Message ID | 0a17416b-57a0-99e7-2e7e-90a63da66fe6@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | [rs6000] avoid using unaligned vsx or lxvd2x/stxvd2x for memcpy/memmove inline expansion | expand |
On Wed, Dec 19, 2018 at 01:53:05PM -0600, Aaron Sawdey wrote: > Because of POWER9 dd2.1 issues with certain unaligned vsx instructions > to cache inhibited memory, here is a patch that keeps memmove (and memcpy) > inline expansion from doing unaligned vector or using vector load/store > other than lvx/stvx. More description of the issue is here: > > https://patchwork.ozlabs.org/patch/814059/ > > OK for trunk if bootstrap/regtest ok? Okay, but see below. > 2018-12-19 Aaron Sawdey <acsawdey@linux.ibm.com> > > * config/rs6000/rs6000-string.c (expand_block_move): Don't use > unaligned vsx and avoid lxvd2x/stxvd2x. > (gen_lvx_v4si_move): New function. > +static rtx > +gen_lvx_v4si_move (rtx dest, rtx src) > +{ > + rtx rv = NULL; > + if (MEM_P (dest)) > + { > + gcc_assert (!MEM_P (src)); > + gcc_assert (GET_MODE (src) == V4SImode); > + rv = gen_altivec_stvx_v4si_internal (dest, src); > + } > + else if (MEM_P (src)) > + { > + gcc_assert (!MEM_P (dest)); > + gcc_assert (GET_MODE (dest) == V4SImode); > + rv = gen_altivec_lvx_v4si_internal (dest, src); > + } > + else > + gcc_unreachable (); > + > + return rv; > +} This is extraordinarily clumsy :-) Maybe something like: static rtx gen_lvx_v4si_move (rtx dest, rtx src) { gcc_assert (!(MEM_P (dest) && MEM_P (src)); gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode); if (MEM_P (dest)) return gen_altivec_stvx_v4si_internal (dest, src); else if (MEM_P (src)) return gen_altivec_lvx_v4si_internal (dest, src); else gcc_unreachable (); } (Or do you allow VOIDmode for src as well?) Anyway, at least get rid of the useless extra variable. Thanks! Segher
On 12/20/18 3:51 AM, Segher Boessenkool wrote: > On Wed, Dec 19, 2018 at 01:53:05PM -0600, Aaron Sawdey wrote: >> Because of POWER9 dd2.1 issues with certain unaligned vsx instructions >> to cache inhibited memory, here is a patch that keeps memmove (and memcpy) >> inline expansion from doing unaligned vector or using vector load/store >> other than lvx/stvx. More description of the issue is here: >> >> https://patchwork.ozlabs.org/patch/814059/ >> >> OK for trunk if bootstrap/regtest ok? > > Okay, but see below. > [snip] > > This is extraordinarily clumsy :-) Maybe something like: > > static rtx > gen_lvx_v4si_move (rtx dest, rtx src) > { > gcc_assert (!(MEM_P (dest) && MEM_P (src)); > gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode); > if (MEM_P (dest)) > return gen_altivec_stvx_v4si_internal (dest, src); > else if (MEM_P (src)) > return gen_altivec_lvx_v4si_internal (dest, src); > else > gcc_unreachable (); > } > > (Or do you allow VOIDmode for src as well?) Anyway, at least get rid of > the useless extra variable. I think this should be better: static rtx gen_lvx_v4si_move (rtx dest, rtx src) { gcc_assert ((MEM_P (dest) && !MEM_P (src)) || (MEM_P (src) && !MEM_P(dest))); gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode); if (MEM_P (dest)) return gen_altivec_stvx_v4si_internal (dest, src); else if (MEM_P (src)) return gen_altivec_lvx_v4si_internal (dest, src); gcc_unreachable (); } I'll commit after I re-regstrap. Thanks! Aaron
On Thu, Dec 20, 2018 at 05:34:54PM -0600, Aaron Sawdey wrote: > On 12/20/18 3:51 AM, Segher Boessenkool wrote: > > On Wed, Dec 19, 2018 at 01:53:05PM -0600, Aaron Sawdey wrote: > >> Because of POWER9 dd2.1 issues with certain unaligned vsx instructions > >> to cache inhibited memory, here is a patch that keeps memmove (and memcpy) > >> inline expansion from doing unaligned vector or using vector load/store > >> other than lvx/stvx. More description of the issue is here: > >> > >> https://patchwork.ozlabs.org/patch/814059/ > >> > >> OK for trunk if bootstrap/regtest ok? > > > > Okay, but see below. > > > [snip] > > > > This is extraordinarily clumsy :-) Maybe something like: > > > > static rtx > > gen_lvx_v4si_move (rtx dest, rtx src) > > { > > gcc_assert (!(MEM_P (dest) && MEM_P (src)); > > gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode); > > if (MEM_P (dest)) > > return gen_altivec_stvx_v4si_internal (dest, src); > > else if (MEM_P (src)) > > return gen_altivec_lvx_v4si_internal (dest, src); > > else > > gcc_unreachable (); > > } > > > > (Or do you allow VOIDmode for src as well?) Anyway, at least get rid of > > the useless extra variable. > > I think this should be better: The gcc_unreachable at the end catches the non-mem to non-mem case. > static rtx > gen_lvx_v4si_move (rtx dest, rtx src) > { > gcc_assert ((MEM_P (dest) && !MEM_P (src)) || (MEM_P (src) && !MEM_P(dest))); But if you prefer this, how about { gcc_assert (MEM_P (dest) ^ MEM_P (src)); gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode); if (MEM_P (dest)) return gen_altivec_stvx_v4si_internal (dest, src); else return gen_altivec_lvx_v4si_internal (dest, src); } :-) Segher
On 12/20/18 5:44 PM, Segher Boessenkool wrote: > On Thu, Dec 20, 2018 at 05:34:54PM -0600, Aaron Sawdey wrote: >> On 12/20/18 3:51 AM, Segher Boessenkool wrote: >>> On Wed, Dec 19, 2018 at 01:53:05PM -0600, Aaron Sawdey wrote: >>>> Because of POWER9 dd2.1 issues with certain unaligned vsx instructions >>>> to cache inhibited memory, here is a patch that keeps memmove (and memcpy) >>>> inline expansion from doing unaligned vector or using vector load/store >>>> other than lvx/stvx. More description of the issue is here: >>>> >>>> https://patchwork.ozlabs.org/patch/814059/ >>>> >>>> OK for trunk if bootstrap/regtest ok? >>> >>> Okay, but see below. >>> >> [snip] >>> >>> This is extraordinarily clumsy :-) Maybe something like: >>> >>> static rtx >>> gen_lvx_v4si_move (rtx dest, rtx src) >>> { >>> gcc_assert (!(MEM_P (dest) && MEM_P (src)); >>> gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode); >>> if (MEM_P (dest)) >>> return gen_altivec_stvx_v4si_internal (dest, src); >>> else if (MEM_P (src)) >>> return gen_altivec_lvx_v4si_internal (dest, src); >>> else >>> gcc_unreachable (); >>> } >>> >>> (Or do you allow VOIDmode for src as well?) Anyway, at least get rid of >>> the useless extra variable. >> >> I think this should be better: > > The gcc_unreachable at the end catches the non-mem to non-mem case. > >> static rtx >> gen_lvx_v4si_move (rtx dest, rtx src) >> { >> gcc_assert ((MEM_P (dest) && !MEM_P (src)) || (MEM_P (src) && !MEM_P(dest))); > > But if you prefer this, how about > > { > gcc_assert (MEM_P (dest) ^ MEM_P (src)); > gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode); > > if (MEM_P (dest)) > return gen_altivec_stvx_v4si_internal (dest, src); > else > return gen_altivec_lvx_v4si_internal (dest, src); > } > > :-) > > > Segher > I like that even better, thanks!
The patch for this was committed to trunk as 267562 (see below). Is this also ok for backport to 8? Thanks, Aaron On 12/20/18 5:44 PM, Segher Boessenkool wrote: > On Thu, Dec 20, 2018 at 05:34:54PM -0600, Aaron Sawdey wrote: >> On 12/20/18 3:51 AM, Segher Boessenkool wrote: >>> On Wed, Dec 19, 2018 at 01:53:05PM -0600, Aaron Sawdey wrote: >>>> Because of POWER9 dd2.1 issues with certain unaligned vsx instructions >>>> to cache inhibited memory, here is a patch that keeps memmove (and memcpy) >>>> inline expansion from doing unaligned vector or using vector load/store >>>> other than lvx/stvx. More description of the issue is here: >>>> >>>> https://patchwork.ozlabs.org/patch/814059/ >>>> >>>> OK for trunk if bootstrap/regtest ok? >>> >>> Okay, but see below. >>> >> [snip] >>> >>> This is extraordinarily clumsy :-) Maybe something like: >>> >>> static rtx >>> gen_lvx_v4si_move (rtx dest, rtx src) >>> { >>> gcc_assert (!(MEM_P (dest) && MEM_P (src)); >>> gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode); >>> if (MEM_P (dest)) >>> return gen_altivec_stvx_v4si_internal (dest, src); >>> else if (MEM_P (src)) >>> return gen_altivec_lvx_v4si_internal (dest, src); >>> else >>> gcc_unreachable (); >>> } >>> >>> (Or do you allow VOIDmode for src as well?) Anyway, at least get rid of >>> the useless extra variable. >> >> I think this should be better: > > The gcc_unreachable at the end catches the non-mem to non-mem case. > >> static rtx >> gen_lvx_v4si_move (rtx dest, rtx src) >> { >> gcc_assert ((MEM_P (dest) && !MEM_P (src)) || (MEM_P (src) && !MEM_P(dest))); > > But if you prefer this, how about > > { > gcc_assert (MEM_P (dest) ^ MEM_P (src)); > gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode); > > if (MEM_P (dest)) > return gen_altivec_stvx_v4si_internal (dest, src); > else > return gen_altivec_lvx_v4si_internal (dest, src); > } > > :-) > > > Segher > 2019-01-03 Aaron Sawdey <acsawdey@linux.ibm.com> * config/rs6000/rs6000-string.c (expand_block_move): Don't use unaligned vsx and avoid lxvd2x/stxvd2x. (gen_lvx_v4si_move): New function. Index: gcc/config/rs6000/rs6000-string.c =================================================================== --- gcc/config/rs6000/rs6000-string.c (revision 267299) +++ gcc/config/rs6000/rs6000-string.c (working copy) @@ -2669,6 +2669,25 @@ return true; } +/* Generate loads and stores for a move of v4si mode using lvx/stvx. + This uses altivec_{l,st}vx_<mode>_internal which use unspecs to + keep combine from changing what instruction gets used. + + DEST is the destination for the data. + SRC is the source of the data for the move. */ + +static rtx +gen_lvx_v4si_move (rtx dest, rtx src) +{ + gcc_assert (MEM_P (dest) ^ MEM_P (src)); + gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode); + + if (MEM_P (dest)) + return gen_altivec_stvx_v4si_internal (dest, src); + else + return gen_altivec_lvx_v4si_internal (dest, src); +} + /* Expand a block move operation, and return 1 if successful. Return 0 if we should let the compiler generate normal code. @@ -2721,11 +2740,11 @@ /* Altivec first, since it will be faster than a string move when it applies, and usually not significantly larger. */ - if (TARGET_ALTIVEC && bytes >= 16 && (TARGET_EFFICIENT_UNALIGNED_VSX || align >= 128)) + if (TARGET_ALTIVEC && bytes >= 16 && align >= 128) { move_bytes = 16; mode = V4SImode; - gen_func.mov = gen_movv4si; + gen_func.mov = gen_lvx_v4si_move; } else if (bytes >= 8 && TARGET_POWERPC64 && (align >= 64 || !STRICT_ALIGNMENT))
On Mon, Jan 14, 2019 at 12:49:33PM -0600, Aaron Sawdey wrote: > The patch for this was committed to trunk as 267562 (see below). Is this also ok for backport to 8? Yes please. Thanks! Segher > On 12/20/18 5:44 PM, Segher Boessenkool wrote: > > On Thu, Dec 20, 2018 at 05:34:54PM -0600, Aaron Sawdey wrote: > >> On 12/20/18 3:51 AM, Segher Boessenkool wrote: > >>> On Wed, Dec 19, 2018 at 01:53:05PM -0600, Aaron Sawdey wrote: > >>>> Because of POWER9 dd2.1 issues with certain unaligned vsx instructions > >>>> to cache inhibited memory, here is a patch that keeps memmove (and memcpy) > >>>> inline expansion from doing unaligned vector or using vector load/store > >>>> other than lvx/stvx. More description of the issue is here: > >>>> > >>>> https://patchwork.ozlabs.org/patch/814059/ > >>>> > >>>> OK for trunk if bootstrap/regtest ok? > >>> > >>> Okay, but see below. > >>> > >> [snip] > >>> > >>> This is extraordinarily clumsy :-) Maybe something like: > >>> > >>> static rtx > >>> gen_lvx_v4si_move (rtx dest, rtx src) > >>> { > >>> gcc_assert (!(MEM_P (dest) && MEM_P (src)); > >>> gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode); > >>> if (MEM_P (dest)) > >>> return gen_altivec_stvx_v4si_internal (dest, src); > >>> else if (MEM_P (src)) > >>> return gen_altivec_lvx_v4si_internal (dest, src); > >>> else > >>> gcc_unreachable (); > >>> } > >>> > >>> (Or do you allow VOIDmode for src as well?) Anyway, at least get rid of > >>> the useless extra variable. > >> > >> I think this should be better: > > > > The gcc_unreachable at the end catches the non-mem to non-mem case. > > > >> static rtx > >> gen_lvx_v4si_move (rtx dest, rtx src) > >> { > >> gcc_assert ((MEM_P (dest) && !MEM_P (src)) || (MEM_P (src) && !MEM_P(dest))); > > > > But if you prefer this, how about > > > > { > > gcc_assert (MEM_P (dest) ^ MEM_P (src)); > > gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode); > > > > if (MEM_P (dest)) > > return gen_altivec_stvx_v4si_internal (dest, src); > > else > > return gen_altivec_lvx_v4si_internal (dest, src); > > } > > > > :-) > > > > > > Segher > > > > 2019-01-03 Aaron Sawdey <acsawdey@linux.ibm.com> > > * config/rs6000/rs6000-string.c (expand_block_move): Don't use > unaligned vsx and avoid lxvd2x/stxvd2x. > (gen_lvx_v4si_move): New function. > > > Index: gcc/config/rs6000/rs6000-string.c > =================================================================== > --- gcc/config/rs6000/rs6000-string.c (revision 267299) > +++ gcc/config/rs6000/rs6000-string.c (working copy) > @@ -2669,6 +2669,25 @@ > return true; > } > > +/* Generate loads and stores for a move of v4si mode using lvx/stvx. > + This uses altivec_{l,st}vx_<mode>_internal which use unspecs to > + keep combine from changing what instruction gets used. > + > + DEST is the destination for the data. > + SRC is the source of the data for the move. */ > + > +static rtx > +gen_lvx_v4si_move (rtx dest, rtx src) > +{ > + gcc_assert (MEM_P (dest) ^ MEM_P (src)); > + gcc_assert (GET_MODE (dest) == V4SImode && GET_MODE (src) == V4SImode); > + > + if (MEM_P (dest)) > + return gen_altivec_stvx_v4si_internal (dest, src); > + else > + return gen_altivec_lvx_v4si_internal (dest, src); > +} > + > /* Expand a block move operation, and return 1 if successful. Return 0 > if we should let the compiler generate normal code. > > @@ -2721,11 +2740,11 @@ > > /* Altivec first, since it will be faster than a string move > when it applies, and usually not significantly larger. */ > - if (TARGET_ALTIVEC && bytes >= 16 && (TARGET_EFFICIENT_UNALIGNED_VSX || align >= 128)) > + if (TARGET_ALTIVEC && bytes >= 16 && align >= 128) > { > move_bytes = 16; > mode = V4SImode; > - gen_func.mov = gen_movv4si; > + gen_func.mov = gen_lvx_v4si_move; > } > else if (bytes >= 8 && TARGET_POWERPC64 > && (align >= 64 || !STRICT_ALIGNMENT)) > > > > -- > Aaron Sawdey, Ph.D. acsawdey@linux.vnet.ibm.com > 050-2/C113 (507) 253-7520 home: 507/263-0782 > IBM Linux Technology Center - PPC Toolchain
Index: gcc/config/rs6000/rs6000-string.c =================================================================== --- gcc/config/rs6000/rs6000-string.c (revision 267055) +++ gcc/config/rs6000/rs6000-string.c (working copy) @@ -2669,6 +2669,35 @@ return true; } +/* Generate loads and stores for a move of v4si mode using lvx/stvx. + This uses altivec_{l,st}vx_<mode>_internal which use unspecs to + keep combine from changing what instruction gets used. + + DEST is the destination for the data. + SRC is the source of the data for the move. */ + +static rtx +gen_lvx_v4si_move (rtx dest, rtx src) +{ + rtx rv = NULL; + if (MEM_P (dest)) + { + gcc_assert (!MEM_P (src)); + gcc_assert (GET_MODE (src) == V4SImode); + rv = gen_altivec_stvx_v4si_internal (dest, src); + } + else if (MEM_P (src)) + { + gcc_assert (!MEM_P (dest)); + gcc_assert (GET_MODE (dest) == V4SImode); + rv = gen_altivec_lvx_v4si_internal (dest, src); + } + else + gcc_unreachable (); + + return rv; +} + /* Expand a block move operation, and return 1 if successful. Return 0 if we should let the compiler generate normal code. @@ -2721,11 +2750,11 @@ /* Altivec first, since it will be faster than a string move when it applies, and usually not significantly larger. */ - if (TARGET_ALTIVEC && bytes >= 16 && (TARGET_EFFICIENT_UNALIGNED_VSX || align >= 128)) + if (TARGET_ALTIVEC && bytes >= 16 && align >= 128) { move_bytes = 16; mode = V4SImode; - gen_func.mov = gen_movv4si; + gen_func.mov = gen_lvx_v4si_move; } else if (bytes >= 8 && TARGET_POWERPC64 && (align >= 64 || !STRICT_ALIGNMENT))