diff mbox

[v2,0-6/11] Fix PR/61114, make direct vector reductions endianness-neutral

Message ID 544A3E0B.2000803@arm.com
State New
Headers show

Commit Message

Alan Lawrence Oct. 24, 2014, 11:54 a.m. UTC
This is the first half of my previous patch series 
(https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01456.html), that is the part 
making the REDUC_..._EXPR tree codes endian-neutral, and adding a new 
reduce-to-scalar optab in place of the endianness-dependent 
reduc_[us](plus|min|max)_optab.

I'm leaving the vec_shr portion out of this patch series, as the link between 
the two halves is only the end goal of removing an "if (BYTES_BIG_ENDIAN)" from 
tree-vect-loop.c; this series removes that from one code path so can stand alone.

Patches 1-6 are as previously posted apart from rebasing and removing the 
old/poisoned AArch64 patterns as per maintainer's request. Patches 1, 2, 4, 5 
and 6 have already been approved; patch 3 was discussed somewhat but I think we 
decided against most of the ideas raised, I have added comment to 
scalar_reduc_to_vector. I now reread Richie's "Otherwise the patch looks good to 
me" and wonder if I should have taken that as an approval but I didn't read it 
that way at the time...???

Patches 7-11 migrate migrate ARM, x86, IA64 (I think), and mostly PowerPC, to 
the new reduc_(plus|[us](min|max))_scal_optab. I have not managed to work out 
how to do the same for MIPS (specifically what I need to add to 
mips_expand_vec_reduc), and have had no response from the maintainers, so am 
leaving that for now. Also I haven't migrated (or worked out how to target) 
rs6000/paired.md, help would be most welcome.


The suggestion was then to "complete" the migration, by removing the old optabs. 
There are a few options here and I'll follow up with appropriate patches 
according to feedback received. I see options:

(1) just delete the old optabs (and the migration code). This would 
performance-regress the MIPS backend, but should not break it, although one 
should really do *something* with the then-unused reduc_[us](plus|min|max)_optab 
in config/mips/loongson.md.

(2) also renaming reduc_..._scal_optab back to reduc_..._optab; would break the 
MIPS backend if something were not done with it's existing patterns.

(2a) Alternatively I could just use a different new name, e.g. reduce_...., 
reduct_...., vec_reduc_..., anything that's less of a mouthful than 
reduc_..._scal. Whilst being only-very-slightly-different from the current 
reduc_... might be confusing, so might changing the meaning of the optab, and 
its signature, with the existing name, so am open to suggestions?

Cheers, Alan

Comments

Richard Biener Oct. 24, 2014, 11:57 a.m. UTC | #1
On Fri, 24 Oct 2014, Alan Lawrence wrote:

> This is the first half of my previous patch series
> (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01456.html), that is the part
> making the REDUC_..._EXPR tree codes endian-neutral, and adding a new
> reduce-to-scalar optab in place of the endianness-dependent
> reduc_[us](plus|min|max)_optab.
> 
> I'm leaving the vec_shr portion out of this patch series, as the link between
> the two halves is only the end goal of removing an "if (BYTES_BIG_ENDIAN)"
> from tree-vect-loop.c; this series removes that from one code path so can
> stand alone.
> 
> Patches 1-6 are as previously posted apart from rebasing and removing the
> old/poisoned AArch64 patterns as per maintainer's request. Patches 1, 2, 4, 5
> and 6 have already been approved; patch 3 was discussed somewhat but I think
> we decided against most of the ideas raised, I have added comment to
> scalar_reduc_to_vector. I now reread Richie's "Otherwise the patch looks good
> to me" and wonder if I should have taken that as an approval but I didn't read
> it that way at the time...???

Yes, it was an approval ;)

> Patches 7-11 migrate migrate ARM, x86, IA64 (I think), and mostly PowerPC, to
> the new reduc_(plus|[us](min|max))_scal_optab. I have not managed to work out
> how to do the same for MIPS (specifically what I need to add to
> mips_expand_vec_reduc), and have had no response from the maintainers, so am
> leaving that for now. Also I haven't migrated (or worked out how to target)
> rs6000/paired.md, help would be most welcome.
> 
> 
> The suggestion was then to "complete" the migration, by removing the old
> optabs. There are a few options here and I'll follow up with appropriate
> patches according to feedback received. I see options:
> 
> (1) just delete the old optabs (and the migration code). This would
> performance-regress the MIPS backend, but should not break it, although one
> should really do *something* with the then-unused
> reduc_[us](plus|min|max)_optab in config/mips/loongson.md.
>
> (2) also renaming reduc_..._scal_optab back to reduc_..._optab; would break
> the MIPS backend if something were not done with it's existing patterns.
> 
> (2a) Alternatively I could just use a different new name, e.g. reduce_....,
> reduct_...., vec_reduc_..., anything that's less of a mouthful than
> reduc_..._scal. Whilst being only-very-slightly-different from the current
> reduc_... might be confusing, so might changing the meaning of the optab, and
> its signature, with the existing name, so am open to suggestions?

I definitely prefer (2).

Thanks,
Richard.
Alan Lawrence Oct. 24, 2014, 12:06 p.m. UTC | #2
This migrates the reduction patterns in altivec.md and vector.md to the new 
names. I've not touched paired.md as I wasn't really sure how to fix that (how 
do I vec_extractv2sf ?), moreover the testing I did didn't seem to exercise any 
of those patterns (iow: I'm not sure what would be an appropriate target machine?).

I note the reduc_uplus_v16qi (which I've removed, as unsigned and signed 
addition should be equivalent) differed from reduc_splus_v16qi in using 
gen_altivec_vsum4ubs rather than gen_altivec_vsum4sbs.  Testcases 
gcc.dg/vect/{slp-24-big-array.c,slp-24.c,vect-reduc-1char-big-array.c,vert-reduc-1char.c} 
thus produce assembly which differs from previously (only) in that "vsum4ubs" 
becomes "vsum4sbs". These tests are still passing so I assume this is OK.

The combining of signed and unsigned addition also improves 
gcc.dg/vect/{vect-outer-4i.c,vect-reduc-1short.c,vect-reduc-dot-u8b.c,vect-reduc-pattern-1c-big-array.c,vect-reduc-pattern-1c.c} 
: these are now reduced using direct vector reduction, rather than with shifts 
as previously (because there was only a reduc_splus rather than the reduc_uplus 
these tests looked for).

((Side note: the RTL changes to vector.md are to match the combine patterns in 
vsx.md; now that we now longer depend upon combine to generate those patterns 
(as the optab outputs them directly), one might wish to remove the smaller 
pattern from vsx.md, and/or simplify the RTL. I theorize that a reduction of a 
two-element vector is just adding the first element to the second, so maybe to 
something like

   [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
		   (VEC_reduc:V2DF
		    (vec_select:DF
		     (match_operand:V2DF 1 "vfloat_operand" "")
		     (parallel [(const_int 1)]))
		    (vec_select:DF
		     (match_dup 1)
		     (parallel [(const_int 0)]))))
	      (clobber (match_scratch:V2DF 2 ""))])]

but I think it's best for me to leave that to the port maintainers.))

Bootstrapped and check-gcc on powerpc64-none-linux-gnu (gcc110.fsffrance.org, 
with thanks to the GCC Compile Farm).

gcc/ChangeLog:

	* config/rs6000/altivec.md (reduc_splus_<mode>): Rename to...
	(reduc_plus_scal_<mode>): ...this, and rs6000_expand_vector_extract.
	(reduc_uplus_v16qi): Remove.

	* config/rs6000/vector.md (VEC_reduc_name): change "splus" to "plus"
	(reduc_<VEC_reduc_name>_v2df): Rename to...
	(reduc_<VEC_reduc_name>_scal_v2df): ...this, wrap VEC_reduc in a
	vec_select of element 1.
	(reduc_<VEC_reduc_name>_v4sf): Rename to...
	(reduc_<VEC_reduc_name>_scal_v4sf): ...this, wrap VEC_reduc in a
	vec_select of element 3, add scratch register.
Alan Lawrence Oct. 24, 2014, 12:07 p.m. UTC | #3
This is an attempt to migrate IA64 to the newer optabs, however, I found none of 
the tests in gcc.dg/vect seemed to touch any of the affected patterns....so this 
is only really tested by building a stage-1 compiler.

gcc/ChangeLog:

	* config/ia64/vect.md (reduc_splus_v2sf): Rename to...
	(reduc_plus_v2sf): ...this, add a vec_extractv2sf.
	(reduc_smin_v2sf): Rename to...
	(reduc_smin_scal_v2sf): ...this, add a vec_extractv2sf.
	(reduc_smax_v2sf): Rename to...
	(reduc_smax_scal_v2sf): ...this, add a vec_extractv2sf.
Matthew Fortune Oct. 24, 2014, 3:17 p.m. UTC | #4
Alan Lawrence <alan.lawrence@arm.com> writes:
> Patches 7-11 migrate migrate ARM, x86, IA64 (I think), and mostly PowerPC,
> to
> the new reduc_(plus|[us](min|max))_scal_optab. I have not managed to work
> out
> how to do the same for MIPS (specifically what I need to add to
> mips_expand_vec_reduc), and have had no response from the maintainers, so
> am

Sorry, I was looking at this but failed to send an email saying so. The lack
of vec_extract appears to be the stumbling point here so at the very least
we need to add a naïve version of that I believe.

> (2) also renaming reduc_..._scal_optab back to reduc_..._optab; would
> break the
> MIPS backend if something were not done with it's existing patterns.

I suspect we can deal with this in time to make a rename OK.

One thing occurred to me about this change in general which is that on the
whole the reduction to a scalar seems good for an epilogue but is there
a problem if the result is then replicated across a vector for further
processing. I.e. a vector is reduced to a scalar, which moves the value
from a SIMD register to a GP register (because scalar modes are not
supported in SIMD registers generally) and then gets moved back to a
SIMD register to form part of a new vector? Would you expect the
redundant moves to get eliminated?

Thanks,
Matthew
David Edelsohn Oct. 24, 2014, 11:49 p.m. UTC | #5
On Fri, Oct 24, 2014 at 8:06 AM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> This migrates the reduction patterns in altivec.md and vector.md to the new
> names. I've not touched paired.md as I wasn't really sure how to fix that
> (how do I vec_extractv2sf ?), moreover the testing I did didn't seem to
> exercise any of those patterns (iow: I'm not sure what would be an
> appropriate target machine?).
>
> I note the reduc_uplus_v16qi (which I've removed, as unsigned and signed
> addition should be equivalent) differed from reduc_splus_v16qi in using
> gen_altivec_vsum4ubs rather than gen_altivec_vsum4sbs.  Testcases
> gcc.dg/vect/{slp-24-big-array.c,slp-24.c,vect-reduc-1char-big-array.c,vert-reduc-1char.c}
> thus produce assembly which differs from previously (only) in that
> "vsum4ubs" becomes "vsum4sbs". These tests are still passing so I assume
> this is OK.
>
> The combining of signed and unsigned addition also improves
> gcc.dg/vect/{vect-outer-4i.c,vect-reduc-1short.c,vect-reduc-dot-u8b.c,vect-reduc-pattern-1c-big-array.c,vect-reduc-pattern-1c.c}
> : these are now reduced using direct vector reduction, rather than with
> shifts as previously (because there was only a reduc_splus rather than the
> reduc_uplus these tests looked for).
>
> ((Side note: the RTL changes to vector.md are to match the combine patterns
> in vsx.md; now that we now longer depend upon combine to generate those
> patterns (as the optab outputs them directly), one might wish to remove the
> smaller pattern from vsx.md, and/or simplify the RTL. I theorize that a
> reduction of a two-element vector is just adding the first element to the
> second, so maybe to something like
>
>   [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
>                    (VEC_reduc:V2DF
>                     (vec_select:DF
>                      (match_operand:V2DF 1 "vfloat_operand" "")
>                      (parallel [(const_int 1)]))
>                     (vec_select:DF
>                      (match_dup 1)
>                      (parallel [(const_int 0)]))))
>               (clobber (match_scratch:V2DF 2 ""))])]
>
> but I think it's best for me to leave that to the port maintainers.))
>
> Bootstrapped and check-gcc on powerpc64-none-linux-gnu
> (gcc110.fsffrance.org, with thanks to the GCC Compile Farm).
>
> gcc/ChangeLog:
>
>         * config/rs6000/altivec.md (reduc_splus_<mode>): Rename to...
>         (reduc_plus_scal_<mode>): ...this, and rs6000_expand_vector_extract.
>         (reduc_uplus_v16qi): Remove.
>
>         * config/rs6000/vector.md (VEC_reduc_name): change "splus" to "plus"
>         (reduc_<VEC_reduc_name>_v2df): Rename to...
>         (reduc_<VEC_reduc_name>_scal_v2df): ...this, wrap VEC_reduc in a
>         vec_select of element 1.
>         (reduc_<VEC_reduc_name>_v4sf): Rename to...
>         (reduc_<VEC_reduc_name>_scal_v4sf): ...this, wrap VEC_reduc in a
>         vec_select of element 3, add scratch register.
>

This needs some input from Bill, but he will be busy with a conference
this week.

- David
Richard Biener Oct. 27, 2014, 11:43 a.m. UTC | #6
On Fri, Oct 24, 2014 at 5:17 PM, Matthew Fortune
<Matthew.Fortune@imgtec.com> wrote:
> Alan Lawrence <alan.lawrence@arm.com> writes:
>> Patches 7-11 migrate migrate ARM, x86, IA64 (I think), and mostly PowerPC,
>> to
>> the new reduc_(plus|[us](min|max))_scal_optab. I have not managed to work
>> out
>> how to do the same for MIPS (specifically what I need to add to
>> mips_expand_vec_reduc), and have had no response from the maintainers, so
>> am
>
> Sorry, I was looking at this but failed to send an email saying so. The lack
> of vec_extract appears to be the stumbling point here so at the very least
> we need to add a naïve version of that I believe.
>
>> (2) also renaming reduc_..._scal_optab back to reduc_..._optab; would
>> break the
>> MIPS backend if something were not done with it's existing patterns.
>
> I suspect we can deal with this in time to make a rename OK.
>
> One thing occurred to me about this change in general which is that on the
> whole the reduction to a scalar seems good for an epilogue but is there
> a problem if the result is then replicated across a vector for further
> processing. I.e. a vector is reduced to a scalar, which moves the value
> from a SIMD register to a GP register (because scalar modes are not
> supported in SIMD registers generally) and then gets moved back to a
> SIMD register to form part of a new vector? Would you expect the
> redundant moves to get eliminated?

Combine should be able to do this if you help it (it of course depends
on what your actual processor instruction doing the reduction does).

Richard.

> Thanks,
> Matthew
Bill Schmidt Nov. 3, 2014, 5:50 p.m. UTC | #7
On Fri, 2014-10-24 at 19:49 -0400, David Edelsohn wrote:
> On Fri, Oct 24, 2014 at 8:06 AM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> > This migrates the reduction patterns in altivec.md and vector.md to the new
> > names. I've not touched paired.md as I wasn't really sure how to fix that
> > (how do I vec_extractv2sf ?), moreover the testing I did didn't seem to
> > exercise any of those patterns (iow: I'm not sure what would be an
> > appropriate target machine?).
> >
> > I note the reduc_uplus_v16qi (which I've removed, as unsigned and signed
> > addition should be equivalent) differed from reduc_splus_v16qi in using
> > gen_altivec_vsum4ubs rather than gen_altivec_vsum4sbs.  Testcases
> > gcc.dg/vect/{slp-24-big-array.c,slp-24.c,vect-reduc-1char-big-array.c,vert-reduc-1char.c}
> > thus produce assembly which differs from previously (only) in that
> > "vsum4ubs" becomes "vsum4sbs". These tests are still passing so I assume
> > this is OK.

Given that the only 32-bit quantity being added here is zero, the
difference in saturation points for vsum4ubs and vsum4sbs won't come
into play, so I agree this should be fine.

I would like to ask Mike Meissner to look over the changes to the
reduction patterns in vector.md.  He wrote those and is more familiar
with that piece than I am.  On the surface I don't see any problems, but
I could miss something subtle.

Otherwise I'm ok with the patch.

Thanks,
Bill

(p.s. Sorry for the delay on reviewing this.  As David noted, I was
traveling, and I ended up having no access to my mail for most of the
week due to an IT snafu.)

> >
> > The combining of signed and unsigned addition also improves
> > gcc.dg/vect/{vect-outer-4i.c,vect-reduc-1short.c,vect-reduc-dot-u8b.c,vect-reduc-pattern-1c-big-array.c,vect-reduc-pattern-1c.c}
> > : these are now reduced using direct vector reduction, rather than with
> > shifts as previously (because there was only a reduc_splus rather than the
> > reduc_uplus these tests looked for).
> >
> > ((Side note: the RTL changes to vector.md are to match the combine patterns
> > in vsx.md; now that we now longer depend upon combine to generate those
> > patterns (as the optab outputs them directly), one might wish to remove the
> > smaller pattern from vsx.md, and/or simplify the RTL. I theorize that a
> > reduction of a two-element vector is just adding the first element to the
> > second, so maybe to something like
> >
> >   [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
> >                    (VEC_reduc:V2DF
> >                     (vec_select:DF
> >                      (match_operand:V2DF 1 "vfloat_operand" "")
> >                      (parallel [(const_int 1)]))
> >                     (vec_select:DF
> >                      (match_dup 1)
> >                      (parallel [(const_int 0)]))))
> >               (clobber (match_scratch:V2DF 2 ""))])]
> >
> > but I think it's best for me to leave that to the port maintainers.))
> >
> > Bootstrapped and check-gcc on powerpc64-none-linux-gnu
> > (gcc110.fsffrance.org, with thanks to the GCC Compile Farm).
> >
> > gcc/ChangeLog:
> >
> >         * config/rs6000/altivec.md (reduc_splus_<mode>): Rename to...
> >         (reduc_plus_scal_<mode>): ...this, and rs6000_expand_vector_extract.
> >         (reduc_uplus_v16qi): Remove.
> >
> >         * config/rs6000/vector.md (VEC_reduc_name): change "splus" to "plus"
> >         (reduc_<VEC_reduc_name>_v2df): Rename to...
> >         (reduc_<VEC_reduc_name>_scal_v2df): ...this, wrap VEC_reduc in a
> >         vec_select of element 1.
> >         (reduc_<VEC_reduc_name>_v4sf): Rename to...
> >         (reduc_<VEC_reduc_name>_scal_v4sf): ...this, wrap VEC_reduc in a
> >         vec_select of element 3, add scratch register.
> >
> 
> This needs some input from Bill, but he will be busy with a conference
> this week.
> 
> - David
>
Alan Lawrence Nov. 6, 2014, 4:44 p.m. UTC | #8
Hmmm. I am a little surprised by your mention of "saturation points" as I would 
not expect any variety of reduc_plus to be a saturating operation???

A.

Bill Schmidt wrote:
> On Fri, 2014-10-24 at 19:49 -0400, David Edelsohn wrote:
>> On Fri, Oct 24, 2014 at 8:06 AM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>>> This migrates the reduction patterns in altivec.md and vector.md to the new
>>> names. I've not touched paired.md as I wasn't really sure how to fix that
>>> (how do I vec_extractv2sf ?), moreover the testing I did didn't seem to
>>> exercise any of those patterns (iow: I'm not sure what would be an
>>> appropriate target machine?).
>>>
>>> I note the reduc_uplus_v16qi (which I've removed, as unsigned and signed
>>> addition should be equivalent) differed from reduc_splus_v16qi in using
>>> gen_altivec_vsum4ubs rather than gen_altivec_vsum4sbs.  Testcases
>>> gcc.dg/vect/{slp-24-big-array.c,slp-24.c,vect-reduc-1char-big-array.c,vert-reduc-1char.c}
>>> thus produce assembly which differs from previously (only) in that
>>> "vsum4ubs" becomes "vsum4sbs". These tests are still passing so I assume
>>> this is OK.
> 
> Given that the only 32-bit quantity being added here is zero, the
> difference in saturation points for vsum4ubs and vsum4sbs won't come
> into play, so I agree this should be fine.
> 
> I would like to ask Mike Meissner to look over the changes to the
> reduction patterns in vector.md.  He wrote those and is more familiar
> with that piece than I am.  On the surface I don't see any problems, but
> I could miss something subtle.
> 
> Otherwise I'm ok with the patch.
> 
> Thanks,
> Bill
> 
> (p.s. Sorry for the delay on reviewing this.  As David noted, I was
> traveling, and I ended up having no access to my mail for most of the
> week due to an IT snafu.)
> 
>>> The combining of signed and unsigned addition also improves
>>> gcc.dg/vect/{vect-outer-4i.c,vect-reduc-1short.c,vect-reduc-dot-u8b.c,vect-reduc-pattern-1c-big-array.c,vect-reduc-pattern-1c.c}
>>> : these are now reduced using direct vector reduction, rather than with
>>> shifts as previously (because there was only a reduc_splus rather than the
>>> reduc_uplus these tests looked for).
>>>
>>> ((Side note: the RTL changes to vector.md are to match the combine patterns
>>> in vsx.md; now that we now longer depend upon combine to generate those
>>> patterns (as the optab outputs them directly), one might wish to remove the
>>> smaller pattern from vsx.md, and/or simplify the RTL. I theorize that a
>>> reduction of a two-element vector is just adding the first element to the
>>> second, so maybe to something like
>>>
>>>   [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
>>>                    (VEC_reduc:V2DF
>>>                     (vec_select:DF
>>>                      (match_operand:V2DF 1 "vfloat_operand" "")
>>>                      (parallel [(const_int 1)]))
>>>                     (vec_select:DF
>>>                      (match_dup 1)
>>>                      (parallel [(const_int 0)]))))
>>>               (clobber (match_scratch:V2DF 2 ""))])]
>>>
>>> but I think it's best for me to leave that to the port maintainers.))
>>>
>>> Bootstrapped and check-gcc on powerpc64-none-linux-gnu
>>> (gcc110.fsffrance.org, with thanks to the GCC Compile Farm).
>>>
>>> gcc/ChangeLog:
>>>
>>>         * config/rs6000/altivec.md (reduc_splus_<mode>): Rename to...
>>>         (reduc_plus_scal_<mode>): ...this, and rs6000_expand_vector_extract.
>>>         (reduc_uplus_v16qi): Remove.
>>>
>>>         * config/rs6000/vector.md (VEC_reduc_name): change "splus" to "plus"
>>>         (reduc_<VEC_reduc_name>_v2df): Rename to...
>>>         (reduc_<VEC_reduc_name>_scal_v2df): ...this, wrap VEC_reduc in a
>>>         vec_select of element 1.
>>>         (reduc_<VEC_reduc_name>_v4sf): Rename to...
>>>         (reduc_<VEC_reduc_name>_scal_v4sf): ...this, wrap VEC_reduc in a
>>>         vec_select of element 3, add scratch register.
>>>
>> This needs some input from Bill, but he will be busy with a conference
>> this week.
>>
>> - David
>>
> 
> 
>
Bill Schmidt Nov. 6, 2014, 6:56 p.m. UTC | #9
On Thu, 2014-11-06 at 16:44 +0000, Alan Lawrence wrote:
> Hmmm. I am a little surprised by your mention of "saturation points" as I would 
> not expect any variety of reduc_plus to be a saturating operation???

I wouldn't either, but the underlying vsum4ubs and vsum4sbs instructions
used in these patterns do both a reduction and an add to another value.
If that other value is large enough this can trigger a saturation event.
However, the patterns use vzero for this other value, so it's not
possible to approach the saturation cutoff for either instruction since
the reductions are being done on byte values.  (Each word in the vector
result is the sum of the corresponding four byte values in the vector
source, added to the other value, which here is zero.)

Thanks,
Bill

> 
> A.
> 
> Bill Schmidt wrote:
> > On Fri, 2014-10-24 at 19:49 -0400, David Edelsohn wrote:
> >> On Fri, Oct 24, 2014 at 8:06 AM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> >>> This migrates the reduction patterns in altivec.md and vector.md to the new
> >>> names. I've not touched paired.md as I wasn't really sure how to fix that
> >>> (how do I vec_extractv2sf ?), moreover the testing I did didn't seem to
> >>> exercise any of those patterns (iow: I'm not sure what would be an
> >>> appropriate target machine?).
> >>>
> >>> I note the reduc_uplus_v16qi (which I've removed, as unsigned and signed
> >>> addition should be equivalent) differed from reduc_splus_v16qi in using
> >>> gen_altivec_vsum4ubs rather than gen_altivec_vsum4sbs.  Testcases
> >>> gcc.dg/vect/{slp-24-big-array.c,slp-24.c,vect-reduc-1char-big-array.c,vert-reduc-1char.c}
> >>> thus produce assembly which differs from previously (only) in that
> >>> "vsum4ubs" becomes "vsum4sbs". These tests are still passing so I assume
> >>> this is OK.
> > 
> > Given that the only 32-bit quantity being added here is zero, the
> > difference in saturation points for vsum4ubs and vsum4sbs won't come
> > into play, so I agree this should be fine.
> > 
> > I would like to ask Mike Meissner to look over the changes to the
> > reduction patterns in vector.md.  He wrote those and is more familiar
> > with that piece than I am.  On the surface I don't see any problems, but
> > I could miss something subtle.
> > 
> > Otherwise I'm ok with the patch.
> > 
> > Thanks,
> > Bill
> > 
> > (p.s. Sorry for the delay on reviewing this.  As David noted, I was
> > traveling, and I ended up having no access to my mail for most of the
> > week due to an IT snafu.)
> > 
> >>> The combining of signed and unsigned addition also improves
> >>> gcc.dg/vect/{vect-outer-4i.c,vect-reduc-1short.c,vect-reduc-dot-u8b.c,vect-reduc-pattern-1c-big-array.c,vect-reduc-pattern-1c.c}
> >>> : these are now reduced using direct vector reduction, rather than with
> >>> shifts as previously (because there was only a reduc_splus rather than the
> >>> reduc_uplus these tests looked for).
> >>>
> >>> ((Side note: the RTL changes to vector.md are to match the combine patterns
> >>> in vsx.md; now that we now longer depend upon combine to generate those
> >>> patterns (as the optab outputs them directly), one might wish to remove the
> >>> smaller pattern from vsx.md, and/or simplify the RTL. I theorize that a
> >>> reduction of a two-element vector is just adding the first element to the
> >>> second, so maybe to something like
> >>>
> >>>   [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
> >>>                    (VEC_reduc:V2DF
> >>>                     (vec_select:DF
> >>>                      (match_operand:V2DF 1 "vfloat_operand" "")
> >>>                      (parallel [(const_int 1)]))
> >>>                     (vec_select:DF
> >>>                      (match_dup 1)
> >>>                      (parallel [(const_int 0)]))))
> >>>               (clobber (match_scratch:V2DF 2 ""))])]
> >>>
> >>> but I think it's best for me to leave that to the port maintainers.))
> >>>
> >>> Bootstrapped and check-gcc on powerpc64-none-linux-gnu
> >>> (gcc110.fsffrance.org, with thanks to the GCC Compile Farm).
> >>>
> >>> gcc/ChangeLog:
> >>>
> >>>         * config/rs6000/altivec.md (reduc_splus_<mode>): Rename to...
> >>>         (reduc_plus_scal_<mode>): ...this, and rs6000_expand_vector_extract.
> >>>         (reduc_uplus_v16qi): Remove.
> >>>
> >>>         * config/rs6000/vector.md (VEC_reduc_name): change "splus" to "plus"
> >>>         (reduc_<VEC_reduc_name>_v2df): Rename to...
> >>>         (reduc_<VEC_reduc_name>_scal_v2df): ...this, wrap VEC_reduc in a
> >>>         vec_select of element 1.
> >>>         (reduc_<VEC_reduc_name>_v4sf): Rename to...
> >>>         (reduc_<VEC_reduc_name>_scal_v4sf): ...this, wrap VEC_reduc in a
> >>>         vec_select of element 3, add scratch register.
> >>>
> >> This needs some input from Bill, but he will be busy with a conference
> >> this week.
> >>
> >> - David
> >>
> > 
> > 
> > 
> 
>
Alan Lawrence Nov. 7, 2014, 10:09 a.m. UTC | #10
Ah I see now! Thank you for explaining that bit, I was a bit puzzled when I saw 
it, but it makes sense now!

Cheers, Alan

Bill Schmidt wrote:
> On Thu, 2014-11-06 at 16:44 +0000, Alan Lawrence wrote:
>> Hmmm. I am a little surprised by your mention of "saturation points" as I would 
>> not expect any variety of reduc_plus to be a saturating operation???
> 
> I wouldn't either, but the underlying vsum4ubs and vsum4sbs instructions
> used in these patterns do both a reduction and an add to another value.
> If that other value is large enough this can trigger a saturation event.
> However, the patterns use vzero for this other value, so it's not
> possible to approach the saturation cutoff for either instruction since
> the reductions are being done on byte values.  (Each word in the vector
> result is the sum of the corresponding four byte values in the vector
> source, added to the other value, which here is zero.)
> 
> Thanks,
> Bill
Michael Meissner Nov. 10, 2014, 10:36 p.m. UTC | #11
On Fri, Oct 24, 2014 at 01:06:41PM +0100, Alan Lawrence wrote:
> This migrates the reduction patterns in altivec.md and vector.md to
> the new names. I've not touched paired.md as I wasn't really sure
> how to fix that (how do I vec_extractv2sf ?), moreover the testing I
> did didn't seem to exercise any of those patterns (iow: I'm not sure
> what would be an appropriate target machine?).
> 
> I note the reduc_uplus_v16qi (which I've removed, as unsigned and
> signed addition should be equivalent) differed from
> reduc_splus_v16qi in using gen_altivec_vsum4ubs rather than
> gen_altivec_vsum4sbs.  Testcases gcc.dg/vect/{slp-24-big-array.c,slp-24.c,vect-reduc-1char-big-array.c,vert-reduc-1char.c}
> thus produce assembly which differs from previously (only) in that
> "vsum4ubs" becomes "vsum4sbs". These tests are still passing so I
> assume this is OK.
> 
> The combining of signed and unsigned addition also improves gcc.dg/vect/{vect-outer-4i.c,vect-reduc-1short.c,vect-reduc-dot-u8b.c,vect-reduc-pattern-1c-big-array.c,vect-reduc-pattern-1c.c}
> : these are now reduced using direct vector reduction, rather than
> with shifts as previously (because there was only a reduc_splus
> rather than the reduc_uplus these tests looked for).

I checked the integer vector add reductions, and it seems to generate the same
value with old/new code, and I like eliminating the vector shift.

> ((Side note: the RTL changes to vector.md are to match the combine
> patterns in vsx.md; now that we now longer depend upon combine to
> generate those patterns (as the optab outputs them directly), one
> might wish to remove the smaller pattern from vsx.md, and/or
> simplify the RTL. I theorize that a reduction of a two-element
> vector is just adding the first element to the second, so maybe to
> something like
> 
>   [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
> 		   (VEC_reduc:V2DF
> 		    (vec_select:DF
> 		     (match_operand:V2DF 1 "vfloat_operand" "")
> 		     (parallel [(const_int 1)]))
> 		    (vec_select:DF
> 		     (match_dup 1)
> 		     (parallel [(const_int 0)]))))
> 	      (clobber (match_scratch:V2DF 2 ""))])]
> 
> but I think it's best for me to leave that to the port maintainers.))
> 
> Bootstrapped and check-gcc on powerpc64-none-linux-gnu
> (gcc110.fsffrance.org, with thanks to the GCC Compile Farm).

However, the double pattern is completely broken.  This cannot go in.

Consider this source:

#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include <string.h>

#ifndef TYPE
#define TYPE double
#endif

#ifndef OTYPE
#define OTYPE TYPE
#endif

#ifndef SIZE
#define SIZE 1024
#endif

#ifndef ALIGN
#define ALIGN 32
#endif

TYPE a[SIZE] __attribute__((__aligned__(ALIGN)));

OTYPE sum (void) __attribute__((__noinline__));

OTYPE
sum (void)
{
  size_t i;
  OTYPE s = (OTYPE) 0;

  for (i = 0; i < SIZE; i++)
    s += a[i];

  return s;
}

If I compile with today's trunk, and -mcpu=power8 -ffast-math -O3, I get code
that I expect (though it could xxpermdi instead of xxsldwi):

sum:
	.quad	.L.sum,.TOC.@tocbase,0
	.previous
	.type	sum, @function
.L.sum:
	li 10,512
	addis 9,2,.LC1@toc@ha		# gpr load fusion, type long
	ld 9,.LC1@toc@l(9)
	xxlxor 0,0,0
	mtctr 10
	.p2align 4,,15
.L2:
	lxvd2x 12,0,9
	addi 9,9,16
	xvadddp 0,0,12
	bdnz .L2
	xxsldwi 12,0,0,2
	xvadddp 1,12,0
	xxpermdi 1,1,1,2
	blr
	.long 0

However, the code produced by the patches gives:

sum:
	.quad	.L.sum,.TOC.@tocbase,0
	.previous
	.type	sum, @function
.L.sum:
	xxlxor 0,0,0
	addi 10,1,-16
	li 8,512
	addis 9,2,.LC1@toc@ha		# gpr load fusion, type long
	ld 9,.LC1@toc@l(9)
	mtctr 8
	stxvd2x 0,0,10
	.p2align 5,,31
.L2:
	addi 10,1,-16
	lxvd2x 0,0,9
	addi 9,9,16
	lxvd2x 12,0,10
	xvadddp 12,12,0
	stxvd2x 12,0,10
	bdnz .L2
	lfd 0,-16(1)
	xxpermdi 1,12,12,2
	fadd 1,0,1
	blr
	.long 0

It is unacceptable to have to do the inner loop doing a load, vector add, and
store in the loop.
Segher Boessenkool Nov. 11, 2014, 7:10 a.m. UTC | #12
On Mon, Nov 10, 2014 at 05:36:24PM -0500, Michael Meissner wrote:
> However, the double pattern is completely broken.  This cannot go in.

[snip]

> It is unacceptable to have to do the inner loop doing a load, vector add, and
> store in the loop.

Before the patch, the final reduction used *vsx_reduc_splus_v2df; after
the patch, it is *vsx_reduc_plus_v2df_scalar.  The former does a vector
add, the latter a float add.  And it uses the same pseudoregister for the
accumulator throughout.  IRA decides a register is more expensive than
memory for this, I suppose because it wants both V2DF and DF?  It doesn't
seem to like the subreg very much.

The new code does look nicer otherwise :-)


Segher
Michael Meissner Nov. 12, 2014, 1:27 a.m. UTC | #13
On Tue, Nov 11, 2014 at 01:10:01AM -0600, Segher Boessenkool wrote:
> On Mon, Nov 10, 2014 at 05:36:24PM -0500, Michael Meissner wrote:
> > However, the double pattern is completely broken.  This cannot go in.
> 
> [snip]
> 
> > It is unacceptable to have to do the inner loop doing a load, vector add, and
> > store in the loop.
> 
> Before the patch, the final reduction used *vsx_reduc_splus_v2df; after
> the patch, it is *vsx_reduc_plus_v2df_scalar.  The former does a vector
> add, the latter a float add.  And it uses the same pseudoregister for the
> accumulator throughout.  IRA decides a register is more expensive than
> memory for this, I suppose because it wants both V2DF and DF?  It doesn't
> seem to like the subreg very much.

I haven't looked into in detail (I've been a little busy with th upper regs
patch), but I suspect the problem is that 128-bit and 64-bit types cannot
overlap (i.e. rs6000_cannot_change_mode_class returns true).  This is due to
the fact that scalars in VSX registers occupy the upper 64-bits, which would
not match the compiler's notion of that it should be in the bottom 64-bits.
Segher Boessenkool Nov. 12, 2014, 9:26 a.m. UTC | #14
On Tue, Nov 11, 2014 at 08:27:22PM -0500, Michael Meissner wrote:
> > Before the patch, the final reduction used *vsx_reduc_splus_v2df; after
> > the patch, it is *vsx_reduc_plus_v2df_scalar.  The former does a vector
> > add, the latter a float add.  And it uses the same pseudoregister for the
> > accumulator throughout.  IRA decides a register is more expensive than
> > memory for this, I suppose because it wants both V2DF and DF?  It doesn't
> > seem to like the subreg very much.
> 
> I haven't looked into in detail (I've been a little busy with th upper regs
> patch), but I suspect the problem is that 128-bit and 64-bit types cannot
> overlap (i.e. rs6000_cannot_change_mode_class returns true).  This is due to
> the fact that scalars in VSX registers occupy the upper 64-bits, which would
> not match the compiler's notion of that it should be in the bottom 64-bits.

You suspect correctly.  Hacking around that in cannot_change_mode_class
doesn't help, subreg_get_info disallows it next.

Changing the pattern so it does two extracts instead of an extract and
a subreg works (you get an fmr for the high part though, register alloc
doesn't know dest=src is for free here).

_Should_ the subreg thing work?  Or should the patterns be fixed?


Segher
Michael Meissner Nov. 12, 2014, 7:05 p.m. UTC | #15
On Wed, Nov 12, 2014 at 03:26:35AM -0600, Segher Boessenkool wrote:
> On Tue, Nov 11, 2014 at 08:27:22PM -0500, Michael Meissner wrote:
> > > Before the patch, the final reduction used *vsx_reduc_splus_v2df; after
> > > the patch, it is *vsx_reduc_plus_v2df_scalar.  The former does a vector
> > > add, the latter a float add.  And it uses the same pseudoregister for the
> > > accumulator throughout.  IRA decides a register is more expensive than
> > > memory for this, I suppose because it wants both V2DF and DF?  It doesn't
> > > seem to like the subreg very much.
> > 
> > I haven't looked into in detail (I've been a little busy with th upper regs
> > patch), but I suspect the problem is that 128-bit and 64-bit types cannot
> > overlap (i.e. rs6000_cannot_change_mode_class returns true).  This is due to
> > the fact that scalars in VSX registers occupy the upper 64-bits, which would
> > not match the compiler's notion of that it should be in the bottom 64-bits.
> 
> You suspect correctly.  Hacking around that in cannot_change_mode_class
> doesn't help, subreg_get_info disallows it next.
> 
> Changing the pattern so it does two extracts instead of an extract and
> a subreg works (you get an fmr for the high part though, register alloc
> doesn't know dest=src is for free here).
> 
> _Should_ the subreg thing work?  Or should the patterns be fixed?

As I said, we cannot allow CANNOT_CHANGE_MODE_CLASS to return false for this
case, because the hardware just does not agree with what GCC believes is the
natural placement for smaller values inside of larger register fields.  I
suspect even if you add new target support macros to fix it, it will be a game
of whack-a-mole to find all of the places where there are hidden asumptions in
the compiler about subreg ordering.
diff mbox

Patch

commit 846d5932041e04bbf386efbc739aee9749051bc7
Author: Alan Lawrence <alan.lawrence@arm.com>
Date:   Wed Aug 13 17:25:13 2014 +0100

    AArch64: Reintroduce gimple_fold for min+max+plus

diff --git a/gcc/config/aarch64/aarch64-builtins.c b/gcc/config/aarch64/aarch64-builtins.c
index a49da89..283469b 100644
--- a/gcc/config/aarch64/aarch64-builtins.c
+++ b/gcc/config/aarch64/aarch64-builtins.c
@@ -1188,9 +1188,6 @@  aarch64_fold_builtin (tree fndecl, int n_args ATTRIBUTE_UNUSED, tree *args,
   return NULL_TREE;
 }
 
-/* Handling of reduction operations temporarily removed so as to decouple
-   changes to tree codes from AArch64 NEON Intrinsics.  */
-#if 0
 bool
 aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 {
@@ -1200,19 +1197,6 @@  aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
   tree fndecl;
   gimple new_stmt = NULL;
 
-  /* The operations folded below are reduction operations.  These are
-     defined to leave their result in the 0'th element (from the perspective
-     of GCC).  The architectural instruction we are folding will leave the
-     result in the 0'th element (from the perspective of the architecture).
-     For big-endian systems, these perspectives are not aligned.
-
-     It is therefore wrong to perform this fold on big-endian.  There
-     are some tricks we could play with shuffling, but the mid-end is
-     inconsistent in the way it treats reduction operations, so we will
-     end up in difficulty.  Until we fix the ambiguity - just bail out.  */
-  if (BYTES_BIG_ENDIAN)
-    return false;
-
   if (call)
     {
       fndecl = gimple_call_fndecl (stmt);
@@ -1224,23 +1208,28 @@  aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 			? gimple_call_arg_ptr (stmt, 0)
 			: &error_mark_node);
 
+	  /* We use gimple's REDUC_(PLUS|MIN|MAX)_EXPRs for float, signed int
+	     and unsigned int; it will distinguish according to the types of
+	     the arguments to the __builtin.  */
 	  switch (fcode)
 	    {
-	      BUILTIN_VALL (UNOP, reduc_splus_, 10)
-		new_stmt = gimple_build_assign_with_ops (
+	      BUILTIN_VALL (UNOP, reduc_plus_scal_, 10)
+	        new_stmt = gimple_build_assign_with_ops (
 						REDUC_PLUS_EXPR,
 						gimple_call_lhs (stmt),
 						args[0],
 						NULL_TREE);
 		break;
-	      BUILTIN_VDQIF (UNOP, reduc_smax_, 10)
+	      BUILTIN_VDQIF (UNOP, reduc_smax_scal_, 10)
+	      BUILTIN_VDQ_BHSI (UNOPU, reduc_umax_scal_, 10)
 		new_stmt = gimple_build_assign_with_ops (
 						REDUC_MAX_EXPR,
 						gimple_call_lhs (stmt),
 						args[0],
 						NULL_TREE);
 		break;
-	      BUILTIN_VDQIF (UNOP, reduc_smin_, 10)
+	      BUILTIN_VDQIF (UNOP, reduc_smin_scal_, 10)
+	      BUILTIN_VDQ_BHSI (UNOPU, reduc_umin_scal_, 10)
 		new_stmt = gimple_build_assign_with_ops (
 						REDUC_MIN_EXPR,
 						gimple_call_lhs (stmt),
@@ -1262,7 +1251,6 @@  aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 
   return changed;
 }
-#endif
 
 void
 aarch64_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 27d82f3..db5ff59 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -10015,8 +10015,8 @@  aarch64_asan_shadow_offset (void)
 #undef TARGET_FRAME_POINTER_REQUIRED
 #define TARGET_FRAME_POINTER_REQUIRED aarch64_frame_pointer_required
 
-//#undef TARGET_GIMPLE_FOLD_BUILTIN
-//#define TARGET_GIMPLE_FOLD_BUILTIN aarch64_gimple_fold_builtin
+#undef TARGET_GIMPLE_FOLD_BUILTIN
+#define TARGET_GIMPLE_FOLD_BUILTIN aarch64_gimple_fold_builtin
 
 #undef TARGET_GIMPLIFY_VA_ARG_EXPR
 #define TARGET_GIMPLIFY_VA_ARG_EXPR aarch64_gimplify_va_arg_expr