Message ID | 56C1B74D.4070009@foss.arm.com |
---|---|
State | New |
Headers | show |
On Mon, Feb 15, 2016 at 11:32 AM, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote: > > On 04/02/16 08:58, Ramana Radhakrishnan wrote: >> >> On Tue, Jun 30, 2015 at 2:15 AM, Jim Wilson <jim.wilson@linaro.org> wrote: >>> >>> This is my suggested fix for PR 65932, which is a linux kernel >>> miscompile with gcc-5.1. >>> >>> The problem here is caused by a chain of events. The first is that >>> the relatively new eipa_sra pass creates fake parameters that behave >>> slightly differently than normal parameters. The second is that the >>> optimizer creates phi nodes that copy local variables to fake >>> parameters and/or vice versa. The third is that the ouf-of-ssa pass >>> assumes that it can emit simple move instructions for these phi nodes. >>> And the fourth is that the ARM port has a PROMOTE_MODE macro that >>> forces QImode and HImode to unsigned, but a >>> TARGET_PROMOTE_FUNCTION_MODE hook that does not. So signed char and >>> short parameters have different in register representations than local >>> variables, and require a conversion when copying between them, a >>> conversion that the out-of-ssa pass can't easily emit. >>> >>> Ultimately, I think this is a problem in the arm backend. It should >>> not have a PROMOTE_MODE macro that is changing the sign of char and >>> short local variables. I also think that we should merge the >>> PROMOTE_MODE macro with the TARGET_PROMOTE_FUNCTION_MODE hook to >>> prevent this from happening again. >>> >>> I see four general problems with the current ARM PROMOTE_MODE definition. >>> 1) Unsigned char is only faster for armv5 and earlier, before the sxtb >>> instruction was added. It is a lose for armv6 and later. >>> 2) Unsigned short was only faster for targets that don't support >>> unaligned accesses. Support for these targets was removed a while >>> ago, and this PROMODE_MODE hunk should have been removed at the same >>> time. It was accidentally left behind. >>> 3) TARGET_PROMOTE_FUNCTION_MODE used to be a boolean hook, when it was >>> converted to a function, the PROMOTE_MODE code was copied without the >>> UNSIGNEDP changes. Thus it is only an accident that >>> TARGET_PROMOTE_FUNCTION_MODE and PROMOTE_MODE disagree. Changing >>> TARGET_PROMOTE_FUNCTION_MODE is an ABI change, so only PROMOTE_MODE >>> changes to resolve the difference are safe. >>> 4) There is a general principle that you should only change signedness >>> in PROMOTE_MODE if the hardware forces it, as otherwise this results >>> in extra conversion instructions that make code slower. The mips64 >>> hardware for instance requires that 32-bit values be sign-extended >>> regardless of type, and instructions may trap if this is not true. >>> However, it has a set of 32-bit instructions that operate on these >>> values, and hence no conversions are required. There is no similar >>> case on ARM. Thus the conversions are unnecessary and unwise. This >>> can be seen in the testcases where gcc emits both a zero-extend and a >>> sign-extend inside a loop, as the sign-extend is required for a >>> compare, and the zero-extend is required by PROMOTE_MODE. >> >> Given Kyrill's testing with the patch and the reasonably detailed >> check of the effects of code generation changes - The arm.h hunk is ok >> - I do think we should make this explicit in the documentation that >> TARGET_PROMOTE_MODE and TARGET_PROMOTE_FUNCTION_MODE should agree and >> better still maybe put in a checking assert for the same in the >> mid-end but that could be the subject of a follow-up patch. >> >> Ok to apply just the arm.h hunk as I think Kyrill has taken care of >> the testsuite fallout separately. > > Hi all, > > I'd like to backport the arm.h from this ( r233130) to the GCC 5 > branch. As the CSE patch from my series had some fallout on x86_64 > due to a deficiency in the AVX patterns that is too invasive to fix > at this stage (and presumably backport), I'd like to just backport > this arm.h fix and adjust the tests to XFAIL the fallout that comes > with not applying the CSE patch. The attached patch does that. I would hope that someone looks to fix the AVX port for GCC 7 - as the CSE patch is something that is useful on ARM for performance in real terms and could have benefits elsewhere. OK to apply if no regressions - thanks for pushing this through. Thanks, Ramana > > The code quality fallout on code outside the testsuite is not > that gread. The SPEC benchmarks are not affected by not applying > the CSE change, and only a single sequence in a popular embedded benchmark > shows some degradation for -mtune=cortex-a9 in the same way as the > wmul-1.c and wmul-2.c tests. > > I think that's a fair tradeoff for fixing the wrong code bug on that branch. > > Ok to backport r233130 and the attached testsuite patch to the GCC 5 branch? > > Thanks, > Kyrill > > 2016-02-15 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > PR target/65932 > * gcc.target/arm/wmul-1.c: Add -mtune=cortex-a9 to dg-options. > xfail the scan-assembler test. > * gcc.target/arm/wmul-2.c: Likewise. > * gcc.target/arm/wmul-3.c: Simplify test to generate a single smulbb. > > > > >> >> regards >> Ramana >> >> >> >> >>> My change was tested with an arm bootstrap, make check, and SPEC >>> CPU2000 run. The original poster verified that this gives a linux >>> kernel that boots correctly. >>> >>> The PRMOTE_MODE change causes 3 testsuite testcases to fail. These >>> are tests to verify that smulbb and/or smlabb are generated. >>> Eliminating the unnecessary sign conversions causes us to get better >>> code that doesn't include the smulbb and smlabb instructions. I had >>> to modify the testcases to get them to emit the desired instructions. >>> With the testcase changes there are no additional testsuite failures, >>> though I'm concerned that these testcases with the changes may be >>> fragile, and future changes may break them again. >> >> >> >>> If there are ARM parts where smulbb/smlabb are faster than mul/mla, >>> then maybe we should try to add new patterns to get the instructions >>> emitted again for the unmodified testcases. >>> >>> Jim > >
On 15 February 2016 at 12:32, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote: > > On 04/02/16 08:58, Ramana Radhakrishnan wrote: >> >> On Tue, Jun 30, 2015 at 2:15 AM, Jim Wilson <jim.wilson@linaro.org> wrote: >>> >>> This is my suggested fix for PR 65932, which is a linux kernel >>> miscompile with gcc-5.1. >>> >>> The problem here is caused by a chain of events. The first is that >>> the relatively new eipa_sra pass creates fake parameters that behave >>> slightly differently than normal parameters. The second is that the >>> optimizer creates phi nodes that copy local variables to fake >>> parameters and/or vice versa. The third is that the ouf-of-ssa pass >>> assumes that it can emit simple move instructions for these phi nodes. >>> And the fourth is that the ARM port has a PROMOTE_MODE macro that >>> forces QImode and HImode to unsigned, but a >>> TARGET_PROMOTE_FUNCTION_MODE hook that does not. So signed char and >>> short parameters have different in register representations than local >>> variables, and require a conversion when copying between them, a >>> conversion that the out-of-ssa pass can't easily emit. >>> >>> Ultimately, I think this is a problem in the arm backend. It should >>> not have a PROMOTE_MODE macro that is changing the sign of char and >>> short local variables. I also think that we should merge the >>> PROMOTE_MODE macro with the TARGET_PROMOTE_FUNCTION_MODE hook to >>> prevent this from happening again. >>> >>> I see four general problems with the current ARM PROMOTE_MODE definition. >>> 1) Unsigned char is only faster for armv5 and earlier, before the sxtb >>> instruction was added. It is a lose for armv6 and later. >>> 2) Unsigned short was only faster for targets that don't support >>> unaligned accesses. Support for these targets was removed a while >>> ago, and this PROMODE_MODE hunk should have been removed at the same >>> time. It was accidentally left behind. >>> 3) TARGET_PROMOTE_FUNCTION_MODE used to be a boolean hook, when it was >>> converted to a function, the PROMOTE_MODE code was copied without the >>> UNSIGNEDP changes. Thus it is only an accident that >>> TARGET_PROMOTE_FUNCTION_MODE and PROMOTE_MODE disagree. Changing >>> TARGET_PROMOTE_FUNCTION_MODE is an ABI change, so only PROMOTE_MODE >>> changes to resolve the difference are safe. >>> 4) There is a general principle that you should only change signedness >>> in PROMOTE_MODE if the hardware forces it, as otherwise this results >>> in extra conversion instructions that make code slower. The mips64 >>> hardware for instance requires that 32-bit values be sign-extended >>> regardless of type, and instructions may trap if this is not true. >>> However, it has a set of 32-bit instructions that operate on these >>> values, and hence no conversions are required. There is no similar >>> case on ARM. Thus the conversions are unnecessary and unwise. This >>> can be seen in the testcases where gcc emits both a zero-extend and a >>> sign-extend inside a loop, as the sign-extend is required for a >>> compare, and the zero-extend is required by PROMOTE_MODE. >> >> Given Kyrill's testing with the patch and the reasonably detailed >> check of the effects of code generation changes - The arm.h hunk is ok >> - I do think we should make this explicit in the documentation that >> TARGET_PROMOTE_MODE and TARGET_PROMOTE_FUNCTION_MODE should agree and >> better still maybe put in a checking assert for the same in the >> mid-end but that could be the subject of a follow-up patch. >> >> Ok to apply just the arm.h hunk as I think Kyrill has taken care of >> the testsuite fallout separately. > > Hi all, > > I'd like to backport the arm.h from this ( r233130) to the GCC 5 > branch. As the CSE patch from my series had some fallout on x86_64 > due to a deficiency in the AVX patterns that is too invasive to fix > at this stage (and presumably backport), I'd like to just backport > this arm.h fix and adjust the tests to XFAIL the fallout that comes > with not applying the CSE patch. The attached patch does that. > > The code quality fallout on code outside the testsuite is not > that gread. The SPEC benchmarks are not affected by not applying > the CSE change, and only a single sequence in a popular embedded benchmark > shows some degradation for -mtune=cortex-a9 in the same way as the > wmul-1.c and wmul-2.c tests. > > I think that's a fair tradeoff for fixing the wrong code bug on that branch. > > Ok to backport r233130 and the attached testsuite patch to the GCC 5 branch? > > Thanks, > Kyrill > > 2016-02-15 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > PR target/65932 > * gcc.target/arm/wmul-1.c: Add -mtune=cortex-a9 to dg-options. > xfail the scan-assembler test. > * gcc.target/arm/wmul-2.c: Likewise. > * gcc.target/arm/wmul-3.c: Simplify test to generate a single smulbb. > > Hi Kyrill, I've noticed that wmul-3 still fails on the gcc-5 branch when forcing GCC configuration to: --with-cpu cortex-a5 --with-fpu vfpv3-d16-fp16 (target arm-none-linux-gnueabihf) The generated code is: sxth r0, r0 sxth r1, r1 mul r0, r1, r0 instead of smulbb r0, r1, r0 on trunk. I guess we don't worry? > > >> >> regards >> Ramana >> >> >> >> >>> My change was tested with an arm bootstrap, make check, and SPEC >>> CPU2000 run. The original poster verified that this gives a linux >>> kernel that boots correctly. >>> >>> The PRMOTE_MODE change causes 3 testsuite testcases to fail. These >>> are tests to verify that smulbb and/or smlabb are generated. >>> Eliminating the unnecessary sign conversions causes us to get better >>> code that doesn't include the smulbb and smlabb instructions. I had >>> to modify the testcases to get them to emit the desired instructions. >>> With the testcase changes there are no additional testsuite failures, >>> though I'm concerned that these testcases with the changes may be >>> fragile, and future changes may break them again. >> >> >> >>> If there are ARM parts where smulbb/smlabb are faster than mul/mla, >>> then maybe we should try to add new patterns to get the instructions >>> emitted again for the unmodified testcases. >>> >>> Jim > >
On 17/02/16 10:03, Christophe Lyon wrote: > On 15 February 2016 at 12:32, Kyrill Tkachov > <kyrylo.tkachov@foss.arm.com> wrote: >> On 04/02/16 08:58, Ramana Radhakrishnan wrote: >>> On Tue, Jun 30, 2015 at 2:15 AM, Jim Wilson <jim.wilson@linaro.org> wrote: >>>> This is my suggested fix for PR 65932, which is a linux kernel >>>> miscompile with gcc-5.1. >>>> >>>> The problem here is caused by a chain of events. The first is that >>>> the relatively new eipa_sra pass creates fake parameters that behave >>>> slightly differently than normal parameters. The second is that the >>>> optimizer creates phi nodes that copy local variables to fake >>>> parameters and/or vice versa. The third is that the ouf-of-ssa pass >>>> assumes that it can emit simple move instructions for these phi nodes. >>>> And the fourth is that the ARM port has a PROMOTE_MODE macro that >>>> forces QImode and HImode to unsigned, but a >>>> TARGET_PROMOTE_FUNCTION_MODE hook that does not. So signed char and >>>> short parameters have different in register representations than local >>>> variables, and require a conversion when copying between them, a >>>> conversion that the out-of-ssa pass can't easily emit. >>>> >>>> Ultimately, I think this is a problem in the arm backend. It should >>>> not have a PROMOTE_MODE macro that is changing the sign of char and >>>> short local variables. I also think that we should merge the >>>> PROMOTE_MODE macro with the TARGET_PROMOTE_FUNCTION_MODE hook to >>>> prevent this from happening again. >>>> >>>> I see four general problems with the current ARM PROMOTE_MODE definition. >>>> 1) Unsigned char is only faster for armv5 and earlier, before the sxtb >>>> instruction was added. It is a lose for armv6 and later. >>>> 2) Unsigned short was only faster for targets that don't support >>>> unaligned accesses. Support for these targets was removed a while >>>> ago, and this PROMODE_MODE hunk should have been removed at the same >>>> time. It was accidentally left behind. >>>> 3) TARGET_PROMOTE_FUNCTION_MODE used to be a boolean hook, when it was >>>> converted to a function, the PROMOTE_MODE code was copied without the >>>> UNSIGNEDP changes. Thus it is only an accident that >>>> TARGET_PROMOTE_FUNCTION_MODE and PROMOTE_MODE disagree. Changing >>>> TARGET_PROMOTE_FUNCTION_MODE is an ABI change, so only PROMOTE_MODE >>>> changes to resolve the difference are safe. >>>> 4) There is a general principle that you should only change signedness >>>> in PROMOTE_MODE if the hardware forces it, as otherwise this results >>>> in extra conversion instructions that make code slower. The mips64 >>>> hardware for instance requires that 32-bit values be sign-extended >>>> regardless of type, and instructions may trap if this is not true. >>>> However, it has a set of 32-bit instructions that operate on these >>>> values, and hence no conversions are required. There is no similar >>>> case on ARM. Thus the conversions are unnecessary and unwise. This >>>> can be seen in the testcases where gcc emits both a zero-extend and a >>>> sign-extend inside a loop, as the sign-extend is required for a >>>> compare, and the zero-extend is required by PROMOTE_MODE. >>> Given Kyrill's testing with the patch and the reasonably detailed >>> check of the effects of code generation changes - The arm.h hunk is ok >>> - I do think we should make this explicit in the documentation that >>> TARGET_PROMOTE_MODE and TARGET_PROMOTE_FUNCTION_MODE should agree and >>> better still maybe put in a checking assert for the same in the >>> mid-end but that could be the subject of a follow-up patch. >>> >>> Ok to apply just the arm.h hunk as I think Kyrill has taken care of >>> the testsuite fallout separately. >> Hi all, >> >> I'd like to backport the arm.h from this ( r233130) to the GCC 5 >> branch. As the CSE patch from my series had some fallout on x86_64 >> due to a deficiency in the AVX patterns that is too invasive to fix >> at this stage (and presumably backport), I'd like to just backport >> this arm.h fix and adjust the tests to XFAIL the fallout that comes >> with not applying the CSE patch. The attached patch does that. >> >> The code quality fallout on code outside the testsuite is not >> that gread. The SPEC benchmarks are not affected by not applying >> the CSE change, and only a single sequence in a popular embedded benchmark >> shows some degradation for -mtune=cortex-a9 in the same way as the >> wmul-1.c and wmul-2.c tests. >> >> I think that's a fair tradeoff for fixing the wrong code bug on that branch. >> >> Ok to backport r233130 and the attached testsuite patch to the GCC 5 branch? >> >> Thanks, >> Kyrill >> >> 2016-02-15 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >> >> PR target/65932 >> * gcc.target/arm/wmul-1.c: Add -mtune=cortex-a9 to dg-options. >> xfail the scan-assembler test. >> * gcc.target/arm/wmul-2.c: Likewise. >> * gcc.target/arm/wmul-3.c: Simplify test to generate a single smulbb. >> >> > Hi Kyrill, > > I've noticed that wmul-3 still fails on the gcc-5 branch when forcing GCC > configuration to: > --with-cpu cortex-a5 --with-fpu vfpv3-d16-fp16 > (target arm-none-linux-gnueabihf) > > The generated code is: > sxth r0, r0 > sxth r1, r1 > mul r0, r1, r0 > instead of > smulbb r0, r1, r0 > on trunk. > > I guess we don't worry? Hi Christophe, Hmmm, I suspect we might want to backport https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01714.html to fix backend the costing logic of smulbb. Could you please try that patch to see if it helps? Thanks, Kyrill >> >>> regards >>> Ramana >>> >>> >>> >>> >>>> My change was tested with an arm bootstrap, make check, and SPEC >>>> CPU2000 run. The original poster verified that this gives a linux >>>> kernel that boots correctly. >>>> >>>> The PRMOTE_MODE change causes 3 testsuite testcases to fail. These >>>> are tests to verify that smulbb and/or smlabb are generated. >>>> Eliminating the unnecessary sign conversions causes us to get better >>>> code that doesn't include the smulbb and smlabb instructions. I had >>>> to modify the testcases to get them to emit the desired instructions. >>>> With the testcase changes there are no additional testsuite failures, >>>> though I'm concerned that these testcases with the changes may be >>>> fragile, and future changes may break them again. >>> >>> >>>> If there are ARM parts where smulbb/smlabb are faster than mul/mla, >>>> then maybe we should try to add new patterns to get the instructions >>>> emitted again for the unmodified testcases. >>>> >>>> Jim >>
diff --git a/gcc/testsuite/gcc.target/arm/wmul-1.c b/gcc/testsuite/gcc.target/arm/wmul-1.c index ddddd509fe645ea98877753773e7bcf9b6787897..ce14769c570537f16f022c8cde35843ab0695f74 100644 --- a/gcc/testsuite/gcc.target/arm/wmul-1.c +++ b/gcc/testsuite/gcc.target/arm/wmul-1.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ /* { dg-require-effective-target arm_dsp } */ -/* { dg-options "-O1 -fexpensive-optimizations" } */ +/* { dg-options "-O1 -fexpensive-optimizations -mtune=cortex-a9" } */ int mac(const short *a, const short *b, int sqr, int *sum) { @@ -16,4 +16,4 @@ int mac(const short *a, const short *b, int sqr, int *sum) return sqr; } -/* { dg-final { scan-assembler-times "smlabb" 2 } } */ +/* { dg-final { scan-assembler-times "smlabb" 2 { xfail *-*-* } } } */ diff --git a/gcc/testsuite/gcc.target/arm/wmul-2.c b/gcc/testsuite/gcc.target/arm/wmul-2.c index 2ea55f9fbe12f74f38754cb72be791fd6e9495f4..a74d81b195b210650fc1bc1da598ac7c1fe82ac2 100644 --- a/gcc/testsuite/gcc.target/arm/wmul-2.c +++ b/gcc/testsuite/gcc.target/arm/wmul-2.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ /* { dg-require-effective-target arm_dsp } */ -/* { dg-options "-O1 -fexpensive-optimizations" } */ +/* { dg-options "-O1 -fexpensive-optimizations -mtune=cortex-a9" } */ void vec_mpy(int y[], const short x[], short scaler) { @@ -10,4 +10,4 @@ void vec_mpy(int y[], const short x[], short scaler) y[i] += ((scaler * x[i]) >> 31); } -/* { dg-final { scan-assembler-times "smulbb" 1 } } */ +/* { dg-final { scan-assembler-times "smulbb" 1 { xfail *-*-* } } } */ diff --git a/gcc/testsuite/gcc.target/arm/wmul-3.c b/gcc/testsuite/gcc.target/arm/wmul-3.c index 144b553082e6158701639f05929987de01e7125a..87eba740142a80a1dc1979b4e79d9272a839e7b2 100644 --- a/gcc/testsuite/gcc.target/arm/wmul-3.c +++ b/gcc/testsuite/gcc.target/arm/wmul-3.c @@ -1,19 +1,11 @@ /* { dg-do compile } */ /* { dg-require-effective-target arm_dsp } */ -/* { dg-options "-O1 -fexpensive-optimizations" } */ +/* { dg-options "-O" } */ -int mac(const short *a, const short *b, int sqr, int *sum) +int +foo (int a, int b) { - int i; - int dotp = *sum; - - for (i = 0; i < 150; i++) { - dotp -= b[i] * a[i]; - sqr -= b[i] * b[i]; - } - - *sum = dotp; - return sqr; + return (short) a * (short) b; } -/* { dg-final { scan-assembler-times "smulbb" 2 } } */ +/* { dg-final { scan-assembler-times "smulbb" 1 } } */