Message ID | mpt36inj0sq.fsf@arm.com |
---|---|
State | New |
Headers | show |
Series | Don't use integer "FMA" for shifts | expand |
On Tue, Jul 30, 2019 at 12:04 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > tree-ssa-math-opts supports FMA optabs for integers as well as > floating-point types, even though there's no distinction between > fused and unfused there. It turns out to be pretty handy for the > IFN_COND_* handling though, so I don't want to remove it, however > weird it might seem. > > Instead this patch makes sure that we don't apply it to integer > multiplications that are actually shifts (but that are represented > in gimple as multiplications because that's the canonical form). > > This is a preemptive strike. The test doesn't fail for SVE as-is, > but would after a later patch. > > Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu. > OK to install? > > Richard > > > 2019-07-30 Richard Sandiford <richard.sandiford@arm.com> > > gcc/ > * tree-ssa-math-opts.c (convert_mult_to_fma): Reject integer > multiplications by a power of 2 or a negative power of 2. > > gcc/testsuite/ > * gcc.dg/vect/vect-cond-arith-8.c: New test. > > Index: gcc/tree-ssa-math-opts.c > =================================================================== > --- gcc/tree-ssa-math-opts.c 2019-07-30 10:51:51.827405171 +0100 > +++ gcc/tree-ssa-math-opts.c 2019-07-30 10:52:27.327139149 +0100 > @@ -3074,10 +3074,20 @@ convert_mult_to_fma (gimple *mul_stmt, t > && flag_fp_contract_mode == FP_CONTRACT_OFF) > return false; > > - /* We don't want to do bitfield reduction ops. */ > - if (INTEGRAL_TYPE_P (type) > - && (!type_has_mode_precision_p (type) || TYPE_OVERFLOW_TRAPS (type))) > - return false; > + if (ANY_INTEGRAL_TYPE_P (type)) > + { > + /* We don't want to do bitfield reduction ops. */ > + tree itype = INTEGRAL_TYPE_P (type) ? type : TREE_TYPE (type); you can use element_type () for this. But I think it's easier to leave the existing test in-place since vector or complex types cannot have non-mode precision components. > + if (!type_has_mode_precision_p (itype) || TYPE_OVERFLOW_TRAPS (itype)) > + return false; > + > + /* Don't use FMA for multiplications that are actually shifts. */ So I question this - if the FMA can do the shift "for free" and it eventually is same cost/latency as an add why do we want to not allow this (for all targets)? Esp. for vector types I would guess a add plus a shift may be more costly (not all ISAs implement shifts anyhow). Can this be fixed up in the target instead? (by a splitter or appropriate expander?) > + tree val = VECTOR_TYPE_P (type) ? uniform_vector_p (op2) : op2; > + if (val > + && TREE_CODE (val) == INTEGER_CST > + && wi::popcount (wi::abs (wi::to_wide (val))) == 1) > + return false; > + } > > /* If the target doesn't support it, don't generate it. We assume that > if fma isn't available then fms, fnma or fnms are not either. */ > Index: gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c > =================================================================== > --- /dev/null 2019-07-30 08:53:31.317691683 +0100 > +++ gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c 2019-07-30 10:52:27.327139149 +0100 > @@ -0,0 +1,57 @@ > +/* { dg-require-effective-target scalar_all_fma } */ > +/* { dg-additional-options "-fdump-tree-optimized -ffp-contract=fast" } */ > + > +#include "tree-vect.h" > + > +#define N (VECTOR_BITS * 11 / 64 + 3) > + > +#define DEF(INV) \ > + void __attribute__ ((noipa)) \ > + f_##INV (int *restrict a, int *restrict b, \ > + int *restrict c) \ > + { \ > + for (int i = 0; i < N; ++i) \ > + { \ > + int mb = (INV & 1 ? -b[i] : b[i]); \ > + int mc = (INV & 2 ? -c[i] : c[i]); \ > + a[i] = b[i] < 10 ? mb * 8 + mc : 10; \ > + } \ > + } > + > +#define TEST(INV) \ > + { \ > + f_##INV (a, b, c); \ > + for (int i = 0; i < N; ++i) \ > + { \ > + int mb = (INV & 1 ? -b[i] : b[i]); \ > + int mc = (INV & 2 ? -c[i] : c[i]); \ > + int truev = mb * 8 + mc; \ > + if (a[i] != (i % 17 < 10 ? truev : 10)) \ > + __builtin_abort (); \ > + asm volatile ("" ::: "memory"); \ > + } \ > + } > + > +#define FOR_EACH_INV(T) \ > + T (0) T (1) T (2) T (3) > + > +FOR_EACH_INV (DEF) > + > +int > +main (void) > +{ > + int a[N], b[N], c[N]; > + for (int i = 0; i < N; ++i) > + { > + b[i] = i % 17; > + c[i] = i % 13 + 14; > + asm volatile ("" ::: "memory"); > + } > + FOR_EACH_INV (TEST) > + return 0; > +} > + > +/* { dg-final { scan-tree-dump-not { = \.COND_FMA } "optimized" } } */ > +/* { dg-final { scan-tree-dump-not { = \.COND_FMS } "optimized" } } */ > +/* { dg-final { scan-tree-dump-not { = \.COND_FNMA } "optimized" } } */ > +/* { dg-final { scan-tree-dump-not { = \.COND_FNMS } "optimized" } } */
Richard Biener <richard.guenther@gmail.com> writes: > On Tue, Jul 30, 2019 at 12:04 PM Richard Sandiford > <richard.sandiford@arm.com> wrote: >> >> tree-ssa-math-opts supports FMA optabs for integers as well as >> floating-point types, even though there's no distinction between >> fused and unfused there. It turns out to be pretty handy for the >> IFN_COND_* handling though, so I don't want to remove it, however >> weird it might seem. >> >> Instead this patch makes sure that we don't apply it to integer >> multiplications that are actually shifts (but that are represented >> in gimple as multiplications because that's the canonical form). >> >> This is a preemptive strike. The test doesn't fail for SVE as-is, >> but would after a later patch. >> >> Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu. >> OK to install? >> >> Richard >> >> >> 2019-07-30 Richard Sandiford <richard.sandiford@arm.com> >> >> gcc/ >> * tree-ssa-math-opts.c (convert_mult_to_fma): Reject integer >> multiplications by a power of 2 or a negative power of 2. >> >> gcc/testsuite/ >> * gcc.dg/vect/vect-cond-arith-8.c: New test. >> >> Index: gcc/tree-ssa-math-opts.c >> =================================================================== >> --- gcc/tree-ssa-math-opts.c 2019-07-30 10:51:51.827405171 +0100 >> +++ gcc/tree-ssa-math-opts.c 2019-07-30 10:52:27.327139149 +0100 >> @@ -3074,10 +3074,20 @@ convert_mult_to_fma (gimple *mul_stmt, t >> && flag_fp_contract_mode == FP_CONTRACT_OFF) >> return false; >> >> - /* We don't want to do bitfield reduction ops. */ >> - if (INTEGRAL_TYPE_P (type) >> - && (!type_has_mode_precision_p (type) || TYPE_OVERFLOW_TRAPS (type))) >> - return false; >> + if (ANY_INTEGRAL_TYPE_P (type)) >> + { >> + /* We don't want to do bitfield reduction ops. */ >> + tree itype = INTEGRAL_TYPE_P (type) ? type : TREE_TYPE (type); > > you can use element_type () for this. But I think it's easier to > leave the existing > test in-place since vector or complex types cannot have non-mode precision > components. Ah, yeah. Guess I was over-generalising here :-) >> + if (!type_has_mode_precision_p (itype) || TYPE_OVERFLOW_TRAPS (itype)) >> + return false; >> + >> + /* Don't use FMA for multiplications that are actually shifts. */ > > So I question this - if the FMA can do the shift "for free" and it eventually is > same cost/latency as an add why do we want to not allow this (for all targets)? > Esp. for vector types I would guess a add plus a shift may be more costly > (not all ISAs implement shifts anyhow). The shift doesn't really come for free. By using FMA we're converting the shift by a constant into a general multiplication. > Can this be fixed up in the target instead? (by a splitter or appropriate > expander?) OK, I'll try to do it that way instead. Thanks, Richard > >> + tree val = VECTOR_TYPE_P (type) ? uniform_vector_p (op2) : op2; >> + if (val >> + && TREE_CODE (val) == INTEGER_CST >> + && wi::popcount (wi::abs (wi::to_wide (val))) == 1) >> + return false; >> + } >> >> /* If the target doesn't support it, don't generate it. We assume that >> if fma isn't available then fms, fnma or fnms are not either. */ >> Index: gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c >> =================================================================== >> --- /dev/null 2019-07-30 08:53:31.317691683 +0100 >> +++ gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c 2019-07-30 10:52:27.327139149 +0100 >> @@ -0,0 +1,57 @@ >> +/* { dg-require-effective-target scalar_all_fma } */ >> +/* { dg-additional-options "-fdump-tree-optimized -ffp-contract=fast" } */ >> + >> +#include "tree-vect.h" >> + >> +#define N (VECTOR_BITS * 11 / 64 + 3) >> + >> +#define DEF(INV) \ >> + void __attribute__ ((noipa)) \ >> + f_##INV (int *restrict a, int *restrict b, \ >> + int *restrict c) \ >> + { \ >> + for (int i = 0; i < N; ++i) \ >> + { \ >> + int mb = (INV & 1 ? -b[i] : b[i]); \ >> + int mc = (INV & 2 ? -c[i] : c[i]); \ >> + a[i] = b[i] < 10 ? mb * 8 + mc : 10; \ >> + } \ >> + } >> + >> +#define TEST(INV) \ >> + { \ >> + f_##INV (a, b, c); \ >> + for (int i = 0; i < N; ++i) \ >> + { \ >> + int mb = (INV & 1 ? -b[i] : b[i]); \ >> + int mc = (INV & 2 ? -c[i] : c[i]); \ >> + int truev = mb * 8 + mc; \ >> + if (a[i] != (i % 17 < 10 ? truev : 10)) \ >> + __builtin_abort (); \ >> + asm volatile ("" ::: "memory"); \ >> + } \ >> + } >> + >> +#define FOR_EACH_INV(T) \ >> + T (0) T (1) T (2) T (3) >> + >> +FOR_EACH_INV (DEF) >> + >> +int >> +main (void) >> +{ >> + int a[N], b[N], c[N]; >> + for (int i = 0; i < N; ++i) >> + { >> + b[i] = i % 17; >> + c[i] = i % 13 + 14; >> + asm volatile ("" ::: "memory"); >> + } >> + FOR_EACH_INV (TEST) >> + return 0; >> +} >> + >> +/* { dg-final { scan-tree-dump-not { = \.COND_FMA } "optimized" } } */ >> +/* { dg-final { scan-tree-dump-not { = \.COND_FMS } "optimized" } } */ >> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMA } "optimized" } } */ >> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMS } "optimized" } } */
On Tue, Jul 30, 2019 at 12:50 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Richard Biener <richard.guenther@gmail.com> writes: > > On Tue, Jul 30, 2019 at 12:04 PM Richard Sandiford > > <richard.sandiford@arm.com> wrote: > >> > >> tree-ssa-math-opts supports FMA optabs for integers as well as > >> floating-point types, even though there's no distinction between > >> fused and unfused there. It turns out to be pretty handy for the > >> IFN_COND_* handling though, so I don't want to remove it, however > >> weird it might seem. > >> > >> Instead this patch makes sure that we don't apply it to integer > >> multiplications that are actually shifts (but that are represented > >> in gimple as multiplications because that's the canonical form). > >> > >> This is a preemptive strike. The test doesn't fail for SVE as-is, > >> but would after a later patch. > >> > >> Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu. > >> OK to install? > >> > >> Richard > >> > >> > >> 2019-07-30 Richard Sandiford <richard.sandiford@arm.com> > >> > >> gcc/ > >> * tree-ssa-math-opts.c (convert_mult_to_fma): Reject integer > >> multiplications by a power of 2 or a negative power of 2. > >> > >> gcc/testsuite/ > >> * gcc.dg/vect/vect-cond-arith-8.c: New test. > >> > >> Index: gcc/tree-ssa-math-opts.c > >> =================================================================== > >> --- gcc/tree-ssa-math-opts.c 2019-07-30 10:51:51.827405171 +0100 > >> +++ gcc/tree-ssa-math-opts.c 2019-07-30 10:52:27.327139149 +0100 > >> @@ -3074,10 +3074,20 @@ convert_mult_to_fma (gimple *mul_stmt, t > >> && flag_fp_contract_mode == FP_CONTRACT_OFF) > >> return false; > >> > >> - /* We don't want to do bitfield reduction ops. */ > >> - if (INTEGRAL_TYPE_P (type) > >> - && (!type_has_mode_precision_p (type) || TYPE_OVERFLOW_TRAPS (type))) > >> - return false; > >> + if (ANY_INTEGRAL_TYPE_P (type)) > >> + { > >> + /* We don't want to do bitfield reduction ops. */ > >> + tree itype = INTEGRAL_TYPE_P (type) ? type : TREE_TYPE (type); > > > > you can use element_type () for this. But I think it's easier to > > leave the existing > > test in-place since vector or complex types cannot have non-mode precision > > components. > > Ah, yeah. Guess I was over-generalising here :-) > > >> + if (!type_has_mode_precision_p (itype) || TYPE_OVERFLOW_TRAPS (itype)) > >> + return false; > >> + > >> + /* Don't use FMA for multiplications that are actually shifts. */ > > > > So I question this - if the FMA can do the shift "for free" and it eventually is > > same cost/latency as an add why do we want to not allow this (for all targets)? > > Esp. for vector types I would guess a add plus a shift may be more costly > > (not all ISAs implement shifts anyhow). > > The shift doesn't really come for free. By using FMA we're converting > the shift by a constant into a general multiplication. Yeah, it probably "depends". OTOH the shift may end up being a (vector) multiply anyway due to target constraints. I wonder how we select between (vector) shift and multiply when expanding without FMA. Ideally the scheduler would have a say here based on port utilization ;) > > Can this be fixed up in the target instead? (by a splitter or appropriate > > expander?) > > OK, I'll try to do it that way instead. > > Thanks, > Richard > > > > >> + tree val = VECTOR_TYPE_P (type) ? uniform_vector_p (op2) : op2; > >> + if (val > >> + && TREE_CODE (val) == INTEGER_CST > >> + && wi::popcount (wi::abs (wi::to_wide (val))) == 1) > >> + return false; > >> + } > >> > >> /* If the target doesn't support it, don't generate it. We assume that > >> if fma isn't available then fms, fnma or fnms are not either. */ > >> Index: gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c > >> =================================================================== > >> --- /dev/null 2019-07-30 08:53:31.317691683 +0100 > >> +++ gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c 2019-07-30 10:52:27.327139149 +0100 > >> @@ -0,0 +1,57 @@ > >> +/* { dg-require-effective-target scalar_all_fma } */ > >> +/* { dg-additional-options "-fdump-tree-optimized -ffp-contract=fast" } */ > >> + > >> +#include "tree-vect.h" > >> + > >> +#define N (VECTOR_BITS * 11 / 64 + 3) > >> + > >> +#define DEF(INV) \ > >> + void __attribute__ ((noipa)) \ > >> + f_##INV (int *restrict a, int *restrict b, \ > >> + int *restrict c) \ > >> + { \ > >> + for (int i = 0; i < N; ++i) \ > >> + { \ > >> + int mb = (INV & 1 ? -b[i] : b[i]); \ > >> + int mc = (INV & 2 ? -c[i] : c[i]); \ > >> + a[i] = b[i] < 10 ? mb * 8 + mc : 10; \ > >> + } \ > >> + } > >> + > >> +#define TEST(INV) \ > >> + { \ > >> + f_##INV (a, b, c); \ > >> + for (int i = 0; i < N; ++i) \ > >> + { \ > >> + int mb = (INV & 1 ? -b[i] : b[i]); \ > >> + int mc = (INV & 2 ? -c[i] : c[i]); \ > >> + int truev = mb * 8 + mc; \ > >> + if (a[i] != (i % 17 < 10 ? truev : 10)) \ > >> + __builtin_abort (); \ > >> + asm volatile ("" ::: "memory"); \ > >> + } \ > >> + } > >> + > >> +#define FOR_EACH_INV(T) \ > >> + T (0) T (1) T (2) T (3) > >> + > >> +FOR_EACH_INV (DEF) > >> + > >> +int > >> +main (void) > >> +{ > >> + int a[N], b[N], c[N]; > >> + for (int i = 0; i < N; ++i) > >> + { > >> + b[i] = i % 17; > >> + c[i] = i % 13 + 14; > >> + asm volatile ("" ::: "memory"); > >> + } > >> + FOR_EACH_INV (TEST) > >> + return 0; > >> +} > >> + > >> +/* { dg-final { scan-tree-dump-not { = \.COND_FMA } "optimized" } } */ > >> +/* { dg-final { scan-tree-dump-not { = \.COND_FMS } "optimized" } } */ > >> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMA } "optimized" } } */ > >> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMS } "optimized" } } */
On 30/07/2019 11:50, Richard Sandiford wrote: > Richard Biener <richard.guenther@gmail.com> writes: >> On Tue, Jul 30, 2019 at 12:04 PM Richard Sandiford >> <richard.sandiford@arm.com> wrote: >>> >>> tree-ssa-math-opts supports FMA optabs for integers as well as >>> floating-point types, even though there's no distinction between >>> fused and unfused there. It turns out to be pretty handy for the >>> IFN_COND_* handling though, so I don't want to remove it, however >>> weird it might seem. >>> >>> Instead this patch makes sure that we don't apply it to integer >>> multiplications that are actually shifts (but that are represented >>> in gimple as multiplications because that's the canonical form). >>> >>> This is a preemptive strike. The test doesn't fail for SVE as-is, >>> but would after a later patch. >>> >>> Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu. >>> OK to install? >>> >>> Richard >>> >>> >>> 2019-07-30 Richard Sandiford <richard.sandiford@arm.com> >>> >>> gcc/ >>> * tree-ssa-math-opts.c (convert_mult_to_fma): Reject integer >>> multiplications by a power of 2 or a negative power of 2. >>> >>> gcc/testsuite/ >>> * gcc.dg/vect/vect-cond-arith-8.c: New test. >>> >>> Index: gcc/tree-ssa-math-opts.c >>> =================================================================== >>> --- gcc/tree-ssa-math-opts.c 2019-07-30 10:51:51.827405171 +0100 >>> +++ gcc/tree-ssa-math-opts.c 2019-07-30 10:52:27.327139149 +0100 >>> @@ -3074,10 +3074,20 @@ convert_mult_to_fma (gimple *mul_stmt, t >>> && flag_fp_contract_mode == FP_CONTRACT_OFF) >>> return false; >>> >>> - /* We don't want to do bitfield reduction ops. */ >>> - if (INTEGRAL_TYPE_P (type) >>> - && (!type_has_mode_precision_p (type) || TYPE_OVERFLOW_TRAPS (type))) >>> - return false; >>> + if (ANY_INTEGRAL_TYPE_P (type)) >>> + { >>> + /* We don't want to do bitfield reduction ops. */ >>> + tree itype = INTEGRAL_TYPE_P (type) ? type : TREE_TYPE (type); >> >> you can use element_type () for this. But I think it's easier to >> leave the existing >> test in-place since vector or complex types cannot have non-mode precision >> components. > > Ah, yeah. Guess I was over-generalising here :-) > >>> + if (!type_has_mode_precision_p (itype) || TYPE_OVERFLOW_TRAPS (itype)) >>> + return false; >>> + >>> + /* Don't use FMA for multiplications that are actually shifts. */ >> >> So I question this - if the FMA can do the shift "for free" and it eventually is >> same cost/latency as an add why do we want to not allow this (for all targets)? >> Esp. for vector types I would guess a add plus a shift may be more costly >> (not all ISAs implement shifts anyhow). > > The shift doesn't really come for free. By using FMA we're converting > the shift by a constant into a general multiplication. Isn't this just the problem that the madd<mode> pattern for aarch64 doesn't accept constants that are a power of 2? So it ends up trying to force the constant into a register unnecessarily. (define_insn "madd<mode>" [(set (match_operand:GPI 0 "register_operand" "=r") (plus:GPI (mult:GPI (match_operand:GPI 1 "register_operand" "r") (match_operand:GPI 2 "register_operand" "r")) --- This should be reg_or_imm_power2, then a second alternative for the immediate case. (match_operand:GPI 3 "register_operand" "r")))] "" R. > >> Can this be fixed up in the target instead? (by a splitter or appropriate >> expander?) > > OK, I'll try to do it that way instead. > > Thanks, > Richard > >> >>> + tree val = VECTOR_TYPE_P (type) ? uniform_vector_p (op2) : op2; >>> + if (val >>> + && TREE_CODE (val) == INTEGER_CST >>> + && wi::popcount (wi::abs (wi::to_wide (val))) == 1) >>> + return false; >>> + } >>> >>> /* If the target doesn't support it, don't generate it. We assume that >>> if fma isn't available then fms, fnma or fnms are not either. */ >>> Index: gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c >>> =================================================================== >>> --- /dev/null 2019-07-30 08:53:31.317691683 +0100 >>> +++ gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c 2019-07-30 10:52:27.327139149 +0100 >>> @@ -0,0 +1,57 @@ >>> +/* { dg-require-effective-target scalar_all_fma } */ >>> +/* { dg-additional-options "-fdump-tree-optimized -ffp-contract=fast" } */ >>> + >>> +#include "tree-vect.h" >>> + >>> +#define N (VECTOR_BITS * 11 / 64 + 3) >>> + >>> +#define DEF(INV) \ >>> + void __attribute__ ((noipa)) \ >>> + f_##INV (int *restrict a, int *restrict b, \ >>> + int *restrict c) \ >>> + { \ >>> + for (int i = 0; i < N; ++i) \ >>> + { \ >>> + int mb = (INV & 1 ? -b[i] : b[i]); \ >>> + int mc = (INV & 2 ? -c[i] : c[i]); \ >>> + a[i] = b[i] < 10 ? mb * 8 + mc : 10; \ >>> + } \ >>> + } >>> + >>> +#define TEST(INV) \ >>> + { \ >>> + f_##INV (a, b, c); \ >>> + for (int i = 0; i < N; ++i) \ >>> + { \ >>> + int mb = (INV & 1 ? -b[i] : b[i]); \ >>> + int mc = (INV & 2 ? -c[i] : c[i]); \ >>> + int truev = mb * 8 + mc; \ >>> + if (a[i] != (i % 17 < 10 ? truev : 10)) \ >>> + __builtin_abort (); \ >>> + asm volatile ("" ::: "memory"); \ >>> + } \ >>> + } >>> + >>> +#define FOR_EACH_INV(T) \ >>> + T (0) T (1) T (2) T (3) >>> + >>> +FOR_EACH_INV (DEF) >>> + >>> +int >>> +main (void) >>> +{ >>> + int a[N], b[N], c[N]; >>> + for (int i = 0; i < N; ++i) >>> + { >>> + b[i] = i % 17; >>> + c[i] = i % 13 + 14; >>> + asm volatile ("" ::: "memory"); >>> + } >>> + FOR_EACH_INV (TEST) >>> + return 0; >>> +} >>> + >>> +/* { dg-final { scan-tree-dump-not { = \.COND_FMA } "optimized" } } */ >>> +/* { dg-final { scan-tree-dump-not { = \.COND_FMS } "optimized" } } */ >>> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMA } "optimized" } } */ >>> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMS } "optimized" } } */
"Richard Earnshaw (lists)" <Richard.Earnshaw@arm.com> writes: > On 30/07/2019 11:50, Richard Sandiford wrote: >> Richard Biener <richard.guenther@gmail.com> writes: >>> On Tue, Jul 30, 2019 at 12:04 PM Richard Sandiford >>> <richard.sandiford@arm.com> wrote: >>>> >>>> tree-ssa-math-opts supports FMA optabs for integers as well as >>>> floating-point types, even though there's no distinction between >>>> fused and unfused there. It turns out to be pretty handy for the >>>> IFN_COND_* handling though, so I don't want to remove it, however >>>> weird it might seem. >>>> >>>> Instead this patch makes sure that we don't apply it to integer >>>> multiplications that are actually shifts (but that are represented >>>> in gimple as multiplications because that's the canonical form). >>>> >>>> This is a preemptive strike. The test doesn't fail for SVE as-is, >>>> but would after a later patch. >>>> >>>> Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu. >>>> OK to install? >>>> >>>> Richard >>>> >>>> >>>> 2019-07-30 Richard Sandiford <richard.sandiford@arm.com> >>>> >>>> gcc/ >>>> * tree-ssa-math-opts.c (convert_mult_to_fma): Reject integer >>>> multiplications by a power of 2 or a negative power of 2. >>>> >>>> gcc/testsuite/ >>>> * gcc.dg/vect/vect-cond-arith-8.c: New test. >>>> >>>> Index: gcc/tree-ssa-math-opts.c >>>> =================================================================== >>>> --- gcc/tree-ssa-math-opts.c 2019-07-30 10:51:51.827405171 +0100 >>>> +++ gcc/tree-ssa-math-opts.c 2019-07-30 10:52:27.327139149 +0100 >>>> @@ -3074,10 +3074,20 @@ convert_mult_to_fma (gimple *mul_stmt, t >>>> && flag_fp_contract_mode == FP_CONTRACT_OFF) >>>> return false; >>>> >>>> - /* We don't want to do bitfield reduction ops. */ >>>> - if (INTEGRAL_TYPE_P (type) >>>> - && (!type_has_mode_precision_p (type) || TYPE_OVERFLOW_TRAPS (type))) >>>> - return false; >>>> + if (ANY_INTEGRAL_TYPE_P (type)) >>>> + { >>>> + /* We don't want to do bitfield reduction ops. */ >>>> + tree itype = INTEGRAL_TYPE_P (type) ? type : TREE_TYPE (type); >>> >>> you can use element_type () for this. But I think it's easier to >>> leave the existing >>> test in-place since vector or complex types cannot have non-mode precision >>> components. >> >> Ah, yeah. Guess I was over-generalising here :-) >> >>>> + if (!type_has_mode_precision_p (itype) || TYPE_OVERFLOW_TRAPS (itype)) >>>> + return false; >>>> + >>>> + /* Don't use FMA for multiplications that are actually shifts. */ >>> >>> So I question this - if the FMA can do the shift "for free" and it eventually is >>> same cost/latency as an add why do we want to not allow this (for all targets)? >>> Esp. for vector types I would guess a add plus a shift may be more costly >>> (not all ISAs implement shifts anyhow). >> >> The shift doesn't really come for free. By using FMA we're converting >> the shift by a constant into a general multiplication. > > Isn't this just the problem that the madd<mode> pattern for aarch64 > doesn't accept constants that are a power of 2? So it ends up trying to > force the constant into a register unnecessarily. > > (define_insn "madd<mode>" > [(set (match_operand:GPI 0 "register_operand" "=r") > (plus:GPI (mult:GPI (match_operand:GPI 1 "register_operand" "r") > (match_operand:GPI 2 "register_operand" "r")) > --- This should be reg_or_imm_power2, then a second alternative for the > immediate case. > > (match_operand:GPI 3 "register_operand" "r")))] > "" That might be a problem too (e.g. maybe for intrinsics?), but this patch was only dealing with the way that the "fma<mode>4" group of patterns are used. The AArch64 port doesn't yet define fma for any integer modes AFAICT, but I have a patch that adds them for SVE. (And it now handles shifts specially, as Richard suggested.) Thanks, Richard > > R. > >> >>> Can this be fixed up in the target instead? (by a splitter or appropriate >>> expander?) >> >> OK, I'll try to do it that way instead. >> >> Thanks, >> Richard >> >>> >>>> + tree val = VECTOR_TYPE_P (type) ? uniform_vector_p (op2) : op2; >>>> + if (val >>>> + && TREE_CODE (val) == INTEGER_CST >>>> + && wi::popcount (wi::abs (wi::to_wide (val))) == 1) >>>> + return false; >>>> + } >>>> >>>> /* If the target doesn't support it, don't generate it. We assume that >>>> if fma isn't available then fms, fnma or fnms are not either. */ >>>> Index: gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c >>>> =================================================================== >>>> --- /dev/null 2019-07-30 08:53:31.317691683 +0100 >>>> +++ gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c 2019-07-30 10:52:27.327139149 +0100 >>>> @@ -0,0 +1,57 @@ >>>> +/* { dg-require-effective-target scalar_all_fma } */ >>>> +/* { dg-additional-options "-fdump-tree-optimized -ffp-contract=fast" } */ >>>> + >>>> +#include "tree-vect.h" >>>> + >>>> +#define N (VECTOR_BITS * 11 / 64 + 3) >>>> + >>>> +#define DEF(INV) \ >>>> + void __attribute__ ((noipa)) \ >>>> + f_##INV (int *restrict a, int *restrict b, \ >>>> + int *restrict c) \ >>>> + { \ >>>> + for (int i = 0; i < N; ++i) \ >>>> + { \ >>>> + int mb = (INV & 1 ? -b[i] : b[i]); \ >>>> + int mc = (INV & 2 ? -c[i] : c[i]); \ >>>> + a[i] = b[i] < 10 ? mb * 8 + mc : 10; \ >>>> + } \ >>>> + } >>>> + >>>> +#define TEST(INV) \ >>>> + { \ >>>> + f_##INV (a, b, c); \ >>>> + for (int i = 0; i < N; ++i) \ >>>> + { \ >>>> + int mb = (INV & 1 ? -b[i] : b[i]); \ >>>> + int mc = (INV & 2 ? -c[i] : c[i]); \ >>>> + int truev = mb * 8 + mc; \ >>>> + if (a[i] != (i % 17 < 10 ? truev : 10)) \ >>>> + __builtin_abort (); \ >>>> + asm volatile ("" ::: "memory"); \ >>>> + } \ >>>> + } >>>> + >>>> +#define FOR_EACH_INV(T) \ >>>> + T (0) T (1) T (2) T (3) >>>> + >>>> +FOR_EACH_INV (DEF) >>>> + >>>> +int >>>> +main (void) >>>> +{ >>>> + int a[N], b[N], c[N]; >>>> + for (int i = 0; i < N; ++i) >>>> + { >>>> + b[i] = i % 17; >>>> + c[i] = i % 13 + 14; >>>> + asm volatile ("" ::: "memory"); >>>> + } >>>> + FOR_EACH_INV (TEST) >>>> + return 0; >>>> +} >>>> + >>>> +/* { dg-final { scan-tree-dump-not { = \.COND_FMA } "optimized" } } */ >>>> +/* { dg-final { scan-tree-dump-not { = \.COND_FMS } "optimized" } } */ >>>> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMA } "optimized" } } */ >>>> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMS } "optimized" } } */
Index: gcc/tree-ssa-math-opts.c =================================================================== --- gcc/tree-ssa-math-opts.c 2019-07-30 10:51:51.827405171 +0100 +++ gcc/tree-ssa-math-opts.c 2019-07-30 10:52:27.327139149 +0100 @@ -3074,10 +3074,20 @@ convert_mult_to_fma (gimple *mul_stmt, t && flag_fp_contract_mode == FP_CONTRACT_OFF) return false; - /* We don't want to do bitfield reduction ops. */ - if (INTEGRAL_TYPE_P (type) - && (!type_has_mode_precision_p (type) || TYPE_OVERFLOW_TRAPS (type))) - return false; + if (ANY_INTEGRAL_TYPE_P (type)) + { + /* We don't want to do bitfield reduction ops. */ + tree itype = INTEGRAL_TYPE_P (type) ? type : TREE_TYPE (type); + if (!type_has_mode_precision_p (itype) || TYPE_OVERFLOW_TRAPS (itype)) + return false; + + /* Don't use FMA for multiplications that are actually shifts. */ + tree val = VECTOR_TYPE_P (type) ? uniform_vector_p (op2) : op2; + if (val + && TREE_CODE (val) == INTEGER_CST + && wi::popcount (wi::abs (wi::to_wide (val))) == 1) + return false; + } /* If the target doesn't support it, don't generate it. We assume that if fma isn't available then fms, fnma or fnms are not either. */ Index: gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c =================================================================== --- /dev/null 2019-07-30 08:53:31.317691683 +0100 +++ gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c 2019-07-30 10:52:27.327139149 +0100 @@ -0,0 +1,57 @@ +/* { dg-require-effective-target scalar_all_fma } */ +/* { dg-additional-options "-fdump-tree-optimized -ffp-contract=fast" } */ + +#include "tree-vect.h" + +#define N (VECTOR_BITS * 11 / 64 + 3) + +#define DEF(INV) \ + void __attribute__ ((noipa)) \ + f_##INV (int *restrict a, int *restrict b, \ + int *restrict c) \ + { \ + for (int i = 0; i < N; ++i) \ + { \ + int mb = (INV & 1 ? -b[i] : b[i]); \ + int mc = (INV & 2 ? -c[i] : c[i]); \ + a[i] = b[i] < 10 ? mb * 8 + mc : 10; \ + } \ + } + +#define TEST(INV) \ + { \ + f_##INV (a, b, c); \ + for (int i = 0; i < N; ++i) \ + { \ + int mb = (INV & 1 ? -b[i] : b[i]); \ + int mc = (INV & 2 ? -c[i] : c[i]); \ + int truev = mb * 8 + mc; \ + if (a[i] != (i % 17 < 10 ? truev : 10)) \ + __builtin_abort (); \ + asm volatile ("" ::: "memory"); \ + } \ + } + +#define FOR_EACH_INV(T) \ + T (0) T (1) T (2) T (3) + +FOR_EACH_INV (DEF) + +int +main (void) +{ + int a[N], b[N], c[N]; + for (int i = 0; i < N; ++i) + { + b[i] = i % 17; + c[i] = i % 13 + 14; + asm volatile ("" ::: "memory"); + } + FOR_EACH_INV (TEST) + return 0; +} + +/* { dg-final { scan-tree-dump-not { = \.COND_FMA } "optimized" } } */ +/* { dg-final { scan-tree-dump-not { = \.COND_FMS } "optimized" } } */ +/* { dg-final { scan-tree-dump-not { = \.COND_FNMA } "optimized" } } */ +/* { dg-final { scan-tree-dump-not { = \.COND_FNMS } "optimized" } } */