diff mbox series

Don't use integer "FMA" for shifts

Message ID mpt36inj0sq.fsf@arm.com
State New
Headers show
Series Don't use integer "FMA" for shifts | expand

Commit Message

Richard Sandiford July 30, 2019, 10:04 a.m. UTC
tree-ssa-math-opts supports FMA optabs for integers as well as
floating-point types, even though there's no distinction between
fused and unfused there.  It turns out to be pretty handy for the
IFN_COND_* handling though, so I don't want to remove it, however
weird it might seem.

Instead this patch makes sure that we don't apply it to integer
multiplications that are actually shifts (but that are represented
in gimple as multiplications because that's the canonical form).

This is a preemptive strike.  The test doesn't fail for SVE as-is,
but would after a later patch.

Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu.
OK to install?

Richard


2019-07-30  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* tree-ssa-math-opts.c (convert_mult_to_fma): Reject integer
	multiplications by a power of 2 or a negative power of 2.

gcc/testsuite/
	* gcc.dg/vect/vect-cond-arith-8.c: New test.

Comments

Richard Biener July 30, 2019, 10:33 a.m. UTC | #1
On Tue, Jul 30, 2019 at 12:04 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> tree-ssa-math-opts supports FMA optabs for integers as well as
> floating-point types, even though there's no distinction between
> fused and unfused there.  It turns out to be pretty handy for the
> IFN_COND_* handling though, so I don't want to remove it, however
> weird it might seem.
>
> Instead this patch makes sure that we don't apply it to integer
> multiplications that are actually shifts (but that are represented
> in gimple as multiplications because that's the canonical form).
>
> This is a preemptive strike.  The test doesn't fail for SVE as-is,
> but would after a later patch.
>
> Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu.
> OK to install?
>
> Richard
>
>
> 2019-07-30  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
>         * tree-ssa-math-opts.c (convert_mult_to_fma): Reject integer
>         multiplications by a power of 2 or a negative power of 2.
>
> gcc/testsuite/
>         * gcc.dg/vect/vect-cond-arith-8.c: New test.
>
> Index: gcc/tree-ssa-math-opts.c
> ===================================================================
> --- gcc/tree-ssa-math-opts.c    2019-07-30 10:51:51.827405171 +0100
> +++ gcc/tree-ssa-math-opts.c    2019-07-30 10:52:27.327139149 +0100
> @@ -3074,10 +3074,20 @@ convert_mult_to_fma (gimple *mul_stmt, t
>        && flag_fp_contract_mode == FP_CONTRACT_OFF)
>      return false;
>
> -  /* We don't want to do bitfield reduction ops.  */
> -  if (INTEGRAL_TYPE_P (type)
> -      && (!type_has_mode_precision_p (type) || TYPE_OVERFLOW_TRAPS (type)))
> -    return false;
> +  if (ANY_INTEGRAL_TYPE_P (type))
> +    {
> +      /* We don't want to do bitfield reduction ops.  */
> +      tree itype = INTEGRAL_TYPE_P (type) ? type : TREE_TYPE (type);

you can use element_type () for this.  But I think it's easier to
leave the existing
test in-place since vector or complex types cannot have non-mode precision
components.

> +      if (!type_has_mode_precision_p (itype) || TYPE_OVERFLOW_TRAPS (itype))
> +       return false;
> +
> +      /* Don't use FMA for multiplications that are actually shifts.  */

So I question this - if the FMA can do the shift "for free" and it eventually is
same cost/latency as an add why do we want to not allow this (for all targets)?
Esp. for vector types I would guess a add plus a shift may be more costly
(not all ISAs implement shifts anyhow).

Can this be fixed up in the target instead?  (by a splitter or appropriate
expander?)

> +      tree val = VECTOR_TYPE_P (type) ? uniform_vector_p (op2) : op2;
> +      if (val
> +         && TREE_CODE (val) == INTEGER_CST
> +         && wi::popcount (wi::abs (wi::to_wide (val))) == 1)
> +       return false;
> +    }
>
>    /* If the target doesn't support it, don't generate it.  We assume that
>       if fma isn't available then fms, fnma or fnms are not either.  */
> Index: gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c
> ===================================================================
> --- /dev/null   2019-07-30 08:53:31.317691683 +0100
> +++ gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c       2019-07-30 10:52:27.327139149 +0100
> @@ -0,0 +1,57 @@
> +/* { dg-require-effective-target scalar_all_fma } */
> +/* { dg-additional-options "-fdump-tree-optimized -ffp-contract=fast" } */
> +
> +#include "tree-vect.h"
> +
> +#define N (VECTOR_BITS * 11 / 64 + 3)
> +
> +#define DEF(INV)                                       \
> +  void __attribute__ ((noipa))                         \
> +  f_##INV (int *restrict a, int *restrict b,           \
> +          int *restrict c)                             \
> +  {                                                    \
> +    for (int i = 0; i < N; ++i)                                \
> +      {                                                        \
> +       int mb = (INV & 1 ? -b[i] : b[i]);              \
> +       int mc = (INV & 2 ? -c[i] : c[i]);              \
> +       a[i] = b[i] < 10 ? mb * 8 + mc : 10;            \
> +      }                                                        \
> +  }
> +
> +#define TEST(INV)                                      \
> +  {                                                    \
> +    f_##INV (a, b, c);                                 \
> +    for (int i = 0; i < N; ++i)                                \
> +      {                                                        \
> +       int mb = (INV & 1 ? -b[i] : b[i]);              \
> +       int mc = (INV & 2 ? -c[i] : c[i]);              \
> +       int truev = mb * 8 + mc;                        \
> +       if (a[i] != (i % 17 < 10 ? truev : 10))         \
> +         __builtin_abort ();                           \
> +       asm volatile ("" ::: "memory");                 \
> +      }                                                        \
> +  }
> +
> +#define FOR_EACH_INV(T) \
> +  T (0) T (1) T (2) T (3)
> +
> +FOR_EACH_INV (DEF)
> +
> +int
> +main (void)
> +{
> +  int a[N], b[N], c[N];
> +  for (int i = 0; i < N; ++i)
> +    {
> +      b[i] = i % 17;
> +      c[i] = i % 13 + 14;
> +      asm volatile ("" ::: "memory");
> +    }
> +  FOR_EACH_INV (TEST)
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-not { = \.COND_FMA } "optimized" } } */
> +/* { dg-final { scan-tree-dump-not { = \.COND_FMS } "optimized" } } */
> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMA } "optimized" } } */
> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMS } "optimized" } } */
Richard Sandiford July 30, 2019, 10:50 a.m. UTC | #2
Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Jul 30, 2019 at 12:04 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> tree-ssa-math-opts supports FMA optabs for integers as well as
>> floating-point types, even though there's no distinction between
>> fused and unfused there.  It turns out to be pretty handy for the
>> IFN_COND_* handling though, so I don't want to remove it, however
>> weird it might seem.
>>
>> Instead this patch makes sure that we don't apply it to integer
>> multiplications that are actually shifts (but that are represented
>> in gimple as multiplications because that's the canonical form).
>>
>> This is a preemptive strike.  The test doesn't fail for SVE as-is,
>> but would after a later patch.
>>
>> Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu.
>> OK to install?
>>
>> Richard
>>
>>
>> 2019-07-30  Richard Sandiford  <richard.sandiford@arm.com>
>>
>> gcc/
>>         * tree-ssa-math-opts.c (convert_mult_to_fma): Reject integer
>>         multiplications by a power of 2 or a negative power of 2.
>>
>> gcc/testsuite/
>>         * gcc.dg/vect/vect-cond-arith-8.c: New test.
>>
>> Index: gcc/tree-ssa-math-opts.c
>> ===================================================================
>> --- gcc/tree-ssa-math-opts.c    2019-07-30 10:51:51.827405171 +0100
>> +++ gcc/tree-ssa-math-opts.c    2019-07-30 10:52:27.327139149 +0100
>> @@ -3074,10 +3074,20 @@ convert_mult_to_fma (gimple *mul_stmt, t
>>        && flag_fp_contract_mode == FP_CONTRACT_OFF)
>>      return false;
>>
>> -  /* We don't want to do bitfield reduction ops.  */
>> -  if (INTEGRAL_TYPE_P (type)
>> -      && (!type_has_mode_precision_p (type) || TYPE_OVERFLOW_TRAPS (type)))
>> -    return false;
>> +  if (ANY_INTEGRAL_TYPE_P (type))
>> +    {
>> +      /* We don't want to do bitfield reduction ops.  */
>> +      tree itype = INTEGRAL_TYPE_P (type) ? type : TREE_TYPE (type);
>
> you can use element_type () for this.  But I think it's easier to
> leave the existing
> test in-place since vector or complex types cannot have non-mode precision
> components.

Ah, yeah.  Guess I was over-generalising here :-)

>> +      if (!type_has_mode_precision_p (itype) || TYPE_OVERFLOW_TRAPS (itype))
>> +       return false;
>> +
>> +      /* Don't use FMA for multiplications that are actually shifts.  */
>
> So I question this - if the FMA can do the shift "for free" and it eventually is
> same cost/latency as an add why do we want to not allow this (for all targets)?
> Esp. for vector types I would guess a add plus a shift may be more costly
> (not all ISAs implement shifts anyhow).

The shift doesn't really come for free.  By using FMA we're converting
the shift by a constant into a general multiplication.

> Can this be fixed up in the target instead?  (by a splitter or appropriate
> expander?)

OK, I'll try to do it that way instead.

Thanks,
Richard

>
>> +      tree val = VECTOR_TYPE_P (type) ? uniform_vector_p (op2) : op2;
>> +      if (val
>> +         && TREE_CODE (val) == INTEGER_CST
>> +         && wi::popcount (wi::abs (wi::to_wide (val))) == 1)
>> +       return false;
>> +    }
>>
>>    /* If the target doesn't support it, don't generate it.  We assume that
>>       if fma isn't available then fms, fnma or fnms are not either.  */
>> Index: gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c
>> ===================================================================
>> --- /dev/null   2019-07-30 08:53:31.317691683 +0100
>> +++ gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c       2019-07-30 10:52:27.327139149 +0100
>> @@ -0,0 +1,57 @@
>> +/* { dg-require-effective-target scalar_all_fma } */
>> +/* { dg-additional-options "-fdump-tree-optimized -ffp-contract=fast" } */
>> +
>> +#include "tree-vect.h"
>> +
>> +#define N (VECTOR_BITS * 11 / 64 + 3)
>> +
>> +#define DEF(INV)                                       \
>> +  void __attribute__ ((noipa))                         \
>> +  f_##INV (int *restrict a, int *restrict b,           \
>> +          int *restrict c)                             \
>> +  {                                                    \
>> +    for (int i = 0; i < N; ++i)                                \
>> +      {                                                        \
>> +       int mb = (INV & 1 ? -b[i] : b[i]);              \
>> +       int mc = (INV & 2 ? -c[i] : c[i]);              \
>> +       a[i] = b[i] < 10 ? mb * 8 + mc : 10;            \
>> +      }                                                        \
>> +  }
>> +
>> +#define TEST(INV)                                      \
>> +  {                                                    \
>> +    f_##INV (a, b, c);                                 \
>> +    for (int i = 0; i < N; ++i)                                \
>> +      {                                                        \
>> +       int mb = (INV & 1 ? -b[i] : b[i]);              \
>> +       int mc = (INV & 2 ? -c[i] : c[i]);              \
>> +       int truev = mb * 8 + mc;                        \
>> +       if (a[i] != (i % 17 < 10 ? truev : 10))         \
>> +         __builtin_abort ();                           \
>> +       asm volatile ("" ::: "memory");                 \
>> +      }                                                        \
>> +  }
>> +
>> +#define FOR_EACH_INV(T) \
>> +  T (0) T (1) T (2) T (3)
>> +
>> +FOR_EACH_INV (DEF)
>> +
>> +int
>> +main (void)
>> +{
>> +  int a[N], b[N], c[N];
>> +  for (int i = 0; i < N; ++i)
>> +    {
>> +      b[i] = i % 17;
>> +      c[i] = i % 13 + 14;
>> +      asm volatile ("" ::: "memory");
>> +    }
>> +  FOR_EACH_INV (TEST)
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-not { = \.COND_FMA } "optimized" } } */
>> +/* { dg-final { scan-tree-dump-not { = \.COND_FMS } "optimized" } } */
>> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMA } "optimized" } } */
>> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMS } "optimized" } } */
Richard Biener July 30, 2019, 11 a.m. UTC | #3
On Tue, Jul 30, 2019 at 12:50 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Tue, Jul 30, 2019 at 12:04 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> tree-ssa-math-opts supports FMA optabs for integers as well as
> >> floating-point types, even though there's no distinction between
> >> fused and unfused there.  It turns out to be pretty handy for the
> >> IFN_COND_* handling though, so I don't want to remove it, however
> >> weird it might seem.
> >>
> >> Instead this patch makes sure that we don't apply it to integer
> >> multiplications that are actually shifts (but that are represented
> >> in gimple as multiplications because that's the canonical form).
> >>
> >> This is a preemptive strike.  The test doesn't fail for SVE as-is,
> >> but would after a later patch.
> >>
> >> Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu.
> >> OK to install?
> >>
> >> Richard
> >>
> >>
> >> 2019-07-30  Richard Sandiford  <richard.sandiford@arm.com>
> >>
> >> gcc/
> >>         * tree-ssa-math-opts.c (convert_mult_to_fma): Reject integer
> >>         multiplications by a power of 2 or a negative power of 2.
> >>
> >> gcc/testsuite/
> >>         * gcc.dg/vect/vect-cond-arith-8.c: New test.
> >>
> >> Index: gcc/tree-ssa-math-opts.c
> >> ===================================================================
> >> --- gcc/tree-ssa-math-opts.c    2019-07-30 10:51:51.827405171 +0100
> >> +++ gcc/tree-ssa-math-opts.c    2019-07-30 10:52:27.327139149 +0100
> >> @@ -3074,10 +3074,20 @@ convert_mult_to_fma (gimple *mul_stmt, t
> >>        && flag_fp_contract_mode == FP_CONTRACT_OFF)
> >>      return false;
> >>
> >> -  /* We don't want to do bitfield reduction ops.  */
> >> -  if (INTEGRAL_TYPE_P (type)
> >> -      && (!type_has_mode_precision_p (type) || TYPE_OVERFLOW_TRAPS (type)))
> >> -    return false;
> >> +  if (ANY_INTEGRAL_TYPE_P (type))
> >> +    {
> >> +      /* We don't want to do bitfield reduction ops.  */
> >> +      tree itype = INTEGRAL_TYPE_P (type) ? type : TREE_TYPE (type);
> >
> > you can use element_type () for this.  But I think it's easier to
> > leave the existing
> > test in-place since vector or complex types cannot have non-mode precision
> > components.
>
> Ah, yeah.  Guess I was over-generalising here :-)
>
> >> +      if (!type_has_mode_precision_p (itype) || TYPE_OVERFLOW_TRAPS (itype))
> >> +       return false;
> >> +
> >> +      /* Don't use FMA for multiplications that are actually shifts.  */
> >
> > So I question this - if the FMA can do the shift "for free" and it eventually is
> > same cost/latency as an add why do we want to not allow this (for all targets)?
> > Esp. for vector types I would guess a add plus a shift may be more costly
> > (not all ISAs implement shifts anyhow).
>
> The shift doesn't really come for free.  By using FMA we're converting
> the shift by a constant into a general multiplication.

Yeah, it probably "depends".  OTOH the shift may end up being a (vector)
multiply anyway due to target constraints.

I wonder how we select between (vector) shift and multiply when expanding
without FMA.  Ideally the scheduler would have a say here based on
port utilization ;)

> > Can this be fixed up in the target instead?  (by a splitter or appropriate
> > expander?)
>
> OK, I'll try to do it that way instead.
>
> Thanks,
> Richard
>
> >
> >> +      tree val = VECTOR_TYPE_P (type) ? uniform_vector_p (op2) : op2;
> >> +      if (val
> >> +         && TREE_CODE (val) == INTEGER_CST
> >> +         && wi::popcount (wi::abs (wi::to_wide (val))) == 1)
> >> +       return false;
> >> +    }
> >>
> >>    /* If the target doesn't support it, don't generate it.  We assume that
> >>       if fma isn't available then fms, fnma or fnms are not either.  */
> >> Index: gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c
> >> ===================================================================
> >> --- /dev/null   2019-07-30 08:53:31.317691683 +0100
> >> +++ gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c       2019-07-30 10:52:27.327139149 +0100
> >> @@ -0,0 +1,57 @@
> >> +/* { dg-require-effective-target scalar_all_fma } */
> >> +/* { dg-additional-options "-fdump-tree-optimized -ffp-contract=fast" } */
> >> +
> >> +#include "tree-vect.h"
> >> +
> >> +#define N (VECTOR_BITS * 11 / 64 + 3)
> >> +
> >> +#define DEF(INV)                                       \
> >> +  void __attribute__ ((noipa))                         \
> >> +  f_##INV (int *restrict a, int *restrict b,           \
> >> +          int *restrict c)                             \
> >> +  {                                                    \
> >> +    for (int i = 0; i < N; ++i)                                \
> >> +      {                                                        \
> >> +       int mb = (INV & 1 ? -b[i] : b[i]);              \
> >> +       int mc = (INV & 2 ? -c[i] : c[i]);              \
> >> +       a[i] = b[i] < 10 ? mb * 8 + mc : 10;            \
> >> +      }                                                        \
> >> +  }
> >> +
> >> +#define TEST(INV)                                      \
> >> +  {                                                    \
> >> +    f_##INV (a, b, c);                                 \
> >> +    for (int i = 0; i < N; ++i)                                \
> >> +      {                                                        \
> >> +       int mb = (INV & 1 ? -b[i] : b[i]);              \
> >> +       int mc = (INV & 2 ? -c[i] : c[i]);              \
> >> +       int truev = mb * 8 + mc;                        \
> >> +       if (a[i] != (i % 17 < 10 ? truev : 10))         \
> >> +         __builtin_abort ();                           \
> >> +       asm volatile ("" ::: "memory");                 \
> >> +      }                                                        \
> >> +  }
> >> +
> >> +#define FOR_EACH_INV(T) \
> >> +  T (0) T (1) T (2) T (3)
> >> +
> >> +FOR_EACH_INV (DEF)
> >> +
> >> +int
> >> +main (void)
> >> +{
> >> +  int a[N], b[N], c[N];
> >> +  for (int i = 0; i < N; ++i)
> >> +    {
> >> +      b[i] = i % 17;
> >> +      c[i] = i % 13 + 14;
> >> +      asm volatile ("" ::: "memory");
> >> +    }
> >> +  FOR_EACH_INV (TEST)
> >> +  return 0;
> >> +}
> >> +
> >> +/* { dg-final { scan-tree-dump-not { = \.COND_FMA } "optimized" } } */
> >> +/* { dg-final { scan-tree-dump-not { = \.COND_FMS } "optimized" } } */
> >> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMA } "optimized" } } */
> >> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMS } "optimized" } } */
Richard Earnshaw (lists) July 30, 2019, 12:22 p.m. UTC | #4
On 30/07/2019 11:50, Richard Sandiford wrote:
> Richard Biener <richard.guenther@gmail.com> writes:
>> On Tue, Jul 30, 2019 at 12:04 PM Richard Sandiford
>> <richard.sandiford@arm.com> wrote:
>>>
>>> tree-ssa-math-opts supports FMA optabs for integers as well as
>>> floating-point types, even though there's no distinction between
>>> fused and unfused there.  It turns out to be pretty handy for the
>>> IFN_COND_* handling though, so I don't want to remove it, however
>>> weird it might seem.
>>>
>>> Instead this patch makes sure that we don't apply it to integer
>>> multiplications that are actually shifts (but that are represented
>>> in gimple as multiplications because that's the canonical form).
>>>
>>> This is a preemptive strike.  The test doesn't fail for SVE as-is,
>>> but would after a later patch.
>>>
>>> Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu.
>>> OK to install?
>>>
>>> Richard
>>>
>>>
>>> 2019-07-30  Richard Sandiford  <richard.sandiford@arm.com>
>>>
>>> gcc/
>>>          * tree-ssa-math-opts.c (convert_mult_to_fma): Reject integer
>>>          multiplications by a power of 2 or a negative power of 2.
>>>
>>> gcc/testsuite/
>>>          * gcc.dg/vect/vect-cond-arith-8.c: New test.
>>>
>>> Index: gcc/tree-ssa-math-opts.c
>>> ===================================================================
>>> --- gcc/tree-ssa-math-opts.c    2019-07-30 10:51:51.827405171 +0100
>>> +++ gcc/tree-ssa-math-opts.c    2019-07-30 10:52:27.327139149 +0100
>>> @@ -3074,10 +3074,20 @@ convert_mult_to_fma (gimple *mul_stmt, t
>>>         && flag_fp_contract_mode == FP_CONTRACT_OFF)
>>>       return false;
>>>
>>> -  /* We don't want to do bitfield reduction ops.  */
>>> -  if (INTEGRAL_TYPE_P (type)
>>> -      && (!type_has_mode_precision_p (type) || TYPE_OVERFLOW_TRAPS (type)))
>>> -    return false;
>>> +  if (ANY_INTEGRAL_TYPE_P (type))
>>> +    {
>>> +      /* We don't want to do bitfield reduction ops.  */
>>> +      tree itype = INTEGRAL_TYPE_P (type) ? type : TREE_TYPE (type);
>>
>> you can use element_type () for this.  But I think it's easier to
>> leave the existing
>> test in-place since vector or complex types cannot have non-mode precision
>> components.
> 
> Ah, yeah.  Guess I was over-generalising here :-)
> 
>>> +      if (!type_has_mode_precision_p (itype) || TYPE_OVERFLOW_TRAPS (itype))
>>> +       return false;
>>> +
>>> +      /* Don't use FMA for multiplications that are actually shifts.  */
>>
>> So I question this - if the FMA can do the shift "for free" and it eventually is
>> same cost/latency as an add why do we want to not allow this (for all targets)?
>> Esp. for vector types I would guess a add plus a shift may be more costly
>> (not all ISAs implement shifts anyhow).
> 
> The shift doesn't really come for free.  By using FMA we're converting
> the shift by a constant into a general multiplication.

Isn't this just the problem that the madd<mode> pattern for aarch64 
doesn't accept constants that are a power of 2?  So it ends up trying to 
force the constant into a register unnecessarily.

(define_insn "madd<mode>"
   [(set (match_operand:GPI 0 "register_operand" "=r")
	(plus:GPI (mult:GPI (match_operand:GPI 1 "register_operand" "r")
			    (match_operand:GPI 2 "register_operand" "r"))
--- This should be reg_or_imm_power2, then a second alternative for the 
immediate case.

		  (match_operand:GPI 3 "register_operand" "r")))]
   ""

R.

> 
>> Can this be fixed up in the target instead?  (by a splitter or appropriate
>> expander?)
> 
> OK, I'll try to do it that way instead.
> 
> Thanks,
> Richard
> 
>>
>>> +      tree val = VECTOR_TYPE_P (type) ? uniform_vector_p (op2) : op2;
>>> +      if (val
>>> +         && TREE_CODE (val) == INTEGER_CST
>>> +         && wi::popcount (wi::abs (wi::to_wide (val))) == 1)
>>> +       return false;
>>> +    }
>>>
>>>     /* If the target doesn't support it, don't generate it.  We assume that
>>>        if fma isn't available then fms, fnma or fnms are not either.  */
>>> Index: gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c
>>> ===================================================================
>>> --- /dev/null   2019-07-30 08:53:31.317691683 +0100
>>> +++ gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c       2019-07-30 10:52:27.327139149 +0100
>>> @@ -0,0 +1,57 @@
>>> +/* { dg-require-effective-target scalar_all_fma } */
>>> +/* { dg-additional-options "-fdump-tree-optimized -ffp-contract=fast" } */
>>> +
>>> +#include "tree-vect.h"
>>> +
>>> +#define N (VECTOR_BITS * 11 / 64 + 3)
>>> +
>>> +#define DEF(INV)                                       \
>>> +  void __attribute__ ((noipa))                         \
>>> +  f_##INV (int *restrict a, int *restrict b,           \
>>> +          int *restrict c)                             \
>>> +  {                                                    \
>>> +    for (int i = 0; i < N; ++i)                                \
>>> +      {                                                        \
>>> +       int mb = (INV & 1 ? -b[i] : b[i]);              \
>>> +       int mc = (INV & 2 ? -c[i] : c[i]);              \
>>> +       a[i] = b[i] < 10 ? mb * 8 + mc : 10;            \
>>> +      }                                                        \
>>> +  }
>>> +
>>> +#define TEST(INV)                                      \
>>> +  {                                                    \
>>> +    f_##INV (a, b, c);                                 \
>>> +    for (int i = 0; i < N; ++i)                                \
>>> +      {                                                        \
>>> +       int mb = (INV & 1 ? -b[i] : b[i]);              \
>>> +       int mc = (INV & 2 ? -c[i] : c[i]);              \
>>> +       int truev = mb * 8 + mc;                        \
>>> +       if (a[i] != (i % 17 < 10 ? truev : 10))         \
>>> +         __builtin_abort ();                           \
>>> +       asm volatile ("" ::: "memory");                 \
>>> +      }                                                        \
>>> +  }
>>> +
>>> +#define FOR_EACH_INV(T) \
>>> +  T (0) T (1) T (2) T (3)
>>> +
>>> +FOR_EACH_INV (DEF)
>>> +
>>> +int
>>> +main (void)
>>> +{
>>> +  int a[N], b[N], c[N];
>>> +  for (int i = 0; i < N; ++i)
>>> +    {
>>> +      b[i] = i % 17;
>>> +      c[i] = i % 13 + 14;
>>> +      asm volatile ("" ::: "memory");
>>> +    }
>>> +  FOR_EACH_INV (TEST)
>>> +  return 0;
>>> +}
>>> +
>>> +/* { dg-final { scan-tree-dump-not { = \.COND_FMA } "optimized" } } */
>>> +/* { dg-final { scan-tree-dump-not { = \.COND_FMS } "optimized" } } */
>>> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMA } "optimized" } } */
>>> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMS } "optimized" } } */
Richard Sandiford Aug. 6, 2019, 1:57 p.m. UTC | #5
"Richard Earnshaw (lists)" <Richard.Earnshaw@arm.com> writes:
> On 30/07/2019 11:50, Richard Sandiford wrote:
>> Richard Biener <richard.guenther@gmail.com> writes:
>>> On Tue, Jul 30, 2019 at 12:04 PM Richard Sandiford
>>> <richard.sandiford@arm.com> wrote:
>>>>
>>>> tree-ssa-math-opts supports FMA optabs for integers as well as
>>>> floating-point types, even though there's no distinction between
>>>> fused and unfused there.  It turns out to be pretty handy for the
>>>> IFN_COND_* handling though, so I don't want to remove it, however
>>>> weird it might seem.
>>>>
>>>> Instead this patch makes sure that we don't apply it to integer
>>>> multiplications that are actually shifts (but that are represented
>>>> in gimple as multiplications because that's the canonical form).
>>>>
>>>> This is a preemptive strike.  The test doesn't fail for SVE as-is,
>>>> but would after a later patch.
>>>>
>>>> Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu.
>>>> OK to install?
>>>>
>>>> Richard
>>>>
>>>>
>>>> 2019-07-30  Richard Sandiford  <richard.sandiford@arm.com>
>>>>
>>>> gcc/
>>>>          * tree-ssa-math-opts.c (convert_mult_to_fma): Reject integer
>>>>          multiplications by a power of 2 or a negative power of 2.
>>>>
>>>> gcc/testsuite/
>>>>          * gcc.dg/vect/vect-cond-arith-8.c: New test.
>>>>
>>>> Index: gcc/tree-ssa-math-opts.c
>>>> ===================================================================
>>>> --- gcc/tree-ssa-math-opts.c    2019-07-30 10:51:51.827405171 +0100
>>>> +++ gcc/tree-ssa-math-opts.c    2019-07-30 10:52:27.327139149 +0100
>>>> @@ -3074,10 +3074,20 @@ convert_mult_to_fma (gimple *mul_stmt, t
>>>>         && flag_fp_contract_mode == FP_CONTRACT_OFF)
>>>>       return false;
>>>>
>>>> -  /* We don't want to do bitfield reduction ops.  */
>>>> -  if (INTEGRAL_TYPE_P (type)
>>>> -      && (!type_has_mode_precision_p (type) || TYPE_OVERFLOW_TRAPS (type)))
>>>> -    return false;
>>>> +  if (ANY_INTEGRAL_TYPE_P (type))
>>>> +    {
>>>> +      /* We don't want to do bitfield reduction ops.  */
>>>> +      tree itype = INTEGRAL_TYPE_P (type) ? type : TREE_TYPE (type);
>>>
>>> you can use element_type () for this.  But I think it's easier to
>>> leave the existing
>>> test in-place since vector or complex types cannot have non-mode precision
>>> components.
>> 
>> Ah, yeah.  Guess I was over-generalising here :-)
>> 
>>>> +      if (!type_has_mode_precision_p (itype) || TYPE_OVERFLOW_TRAPS (itype))
>>>> +       return false;
>>>> +
>>>> +      /* Don't use FMA for multiplications that are actually shifts.  */
>>>
>>> So I question this - if the FMA can do the shift "for free" and it eventually is
>>> same cost/latency as an add why do we want to not allow this (for all targets)?
>>> Esp. for vector types I would guess a add plus a shift may be more costly
>>> (not all ISAs implement shifts anyhow).
>> 
>> The shift doesn't really come for free.  By using FMA we're converting
>> the shift by a constant into a general multiplication.
>
> Isn't this just the problem that the madd<mode> pattern for aarch64 
> doesn't accept constants that are a power of 2?  So it ends up trying to 
> force the constant into a register unnecessarily.
>
> (define_insn "madd<mode>"
>    [(set (match_operand:GPI 0 "register_operand" "=r")
> 	(plus:GPI (mult:GPI (match_operand:GPI 1 "register_operand" "r")
> 			    (match_operand:GPI 2 "register_operand" "r"))
> --- This should be reg_or_imm_power2, then a second alternative for the 
> immediate case.
>
> 		  (match_operand:GPI 3 "register_operand" "r")))]
>    ""

That might be a problem too (e.g. maybe for intrinsics?), but this patch
was only dealing with the way that the "fma<mode>4" group of patterns are
used.  The AArch64 port doesn't yet define fma for any integer modes AFAICT,
but I have a patch that adds them for SVE.  (And it now handles shifts
specially, as Richard suggested.)

Thanks,
Richard


>
> R.
>
>> 
>>> Can this be fixed up in the target instead?  (by a splitter or appropriate
>>> expander?)
>> 
>> OK, I'll try to do it that way instead.
>> 
>> Thanks,
>> Richard
>> 
>>>
>>>> +      tree val = VECTOR_TYPE_P (type) ? uniform_vector_p (op2) : op2;
>>>> +      if (val
>>>> +         && TREE_CODE (val) == INTEGER_CST
>>>> +         && wi::popcount (wi::abs (wi::to_wide (val))) == 1)
>>>> +       return false;
>>>> +    }
>>>>
>>>>     /* If the target doesn't support it, don't generate it.  We assume that
>>>>        if fma isn't available then fms, fnma or fnms are not either.  */
>>>> Index: gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c
>>>> ===================================================================
>>>> --- /dev/null   2019-07-30 08:53:31.317691683 +0100
>>>> +++ gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c       2019-07-30 10:52:27.327139149 +0100
>>>> @@ -0,0 +1,57 @@
>>>> +/* { dg-require-effective-target scalar_all_fma } */
>>>> +/* { dg-additional-options "-fdump-tree-optimized -ffp-contract=fast" } */
>>>> +
>>>> +#include "tree-vect.h"
>>>> +
>>>> +#define N (VECTOR_BITS * 11 / 64 + 3)
>>>> +
>>>> +#define DEF(INV)                                       \
>>>> +  void __attribute__ ((noipa))                         \
>>>> +  f_##INV (int *restrict a, int *restrict b,           \
>>>> +          int *restrict c)                             \
>>>> +  {                                                    \
>>>> +    for (int i = 0; i < N; ++i)                                \
>>>> +      {                                                        \
>>>> +       int mb = (INV & 1 ? -b[i] : b[i]);              \
>>>> +       int mc = (INV & 2 ? -c[i] : c[i]);              \
>>>> +       a[i] = b[i] < 10 ? mb * 8 + mc : 10;            \
>>>> +      }                                                        \
>>>> +  }
>>>> +
>>>> +#define TEST(INV)                                      \
>>>> +  {                                                    \
>>>> +    f_##INV (a, b, c);                                 \
>>>> +    for (int i = 0; i < N; ++i)                                \
>>>> +      {                                                        \
>>>> +       int mb = (INV & 1 ? -b[i] : b[i]);              \
>>>> +       int mc = (INV & 2 ? -c[i] : c[i]);              \
>>>> +       int truev = mb * 8 + mc;                        \
>>>> +       if (a[i] != (i % 17 < 10 ? truev : 10))         \
>>>> +         __builtin_abort ();                           \
>>>> +       asm volatile ("" ::: "memory");                 \
>>>> +      }                                                        \
>>>> +  }
>>>> +
>>>> +#define FOR_EACH_INV(T) \
>>>> +  T (0) T (1) T (2) T (3)
>>>> +
>>>> +FOR_EACH_INV (DEF)
>>>> +
>>>> +int
>>>> +main (void)
>>>> +{
>>>> +  int a[N], b[N], c[N];
>>>> +  for (int i = 0; i < N; ++i)
>>>> +    {
>>>> +      b[i] = i % 17;
>>>> +      c[i] = i % 13 + 14;
>>>> +      asm volatile ("" ::: "memory");
>>>> +    }
>>>> +  FOR_EACH_INV (TEST)
>>>> +  return 0;
>>>> +}
>>>> +
>>>> +/* { dg-final { scan-tree-dump-not { = \.COND_FMA } "optimized" } } */
>>>> +/* { dg-final { scan-tree-dump-not { = \.COND_FMS } "optimized" } } */
>>>> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMA } "optimized" } } */
>>>> +/* { dg-final { scan-tree-dump-not { = \.COND_FNMS } "optimized" } } */
diff mbox series

Patch

Index: gcc/tree-ssa-math-opts.c
===================================================================
--- gcc/tree-ssa-math-opts.c	2019-07-30 10:51:51.827405171 +0100
+++ gcc/tree-ssa-math-opts.c	2019-07-30 10:52:27.327139149 +0100
@@ -3074,10 +3074,20 @@  convert_mult_to_fma (gimple *mul_stmt, t
       && flag_fp_contract_mode == FP_CONTRACT_OFF)
     return false;
 
-  /* We don't want to do bitfield reduction ops.  */
-  if (INTEGRAL_TYPE_P (type)
-      && (!type_has_mode_precision_p (type) || TYPE_OVERFLOW_TRAPS (type)))
-    return false;
+  if (ANY_INTEGRAL_TYPE_P (type))
+    {
+      /* We don't want to do bitfield reduction ops.  */
+      tree itype = INTEGRAL_TYPE_P (type) ? type : TREE_TYPE (type);
+      if (!type_has_mode_precision_p (itype) || TYPE_OVERFLOW_TRAPS (itype))
+	return false;
+
+      /* Don't use FMA for multiplications that are actually shifts.  */
+      tree val = VECTOR_TYPE_P (type) ? uniform_vector_p (op2) : op2;
+      if (val
+	  && TREE_CODE (val) == INTEGER_CST
+	  && wi::popcount (wi::abs (wi::to_wide (val))) == 1)
+	return false;
+    }
 
   /* If the target doesn't support it, don't generate it.  We assume that
      if fma isn't available then fms, fnma or fnms are not either.  */
Index: gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c
===================================================================
--- /dev/null	2019-07-30 08:53:31.317691683 +0100
+++ gcc/testsuite/gcc.dg/vect/vect-cond-arith-8.c	2019-07-30 10:52:27.327139149 +0100
@@ -0,0 +1,57 @@ 
+/* { dg-require-effective-target scalar_all_fma } */
+/* { dg-additional-options "-fdump-tree-optimized -ffp-contract=fast" } */
+
+#include "tree-vect.h"
+
+#define N (VECTOR_BITS * 11 / 64 + 3)
+
+#define DEF(INV)					\
+  void __attribute__ ((noipa))				\
+  f_##INV (int *restrict a, int *restrict b,		\
+	   int *restrict c)				\
+  {							\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	int mb = (INV & 1 ? -b[i] : b[i]);		\
+	int mc = (INV & 2 ? -c[i] : c[i]);		\
+	a[i] = b[i] < 10 ? mb * 8 + mc : 10;		\
+      }							\
+  }
+
+#define TEST(INV)					\
+  {							\
+    f_##INV (a, b, c);					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	int mb = (INV & 1 ? -b[i] : b[i]);		\
+	int mc = (INV & 2 ? -c[i] : c[i]);		\
+	int truev = mb * 8 + mc;			\
+	if (a[i] != (i % 17 < 10 ? truev : 10))		\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+
+#define FOR_EACH_INV(T) \
+  T (0) T (1) T (2) T (3)
+
+FOR_EACH_INV (DEF)
+
+int
+main (void)
+{
+  int a[N], b[N], c[N];
+  for (int i = 0; i < N; ++i)
+    {
+      b[i] = i % 17;
+      c[i] = i % 13 + 14;
+      asm volatile ("" ::: "memory");
+    }
+  FOR_EACH_INV (TEST)
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-not { = \.COND_FMA } "optimized" } } */
+/* { dg-final { scan-tree-dump-not { = \.COND_FMS } "optimized" } } */
+/* { dg-final { scan-tree-dump-not { = \.COND_FNMA } "optimized" } } */
+/* { dg-final { scan-tree-dump-not { = \.COND_FNMS } "optimized" } } */