[6/8] aarch64: Tweak MLA vector costs

Message ID	mpto8aebumt.fsf@arm.com
State	New
Headers	show Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 462E3393AC16 To: gcc-patches@gcc.gnu.org Mail-Followup-To: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com Subject: [PATCH 6/8] aarch64: Tweak MLA vector costs References: <mptlf5id9bw.fsf@arm.com> Date: Tue, 03 Aug 2021 13:06:02 +0100 In-Reply-To: <mptlf5id9bw.fsf@arm.com> (Richard Sandiford's message of "Tue, 03 Aug 2021 13:03:15 +0100") Message-ID: <mpto8aebumt.fsf@arm.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Precedence: list From: Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Richard Sandiford <richard.sandiford@arm.com> Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
Series	aarch64 vector cost tweaks \| expand [0/8] aarch64 vector cost tweaks [1/8] aarch64: Turn sve_width tuning field into a bitmask [2/8] aarch64: Add a simple fixed-point class for costing [3/8] aarch64: Split out aarch64_adjust_body_cost_sve [4/8] aarch64: Add gather_load_xNN_cost tuning fields [5/8] aarch64: Tweak the cost of elementwise stores [6/8] aarch64: Tweak MLA vector costs [7/8] aarch64: Restrict issue heuristics to inner vector loop [8/8] aarch64: Add -mtune=neoverse-512tvb

Message ID

mpto8aebumt.fsf@arm.com

State

New

Headers

DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 462E3393AC16
To: gcc-patches@gcc.gnu.org
Mail-Followup-To: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com
Subject: [PATCH 6/8] aarch64: Tweak MLA vector costs
References: <mptlf5id9bw.fsf@arm.com>
Date: Tue, 03 Aug 2021 13:06:02 +0100
In-Reply-To: <mptlf5id9bw.fsf@arm.com> (Richard Sandiford's message of "Tue,
 03 Aug 2021 13:03:15 +0100")
Message-ID: <mpto8aebumt.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
Precedence: list
From: Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org>
Reply-To: Richard Sandiford <richard.sandiford@arm.com>
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>

Series

aarch64 vector cost tweaks | expand

Commit Message

Richard Sandiford Aug. 3, 2021, 12:06 p.m. UTC

The issue-based vector costs currently assume that a multiply-add
sequence can be implemented using a single instruction.  This is
generally true for scalars (which have a 4-operand instruction)
and SVE (which allows the output to be tied to any input).
However, for Advanced SIMD, multiplying two values and adding
an invariant will end up being a move and an MLA.

The only target to use the issue-based vector costs is Neoverse V1,
which would generally prefer SVE in this case anyway.  I therefore
don't have a self-contained testcase.  However, the distinction
becomes more important with a later patch.

gcc/
	* config/aarch64/aarch64.c (aarch64_multiply_add_p): Add a vec_flags
	parameter.  Detect cases in which an Advanced SIMD MLA would almost
	certainly require a MOV.
	(aarch64_count_ops): Update accordingly.
---
 gcc/config/aarch64/aarch64.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

Comments

Richard Biener Aug. 4, 2021, 11:45 a.m. UTC | #1

On Tue, Aug 3, 2021 at 2:10 PM Richard Sandiford via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> The issue-based vector costs currently assume that a multiply-add
> sequence can be implemented using a single instruction.  This is
> generally true for scalars (which have a 4-operand instruction)
> and SVE (which allows the output to be tied to any input).
> However, for Advanced SIMD, multiplying two values and adding
> an invariant will end up being a move and an MLA.
>
> The only target to use the issue-based vector costs is Neoverse V1,
> which would generally prefer SVE in this case anyway.  I therefore
> don't have a self-contained testcase.  However, the distinction
> becomes more important with a later patch.

But we do cost any invariants separately (for the prologue), so they
should be available in a register.  How doesn't that work?

> gcc/
>         * config/aarch64/aarch64.c (aarch64_multiply_add_p): Add a vec_flags
>         parameter.  Detect cases in which an Advanced SIMD MLA would almost
>         certainly require a MOV.
>         (aarch64_count_ops): Update accordingly.
> ---
>  gcc/config/aarch64/aarch64.c | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 084f8caa0da..19045ef6944 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -14767,9 +14767,12 @@ aarch64_integer_truncation_p (stmt_vec_info stmt_info)
>
>  /* Return true if STMT_INFO is the second part of a two-statement multiply-add
>     or multiply-subtract sequence that might be suitable for fusing into a
> -   single instruction.  */
> +   single instruction.  If VEC_FLAGS is zero, analyze the operation as
> +   a scalar one, otherwise analyze it as an operation on vectors with those
> +   VEC_* flags.  */
>  static bool
> -aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
> +aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info,
> +                       unsigned int vec_flags)
>  {
>    gassign *assign = dyn_cast<gassign *> (stmt_info->stmt);
>    if (!assign)
> @@ -14797,6 +14800,22 @@ aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
>        if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
>         continue;
>
> +      if (vec_flags & VEC_ADVSIMD)
> +       {
> +         /* Scalar and SVE code can tie the result to any FMLA input (or none,
> +            although that requires a MOVPRFX for SVE).  However, Advanced SIMD
> +            only supports MLA forms, so will require a move if the result
> +            cannot be tied to the accumulator.  The most important case in
> +            which this is true is when the accumulator input is invariant.  */
> +         rhs = gimple_op (assign, 3 - i);
> +         if (TREE_CODE (rhs) != SSA_NAME)
> +           return false;
> +         def_stmt_info = vinfo->lookup_def (rhs);
> +         if (!def_stmt_info
> +             || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
> +           return false;
> +       }
> +
>        return true;
>      }
>    return false;
> @@ -15232,7 +15251,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
>      }
>
>    /* Assume that multiply-adds will become a single operation.  */
> -  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info))
> +  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info, vec_flags))
>      return;
>
>    /* When costing scalar statements in vector code, the count already

Richard Sandiford Aug. 4, 2021, 12:14 p.m. UTC | #2

Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Aug 3, 2021 at 2:10 PM Richard Sandiford via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> The issue-based vector costs currently assume that a multiply-add
>> sequence can be implemented using a single instruction.  This is
>> generally true for scalars (which have a 4-operand instruction)
>> and SVE (which allows the output to be tied to any input).
>> However, for Advanced SIMD, multiplying two values and adding
>> an invariant will end up being a move and an MLA.
>>
>> The only target to use the issue-based vector costs is Neoverse V1,
>> which would generally prefer SVE in this case anyway.  I therefore
>> don't have a self-contained testcase.  However, the distinction
>> becomes more important with a later patch.
>
> But we do cost any invariants separately (for the prologue), so they
> should be available in a register.  How doesn't that work?

Yeah, that works, and the prologue part is costed correctly.  But the
result of an Advanced SIMD FMLA is tied to the accumulator input, so if
the accumulator input is an invariant, we need a register move (in the
loop body) before the FMLA.

E.g. for:

void
f (float *restrict x, float *restrict y, float *restrict z, float a)
{
  for (int i = 0; i < 100; ++i)
    x[i] = y[i] * z[i];
}

the scalar code is:

.L2:
        ldr     s1, [x1, x3]
        ldr     s2, [x2, x3]
        fmadd   s1, s1, s2, s0
        str     s1, [x0, x3]
        add     x3, x3, 4
        cmp     x3, 400
        bne     .L2

the SVE code is:

.L2:
        ld1w    z1.s, p0/z, [x1, x3, lsl 2]
        ld1w    z0.s, p0/z, [x2, x3, lsl 2]
        fmad    z0.s, p1/m, z1.s, z2.s
        st1w    z0.s, p0, [x0, x3, lsl 2]
        add     x3, x3, x5
        whilelo p0.s, w3, w4
        b.any   .L2

but the Advanced SIMD code is:

.L2:
        mov     v0.16b, v3.16b   // <-------- boo
        ldr     q2, [x2, x3]
        ldr     q1, [x1, x3]
        fmla    v0.4s, v2.4s, v1.4s
        str     q0, [x0, x3]
        add     x3, x3, 16
        cmp     x3, 400
        bne     .L2

Thanks,
Richard


>
>> gcc/
>>         * config/aarch64/aarch64.c (aarch64_multiply_add_p): Add a vec_flags
>>         parameter.  Detect cases in which an Advanced SIMD MLA would almost
>>         certainly require a MOV.
>>         (aarch64_count_ops): Update accordingly.
>> ---
>>  gcc/config/aarch64/aarch64.c | 25 ++++++++++++++++++++++---
>>  1 file changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index 084f8caa0da..19045ef6944 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -14767,9 +14767,12 @@ aarch64_integer_truncation_p (stmt_vec_info stmt_info)
>>
>>  /* Return true if STMT_INFO is the second part of a two-statement multiply-add
>>     or multiply-subtract sequence that might be suitable for fusing into a
>> -   single instruction.  */
>> +   single instruction.  If VEC_FLAGS is zero, analyze the operation as
>> +   a scalar one, otherwise analyze it as an operation on vectors with those
>> +   VEC_* flags.  */
>>  static bool
>> -aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
>> +aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info,
>> +                       unsigned int vec_flags)
>>  {
>>    gassign *assign = dyn_cast<gassign *> (stmt_info->stmt);
>>    if (!assign)
>> @@ -14797,6 +14800,22 @@ aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
>>        if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
>>         continue;
>>
>> +      if (vec_flags & VEC_ADVSIMD)
>> +       {
>> +         /* Scalar and SVE code can tie the result to any FMLA input (or none,
>> +            although that requires a MOVPRFX for SVE).  However, Advanced SIMD
>> +            only supports MLA forms, so will require a move if the result
>> +            cannot be tied to the accumulator.  The most important case in
>> +            which this is true is when the accumulator input is invariant.  */
>> +         rhs = gimple_op (assign, 3 - i);
>> +         if (TREE_CODE (rhs) != SSA_NAME)
>> +           return false;
>> +         def_stmt_info = vinfo->lookup_def (rhs);
>> +         if (!def_stmt_info
>> +             || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
>> +           return false;
>> +       }
>> +
>>        return true;
>>      }
>>    return false;
>> @@ -15232,7 +15251,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
>>      }
>>
>>    /* Assume that multiply-adds will become a single operation.  */
>> -  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info))
>> +  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info, vec_flags))
>>      return;
>>
>>    /* When costing scalar statements in vector code, the count already

Richard Sandiford Aug. 4, 2021, 12:19 p.m. UTC | #3

Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Richard Biener <richard.guenther@gmail.com> writes:
>> On Tue, Aug 3, 2021 at 2:10 PM Richard Sandiford via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:
>>>
>>> The issue-based vector costs currently assume that a multiply-add
>>> sequence can be implemented using a single instruction.  This is
>>> generally true for scalars (which have a 4-operand instruction)
>>> and SVE (which allows the output to be tied to any input).
>>> However, for Advanced SIMD, multiplying two values and adding
>>> an invariant will end up being a move and an MLA.
>>>
>>> The only target to use the issue-based vector costs is Neoverse V1,
>>> which would generally prefer SVE in this case anyway.  I therefore
>>> don't have a self-contained testcase.  However, the distinction
>>> becomes more important with a later patch.
>>
>> But we do cost any invariants separately (for the prologue), so they
>> should be available in a register.  How doesn't that work?
>
> Yeah, that works, and the prologue part is costed correctly.  But the
> result of an Advanced SIMD FMLA is tied to the accumulator input, so if
> the accumulator input is an invariant, we need a register move (in the
> loop body) before the FMLA.
>
> E.g. for:
>
> void
> f (float *restrict x, float *restrict y, float *restrict z, float a)
> {
>   for (int i = 0; i < 100; ++i)
>     x[i] = y[i] * z[i];

+ 1.0, not sure where that went.

> }
>
> the scalar code is:
>
> .L2:
>         ldr     s1, [x1, x3]
>         ldr     s2, [x2, x3]
>         fmadd   s1, s1, s2, s0
>         str     s1, [x0, x3]
>         add     x3, x3, 4
>         cmp     x3, 400
>         bne     .L2
>
> the SVE code is:
>
> .L2:
>         ld1w    z1.s, p0/z, [x1, x3, lsl 2]
>         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
>         fmad    z0.s, p1/m, z1.s, z2.s
>         st1w    z0.s, p0, [x0, x3, lsl 2]
>         add     x3, x3, x5
>         whilelo p0.s, w3, w4
>         b.any   .L2
>
> but the Advanced SIMD code is:
>
> .L2:
>         mov     v0.16b, v3.16b   // <-------- boo
>         ldr     q2, [x2, x3]
>         ldr     q1, [x1, x3]
>         fmla    v0.4s, v2.4s, v1.4s
>         str     q0, [x0, x3]
>         add     x3, x3, 16
>         cmp     x3, 400
>         bne     .L2
>
> Thanks,
> Richard
>
>
>>
>>> gcc/
>>>         * config/aarch64/aarch64.c (aarch64_multiply_add_p): Add a vec_flags
>>>         parameter.  Detect cases in which an Advanced SIMD MLA would almost
>>>         certainly require a MOV.
>>>         (aarch64_count_ops): Update accordingly.
>>> ---
>>>  gcc/config/aarch64/aarch64.c | 25 ++++++++++++++++++++++---
>>>  1 file changed, 22 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>>> index 084f8caa0da..19045ef6944 100644
>>> --- a/gcc/config/aarch64/aarch64.c
>>> +++ b/gcc/config/aarch64/aarch64.c
>>> @@ -14767,9 +14767,12 @@ aarch64_integer_truncation_p (stmt_vec_info stmt_info)
>>>
>>>  /* Return true if STMT_INFO is the second part of a two-statement multiply-add
>>>     or multiply-subtract sequence that might be suitable for fusing into a
>>> -   single instruction.  */
>>> +   single instruction.  If VEC_FLAGS is zero, analyze the operation as
>>> +   a scalar one, otherwise analyze it as an operation on vectors with those
>>> +   VEC_* flags.  */
>>>  static bool
>>> -aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
>>> +aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info,
>>> +                       unsigned int vec_flags)
>>>  {
>>>    gassign *assign = dyn_cast<gassign *> (stmt_info->stmt);
>>>    if (!assign)
>>> @@ -14797,6 +14800,22 @@ aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
>>>        if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
>>>         continue;
>>>
>>> +      if (vec_flags & VEC_ADVSIMD)
>>> +       {
>>> +         /* Scalar and SVE code can tie the result to any FMLA input (or none,
>>> +            although that requires a MOVPRFX for SVE).  However, Advanced SIMD
>>> +            only supports MLA forms, so will require a move if the result
>>> +            cannot be tied to the accumulator.  The most important case in
>>> +            which this is true is when the accumulator input is invariant.  */
>>> +         rhs = gimple_op (assign, 3 - i);
>>> +         if (TREE_CODE (rhs) != SSA_NAME)
>>> +           return false;
>>> +         def_stmt_info = vinfo->lookup_def (rhs);
>>> +         if (!def_stmt_info
>>> +             || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
>>> +           return false;
>>> +       }
>>> +
>>>        return true;
>>>      }
>>>    return false;
>>> @@ -15232,7 +15251,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
>>>      }
>>>
>>>    /* Assume that multiply-adds will become a single operation.  */
>>> -  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info))
>>> +  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info, vec_flags))
>>>      return;
>>>
>>>    /* When costing scalar statements in vector code, the count already

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 084f8caa0da..19045ef6944 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14767,9 +14767,12 @@  aarch64_integer_truncation_p (stmt_vec_info stmt_info)
 
 /* Return true if STMT_INFO is the second part of a two-statement multiply-add
    or multiply-subtract sequence that might be suitable for fusing into a
-   single instruction.  */
+   single instruction.  If VEC_FLAGS is zero, analyze the operation as
+   a scalar one, otherwise analyze it as an operation on vectors with those
+   VEC_* flags.  */
 static bool
-aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
+aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info,
+			unsigned int vec_flags)
 {
   gassign *assign = dyn_cast<gassign *> (stmt_info->stmt);
   if (!assign)
@@ -14797,6 +14800,22 @@  aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
       if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
 	continue;
 
+      if (vec_flags & VEC_ADVSIMD)
+	{
+	  /* Scalar and SVE code can tie the result to any FMLA input (or none,
+	     although that requires a MOVPRFX for SVE).  However, Advanced SIMD
+	     only supports MLA forms, so will require a move if the result
+	     cannot be tied to the accumulator.  The most important case in
+	     which this is true is when the accumulator input is invariant.  */
+	  rhs = gimple_op (assign, 3 - i);
+	  if (TREE_CODE (rhs) != SSA_NAME)
+	    return false;
+	  def_stmt_info = vinfo->lookup_def (rhs);
+	  if (!def_stmt_info
+	      || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
+	    return false;
+	}
+
       return true;
     }
   return false;
@@ -15232,7 +15251,7 @@  aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
     }
 
   /* Assume that multiply-adds will become a single operation.  */
-  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info))
+  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info, vec_flags))
     return;
 
   /* When costing scalar statements in vector code, the count already

[6/8] aarch64: Tweak MLA vector costs

Commit Message

Comments

Patch