diff mbox

Generate fused widening multiply-and-accumulate operations only when the widening multiply has single use

Message ID 5265A43E.7060507@arm.com
State New
Headers show

Commit Message

Yufeng Zhang Oct. 21, 2013, 10:01 p.m. UTC
Hi,

This patch changes the widening_mul pass to fuse the widening multiply 
with accumulate only when the multiply has single use.  The widening_mul 
pass currently does the conversion regardless of the number of the uses, 
which can cause poor code-gen in cases like the following:

typedef int ArrT [10][10];

void
foo (ArrT Arr, int Idx)
{
   Arr[Idx][Idx] = 1;
   Arr[Idx + 10][Idx] = 2;
}

On AArch64, after widening_mul, the IR is like

   _2 = (long unsigned int) Idx_1(D);
   _3 = Idx_1(D) w* 40;                           <----
   _5 = Arr_4(D) + _3;
   *_5[Idx_1(D)] = 1;
   _8 = WIDEN_MULT_PLUS_EXPR <Idx_1(D), 40, 400>; <----
   _9 = Arr_4(D) + _8;
   *_9[Idx_1(D)] = 2;

Where the arrows point, there are redundant widening multiplies.

Bootstrap successfully on x86_64.

The patch passes the regtest on aarch64, arm and x86_64.

OK for the trunk?

Thanks,
Yufeng

p.s. Note that x86_64 doesn't suffer from this issue as the 
corresponding widening multiply accumulate op is not available on the 
target.

gcc/

         * tree-ssa-math-opts.c (convert_plusminus_to_widen): Call
         has_single_use () and not do the conversion if has_single_use ()
         returns false for the multiplication result.

Comments

Andrew Stubbs Oct. 22, 2013, 8:53 a.m. UTC | #1
On 21/10/13 23:01, Yufeng Zhang wrote:
> Hi,
>
> This patch changes the widening_mul pass to fuse the widening multiply
> with accumulate only when the multiply has single use.  The widening_mul
> pass currently does the conversion regardless of the number of the uses,
> which can cause poor code-gen in cases like the following:

This seems reasonable to me, not that I have authority to approve it.

You should probably create at least one test case, ideally for both 
AArch32 and AArch64. There are already AArch32 testcases in gcc.target 
for the widening multiplies cases, but as they are single use only, you 
shouldn't need to adjust them.

Andrew
Richard Biener Oct. 23, 2013, 9:42 a.m. UTC | #2
On Tue, Oct 22, 2013 at 12:01 AM, Yufeng Zhang <Yufeng.Zhang@arm.com> wrote:
> Hi,
>
> This patch changes the widening_mul pass to fuse the widening multiply with
> accumulate only when the multiply has single use.  The widening_mul pass
> currently does the conversion regardless of the number of the uses, which
> can cause poor code-gen in cases like the following:
>
> typedef int ArrT [10][10];
>
> void
> foo (ArrT Arr, int Idx)
> {
>   Arr[Idx][Idx] = 1;
>   Arr[Idx + 10][Idx] = 2;
> }
>
> On AArch64, after widening_mul, the IR is like
>
>   _2 = (long unsigned int) Idx_1(D);
>   _3 = Idx_1(D) w* 40;                           <----
>   _5 = Arr_4(D) + _3;
>   *_5[Idx_1(D)] = 1;
>   _8 = WIDEN_MULT_PLUS_EXPR <Idx_1(D), 40, 400>; <----
>   _9 = Arr_4(D) + _8;
>   *_9[Idx_1(D)] = 2;
>
> Where the arrows point, there are redundant widening multiplies.
>
> Bootstrap successfully on x86_64.
>
> The patch passes the regtest on aarch64, arm and x86_64.
>
> OK for the trunk?

       if (!is_widening_mult_p (rhs1_stmt, &type1, &mult_rhs1,
-       &type2, &mult_rhs2))
+       &type2, &mult_rhs2)
+  || !has_single_use (rhs1))

please check has_single_use first, it's the cheaper check.

Ok with that change (and possibly a testcase).

Thanks,
Richard.



> Thanks,
> Yufeng
>
> p.s. Note that x86_64 doesn't suffer from this issue as the corresponding
> widening multiply accumulate op is not available on the target.
>
> gcc/
>
>         * tree-ssa-math-opts.c (convert_plusminus_to_widen): Call
>         has_single_use () and not do the conversion if has_single_use ()
>         returns false for the multiplication result.
Richard Henderson Oct. 24, 2013, 12:29 a.m. UTC | #3
On 10/21/2013 03:01 PM, Yufeng Zhang wrote:
> 
> This patch changes the widening_mul pass to fuse the widening multiply with
> accumulate only when the multiply has single use.  The widening_mul pass
> currently does the conversion regardless of the number of the uses, which can
> cause poor code-gen in cases like the following:
> 
> typedef int ArrT [10][10];
> 
> void
> foo (ArrT Arr, int Idx)
> {
>   Arr[Idx][Idx] = 1;
>   Arr[Idx + 10][Idx] = 2;
> }
> 
> On AArch64, after widening_mul, the IR is like
> 
>   _2 = (long unsigned int) Idx_1(D);
>   _3 = Idx_1(D) w* 40;                           <----
>   _5 = Arr_4(D) + _3;
>   *_5[Idx_1(D)] = 1;
>   _8 = WIDEN_MULT_PLUS_EXPR <Idx_1(D), 40, 400>; <----
>   _9 = Arr_4(D) + _8;
>   *_9[Idx_1(D)] = 2;
> 
> Where the arrows point, there are redundant widening multiplies.

So they're redundant.  Why does this imply poor code-gen?

If a target has more than one FMA unit, then the target might
be able to issue the computation for _3 and _8 in parallel.

Even if the target only has one FMA unit, but the unit is
pipelined, the computations could overlap.


r~
Yufeng Zhang Oct. 24, 2013, 2:52 p.m. UTC | #4
On 10/24/13 01:29, Richard Henderson wrote:
> On 10/21/2013 03:01 PM, Yufeng Zhang wrote:
>>
>> This patch changes the widening_mul pass to fuse the widening multiply with
>> accumulate only when the multiply has single use.  The widening_mul pass
>> currently does the conversion regardless of the number of the uses, which can
>> cause poor code-gen in cases like the following:
>>
>> typedef int ArrT [10][10];
>>
>> void
>> foo (ArrT Arr, int Idx)
>> {
>>    Arr[Idx][Idx] = 1;
>>    Arr[Idx + 10][Idx] = 2;
>> }
>>
>> On AArch64, after widening_mul, the IR is like
>>
>>    _2 = (long unsigned int) Idx_1(D);
>>    _3 = Idx_1(D) w* 40;<----
>>    _5 = Arr_4(D) + _3;
>>    *_5[Idx_1(D)] = 1;
>>    _8 = WIDEN_MULT_PLUS_EXPR<Idx_1(D), 40, 400>;<----
>>    _9 = Arr_4(D) + _8;
>>    *_9[Idx_1(D)] = 2;
>>
>> Where the arrows point, there are redundant widening multiplies.
>
> So they're redundant.  Why does this imply poor code-gen?
>
> If a target has more than one FMA unit, then the target might
> be able to issue the computation for _3 and _8 in parallel.
>
> Even if the target only has one FMA unit, but the unit is
> pipelined, the computations could overlap.

Thanks for the review.

I think it is a fair point that redundancy doesn't always indicate poor 
code-gen, but there are a few reasons that I think this patch makes sense.

Firstly, the generated WIDEN_MULT_PLUS_EXPR can prevents other 
optimization passes from analyzing the IR sequence effectively.  Like in 
the above example, the widening multiply can be part of a larger common 
sub-expression (Arr_4(D) + Idx_1(D) w* 40 + Idx_1(D) * 4); by blindly 
merging the multiply with accumulate, it makes the recognition of the 
common sub-expression rather difficult.

Secondly, it is generally more expensive (in terms of both latency and 
energy) to multiply than accumulate.  Even though there are multiple MAC 
units* or well-working pipeline, it is not always the case that multiple 
widening multiply-and-accumulate instructions can be scheduled 
(statically or dynamically) together.  Merged multiply-and-accumulate 
can add to the register pressure as well.  So maybe it is better to let 
the backend do the conversion (when the multiply has more uses).

Also, isn't it in general that new common sub-expression (widening 
multiply in this case) shall not be created in the gimple IR when there 
is no obvious benefit?  I can sense that it may be a difference case for 
the floating-point multiply-and-accumulate, as on one hand the 
arithmetic is usually for pure data-processing instead of other purposes 
like address calculation (as what its integer peers may do), and on the 
other hand, on micro-architectures where there are more FMA units than 
FADD units, it probably makes more sense to generate more FMA 
instructions in order to take advantage of the throughput capacity.

The area where this patch tries to tackle is only about the integer 
widening multiply-and-accumulate, and it doesn't seem beneficial to me 
to merge the widening multiply with accumulate so aggressively; you 
could argue that other optimization passes shall be extended to be able 
to handle WIDEN_MULT_PLUS_EXPR and its friends; while it is an option 
I'm considering, it is more likely to be a longer-term solution.

Regards,
Yufeng

*) I think I had abused the word 'fused' in my previous emails.  It 
seems like 'fused' is more often used to refer to the floating-point 
multiply-and-accumulate with a single rounding.
diff mbox

Patch

diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index f7f8ec9..d316990 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -2425,12 +2425,16 @@  convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple stmt,
 
      It might also appear that it would be sufficient to use the existing
      operands of the widening multiply, but that would limit the choice of
-     multiply-and-accumulate instructions.  */
+     multiply-and-accumulate instructions.
+
+     If the widened-multiplication result has more than one uses, it is
+     probably wiser not to do the conversion.  */
   if (code == PLUS_EXPR
       && (rhs1_code == MULT_EXPR || rhs1_code == WIDEN_MULT_EXPR))
     {
       if (!is_widening_mult_p (rhs1_stmt, &type1, &mult_rhs1,
-			       &type2, &mult_rhs2))
+			       &type2, &mult_rhs2)
+	  || !has_single_use (rhs1))
 	return false;
       add_rhs = rhs2;
       conv_stmt = conv1_stmt;
@@ -2438,7 +2442,8 @@  convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple stmt,
   else if (rhs2_code == MULT_EXPR || rhs2_code == WIDEN_MULT_EXPR)
     {
       if (!is_widening_mult_p (rhs2_stmt, &type1, &mult_rhs1,
-			       &type2, &mult_rhs2))
+			       &type2, &mult_rhs2)
+	  || !has_single_use (rhs2))
 	return false;
       add_rhs = rhs1;
       conv_stmt = conv2_stmt;