Message ID | 5265A43E.7060507@arm.com |
---|---|
State | New |
Headers | show |
On 21/10/13 23:01, Yufeng Zhang wrote: > Hi, > > This patch changes the widening_mul pass to fuse the widening multiply > with accumulate only when the multiply has single use. The widening_mul > pass currently does the conversion regardless of the number of the uses, > which can cause poor code-gen in cases like the following: This seems reasonable to me, not that I have authority to approve it. You should probably create at least one test case, ideally for both AArch32 and AArch64. There are already AArch32 testcases in gcc.target for the widening multiplies cases, but as they are single use only, you shouldn't need to adjust them. Andrew
On Tue, Oct 22, 2013 at 12:01 AM, Yufeng Zhang <Yufeng.Zhang@arm.com> wrote: > Hi, > > This patch changes the widening_mul pass to fuse the widening multiply with > accumulate only when the multiply has single use. The widening_mul pass > currently does the conversion regardless of the number of the uses, which > can cause poor code-gen in cases like the following: > > typedef int ArrT [10][10]; > > void > foo (ArrT Arr, int Idx) > { > Arr[Idx][Idx] = 1; > Arr[Idx + 10][Idx] = 2; > } > > On AArch64, after widening_mul, the IR is like > > _2 = (long unsigned int) Idx_1(D); > _3 = Idx_1(D) w* 40; <---- > _5 = Arr_4(D) + _3; > *_5[Idx_1(D)] = 1; > _8 = WIDEN_MULT_PLUS_EXPR <Idx_1(D), 40, 400>; <---- > _9 = Arr_4(D) + _8; > *_9[Idx_1(D)] = 2; > > Where the arrows point, there are redundant widening multiplies. > > Bootstrap successfully on x86_64. > > The patch passes the regtest on aarch64, arm and x86_64. > > OK for the trunk? if (!is_widening_mult_p (rhs1_stmt, &type1, &mult_rhs1, - &type2, &mult_rhs2)) + &type2, &mult_rhs2) + || !has_single_use (rhs1)) please check has_single_use first, it's the cheaper check. Ok with that change (and possibly a testcase). Thanks, Richard. > Thanks, > Yufeng > > p.s. Note that x86_64 doesn't suffer from this issue as the corresponding > widening multiply accumulate op is not available on the target. > > gcc/ > > * tree-ssa-math-opts.c (convert_plusminus_to_widen): Call > has_single_use () and not do the conversion if has_single_use () > returns false for the multiplication result.
On 10/21/2013 03:01 PM, Yufeng Zhang wrote: > > This patch changes the widening_mul pass to fuse the widening multiply with > accumulate only when the multiply has single use. The widening_mul pass > currently does the conversion regardless of the number of the uses, which can > cause poor code-gen in cases like the following: > > typedef int ArrT [10][10]; > > void > foo (ArrT Arr, int Idx) > { > Arr[Idx][Idx] = 1; > Arr[Idx + 10][Idx] = 2; > } > > On AArch64, after widening_mul, the IR is like > > _2 = (long unsigned int) Idx_1(D); > _3 = Idx_1(D) w* 40; <---- > _5 = Arr_4(D) + _3; > *_5[Idx_1(D)] = 1; > _8 = WIDEN_MULT_PLUS_EXPR <Idx_1(D), 40, 400>; <---- > _9 = Arr_4(D) + _8; > *_9[Idx_1(D)] = 2; > > Where the arrows point, there are redundant widening multiplies. So they're redundant. Why does this imply poor code-gen? If a target has more than one FMA unit, then the target might be able to issue the computation for _3 and _8 in parallel. Even if the target only has one FMA unit, but the unit is pipelined, the computations could overlap. r~
On 10/24/13 01:29, Richard Henderson wrote: > On 10/21/2013 03:01 PM, Yufeng Zhang wrote: >> >> This patch changes the widening_mul pass to fuse the widening multiply with >> accumulate only when the multiply has single use. The widening_mul pass >> currently does the conversion regardless of the number of the uses, which can >> cause poor code-gen in cases like the following: >> >> typedef int ArrT [10][10]; >> >> void >> foo (ArrT Arr, int Idx) >> { >> Arr[Idx][Idx] = 1; >> Arr[Idx + 10][Idx] = 2; >> } >> >> On AArch64, after widening_mul, the IR is like >> >> _2 = (long unsigned int) Idx_1(D); >> _3 = Idx_1(D) w* 40;<---- >> _5 = Arr_4(D) + _3; >> *_5[Idx_1(D)] = 1; >> _8 = WIDEN_MULT_PLUS_EXPR<Idx_1(D), 40, 400>;<---- >> _9 = Arr_4(D) + _8; >> *_9[Idx_1(D)] = 2; >> >> Where the arrows point, there are redundant widening multiplies. > > So they're redundant. Why does this imply poor code-gen? > > If a target has more than one FMA unit, then the target might > be able to issue the computation for _3 and _8 in parallel. > > Even if the target only has one FMA unit, but the unit is > pipelined, the computations could overlap. Thanks for the review. I think it is a fair point that redundancy doesn't always indicate poor code-gen, but there are a few reasons that I think this patch makes sense. Firstly, the generated WIDEN_MULT_PLUS_EXPR can prevents other optimization passes from analyzing the IR sequence effectively. Like in the above example, the widening multiply can be part of a larger common sub-expression (Arr_4(D) + Idx_1(D) w* 40 + Idx_1(D) * 4); by blindly merging the multiply with accumulate, it makes the recognition of the common sub-expression rather difficult. Secondly, it is generally more expensive (in terms of both latency and energy) to multiply than accumulate. Even though there are multiple MAC units* or well-working pipeline, it is not always the case that multiple widening multiply-and-accumulate instructions can be scheduled (statically or dynamically) together. Merged multiply-and-accumulate can add to the register pressure as well. So maybe it is better to let the backend do the conversion (when the multiply has more uses). Also, isn't it in general that new common sub-expression (widening multiply in this case) shall not be created in the gimple IR when there is no obvious benefit? I can sense that it may be a difference case for the floating-point multiply-and-accumulate, as on one hand the arithmetic is usually for pure data-processing instead of other purposes like address calculation (as what its integer peers may do), and on the other hand, on micro-architectures where there are more FMA units than FADD units, it probably makes more sense to generate more FMA instructions in order to take advantage of the throughput capacity. The area where this patch tries to tackle is only about the integer widening multiply-and-accumulate, and it doesn't seem beneficial to me to merge the widening multiply with accumulate so aggressively; you could argue that other optimization passes shall be extended to be able to handle WIDEN_MULT_PLUS_EXPR and its friends; while it is an option I'm considering, it is more likely to be a longer-term solution. Regards, Yufeng *) I think I had abused the word 'fused' in my previous emails. It seems like 'fused' is more often used to refer to the floating-point multiply-and-accumulate with a single rounding.
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c index f7f8ec9..d316990 100644 --- a/gcc/tree-ssa-math-opts.c +++ b/gcc/tree-ssa-math-opts.c @@ -2425,12 +2425,16 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple stmt, It might also appear that it would be sufficient to use the existing operands of the widening multiply, but that would limit the choice of - multiply-and-accumulate instructions. */ + multiply-and-accumulate instructions. + + If the widened-multiplication result has more than one uses, it is + probably wiser not to do the conversion. */ if (code == PLUS_EXPR && (rhs1_code == MULT_EXPR || rhs1_code == WIDEN_MULT_EXPR)) { if (!is_widening_mult_p (rhs1_stmt, &type1, &mult_rhs1, - &type2, &mult_rhs2)) + &type2, &mult_rhs2) + || !has_single_use (rhs1)) return false; add_rhs = rhs2; conv_stmt = conv1_stmt; @@ -2438,7 +2442,8 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple stmt, else if (rhs2_code == MULT_EXPR || rhs2_code == WIDEN_MULT_EXPR) { if (!is_widening_mult_p (rhs2_stmt, &type1, &mult_rhs1, - &type2, &mult_rhs2)) + &type2, &mult_rhs2) + || !has_single_use (rhs2)) return false; add_rhs = rhs1; conv_stmt = conv2_stmt;