diff mbox

[tree-ssa] PR54295 Incorrect value extension in widening multiply-accumulate

Message ID 502E4F54.9040309@arm.com
State New
Headers show

Commit Message

Richard Earnshaw Aug. 17, 2012, 2:04 p.m. UTC
PR54295 shows that widening multiply-accumulate operations can end up
with incorrect code.  The path to the failure is as follows

1) Compiler detects a widening multiply operation and generates the
correct widening-multiply operation.  This involves stripping off the
widening cast to leave the underlying operands.

That is

	X = (c1) a * (c2) b
=>
	X = a w* b

the type of w is picked based on the types of a and b; if both are
signed a signed multiply-extend operation is used; if both are unsigned
an unsigned multiply-extend is used.  If they differ a signed/unsigned
multiply is used, if it exists.  If either a or b is a positive constant
it is coerced, if possible, into the type of the other operand.

2) The compiler then notices that the result, X is used in an addition
operation

	Y = X + d

As X is a widening multiply, the compiler then looks inside it to try
and generate a widening-multiply-accumulate operation.  However, it
passes the rewritten expression (X = a w* b) to is_widening_mult_p and
that routine does not correctly check what type of multiply is being
done: the code blindly tries to strip of an conversion operations.

The failure happens when a is also the result of a cast operation, for
example, being cast from an unsigned to an int.  The compiler then
re-formulates a signed multiply-extend into an unsigned multiply-extend.
 That is, if
	a = (int) e	// typeof(e) == unsigned

Then instead of
	Y = WIDEN_MULT_PLUS (a, b, d)
we end up with
	Y = WIDEN_MULT_PLUS (e, b, d)

The fix is to make is_widening_mult_p note that it has been called with
a WIDEN_MULT_EXPR and rather than decompose the operands again, to
simply extract the existing operands, which have already been formulated
correctly for a widening multiply operation.

	PR tree-ssa/54295
	* tree-ssa-math-opts.c (is_widening_mult_p): Short-circuit when
	the stmt is already a widening multiply.

Testsuite

	PR tree-ssa/54295
	* gcc.c-torture/execute/20120817-1.c: New test.

OK to commit once testing has completed?

Comments

Andrew Stubbs Aug. 17, 2012, 2:22 p.m. UTC | #1
On 17/08/12 15:04, Richard Earnshaw wrote:
> The fix is to make is_widening_mult_p note that it has been called with
> a WIDEN_MULT_EXPR and rather than decompose the operands again, to
> simply extract the existing operands, which have already been formulated
> correctly for a widening multiply operation.

As long as the existing test cases work, I think the only problem with 
this idea is if some architecture has a wider range of widening 
multiply-and-accumulate than it does plain widening multiply.

If no such architecture exists then this is fine with me, for whatever 
that's worth. IIRC, ARM is one of only two architectures (in GCC) with 
widening multiplies that widen more than twice the width, and, last I 
looked, the only one that has the patterns to use this code, so it's 
probably safe.

Andrew
Richard Earnshaw Aug. 17, 2012, 2:31 p.m. UTC | #2
On 17/08/12 15:22, Andrew Stubbs wrote:
> On 17/08/12 15:04, Richard Earnshaw wrote:
>> The fix is to make is_widening_mult_p note that it has been called with
>> a WIDEN_MULT_EXPR and rather than decompose the operands again, to
>> simply extract the existing operands, which have already been formulated
>> correctly for a widening multiply operation.
> 
> As long as the existing test cases work, I think the only problem with 
> this idea is if some architecture has a wider range of widening 
> multiply-and-accumulate than it does plain widening multiply.

But surely in that case step1 wouldn't have applied and we'd then be
looking at a MULT_EXPR not a WIDEN_MULT_EXPR.

R.
Andrew Stubbs Aug. 17, 2012, 2:39 p.m. UTC | #3
On 17/08/12 15:31, Richard Earnshaw wrote:
> On 17/08/12 15:22, Andrew Stubbs wrote:
>> On 17/08/12 15:04, Richard Earnshaw wrote:
>>> The fix is to make is_widening_mult_p note that it has been called with
>>> a WIDEN_MULT_EXPR and rather than decompose the operands again, to
>>> simply extract the existing operands, which have already been formulated
>>> correctly for a widening multiply operation.
>>
>> As long as the existing test cases work, I think the only problem with
>> this idea is if some architecture has a wider range of widening
>> multiply-and-accumulate than it does plain widening multiply.
>
> But surely in that case step1 wouldn't have applied and we'd then be
> looking at a MULT_EXPR not a WIDEN_MULT_EXPR.

Not necessarily.

For example, it will use a 32x32->64 signed widening multiply for 16 bit 
unsigned data if there is no unsigned 16x16->64 option. Hypothetically, 
if there happened to be an unsigned 16x16->64 multiply-and-accumulate 
then you'd end up masking it.

Like I said though, cross that bridge if it ever comes to it.

Andrew
Richard Earnshaw Aug. 17, 2012, 2:47 p.m. UTC | #4
On 17/08/12 15:39, Andrew Stubbs wrote:
> On 17/08/12 15:31, Richard Earnshaw wrote:
>> On 17/08/12 15:22, Andrew Stubbs wrote:
>>> On 17/08/12 15:04, Richard Earnshaw wrote:
>>>> The fix is to make is_widening_mult_p note that it has been called with
>>>> a WIDEN_MULT_EXPR and rather than decompose the operands again, to
>>>> simply extract the existing operands, which have already been formulated
>>>> correctly for a widening multiply operation.
>>>
>>> As long as the existing test cases work, I think the only problem with
>>> this idea is if some architecture has a wider range of widening
>>> multiply-and-accumulate than it does plain widening multiply.
>>
>> But surely in that case step1 wouldn't have applied and we'd then be
>> looking at a MULT_EXPR not a WIDEN_MULT_EXPR.
> 
> Not necessarily.
> 
> For example, it will use a 32x32->64 signed widening multiply for 16 bit 
> unsigned data if there is no unsigned 16x16->64 option. Hypothetically, 
> if there happened to be an unsigned 16x16->64 multiply-and-accumulate 
> then you'd end up masking it.
> 
> Like I said though, cross that bridge if it ever comes to it.
> 
> Andrew
> 

You've lost me.

If we don't have a 16x16->64 mult operation then after step 1 we'll
still have a MULT_EXPR, not a WIDEN_MULT_EXPR, so when we reach step2
there's nothing to short circuit.

Unless, of course, you're expecting us to get

step1 -> 16x16->32 widen mult
step2 -> widen64(step1) + acc64

But even this is only safe iff widen64 is the same type of widening as
the original widening step, and we determine that by looking at the
operands of the widening mult, not at any casts on them.

The key thing is that the type of multiply that we want is based on the
types of the operands, not on the type of the result.  So it's essential
we don't strip type conversions beyond the widening conversion.

R.
Andrew Stubbs Aug. 17, 2012, 3:06 p.m. UTC | #5
On 17/08/12 15:47, Richard Earnshaw wrote:
> If we don't have a 16x16->64 mult operation then after step 1 we'll
> still have a MULT_EXPR, not a WIDEN_MULT_EXPR, so when we reach step2
> there's nothing to short circuit.
>
> Unless, of course, you're expecting us to get
>
> step1 -> 16x16->32 widen mult
> step2 -> widen64(step1) + acc64

No, given a u16xu16->u64 operation in the code, and that the arch 
doesn't have such an opcode, I'd expect to get

step1 -> (u32)u16 x (u32)u16 -> u64

Likewise, 8x8->32 might give (16)8x(16)8->32.

The code can't see that the widening operation is non-optimal without 
looking beyond into its inputs.

Andrew
Richard Earnshaw Aug. 17, 2012, 3:20 p.m. UTC | #6
On 17/08/12 16:06, Andrew Stubbs wrote:
> On 17/08/12 15:47, Richard Earnshaw wrote:
>> If we don't have a 16x16->64 mult operation then after step 1 we'll
>> still have a MULT_EXPR, not a WIDEN_MULT_EXPR, so when we reach step2
>> there's nothing to short circuit.
>>
>> Unless, of course, you're expecting us to get
>>
>> step1 -> 16x16->32 widen mult
>> step2 -> widen64(step1) + acc64
> 
> No, given a u16xu16->u64 operation in the code, and that the arch 
> doesn't have such an opcode, I'd expect to get
> 
> step1 -> (u32)u16 x (u32)u16 -> u64

Hmm, I would have thought that would be more costly than

	(u64)(u16 x u16 -> u32)

> 
> Likewise, 8x8->32 might give (16)8x(16)8->32.
> 
> The code can't see that the widening operation is non-optimal without 
> looking beyond into its inputs.

Ok, in which case we have to give is_widening_mult_rhs_p enough smarts
to not strip

	(s32)u32

and return u32.

I'll have another think about it.

R.
Andrew Stubbs Aug. 17, 2012, 6:57 p.m. UTC | #7
On 17/08/12 16:20, Richard Earnshaw wrote:
>> No, given a u16xu16->u64 operation in the code, and that the arch
>> doesn't have such an opcode, I'd expect to get
>>
>> step1 -> (u32)u16 x (u32)u16 -> u64
>
> Hmm, I would have thought that would be more costly than
>
> 	(u64)(u16 x u16 -> u32)

You might be right, but then extends are often free, especially with 
unsigned types, so it's hard to say for sure.

Did you reproduce one? It's a long time since I last looked at this 
stuff, so I could be confused.

Andrew
diff mbox

Patch

--- tree-ssa-math-opts.c	(revision 190462)
+++ tree-ssa-math-opts.c	(local)
@@ -2039,6 +2039,25 @@  is_widening_mult_p (gimple stmt,
       && TREE_CODE (type) != FIXED_POINT_TYPE)
     return false;
 
+  /* If we already have a widening multiply expression, simply extract the
+     operands.  */
+  if (gimple_assign_rhs_code (stmt) == WIDEN_MULT_EXPR)
+    {
+      *rhs1_out = gimple_assign_rhs1 (stmt);
+      *rhs2_out = gimple_assign_rhs2 (stmt);
+      if (TREE_CODE (*rhs1_out) == INTEGER_CST)
+	*type1_out = TREE_TYPE (*rhs2_out);
+      else
+	*type1_out = TREE_TYPE (*rhs1_out);
+
+      if (TREE_CODE (*rhs2_out) == INTEGER_CST)
+	*type2_out = TREE_TYPE (*rhs1_out);
+      else
+	*type2_out = TREE_TYPE (*rhs2_out);
+
+      return true;
+    }
+
   if (!is_widening_mult_rhs_p (type, gimple_assign_rhs1 (stmt), type1_out,
 			       rhs1_out))
     return false;
Index: 20120817-1.c
===================================================================
--- 20120817-1.c	(revision 0)
+++ 20120817-1.c	(revision 0)
@@ -0,0 +1,14 @@ 
+typedef unsigned long long u64;
+unsigned long foo = 0;
+u64 f() __attribute__((noinline));
+
+u64 f() {
+  return ((u64)40) + ((u64) 24) * (int)(foo - 1);
+}
+
+int main ()
+{
+  if (f () != 16)
+    abort ();
+  exit (0);
+}