diff mbox

[ARM] Fix 16-bit -> 64-bit multiply and accumulate

Message ID 4D01018F.3020108@codesourcery.com
State New
Headers show

Commit Message

Andrew Stubbs Dec. 9, 2010, 4:19 p.m. UTC
The attached patch fixes a bug that prevented smlalbb being generated, 
even though it was present in the machine description.

The problem was that the MULT portion of the maddhidi4 pattern did not 
match the mulhisi3 pattern, and so combine could not build matching 
patterns.

My patch simply adjusts the (internal) modes and extends in the 
maddhidi4 pattern so that they match. The externally visible modes 
remain unaltered.

Test case:

   long long foolong (long long x, short *a, short *b)
   {
       return x + *a * *b;
   }

Before, compiled with "-O2 -mcpu=cortex-a8 -mthumb":

           ldrh    r2, [r2, #0]
           ldrh    r3, [r3, #0]
           smulbb  r3, r2, r3
           adds    r0, r0, r3
           adc     r1, r1, r3, asr #31
           bx      lr

The compiler is forced to do the multiply and addition operation separately.

After:
           ldrh    r2, [r2, #0]
           ldrh    r3, [r3, #0]
           smlalbb r0, r1, r2, r3
           bx      lr


OK to commit, once stage 1 opens again?

Andrew

Comments

Ramana Radhakrishnan Jan. 27, 2011, 1:32 a.m. UTC | #1
Hi Andrew,



>
> OK to commit, once stage 1 opens again?

I can't approve or reject your patch but is there any reason why this
shouldn't be covered under the same rules as the smulbb patch that you
submitted around the same time ?

http://gcc.gnu.org/ml/gcc-patches/2010-12/msg00772.html

Ramana
Andrew Stubbs Jan. 27, 2011, 9:47 a.m. UTC | #2
On 27/01/11 01:32, Ramana Radhakrishnan wrote:
> I can't approve or reject your patch but is there any reason why this
> shouldn't be covered under the same rules as the smulbb patch that you
> submitted around the same time ?
>
> http://gcc.gnu.org/ml/gcc-patches/2010-12/msg00772.html

Only that that was stage 3, and we're now in stage 4, I think?

Otherwise it's exactly the same sort of issue.

Andrew
Richard Earnshaw Jan. 27, 2011, 6:40 p.m. UTC | #3
On Thu, 2010-12-09 at 16:19 +0000, Andrew Stubbs wrote:
> The attached patch fixes a bug that prevented smlalbb being generated, 
> even though it was present in the machine description.
> 
> The problem was that the MULT portion of the maddhidi4 pattern did not 
> match the mulhisi3 pattern, and so combine could not build matching 
> patterns.
> 
> My patch simply adjusts the (internal) modes and extends in the 
> maddhidi4 pattern so that they match. The externally visible modes 
> remain unaltered.
> 
> Test case:
> 
>    long long foolong (long long x, short *a, short *b)
>    {
>        return x + *a * *b;
>    }
> 
> Before, compiled with "-O2 -mcpu=cortex-a8 -mthumb":
> 
>            ldrh    r2, [r2, #0]
>            ldrh    r3, [r3, #0]
>            smulbb  r3, r2, r3
>            adds    r0, r0, r3
>            adc     r1, r1, r3, asr #31
>            bx      lr
> 
> The compiler is forced to do the multiply and addition operation separately.
> 
> After:
>            ldrh    r2, [r2, #0]
>            ldrh    r3, [r3, #0]
>            smlalbb r0, r1, r2, r3
>            bx      lr
> 
> 
> OK to commit, once stage 1 opens again?
> 
> Andrew

I'm worried about this change.  The proposed RTL might be what GCC
generates for this particular case, but the pattern hardly looks
canonical: a sign-extend of a multiply that contained sign-extended
operands.

Are you sure that fixing this particular case doesn't introduce a
regression elsewhere?

Alternatively, is there some part of the manual that describes the form
you have as the correct canonical form?

R.
Andrew Stubbs Jan. 28, 2011, 10:07 a.m. UTC | #4
On 27/01/11 18:40, Richard Earnshaw wrote:
> I'm worried about this change.  The proposed RTL might be what GCC
> generates for this particular case, but the pattern hardly looks
> canonical: a sign-extend of a multiply that contained sign-extended
> operands.
>
> Are you sure that fixing this particular case doesn't introduce a
> regression elsewhere?
>
> Alternatively, is there some part of the manual that describes the form
> you have as the correct canonical form?

I don't think I said it was the canonical form (although the other 
similar smulbb patch was about that), but if it weren't in the canonical 
form then combine would not work (which it does with my patch), so I 
don't believe that's the problem here.

Anyway, combine only canonicalizes the *order* or the expressions, not 
the modes, which is what I'm adjusting here. Or have I missed something?

The problem here is that the expand pass used mulhisi3 which has the 
'internal' modes set one way, and then combine tries to insert that into 
a multiply-and-accumulate form, but recog rejects it because the 
'internal' modes of maddhidi4 are set a different way.

Now, mulhisi3 is currently written using MULT:SI mode, and I can't see 
how this could be any other way - it's an SI mode insn. If we want to be 
able to insert that into maddhidi4, and we do, I don't see any way to do 
that if it uses a MULT:DI - it just isn't compatible.

I have regression tested the patch for correctness, but I haven't proven 
that I haven't produced worse code in some other case.

I could introduce a second pattern, say "*maddhidi4_2", so that both 
patterns match and then there could be no regressions, right? However, I 
don't believe it's necessary.

Andrew
Richard Earnshaw Jan. 28, 2011, 2:12 p.m. UTC | #5
On Fri, 2011-01-28 at 10:07 +0000, Andrew Stubbs wrote:
> On 27/01/11 18:40, Richard Earnshaw wrote:
> > I'm worried about this change.  The proposed RTL might be what GCC
> > generates for this particular case, but the pattern hardly looks
> > canonical: a sign-extend of a multiply that contained sign-extended
> > operands.
> >
> > Are you sure that fixing this particular case doesn't introduce a
> > regression elsewhere?
> >
> > Alternatively, is there some part of the manual that describes the form
> > you have as the correct canonical form?
> 
> I don't think I said it was the canonical form (although the other 
> similar smulbb patch was about that), but if it weren't in the canonical 
> form then combine would not work (which it does with my patch), so I 
> don't believe that's the problem here.
> 
> Anyway, combine only canonicalizes the *order* or the expressions, not 
> the modes, which is what I'm adjusting here. Or have I missed something?
> 

Canonical RTL is not just about the order of the operands its about not
having multiple ways of representing the same operation.  Order is part
of that, but not the end of it.  For example, the canonical form of 

	(lt:SI (reg:SI) (const_int 0))

is
	(lshiftrt:SI (reg:SI) (const_int 31))

As many machines have an instruction for the latter but not the former.

My question really, is about whether 

	(sign_extend:DI (mult:SI (sign_extend:SI (reg:HI))
				 (sign_extend:SI (reg:HI)))))

should be canonical, or whether

	(mult:DI (sign_extend:DI (reg:HI)
		 (sign_extend:DI (reg:HI))

is a better expression of this operation.

Ultimately this comes down to what is the correct thing to do in GCC.
If its the latter, then simplify RTX may need some work to make it do
the right thing, rather than tweaking the ARM back-end.  

> The problem here is that the expand pass used mulhisi3 which has the 
> 'internal' modes set one way, and then combine tries to insert that into 
> a multiply-and-accumulate form, but recog rejects it because the 
> 'internal' modes of maddhidi4 are set a different way.
> 
> Now, mulhisi3 is currently written using MULT:SI mode, and I can't see 
> how this could be any other way - it's an SI mode insn. If we want to be 
> able to insert that into maddhidi4, and we do, I don't see any way to do 
> that if it uses a MULT:DI - it just isn't compatible.
> 
> I have regression tested the patch for correctness, but I haven't proven 
> that I haven't produced worse code in some other case.
> 
> I could introduce a second pattern, say "*maddhidi4_2", so that both 
> patterns match and then there could be no regressions, right? However, I 
> don't believe it's necessary.
> 

So what happens to a variation of your testcase:

long long foolong (long long x, short *a, short *b)
   {
       return x + (long long)*a * (long long)*b;
   }

With your patch?  This should generate identical code to your original
test-case.

R.
Andrew Stubbs Jan. 28, 2011, 3:13 p.m. UTC | #6
On 28/01/11 14:12, Richard Earnshaw wrote:
> So what happens to a variation of your testcase:
>
> long long foolong (long long x, short *a, short *b)
>     {
>         return x + (long long)*a * (long long)*b;
>     }
>
> With your patch?  This should generate identical code to your original
> test-case.
>

The patch has no effect on that testcase - it's broken in some other 
way, I think, and the same with and without my patch:

         ldrsh   r3, [r3, #0]
         ldrsh   r2, [r2, #0]
         push    {r4, r5}
         asrs    r4, r3, #31
         asrs    r5, r2, #31
         mul     r4, r2, r4
         mla     r4, r3, r5, r4
         umull   r2, r3, r2, r3
         adds    r3, r4, r3
         adds    r0, r0, r2
         adc     r1, r1, r3
         pop     {r4, r5}
         bx      lr

Hmmm, that probably doesn't add anything useful to the discussion. :(

I'll add that one to the todo list ...

Andrew
Richard Earnshaw Jan. 28, 2011, 3:20 p.m. UTC | #7
On Fri, 2011-01-28 at 15:13 +0000, Andrew Stubbs wrote:
> On 28/01/11 14:12, Richard Earnshaw wrote:
> > So what happens to a variation of your testcase:
> >
> > long long foolong (long long x, short *a, short *b)
> >     {
> >         return x + (long long)*a * (long long)*b;
> >     }
> >
> > With your patch?  This should generate identical code to your original
> > test-case.
> >
> 
> The patch has no effect on that testcase - it's broken in some other 
> way, I think, and the same with and without my patch:
> 
>          ldrsh   r3, [r3, #0]
>          ldrsh   r2, [r2, #0]
>          push    {r4, r5}
>          asrs    r4, r3, #31
>          asrs    r5, r2, #31
>          mul     r4, r2, r4
>          mla     r4, r3, r5, r4
>          umull   r2, r3, r2, r3
>          adds    r3, r4, r3
>          adds    r0, r0, r2
>          adc     r1, r1, r3
>          pop     {r4, r5}
>          bx      lr
> 
> Hmmm, that probably doesn't add anything useful to the discussion. :(
> 
> I'll add that one to the todo list ...
> 
> Andrew
> 

Ouch!  I though that used to work :-(

R.
Hans-Peter Nilsson Feb. 9, 2011, 6:31 a.m. UTC | #8
On Fri, 28 Jan 2011, Richard Earnshaw wrote:
> Canonical RTL is not just about the order of the operands its about not
> having multiple ways of representing the same operation.  Order is part
> of that, but not the end of it.  For example, the canonical form of
>
> 	(lt:SI (reg:SI) (const_int 0))
>
> is
> 	(lshiftrt:SI (reg:SI) (const_int 31))

I kind of doubt that's canon.  At least it shouldn't be that
way, except as a variant attempted by combine.

If so, ISTM i'd be a complicating factor for the cmpM patterns
of most ports. :)  (Likewise unnamed variants for combine.)

brgds, H-P
Andrew Stubbs March 25, 2011, 4:19 p.m. UTC | #9
On 28/01/11 15:20, Richard Earnshaw wrote:
>
> On Fri, 2011-01-28 at 15:13 +0000, Andrew Stubbs wrote:
>> On 28/01/11 14:12, Richard Earnshaw wrote:
>>> So what happens to a variation of your testcase:
>>>
>>> long long foolong (long long x, short *a, short *b)
>>>      {
>>>          return x + (long long)*a * (long long)*b;
>>>      }
>>>
>>> With your patch?  This should generate identical code to your original
>>> test-case.
>>>
>>
>> The patch has no effect on that testcase - it's broken in some other
>> way, I think, and the same with and without my patch:
>>
>>           ldrsh   r3, [r3, #0]
>>           ldrsh   r2, [r2, #0]
>>           push    {r4, r5}
>>           asrs    r4, r3, #31
>>           asrs    r5, r2, #31
>>           mul     r4, r2, r4
>>           mla     r4, r3, r5, r4
>>           umull   r2, r3, r2, r3
>>           adds    r3, r4, r3
>>           adds    r0, r0, r2
>>           adc     r1, r1, r3
>>           pop     {r4, r5}
>>           bx      lr
>>
>> Hmmm, that probably doesn't add anything useful to the discussion. :(
>>
>> I'll add that one to the todo list ...
>>
>> Andrew
>>
>
> Ouch!  I though that used to work :-(


I looked at this one again, but on a second inspection I'm not sure 
there's much wrong with it?

When I wrote the above I thought that there was a 64-bit multiply 
instruction, but now I look more closely I see there isn't, hence the 
above. It does two 16-bit loads, sign-extends the inputs to 64-bit, does 
a 64-bit -> 64-bit multiply, and then adds 'x'.

Can the umull/add/add/adc be optimized using umlal? It's too late on a 
Friday to workout what's going on with the carries ....

Also, I don't fully understand why the compiler can't just disregard the 
casts and use maddhidi4? Isn't that mathematically equivalent in this case?

When you say it used to work, what did it use to look like?

Thanks

Andrew
Andrew Stubbs April 15, 2011, 10:54 a.m. UTC | #10
Ping.

On 25/03/11 16:19, Andrew Stubbs wrote:
> On 28/01/11 15:20, Richard Earnshaw wrote:
>>
>> On Fri, 2011-01-28 at 15:13 +0000, Andrew Stubbs wrote:
>>> On 28/01/11 14:12, Richard Earnshaw wrote:
>>>> So what happens to a variation of your testcase:
>>>>
>>>> long long foolong (long long x, short *a, short *b)
>>>> {
>>>> return x + (long long)*a * (long long)*b;
>>>> }
>>>>
>>>> With your patch? This should generate identical code to your original
>>>> test-case.
>>>>
>>>
>>> The patch has no effect on that testcase - it's broken in some other
>>> way, I think, and the same with and without my patch:
>>>
>>> ldrsh r3, [r3, #0]
>>> ldrsh r2, [r2, #0]
>>> push {r4, r5}
>>> asrs r4, r3, #31
>>> asrs r5, r2, #31
>>> mul r4, r2, r4
>>> mla r4, r3, r5, r4
>>> umull r2, r3, r2, r3
>>> adds r3, r4, r3
>>> adds r0, r0, r2
>>> adc r1, r1, r3
>>> pop {r4, r5}
>>> bx lr
>>>
>>> Hmmm, that probably doesn't add anything useful to the discussion. :(
>>>
>>> I'll add that one to the todo list ...
>>>
>>> Andrew
>>>
>>
>> Ouch! I though that used to work :-(
>
>
> I looked at this one again, but on a second inspection I'm not sure
> there's much wrong with it?
>
> When I wrote the above I thought that there was a 64-bit multiply
> instruction, but now I look more closely I see there isn't, hence the
> above. It does two 16-bit loads, sign-extends the inputs to 64-bit, does
> a 64-bit -> 64-bit multiply, and then adds 'x'.
>
> Can the umull/add/add/adc be optimized using umlal? It's too late on a
> Friday to workout what's going on with the carries ....
>
> Also, I don't fully understand why the compiler can't just disregard the
> casts and use maddhidi4? Isn't that mathematically equivalent in this case?
>
> When you say it used to work, what did it use to look like?
>
> Thanks
>
> Andrew
>
diff mbox

Patch

2010-12-09  Andrew Stubbs  <ams@codesourcery.com>

	gcc/
	* config/arm/arm.md (*maddhidi4): Make mult mode match
	mulhisi3 for combine.

	gcc/testsuite/
	* gcc.target/arm/smlalbb.c: New file.

---
 src/gcc-mainline/gcc/config/arm/arm.md             |    9 +++++----
 .../gcc/testsuite/gcc.target/arm/smlalbb.c         |    8 ++++++++
 2 files changed, 13 insertions(+), 4 deletions(-)
 create mode 100644 src/gcc-mainline/gcc/testsuite/gcc.target/arm/smlalbb.c

diff --git a/src/gcc-mainline/gcc/config/arm/arm.md b/src/gcc-mainline/gcc/config/arm/arm.md
index c816126..dc9da52 100644
--- a/src/gcc-mainline/gcc/config/arm/arm.md
+++ b/src/gcc-mainline/gcc/config/arm/arm.md
@@ -1814,10 +1814,11 @@ 
 (define_insn "*maddhidi4"
   [(set (match_operand:DI 0 "s_register_operand" "=r")
 	(plus:DI
-	  (mult:DI (sign_extend:DI
-	 	    (match_operand:HI 1 "s_register_operand" "%r"))
-		   (sign_extend:DI
-		    (match_operand:HI 2 "s_register_operand" "r")))
+	  (sign_extend:DI
+	   (mult:SI (sign_extend:SI
+		     (match_operand:HI 1 "s_register_operand" "%r"))
+		    (sign_extend:SI
+		     (match_operand:HI 2 "s_register_operand" "r"))))
 	  (match_operand:DI 3 "s_register_operand" "0")))]
   "TARGET_DSP_MULTIPLY"
   "smlalbb%?\\t%Q0, %R0, %1, %2"
diff --git a/src/gcc-mainline/gcc/testsuite/gcc.target/arm/smlalbb.c b/src/gcc-mainline/gcc/testsuite/gcc.target/arm/smlalbb.c
new file mode 100644
index 0000000..0e4dd70
--- /dev/null
+++ b/src/gcc-mainline/gcc/testsuite/gcc.target/arm/smlalbb.c
@@ -0,0 +1,8 @@ 
+/* Ensure the smlalbb pattern works.  */
+/* { dg-options "-O2 -march=armv7-a" }  */
+/* { dg-final { scan-assembler "smlalbb" } } */
+
+long long foolong (long long x, short *a, short *b)
+{
+    return x + *a * *b;
+}