Patchwork [3/3] tcg/arm: improve constant loading

login
register
mail settings
Submitter Aurelien Jarno
Date Jan. 6, 2011, 9:54 p.m.
Message ID <1294350874-6885-3-git-send-email-aurelien@aurel32.net>
Download mbox | patch
Permalink /patch/77812/
State New
Headers show

Comments

Aurelien Jarno - Jan. 6, 2011, 9:54 p.m.
Improve constant loading in two ways:
- On all ARM versions, it's possible to load 0xffffff00 = -0x100 using
  the mvn rd, #0. Fix the conditions.
- On <= ARMv6 versions, where movw and movt are not available, load the
  constants using mov and orr with rotations depending on the constant
  to load. This is very useful for example to load constants where the
  low byte is 0. This reduce the generated code size by about 7%.

Also fix the coding style at the same time.

Cc: Andrzej Zaborowski <balrog@zabor.org>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
---
 tcg/arm/tcg-target.c |   39 +++++++++++++++++++++------------------
 1 files changed, 21 insertions(+), 18 deletions(-)
andrzej zaborowski - Jan. 7, 2011, 12:52 p.m.
Hi,

On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote:
> Improve constant loading in two ways:
> - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using
>  the mvn rd, #0. Fix the conditions.
> - On <= ARMv6 versions, where movw and movt are not available, load the
>  constants using mov and orr with rotations depending on the constant
>  to load. This is very useful for example to load constants where the
>  low byte is 0. This reduce the generated code size by about 7%.

That's a nice improvement.  For some instructions using MVN and AND
could yield even shorter code and I think with that the optimisation
options (except loading from a constant pool) would be exhausted :)

...
>         }
> +    } else {
> +        int opc = ARITH_MOV;
> +        int rn = 0;
> +
> +        do {
> +            int i, rot;
> +
> +            i = ctz32(arg) & ~1;
> +            rot = ((32 - i) << 7) & 0xf00;
> +            tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot);
> +            arg &= ~(0xff << i);
> +
> +            opc = ARITH_ORR;
> +            rn = rd;

I think you could get rid of rn and just use rd from the start of the
loop.  Otherwise acked by me too.

Best regards
andrzej zaborowski - Jan. 7, 2011, 12:55 p.m.
On 7 January 2011 13:52, andrzej zaborowski <balrog@zabor.org> wrote:
> On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote:
>> Improve constant loading in two ways:
>> - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using
>>  the mvn rd, #0. Fix the conditions.
>> - On <= ARMv6 versions, where movw and movt are not available, load the
>>  constants using mov and orr with rotations depending on the constant
>>  to load. This is very useful for example to load constants where the
>>  low byte is 0. This reduce the generated code size by about 7%.
>
> That's a nice improvement.  For some instructions using MVN and AND

Oops, I mean for some *values*.
Aurelien Jarno - Jan. 7, 2011, 2:40 p.m.
On Fri, Jan 07, 2011 at 01:52:25PM +0100, andrzej zaborowski wrote:
> Hi,
> 
> On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote:
> > Improve constant loading in two ways:
> > - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using
> >  the mvn rd, #0. Fix the conditions.
> > - On <= ARMv6 versions, where movw and movt are not available, load the
> >  constants using mov and orr with rotations depending on the constant
> >  to load. This is very useful for example to load constants where the
> >  low byte is 0. This reduce the generated code size by about 7%.
> 
> That's a nice improvement.  For some instructions using MVN and AND
> could yield even shorter code and I think with that the optimisation
> options (except loading from a constant pool) would be exhausted :)

I also did something with MVN and BIC, it works well, but the problem is
to find the right heuristic to choose between MOV/ORR and MVN/BIC. In my
tries, it was making the code bigger.

> ...
> >         }
> > +    } else {
> > +        int opc = ARITH_MOV;
> > +        int rn = 0;
> > +
> > +        do {
> > +            int i, rot;
> > +
> > +            i = ctz32(arg) & ~1;
> > +            rot = ((32 - i) << 7) & 0xf00;
> > +            tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot);
> > +            arg &= ~(0xff << i);
> > +
> > +            opc = ARITH_ORR;
> > +            rn = rd;
> 
> I think you could get rid of rn and just use rd from the start of the
> loop.  Otherwise acked by me too.
> 

What do you mean exactly? rn has to be 0 when opc is ARITH_MOV in order
to generate a correct ARM instruction.
andrzej zaborowski - Jan. 7, 2011, 3:56 p.m.
On 7 January 2011 15:40, Aurelien Jarno <aurelien@aurel32.net> wrote:
> On Fri, Jan 07, 2011 at 01:52:25PM +0100, andrzej zaborowski wrote:
>> Hi,
>>
>> On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote:
>> > Improve constant loading in two ways:
>> > - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using
>> >  the mvn rd, #0. Fix the conditions.
>> > - On <= ARMv6 versions, where movw and movt are not available, load the
>> >  constants using mov and orr with rotations depending on the constant
>> >  to load. This is very useful for example to load constants where the
>> >  low byte is 0. This reduce the generated code size by about 7%.
>>
>> That's a nice improvement.  For some instructions using MVN and AND
>> could yield even shorter code and I think with that the optimisation
>> options (except loading from a constant pool) would be exhausted :)
>
> I also did something with MVN and BIC, it works well, but the problem is
> to find the right heuristic to choose between MOV/ORR and MVN/BIC. In my
> tries, it was making the code bigger.

I was thinking of running both without writing the instructions, then
comparing the lengths and then running the better method.  It's
possible that the cost of this outweights the shorter code advantage
though.

>
>> ...
>> >         }
>> > +    } else {
>> > +        int opc = ARITH_MOV;
>> > +        int rn = 0;
>> > +
>> > +        do {
>> > +            int i, rot;
>> > +
>> > +            i = ctz32(arg) & ~1;
>> > +            rot = ((32 - i) << 7) & 0xf00;
>> > +            tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot);
>> > +            arg &= ~(0xff << i);
>> > +
>> > +            opc = ARITH_ORR;
>> > +            rn = rd;
>>
>> I think you could get rid of rn and just use rd from the start of the
>> loop.  Otherwise acked by me too.
>>
>
> What do you mean exactly? rn has to be 0 when opc is ARITH_MOV in order
> to generate a correct ARM instruction.

According to my ARM926 manual rn is ignored for MOV/MVN, perhaps it's
different in later revisions.

Cheers
Aurelien Jarno - Jan. 9, 2011, 10:40 p.m.
On Fri, Jan 07, 2011 at 04:56:32PM +0100, andrzej zaborowski wrote:
> On 7 January 2011 15:40, Aurelien Jarno <aurelien@aurel32.net> wrote:
> > On Fri, Jan 07, 2011 at 01:52:25PM +0100, andrzej zaborowski wrote:
> >> Hi,
> >>
> >> On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote:
> >> > Improve constant loading in two ways:
> >> > - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using
> >> >  the mvn rd, #0. Fix the conditions.
> >> > - On <= ARMv6 versions, where movw and movt are not available, load the
> >> >  constants using mov and orr with rotations depending on the constant
> >> >  to load. This is very useful for example to load constants where the
> >> >  low byte is 0. This reduce the generated code size by about 7%.
> >>
> >> That's a nice improvement.  For some instructions using MVN and AND
> >> could yield even shorter code and I think with that the optimisation
> >> options (except loading from a constant pool) would be exhausted :)
> >
> > I also did something with MVN and BIC, it works well, but the problem is
> > to find the right heuristic to choose between MOV/ORR and MVN/BIC. In my
> > tries, it was making the code bigger.
> 
> I was thinking of running both without writing the instructions, then
> comparing the lengths and then running the better method.  It's
> possible that the cost of this outweights the shorter code advantage
> though.
> 
> >
> >> ...
> >> >         }
> >> > +    } else {
> >> > +        int opc = ARITH_MOV;
> >> > +        int rn = 0;
> >> > +
> >> > +        do {
> >> > +            int i, rot;
> >> > +
> >> > +            i = ctz32(arg) & ~1;
> >> > +            rot = ((32 - i) << 7) & 0xf00;
> >> > +            tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot);
> >> > +            arg &= ~(0xff << i);
> >> > +
> >> > +            opc = ARITH_ORR;
> >> > +            rn = rd;
> >>
> >> I think you could get rid of rn and just use rd from the start of the
> >> loop.  Otherwise acked by me too.
> >>
> >
> > What do you mean exactly? rn has to be 0 when opc is ARITH_MOV in order
> > to generate a correct ARM instruction.
> 
> According to my ARM926 manual rn is ignored for MOV/MVN, perhaps it's
> different in later revisions.
> 

I have just tried, and it actually works (tried on ARMv5 and ARMv7). 
Note that binutils is not able to disassemble such an instruction and
outputs in qemu.log something like:
| 0x01000008:  e3aa50ff  undefined instruction 0xe3aa50ff

However what worries me the most is that the "ARM Architecture Reference
Manual ARMv7-A and ARMv7-R edition" defines this opcode with the rn field
as "(0)(0)(0)(0)". Looking at what it means:

| An instruction is UNPREDICTABLE if:
| [...]
| * the pseudocode for that encoding does not indicate that a different
|   special case applies, and a bit marked (0) or (1) in the encoding 
| diagram of an instruction is not 0 or 1 respectively.

In short is it still going to work on newer CPUs?
andrzej zaborowski - Jan. 9, 2011, 11:33 p.m.
On 9 January 2011 23:40, Aurelien Jarno <aurelien@aurel32.net> wrote:
> On Fri, Jan 07, 2011 at 04:56:32PM +0100, andrzej zaborowski wrote:
>> On 7 January 2011 15:40, Aurelien Jarno <aurelien@aurel32.net> wrote:
>> > On Fri, Jan 07, 2011 at 01:52:25PM +0100, andrzej zaborowski wrote:
>> >> On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote:
>> >> ...
>> >> >         }
>> >> > +    } else {
>> >> > +        int opc = ARITH_MOV;
>> >> > +        int rn = 0;
>> >> > +
>> >> > +        do {
>> >> > +            int i, rot;
>> >> > +
>> >> > +            i = ctz32(arg) & ~1;
>> >> > +            rot = ((32 - i) << 7) & 0xf00;
>> >> > +            tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot);
>> >> > +            arg &= ~(0xff << i);
>> >> > +
>> >> > +            opc = ARITH_ORR;
>> >> > +            rn = rd;
>> >>
>> >> I think you could get rid of rn and just use rd from the start of the
>> >> loop.  Otherwise acked by me too.
>> >>
>> >
>> > What do you mean exactly? rn has to be 0 when opc is ARITH_MOV in order
>> > to generate a correct ARM instruction.
>>
>> According to my ARM926 manual rn is ignored for MOV/MVN, perhaps it's
>> different in later revisions.
>>
>
> I have just tried, and it actually works (tried on ARMv5 and ARMv7).

Also works under qemu-arm :)

> Note that binutils is not able to disassemble such an instruction and
> outputs in qemu.log something like:
> | 0x01000008:  e3aa50ff  undefined instruction 0xe3aa50ff
>
> However what worries me the most is that the "ARM Architecture Reference
> Manual ARMv7-A and ARMv7-R edition" defines this opcode with the rn field
> as "(0)(0)(0)(0)". Looking at what it means:
>
> | An instruction is UNPREDICTABLE if:
> | [...]
> | * the pseudocode for that encoding does not indicate that a different
> |   special case applies, and a bit marked (0) or (1) in the encoding
> | diagram of an instruction is not 0 or 1 respectively.
>
> In short is it still going to work on newer CPUs?

Perhaps let's be on the safe side and use your version with rn = 0.

I think it *should* work on the new ARM ISAs because of backwards
compatibility: x works under ARMv4 & ARMv5 and x is not listed under
the differences between new and old ISA, thus it needs to work under a
new ISA.

Cheers
Peter Maydell - Jan. 10, 2011, 3:41 a.m.
On 9 January 2011 23:33, andrzej zaborowski <balrogg@gmail.com> wrote:
> On 9 January 2011 23:40, Aurelien Jarno <aurelien@aurel32.net> wrote:
>> Note that binutils is not able to disassemble such an instruction and
>> outputs in qemu.log something like:
>> | 0x01000008:  e3aa50ff  undefined instruction 0xe3aa50ff
>>
>> However what worries me the most is that the "ARM Architecture Reference
>> Manual ARMv7-A and ARMv7-R edition" defines this opcode with the rn field
>> as "(0)(0)(0)(0)". Looking at what it means:
>>
>> | An instruction is UNPREDICTABLE if:
>> | [...]
>> | * the pseudocode for that encoding does not indicate that a different
>> |   special case applies, and a bit marked (0) or (1) in the encoding
>> | diagram of an instruction is not 0 or 1 respectively.
>>
>> In short is it still going to work on newer CPUs?

It might not work on existing CPUs, never mind newer ones. We
mean it about UNPREDICTABLE :-) Some cores choose to make
patterns which fail these "should be zero/one" checks cause an
UNDEF exception. Some don't.

> I think it *should* work on the new ARM ISAs because of backwards
> compatibility: x works under ARMv4 & ARMv5 and x is not listed under
> the differences between new and old ISA, thus it needs to work under a
> new ISA.

I went back and checked the ARM ARM for ARMv4 (that's ARM
document DUI0100B, dated 1996). It says that for MOV and MVN
bits 19..16 are "SBZ", ie "Should Be Zero", meaning that non-zero
is UNPREDICTABLE. So this isn't a change in behaviour -- the
ISA has always been clear that you should not do it.

[Note for the unwary: UNPREDICTABLE in ARM docs doesn't
mean totally unpredictable -- an implementation isn't allowed to
permit it to be a security hole or to hang the processor, for instance.
But you can't rely on it doing anything useful or consistent.]

-- PMM

Patch

diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
index 08c44c1..1eb5605 100644
--- a/tcg/arm/tcg-target.c
+++ b/tcg/arm/tcg-target.c
@@ -406,35 +406,38 @@  static inline void tcg_out_dat_imm(TCGContext *s,
 }
 
 static inline void tcg_out_movi32(TCGContext *s,
-                int cond, int rd, int32_t arg)
+                int cond, int rd, uint32_t arg)
 {
     /* TODO: This is very suboptimal, we can easily have a constant
      * pool somewhere after all the instructions.  */
-
-    if (arg < 0 && arg > -0x100)
-        return tcg_out_dat_imm(s, cond, ARITH_MVN, rd, 0, (~arg) & 0xff);
-
-    if (use_armv7_instructions) {
+    if ((int)arg < 0 && (int)arg >= -0x100) {
+        tcg_out_dat_imm(s, cond, ARITH_MVN, rd, 0, (~arg) & 0xff);
+    } else if (use_armv7_instructions) {
         /* use movw/movt */
         /* movw */
         tcg_out32(s, (cond << 28) | 0x03000000 | (rd << 12)
                   | ((arg << 4) & 0x000f0000) | (arg & 0xfff));
-        if (arg & 0xffff0000)
+        if (arg & 0xffff0000) {
             /* movt */
             tcg_out32(s, (cond << 28) | 0x03400000 | (rd << 12)
                       | ((arg >> 12) & 0x000f0000) | ((arg >> 16) & 0xfff));
-    } else {
-        tcg_out_dat_imm(s, cond, ARITH_MOV, rd, 0, arg & 0xff);
-        if (arg & 0x0000ff00)
-            tcg_out_dat_imm(s, cond, ARITH_ORR, rd, rd,
-                            ((arg >>  8) & 0xff) | 0xc00);
-        if (arg & 0x00ff0000)
-            tcg_out_dat_imm(s, cond, ARITH_ORR, rd, rd,
-                            ((arg >> 16) & 0xff) | 0x800);
-        if (arg & 0xff000000)
-            tcg_out_dat_imm(s, cond, ARITH_ORR, rd, rd,
-                            ((arg >> 24) & 0xff) | 0x400);
         }
+    } else {
+        int opc = ARITH_MOV;
+        int rn = 0;
+
+        do {
+            int i, rot;
+
+            i = ctz32(arg) & ~1;
+            rot = ((32 - i) << 7) & 0xf00;
+            tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot);
+            arg &= ~(0xff << i);
+
+            opc = ARITH_ORR;
+            rn = rd;
+        } while (arg);
+    }
 }
 
 static inline void tcg_out_mul32(TCGContext *s,