Patchwork [v2,05/10] target-arm: optimize arm load/store multiple ops

login
register
mail settings
Submitter Juha.Riihimaki@nokia.com
Date Oct. 24, 2009, 12:19 p.m.
Message ID <1256386749-85299-6-git-send-email-juha.riihimaki@nokia.com>
Download mbox | patch
Permalink /patch/36834/
State New
Headers show

Comments

Juha.Riihimaki@nokia.com - Oct. 24, 2009, 12:19 p.m.
From: Juha Riihimäki <juha.riihimaki@nokia.com>

RM load/store multiple instructions can be slightly optimized by
loading the register offset constant into a variable outside the
register loop and using the preloaded variable inside the loop instead
of reloading the offset value to a temporary variable on each loop
iteration. This causes less TCG ops to be generated for a ARM load/
store multiple instruction if there are more than one register
accessed, otherwise the number of generated TCG ops is the same.

Signed-off-by: Juha Riihimäki <juha.riihimaki@nokia.com>
Acked-by: Laurent Desnogues <laurent.desnogues@gmail.com>
---
 target-arm/translate.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)
Aurelien Jarno - Oct. 27, 2009, 8:39 a.m.
On Sat, Oct 24, 2009 at 03:19:04PM +0300, juha.riihimaki@nokia.com wrote:
> From: Juha Riihimäki <juha.riihimaki@nokia.com>
> 
> RM load/store multiple instructions can be slightly optimized by
> loading the register offset constant into a variable outside the
> register loop and using the preloaded variable inside the loop instead
> of reloading the offset value to a temporary variable on each loop
> iteration. This causes less TCG ops to be generated for a ARM load/
> store multiple instruction if there are more than one register
> accessed, otherwise the number of generated TCG ops is the same.
> 
> Signed-off-by: Juha Riihimäki <juha.riihimaki@nokia.com>
> Acked-by: Laurent Desnogues <laurent.desnogues@gmail.com>

This patch breaks, the boot of an arm kernel, as tmp2 is used elsewhere
within this code path.

OTOH, while it reduce the number of TCG ops, that should not impact the
generated host asm code, as most (all ?) targets are able to add a
small constant value to a register in one instruction.

> diff --git a/target-arm/translate.c b/target-arm/translate.c
> index 38fb833..d1e2ed2 100644
> --- a/target-arm/translate.c
> +++ b/target-arm/translate.c
> @@ -6852,6 +6852,7 @@ static void disas_arm_insn(CPUState * env, DisasContext *s)
>                  }
>                  rn = (insn >> 16) & 0xf;
>                  addr = load_reg(s, rn);
> +                tmp2 = tcg_const_i32(4);
>  
>                  /* compute total size */
>                  loaded_base = 0;
> @@ -6865,7 +6866,7 @@ static void disas_arm_insn(CPUState * env, DisasContext *s)
>                  if (insn & (1 << 23)) {
>                      if (insn & (1 << 24)) {
>                          /* pre increment */
> -                        tcg_gen_addi_i32(addr, addr, 4);
> +                        tcg_gen_add_i32(addr, addr, tmp2);
>                      } else {
>                          /* post increment */
>                      }
> @@ -6918,7 +6919,7 @@ static void disas_arm_insn(CPUState * env, DisasContext *s)
>                          j++;
>                          /* no need to add after the last transfer */
>                          if (j != n)
> -                            tcg_gen_addi_i32(addr, addr, 4);
> +                            tcg_gen_add_i32(addr, addr, tmp2);
>                      }
>                  }
>                  if (insn & (1 << 21)) {
> @@ -6928,7 +6929,7 @@ static void disas_arm_insn(CPUState * env, DisasContext *s)
>                              /* pre increment */
>                          } else {
>                              /* post increment */
> -                            tcg_gen_addi_i32(addr, addr, 4);
> +                            tcg_gen_add_i32(addr, addr, tmp2);
>                          }
>                      } else {
>                          if (insn & (1 << 24)) {
> @@ -6944,6 +6945,7 @@ static void disas_arm_insn(CPUState * env, DisasContext *s)
>                  } else {
>                      dead_tmp(addr);
>                  }
> +                tcg_temp_free_i32(tmp2);
>                  if (loaded_base) {
>                      store_reg(s, rn, loaded_var);
>                  }
> -- 
> 1.6.5
> 
> 
>
Juha.Riihimaki@nokia.com - Oct. 27, 2009, 8:48 a.m.
On Oct 27, 2009, at 10:39, ext Aurelien Jarno wrote:

> On Sat, Oct 24, 2009 at 03:19:04PM +0300, juha.riihimaki@nokia.com  
> wrote:
>> From: Juha Riihimäki <juha.riihimaki@nokia.com>
>>
>> RM load/store multiple instructions can be slightly optimized by
>> loading the register offset constant into a variable outside the
>> register loop and using the preloaded variable inside the loop  
>> instead
>> of reloading the offset value to a temporary variable on each loop
>> iteration. This causes less TCG ops to be generated for a ARM load/
>> store multiple instruction if there are more than one register
>> accessed, otherwise the number of generated TCG ops is the same.
>>
>> Signed-off-by: Juha Riihimäki <juha.riihimaki@nokia.com>
>> Acked-by: Laurent Desnogues <laurent.desnogues@gmail.com>
>
> This patch breaks, the boot of an arm kernel, as tmp2 is used  
> elsewhere
> within this code path.

True, I just noticed that as well. This is because the resource leak  
patch
was refactored to utilize tmp2 inside the loop as well. I just sent a  
new
revision of this patch that uses tmp3 for th constant value.

> OTOH, while it reduce the number of TCG ops, that should not impact  
> the
> generated host asm code, as most (all ?) targets are able to add a
> small constant value to a register in one instruction.

This is true, but I still think it provides a small speed gain as  
there are
less TCG ops to be processed when generating host code...?


Cheers,
Juha
Aurelien Jarno - Oct. 27, 2009, 9 a.m.
Juha.Riihimaki@nokia.com a écrit :
> On Oct 27, 2009, at 10:39, ext Aurelien Jarno wrote:
> 
>> On Sat, Oct 24, 2009 at 03:19:04PM +0300, juha.riihimaki@nokia.com  
>> wrote:
>>> From: Juha Riihimäki <juha.riihimaki@nokia.com>
>>>
>>> RM load/store multiple instructions can be slightly optimized by
>>> loading the register offset constant into a variable outside the
>>> register loop and using the preloaded variable inside the loop  
>>> instead
>>> of reloading the offset value to a temporary variable on each loop
>>> iteration. This causes less TCG ops to be generated for a ARM load/
>>> store multiple instruction if there are more than one register
>>> accessed, otherwise the number of generated TCG ops is the same.
>>>
>>> Signed-off-by: Juha Riihimäki <juha.riihimaki@nokia.com>
>>> Acked-by: Laurent Desnogues <laurent.desnogues@gmail.com>
>> This patch breaks, the boot of an arm kernel, as tmp2 is used  
>> elsewhere
>> within this code path.
> 
> True, I just noticed that as well. This is because the resource leak  
> patch
> was refactored to utilize tmp2 inside the loop as well. I just sent a  
> new
> revision of this patch that uses tmp3 for th constant value.
> 
>> OTOH, while it reduce the number of TCG ops, that should not impact  
>> the
>> generated host asm code, as most (all ?) targets are able to add a
>> small constant value to a register in one instruction.
> 
> This is true, but I still think it provides a small speed gain as  
> there are
> less TCG ops to be processed when generating host code...?

It means less TCG ops, but one more temp variable, therefore if there is
a gain, I don't think it is something even measurable.

OTOH it makes the code a bit more complex to read. I am not really
opposed to this patch (and the other patches of the same kind), but I
will need some more arguments to convince me.
Juha.Riihimaki@nokia.com - Oct. 27, 2009, 9:05 a.m.
On Oct 27, 2009, at 11:00, ext Aurelien Jarno wrote:

> Juha.Riihimaki@nokia.com a écrit :
>> On Oct 27, 2009, at 10:39, ext Aurelien Jarno wrote:
>>
>>> On Sat, Oct 24, 2009 at 03:19:04PM +0300, juha.riihimaki@nokia.com
>>> wrote:
>>>> From: Juha Riihimäki <juha.riihimaki@nokia.com>
>>>>
>>>> RM load/store multiple instructions can be slightly optimized by
>>>> loading the register offset constant into a variable outside the
>>>> register loop and using the preloaded variable inside the loop
>>>> instead
>>>> of reloading the offset value to a temporary variable on each loop
>>>> iteration. This causes less TCG ops to be generated for a ARM load/
>>>> store multiple instruction if there are more than one register
>>>> accessed, otherwise the number of generated TCG ops is the same.
>>>>
>>>> Signed-off-by: Juha Riihimäki <juha.riihimaki@nokia.com>
>>>> Acked-by: Laurent Desnogues <laurent.desnogues@gmail.com>
>>> This patch breaks, the boot of an arm kernel, as tmp2 is used
>>> elsewhere
>>> within this code path.
>>
>> True, I just noticed that as well. This is because the resource leak
>> patch
>> was refactored to utilize tmp2 inside the loop as well. I just sent a
>> new
>> revision of this patch that uses tmp3 for th constant value.
>>
>>> OTOH, while it reduce the number of TCG ops, that should not impact
>>> the
>>> generated host asm code, as most (all ?) targets are able to add a
>>> small constant value to a register in one instruction.
>>
>> This is true, but I still think it provides a small speed gain as
>> there are
>> less TCG ops to be processed when generating host code...?
>
> It means less TCG ops, but one more temp variable, therefore if  
> there is
> a gain, I don't think it is something even measurable.
>
> OTOH it makes the code a bit more complex to read. I am not really
> opposed to this patch (and the other patches of the same kind), but I
> will need some more arguments to convince me.

Shouldn't the amount of temp variables stay the same? tcg_gen_addi_i32  
will internally allocate a temporary variable to hold the integer  
parameter. The only difference is that the temporary stays alive  
during the loop instead of being allocated/deallocated during each  
iteration. But I agree, the performance gain is probably quite small.


Regards,
Juha

Patch

diff --git a/target-arm/translate.c b/target-arm/translate.c
index 38fb833..d1e2ed2 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -6852,6 +6852,7 @@  static void disas_arm_insn(CPUState * env, DisasContext *s)
                 }
                 rn = (insn >> 16) & 0xf;
                 addr = load_reg(s, rn);
+                tmp2 = tcg_const_i32(4);
 
                 /* compute total size */
                 loaded_base = 0;
@@ -6865,7 +6866,7 @@  static void disas_arm_insn(CPUState * env, DisasContext *s)
                 if (insn & (1 << 23)) {
                     if (insn & (1 << 24)) {
                         /* pre increment */
-                        tcg_gen_addi_i32(addr, addr, 4);
+                        tcg_gen_add_i32(addr, addr, tmp2);
                     } else {
                         /* post increment */
                     }
@@ -6918,7 +6919,7 @@  static void disas_arm_insn(CPUState * env, DisasContext *s)
                         j++;
                         /* no need to add after the last transfer */
                         if (j != n)
-                            tcg_gen_addi_i32(addr, addr, 4);
+                            tcg_gen_add_i32(addr, addr, tmp2);
                     }
                 }
                 if (insn & (1 << 21)) {
@@ -6928,7 +6929,7 @@  static void disas_arm_insn(CPUState * env, DisasContext *s)
                             /* pre increment */
                         } else {
                             /* post increment */
-                            tcg_gen_addi_i32(addr, addr, 4);
+                            tcg_gen_add_i32(addr, addr, tmp2);
                         }
                     } else {
                         if (insn & (1 << 24)) {
@@ -6944,6 +6945,7 @@  static void disas_arm_insn(CPUState * env, DisasContext *s)
                 } else {
                     dead_tmp(addr);
                 }
+                tcg_temp_free_i32(tmp2);
                 if (loaded_base) {
                     store_reg(s, rn, loaded_var);
                 }