diff mbox

[RFT] Speedup 'tb_find_slow' by using the same heuristic as during memory page lookup

Message ID 4CEAC792.7040307@ispras.ru
State New
Headers show

Commit Message

Kirill Batuzov Nov. 22, 2010, 7:42 p.m. UTC
Move the last found TB to the head of the list so it will be found more 
quickly next time it will be looked for.

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
Signed-off-by: Pavel Yushchenko <pau@ispras.ru>
---
Hello.  This patch gives significant boost to a used by us rather rich 
(for embedded one - featuring X-server, many daemons and applications) 
ARM-based system literally decreasing its boot to desktop time by TWO 
times!  (Average number of traversed 'tb_phys_hash' entries in the main 
loop of the 'tb_find_slow' function reduced from 20 to 1.5.)  We were 
able to shorten boot to login time by about 25% as well using Debian on 
versatilepb (no X-server, only basic system). Seems like kernel booting 
time is not affected.  No problems were encountered during our experiments.

We are looking forward for comments about this change and help with 
testing. Thanks in advance!

  cpu-exec.c |    5 +++++
  1 files changed, 5 insertions(+), 0 deletions(-)

Comments

Riku Voipio Nov. 23, 2010, 12:38 p.m. UTC | #1
Hi,

tested and improves indeed startup speed.

On Mon, Nov 22, 2010 at 10:42:10PM +0300, Kirill Batuzov wrote:
> Move the last found TB to the head of the list so it will be found more  
> quickly next time it will be looked for.
>
> Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
> Signed-off-by: Pavel Yushchenko <pau@ispras.ru>
> ---
> Hello.  This patch gives significant boost to a used by us rather rich  
> (for embedded one - featuring X-server, many daemons and applications)  
> ARM-based system literally decreasing its boot to desktop time by TWO  
> times!  (Average number of traversed 'tb_phys_hash' entries in the main  
> loop of the 'tb_find_slow' function reduced from 20 to 1.5.)  We were  
> able to shorten boot to login time by about 25% as well using Debian on  
> versatilepb (no X-server, only basic system). Seems like kernel booting  
> time is not affected.  No problems were encountered during our 
> experiments.
>
> We are looking forward for comments about this change and help with  
> testing. Thanks in advance!
>
>  cpu-exec.c |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index 5d6dd51..55c4526 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -161,6 +161,11 @@ static TranslationBlock *tb_find_slow(target_ulong pc,
>      tb = tb_gen_code(env, pc, cs_base, flags, 0);
>
>   found:
> +    if (*ptb1) {
> +        *ptb1 = tb->phys_hash_next;
> +        tb->phys_hash_next = tb_phys_hash[h];
> +        tb_phys_hash[h] = tb;
> +    }
>      /* we add the TB in the virtual pc hash table */
>      env->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
>      return tb;
Mulyadi Santosa Nov. 23, 2010, 4:23 p.m. UTC | #2
Dear Kirill

On Tue, Nov 23, 2010 at 02:42, Kirill Batuzov <batuzovk@ispras.ru> wrote:
> Move the last found TB to the head of the list so it will be found more
> quickly next time it will be looked for.
...
>  found:
> +    if (*ptb1) {
> +        *ptb1 = tb->phys_hash_next;
> +        tb->phys_hash_next = tb_phys_hash[h];
> +        tb_phys_hash[h] = tb;
> +    }
>     /* we add the TB in the virtual pc hash table */
>     env->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
>     return tb;
>

I thank you, because you indirectly teach me how to do it. Since a
long time ago, I'd like to do the same thing but I never understand
the way TB managed thoroughly.

May I suggest something?
a. the "if (*ptb)" could be improved by likely() or unlikely() macros
(I forgot the real gcc macro's name, I just write down the way Linux
kernel name it).

I guess, statistically the hit rate of last executed TB could be
somewhere above 50% (using locality principle, which is IIRC, saying
that roughly 10% of code are actually ran and they took 90% of overall
total code run time. CMIIW). So, likely() will improve the code
ordering.

b. Or better, we need somekind of LRU list ordering of TB. On each TB
hit, "hit count" of that TB is incremented. On every certain time
interval, the entire list is scanned and it is reordered with the most
frequently called TB is in the head.

All in all, I think it is due to simple TB "clean up" mechanism so
far, that is whenever it is full, they are simply dumped out. While I
agree it's the simplest, I remember Julian Seward's test that showed
runtime acceleration when TB size is increased up to certain size.
This, to me, in some degree demonstrate that more efficient management
of TB cache is needed.

Just my 2 cents idea...and so far this is all I can suggest to help
Qemu development.
diff mbox

Patch

diff --git a/cpu-exec.c b/cpu-exec.c
index 5d6dd51..55c4526 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -161,6 +161,11 @@  static TranslationBlock *tb_find_slow(target_ulong pc,
      tb = tb_gen_code(env, pc, cs_base, flags, 0);

   found:
+    if (*ptb1) {
+        *ptb1 = tb->phys_hash_next;
+        tb->phys_hash_next = tb_phys_hash[h];
+        tb_phys_hash[h] = tb;
+    }
      /* we add the TB in the virtual pc hash table */
      env->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
      return tb;