Patchwork [v2] TCG: Convert global variables to be TLS.

login
register
mail settings
Submitter Evgeny Voevodin
Date Feb. 27, 2012, 12:13 p.m.
Message ID <1330344787-14482-1-git-send-email-e.voevodin@samsung.com>
Download mbox | patch
Permalink /patch/143181/
State New
Headers show

Comments

Evgeny Voevodin - Feb. 27, 2012, 12:13 p.m.
This commit converts code_gen_buffer, code_gen_ptr, tbs, nb_tbs to
TLS. We need this if we want TCG to become multithreaded.

Initialization of code_gen_buffer and code_gen_ptr is moved to new
tcg_gen_buffer_init() function. This is done because we do not need
to allocate and initialize TCG buffers for IO thread. Initialization
is now done in qemu_tcg_cpu_thread_fn() by each HW thread
individually.

Also tcg_enabled() returns a variable instead of
(code_gen_buffer != NULL) since if called from IO thread, this will
always return FALSE.

Also some code format changes.

Signed-off-by: Evgeny Voevodin <e.voevodin@samsung.com>
---
 bsd-user/main.c    |    1 +
 cpus.c             |    2 +
 darwin-user/main.c |    1 +
 exec.c             |  123 +++++++++++++++++++++++++++++++---------------------
 linux-user/main.c  |    1 +
 qemu-common.h      |    1 +
 6 files changed, 79 insertions(+), 50 deletions(-)
Peter Maydell - Feb. 27, 2012, 12:35 p.m.
On 27 February 2012 12:13, Evgeny Voevodin <e.voevodin@samsung.com> wrote:
> This commit converts code_gen_buffer, code_gen_ptr, tbs, nb_tbs to
> TLS. We need this if we want TCG to become multithreaded.

I'm sceptical about doing this kind of thing as a change on its
own. A true multithreaded TCG is a large project, and unless we're
going to commit to doing that I don't see much value in making
some variables per-thread when we might instead need to do
larger refactorings (properly encapsulating the codegen
caches as qom objects, maybe?).

-- PMM
Evgeny Voevodin - Feb. 28, 2012, 3:13 a.m.
On 27.02.2012 16:35, Peter Maydell wrote:
> On 27 February 2012 12:13, Evgeny Voevodin<e.voevodin@samsung.com>  wrote:
>> This commit converts code_gen_buffer, code_gen_ptr, tbs, nb_tbs to
>> TLS. We need this if we want TCG to become multithreaded.
> I'm sceptical about doing this kind of thing as a change on its
> own. A true multithreaded TCG is a large project, and unless we're
> going to commit to doing that I don't see much value in making
> some variables per-thread when we might instead need to do
> larger refactorings (properly encapsulating the codegen
> caches as qom objects, maybe?).
>
> -- PMM
>

I wanted to get some feedback and points to show up a direction to move 
in this field.
And qomification of translation caches is an interesting suggestion I think.
Peter Maydell - Feb. 28, 2012, 8:10 a.m.
On 28 February 2012 03:13, Evgeny Voevodin <e.voevodin@samsung.com> wrote:
> I wanted to get some feedback and points to show up a direction to move in
> this field.
> And qomification of translation caches is an interesting suggestion I think.

If you're serious about multithreading TCG then I think the first
steps are:
 * fix existing race conditions
 * think very hard
 * come up with an overall design for what you're proposing

You won't get there by incremental steps unless you know where
you're going...

-- PMM
陳韋任 - Feb. 29, 2012, 3:26 a.m.
On Tue, Feb 28, 2012 at 08:10:58AM +0000, Peter Maydell wrote:
> On 28 February 2012 03:13, Evgeny Voevodin <e.voevodin@samsung.com> wrote:
> > I wanted to get some feedback and points to show up a direction to move in
> > this field.
> > And qomification of translation caches is an interesting suggestion I think.
> 
> If you're serious about multithreading TCG then I think the first
> steps are:
>  * fix existing race conditions
>  * think very hard
>  * come up with an overall design for what you're proposing
> 
> You won't get there by incremental steps unless you know where
> you're going...

  Would the paper "PQEMU: A Parallel System Emulator Based on QEMU " help on this?

Regards,
chenwj

[1] http://www.cs.nthu.edu.tw/~ychung/conference/ICPADS2011.pdf
Evgeny Voevodin - Feb. 29, 2012, 3:43 a.m.
On 29.02.2012 07:26, 陳韋任 wrote:
> On Tue, Feb 28, 2012 at 08:10:58AM +0000, Peter Maydell wrote:
>> On 28 February 2012 03:13, Evgeny Voevodin<e.voevodin@samsung.com>  wrote:
>>> I wanted to get some feedback and points to show up a direction to move in
>>> this field.
>>> And qomification of translation caches is an interesting suggestion I think.
>> If you're serious about multithreading TCG then I think the first
>> steps are:
>>   * fix existing race conditions
>>   * think very hard
>>   * come up with an overall design for what you're proposing
>>
>> You won't get there by incremental steps unless you know where
>> you're going...
>    Would the paper "PQEMU: A Parallel System Emulator Based on QEMU " help on this?
>
> Regards,
> chenwj
>
> [1] http://www.cs.nthu.edu.tw/~ychung/conference/ICPADS2011.pdf
>

Certainly would :) Also I've studied COREMU: 
http://ppi.fudan.edu.cn/_media/publications%3Bcoremu-ppopp11.pdf
But they are based on v0.14 as I can remember and seems that this 
project is not going to come upstream.
Anyway, thee are a lot of useful approaches they done while facing 
different problems on the way of paralleling the TCG.
I'm sure that those approaches should be used in future work.
陳韋任 - Feb. 29, 2012, 3:46 a.m.
> Certainly would :) Also I've studied COREMU: 
> http://ppi.fudan.edu.cn/_media/publications%3Bcoremu-ppopp11.pdf
> But they are based on v0.14 as I can remember and seems that this 
> project is not going to come upstream.
> Anyway, thee are a lot of useful approaches they done while facing 
> different problems on the way of paralleling the TCG.
> I'm sure that those approaches should be used in future work.

  FWIW, COREMU maintainer tends to upstream their work but they have
another project to do right now, so ... ;)

Regards,
chenwj
Evgeny Voevodin - Feb. 29, 2012, 4:01 a.m.
On 29.02.2012 07:46, 陳韋任 wrote:
>> Certainly would :) Also I've studied COREMU:
>> http://ppi.fudan.edu.cn/_media/publications%3Bcoremu-ppopp11.pdf
>> But they are based on v0.14 as I can remember and seems that this
>> project is not going to come upstream.
>> Anyway, thee are a lot of useful approaches they done while facing
>> different problems on the way of paralleling the TCG.
>> I'm sure that those approaches should be used in future work.
>    FWIW, COREMU maintainer tends to upstream their work but they have
> another project to do right now, so ... ;)
>
> Regards,
> chenwj
>

Their git tree was not updated for more then a year and they are based 
on v0.14, in which
one thread was used for HW and IO. Also their code is splited into 
coremu lib and modified qemu
from which coremu interfaces are called.
陳韋任 - March 1, 2012, 7:51 a.m.
> If you're serious about multithreading TCG then I think the first
> steps are:
>  * fix existing race conditions
>  * think very hard
>  * come up with an overall design for what you're proposing

  As COREMU [1] point out, current QEMU atomic instruction emulation approach is
problematic. For example, guest application might use x86 xchg instruction to
implement spin lock/unlock (addr is a shared memory space).


      spin_unlock:                   spin_lock:
                                     
                                     try:
                                       r10 = 1;
                                       xchg addr, r10;
                                       if (r10 == 0)
                                         goto success;
      *addr = 0;                     fail:
                                       pause;
                                       if (*addr != 0)
                                         goto fail;

                                       goto try;

                                     success:

                                     
After QEMU translation, guest xchg instruction becomes

      spin_unlock:                   spin_lock:

                                     helper_lock;

      *addr = 0;                     T0 = r10;
                                     T1 = *addr;
                                     *addr = T0;
                                     r10 = T1;

                                     helper_unlock;

  You can the see the atomicity on which spin lock/unlock rely is broken.
"*addr = 0" can happened in the between of helper_lock/helper_unlock.
COREMU solve this by using a lightway software transaction memory to emulate
atomic instructions. I think this issue is quite important if we want to make
TCG multithreaded, right? Is there a better way to solve this?

Regards,
chenwj

[1]
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.6011&rep=rep1&type=pdf
Andreas Färber - March 1, 2012, 8:22 a.m.
Am 28.02.2012 04:13, schrieb Evgeny Voevodin:
> On 27.02.2012 16:35, Peter Maydell wrote:
>> A true multithreaded TCG is a large project, and unless we're
>> going to commit to doing that I don't see much value in making
>> some variables per-thread when we might instead need to do
>> larger refactorings (properly encapsulating the codegen
>> caches as qom objects, maybe?).
> 
> [...] qomification of translation caches is an interesting suggestion I
> think.

While I have come to like QOM and am using it for the CPUState, I don't
see the benefit in using it for these secondary structures. There are
already dedicated monitor commands to inspect them, no?

Andreas
Peter Maydell - March 1, 2012, 8:27 a.m.
On 1 March 2012 08:22, Andreas Färber <afaerber@suse.de> wrote:
> Am 28.02.2012 04:13, schrieb Evgeny Voevodin:
>> On 27.02.2012 16:35, Peter Maydell wrote:
>>> A true multithreaded TCG is a large project, and unless we're
>>> going to commit to doing that I don't see much value in making
>>> some variables per-thread when we might instead need to do
>>> larger refactorings (properly encapsulating the codegen
>>> caches as qom objects, maybe?).
>>
>> [...] qomification of translation caches is an interesting suggestion I
>> think.
>
> While I have come to like QOM and am using it for the CPUState, I don't
> see the benefit in using it for these secondary structures. There are
> already dedicated monitor commands to inspect them, no?

Mostly I was thinking about the encapsulation of knowing which data
structures are associated with a translation cache and letting you
have more than one of them. You could do that with a plain struct
but since we have this OO infrastructure now why not use it?

-- PMM
Evgeny Voevodin - March 1, 2012, 10:57 a.m.
On 01.03.2012 12:27, Peter Maydell wrote:
> On 1 March 2012 08:22, Andreas Färber<afaerber@suse.de>  wrote:
>> Am 28.02.2012 04:13, schrieb Evgeny Voevodin:
>>> On 27.02.2012 16:35, Peter Maydell wrote:
>>>> A true multithreaded TCG is a large project, and unless we're
>>>> going to commit to doing that I don't see much value in making
>>>> some variables per-thread when we might instead need to do
>>>> larger refactorings (properly encapsulating the codegen
>>>> caches as qom objects, maybe?).
>>> [...] qomification of translation caches is an interesting suggestion I
>>> think.
>> While I have come to like QOM and am using it for the CPUState, I don't
>> see the benefit in using it for these secondary structures. There are
>> already dedicated monitor commands to inspect them, no?
> Mostly I was thinking about the encapsulation of knowing which data
> structures are associated with a translation cache and letting you
> have more than one of them. You could do that with a plain struct
> but since we have this OO infrastructure now why not use it?
>
> -- PMM
>

Actually, I didn't dive deep enough in QOM and can't see any benefits or 
disadvantages in
such encapsulation. As stands to me now, QOM is mostly an interface, but 
internal things
are still structs :) And if we implement appropriate model for 
multithreading TCG, I believe,
that it could be easily wrapped with QOM, if needed.
Also, there are at least two approaches for cache - unified for all 
VCPUs and exclusive for each VCPU.
First is better when a lot of different threads run in the target, since 
each cache holds
unique code and thread communication is small.
Second is better when a lot of identical threads running since no 
excessive translation of identical code
is made by each VCPU thread, but communication between threads on 
accessing the cache is high.
So, what I'm talking about is if we use unified cache, we may not need 
to have more than one cache instance.
Evgeny Voevodin - March 2, 2012, 6:08 a.m.
On 01.03.2012 11:51, 陳韋任 wrote:
>> If you're serious about multithreading TCG then I think the first
>> steps are:
>>   * fix existing race conditions
>>   * think very hard
>>   * come up with an overall design for what you're proposing
>
>    As COREMU [1] point out, current QEMU atomic instruction emulation approach is
> problematic. For example, guest application might use x86 xchg instruction to
> implement spin lock/unlock (addr is a shared memory space).
>
>
>        spin_unlock:                   spin_lock:
>
>                                       try:
>                                         r10 = 1;
>                                         xchg addr, r10;
>                                         if (r10 == 0)
>                                           goto success;
>        *addr = 0;                     fail:
>                                         pause;
>                                         if (*addr != 0)
>                                           goto fail;
>
>                                         goto try;
>
>                                       success:
>
>
> After QEMU translation, guest xchg instruction becomes
>
>        spin_unlock:                   spin_lock:
>
>                                       helper_lock;
>
>        *addr = 0;                     T0 = r10;
>                                       T1 = *addr;
>                                       *addr = T0;
>                                       r10 = T1;
>
>                                       helper_unlock;
>
>    You can the see the atomicity on which spin lock/unlock rely is broken.
> "*addr = 0" can happened in the between of helper_lock/helper_unlock.
> COREMU solve this by using a lightway software transaction memory to emulate
> atomic instructions. I think this issue is quite important if we want to make
> TCG multithreaded, right? Is there a better way to solve this?
>
> Regards,
> chenwj
>
> [1]
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.6011&rep=rep1&type=pdf
>

In COREMU implementation they rely on support of single-word CAS 
instructions by the host architecture. And if such support presents, we 
can use CASN algorithm if we need multiple-word CAS. So, this approach 
limits supported host architectures. The general question - is there 
some host which QEMU can run on and which doesn't support CAS?

Patch

diff --git a/bsd-user/main.c b/bsd-user/main.c
index cc7d4a3..11e4540 100644
--- a/bsd-user/main.c
+++ b/bsd-user/main.c
@@ -906,6 +906,7 @@  int main(int argc, char **argv)
 #endif
     }
     tcg_exec_init(0);
+    tcg_gen_buffer_init();
     cpu_exec_init_all();
     /* NOTE: we need to init the CPU at this stage to get
        qemu_host_page_size */
diff --git a/cpus.c b/cpus.c
index f45a438..6190250 100644
--- a/cpus.c
+++ b/cpus.c
@@ -746,6 +746,8 @@  static void *qemu_tcg_cpu_thread_fn(void *arg)
 {
     CPUState *env = arg;
 
+    tcg_gen_buffer_init();
+
     qemu_tcg_init_cpu_signals();
     qemu_thread_get_self(env->thread);
 
diff --git a/darwin-user/main.c b/darwin-user/main.c
index 9b57c20..8618a52 100644
--- a/darwin-user/main.c
+++ b/darwin-user/main.c
@@ -851,6 +851,7 @@  int main(int argc, char **argv)
 #endif
     }
     tcg_exec_init(0);
+    tcg_gen_buffer_init();
     cpu_exec_init_all();
     /* NOTE: we need to init the CPU at this stage to get
        qemu_host_page_size */
diff --git a/exec.c b/exec.c
index b81677a..cf673a5 100644
--- a/exec.c
+++ b/exec.c
@@ -79,10 +79,10 @@ 
 
 #define SMC_BITMAP_USE_THRESHOLD 10
 
-static TranslationBlock *tbs;
+static DEFINE_TLS(TranslationBlock*, tbs);
 static int code_gen_max_blocks;
 TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
-static int nb_tbs;
+static DEFINE_TLS(int, nb_tbs);
 /* any access to the tbs or the page table must use this lock */
 spinlock_t tb_lock = SPIN_LOCK_UNLOCKED;
 
@@ -103,11 +103,12 @@  spinlock_t tb_lock = SPIN_LOCK_UNLOCKED;
 #endif
 
 uint8_t code_gen_prologue[1024] code_gen_section;
-static uint8_t *code_gen_buffer;
+static bool code_gen_enabled;
+static DEFINE_TLS(uint8_t*, code_gen_buffer);
 static unsigned long code_gen_buffer_size;
 /* threshold to flush the translated code buffer */
 static unsigned long code_gen_buffer_max_size;
-static uint8_t *code_gen_ptr;
+static DEFINE_TLS(uint8_t*, code_gen_ptr);
 
 #if !defined(CONFIG_USER_ONLY)
 int phys_ram_fd;
@@ -469,18 +470,17 @@  static void tlb_unprotect_code_phys(CPUState *env, ram_addr_t ram_addr,
 #endif
 
 #ifdef USE_STATIC_CODE_GEN_BUFFER
-static uint8_t static_code_gen_buffer[DEFAULT_CODE_GEN_BUFFER_SIZE]
-               __attribute__((aligned (CODE_GEN_ALIGN)));
+static DEFINE_TLS(uint8_t [DEFAULT_CODE_GEN_BUFFER_SIZE],
+        static_code_gen_buffer) __attribute__((aligned(CODE_GEN_ALIGN)));
 #endif
 
-static void code_gen_alloc(unsigned long tb_size)
+static void code_gen_alloc(void)
 {
 #ifdef USE_STATIC_CODE_GEN_BUFFER
-    code_gen_buffer = static_code_gen_buffer;
+    tls_var(code_gen_buffer) = tls_var(static_code_gen_buffer);
     code_gen_buffer_size = DEFAULT_CODE_GEN_BUFFER_SIZE;
-    map_exec(code_gen_buffer, code_gen_buffer_size);
+    map_exec(tls_var(code_gen_buffer), code_gen_buffer_size);
 #else
-    code_gen_buffer_size = tb_size;
     if (code_gen_buffer_size == 0) {
 #if defined(CONFIG_USER_ONLY)
         code_gen_buffer_size = DEFAULT_CODE_GEN_BUFFER_SIZE;
@@ -522,10 +522,10 @@  static void code_gen_alloc(unsigned long tb_size)
         }
         start = (void *)0x90000000UL;
 #endif
-        code_gen_buffer = mmap(start, code_gen_buffer_size,
+        tls_var(code_gen_buffer) = mmap(start, code_gen_buffer_size,
                                PROT_WRITE | PROT_READ | PROT_EXEC,
                                flags, -1, 0);
-        if (code_gen_buffer == MAP_FAILED) {
+        if (tls_var(code_gen_buffer) == MAP_FAILED) {
             fprintf(stderr, "Could not allocate dynamic translator buffer\n");
             exit(1);
         }
@@ -553,24 +553,30 @@  static void code_gen_alloc(unsigned long tb_size)
             code_gen_buffer_size = (512 * 1024 * 1024);
         }
 #endif
-        code_gen_buffer = mmap(addr, code_gen_buffer_size,
+        tls_var(code_gen_buffer) = mmap(addr, code_gen_buffer_size,
                                PROT_WRITE | PROT_READ | PROT_EXEC, 
                                flags, -1, 0);
-        if (code_gen_buffer == MAP_FAILED) {
+        if (tls_var(code_gen_buffer) == MAP_FAILED) {
             fprintf(stderr, "Could not allocate dynamic translator buffer\n");
             exit(1);
         }
     }
 #else
-    code_gen_buffer = g_malloc(code_gen_buffer_size);
-    map_exec(code_gen_buffer, code_gen_buffer_size);
+    tls_var(code_gen_buffer) = g_malloc(code_gen_buffer_size);
+    map_exec(tls_var(code_gen_buffer), code_gen_buffer_size);
 #endif
 #endif /* !USE_STATIC_CODE_GEN_BUFFER */
     map_exec(code_gen_prologue, sizeof(code_gen_prologue));
     code_gen_buffer_max_size = code_gen_buffer_size -
         (TCG_MAX_OP_SIZE * OPC_BUF_SIZE);
     code_gen_max_blocks = code_gen_buffer_size / CODE_GEN_AVG_BLOCK_SIZE;
-    tbs = g_malloc(code_gen_max_blocks * sizeof(TranslationBlock));
+    tls_var(tbs) = g_malloc(code_gen_max_blocks * sizeof(TranslationBlock));
+}
+
+void tcg_gen_buffer_init(void)
+{
+    code_gen_alloc();
+    tls_var(code_gen_ptr) = tls_var(code_gen_buffer);
 }
 
 /* Must be called before using the QEMU cpus. 'tb_size' is the size
@@ -579,19 +585,21 @@  static void code_gen_alloc(unsigned long tb_size)
 void tcg_exec_init(unsigned long tb_size)
 {
     cpu_gen_init();
-    code_gen_alloc(tb_size);
-    code_gen_ptr = code_gen_buffer;
+    code_gen_buffer_size = tb_size;
     page_init();
 #if !defined(CONFIG_USER_ONLY) || !defined(CONFIG_USE_GUEST_BASE)
     /* There's no guest base to take into account, so go ahead and
        initialize the prologue now.  */
     tcg_prologue_init(&tcg_ctx);
 #endif
+    /* Suppose that tcg_enabled() is acquired just to find out if tcg is
+     * ENABLED, not initialized */
+    code_gen_enabled = 1;
 }
 
 bool tcg_enabled(void)
 {
-    return code_gen_buffer != NULL;
+    return code_gen_enabled;
 }
 
 void cpu_exec_init_all(void)
@@ -682,10 +690,13 @@  static TranslationBlock *tb_alloc(target_ulong pc)
 {
     TranslationBlock *tb;
 
-    if (nb_tbs >= code_gen_max_blocks ||
-        (code_gen_ptr - code_gen_buffer) >= code_gen_buffer_max_size)
+    if (tls_var(nb_tbs) >= code_gen_max_blocks ||
+        (tls_var(code_gen_ptr) - tls_var(code_gen_buffer)) >=
+        code_gen_buffer_max_size) {
         return NULL;
-    tb = &tbs[nb_tbs++];
+    }
+
+    tb = &tls_var(tbs)[tls_var(nb_tbs)++];
     tb->pc = pc;
     tb->cflags = 0;
     return tb;
@@ -696,9 +707,9 @@  void tb_free(TranslationBlock *tb)
     /* In practice this is mostly used for single use temporary TB
        Ignore the hard cases and just back up if this TB happens to
        be the last one generated.  */
-    if (nb_tbs > 0 && tb == &tbs[nb_tbs - 1]) {
-        code_gen_ptr = tb->tc_ptr;
-        nb_tbs--;
+    if (tls_var(nb_tbs) > 0 && tb == &tls_var(tbs)[tls_var(nb_tbs) - 1]) {
+        tls_var(code_gen_ptr) = tb->tc_ptr;
+        tls_var(nb_tbs)--;
     }
 }
 
@@ -749,14 +760,17 @@  void tb_flush(CPUState *env1)
     CPUState *env;
 #if defined(DEBUG_FLUSH)
     printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%ld\n",
-           (unsigned long)(code_gen_ptr - code_gen_buffer),
-           nb_tbs, nb_tbs > 0 ?
-           ((unsigned long)(code_gen_ptr - code_gen_buffer)) / nb_tbs : 0);
+           (unsigned long)(tls_var(code_gen_ptr) - tls_var(code_gen_buffer)),
+           tls_var(nb_tbs), tls_var(nb_tbs) > 0 ?
+           ((unsigned long)(tls_var(code_gen_ptr) - tls_var(code_gen_buffer))) /
+           tls_var(nb_tbs) : 0);
 #endif
-    if ((unsigned long)(code_gen_ptr - code_gen_buffer) > code_gen_buffer_size)
+    if ((unsigned long)(tls_var(code_gen_ptr) - tls_var(code_gen_buffer)) >
+        code_gen_buffer_size) {
         cpu_abort(env1, "Internal error: code buffer overflow\n");
+    }
 
-    nb_tbs = 0;
+    tls_var(nb_tbs) = 0;
 
     for(env = first_cpu; env != NULL; env = env->next_cpu) {
         memset (env->tb_jmp_cache, 0, TB_JMP_CACHE_SIZE * sizeof (void *));
@@ -765,7 +779,7 @@  void tb_flush(CPUState *env1)
     memset (tb_phys_hash, 0, CODE_GEN_PHYS_HASH_SIZE * sizeof (void *));
     page_flush_tb();
 
-    code_gen_ptr = code_gen_buffer;
+    tls_var(code_gen_ptr) = tls_var(code_gen_buffer);
     /* XXX: flush processor icache at this point if cache flush is
        expensive */
     tb_flush_count++;
@@ -1008,13 +1022,14 @@  TranslationBlock *tb_gen_code(CPUState *env,
         /* Don't forget to invalidate previous TB info.  */
         tb_invalidated_flag = 1;
     }
-    tc_ptr = code_gen_ptr;
+    tc_ptr = tls_var(code_gen_ptr);
     tb->tc_ptr = tc_ptr;
     tb->cs_base = cs_base;
     tb->flags = flags;
     tb->cflags = cflags;
     cpu_gen_code(env, tb, &code_gen_size);
-    code_gen_ptr = (void *)(((unsigned long)code_gen_ptr + code_gen_size + CODE_GEN_ALIGN - 1) & ~(CODE_GEN_ALIGN - 1));
+    tls_var(code_gen_ptr) = (void *)(((unsigned long)tls_var(code_gen_ptr) +
+                code_gen_size + CODE_GEN_ALIGN - 1) & ~(CODE_GEN_ALIGN - 1));
 
     /* check next page if needed */
     virt_page2 = (pc + tb->size - 1) & TARGET_PAGE_MASK;
@@ -1330,17 +1345,19 @@  TranslationBlock *tb_find_pc(unsigned long tc_ptr)
     unsigned long v;
     TranslationBlock *tb;
 
-    if (nb_tbs <= 0)
+    if (tls_var(nb_tbs) <= 0) {
         return NULL;
-    if (tc_ptr < (unsigned long)code_gen_buffer ||
-        tc_ptr >= (unsigned long)code_gen_ptr)
+    }
+    if (tc_ptr < (unsigned long)tls_var(code_gen_buffer) ||
+        tc_ptr >= (unsigned long)tls_var(code_gen_ptr)) {
         return NULL;
+    }
     /* binary search (cf Knuth) */
     m_min = 0;
-    m_max = nb_tbs - 1;
+    m_max = tls_var(nb_tbs) - 1;
     while (m_min <= m_max) {
         m = (m_min + m_max) >> 1;
-        tb = &tbs[m];
+        tb = &tls_var(tbs)[m];
         v = (unsigned long)tb->tc_ptr;
         if (v == tc_ptr)
             return tb;
@@ -1350,7 +1367,7 @@  TranslationBlock *tb_find_pc(unsigned long tc_ptr)
             m_min = m + 1;
         }
     }
-    return &tbs[m_max];
+    return &tls_var(tbs)[m_max];
 }
 
 static void tb_reset_jump_recursive(TranslationBlock *tb);
@@ -4332,8 +4349,8 @@  void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     cross_page = 0;
     direct_jmp_count = 0;
     direct_jmp2_count = 0;
-    for(i = 0; i < nb_tbs; i++) {
-        tb = &tbs[i];
+    for(i = 0; i < tls_var(nb_tbs); i++) {
+        tb = &tls_var(tbs)[i];
         target_code_size += tb->size;
         if (tb->size > max_target_code_size)
             max_target_code_size = tb->size;
@@ -4349,23 +4366,29 @@  void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     /* XXX: avoid using doubles ? */
     cpu_fprintf(f, "Translation buffer state:\n");
     cpu_fprintf(f, "gen code size       %td/%ld\n",
-                code_gen_ptr - code_gen_buffer, code_gen_buffer_max_size);
+                tls_var(code_gen_ptr) - tls_var(code_gen_buffer),
+                code_gen_buffer_max_size);
     cpu_fprintf(f, "TB count            %d/%d\n", 
-                nb_tbs, code_gen_max_blocks);
+                tls_var(nb_tbs), code_gen_max_blocks);
     cpu_fprintf(f, "TB avg target size  %d max=%d bytes\n",
-                nb_tbs ? target_code_size / nb_tbs : 0,
+                tls_var(nb_tbs) ? target_code_size / tls_var(nb_tbs) : 0,
                 max_target_code_size);
     cpu_fprintf(f, "TB avg host size    %td bytes (expansion ratio: %0.1f)\n",
-                nb_tbs ? (code_gen_ptr - code_gen_buffer) / nb_tbs : 0,
-                target_code_size ? (double) (code_gen_ptr - code_gen_buffer) / target_code_size : 0);
+                tls_var(nb_tbs) ?
+                    (tls_var(code_gen_ptr) - tls_var(code_gen_buffer)) /
+                    tls_var(nb_tbs) : 0,
+                target_code_size ? (double) (tls_var(code_gen_ptr) -
+                    tls_var(code_gen_buffer)) / target_code_size : 0);
     cpu_fprintf(f, "cross page TB count %d (%d%%)\n",
             cross_page,
-            nb_tbs ? (cross_page * 100) / nb_tbs : 0);
+            tls_var(nb_tbs) ? (cross_page * 100) / tls_var(nb_tbs) : 0);
     cpu_fprintf(f, "direct jump count   %d (%d%%) (2 jumps=%d %d%%)\n",
                 direct_jmp_count,
-                nb_tbs ? (direct_jmp_count * 100) / nb_tbs : 0,
+                tls_var(nb_tbs) ?
+                    (direct_jmp_count * 100) / tls_var(nb_tbs) : 0,
                 direct_jmp2_count,
-                nb_tbs ? (direct_jmp2_count * 100) / nb_tbs : 0);
+                tls_var(nb_tbs) ?
+                    (direct_jmp2_count * 100) / tls_var(nb_tbs) : 0);
     cpu_fprintf(f, "\nStatistics:\n");
     cpu_fprintf(f, "TB flush count      %d\n", tb_flush_count);
     cpu_fprintf(f, "TB invalidate count %d\n", tb_phys_invalidate_count);
diff --git a/linux-user/main.c b/linux-user/main.c
index 14bf5f0..483482f 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -3364,6 +3364,7 @@  int main(int argc, char **argv, char **envp)
 #endif
     }
     tcg_exec_init(0);
+    tcg_gen_buffer_init();
     cpu_exec_init_all();
     /* NOTE: we need to init the CPU at this stage to get
        qemu_host_page_size */
diff --git a/qemu-common.h b/qemu-common.h
index c5e9cad..13d45e0 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -258,6 +258,7 @@  typedef enum LostTickPolicy {
     LOST_TICK_MAX
 } LostTickPolicy;
 
+void tcg_gen_buffer_init(void);
 void tcg_exec_init(unsigned long tb_size);
 bool tcg_enabled(void);