Message ID | 1496702979-26132-4-git-send-email-cota@braap.org |
---|---|
State | New |
Headers | show |
On Mon, Jun 5, 2017 at 6:49 PM, Emilio G. Cota <cota@braap.org> wrote: > Allocating an arbitrarily-sized array of tbs results in either > (a) a lot of memory wasted or (b) unnecessary flushes of the code > cache when we run out of TB structs in the array. > > An obvious solution would be to just malloc a TB struct when needed, > and keep the TB array as an array of pointers (recall that tb_find_pc() > needs the TB array to run in O(log n)). > > Perhaps a better solution, which is implemented in this patch, is to > allocate TB's right before the translated code they describe. This > results in some memory waste due to padding to have code and TBs in > separate cache lines--for instance, I measured 4.7% of padding in the > used portion of code_gen_buffer when booting aarch64 Linux on a > host with 64-byte cache lines. However, it can allow for optimizations > in some host architectures, since TCG backends could safely assume that > the TB and the corresponding translated code are very close to each > other in memory. See this message by rth for a detailed explanation: > > https://lists.gnu.org/archive/html/qemu-devel/2017-03/msg05172.html > Subject: Re: GSoC 2017 Proposal: TCG performance enhancements > Message-ID: <1e67644b-4b30-887e-d329-1848e94c9484@twiddle.net> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com> Thanks for doing this Emilio. Do you plan to continue working on rth's suggestions in that email? If so, can we co-ordinate our work?
On 06/05/2017 03:49 PM, Emilio G. Cota wrote: > +TranslationBlock *tcg_tb_alloc(TCGContext *s) > +{ > + void *aligned; > + > + aligned = (void *)ROUND_UP((uintptr_t)s->code_gen_ptr, QEMU_CACHELINE_SIZE); > + if (unlikely(aligned + sizeof(TranslationBlock) > s->code_gen_highwater)) { > + return NULL; > + } > + s->code_gen_ptr += aligned - s->code_gen_ptr + sizeof(TranslationBlock); > + return aligned; We don't really need the 2/3 patch. We don't gain anything by telling the compiler that the structure is more aligned than it needs to be. We can query the line size at runtime, as suggested by Pranith, and use that for the alignment here. Which means that the binary isn't tied to a particular cpu implementation, which is clearly preferable for distributions. r~
On Tue, Jun 06, 2017 at 01:24:11 -0700, Richard Henderson wrote: > On 06/05/2017 03:49 PM, Emilio G. Cota wrote: > >+TranslationBlock *tcg_tb_alloc(TCGContext *s) > >+{ > >+ void *aligned; > >+ > >+ aligned = (void *)ROUND_UP((uintptr_t)s->code_gen_ptr, QEMU_CACHELINE_SIZE); > >+ if (unlikely(aligned + sizeof(TranslationBlock) > s->code_gen_highwater)) { > >+ return NULL; > >+ } > >+ s->code_gen_ptr += aligned - s->code_gen_ptr + sizeof(TranslationBlock); > >+ return aligned; > > We don't really need the 2/3 patch. We don't gain anything by telling the > compiler that the structure is more aligned than it needs to be. The compile-time requirement is for the compiler to pad the structs appropriately; this is critical to avoid false sharing when allocating arrays of structs like those test programs do. > We can query the line size at runtime, as suggested by Pranith, and use that > for the alignment here. Which means that the binary isn't tied to a > particular cpu implementation, which is clearly preferable for > distributions. For this particular case we can get away without padding the structs if we're OK with having the end of a TB struct immediately followed by its translated code, instead of having that code on the following cache line. E.
On 06/06/2017 09:25 AM, Emilio G. Cota wrote: > For this particular case we can get away without padding the structs if > we're OK with having the end of a TB struct immediately followed > by its translated code, instead of having that code on the following > cache line. Uh, no, if you can manually pad before the struct, you can manually pad after the struct too. r~
On Tue, Jun 06, 2017 at 01:36:50 -0400, Pranith Kumar wrote: > Reviewed-by: Pranith Kumar <bobby.prani@gmail.com> > > Thanks for doing this Emilio. Do you plan to continue working on rth's > suggestions in that email? If so, can we co-ordinate our work? My plan is to work on instrumentation. This was just low-hanging fruit; I was curious to see the impact on cache miss rates of bringing the TB's close to the corresponding translated code. Turns out it's pretty small or my L1's are too big :-) The memory savings are significant though, with the added benefit that this can enable more efficient translated code as Richard pointed out. I've just left a message on the GSoC thread with ideas. Emilio
On Tue, Jun 06, 2017 at 10:02:17 -0700, Richard Henderson wrote: > On 06/06/2017 09:25 AM, Emilio G. Cota wrote: > >For this particular case we can get away without padding the structs if > >we're OK with having the end of a TB struct immediately followed > >by its translated code, instead of having that code on the following > >cache line. > > Uh, no, if you can manually pad before the struct, you can manually pad > after the struct too. Yes of course =) I'll respin the series to do this. E.
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h index 87ae10b..00c0f43 100644 --- a/include/exec/exec-all.h +++ b/include/exec/exec-all.h @@ -363,7 +363,7 @@ struct TranslationBlock { */ uintptr_t jmp_list_next[2]; uintptr_t jmp_list_first; -}; +} QEMU_ALIGNED(QEMU_CACHELINE_SIZE); void tb_free(TranslationBlock *tb); void tb_flush(CPUState *cpu); diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h index c7f17f2..25c2afe 100644 --- a/include/exec/tb-context.h +++ b/include/exec/tb-context.h @@ -31,8 +31,9 @@ typedef struct TBContext TBContext; struct TBContext { - TranslationBlock *tbs; + TranslationBlock **tbs; struct qht htable; + size_t tbs_size; int nb_tbs; /* any access to the tbs or the page table must use this lock */ QemuMutex tb_lock; diff --git a/tcg/tcg.c b/tcg/tcg.c index 564292f..f657c51 100644 --- a/tcg/tcg.c +++ b/tcg/tcg.c @@ -383,6 +383,22 @@ void tcg_context_init(TCGContext *s) } } +/* + * Allocate TBs right before their corresponding translated code, making + * sure that TBs and code are on different cache lines. + */ +TranslationBlock *tcg_tb_alloc(TCGContext *s) +{ + void *aligned; + + aligned = (void *)ROUND_UP((uintptr_t)s->code_gen_ptr, QEMU_CACHELINE_SIZE); + if (unlikely(aligned + sizeof(TranslationBlock) > s->code_gen_highwater)) { + return NULL; + } + s->code_gen_ptr += aligned - s->code_gen_ptr + sizeof(TranslationBlock); + return aligned; +} + void tcg_prologue_init(TCGContext *s) { size_t prologue_size, total_size; diff --git a/tcg/tcg.h b/tcg/tcg.h index 5ec48d1..9e37722 100644 --- a/tcg/tcg.h +++ b/tcg/tcg.h @@ -697,7 +697,6 @@ struct TCGContext { here, because there's too much arithmetic throughout that relies on addition and subtraction working on bytes. Rely on the GCC extension that allows arithmetic on void*. */ - int code_gen_max_blocks; void *code_gen_prologue; void *code_gen_epilogue; void *code_gen_buffer; @@ -756,6 +755,7 @@ static inline bool tcg_op_buf_full(void) /* tb_lock must be held for tcg_malloc_internal. */ void *tcg_malloc_internal(TCGContext *s, int size); void tcg_pool_reset(TCGContext *s); +TranslationBlock *tcg_tb_alloc(TCGContext *s); void tb_lock(void); void tb_unlock(void); diff --git a/translate-all.c b/translate-all.c index b3ee876..0eb9d13 100644 --- a/translate-all.c +++ b/translate-all.c @@ -781,12 +781,13 @@ static inline void code_gen_alloc(size_t tb_size) exit(1); } - /* Estimate a good size for the number of TBs we can support. We - still haven't deducted the prologue from the buffer size here, - but that's minimal and won't affect the estimate much. */ - tcg_ctx.code_gen_max_blocks - = tcg_ctx.code_gen_buffer_size / CODE_GEN_AVG_BLOCK_SIZE; - tcg_ctx.tb_ctx.tbs = g_new(TranslationBlock, tcg_ctx.code_gen_max_blocks); + /* size this conservatively -- realloc later if needed */ + tcg_ctx.tb_ctx.tbs_size = + tcg_ctx.code_gen_buffer_size / CODE_GEN_AVG_BLOCK_SIZE / 8; + if (unlikely(!tcg_ctx.tb_ctx.tbs_size)) { + tcg_ctx.tb_ctx.tbs_size = 64 * 1024; + } + tcg_ctx.tb_ctx.tbs = g_new(TranslationBlock *, tcg_ctx.tb_ctx.tbs_size); qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock); } @@ -828,13 +829,20 @@ bool tcg_enabled(void) static TranslationBlock *tb_alloc(target_ulong pc) { TranslationBlock *tb; + TBContext *ctx; assert_tb_locked(); - if (tcg_ctx.tb_ctx.nb_tbs >= tcg_ctx.code_gen_max_blocks) { + tb = tcg_tb_alloc(&tcg_ctx); + if (unlikely(tb == NULL)) { return NULL; } - tb = &tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs++]; + ctx = &tcg_ctx.tb_ctx; + if (unlikely(ctx->nb_tbs == ctx->tbs_size)) { + ctx->tbs_size *= 2; + ctx->tbs = g_renew(TranslationBlock *, ctx->tbs, ctx->tbs_size); + } + ctx->tbs[ctx->nb_tbs++] = tb; tb->pc = pc; tb->cflags = 0; tb->invalid = false; @@ -850,8 +858,8 @@ void tb_free(TranslationBlock *tb) Ignore the hard cases and just back up if this TB happens to be the last one generated. */ if (tcg_ctx.tb_ctx.nb_tbs > 0 && - tb == &tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs - 1]) { - tcg_ctx.code_gen_ptr = tb->tc_ptr; + tb == tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs - 1]) { + tcg_ctx.code_gen_ptr = tb->tc_ptr - sizeof(TranslationBlock); tcg_ctx.tb_ctx.nb_tbs--; } } @@ -1666,7 +1674,7 @@ static TranslationBlock *tb_find_pc(uintptr_t tc_ptr) m_max = tcg_ctx.tb_ctx.nb_tbs - 1; while (m_min <= m_max) { m = (m_min + m_max) >> 1; - tb = &tcg_ctx.tb_ctx.tbs[m]; + tb = tcg_ctx.tb_ctx.tbs[m]; v = (uintptr_t)tb->tc_ptr; if (v == tc_ptr) { return tb; @@ -1676,7 +1684,7 @@ static TranslationBlock *tb_find_pc(uintptr_t tc_ptr) m_min = m + 1; } } - return &tcg_ctx.tb_ctx.tbs[m_max]; + return tcg_ctx.tb_ctx.tbs[m_max]; } #if !defined(CONFIG_USER_ONLY) @@ -1874,7 +1882,7 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf) direct_jmp_count = 0; direct_jmp2_count = 0; for (i = 0; i < tcg_ctx.tb_ctx.nb_tbs; i++) { - tb = &tcg_ctx.tb_ctx.tbs[i]; + tb = tcg_ctx.tb_ctx.tbs[i]; target_code_size += tb->size; if (tb->size > max_target_code_size) { max_target_code_size = tb->size; @@ -1894,8 +1902,7 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf) cpu_fprintf(f, "gen code size %td/%zd\n", tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer, tcg_ctx.code_gen_highwater - tcg_ctx.code_gen_buffer); - cpu_fprintf(f, "TB count %d/%d\n", - tcg_ctx.tb_ctx.nb_tbs, tcg_ctx.code_gen_max_blocks); + cpu_fprintf(f, "TB count %d\n", tcg_ctx.tb_ctx.nb_tbs); cpu_fprintf(f, "TB avg target size %d max=%d bytes\n", tcg_ctx.tb_ctx.nb_tbs ? target_code_size / tcg_ctx.tb_ctx.nb_tbs : 0,
Allocating an arbitrarily-sized array of tbs results in either (a) a lot of memory wasted or (b) unnecessary flushes of the code cache when we run out of TB structs in the array. An obvious solution would be to just malloc a TB struct when needed, and keep the TB array as an array of pointers (recall that tb_find_pc() needs the TB array to run in O(log n)). Perhaps a better solution, which is implemented in this patch, is to allocate TB's right before the translated code they describe. This results in some memory waste due to padding to have code and TBs in separate cache lines--for instance, I measured 4.7% of padding in the used portion of code_gen_buffer when booting aarch64 Linux on a host with 64-byte cache lines. However, it can allow for optimizations in some host architectures, since TCG backends could safely assume that the TB and the corresponding translated code are very close to each other in memory. See this message by rth for a detailed explanation: https://lists.gnu.org/archive/html/qemu-devel/2017-03/msg05172.html Subject: Re: GSoC 2017 Proposal: TCG performance enhancements Message-ID: <1e67644b-4b30-887e-d329-1848e94c9484@twiddle.net> Suggested-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> --- include/exec/exec-all.h | 2 +- include/exec/tb-context.h | 3 ++- tcg/tcg.c | 16 ++++++++++++++++ tcg/tcg.h | 2 +- translate-all.c | 37 ++++++++++++++++++++++--------------- 5 files changed, 42 insertions(+), 18 deletions(-)