Message ID | 20221021170112.151393-2-leandro.lupori@eldorado.org.br |
---|---|
State | New |
Headers | show |
Series | Performance optimizations for PPC64 | expand |
On 10/22/22 03:01, Leandro Lupori wrote: > Profiling QEMU during Fedora 35 for PPC64 boot revealed that a > considerable amount of time was being spent in > check_for_breakpoints() (0.61% of total time on PPC64 and 2.19% on > amd64), even though it was just checking that its queue was empty > and returning, when no breakpoints were set. It turns out this > function is not inlined by the compiler and it's always called by > helper_lookup_tb_ptr(), one of the most called functions. > > By moving the check for empty queue to the have_breakpoints() > macro and calling check_for_breakpoints() only when it returns > true, it's possible to avoid the call overhead. An improvement of > about 3% in total time was measured on POWER9. Wow, 3%? > > Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br> > --- > accel/tcg/cpu-exec.c | 13 +++++++------ > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c > index f9e5cc9ba0..9eec01ad9a 100644 > --- a/accel/tcg/cpu-exec.c > +++ b/accel/tcg/cpu-exec.c > @@ -304,16 +304,15 @@ static void log_cpu_exec(target_ulong pc, CPUState *cpu, > } > } > > +#define have_breakpoints(cpu) (likely(QTAILQ_EMPTY(&(cpu)->breakpoints)) ? \ > + false : true) First, always avoid useless ?:. Second, I think renaming the existing check_for_breakpoints to check_for_breakpoints_slow and make this test be an inline function instead. E.g. static bool check_for_breakpoints(...) { if (unlikely(!QTAILQ_EMPTY(&cpu->breakpoints))) { return check_for_breakpoints_slow(...); } return false; } r~
On 10/22/22 08:12, Richard Henderson wrote: > On 10/22/22 03:01, Leandro Lupori wrote: >> Profiling QEMU during Fedora 35 for PPC64 boot revealed that a >> considerable amount of time was being spent in >> check_for_breakpoints() (0.61% of total time on PPC64 and 2.19% on >> amd64), even though it was just checking that its queue was empty >> and returning, when no breakpoints were set. It turns out this >> function is not inlined by the compiler and it's always called by >> helper_lookup_tb_ptr(), one of the most called functions. >> >> By moving the check for empty queue to the have_breakpoints() >> macro and calling check_for_breakpoints() only when it returns >> true, it's possible to avoid the call overhead. An improvement of >> about 3% in total time was measured on POWER9. > > Wow, 3%? > I've compared the averages of 3 runs. To get a more precise number it would probably be better to take the average or median of 10 runs with each version. >> >> Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br> >> --- >> accel/tcg/cpu-exec.c | 13 +++++++------ >> 1 file changed, 7 insertions(+), 6 deletions(-) >> >> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c >> index f9e5cc9ba0..9eec01ad9a 100644 >> --- a/accel/tcg/cpu-exec.c >> +++ b/accel/tcg/cpu-exec.c >> @@ -304,16 +304,15 @@ static void log_cpu_exec(target_ulong pc, >> CPUState *cpu, >> } >> } >> >> +#define have_breakpoints(cpu) >> (likely(QTAILQ_EMPTY(&(cpu)->breakpoints)) ? \ >> + false : true) > > First, always avoid useless ?:. > > Second, I think renaming the existing check_for_breakpoints to > check_for_breakpoints_slow > and make this test be an inline function instead. E.g. > > static bool check_for_breakpoints(...) > { > if (unlikely(!QTAILQ_EMPTY(&cpu->breakpoints))) { > return check_for_breakpoints_slow(...); > } > return false; > } > Ok. I had left the ?: to take advantage of likely() in the macro. Thanks, Leandro > > r~
On 10/21/22 19:01, Leandro Lupori wrote: > Profiling QEMU during Fedora 35 for PPC64 boot revealed that a > considerable amount of time was being spent in > check_for_breakpoints() (0.61% of total time on PPC64 and 2.19% on > amd64), even though it was just checking that its queue was empty > and returning, when no breakpoints were set. It turns out this > function is not inlined by the compiler and it's always called by > helper_lookup_tb_ptr(), one of the most called functions. > > By moving the check for empty queue to the have_breakpoints() > macro and calling check_for_breakpoints() only when it returns > true, it's possible to avoid the call overhead. An improvement of > about 3% in total time was measured on POWER9. > > Signed-off-by: Leandro Lupori<leandro.lupori@eldorado.org.br> > --- > accel/tcg/cpu-exec.c | 13 +++++++------ > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c > index f9e5cc9ba0..9eec01ad9a 100644 > --- a/accel/tcg/cpu-exec.c > +++ b/accel/tcg/cpu-exec.c > @@ -304,16 +304,15 @@ static void log_cpu_exec(target_ulong pc, CPUState *cpu, > } > } > > +#define have_breakpoints(cpu) (likely(QTAILQ_EMPTY(&(cpu)->breakpoints)) ? \ > + false : true) > + > static bool check_for_breakpoints(CPUState *cpu, target_ulong pc, > uint32_t *cflags) > { > CPUBreakpoint *bp; > bool match_page = false; > > - if (likely(QTAILQ_EMPTY(&cpu->breakpoints))) { > - return false; > - } > - It's a little more readable to just split out the slow path: -static inline bool check_for_breakpoints(CPUState *cpu, target_ulong pc, - uint32_t *cflags) +static bool check_for_breakpoints_slow(CPUState *cpu, target_ulong pc, + uint32_t *cflags) { CPUBreakpoint *bp; bool match_page = false; - if (likely(QTAILQ_EMPTY(&cpu->breakpoints))) { - return false; - } ... } + +static inline bool check_for_breakpoints(CPUState *cpu, target_ulong pc, + uint32_t *cflags) +{ + return unlikely(!QTAILQ_EMPTY(&cpu->breakpoints)) + && check_for_breakpoints_slow(cpu, pc, cflags); +} Paolo
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c index f9e5cc9ba0..9eec01ad9a 100644 --- a/accel/tcg/cpu-exec.c +++ b/accel/tcg/cpu-exec.c @@ -304,16 +304,15 @@ static void log_cpu_exec(target_ulong pc, CPUState *cpu, } } +#define have_breakpoints(cpu) (likely(QTAILQ_EMPTY(&(cpu)->breakpoints)) ? \ + false : true) + static bool check_for_breakpoints(CPUState *cpu, target_ulong pc, uint32_t *cflags) { CPUBreakpoint *bp; bool match_page = false; - if (likely(QTAILQ_EMPTY(&cpu->breakpoints))) { - return false; - } - /* * Singlestep overrides breakpoints. * This requirement is visible in the record-replay tests, where @@ -392,7 +391,8 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env) cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags); cflags = curr_cflags(cpu); - if (check_for_breakpoints(cpu, pc, &cflags)) { + + if (have_breakpoints(cpu) && check_for_breakpoints(cpu, pc, &cflags)) { cpu_loop_exit(cpu); } @@ -990,7 +990,8 @@ int cpu_exec(CPUState *cpu) cpu->cflags_next_tb = -1; } - if (check_for_breakpoints(cpu, pc, &cflags)) { + if (have_breakpoints(cpu) && + check_for_breakpoints(cpu, pc, &cflags)) { break; }
Profiling QEMU during Fedora 35 for PPC64 boot revealed that a considerable amount of time was being spent in check_for_breakpoints() (0.61% of total time on PPC64 and 2.19% on amd64), even though it was just checking that its queue was empty and returning, when no breakpoints were set. It turns out this function is not inlined by the compiler and it's always called by helper_lookup_tb_ptr(), one of the most called functions. By moving the check for empty queue to the have_breakpoints() macro and calling check_for_breakpoints() only when it returns true, it's possible to avoid the call overhead. An improvement of about 3% in total time was measured on POWER9. Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br> --- accel/tcg/cpu-exec.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-)