diff mbox series

Do not enable QOM debugging by default

Message ID 20180716195851.0FE177456B7@zero.eik.bme.hu
State New
Headers show
Series Do not enable QOM debugging by default | expand

Commit Message

BALATON Zoltan July 16, 2018, 7:47 p.m. UTC
Commit 3556c233d 5 years ago added an option to disable QOM debugging,
noting that it can have high performance cost but left the default to on.
Change the default to off and only enable it when debugging is requested
to avoid the performance penalty for those compiling without debug options.

Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
---
 configure | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Peter Maydell July 16, 2018, 8:41 p.m. UTC | #1
On 16 July 2018 at 20:47, BALATON Zoltan <balaton@eik.bme.hu> wrote:
> Commit 3556c233d 5 years ago added an option to disable QOM debugging,
> noting that it can have high performance cost but left the default to on.
> Change the default to off and only enable it when debugging is requested
> to avoid the performance penalty for those compiling without debug options.
>
> Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>

Looking at the code, doesn't this disable all the type checking
on the QOM casts? That is not something I think we should drop,
in the same way that we do not disable assert()s even in the
non-debug build.

Is this coming up as significant in profiling? In the past we've
done some special casing if there are particular QOM casts in
hot paths that use significant runtime.

thanks
-- PMM
BALATON Zoltan July 16, 2018, 9:08 p.m. UTC | #2
On Mon, 16 Jul 2018, Peter Maydell wrote:
> On 16 July 2018 at 20:47, BALATON Zoltan <balaton@eik.bme.hu> wrote:
>> Commit 3556c233d 5 years ago added an option to disable QOM debugging,
>> noting that it can have high performance cost but left the default to on.
>> Change the default to off and only enable it when debugging is requested
>> to avoid the performance penalty for those compiling without debug options.
>>
>> Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
>
> Looking at the code, doesn't this disable all the type checking
> on the QOM casts? That is not something I think we should drop,
> in the same way that we do not disable assert()s even in the
> non-debug build.
>
> Is this coming up as significant in profiling? In the past we've
> done some special casing if there are particular QOM casts in
> hot paths that use significant runtime.

My profiling results are not very accurate (could only gather sample based 
results with oprofile as --enable-gprof haven't worked for me) so I don't 
have numbers but I did see object_class_dynamic_cast_assert coming out 
among the top few without disabling this option but I don't know which 
call sites are responsible.

Are there any recommendations for profiling QEMU (especially TCG) 
somewhere that I should be aware of?

Regards,
BALATON Zoltan
Daniel P. Berrangé July 17, 2018, 9:40 a.m. UTC | #3
On Mon, Jul 16, 2018 at 09:41:12PM +0100, Peter Maydell wrote:
> On 16 July 2018 at 20:47, BALATON Zoltan <balaton@eik.bme.hu> wrote:
> > Commit 3556c233d 5 years ago added an option to disable QOM debugging,
> > noting that it can have high performance cost but left the default to on.
> > Change the default to off and only enable it when debugging is requested
> > to avoid the performance penalty for those compiling without debug options.
> >
> > Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
> 
> Looking at the code, doesn't this disable all the type checking
> on the QOM casts? That is not something I think we should drop,
> in the same way that we do not disable assert()s even in the
> non-debug build.
> 
> Is this coming up as significant in profiling? In the past we've
> done some special casing if there are particular QOM casts in
> hot paths that use significant runtime.

There might also be scope for doing optimization of the
object_dynamic_cast_assert() implementation too.

Regards,
Daniel
BALATON Zoltan July 17, 2018, 3:09 p.m. UTC | #4
On Mon, 16 Jul 2018, Peter Maydell wrote:
> Is this coming up as significant in profiling? In the past we've

This seems to depend on the workload. From the cases I'm interested in 
AROS and AmigaOS on qemu-system-ppc -M sam460ex does not seem to be 
effected much (object_class_dynamic_cast_assert is not in top 10 with <2%) 
but for MorphOS on mac99 this seems to be significant. This is with 
default configure (--enable-qom-cast-debug):

%       cum. %     linenr info                 symbol name
9.7057   9.7057    exec-all.h:410              helper_lookup_tb_ptr
8.0330  17.7387    object.c:711                object_class_dynamic_cast_assert
6.9411  24.6798    cputlb.c:793                io_readx
6.3219  31.0017    sm501_template.h:62         draw_line16_32
5.3601  36.3617    cputlb.c:114                tlb_flush_nocheck
3.6170  39.9787    translate-all.c:749         page_trylock_add
3.1188  43.0975    translate-all.c:803         page_collection_lock
3.0405  46.1380    exec.c:3025                 iotlb_to_section
2.7044  48.8424    softmmu_template.h:112      helper_ret_ldub_mmu
2.4154  51.2578    memory.c:1350               memory_region_access_valid

and improves a bit (but not much) with --disable-qom-cast-debug

%        cum. %     linenr info                 symbol name
10.2063  10.2063    exec-all.h:410              helper_lookup_tb_ptr
  7.1581  17.3644    object.c:711                object_class_dynamic_cast_assert
  5.9297  23.2941    sm501_template.h:62         draw_line16_32
  5.9227  29.2168    cputlb.c:793                io_readx
  5.3030  34.5198    cputlb.c:114                tlb_flush_nocheck
  3.6445  38.1643    memory.c:1350               memory_region_access_valid
  3.5499  41.7142    softmmu_template.h:112      helper_ret_ldub_mmu
  3.0383  44.7525    translate-all.c:803         page_collection_lock
  2.9735  47.7260    memory.c:1415               memory_region_dispatch_read
  2.9503  50.6763    translate-all.c:749         page_trylock_add

But the workloads may not have been 100% identical so this is not 
conclusive, maybe this debug code is not that expensive at the moment.

AROS on sam460ex has a different profile:

%        cum. %     linenr info                 symbol name
8.9905   8.9905     translate-all.c:749         page_trylock_add
8.7658  17.7563     exec-all.h:410              helper_lookup_tb_ptr
7.7349  25.4911     translate-all.c:803         page_collection_lock
5.8246  31.3158     cputlb.c:924                victim_tlb_hit
3.1640  34.4797     cpus.c:347                  cpu_get_clock
3.1538  37.6335     translate-all.c:788         tb_page_addr_cmp
2.7969  40.4304     exec.c:435                  address_space_translate_internal
2.6647  43.0951     memory.c:571                access_with_adjusted_size
2.0615  45.1567     exec.c:569                  flatview_do_translate
1.9586  47.1153     memory.c:1350               memory_region_access_valid

Would anyone be able to guess what are the places that should be looked at 
or what to check to get more info on this?

Regards,
BALATON Zoltan
Mark Cave-Ayland July 17, 2018, 5:09 p.m. UTC | #5
On 17/07/18 16:09, BALATON Zoltan wrote:

> On Mon, 16 Jul 2018, Peter Maydell wrote:
>> Is this coming up as significant in profiling? In the past we've
> 
> This seems to depend on the workload. From the cases I'm interested in 
> AROS and AmigaOS on qemu-system-ppc -M sam460ex does not seem to be 
> effected much (object_class_dynamic_cast_assert is not in top 10 with 
> <2%) but for MorphOS on mac99 this seems to be significant. This is with 
> default configure (--enable-qom-cast-debug):
> 
> %       cum. %     linenr info                 symbol name
> 9.7057   9.7057    exec-all.h:410              helper_lookup_tb_ptr
> 8.0330  17.7387    object.c:711                
> object_class_dynamic_cast_assert
> 6.9411  24.6798    cputlb.c:793                io_readx
> 6.3219  31.0017    sm501_template.h:62         draw_line16_32
> 5.3601  36.3617    cputlb.c:114                tlb_flush_nocheck
> 3.6170  39.9787    translate-all.c:749         page_trylock_add
> 3.1188  43.0975    translate-all.c:803         page_collection_lock
> 3.0405  46.1380    exec.c:3025                 iotlb_to_section
> 2.7044  48.8424    softmmu_template.h:112      helper_ret_ldub_mmu
> 2.4154  51.2578    memory.c:1350               memory_region_access_valid
> 
> and improves a bit (but not much) with --disable-qom-cast-debug
> 
> %        cum. %     linenr info                 symbol name
> 10.2063  10.2063    exec-all.h:410              helper_lookup_tb_ptr
>   7.1581  17.3644    object.c:711                
> object_class_dynamic_cast_assert
>   5.9297  23.2941    sm501_template.h:62         draw_line16_32
>   5.9227  29.2168    cputlb.c:793                io_readx
>   5.3030  34.5198    cputlb.c:114                tlb_flush_nocheck
>   3.6445  38.1643    memory.c:1350               memory_region_access_valid
>   3.5499  41.7142    softmmu_template.h:112      helper_ret_ldub_mmu
>   3.0383  44.7525    translate-all.c:803         page_collection_lock
>   2.9735  47.7260    memory.c:1415               
> memory_region_dispatch_read
>   2.9503  50.6763    translate-all.c:749         page_trylock_add
> 
> But the workloads may not have been 100% identical so this is not 
> conclusive, maybe this debug code is not that expensive at the moment.
> 
> AROS on sam460ex has a different profile:
> 
> %        cum. %     linenr info                 symbol name
> 8.9905   8.9905     translate-all.c:749         page_trylock_add
> 8.7658  17.7563     exec-all.h:410              helper_lookup_tb_ptr
> 7.7349  25.4911     translate-all.c:803         page_collection_lock
> 5.8246  31.3158     cputlb.c:924                victim_tlb_hit
> 3.1640  34.4797     cpus.c:347                  cpu_get_clock
> 3.1538  37.6335     translate-all.c:788         tb_page_addr_cmp
> 2.7969  40.4304     exec.c:435                  
> address_space_translate_internal
> 2.6647  43.0951     memory.c:571                access_with_adjusted_size
> 2.0615  45.1567     exec.c:569                  flatview_do_translate
> 1.9586  47.1153     memory.c:1350               memory_region_access_valid
> 
> Would anyone be able to guess what are the places that should be looked 
> at or what to check to get more info on this?

My first thought is that there is a QOM cast somewhere in a hot path on 
-M mac99 - can you show us the call stack information from the profile?

I had a similar issue with SPARC32 whereby each DMA request needs to be 
manually word-swapped, so instead of adding the QOM cast into these 
routines I did a direct C cast from the opaque to ensure that the 
overhead was as little as possible.


ATB,

Mark.
BALATON Zoltan July 17, 2018, 7:35 p.m. UTC | #6
On Tue, 17 Jul 2018, Mark Cave-Ayland wrote:
>> MorphOS on mac99 this seems to be significant. This is with default 
>> configure (--enable-qom-cast-debug):
>> 
>> %?????? cum. %???? linenr info???????????????? symbol name
>> 9.7057?? 9.7057??? exec-all.h:410????????????? helper_lookup_tb_ptr
>> 8.0330? 17.7387??? object.c:711 
>> object_class_dynamic_cast_assert
>> 6.9411? 24.6798??? cputlb.c:793??????????????? io_readx
>> 6.3219? 31.0017??? sm501_template.h:62???????? draw_line16_32
>> 5.3601? 36.3617??? cputlb.c:114??????????????? tlb_flush_nocheck
>> 3.6170? 39.9787??? translate-all.c:749???????? page_trylock_add
>> 3.1188? 43.0975??? translate-all.c:803???????? page_collection_lock
>> 3.0405? 46.1380??? exec.c:3025???????????????? iotlb_to_section
>> 2.7044? 48.8424??? softmmu_template.h:112????? helper_ret_ldub_mmu
>> 2.4154? 51.2578??? memory.c:1350?????????????? memory_region_access_valid
[...]
> My first thought is that there is a QOM cast somewhere in a hot path on -M 
> mac99 - can you show us the call stack information from the profile?

Not really. Oprofile that gave me this profile could not display call 
graph for this call. I've tried again with the perf tool but I'm not quite 
sure how to interpret its results. If I'm not mistaken it points me in the 
direction of cpu_asidx_from_attrs, called from iotlb_to_section, called 
from io_readx. The object_class_dynamic_cast_assert call likely comes from 
OBJECT_CLASS_CHECK, expanded from OBJECT_GET_CLASS, expanded from 
CPU_GET_CLASS. Would that make any sense? Any idea what to do about it?

Thank you,
BALATON Zoltan
Mark Cave-Ayland July 17, 2018, 7:57 p.m. UTC | #7
On 17/07/18 20:35, BALATON Zoltan wrote:

> On Tue, 17 Jul 2018, Mark Cave-Ayland wrote:
>>> MorphOS on mac99 this seems to be significant. This is with default 
>>> configure (--enable-qom-cast-debug):
>>>
>>> %?????? cum. %???? linenr info???????????????? symbol name
>>> 9.7057?? 9.7057??? exec-all.h:410????????????? helper_lookup_tb_ptr
>>> 8.0330? 17.7387??? object.c:711 object_class_dynamic_cast_assert
>>> 6.9411? 24.6798??? cputlb.c:793??????????????? io_readx
>>> 6.3219? 31.0017??? sm501_template.h:62???????? draw_line16_32
>>> 5.3601? 36.3617??? cputlb.c:114??????????????? tlb_flush_nocheck
>>> 3.6170? 39.9787??? translate-all.c:749???????? page_trylock_add
>>> 3.1188? 43.0975??? translate-all.c:803???????? page_collection_lock
>>> 3.0405? 46.1380??? exec.c:3025???????????????? iotlb_to_section
>>> 2.7044? 48.8424??? softmmu_template.h:112????? helper_ret_ldub_mmu
>>> 2.4154? 51.2578??? memory.c:1350?????????????? 
>>> memory_region_access_valid
> [...]
>> My first thought is that there is a QOM cast somewhere in a hot path 
>> on -M mac99 - can you show us the call stack information from the 
>> profile?
> 
> Not really. Oprofile that gave me this profile could not display call 
> graph for this call. I've tried again with the perf tool but I'm not 
> quite sure how to interpret its results. If I'm not mistaken it points 
> me in the direction of cpu_asidx_from_attrs, called from 
> iotlb_to_section, called from io_readx. The 
> object_class_dynamic_cast_assert call likely comes from 
> OBJECT_CLASS_CHECK, expanded from OBJECT_GET_CLASS, expanded from 
> CPU_GET_CLASS. Would that make any sense? Any idea what to do about it?

Good question. A quick grep for 'asidx_from_attrs' shows that 
cc->asidx_from_attrs() isn't set for PPC targets, so as a quick test 
does replacing the inline function cpu_asidx_from_attrs() in 
include/qom/cpu.h with a simple "return 0" change the profile at all?


ATB,

Mark.
BALATON Zoltan July 17, 2018, 8:46 p.m. UTC | #8
On Tue, 17 Jul 2018, Mark Cave-Ayland wrote:
> On 17/07/18 20:35, BALATON Zoltan wrote:
>> On Tue, 17 Jul 2018, Mark Cave-Ayland wrote:
>>>> MorphOS on mac99 this seems to be significant. This is with default 
>>>> configure (--enable-qom-cast-debug):
>>>> 
>>>> %?????? cum. %???? linenr info???????????????? symbol name
>>>> 9.7057?? 9.7057??? exec-all.h:410????????????? helper_lookup_tb_ptr
>>>> 8.0330? 17.7387??? object.c:711 object_class_dynamic_cast_assert
>>>> 6.9411? 24.6798??? cputlb.c:793??????????????? io_readx
>>>> 6.3219? 31.0017??? sm501_template.h:62???????? draw_line16_32
>>>> 5.3601? 36.3617??? cputlb.c:114??????????????? tlb_flush_nocheck
>>>> 3.6170? 39.9787??? translate-all.c:749???????? page_trylock_add
>>>> 3.1188? 43.0975??? translate-all.c:803???????? page_collection_lock
>>>> 3.0405? 46.1380??? exec.c:3025???????????????? iotlb_to_section
>>>> 2.7044? 48.8424??? softmmu_template.h:112????? helper_ret_ldub_mmu
>>>> 2.4154? 51.2578??? memory.c:1350?????????????? memory_region_access_valid
>> [...]
>>> My first thought is that there is a QOM cast somewhere in a hot path on -M 
>>> mac99 - can you show us the call stack information from the profile?
>> 
>> Not really. Oprofile that gave me this profile could not display call graph 
>> for this call. I've tried again with the perf tool but I'm not quite sure 
>> how to interpret its results. If I'm not mistaken it points me in the 
>> direction of cpu_asidx_from_attrs, called from iotlb_to_section, called 
>> from io_readx. The object_class_dynamic_cast_assert call likely comes from 
>> OBJECT_CLASS_CHECK, expanded from OBJECT_GET_CLASS, expanded from 
>> CPU_GET_CLASS. Would that make any sense? Any idea what to do about it?
>
> Good question. A quick grep for 'asidx_from_attrs' shows that 
> cc->asidx_from_attrs() isn't set for PPC targets, so as a quick test does 
> replacing the inline function cpu_asidx_from_attrs() in include/qom/cpu.h 
> with a simple "return 0" change the profile at all?

It does seem to lessen its impact but it's still higher than I expected:

%        cum. %     linenr info                 symbol name
10.7949  10.7949    exec-all.h:410              helper_lookup_tb_ptr
  7.8663  18.6612    cputlb.c:793                io_readx
  6.0265  24.6878    cputlb.c:114                tlb_flush_nocheck
  4.0671  28.7548    sm501_template.h:62         draw_line16_32
  4.0559  32.8107    object.c:765                object_class_dynamic_cast_assert
  3.3780  36.1887    memory.c:1350               memory_region_access_valid
  2.8920  39.0808    qemu-thread-posix.c:61      qemu_mutex_lock_impl
  2.7187  41.7995    memory.c:1415               memory_region_dispatch_read
  2.6011  44.4006    qht.c:487                   qht_lookup_custom
  2.5356  46.9362    softmmu_template.h:112      helper_ret_ldub_mmu

Maybe it's called from somewhere else too? I know draw_line16_32 but I 
wonder where could helper_lookup_tb_ptr and tlb flushes come from? Those 
seem to be significant. And io_readx in itself seems to be too high on the 
list too. I wonder if it may have something to do with the background task 
trying to read non-implemented i2c stuff frequently (as discussed in point 
2. in http://zero.eik.bme.hu/~balaton/qemu/amiga/#morphos).

Regards,
BALATON Zoltan
Peter Maydell July 17, 2018, 9:53 p.m. UTC | #9
On 17 July 2018 at 21:46, BALATON Zoltan <balaton@eik.bme.hu> wrote:
> On Tue, 17 Jul 2018, Mark Cave-Ayland wrote:
>> Good question. A quick grep for 'asidx_from_attrs' shows that
>> cc->asidx_from_attrs() isn't set for PPC targets, so as a quick test does
>> replacing the inline function cpu_asidx_from_attrs() in include/qom/cpu.h
>> with a simple "return 0" change the profile at all?
>
>
> It does seem to lessen its impact but it's still higher than I expected:

It may be worth special-casing the CPU method lookups (or at
least that one) if we can, then...

> %        cum. %     linenr info                 symbol name
> 10.7949  10.7949    exec-all.h:410              helper_lookup_tb_ptr
>  7.8663  18.6612    cputlb.c:793                io_readx
>  6.0265  24.6878    cputlb.c:114                tlb_flush_nocheck
>  4.0671  28.7548    sm501_template.h:62         draw_line16_32
>  4.0559  32.8107    object.c:765
> object_class_dynamic_cast_assert
>  3.3780  36.1887    memory.c:1350               memory_region_access_valid
>  2.8920  39.0808    qemu-thread-posix.c:61      qemu_mutex_lock_impl
>  2.7187  41.7995    memory.c:1415               memory_region_dispatch_read
>  2.6011  44.4006    qht.c:487                   qht_lookup_custom
>  2.5356  46.9362    softmmu_template.h:112      helper_ret_ldub_mmu
>
> Maybe it's called from somewhere else too? I know draw_line16_32 but I
> wonder where could helper_lookup_tb_ptr and tlb flushes come from? Those
> seem to be significant. And io_readx in itself seems to be too high on the
> list too.

helper_lookup_tb_ptr is part of TCG -- it's where we look for
the next TB to go to. Any non-computed branch to a different page
will result in our calling this. So it's high on the profile
because we do it a lot, I think, but that's not necessarily a
problem as such.

io_readx is the slow path for guest memory accesses -- any
guest access to something that's not RAM will have to go through
here. My first guess (given the other things in the profile,
especially helper_ret_ldub_mmu, memory_region_dispatch_read
and memory_region_access_valid) is that the guest is in a tight
loop doing a read on a device register a lot of the time.

> I wonder if it may have something to do with the background task
> trying to read non-implemented i2c stuff frequently (as discussed in point
> 2. in http://zero.eik.bme.hu/~balaton/qemu/amiga/#morphos).

Could be, or some similar thing. If you suspect the i2c you
could try putting in an unimplemented-device stub in the
right place and see how often -d unimp yells about reads to it.

So overall I'd be a little wary of optimizing based on this
profile, because I suspect it's atypical -- the guest is sat
in a tight polling loop and the profile says "all the functions
in the code path for doing device access are really hot".
The fix is to improve our model so the guest doesn't get
stuck like that, not to try to slightly improve the speed
of device accesses (we call it the "slow path" for a reason :-))

(But places like asidx_from_attrs are likely to be on hot
paths in general, so having the QOM class lookup there be
overly heavyweight is maybe worth fixing anyhow.)

thanks
-- PMM
diff mbox series

Patch

diff --git a/configure b/configure
index 2a7796e..a8fc1ed 100755
--- a/configure
+++ b/configure
@@ -415,7 +415,7 @@  bsd_user="no"
 blobs="yes"
 pkgversion=""
 pie=""
-qom_cast_debug="yes"
+qom_cast_debug="no"
 trace_backends="log"
 trace_file="trace"
 spice=""
@@ -1062,6 +1062,7 @@  for opt do
       debug_tcg="yes"
       debug_mutex="yes"
       debug="yes"
+      qom_cast_debug="yes"
       strip_opt="no"
       fortify_source="no"
   ;;