Message ID | 20180716195851.0FE177456B7@zero.eik.bme.hu |
---|---|
State | New |
Headers | show |
Series | Do not enable QOM debugging by default | expand |
On 16 July 2018 at 20:47, BALATON Zoltan <balaton@eik.bme.hu> wrote: > Commit 3556c233d 5 years ago added an option to disable QOM debugging, > noting that it can have high performance cost but left the default to on. > Change the default to off and only enable it when debugging is requested > to avoid the performance penalty for those compiling without debug options. > > Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu> Looking at the code, doesn't this disable all the type checking on the QOM casts? That is not something I think we should drop, in the same way that we do not disable assert()s even in the non-debug build. Is this coming up as significant in profiling? In the past we've done some special casing if there are particular QOM casts in hot paths that use significant runtime. thanks -- PMM
On Mon, 16 Jul 2018, Peter Maydell wrote: > On 16 July 2018 at 20:47, BALATON Zoltan <balaton@eik.bme.hu> wrote: >> Commit 3556c233d 5 years ago added an option to disable QOM debugging, >> noting that it can have high performance cost but left the default to on. >> Change the default to off and only enable it when debugging is requested >> to avoid the performance penalty for those compiling without debug options. >> >> Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu> > > Looking at the code, doesn't this disable all the type checking > on the QOM casts? That is not something I think we should drop, > in the same way that we do not disable assert()s even in the > non-debug build. > > Is this coming up as significant in profiling? In the past we've > done some special casing if there are particular QOM casts in > hot paths that use significant runtime. My profiling results are not very accurate (could only gather sample based results with oprofile as --enable-gprof haven't worked for me) so I don't have numbers but I did see object_class_dynamic_cast_assert coming out among the top few without disabling this option but I don't know which call sites are responsible. Are there any recommendations for profiling QEMU (especially TCG) somewhere that I should be aware of? Regards, BALATON Zoltan
On Mon, Jul 16, 2018 at 09:41:12PM +0100, Peter Maydell wrote: > On 16 July 2018 at 20:47, BALATON Zoltan <balaton@eik.bme.hu> wrote: > > Commit 3556c233d 5 years ago added an option to disable QOM debugging, > > noting that it can have high performance cost but left the default to on. > > Change the default to off and only enable it when debugging is requested > > to avoid the performance penalty for those compiling without debug options. > > > > Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu> > > Looking at the code, doesn't this disable all the type checking > on the QOM casts? That is not something I think we should drop, > in the same way that we do not disable assert()s even in the > non-debug build. > > Is this coming up as significant in profiling? In the past we've > done some special casing if there are particular QOM casts in > hot paths that use significant runtime. There might also be scope for doing optimization of the object_dynamic_cast_assert() implementation too. Regards, Daniel
On Mon, 16 Jul 2018, Peter Maydell wrote:
> Is this coming up as significant in profiling? In the past we've
This seems to depend on the workload. From the cases I'm interested in
AROS and AmigaOS on qemu-system-ppc -M sam460ex does not seem to be
effected much (object_class_dynamic_cast_assert is not in top 10 with <2%)
but for MorphOS on mac99 this seems to be significant. This is with
default configure (--enable-qom-cast-debug):
% cum. % linenr info symbol name
9.7057 9.7057 exec-all.h:410 helper_lookup_tb_ptr
8.0330 17.7387 object.c:711 object_class_dynamic_cast_assert
6.9411 24.6798 cputlb.c:793 io_readx
6.3219 31.0017 sm501_template.h:62 draw_line16_32
5.3601 36.3617 cputlb.c:114 tlb_flush_nocheck
3.6170 39.9787 translate-all.c:749 page_trylock_add
3.1188 43.0975 translate-all.c:803 page_collection_lock
3.0405 46.1380 exec.c:3025 iotlb_to_section
2.7044 48.8424 softmmu_template.h:112 helper_ret_ldub_mmu
2.4154 51.2578 memory.c:1350 memory_region_access_valid
and improves a bit (but not much) with --disable-qom-cast-debug
% cum. % linenr info symbol name
10.2063 10.2063 exec-all.h:410 helper_lookup_tb_ptr
7.1581 17.3644 object.c:711 object_class_dynamic_cast_assert
5.9297 23.2941 sm501_template.h:62 draw_line16_32
5.9227 29.2168 cputlb.c:793 io_readx
5.3030 34.5198 cputlb.c:114 tlb_flush_nocheck
3.6445 38.1643 memory.c:1350 memory_region_access_valid
3.5499 41.7142 softmmu_template.h:112 helper_ret_ldub_mmu
3.0383 44.7525 translate-all.c:803 page_collection_lock
2.9735 47.7260 memory.c:1415 memory_region_dispatch_read
2.9503 50.6763 translate-all.c:749 page_trylock_add
But the workloads may not have been 100% identical so this is not
conclusive, maybe this debug code is not that expensive at the moment.
AROS on sam460ex has a different profile:
% cum. % linenr info symbol name
8.9905 8.9905 translate-all.c:749 page_trylock_add
8.7658 17.7563 exec-all.h:410 helper_lookup_tb_ptr
7.7349 25.4911 translate-all.c:803 page_collection_lock
5.8246 31.3158 cputlb.c:924 victim_tlb_hit
3.1640 34.4797 cpus.c:347 cpu_get_clock
3.1538 37.6335 translate-all.c:788 tb_page_addr_cmp
2.7969 40.4304 exec.c:435 address_space_translate_internal
2.6647 43.0951 memory.c:571 access_with_adjusted_size
2.0615 45.1567 exec.c:569 flatview_do_translate
1.9586 47.1153 memory.c:1350 memory_region_access_valid
Would anyone be able to guess what are the places that should be looked at
or what to check to get more info on this?
Regards,
BALATON Zoltan
On 17/07/18 16:09, BALATON Zoltan wrote: > On Mon, 16 Jul 2018, Peter Maydell wrote: >> Is this coming up as significant in profiling? In the past we've > > This seems to depend on the workload. From the cases I'm interested in > AROS and AmigaOS on qemu-system-ppc -M sam460ex does not seem to be > effected much (object_class_dynamic_cast_assert is not in top 10 with > <2%) but for MorphOS on mac99 this seems to be significant. This is with > default configure (--enable-qom-cast-debug): > > % cum. % linenr info symbol name > 9.7057 9.7057 exec-all.h:410 helper_lookup_tb_ptr > 8.0330 17.7387 object.c:711 > object_class_dynamic_cast_assert > 6.9411 24.6798 cputlb.c:793 io_readx > 6.3219 31.0017 sm501_template.h:62 draw_line16_32 > 5.3601 36.3617 cputlb.c:114 tlb_flush_nocheck > 3.6170 39.9787 translate-all.c:749 page_trylock_add > 3.1188 43.0975 translate-all.c:803 page_collection_lock > 3.0405 46.1380 exec.c:3025 iotlb_to_section > 2.7044 48.8424 softmmu_template.h:112 helper_ret_ldub_mmu > 2.4154 51.2578 memory.c:1350 memory_region_access_valid > > and improves a bit (but not much) with --disable-qom-cast-debug > > % cum. % linenr info symbol name > 10.2063 10.2063 exec-all.h:410 helper_lookup_tb_ptr > 7.1581 17.3644 object.c:711 > object_class_dynamic_cast_assert > 5.9297 23.2941 sm501_template.h:62 draw_line16_32 > 5.9227 29.2168 cputlb.c:793 io_readx > 5.3030 34.5198 cputlb.c:114 tlb_flush_nocheck > 3.6445 38.1643 memory.c:1350 memory_region_access_valid > 3.5499 41.7142 softmmu_template.h:112 helper_ret_ldub_mmu > 3.0383 44.7525 translate-all.c:803 page_collection_lock > 2.9735 47.7260 memory.c:1415 > memory_region_dispatch_read > 2.9503 50.6763 translate-all.c:749 page_trylock_add > > But the workloads may not have been 100% identical so this is not > conclusive, maybe this debug code is not that expensive at the moment. > > AROS on sam460ex has a different profile: > > % cum. % linenr info symbol name > 8.9905 8.9905 translate-all.c:749 page_trylock_add > 8.7658 17.7563 exec-all.h:410 helper_lookup_tb_ptr > 7.7349 25.4911 translate-all.c:803 page_collection_lock > 5.8246 31.3158 cputlb.c:924 victim_tlb_hit > 3.1640 34.4797 cpus.c:347 cpu_get_clock > 3.1538 37.6335 translate-all.c:788 tb_page_addr_cmp > 2.7969 40.4304 exec.c:435 > address_space_translate_internal > 2.6647 43.0951 memory.c:571 access_with_adjusted_size > 2.0615 45.1567 exec.c:569 flatview_do_translate > 1.9586 47.1153 memory.c:1350 memory_region_access_valid > > Would anyone be able to guess what are the places that should be looked > at or what to check to get more info on this? My first thought is that there is a QOM cast somewhere in a hot path on -M mac99 - can you show us the call stack information from the profile? I had a similar issue with SPARC32 whereby each DMA request needs to be manually word-swapped, so instead of adding the QOM cast into these routines I did a direct C cast from the opaque to ensure that the overhead was as little as possible. ATB, Mark.
On Tue, 17 Jul 2018, Mark Cave-Ayland wrote: >> MorphOS on mac99 this seems to be significant. This is with default >> configure (--enable-qom-cast-debug): >> >> %?????? cum. %???? linenr info???????????????? symbol name >> 9.7057?? 9.7057??? exec-all.h:410????????????? helper_lookup_tb_ptr >> 8.0330? 17.7387??? object.c:711 >> object_class_dynamic_cast_assert >> 6.9411? 24.6798??? cputlb.c:793??????????????? io_readx >> 6.3219? 31.0017??? sm501_template.h:62???????? draw_line16_32 >> 5.3601? 36.3617??? cputlb.c:114??????????????? tlb_flush_nocheck >> 3.6170? 39.9787??? translate-all.c:749???????? page_trylock_add >> 3.1188? 43.0975??? translate-all.c:803???????? page_collection_lock >> 3.0405? 46.1380??? exec.c:3025???????????????? iotlb_to_section >> 2.7044? 48.8424??? softmmu_template.h:112????? helper_ret_ldub_mmu >> 2.4154? 51.2578??? memory.c:1350?????????????? memory_region_access_valid [...] > My first thought is that there is a QOM cast somewhere in a hot path on -M > mac99 - can you show us the call stack information from the profile? Not really. Oprofile that gave me this profile could not display call graph for this call. I've tried again with the perf tool but I'm not quite sure how to interpret its results. If I'm not mistaken it points me in the direction of cpu_asidx_from_attrs, called from iotlb_to_section, called from io_readx. The object_class_dynamic_cast_assert call likely comes from OBJECT_CLASS_CHECK, expanded from OBJECT_GET_CLASS, expanded from CPU_GET_CLASS. Would that make any sense? Any idea what to do about it? Thank you, BALATON Zoltan
On 17/07/18 20:35, BALATON Zoltan wrote: > On Tue, 17 Jul 2018, Mark Cave-Ayland wrote: >>> MorphOS on mac99 this seems to be significant. This is with default >>> configure (--enable-qom-cast-debug): >>> >>> %?????? cum. %???? linenr info???????????????? symbol name >>> 9.7057?? 9.7057??? exec-all.h:410????????????? helper_lookup_tb_ptr >>> 8.0330? 17.7387??? object.c:711 object_class_dynamic_cast_assert >>> 6.9411? 24.6798??? cputlb.c:793??????????????? io_readx >>> 6.3219? 31.0017??? sm501_template.h:62???????? draw_line16_32 >>> 5.3601? 36.3617??? cputlb.c:114??????????????? tlb_flush_nocheck >>> 3.6170? 39.9787??? translate-all.c:749???????? page_trylock_add >>> 3.1188? 43.0975??? translate-all.c:803???????? page_collection_lock >>> 3.0405? 46.1380??? exec.c:3025???????????????? iotlb_to_section >>> 2.7044? 48.8424??? softmmu_template.h:112????? helper_ret_ldub_mmu >>> 2.4154? 51.2578??? memory.c:1350?????????????? >>> memory_region_access_valid > [...] >> My first thought is that there is a QOM cast somewhere in a hot path >> on -M mac99 - can you show us the call stack information from the >> profile? > > Not really. Oprofile that gave me this profile could not display call > graph for this call. I've tried again with the perf tool but I'm not > quite sure how to interpret its results. If I'm not mistaken it points > me in the direction of cpu_asidx_from_attrs, called from > iotlb_to_section, called from io_readx. The > object_class_dynamic_cast_assert call likely comes from > OBJECT_CLASS_CHECK, expanded from OBJECT_GET_CLASS, expanded from > CPU_GET_CLASS. Would that make any sense? Any idea what to do about it? Good question. A quick grep for 'asidx_from_attrs' shows that cc->asidx_from_attrs() isn't set for PPC targets, so as a quick test does replacing the inline function cpu_asidx_from_attrs() in include/qom/cpu.h with a simple "return 0" change the profile at all? ATB, Mark.
On Tue, 17 Jul 2018, Mark Cave-Ayland wrote: > On 17/07/18 20:35, BALATON Zoltan wrote: >> On Tue, 17 Jul 2018, Mark Cave-Ayland wrote: >>>> MorphOS on mac99 this seems to be significant. This is with default >>>> configure (--enable-qom-cast-debug): >>>> >>>> %?????? cum. %???? linenr info???????????????? symbol name >>>> 9.7057?? 9.7057??? exec-all.h:410????????????? helper_lookup_tb_ptr >>>> 8.0330? 17.7387??? object.c:711 object_class_dynamic_cast_assert >>>> 6.9411? 24.6798??? cputlb.c:793??????????????? io_readx >>>> 6.3219? 31.0017??? sm501_template.h:62???????? draw_line16_32 >>>> 5.3601? 36.3617??? cputlb.c:114??????????????? tlb_flush_nocheck >>>> 3.6170? 39.9787??? translate-all.c:749???????? page_trylock_add >>>> 3.1188? 43.0975??? translate-all.c:803???????? page_collection_lock >>>> 3.0405? 46.1380??? exec.c:3025???????????????? iotlb_to_section >>>> 2.7044? 48.8424??? softmmu_template.h:112????? helper_ret_ldub_mmu >>>> 2.4154? 51.2578??? memory.c:1350?????????????? memory_region_access_valid >> [...] >>> My first thought is that there is a QOM cast somewhere in a hot path on -M >>> mac99 - can you show us the call stack information from the profile? >> >> Not really. Oprofile that gave me this profile could not display call graph >> for this call. I've tried again with the perf tool but I'm not quite sure >> how to interpret its results. If I'm not mistaken it points me in the >> direction of cpu_asidx_from_attrs, called from iotlb_to_section, called >> from io_readx. The object_class_dynamic_cast_assert call likely comes from >> OBJECT_CLASS_CHECK, expanded from OBJECT_GET_CLASS, expanded from >> CPU_GET_CLASS. Would that make any sense? Any idea what to do about it? > > Good question. A quick grep for 'asidx_from_attrs' shows that > cc->asidx_from_attrs() isn't set for PPC targets, so as a quick test does > replacing the inline function cpu_asidx_from_attrs() in include/qom/cpu.h > with a simple "return 0" change the profile at all? It does seem to lessen its impact but it's still higher than I expected: % cum. % linenr info symbol name 10.7949 10.7949 exec-all.h:410 helper_lookup_tb_ptr 7.8663 18.6612 cputlb.c:793 io_readx 6.0265 24.6878 cputlb.c:114 tlb_flush_nocheck 4.0671 28.7548 sm501_template.h:62 draw_line16_32 4.0559 32.8107 object.c:765 object_class_dynamic_cast_assert 3.3780 36.1887 memory.c:1350 memory_region_access_valid 2.8920 39.0808 qemu-thread-posix.c:61 qemu_mutex_lock_impl 2.7187 41.7995 memory.c:1415 memory_region_dispatch_read 2.6011 44.4006 qht.c:487 qht_lookup_custom 2.5356 46.9362 softmmu_template.h:112 helper_ret_ldub_mmu Maybe it's called from somewhere else too? I know draw_line16_32 but I wonder where could helper_lookup_tb_ptr and tlb flushes come from? Those seem to be significant. And io_readx in itself seems to be too high on the list too. I wonder if it may have something to do with the background task trying to read non-implemented i2c stuff frequently (as discussed in point 2. in http://zero.eik.bme.hu/~balaton/qemu/amiga/#morphos). Regards, BALATON Zoltan
On 17 July 2018 at 21:46, BALATON Zoltan <balaton@eik.bme.hu> wrote: > On Tue, 17 Jul 2018, Mark Cave-Ayland wrote: >> Good question. A quick grep for 'asidx_from_attrs' shows that >> cc->asidx_from_attrs() isn't set for PPC targets, so as a quick test does >> replacing the inline function cpu_asidx_from_attrs() in include/qom/cpu.h >> with a simple "return 0" change the profile at all? > > > It does seem to lessen its impact but it's still higher than I expected: It may be worth special-casing the CPU method lookups (or at least that one) if we can, then... > % cum. % linenr info symbol name > 10.7949 10.7949 exec-all.h:410 helper_lookup_tb_ptr > 7.8663 18.6612 cputlb.c:793 io_readx > 6.0265 24.6878 cputlb.c:114 tlb_flush_nocheck > 4.0671 28.7548 sm501_template.h:62 draw_line16_32 > 4.0559 32.8107 object.c:765 > object_class_dynamic_cast_assert > 3.3780 36.1887 memory.c:1350 memory_region_access_valid > 2.8920 39.0808 qemu-thread-posix.c:61 qemu_mutex_lock_impl > 2.7187 41.7995 memory.c:1415 memory_region_dispatch_read > 2.6011 44.4006 qht.c:487 qht_lookup_custom > 2.5356 46.9362 softmmu_template.h:112 helper_ret_ldub_mmu > > Maybe it's called from somewhere else too? I know draw_line16_32 but I > wonder where could helper_lookup_tb_ptr and tlb flushes come from? Those > seem to be significant. And io_readx in itself seems to be too high on the > list too. helper_lookup_tb_ptr is part of TCG -- it's where we look for the next TB to go to. Any non-computed branch to a different page will result in our calling this. So it's high on the profile because we do it a lot, I think, but that's not necessarily a problem as such. io_readx is the slow path for guest memory accesses -- any guest access to something that's not RAM will have to go through here. My first guess (given the other things in the profile, especially helper_ret_ldub_mmu, memory_region_dispatch_read and memory_region_access_valid) is that the guest is in a tight loop doing a read on a device register a lot of the time. > I wonder if it may have something to do with the background task > trying to read non-implemented i2c stuff frequently (as discussed in point > 2. in http://zero.eik.bme.hu/~balaton/qemu/amiga/#morphos). Could be, or some similar thing. If you suspect the i2c you could try putting in an unimplemented-device stub in the right place and see how often -d unimp yells about reads to it. So overall I'd be a little wary of optimizing based on this profile, because I suspect it's atypical -- the guest is sat in a tight polling loop and the profile says "all the functions in the code path for doing device access are really hot". The fix is to improve our model so the guest doesn't get stuck like that, not to try to slightly improve the speed of device accesses (we call it the "slow path" for a reason :-)) (But places like asidx_from_attrs are likely to be on hot paths in general, so having the QOM class lookup there be overly heavyweight is maybe worth fixing anyhow.) thanks -- PMM
diff --git a/configure b/configure index 2a7796e..a8fc1ed 100755 --- a/configure +++ b/configure @@ -415,7 +415,7 @@ bsd_user="no" blobs="yes" pkgversion="" pie="" -qom_cast_debug="yes" +qom_cast_debug="no" trace_backends="log" trace_file="trace" spice="" @@ -1062,6 +1062,7 @@ for opt do debug_tcg="yes" debug_mutex="yes" debug="yes" + qom_cast_debug="yes" strip_opt="no" fortify_source="no" ;;
Commit 3556c233d 5 years ago added an option to disable QOM debugging, noting that it can have high performance cost but left the default to on. Change the default to off and only enable it when debugging is requested to avoid the performance penalty for those compiling without debug options. Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu> --- configure | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)