AIO: Reduce number of threads for 32bit hosts
diff mbox

Message ID 1421197016-69426-1-git-send-email-agraf@suse.de
State New
Headers show

Commit Message

Alexander Graf Jan. 14, 2015, 12:56 a.m. UTC
On hosts with limited virtual address space (32bit pointers), we can very
easily run out of virtual memory with big thread pools.

Instead, we should limit ourselves to small pools to keep memory footprint
low on those systems.

This patch fixes random VM stalls like

  (process:25114): GLib-ERROR **: gmem.c:103: failed to allocate 1048576 bytes

on 32bit ARM systems for me.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 thread-pool.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Paolo Bonzini Jan. 14, 2015, 7:37 a.m. UTC | #1
On 14/01/2015 01:56, Alexander Graf wrote:
> +    if (sizeof(pool) == 4) {
> +        /* 32bit systems run out of virtual memory quickly */
> +        pool->max_threads = 4;
> +    } else {
> +        pool->max_threads = 64;
> +    }

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

The same problem applies to coroutine stacks, and those cannot be
throttled down as easily.  But I guess if you limit the number of
threads, the guest gets slowed down and doesn't create as many coroutines.

It would be nice to have a way to measure coroutine stack usage, similar
to what the kernel does.

Paolo
Kevin Wolf Jan. 14, 2015, 10:20 a.m. UTC | #2
Am 14.01.2015 um 08:37 hat Paolo Bonzini geschrieben:
> 
> 
> On 14/01/2015 01:56, Alexander Graf wrote:
> > +    if (sizeof(pool) == 4) {
> > +        /* 32bit systems run out of virtual memory quickly */
> > +        pool->max_threads = 4;
> > +    } else {
> > +        pool->max_threads = 64;
> > +    }
> 
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> The same problem applies to coroutine stacks, and those cannot be
> throttled down as easily.  But I guess if you limit the number of
> threads, the guest gets slowed down and doesn't create as many coroutines.

Shouldn't we rather try and decrease the stack sizes a bit? 1 MB per
coroutine is really a lot, and as I understand it, threads take even
more by default.

> It would be nice to have a way to measure coroutine stack usage, similar
> to what the kernel does.

The information which pages are mapped should be there somewhere...

Kevin
Paolo Bonzini Jan. 14, 2015, 11:18 a.m. UTC | #3
On 14/01/2015 11:20, Kevin Wolf wrote:
>> > The same problem applies to coroutine stacks, and those cannot be
>> > throttled down as easily.  But I guess if you limit the number of
>> > threads, the guest gets slowed down and doesn't create as many coroutines.
> Shouldn't we rather try and decrease the stack sizes a bit? 1 MB per
> coroutine is really a lot, and as I understand it, threads take even
> more by default.

Yup, 2 MB.  Last time I proposed this, I think Markus was strongly in 
the "better safe than sorry" camp. :)

But thread pool workers definitely don't need a big stack.

>> > It would be nice to have a way to measure coroutine stack usage, similar
>> > to what the kernel does.
> The information which pages are mapped should be there somewhere...

Yes, there is mincore(2).  The complicated part is doing it fast, but
perhaps it doesn't need to be fast.

I tried gathering warning from GCC's -Wstack-usage=1023 option and the
block layer does not seem to have functions with huge stacks in the I/O
path.

So, assuming a maximum stack depth of 50 (already pretty generous since
there shouldn't be any recursive calls) a 100K stack should be pretty
much okay for coroutines and thread-pool threads.

That said there are some offenders in the device models.  Other
QemuThreads, especially VCPU threads, had better stay with a big stack.

Paolo

1024-2048:
block/linux-aio.c                      ioq_submit                         stack usage is 1104 bytes
block/mirror.c                         mirror_run                         stack usage is 1280 bytes
block/qapi.c                           bdrv_query_image_info              stack usage is 1152 bytes
block/vmdk.c                           vmdk_open_vmdk4                    stack usage is 1840 bytes
block/vpc.c                            vpc_create                         stack usage is 1152 bytes
cpus.c                                 qmp_memsave                        stack usage is 1120 bytes
cpus.c                                 qmp_pmemsave                       stack usage is 1120 bytes
disas/sparc.c                          build_hash_table                   stack usage is 1072 bytes
fsdev/virtfs-proxy-helper.c            main                               stack usage is 1264 bytes
hw/arm/highbank.c                      calxeda_init                       stack usage is 1200 bytes
hw/arm/stellaris.c                     stellaris_init                     stack usage is 1120 bytes
hw/arm/virt.c                          machvirt_init                      stack usage is 1264 bytes
hw/block/dataplane/virtio-blk.c        handle_notify                      stack usage is 1632 bytes
hw/block/virtio-blk.c                  virtio_blk_dma_restart_bh          stack usage is 1600 bytes
hw/block/virtio-blk.c                  virtio_blk_handle_output           stack usage is 1584 bytes
hw/core/qdev.c                         qdev_get_legacy_property           stack usage is 1104 bytes
hw/net/vmxnet_tx_pkt.c                 vmxnet_tx_pkt_do_sw_fragmentation  stack usage is 1168 bytes
hw/ppc/e500.c                          ppce500_load_device_tree           stack usage is 1072 bytes
hw/scsi/megasas.c                      megasas_dcmd_ld_get_list           stack usage is 1152 bytes
hw/tpm/tpm_passthrough.c               tpm_passthrough_test_tpmdev        stack usage is 1216 bytes
linux-user/main.c                      main                               stack usage is 1664 bytes
linux-user/main.c                      main                               stack usage is 1696 bytes
linux-user/main.c                      main                               stack usage is 1728 bytes
linux-user/main.c                      main                               stack usage is 1744 bytes
linux-user/main.c                      main                               stack usage is 1760 bytes
linux-user/main.c                      main                               stack usage is 1776 bytes
linux-user/main.c                      main                               stack usage is 1856 bytes
linux-user/main.c                      main                               stack usage is 1872 bytes
linux-user/main.c                      main                               stack usage is 1888 bytes
linux-user/main.c                      main                               stack usage is 1952 bytes
linux-user/main.c                      main                               stack usage is 1984 bytes
linux-user/main.c                      main                               stack usage is 2032 bytes
migration/vmstate.c                    get_unused_buffer                  stack usage is 1088 bytes
monitor.c                              monitor_parse_command              stack usage is 1424 bytes
monitor.c                              parse_cmdline                      stack usage is 1136 bytes
qemu-img.c                             img_rebase                         stack usage is 1248 bytes
qobject/json-parser.c                  parse_error                        stack usage is 1280 bytes
qobject/qjson.c                        to_json                            stack usage is 1136 bytes
qtest.c                                qtest_send                         stack usage is 1280 bytes
savevm.c                               do_savevm                          stack usage is 1424 bytes
slirp/slirp.c                          if_encap                           stack usage is 1680 bytes
target-i386/kvm.c                      kvm_get_apic                       stack usage is 1072 bytes
target-i386/kvm.c                      kvm_put_apic                       stack usage is 1072 bytes
target-xtensa/xtensa-semi.c            helper_simcall                     stack usage is 1296 bytes
trace/control.c                        trace_init_events                  stack usage is 1152 bytes
ui/keymaps.c                           parse_keyboard_layout              stack usage is 1152 bytes
ui/vnc.c                               addr_to_string                     stack usage is 1136 bytes
ui/vnc.c                               protocol_client_init               stack usage is 1072 bytes
ui/vnc.c                               qmp_query_vnc                      stack usage is 1344 bytes
ui/vnc.c                               vnc_basic_info_get                 stack usage is 1136 bytes
ui/vnc-enc-tight.c                     tight_detect_smooth_image16        stack usage is 1184 bytes
ui/vnc-enc-tight.c                     tight_detect_smooth_image24        stack usage is 1152 bytes
ui/vnc-enc-tight.c                     tight_detect_smooth_image32        stack usage is 1184 bytes
ui/vnc-enc-zrle-template.c             zrle_encode_tile15be               stack usage is 1264 bytes
ui/vnc-enc-zrle-template.c             zrle_encode_tile15le               stack usage is 1264 bytes
ui/vnc-enc-zrle-template.c             zrle_encode_tile16be               stack usage is 1264 bytes
ui/vnc-enc-zrle-template.c             zrle_encode_tile16le               stack usage is 1264 bytes
ui/vnc-enc-zrle-template.c             zrle_encode_tile24abe              stack usage is 1200 bytes
ui/vnc-enc-zrle-template.c             zrle_encode_tile24ale              stack usage is 1200 bytes
ui/vnc-enc-zrle-template.c             zrle_encode_tile24bbe              stack usage is 1200 bytes
ui/vnc-enc-zrle-template.c             zrle_encode_tile24ble              stack usage is 1200 bytes
ui/vnc-enc-zrle-template.c             zrle_encode_tile32be               stack usage is 1184 bytes
ui/vnc-enc-zrle-template.c             zrle_encode_tile32le               stack usage is 1184 bytes
ui/vnc-enc-zrle-template.c             zrle_encode_tile8ne                stack usage is 1184 bytes
util/qemu-option.c                     opts_do_parse                      stack usage is 1280 bytes
util/qemu-option.c                     opts_parse                         stack usage is 1104 bytes



2048-4096:
block.c                                bdrv_swap                          stack usage is 3920 bytes
coroutine-ucontext.c                   qemu_coroutine_new                 stack usage is 2192 bytes
hw/arm/musicpal.c                      eth_send                           stack usage is 2160 bytes
hw/dma/etraxfs_dma.c                   channel_out_run                    stack usage is 2160 bytes
hw/net/cadence_gem.c                   gem_receive                        stack usage is 2192 bytes
hw/net/cadence_gem.c                   gem_transmit                       stack usage is 2176 bytes
hw/net/eepro100.c                      tx_command                         stack usage is 2736 bytes
hw/net/mcf_fec.c                       mcf_fec_do_tx                      stack usage is 2144 bytes
hw/ppc/spapr_pci.c                     spapr_populate_pci_dt              stack usage is 4032 bytes
hw/scsi/megasas.c                      megasas_ctrl_get_info              stack usage is 2176 bytes
libcacard/vcard_emul_nss.c             vcard_emul_rsa_op                  stack usage is 2224 bytes
monitor.c                              file_completion                    stack usage is 3328 bytes
monitor.c                              host_net_remove_completion         stack usage is 2176 bytes
monitor.c                              netdev_del_completion              stack usage is 2112 bytes
monitor.c                              set_link_completion                stack usage is 2112 bytes
qemu-char.c                            tcp_chr_connect                    stack usage is 2608 bytes
qemu-nbd.c                             main                               stack usage is 2336 bytes
slirp/misc.c                           fork_exec                          stack usage is 2176 bytes
slirp/socket.c                         sosendoob                          stack usage is 2144 bytes
target-arm/helper.c                    register_cp_regs_for_features      stack usage is 3232 bytes
target-arm/helper.c                    register_cp_regs_for_features      stack usage is 3360 bytes
target-i386/kvm.c                      kvm_get_msrs                       stack usage is 2448 bytes
target-i386/kvm.c                      kvm_put_msrs                       stack usage is 2448 bytes
ui/sdl.c                               sdl_update_caption                 stack usage is 2112 bytes
util/qemu-config.c                     qemu_config_parse                  stack usage is 2416 bytes



4096-8192:
block.c                                bdrv_commit                        stack usage is 4208 bytes
block/raw-posix.c                      hdev_open                          stack usage is 4320 bytes
block/vmdk.c                           vmdk_parse_extents                 stack usage is 4848 bytes
block/vvfat.c                          check_directory_consistency        stack usage is 5152 bytes
gdbstub.c                              gdb_monitor_output                 stack usage is 4144 bytes
hw/arm/boot.c                          set_kernel_args                    stack usage is 4176 bytes
hw/audio/ac97.c                        read_audio                         stack usage is 4208 bytes
hw/audio/ac97.c                        write_audio                        stack usage is 4208 bytes
hw/audio/es1370.c                      es1370_transfer_audio              stack usage is 4240 bytes
hw/audio/gus.c                         GUS_read_DMA                       stack usage is 4192 bytes
hw/audio/marvell_88w8618.c             mv88w8618_audio_callback           stack usage is 4160 bytes
hw/audio/milkymist-ac97.c              ac97_in_cb                         stack usage is 4176 bytes
hw/audio/milkymist-ac97.c              ac97_out_cb                        stack usage is 4176 bytes
hw/audio/sb16.c                        write_audio                        stack usage is 4192 bytes
hw/char/sclpconsole-lm.c               process_mdb                        stack usage is 4144 bytes
hw/s390x/sclp.c                        sclp_service_call                  stack usage is 4336 bytes
hw/scsi/lsi53c895a.c                   lsi_memcpy                         stack usage is 4192 bytes
hw/scsi/megasas.c                      megasas_dcmd_cfg_read              stack usage is 4224 bytes
hw/scsi/megasas.c                      megasas_dcmd_pd_get_list           stack usage is 5808 bytes
hw/tpm/tpm_passthrough.c               tpm_passthrough_open_sysfs_cancel  stack usage is 4144 bytes
linux-user/elfload.c                   vma_dump_size                      stack usage is 4128 bytes
linux-user/syscall.c                   do_openat                          stack usage is 4176 bytes
net/tap.c                              net_bridge_run_helper              stack usage is 4720 bytes
qemu-bridge-helper.c                   parse_acl_file                     stack usage is 4176 bytes
qemu-char.c                            fd_chr_read                        stack usage is 4160 bytes
qemu-char.c                            pty_chr_read                       stack usage is 4160 bytes
qemu-char.c                            qemu_chr_fe_printf                 stack usage is 4352 bytes
qemu-char.c                            qemu_chr_open_pty                  stack usage is 4176 bytes
qemu-char.c                            tcp_chr_read                       stack usage is 4192 bytes
qga/commands-posix.c                   get_pci_driver                     stack usage is 4144 bytes
qga/main.c                             channel_event_cb                   stack usage is 4160 bytes
target-i386/kvm.c                      kvm_arch_init_vcpu                 stack usage is 4144 bytes
target-s390x/ioinst.c                  ioinst_handle_chsc_scsc            stack usage is 4112 bytes
target-s390x/misc_helper.c             helper_stsi                        stack usage is 4128 bytes
ui/vnc-auth-sasl.c                     vnc_client_read_sasl               stack usage is 4160 bytes
util/oslib-posix.c                     qemu_init_exec_dir                 stack usage is 4144 bytes
util/path.c                            init_paths                         stack usage is 4144 bytes



8192-16384:
block/vmdk.c                           vmdk_parent_open                   stack usage is 10288 bytes
block/vmdk.c                           vmdk_read_cid                      stack usage is 10304 bytes
gdbstub.c                              gdb_handle_packet                  stack usage is 8336 bytes
hw/acpi/core.c                         acpi_table_add                     stack usage is 8336 bytes
hw/audio/cs4231a.c                     cs_write_audio                     stack usage is 12384 bytes
hw/core/qdev-properties-system.c       set_netdev                         stack usage is 8288 bytes
hw/i386/kvm/pci-assign.c               assign_failed_examine              stack usage is 12480 bytes
hw/i386/pc.c                           load_linux                         stack usage is 8336 bytes
hw/net/rtl8139.c                       rtl8139_transmit_one               stack usage is 8256 bytes
hw/net/xgmac.c                         xgmac_enet_send                    stack usage is 8288 bytes
hw/s390x/s390-pci-inst.c               clp_service_call                   stack usage is 8384 bytes
hw/sparc64/sun4u.c                     sun4u_NVRAM_set_params             stack usage is 8272 bytes
hw/sparc/sun4m.c                       nvram_init                         stack usage is 8272 bytes
hw/vfio/pci.c                          vfio_initfn                        stack usage is 8592 bytes
linux-user/elfload.c                   elf_core_dump                      stack usage is 12640 bytes
linux-user/elfload.c                   elf_core_dump                      stack usage is 8544 bytes
linux-user/elfload.c                   vma_dump_size                      stack usage is 8224 bytes
net/net.c                              qemu_del_net_client                stack usage is 8256 bytes
net/net.c                              qmp_set_link                       stack usage is 8256 bytes



16384+:
block/vmdk.c                           vmdk_create                        stack usage is 29376 bytes
block/vmdk.c                           vmdk_write_cid                     stack usage is 20544 bytes
hw/arm/nseries.c                       n8x0_init                          stack usage is 65728 bytes
hw/char/virtio-serial-bus.c            control_out                        stack usage is 49312 bytes
hw/char/virtio-serial-bus.c            discard_vq_data                    stack usage is 49216 bytes
hw/char/virtio-serial-bus.c            send_control_msg                   stack usage is 49232 bytes
hw/char/virtio-serial-bus.c            write_to_port                      stack usage is 49264 bytes
hw/display/vmware_vga.c                vmsvga_fifo_run                    stack usage is 20688 bytes
hw/dma/xilinx_axidma.c                 stream_process_mem2s               stack usage is 16480 bytes
hw/net/opencores_eth.c                 open_eth_start_xmit                stack usage is 65600 bytes
hw/net/virtio-net.c                    virtio_net_flush_tx                stack usage is 65664 bytes
hw/net/virtio-net.c                    virtio_net_handle_ctrl             stack usage is 49424 bytes
hw/net/virtio-net.c                    virtio_net_handle_ctrl             stack usage is 49440 bytes
hw/net/virtio-net.c                    virtio_net_receive                 stack usage is 65728 bytes
hw/virtio/virtio-balloon.c             virtio_balloon_handle_output       stack usage is 49376 bytes
hw/virtio/virtio-rng.c                 chr_read                           stack usage is 49264 bytes
net/net.c                              nc_sendv_compat                    stack usage is 69680 bytes
net/socket.c                           net_socket_send                    stack usage is 69712 bytes
net/tap.c                              net_init_tap                       stack usage is 16704 bytes
ui/vnc-jobs.c                          vnc_worker_thread_loop             stack usage is 75632 bytes



"might be unbounded":
backends/baum.c                        baum_write_packet                  stack usage might be unbounded
backends/baum.c                        baum_write                         stack usage might be unbounded
backends/rng-random.c                  entropy_available                  stack usage might be unbounded
block/qapi.c                           dump_qobject                       stack usage might be unbounded
block/vpc.c                            vpc_co_write                       stack usage might be unbounded
exec.c                                 mem_commit                         stack usage might be unbounded
hw/block/nvme.c                        nvme_map_prp                       stack usage might be unbounded
hw/bt/l2cap.c                          l2cap_cframe_in                    stack usage might be unbounded
hw/i386/multiboot.c                    load_multiboot                     stack usage might be unbounded
hw/intc/xics.c                         ics_reset                          stack usage might be unbounded
hw/net/fsl_etsec/rings.c               etsec_walk_rx_ring                 stack usage might be unbounded
hw/net/spapr_llan.c                    h_send_logical_lan                 stack usage might be unbounded
hw/ppc/spapr.c                         spapr_fixup_cpu_dt                 stack usage might be unbounded
hw/ppc/spapr_pci.c                     spapr_phb_realize                  stack usage might be unbounded
hw/ssi/xilinx_spips.c                  xilinx_spips_flush_txfifo          stack usage might be unbounded
hw/usb/dev-hid.c                       usb_hid_handle_data                stack usage might be unbounded
hw/usb/dev-mtp.c                       usb_mtp_add_str                    stack usage might be unbounded
hw/usb/dev-mtp.c                       usb_mtp_handle_data                stack usage might be unbounded
hw/usb/dev-wacom.c                     usb_wacom_handle_data              stack usage might be unbounded
hw/usb/hcd-xhci.c                      xhci_process_commands              stack usage might be unbounded
hw/usb/redirect.c                      usbredir_handle_data               stack usage might be unbounded
linux-user/elfload.c                   load_elf_image                     stack usage might be unbounded
linux-user/elfload.c                   load_symbols                       stack usage might be unbounded
linux-user/syscall.c                   do_accept4                         stack usage might be unbounded
linux-user/syscall.c                   do_bind                            stack usage might be unbounded
linux-user/syscall.c                   do_connect                         stack usage might be unbounded
linux-user/syscall.c                   do_getpeername                     stack usage might be unbounded
linux-user/syscall.c                   do_getsockname                     stack usage might be unbounded
linux-user/syscall.c                   do_recvfrom                        stack usage might be unbounded
linux-user/syscall.c                   do_sendrecvmsg_locked              stack usage might be unbounded
linux-user/syscall.c                   do_sendto                          stack usage might be unbounded
linux-user/syscall.c                   do_setsockopt                      stack usage might be unbounded
linux-user/syscall.c                   do_syscall                         stack usage might be unbounded
net/tap.c                              tap_receive_iov                    stack usage might be unbounded
qemu-char.c                            tcp_chr_write                      stack usage might be unbounded
target-arm/helper.c                    helper_dc_zva                      stack usage might be unbounded
ui/vnc-enc-hextile-template.h          send_hextile_tile_32               stack usage might be unbounded
ui/vnc-enc-hextile-template.h          send_hextile_tile_generic_32       stack usage might be unbounded
ui/vnc-enc-tight.c                     send_palette_rect                  stack usage might be unbounded
util/iov.c                             qemu_iovec_clone                   stack usage might be unbounded
Kevin Wolf Jan. 14, 2015, 1:38 p.m. UTC | #4
Am 14.01.2015 um 12:18 hat Paolo Bonzini geschrieben:
> 
> 
> On 14/01/2015 11:20, Kevin Wolf wrote:
> >> > The same problem applies to coroutine stacks, and those cannot be
> >> > throttled down as easily.  But I guess if you limit the number of
> >> > threads, the guest gets slowed down and doesn't create as many coroutines.
> > Shouldn't we rather try and decrease the stack sizes a bit? 1 MB per
> > coroutine is really a lot, and as I understand it, threads take even
> > more by default.
> 
> Yup, 2 MB.  Last time I proposed this, I think Markus was strongly in 
> the "better safe than sorry" camp. :)
> 
> But thread pool workers definitely don't need a big stack.

Right, I think we need to consider what kind of thread it is. For the
moment, I'm talking about the block layer only.

> >> > It would be nice to have a way to measure coroutine stack usage, similar
> >> > to what the kernel does.
> > The information which pages are mapped should be there somewhere...
> 
> Yes, there is mincore(2).  The complicated part is doing it fast, but
> perhaps it doesn't need to be fast.

Well, what do you want to use it for? I thought it would only be for a
one-time check where we usually end up rather than something that would
be enabled in production, but maybe I misunderstood.

> I tried gathering warning from GCC's -Wstack-usage=1023 option and the
> block layer does not seem to have functions with huge stacks in the I/O
> path.
> 
> So, assuming a maximum stack depth of 50 (already pretty generous since
> there shouldn't be any recursive calls) a 100K stack should be pretty
> much okay for coroutines and thread-pool threads.

The potential problem in the block layer is long backing file chains.
Perhaps we need to do something to solve that iteratively instead of
recursively.

> That said there are some offenders in the device models.  Other
> QemuThreads, especially VCPU threads, had better stay with a big stack.

Yes, that's not exactly surprising.

Kevin
Paolo Bonzini Jan. 14, 2015, 1:49 p.m. UTC | #5
On 14/01/2015 14:38, Kevin Wolf wrote:
> Well, what do you want to use it for? I thought it would only be for a
> one-time check where we usually end up rather than something that would
> be enabled in production, but maybe I misunderstood.

No, you didn't.  Though I guess we could limit the checks to the yield
points.  If we have BDS recursion, as in the backing file case, yield
points should not be far from the deepest part of the stack.

Another possibility (which cannot be enabled in production) is to fill
the stack with a known 64-bit value, and do a binary search when the
coroutine is destroyed.

>> I tried gathering warning from GCC's -Wstack-usage=1023 option and the
>> block layer does not seem to have functions with huge stacks in the I/O
>> path.
>>
>> So, assuming a maximum stack depth of 50 (already pretty generous since
>> there shouldn't be any recursive calls) a 100K stack should be pretty
>> much okay for coroutines and thread-pool threads.
> 
> The potential problem in the block layer is long backing file chains.
> Perhaps we need to do something to solve that iteratively instead of
> recursively.

Basically first read stuff from the current BDS, and then "fill in the
blanks" with a tail call on bs->backing_file?  That would be quite a
change, and we'd need a stopgap measure like Alex's patch in the meanwhile.

Paolo
Kevin Wolf Jan. 14, 2015, 2:07 p.m. UTC | #6
Am 14.01.2015 um 14:49 hat Paolo Bonzini geschrieben:
> 
> 
> On 14/01/2015 14:38, Kevin Wolf wrote:
> > Well, what do you want to use it for? I thought it would only be for a
> > one-time check where we usually end up rather than something that would
> > be enabled in production, but maybe I misunderstood.
> 
> No, you didn't.  Though I guess we could limit the checks to the yield
> points.  If we have BDS recursion, as in the backing file case, yield
> points should not be far from the deepest part of the stack.
> 
> Another possibility (which cannot be enabled in production) is to fill
> the stack with a known 64-bit value, and do a binary search when the
> coroutine is destroyed.

Maybe that's the easiest one, yes.

> >> I tried gathering warning from GCC's -Wstack-usage=1023 option and the
> >> block layer does not seem to have functions with huge stacks in the I/O
> >> path.
> >>
> >> So, assuming a maximum stack depth of 50 (already pretty generous since
> >> there shouldn't be any recursive calls) a 100K stack should be pretty
> >> much okay for coroutines and thread-pool threads.
> > 
> > The potential problem in the block layer is long backing file chains.
> > Perhaps we need to do something to solve that iteratively instead of
> > recursively.
> 
> Basically first read stuff from the current BDS, and then "fill in the
> blanks" with a tail call on bs->backing_file?  That would be quite a
> change, and we'd need a stopgap measure like Alex's patch in the meanwhile.

Basically block.c would do something like get_block_status() first and
only then call the read/write functions of the individual drivers. But
yes, that's more a theoretical consideration at this point.

I think with the 50 recursions that you calculated we should be fine in
practice for now. I would however strongly recommend finally implementing
a guard page for coroutine stacks before we make that change.

Anyway, the thread pool workers aren't affected by any of this, so they
would be the obvious first step.

Kevin
Alexander Graf Jan. 14, 2015, 2:09 p.m. UTC | #7
On 01/14/15 15:07, Kevin Wolf wrote:
> Am 14.01.2015 um 14:49 hat Paolo Bonzini geschrieben:
>>
>> On 14/01/2015 14:38, Kevin Wolf wrote:
>>> Well, what do you want to use it for? I thought it would only be for a
>>> one-time check where we usually end up rather than something that would
>>> be enabled in production, but maybe I misunderstood.
>> No, you didn't.  Though I guess we could limit the checks to the yield
>> points.  If we have BDS recursion, as in the backing file case, yield
>> points should not be far from the deepest part of the stack.
>>
>> Another possibility (which cannot be enabled in production) is to fill
>> the stack with a known 64-bit value, and do a binary search when the
>> coroutine is destroyed.
> Maybe that's the easiest one, yes.
>
>>>> I tried gathering warning from GCC's -Wstack-usage=1023 option and the
>>>> block layer does not seem to have functions with huge stacks in the I/O
>>>> path.
>>>>
>>>> So, assuming a maximum stack depth of 50 (already pretty generous since
>>>> there shouldn't be any recursive calls) a 100K stack should be pretty
>>>> much okay for coroutines and thread-pool threads.
>>> The potential problem in the block layer is long backing file chains.
>>> Perhaps we need to do something to solve that iteratively instead of
>>> recursively.
>> Basically first read stuff from the current BDS, and then "fill in the
>> blanks" with a tail call on bs->backing_file?  That would be quite a
>> change, and we'd need a stopgap measure like Alex's patch in the meanwhile.
> Basically block.c would do something like get_block_status() first and
> only then call the read/write functions of the individual drivers. But
> yes, that's more a theoretical consideration at this point.
>
> I think with the 50 recursions that you calculated we should be fine in
> practice for now. I would however strongly recommend finally implementing
> a guard page for coroutine stacks before we make that change.

We could just write mprotect an excessively mapped page as guard page, no?

> Anyway, the thread pool workers aren't affected by any of this, so they
> would be the obvious first step.

Yes, ideally we would have the maximum number of threads be runtime 
configurable too. That way you can adjust them to your workload.


Alex
Markus Armbruster Jan. 14, 2015, 2:24 p.m. UTC | #8
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 14/01/2015 11:20, Kevin Wolf wrote:
>>> > The same problem applies to coroutine stacks, and those cannot be
>>> > throttled down as easily.  But I guess if you limit the number of
>>> > threads, the guest gets slowed down and doesn't create as many coroutines.
>> Shouldn't we rather try and decrease the stack sizes a bit? 1 MB per
>> coroutine is really a lot, and as I understand it, threads take even
>> more by default.
>
> Yup, 2 MB.  Last time I proposed this, I think Markus was strongly in 
> the "better safe than sorry" camp. :)

Yup.  Running out of stack is a nasty failure mode.

> But thread pool workers definitely don't need a big stack.

When analysis leads to an upper bound for stack size, by all means use
it.

Absent rigorous analysis backed by testing, I recommend to throw address
space at the problem.

32 bit limitations can force us to throw less generously (and be less
safe) there.  Not a good reason to compromise safety on 64 bit machines
as well, though.

[...]
Kevin Wolf Jan. 15, 2015, 10 a.m. UTC | #9
Am 14.01.2015 um 15:09 hat Alexander Graf geschrieben:
> On 01/14/15 15:07, Kevin Wolf wrote:
> >Am 14.01.2015 um 14:49 hat Paolo Bonzini geschrieben:
> >>
> >>On 14/01/2015 14:38, Kevin Wolf wrote:
> >>>Well, what do you want to use it for? I thought it would only be for a
> >>>one-time check where we usually end up rather than something that would
> >>>be enabled in production, but maybe I misunderstood.
> >>No, you didn't.  Though I guess we could limit the checks to the yield
> >>points.  If we have BDS recursion, as in the backing file case, yield
> >>points should not be far from the deepest part of the stack.
> >>
> >>Another possibility (which cannot be enabled in production) is to fill
> >>the stack with a known 64-bit value, and do a binary search when the
> >>coroutine is destroyed.
> >Maybe that's the easiest one, yes.
> >
> >>>>I tried gathering warning from GCC's -Wstack-usage=1023 option and the
> >>>>block layer does not seem to have functions with huge stacks in the I/O
> >>>>path.
> >>>>
> >>>>So, assuming a maximum stack depth of 50 (already pretty generous since
> >>>>there shouldn't be any recursive calls) a 100K stack should be pretty
> >>>>much okay for coroutines and thread-pool threads.
> >>>The potential problem in the block layer is long backing file chains.
> >>>Perhaps we need to do something to solve that iteratively instead of
> >>>recursively.
> >>Basically first read stuff from the current BDS, and then "fill in the
> >>blanks" with a tail call on bs->backing_file?  That would be quite a
> >>change, and we'd need a stopgap measure like Alex's patch in the meanwhile.
> >Basically block.c would do something like get_block_status() first and
> >only then call the read/write functions of the individual drivers. But
> >yes, that's more a theoretical consideration at this point.
> >
> >I think with the 50 recursions that you calculated we should be fine in
> >practice for now. I would however strongly recommend finally implementing
> >a guard page for coroutine stacks before we make that change.
> 
> We could just write mprotect an excessively mapped page as guard page, no?

Not just write protect, but PROT_NONE, but otherwise yes, I think that's
how it usually done (or directly with a mmap instead of modifying it
after the fact).

> >Anyway, the thread pool workers aren't affected by any of this, so they
> >would be the obvious first step.
> 
> Yes, ideally we would have the maximum number of threads be runtime
> configurable too. That way you can adjust them to your workload.

Should be easy enough to add an option to raw-{posix,win32} for that.

Kevin
Stefan Hajnoczi Feb. 12, 2015, 3:38 p.m. UTC | #10
On Wed, Jan 14, 2015 at 01:56:56AM +0100, Alexander Graf wrote:
> On hosts with limited virtual address space (32bit pointers), we can very
> easily run out of virtual memory with big thread pools.
> 
> Instead, we should limit ourselves to small pools to keep memory footprint
> low on those systems.
> 
> This patch fixes random VM stalls like
> 
>   (process:25114): GLib-ERROR **: gmem.c:103: failed to allocate 1048576 bytes
> 
> on 32bit ARM systems for me.
> 
> Signed-off-by: Alexander Graf <agraf@suse.de>
> ---
>  thread-pool.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/thread-pool.c b/thread-pool.c
> index e2cac8e..87a3ea9 100644
> --- a/thread-pool.c
> +++ b/thread-pool.c
> @@ -299,7 +299,12 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
>      qemu_mutex_init(&pool->lock);
>      qemu_cond_init(&pool->worker_stopped);
>      qemu_sem_init(&pool->sem, 0);
> -    pool->max_threads = 64;
> +    if (sizeof(pool) == 4) {
> +        /* 32bit systems run out of virtual memory quickly */
> +        pool->max_threads = 4;
> +    } else {
> +        pool->max_threads = 64;
> +    }
>      pool->new_thread_bh = aio_bh_new(ctx, spawn_thread_bh_fn, pool);
>  
>      QLIST_INIT(&pool->head);

This patch has not been applied.

Have there been any changes in this area?

Stefan
Kevin Wolf Feb. 12, 2015, 3:59 p.m. UTC | #11
Am 12.02.2015 um 16:38 hat Stefan Hajnoczi geschrieben:
> On Wed, Jan 14, 2015 at 01:56:56AM +0100, Alexander Graf wrote:
> > On hosts with limited virtual address space (32bit pointers), we can very
> > easily run out of virtual memory with big thread pools.
> > 
> > Instead, we should limit ourselves to small pools to keep memory footprint
> > low on those systems.
> > 
> > This patch fixes random VM stalls like
> > 
> >   (process:25114): GLib-ERROR **: gmem.c:103: failed to allocate 1048576 bytes
> > 
> > on 32bit ARM systems for me.
> > 
> > Signed-off-by: Alexander Graf <agraf@suse.de>
> > ---
> >  thread-pool.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/thread-pool.c b/thread-pool.c
> > index e2cac8e..87a3ea9 100644
> > --- a/thread-pool.c
> > +++ b/thread-pool.c
> > @@ -299,7 +299,12 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
> >      qemu_mutex_init(&pool->lock);
> >      qemu_cond_init(&pool->worker_stopped);
> >      qemu_sem_init(&pool->sem, 0);
> > -    pool->max_threads = 64;
> > +    if (sizeof(pool) == 4) {
> > +        /* 32bit systems run out of virtual memory quickly */
> > +        pool->max_threads = 4;
> > +    } else {
> > +        pool->max_threads = 64;
> > +    }
> >      pool->new_thread_bh = aio_bh_new(ctx, spawn_thread_bh_fn, pool);
> >  
> >      QLIST_INIT(&pool->head);
> 
> This patch has not been applied.
> 
> Have there been any changes in this area?

I think the idea was to implement guard pages for coroutine stacks and
then decrease the stack size of both coroutines and worker threads on
32 bit.

Kevin

Patch
diff mbox

diff --git a/thread-pool.c b/thread-pool.c
index e2cac8e..87a3ea9 100644
--- a/thread-pool.c
+++ b/thread-pool.c
@@ -299,7 +299,12 @@  static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
     qemu_mutex_init(&pool->lock);
     qemu_cond_init(&pool->worker_stopped);
     qemu_sem_init(&pool->sem, 0);
-    pool->max_threads = 64;
+    if (sizeof(pool) == 4) {
+        /* 32bit systems run out of virtual memory quickly */
+        pool->max_threads = 4;
+    } else {
+        pool->max_threads = 64;
+    }
     pool->new_thread_bh = aio_bh_new(ctx, spawn_thread_bh_fn, pool);
 
     QLIST_INIT(&pool->head);