Message ID | 1421197016-69426-1-git-send-email-agraf@suse.de |
---|---|
State | New |
Headers | show |
On 14/01/2015 01:56, Alexander Graf wrote: > + if (sizeof(pool) == 4) { > + /* 32bit systems run out of virtual memory quickly */ > + pool->max_threads = 4; > + } else { > + pool->max_threads = 64; > + } Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> The same problem applies to coroutine stacks, and those cannot be throttled down as easily. But I guess if you limit the number of threads, the guest gets slowed down and doesn't create as many coroutines. It would be nice to have a way to measure coroutine stack usage, similar to what the kernel does. Paolo
Am 14.01.2015 um 08:37 hat Paolo Bonzini geschrieben: > > > On 14/01/2015 01:56, Alexander Graf wrote: > > + if (sizeof(pool) == 4) { > > + /* 32bit systems run out of virtual memory quickly */ > > + pool->max_threads = 4; > > + } else { > > + pool->max_threads = 64; > > + } > > Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> > > The same problem applies to coroutine stacks, and those cannot be > throttled down as easily. But I guess if you limit the number of > threads, the guest gets slowed down and doesn't create as many coroutines. Shouldn't we rather try and decrease the stack sizes a bit? 1 MB per coroutine is really a lot, and as I understand it, threads take even more by default. > It would be nice to have a way to measure coroutine stack usage, similar > to what the kernel does. The information which pages are mapped should be there somewhere... Kevin
On 14/01/2015 11:20, Kevin Wolf wrote: >> > The same problem applies to coroutine stacks, and those cannot be >> > throttled down as easily. But I guess if you limit the number of >> > threads, the guest gets slowed down and doesn't create as many coroutines. > Shouldn't we rather try and decrease the stack sizes a bit? 1 MB per > coroutine is really a lot, and as I understand it, threads take even > more by default. Yup, 2 MB. Last time I proposed this, I think Markus was strongly in the "better safe than sorry" camp. :) But thread pool workers definitely don't need a big stack. >> > It would be nice to have a way to measure coroutine stack usage, similar >> > to what the kernel does. > The information which pages are mapped should be there somewhere... Yes, there is mincore(2). The complicated part is doing it fast, but perhaps it doesn't need to be fast. I tried gathering warning from GCC's -Wstack-usage=1023 option and the block layer does not seem to have functions with huge stacks in the I/O path. So, assuming a maximum stack depth of 50 (already pretty generous since there shouldn't be any recursive calls) a 100K stack should be pretty much okay for coroutines and thread-pool threads. That said there are some offenders in the device models. Other QemuThreads, especially VCPU threads, had better stay with a big stack. Paolo 1024-2048: block/linux-aio.c ioq_submit stack usage is 1104 bytes block/mirror.c mirror_run stack usage is 1280 bytes block/qapi.c bdrv_query_image_info stack usage is 1152 bytes block/vmdk.c vmdk_open_vmdk4 stack usage is 1840 bytes block/vpc.c vpc_create stack usage is 1152 bytes cpus.c qmp_memsave stack usage is 1120 bytes cpus.c qmp_pmemsave stack usage is 1120 bytes disas/sparc.c build_hash_table stack usage is 1072 bytes fsdev/virtfs-proxy-helper.c main stack usage is 1264 bytes hw/arm/highbank.c calxeda_init stack usage is 1200 bytes hw/arm/stellaris.c stellaris_init stack usage is 1120 bytes hw/arm/virt.c machvirt_init stack usage is 1264 bytes hw/block/dataplane/virtio-blk.c handle_notify stack usage is 1632 bytes hw/block/virtio-blk.c virtio_blk_dma_restart_bh stack usage is 1600 bytes hw/block/virtio-blk.c virtio_blk_handle_output stack usage is 1584 bytes hw/core/qdev.c qdev_get_legacy_property stack usage is 1104 bytes hw/net/vmxnet_tx_pkt.c vmxnet_tx_pkt_do_sw_fragmentation stack usage is 1168 bytes hw/ppc/e500.c ppce500_load_device_tree stack usage is 1072 bytes hw/scsi/megasas.c megasas_dcmd_ld_get_list stack usage is 1152 bytes hw/tpm/tpm_passthrough.c tpm_passthrough_test_tpmdev stack usage is 1216 bytes linux-user/main.c main stack usage is 1664 bytes linux-user/main.c main stack usage is 1696 bytes linux-user/main.c main stack usage is 1728 bytes linux-user/main.c main stack usage is 1744 bytes linux-user/main.c main stack usage is 1760 bytes linux-user/main.c main stack usage is 1776 bytes linux-user/main.c main stack usage is 1856 bytes linux-user/main.c main stack usage is 1872 bytes linux-user/main.c main stack usage is 1888 bytes linux-user/main.c main stack usage is 1952 bytes linux-user/main.c main stack usage is 1984 bytes linux-user/main.c main stack usage is 2032 bytes migration/vmstate.c get_unused_buffer stack usage is 1088 bytes monitor.c monitor_parse_command stack usage is 1424 bytes monitor.c parse_cmdline stack usage is 1136 bytes qemu-img.c img_rebase stack usage is 1248 bytes qobject/json-parser.c parse_error stack usage is 1280 bytes qobject/qjson.c to_json stack usage is 1136 bytes qtest.c qtest_send stack usage is 1280 bytes savevm.c do_savevm stack usage is 1424 bytes slirp/slirp.c if_encap stack usage is 1680 bytes target-i386/kvm.c kvm_get_apic stack usage is 1072 bytes target-i386/kvm.c kvm_put_apic stack usage is 1072 bytes target-xtensa/xtensa-semi.c helper_simcall stack usage is 1296 bytes trace/control.c trace_init_events stack usage is 1152 bytes ui/keymaps.c parse_keyboard_layout stack usage is 1152 bytes ui/vnc.c addr_to_string stack usage is 1136 bytes ui/vnc.c protocol_client_init stack usage is 1072 bytes ui/vnc.c qmp_query_vnc stack usage is 1344 bytes ui/vnc.c vnc_basic_info_get stack usage is 1136 bytes ui/vnc-enc-tight.c tight_detect_smooth_image16 stack usage is 1184 bytes ui/vnc-enc-tight.c tight_detect_smooth_image24 stack usage is 1152 bytes ui/vnc-enc-tight.c tight_detect_smooth_image32 stack usage is 1184 bytes ui/vnc-enc-zrle-template.c zrle_encode_tile15be stack usage is 1264 bytes ui/vnc-enc-zrle-template.c zrle_encode_tile15le stack usage is 1264 bytes ui/vnc-enc-zrle-template.c zrle_encode_tile16be stack usage is 1264 bytes ui/vnc-enc-zrle-template.c zrle_encode_tile16le stack usage is 1264 bytes ui/vnc-enc-zrle-template.c zrle_encode_tile24abe stack usage is 1200 bytes ui/vnc-enc-zrle-template.c zrle_encode_tile24ale stack usage is 1200 bytes ui/vnc-enc-zrle-template.c zrle_encode_tile24bbe stack usage is 1200 bytes ui/vnc-enc-zrle-template.c zrle_encode_tile24ble stack usage is 1200 bytes ui/vnc-enc-zrle-template.c zrle_encode_tile32be stack usage is 1184 bytes ui/vnc-enc-zrle-template.c zrle_encode_tile32le stack usage is 1184 bytes ui/vnc-enc-zrle-template.c zrle_encode_tile8ne stack usage is 1184 bytes util/qemu-option.c opts_do_parse stack usage is 1280 bytes util/qemu-option.c opts_parse stack usage is 1104 bytes 2048-4096: block.c bdrv_swap stack usage is 3920 bytes coroutine-ucontext.c qemu_coroutine_new stack usage is 2192 bytes hw/arm/musicpal.c eth_send stack usage is 2160 bytes hw/dma/etraxfs_dma.c channel_out_run stack usage is 2160 bytes hw/net/cadence_gem.c gem_receive stack usage is 2192 bytes hw/net/cadence_gem.c gem_transmit stack usage is 2176 bytes hw/net/eepro100.c tx_command stack usage is 2736 bytes hw/net/mcf_fec.c mcf_fec_do_tx stack usage is 2144 bytes hw/ppc/spapr_pci.c spapr_populate_pci_dt stack usage is 4032 bytes hw/scsi/megasas.c megasas_ctrl_get_info stack usage is 2176 bytes libcacard/vcard_emul_nss.c vcard_emul_rsa_op stack usage is 2224 bytes monitor.c file_completion stack usage is 3328 bytes monitor.c host_net_remove_completion stack usage is 2176 bytes monitor.c netdev_del_completion stack usage is 2112 bytes monitor.c set_link_completion stack usage is 2112 bytes qemu-char.c tcp_chr_connect stack usage is 2608 bytes qemu-nbd.c main stack usage is 2336 bytes slirp/misc.c fork_exec stack usage is 2176 bytes slirp/socket.c sosendoob stack usage is 2144 bytes target-arm/helper.c register_cp_regs_for_features stack usage is 3232 bytes target-arm/helper.c register_cp_regs_for_features stack usage is 3360 bytes target-i386/kvm.c kvm_get_msrs stack usage is 2448 bytes target-i386/kvm.c kvm_put_msrs stack usage is 2448 bytes ui/sdl.c sdl_update_caption stack usage is 2112 bytes util/qemu-config.c qemu_config_parse stack usage is 2416 bytes 4096-8192: block.c bdrv_commit stack usage is 4208 bytes block/raw-posix.c hdev_open stack usage is 4320 bytes block/vmdk.c vmdk_parse_extents stack usage is 4848 bytes block/vvfat.c check_directory_consistency stack usage is 5152 bytes gdbstub.c gdb_monitor_output stack usage is 4144 bytes hw/arm/boot.c set_kernel_args stack usage is 4176 bytes hw/audio/ac97.c read_audio stack usage is 4208 bytes hw/audio/ac97.c write_audio stack usage is 4208 bytes hw/audio/es1370.c es1370_transfer_audio stack usage is 4240 bytes hw/audio/gus.c GUS_read_DMA stack usage is 4192 bytes hw/audio/marvell_88w8618.c mv88w8618_audio_callback stack usage is 4160 bytes hw/audio/milkymist-ac97.c ac97_in_cb stack usage is 4176 bytes hw/audio/milkymist-ac97.c ac97_out_cb stack usage is 4176 bytes hw/audio/sb16.c write_audio stack usage is 4192 bytes hw/char/sclpconsole-lm.c process_mdb stack usage is 4144 bytes hw/s390x/sclp.c sclp_service_call stack usage is 4336 bytes hw/scsi/lsi53c895a.c lsi_memcpy stack usage is 4192 bytes hw/scsi/megasas.c megasas_dcmd_cfg_read stack usage is 4224 bytes hw/scsi/megasas.c megasas_dcmd_pd_get_list stack usage is 5808 bytes hw/tpm/tpm_passthrough.c tpm_passthrough_open_sysfs_cancel stack usage is 4144 bytes linux-user/elfload.c vma_dump_size stack usage is 4128 bytes linux-user/syscall.c do_openat stack usage is 4176 bytes net/tap.c net_bridge_run_helper stack usage is 4720 bytes qemu-bridge-helper.c parse_acl_file stack usage is 4176 bytes qemu-char.c fd_chr_read stack usage is 4160 bytes qemu-char.c pty_chr_read stack usage is 4160 bytes qemu-char.c qemu_chr_fe_printf stack usage is 4352 bytes qemu-char.c qemu_chr_open_pty stack usage is 4176 bytes qemu-char.c tcp_chr_read stack usage is 4192 bytes qga/commands-posix.c get_pci_driver stack usage is 4144 bytes qga/main.c channel_event_cb stack usage is 4160 bytes target-i386/kvm.c kvm_arch_init_vcpu stack usage is 4144 bytes target-s390x/ioinst.c ioinst_handle_chsc_scsc stack usage is 4112 bytes target-s390x/misc_helper.c helper_stsi stack usage is 4128 bytes ui/vnc-auth-sasl.c vnc_client_read_sasl stack usage is 4160 bytes util/oslib-posix.c qemu_init_exec_dir stack usage is 4144 bytes util/path.c init_paths stack usage is 4144 bytes 8192-16384: block/vmdk.c vmdk_parent_open stack usage is 10288 bytes block/vmdk.c vmdk_read_cid stack usage is 10304 bytes gdbstub.c gdb_handle_packet stack usage is 8336 bytes hw/acpi/core.c acpi_table_add stack usage is 8336 bytes hw/audio/cs4231a.c cs_write_audio stack usage is 12384 bytes hw/core/qdev-properties-system.c set_netdev stack usage is 8288 bytes hw/i386/kvm/pci-assign.c assign_failed_examine stack usage is 12480 bytes hw/i386/pc.c load_linux stack usage is 8336 bytes hw/net/rtl8139.c rtl8139_transmit_one stack usage is 8256 bytes hw/net/xgmac.c xgmac_enet_send stack usage is 8288 bytes hw/s390x/s390-pci-inst.c clp_service_call stack usage is 8384 bytes hw/sparc64/sun4u.c sun4u_NVRAM_set_params stack usage is 8272 bytes hw/sparc/sun4m.c nvram_init stack usage is 8272 bytes hw/vfio/pci.c vfio_initfn stack usage is 8592 bytes linux-user/elfload.c elf_core_dump stack usage is 12640 bytes linux-user/elfload.c elf_core_dump stack usage is 8544 bytes linux-user/elfload.c vma_dump_size stack usage is 8224 bytes net/net.c qemu_del_net_client stack usage is 8256 bytes net/net.c qmp_set_link stack usage is 8256 bytes 16384+: block/vmdk.c vmdk_create stack usage is 29376 bytes block/vmdk.c vmdk_write_cid stack usage is 20544 bytes hw/arm/nseries.c n8x0_init stack usage is 65728 bytes hw/char/virtio-serial-bus.c control_out stack usage is 49312 bytes hw/char/virtio-serial-bus.c discard_vq_data stack usage is 49216 bytes hw/char/virtio-serial-bus.c send_control_msg stack usage is 49232 bytes hw/char/virtio-serial-bus.c write_to_port stack usage is 49264 bytes hw/display/vmware_vga.c vmsvga_fifo_run stack usage is 20688 bytes hw/dma/xilinx_axidma.c stream_process_mem2s stack usage is 16480 bytes hw/net/opencores_eth.c open_eth_start_xmit stack usage is 65600 bytes hw/net/virtio-net.c virtio_net_flush_tx stack usage is 65664 bytes hw/net/virtio-net.c virtio_net_handle_ctrl stack usage is 49424 bytes hw/net/virtio-net.c virtio_net_handle_ctrl stack usage is 49440 bytes hw/net/virtio-net.c virtio_net_receive stack usage is 65728 bytes hw/virtio/virtio-balloon.c virtio_balloon_handle_output stack usage is 49376 bytes hw/virtio/virtio-rng.c chr_read stack usage is 49264 bytes net/net.c nc_sendv_compat stack usage is 69680 bytes net/socket.c net_socket_send stack usage is 69712 bytes net/tap.c net_init_tap stack usage is 16704 bytes ui/vnc-jobs.c vnc_worker_thread_loop stack usage is 75632 bytes "might be unbounded": backends/baum.c baum_write_packet stack usage might be unbounded backends/baum.c baum_write stack usage might be unbounded backends/rng-random.c entropy_available stack usage might be unbounded block/qapi.c dump_qobject stack usage might be unbounded block/vpc.c vpc_co_write stack usage might be unbounded exec.c mem_commit stack usage might be unbounded hw/block/nvme.c nvme_map_prp stack usage might be unbounded hw/bt/l2cap.c l2cap_cframe_in stack usage might be unbounded hw/i386/multiboot.c load_multiboot stack usage might be unbounded hw/intc/xics.c ics_reset stack usage might be unbounded hw/net/fsl_etsec/rings.c etsec_walk_rx_ring stack usage might be unbounded hw/net/spapr_llan.c h_send_logical_lan stack usage might be unbounded hw/ppc/spapr.c spapr_fixup_cpu_dt stack usage might be unbounded hw/ppc/spapr_pci.c spapr_phb_realize stack usage might be unbounded hw/ssi/xilinx_spips.c xilinx_spips_flush_txfifo stack usage might be unbounded hw/usb/dev-hid.c usb_hid_handle_data stack usage might be unbounded hw/usb/dev-mtp.c usb_mtp_add_str stack usage might be unbounded hw/usb/dev-mtp.c usb_mtp_handle_data stack usage might be unbounded hw/usb/dev-wacom.c usb_wacom_handle_data stack usage might be unbounded hw/usb/hcd-xhci.c xhci_process_commands stack usage might be unbounded hw/usb/redirect.c usbredir_handle_data stack usage might be unbounded linux-user/elfload.c load_elf_image stack usage might be unbounded linux-user/elfload.c load_symbols stack usage might be unbounded linux-user/syscall.c do_accept4 stack usage might be unbounded linux-user/syscall.c do_bind stack usage might be unbounded linux-user/syscall.c do_connect stack usage might be unbounded linux-user/syscall.c do_getpeername stack usage might be unbounded linux-user/syscall.c do_getsockname stack usage might be unbounded linux-user/syscall.c do_recvfrom stack usage might be unbounded linux-user/syscall.c do_sendrecvmsg_locked stack usage might be unbounded linux-user/syscall.c do_sendto stack usage might be unbounded linux-user/syscall.c do_setsockopt stack usage might be unbounded linux-user/syscall.c do_syscall stack usage might be unbounded net/tap.c tap_receive_iov stack usage might be unbounded qemu-char.c tcp_chr_write stack usage might be unbounded target-arm/helper.c helper_dc_zva stack usage might be unbounded ui/vnc-enc-hextile-template.h send_hextile_tile_32 stack usage might be unbounded ui/vnc-enc-hextile-template.h send_hextile_tile_generic_32 stack usage might be unbounded ui/vnc-enc-tight.c send_palette_rect stack usage might be unbounded util/iov.c qemu_iovec_clone stack usage might be unbounded
Am 14.01.2015 um 12:18 hat Paolo Bonzini geschrieben: > > > On 14/01/2015 11:20, Kevin Wolf wrote: > >> > The same problem applies to coroutine stacks, and those cannot be > >> > throttled down as easily. But I guess if you limit the number of > >> > threads, the guest gets slowed down and doesn't create as many coroutines. > > Shouldn't we rather try and decrease the stack sizes a bit? 1 MB per > > coroutine is really a lot, and as I understand it, threads take even > > more by default. > > Yup, 2 MB. Last time I proposed this, I think Markus was strongly in > the "better safe than sorry" camp. :) > > But thread pool workers definitely don't need a big stack. Right, I think we need to consider what kind of thread it is. For the moment, I'm talking about the block layer only. > >> > It would be nice to have a way to measure coroutine stack usage, similar > >> > to what the kernel does. > > The information which pages are mapped should be there somewhere... > > Yes, there is mincore(2). The complicated part is doing it fast, but > perhaps it doesn't need to be fast. Well, what do you want to use it for? I thought it would only be for a one-time check where we usually end up rather than something that would be enabled in production, but maybe I misunderstood. > I tried gathering warning from GCC's -Wstack-usage=1023 option and the > block layer does not seem to have functions with huge stacks in the I/O > path. > > So, assuming a maximum stack depth of 50 (already pretty generous since > there shouldn't be any recursive calls) a 100K stack should be pretty > much okay for coroutines and thread-pool threads. The potential problem in the block layer is long backing file chains. Perhaps we need to do something to solve that iteratively instead of recursively. > That said there are some offenders in the device models. Other > QemuThreads, especially VCPU threads, had better stay with a big stack. Yes, that's not exactly surprising. Kevin
On 14/01/2015 14:38, Kevin Wolf wrote: > Well, what do you want to use it for? I thought it would only be for a > one-time check where we usually end up rather than something that would > be enabled in production, but maybe I misunderstood. No, you didn't. Though I guess we could limit the checks to the yield points. If we have BDS recursion, as in the backing file case, yield points should not be far from the deepest part of the stack. Another possibility (which cannot be enabled in production) is to fill the stack with a known 64-bit value, and do a binary search when the coroutine is destroyed. >> I tried gathering warning from GCC's -Wstack-usage=1023 option and the >> block layer does not seem to have functions with huge stacks in the I/O >> path. >> >> So, assuming a maximum stack depth of 50 (already pretty generous since >> there shouldn't be any recursive calls) a 100K stack should be pretty >> much okay for coroutines and thread-pool threads. > > The potential problem in the block layer is long backing file chains. > Perhaps we need to do something to solve that iteratively instead of > recursively. Basically first read stuff from the current BDS, and then "fill in the blanks" with a tail call on bs->backing_file? That would be quite a change, and we'd need a stopgap measure like Alex's patch in the meanwhile. Paolo
Am 14.01.2015 um 14:49 hat Paolo Bonzini geschrieben: > > > On 14/01/2015 14:38, Kevin Wolf wrote: > > Well, what do you want to use it for? I thought it would only be for a > > one-time check where we usually end up rather than something that would > > be enabled in production, but maybe I misunderstood. > > No, you didn't. Though I guess we could limit the checks to the yield > points. If we have BDS recursion, as in the backing file case, yield > points should not be far from the deepest part of the stack. > > Another possibility (which cannot be enabled in production) is to fill > the stack with a known 64-bit value, and do a binary search when the > coroutine is destroyed. Maybe that's the easiest one, yes. > >> I tried gathering warning from GCC's -Wstack-usage=1023 option and the > >> block layer does not seem to have functions with huge stacks in the I/O > >> path. > >> > >> So, assuming a maximum stack depth of 50 (already pretty generous since > >> there shouldn't be any recursive calls) a 100K stack should be pretty > >> much okay for coroutines and thread-pool threads. > > > > The potential problem in the block layer is long backing file chains. > > Perhaps we need to do something to solve that iteratively instead of > > recursively. > > Basically first read stuff from the current BDS, and then "fill in the > blanks" with a tail call on bs->backing_file? That would be quite a > change, and we'd need a stopgap measure like Alex's patch in the meanwhile. Basically block.c would do something like get_block_status() first and only then call the read/write functions of the individual drivers. But yes, that's more a theoretical consideration at this point. I think with the 50 recursions that you calculated we should be fine in practice for now. I would however strongly recommend finally implementing a guard page for coroutine stacks before we make that change. Anyway, the thread pool workers aren't affected by any of this, so they would be the obvious first step. Kevin
On 01/14/15 15:07, Kevin Wolf wrote: > Am 14.01.2015 um 14:49 hat Paolo Bonzini geschrieben: >> >> On 14/01/2015 14:38, Kevin Wolf wrote: >>> Well, what do you want to use it for? I thought it would only be for a >>> one-time check where we usually end up rather than something that would >>> be enabled in production, but maybe I misunderstood. >> No, you didn't. Though I guess we could limit the checks to the yield >> points. If we have BDS recursion, as in the backing file case, yield >> points should not be far from the deepest part of the stack. >> >> Another possibility (which cannot be enabled in production) is to fill >> the stack with a known 64-bit value, and do a binary search when the >> coroutine is destroyed. > Maybe that's the easiest one, yes. > >>>> I tried gathering warning from GCC's -Wstack-usage=1023 option and the >>>> block layer does not seem to have functions with huge stacks in the I/O >>>> path. >>>> >>>> So, assuming a maximum stack depth of 50 (already pretty generous since >>>> there shouldn't be any recursive calls) a 100K stack should be pretty >>>> much okay for coroutines and thread-pool threads. >>> The potential problem in the block layer is long backing file chains. >>> Perhaps we need to do something to solve that iteratively instead of >>> recursively. >> Basically first read stuff from the current BDS, and then "fill in the >> blanks" with a tail call on bs->backing_file? That would be quite a >> change, and we'd need a stopgap measure like Alex's patch in the meanwhile. > Basically block.c would do something like get_block_status() first and > only then call the read/write functions of the individual drivers. But > yes, that's more a theoretical consideration at this point. > > I think with the 50 recursions that you calculated we should be fine in > practice for now. I would however strongly recommend finally implementing > a guard page for coroutine stacks before we make that change. We could just write mprotect an excessively mapped page as guard page, no? > Anyway, the thread pool workers aren't affected by any of this, so they > would be the obvious first step. Yes, ideally we would have the maximum number of threads be runtime configurable too. That way you can adjust them to your workload. Alex
Paolo Bonzini <pbonzini@redhat.com> writes: > On 14/01/2015 11:20, Kevin Wolf wrote: >>> > The same problem applies to coroutine stacks, and those cannot be >>> > throttled down as easily. But I guess if you limit the number of >>> > threads, the guest gets slowed down and doesn't create as many coroutines. >> Shouldn't we rather try and decrease the stack sizes a bit? 1 MB per >> coroutine is really a lot, and as I understand it, threads take even >> more by default. > > Yup, 2 MB. Last time I proposed this, I think Markus was strongly in > the "better safe than sorry" camp. :) Yup. Running out of stack is a nasty failure mode. > But thread pool workers definitely don't need a big stack. When analysis leads to an upper bound for stack size, by all means use it. Absent rigorous analysis backed by testing, I recommend to throw address space at the problem. 32 bit limitations can force us to throw less generously (and be less safe) there. Not a good reason to compromise safety on 64 bit machines as well, though. [...]
Am 14.01.2015 um 15:09 hat Alexander Graf geschrieben: > On 01/14/15 15:07, Kevin Wolf wrote: > >Am 14.01.2015 um 14:49 hat Paolo Bonzini geschrieben: > >> > >>On 14/01/2015 14:38, Kevin Wolf wrote: > >>>Well, what do you want to use it for? I thought it would only be for a > >>>one-time check where we usually end up rather than something that would > >>>be enabled in production, but maybe I misunderstood. > >>No, you didn't. Though I guess we could limit the checks to the yield > >>points. If we have BDS recursion, as in the backing file case, yield > >>points should not be far from the deepest part of the stack. > >> > >>Another possibility (which cannot be enabled in production) is to fill > >>the stack with a known 64-bit value, and do a binary search when the > >>coroutine is destroyed. > >Maybe that's the easiest one, yes. > > > >>>>I tried gathering warning from GCC's -Wstack-usage=1023 option and the > >>>>block layer does not seem to have functions with huge stacks in the I/O > >>>>path. > >>>> > >>>>So, assuming a maximum stack depth of 50 (already pretty generous since > >>>>there shouldn't be any recursive calls) a 100K stack should be pretty > >>>>much okay for coroutines and thread-pool threads. > >>>The potential problem in the block layer is long backing file chains. > >>>Perhaps we need to do something to solve that iteratively instead of > >>>recursively. > >>Basically first read stuff from the current BDS, and then "fill in the > >>blanks" with a tail call on bs->backing_file? That would be quite a > >>change, and we'd need a stopgap measure like Alex's patch in the meanwhile. > >Basically block.c would do something like get_block_status() first and > >only then call the read/write functions of the individual drivers. But > >yes, that's more a theoretical consideration at this point. > > > >I think with the 50 recursions that you calculated we should be fine in > >practice for now. I would however strongly recommend finally implementing > >a guard page for coroutine stacks before we make that change. > > We could just write mprotect an excessively mapped page as guard page, no? Not just write protect, but PROT_NONE, but otherwise yes, I think that's how it usually done (or directly with a mmap instead of modifying it after the fact). > >Anyway, the thread pool workers aren't affected by any of this, so they > >would be the obvious first step. > > Yes, ideally we would have the maximum number of threads be runtime > configurable too. That way you can adjust them to your workload. Should be easy enough to add an option to raw-{posix,win32} for that. Kevin
On Wed, Jan 14, 2015 at 01:56:56AM +0100, Alexander Graf wrote: > On hosts with limited virtual address space (32bit pointers), we can very > easily run out of virtual memory with big thread pools. > > Instead, we should limit ourselves to small pools to keep memory footprint > low on those systems. > > This patch fixes random VM stalls like > > (process:25114): GLib-ERROR **: gmem.c:103: failed to allocate 1048576 bytes > > on 32bit ARM systems for me. > > Signed-off-by: Alexander Graf <agraf@suse.de> > --- > thread-pool.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/thread-pool.c b/thread-pool.c > index e2cac8e..87a3ea9 100644 > --- a/thread-pool.c > +++ b/thread-pool.c > @@ -299,7 +299,12 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx) > qemu_mutex_init(&pool->lock); > qemu_cond_init(&pool->worker_stopped); > qemu_sem_init(&pool->sem, 0); > - pool->max_threads = 64; > + if (sizeof(pool) == 4) { > + /* 32bit systems run out of virtual memory quickly */ > + pool->max_threads = 4; > + } else { > + pool->max_threads = 64; > + } > pool->new_thread_bh = aio_bh_new(ctx, spawn_thread_bh_fn, pool); > > QLIST_INIT(&pool->head); This patch has not been applied. Have there been any changes in this area? Stefan
Am 12.02.2015 um 16:38 hat Stefan Hajnoczi geschrieben: > On Wed, Jan 14, 2015 at 01:56:56AM +0100, Alexander Graf wrote: > > On hosts with limited virtual address space (32bit pointers), we can very > > easily run out of virtual memory with big thread pools. > > > > Instead, we should limit ourselves to small pools to keep memory footprint > > low on those systems. > > > > This patch fixes random VM stalls like > > > > (process:25114): GLib-ERROR **: gmem.c:103: failed to allocate 1048576 bytes > > > > on 32bit ARM systems for me. > > > > Signed-off-by: Alexander Graf <agraf@suse.de> > > --- > > thread-pool.c | 7 ++++++- > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > diff --git a/thread-pool.c b/thread-pool.c > > index e2cac8e..87a3ea9 100644 > > --- a/thread-pool.c > > +++ b/thread-pool.c > > @@ -299,7 +299,12 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx) > > qemu_mutex_init(&pool->lock); > > qemu_cond_init(&pool->worker_stopped); > > qemu_sem_init(&pool->sem, 0); > > - pool->max_threads = 64; > > + if (sizeof(pool) == 4) { > > + /* 32bit systems run out of virtual memory quickly */ > > + pool->max_threads = 4; > > + } else { > > + pool->max_threads = 64; > > + } > > pool->new_thread_bh = aio_bh_new(ctx, spawn_thread_bh_fn, pool); > > > > QLIST_INIT(&pool->head); > > This patch has not been applied. > > Have there been any changes in this area? I think the idea was to implement guard pages for coroutine stacks and then decrease the stack size of both coroutines and worker threads on 32 bit. Kevin
diff --git a/thread-pool.c b/thread-pool.c index e2cac8e..87a3ea9 100644 --- a/thread-pool.c +++ b/thread-pool.c @@ -299,7 +299,12 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx) qemu_mutex_init(&pool->lock); qemu_cond_init(&pool->worker_stopped); qemu_sem_init(&pool->sem, 0); - pool->max_threads = 64; + if (sizeof(pool) == 4) { + /* 32bit systems run out of virtual memory quickly */ + pool->max_threads = 4; + } else { + pool->max_threads = 64; + } pool->new_thread_bh = aio_bh_new(ctx, spawn_thread_bh_fn, pool); QLIST_INIT(&pool->head);
On hosts with limited virtual address space (32bit pointers), we can very easily run out of virtual memory with big thread pools. Instead, we should limit ourselves to small pools to keep memory footprint low on those systems. This patch fixes random VM stalls like (process:25114): GLib-ERROR **: gmem.c:103: failed to allocate 1048576 bytes on 32bit ARM systems for me. Signed-off-by: Alexander Graf <agraf@suse.de> --- thread-pool.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)