diff mbox series

virtio-blk: iothread-vq-mapping coroutine pool sizing

Message ID 20240311201423.387622-1-stefanha@redhat.com
State New
Headers show
Series virtio-blk: iothread-vq-mapping coroutine pool sizing | expand

Commit Message

Stefan Hajnoczi March 11, 2024, 8:14 p.m. UTC
It is possible to hit the sysctl vm.max_map_count limit when the
coroutine pool size becomes large. Each coroutine requires two mappings
(one for the stack and one for the guard page). QEMU can crash with
"failed to set up stack guard page" or "failed to allocate memory for
stack" when this happens.

Coroutine pool sizing is simple when there is only one AioContext: sum
up all I/O requests across all virtqueues.

When the iothread-vq-mapping option is used we should calculate tighter
bounds: take the maximum number of the number of I/O requests across all
virtqueues. This number is lower than simply summing all virtqueues when
only a subset of the virtqueues is handled by each AioContext.

This is not a solution to hitting vm.max_map_count, but it helps. A
guest with 64 vCPUs (hence 64 virtqueues) across 4 IOThreads with one
iothread-vq-mapping virtio-blk device and a root disk without goes from
pool_max_size 16,448 to 10,304.

Reported-by: Sanjay Rao <srao@redhat.com>
Reported-by: Boaz Ben Shabat <bbenshab@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/hw/virtio/virtio-blk.h |  2 ++
 hw/block/virtio-blk.c          | 34 ++++++++++++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)

Comments

Kevin Wolf March 12, 2024, 11:24 a.m. UTC | #1
Am 11.03.2024 um 21:14 hat Stefan Hajnoczi geschrieben:
> It is possible to hit the sysctl vm.max_map_count limit when the
> coroutine pool size becomes large. Each coroutine requires two mappings
> (one for the stack and one for the guard page). QEMU can crash with
> "failed to set up stack guard page" or "failed to allocate memory for
> stack" when this happens.
> 
> Coroutine pool sizing is simple when there is only one AioContext: sum
> up all I/O requests across all virtqueues.
> 
> When the iothread-vq-mapping option is used we should calculate tighter
> bounds: take the maximum number of the number of I/O requests across all
> virtqueues. This number is lower than simply summing all virtqueues when
> only a subset of the virtqueues is handled by each AioContext.

The reasoning is that each thread has its own coroutine pool for which
the pool size applies individually, and it doesn't need to have space
for coroutines running in a different thread, right? I'd like to have
this recorded in the commit message.

Of course, this also makes me wonder if a global coroutine pool size
really makes sense or if it should be per thread. One thread could be
serving only one queue (maybe the main thread with a CD-ROM device) and
another thread 32 queues (the iothread with the interesting disks).
There is no reason for the first thread to have a coroutine pool as big
as the second one.

But before we make the size thread-local, maybe having thread-local
pools wasn't right to begin with because multiple threads can run main
context code and they should therefore share the same coroutine pool (we
already had the problem earlier that coroutines start on the vcpu thread
and terminate on the main thread and this plays havoc with coroutine
pools).

Maybe per-AioContext pools with per-AioContext sizes would make more
sense?

> This is not a solution to hitting vm.max_map_count, but it helps. A
> guest with 64 vCPUs (hence 64 virtqueues) across 4 IOThreads with one
> iothread-vq-mapping virtio-blk device and a root disk without goes from
> pool_max_size 16,448 to 10,304.
> 
> Reported-by: Sanjay Rao <srao@redhat.com>
> Reported-by: Boaz Ben Shabat <bbenshab@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Either way, this should already strictly improve the situation, so I'm
happy to apply this change for now.

Kevin
Stefan Hajnoczi March 12, 2024, 11:55 a.m. UTC | #2
On Tue, Mar 12, 2024 at 12:24:30PM +0100, Kevin Wolf wrote:
> Am 11.03.2024 um 21:14 hat Stefan Hajnoczi geschrieben:
> > It is possible to hit the sysctl vm.max_map_count limit when the
> > coroutine pool size becomes large. Each coroutine requires two mappings
> > (one for the stack and one for the guard page). QEMU can crash with
> > "failed to set up stack guard page" or "failed to allocate memory for
> > stack" when this happens.
> > 
> > Coroutine pool sizing is simple when there is only one AioContext: sum
> > up all I/O requests across all virtqueues.
> > 
> > When the iothread-vq-mapping option is used we should calculate tighter
> > bounds: take the maximum number of the number of I/O requests across all
> > virtqueues. This number is lower than simply summing all virtqueues when
> > only a subset of the virtqueues is handled by each AioContext.
> 
> The reasoning is that each thread has its own coroutine pool for which
> the pool size applies individually, and it doesn't need to have space
> for coroutines running in a different thread, right? I'd like to have
> this recorded in the commit message.

Right, I'll update the commit description to mention this.

I also forgot to include Joe Mario in the Reported-by tags. Joe has been
helping investigate the issue.

> Of course, this also makes me wonder if a global coroutine pool size
> really makes sense or if it should be per thread. One thread could be
> serving only one queue (maybe the main thread with a CD-ROM device) and
> another thread 32 queues (the iothread with the interesting disks).
> There is no reason for the first thread to have a coroutine pool as big
> as the second one.

Agreed. The main loop thread, in particular, has very different
coroutine pool sizing requirements from the IOThreads that are
performing the bulk of the I/O.

> But before we make the size thread-local, maybe having thread-local
> pools wasn't right to begin with because multiple threads can run main
> context code and they should therefore share the same coroutine pool (we
> already had the problem earlier that coroutines start on the vcpu thread
> and terminate on the main thread and this plays havoc with coroutine
> pools).
> 
> Maybe per-AioContext pools with per-AioContext sizes would make more
> sense?

That's a good observation. Originally I hoped we could keep the
coroutine code independent of AioContext, but it is already connected in
other ways (e.g. co->ctx) so there's no reason to avoid it.

I'm not happy with how coroutine pools perform. I reproduced the
max_map_count failures that Sanjay, Boaz, and Joe have encountered and
found that one IOThread had 10k coroutines pooled while another
IOThread's pool was empty. Using timers to decay the pool size when it's
not actively in use has come to mind.

On the other hand, the fundamental problem is that the guest has access
to more virtqueue space than QEMU can allocate coroutines. Adding timers
to decay the pool size might hide issues but a guest could fill all
virtqueues at once and still crash.

QEMU needs to be reliable and run out of resources. There is no
accounting code in place capable of tracking all possible coroutine
usage, but maybe QEMU should refuse to launch when the virtqueue
resources promised to the guest exceed vm.max_map_count...

> > This is not a solution to hitting vm.max_map_count, but it helps. A
> > guest with 64 vCPUs (hence 64 virtqueues) across 4 IOThreads with one
> > iothread-vq-mapping virtio-blk device and a root disk without goes from
> > pool_max_size 16,448 to 10,304.
> > 
> > Reported-by: Sanjay Rao <srao@redhat.com>
> > Reported-by: Boaz Ben Shabat <bbenshab@redhat.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Either way, this should already strictly improve the situation, so I'm
> happy to apply this change for now.

Thanks, I will send a v2.

Stefan
diff mbox series

Patch

diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 5c14110c4b..ac29700ad4 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -74,6 +74,8 @@  struct VirtIOBlock {
     uint64_t host_features;
     size_t config_size;
     BlockRAMRegistrar blk_ram_registrar;
+
+    unsigned coroutine_pool_size;
 };
 
 typedef struct VirtIOBlockReq {
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 738cb2ac36..0a14b2b175 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1957,6 +1957,35 @@  static void virtio_blk_stop_ioeventfd(VirtIODevice *vdev)
     s->ioeventfd_stopping = false;
 }
 
+/* Increase the coroutine pool size to include our I/O requests */
+static void virtio_blk_inc_coroutine_pool_size(VirtIOBlock *s)
+{
+    VirtIOBlkConf *conf = &s->conf;
+    unsigned max_requests = 0;
+
+    /* Tracks the total number of requests for AioContext */
+    g_autoptr(GHashTable) counters = g_hash_table_new(NULL, NULL);
+
+    /* Call this function after setting up vq_aio_context[] */
+    assert(s->vq_aio_context);
+
+    for (unsigned i = 0; i < conf->num_queues; i++) {
+        AioContext *ctx = s->vq_aio_context[i];
+        unsigned n = GPOINTER_TO_UINT(g_hash_table_lookup(counters, ctx));
+
+        n += conf->queue_size / 2; /* this is a heuristic */
+
+        g_hash_table_insert(counters, ctx, GUINT_TO_POINTER(n));
+
+        if (n > max_requests) {
+            max_requests = n;
+        }
+    }
+
+    qemu_coroutine_inc_pool_size(max_requests);
+    s->coroutine_pool_size = max_requests; /* stash it for ->unrealize() */
+}
+
 static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -2048,7 +2077,6 @@  static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
     for (i = 0; i < conf->num_queues; i++) {
         virtio_add_queue(vdev, conf->queue_size, virtio_blk_handle_output);
     }
-    qemu_coroutine_inc_pool_size(conf->num_queues * conf->queue_size / 2);
 
     /* Don't start ioeventfd if transport does not support notifiers. */
     if (!virtio_device_ioeventfd_enabled(vdev)) {
@@ -2065,6 +2093,8 @@  static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    virtio_blk_inc_coroutine_pool_size(s);
+
     /*
      * This must be after virtio_init() so virtio_blk_dma_restart_cb() gets
      * called after ->start_ioeventfd() has already set blk's AioContext.
@@ -2096,7 +2126,7 @@  static void virtio_blk_device_unrealize(DeviceState *dev)
     for (i = 0; i < conf->num_queues; i++) {
         virtio_del_queue(vdev, i);
     }
-    qemu_coroutine_dec_pool_size(conf->num_queues * conf->queue_size / 2);
+    qemu_coroutine_dec_pool_size(s->coroutine_pool_size);
     qemu_mutex_destroy(&s->rq_lock);
     blk_ram_registrar_destroy(&s->blk_ram_registrar);
     qemu_del_vm_change_state_handler(s->change);