diff mbox

Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device

Message ID 20100111134248.GA25622@lst.de
State New
Headers show

Commit Message

Christoph Hellwig Jan. 11, 2010, 1:42 p.m. UTC
On Mon, Jan 11, 2010 at 10:30:53AM +0200, Avi Kivity wrote:
> The patch has potential to reduce performance on volumes with multiple 
> spindles.  Consider two processes issuing sequential reads into a RAID 
> array.  With this patch, the reads will be executed sequentially rather 
> than in parallel, so I think a follow-on patch to make the minimum depth 
> a parameter (set by the guest? the host?) would be helpful.

Let's think about the life cycle of I/O requests a bit.

We have an idle virtqueue (aka one virtio-blk device).  The first (read)
request comes in, we get the virtio notify from the guest, which calls
into virtio_blk_handle_output.  With the new code we now disable the
notify once we start processing the first request.  If the second
request hits the queue before we call into virtio_blk_get_request
the second time we're fine even with the new code as we keep picking it
up.  If however it hits after we leave virtio_blk_handle_output, but
before we complete the first request we do indeed introduce additional
latency.

So instead of disabling notify while requests are active we might want
to only disable it while we are inside virtio_blk_handle_output.
Something like the following minimally tested patch:

Comments

Anthony Liguori Jan. 11, 2010, 1:49 p.m. UTC | #1
On 01/11/2010 07:42 AM, Christoph Hellwig wrote:
> On Mon, Jan 11, 2010 at 10:30:53AM +0200, Avi Kivity wrote:
>    
>> The patch has potential to reduce performance on volumes with multiple
>> spindles.  Consider two processes issuing sequential reads into a RAID
>> array.  With this patch, the reads will be executed sequentially rather
>> than in parallel, so I think a follow-on patch to make the minimum depth
>> a parameter (set by the guest? the host?) would be helpful.
>>      
> Let's think about the life cycle of I/O requests a bit.
>
> We have an idle virtqueue (aka one virtio-blk device).  The first (read)
> request comes in, we get the virtio notify from the guest, which calls
> into virtio_blk_handle_output.  With the new code we now disable the
> notify once we start processing the first request.  If the second
> request hits the queue before we call into virtio_blk_get_request
> the second time we're fine even with the new code as we keep picking it
> up.  If however it hits after we leave virtio_blk_handle_output, but
> before we complete the first request we do indeed introduce additional
> latency.
>
> So instead of disabling notify while requests are active we might want
> to only disable it while we are inside virtio_blk_handle_output.
> Something like the following minimally tested patch:
>    

I'd suggest that we get even more aggressive and install an idle bottom 
half that checks the queue for newly submitted requests.  If we keep 
getting requests submitted before a new one completes, we'll never take 
an I/O exit.

The same approach is probably a good idea for virtio-net.

Regards,

Anthony Liguori
Avi Kivity Jan. 11, 2010, 2:25 p.m. UTC | #2
On 01/11/2010 03:42 PM, Christoph Hellwig wrote:
> On Mon, Jan 11, 2010 at 10:30:53AM +0200, Avi Kivity wrote:
>    
>> The patch has potential to reduce performance on volumes with multiple
>> spindles.  Consider two processes issuing sequential reads into a RAID
>> array.  With this patch, the reads will be executed sequentially rather
>> than in parallel, so I think a follow-on patch to make the minimum depth
>> a parameter (set by the guest? the host?) would be helpful.
>>      
> Let's think about the life cycle of I/O requests a bit.
>
> We have an idle virtqueue (aka one virtio-blk device).  The first (read)
> request comes in, we get the virtio notify from the guest, which calls
> into virtio_blk_handle_output.  With the new code we now disable the
> notify once we start processing the first request.  If the second
> request hits the queue before we call into virtio_blk_get_request
> the second time we're fine even with the new code as we keep picking it
> up.  If however it hits after we leave virtio_blk_handle_output, but
> before we complete the first request we do indeed introduce additional
> latency.
>
> So instead of disabling notify while requests are active we might want
> to only disable it while we are inside virtio_blk_handle_output.
> Something like the following minimally tested patch:
>
>
> Index: qemu/hw/virtio-blk.c
> ===================================================================
> --- qemu.orig/hw/virtio-blk.c	2010-01-11 14:28:42.896010503 +0100
> +++ qemu/hw/virtio-blk.c	2010-01-11 14:40:13.535256353 +0100
> @@ -328,7 +328,15 @@ static void virtio_blk_handle_output(Vir
>       int num_writes = 0;
>       BlockDriverState *old_bs = NULL;
>
> +    /*
> +     * While we are processing requests there is no need to get further
> +     * notifications from the guest - it'll just burn cpu cycles doing
> +     * useless context switches into the host.
> +     */
> +    virtio_queue_set_notification(s->vq, 0);
> +
>       while ((req = virtio_blk_get_request(s))) {
> +handle_request:
>           if (req->elem.out_num<  1 || req->elem.in_num<  1) {
>               fprintf(stderr, "virtio-blk missing headers\n");
>               exit(1);
> @@ -358,6 +366,18 @@ static void virtio_blk_handle_output(Vir
>           }
>       }
>
> +    /*
> +     * Once we're done processing all pending requests re-enable the queue
> +     * notification.  If there's an entry pending after we enabled
> +     * notification again we hit a race and need to process it before
> +     * returning.
> +     */
> +    virtio_queue_set_notification(s->vq, 1);
> +    req = virtio_blk_get_request(s);
> +    if (req) {
> +        goto handle_request;
> +    }
> +
>    

I don't think this will have much effect.  First, the time spent in 
virtio_blk_handle_output() is a small fraction of total guest time, so 
the probability of the guest hitting the notifications closed window is 
low.  Second, while we're in that function, the vcpu that kicked us is 
stalled, and other vcpus are likely to be locked out of the queue by the 
guest.
Avi Kivity Jan. 11, 2010, 2:29 p.m. UTC | #3
On 01/11/2010 03:49 PM, Anthony Liguori wrote:
>> So instead of disabling notify while requests are active we might want
>> to only disable it while we are inside virtio_blk_handle_output.
>> Something like the following minimally tested patch:
>
>
> I'd suggest that we get even more aggressive and install an idle 
> bottom half that checks the queue for newly submitted requests.  If we 
> keep getting requests submitted before a new one completes, we'll 
> never take an I/O exit.
>

That has the downside of bouncing a cache line on unrelated exits.  It 
probably doesn't matter with qemu as it is now, since it will bounce 
qemu_mutex, but it will hurt with large guests (especially if they have 
many rings).

IMO we should get things to work well without riding on unrelated exits, 
especially as we're trying to reduce those exits.

> The same approach is probably a good idea for virtio-net.

With vhost-net you don't see exits.
Anthony Liguori Jan. 11, 2010, 2:37 p.m. UTC | #4
On 01/11/2010 08:29 AM, Avi Kivity wrote:
> On 01/11/2010 03:49 PM, Anthony Liguori wrote:
>>> So instead of disabling notify while requests are active we might want
>>> to only disable it while we are inside virtio_blk_handle_output.
>>> Something like the following minimally tested patch:
>>
>>
>> I'd suggest that we get even more aggressive and install an idle 
>> bottom half that checks the queue for newly submitted requests.  If 
>> we keep getting requests submitted before a new one completes, we'll 
>> never take an I/O exit.
>>
>
> That has the downside of bouncing a cache line on unrelated exits.

The read and write sides of the ring are widely separated in physical 
memory specifically to avoid cache line bouncing.

>   It probably doesn't matter with qemu as it is now, since it will 
> bounce qemu_mutex, but it will hurt with large guests (especially if 
> they have many rings).
>
> IMO we should get things to work well without riding on unrelated 
> exits, especially as we're trying to reduce those exits.

A block I/O request can potentially be very, very long lived.  By 
serializing requests like this, there's a high likelihood that it's 
going to kill performance with anything capable of processing multiple 
requests.

OTOH, if we aggressively poll the ring when we have an opportunity to, 
there's very little down side to that and it addresses the serialization 
problem.

>> The same approach is probably a good idea for virtio-net.
>
> With vhost-net you don't see exits.

The point is, when we've disabled notification, we should poll on the 
ring for additional requests instead of waiting for one to complete 
before looking at another one.

Even with vhost-net, this logic is still applicable although potentially 
harder to achieve.

Regards,

Anthony Liguori
Avi Kivity Jan. 11, 2010, 2:46 p.m. UTC | #5
On 01/11/2010 04:37 PM, Anthony Liguori wrote:
>> That has the downside of bouncing a cache line on unrelated exits.
>
>
> The read and write sides of the ring are widely separated in physical 
> memory specifically to avoid cache line bouncing.

I meant, exits on random vcpus will cause the cacheline containing the 
notification disable flag to bounce around.  As it is, we read it on the 
vcpu that owns the queue and write it on that vcpu or the I/O thread.

>>   It probably doesn't matter with qemu as it is now, since it will 
>> bounce qemu_mutex, but it will hurt with large guests (especially if 
>> they have many rings).
>>
>> IMO we should get things to work well without riding on unrelated 
>> exits, especially as we're trying to reduce those exits.
>
> A block I/O request can potentially be very, very long lived.  By 
> serializing requests like this, there's a high likelihood that it's 
> going to kill performance with anything capable of processing multiple 
> requests.

Right, that's why I suggested having a queue depth at which disabling 
notification kicks in.  The patch hardcodes this depth to 1, unpatched 
qemu is infinite, a good value is probably spindle count + VAT.

> OTOH, if we aggressively poll the ring when we have an opportunity to, 
> there's very little down side to that and it addresses the 
> serialization problem.

But we can't guarantee that we'll get those opportunities, so it doesn't 
address the problem in a general way.  A guest that doesn't use hpet and 
only has a single virtio-blk device will not have any reason to exit to 
qemu.
Anthony Liguori Jan. 11, 2010, 3:13 p.m. UTC | #6
On 01/11/2010 08:46 AM, Avi Kivity wrote:
> On 01/11/2010 04:37 PM, Anthony Liguori wrote:
>>> That has the downside of bouncing a cache line on unrelated exits.
>>
>>
>> The read and write sides of the ring are widely separated in physical 
>> memory specifically to avoid cache line bouncing.
>
> I meant, exits on random vcpus will cause the cacheline containing the 
> notification disable flag to bounce around.  As it is, we read it on 
> the vcpu that owns the queue and write it on that vcpu or the I/O thread.

Bottom halves are always run from the IO thread.
>>>   It probably doesn't matter with qemu as it is now, since it will 
>>> bounce qemu_mutex, but it will hurt with large guests (especially if 
>>> they have many rings).
>>>
>>> IMO we should get things to work well without riding on unrelated 
>>> exits, especially as we're trying to reduce those exits.
>>
>> A block I/O request can potentially be very, very long lived.  By 
>> serializing requests like this, there's a high likelihood that it's 
>> going to kill performance with anything capable of processing 
>> multiple requests.
>
> Right, that's why I suggested having a queue depth at which disabling 
> notification kicks in.  The patch hardcodes this depth to 1, unpatched 
> qemu is infinite, a good value is probably spindle count + VAT.

That means we would need a user visible option which is quite unfortunate.

Also, that logic only really makes sense with cache=off.  With 
cache=writethrough, you can get pathological cases whereas you have an 
uncached access followed by cached accesses.  In fact, with read-ahead, 
this is probably not an uncommon scenario.

>> OTOH, if we aggressively poll the ring when we have an opportunity 
>> to, there's very little down side to that and it addresses the 
>> serialization problem.
>
> But we can't guarantee that we'll get those opportunities, so it 
> doesn't address the problem in a general way.  A guest that doesn't 
> use hpet and only has a single virtio-blk device will not have any 
> reason to exit to qemu.

We can mitigate this with a timer but honestly, we need to do perf 
measurements to see.  My feeling is that we will need some more 
aggressive form of polling than just waiting for IO completion.  I don't 
think queue depth is enough because it assumes that all requests are 
equal.  When dealing with cache=off or even just storage with it's own 
cache, that's simply not the case.

Regards,

Anthony Liguori
Avi Kivity Jan. 11, 2010, 3:19 p.m. UTC | #7
On 01/11/2010 05:13 PM, Anthony Liguori wrote:
> On 01/11/2010 08:46 AM, Avi Kivity wrote:
>> On 01/11/2010 04:37 PM, Anthony Liguori wrote:
>>>> That has the downside of bouncing a cache line on unrelated exits.
>>>
>>>
>>> The read and write sides of the ring are widely separated in 
>>> physical memory specifically to avoid cache line bouncing.
>>
>> I meant, exits on random vcpus will cause the cacheline containing 
>> the notification disable flag to bounce around.  As it is, we read it 
>> on the vcpu that owns the queue and write it on that vcpu or the I/O 
>> thread.
>
> Bottom halves are always run from the IO thread.

Okay, so that won't be an issue.

>>>>   It probably doesn't matter with qemu as it is now, since it will 
>>>> bounce qemu_mutex, but it will hurt with large guests (especially 
>>>> if they have many rings).
>>>>
>>>> IMO we should get things to work well without riding on unrelated 
>>>> exits, especially as we're trying to reduce those exits.
>>>
>>> A block I/O request can potentially be very, very long lived.  By 
>>> serializing requests like this, there's a high likelihood that it's 
>>> going to kill performance with anything capable of processing 
>>> multiple requests.
>>
>> Right, that's why I suggested having a queue depth at which disabling 
>> notification kicks in.  The patch hardcodes this depth to 1, 
>> unpatched qemu is infinite, a good value is probably spindle count + 
>> VAT.
>
> That means we would need a user visible option which is quite 
> unfortunate.

We could guess it, perhaps.

> Also, that logic only really makes sense with cache=off.  With 
> cache=writethrough, you can get pathological cases whereas you have an 
> uncached access followed by cached accesses.  In fact, with 
> read-ahead, this is probably not an uncommon scenario.

So you'd increase the disable depths when cache!=none.

>>> OTOH, if we aggressively poll the ring when we have an opportunity 
>>> to, there's very little down side to that and it addresses the 
>>> serialization problem.
>>
>> But we can't guarantee that we'll get those opportunities, so it 
>> doesn't address the problem in a general way.  A guest that doesn't 
>> use hpet and only has a single virtio-blk device will not have any 
>> reason to exit to qemu.
>
> We can mitigate this with a timer but honestly, we need to do perf 
> measurements to see.  My feeling is that we will need some more 
> aggressive form of polling than just waiting for IO completion.  I 
> don't think queue depth is enough because it assumes that all requests 
> are equal.  When dealing with cache=off or even just storage with it's 
> own cache, that's simply not the case.

Maybe we can adapt behaviour dynamically based on how fast results come in.
Anthony Liguori Jan. 11, 2010, 3:22 p.m. UTC | #8
On 01/11/2010 09:19 AM, Avi Kivity wrote:
>>>> OTOH, if we aggressively poll the ring when we have an opportunity 
>>>> to, there's very little down side to that and it addresses the 
>>>> serialization problem.
>>>
>>> But we can't guarantee that we'll get those opportunities, so it 
>>> doesn't address the problem in a general way.  A guest that doesn't 
>>> use hpet and only has a single virtio-blk device will not have any 
>>> reason to exit to qemu.
>>
>> We can mitigate this with a timer but honestly, we need to do perf 
>> measurements to see.  My feeling is that we will need some more 
>> aggressive form of polling than just waiting for IO completion.  I 
>> don't think queue depth is enough because it assumes that all 
>> requests are equal.  When dealing with cache=off or even just storage 
>> with it's own cache, that's simply not the case.
>
> Maybe we can adapt behaviour dynamically based on how fast results 
> come in.

Based on our experiences with virtio-net, what I'd suggest is to make a 
lot of tunable options (ring size, various tx mitigation schemes, 
timeout durations, etc) and then we can do some deep performance studies 
to see how things interact with each other.

I think we should do that before making any changes because I'm deeply 
concerned that we'll introduce significant performance regressions.

Regards,

Anthony Liguori
Avi Kivity Jan. 11, 2010, 3:31 p.m. UTC | #9
On 01/11/2010 05:22 PM, Anthony Liguori wrote:
>
> Based on our experiences with virtio-net, what I'd suggest is to make 
> a lot of tunable options (ring size, various tx mitigation schemes, 
> timeout durations, etc) and then we can do some deep performance 
> studies to see how things interact with each other.
>
> I think we should do that before making any changes because I'm deeply 
> concerned that we'll introduce significant performance regressions.
>

I agree.  We can start with this patch, with a tunable depth, defaulting 
to current behaviour.
Anthony Liguori Jan. 11, 2010, 3:32 p.m. UTC | #10
On 01/11/2010 09:31 AM, Avi Kivity wrote:
> On 01/11/2010 05:22 PM, Anthony Liguori wrote:
>>
>> Based on our experiences with virtio-net, what I'd suggest is to make 
>> a lot of tunable options (ring size, various tx mitigation schemes, 
>> timeout durations, etc) and then we can do some deep performance 
>> studies to see how things interact with each other.
>>
>> I think we should do that before making any changes because I'm 
>> deeply concerned that we'll introduce significant performance 
>> regressions.
>>
>
> I agree.  We can start with this patch, with a tunable depth, 
> defaulting to current behaviour.

I wouldn't be opposed to that provided we made it clear that these 
options were not supported long term.  I don't want management tools 
(like libvirt) to start relying on them.

Regards,

Anthony Liguori
Avi Kivity Jan. 11, 2010, 3:35 p.m. UTC | #11
On 01/11/2010 05:32 PM, Anthony Liguori wrote:
> On 01/11/2010 09:31 AM, Avi Kivity wrote:
>> On 01/11/2010 05:22 PM, Anthony Liguori wrote:
>>>
>>> Based on our experiences with virtio-net, what I'd suggest is to 
>>> make a lot of tunable options (ring size, various tx mitigation 
>>> schemes, timeout durations, etc) and then we can do some deep 
>>> performance studies to see how things interact with each other.
>>>
>>> I think we should do that before making any changes because I'm 
>>> deeply concerned that we'll introduce significant performance 
>>> regressions.
>>>
>>
>> I agree.  We can start with this patch, with a tunable depth, 
>> defaulting to current behaviour.
>
> I wouldn't be opposed to that provided we made it clear that these 
> options were not supported long term.  I don't want management tools 
> (like libvirt) to start relying on them.
>

x-option-name for experimental options?

-device disk,if=virtio,x-queue-depth-suppress-notify=4
Anthony Liguori Jan. 11, 2010, 3:38 p.m. UTC | #12
On 01/11/2010 09:35 AM, Avi Kivity wrote:
> On 01/11/2010 05:32 PM, Anthony Liguori wrote:
>> On 01/11/2010 09:31 AM, Avi Kivity wrote:
>>> On 01/11/2010 05:22 PM, Anthony Liguori wrote:
>>>>
>>>> Based on our experiences with virtio-net, what I'd suggest is to 
>>>> make a lot of tunable options (ring size, various tx mitigation 
>>>> schemes, timeout durations, etc) and then we can do some deep 
>>>> performance studies to see how things interact with each other.
>>>>
>>>> I think we should do that before making any changes because I'm 
>>>> deeply concerned that we'll introduce significant performance 
>>>> regressions.
>>>>
>>>
>>> I agree.  We can start with this patch, with a tunable depth, 
>>> defaulting to current behaviour.
>>
>> I wouldn't be opposed to that provided we made it clear that these 
>> options were not supported long term.  I don't want management tools 
>> (like libvirt) to start relying on them.
>>
>
> x-option-name for experimental options?
>
> -device disk,if=virtio,x-queue-depth-suppress-notify=4

Sounds reasonable to me.

Regards,

Anthony Liguori
Michael S. Tsirkin Jan. 11, 2010, 6:20 p.m. UTC | #13
On Mon, Jan 11, 2010 at 08:37:10AM -0600, Anthony Liguori wrote:
> On 01/11/2010 08:29 AM, Avi Kivity wrote:
>> On 01/11/2010 03:49 PM, Anthony Liguori wrote:
>>>> So instead of disabling notify while requests are active we might want
>>>> to only disable it while we are inside virtio_blk_handle_output.
>>>> Something like the following minimally tested patch:
>>>
>>>
>>> I'd suggest that we get even more aggressive and install an idle  
>>> bottom half that checks the queue for newly submitted requests.  If  
>>> we keep getting requests submitted before a new one completes, we'll  
>>> never take an I/O exit.
>>>
>>
>> That has the downside of bouncing a cache line on unrelated exits.
>
> The read and write sides of the ring are widely separated in physical  
> memory specifically to avoid cache line bouncing.
>
>>   It probably doesn't matter with qemu as it is now, since it will  
>> bounce qemu_mutex, but it will hurt with large guests (especially if  
>> they have many rings).
>>
>> IMO we should get things to work well without riding on unrelated  
>> exits, especially as we're trying to reduce those exits.
>
> A block I/O request can potentially be very, very long lived.  By  
> serializing requests like this, there's a high likelihood that it's  
> going to kill performance with anything capable of processing multiple  
> requests.
>
> OTOH, if we aggressively poll the ring when we have an opportunity to,  
> there's very little down side to that and it addresses the serialization  
> problem.
>
>>> The same approach is probably a good idea for virtio-net.
>>
>> With vhost-net you don't see exits.
>
> The point is, when we've disabled notification, we should poll on the  
> ring for additional requests instead of waiting for one to complete  
> before looking at another one.
>
> Even with vhost-net, this logic is still applicable although potentially  
> harder to achieve.
>
> Regards,
>
> Anthony Liguori
>

vhost net does this already: it has a mode where it poll when skbs have
left send queue: for tap this is when they have crossed the bridge, for packet
socket this is when they have been transmitted.
Michael S. Tsirkin Jan. 11, 2010, 6:22 p.m. UTC | #14
On Mon, Jan 11, 2010 at 09:13:23AM -0600, Anthony Liguori wrote:
> On 01/11/2010 08:46 AM, Avi Kivity wrote:
>> On 01/11/2010 04:37 PM, Anthony Liguori wrote:
>>>> That has the downside of bouncing a cache line on unrelated exits.
>>>
>>>
>>> The read and write sides of the ring are widely separated in physical 
>>> memory specifically to avoid cache line bouncing.
>>
>> I meant, exits on random vcpus will cause the cacheline containing the  
>> notification disable flag to bounce around.  As it is, we read it on  
>> the vcpu that owns the queue and write it on that vcpu or the I/O 
>> thread.
>
> Bottom halves are always run from the IO thread.
>>>>   It probably doesn't matter with qemu as it is now, since it will  
>>>> bounce qemu_mutex, but it will hurt with large guests (especially 
>>>> if they have many rings).
>>>>
>>>> IMO we should get things to work well without riding on unrelated  
>>>> exits, especially as we're trying to reduce those exits.
>>>
>>> A block I/O request can potentially be very, very long lived.  By  
>>> serializing requests like this, there's a high likelihood that it's  
>>> going to kill performance with anything capable of processing  
>>> multiple requests.
>>
>> Right, that's why I suggested having a queue depth at which disabling  
>> notification kicks in.  The patch hardcodes this depth to 1, unpatched  
>> qemu is infinite, a good value is probably spindle count + VAT.
>
> That means we would need a user visible option which is quite unfortunate.
>
> Also, that logic only really makes sense with cache=off.  With  
> cache=writethrough, you can get pathological cases whereas you have an  
> uncached access followed by cached accesses.  In fact, with read-ahead,  
> this is probably not an uncommon scenario.
>
>>> OTOH, if we aggressively poll the ring when we have an opportunity  
>>> to, there's very little down side to that and it addresses the  
>>> serialization problem.
>>
>> But we can't guarantee that we'll get those opportunities, so it  
>> doesn't address the problem in a general way.  A guest that doesn't  
>> use hpet and only has a single virtio-blk device will not have any  
>> reason to exit to qemu.
>
> We can mitigate this with a timer but honestly, we need to do perf  
> measurements to see.  My feeling is that we will need some more  
> aggressive form of polling than just waiting for IO completion.  I don't  
> think queue depth is enough because it assumes that all requests are  
> equal.  When dealing with cache=off or even just storage with it's own  
> cache, that's simply not the case.
>
> Regards,
>
> Anthony Liguori
>

BTW this is why vhost net uses queue depth in bytes.
diff mbox

Patch

Index: qemu/hw/virtio-blk.c
===================================================================
--- qemu.orig/hw/virtio-blk.c	2010-01-11 14:28:42.896010503 +0100
+++ qemu/hw/virtio-blk.c	2010-01-11 14:40:13.535256353 +0100
@@ -328,7 +328,15 @@  static void virtio_blk_handle_output(Vir
     int num_writes = 0;
     BlockDriverState *old_bs = NULL;
 
+    /*
+     * While we are processing requests there is no need to get further
+     * notifications from the guest - it'll just burn cpu cycles doing
+     * useless context switches into the host.
+     */
+    virtio_queue_set_notification(s->vq, 0);
+
     while ((req = virtio_blk_get_request(s))) {
+handle_request:
         if (req->elem.out_num < 1 || req->elem.in_num < 1) {
             fprintf(stderr, "virtio-blk missing headers\n");
             exit(1);
@@ -358,6 +366,18 @@  static void virtio_blk_handle_output(Vir
         }
     }
 
+    /*
+     * Once we're done processing all pending requests re-enable the queue
+     * notification.  If there's an entry pending after we enabled
+     * notification again we hit a race and need to process it before
+     * returning.
+     */
+    virtio_queue_set_notification(s->vq, 1);
+    req = virtio_blk_get_request(s);
+    if (req) {
+        goto handle_request;
+    }
+
     if (num_writes > 0) {
         do_multiwrite(old_bs, blkreq, num_writes);
     }