Patchwork [+stable] block: don't attempt to merge overlapping requests

login
register
mail settings
Submitter Avi Kivity
Date May 18, 2010, 5:18 p.m.
Message ID <1274203124-14318-1-git-send-email-avi@redhat.com>
Download mbox | patch
Permalink /patch/52896/
State New
Headers show

Comments

Avi Kivity - May 18, 2010, 5:18 p.m.
The block multiwrite code pretends to be able to merge overlapping requests,
but doesn't do so in fact.  This leads to I/O errors (for example on mkfs
of a large virtio disk).

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 block.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)
Stefan Hajnoczi - May 18, 2010, 7:22 p.m.
On Tue, May 18, 2010 at 6:18 PM, Avi Kivity <avi@redhat.com> wrote:
> The block multiwrite code pretends to be able to merge overlapping requests,
> but doesn't do so in fact.  This leads to I/O errors (for example on mkfs
> of a large virtio disk).

Are overlapping write requests correct guest behavior?  I thought the
ordering semantics require a flush between overlapping writes to
ensure A is written before B.

What cache= mode are you running?

Stefan
Stefan Hajnoczi - May 18, 2010, 7:37 p.m.
I just caught up on mails and saw you had already mentioned that
overlapping writes from the guest look fishy in the "the >1Tb block
issue".  Cache mode might still be interesting because it affects how
guest virtio-blk chooses queue ordering mode.

Stefan
Michael Tokarev - May 18, 2010, 7:41 p.m.
18.05.2010 23:37, Stefan Hajnoczi wrote:
> I just caught up on mails and saw you had already mentioned that
> overlapping writes from the guest look fishy in the "the>1Tb block
> issue".  Cache mode might still be interesting because it affects how
> guest virtio-blk chooses queue ordering mode.

What I can say for sure is that the issue mentioned (>1Tb block)
occurs regardless of the cache mode.  I tried all 3, with exactly
the same results (well, not entirely exactly, since the prob
depends on timing too, and timing is different depending on the
cache mode), and performed all further tests with cache=writeback
since it's the mode which lets the guest to finish mkfs'ing the
1.5Tb "disk" in a reasonable time (or else it takes hours).

But actually I don't quite see where that dependency is: guest
does not know how its data is cached on host...

/mjt
Stefan Hajnoczi - May 18, 2010, 8:19 p.m.
On Tue, May 18, 2010 at 8:41 PM, Michael Tokarev <mjt@tls.msk.ru> wrote:
> But actually I don't quite see where that dependency is: guest
> does not know how its data is cached on host...

I was thinking of this bit in linux-2.6:drivers/block/virtio_blk.c:

/* If barriers are supported, tell block layer that queue is ordered */
if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
    blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH,
                                  virtblk_prepare_flush);
else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER))
    blk_queue_ordered(q, QUEUE_ORDERED_TAG, NULL);

I was wondering whether we have something wrong, causing the guest to
perform overlapping write requests without draining the queue or
flushing.

Stefan
Avi Kivity - May 19, 2010, 8:09 a.m.
On 05/18/2010 10:22 PM, Stefan Hajnoczi wrote:
> On Tue, May 18, 2010 at 6:18 PM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> The block multiwrite code pretends to be able to merge overlapping requests,
>> but doesn't do so in fact.  This leads to I/O errors (for example on mkfs
>> of a large virtio disk).
>>      
> Are overlapping write requests correct guest behavior?  I thought the
> ordering semantics require a flush between overlapping writes to
> ensure A is written before B.
>
> What cache= mode are you running?
>    

writeback.
Stefan Hajnoczi - May 19, 2010, 9:01 a.m.
On Wed, May 19, 2010 at 9:09 AM, Avi Kivity <avi@redhat.com> wrote:
> On 05/18/2010 10:22 PM, Stefan Hajnoczi wrote:
>> What cache= mode are you running?
>
> writeback.

In the cache=writeback case the virtio-blk guest driver does:

blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, ...)

Stefan
Avi Kivity - May 19, 2010, 9:06 a.m.
On 05/19/2010 12:01 PM, Stefan Hajnoczi wrote:
> On Wed, May 19, 2010 at 9:09 AM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> On 05/18/2010 10:22 PM, Stefan Hajnoczi wrote:
>>      
>>> What cache= mode are you running?
>>>        
>> writeback.
>>      
> In the cache=writeback case the virtio-blk guest driver does:
>
> blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, ...)
>    

I don't follow.  What's the implication?

btw I really dislike how the cache attribute (which I see as a pure host 
choice) is exposed to the guest.  It means we can't change caching mode 
on the fly (for example after live migration), or that changing caching 
mode during a restart may expose a previously hidden guest bug.
Stefan Hajnoczi - May 19, 2010, 9:23 a.m.
On Wed, May 19, 2010 at 10:06 AM, Avi Kivity <avi@redhat.com> wrote:
>> In the cache=writeback case the virtio-blk guest driver does:
>>
>> blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, ...)
>>
>
> I don't follow.  What's the implication?

I was wondering whether the queue is incorrectly set to a mode where
overlapping write requests aren't being ordered.  Anyway, Christoph
says overlapping write requests can be issued so my theory is broken.

> btw I really dislike how the cache attribute (which I see as a pure host
> choice) is exposed to the guest.  It means we can't change caching mode on
> the fly (for example after live migration), or that changing caching mode
> during a restart may expose a previously hidden guest bug.

Christoph has mentioned wanting to make the write cache switchable
from inside the guest.

Stefan
Christoph Hellwig - May 19, 2010, 9:31 a.m.
On Wed, May 19, 2010 at 10:23:44AM +0100, Stefan Hajnoczi wrote:
> On Wed, May 19, 2010 at 10:06 AM, Avi Kivity <avi@redhat.com> wrote:
> >> In the cache=writeback case the virtio-blk guest driver does:
> >>
> >> blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, ...)
> >>
> >
> > I don't follow. ?What's the implication?
> 
> I was wondering whether the queue is incorrectly set to a mode where
> overlapping write requests aren't being ordered.  Anyway, Christoph
> says overlapping write requests can be issued so my theory is broken.

They can happen, but won't during usual pagecache based writeback.  So
this should not happen for the pagecache based mke2fs workload.  It
could happen for a workload with mkfs that uses O_DIRECT.  And we
need to handle it either way.

And btw, our barrier handling for devices using multiwrite (aka virtio)
is utterly busted right now, patch will follow ASAP.

Patch

diff --git a/block.c b/block.c
index 48305b7..0e44e26 100644
--- a/block.c
+++ b/block.c
@@ -1956,8 +1956,8 @@  static int multiwrite_merge(BlockDriverState *bs, BlockRequest *reqs,
         int64_t oldreq_last = reqs[outidx].sector + reqs[outidx].nb_sectors;
 
         // This handles the cases that are valid for all block drivers, namely
-        // exactly sequential writes and overlapping writes.
-        if (reqs[i].sector <= oldreq_last) {
+        // exactly sequential writes
+        if (reqs[i].sector == oldreq_last) {
             merge = 1;
         }