Patchwork [v2,6/7] qed: Read/write support

login
register
mail settings
Submitter Stefan Hajnoczi
Date Oct. 8, 2010, 3:48 p.m.
Message ID <1286552914-27014-7-git-send-email-stefanha@linux.vnet.ibm.com>
Download mbox | patch
Permalink /patch/67256/
State New
Headers show

Comments

Stefan Hajnoczi - Oct. 8, 2010, 3:48 p.m.
This patch implements the read/write state machine.  Operations are
fully asynchronous and multiple operations may be active at any time.

Allocating writes lock tables to ensure metadata updates do not
interfere with each other.  If two allocating writes need to update the
same L2 table they will run sequentially.  If two allocating writes need
to update different L2 tables they will run in parallel.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 Makefile.objs    |    1 +
 block/qed-lock.c |  124 ++++++++++++
 block/qed.c      |  593 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/qed.h      |   43 ++++
 trace-events     |   10 +
 5 files changed, 769 insertions(+), 2 deletions(-)
 create mode 100644 block/qed-lock.c
Avi Kivity - Oct. 10, 2010, 9:10 a.m.
On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> This patch implements the read/write state machine.  Operations are
> fully asynchronous and multiple operations may be active at any time.
>
> Allocating writes lock tables to ensure metadata updates do not
> interfere with each other.  If two allocating writes need to update the
> same L2 table they will run sequentially.  If two allocating writes need
> to update different L2 tables they will run in parallel.
>

Shouldn't there be a flush between an allocating write and an L2 
update?  Otherwise a reuse of a cluster can move logical sectors from 
one place to another, causing a data disclosure.

Can be skipped if the new cluster is beyond the physical image size.

> +
> +/*
> + * Table locking works as follows:
> + *
> + * Reads and non-allocating writes do not acquire locks because they do not
> + * modify tables and only see committed L2 cache entries.

What about a non-allocating write that follows an allocating write?

1 Guest writes to sector 0
2 Host reads backing image (or supplies zeros), sectors 1-127
3 Host writes sectors 0-127
4 Guest writes sector 1
5 Host writes sector 1

There needs to be a barrier that prevents the host and the disk from 
reordering operations 3 and 5, or guest operation 4 is lost.  As far as 
the guest is concerned no overlapping writes were issued, so it isn't 
required to provide any barriers.

(based on the comment only, haven't read the code)
Stefan Hajnoczi - Oct. 11, 2010, 10:37 a.m.
On Sun, Oct 10, 2010 at 11:10:15AM +0200, Avi Kivity wrote:
>  On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> >This patch implements the read/write state machine.  Operations are
> >fully asynchronous and multiple operations may be active at any time.
> >
> >Allocating writes lock tables to ensure metadata updates do not
> >interfere with each other.  If two allocating writes need to update the
> >same L2 table they will run sequentially.  If two allocating writes need
> >to update different L2 tables they will run in parallel.
> >
> 
> Shouldn't there be a flush between an allocating write and an L2
> update?  Otherwise a reuse of a cluster can move logical sectors
> from one place to another, causing a data disclosure.
> 
> Can be skipped if the new cluster is beyond the physical image size.

Currently clusters are never reused and new clusters are always beyond
physical image size.  The only exception I can think of is when the
image file size is not a multiple of the cluster size and we round down
to the start of cluster.

In an implementation that supports TRIM or otherwise reuses clusters
this is a cost.

> >+
> >+/*
> >+ * Table locking works as follows:
> >+ *
> >+ * Reads and non-allocating writes do not acquire locks because they do not
> >+ * modify tables and only see committed L2 cache entries.
> 
> What about a non-allocating write that follows an allocating write?
> 
> 1 Guest writes to sector 0
> 2 Host reads backing image (or supplies zeros), sectors 1-127
> 3 Host writes sectors 0-127
> 4 Guest writes sector 1
> 5 Host writes sector 1
> 
> There needs to be a barrier that prevents the host and the disk from
> reordering operations 3 and 5, or guest operation 4 is lost.  As far
> as the guest is concerned no overlapping writes were issued, so it
> isn't required to provide any barriers.
> 
> (based on the comment only, haven't read the code)

There is no barrier between operations 3 and 5.  However, operation 5
only starts after operation 3 has completed because of table locking.
It is my understanding that *independent* requests may be reordered but
two writes to the *same* sector will not be reordered if write A
completes before write B is issued.

Imagine a test program that uses pwrite() to rewrite a counter many
times on disk.  When the program finishes it prints the counter
variable's last value.  This scenario is like operations 3 and 5 above.
If we read the counter back from disk it will be the final value, not
some intermediate value.  The writes will not be reordered.

Stefan
Avi Kivity - Oct. 11, 2010, 1:10 p.m.
On 10/11/2010 12:37 PM, Stefan Hajnoczi wrote:
> On Sun, Oct 10, 2010 at 11:10:15AM +0200, Avi Kivity wrote:
> >   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> >  >This patch implements the read/write state machine.  Operations are
> >  >fully asynchronous and multiple operations may be active at any time.
> >  >
> >  >Allocating writes lock tables to ensure metadata updates do not
> >  >interfere with each other.  If two allocating writes need to update the
> >  >same L2 table they will run sequentially.  If two allocating writes need
> >  >to update different L2 tables they will run in parallel.
> >  >
> >
> >  Shouldn't there be a flush between an allocating write and an L2
> >  update?  Otherwise a reuse of a cluster can move logical sectors
> >  from one place to another, causing a data disclosure.
> >
> >  Can be skipped if the new cluster is beyond the physical image size.
>
> Currently clusters are never reused and new clusters are always beyond
> physical image size.  The only exception I can think of is when the
> image file size is not a multiple of the cluster size and we round down
> to the start of cluster.
>
> In an implementation that supports TRIM or otherwise reuses clusters
> this is a cost.
>
> >  >+
> >  >+/*
> >  >+ * Table locking works as follows:
> >  >+ *
> >  >+ * Reads and non-allocating writes do not acquire locks because they do not
> >  >+ * modify tables and only see committed L2 cache entries.
> >
> >  What about a non-allocating write that follows an allocating write?
> >
> >  1 Guest writes to sector 0
> >  2 Host reads backing image (or supplies zeros), sectors 1-127
> >  3 Host writes sectors 0-127
> >  4 Guest writes sector 1
> >  5 Host writes sector 1
> >
> >  There needs to be a barrier that prevents the host and the disk from
> >  reordering operations 3 and 5, or guest operation 4 is lost.  As far
> >  as the guest is concerned no overlapping writes were issued, so it
> >  isn't required to provide any barriers.
> >
> >  (based on the comment only, haven't read the code)
>
> There is no barrier between operations 3 and 5.  However, operation 5
> only starts after operation 3 has completed because of table locking.
> It is my understanding that *independent* requests may be reordered but
> two writes to the *same* sector will not be reordered if write A
> completes before write B is issued.
>
> Imagine a test program that uses pwrite() to rewrite a counter many
> times on disk.  When the program finishes it prints the counter
> variable's last value.  This scenario is like operations 3 and 5 above.
> If we read the counter back from disk it will be the final value, not
> some intermediate value.  The writes will not be reordered.

Yes, all that is needed is a barrier in program order.  So, operation 5 
is an allocating write as long as 3 hasn't returned? (at which point it 
becomes a non-allocating write)?
Stefan Hajnoczi - Oct. 11, 2010, 1:55 p.m.
On Mon, Oct 11, 2010 at 03:10:16PM +0200, Avi Kivity wrote:
>  On 10/11/2010 12:37 PM, Stefan Hajnoczi wrote:
> >On Sun, Oct 10, 2010 at 11:10:15AM +0200, Avi Kivity wrote:
> >>   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> >>  >This patch implements the read/write state machine.  Operations are
> >>  >fully asynchronous and multiple operations may be active at any time.
> >>  >
> >>  >Allocating writes lock tables to ensure metadata updates do not
> >>  >interfere with each other.  If two allocating writes need to update the
> >>  >same L2 table they will run sequentially.  If two allocating writes need
> >>  >to update different L2 tables they will run in parallel.
> >>  >
> >>
> >>  Shouldn't there be a flush between an allocating write and an L2
> >>  update?  Otherwise a reuse of a cluster can move logical sectors
> >>  from one place to another, causing a data disclosure.
> >>
> >>  Can be skipped if the new cluster is beyond the physical image size.
> >
> >Currently clusters are never reused and new clusters are always beyond
> >physical image size.  The only exception I can think of is when the
> >image file size is not a multiple of the cluster size and we round down
> >to the start of cluster.
> >
> >In an implementation that supports TRIM or otherwise reuses clusters
> >this is a cost.
> >
> >>  >+
> >>  >+/*
> >>  >+ * Table locking works as follows:
> >>  >+ *
> >>  >+ * Reads and non-allocating writes do not acquire locks because they do not
> >>  >+ * modify tables and only see committed L2 cache entries.
> >>
> >>  What about a non-allocating write that follows an allocating write?
> >>
> >>  1 Guest writes to sector 0
> >>  2 Host reads backing image (or supplies zeros), sectors 1-127
> >>  3 Host writes sectors 0-127
> >>  4 Guest writes sector 1
> >>  5 Host writes sector 1
> >>
> >>  There needs to be a barrier that prevents the host and the disk from
> >>  reordering operations 3 and 5, or guest operation 4 is lost.  As far
> >>  as the guest is concerned no overlapping writes were issued, so it
> >>  isn't required to provide any barriers.
> >>
> >>  (based on the comment only, haven't read the code)
> >
> >There is no barrier between operations 3 and 5.  However, operation 5
> >only starts after operation 3 has completed because of table locking.
> >It is my understanding that *independent* requests may be reordered but
> >two writes to the *same* sector will not be reordered if write A
> >completes before write B is issued.
> >
> >Imagine a test program that uses pwrite() to rewrite a counter many
> >times on disk.  When the program finishes it prints the counter
> >variable's last value.  This scenario is like operations 3 and 5 above.
> >If we read the counter back from disk it will be the final value, not
> >some intermediate value.  The writes will not be reordered.
> 
> Yes, all that is needed is a barrier in program order.  So,
> operation 5 is an allocating write as long as 3 hasn't returned? (at
> which point it becomes a non-allocating write)?

Yes, operation 5 waits until operation 3 completes.  After waking up on
the lock, the request looks up the cluster again because it may now be
allocated - operation 5 switches to an non-allocating write.

Stefan
Anthony Liguori - Oct. 11, 2010, 2:57 p.m.
On 10/11/2010 08:10 AM, Avi Kivity wrote:
>  On 10/11/2010 12:37 PM, Stefan Hajnoczi wrote:
>> On Sun, Oct 10, 2010 at 11:10:15AM +0200, Avi Kivity wrote:
>> >   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
>> > >This patch implements the read/write state machine.  Operations are
>> > >fully asynchronous and multiple operations may be active at any time.
>> > >
>> > >Allocating writes lock tables to ensure metadata updates do not
>> > >interfere with each other.  If two allocating writes need to 
>> update the
>> > >same L2 table they will run sequentially.  If two allocating 
>> writes need
>> > >to update different L2 tables they will run in parallel.
>> > >
>> >
>> >  Shouldn't there be a flush between an allocating write and an L2
>> >  update?  Otherwise a reuse of a cluster can move logical sectors
>> >  from one place to another, causing a data disclosure.
>> >
>> >  Can be skipped if the new cluster is beyond the physical image size.
>>
>> Currently clusters are never reused and new clusters are always beyond
>> physical image size.  The only exception I can think of is when the
>> image file size is not a multiple of the cluster size and we round down
>> to the start of cluster.
>>
>> In an implementation that supports TRIM or otherwise reuses clusters
>> this is a cost.
>>
>> > >+
>> > >+/*
>> > >+ * Table locking works as follows:
>> > >+ *
>> > >+ * Reads and non-allocating writes do not acquire locks because 
>> they do not
>> > >+ * modify tables and only see committed L2 cache entries.
>> >
>> >  What about a non-allocating write that follows an allocating write?
>> >
>> >  1 Guest writes to sector 0
>> >  2 Host reads backing image (or supplies zeros), sectors 1-127
>> >  3 Host writes sectors 0-127
>> >  4 Guest writes sector 1
>> >  5 Host writes sector 1
>> >
>> >  There needs to be a barrier that prevents the host and the disk from
>> >  reordering operations 3 and 5, or guest operation 4 is lost.  As far
>> >  as the guest is concerned no overlapping writes were issued, so it
>> >  isn't required to provide any barriers.
>> >
>> >  (based on the comment only, haven't read the code)
>>
>> There is no barrier between operations 3 and 5.  However, operation 5
>> only starts after operation 3 has completed because of table locking.
>> It is my understanding that *independent* requests may be reordered but
>> two writes to the *same* sector will not be reordered if write A
>> completes before write B is issued.
>>
>> Imagine a test program that uses pwrite() to rewrite a counter many
>> times on disk.  When the program finishes it prints the counter
>> variable's last value.  This scenario is like operations 3 and 5 above.
>> If we read the counter back from disk it will be the final value, not
>> some intermediate value.  The writes will not be reordered.
>
> Yes, all that is needed is a barrier in program order.  So, operation 
> 5 is an allocating write as long as 3 hasn't returned? (at which point 
> it becomes a non-allocating write)?

Correct.  The table lock in 0 is held until the request completes fully 
(including the write out of all of the fill data--step 3) which means 5 
will not begin until 3 has completed

Regards,

Anthony Liguori
Kevin Wolf - Oct. 12, 2010, 3:08 p.m.
Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
> This patch implements the read/write state machine.  Operations are
> fully asynchronous and multiple operations may be active at any time.
> 
> Allocating writes lock tables to ensure metadata updates do not
> interfere with each other.  If two allocating writes need to update the
> same L2 table they will run sequentially.  If two allocating writes need
> to update different L2 tables they will run in parallel.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

I hope to review this in more detail tomorrow, but there's one thing I
already noticed: When allocating a cluster, but not writing the whole
cluster (i.e. requests involving COW), I think we need to flush after
the COW and before the cluster allocation is written to the L2 table to
maintain the right order. Otherwise we might destroy data that isn't
even touched by the guest request in case of a crash.

Kevin
Anthony Liguori - Oct. 12, 2010, 3:22 p.m.
On 10/12/2010 10:08 AM, Kevin Wolf wrote:
>   Otherwise we might destroy data that isn't
> even touched by the guest request in case of a crash.
>    

The failure scenarios are either that the cluster is leaked in which 
case, the old version of the data is still present or the cluster is 
orphaned because the L2 entry is written, in which case the old version 
of the data is present.

Are you referring to a scenario where the cluster is partially written 
because the data is present in the write cache and the write cache isn't 
flushed on power failure?

Regards,

Anthony Liguori

> Kevin
>
Kevin Wolf - Oct. 12, 2010, 3:39 p.m.
Am 12.10.2010 17:22, schrieb Anthony Liguori:
> On 10/12/2010 10:08 AM, Kevin Wolf wrote:
>>   Otherwise we might destroy data that isn't
>> even touched by the guest request in case of a crash.
>>    
> 
> The failure scenarios are either that the cluster is leaked in which 
> case, the old version of the data is still present or the cluster is 
> orphaned because the L2 entry is written, in which case the old version 
> of the data is present.

Hm, how does the latter case work? Or rather, what do mean by "orphaned"?

> Are you referring to a scenario where the cluster is partially written 
> because the data is present in the write cache and the write cache isn't 
> flushed on power failure?

The case I'm referring to is a COW. So let's assume a partial write to
an unallocated cluster, we then need to do a COW in pre/postfill. Then
we do a normal write and link the new cluster in the L2 table.

Assume that the write to the L2 table is already on the disk, but the
pre/postfill data isn't yet. At this point we have a bad state because
if we crash now we have lost the data that should have been copied from
the backing file.

If we can't guarantee that a new cluster is all zeros, the same happens
without a backing file. So as soon as we start reusing freed clusters,
we get this case for all QED images.

Kevin
Stefan Hajnoczi - Oct. 12, 2010, 3:59 p.m.
On Tue, Oct 12, 2010 at 05:39:48PM +0200, Kevin Wolf wrote:
> Am 12.10.2010 17:22, schrieb Anthony Liguori:
> > On 10/12/2010 10:08 AM, Kevin Wolf wrote:
> >>   Otherwise we might destroy data that isn't
> >> even touched by the guest request in case of a crash.
> >>    
> > 
> > The failure scenarios are either that the cluster is leaked in which 
> > case, the old version of the data is still present or the cluster is 
> > orphaned because the L2 entry is written, in which case the old version 
> > of the data is present.
> 
> Hm, how does the latter case work? Or rather, what do mean by "orphaned"?
> 
> > Are you referring to a scenario where the cluster is partially written 
> > because the data is present in the write cache and the write cache isn't 
> > flushed on power failure?
> 
> The case I'm referring to is a COW. So let's assume a partial write to
> an unallocated cluster, we then need to do a COW in pre/postfill. Then
> we do a normal write and link the new cluster in the L2 table.
> 
> Assume that the write to the L2 table is already on the disk, but the
> pre/postfill data isn't yet. At this point we have a bad state because
> if we crash now we have lost the data that should have been copied from
> the backing file.

In this case QED_F_NEED_CHECK is set and the invalid cluster offset
should be reset to zero on open.

However, I think we can get into a state where the pre/postfill data
isn't on the disk yet but another allocation has increased the file
size, making the unwritten cluster "valid".  This fools consistency
check into thinking the data cluster (which was never written to on
disk) is valid.

Will think about this more tonight.

Stefan
Anthony Liguori - Oct. 12, 2010, 4:16 p.m.
On 10/12/2010 10:59 AM, Stefan Hajnoczi wrote:
> On Tue, Oct 12, 2010 at 05:39:48PM +0200, Kevin Wolf wrote:
>    
>> Am 12.10.2010 17:22, schrieb Anthony Liguori:
>>      
>>> On 10/12/2010 10:08 AM, Kevin Wolf wrote:
>>>        
>>>>    Otherwise we might destroy data that isn't
>>>> even touched by the guest request in case of a crash.
>>>>
>>>>          
>>> The failure scenarios are either that the cluster is leaked in which
>>> case, the old version of the data is still present or the cluster is
>>> orphaned because the L2 entry is written, in which case the old version
>>> of the data is present.
>>>        
>> Hm, how does the latter case work? Or rather, what do mean by "orphaned"?
>>
>>      
>>> Are you referring to a scenario where the cluster is partially written
>>> because the data is present in the write cache and the write cache isn't
>>> flushed on power failure?
>>>        
>> The case I'm referring to is a COW. So let's assume a partial write to
>> an unallocated cluster, we then need to do a COW in pre/postfill. Then
>> we do a normal write and link the new cluster in the L2 table.
>>
>> Assume that the write to the L2 table is already on the disk, but the
>> pre/postfill data isn't yet. At this point we have a bad state because
>> if we crash now we have lost the data that should have been copied from
>> the backing file.
>>      
> In this case QED_F_NEED_CHECK is set and the invalid cluster offset
> should be reset to zero on open.
>
> However, I think we can get into a state where the pre/postfill data
> isn't on the disk yet but another allocation has increased the file
> size, making the unwritten cluster "valid".  This fools consistency
> check into thinking the data cluster (which was never written to on
> disk) is valid.
>
> Will think about this more tonight.
>    

It's fairly simple to add a sync to this path.  It's probably worth 
checking the prefill/postfill for zeros and avoiding the write/sync if 
that's the case.  That should optimize the common cases of allocating 
new space within a file.

My intuition is that we can avoid the sync entirely but we'll need to 
think about it further.

Regards,

Anthony Liguori

> Stefan
>
Avi Kivity - Oct. 12, 2010, 4:21 p.m.
On 10/12/2010 06:16 PM, Anthony Liguori wrote:
>
> It's fairly simple to add a sync to this path.  It's probably worth 
> checking the prefill/postfill for zeros and avoiding the write/sync if 
> that's the case.  That should optimize the common cases of allocating 
> new space within a file.
>
> My intuition is that we can avoid the sync entirely but we'll need to 
> think about it further.
>

I don't think so.  This isn't a guest initiated write so we can't shift 
responsibility to the guest.
Stefan Hajnoczi - Oct. 13, 2010, 12:13 p.m.
On Tue, Oct 12, 2010 at 11:16:17AM -0500, Anthony Liguori wrote:
> On 10/12/2010 10:59 AM, Stefan Hajnoczi wrote:
> >On Tue, Oct 12, 2010 at 05:39:48PM +0200, Kevin Wolf wrote:
> >>Am 12.10.2010 17:22, schrieb Anthony Liguori:
> >>>On 10/12/2010 10:08 AM, Kevin Wolf wrote:
> >>>>   Otherwise we might destroy data that isn't
> >>>>even touched by the guest request in case of a crash.
> >>>>
> >>>The failure scenarios are either that the cluster is leaked in which
> >>>case, the old version of the data is still present or the cluster is
> >>>orphaned because the L2 entry is written, in which case the old version
> >>>of the data is present.
> >>Hm, how does the latter case work? Or rather, what do mean by "orphaned"?
> >>
> >>>Are you referring to a scenario where the cluster is partially written
> >>>because the data is present in the write cache and the write cache isn't
> >>>flushed on power failure?
> >>The case I'm referring to is a COW. So let's assume a partial write to
> >>an unallocated cluster, we then need to do a COW in pre/postfill. Then
> >>we do a normal write and link the new cluster in the L2 table.
> >>
> >>Assume that the write to the L2 table is already on the disk, but the
> >>pre/postfill data isn't yet. At this point we have a bad state because
> >>if we crash now we have lost the data that should have been copied from
> >>the backing file.
> >In this case QED_F_NEED_CHECK is set and the invalid cluster offset
> >should be reset to zero on open.
> >
> >However, I think we can get into a state where the pre/postfill data
> >isn't on the disk yet but another allocation has increased the file
> >size, making the unwritten cluster "valid".  This fools consistency
> >check into thinking the data cluster (which was never written to on
> >disk) is valid.
> >
> >Will think about this more tonight.
> 
> It's fairly simple to add a sync to this path.  It's probably worth
> checking the prefill/postfill for zeros and avoiding the write/sync
> if that's the case.  That should optimize the common cases of
> allocating new space within a file.
> 
> My intuition is that we can avoid the sync entirely but we'll need
> to think about it further.

We can avoid it when a backing image is not used.  Your idea to check
for zeroes in the backing image is neat too, it may well reduce the
common case even for backing images.

Stefan
Kevin Wolf - Oct. 13, 2010, 1:07 p.m.
Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:
> On Tue, Oct 12, 2010 at 11:16:17AM -0500, Anthony Liguori wrote:
>> On 10/12/2010 10:59 AM, Stefan Hajnoczi wrote:
>>> On Tue, Oct 12, 2010 at 05:39:48PM +0200, Kevin Wolf wrote:
>>>> Am 12.10.2010 17:22, schrieb Anthony Liguori:
>>>>> On 10/12/2010 10:08 AM, Kevin Wolf wrote:
>>>>>>   Otherwise we might destroy data that isn't
>>>>>> even touched by the guest request in case of a crash.
>>>>>>
>>>>> The failure scenarios are either that the cluster is leaked in which
>>>>> case, the old version of the data is still present or the cluster is
>>>>> orphaned because the L2 entry is written, in which case the old version
>>>>> of the data is present.
>>>> Hm, how does the latter case work? Or rather, what do mean by "orphaned"?
>>>>
>>>>> Are you referring to a scenario where the cluster is partially written
>>>>> because the data is present in the write cache and the write cache isn't
>>>>> flushed on power failure?
>>>> The case I'm referring to is a COW. So let's assume a partial write to
>>>> an unallocated cluster, we then need to do a COW in pre/postfill. Then
>>>> we do a normal write and link the new cluster in the L2 table.
>>>>
>>>> Assume that the write to the L2 table is already on the disk, but the
>>>> pre/postfill data isn't yet. At this point we have a bad state because
>>>> if we crash now we have lost the data that should have been copied from
>>>> the backing file.
>>> In this case QED_F_NEED_CHECK is set and the invalid cluster offset
>>> should be reset to zero on open.
>>>
>>> However, I think we can get into a state where the pre/postfill data
>>> isn't on the disk yet but another allocation has increased the file
>>> size, making the unwritten cluster "valid".  This fools consistency
>>> check into thinking the data cluster (which was never written to on
>>> disk) is valid.
>>>
>>> Will think about this more tonight.
>>
>> It's fairly simple to add a sync to this path.  It's probably worth
>> checking the prefill/postfill for zeros and avoiding the write/sync
>> if that's the case.  That should optimize the common cases of
>> allocating new space within a file.
>>
>> My intuition is that we can avoid the sync entirely but we'll need
>> to think about it further.
> 
> We can avoid it when a backing image is not used.  Your idea to check
> for zeroes in the backing image is neat too, it may well reduce the
> common case even for backing images.

The additional requirement is that we're extending the file and not
reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
doesn't work on host_devices anyway)

Kevin
Anthony Liguori - Oct. 13, 2010, 1:24 p.m.
On 10/13/2010 08:07 AM, Kevin Wolf wrote:
> Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:
>    
>> We can avoid it when a backing image is not used.  Your idea to check
>> for zeroes in the backing image is neat too, it may well reduce the
>> common case even for backing images.
>>      
> The additional requirement is that we're extending the file and not
> reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
> doesn't work on host_devices anyway)
>    

Yes, that's a good point.

BTW, I think we've decided that making it work on host_devices is not 
that bad.

We can add an additional feature called QED_F_PHYSICAL_SIZE.

This feature will add another field to the header that contains an 
offset immediately following the last cluster allocation.

During a metadata scan, we can accurately recreate this field so we only 
need to update this field whenever we clear the header dirty bit (which 
means during an fsync()).

That means we can maintain the physical size without introducing 
additional fsync()s in the allocation path.  Since we're already writing 
out the header anyway, the write operation is basically free too.

Regards,

Anthony Liguori

> Kevin
>
Avi Kivity - Oct. 13, 2010, 1:50 p.m.
On 10/13/2010 03:24 PM, Anthony Liguori wrote:
> On 10/13/2010 08:07 AM, Kevin Wolf wrote:
>> Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:
>>> We can avoid it when a backing image is not used.  Your idea to check
>>> for zeroes in the backing image is neat too, it may well reduce the
>>> common case even for backing images.
>> The additional requirement is that we're extending the file and not
>> reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
>> doesn't work on host_devices anyway)
>
> Yes, that's a good point.
>
> BTW, I think we've decided that making it work on host_devices is not 
> that bad.
>
> We can add an additional feature called QED_F_PHYSICAL_SIZE.
>
> This feature will add another field to the header that contains an 
> offset immediately following the last cluster allocation.
>
> During a metadata scan, we can accurately recreate this field so we 
> only need to update this field whenever we clear the header dirty bit 
> (which means during an fsync()).

If you make QED_F_PHYSICAL_SIZE an autoclear bit, you don't need the 
header dirty bit.


>
> That means we can maintain the physical size without introducing 
> additional fsync()s in the allocation path.  Since we're already 
> writing out the header anyway, the write operation is basically free too.

I don't see how it is free.  It's an extra write.  The good news is that 
it's very easy to amortize.
Stefan Hajnoczi - Oct. 13, 2010, 2:07 p.m.
On Wed, Oct 13, 2010 at 03:50:00PM +0200, Avi Kivity wrote:
>  On 10/13/2010 03:24 PM, Anthony Liguori wrote:
> >On 10/13/2010 08:07 AM, Kevin Wolf wrote:
> >>Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:
> >>>We can avoid it when a backing image is not used.  Your idea to check
> >>>for zeroes in the backing image is neat too, it may well reduce the
> >>>common case even for backing images.
> >>The additional requirement is that we're extending the file and not
> >>reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
> >>doesn't work on host_devices anyway)
> >
> >Yes, that's a good point.
> >
> >BTW, I think we've decided that making it work on host_devices is
> >not that bad.
> >
> >We can add an additional feature called QED_F_PHYSICAL_SIZE.
> >
> >This feature will add another field to the header that contains an
> >offset immediately following the last cluster allocation.
> >
> >During a metadata scan, we can accurately recreate this field so
> >we only need to update this field whenever we clear the header
> >dirty bit (which means during an fsync()).
> 
> If you make QED_F_PHYSICAL_SIZE an autoclear bit, you don't need the
> header dirty bit.

Do you mean we just need to check the physical size header field against
the actual file size?  If the two are different, then a consistency
check is forced.

> >
> >That means we can maintain the physical size without introducing
> >additional fsync()s in the allocation path.  Since we're already
> >writing out the header anyway, the write operation is basically
> >free too.
> 
> I don't see how it is free.  It's an extra write.  The good news is
> that it's very easy to amortize.

We only need to update the header field on disk when we're already
updating the header, so it's not even an extra write operation.

Stefan
Anthony Liguori - Oct. 13, 2010, 2:08 p.m.
On 10/13/2010 09:07 AM, Stefan Hajnoczi wrote:
>
>>> That means we can maintain the physical size without introducing
>>> additional fsync()s in the allocation path.  Since we're already
>>> writing out the header anyway, the write operation is basically
>>> free too.
>>>        
>> I don't see how it is free.  It's an extra write.  The good news is
>> that it's very easy to amortize.
>>      
> We only need to update the header field on disk when we're already
> updating the header, so it's not even an extra write operation.
>    

Because we're already writing out the sector that contains that field in 
the header.

Regards,

Anthony Liguori

> Stefan
>
Anthony Liguori - Oct. 13, 2010, 2:10 p.m.
On 10/13/2010 08:50 AM, Avi Kivity wrote:
>  On 10/13/2010 03:24 PM, Anthony Liguori wrote:
>> On 10/13/2010 08:07 AM, Kevin Wolf wrote:
>>> Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:
>>>> We can avoid it when a backing image is not used.  Your idea to check
>>>> for zeroes in the backing image is neat too, it may well reduce the
>>>> common case even for backing images.
>>> The additional requirement is that we're extending the file and not
>>> reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
>>> doesn't work on host_devices anyway)
>>
>> Yes, that's a good point.
>>
>> BTW, I think we've decided that making it work on host_devices is not 
>> that bad.
>>
>> We can add an additional feature called QED_F_PHYSICAL_SIZE.
>>
>> This feature will add another field to the header that contains an 
>> offset immediately following the last cluster allocation.
>>
>> During a metadata scan, we can accurately recreate this field so we 
>> only need to update this field whenever we clear the header dirty bit 
>> (which means during an fsync()).
>
> If you make QED_F_PHYSICAL_SIZE an autoclear bit, you don't need the 
> header dirty bit.

Yes, autoclear bits are essentially granular header dirty bits.

Regards,

Anthony Liguori

>
>>
>> That means we can maintain the physical size without introducing 
>> additional fsync()s in the allocation path.  Since we're already 
>> writing out the header anyway, the write operation is basically free 
>> too.
>
> I don't see how it is free.  It's an extra write.  The good news is 
> that it's very easy to amortize.
>
Avi Kivity - Oct. 13, 2010, 2:10 p.m.
On 10/13/2010 04:07 PM, Stefan Hajnoczi wrote:
> On Wed, Oct 13, 2010 at 03:50:00PM +0200, Avi Kivity wrote:
> >   On 10/13/2010 03:24 PM, Anthony Liguori wrote:
> >  >On 10/13/2010 08:07 AM, Kevin Wolf wrote:
> >  >>Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:
> >  >>>We can avoid it when a backing image is not used.  Your idea to check
> >  >>>for zeroes in the backing image is neat too, it may well reduce the
> >  >>>common case even for backing images.
> >  >>The additional requirement is that we're extending the file and not
> >  >>reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
> >  >>doesn't work on host_devices anyway)
> >  >
> >  >Yes, that's a good point.
> >  >
> >  >BTW, I think we've decided that making it work on host_devices is
> >  >not that bad.
> >  >
> >  >We can add an additional feature called QED_F_PHYSICAL_SIZE.
> >  >
> >  >This feature will add another field to the header that contains an
> >  >offset immediately following the last cluster allocation.
> >  >
> >  >During a metadata scan, we can accurately recreate this field so
> >  >we only need to update this field whenever we clear the header
> >  >dirty bit (which means during an fsync()).
> >
> >  If you make QED_F_PHYSICAL_SIZE an autoclear bit, you don't need the
> >  header dirty bit.
>
> Do you mean we just need to check the physical size header field against
> the actual file size?  If the two are different, then a consistency
> check is forced.

I thought you'd only use a header size field when you don't have a real 
file size.  Why do you need both?

> >  >
> >  >That means we can maintain the physical size without introducing
> >  >additional fsync()s in the allocation path.  Since we're already
> >  >writing out the header anyway, the write operation is basically
> >  >free too.
> >
> >  I don't see how it is free.  It's an extra write.  The good news is
> >  that it's very easy to amortize.
>
> We only need to update the header field on disk when we're already
> updating the header, so it's not even an extra write operation.

Why would you ever update the header, apart from relocating L1 for some 
reason?
Anthony Liguori - Oct. 13, 2010, 2:11 p.m.
>> > >
>> > >That means we can maintain the physical size without introducing
>> > >additional fsync()s in the allocation path.  Since we're already
>> > >writing out the header anyway, the write operation is basically
>> > >free too.
>> >
>> >  I don't see how it is free.  It's an extra write.  The good news is
>> >  that it's very easy to amortize.
>>
>> We only need to update the header field on disk when we're already
>> updating the header, so it's not even an extra write operation.
>
> Why would you ever update the header, apart from relocating L1 for 
> some reason?

To update the L1/L2 tables clean bit.  That's what prevents a check in 
the normal case where you have a clean shutdown.

Regards,

Anthony Liguori
Avi Kivity - Oct. 13, 2010, 2:16 p.m.
On 10/13/2010 04:11 PM, Anthony Liguori wrote:
>> Why would you ever update the header, apart from relocating L1 for 
>> some reason?
>
>
> To update the L1/L2 tables clean bit.  That's what prevents a check in 
> the normal case where you have a clean shutdown.

I see - so you wouldn't update it every allocation, only when the disk 
has been quiet for a while.
Anthony Liguori - Oct. 13, 2010, 2:53 p.m.
On 10/13/2010 09:16 AM, Avi Kivity wrote:
>  On 10/13/2010 04:11 PM, Anthony Liguori wrote:
>>> Why would you ever update the header, apart from relocating L1 for 
>>> some reason?
>>
>>
>> To update the L1/L2 tables clean bit.  That's what prevents a check 
>> in the normal case where you have a clean shutdown.
>
> I see - so you wouldn't update it every allocation, only when the disk 
> has been quiet for a while.

Right, the current plan is to flush the header dirty bit on shutdown or 
whenever there is an explicit flush of the device.  Current that is 
caused by either a guest-initiated flush or a L1 update.  We also plan 
to add a timer-based flush such that a flush is scheduled for some 
period of time (like 5 minutes) after the dirty bit is set.

The end result should be that the only window for requiring a metadata 
scan is if a crash occurs within 5 minutes of a cluster allocation and 
an explicit flush has not occurred for some other reason.

Regards,

Anthony Liguori
Avi Kivity - Oct. 13, 2010, 3:08 p.m.
On 10/13/2010 04:53 PM, Anthony Liguori wrote:
> On 10/13/2010 09:16 AM, Avi Kivity wrote:
>>  On 10/13/2010 04:11 PM, Anthony Liguori wrote:
>>>> Why would you ever update the header, apart from relocating L1 for 
>>>> some reason?
>>>
>>>
>>> To update the L1/L2 tables clean bit.  That's what prevents a check 
>>> in the normal case where you have a clean shutdown.
>>
>> I see - so you wouldn't update it every allocation, only when the 
>> disk has been quiet for a while.
>
> Right, the current plan is to flush the header dirty bit on shutdown 
> or whenever there is an explicit flush of the device.  Current that is 
> caused by either a guest-initiated flush or a L1 update.

That does add an extra write (and a new write+flush later to mark the 
header dirty again when you start allocating).  I'd drop it and only use 
the timer.

in fact, it adds an extra flush too.  The sequence

1 L1 update
2 mark clean
3 flush

is unsafe since you can crash between 2 and 3, ad only 2 makes it.  So 
I'd do something like

1 opportunistic flush (for whatever reason)
2 set timer
3 no intervening metadata changes
4 mark clean
5 no intervening metadata changes
6 mark dirty
7 flush
8 metadata changes
Anthony Liguori - Oct. 13, 2010, 3:42 p.m.
On 10/13/2010 10:08 AM, Avi Kivity wrote:
>  On 10/13/2010 04:53 PM, Anthony Liguori wrote:
>> On 10/13/2010 09:16 AM, Avi Kivity wrote:
>>>  On 10/13/2010 04:11 PM, Anthony Liguori wrote:
>>>>> Why would you ever update the header, apart from relocating L1 for 
>>>>> some reason?
>>>>
>>>>
>>>> To update the L1/L2 tables clean bit.  That's what prevents a check 
>>>> in the normal case where you have a clean shutdown.
>>>
>>> I see - so you wouldn't update it every allocation, only when the 
>>> disk has been quiet for a while.
>>
>> Right, the current plan is to flush the header dirty bit on shutdown 
>> or whenever there is an explicit flush of the device.  Current that 
>> is caused by either a guest-initiated flush or a L1 update.
>
> That does add an extra write (and a new write+flush later to mark the 
> header dirty again when you start allocating).  I'd drop it and only 
> use the timer.
>
> in fact, it adds an extra flush too.  The sequence
>
> 1 L1 update
> 2 mark clean
> 3 flush
>
> is unsafe since you can crash between 2 and 3, ad only 2 makes it.  So 
> I'd do something like

You've got the order wrong.

1. L1 update
2. flush()
3. mark clean

If (3) doesn't make it to disk, that's okay.  It just causes an extra scan.

> 1 opportunistic flush (for whatever reason)
> 2 set timer
> 3 no intervening metadata changes
> 4 mark clean
> 5 no intervening metadata changes
> 6 mark dirty
> 7 flush
> 8 metadata changes

Not sure I see why we set the timer in step 2 as opposed to:

0 clear scheduled flush (if necessary)
1 opportunistic flush (for whatever reason)
2 mark clean
3 no intervening metadata changes
4 mark dirty
5 flush
6 schedule flush (in 5 minutes)
7 metadata changes

Which is now recorded at http://wiki.qemu.org/Features/QED/ScanAvoidance 
so we can keep track of this.

Regards,

Anthony Liguori

>
>
Stefan Hajnoczi - Oct. 14, 2010, 11:06 a.m.
On Wed, Oct 13, 2010 at 04:10:25PM +0200, Avi Kivity wrote:
>  On 10/13/2010 04:07 PM, Stefan Hajnoczi wrote:
> >On Wed, Oct 13, 2010 at 03:50:00PM +0200, Avi Kivity wrote:
> >>   On 10/13/2010 03:24 PM, Anthony Liguori wrote:
> >>  >On 10/13/2010 08:07 AM, Kevin Wolf wrote:
> >>  >>Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:
> >>  >>>We can avoid it when a backing image is not used.  Your idea to check
> >>  >>>for zeroes in the backing image is neat too, it may well reduce the
> >>  >>>common case even for backing images.
> >>  >>The additional requirement is that we're extending the file and not
> >>  >>reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
> >>  >>doesn't work on host_devices anyway)
> >>  >
> >>  >Yes, that's a good point.
> >>  >
> >>  >BTW, I think we've decided that making it work on host_devices is
> >>  >not that bad.
> >>  >
> >>  >We can add an additional feature called QED_F_PHYSICAL_SIZE.
> >>  >
> >>  >This feature will add another field to the header that contains an
> >>  >offset immediately following the last cluster allocation.
> >>  >
> >>  >During a metadata scan, we can accurately recreate this field so
> >>  >we only need to update this field whenever we clear the header
> >>  >dirty bit (which means during an fsync()).
> >>
> >>  If you make QED_F_PHYSICAL_SIZE an autoclear bit, you don't need the
> >>  header dirty bit.
> >
> >Do you mean we just need to check the physical size header field against
> >the actual file size?  If the two are different, then a consistency
> >check is forced.
> 
> I thought you'd only use a header size field when you don't have a
> real file size.  Why do you need both?

I probably didn't understand correctly :).  You said with
QED_F_PHYSICAL_SIZE autoclear you don't need the header dirty bit.  I
don't see how it eliminates the need for the header dirty bit.

Stefan

Patch

diff --git a/Makefile.objs b/Makefile.objs
index 7b3b19c..24d734f 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -15,6 +15,7 @@  block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
 block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
+block-nested-y += qed-lock.o
 block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
diff --git a/block/qed-lock.c b/block/qed-lock.c
new file mode 100644
index 0000000..bd91729
--- /dev/null
+++ b/block/qed-lock.c
@@ -0,0 +1,124 @@ 
+/*
+ * QEMU Enhanced Disk Format Lock
+ *
+ * Copyright IBM, Corp. 2010
+ *
+ * Authors:
+ *  Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+/*
+ * Table locking works as follows:
+ *
+ * Reads and non-allocating writes do not acquire locks because they do not
+ * modify tables and only see committed L2 cache entries.
+ *
+ * An allocating write request that needs to update an existing L2 table
+ * acquires a lock on the table.  This serializes requests that touch the same
+ * L2 table.
+ *
+ * An allocating write request that needs to create a new L2 table and update
+ * the L1 table acquires a lock on the L1 table.  This serializes requests that
+ * create new L2 tables.
+ *
+ * When a request is unable to acquire a lock, it is put to sleep and must
+ * return.  When the lock it attempted to acquire becomes available, a wakeup
+ * function is invoked to activate it again.
+ *
+ * A request must retry its cluster lookup after waking up because the tables
+ * have changed.  For example, an allocating write may no longer need to
+ * allocate if the previous request already allocated the cluster.
+ */
+
+#include "qed.h"
+
+struct QEDLockEntry {
+    uint64_t key;
+    QSIMPLEQ_HEAD(, QEDAIOCB) reqs;
+    QTAILQ_ENTRY(QEDLockEntry) next;
+};
+
+/**
+ * Initialize a lock
+ *
+ * @lock:           Lock
+ * @wakeup_fn:      Callback to reactivate a sleeping request
+ */
+void qed_lock_init(QEDLock *lock, BlockDriverCompletionFunc *wakeup_fn)
+{
+    QTAILQ_INIT(&lock->entries);
+    lock->wakeup_fn = wakeup_fn;
+}
+
+/**
+ * Acquire a lock on a given key
+ *
+ * @lock:           Lock
+ * @key:            Key to lock on
+ * @acb:            Request
+ * @ret:            true if lock was acquired, false if request needs to sleep
+ *
+ * If the request currently has another lock held, that lock will be released.
+ */
+bool qed_lock(QEDLock *lock, uint64_t key, QEDAIOCB *acb)
+{
+    QEDLockEntry *entry = acb->lock_entry;
+
+    if (entry) {
+        /* Lock already held */
+        if (entry->key == key) {
+            return true;
+        }
+
+        /* Release old lock */
+        qed_unlock(lock, acb);
+    }
+
+    /* Find held lock */
+    QTAILQ_FOREACH(entry, &lock->entries, next) {
+        if (entry->key == key) {
+            QSIMPLEQ_INSERT_TAIL(&entry->reqs, acb, next);
+            acb->lock_entry = entry;
+            return false;
+        }
+    }
+
+    /* Allocate new lock entry */
+    entry = qemu_malloc(sizeof(*entry));
+    entry->key = key;
+    QSIMPLEQ_INIT(&entry->reqs);
+    QSIMPLEQ_INSERT_TAIL(&entry->reqs, acb, next);
+    QTAILQ_INSERT_TAIL(&lock->entries, entry, next);
+    acb->lock_entry = entry;
+    return true;
+}
+
+/**
+ * Release a held lock
+ */
+void qed_unlock(QEDLock *lock, QEDAIOCB *acb)
+{
+    QEDLockEntry *entry = acb->lock_entry;
+
+    if (!entry) {
+        return;
+    }
+
+    acb->lock_entry = NULL;
+    QSIMPLEQ_REMOVE_HEAD(&entry->reqs, next);
+
+    /* Wake up next lock holder */
+    acb = QSIMPLEQ_FIRST(&entry->reqs);
+    if (acb) {
+        lock->wakeup_fn(acb, 0);
+        return;
+    }
+
+    /* Free lock entry */
+    QTAILQ_REMOVE(&lock->entries, entry, next);
+    qemu_free(entry);
+}
diff --git a/block/qed.c b/block/qed.c
index 6d7f4d7..4fded31 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -12,8 +12,26 @@ 
  *
  */
 
+#include "trace.h"
 #include "qed.h"
 
+static void qed_aio_cancel(BlockDriverAIOCB *blockacb)
+{
+    QEDAIOCB *acb = (QEDAIOCB *)blockacb;
+    bool finished = false;
+
+    /* Wait for the request to finish */
+    acb->finished = &finished;
+    while (!finished) {
+        qemu_aio_wait();
+    }
+}
+
+static AIOPool qed_aio_pool = {
+    .aiocb_size         = sizeof(QEDAIOCB),
+    .cancel             = qed_aio_cancel,
+};
+
 static int bdrv_qed_probe(const uint8_t *buf, int buf_size,
                           const char *filename)
 {
@@ -139,6 +157,20 @@  static int qed_read_string(BlockDriverState *file, uint64_t offset, size_t n,
     return 0;
 }
 
+/**
+ * Allocate new clusters
+ *
+ * @s:          QED state
+ * @n:          Number of contiguous clusters to allocate
+ * @offset:     Offset of first allocated cluster, filled in on success
+ */
+static int qed_alloc_clusters(BDRVQEDState *s, unsigned int n, uint64_t *offset)
+{
+    *offset = s->file_size;
+    s->file_size += n * s->header.cluster_size;
+    return 0;
+}
+
 static QEDTable *qed_alloc_table(void *opaque)
 {
     BDRVQEDState *s = opaque;
@@ -148,6 +180,30 @@  static QEDTable *qed_alloc_table(void *opaque)
                            s->header.cluster_size * s->header.table_size);
 }
 
+/**
+ * Allocate a new zeroed L2 table
+ */
+static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
+{
+    uint64_t offset;
+    int ret;
+    CachedL2Table *l2_table;
+
+    ret = qed_alloc_clusters(s, s->header.table_size, &offset);
+    if (ret) {
+        return NULL;
+    }
+
+    l2_table = qed_alloc_l2_cache_entry(&s->l2_cache);
+    l2_table->offset = offset;
+
+    memset(l2_table->table->offsets, 0,
+           s->header.cluster_size * s->header.table_size);
+    return l2_table;
+}
+
+static void qed_aio_next_io(void *opaque, int ret);
+
 static int bdrv_qed_open(BlockDriverState *bs, int flags)
 {
     BDRVQEDState *s = bs->opaque;
@@ -156,6 +212,7 @@  static int bdrv_qed_open(BlockDriverState *bs, int flags)
     int ret;
 
     s->bs = bs;
+    qed_lock_init(&s->lock, qed_aio_next_io);
 
     ret = bdrv_pread(bs->file, 0, &le_header, sizeof(le_header));
     if (ret != sizeof(le_header)) {
@@ -402,13 +459,545 @@  static int bdrv_qed_make_empty(BlockDriverState *bs)
     return -ENOTSUP;
 }
 
+static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
+{
+    return acb->common.bs->opaque;
+}
+
+typedef struct {
+    GenericCB gencb;
+    BDRVQEDState *s;
+    QEMUIOVector qiov;
+    struct iovec iov;
+    uint64_t offset;
+} CopyFromBackingFileCB;
+
+static void qed_copy_from_backing_file_cb(void *opaque, int ret)
+{
+    CopyFromBackingFileCB *copy_cb = opaque;
+    qemu_vfree(copy_cb->iov.iov_base);
+    gencb_complete(&copy_cb->gencb, ret);
+}
+
+static void qed_copy_from_backing_file_write(void *opaque, int ret)
+{
+    CopyFromBackingFileCB *copy_cb = opaque;
+    BDRVQEDState *s = copy_cb->s;
+    BlockDriverAIOCB *aiocb;
+
+    if (ret) {
+        qed_copy_from_backing_file_cb(copy_cb, ret);
+        return;
+    }
+
+    BLKDBG_EVENT(s->bs->file, BLKDBG_COW_WRITE);
+    aiocb = bdrv_aio_writev(s->bs->file, copy_cb->offset / BDRV_SECTOR_SIZE,
+                            &copy_cb->qiov,
+                            copy_cb->qiov.size / BDRV_SECTOR_SIZE,
+                            qed_copy_from_backing_file_cb, copy_cb);
+    if (!aiocb) {
+        qed_copy_from_backing_file_cb(copy_cb, -EIO);
+    }
+}
+
+/**
+ * Copy data from backing file into the image
+ *
+ * @s:          QED state
+ * @pos:        Byte position in device
+ * @len:        Number of bytes
+ * @offset:     Byte offset in image file
+ * @cb:         Completion function
+ * @opaque:     User data for completion function
+ */
+static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
+                                       uint64_t len, uint64_t offset,
+                                       BlockDriverCompletionFunc *cb,
+                                       void *opaque)
+{
+    CopyFromBackingFileCB *copy_cb;
+    BlockDriverAIOCB *aiocb;
+
+    /* Skip copy entirely if there is no work to do */
+    if (len == 0) {
+        cb(opaque, 0);
+        return;
+    }
+
+    copy_cb = gencb_alloc(sizeof(*copy_cb), cb, opaque);
+    copy_cb->s = s;
+    copy_cb->offset = offset;
+    copy_cb->iov.iov_base = qemu_blockalign(s->bs, len);
+    copy_cb->iov.iov_len = len;
+    qemu_iovec_init_external(&copy_cb->qiov, &copy_cb->iov, 1);
+
+    /* Zero sectors if there is no backing file */
+    if (!s->bs->backing_hd) {
+        /* Note that it is possible to skip writing zeroes for prefill if the
+         * cluster is not yet allocated and the file guarantees new space is
+         * zeroed.  Don't take this shortcut for now since it also forces us to
+         * handle the special case of rounding file size down on open, which
+         * can be solved by also doing a truncate to free any extra data at the
+         * end of the file.
+         */
+        memset(copy_cb->iov.iov_base, 0, len);
+        qed_copy_from_backing_file_write(copy_cb, 0);
+        return;
+    }
+
+    BLKDBG_EVENT(s->bs->file, BLKDBG_READ_BACKING);
+    aiocb = bdrv_aio_readv(s->bs->backing_hd, pos / BDRV_SECTOR_SIZE,
+                           &copy_cb->qiov, len / BDRV_SECTOR_SIZE,
+                           qed_copy_from_backing_file_write, copy_cb);
+    if (!aiocb) {
+        qed_copy_from_backing_file_cb(copy_cb, -EIO);
+    }
+}
+
+/**
+ * Link one or more contiguous clusters into a table
+ *
+ * @s:              QED state
+ * @table:          L2 table
+ * @index:          First cluster index
+ * @n:              Number of contiguous clusters
+ * @cluster:        First cluster byte offset in image file
+ */
+static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
+                                unsigned int n, uint64_t cluster)
+{
+    int i;
+    for (i = index; i < index + n; i++) {
+        table->offsets[i] = cluster;
+        cluster += s->header.cluster_size;
+    }
+}
+
+static void qed_aio_complete_bh(void *opaque)
+{
+    QEDAIOCB *acb = opaque;
+    BlockDriverCompletionFunc *cb = acb->common.cb;
+    void *user_opaque = acb->common.opaque;
+    int ret = acb->bh_ret;
+    bool *finished = acb->finished;
+
+    qemu_bh_delete(acb->bh);
+    qemu_aio_release(acb);
+
+    /* Invoke callback */
+    cb(user_opaque, ret);
+
+    /* Signal cancel completion */
+    if (finished) {
+        *finished = true;
+    }
+}
+
+static void qed_aio_complete(QEDAIOCB *acb, int ret)
+{
+    BDRVQEDState *s = acb_to_s(acb);
+
+    trace_qed_aio_complete(s, acb, ret);
+
+    /* Free resources */
+    qemu_iovec_destroy(&acb->cur_qiov);
+    qed_unref_l2_cache_entry(&s->l2_cache, acb->request.l2_table);
+    qed_unlock(&s->lock, acb);
+
+    /* Arrange for a bh to invoke the completion function */
+    acb->bh_ret = ret;
+    acb->bh = qemu_bh_new(qed_aio_complete_bh, acb);
+    qemu_bh_schedule(acb->bh);
+}
+
+/**
+ * Construct an iovec array for a given length
+ *
+ * @acb:        I/O request
+ * @len:        Maximum number of bytes
+ *
+ * This function can be called several times to build subset iovec arrays of
+ * acb->qiov.  For example:
+ *
+ *   acb->qiov->iov[] = {{0x100000, 1024},
+ *                       {0x200000, 1024}}
+ *
+ *   qed_acb_build_qiov(acb, 512) =>
+ *                      {{0x100000, 512}}
+ *
+ *   qed_acb_build_qiov(acb, 1024) =>
+ *                      {{0x100200, 512},
+ *                       {0x200000, 512}}
+ *
+ *   qed_acb_build_qiov(acb, 512) =>
+ *                      {{0x200200, 512}}
+ */
+static void qed_acb_build_qiov(QEDAIOCB *acb, size_t len)
+{
+    struct iovec *iov_end = &acb->qiov->iov[acb->qiov->niov];
+    size_t iov_offset = acb->cur_iov_offset;
+    struct iovec *iov = acb->cur_iov;
+
+    while (iov != iov_end && len > 0) {
+        size_t nbytes = MIN(iov->iov_len - iov_offset, len);
+
+        qemu_iovec_add(&acb->cur_qiov, iov->iov_base + iov_offset, nbytes);
+        iov_offset += nbytes;
+        len -= nbytes;
+
+        if (iov_offset >= iov->iov_len) {
+            iov_offset = 0;
+            iov++;
+        }
+    }
+
+    /* Stash state for next time */
+    acb->cur_iov = iov;
+    acb->cur_iov_offset = iov_offset;
+}
+
+/**
+ * Commit the current L2 table to the cache
+ */
+static void qed_commit_l2_update(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    CachedL2Table *l2_table = acb->request.l2_table;
+
+    qed_commit_l2_cache_entry(&s->l2_cache, l2_table);
+
+    /* This is guaranteed to succeed because we just committed the entry to the
+     * cache.
+     */
+    acb->request.l2_table = qed_find_l2_cache_entry(&s->l2_cache,
+                                                    l2_table->offset);
+    assert(acb->request.l2_table != NULL);
+
+    qed_aio_next_io(opaque, ret);
+}
+
+/**
+ * Update L1 table with new L2 table offset and write it out
+ */
+static void qed_aio_write_l1_update(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    int index;
+
+    if (ret) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+
+    index = qed_l1_index(s, acb->cur_pos);
+    s->l1_table->offsets[index] = acb->request.l2_table->offset;
+
+    qed_write_l1_table(s, index, 1, qed_commit_l2_update, acb);
+}
+
+/**
+ * Update L2 table with new cluster offsets and write them out
+ */
+static void qed_aio_write_l2_update(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
+    int index;
+
+    if (ret) {
+        goto err;
+    }
+
+    if (need_alloc) {
+        qed_unref_l2_cache_entry(&s->l2_cache, acb->request.l2_table);
+        acb->request.l2_table = qed_new_l2_table(s);
+        if (!acb->request.l2_table) {
+            ret = -EIO;
+            goto err;
+        }
+    }
+
+    index = qed_l2_index(s, acb->cur_pos);
+    qed_update_l2_table(s, acb->request.l2_table->table, index, acb->cur_nclusters,
+                         acb->cur_cluster);
+
+    if (need_alloc) {
+        /* Write out the whole new L2 table */
+        qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true,
+                            qed_aio_write_l1_update, acb);
+    } else {
+        /* Write out only the updated part of the L2 table */
+        qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters, false,
+                            qed_aio_next_io, acb);
+    }
+    return;
+
+err:
+    qed_aio_complete(acb, ret);
+}
+
+/**
+ * Write data to the image file
+ */
+static void qed_aio_write_main(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    bool need_alloc = acb->find_cluster_ret != QED_CLUSTER_FOUND;
+    uint64_t offset = acb->cur_cluster;
+    BlockDriverAIOCB *file_acb;
+
+    trace_qed_aio_write_main(s, acb, ret, offset, acb->cur_qiov.size);
+
+    if (ret) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+
+    offset += qed_offset_into_cluster(s, acb->cur_pos);
+    BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
+    file_acb = bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE,
+                               &acb->cur_qiov,
+                               acb->cur_qiov.size / BDRV_SECTOR_SIZE,
+                               need_alloc ? qed_aio_write_l2_update :
+                                            qed_aio_next_io,
+                               acb);
+    if (!file_acb) {
+        qed_aio_complete(acb, -EIO);
+    }
+}
+
+/**
+ * Populate back untouched region of new data cluster
+ */
+static void qed_aio_write_postfill(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    uint64_t start = acb->cur_pos + acb->cur_qiov.size;
+    uint64_t len = qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start;
+    uint64_t offset = acb->cur_cluster + qed_offset_into_cluster(s, acb->cur_pos) + acb->cur_qiov.size;
+
+    if (ret) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+
+    trace_qed_aio_write_postfill(s, acb, start, len, offset);
+    qed_copy_from_backing_file(s, start, len, offset,
+                                qed_aio_write_main, acb);
+}
+
+/**
+ * Populate front untouched region of new data cluster
+ */
+static void qed_aio_write_prefill(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    uint64_t start = qed_start_of_cluster(s, acb->cur_pos);
+    uint64_t len = qed_offset_into_cluster(s, acb->cur_pos);
+
+    trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
+    qed_copy_from_backing_file(s, start, len, acb->cur_cluster,
+                                qed_aio_write_postfill, acb);
+}
+
+/**
+ * Write data cluster
+ *
+ * @opaque:     Write request
+ * @ret:        QED_CLUSTER_FOUND, QED_CLUSTER_L2, QED_CLUSTER_L1,
+ *              or QED_CLUSTER_ERROR
+ * @offset:     Cluster offset in bytes
+ * @len:        Length in bytes
+ *
+ * Callback from qed_find_cluster().
+ */
+static void qed_aio_write_data(void *opaque, int ret,
+                               uint64_t offset, size_t len)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    bool need_alloc = ret != QED_CLUSTER_FOUND;
+
+    trace_qed_aio_write_data(s, acb, ret, offset, len);
+
+    if (ret == QED_CLUSTER_ERROR) {
+        goto err;
+    }
+
+    if (need_alloc) {
+        /* If a new L2 table is being allocated, lock the L1 table.  Otherwise
+         * just lock the L2 table.
+         */
+        uint64_t lock_key = ret == QED_CLUSTER_L1 ?
+                            s->header.l1_table_offset :
+                            acb->request.l2_table->offset;
+
+        if (!qed_lock(&s->lock, lock_key, acb)) {
+            return; /* sleep until woken up again */
+        }
+    } else {
+        /* If we're still holding a lock, release it */
+        qed_unlock(&s->lock, acb);
+    }
+
+    acb->cur_nclusters = qed_bytes_to_clusters(s,
+                             qed_offset_into_cluster(s, acb->cur_pos) + len);
+
+    if (need_alloc) {
+        if (qed_alloc_clusters(s, acb->cur_nclusters, &offset) != 0) {
+            goto err;
+        }
+    }
+
+    acb->find_cluster_ret = ret;
+    acb->cur_cluster = offset;
+    qed_acb_build_qiov(acb, len);
+
+    /* Write data in place if the cluster already exists */
+    if (!need_alloc) {
+        qed_aio_write_main(acb, 0);
+        return;
+    }
+
+    /* Write new cluster */
+    qed_aio_write_prefill(acb, 0);
+    return;
+
+err:
+    qed_aio_complete(acb, -EIO);
+}
+
+/**
+ * Read data cluster
+ *
+ * @opaque:     Read request
+ * @ret:        QED_CLUSTER_FOUND, QED_CLUSTER_L2, QED_CLUSTER_L1,
+ *              or QED_CLUSTER_ERROR
+ * @offset:     Cluster offset in bytes
+ * @len:        Length in bytes
+ *
+ * Callback from qed_find_cluster().
+ */
+static void qed_aio_read_data(void *opaque, int ret,
+                              uint64_t offset, size_t len)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    BlockDriverState *bs = acb->common.bs;
+    BlockDriverState *file = bs->file;
+    BlockDriverAIOCB *file_acb;
+
+    trace_qed_aio_read_data(s, acb, ret, offset, len);
+
+    if (ret == QED_CLUSTER_ERROR) {
+        goto err;
+    }
+
+    qed_acb_build_qiov(acb, len);
+
+    /* Adjust offset into cluster */
+    offset += qed_offset_into_cluster(s, acb->cur_pos);
+
+    /* Handle backing file and unallocated sparse hole reads */
+    if (ret != QED_CLUSTER_FOUND) {
+        if (!bs->backing_hd) {
+            qemu_iovec_memset(&acb->cur_qiov, 0, acb->cur_qiov.size);
+            qed_aio_next_io(acb, 0);
+            return;
+        }
+
+        /* Pass through read to backing file */
+        offset = acb->cur_pos;
+        file = bs->backing_hd;
+    }
+
+    BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
+    file_acb = bdrv_aio_readv(file, offset / BDRV_SECTOR_SIZE,
+                              &acb->cur_qiov,
+                              acb->cur_qiov.size / BDRV_SECTOR_SIZE,
+                              qed_aio_next_io, acb);
+    if (!file_acb) {
+        goto err;
+    }
+    return;
+
+err:
+    qed_aio_complete(acb, -EIO);
+}
+
+/**
+ * Begin next I/O or complete the request
+ */
+static void qed_aio_next_io(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    QEDFindClusterFunc *io_fn =
+        acb->is_write ? qed_aio_write_data : qed_aio_read_data;
+
+    trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
+
+    /* Handle I/O error */
+    if (ret) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+
+    acb->cur_pos += acb->cur_qiov.size;
+    qemu_iovec_reset(&acb->cur_qiov);
+
+    /* Complete request */
+    if (acb->cur_pos >= acb->end_pos) {
+        qed_aio_complete(acb, 0);
+        return;
+    }
+
+    /* Find next cluster and start I/O */
+    qed_find_cluster(s, &acb->request,
+                      acb->cur_pos, acb->end_pos - acb->cur_pos,
+                      io_fn, acb);
+}
+
+static BlockDriverAIOCB *qed_aio_setup(BlockDriverState *bs,
+                                       int64_t sector_num,
+                                       QEMUIOVector *qiov, int nb_sectors,
+                                       BlockDriverCompletionFunc *cb,
+                                       void *opaque, bool is_write)
+{
+    QEDAIOCB *acb = qemu_aio_get(&qed_aio_pool, bs, cb, opaque);
+
+    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors,
+                         opaque, is_write);
+
+    acb->is_write = is_write;
+    acb->finished = NULL;
+    acb->qiov = qiov;
+    acb->cur_iov = acb->qiov->iov;
+    acb->cur_iov_offset = 0;
+    acb->cur_pos = (uint64_t)sector_num * BDRV_SECTOR_SIZE;
+    acb->end_pos = acb->cur_pos + nb_sectors * BDRV_SECTOR_SIZE;
+    acb->request.l2_table = NULL;
+    acb->lock_entry = NULL;
+    qemu_iovec_init(&acb->cur_qiov, qiov->niov);
+
+    /* Start request */
+    qed_aio_next_io(acb, 0);
+    return &acb->common;
+}
+
 static BlockDriverAIOCB *bdrv_qed_aio_readv(BlockDriverState *bs,
                                             int64_t sector_num,
                                             QEMUIOVector *qiov, int nb_sectors,
                                             BlockDriverCompletionFunc *cb,
                                             void *opaque)
 {
-    return NULL;
+    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, false);
 }
 
 static BlockDriverAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
@@ -417,7 +1006,7 @@  static BlockDriverAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
                                              BlockDriverCompletionFunc *cb,
                                              void *opaque)
 {
-    return NULL;
+    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, true);
 }
 
 static BlockDriverAIOCB *bdrv_qed_aio_flush(BlockDriverState *bs,
diff --git a/block/qed.h b/block/qed.h
index 9ea288f..91c91f4 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -126,6 +126,41 @@  typedef struct QEDRequest {
     CachedL2Table *l2_table;
 } QEDRequest;
 
+typedef struct QEDLockEntry QEDLockEntry;
+
+typedef struct QEDAIOCB {
+    BlockDriverAIOCB common;
+    QEMUBH *bh;
+    int bh_ret;                     /* final return status for completion bh */
+    QSIMPLEQ_ENTRY(QEDAIOCB) next;  /* next request */
+    QEDLockEntry *lock_entry;       /* held lock */
+    bool is_write;                  /* false - read, true - write */
+    bool *finished;                 /* signal for cancel completion */
+    uint64_t end_pos;               /* request end on block device, in bytes */
+
+    /* User scatter-gather list */
+    QEMUIOVector *qiov;
+    struct iovec *cur_iov;          /* current iovec to process */
+    size_t cur_iov_offset;          /* byte count already processed in iovec */
+
+    /* Current cluster scatter-gather list */
+    QEMUIOVector cur_qiov;
+    uint64_t cur_pos;               /* position on block device, in bytes */
+    uint64_t cur_cluster;           /* cluster offset in image file */
+    unsigned int cur_nclusters;     /* number of clusters being accessed */
+    int find_cluster_ret;           /* used for L1/L2 update */
+
+    QEDRequest request;
+} QEDAIOCB;
+
+/**
+ * Lock used to serialize requests touching the same table
+ */
+typedef struct {
+    QTAILQ_HEAD(, QEDLockEntry) entries;
+    BlockDriverCompletionFunc *wakeup_fn;
+} QEDLock;
+
 typedef struct {
     BlockDriverState *bs;           /* device */
     uint64_t file_size;             /* length of image file, in bytes */
@@ -133,6 +168,7 @@  typedef struct {
     QEDHeader header;               /* always cpu-endian */
     QEDTable *l1_table;
     L2TableCache l2_cache;          /* l2 table cache */
+    QEDLock lock;                   /* table lock */
     uint32_t table_nelems;
     uint32_t l1_shift;
     uint32_t l2_shift;
@@ -170,6 +206,13 @@  CachedL2Table *qed_find_l2_cache_entry(L2TableCache *l2_cache, uint64_t offset);
 void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *l2_table);
 
 /**
+ * Lock functions
+ */
+void qed_lock_init(QEDLock *lock, BlockDriverCompletionFunc *wakeup_fn);
+bool qed_lock(QEDLock *lock, uint64_t key, QEDAIOCB *acb);
+void qed_unlock(QEDLock *lock, QEDAIOCB *acb);
+
+/**
  * Table I/O functions
  */
 int qed_read_l1_table_sync(BDRVQEDState *s);
diff --git a/trace-events b/trace-events
index a390196..86a1a75 100644
--- a/trace-events
+++ b/trace-events
@@ -73,3 +73,13 @@  disable qed_read_table(void *s, uint64_t offset, void *table) "s %p offset %"PRI
 disable qed_read_table_cb(void *s, void *table, int ret) "s %p table %p ret %d"
 disable qed_write_table(void *s, uint64_t offset, void *table, unsigned int index, unsigned int n) "s %p offset %"PRIu64" table %p index %u n %u"
 disable qed_write_table_cb(void *s, void *table, int ret) "s %p table %p ret %d"
+
+# block/qed.c
+disable qed_aio_complete(void *s, void *acb, int ret) "s %p acb %p ret %d"
+disable qed_aio_setup(void *s, void *acb, int64_t sector_num, int nb_sectors, void *opaque, int is_write) "s %p acb %p sector_num %"PRId64" nb_sectors %d opaque %p is_write %d"
+disable qed_aio_next_io(void *s, void *acb, int ret, uint64_t cur_pos) "s %p acb %p ret %d cur_pos %"PRIu64""
+disable qed_aio_read_data(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
+disable qed_aio_write_data(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
+disable qed_aio_write_prefill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64""
+disable qed_aio_write_postfill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64""
+disable qed_aio_write_main(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"