Patchwork [RFC,STABLE,0.13] Revert "qcow2: Use bdrv_(p)write_sync for metadata writes"

login
register
mail settings
Submitter Kevin Wolf
Date Aug. 24, 2010, 10:40 a.m.
Message ID <1282646430-5777-1-git-send-email-kwolf@redhat.com>
Download mbox | patch
Permalink /patch/62563/
State New
Headers show

Comments

Kevin Wolf - Aug. 24, 2010, 10:40 a.m.
This reverts commit 8b3b720620a1137a1b794fc3ed64734236f94e06.

This fix has caused severe slowdowns on recent kernels that actually do flush
when they are told so. Reverting this patch hurts correctness and means that we
could get corrupted images in case of a host crash. This means that qcow2 might
not be an option for some people without this fix. On the other hand, I get
reports that the slowdown is so massive that not reverting it would mean that
people can't use it either because it just takes ages to complete stuff. It
probably can be fixed, but not in time for 0.13.0.

Usually, if there's a possible tradeoff between correctness and performance, I
tend to choose correctness, but I'm not so sure in this case. I'm not sure with
reverting either, which is why I post this as an RFC only.

I hope to get some more comments on how to proceed here for 0.13.

Kevin


---
 block/qcow2-cluster.c  |   24 ++++++++++++------------
 block/qcow2-refcount.c |   24 ++++++++++++------------
 block/qcow2-snapshot.c |   23 ++++++++++++-----------
 block/qcow2.c          |   10 +++++-----
 4 files changed, 41 insertions(+), 40 deletions(-)
Stefan Hajnoczi - Aug. 24, 2010, 11:02 a.m.
On Tue, Aug 24, 2010 at 11:40 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> This reverts commit 8b3b720620a1137a1b794fc3ed64734236f94e06.
>
> This fix has caused severe slowdowns on recent kernels that actually do flush
> when they are told so. Reverting this patch hurts correctness and means that we
> could get corrupted images in case of a host crash. This means that qcow2 might
> not be an option for some people without this fix. On the other hand, I get
> reports that the slowdown is so massive that not reverting it would mean that
> people can't use it either because it just takes ages to complete stuff. It
> probably can be fixed, but not in time for 0.13.0.
>
> Usually, if there's a possible tradeoff between correctness and performance, I
> tend to choose correctness, but I'm not so sure in this case. I'm not sure with
> reverting either, which is why I post this as an RFC only.
>
> I hope to get some more comments on how to proceed here for 0.13.

Sometimes an improvement has a side effect and it makes sense to hold
back the improvement until the side effect can be resolved.  The
period of time in which users could rely on qcow2 data integrity is
small to none, I feel reverting the commit makes sense.

QEMU 0.12.5 has qcow2 sync metadata writes in commit
37060c28e522843fbf6f7e59af745dfcb05b132c.  Was the performance
regression spotted on 0.12.5 or 0.13?

Stefan
Michael Tokarev - Aug. 24, 2010, 11:06 a.m.
24.08.2010 15:02, Stefan Hajnoczi wrote:
[]
> Sometimes an improvement has a side effect and it makes sense to hold
> back the improvement until the side effect can be resolved.  The
> period of time in which users could rely on qcow2 data integrity is
> small to none, I feel reverting the commit makes sense.

And I agree with this.

> QEMU 0.12.5 has qcow2 sync metadata writes in commit
> 37060c28e522843fbf6f7e59af745dfcb05b132c.  Was the performance
> regression spotted on 0.12.5 or 0.13?

It started as a debian bugreport ( http://bugs.debian.org/594069 )
which were reported once 0.12.5 entered debian testing.  So
it were spotted in 0.12.5, not 0.13.  I verfied the bug locally
and indeed, it makes _huge_ difference (0.12.4 vs 0.12.5), --
e.g 600mb mem windows7 hybernation on my machine now takes about
40 minutes to complete instead of ~20 sec in 0.12.4.

I'm reverting whole series (there are 5 patches in total, covering
qcow, qcow2, vmdk and vpc plus infrastructure) for debian 0.12
package.  It is sad but there's no other option for now.

/mjt
Kevin Wolf - Aug. 24, 2010, 11:40 a.m.
Am 24.08.2010 13:02, schrieb Stefan Hajnoczi:
> On Tue, Aug 24, 2010 at 11:40 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>> This reverts commit 8b3b720620a1137a1b794fc3ed64734236f94e06.
>>
>> This fix has caused severe slowdowns on recent kernels that actually do flush
>> when they are told so. Reverting this patch hurts correctness and means that we
>> could get corrupted images in case of a host crash. This means that qcow2 might
>> not be an option for some people without this fix. On the other hand, I get
>> reports that the slowdown is so massive that not reverting it would mean that
>> people can't use it either because it just takes ages to complete stuff. It
>> probably can be fixed, but not in time for 0.13.0.
>>
>> Usually, if there's a possible tradeoff between correctness and performance, I
>> tend to choose correctness, but I'm not so sure in this case. I'm not sure with
>> reverting either, which is why I post this as an RFC only.
>>
>> I hope to get some more comments on how to proceed here for 0.13.
> 
> Sometimes an improvement has a side effect and it makes sense to hold
> back the improvement until the side effect can be resolved.  The
> period of time in which users could rely on qcow2 data integrity is
> small to none, I feel reverting the commit makes sense.

Right, that's the vague feeling I have, too.

> QEMU 0.12.5 has qcow2 sync metadata writes in commit
> 37060c28e522843fbf6f7e59af745dfcb05b132c.  Was the performance
> regression spotted on 0.12.5 or 0.13?

Both. You mean we should consider a 0.12.6 if we decide to revert? I
think so far 0.12.5 was planned to be last 0.12.x release.

Kevin
Alexander Graf - Aug. 24, 2010, 11:56 a.m.
Kevin Wolf wrote:
> Am 24.08.2010 13:02, schrieb Stefan Hajnoczi:
>   
>> On Tue, Aug 24, 2010 at 11:40 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>     
>>> This reverts commit 8b3b720620a1137a1b794fc3ed64734236f94e06.
>>>
>>> This fix has caused severe slowdowns on recent kernels that actually do flush
>>> when they are told so. Reverting this patch hurts correctness and means that we
>>> could get corrupted images in case of a host crash. This means that qcow2 might
>>> not be an option for some people without this fix. On the other hand, I get
>>> reports that the slowdown is so massive that not reverting it would mean that
>>> people can't use it either because it just takes ages to complete stuff. It
>>> probably can be fixed, but not in time for 0.13.0.
>>>
>>> Usually, if there's a possible tradeoff between correctness and performance, I
>>> tend to choose correctness, but I'm not so sure in this case. I'm not sure with
>>> reverting either, which is why I post this as an RFC only.
>>>
>>> I hope to get some more comments on how to proceed here for 0.13.
>>>       
>> Sometimes an improvement has a side effect and it makes sense to hold
>> back the improvement until the side effect can be resolved.  The
>> period of time in which users could rely on qcow2 data integrity is
>> small to none, I feel reverting the commit makes sense.
>>     
>
> Right, that's the vague feeling I have, too.
>   

If we don't think of qcow2 as integer format, why don't we just default
to cache=unsafe there then? That way you could keep all the syncs in
place making it stable with cache=!unsafe, but the default for users
would be fast albeit unsafe, which it already is.


Alex
Kevin Wolf - Aug. 24, 2010, 12:10 p.m.
Am 24.08.2010 13:56, schrieb Alexander Graf:
> Kevin Wolf wrote:
>> Am 24.08.2010 13:02, schrieb Stefan Hajnoczi:
>>   
>>> On Tue, Aug 24, 2010 at 11:40 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>     
>>>> This reverts commit 8b3b720620a1137a1b794fc3ed64734236f94e06.
>>>>
>>>> This fix has caused severe slowdowns on recent kernels that actually do flush
>>>> when they are told so. Reverting this patch hurts correctness and means that we
>>>> could get corrupted images in case of a host crash. This means that qcow2 might
>>>> not be an option for some people without this fix. On the other hand, I get
>>>> reports that the slowdown is so massive that not reverting it would mean that
>>>> people can't use it either because it just takes ages to complete stuff. It
>>>> probably can be fixed, but not in time for 0.13.0.
>>>>
>>>> Usually, if there's a possible tradeoff between correctness and performance, I
>>>> tend to choose correctness, but I'm not so sure in this case. I'm not sure with
>>>> reverting either, which is why I post this as an RFC only.
>>>>
>>>> I hope to get some more comments on how to proceed here for 0.13.
>>>>       
>>> Sometimes an improvement has a side effect and it makes sense to hold
>>> back the improvement until the side effect can be resolved.  The
>>> period of time in which users could rely on qcow2 data integrity is
>>> small to none, I feel reverting the commit makes sense.
>>>     
>>
>> Right, that's the vague feeling I have, too.
>>   
> 
> If we don't think of qcow2 as integer format, why don't we just default
> to cache=unsafe there then? That way you could keep all the syncs in
> place making it stable with cache=!unsafe, but the default for users
> would be fast albeit unsafe, which it already is.

Well, safety is not boolean. Considering to make it mostly safe instead
of completely safe because of the performance doesn't mean that we
should make it completely unsafe.

That said, what we should do is changing the cache mode to unsafe in
certain places in qemu-img, e.g. in convert for the destination image.
If it fails, you'll throw it away anyway.

Kevin
Alexander Graf - Aug. 24, 2010, 12:12 p.m.
Kevin Wolf wrote:
> Am 24.08.2010 13:56, schrieb Alexander Graf:
>   
>> Kevin Wolf wrote:
>>     
>>> Am 24.08.2010 13:02, schrieb Stefan Hajnoczi:
>>>   
>>>       
>>>> On Tue, Aug 24, 2010 at 11:40 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>     
>>>>         
>>>>> This reverts commit 8b3b720620a1137a1b794fc3ed64734236f94e06.
>>>>>
>>>>> This fix has caused severe slowdowns on recent kernels that actually do flush
>>>>> when they are told so. Reverting this patch hurts correctness and means that we
>>>>> could get corrupted images in case of a host crash. This means that qcow2 might
>>>>> not be an option for some people without this fix. On the other hand, I get
>>>>> reports that the slowdown is so massive that not reverting it would mean that
>>>>> people can't use it either because it just takes ages to complete stuff. It
>>>>> probably can be fixed, but not in time for 0.13.0.
>>>>>
>>>>> Usually, if there's a possible tradeoff between correctness and performance, I
>>>>> tend to choose correctness, but I'm not so sure in this case. I'm not sure with
>>>>> reverting either, which is why I post this as an RFC only.
>>>>>
>>>>> I hope to get some more comments on how to proceed here for 0.13.
>>>>>       
>>>>>           
>>>> Sometimes an improvement has a side effect and it makes sense to hold
>>>> back the improvement until the side effect can be resolved.  The
>>>> period of time in which users could rely on qcow2 data integrity is
>>>> small to none, I feel reverting the commit makes sense.
>>>>     
>>>>         
>>> Right, that's the vague feeling I have, too.
>>>   
>>>       
>> If we don't think of qcow2 as integer format, why don't we just default
>> to cache=unsafe there then? That way you could keep all the syncs in
>> place making it stable with cache=!unsafe, but the default for users
>> would be fast albeit unsafe, which it already is.
>>     
>
> Well, safety is not boolean. Considering to make it mostly safe instead
> of completely safe because of the performance doesn't mean that we
> should make it completely unsafe.
>   

What is safety then? A vague feeling of "oh today is monday so my data
is safe, but on tuesday I always lose my image data"? Either we promise
to keep data safe or we don't. There is no in between.

> That said, what we should do is changing the cache mode to unsafe in
> certain places in qemu-img, e.g. in convert for the destination image.
> If it fails, you'll throw it away anyway.
>   

That would be useful either way.


Alex
Avi Kivity - Aug. 24, 2010, 12:18 p.m.
On 08/24/2010 03:12 PM, Alexander Graf wrote:
>
>> Well, safety is not boolean. Considering to make it mostly safe instead
>> of completely safe because of the performance doesn't mean that we
>> should make it completely unsafe.
>>
> What is safety then? A vague feeling of "oh today is monday so my data
> is safe, but on tuesday I always lose my image data"? Either we promise
> to keep data safe or we don't. There is no in between.
>

Do you drive a car?

Though in general I agree we shouldn't compromise on data integrity.
Alexander Graf - Aug. 24, 2010, 12:21 p.m.
Avi Kivity wrote:
>  On 08/24/2010 03:12 PM, Alexander Graf wrote:
>>
>>> Well, safety is not boolean. Considering to make it mostly safe instead
>>> of completely safe because of the performance doesn't mean that we
>>> should make it completely unsafe.
>>>
>> What is safety then? A vague feeling of "oh today is monday so my data
>> is safe, but on tuesday I always lose my image data"? Either we promise
>> to keep data safe or we don't. There is no in between.
>>
>
> Do you drive a car?

Would you buy a car where the breaks are known to not always work? ;)

> Though in general I agree we shouldn't compromise on data integrity.

That's my point. Either we go for it or we don't.


Alex
Stefan Hajnoczi - Aug. 24, 2010, 12:21 p.m.
On Tue, Aug 24, 2010 at 12:40 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 24.08.2010 13:02, schrieb Stefan Hajnoczi:
>> QEMU 0.12.5 has qcow2 sync metadata writes in commit
>> 37060c28e522843fbf6f7e59af745dfcb05b132c.  Was the performance
>> regression spotted on 0.12.5 or 0.13?
>
> Both. You mean we should consider a 0.12.6 if we decide to revert? I
> think so far 0.12.5 was planned to be last 0.12.x release.

Yes, especially if distros will revert the patches manually without an
upstream release.  I can see arguments for either way though.

Stefan
Michael Tokarev - Aug. 24, 2010, 12:23 p.m.
24.08.2010 16:21, Stefan Hajnoczi wrote:
> On Tue, Aug 24, 2010 at 12:40 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 24.08.2010 13:02, schrieb Stefan Hajnoczi:
>>> QEMU 0.12.5 has qcow2 sync metadata writes in commit
>>> 37060c28e522843fbf6f7e59af745dfcb05b132c.  Was the performance
>>> regression spotted on 0.12.5 or 0.13?
>>
>> Both. You mean we should consider a 0.12.6 if we decide to revert? I
>> think so far 0.12.5 was planned to be last 0.12.x release.
> 
> Yes, especially if distros will revert the patches manually without an
> upstream release.  I can see arguments for either way though.

Note that for 0.12 there's no other option than to revert the
whole thing.  Because, well, there's no "cache=unsafe" there,
not to say about it being the default.  And quemu-kvm breaks
the user's systems in the middle of stable series (by "breaks"
I mean making their working guests unusable).

/mjt
Avi Kivity - Aug. 24, 2010, 12:27 p.m.
On 08/24/2010 03:21 PM, Alexander Graf wrote:
> Avi Kivity wrote:
>>   On 08/24/2010 03:12 PM, Alexander Graf wrote:
>>>> Well, safety is not boolean. Considering to make it mostly safe instead
>>>> of completely safe because of the performance doesn't mean that we
>>>> should make it completely unsafe.
>>>>
>>> What is safety then? A vague feeling of "oh today is monday so my data
>>> is safe, but on tuesday I always lose my image data"? Either we promise
>>> to keep data safe or we don't. There is no in between.
>>>
>> Do you drive a car?
> Would you buy a car where the breaks are known to not always work? ;)

That's not the case, even with cache=unsafe.

>> Though in general I agree we shouldn't compromise on data integrity.
> That's my point. Either we go for it or we don't.

I don't know how bad the performance regression is, and how large the 
integrity risk is.  I'd default towards preserving integrity, but maybe 
this situation is different.
Kevin Wolf - Aug. 24, 2010, 12:35 p.m.
Am 24.08.2010 14:27, schrieb Avi Kivity:
>   On 08/24/2010 03:21 PM, Alexander Graf wrote:
>> Avi Kivity wrote:
>>>   On 08/24/2010 03:12 PM, Alexander Graf wrote:
>>>>> Well, safety is not boolean. Considering to make it mostly safe instead
>>>>> of completely safe because of the performance doesn't mean that we
>>>>> should make it completely unsafe.
>>>>>
>>>> What is safety then? A vague feeling of "oh today is monday so my data
>>>> is safe, but on tuesday I always lose my image data"? Either we promise
>>>> to keep data safe or we don't. There is no in between.
>>>>
>>> Do you drive a car?
>> Would you buy a car where the breaks are known to not always work? ;)
> 
> That's not the case, even with cache=unsafe.
> 
>>> Though in general I agree we shouldn't compromise on data integrity.
>> That's my point. Either we go for it or we don't.
> 
> I don't know how bad the performance regression is, and how large the 
> integrity risk is.  I'd default towards preserving integrity, but maybe 
> this situation is different.

I have reports of installations taking like 50 min instead of 14 min. My
own qemu-io based test goes up from 1 s to 23 s. And I think the winner
is Michael's image conversion which went up from 30 s to 49 min.

So it's not like we're talking about just some 10 or 20 percent.

Kevin
Avi Kivity - Aug. 24, 2010, 12:39 p.m.
On 08/24/2010 03:35 PM, Kevin Wolf wrote:
>
>> I don't know how bad the performance regression is, and how large the
>> integrity risk is.  I'd default towards preserving integrity, but maybe
>> this situation is different.
> I have reports of installations taking like 50 min instead of 14 min. My
> own qemu-io based test goes up from 1 s to 23 s. And I think the winner
> is Michael's image conversion which went up from 30 s to 49 min.
>
> So it's not like we're talking about just some 10 or 20 percent.

Image conversion should be done with cache=unsafe (with an fsync at the 
end to make sure a guest isn't launched with volatile data in the 
cache[1]).  The io test isn't a user workload.

The 14 min -> 50 min regression is pretty bad, but I'm not sure it's bad 
enough to merit risking user data.

[1] or better, when launching a guest with cache!=unsafe start with an 
fsync to make sure it's on disk.
Juan Quintela - Aug. 24, 2010, 12:48 p.m.
Kevin Wolf <kwolf@redhat.com> wrote:
> This reverts commit 8b3b720620a1137a1b794fc3ed64734236f94e06.
>
> This fix has caused severe slowdowns on recent kernels that actually do flush
> when they are told so. Reverting this patch hurts correctness and means that we
> could get corrupted images in case of a host crash. This means that qcow2 might
> not be an option for some people without this fix. On the other hand, I get
> reports that the slowdown is so massive that not reverting it would mean that
> people can't use it either because it just takes ages to complete stuff. It
> probably can be fixed, but not in time for 0.13.0.
>
> Usually, if there's a possible tradeoff between correctness and performance, I
> tend to choose correctness, but I'm not so sure in this case. I'm not sure with
> reverting either, which is why I post this as an RFC only.
>
> I hope to get some more comments on how to proceed here for 0.13.

I think that we have to revert also.

It is a thoug decission so.  This patch makes things more "correct",
but it makes some load really slow, for instance, and install went from
15m to 50m on my hardware :(

My point here is that with the patch, qcow2 becomes almost unusable, so
... what to do?

Later, Juan.
Kevin Wolf - Aug. 24, 2010, 12:53 p.m.
Am 24.08.2010 14:39, schrieb Avi Kivity:
>   On 08/24/2010 03:35 PM, Kevin Wolf wrote:
>>
>>> I don't know how bad the performance regression is, and how large the
>>> integrity risk is.  I'd default towards preserving integrity, but maybe
>>> this situation is different.
>> I have reports of installations taking like 50 min instead of 14 min. My
>> own qemu-io based test goes up from 1 s to 23 s. And I think the winner
>> is Michael's image conversion which went up from 30 s to 49 min.
>>
>> So it's not like we're talking about just some 10 or 20 percent.
> 
> Image conversion should be done with cache=unsafe (with an fsync at the 

Well, should, but is not. Arguably we should improve our format drivers
to make the metadata flushes less of a problem, too. But we haven't done
so yet. This is why this discussion is about 0.13, not 0.14.

> end to make sure a guest isn't launched with volatile data in the 
> cache[1]).  The io test isn't a user workload.
> 
> The 14 min -> 50 min regression is pretty bad, but I'm not sure it's bad 
> enough to merit risking user data.

Without a disk write cache (or with a battery-backed one) cache=none is
safe without any additional flushes. So yes, users need to be aware of
the meaning of cache options. But they already had to be aware of it
with 0.12.

However, it might trap users who are not aware of it, which is why we
even have this discussion. It's not an easy question.

> [1] or better, when launching a guest with cache!=unsafe start with an 
> fsync to make sure it's on disk.

Hm, that's an interesting idea, never thought of that. Might make sense.

Kevin
Anthony Liguori - Aug. 24, 2010, 1:01 p.m.
On 08/24/2010 05:40 AM, Kevin Wolf wrote:
> This reverts commit 8b3b720620a1137a1b794fc3ed64734236f94e06.
>
> This fix has caused severe slowdowns on recent kernels that actually do flush
> when they are told so. Reverting this patch hurts correctness and means that we
> could get corrupted images in case of a host crash. This means that qcow2 might
> not be an option for some people without this fix. On the other hand, I get
> reports that the slowdown is so massive that not reverting it would mean that
> people can't use it either because it just takes ages to complete stuff. It
> probably can be fixed, but not in time for 0.13.0.
>
> Usually, if there's a possible tradeoff between correctness and performance, I
> tend to choose correctness, but I'm not so sure in this case. I'm not sure with
> reverting either, which is why I post this as an RFC only.
>
> I hope to get some more comments on how to proceed here for 0.13.
>    

How fundamental of an issue is this?  Is this something we think we know 
how to fix and we just don't think there's time to fix it for 0.13?

And with this fix in place, how much confidence do we have in qcow2 with 
respect to data integrity on power loss?

We've shipped every version of QEMU since qcow2 was introduced with 
known data corruptions.  It sucks big time.  I think it's either that 
building an image format is a really hard problem akin to making a file 
system and we shouldn't be in that business or that qcow2 is bad as an 
image format which makes this all harder than it should be.

Either way, I think we need to look seriously at enabling alternatives 
(i.e. external snapshots, simple sparse files, etc.).

Regards,

Anthony Liguori

> Kevin
>
Kevin Wolf - Aug. 24, 2010, 1:16 p.m.
Am 24.08.2010 15:01, schrieb Anthony Liguori:
> On 08/24/2010 05:40 AM, Kevin Wolf wrote:
>> This reverts commit 8b3b720620a1137a1b794fc3ed64734236f94e06.
>>
>> This fix has caused severe slowdowns on recent kernels that actually do flush
>> when they are told so. Reverting this patch hurts correctness and means that we
>> could get corrupted images in case of a host crash. This means that qcow2 might
>> not be an option for some people without this fix. On the other hand, I get
>> reports that the slowdown is so massive that not reverting it would mean that
>> people can't use it either because it just takes ages to complete stuff. It
>> probably can be fixed, but not in time for 0.13.0.
>>
>> Usually, if there's a possible tradeoff between correctness and performance, I
>> tend to choose correctness, but I'm not so sure in this case. I'm not sure with
>> reverting either, which is why I post this as an RFC only.
>>
>> I hope to get some more comments on how to proceed here for 0.13.
>>    
> 
> How fundamental of an issue is this?  Is this something we think we know 
> how to fix and we just don't think there's time to fix it for 0.13?

I think we can improve things basically by trying to batch metadata
writes and do them in parallel while already processing the next requests.

I'm not sure what the numbers are going to look like with something like
this in place, I need to try it. It's definitely not something that I
want to go into 0.13 at this point.

> And with this fix in place, how much confidence do we have in qcow2 with 
> respect to data integrity on power loss?

How do you measure confidence? :-)

There are power failure tests and I don't have any bugs open to that
respect. I'm not sure how intensively it's tested, though.

> We've shipped every version of QEMU since qcow2 was introduced with 
> known data corruptions.  It sucks big time.  I think it's either that 
> building an image format is a really hard problem akin to making a file 
> system and we shouldn't be in that business or that qcow2 is bad as an 
> image format which makes this all harder than it should be.

I tend to say that it's just hard to get right. Most of the problems
that were fixed in qcow2 over the last year are probably present in our
VMDK implementation as well, just to pick one example.

Kevin
Anthony Liguori - Aug. 24, 2010, 1:29 p.m.
On 08/24/2010 08:16 AM, Kevin Wolf wrote:
> Am 24.08.2010 15:01, schrieb Anthony Liguori:
>    
>> On 08/24/2010 05:40 AM, Kevin Wolf wrote:
>>      
>>> This reverts commit 8b3b720620a1137a1b794fc3ed64734236f94e06.
>>>
>>> This fix has caused severe slowdowns on recent kernels that actually do flush
>>> when they are told so. Reverting this patch hurts correctness and means that we
>>> could get corrupted images in case of a host crash. This means that qcow2 might
>>> not be an option for some people without this fix. On the other hand, I get
>>> reports that the slowdown is so massive that not reverting it would mean that
>>> people can't use it either because it just takes ages to complete stuff. It
>>> probably can be fixed, but not in time for 0.13.0.
>>>
>>> Usually, if there's a possible tradeoff between correctness and performance, I
>>> tend to choose correctness, but I'm not so sure in this case. I'm not sure with
>>> reverting either, which is why I post this as an RFC only.
>>>
>>> I hope to get some more comments on how to proceed here for 0.13.
>>>
>>>        
>> How fundamental of an issue is this?  Is this something we think we know
>> how to fix and we just don't think there's time to fix it for 0.13?
>>      
> I think we can improve things basically by trying to batch metadata
> writes and do them in parallel while already processing the next requests.
>
> I'm not sure what the numbers are going to look like with something like
> this in place, I need to try it. It's definitely not something that I
> want to go into 0.13 at this point.
>    

I'm not sure this patch is needed in the first place.

If you have a sequence of operations like:

0) receive guest write request Z
1) submit write A
2) write A completes
3) submit write B
4) write B completes
5) report guest write Z complete

You're adding a:

4.5) sync write B

Which is ultimately unnecessary if what you care about is avoiding 
reordering of step (2) and (4).  When a write() request completes, 
you're guaranteed that a subsequent read() request will return the 
written data.  That's always true.  If I could do a write(A) followed by 
a write(B) and then read()=A, no software would actually function correctly.

It's important to make sure that you don't get image corruption if (2) 
happens but not (4).  But I think that's okay in qcow2 today.

Regards,

Anthony Liguori

>> And with this fix in place, how much confidence do we have in qcow2 with
>> respect to data integrity on power loss?
>>      
> How do you measure confidence? :-)
>
> There are power failure tests and I don't have any bugs open to that
> respect. I'm not sure how intensively it's tested, though.
>
>    
>> We've shipped every version of QEMU since qcow2 was introduced with
>> known data corruptions.  It sucks big time.  I think it's either that
>> building an image format is a really hard problem akin to making a file
>> system and we shouldn't be in that business or that qcow2 is bad as an
>> image format which makes this all harder than it should be.
>>      
> I tend to say that it's just hard to get right. Most of the problems
> that were fixed in qcow2 over the last year are probably present in our
> VMDK implementation as well, just to pick one example.
>
> Kevin
>
Avi Kivity - Aug. 24, 2010, 1:31 p.m.
On 08/24/2010 04:29 PM, Anthony Liguori wrote:
>
> I'm not sure this patch is needed in the first place.
>
> If you have a sequence of operations like:
>
> 0) receive guest write request Z
> 1) submit write A
> 2) write A completes
> 3) submit write B
> 4) write B completes
> 5) report guest write Z complete
>
> You're adding a:
>
> 4.5) sync write B
>
> Which is ultimately unnecessary if what you care about is avoiding 
> reordering of step (2) and (4).  When a write() request completes, 
> you're guaranteed that a subsequent read() request will return the 
> written data.  That's always true.  If I could do a write(A) followed 
> by a write(B) and then read()=A, no software would actually function 
> correctly.
>
> It's important to make sure that you don't get image corruption if (2) 
> happens but not (4).  But I think that's okay in qcow2 today.

It's about metadata writes.  If an operation changes metadata, we must 
sync it to disk before writing any data or other metadata which depends 
on it, regardless of any promises to the guest.
Anthony Liguori - Aug. 24, 2010, 1:35 p.m.
On 08/24/2010 08:31 AM, Avi Kivity wrote:
>  On 08/24/2010 04:29 PM, Anthony Liguori wrote:
>>
>> I'm not sure this patch is needed in the first place.
>>
>> If you have a sequence of operations like:
>>
>> 0) receive guest write request Z
>> 1) submit write A
>> 2) write A completes
>> 3) submit write B
>> 4) write B completes
>> 5) report guest write Z complete
>>
>> You're adding a:
>>
>> 4.5) sync write B
>>
>> Which is ultimately unnecessary if what you care about is avoiding 
>> reordering of step (2) and (4).  When a write() request completes, 
>> you're guaranteed that a subsequent read() request will return the 
>> written data.  That's always true.  If I could do a write(A) followed 
>> by a write(B) and then read()=A, no software would actually function 
>> correctly.
>>
>> It's important to make sure that you don't get image corruption if 
>> (2) happens but not (4).  But I think that's okay in qcow2 today.
>
> It's about metadata writes.  If an operation changes metadata, we must 
> sync it to disk before writing any data or other metadata which 
> depends on it, regardless of any promises to the guest.

Why?  If the metadata isn't sync, we loose the write.

But that can happen anyway because we're not sync'ing the data

We need to sync the metadata in the event of a guest initiated flush, 
but we shouldn't need to for a normal write.

Regards,

Anthony Liguori
Avi Kivity - Aug. 24, 2010, 1:39 p.m.
On 08/24/2010 04:35 PM, Anthony Liguori wrote:
>> It's about metadata writes.  If an operation changes metadata, we 
>> must sync it to disk before writing any data or other metadata which 
>> depends on it, regardless of any promises to the guest.
>
>
> Why?  If the metadata isn't sync, we loose the write.
>
> But that can happen anyway because we're not sync'ing the data
>
> We need to sync the metadata in the event of a guest initiated flush, 
> but we shouldn't need to for a normal write.

1. Allocate a cluster (increase refcount table)

2. Link cluster to L2 table

3. Second operation makes it to disk; first still in pagecache

4. Crash

5. Dangling pointer from L2 to freed cluster
Anthony Liguori - Aug. 24, 2010, 1:40 p.m.
On 08/24/2010 08:39 AM, Avi Kivity wrote:
>  On 08/24/2010 04:35 PM, Anthony Liguori wrote:
>>> It's about metadata writes.  If an operation changes metadata, we 
>>> must sync it to disk before writing any data or other metadata which 
>>> depends on it, regardless of any promises to the guest.
>>
>>
>> Why?  If the metadata isn't sync, we loose the write.
>>
>> But that can happen anyway because we're not sync'ing the data
>>
>> We need to sync the metadata in the event of a guest initiated flush, 
>> but we shouldn't need to for a normal write.
>
> 1. Allocate a cluster (increase refcount table)
>
> 2. Link cluster to L2 table
>
> 3. Second operation makes it to disk; first still in pagecache
>
> 4. Crash
>
> 5. Dangling pointer from L2 to freed cluster

Yes, having this discussion in IRC.

The problem is that we maintain a refcount table.  If we didn't do 
internal disk snapshots, we wouldn't have this problem.  IOW, VMDK 
doesn't have this problem so the answer to my very first question is 
that qcow2 is too difficult a format to get right.

Regards,

Anthony Liguori
Avi Kivity - Aug. 24, 2010, 1:44 p.m.
On 08/24/2010 04:40 PM, Anthony Liguori wrote:
>> 1. Allocate a cluster (increase refcount table)
>>
>> 2. Link cluster to L2 table
>>
>> 3. Second operation makes it to disk; first still in pagecache
>>
>> 4. Crash
>>
>> 5. Dangling pointer from L2 to freed cluster
>
>
> Yes, having this discussion in IRC.
>
> The problem is that we maintain a refcount table. 

Are you sure that's the only issue?

> If we didn't do internal disk snapshots, we wouldn't have this 
> problem.  IOW, VMDK doesn't have this problem so the answer to my very 
> first question is that qcow2 is too difficult a format to get right.

One doesn't follow from the other (though I'm no fan of internal 
snapshots, myself).
Anthony Liguori - Aug. 24, 2010, 1:56 p.m.
On 08/24/2010 08:44 AM, Avi Kivity wrote:
>  On 08/24/2010 04:40 PM, Anthony Liguori wrote:
>>> 1. Allocate a cluster (increase refcount table)
>>>
>>> 2. Link cluster to L2 table
>>>
>>> 3. Second operation makes it to disk; first still in pagecache
>>>
>>> 4. Crash
>>>
>>> 5. Dangling pointer from L2 to freed cluster
>>
>>
>> Yes, having this discussion in IRC.
>>
>> The problem is that we maintain a refcount table. 
>
> Are you sure that's the only issue?

No.

>> If we didn't do internal disk snapshots, we wouldn't have this 
>> problem.  IOW, VMDK doesn't have this problem so the answer to my 
>> very first question is that qcow2 is too difficult a format to get 
>> right.
>
> One doesn't follow from the other (though I'm no fan of internal 
> snapshots, myself).

It does.  Let's consider the failure scenarios:

1) guest submits write request
2) allocate extent
3) write data to disk (a)
4) write (a) completes
5) update reference count table for new extent (b)
6) write (b) completes
7) write extent table (c)
8) write (c) completes
9) complete guest write request

If this all happened in order and we lost power, the worst case error is 
that we leak a block which isn't terrible.

But we're not guaranteed that this happens in order.

If (b) or (c) happen before (a), then the image is not corrupted but 
data gets lost.  That's okay because it's part of the guest contract.

If (c) happens before (b), then we've created an extent that's attached 
to a table with a zero reference count.  This is a corrupt image.

Let's consider if we eliminate the reference count table which means 
eliminating internal snapshots.

1) guest submits write request
2) allocate extent
3) write data to disk (a)
4) write (a) completes
5) write extent table (c)
6) write (c) completes
7) complete guest write request

If this all happens in order and we lose power, we just leak a block.  
It means we need a periodic fsck.

If (c) completes before (a), then it means that the image is not 
corrupted but data gets lost.  This is okay based on the guest contract.

And that's it.  There is no scenario where the disk is corrupted.

So in summary, both situations are not perfect, but scenario (1) can 
result in a corrupted image whereas scenario (2) results in leakage.  
The classic solution to this is fsck.

Regards,

Anthony Liguori
Avi Kivity - Aug. 25, 2010, 7:14 a.m.
On 08/24/2010 04:56 PM, Anthony Liguori wrote:
>> One doesn't follow from the other (though I'm no fan of internal 
>> snapshots, myself).
>
>
> It does.  Let's consider the failure scenarios:
>
> 1) guest submits write request
> 2) allocate extent
> 3) write data to disk (a)
> 4) write (a) completes
> 5) update reference count table for new extent (b)
> 6) write (b) completes
> 7) write extent table (c)
> 8) write (c) completes
> 9) complete guest write request
>
> If this all happened in order and we lost power, the worst case error 
> is that we leak a block which isn't terrible.
>
> But we're not guaranteed that this happens in order.
>
> If (b) or (c) happen before (a), then the image is not corrupted but 
> data gets lost.  That's okay because it's part of the guest contract.
>
> If (c) happens before (b), then we've created an extent that's 
> attached to a table with a zero reference count.  This is a corrupt 
> image.
>

If the only issue is new block allocation, it can be easily solved.  
Instead of allocating exactly the needed amount of blocks, allocate a 
large extent and hold them in memory.  The next allocation can then be 
filled from memory, so the allocation sync is amortized over many 
blocks.  A power fail will leak the preallocated blocks, losing some 
megabytes of address space, but not real disk space.


> Let's consider if we eliminate the reference count table which means 
> eliminating internal snapshots.
>
> 1) guest submits write request
> 2) allocate extent
> 3) write data to disk (a)
> 4) write (a) completes
> 5) write extent table (c)
> 6) write (c) completes
> 7) complete guest write request
>
> If this all happens in order and we lose power, we just leak a block.  
> It means we need a periodic fsck.
>
> If (c) completes before (a), then it means that the image is not 
> corrupted but data gets lost.  This is okay based on the guest contract.
>
> And that's it.  There is no scenario where the disk is corrupted.

_if_ that's the only failure mode.
Anthony Liguori - Aug. 25, 2010, 12:46 p.m.
On 08/25/2010 02:14 AM, Avi Kivity wrote:
>> If (c) happens before (b), then we've created an extent that's 
>> attached to a table with a zero reference count.  This is a corrupt 
>> image.
>>
>
>
> If the only issue is new block allocation, it can be easily solved.

Technically, I believe there are similar issues around creating 
snapshots but I don't think we care.

>   Instead of allocating exactly the needed amount of blocks, allocate 
> a large extent and hold them in memory.

So you're suggesting that we allocate a bunch of blocks, update the ref 
count table so that they are seen as allocated even though they aren't 
attached to an l1 table?

>   The next allocation can then be filled from memory, so the 
> allocation sync is amortized over many blocks.  A power fail will leak 
> the preallocated blocks, losing some megabytes of address space, but 
> not real disk space.

It's a clever idea, but it would lose real disk space which is probably 
not a huge issue.

>> Let's consider if we eliminate the reference count table which means 
>> eliminating internal snapshots.
>>
>> 1) guest submits write request
>> 2) allocate extent
>> 3) write data to disk (a)
>> 4) write (a) completes
>> 5) write extent table (c)
>> 6) write (c) completes
>> 7) complete guest write request
>>
>> If this all happens in order and we lose power, we just leak a 
>> block.  It means we need a periodic fsck.
>>
>> If (c) completes before (a), then it means that the image is not 
>> corrupted but data gets lost.  This is okay based on the guest contract.
>>
>> And that's it.  There is no scenario where the disk is corrupted.
>
> _if_ that's the only failure mode.

If we had another disk format that only supported growth and metadata 
for a backing file, can you think of another failure scenario?

Regards,

Anthony Liguori
Avi Kivity - Aug. 25, 2010, 1:07 p.m.
On 08/25/2010 03:46 PM, Anthony Liguori wrote:
> On 08/25/2010 02:14 AM, Avi Kivity wrote:
>>> If (c) happens before (b), then we've created an extent that's 
>>> attached to a table with a zero reference count.  This is a corrupt 
>>> image.
>>>
>>
>>
>> If the only issue is new block allocation, it can be easily solved.
>
> Technically, I believe there are similar issues around creating 
> snapshots but I don't think we care.
>
>>   Instead of allocating exactly the needed amount of blocks, allocate 
>> a large extent and hold them in memory.
>
> So you're suggesting that we allocate a bunch of blocks, update the 
> ref count table so that they are seen as allocated even though they 
> aren't attached to an l1 table?

Yes.  Like malloc() will ask the OS for more memory that the 20-byte 
allocation you've requested.

>
>>   The next allocation can then be filled from memory, so the 
>> allocation sync is amortized over many blocks.  A power fail will 
>> leak the preallocated blocks, losing some megabytes of address space, 
>> but not real disk space.
>
> It's a clever idea, but it would lose real disk space which is 
> probably not a huge issue.

Not real disk space since no pwrite() would ever touch the disk.  If the 
image were copied, _then_ we'd lose the disk space, if the copy command 
and filesystem don't optimize zeros away.

>>>
>>> And that's it.  There is no scenario where the disk is corrupted.
>>
>> _if_ that's the only failure mode.
>
> If we had another disk format that only supported growth and metadata 
> for a backing file, can you think of another failure scenario?
>

I can't think of one, but that's not saying much.
Avi Kivity - Aug. 25, 2010, 1:23 p.m.
On 08/25/2010 03:46 PM, Anthony Liguori wrote:
>
> If we had another disk format that only supported growth and metadata 
> for a backing file, can you think of another failure scenario?

btw, only supporting growth is a step backwards.  Currently file-backed 
disks keep growing even the guest-used storage doesn't grow, since once 
we allocate something we never release it.  But eventually guests will 
start using TRIM or DISCARD or however it's called, and then we can 
expose it and reclaim unused blocks.
Anthony Liguori - Aug. 25, 2010, 1:37 p.m.
On 08/25/2010 08:07 AM, Avi Kivity wrote:
>>
>>>   The next allocation can then be filled from memory, so the 
>>> allocation sync is amortized over many blocks.  A power fail will 
>>> leak the preallocated blocks, losing some megabytes of address 
>>> space, but not real disk space.
>>
>> It's a clever idea, but it would lose real disk space which is 
>> probably not a huge issue.
>
> Not real disk space since no pwrite() would ever touch the disk.  If 
> the image were copied, _then_ we'd lose the disk space, if the copy 
> command and filesystem don't optimize zeros away. \

Ok.

Regards,

Anthony Liguori

>>>>
>>>> And that's it.  There is no scenario where the disk is corrupted.
>>>
>>> _if_ that's the only failure mode.
>>
>> If we had another disk format that only supported growth and metadata 
>> for a backing file, can you think of another failure scenario?
>>
>
> I can't think of one, but that's not saying much.
>
Anthony Liguori - Aug. 25, 2010, 1:42 p.m.
On 08/25/2010 08:23 AM, Avi Kivity wrote:
>  On 08/25/2010 03:46 PM, Anthony Liguori wrote:
>>
>> If we had another disk format that only supported growth and metadata 
>> for a backing file, can you think of another failure scenario?
>
> btw, only supporting growth is a step backwards.  Currently 
> file-backed disks keep growing even the guest-used storage doesn't 
> grow, since once we allocate something we never release it.  But 
> eventually guests will start using TRIM or DISCARD or however it's 
> called, and then we can expose it and reclaim unused blocks.

You can do this in one of two ways.  You can do online compaction or you 
can maintain a free list.  Online compaction has an advantage because it 
does not require any operations in the fast path whereas a free list 
would require ordered metadata updates (must remove something from the 
first list before updating the l2 table) which implies a sync.

At a high level, I don't think online compaction requires any specific 
support from an image format.

Regards,

Anthony Liguori
Anthony Liguori - Aug. 25, 2010, 1:46 p.m.
On 08/25/2010 08:23 AM, Avi Kivity wrote:
>  On 08/25/2010 03:46 PM, Anthony Liguori wrote:
>>
>> If we had another disk format that only supported growth and metadata 
>> for a backing file, can you think of another failure scenario?
>
> btw, only supporting growth is a step backwards.  Currently 
> file-backed disks keep growing even the guest-used storage doesn't 
> grow, since once we allocate something we never release it.  But 
> eventually guests will start using TRIM or DISCARD or however it's 
> called, and then we can expose it and reclaim unused blocks.

BTW, something that had the features of qcow2 that people actually used 
andwas fully asynchronous, performed well, and had a high degree of 
confidence in data integrity would be a major step forward, not backwards.

Regards,

Anthony Liguori
Avi Kivity - Aug. 25, 2010, 2 p.m.
On 08/25/2010 04:42 PM, Anthony Liguori wrote:
> On 08/25/2010 08:23 AM, Avi Kivity wrote:
>>  On 08/25/2010 03:46 PM, Anthony Liguori wrote:
>>>
>>> If we had another disk format that only supported growth and 
>>> metadata for a backing file, can you think of another failure scenario?
>>
>> btw, only supporting growth is a step backwards.  Currently 
>> file-backed disks keep growing even the guest-used storage doesn't 
>> grow, since once we allocate something we never release it.  But 
>> eventually guests will start using TRIM or DISCARD or however it's 
>> called, and then we can expose it and reclaim unused blocks.
>
> You can do this in one of two ways.  You can do online compaction or 
> you can maintain a free list.  Online compaction has an advantage 
> because it does not require any operations in the fast path whereas a 
> free list would require ordered metadata updates (must remove 
> something from the first list before updating the l2 table) which 
> implies a sync.

DISCARD/TRIM can queue blocks to the same preallocated block list we 
have to optimize allocation.  New
allocations can come from this list, if it grows too large we sync part 
of it to disk to avoid loss of a lot of free space on power fail.

> At a high level, I don't think online compaction requires any specific 
> support from an image format.
>

You need to know that the block is free and can be reallocated.
Avi Kivity - Aug. 25, 2010, 2:03 p.m.
On 08/25/2010 04:46 PM, Anthony Liguori wrote:
> On 08/25/2010 08:23 AM, Avi Kivity wrote:
>>  On 08/25/2010 03:46 PM, Anthony Liguori wrote:
>>>
>>> If we had another disk format that only supported growth and 
>>> metadata for a backing file, can you think of another failure scenario?
>>
>> btw, only supporting growth is a step backwards.  Currently 
>> file-backed disks keep growing even the guest-used storage doesn't 
>> grow, since once we allocate something we never release it.  But 
>> eventually guests will start using TRIM or DISCARD or however it's 
>> called, and then we can expose it and reclaim unused blocks.
>
> BTW, something that had the features of qcow2 that people actually 
> used andwas fully asynchronous, performed well, and had a high degree 
> of confidence in data integrity would be a major step forward, not 
> backwards.
>

TRIM/DISCARD is not a feature that people use (esp. as it's not 
available) but we do want to support it in our image formats.
Anthony Liguori - Aug. 25, 2010, 2:14 p.m.
On 08/25/2010 09:00 AM, Avi Kivity wrote:
>  On 08/25/2010 04:42 PM, Anthony Liguori wrote:
>> On 08/25/2010 08:23 AM, Avi Kivity wrote:
>>>  On 08/25/2010 03:46 PM, Anthony Liguori wrote:
>>>>
>>>> If we had another disk format that only supported growth and 
>>>> metadata for a backing file, can you think of another failure 
>>>> scenario?
>>>
>>> btw, only supporting growth is a step backwards.  Currently 
>>> file-backed disks keep growing even the guest-used storage doesn't 
>>> grow, since once we allocate something we never release it.  But 
>>> eventually guests will start using TRIM or DISCARD or however it's 
>>> called, and then we can expose it and reclaim unused blocks.
>>
>> You can do this in one of two ways.  You can do online compaction or 
>> you can maintain a free list.  Online compaction has an advantage 
>> because it does not require any operations in the fast path whereas a 
>> free list would require ordered metadata updates (must remove 
>> something from the first list before updating the l2 table) which 
>> implies a sync.
>
> DISCARD/TRIM can queue blocks to the same preallocated block list we 
> have to optimize allocation.  New
> allocations can come from this list, if it grows too large we sync 
> part of it to disk to avoid loss of a lot of free space on power fail.
>
>> At a high level, I don't think online compaction requires any 
>> specific support from an image format.
>>
>
> You need to know that the block is free and can be reallocated.

Semantically, TRIM/DISCARD means that "I don't care about the contents 
of the block anymore until I do another write."  Behind the scenes, we 
can keep track of which blocks have been discarded in an in-memory list 
whereas the first write to the block causes it to be evicted from the 
discarded list.

A background task would attempt to detect idle I/O and copy a block from 
the end of the file to a location on the discarded list.  When the copy 
has completed, you can then remove the L2 entry for the discarded block 
(effectively punching a hole in the image), sync, and then update the l2 
entry for the block at the end of file location to point to the new 
block location.  You can then ftruncate to reduce overall file size.

If you tried to maintain a free list, then you would need to sync on 
TRIM/DISCARD which is potentially a fast path.  While a background task 
may be less efficient in the short term, it's just as efficient in the 
long term and it has the advantage of keeping any fast path fast.

Regards,

Anthony Liguori
Christoph Hellwig - Aug. 25, 2010, 2:18 p.m.
On Wed, Aug 25, 2010 at 04:23:38PM +0300, Avi Kivity wrote:
>  On 08/25/2010 03:46 PM, Anthony Liguori wrote:
> >
> >If we had another disk format that only supported growth and metadata 
> >for a backing file, can you think of another failure scenario?
> 
> btw, only supporting growth is a step backwards.  Currently file-backed 
> disks keep growing even the guest-used storage doesn't grow, since once 
> we allocate something we never release it.  But eventually guests will 
> start using TRIM or DISCARD or however it's called, and then we can 
> expose it and reclaim unused blocks.

Together with file level snapshots Thin Provisioning support basically
makes qcow2 obsolete.

I already have a prototype implementation of scsi discard for qemu,
which together with the XFS extent size hints allows us to very
efficiently using growing and shrinking images on the bare filesystem.
XFS at this point doesn't have file-level snapshots, but I'm planning
to port the hole punhing and extent size hints to btrfs and maybe
ocfs2 which will give you all that.
Christoph Hellwig - Aug. 25, 2010, 2:19 p.m.
On Wed, Aug 25, 2010 at 08:46:15AM -0500, Anthony Liguori wrote:
> BTW, something that had the features of qcow2 that people actually used 
> andwas fully asynchronous, performed well, and had a high degree of 
> confidence in data integrity would be a major step forward, not backwards.

That means you'll actually need to agree on a feature set first.
Anthony Liguori - Aug. 25, 2010, 2:26 p.m.
On 08/25/2010 09:18 AM, Christoph Hellwig wrote:
> On Wed, Aug 25, 2010 at 04:23:38PM +0300, Avi Kivity wrote:
>    
>>   On 08/25/2010 03:46 PM, Anthony Liguori wrote:
>>      
>>> If we had another disk format that only supported growth and metadata
>>> for a backing file, can you think of another failure scenario?
>>>        
>> btw, only supporting growth is a step backwards.  Currently file-backed
>> disks keep growing even the guest-used storage doesn't grow, since once
>> we allocate something we never release it.  But eventually guests will
>> start using TRIM or DISCARD or however it's called, and then we can
>> expose it and reclaim unused blocks.
>>      
> Together with file level snapshots Thin Provisioning support basically
> makes qcow2 obsolete.
>    

I think we'll always need a sparse image format.  As long as people copy 
images to USB thumb drives containing vfat file systems or over dumb 
transports like HTTP, it will be necessary.

However, I agree that really advanced things like snapshotting is best 
done with a smart file system.

Regards,

Anthony Liguori

> I already have a prototype implementation of scsi discard for qemu,
> which together with the XFS extent size hints allows us to very
> efficiently using growing and shrinking images on the bare filesystem.
> XFS at this point doesn't have file-level snapshots, but I'm planning
> to port the hole punhing and extent size hints to btrfs and maybe
> ocfs2 which will give you all that.
>
>
Avi Kivity - Aug. 25, 2010, 2:36 p.m.
On 08/25/2010 05:14 PM, Anthony Liguori wrote:
>>
>>> At a high level, I don't think online compaction requires any 
>>> specific support from an image format.
>>>
>>
>> You need to know that the block is free and can be reallocated.
>
>
> Semantically, TRIM/DISCARD means that "I don't care about the contents 
> of the block anymore until I do another write."  Behind the scenes, we 
> can keep track of which blocks have been discarded in an in-memory 
> list whereas the first write to the block causes it to be evicted from 
> the discarded list.
>
> A background task would attempt to detect idle I/O and copy a block 
> from the end of the file to a location on the discarded list.  When 
> the copy has completed, you can then remove the L2 entry for the 
> discarded block (effectively punching a hole in the image), sync, and 
> then update the l2 entry for the block at the end of file location to 
> point to the new block location.  You can then ftruncate to reduce 
> overall file size.

That should work.

>
> If you tried to maintain a free list, then you would need to sync on 
> TRIM/DISCARD which is potentially a fast path.  While a background 
> task may be less efficient in the short term, it's just as efficient 
> in the long term and it has the advantage of keeping any fast path fast.
>

You only need to sync when the free list size grows beyond the amount of 
space you're prepared to lose on power fail.  And you may be able to 
defer the background task indefinitely by satisfying new allocations 
from the free list.
Avi Kivity - Aug. 25, 2010, 2:37 p.m.
On 08/25/2010 05:19 PM, Christoph Hellwig wrote:
> On Wed, Aug 25, 2010 at 08:46:15AM -0500, Anthony Liguori wrote:
>> BTW, something that had the features of qcow2 that people actually used
>> andwas fully asynchronous, performed well, and had a high degree of
>> confidence in data integrity would be a major step forward, not backwards.
> That means you'll actually need to agree on a feature set first.
>

:)

The next image format should be raw-over-btrfs.  And we need to continue 
developing qcow2, we can't keep moving users from format to format just 
because it's complicated.
Daniel P. Berrange - Aug. 25, 2010, 2:49 p.m.
On Wed, Aug 25, 2010 at 09:26:04AM -0500, Anthony Liguori wrote:
> On 08/25/2010 09:18 AM, Christoph Hellwig wrote:
> >On Wed, Aug 25, 2010 at 04:23:38PM +0300, Avi Kivity wrote:
> >   
> >>  On 08/25/2010 03:46 PM, Anthony Liguori wrote:
> >>     
> >>>If we had another disk format that only supported growth and metadata
> >>>for a backing file, can you think of another failure scenario?
> >>>       
> >>btw, only supporting growth is a step backwards.  Currently file-backed
> >>disks keep growing even the guest-used storage doesn't grow, since once
> >>we allocate something we never release it.  But eventually guests will
> >>start using TRIM or DISCARD or however it's called, and then we can
> >>expose it and reclaim unused blocks.
> >>     
> >Together with file level snapshots Thin Provisioning support basically
> >makes qcow2 obsolete.
> >   
> 
> I think we'll always need a sparse image format.  As long as people copy 
> images to USB thumb drives containing vfat file systems or over dumb 
> transports like HTTP, it will be necessary.

QCow2 is also useful with NFS because the encryption capabilities
let the virt host administrators protect guest data from the
storage administrators.

Regards,
Daniel
Anthony Liguori - Aug. 25, 2010, 3:06 p.m.
On 08/25/2010 09:36 AM, Avi Kivity wrote:
>>
>> If you tried to maintain a free list, then you would need to sync on 
>> TRIM/DISCARD which is potentially a fast path.  While a background 
>> task may be less efficient in the short term, it's just as efficient 
>> in the long term and it has the advantage of keeping any fast path fast.
>>
>
> You only need to sync when the free list size grows beyond the amount 
> of space you're prepared to lose on power fail.  And you may be able 
> to defer the background task indefinitely by satisfying new 
> allocations from the free list.

Free does not mean free.  If you immediately punch a hole in the l2 
without doing a sync, then you're never sure whether the hole is there 
on disk or not.  So if you then allocate that block and put it somewhere 
else in another l2 table, you need to sync the previous l2 change before 
you update the new l2.

Otherwise you can have two l2 entries pointing to the same block after a 
power failure.  That's not a leak, that's a data corruption.

Regards,

Anthony Liguori
Avi Kivity - Aug. 25, 2010, 3:15 p.m.
On 08/25/2010 06:06 PM, Anthony Liguori wrote:
> On 08/25/2010 09:36 AM, Avi Kivity wrote:
>>>
>>> If you tried to maintain a free list, then you would need to sync on 
>>> TRIM/DISCARD which is potentially a fast path.  While a background 
>>> task may be less efficient in the short term, it's just as efficient 
>>> in the long term and it has the advantage of keeping any fast path 
>>> fast.
>>>
>>
>> You only need to sync when the free list size grows beyond the amount 
>> of space you're prepared to lose on power fail.  And you may be able 
>> to defer the background task indefinitely by satisfying new 
>> allocations from the free list.
>
> Free does not mean free.  If you immediately punch a hole in the l2 
> without doing a sync, then you're never sure whether the hole is there 
> on disk or not.  So if you then allocate that block and put it 
> somewhere else in another l2 table, you need to sync the previous l2 
> change before you update the new l2.
>
> Otherwise you can have two l2 entries pointing to the same block after 
> a power failure.  That's not a leak, that's a data corruption.

L2 certainly needs to be updated before the block is reused.  But that's 
not different from a file format without a free list.

The batching I was referring to was only for free list management, same 
as the allocation issue which started this thread.
Anthony Liguori - Aug. 25, 2010, 3:21 p.m.
On 08/25/2010 10:15 AM, Avi Kivity wrote:
>  On 08/25/2010 06:06 PM, Anthony Liguori wrote:
>> On 08/25/2010 09:36 AM, Avi Kivity wrote:
>>>>
>>>> If you tried to maintain a free list, then you would need to sync 
>>>> on TRIM/DISCARD which is potentially a fast path.  While a 
>>>> background task may be less efficient in the short term, it's just 
>>>> as efficient in the long term and it has the advantage of keeping 
>>>> any fast path fast.
>>>>
>>>
>>> You only need to sync when the free list size grows beyond the 
>>> amount of space you're prepared to lose on power fail.  And you may 
>>> be able to defer the background task indefinitely by satisfying new 
>>> allocations from the free list.
>>
>> Free does not mean free.  If you immediately punch a hole in the l2 
>> without doing a sync, then you're never sure whether the hole is 
>> there on disk or not.  So if you then allocate that block and put it 
>> somewhere else in another l2 table, you need to sync the previous l2 
>> change before you update the new l2.
>>
>> Otherwise you can have two l2 entries pointing to the same block 
>> after a power failure.  That's not a leak, that's a data corruption.
>
> L2 certainly needs to be updated before the block is reused.  But 
> that's not different from a file format without a free list.
>
> The batching I was referring to was only for free list management, 
> same as the allocation issue which started this thread.

Okay, yes, you can orphan blocks pro-actively and then use them for 
future allocations instead of doing active defragmentation.

I'm still not sure a free list stored in the format is all that useful 
but you could certainly implement it lazily.

Regards,

Anthony Liguori

Patch

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 166922f..5760ad6 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -64,8 +64,8 @@  int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
     BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_WRITE_TABLE);
     for(i = 0; i < s->l1_size; i++)
         new_l1_table[i] = cpu_to_be64(new_l1_table[i]);
-    ret = bdrv_pwrite_sync(bs->file, new_l1_table_offset, new_l1_table, new_l1_size2);
-    if (ret < 0)
+    ret = bdrv_pwrite(bs->file, new_l1_table_offset, new_l1_table, new_l1_size2);
+    if (ret != new_l1_size2)
         goto fail;
     for(i = 0; i < s->l1_size; i++)
         new_l1_table[i] = be64_to_cpu(new_l1_table[i]);
@@ -74,8 +74,8 @@  int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
     BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_ACTIVATE_TABLE);
     cpu_to_be32w((uint32_t*)data, new_l1_size);
     cpu_to_be64w((uint64_t*)(data + 4), new_l1_table_offset);
-    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_size), data,sizeof(data));
-    if (ret < 0) {
+    ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, l1_size), data,sizeof(data));
+    if (ret != sizeof(data)) {
         goto fail;
     }
     qemu_free(s->l1_table);
@@ -87,7 +87,7 @@  int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
  fail:
     qemu_free(new_l1_table);
     qcow2_free_clusters(bs, new_l1_table_offset, new_l1_size2);
-    return ret;
+    return ret < 0 ? ret : -EIO;
 }
 
 void qcow2_l2_cache_reset(BlockDriverState *bs)
@@ -207,7 +207,7 @@  static int write_l1_entry(BlockDriverState *bs, int l1_index)
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE);
-    ret = bdrv_pwrite_sync(bs->file, s->l1_table_offset + 8 * l1_start_index,
+    ret = bdrv_pwrite(bs->file, s->l1_table_offset + 8 * l1_start_index,
         buf, sizeof(buf));
     if (ret < 0) {
         return ret;
@@ -263,7 +263,7 @@  static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table)
     }
     /* write the l2 table to the file */
     BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_WRITE);
-    ret = bdrv_pwrite_sync(bs->file, l2_offset, l2_table,
+    ret = bdrv_pwrite(bs->file, l2_offset, l2_table,
         s->l2_size * sizeof(uint64_t));
     if (ret < 0) {
         goto fail;
@@ -413,8 +413,8 @@  static int copy_sectors(BlockDriverState *bs, uint64_t start_sect,
                         &s->aes_encrypt_key);
     }
     BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
-    ret = bdrv_write_sync(bs->file, (cluster_offset >> 9) + n_start,
-        s->cluster_data, n);
+    ret = bdrv_write(bs->file, (cluster_offset >> 9) + n_start,
+                     s->cluster_data, n);
     if (ret < 0)
         return ret;
     return 0;
@@ -631,10 +631,10 @@  uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
 
     BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
     l2_table[l2_index] = cpu_to_be64(cluster_offset);
-    if (bdrv_pwrite_sync(bs->file,
+    if (bdrv_pwrite(bs->file,
                     l2_offset + l2_index * sizeof(uint64_t),
                     l2_table + l2_index,
-                    sizeof(uint64_t)) < 0)
+                    sizeof(uint64_t)) != sizeof(uint64_t))
         return 0;
 
     return cluster_offset;
@@ -655,7 +655,7 @@  static int write_l2_entries(BlockDriverState *bs, uint64_t *l2_table,
     int ret;
 
     BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE);
-    ret = bdrv_pwrite_sync(bs->file, l2_offset + start_offset,
+    ret = bdrv_pwrite(bs->file, l2_offset + start_offset,
         &l2_table[l2_start_index], len);
     if (ret < 0) {
         return ret;
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 4c19e7e..4542068 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -44,8 +44,8 @@  static int write_refcount_block(BlockDriverState *bs)
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_UPDATE);
-    if (bdrv_pwrite_sync(bs->file, s->refcount_block_cache_offset,
-            s->refcount_block_cache, size) < 0)
+    if (bdrv_pwrite(bs->file, s->refcount_block_cache_offset,
+            s->refcount_block_cache, size) != size)
     {
         return -EIO;
     }
@@ -269,7 +269,7 @@  static int64_t alloc_refcount_block(BlockDriverState *bs, int64_t cluster_index)
 
     /* Now the new refcount block needs to be written to disk */
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_WRITE);
-    ret = bdrv_pwrite_sync(bs->file, new_block, s->refcount_block_cache,
+    ret = bdrv_pwrite(bs->file, new_block, s->refcount_block_cache,
         s->cluster_size);
     if (ret < 0) {
         goto fail_block;
@@ -279,7 +279,7 @@  static int64_t alloc_refcount_block(BlockDriverState *bs, int64_t cluster_index)
     if (refcount_table_index < s->refcount_table_size) {
         uint64_t data64 = cpu_to_be64(new_block);
         BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_HOOKUP);
-        ret = bdrv_pwrite_sync(bs->file,
+        ret = bdrv_pwrite(bs->file,
             s->refcount_table_offset + refcount_table_index * sizeof(uint64_t),
             &data64, sizeof(data64));
         if (ret < 0) {
@@ -359,7 +359,7 @@  static int64_t alloc_refcount_block(BlockDriverState *bs, int64_t cluster_index)
 
     /* Write refcount blocks to disk */
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_WRITE_BLOCKS);
-    ret = bdrv_pwrite_sync(bs->file, meta_offset, new_blocks,
+    ret = bdrv_pwrite(bs->file, meta_offset, new_blocks,
         blocks_clusters * s->cluster_size);
     qemu_free(new_blocks);
     if (ret < 0) {
@@ -372,7 +372,7 @@  static int64_t alloc_refcount_block(BlockDriverState *bs, int64_t cluster_index)
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_WRITE_TABLE);
-    ret = bdrv_pwrite_sync(bs->file, table_offset, new_table,
+    ret = bdrv_pwrite(bs->file, table_offset, new_table,
         table_size * sizeof(uint64_t));
     if (ret < 0) {
         goto fail_table;
@@ -387,7 +387,7 @@  static int64_t alloc_refcount_block(BlockDriverState *bs, int64_t cluster_index)
     cpu_to_be64w((uint64_t*)data, table_offset);
     cpu_to_be32w((uint32_t*)(data + 8), table_clusters);
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_SWITCH_TABLE);
-    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, refcount_table_offset),
+    ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, refcount_table_offset),
         data, sizeof(data));
     if (ret < 0) {
         goto fail_table;
@@ -444,7 +444,7 @@  static int write_refcount_block_entries(BlockDriverState *bs,
     size = (last_index - first_index) << REFCOUNT_SHIFT;
 
     BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_UPDATE_PART);
-    ret = bdrv_pwrite_sync(bs->file,
+    ret = bdrv_pwrite(bs->file,
         refcount_block_offset + (first_index << REFCOUNT_SHIFT),
         &s->refcount_block_cache[first_index], size);
     if (ret < 0) {
@@ -826,8 +826,8 @@  int qcow2_update_snapshot_refcount(BlockDriverState *bs,
                 }
             }
             if (l2_modified) {
-                if (bdrv_pwrite_sync(bs->file,
-                                l2_offset, l2_table, l2_size) < 0)
+                if (bdrv_pwrite(bs->file,
+                                l2_offset, l2_table, l2_size) != l2_size)
                     goto fail;
             }
 
@@ -850,8 +850,8 @@  int qcow2_update_snapshot_refcount(BlockDriverState *bs,
     if (l1_modified) {
         for(i = 0; i < l1_size; i++)
             cpu_to_be64s(&l1_table[i]);
-        if (bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table,
-                        l1_size2) < 0)
+        if (bdrv_pwrite(bs->file, l1_table_offset, l1_table,
+                        l1_size2) != l1_size2)
             goto fail;
         for(i = 0; i < l1_size; i++)
             be64_to_cpus(&l1_table[i]);
diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
index 6228612..2a21c17 100644
--- a/block/qcow2-snapshot.c
+++ b/block/qcow2-snapshot.c
@@ -158,25 +158,25 @@  static int qcow_write_snapshots(BlockDriverState *bs)
         h.id_str_size = cpu_to_be16(id_str_size);
         h.name_size = cpu_to_be16(name_size);
         offset = align_offset(offset, 8);
-        if (bdrv_pwrite_sync(bs->file, offset, &h, sizeof(h)) < 0)
+        if (bdrv_pwrite(bs->file, offset, &h, sizeof(h)) != sizeof(h))
             goto fail;
         offset += sizeof(h);
-        if (bdrv_pwrite_sync(bs->file, offset, sn->id_str, id_str_size) < 0)
+        if (bdrv_pwrite(bs->file, offset, sn->id_str, id_str_size) != id_str_size)
             goto fail;
         offset += id_str_size;
-        if (bdrv_pwrite_sync(bs->file, offset, sn->name, name_size) < 0)
+        if (bdrv_pwrite(bs->file, offset, sn->name, name_size) != name_size)
             goto fail;
         offset += name_size;
     }
 
     /* update the various header fields */
     data64 = cpu_to_be64(snapshots_offset);
-    if (bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, snapshots_offset),
-                    &data64, sizeof(data64)) < 0)
+    if (bdrv_pwrite(bs->file, offsetof(QCowHeader, snapshots_offset),
+                    &data64, sizeof(data64)) != sizeof(data64))
         goto fail;
     data32 = cpu_to_be32(s->nb_snapshots);
-    if (bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, nb_snapshots),
-                    &data32, sizeof(data32)) < 0)
+    if (bdrv_pwrite(bs->file, offsetof(QCowHeader, nb_snapshots),
+                    &data32, sizeof(data32)) != sizeof(data32))
         goto fail;
 
     /* free the old snapshot table */
@@ -284,8 +284,9 @@  int qcow2_snapshot_create(BlockDriverState *bs, QEMUSnapshotInfo *sn_info)
     for(i = 0; i < s->l1_size; i++) {
         l1_table[i] = cpu_to_be64(s->l1_table[i]);
     }
-    if (bdrv_pwrite_sync(bs->file, sn->l1_table_offset,
-                    l1_table, s->l1_size * sizeof(uint64_t)) < 0)
+    if (bdrv_pwrite(bs->file, sn->l1_table_offset,
+                    l1_table, s->l1_size * sizeof(uint64_t)) !=
+        (s->l1_size * sizeof(uint64_t)))
         goto fail;
     qemu_free(l1_table);
     l1_table = NULL;
@@ -334,8 +335,8 @@  int qcow2_snapshot_goto(BlockDriverState *bs, const char *snapshot_id)
     if (bdrv_pread(bs->file, sn->l1_table_offset,
                    s->l1_table, l1_size2) != l1_size2)
         goto fail;
-    if (bdrv_pwrite_sync(bs->file, s->l1_table_offset,
-                    s->l1_table, l1_size2) < 0)
+    if (bdrv_pwrite(bs->file, s->l1_table_offset,
+                    s->l1_table, l1_size2) != l1_size2)
         goto fail;
     for(i = 0;i < s->l1_size; i++) {
         be64_to_cpus(&s->l1_table[i]);
diff --git a/block/qcow2.c b/block/qcow2.c
index a53014d..a62a010 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -735,7 +735,7 @@  static int qcow2_update_ext_header(BlockDriverState *bs,
         backing_file_offset = sizeof(QCowHeader) + offset;
     }
 
-    ret = bdrv_pwrite_sync(bs->file, sizeof(QCowHeader), buf, ext_size);
+    ret = bdrv_pwrite(bs->file, sizeof(QCowHeader), buf, ext_size);
     if (ret < 0) {
         goto fail;
     }
@@ -744,13 +744,13 @@  static int qcow2_update_ext_header(BlockDriverState *bs,
     uint64_t be_backing_file_offset = cpu_to_be64(backing_file_offset);
     uint32_t be_backing_file_size = cpu_to_be32(backing_file_len);
 
-    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, backing_file_offset),
+    ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, backing_file_offset),
         &be_backing_file_offset, sizeof(uint64_t));
     if (ret < 0) {
         goto fail;
     }
 
-    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, backing_file_size),
+    ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, backing_file_size),
         &be_backing_file_size, sizeof(uint32_t));
     if (ret < 0) {
         goto fail;
@@ -1134,8 +1134,8 @@  static int qcow2_truncate(BlockDriverState *bs, int64_t offset)
 
     /* write updated header.size */
     offset = cpu_to_be64(offset);
-    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, size),
-                           &offset, sizeof(uint64_t));
+    ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, size),
+                      &offset, sizeof(uint64_t));
     if (ret < 0) {
         return ret;
     }