diff mbox

[BUG] qemu-1.1.2 [FIXED-BY] qcow2: Fix avail_sectors in cluster allocation code

Message ID 201212121425.41850.hahn@univention.de
State New
Headers show

Commit Message

Philipp Hahn Dec. 12, 2012, 1:25 p.m. UTC
Hello Kevin, hello Michael, hello *,

we noticed a data corruption bug in qemu-1.1.2, which will be shipped by 
Debian and our own Debian based distibution.
The corruption mostly manifests while installing large Debian package files 
and seems to be reladed to memory preasure: As long as the file is still in 
the page cache, everything looks fine, but when the file is re-read from the 
virtual hard disk using a qcow2 file backed by another qcow2 file, the file 
is corrupted: dpkg complains that the .tar.gz file inside the Debian archive 
file is corrupted and the md5sum no longer matches.

I tracked this down using "git bisect" to your patch attached below, which 
fixed this bug, so everything is fine with qemu-kvm-1.2.0.
From my reading this seems to explain our problems, since during my own 
testing during development I never used backing chains and the problem only 
showed up when my collegues started using qemu-kvm-1.1.2 with their VMs using 
backing chains.

@Kevin: Do you thinks that's a valid explanation and your patch should fix 
that problem?
I'd like to get your expertise before filing a bug with Debian and asking 
Michael to include that patch with his next stable update for 1.1.

Thanks in advance.

Sincerely
Philipp

Comments

Kevin Wolf Dec. 12, 2012, 1:41 p.m. UTC | #1
Hi Philipp,

Am 12.12.2012 14:25, schrieb Philipp Hahn:
> Hello Kevin, hello Michael, hello *,
> 
> we noticed a data corruption bug in qemu-1.1.2, which will be shipped by 
> Debian and our own Debian based distibution.
> The corruption mostly manifests while installing large Debian package files 
> and seems to be reladed to memory preasure: As long as the file is still in 
> the page cache, everything looks fine, but when the file is re-read from the 
> virtual hard disk using a qcow2 file backed by another qcow2 file, the file 
> is corrupted: dpkg complains that the .tar.gz file inside the Debian archive 
> file is corrupted and the md5sum no longer matches.
> 
> I tracked this down using "git bisect" to your patch attached below, which 
> fixed this bug, so everything is fine with qemu-kvm-1.2.0.
> From my reading this seems to explain our problems, since during my own 
> testing during development I never used backing chains and the problem only 
> showed up when my collegues started using qemu-kvm-1.1.2 with their VMs using 
> backing chains.
> 
> @Kevin: Do you thinks that's a valid explanation and your patch should fix 
> that problem?
> I'd like to get your expertise before filing a bug with Debian and asking 
> Michael to include that patch with his next stable update for 1.1.

As you can see in the commit message of that patch I was convinced that
no bug did exist in practice and this was only dangerous with respect to
future changes. Therefore my first question is if you're using an
unmodified upstream qemu or if some backported patches are applied to
it? If it's indeed unmodified, we should probably review the code once
again to understand why it makes a difference.

In any case, this is the cluster allocation code. It's probably not
related to rereading things from disk, but rather to the writeout of the
page cache.

Kevin
Philipp Hahn Dec. 12, 2012, 2:09 p.m. UTC | #2
Hello Kevin, 

Am Mittwoch 12 Dezember 2012 14:41:49 schrieb Kevin Wolf:
> As you can see in the commit message of that patch I was convinced that
> no bug did exist in practice and this was only dangerous with respect to
> future changes. Therefore my first question is if you're using an
> unmodified upstream qemu or if some backported patches are applied to
> it? If it's indeed unmodified, we should probably review the code once
> again to understand why it makes a difference.

This were all unmodified versions directly from git between "qemu-kvm-1.1.0" 
and "qemu-kvm-1.2.0"

"git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works,
"git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c~1" is broken.
"git checkout qemu-kvm-1.1.2"  is broken,
"git checkout qemu-kvm-1.1.2 ; git cherry-pick 
b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c"  works

> In any case, this is the cluster allocation code. It's probably not
> related to rereading things from disk, but rather to the writeout of the
> page cache.

Yes, the problem is probably write related. But as the write "doens't explode 
with some spectacular error", I only notice the error on the following read 
by comparing md5 sums.
I just re-checked it: After a reboot the md5sums are still invalid, so I guess 
the data is corrupted on writeout.

Sincerely
Philipp
Kevin Wolf Dec. 12, 2012, 4:54 p.m. UTC | #3
Am 12.12.2012 15:09, schrieb Philipp Hahn:
> Hello Kevin, 
> 
> Am Mittwoch 12 Dezember 2012 14:41:49 schrieb Kevin Wolf:
>> As you can see in the commit message of that patch I was convinced that
>> no bug did exist in practice and this was only dangerous with respect to
>> future changes. Therefore my first question is if you're using an
>> unmodified upstream qemu or if some backported patches are applied to
>> it? If it's indeed unmodified, we should probably review the code once
>> again to understand why it makes a difference.
> 
> This were all unmodified versions directly from git between "qemu-kvm-1.1.0" 
> and "qemu-kvm-1.2.0"
> 
> "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works,
> "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c~1" is broken.
> "git checkout qemu-kvm-1.1.2"  is broken,
> "git checkout qemu-kvm-1.1.2 ; git cherry-pick 
> b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c"  works

Ok, thanks for clarifying. Then I must have missed some interesting case
while doing the patch.

Ideally we would find a sequence of qemu-io commands to reliably
reproduce this. First thing worth trying would be running the current
qemu-iotests suite on the old versions. If we don't find it this way, I
guess we need to catch it with code review. I'm not sure if I can get to
it this week, and starting next week I'll be on vacation, so any help
with finding a reproducer would be appreciated.

>> In any case, this is the cluster allocation code. It's probably not
>> related to rereading things from disk, but rather to the writeout of the
>> page cache.
> 
> Yes, the problem is probably write related. But as the write "doens't explode 
> with some spectacular error", I only notice the error on the following read 
> by comparing md5 sums.
> I just re-checked it: After a reboot the md5sums are still invalid, so I guess 
> the data is corrupted on writeout.

Yes, it's really the only thing that makes sense in connection with this
patch.

Kevin
Philipp Hahn Dec. 12, 2012, 5:29 p.m. UTC | #4
Hello Kevin,

Am Mittwoch 12 Dezember 2012 17:54:58 schrieb Kevin Wolf:
> Am 12.12.2012 15:09, schrieb Philipp Hahn:
> > Am Mittwoch 12 Dezember 2012 14:41:49 schrieb Kevin Wolf:
> >> As you can see in the commit message of that patch I was convinced that
> >> no bug did exist in practice and this was only dangerous with respect to
> >> future changes. Therefore my first question is if you're using an
> >> unmodified upstream qemu or if some backported patches are applied to
> >> it? If it's indeed unmodified, we should probably review the code once
> >> again to understand why it makes a difference.
> >
> > This were all unmodified versions directly from git between
> > "qemu-kvm-1.1.0" and "qemu-kvm-1.2.0"
> >
> > "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works,
> > "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c~1" is broken.
> > "git checkout qemu-kvm-1.1.2"  is broken,
> > "git checkout qemu-kvm-1.1.2 ; git cherry-pick
> > b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c"  works
>
> Ok, thanks for clarifying. Then I must have missed some interesting case
> while doing the patch.

I just re-run my "git bisect run ~/bisect.sh"  case, but it again arrived at 
that patch. I just queued another run for tonight so make sure the test is 
reliable:

# bad: [4c3e02beed9878a5f760eeceb6cd42c475cf0127] Merge tag 'v1.2.0'
# good: [bd11ac4feb54d32653e5d4eb7994bed18be0609c] fdc: fix implied seek while 
there is no media in drive
git bisect start 'qemu-kvm-1.2.0' 'qemu-kvm-1.1.0'
# good: [15ecf28f39e2b6fba359ed094770c8fa4ad8dc60] Merge tag 'v1.1.0' into 
next
git bisect good 15ecf28f39e2b6fba359ed094770c8fa4ad8dc60
# bad: [2fa5008ffd49e51540756adccf966a2fcde6e6c1] hd-geometry: Factor out 
guess_chs_for_size()
git bisect bad 2fa5008ffd49e51540756adccf966a2fcde6e6c1
# bad: [306a571a2d75e32cd2eae5486c2714b7b7792a63] hw/arm_gic: Add qdev 
property for GIC revision
git bisect bad 306a571a2d75e32cd2eae5486c2714b7b7792a63
# good: [5c6f4f178ba542358c012ca033985f73e61b8ae5] z2: Rename PXA2xxState 
variable
git bisect good 5c6f4f178ba542358c012ca033985f73e61b8ae5
# good: [833e40858cb9501c5e76b3aa345e4bb5be34385a] qcow2: remove a line of 
unnecessary code
git bisect good 833e40858cb9501c5e76b3aa345e4bb5be34385a
# bad: [0b0cb9d310edfe2b2d108f18be4f013a1e552cfd] Merge remote-tracking 
branch 'kwolf/for-anthony' into staging
git bisect bad 0b0cb9d310edfe2b2d108f18be4f013a1e552cfd
# bad: [0446919dcab51e7468f346c0a009a88632c5c5e0] qemu-iotests: COW with many 
AIO requests on the same cluster
git bisect bad 0446919dcab51e7468f346c0a009a88632c5c5e0
# good: [b75a02829dde98723dfe16fa098338cb267b28b9] Prevent disk data loss when 
closing qemu
git bisect good b75a02829dde98723dfe16fa098338cb267b28b9
# good: [c4a248a138028bee63a099410c79b428db0c4779] block: copy 
enable_write_cache in bdrv_append
git bisect good c4a248a138028bee63a099410c79b428db0c4779
# good: [6af4e9ead4ec9491259c9861b1b35f9abee24a66] qcow2: always operate 
caches in writeback mode
git bisect good 6af4e9ead4ec9491259c9861b1b35f9abee24a66
# bad: [b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c] qcow2: Fix avail_sectors in 
cluster allocation code
git bisect bad b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c
# good: [cdba7fee1daa8865bac2d69da288171fe7c21aae] qcow2: Simplify calculation 
for COW area at the end
git bisect good cdba7fee1daa8865bac2d69da288171fe7c21aae

> Ideally we would find a sequence of qemu-io commands to reliably
> reproduce this. First thing worth trying would be running the current
> qemu-iotests suite on the old versions. If we don't find it this way, I
> guess we need to catch it with code review. I'm not sure if I can get to
> it this week, and starting next week I'll be on vacation, so any help
> with finding a reproducer would be appreciated.

I'll have a look at it tommorrow.

Thank you for your fast replies and have a nice vacation in case we don't head 
from each other this week again.

BYtE
Philipp
Philipp Hahn Dec. 14, 2012, 1:03 p.m. UTC | #5
Hello Kevin,

On Wednesday 12 December 2012 18:29:48 Philipp Hahn wrote:
> I just re-run my "git bisect run ~/bisect.sh"  case, but it again arrived
> at that patch. I just queued another run for tonight so make sure the test
> is reliable:

The run from last night again arrived at the refecenced patch.

> > Ideally we would find a sequence of qemu-io commands to reliably
> > reproduce this. First thing worth trying would be running the current
> > qemu-iotests suite on the old versions. If we don't find it this way, I
> > guess we need to catch it with code review. I'm not sure if I can get to
> > it this week, and starting next week I'll be on vacation, so any help
> > with finding a reproducer would be appreciated.

I took a closer look at what gets corrupted; I've attached my notes.
Please notice that the partitions are not alignd properly.

If you would like to look at the full qcow2_alloc_clusters_offset trace, I can 
provide you with a link to the trace file.

BYtE
Philipp
Philipp Hahn Dec. 18, 2012, 9:46 a.m. UTC | #6
Hello Kevin, hello Michael,

On Wednesday 12 December 2012 17:54:58 Kevin Wolf wrote:
> Am 12.12.2012 15:09, schrieb Philipp Hahn:
> > Am Mittwoch 12 Dezember 2012 14:41:49 schrieb Kevin Wolf:
> >> As you can see in the commit message of that patch I was convinced that
> >> no bug did exist in practice and this was only dangerous with respect to
> >> future changes. Therefore my first question is if you're using an
> >> unmodified upstream qemu or if some backported patches are applied to
> >> it? If it's indeed unmodified, we should probably review the code once
> >> again to understand why it makes a difference.
> >
> > This were all unmodified versions directly from git between
> > "qemu-kvm-1.1.0" and "qemu-kvm-1.2.0"
> >
> > "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works,
> > "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c~1" is broken.
> > "git checkout qemu-kvm-1.1.2"  is broken,
> > "git checkout qemu-kvm-1.1.2 ; git cherry-pick
> > b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c"  works
>
> Ok, thanks for clarifying. Then I must have missed some interesting case
> while doing the patch.

I think I found your missing link:
After filling in "QCowL2Meta *m", that request ist queued:
  QLIST_INSERT_HEAD(&s->cluster_allocs, m, next_in_flight);
do prevent double allocating the same cluster for overlapping requests, which 
is checked in do_alloc_cluster_offset().

I guess that since the sector count was wrong, the overlap detection didn't 
work and the two concurrent write requests to the same cluster overwrote each 
other.

> Ideally we would find a sequence of qemu-io commands to reliably
> reproduce this.

You're the block guru, so I leave that to you (or anybody else who knows more 
about the working of qemu-io.) ;-)

Sincerely
Philipp
Michael Tokarev Dec. 18, 2012, 12:12 p.m. UTC | #7
On 18.12.2012 13:46, Philipp Hahn wrote:

> I think I found your missing link:
> After filling in "QCowL2Meta *m", that request ist queued:
>   QLIST_INSERT_HEAD(&s->cluster_allocs, m, next_in_flight);
> do prevent double allocating the same cluster for overlapping requests, which 
> is checked in do_alloc_cluster_offset().
> 
> I guess that since the sector count was wrong, the overlap detection didn't 
> work and the two concurrent write requests to the same cluster overwrote each 
> other.

Meh.  And I already closed the debian bugreport... :)

But thank you Philipp for your excellent work on the matter!

/mjt
Kevin Wolf Jan. 22, 2013, 10:25 a.m. UTC | #8
Am 18.12.2012 10:46, schrieb Philipp Hahn:
> Hello Kevin, hello Michael,
> 
> On Wednesday 12 December 2012 17:54:58 Kevin Wolf wrote:
>> Am 12.12.2012 15:09, schrieb Philipp Hahn:
>>> Am Mittwoch 12 Dezember 2012 14:41:49 schrieb Kevin Wolf:
>>>> As you can see in the commit message of that patch I was convinced that
>>>> no bug did exist in practice and this was only dangerous with respect to
>>>> future changes. Therefore my first question is if you're using an
>>>> unmodified upstream qemu or if some backported patches are applied to
>>>> it? If it's indeed unmodified, we should probably review the code once
>>>> again to understand why it makes a difference.
>>>
>>> This were all unmodified versions directly from git between
>>> "qemu-kvm-1.1.0" and "qemu-kvm-1.2.0"
>>>
>>> "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works,
>>> "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c~1" is broken.
>>> "git checkout qemu-kvm-1.1.2"  is broken,
>>> "git checkout qemu-kvm-1.1.2 ; git cherry-pick
>>> b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c"  works
>>
>> Ok, thanks for clarifying. Then I must have missed some interesting case
>> while doing the patch.
> 
> I think I found your missing link:
> After filling in "QCowL2Meta *m", that request ist queued:
>   QLIST_INSERT_HEAD(&s->cluster_allocs, m, next_in_flight);
> do prevent double allocating the same cluster for overlapping requests, which 
> is checked in do_alloc_cluster_offset().
> I guess that since the sector count was wrong, the overlap detection didn't 
> work and the two concurrent write requests to the same cluster overwrote each 
> other.

I'm still not seeing it. The overlap detection code doesn't use
m->nb_available at all, so why would it make a difference?

I guess I need some people who can decently review code - Laszlo, Paolo,
any idea why upstream commit b7ab0fea does change the behaviour,
contrary to what I claimed in the commit message?

>> Ideally we would find a sequence of qemu-io commands to reliably
>> reproduce this.
> 
> You're the block guru, so I leave that to you (or anybody else who knows more 
> about the working of qemu-io.) ;-)

Yeah, as soon as I know what's happening, calling qemu-io with the right
options shouldn't be a problem.

Kevin
Laszlo Ersek Jan. 22, 2013, 1:17 p.m. UTC | #9
On 01/22/13 11:25, Kevin Wolf wrote:
> Am 18.12.2012 10:46, schrieb Philipp Hahn:
>> Hello Kevin, hello Michael,
>>
>> On Wednesday 12 December 2012 17:54:58 Kevin Wolf wrote:
>>> Am 12.12.2012 15:09, schrieb Philipp Hahn:
>>>> Am Mittwoch 12 Dezember 2012 14:41:49 schrieb Kevin Wolf:
>>>>> As you can see in the commit message of that patch I was convinced that
>>>>> no bug did exist in practice and this was only dangerous with respect to
>>>>> future changes. Therefore my first question is if you're using an
>>>>> unmodified upstream qemu or if some backported patches are applied to
>>>>> it? If it's indeed unmodified, we should probably review the code once
>>>>> again to understand why it makes a difference.
>>>>
>>>> This were all unmodified versions directly from git between
>>>> "qemu-kvm-1.1.0" and "qemu-kvm-1.2.0"
>>>>
>>>> "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works,
>>>> "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c~1" is broken.
>>>> "git checkout qemu-kvm-1.1.2"  is broken,
>>>> "git checkout qemu-kvm-1.1.2 ; git cherry-pick
>>>> b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c"  works
>>>
>>> Ok, thanks for clarifying. Then I must have missed some interesting case
>>> while doing the patch.
>>
>> I think I found your missing link:
>> After filling in "QCowL2Meta *m", that request ist queued:
>>   QLIST_INSERT_HEAD(&s->cluster_allocs, m, next_in_flight);
>> do prevent double allocating the same cluster for overlapping requests, which 
>> is checked in do_alloc_cluster_offset().
>> I guess that since the sector count was wrong, the overlap detection didn't 
>> work and the two concurrent write requests to the same cluster overwrote each 
>> other.
> 
> I'm still not seeing it. The overlap detection code doesn't use
> m->nb_available at all, so why would it make a difference?
> 
> I guess I need some people who can decently review code - Laszlo, Paolo,
> any idea why upstream commit b7ab0fea does change the behaviour,
> contrary to what I claimed in the commit message?

As I understand it, the question is not whether b7ab0fea works or not
(it does), the question is whether *without* b7ab0fea there is a real
problem (user visible bug).

Commit b7ab0fea decreased "avail_sectors" by

    keep_clusters << (s->cluster_bits - BDRV_SECTOR_BITS)

which should make a difference for "m->nb_available" in two cases:

(a)

    avail_sectors_patched   <
    requested_sectors       <=
    avail_sectors_unpatched

(ie. the decrease in "avail_sectors" flipped MIN to prefer it over
requested_sectors, lowering m->nb_available)

(b)

    avail_sectors_patched   <
    avail_sectors_unpatched <=
    requested_sectors

(ie. MIN had already preferred "avail_sectors", and now it went lower).


It seems that

  1 << (s->cluster_bits - BDRV_SECTOR_BITS) == s->cluster_sectors    [1]

Therefore both "avail_sectors_patched" and "avail_sectors_unpatched" are
an integral multiple of "s->cluster_sectors". Whenever MIN() selects
either of them, the condition in qcow2_alloc_cluster_link_l2() will
evaluate to false.


In case (b) there seems to be no difference between patched & unpatched
in this regard: MIN() always selects an integral multiple, and so
copy_sectors() is never invoked.

In case (a), the patched version skips copy_sectors(), while the
unpatched version invokes it.

I'm not sure if case (a) is possible at all -- let's expand it:

    avail_sectors_patched   <
    requested_sectors       <=
    avail_sectors_unpatched

Substitute the assigned values:

                  nb_cluster  << (s->cluster_bits - BDRV_SECTOR_BITS) <
    n_end - keep_clusters * s->cluster_sectors                        <=
(keep_clusters + nb_clusters) << (s->cluster_bits - BDRV_SECTOR_BITS)

Using [1], replace the shifts with multiplication:

                    nb_cluster  * s->cluster_sectors <
          n_end - keep_clusters * s->cluster_sectors <=
  (keep_clusters + nb_clusters) * s->cluster_sectors

Distribute the addition:

                    nb_cluster  * s->cluster_sectors   <
          n_end - keep_clusters * s->cluster_sectors   <=
                  keep_clusters * s->cluster_sectors +
                    nb_clusters * s->cluster_sectors

N :=   nb_cluster  * s->cluster_sectors [bytes]
K := keep_clusters * s->cluster_sectors [bytes]

                  N  <  n_end - K  <=    K + N

Add K

              K + N  <  n_end      <=  2*K + N                       [2]

I have no clue about qcow2, but *each one* of K and N seem to be a
function of *both* n_end and the current "qcow2 disk state" (ie. sectors
being allocated vs. just referenced). IOW, I suspect that case (a) is
possible for some (disk state, n_end) pairs. And case (a) -- if indeed
possible -- seems to contradict the commit message:

  allocation. A COW occurs only if the request wasn't cluster aligned,
                                       ^^^^^^^    ^^^^^^^^^^^^^^^^^^^
  which in turn would imply that requested_sectors was less than
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  avail_sectors (both in the original and in the fixed version). In this
  ^^^^^^^^^^^^^                           ^^^^^^^^^^^^^^^^^^^^

In case (a) -- expanded as [2] --, "requested_sectors" is greater than
"avail_sectors_patched", and less than or equal to
"avail_sectors_unpatched".

Laszlo
Kevin Wolf Jan. 22, 2013, 2:14 p.m. UTC | #10
Am 22.01.2013 14:17, schrieb Laszlo Ersek:
> As I understand it, the question is not whether b7ab0fea works or not
> (it does), the question is whether *without* b7ab0fea there is a real
> problem (user visible bug).

Correct. Or actually not whether, but why.

> Commit b7ab0fea decreased "avail_sectors" by
> 
>     keep_clusters << (s->cluster_bits - BDRV_SECTOR_BITS)
> 
> which should make a difference for "m->nb_available" in two cases:
> 
> (a)
> 
>     avail_sectors_patched   <
>     requested_sectors       <=
>     avail_sectors_unpatched
> 
> (ie. the decrease in "avail_sectors" flipped MIN to prefer it over
> requested_sectors, lowering m->nb_available)
> 
> (b)
> 
>     avail_sectors_patched   <
>     avail_sectors_unpatched <=
>     requested_sectors
> 
> (ie. MIN had already preferred "avail_sectors", and now it went lower).
> 
> 
> It seems that
> 
>   1 << (s->cluster_bits - BDRV_SECTOR_BITS) == s->cluster_sectors    [1]
> 
> Therefore both "avail_sectors_patched" and "avail_sectors_unpatched" are
> an integral multiple of "s->cluster_sectors". Whenever MIN() selects
> either of them, the condition in qcow2_alloc_cluster_link_l2() will
> evaluate to false.
> 
> 
> In case (b) there seems to be no difference between patched & unpatched
> in this regard: MIN() always selects an integral multiple, and so
> copy_sectors() is never invoked.
> 
> In case (a), the patched version skips copy_sectors(), while the
> unpatched version invokes it.
> 
> I'm not sure if case (a) is possible at all -- let's expand it:
> 
>     avail_sectors_patched   <
>     requested_sectors       <=
>     avail_sectors_unpatched
> 
> Substitute the assigned values:
> 
>                   nb_cluster  << (s->cluster_bits - BDRV_SECTOR_BITS) <
>     n_end - keep_clusters * s->cluster_sectors                        <=
> (keep_clusters + nb_clusters) << (s->cluster_bits - BDRV_SECTOR_BITS)
> 
> Using [1], replace the shifts with multiplication:
> 
>                     nb_cluster  * s->cluster_sectors <
>           n_end - keep_clusters * s->cluster_sectors <=
>   (keep_clusters + nb_clusters) * s->cluster_sectors
> 
> Distribute the addition:
> 
>                     nb_cluster  * s->cluster_sectors   <
>           n_end - keep_clusters * s->cluster_sectors   <=
>                   keep_clusters * s->cluster_sectors +
>                     nb_clusters * s->cluster_sectors
> 
> N :=   nb_cluster  * s->cluster_sectors [bytes]
> K := keep_clusters * s->cluster_sectors [bytes]
> 
>                   N  <  n_end - K  <=    K + N
> 
> Add K
> 
>               K + N  <  n_end      <=  2*K + N                       [2]

Thank you so much, Laszlo, you are great!

I more or less completely failed to see case (a). With your excellent
analysis I can indeed trigger a copy on write when it shouldn't happen.
The series of requests to get into this state is somewhat tricky as it
requires a specific physical ordering of clusters in the image file:

Host cluster h:     Guest cluster g is mapped to here
Host cluster h + 1: Free, can be allocated for not yet mapped g + 1
Host cluster h + 2: Guest cluster g + 2 is mapped to here

I can get this with this command on a fresh qcow2 image (64k cluster size):

./qemu-io
  -c 'write -P 0x66 0 64k'
  -c 'write 64k 64k'
  -c 'write -P 0x88 128k 64k'
  -c 'discard 64k 64k'
  -c 'write -P 0x77 0 160k'
  /tmp/test.qcow2

Now I get an additional COW for the area from 96k to 128k. However, this
alone doesn't corrupt an image - a copy on write by itself is harmless
because it only rewrites the data that is already there.

The two things I'm going to check next are:

a) What happens with concurrent requests now, are requests for the
   same area still correctly serialised?

b) Can I modify the parameters so that copy_sectors() is working
   with values it wasn't designed for so that the data is copied to
   a wrong place or something like that. m->nb_available is used as
   an argument to it, so it certainly seems realistic.

> I have no clue about qcow2, but *each one* of K and N seem to be a
> function of *both* n_end and the current "qcow2 disk state" (ie. sectors
> being allocated vs. just referenced). IOW, I suspect that case (a) is
> possible for some (disk state, n_end) pairs. And case (a) -- if indeed
> possible -- seems to contradict the commit message:
> 
>   allocation. A COW occurs only if the request wasn't cluster aligned,
>                                        ^^^^^^^    ^^^^^^^^^^^^^^^^^^^
>   which in turn would imply that requested_sectors was less than
>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   avail_sectors (both in the original and in the fixed version). In this
>   ^^^^^^^^^^^^^                           ^^^^^^^^^^^^^^^^^^^^
> 
> In case (a) -- expanded as [2] --, "requested_sectors" is greater than
> "avail_sectors_patched", and less than or equal to
> "avail_sectors_unpatched".

Yes, I agree. The commit message is definitely wrong and I understand
now why, so that was a great step forward. There's still a missing piece
about how it creates a user-visible bug, but hopefully it won't be that
hard to find.

Kevin
Kevin Wolf Jan. 23, 2013, 11:43 a.m. UTC | #11
Am 22.01.2013 15:14, schrieb Kevin Wolf:
> The series of requests to get into this state is somewhat tricky as it
> requires a specific physical ordering of clusters in the image file:
> 
> Host cluster h:     Guest cluster g is mapped to here
> Host cluster h + 1: Free, can be allocated for not yet mapped g + 1
> Host cluster h + 2: Guest cluster g + 2 is mapped to here
> 
> I can get this with this command on a fresh qcow2 image (64k cluster size):
> 
> ./qemu-io
>   -c 'write -P 0x66 0 64k'
>   -c 'write 64k 64k'
>   -c 'write -P 0x88 128k 64k'
>   -c 'discard 64k 64k'
>   -c 'write -P 0x77 0 160k'
>   /tmp/test.qcow2
> 
> Now I get an additional COW for the area from 96k to 128k. However, this
> alone doesn't corrupt an image - a copy on write by itself is harmless
> because it only rewrites the data that is already there.
> 
> The two things I'm going to check next are:
> 
> a) What happens with concurrent requests now, are requests for the
>    same area still correctly serialised?
> 
> b) Can I modify the parameters so that copy_sectors() is working
>    with values it wasn't designed for so that the data is copied to
>    a wrong place or something like that. m->nb_available is used as
>    an argument to it, so it certainly seems realistic.

I can reliably reproduce a bug with a) at least. It requires some
backports to qemu-io in order to get more control of the timing of AIO
requests, but then it works like this:

write -P 0x66 0 320k
write 320k 128k
write -P 0x88 448k 128k
discard 320k 128k
aio_flush

break cow_write A
aio_write -P 0x77 0 480k
wait_break A
write -P 0x99 480k 32k
resume A
aio_flush

read -P 0x77 0 480k
read -P 0x99 480k 32k
read -P 0x88 512k 64k

What happens is that the write of 0x99 to 480k doesn't detect an overlap
with the running AIO request (because obviously it doesn't overlap), but
the COW is actually taking place in the wrong cluster and therefore
unprotected. Stop the AIO request between the COW read and the COW write
to write 0x99 to that area, and the wrong COW will overwrite it, i.e.
corrupt the image.

Basing a real test case on this is not trivial, though, because I have
to stop on the blkdebug event cow_write - which obviously isn't even
emitted in the fixed version.

I still have to look into option b)

Kevin
Kevin Wolf Jan. 23, 2013, 1:54 p.m. UTC | #12
Am 23.01.2013 12:43, schrieb Kevin Wolf:
> Am 22.01.2013 15:14, schrieb Kevin Wolf:
>> The series of requests to get into this state is somewhat tricky as it
>> requires a specific physical ordering of clusters in the image file:
>>
>> Host cluster h:     Guest cluster g is mapped to here
>> Host cluster h + 1: Free, can be allocated for not yet mapped g + 1
>> Host cluster h + 2: Guest cluster g + 2 is mapped to here
>>
>> I can get this with this command on a fresh qcow2 image (64k cluster size):
>>
>> ./qemu-io
>>   -c 'write -P 0x66 0 64k'
>>   -c 'write 64k 64k'
>>   -c 'write -P 0x88 128k 64k'
>>   -c 'discard 64k 64k'
>>   -c 'write -P 0x77 0 160k'
>>   /tmp/test.qcow2
>>
>> Now I get an additional COW for the area from 96k to 128k. However, this
>> alone doesn't corrupt an image - a copy on write by itself is harmless
>> because it only rewrites the data that is already there.
>>
>> The two things I'm going to check next are:
>>
>> a) What happens with concurrent requests now, are requests for the
>>    same area still correctly serialised?
>>
>> b) Can I modify the parameters so that copy_sectors() is working
>>    with values it wasn't designed for so that the data is copied to
>>    a wrong place or something like that. m->nb_available is used as
>>    an argument to it, so it certainly seems realistic.
> 
> I can reliably reproduce a bug with a) at least. It requires some
> backports to qemu-io in order to get more control of the timing of AIO
> requests, but then it works like this:
> 
> write -P 0x66 0 320k
> write 320k 128k
> write -P 0x88 448k 128k
> discard 320k 128k
> aio_flush
> 
> break cow_write A
> aio_write -P 0x77 0 480k
> wait_break A
> write -P 0x99 480k 32k
> resume A
> aio_flush
> 
> read -P 0x77 0 480k
> read -P 0x99 480k 32k
> read -P 0x88 512k 64k
> 
> What happens is that the write of 0x99 to 480k doesn't detect an overlap
> with the running AIO request (because obviously it doesn't overlap), but
> the COW is actually taking place in the wrong cluster and therefore
> unprotected. Stop the AIO request between the COW read and the COW write
> to write 0x99 to that area, and the wrong COW will overwrite it, i.e.
> corrupt the image.
> 
> Basing a real test case on this is not trivial, though, because I have
> to stop on the blkdebug event cow_write - which obviously isn't even
> emitted in the fixed version.
> 
> I still have to look into option b)

This is the reproducer using b):

write -P 0x66 0 320k
write 320k 128k
write -P 0x55 1M 128k
write -P 0x88 448k 128k
discard 320k 128k
aio_flush

write -P 0x77 0 480k
aio_flush

read -P 0x77 0 480k
read -P 0x88 480k 96k
read -P 0x55 1M 128k

Not sure if a stable branch for 1.1 is still maintained, but I'm copying
qemu-stable just in case. Commit b7ab0fea actually fixes a qcow2 data
corruption issue (even though under rare circumstances) and should
definitely be cherry-picked for any new 1.1.x release.

qemu 1.0 and older don't have the bug.

Kevin
Michael Tokarev Jan. 23, 2013, 2:08 p.m. UTC | #13
23.01.2013 17:54, Kevin Wolf wrote:
[]
> Not sure if a stable branch for 1.1 is still maintained, but I'm copying
> qemu-stable just in case. Commit b7ab0fea actually fixes a qcow2 data
> corruption issue (even though under rare circumstances) and should
> definitely be cherry-picked for any new 1.1.x release.

It's been picked up by me once Philipp found it after bisection.
I understand that at that time, it wasn't known what actually
happens, but since Philipp tested it in his tree and it is a
bugfix and it were applied to master (actually to a released
version), I thought it is okay.  It is also included in current
Debian packages.

Speaking of 1.1, I really want to make 1.1.3 release, but I'm
just unsure how to do that... ;)

Thank you Kevin!

/mjt
diff mbox

Patch

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 98fba71..d7e0e19 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -947,8 +947,16 @@  again:
 
         /* save info needed for meta data update */
         if (nb_clusters > 0) {
+            /*
+             * requested_sectors: Number of sectors from the start of the first
+             * newly allocated cluster to the end of the (possibly shortened
+             * before) write request.
+             *
+             * avail_sectors: Number of sectors from the start of the first
+             * newly allocated to the end of the last newly allocated cluster.
+             */
             int requested_sectors = n_end - keep_clusters * s->cluster_sectors;
-            int avail_sectors = (keep_clusters + nb_clusters)
+            int avail_sectors = nb_clusters
                                 << (s->cluster_bits - BDRV_SECTOR_BITS);
 
             *m = (QCowL2Meta) {