Message ID | 201212121425.41850.hahn@univention.de |
---|---|
State | New |
Headers | show |
Hi Philipp, Am 12.12.2012 14:25, schrieb Philipp Hahn: > Hello Kevin, hello Michael, hello *, > > we noticed a data corruption bug in qemu-1.1.2, which will be shipped by > Debian and our own Debian based distibution. > The corruption mostly manifests while installing large Debian package files > and seems to be reladed to memory preasure: As long as the file is still in > the page cache, everything looks fine, but when the file is re-read from the > virtual hard disk using a qcow2 file backed by another qcow2 file, the file > is corrupted: dpkg complains that the .tar.gz file inside the Debian archive > file is corrupted and the md5sum no longer matches. > > I tracked this down using "git bisect" to your patch attached below, which > fixed this bug, so everything is fine with qemu-kvm-1.2.0. > From my reading this seems to explain our problems, since during my own > testing during development I never used backing chains and the problem only > showed up when my collegues started using qemu-kvm-1.1.2 with their VMs using > backing chains. > > @Kevin: Do you thinks that's a valid explanation and your patch should fix > that problem? > I'd like to get your expertise before filing a bug with Debian and asking > Michael to include that patch with his next stable update for 1.1. As you can see in the commit message of that patch I was convinced that no bug did exist in practice and this was only dangerous with respect to future changes. Therefore my first question is if you're using an unmodified upstream qemu or if some backported patches are applied to it? If it's indeed unmodified, we should probably review the code once again to understand why it makes a difference. In any case, this is the cluster allocation code. It's probably not related to rereading things from disk, but rather to the writeout of the page cache. Kevin
Hello Kevin, Am Mittwoch 12 Dezember 2012 14:41:49 schrieb Kevin Wolf: > As you can see in the commit message of that patch I was convinced that > no bug did exist in practice and this was only dangerous with respect to > future changes. Therefore my first question is if you're using an > unmodified upstream qemu or if some backported patches are applied to > it? If it's indeed unmodified, we should probably review the code once > again to understand why it makes a difference. This were all unmodified versions directly from git between "qemu-kvm-1.1.0" and "qemu-kvm-1.2.0" "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works, "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c~1" is broken. "git checkout qemu-kvm-1.1.2" is broken, "git checkout qemu-kvm-1.1.2 ; git cherry-pick b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works > In any case, this is the cluster allocation code. It's probably not > related to rereading things from disk, but rather to the writeout of the > page cache. Yes, the problem is probably write related. But as the write "doens't explode with some spectacular error", I only notice the error on the following read by comparing md5 sums. I just re-checked it: After a reboot the md5sums are still invalid, so I guess the data is corrupted on writeout. Sincerely Philipp
Am 12.12.2012 15:09, schrieb Philipp Hahn: > Hello Kevin, > > Am Mittwoch 12 Dezember 2012 14:41:49 schrieb Kevin Wolf: >> As you can see in the commit message of that patch I was convinced that >> no bug did exist in practice and this was only dangerous with respect to >> future changes. Therefore my first question is if you're using an >> unmodified upstream qemu or if some backported patches are applied to >> it? If it's indeed unmodified, we should probably review the code once >> again to understand why it makes a difference. > > This were all unmodified versions directly from git between "qemu-kvm-1.1.0" > and "qemu-kvm-1.2.0" > > "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works, > "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c~1" is broken. > "git checkout qemu-kvm-1.1.2" is broken, > "git checkout qemu-kvm-1.1.2 ; git cherry-pick > b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works Ok, thanks for clarifying. Then I must have missed some interesting case while doing the patch. Ideally we would find a sequence of qemu-io commands to reliably reproduce this. First thing worth trying would be running the current qemu-iotests suite on the old versions. If we don't find it this way, I guess we need to catch it with code review. I'm not sure if I can get to it this week, and starting next week I'll be on vacation, so any help with finding a reproducer would be appreciated. >> In any case, this is the cluster allocation code. It's probably not >> related to rereading things from disk, but rather to the writeout of the >> page cache. > > Yes, the problem is probably write related. But as the write "doens't explode > with some spectacular error", I only notice the error on the following read > by comparing md5 sums. > I just re-checked it: After a reboot the md5sums are still invalid, so I guess > the data is corrupted on writeout. Yes, it's really the only thing that makes sense in connection with this patch. Kevin
Hello Kevin, Am Mittwoch 12 Dezember 2012 17:54:58 schrieb Kevin Wolf: > Am 12.12.2012 15:09, schrieb Philipp Hahn: > > Am Mittwoch 12 Dezember 2012 14:41:49 schrieb Kevin Wolf: > >> As you can see in the commit message of that patch I was convinced that > >> no bug did exist in practice and this was only dangerous with respect to > >> future changes. Therefore my first question is if you're using an > >> unmodified upstream qemu or if some backported patches are applied to > >> it? If it's indeed unmodified, we should probably review the code once > >> again to understand why it makes a difference. > > > > This were all unmodified versions directly from git between > > "qemu-kvm-1.1.0" and "qemu-kvm-1.2.0" > > > > "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works, > > "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c~1" is broken. > > "git checkout qemu-kvm-1.1.2" is broken, > > "git checkout qemu-kvm-1.1.2 ; git cherry-pick > > b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works > > Ok, thanks for clarifying. Then I must have missed some interesting case > while doing the patch. I just re-run my "git bisect run ~/bisect.sh" case, but it again arrived at that patch. I just queued another run for tonight so make sure the test is reliable: # bad: [4c3e02beed9878a5f760eeceb6cd42c475cf0127] Merge tag 'v1.2.0' # good: [bd11ac4feb54d32653e5d4eb7994bed18be0609c] fdc: fix implied seek while there is no media in drive git bisect start 'qemu-kvm-1.2.0' 'qemu-kvm-1.1.0' # good: [15ecf28f39e2b6fba359ed094770c8fa4ad8dc60] Merge tag 'v1.1.0' into next git bisect good 15ecf28f39e2b6fba359ed094770c8fa4ad8dc60 # bad: [2fa5008ffd49e51540756adccf966a2fcde6e6c1] hd-geometry: Factor out guess_chs_for_size() git bisect bad 2fa5008ffd49e51540756adccf966a2fcde6e6c1 # bad: [306a571a2d75e32cd2eae5486c2714b7b7792a63] hw/arm_gic: Add qdev property for GIC revision git bisect bad 306a571a2d75e32cd2eae5486c2714b7b7792a63 # good: [5c6f4f178ba542358c012ca033985f73e61b8ae5] z2: Rename PXA2xxState variable git bisect good 5c6f4f178ba542358c012ca033985f73e61b8ae5 # good: [833e40858cb9501c5e76b3aa345e4bb5be34385a] qcow2: remove a line of unnecessary code git bisect good 833e40858cb9501c5e76b3aa345e4bb5be34385a # bad: [0b0cb9d310edfe2b2d108f18be4f013a1e552cfd] Merge remote-tracking branch 'kwolf/for-anthony' into staging git bisect bad 0b0cb9d310edfe2b2d108f18be4f013a1e552cfd # bad: [0446919dcab51e7468f346c0a009a88632c5c5e0] qemu-iotests: COW with many AIO requests on the same cluster git bisect bad 0446919dcab51e7468f346c0a009a88632c5c5e0 # good: [b75a02829dde98723dfe16fa098338cb267b28b9] Prevent disk data loss when closing qemu git bisect good b75a02829dde98723dfe16fa098338cb267b28b9 # good: [c4a248a138028bee63a099410c79b428db0c4779] block: copy enable_write_cache in bdrv_append git bisect good c4a248a138028bee63a099410c79b428db0c4779 # good: [6af4e9ead4ec9491259c9861b1b35f9abee24a66] qcow2: always operate caches in writeback mode git bisect good 6af4e9ead4ec9491259c9861b1b35f9abee24a66 # bad: [b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c] qcow2: Fix avail_sectors in cluster allocation code git bisect bad b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c # good: [cdba7fee1daa8865bac2d69da288171fe7c21aae] qcow2: Simplify calculation for COW area at the end git bisect good cdba7fee1daa8865bac2d69da288171fe7c21aae > Ideally we would find a sequence of qemu-io commands to reliably > reproduce this. First thing worth trying would be running the current > qemu-iotests suite on the old versions. If we don't find it this way, I > guess we need to catch it with code review. I'm not sure if I can get to > it this week, and starting next week I'll be on vacation, so any help > with finding a reproducer would be appreciated. I'll have a look at it tommorrow. Thank you for your fast replies and have a nice vacation in case we don't head from each other this week again. BYtE Philipp
Hello Kevin, On Wednesday 12 December 2012 18:29:48 Philipp Hahn wrote: > I just re-run my "git bisect run ~/bisect.sh" case, but it again arrived > at that patch. I just queued another run for tonight so make sure the test > is reliable: The run from last night again arrived at the refecenced patch. > > Ideally we would find a sequence of qemu-io commands to reliably > > reproduce this. First thing worth trying would be running the current > > qemu-iotests suite on the old versions. If we don't find it this way, I > > guess we need to catch it with code review. I'm not sure if I can get to > > it this week, and starting next week I'll be on vacation, so any help > > with finding a reproducer would be appreciated. I took a closer look at what gets corrupted; I've attached my notes. Please notice that the partitions are not alignd properly. If you would like to look at the full qcow2_alloc_clusters_offset trace, I can provide you with a link to the trace file. BYtE Philipp
Hello Kevin, hello Michael, On Wednesday 12 December 2012 17:54:58 Kevin Wolf wrote: > Am 12.12.2012 15:09, schrieb Philipp Hahn: > > Am Mittwoch 12 Dezember 2012 14:41:49 schrieb Kevin Wolf: > >> As you can see in the commit message of that patch I was convinced that > >> no bug did exist in practice and this was only dangerous with respect to > >> future changes. Therefore my first question is if you're using an > >> unmodified upstream qemu or if some backported patches are applied to > >> it? If it's indeed unmodified, we should probably review the code once > >> again to understand why it makes a difference. > > > > This were all unmodified versions directly from git between > > "qemu-kvm-1.1.0" and "qemu-kvm-1.2.0" > > > > "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works, > > "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c~1" is broken. > > "git checkout qemu-kvm-1.1.2" is broken, > > "git checkout qemu-kvm-1.1.2 ; git cherry-pick > > b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works > > Ok, thanks for clarifying. Then I must have missed some interesting case > while doing the patch. I think I found your missing link: After filling in "QCowL2Meta *m", that request ist queued: QLIST_INSERT_HEAD(&s->cluster_allocs, m, next_in_flight); do prevent double allocating the same cluster for overlapping requests, which is checked in do_alloc_cluster_offset(). I guess that since the sector count was wrong, the overlap detection didn't work and the two concurrent write requests to the same cluster overwrote each other. > Ideally we would find a sequence of qemu-io commands to reliably > reproduce this. You're the block guru, so I leave that to you (or anybody else who knows more about the working of qemu-io.) ;-) Sincerely Philipp
On 18.12.2012 13:46, Philipp Hahn wrote: > I think I found your missing link: > After filling in "QCowL2Meta *m", that request ist queued: > QLIST_INSERT_HEAD(&s->cluster_allocs, m, next_in_flight); > do prevent double allocating the same cluster for overlapping requests, which > is checked in do_alloc_cluster_offset(). > > I guess that since the sector count was wrong, the overlap detection didn't > work and the two concurrent write requests to the same cluster overwrote each > other. Meh. And I already closed the debian bugreport... :) But thank you Philipp for your excellent work on the matter! /mjt
Am 18.12.2012 10:46, schrieb Philipp Hahn: > Hello Kevin, hello Michael, > > On Wednesday 12 December 2012 17:54:58 Kevin Wolf wrote: >> Am 12.12.2012 15:09, schrieb Philipp Hahn: >>> Am Mittwoch 12 Dezember 2012 14:41:49 schrieb Kevin Wolf: >>>> As you can see in the commit message of that patch I was convinced that >>>> no bug did exist in practice and this was only dangerous with respect to >>>> future changes. Therefore my first question is if you're using an >>>> unmodified upstream qemu or if some backported patches are applied to >>>> it? If it's indeed unmodified, we should probably review the code once >>>> again to understand why it makes a difference. >>> >>> This were all unmodified versions directly from git between >>> "qemu-kvm-1.1.0" and "qemu-kvm-1.2.0" >>> >>> "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works, >>> "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c~1" is broken. >>> "git checkout qemu-kvm-1.1.2" is broken, >>> "git checkout qemu-kvm-1.1.2 ; git cherry-pick >>> b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works >> >> Ok, thanks for clarifying. Then I must have missed some interesting case >> while doing the patch. > > I think I found your missing link: > After filling in "QCowL2Meta *m", that request ist queued: > QLIST_INSERT_HEAD(&s->cluster_allocs, m, next_in_flight); > do prevent double allocating the same cluster for overlapping requests, which > is checked in do_alloc_cluster_offset(). > I guess that since the sector count was wrong, the overlap detection didn't > work and the two concurrent write requests to the same cluster overwrote each > other. I'm still not seeing it. The overlap detection code doesn't use m->nb_available at all, so why would it make a difference? I guess I need some people who can decently review code - Laszlo, Paolo, any idea why upstream commit b7ab0fea does change the behaviour, contrary to what I claimed in the commit message? >> Ideally we would find a sequence of qemu-io commands to reliably >> reproduce this. > > You're the block guru, so I leave that to you (or anybody else who knows more > about the working of qemu-io.) ;-) Yeah, as soon as I know what's happening, calling qemu-io with the right options shouldn't be a problem. Kevin
On 01/22/13 11:25, Kevin Wolf wrote: > Am 18.12.2012 10:46, schrieb Philipp Hahn: >> Hello Kevin, hello Michael, >> >> On Wednesday 12 December 2012 17:54:58 Kevin Wolf wrote: >>> Am 12.12.2012 15:09, schrieb Philipp Hahn: >>>> Am Mittwoch 12 Dezember 2012 14:41:49 schrieb Kevin Wolf: >>>>> As you can see in the commit message of that patch I was convinced that >>>>> no bug did exist in practice and this was only dangerous with respect to >>>>> future changes. Therefore my first question is if you're using an >>>>> unmodified upstream qemu or if some backported patches are applied to >>>>> it? If it's indeed unmodified, we should probably review the code once >>>>> again to understand why it makes a difference. >>>> >>>> This were all unmodified versions directly from git between >>>> "qemu-kvm-1.1.0" and "qemu-kvm-1.2.0" >>>> >>>> "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works, >>>> "git checkout b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c~1" is broken. >>>> "git checkout qemu-kvm-1.1.2" is broken, >>>> "git checkout qemu-kvm-1.1.2 ; git cherry-pick >>>> b7ab0fea37c15ca9e249c42c46f5c48fd1a0943c" works >>> >>> Ok, thanks for clarifying. Then I must have missed some interesting case >>> while doing the patch. >> >> I think I found your missing link: >> After filling in "QCowL2Meta *m", that request ist queued: >> QLIST_INSERT_HEAD(&s->cluster_allocs, m, next_in_flight); >> do prevent double allocating the same cluster for overlapping requests, which >> is checked in do_alloc_cluster_offset(). >> I guess that since the sector count was wrong, the overlap detection didn't >> work and the two concurrent write requests to the same cluster overwrote each >> other. > > I'm still not seeing it. The overlap detection code doesn't use > m->nb_available at all, so why would it make a difference? > > I guess I need some people who can decently review code - Laszlo, Paolo, > any idea why upstream commit b7ab0fea does change the behaviour, > contrary to what I claimed in the commit message? As I understand it, the question is not whether b7ab0fea works or not (it does), the question is whether *without* b7ab0fea there is a real problem (user visible bug). Commit b7ab0fea decreased "avail_sectors" by keep_clusters << (s->cluster_bits - BDRV_SECTOR_BITS) which should make a difference for "m->nb_available" in two cases: (a) avail_sectors_patched < requested_sectors <= avail_sectors_unpatched (ie. the decrease in "avail_sectors" flipped MIN to prefer it over requested_sectors, lowering m->nb_available) (b) avail_sectors_patched < avail_sectors_unpatched <= requested_sectors (ie. MIN had already preferred "avail_sectors", and now it went lower). It seems that 1 << (s->cluster_bits - BDRV_SECTOR_BITS) == s->cluster_sectors [1] Therefore both "avail_sectors_patched" and "avail_sectors_unpatched" are an integral multiple of "s->cluster_sectors". Whenever MIN() selects either of them, the condition in qcow2_alloc_cluster_link_l2() will evaluate to false. In case (b) there seems to be no difference between patched & unpatched in this regard: MIN() always selects an integral multiple, and so copy_sectors() is never invoked. In case (a), the patched version skips copy_sectors(), while the unpatched version invokes it. I'm not sure if case (a) is possible at all -- let's expand it: avail_sectors_patched < requested_sectors <= avail_sectors_unpatched Substitute the assigned values: nb_cluster << (s->cluster_bits - BDRV_SECTOR_BITS) < n_end - keep_clusters * s->cluster_sectors <= (keep_clusters + nb_clusters) << (s->cluster_bits - BDRV_SECTOR_BITS) Using [1], replace the shifts with multiplication: nb_cluster * s->cluster_sectors < n_end - keep_clusters * s->cluster_sectors <= (keep_clusters + nb_clusters) * s->cluster_sectors Distribute the addition: nb_cluster * s->cluster_sectors < n_end - keep_clusters * s->cluster_sectors <= keep_clusters * s->cluster_sectors + nb_clusters * s->cluster_sectors N := nb_cluster * s->cluster_sectors [bytes] K := keep_clusters * s->cluster_sectors [bytes] N < n_end - K <= K + N Add K K + N < n_end <= 2*K + N [2] I have no clue about qcow2, but *each one* of K and N seem to be a function of *both* n_end and the current "qcow2 disk state" (ie. sectors being allocated vs. just referenced). IOW, I suspect that case (a) is possible for some (disk state, n_end) pairs. And case (a) -- if indeed possible -- seems to contradict the commit message: allocation. A COW occurs only if the request wasn't cluster aligned, ^^^^^^^ ^^^^^^^^^^^^^^^^^^^ which in turn would imply that requested_sectors was less than ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ avail_sectors (both in the original and in the fixed version). In this ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^ In case (a) -- expanded as [2] --, "requested_sectors" is greater than "avail_sectors_patched", and less than or equal to "avail_sectors_unpatched". Laszlo
Am 22.01.2013 14:17, schrieb Laszlo Ersek: > As I understand it, the question is not whether b7ab0fea works or not > (it does), the question is whether *without* b7ab0fea there is a real > problem (user visible bug). Correct. Or actually not whether, but why. > Commit b7ab0fea decreased "avail_sectors" by > > keep_clusters << (s->cluster_bits - BDRV_SECTOR_BITS) > > which should make a difference for "m->nb_available" in two cases: > > (a) > > avail_sectors_patched < > requested_sectors <= > avail_sectors_unpatched > > (ie. the decrease in "avail_sectors" flipped MIN to prefer it over > requested_sectors, lowering m->nb_available) > > (b) > > avail_sectors_patched < > avail_sectors_unpatched <= > requested_sectors > > (ie. MIN had already preferred "avail_sectors", and now it went lower). > > > It seems that > > 1 << (s->cluster_bits - BDRV_SECTOR_BITS) == s->cluster_sectors [1] > > Therefore both "avail_sectors_patched" and "avail_sectors_unpatched" are > an integral multiple of "s->cluster_sectors". Whenever MIN() selects > either of them, the condition in qcow2_alloc_cluster_link_l2() will > evaluate to false. > > > In case (b) there seems to be no difference between patched & unpatched > in this regard: MIN() always selects an integral multiple, and so > copy_sectors() is never invoked. > > In case (a), the patched version skips copy_sectors(), while the > unpatched version invokes it. > > I'm not sure if case (a) is possible at all -- let's expand it: > > avail_sectors_patched < > requested_sectors <= > avail_sectors_unpatched > > Substitute the assigned values: > > nb_cluster << (s->cluster_bits - BDRV_SECTOR_BITS) < > n_end - keep_clusters * s->cluster_sectors <= > (keep_clusters + nb_clusters) << (s->cluster_bits - BDRV_SECTOR_BITS) > > Using [1], replace the shifts with multiplication: > > nb_cluster * s->cluster_sectors < > n_end - keep_clusters * s->cluster_sectors <= > (keep_clusters + nb_clusters) * s->cluster_sectors > > Distribute the addition: > > nb_cluster * s->cluster_sectors < > n_end - keep_clusters * s->cluster_sectors <= > keep_clusters * s->cluster_sectors + > nb_clusters * s->cluster_sectors > > N := nb_cluster * s->cluster_sectors [bytes] > K := keep_clusters * s->cluster_sectors [bytes] > > N < n_end - K <= K + N > > Add K > > K + N < n_end <= 2*K + N [2] Thank you so much, Laszlo, you are great! I more or less completely failed to see case (a). With your excellent analysis I can indeed trigger a copy on write when it shouldn't happen. The series of requests to get into this state is somewhat tricky as it requires a specific physical ordering of clusters in the image file: Host cluster h: Guest cluster g is mapped to here Host cluster h + 1: Free, can be allocated for not yet mapped g + 1 Host cluster h + 2: Guest cluster g + 2 is mapped to here I can get this with this command on a fresh qcow2 image (64k cluster size): ./qemu-io -c 'write -P 0x66 0 64k' -c 'write 64k 64k' -c 'write -P 0x88 128k 64k' -c 'discard 64k 64k' -c 'write -P 0x77 0 160k' /tmp/test.qcow2 Now I get an additional COW for the area from 96k to 128k. However, this alone doesn't corrupt an image - a copy on write by itself is harmless because it only rewrites the data that is already there. The two things I'm going to check next are: a) What happens with concurrent requests now, are requests for the same area still correctly serialised? b) Can I modify the parameters so that copy_sectors() is working with values it wasn't designed for so that the data is copied to a wrong place or something like that. m->nb_available is used as an argument to it, so it certainly seems realistic. > I have no clue about qcow2, but *each one* of K and N seem to be a > function of *both* n_end and the current "qcow2 disk state" (ie. sectors > being allocated vs. just referenced). IOW, I suspect that case (a) is > possible for some (disk state, n_end) pairs. And case (a) -- if indeed > possible -- seems to contradict the commit message: > > allocation. A COW occurs only if the request wasn't cluster aligned, > ^^^^^^^ ^^^^^^^^^^^^^^^^^^^ > which in turn would imply that requested_sectors was less than > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > avail_sectors (both in the original and in the fixed version). In this > ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^ > > In case (a) -- expanded as [2] --, "requested_sectors" is greater than > "avail_sectors_patched", and less than or equal to > "avail_sectors_unpatched". Yes, I agree. The commit message is definitely wrong and I understand now why, so that was a great step forward. There's still a missing piece about how it creates a user-visible bug, but hopefully it won't be that hard to find. Kevin
Am 22.01.2013 15:14, schrieb Kevin Wolf: > The series of requests to get into this state is somewhat tricky as it > requires a specific physical ordering of clusters in the image file: > > Host cluster h: Guest cluster g is mapped to here > Host cluster h + 1: Free, can be allocated for not yet mapped g + 1 > Host cluster h + 2: Guest cluster g + 2 is mapped to here > > I can get this with this command on a fresh qcow2 image (64k cluster size): > > ./qemu-io > -c 'write -P 0x66 0 64k' > -c 'write 64k 64k' > -c 'write -P 0x88 128k 64k' > -c 'discard 64k 64k' > -c 'write -P 0x77 0 160k' > /tmp/test.qcow2 > > Now I get an additional COW for the area from 96k to 128k. However, this > alone doesn't corrupt an image - a copy on write by itself is harmless > because it only rewrites the data that is already there. > > The two things I'm going to check next are: > > a) What happens with concurrent requests now, are requests for the > same area still correctly serialised? > > b) Can I modify the parameters so that copy_sectors() is working > with values it wasn't designed for so that the data is copied to > a wrong place or something like that. m->nb_available is used as > an argument to it, so it certainly seems realistic. I can reliably reproduce a bug with a) at least. It requires some backports to qemu-io in order to get more control of the timing of AIO requests, but then it works like this: write -P 0x66 0 320k write 320k 128k write -P 0x88 448k 128k discard 320k 128k aio_flush break cow_write A aio_write -P 0x77 0 480k wait_break A write -P 0x99 480k 32k resume A aio_flush read -P 0x77 0 480k read -P 0x99 480k 32k read -P 0x88 512k 64k What happens is that the write of 0x99 to 480k doesn't detect an overlap with the running AIO request (because obviously it doesn't overlap), but the COW is actually taking place in the wrong cluster and therefore unprotected. Stop the AIO request between the COW read and the COW write to write 0x99 to that area, and the wrong COW will overwrite it, i.e. corrupt the image. Basing a real test case on this is not trivial, though, because I have to stop on the blkdebug event cow_write - which obviously isn't even emitted in the fixed version. I still have to look into option b) Kevin
Am 23.01.2013 12:43, schrieb Kevin Wolf: > Am 22.01.2013 15:14, schrieb Kevin Wolf: >> The series of requests to get into this state is somewhat tricky as it >> requires a specific physical ordering of clusters in the image file: >> >> Host cluster h: Guest cluster g is mapped to here >> Host cluster h + 1: Free, can be allocated for not yet mapped g + 1 >> Host cluster h + 2: Guest cluster g + 2 is mapped to here >> >> I can get this with this command on a fresh qcow2 image (64k cluster size): >> >> ./qemu-io >> -c 'write -P 0x66 0 64k' >> -c 'write 64k 64k' >> -c 'write -P 0x88 128k 64k' >> -c 'discard 64k 64k' >> -c 'write -P 0x77 0 160k' >> /tmp/test.qcow2 >> >> Now I get an additional COW for the area from 96k to 128k. However, this >> alone doesn't corrupt an image - a copy on write by itself is harmless >> because it only rewrites the data that is already there. >> >> The two things I'm going to check next are: >> >> a) What happens with concurrent requests now, are requests for the >> same area still correctly serialised? >> >> b) Can I modify the parameters so that copy_sectors() is working >> with values it wasn't designed for so that the data is copied to >> a wrong place or something like that. m->nb_available is used as >> an argument to it, so it certainly seems realistic. > > I can reliably reproduce a bug with a) at least. It requires some > backports to qemu-io in order to get more control of the timing of AIO > requests, but then it works like this: > > write -P 0x66 0 320k > write 320k 128k > write -P 0x88 448k 128k > discard 320k 128k > aio_flush > > break cow_write A > aio_write -P 0x77 0 480k > wait_break A > write -P 0x99 480k 32k > resume A > aio_flush > > read -P 0x77 0 480k > read -P 0x99 480k 32k > read -P 0x88 512k 64k > > What happens is that the write of 0x99 to 480k doesn't detect an overlap > with the running AIO request (because obviously it doesn't overlap), but > the COW is actually taking place in the wrong cluster and therefore > unprotected. Stop the AIO request between the COW read and the COW write > to write 0x99 to that area, and the wrong COW will overwrite it, i.e. > corrupt the image. > > Basing a real test case on this is not trivial, though, because I have > to stop on the blkdebug event cow_write - which obviously isn't even > emitted in the fixed version. > > I still have to look into option b) This is the reproducer using b): write -P 0x66 0 320k write 320k 128k write -P 0x55 1M 128k write -P 0x88 448k 128k discard 320k 128k aio_flush write -P 0x77 0 480k aio_flush read -P 0x77 0 480k read -P 0x88 480k 96k read -P 0x55 1M 128k Not sure if a stable branch for 1.1 is still maintained, but I'm copying qemu-stable just in case. Commit b7ab0fea actually fixes a qcow2 data corruption issue (even though under rare circumstances) and should definitely be cherry-picked for any new 1.1.x release. qemu 1.0 and older don't have the bug. Kevin
23.01.2013 17:54, Kevin Wolf wrote: [] > Not sure if a stable branch for 1.1 is still maintained, but I'm copying > qemu-stable just in case. Commit b7ab0fea actually fixes a qcow2 data > corruption issue (even though under rare circumstances) and should > definitely be cherry-picked for any new 1.1.x release. It's been picked up by me once Philipp found it after bisection. I understand that at that time, it wasn't known what actually happens, but since Philipp tested it in his tree and it is a bugfix and it were applied to master (actually to a released version), I thought it is okay. It is also included in current Debian packages. Speaking of 1.1, I really want to make 1.1.3 release, but I'm just unsure how to do that... ;) Thank you Kevin! /mjt
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 98fba71..d7e0e19 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -947,8 +947,16 @@ again: /* save info needed for meta data update */ if (nb_clusters > 0) { + /* + * requested_sectors: Number of sectors from the start of the first + * newly allocated cluster to the end of the (possibly shortened + * before) write request. + * + * avail_sectors: Number of sectors from the start of the first + * newly allocated to the end of the last newly allocated cluster. + */ int requested_sectors = n_end - keep_clusters * s->cluster_sectors; - int avail_sectors = (keep_clusters + nb_clusters) + int avail_sectors = nb_clusters << (s->cluster_bits - BDRV_SECTOR_BITS); *m = (QCowL2Meta) {