Message ID | 1426002357-6889-1-git-send-email-pbonzini@redhat.com |
---|---|
State | New |
Headers | show |
Am 10.03.2015 um 16:45 schrieb Paolo Bonzini: [...] > > Moral of the story: I should refrain from writing more clever stuff. > At least it looks like it is not too clever to be undebuggable. > > Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> > Fixes: c740ad92d0d958fa785e5d7aa1b67ecaf30a6a54 > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> > --- > include/qemu/queue.h | 11 ++++++----- > 1 file changed, 6 insertions(+), 5 deletions(-) > > diff --git a/include/qemu/queue.h b/include/qemu/queue.h > index 8094150..f781aa2 100644 > --- a/include/qemu/queue.h > +++ b/include/qemu/queue.h > @@ -197,11 +197,12 @@ struct { \ > (head)->slh_first = (elm); \ > } while (/*CONSTCOND*/0) > > -#define QSLIST_INSERT_HEAD_ATOMIC(head, elm, field) do { \ > - do { \ > - (elm)->field.sle_next = (head)->slh_first; \ > - } while (atomic_cmpxchg(&(head)->slh_first, (elm)->field.sle_next, \ > - (elm)) != (elm)->field.sle_next); \ > +#define QSLIST_INSERT_HEAD_ATOMIC(head, elm, field) do { \ > + typeof(elm) save_sle_next; \ > + do { \ > + save_sle_next = (elm)->field.sle_next = (head)->slh_first; \ > + } while (atomic_cmpxchg(&(head)->slh_first, save_sle_next, (elm)) != \ > + save_sle_next); \ > } while (/*CONSTCOND*/0) > > #define QSLIST_MOVE_ATOMIC(dest, src) do { \ >
On Tue, Mar 10, 2015 at 04:45:57PM +0100, Paolo Bonzini wrote: > There is a not-so-subtle race in QSLIST_INSERT_HEAD_ATOMIC. > > Because atomic_cmpxchg returns the old value instead of a success flag, > QSLIST_INSERT_HEAD_ATOMIC was checking for success by comparing against > the second argument to atomic_cmpxchg. Unfortunately, this only works > if the second argument is a local or thread-local variable. > > If it is in memory, it can be subject to common subexpression elimination > (and then everything's fine) or reloaded after the atomic_cmpxchg, > depending on the compiler's whims. If the latter happens, the race can > happen. A thread can sneak in, doing something on elm->field.sle_next > after the atomic_cmpxchg and before the comparison. This causes a wrong > failure, and then two threads are using "elm" at the same time. In the > case discovered by Christian, the sequence was likely something like this: > > thread 1 | thread 2 > QSLIST_INSERT_HEAD_ATOMIC | > atomic_cmpxchg succeeds | > elm added to list | > | steal release_pool > | QSLIST_REMOVE_HEAD > | elm removed from list > | ... > | QSLIST_INSERT_HEAD_ATOMIC > | (overwrites sle_next) > spurious failure | > atomic_cmpxchg succeeds | > elm added to list again | > | > steal release_pool | > QSLIST_REMOVE_HEAD | > elm removed again | > > The last three steps could be done by a third thread as well. > A reproducer that failed in a matter of seconds is as follows: > > - the guest has 32 VCPUs on a 28 core host (hyperthreading was enabled), > memory was 16G just to err on the safe side (the host has 64G, but hey > at least you need no s390) > > - the guest has 24 null-aio virtio-blk devices using dataplane > (-object iothread,id=ioN -drive if=none,id=blkN,driver=null-aio,size=500G > -device virtio-blk-pci,iothread=ioN,drive=blkN) > > - the guest also has a single network interface. It's only doing loopback > tests so slirp vs. tap and the model doesn't matter. > > - the guest is running fio with the following script: > > [global] > rw=randread > blocksize=16k > ioengine=libaio > runtime=10m > buffered=0 > fallocate=none > time_based > iodepth=32 > > [virtio1a] > filename=/dev/block/252\:16 > > [virtio1b] > filename=/dev/block/252\:16 > > ... > > [virtio24a] > filename=/dev/block/252\:384 > > [virtio24b] > filename=/dev/block/252\:384 > > [listen1] > protocol=tcp > ioengine=net > port=12345 > listen > rw=read > bs=4k > size=1000g > > [connect1] > protocol=tcp > hostname=localhost > ioengine=net > port=12345 > protocol=tcp > rw=write > startdelay=1 > size=1000g > > ... > > [listen8] > protocol=tcp > ioengine=net > port=12352 > listen > rw=read > bs=4k > size=1000g > > [connect8] > protocol=tcp > hostname=localhost > ioengine=net > port=12352 > rw=write > startdelay=1 > size=1000g > > Moral of the story: I should refrain from writing more clever stuff. > At least it looks like it is not too clever to be undebuggable. > > Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> > Fixes: c740ad92d0d958fa785e5d7aa1b67ecaf30a6a54 > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > include/qemu/queue.h | 11 ++++++----- > 1 file changed, 6 insertions(+), 5 deletions(-) Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
On Tue, Mar 10, 2015 at 04:45:57PM +0100, Paolo Bonzini wrote: > There is a not-so-subtle race in QSLIST_INSERT_HEAD_ATOMIC. > > Because atomic_cmpxchg returns the old value instead of a success flag, > QSLIST_INSERT_HEAD_ATOMIC was checking for success by comparing against > the second argument to atomic_cmpxchg. Unfortunately, this only works > if the second argument is a local or thread-local variable. > > If it is in memory, it can be subject to common subexpression elimination > (and then everything's fine) or reloaded after the atomic_cmpxchg, > depending on the compiler's whims. If the latter happens, the race can > happen. A thread can sneak in, doing something on elm->field.sle_next > after the atomic_cmpxchg and before the comparison. This causes a wrong > failure, and then two threads are using "elm" at the same time. In the > case discovered by Christian, the sequence was likely something like this: > > thread 1 | thread 2 > QSLIST_INSERT_HEAD_ATOMIC | > atomic_cmpxchg succeeds | > elm added to list | > | steal release_pool > | QSLIST_REMOVE_HEAD > | elm removed from list > | ... > | QSLIST_INSERT_HEAD_ATOMIC > | (overwrites sle_next) > spurious failure | > atomic_cmpxchg succeeds | > elm added to list again | > | > steal release_pool | > QSLIST_REMOVE_HEAD | > elm removed again | > > The last three steps could be done by a third thread as well. > A reproducer that failed in a matter of seconds is as follows: > > - the guest has 32 VCPUs on a 28 core host (hyperthreading was enabled), > memory was 16G just to err on the safe side (the host has 64G, but hey > at least you need no s390) > > - the guest has 24 null-aio virtio-blk devices using dataplane > (-object iothread,id=ioN -drive if=none,id=blkN,driver=null-aio,size=500G > -device virtio-blk-pci,iothread=ioN,drive=blkN) > > - the guest also has a single network interface. It's only doing loopback > tests so slirp vs. tap and the model doesn't matter. > > - the guest is running fio with the following script: > > [global] > rw=randread > blocksize=16k > ioengine=libaio > runtime=10m > buffered=0 > fallocate=none > time_based > iodepth=32 > > [virtio1a] > filename=/dev/block/252\:16 > > [virtio1b] > filename=/dev/block/252\:16 > > ... > > [virtio24a] > filename=/dev/block/252\:384 > > [virtio24b] > filename=/dev/block/252\:384 > > [listen1] > protocol=tcp > ioengine=net > port=12345 > listen > rw=read > bs=4k > size=1000g > > [connect1] > protocol=tcp > hostname=localhost > ioengine=net > port=12345 > protocol=tcp > rw=write > startdelay=1 > size=1000g > > ... > > [listen8] > protocol=tcp > ioengine=net > port=12352 > listen > rw=read > bs=4k > size=1000g > > [connect8] > protocol=tcp > hostname=localhost > ioengine=net > port=12352 > rw=write > startdelay=1 > size=1000g > > Moral of the story: I should refrain from writing more clever stuff. > At least it looks like it is not too clever to be undebuggable. > > Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> > Fixes: c740ad92d0d958fa785e5d7aa1b67ecaf30a6a54 > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > include/qemu/queue.h | 11 ++++++----- > 1 file changed, 6 insertions(+), 5 deletions(-) Thanks, applied to my block tree for QEMU 2.3: https://github.com/stefanha/qemu/commits/block Stefan
diff --git a/include/qemu/queue.h b/include/qemu/queue.h index 8094150..f781aa2 100644 --- a/include/qemu/queue.h +++ b/include/qemu/queue.h @@ -197,11 +197,12 @@ struct { \ (head)->slh_first = (elm); \ } while (/*CONSTCOND*/0) -#define QSLIST_INSERT_HEAD_ATOMIC(head, elm, field) do { \ - do { \ - (elm)->field.sle_next = (head)->slh_first; \ - } while (atomic_cmpxchg(&(head)->slh_first, (elm)->field.sle_next, \ - (elm)) != (elm)->field.sle_next); \ +#define QSLIST_INSERT_HEAD_ATOMIC(head, elm, field) do { \ + typeof(elm) save_sle_next; \ + do { \ + save_sle_next = (elm)->field.sle_next = (head)->slh_first; \ + } while (atomic_cmpxchg(&(head)->slh_first, save_sle_next, (elm)) != \ + save_sle_next); \ } while (/*CONSTCOND*/0) #define QSLIST_MOVE_ATOMIC(dest, src) do { \
There is a not-so-subtle race in QSLIST_INSERT_HEAD_ATOMIC. Because atomic_cmpxchg returns the old value instead of a success flag, QSLIST_INSERT_HEAD_ATOMIC was checking for success by comparing against the second argument to atomic_cmpxchg. Unfortunately, this only works if the second argument is a local or thread-local variable. If it is in memory, it can be subject to common subexpression elimination (and then everything's fine) or reloaded after the atomic_cmpxchg, depending on the compiler's whims. If the latter happens, the race can happen. A thread can sneak in, doing something on elm->field.sle_next after the atomic_cmpxchg and before the comparison. This causes a wrong failure, and then two threads are using "elm" at the same time. In the case discovered by Christian, the sequence was likely something like this: thread 1 | thread 2 QSLIST_INSERT_HEAD_ATOMIC | atomic_cmpxchg succeeds | elm added to list | | steal release_pool | QSLIST_REMOVE_HEAD | elm removed from list | ... | QSLIST_INSERT_HEAD_ATOMIC | (overwrites sle_next) spurious failure | atomic_cmpxchg succeeds | elm added to list again | | steal release_pool | QSLIST_REMOVE_HEAD | elm removed again | The last three steps could be done by a third thread as well. A reproducer that failed in a matter of seconds is as follows: - the guest has 32 VCPUs on a 28 core host (hyperthreading was enabled), memory was 16G just to err on the safe side (the host has 64G, but hey at least you need no s390) - the guest has 24 null-aio virtio-blk devices using dataplane (-object iothread,id=ioN -drive if=none,id=blkN,driver=null-aio,size=500G -device virtio-blk-pci,iothread=ioN,drive=blkN) - the guest also has a single network interface. It's only doing loopback tests so slirp vs. tap and the model doesn't matter. - the guest is running fio with the following script: [global] rw=randread blocksize=16k ioengine=libaio runtime=10m buffered=0 fallocate=none time_based iodepth=32 [virtio1a] filename=/dev/block/252\:16 [virtio1b] filename=/dev/block/252\:16 ... [virtio24a] filename=/dev/block/252\:384 [virtio24b] filename=/dev/block/252\:384 [listen1] protocol=tcp ioengine=net port=12345 listen rw=read bs=4k size=1000g [connect1] protocol=tcp hostname=localhost ioengine=net port=12345 protocol=tcp rw=write startdelay=1 size=1000g ... [listen8] protocol=tcp ioengine=net port=12352 listen rw=read bs=4k size=1000g [connect8] protocol=tcp hostname=localhost ioengine=net port=12352 rw=write startdelay=1 size=1000g Moral of the story: I should refrain from writing more clever stuff. At least it looks like it is not too clever to be undebuggable. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Fixes: c740ad92d0d958fa785e5d7aa1b67ecaf30a6a54 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- include/qemu/queue.h | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)