Message ID | 20211214143542.14758-1-stefanha@redhat.com |
---|---|
State | New |
Headers | show |
Series | [v2] block-backend: prevent dangling BDS pointers across aio_poll() | expand |
Am 14.12.2021 um 15:35 hat Stefan Hajnoczi geschrieben: > The BlockBackend root child can change when aio_poll() is invoked. This > happens when a temporary filter node is removed upon blockjob > completion, for example. > > Functions in block/block-backend.c must be aware of this when using a > blk_bs() pointer across aio_poll() because the BlockDriverState refcnt > may reach 0, resulting in a stale pointer. > > One example is scsi_device_purge_requests(), which calls blk_drain() to > wait for in-flight requests to cancel. If the backup blockjob is active, > then the BlockBackend root child is a temporary filter BDS owned by the > blockjob. The blockjob can complete during bdrv_drained_begin() and the > last reference to the BDS is released when the temporary filter node is > removed. This results in a use-after-free when blk_drain() calls > bdrv_drained_end(bs) on the dangling pointer. > > Explicitly hold a reference to bs across block APIs that invoke > aio_poll(). > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > --- > v2: > - Audit block/block-backend.c and fix additional cases > --- > block/block-backend.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/block/block-backend.c b/block/block-backend.c > index 12ef80ea17..a40ad7fa92 100644 > --- a/block/block-backend.c > +++ b/block/block-backend.c > @@ -828,10 +828,12 @@ void blk_remove_bs(BlockBackend *blk) > notifier_list_notify(&blk->remove_bs_notifiers, blk); > if (tgm->throttle_state) { > bs = blk_bs(blk); > + bdrv_ref(bs); > bdrv_drained_begin(bs); > throttle_group_detach_aio_context(tgm); > throttle_group_attach_aio_context(tgm, qemu_get_aio_context()); > bdrv_drained_end(bs); > + bdrv_unref(bs); > } > > blk_update_root_state(blk); This hunk is unnecessary, we still hold a reference that is only given up a few lines down with bdrv_root_unref_child(root). The rest looks good to me, so without this hunk: Reviewed-by: Kevin Wolf <kwolf@redhat.com>
On Tue, Dec 14, 2021 at 03:59:49PM +0100, Kevin Wolf wrote: > Am 14.12.2021 um 15:35 hat Stefan Hajnoczi geschrieben: > > The BlockBackend root child can change when aio_poll() is invoked. This > > happens when a temporary filter node is removed upon blockjob > > completion, for example. > > > > Functions in block/block-backend.c must be aware of this when using a > > blk_bs() pointer across aio_poll() because the BlockDriverState refcnt > > may reach 0, resulting in a stale pointer. > > > > One example is scsi_device_purge_requests(), which calls blk_drain() to > > wait for in-flight requests to cancel. If the backup blockjob is active, > > then the BlockBackend root child is a temporary filter BDS owned by the > > blockjob. The blockjob can complete during bdrv_drained_begin() and the > > last reference to the BDS is released when the temporary filter node is > > removed. This results in a use-after-free when blk_drain() calls > > bdrv_drained_end(bs) on the dangling pointer. > > > > Explicitly hold a reference to bs across block APIs that invoke > > aio_poll(). > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > --- > > v2: > > - Audit block/block-backend.c and fix additional cases > > --- > > block/block-backend.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/block/block-backend.c b/block/block-backend.c > > index 12ef80ea17..a40ad7fa92 100644 > > --- a/block/block-backend.c > > +++ b/block/block-backend.c > > @@ -828,10 +828,12 @@ void blk_remove_bs(BlockBackend *blk) > > notifier_list_notify(&blk->remove_bs_notifiers, blk); > > if (tgm->throttle_state) { > > bs = blk_bs(blk); > > + bdrv_ref(bs); > > bdrv_drained_begin(bs); > > throttle_group_detach_aio_context(tgm); > > throttle_group_attach_aio_context(tgm, qemu_get_aio_context()); > > bdrv_drained_end(bs); > > + bdrv_unref(bs); > > } > > > > blk_update_root_state(blk); > > This hunk is unnecessary, we still hold a reference that is only given > up a few lines down with bdrv_root_unref_child(root). That's not the only place where the reference can be dropped: bdrv_drop_filter() removes the filter node from the graph. Here is a case where it happens: block/backup.c:backup_clean() -> bdrv_cbw_drop() -> bdrv_drop_filter() -> bdrv_replace_node_common() -> bdrv_replace_child_commit(). After we reach this bdrv_unref() is called a few times and all references are dropped because the node is no longer in the graph. This happens during blk_remove_bs() -> bdrv_drained_begin(), so the bs pointer in the above hunk can be stale. Stefan
Am 15.12.2021 um 12:28 hat Stefan Hajnoczi geschrieben: > On Tue, Dec 14, 2021 at 03:59:49PM +0100, Kevin Wolf wrote: > > Am 14.12.2021 um 15:35 hat Stefan Hajnoczi geschrieben: > > > The BlockBackend root child can change when aio_poll() is invoked. This > > > happens when a temporary filter node is removed upon blockjob > > > completion, for example. > > > > > > Functions in block/block-backend.c must be aware of this when using a > > > blk_bs() pointer across aio_poll() because the BlockDriverState refcnt > > > may reach 0, resulting in a stale pointer. > > > > > > One example is scsi_device_purge_requests(), which calls blk_drain() to > > > wait for in-flight requests to cancel. If the backup blockjob is active, > > > then the BlockBackend root child is a temporary filter BDS owned by the > > > blockjob. The blockjob can complete during bdrv_drained_begin() and the > > > last reference to the BDS is released when the temporary filter node is > > > removed. This results in a use-after-free when blk_drain() calls > > > bdrv_drained_end(bs) on the dangling pointer. > > > > > > Explicitly hold a reference to bs across block APIs that invoke > > > aio_poll(). > > > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > > --- > > > v2: > > > - Audit block/block-backend.c and fix additional cases > > > --- > > > block/block-backend.c | 11 +++++++++++ > > > 1 file changed, 11 insertions(+) > > > > > > diff --git a/block/block-backend.c b/block/block-backend.c > > > index 12ef80ea17..a40ad7fa92 100644 > > > --- a/block/block-backend.c > > > +++ b/block/block-backend.c > > > @@ -828,10 +828,12 @@ void blk_remove_bs(BlockBackend *blk) > > > notifier_list_notify(&blk->remove_bs_notifiers, blk); > > > if (tgm->throttle_state) { > > > bs = blk_bs(blk); > > > + bdrv_ref(bs); > > > bdrv_drained_begin(bs); > > > throttle_group_detach_aio_context(tgm); > > > throttle_group_attach_aio_context(tgm, qemu_get_aio_context()); > > > bdrv_drained_end(bs); > > > + bdrv_unref(bs); > > > } > > > > > > blk_update_root_state(blk); > > > > This hunk is unnecessary, we still hold a reference that is only given > > up a few lines down with bdrv_root_unref_child(root). > > That's not the only place where the reference can be dropped: > bdrv_drop_filter() removes the filter node from the graph. > > Here is a case where it happens: block/backup.c:backup_clean() -> > bdrv_cbw_drop() -> bdrv_drop_filter() -> bdrv_replace_node_common() -> > bdrv_replace_child_commit(). After we reach this bdrv_unref() is called > a few times and all references are dropped because the node is no longer > in the graph. > > This happens during blk_remove_bs() -> bdrv_drained_begin(), so the bs > pointer in the above hunk can be stale. Is the scenario that blk->root doesn't go away, but the node it references changes? So the unref below will be for a different node than we're draining here? If so, let's add a comment that blk_bs(blk) can change after the drain, and maybe move the BlockDriverState *bs declaration inside the if block because the variable is invalid after it anyway. Can it also happen that instead of removing a node from the chain, a new one is inserted and we actually end up having drained the wrong node before switching the context for tgm? Kevin
On Wed, Dec 15, 2021 at 04:31:26PM +0100, Kevin Wolf wrote: > Am 15.12.2021 um 12:28 hat Stefan Hajnoczi geschrieben: > > On Tue, Dec 14, 2021 at 03:59:49PM +0100, Kevin Wolf wrote: > > > Am 14.12.2021 um 15:35 hat Stefan Hajnoczi geschrieben: > > > > The BlockBackend root child can change when aio_poll() is invoked. This > > > > happens when a temporary filter node is removed upon blockjob > > > > completion, for example. > > > > > > > > Functions in block/block-backend.c must be aware of this when using a > > > > blk_bs() pointer across aio_poll() because the BlockDriverState refcnt > > > > may reach 0, resulting in a stale pointer. > > > > > > > > One example is scsi_device_purge_requests(), which calls blk_drain() to > > > > wait for in-flight requests to cancel. If the backup blockjob is active, > > > > then the BlockBackend root child is a temporary filter BDS owned by the > > > > blockjob. The blockjob can complete during bdrv_drained_begin() and the > > > > last reference to the BDS is released when the temporary filter node is > > > > removed. This results in a use-after-free when blk_drain() calls > > > > bdrv_drained_end(bs) on the dangling pointer. > > > > > > > > Explicitly hold a reference to bs across block APIs that invoke > > > > aio_poll(). > > > > > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > --- > > > > v2: > > > > - Audit block/block-backend.c and fix additional cases > > > > --- > > > > block/block-backend.c | 11 +++++++++++ > > > > 1 file changed, 11 insertions(+) > > > > > > > > diff --git a/block/block-backend.c b/block/block-backend.c > > > > index 12ef80ea17..a40ad7fa92 100644 > > > > --- a/block/block-backend.c > > > > +++ b/block/block-backend.c > > > > @@ -828,10 +828,12 @@ void blk_remove_bs(BlockBackend *blk) > > > > notifier_list_notify(&blk->remove_bs_notifiers, blk); > > > > if (tgm->throttle_state) { > > > > bs = blk_bs(blk); > > > > + bdrv_ref(bs); > > > > bdrv_drained_begin(bs); > > > > throttle_group_detach_aio_context(tgm); > > > > throttle_group_attach_aio_context(tgm, qemu_get_aio_context()); > > > > bdrv_drained_end(bs); > > > > + bdrv_unref(bs); > > > > } > > > > > > > > blk_update_root_state(blk); > > > > > > This hunk is unnecessary, we still hold a reference that is only given > > > up a few lines down with bdrv_root_unref_child(root). > > > > That's not the only place where the reference can be dropped: > > bdrv_drop_filter() removes the filter node from the graph. > > > > Here is a case where it happens: block/backup.c:backup_clean() -> > > bdrv_cbw_drop() -> bdrv_drop_filter() -> bdrv_replace_node_common() -> > > bdrv_replace_child_commit(). After we reach this bdrv_unref() is called > > a few times and all references are dropped because the node is no longer > > in the graph. > > > > This happens during blk_remove_bs() -> bdrv_drained_begin(), so the bs > > pointer in the above hunk can be stale. > > Is the scenario that blk->root doesn't go away, but the node it > references changes? So the unref below will be for a different node than > we're draining here? Yes, exactly. > If so, let's add a comment that blk_bs(blk) can change after the drain, > and maybe move the BlockDriverState *bs declaration inside the if block > because the variable is invalid after it anyway. Will fix. > Can it also happen that instead of removing a node from the chain, a new > one is inserted and we actually end up having drained the wrong node > before switching the context for tgm? I'll investigate that. There is already some level of support for draining new nodes but I'm not sure it covers all insert and replace cases. Stefan
On 14.12.21 15:35, Stefan Hajnoczi wrote: > The BlockBackend root child can change when aio_poll() is invoked. This > happens when a temporary filter node is removed upon blockjob > completion, for example. > > Functions in block/block-backend.c must be aware of this when using a > blk_bs() pointer across aio_poll() because the BlockDriverState refcnt > may reach 0, resulting in a stale pointer. > > One example is scsi_device_purge_requests(), which calls blk_drain() to > wait for in-flight requests to cancel. If the backup blockjob is active, > then the BlockBackend root child is a temporary filter BDS owned by the > blockjob. The blockjob can complete during bdrv_drained_begin() and the > last reference to the BDS is released when the temporary filter node is > removed. This results in a use-after-free when blk_drain() calls > bdrv_drained_end(bs) on the dangling pointer. By the way, I have a BZ for this, though it’s about block-stream instead of backup (https://bugzilla.redhat.com/show_bug.cgi?id=2036178). But I’m happy to report your patch seems* to fix that problem, too! (Thanks for fixing my BZs! :)) *I’ve written a reproducer in iotest form (https://gitlab.com/hreitz/qemu/-/blob/stefans-fix-and-a-test/tests/qemu-iotests/tests/stream-error-on-reset), and so far I can only assume it indeed reproduces the report, but I found that iotest to indeed be fixed by this patch. (Which made me very happy.) Hanna > Explicitly hold a reference to bs across block APIs that invoke > aio_poll(). > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > --- > v2: > - Audit block/block-backend.c and fix additional cases > --- > block/block-backend.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/block/block-backend.c b/block/block-backend.c > index 12ef80ea17..a40ad7fa92 100644 > --- a/block/block-backend.c > +++ b/block/block-backend.c > @@ -828,10 +828,12 @@ void blk_remove_bs(BlockBackend *blk) > notifier_list_notify(&blk->remove_bs_notifiers, blk); > if (tgm->throttle_state) { > bs = blk_bs(blk); > + bdrv_ref(bs); > bdrv_drained_begin(bs); > throttle_group_detach_aio_context(tgm); > throttle_group_attach_aio_context(tgm, qemu_get_aio_context()); > bdrv_drained_end(bs); > + bdrv_unref(bs); > } > > blk_update_root_state(blk); > @@ -1705,6 +1707,7 @@ void blk_drain(BlockBackend *blk) > BlockDriverState *bs = blk_bs(blk); > > if (bs) { > + bdrv_ref(bs); > bdrv_drained_begin(bs); > } > > @@ -1714,6 +1717,7 @@ void blk_drain(BlockBackend *blk) > > if (bs) { > bdrv_drained_end(bs); > + bdrv_unref(bs); > } > } > > @@ -2044,10 +2048,13 @@ static int blk_do_set_aio_context(BlockBackend *blk, AioContext *new_context, > int ret; > > if (bs) { > + bdrv_ref(bs); > + > if (update_root_node) { > ret = bdrv_child_try_set_aio_context(bs, new_context, blk->root, > errp); > if (ret < 0) { > + bdrv_unref(bs); > return ret; > } > } > @@ -2057,6 +2064,8 @@ static int blk_do_set_aio_context(BlockBackend *blk, AioContext *new_context, > throttle_group_attach_aio_context(tgm, new_context); > bdrv_drained_end(bs); > } > + > + bdrv_unref(bs); > } > > blk->ctx = new_context; > @@ -2326,11 +2335,13 @@ void blk_io_limits_disable(BlockBackend *blk) > ThrottleGroupMember *tgm = &blk->public.throttle_group_member; > assert(tgm->throttle_state); > if (bs) { > + bdrv_ref(bs); > bdrv_drained_begin(bs); > } > throttle_group_unregister_tgm(tgm); > if (bs) { > bdrv_drained_end(bs); > + bdrv_unref(bs); > } > } >
On Wed, Dec 15, 2021 at 04:31:26PM +0100, Kevin Wolf wrote: > Am 15.12.2021 um 12:28 hat Stefan Hajnoczi geschrieben: > > On Tue, Dec 14, 2021 at 03:59:49PM +0100, Kevin Wolf wrote: > > > Am 14.12.2021 um 15:35 hat Stefan Hajnoczi geschrieben: > > > > The BlockBackend root child can change when aio_poll() is invoked. This > > > > happens when a temporary filter node is removed upon blockjob > > > > completion, for example. > > > > > > > > Functions in block/block-backend.c must be aware of this when using a > > > > blk_bs() pointer across aio_poll() because the BlockDriverState refcnt > > > > may reach 0, resulting in a stale pointer. > > > > > > > > One example is scsi_device_purge_requests(), which calls blk_drain() to > > > > wait for in-flight requests to cancel. If the backup blockjob is active, > > > > then the BlockBackend root child is a temporary filter BDS owned by the > > > > blockjob. The blockjob can complete during bdrv_drained_begin() and the > > > > last reference to the BDS is released when the temporary filter node is > > > > removed. This results in a use-after-free when blk_drain() calls > > > > bdrv_drained_end(bs) on the dangling pointer. > > > > > > > > Explicitly hold a reference to bs across block APIs that invoke > > > > aio_poll(). > > > > > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > --- > > > > v2: > > > > - Audit block/block-backend.c and fix additional cases > > > > --- > > > > block/block-backend.c | 11 +++++++++++ > > > > 1 file changed, 11 insertions(+) > > > > > > > > diff --git a/block/block-backend.c b/block/block-backend.c > > > > index 12ef80ea17..a40ad7fa92 100644 > > > > --- a/block/block-backend.c > > > > +++ b/block/block-backend.c > > > > @@ -828,10 +828,12 @@ void blk_remove_bs(BlockBackend *blk) > > > > notifier_list_notify(&blk->remove_bs_notifiers, blk); > > > > if (tgm->throttle_state) { > > > > bs = blk_bs(blk); > > > > + bdrv_ref(bs); > > > > bdrv_drained_begin(bs); > > > > throttle_group_detach_aio_context(tgm); > > > > throttle_group_attach_aio_context(tgm, qemu_get_aio_context()); > > > > bdrv_drained_end(bs); > > > > + bdrv_unref(bs); > > > > } > > > > > > > > blk_update_root_state(blk); > > > > > > This hunk is unnecessary, we still hold a reference that is only given > > > up a few lines down with bdrv_root_unref_child(root). > > > > That's not the only place where the reference can be dropped: > > bdrv_drop_filter() removes the filter node from the graph. > > > > Here is a case where it happens: block/backup.c:backup_clean() -> > > bdrv_cbw_drop() -> bdrv_drop_filter() -> bdrv_replace_node_common() -> > > bdrv_replace_child_commit(). After we reach this bdrv_unref() is called > > a few times and all references are dropped because the node is no longer > > in the graph. > > > > This happens during blk_remove_bs() -> bdrv_drained_begin(), so the bs > > pointer in the above hunk can be stale. > > Is the scenario that blk->root doesn't go away, but the node it > references changes? Yes, blk->root remains non-NULL but its value changes across aio_poll(). > So the unref below will be for a different node than > we're draining here? Yes. > If so, let's add a comment that blk_bs(blk) can change after the drain, > and maybe move the BlockDriverState *bs declaration inside the if block > because the variable is invalid after it anyway. Sounds good. > Can it also happen that instead of removing a node from the chain, a new > one is inserted and we actually end up having drained the wrong node > before switching the context for tgm? I think a new node can theoretically be inserted. I don't know if that can happen in practice since insertions typically happen due to monitor commands and monitor commands don't execute during aio_poll(). Stefan
On Mon, Jan 10, 2022 at 07:57:05PM +0100, Hanna Reitz wrote: > On 14.12.21 15:35, Stefan Hajnoczi wrote: > > The BlockBackend root child can change when aio_poll() is invoked. This > > happens when a temporary filter node is removed upon blockjob > > completion, for example. > > > > Functions in block/block-backend.c must be aware of this when using a > > blk_bs() pointer across aio_poll() because the BlockDriverState refcnt > > may reach 0, resulting in a stale pointer. > > > > One example is scsi_device_purge_requests(), which calls blk_drain() to > > wait for in-flight requests to cancel. If the backup blockjob is active, > > then the BlockBackend root child is a temporary filter BDS owned by the > > blockjob. The blockjob can complete during bdrv_drained_begin() and the > > last reference to the BDS is released when the temporary filter node is > > removed. This results in a use-after-free when blk_drain() calls > > bdrv_drained_end(bs) on the dangling pointer. > > By the way, I have a BZ for this, though it’s about block-stream instead of > backup (https://bugzilla.redhat.com/show_bug.cgi?id=2036178). But I’m happy > to report your patch seems* to fix that problem, too! (Thanks for fixing my > BZs! :)) > > *I’ve written a reproducer in iotest form (https://gitlab.com/hreitz/qemu/-/blob/stefans-fix-and-a-test/tests/qemu-iotests/tests/stream-error-on-reset), > and so far I can only assume it indeed reproduces the report, but I found > that iotest to indeed be fixed by this patch. (Which made me very happy.) Great, I have merged your test case and sent a v3. Thanks, Stefan
diff --git a/block/block-backend.c b/block/block-backend.c index 12ef80ea17..a40ad7fa92 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -828,10 +828,12 @@ void blk_remove_bs(BlockBackend *blk) notifier_list_notify(&blk->remove_bs_notifiers, blk); if (tgm->throttle_state) { bs = blk_bs(blk); + bdrv_ref(bs); bdrv_drained_begin(bs); throttle_group_detach_aio_context(tgm); throttle_group_attach_aio_context(tgm, qemu_get_aio_context()); bdrv_drained_end(bs); + bdrv_unref(bs); } blk_update_root_state(blk); @@ -1705,6 +1707,7 @@ void blk_drain(BlockBackend *blk) BlockDriverState *bs = blk_bs(blk); if (bs) { + bdrv_ref(bs); bdrv_drained_begin(bs); } @@ -1714,6 +1717,7 @@ void blk_drain(BlockBackend *blk) if (bs) { bdrv_drained_end(bs); + bdrv_unref(bs); } } @@ -2044,10 +2048,13 @@ static int blk_do_set_aio_context(BlockBackend *blk, AioContext *new_context, int ret; if (bs) { + bdrv_ref(bs); + if (update_root_node) { ret = bdrv_child_try_set_aio_context(bs, new_context, blk->root, errp); if (ret < 0) { + bdrv_unref(bs); return ret; } } @@ -2057,6 +2064,8 @@ static int blk_do_set_aio_context(BlockBackend *blk, AioContext *new_context, throttle_group_attach_aio_context(tgm, new_context); bdrv_drained_end(bs); } + + bdrv_unref(bs); } blk->ctx = new_context; @@ -2326,11 +2335,13 @@ void blk_io_limits_disable(BlockBackend *blk) ThrottleGroupMember *tgm = &blk->public.throttle_group_member; assert(tgm->throttle_state); if (bs) { + bdrv_ref(bs); bdrv_drained_begin(bs); } throttle_group_unregister_tgm(tgm); if (bs) { bdrv_drained_end(bs); + bdrv_unref(bs); } }
The BlockBackend root child can change when aio_poll() is invoked. This happens when a temporary filter node is removed upon blockjob completion, for example. Functions in block/block-backend.c must be aware of this when using a blk_bs() pointer across aio_poll() because the BlockDriverState refcnt may reach 0, resulting in a stale pointer. One example is scsi_device_purge_requests(), which calls blk_drain() to wait for in-flight requests to cancel. If the backup blockjob is active, then the BlockBackend root child is a temporary filter BDS owned by the blockjob. The blockjob can complete during bdrv_drained_begin() and the last reference to the BDS is released when the temporary filter node is removed. This results in a use-after-free when blk_drain() calls bdrv_drained_end(bs) on the dangling pointer. Explicitly hold a reference to bs across block APIs that invoke aio_poll(). Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> --- v2: - Audit block/block-backend.c and fix additional cases --- block/block-backend.c | 11 +++++++++++ 1 file changed, 11 insertions(+)