Message ID | 1455638000-18051-1-git-send-email-pbonzini@redhat.com |
---|---|
State | New |
Headers | show |
On Tue, 02/16 16:53, Paolo Bonzini wrote: > The current implementation of bdrv_qed_drain can cause a double > completion of a request. > > The problem is that bdrv_qed_drain calls qed_plug_allocating_write_reqs > unconditionally, but this is not correct if an allocating write > is queued. In this case, qed_unplug_allocating_write_reqs will > restart the allocating write and possibly cause it to complete. > The aiocb however is still in use for the L2/L1 table writes, > and will then be completed again as soon as the table writes > are stable. > > The fix is to only call qed_plug_allocating_write_reqs and > bdrv_aio_flush (which is the same as the timer callback---the patch > makes this explicit) only if the timer was scheduled in the first > place. This fixes qemu-iotests test 011. > > Cc: qemu-stable@nongnu.org > Cc: qemu-block@nongnu.org > Cc: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > block/qed.c | 13 +++++++------ > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/block/qed.c b/block/qed.c > index 404be1e..ebba220 100644 > --- a/block/qed.c > +++ b/block/qed.c > @@ -380,12 +380,13 @@ static void bdrv_qed_drain(BlockDriverState *bs) > { > BDRVQEDState *s = bs->opaque; > > - /* Cancel timer and start doing I/O that were meant to happen as if it > - * fired, that way we get bdrv_drain() taking care of the ongoing requests > - * correctly. */ > - qed_cancel_need_check_timer(s); > - qed_plug_allocating_write_reqs(s); > - bdrv_aio_flush(s->bs, qed_clear_need_check, s); > + /* Fire the timer immediately in order to start doing I/O as soon as the > + * header is flushed. > + */ > + if (s->need_check_timer && timer_pending(s->need_check_timer)) { We can assert(s->need_check_timer); > + qed_cancel_need_check_timer(s); > + qed_need_check_timer_cb(s); > + } What if an allocating write is queued (the else branch case)? Its completion will be in bdrv_drain and it could arm the need_check_timer which is wrong. We need to drain the allocating_write_reqs queue before checking the timer. Fam
On 17/02/2016 03:57, Fam Zheng wrote: > On Tue, 02/16 16:53, Paolo Bonzini wrote: >> The current implementation of bdrv_qed_drain can cause a double >> completion of a request. >> >> The problem is that bdrv_qed_drain calls qed_plug_allocating_write_reqs >> unconditionally, but this is not correct if an allocating write >> is queued. In this case, qed_unplug_allocating_write_reqs will >> restart the allocating write and possibly cause it to complete. >> The aiocb however is still in use for the L2/L1 table writes, >> and will then be completed again as soon as the table writes >> are stable. >> >> The fix is to only call qed_plug_allocating_write_reqs and >> bdrv_aio_flush (which is the same as the timer callback---the patch >> makes this explicit) only if the timer was scheduled in the first >> place. This fixes qemu-iotests test 011. >> >> Cc: qemu-stable@nongnu.org >> Cc: qemu-block@nongnu.org >> Cc: Stefan Hajnoczi <stefanha@redhat.com> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> >> --- >> block/qed.c | 13 +++++++------ >> 1 file changed, 7 insertions(+), 6 deletions(-) >> >> diff --git a/block/qed.c b/block/qed.c >> index 404be1e..ebba220 100644 >> --- a/block/qed.c >> +++ b/block/qed.c >> @@ -380,12 +380,13 @@ static void bdrv_qed_drain(BlockDriverState *bs) >> { >> BDRVQEDState *s = bs->opaque; >> >> - /* Cancel timer and start doing I/O that were meant to happen as if it >> - * fired, that way we get bdrv_drain() taking care of the ongoing requests >> - * correctly. */ >> - qed_cancel_need_check_timer(s); >> - qed_plug_allocating_write_reqs(s); >> - bdrv_aio_flush(s->bs, qed_clear_need_check, s); >> + /* Fire the timer immediately in order to start doing I/O as soon as the >> + * header is flushed. >> + */ >> + if (s->need_check_timer && timer_pending(s->need_check_timer)) { > > We can assert(s->need_check_timer); I've seen it NULL, but didn't check why. This was also a source of segmentation faults. >> + qed_cancel_need_check_timer(s); >> + qed_need_check_timer_cb(s); >> + } > > What if an allocating write is queued (the else branch case)? Its completion > will be in bdrv_drain and it could arm the need_check_timer which is wrong. > > We need to drain the allocating_write_reqs queue before checking the timer. You're right, but how? That's what bdrv_drain(bs) does, it's a chicken-and-egg problem. In my new series, draining works separately on each BlockDriverState along the chain, from parent (bs) to children (bs->file->bs). We could then have .before_drain and .after_drain; .before_drain disables the timer, while .after_drain does the same operation as the timer. But that couldn't be backported, and 2.5 would remain broken forever. This patch at least is a band-aid to make QED functional. Perhaps the alternative is to remove write support for QED altogether (and the drain callback with it, since it's the only user). Not sure why anyone would use it these days. Paolo
On Wed, 02/17 12:28, Paolo Bonzini wrote: > > > On 17/02/2016 03:57, Fam Zheng wrote: > > On Tue, 02/16 16:53, Paolo Bonzini wrote: > >> The current implementation of bdrv_qed_drain can cause a double > >> completion of a request. > >> > >> The problem is that bdrv_qed_drain calls qed_plug_allocating_write_reqs > >> unconditionally, but this is not correct if an allocating write > >> is queued. In this case, qed_unplug_allocating_write_reqs will > >> restart the allocating write and possibly cause it to complete. > >> The aiocb however is still in use for the L2/L1 table writes, > >> and will then be completed again as soon as the table writes > >> are stable. > >> > >> The fix is to only call qed_plug_allocating_write_reqs and > >> bdrv_aio_flush (which is the same as the timer callback---the patch > >> makes this explicit) only if the timer was scheduled in the first > >> place. This fixes qemu-iotests test 011. > >> > >> Cc: qemu-stable@nongnu.org > >> Cc: qemu-block@nongnu.org > >> Cc: Stefan Hajnoczi <stefanha@redhat.com> > >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > >> --- > >> block/qed.c | 13 +++++++------ > >> 1 file changed, 7 insertions(+), 6 deletions(-) > >> > >> diff --git a/block/qed.c b/block/qed.c > >> index 404be1e..ebba220 100644 > >> --- a/block/qed.c > >> +++ b/block/qed.c > >> @@ -380,12 +380,13 @@ static void bdrv_qed_drain(BlockDriverState *bs) > >> { > >> BDRVQEDState *s = bs->opaque; > >> > >> - /* Cancel timer and start doing I/O that were meant to happen as if it > >> - * fired, that way we get bdrv_drain() taking care of the ongoing requests > >> - * correctly. */ > >> - qed_cancel_need_check_timer(s); > >> - qed_plug_allocating_write_reqs(s); > >> - bdrv_aio_flush(s->bs, qed_clear_need_check, s); > >> + /* Fire the timer immediately in order to start doing I/O as soon as the > >> + * header is flushed. > >> + */ > >> + if (s->need_check_timer && timer_pending(s->need_check_timer)) { > > > > We can assert(s->need_check_timer); > > I've seen it NULL, but didn't check why. This was also a source of > segmentation faults. > > >> + qed_cancel_need_check_timer(s); > >> + qed_need_check_timer_cb(s); > >> + } > > > > What if an allocating write is queued (the else branch case)? Its completion > > will be in bdrv_drain and it could arm the need_check_timer which is wrong. > > > > We need to drain the allocating_write_reqs queue before checking the timer. > > You're right, but how? That's what bdrv_drain(bs) does, it's a > chicken-and-egg problem. Maybe use an aio_poll loop before the if? Fam
On 23/02/2016 06:57, Fam Zheng wrote: >>>> + qed_cancel_need_check_timer(s); >>>> + qed_need_check_timer_cb(s); >>>> + } >>> >>> What if an allocating write is queued (the else branch case)? Its completion >>> will be in bdrv_drain and it could arm the need_check_timer which is wrong. >>> >>> We need to drain the allocating_write_reqs queue before checking the timer. >> >> You're right, but how? That's what bdrv_drain(bs) does, it's a >> chicken-and-egg problem. > > Maybe use an aio_poll loop before the if? That would not change the fact that you're reimplementing bdrv_drain inside bdrv_qed_drain. Perhaps for now it's simplest to just remove the QED .bdrv_drain callback, if you think this patch is not a good stopgap measure to avoid the segmentation faults. Once the bdrv_drain rework is in, we can move the callback _after_ I/O is drained on bs and before it is drained on bs->file->bs. Paolo
On Tue, 02/23 11:43, Paolo Bonzini wrote: > > > On 23/02/2016 06:57, Fam Zheng wrote: > >>>> + qed_cancel_need_check_timer(s); > >>>> + qed_need_check_timer_cb(s); > >>>> + } > >>> > >>> What if an allocating write is queued (the else branch case)? Its completion > >>> will be in bdrv_drain and it could arm the need_check_timer which is wrong. > >>> > >>> We need to drain the allocating_write_reqs queue before checking the timer. > >> > >> You're right, but how? That's what bdrv_drain(bs) does, it's a > >> chicken-and-egg problem. > > > > Maybe use an aio_poll loop before the if? > > That would not change the fact that you're reimplementing bdrv_drain > inside bdrv_qed_drain. > But it fulfills the contract of .bdrv_drain. This is the easy way, the hard way would be iterating through the allocating_write_reqs list and process reqs one by one synchronously, which still involves aio_poll indirectly. > Perhaps for now it's simplest to just remove the QED .bdrv_drain > callback, if you think this patch is not a good stopgap measure to avoid > the segmentation faults. OK, I'm fine with this as a stopgap measure. > > Once the bdrv_drain rework is in, we can move the callback _after_ I/O > is drained on bs and before it is drained on bs->file->bs. Sounds good. Fam
On 23/02/2016 13:49, Fam Zheng wrote: > On Tue, 02/23 11:43, Paolo Bonzini wrote: >> >> >> On 23/02/2016 06:57, Fam Zheng wrote: >>>>>> + qed_cancel_need_check_timer(s); >>>>>> + qed_need_check_timer_cb(s); >>>>>> + } >>>>> >>>>> What if an allocating write is queued (the else branch case)? Its completion >>>>> will be in bdrv_drain and it could arm the need_check_timer which is wrong. >>>>> >>>>> We need to drain the allocating_write_reqs queue before checking the timer. >>>> >>>> You're right, but how? That's what bdrv_drain(bs) does, it's a >>>> chicken-and-egg problem. >>> >>> Maybe use an aio_poll loop before the if? >> >> That would not change the fact that you're reimplementing bdrv_drain >> inside bdrv_qed_drain. > > But it fulfills the contract of .bdrv_drain. This is the easy way, the hard way > would be iterating through the allocating_write_reqs list and process reqs one > by one synchronously, which still involves aio_poll indirectly. The easy way would be better then. Stefan, any second opinion? Paolo >> Perhaps for now it's simplest to just remove the QED .bdrv_drain >> callback, if you think this patch is not a good stopgap measure to avoid >> the segmentation faults. > > OK, I'm fine with this as a stopgap measure. > >> Once the bdrv_drain rework is in, we can move the callback _after_ I/O >> is drained on bs and before it is drained on bs->file->bs. > > Sounds good.
Am 23.02.2016 um 14:54 hat Paolo Bonzini geschrieben: > > > On 23/02/2016 13:49, Fam Zheng wrote: > > On Tue, 02/23 11:43, Paolo Bonzini wrote: > >> > >> > >> On 23/02/2016 06:57, Fam Zheng wrote: > >>>>>> + qed_cancel_need_check_timer(s); > >>>>>> + qed_need_check_timer_cb(s); > >>>>>> + } > >>>>> > >>>>> What if an allocating write is queued (the else branch case)? Its completion > >>>>> will be in bdrv_drain and it could arm the need_check_timer which is wrong. > >>>>> > >>>>> We need to drain the allocating_write_reqs queue before checking the timer. > >>>> > >>>> You're right, but how? That's what bdrv_drain(bs) does, it's a > >>>> chicken-and-egg problem. > >>> > >>> Maybe use an aio_poll loop before the if? > >> > >> That would not change the fact that you're reimplementing bdrv_drain > >> inside bdrv_qed_drain. > > > > But it fulfills the contract of .bdrv_drain. This is the easy way, the hard way > > would be iterating through the allocating_write_reqs list and process reqs one > > by one synchronously, which still involves aio_poll indirectly. > > The easy way would be better then. > > Stefan, any second opinion? What's the status here? It would be good to have qed not segfaulting all the time. Kevin
On Mon, Mar 07, 2016 at 05:57:41PM +0100, Kevin Wolf wrote: > Am 23.02.2016 um 14:54 hat Paolo Bonzini geschrieben: > > > > > > On 23/02/2016 13:49, Fam Zheng wrote: > > > On Tue, 02/23 11:43, Paolo Bonzini wrote: > > >> > > >> > > >> On 23/02/2016 06:57, Fam Zheng wrote: > > >>>>>> + qed_cancel_need_check_timer(s); > > >>>>>> + qed_need_check_timer_cb(s); > > >>>>>> + } > > >>>>> > > >>>>> What if an allocating write is queued (the else branch case)? Its completion > > >>>>> will be in bdrv_drain and it could arm the need_check_timer which is wrong. > > >>>>> > > >>>>> We need to drain the allocating_write_reqs queue before checking the timer. > > >>>> > > >>>> You're right, but how? That's what bdrv_drain(bs) does, it's a > > >>>> chicken-and-egg problem. > > >>> > > >>> Maybe use an aio_poll loop before the if? > > >> > > >> That would not change the fact that you're reimplementing bdrv_drain > > >> inside bdrv_qed_drain. > > > > > > But it fulfills the contract of .bdrv_drain. This is the easy way, the hard way > > > would be iterating through the allocating_write_reqs list and process reqs one > > > by one synchronously, which still involves aio_poll indirectly. > > > > The easy way would be better then. > > > > Stefan, any second opinion? > > What's the status here? It would be good to have qed not segfaulting all > the time. I think the timer concept itself is troublesome. A radical approach but limited to changing QED itself is to drop the timer and instead keep a timestamp for the last allocating write request. Next time a write request (allocating or rewriting) is about to complete, do the flush and clear the need check flag as part of the write request (if 5 seconds have passed since the timestamp). Because the need check flag clearing is part of the write request completion, it will integrate properly with drain. Stefan
By the way, I'll send a patch on Tuesday showing the timerless need check mechanism. It's a little too late tonight to start it. Stefan
On 07/03/2016 21:56, Stefan Hajnoczi wrote: > I think the timer concept itself is troublesome. A radical approach but > limited to changing QED itself is to drop the timer and instead keep a > timestamp for the last allocating write request. Next time a write > request (allocating or rewriting) is about to complete, do the flush and > clear the need check flag as part of the write request (if 5 seconds > have passed since the timestamp). bdrv_qed_drain should be easy to fix in my new drain implementation, which is based on draining the parent before the BdrvChild-ren. It's just troublesome in the current one which alternates flushing and draining. I would just revert the patch that introduced bdrv_qed_drain for now, and reintroduce it later (note however that something was messed up in commit df9a681, "qed: Implement .bdrv_drain", 2015-11-12, because it includes some dirty bitmap stuff). Or perhaps the idea of making QED read-only has some merit... Paolo
On Mon, Mar 07, 2016 at 10:22:58PM +0100, Paolo Bonzini wrote: > On 07/03/2016 21:56, Stefan Hajnoczi wrote: > > I think the timer concept itself is troublesome. A radical approach but > > limited to changing QED itself is to drop the timer and instead keep a > > timestamp for the last allocating write request. Next time a write > > request (allocating or rewriting) is about to complete, do the flush and > > clear the need check flag as part of the write request (if 5 seconds > > have passed since the timestamp). > > bdrv_qed_drain should be easy to fix in my new drain implementation, > which is based on draining the parent before the BdrvChild-ren. It's > just troublesome in the current one which alternates flushing and draining. > > I would just revert the patch that introduced bdrv_qed_drain for now, > and reintroduce it later (note however that something was messed up in > commit df9a681, "qed: Implement .bdrv_drain", 2015-11-12, because it > includes some dirty bitmap stuff). You're right, that might be the best solution for the time being. AFAIK the need check write is harmless. It does not result in a guest-visible change and is basically just a flush + header update. Stefan
Am 07.03.2016 um 21:56 hat Stefan Hajnoczi geschrieben: > On Mon, Mar 07, 2016 at 05:57:41PM +0100, Kevin Wolf wrote: > > Am 23.02.2016 um 14:54 hat Paolo Bonzini geschrieben: > > > > > > > > > On 23/02/2016 13:49, Fam Zheng wrote: > > > > On Tue, 02/23 11:43, Paolo Bonzini wrote: > > > >> > > > >> > > > >> On 23/02/2016 06:57, Fam Zheng wrote: > > > >>>>>> + qed_cancel_need_check_timer(s); > > > >>>>>> + qed_need_check_timer_cb(s); > > > >>>>>> + } > > > >>>>> > > > >>>>> What if an allocating write is queued (the else branch case)? Its completion > > > >>>>> will be in bdrv_drain and it could arm the need_check_timer which is wrong. > > > >>>>> > > > >>>>> We need to drain the allocating_write_reqs queue before checking the timer. > > > >>>> > > > >>>> You're right, but how? That's what bdrv_drain(bs) does, it's a > > > >>>> chicken-and-egg problem. > > > >>> > > > >>> Maybe use an aio_poll loop before the if? > > > >> > > > >> That would not change the fact that you're reimplementing bdrv_drain > > > >> inside bdrv_qed_drain. > > > > > > > > But it fulfills the contract of .bdrv_drain. This is the easy way, the hard way > > > > would be iterating through the allocating_write_reqs list and process reqs one > > > > by one synchronously, which still involves aio_poll indirectly. > > > > > > The easy way would be better then. > > > > > > Stefan, any second opinion? > > > > What's the status here? It would be good to have qed not segfaulting all > > the time. > > I think the timer concept itself is troublesome. A radical approach but > limited to changing QED itself is to drop the timer and instead keep a > timestamp for the last allocating write request. Next time a write > request (allocating or rewriting) is about to complete, do the flush and > clear the need check flag as part of the write request (if 5 seconds > have passed since the timestamp). Isn't that kind of backwards, though, because it's very likely that after some new activity a second write request will come in soon and mark the image dirty again? Apart from that, doesn't it fail to provide the intended effect, that the image doesn't stay dirty for a long time and a repair isn't usually needed after a crash? I think rather than doing this, I'd just remove the mechanism altogether and instead mark the image clean when the guest flushes the disk cache. Kevin > Because the need check flag clearing is part of the write request > completion, it will integrate properly with drain. > > Stefan
Am 08.03.2016 um 10:52 hat Stefan Hajnoczi geschrieben: > On Mon, Mar 07, 2016 at 10:22:58PM +0100, Paolo Bonzini wrote: > > On 07/03/2016 21:56, Stefan Hajnoczi wrote: > > > I think the timer concept itself is troublesome. A radical approach but > > > limited to changing QED itself is to drop the timer and instead keep a > > > timestamp for the last allocating write request. Next time a write > > > request (allocating or rewriting) is about to complete, do the flush and > > > clear the need check flag as part of the write request (if 5 seconds > > > have passed since the timestamp). > > > > bdrv_qed_drain should be easy to fix in my new drain implementation, > > which is based on draining the parent before the BdrvChild-ren. It's > > just troublesome in the current one which alternates flushing and draining. > > > > I would just revert the patch that introduced bdrv_qed_drain for now, > > and reintroduce it later (note however that something was messed up in > > commit df9a681, "qed: Implement .bdrv_drain", 2015-11-12, because it > > includes some dirty bitmap stuff). > > You're right, that might be the best solution for the time being. AFAIK > the need check write is harmless. It does not result in a guest-visible > change and is basically just a flush + header update. I think it caused occasional assertion failures. Kevin
On Tue, Mar 08, 2016 at 10:58:47AM +0100, Kevin Wolf wrote: > Am 07.03.2016 um 21:56 hat Stefan Hajnoczi geschrieben: > > On Mon, Mar 07, 2016 at 05:57:41PM +0100, Kevin Wolf wrote: > > > Am 23.02.2016 um 14:54 hat Paolo Bonzini geschrieben: > > > > > > > > > > > > On 23/02/2016 13:49, Fam Zheng wrote: > > > > > On Tue, 02/23 11:43, Paolo Bonzini wrote: > > > > >> > > > > >> > > > > >> On 23/02/2016 06:57, Fam Zheng wrote: > > > > >>>>>> + qed_cancel_need_check_timer(s); > > > > >>>>>> + qed_need_check_timer_cb(s); > > > > >>>>>> + } > > > > >>>>> > > > > >>>>> What if an allocating write is queued (the else branch case)? Its completion > > > > >>>>> will be in bdrv_drain and it could arm the need_check_timer which is wrong. > > > > >>>>> > > > > >>>>> We need to drain the allocating_write_reqs queue before checking the timer. > > > > >>>> > > > > >>>> You're right, but how? That's what bdrv_drain(bs) does, it's a > > > > >>>> chicken-and-egg problem. > > > > >>> > > > > >>> Maybe use an aio_poll loop before the if? > > > > >> > > > > >> That would not change the fact that you're reimplementing bdrv_drain > > > > >> inside bdrv_qed_drain. > > > > > > > > > > But it fulfills the contract of .bdrv_drain. This is the easy way, the hard way > > > > > would be iterating through the allocating_write_reqs list and process reqs one > > > > > by one synchronously, which still involves aio_poll indirectly. > > > > > > > > The easy way would be better then. > > > > > > > > Stefan, any second opinion? > > > > > > What's the status here? It would be good to have qed not segfaulting all > > > the time. > > > > I think the timer concept itself is troublesome. A radical approach but > > limited to changing QED itself is to drop the timer and instead keep a > > timestamp for the last allocating write request. Next time a write > > request (allocating or rewriting) is about to complete, do the flush and > > clear the need check flag as part of the write request (if 5 seconds > > have passed since the timestamp). > > Isn't that kind of backwards, though, because it's very likely that > after some new activity a second write request will come in soon and > mark the image dirty again? > > Apart from that, doesn't it fail to provide the intended effect, that > the image doesn't stay dirty for a long time and a repair isn't usually > needed after a crash? Yes, it relies on a non-allocating write request after the timeout. I've reverted the drain patch for now. Stefan
diff --git a/block/qed.c b/block/qed.c index 404be1e..ebba220 100644 --- a/block/qed.c +++ b/block/qed.c @@ -380,12 +380,13 @@ static void bdrv_qed_drain(BlockDriverState *bs) { BDRVQEDState *s = bs->opaque; - /* Cancel timer and start doing I/O that were meant to happen as if it - * fired, that way we get bdrv_drain() taking care of the ongoing requests - * correctly. */ - qed_cancel_need_check_timer(s); - qed_plug_allocating_write_reqs(s); - bdrv_aio_flush(s->bs, qed_clear_need_check, s); + /* Fire the timer immediately in order to start doing I/O as soon as the + * header is flushed. + */ + if (s->need_check_timer && timer_pending(s->need_check_timer)) { + qed_cancel_need_check_timer(s); + qed_need_check_timer_cb(s); + } } static int bdrv_qed_open(BlockDriverState *bs, QDict *options, int flags,
The current implementation of bdrv_qed_drain can cause a double completion of a request. The problem is that bdrv_qed_drain calls qed_plug_allocating_write_reqs unconditionally, but this is not correct if an allocating write is queued. In this case, qed_unplug_allocating_write_reqs will restart the allocating write and possibly cause it to complete. The aiocb however is still in use for the L2/L1 table writes, and will then be completed again as soon as the table writes are stable. The fix is to only call qed_plug_allocating_write_reqs and bdrv_aio_flush (which is the same as the timer callback---the patch makes this explicit) only if the timer was scheduled in the first place. This fixes qemu-iotests test 011. Cc: qemu-stable@nongnu.org Cc: qemu-block@nongnu.org Cc: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- block/qed.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-)