ide: Set BSY bit during FLUSH

Message ID	1369729127-24499-1-git-send-email-afaerber@suse.de
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: =?UTF-8?q?Andreas=20F=C3=A4rber?= <afaerber@suse.de> To: qemu-devel@nongnu.org Date: Tue, 28 May 2013 10:18:47 +0200 Message-Id: <1369729127-24499-1-git-send-email-afaerber@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: Kevin Wolf <kwolf@redhat.com>, stefano.stabellini@eu.citrix.com, stefanha@gmail.com, Heiko Rommel <rommel@suse.com>, Bruce Rogers <brogers@suse.com>, arei.gonglei@huawei.com, pbonzini@redhat.com, =?UTF-8?q?Andreas=20F=C3=A4rber?= <afaerber@suse.de> Subject: [Qemu-devel] [PATCH] ide: Set BSY bit during FLUSH Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Andreas Färber May 28, 2013, 8:18 a.m. UTC

The implementation of the ATA FLUSH command invokes a flush at the block
layer, which may on raw files on POSIX entail a synchronous fdatasync().
This may in some cases take so long that the SLES 11 SP1 guest driver
reports I/O errors and filesystems get corrupted or remounted read-only.

Avoid this by setting BUSY_STAT, so that the guest is made aware we are
in the middle of an operation and no ATA commands are attempted to be
processed concurrently.

Addresses BNC#637297.

Suggested-by: Gonglei (Arei) <arei.gonglei@huawei.com>
Signed-off-by: Andreas Färber <afaerber@suse.de>
---
 hw/ide/core.c | 3 +++
 1 file changed, 3 insertions(+)

Kevin Wolf May 28, 2013, 8:27 a.m. UTC | #1

Am 28.05.2013 um 10:18 hat Andreas Färber geschrieben:
> The implementation of the ATA FLUSH command invokes a flush at the block
> layer, which may on raw files on POSIX entail a synchronous fdatasync().
> This may in some cases take so long that the SLES 11 SP1 guest driver
> reports I/O errors and filesystems get corrupted or remounted read-only.
> 
> Avoid this by setting BUSY_STAT, so that the guest is made aware we are
> in the middle of an operation and no ATA commands are attempted to be
> processed concurrently.
> 
> Addresses BNC#637297.
> 
> Suggested-by: Gonglei (Arei) <arei.gonglei@huawei.com>
> Signed-off-by: Andreas Färber <afaerber@suse.de>
> ---
>  hw/ide/core.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/hw/ide/core.c b/hw/ide/core.c
> index c7a8041..bf1ff18 100644
> --- a/hw/ide/core.c
> +++ b/hw/ide/core.c
> @@ -795,6 +795,8 @@ static void ide_flush_cb(void *opaque, int ret)
>  {
>      IDEState *s = opaque;
>  
> +    s->status &= ~BUSY_STAT;
> +

This part is unnecessary, the status is already reset.

>      if (ret < 0) {
>          /* XXX: What sector number to set here? */
>          if (ide_handle_rw_error(s, -ret, BM_STATUS_RETRY_FLUSH)) {
> @@ -814,6 +816,7 @@ void ide_flush_cache(IDEState *s)
>          return;
>      }
>  
> +    s->status |= BUSY_STAT;
>      bdrv_acct_start(s->bs, &s->acct, 0, BDRV_ACCT_FLUSH);
>      bdrv_aio_flush(s->bs, ide_flush_cb, s);
>  }

This should fix the bug, however in an one-off way. I was planning to
fix it by setting BSY for all commands and having an explicit command
completion everywhere. This part is a mess currently in IDE.

The other part why I haven't sent a fix yet is that I don't have a test
case for it. I guess I need to extend blkdebug first before this can be
reliably tested by qtest.

Kevin

Andreas Färber May 28, 2013, 8:46 a.m. UTC | #2

Am 28.05.2013 10:27, schrieb Kevin Wolf:
> Am 28.05.2013 um 10:18 hat Andreas Färber geschrieben:
>> The implementation of the ATA FLUSH command invokes a flush at the block
>> layer, which may on raw files on POSIX entail a synchronous fdatasync().
>> This may in some cases take so long that the SLES 11 SP1 guest driver
>> reports I/O errors and filesystems get corrupted or remounted read-only.
>>
>> Avoid this by setting BUSY_STAT, so that the guest is made aware we are
>> in the middle of an operation and no ATA commands are attempted to be
>> processed concurrently.
>>
>> Addresses BNC#637297.
>>
>> Suggested-by: Gonglei (Arei) <arei.gonglei@huawei.com>
>> Signed-off-by: Andreas Färber <afaerber@suse.de>
>> ---
>>  hw/ide/core.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/hw/ide/core.c b/hw/ide/core.c
>> index c7a8041..bf1ff18 100644
>> --- a/hw/ide/core.c
>> +++ b/hw/ide/core.c
>> @@ -795,6 +795,8 @@ static void ide_flush_cb(void *opaque, int ret)
>>  {
>>      IDEState *s = opaque;
>>  
>> +    s->status &= ~BUSY_STAT;
>> +
> 
> This part is unnecessary, the status is already reset.

Only in the ret >= 0 case though AFAICS?

>>      if (ret < 0) {
>>          /* XXX: What sector number to set here? */
>>          if (ide_handle_rw_error(s, -ret, BM_STATUS_RETRY_FLUSH)) {
>> @@ -814,6 +816,7 @@ void ide_flush_cache(IDEState *s)
>>          return;
>>      }
>>  
>> +    s->status |= BUSY_STAT;
>>      bdrv_acct_start(s->bs, &s->acct, 0, BDRV_ACCT_FLUSH);
>>      bdrv_aio_flush(s->bs, ide_flush_cb, s);
>>  }
> 
> This should fix the bug, however in an one-off way. I was planning to
> fix it by setting BSY for all commands and having an explicit command
> completion everywhere. This part is a mess currently in IDE.

That's a valid idea, but I had backporting to 0.15 in mind. ;)
And doh, I forgot qemu-stable.

> The other part why I haven't sent a fix yet is that I don't have a test
> case for it.

Temporarily add a sleep(31) in qemu_fdatasync()?

I was lazy in testing with -snapshot to not corrupt my disk image, which
would not trigger the same issue since qcow2-backed AFAIU.

> I guess I need to extend blkdebug first before this can be
> reliably tested by qtest.

It can't, since it's not a pure device emulation issue but depends on
the relative timing of filesystem operations and subsequent commands.

Andreas

Kevin Wolf May 28, 2013, 9:18 a.m. UTC | #3

Am 28.05.2013 um 10:46 hat Andreas Färber geschrieben:
> Am 28.05.2013 10:27, schrieb Kevin Wolf:
> > Am 28.05.2013 um 10:18 hat Andreas Färber geschrieben:
> >> The implementation of the ATA FLUSH command invokes a flush at the block
> >> layer, which may on raw files on POSIX entail a synchronous fdatasync().
> >> This may in some cases take so long that the SLES 11 SP1 guest driver
> >> reports I/O errors and filesystems get corrupted or remounted read-only.
> >>
> >> Avoid this by setting BUSY_STAT, so that the guest is made aware we are
> >> in the middle of an operation and no ATA commands are attempted to be
> >> processed concurrently.
> >>
> >> Addresses BNC#637297.
> >>
> >> Suggested-by: Gonglei (Arei) <arei.gonglei@huawei.com>
> >> Signed-off-by: Andreas Färber <afaerber@suse.de>
> >> ---
> >>  hw/ide/core.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >>
> >> diff --git a/hw/ide/core.c b/hw/ide/core.c
> >> index c7a8041..bf1ff18 100644
> >> --- a/hw/ide/core.c
> >> +++ b/hw/ide/core.c
> >> @@ -795,6 +795,8 @@ static void ide_flush_cb(void *opaque, int ret)
> >>  {
> >>      IDEState *s = opaque;
> >>  
> >> +    s->status &= ~BUSY_STAT;
> >> +
> > 
> > This part is unnecessary, the status is already reset.
> 
> Only in the ret >= 0 case though AFAICS?

ide_handle_rw_error() takes care of resetting the status as well, except
when the VM is stopped. But then it will be immediately set again when
the VM is continued and the request is restarted. So the semantic
difference is just whether BSY would be set or not when you somehow
inspect the state while the VM is stopped after an I/O error.

> >>      if (ret < 0) {
> >>          /* XXX: What sector number to set here? */
> >>          if (ide_handle_rw_error(s, -ret, BM_STATUS_RETRY_FLUSH)) {
> >> @@ -814,6 +816,7 @@ void ide_flush_cache(IDEState *s)
> >>          return;
> >>      }
> >>  
> >> +    s->status |= BUSY_STAT;
> >>      bdrv_acct_start(s->bs, &s->acct, 0, BDRV_ACCT_FLUSH);
> >>      bdrv_aio_flush(s->bs, ide_flush_cb, s);
> >>  }
> > 
> > This should fix the bug, however in an one-off way. I was planning to
> > fix it by setting BSY for all commands and having an explicit command
> > completion everywhere. This part is a mess currently in IDE.
> 
> That's a valid idea, but I had backporting to 0.15 in mind. ;)
> And doh, I forgot qemu-stable.

Fair enough, we can merge something like this first and do the real
thing on top. Though nobody will be interested in the real thing any
more, as usual... :-/

> > The other part why I haven't sent a fix yet is that I don't have a test
> > case for it.
> 
> Temporarily add a sleep(31) in qemu_fdatasync()?
> 
> I was lazy in testing with -snapshot to not corrupt my disk image, which
> would not trigger the same issue since qcow2-backed AFAIU.
> 
> > I guess I need to extend blkdebug first before this can be
> > reliably tested by qtest.
> 
> It can't, since it's not a pure device emulation issue but depends on
> the relative timing of filesystem operations and subsequent commands.

That's why you need to take influence on the timing. It's no excuse for
merging without a test case. If we only ever tested devices that have no
relation to the outside world, our testing would be pretty useless and
always stay as bad as it is today in many areas.

Kevin

Paolo Bonzini May 28, 2013, 9:24 a.m. UTC | #4

Il 28/05/2013 11:18, Kevin Wolf ha scritto:
>>> The other part why I haven't sent a fix yet is that I don't have a test
>>> case for it.
>>
>> Temporarily add a sleep(31) in qemu_fdatasync()?
>>
>> I was lazy in testing with -snapshot to not corrupt my disk image, which
>> would not trigger the same issue since qcow2-backed AFAIU.
>>
>>> I guess I need to extend blkdebug first before this can be
>>> reliably tested by qtest.
>>
>> It can't, since it's not a pure device emulation issue but depends on
>> the relative timing of filesystem operations and subsequent commands.
> 
> That's why you need to take influence on the timing. It's no excuse for
> merging without a test case. If we only ever tested devices that have no
> relation to the outside world, our testing would be pretty useless and
> always stay as bad as it is today in many areas.

I don't think the qtest would be timing dependent.  The Linux testcase
is timing dependent, but for the qtest all you need to check is "is BUSY
set during a flush?".  This can be done with blkdebug suspend/resume,
except that there is no way to call bdrv_debug_resume from QEMU.

Paolo

Kevin Wolf May 28, 2013, 9:36 a.m. UTC | #5

Am 28.05.2013 um 11:24 hat Paolo Bonzini geschrieben:
> Il 28/05/2013 11:18, Kevin Wolf ha scritto:
> >>> The other part why I haven't sent a fix yet is that I don't have a test
> >>> case for it.
> >>
> >> Temporarily add a sleep(31) in qemu_fdatasync()?
> >>
> >> I was lazy in testing with -snapshot to not corrupt my disk image, which
> >> would not trigger the same issue since qcow2-backed AFAIU.
> >>
> >>> I guess I need to extend blkdebug first before this can be
> >>> reliably tested by qtest.
> >>
> >> It can't, since it's not a pure device emulation issue but depends on
> >> the relative timing of filesystem operations and subsequent commands.
> > 
> > That's why you need to take influence on the timing. It's no excuse for
> > merging without a test case. If we only ever tested devices that have no
> > relation to the outside world, our testing would be pretty useless and
> > always stay as bad as it is today in many areas.
> 
> I don't think the qtest would be timing dependent.  The Linux testcase
> is timing dependent, but for the qtest all you need to check is "is BUSY
> set during a flush?".  This can be done with blkdebug suspend/resume,
> except that there is no way to call bdrv_debug_resume from QEMU.

That's exactly what I was talking about, suspending a request is taking
influence on its timing. I'm looking into this right now. (And it's not
just resume, bdrv_debug_suspend can't be called from QEMU either)

In fact, I'm checking whether we can have a monitor command to issue
qemu-io commands, which will be more generally useful for test cases. We
just need to make obvious that it doesn't become an ABI. Maybe prefix it
with "__org.qemu.debug-" or something like that.

Kevin

Paolo Bonzini May 28, 2013, 9:48 a.m. UTC | #6

Il 28/05/2013 11:36, Kevin Wolf ha scritto:
> Am 28.05.2013 um 11:24 hat Paolo Bonzini geschrieben:
>> Il 28/05/2013 11:18, Kevin Wolf ha scritto:
>>>>> The other part why I haven't sent a fix yet is that I don't have a test
>>>>> case for it.
>>>>
>>>> Temporarily add a sleep(31) in qemu_fdatasync()?
>>>>
>>>> I was lazy in testing with -snapshot to not corrupt my disk image, which
>>>> would not trigger the same issue since qcow2-backed AFAIU.
>>>>
>>>>> I guess I need to extend blkdebug first before this can be
>>>>> reliably tested by qtest.
>>>>
>>>> It can't, since it's not a pure device emulation issue but depends on
>>>> the relative timing of filesystem operations and subsequent commands.
>>>
>>> That's why you need to take influence on the timing. It's no excuse for
>>> merging without a test case. If we only ever tested devices that have no
>>> relation to the outside world, our testing would be pretty useless and
>>> always stay as bad as it is today in many areas.
>>
>> I don't think the qtest would be timing dependent.  The Linux testcase
>> is timing dependent, but for the qtest all you need to check is "is BUSY
>> set during a flush?".  This can be done with blkdebug suspend/resume,
>> except that there is no way to call bdrv_debug_resume from QEMU.
> 
> That's exactly what I was talking about, suspending a request is taking
> influence on its timing. I'm looking into this right now. (And it's not
> just resume, bdrv_debug_suspend can't be called from QEMU either)

It can be called from the rules file though, can't it?

> In fact, I'm checking whether we can have a monitor command to issue
> qemu-io commands, which will be more generally useful for test cases. We
> just need to make obvious that it doesn't become an ABI. Maybe prefix it
> with "__org.qemu.debug-" or something like that.

Makes sense.  I'm not sure why you'd want to read or write from
testcases, but bdrv_drain(_all) can also be useful from testcases.

Paolo

Kevin Wolf May 28, 2013, 9:59 a.m. UTC | #7

Am 28.05.2013 um 11:48 hat Paolo Bonzini geschrieben:
> Il 28/05/2013 11:36, Kevin Wolf ha scritto:
> > Am 28.05.2013 um 11:24 hat Paolo Bonzini geschrieben:
> >> Il 28/05/2013 11:18, Kevin Wolf ha scritto:
> >>>>> The other part why I haven't sent a fix yet is that I don't have a test
> >>>>> case for it.
> >>>>
> >>>> Temporarily add a sleep(31) in qemu_fdatasync()?
> >>>>
> >>>> I was lazy in testing with -snapshot to not corrupt my disk image, which
> >>>> would not trigger the same issue since qcow2-backed AFAIU.
> >>>>
> >>>>> I guess I need to extend blkdebug first before this can be
> >>>>> reliably tested by qtest.
> >>>>
> >>>> It can't, since it's not a pure device emulation issue but depends on
> >>>> the relative timing of filesystem operations and subsequent commands.
> >>>
> >>> That's why you need to take influence on the timing. It's no excuse for
> >>> merging without a test case. If we only ever tested devices that have no
> >>> relation to the outside world, our testing would be pretty useless and
> >>> always stay as bad as it is today in many areas.
> >>
> >> I don't think the qtest would be timing dependent.  The Linux testcase
> >> is timing dependent, but for the qtest all you need to check is "is BUSY
> >> set during a flush?".  This can be done with blkdebug suspend/resume,
> >> except that there is no way to call bdrv_debug_resume from QEMU.
> > 
> > That's exactly what I was talking about, suspending a request is taking
> > influence on its timing. I'm looking into this right now. (And it's not
> > just resume, bdrv_debug_suspend can't be called from QEMU either)
> 
> It can be called from the rules file though, can't it?

No, you can only define ACTION_INJECT_ERROR and ACTION_SET_STATE from
the config file, but not ACTION_SUSPEND. Maybe we should add it, but it
would still require a manual resume.

So far all test cases suspend requests with explicit qemu-io commands.

> > In fact, I'm checking whether we can have a monitor command to issue
> > qemu-io commands, which will be more generally useful for test cases. We
> > just need to make obvious that it doesn't become an ABI. Maybe prefix it
> > with "__org.qemu.debug-" or something like that.
> 
> Makes sense.  I'm not sure why you'd want to read or write from
> testcases, but bdrv_drain(_all) can also be useful from testcases.

I imagine writing could be very useful for block job test cases.

Kevin

ide: Set BSY bit during FLUSH

Commit Message

Comments

Patch