diff mbox series

[V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block

Message ID 2021032217253258728710@chinatelecom.cn
State New
Headers show
Series [V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block | expand

Commit Message

ChangLimin March 22, 2021, 9:25 a.m. UTC
For Linux 5.10/5.11, qemu write zeros to a multipath device using
ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY
permanently. Fallback to pwritev instead of exit for -EBUSY error.

The issue was introduced in Linux 5.10:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02

Fixed in Linux 5.12:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d

Signed-off-by: ChangLimin <changlm@chinatelecom.cn>
---
 block/file-posix.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

--
2.27.0

Comments

John Snow March 22, 2021, 5:50 p.m. UTC | #1
On 3/22/21 5:25 AM, ChangLimin wrote:
> For Linux 5.10/5.11, qemu write zeros to a multipath device using
> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY
> permanently. Fallback to pwritev instead of exit for -EBUSY error.
> 
> The issue was introduced in Linux 5.10:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
> 
> Fixed in Linux 5.12:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d
> 
> Signed-off-by: ChangLimin <changlm@chinatelecom.cn>

To be clear, when I asked "When do we get -EINVAL?" it wasn't because I 
doubted that we would ever get it, I was just unclear of the 
circumstances in which we might receive EINVAL and was hoping you would 
explain it to me.

> ---
>   block/file-posix.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 20e14f8e96..d4054ac9cb 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1624,8 +1624,12 @@ static ssize_t 
> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
>           } while (errno == EINTR);
> 
>           ret = translate_err(-errno);
> -        if (ret == -ENOTSUP) {
> -            s->has_write_zeroes = false;
> +        switch (ret) {
> +        case -ENOTSUP:
> +            s->has_write_zeroes = false; /* fall through */
> +        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath 
> devices */
> +            return -ENOTSUP;
> +            break;

What effect does this have, now?

We'll return ENOTSUP but we won't disable trying it again in the future, 
is that right?

Kevin, is this what you had in mind?

--js

>           }
>       }
>   #endif
> --
> 2.27.0
>
Max Reitz March 24, 2021, 2:50 p.m. UTC | #2
On 22.03.21 10:25, ChangLimin wrote:
> For Linux 5.10/5.11, qemu write zeros to a multipath device using
> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY
> permanently.

So as far as I can track back the discussion, Kevin asked on v1 why we’d 
set has_write_zeroes to false, i.e. whether the EBUSY might not go away 
at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then.
You haven’t explicitly replied to that question (as far as I can see), 
so it kind of still stands.

Implicitly, there are two conflicting answers in this patch: On one 
hand, the commit message says “permanently”, and this is what you told 
Nir as a realistic case where this can occur.  So that to me implies 
that we actually should not retry BLKZEROOUT, because the EBUSY will 
remain, and that condition won’t change while the block device is in use 
by qemu.

On the other hand, in the code, you have decided not to reset 
has_write_zeroes to false, so the implementation will retry.

So I don’t quite understand.  Should we keep trying BLKZEROOUT or is 
there no chance of it working after it has at one point failed with 
EBUSY?  (Are there other cases besides what’s described in this commit 
message where EBUSY might be returned and it is only temporary?)

> Fallback to pwritev instead of exit for -EBUSY error.
> 
> The issue was introduced in Linux 5.10:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
> 
> Fixed in Linux 5.12:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d
> 
> Signed-off-by: ChangLimin <changlm@chinatelecom.cn>
> ---
>   block/file-posix.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 20e14f8e96..d4054ac9cb 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1624,8 +1624,12 @@ static ssize_t 
> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
>           } while (errno == EINTR);
> 
>           ret = translate_err(-errno);
> -        if (ret == -ENOTSUP) {
> -            s->has_write_zeroes = false;
> +        switch (ret) {
> +        case -ENOTSUP:
> +            s->has_write_zeroes = false; /* fall through */
> +        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath 
> devices */
> +            return -ENOTSUP;
> +            break;

(Not sure why this break is here.)

Max

>           }
>       }
>   #endif
> --
> 2.27.0
>
Nir Soffer March 24, 2021, 3:49 p.m. UTC | #3
On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mreitz@redhat.com> wrote:

> On 22.03.21 10:25, ChangLimin wrote:
> > For Linux 5.10/5.11, qemu write zeros to a multipath device using
> > ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY
> > permanently.
>
> So as far as I can track back the discussion, Kevin asked on v1 why we’d
> set has_write_zeroes to false, i.e. whether the EBUSY might not go away
> at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then.
> You haven’t explicitly replied to that question (as far as I can see),
> so it kind of still stands.
>
> Implicitly, there are two conflicting answers in this patch: On one
> hand, the commit message says “permanently”, and this is what you told
> Nir as a realistic case where this can occur.


I'm afraid ChangLimin did not answer my question. I'm looking for real
world used case when qemu cannot write zeros to multipath device, when
nobody else is using the device.

I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert,
once with a multipath device, and once with logical volume on a vg created
on the multipath device, and I could not reproduce this issue.

If I understand the kernel change correctly, this can happen when there is
a mounted file system on top of the multipath device. I don't think we have
a use case when qemu accesses a multipath device when the device is used
by a file system, but maybe I missed something.


> So that to me implies
> that we actually should not retry BLKZEROOUT, because the EBUSY will
> remain, and that condition won’t change while the block device is in use
> by qemu.
>
> On the other hand, in the code, you have decided not to reset
> has_write_zeroes to false, so the implementation will retry.
>

EBUSY is usually a temporary error, so retrying makes sense. The question
is if we really can write zeroes manually in this case?


> So I don’t quite understand.  Should we keep trying BLKZEROOUT or is
> there no chance of it working after it has at one point failed with
> EBUSY?  (Are there other cases besides what’s described in this commit
> message where EBUSY might be returned and it is only temporary?)
>
> > Fallback to pwritev instead of exit for -EBUSY error.
> >
> > The issue was introduced in Linux 5.10:
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
> >
> > Fixed in Linux 5.12:
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d
> >
> > Signed-off-by: ChangLimin <changlm@chinatelecom.cn>
> > ---
> >   block/file-posix.c | 8 ++++++--
> >   1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 20e14f8e96..d4054ac9cb 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1624,8 +1624,12 @@ static ssize_t
> > handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
> >           } while (errno == EINTR);
> >
> >           ret = translate_err(-errno);
> > -        if (ret == -ENOTSUP) {
> > -            s->has_write_zeroes = false;
> > +        switch (ret) {
> > +        case -ENOTSUP:
> > +            s->has_write_zeroes = false; /* fall through */
> > +        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath
> > devices */
> > +            return -ENOTSUP;
> > +            break;
>
> (Not sure why this break is here.)
>
> Max
>
> >           }
> >       }
> >   #endif
> > --
> > 2.27.0
> >
>
>
>
ChangLimin March 25, 2021, 6:06 a.m. UTC | #4
>On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mreitz@redhat.com> wrote:
>On 22.03.21 10:25, ChangLimin wrote:
>> For Linux 5.10/5.11, qemu write zeros to a multipath device using
>> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY
>> permanently.
>
>So as far as I can track back the discussion, Kevin asked on v1 why we’d 
>set has_write_zeroes to false, i.e. whether the EBUSY might not go away 
>at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then.
>You haven’t explicitly replied to that question (as far as I can see), 
>so it kind of still stands.
>
>Implicitly, there are two conflicting answers in this patch: On one 
>hand, the commit message says “permanently”, and this is what you told 
>Nir as a realistic case where this can occur. 

For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is below. 
For other Linux version, the EBUSY may be temporary. 
Because  Linux 5.10/5.11 is not used widely, so do not set has_write_zeroes to false.

>I'm afraid ChangLimin did not answer my question. I'm looking for real
>world used case when qemu cannot write zeros to multipath device, when
>nobody else is using the device.
>
>I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert,
>once with a multipath device, and once with logical volume on a vg created
>on the multipath device, and I could not reproduce this issue.

The following is steps to reproduct the issue on Fedora 34.

# uname -a
Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

# qemu-img -V
qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1)

1.  Login in an ISCSI LUN created using targetcli on ubuntu 20.04
# iscsiadm -m discovery -t st -p 192.169.1.109
192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100

# iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100
# iscsiadm -m session
tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 (non-flash)

2. start multipathd service
# mpathconf --enable
# systemctl start multipathd

3.  add multipath path
# multipath -a `/lib/udev/scsi_id -g /dev/sdb`   # sdb means the ISCSI LUN
wwid '36001405b76856e4816b48b99c6a77de3' added

# multipathd add path /dev/sdb
ok

# multipath -ll     # /dev/dm-1 is the multipath device based on /dev/sdb
mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102
size=1.0G features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  `- 5:0:0:0  sdd  8:48   active ready running

4. qemu-img return EBUSY both to dm-1 and sdb
# wget http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
# qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1
qemu-img: error while writing at byte 0: Device or resource busy

# qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb
qemu-img: error while writing at byte 0: Device or resource busy

5. blkdiscard also return EBUSY  both to dm-1 and sdb
# blkdiscard -o 0 -l 4096 /dev/dm-1
blkdiscard: cannot open /dev/dm-1: Device or resource busy

# blkdiscard -o 0 -l 4096 /dev/sdb
blkdiscard: cannot open /dev/sdb: No such file or directory

6. dd write zero is good, because it does not use blkdiscard
# dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct 
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s

7. The LUN should support blkdiscard feature, otherwise it will not write zero 
with  ioctl(fd, BLKZEROOUT, range) 

>If I understand the kernel change correctly, this can happen when there is
>a mounted file system on top of the multipath device. I don't think we have 
>a use case when qemu accesses a multipath device when the device is used
>by a file system, but maybe I missed something.
> 
>So that to me implies 
>that we actually should not retry BLKZEROOUT, because the EBUSY will 
>remain, and that condition won’t change while the block device is in use 
>by qemu.
>
>On the other hand, in the code, you have decided not to reset 
>has_write_zeroes to false, so the implementation will retry.
>
>EBUSY is usually a temporary error, so retrying makes sense. The question
>is if we really can write zeroes manually in this case?
> 
>So I don’t quite understand.  Should we keep trying BLKZEROOUT or is 
>there no chance of it working after it has at one point failed with 
>EBUSY?  (Are there other cases besides what’s described in this commit 
>message where EBUSY might be returned and it is only temporary?)
>
>> Fallback to pwritev instead of exit for -EBUSY error.
>> 
>> The issue was introduced in Linux 5.10:
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
>> 
>> Fixed in Linux 5.12:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d
>> 
>> Signed-off-by: ChangLimin <changlm@chinatelecom.cn>
>> ---
>>   block/file-posix.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>> 
>> diff --git a/block/file-posix.c b/block/file-posix.c
>> index 20e14f8e96..d4054ac9cb 100644
>> --- a/block/file-posix.c
>> +++ b/block/file-posix.c
>> @@ -1624,8 +1624,12 @@ static ssize_t 
>> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
>>           } while (errno == EINTR);
>> 
>>           ret = translate_err(-errno);
>> -        if (ret == -ENOTSUP) {
>> -            s->has_write_zeroes = false;
>> +        switch (ret) {
>> +        case -ENOTSUP:
>> +            s->has_write_zeroes = false; /* fall through */
>> +        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath 
>> devices */
>> +            return -ENOTSUP;
>> +            break;
>
>(Not sure why this break is here.)
>
>Max
>
>>           }
>>       }
>>   #endif
>> --
>> 2.27.0
>>
Nir Soffer March 25, 2021, 3:48 p.m. UTC | #5
On Thu, Mar 25, 2021 at 8:07 AM ChangLimin <changlm@chinatelecom.cn> wrote:

> >On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mreitz@redhat.com> wrote:
> >On 22.03.21 10:25, ChangLimin wrote:
> >> For Linux 5.10/5.11, qemu write zeros to a multipath device using
> >> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY
> >> permanently.
> >
> >So as far as I can track back the discussion, Kevin asked on v1 why we’d
> >set has_write_zeroes to false, i.e. whether the EBUSY might not go away
> >at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then.
> >You haven’t explicitly replied to that question (as far as I can see),
> >so it kind of still stands.
> >
> >Implicitly, there are two conflicting answers in this patch: On one
> >hand, the commit message says “permanently”, and this is what you told
> >Nir as a realistic case where this can occur.
>
> For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is
> below.
> For other Linux version, the EBUSY may be temporary.
> Because  Linux 5.10/5.11 is not used widely, so do not set has_write_zeroes
> to false.
>
> >I'm afraid ChangLimin did not answer my question. I'm looking for real
> >world used case when qemu cannot write zeros to multipath device, when
> >nobody else is using the device.
> >
> >I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert,
> >once with a multipath device, and once with logical volume on a vg created
> >on the multipath device, and I could not reproduce this issue.
>
> The following is steps to reproduct the issue on Fedora 34.
>
> # uname -a
> Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021
> x86_64 x86_64 x86_64 GNU/Linux
>

Is this the most recent kernel? I have 5.11.7 in fedora 32.


>
> # qemu-img -V
> qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1)
>
> 1.  Login in an ISCSI LUN created using targetcli on ubuntu 20.04
> # iscsiadm -m discovery -t st -p 192.169.1.109
> 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100
>
> # iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100
> # iscsiadm -m session
> tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100
> (non-flash)
>
> 2. start multipathd service
> # mpathconf --enable
> # systemctl start multipathd
>
> 3.  add multipath path
> # multipath -a `/lib/udev/scsi_id -g /dev/sdb`   # sdb means the ISCSI LUN
> wwid '36001405b76856e4816b48b99c6a77de3' added
>
> # multipathd add path /dev/sdb
> ok
>
> # multipath -ll     # /dev/dm-1 is the multipath device based on /dev/sdb
> mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102
> size=1.0G features='0' hwhandler='1 alua' wp=rw
> `-+- policy='service-time 0' prio=50 status=active
>   `- 5:0:0:0  sdd  8:48   active ready running
>

You are using user_friendly_names which is (sadly) the default.
But I don't think it should matter.

4. qemu-img return EBUSY both to dm-1 and sdb
> # wget http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
> # qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1
> qemu-img: error while writing at byte 0: Device or resource busy
>
> # qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb
> qemu-img: error while writing at byte 0: Device or resource busy
>
> 5. blkdiscard also return EBUSY  both to dm-1 and sdb
> # blkdiscard -o 0 -l 4096 /dev/dm-1
> blkdiscard: cannot open /dev/dm-1: Device or resource busy
>
> # blkdiscard -o 0 -l 4096 /dev/sdb
> blkdiscard: cannot open /dev/sdb: No such file or directory
>
> 6. dd write zero is good, because it does not use blkdiscard
> # dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct
> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s
>
> 7. The LUN should support blkdiscard feature, otherwise it will not write
> zero
> with  ioctl(fd, BLKZEROOUT, range)
>

Thanks!

I could not reproduce this with kernel 5.10, but now I'm no 5.11:
# uname -r
5.11.7-100.fc32.x86_64

# qemu-img --version
qemu-img version 5.2.0 (qemu-5.2.0-6.fc32.1)
Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers

# cat /etc/multipath.conf
defaults {
user_friendly_names no
find_multipaths no
}

blacklist_exceptions {
        property "(SCSI_IDENT_|ID_WWN)"
}

blacklist {
}

# multipath -ll 36001405e884ab8ff4b44fdba6901099c
36001405e884ab8ff4b44fdba6901099c dm-8 LIO-ORG,3-09
size=6.0G features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  `- 1:0:0:9 sdk     8:160 active ready running

$ lsblk /dev/sdk
NAME                                MAJ:MIN RM SIZE RO TYPE  MOUNTPOINT
sdk                                   8:160  0   6G  0 disk
└─36001405e884ab8ff4b44fdba6901099c 253:13   0   6G  0 mpath

$ virt-builder fedora-32 -o disk.img
[   2.9] Downloading: http://builder.libguestfs.org/fedora-32.xz
[   3.8] Planning how to build this image
[   3.8] Uncompressing
[  11.1] Opening the new disk
[  16.1] Setting a random seed
[  16.1] Setting passwords
virt-builder: Setting random password of root to TcikQqRxAaIqS1kF
[  17.0] Finishing off
                   Output file: disk.img
                   Output size: 6.0G
                 Output format: raw
            Total usable space: 5.4G
                    Free space: 4.0G (74%)

$ qemu-img info disk.img
image: disk.img
file format: raw
virtual size: 6 GiB (6442450944 bytes)
disk size: 1.29 GiB

# qemu-img convert -p -f raw -O raw -t none -T none disk.img
/dev/mapper/36001405e884ab8ff4b44fdba6901099c
    (100.00/100%)

Works.

# lsblk /dev/sdk
NAME                                   MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sdk                                      8:160  0    6G  0 disk
└─36001405e884ab8ff4b44fdba6901099c    253:13   0    6G  0 mpath
  ├─36001405e884ab8ff4b44fdba6901099c1 253:14   0    1M  0 part
  ├─36001405e884ab8ff4b44fdba6901099c2 253:15   0    1G  0 part
  ├─36001405e884ab8ff4b44fdba6901099c3 253:16   0  615M  0 part
  └─36001405e884ab8ff4b44fdba6901099c4 253:17   0  4.4G  0 part

# qemu-img convert -p -f raw -O raw -t none -T none disk.img
/dev/mapper/36001405e884ab8ff4b44fdba6901099c
    (100.00/100%)

Works, wiping the partitions.

# mount /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /tmp/mnt

# mount | grep /dev/mapper/36001405e884ab8ff4b44fdba6901099c4
/dev/mapper/36001405e884ab8ff4b44fdba6901099c4 on /tmp/mnt type xfs
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)

This is invalid use, converting to device with mounted file system, but
let's try:

# qemu-img convert -p -f raw -O raw -t none -T none disk.img
/dev/mapper/36001405e884ab8ff4b44fdba6901099c
    (100.00/100%)

Still works?!

# wipefs -a /dev/mapper/36001405e884ab8ff4b44fdba6901099c
wipefs: error: /dev/mapper/36001405e884ab8ff4b44fdba6901099c: probing
initialization failed: Device or resource busy

# blkdiscard /dev/mapper/36001405e884ab8ff4b44fdba6901099c

Works.

This the configuration of the LUN on the server side:

   {

      "aio": false,

      "alua_tpgs": [

        {

          "alua_access_state": 0,

          "alua_access_status": 0,

          "alua_access_type": 3,

          "alua_support_active_nonoptimized": 1,

          "alua_support_active_optimized": 1,

          "alua_support_offline": 1,

          "alua_support_standby": 1,

          "alua_support_transitioning": 1,

          "alua_support_unavailable": 1,

          "alua_write_metadata": 0,

          "implicit_trans_secs": 0,

          "name": "default_tg_pt_gp",

          "nonop_delay_msecs": 100,

          "preferred": 0,

          "tg_pt_gp_id": 0,

          "trans_delay_msecs": 0

        }

      ],

      "attributes": {

        "block_size": 512,

        "emulate_3pc": 1,

        "emulate_caw": 1,

        "emulate_dpo": 1,

        "emulate_fua_read": 1,

        "emulate_fua_write": 1,

        "emulate_model_alias": 1,

        "emulate_pr": 1,

        "emulate_rest_reord": 0,

        "emulate_tas": 1,

        "emulate_tpu": 1,

        "emulate_tpws": 1,

        "emulate_ua_intlck_ctrl": 0,

        "emulate_write_cache": 1,

        "enforce_pr_isids": 1,

        "force_pr_aptpl": 0,

        "is_nonrot": 0,

        "max_unmap_block_desc_count": 1,

        "max_unmap_lba_count": 8192,

        "max_write_same_len": 65335,

        "optimal_sectors": 16384,

        "pi_prot_format": 0,

        "pi_prot_type": 0,

        "pi_prot_verify": 0,

        "queue_depth": 128,

        "unmap_granularity": 1,

        "unmap_granularity_alignment": 0,

        "unmap_zeroes_data": 0

      },

      "dev": "/target/3/09",

      "name": "3-09",

      "plugin": "fileio",

      "size": 6442450944,

      "write_back": true,

      "wwn": "e884ab8f-f4b4-4fdb-a690-1099c072c86d"

    },


Maybe this upstream change is not in all downstream 5.11 kernels, or 5.11.7
already includes the fix?

Adding Ben, maybe he had more insight on the multipath side.

>If I understand the kernel change correctly, this can happen when there is
> >a mounted file system on top of the multipath device. I don't think we
> have
> >a use case when qemu accesses a multipath device when the device is used
> >by a file system, but maybe I missed something.
> >
> >So that to me implies
> >that we actually should not retry BLKZEROOUT, because the EBUSY will
> >remain, and that condition won’t change while the block device is in use
> >by qemu.
> >
> >On the other hand, in the code, you have decided not to reset
> >has_write_zeroes to false, so the implementation will retry.
> >
> >EBUSY is usually a temporary error, so retrying makes sense. The question
> >is if we really can write zeroes manually in this case?
> >
> >So I don’t quite understand.  Should we keep trying BLKZEROOUT or is
> >there no chance of it working after it has at one point failed with
> >EBUSY?  (Are there other cases besides what’s described in this commit
> >message where EBUSY might be returned and it is only temporary?)
> >
> >> Fallback to pwritev instead of exit for -EBUSY error.
> >>
> >> The issue was introduced in Linux 5.10:
> >>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
> >>
> >> Fixed in Linux 5.12:
> >>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d
> >>
> >> Signed-off-by: ChangLimin <changlm@chinatelecom.cn>
> >> ---
> >>   block/file-posix.c | 8 ++++++--
> >>   1 file changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/block/file-posix.c b/block/file-posix.c
> >> index 20e14f8e96..d4054ac9cb 100644
> >> --- a/block/file-posix.c
> >> +++ b/block/file-posix.c
> >> @@ -1624,8 +1624,12 @@ static ssize_t
> >> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
> >>           } while (errno == EINTR);
> >>
> >>           ret = translate_err(-errno);
> >> -        if (ret == -ENOTSUP) {
> >> -            s->has_write_zeroes = false;
> >> +        switch (ret) {
> >> +        case -ENOTSUP:
> >> +            s->has_write_zeroes = false; /* fall through */
> >> +        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for
> multipath
> >> devices */
> >> +            return -ENOTSUP;
> >> +            break;
> >
> >(Not sure why this break is here.)
> >
> >Max
> >
> >>           }
> >>       }
> >>   #endif
> >> --
> >> 2.27.0
> >>
>
>
ChangLimin March 26, 2021, 12:20 a.m. UTC | #6
>On Thu, Mar 25, 2021 at 8:07 AM ChangLimin <changlm@chinatelecom.cn> wrote:
>>On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mreitz@redhat.com> wrote:
>>On 22.03.21 10:25, ChangLimin wrote:
>>> For Linux 5.10/5.11, qemu write zeros to a multipath device using
>>> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY
>>> permanently.
>>
>>So as far as I can track back the discussion, Kevin asked on v1 why we’d 
>>set has_write_zeroes to false, i.e. whether the EBUSY might not go away 
>>at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then.
>>You haven’t explicitly replied to that question (as far as I can see), 
>>so it kind of still stands.
>>
>>Implicitly, there are two conflicting answers in this patch: On one 
>>hand, the commit message says “permanently”, and this is what you told 
>>Nir as a realistic case where this can occur. 
>
>For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is below. 
>For other Linux version, the EBUSY may be temporary. 
>Because  Linux 5.10/5.11 is not used widely, so do not set has_write_zeroes to false.
>
>>I'm afraid ChangLimin did not answer my question. I'm looking for real
>>world used case when qemu cannot write zeros to multipath device, when
>>nobody else is using the device.
>>
>>I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert,
>>once with a multipath device, and once with logical volume on a vg created
>>on the multipath device, and I could not reproduce this issue.
>
>The following is steps to reproduct the issue on Fedora 34.
>
># uname -a
>Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
>
>Is this the most recent kernel? I have 5.11.7 in fedora 32.
> 
>
># qemu-img -V
>qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1)
>
>1.  Login in an ISCSI LUN created using targetcli on ubuntu 20.04
># iscsiadm -m discovery -t st -p 192.169.1.109
>192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100
>
># iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100
># iscsiadm -m session
>tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 (non-flash)
>
>2. start multipathd service
># mpathconf --enable
># systemctl start multipathd
>
>3.  add multipath path
># multipath -a `/lib/udev/scsi_id -g /dev/sdb`   # sdb means the ISCSI LUN
>wwid '36001405b76856e4816b48b99c6a77de3' added
>
># multipathd add path /dev/sdb
>ok
>
># multipath -ll     # /dev/dm-1 is the multipath device based on /dev/sdb
>mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102
>size=1.0G features='0' hwhandler='1 alua' wp=rw
>`-+- policy='service-time 0' prio=50 status=active
>  `- 5:0:0:0  sdd  8:48   active ready running
>
>You are using user_friendly_names which is (sadly) the default.
>But I don't think it should matter.
>
>4. qemu-img return EBUSY both to dm-1 and sdb
># wget http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1
>qemu-img: error while writing at byte 0: Device or resource busy
>
># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb
>qemu-img: error while writing at byte 0: Device or resource busy
>
>5. blkdiscard also return EBUSY  both to dm-1 and sdb
># blkdiscard -o 0 -l 4096 /dev/dm-1
>blkdiscard: cannot open /dev/dm-1: Device or resource busy
>
># blkdiscard -o 0 -l 4096 /dev/sdb
>blkdiscard: cannot open /dev/sdb: No such file or directory
>
>6. dd write zero is good, because it does not use blkdiscard
># dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct 
>100+0 records in
>100+0 records out
>104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s
>
>7. The LUN should support blkdiscard feature, otherwise it will not write zero 
>with  ioctl(fd, BLKZEROOUT, range) 
>
>Thanks!
>
>I could not reproduce this with kernel 5.10, but now I'm no 5.11:
># uname -r
>5.11.7-100.fc32.x86_64
>
># qemu-img --version
>qemu-img version 5.2.0 (qemu-5.2.0-6.fc32.1)
>Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers
>
># cat /etc/multipath.conf
>defaults {
>user_friendly_names no
>find_multipaths no
>}
>
>blacklist_exceptions {
>        property "(SCSI_IDENT_|ID_WWN)"
>}
>
>blacklist {
>}
>
># multipath -ll 36001405e884ab8ff4b44fdba6901099c
>36001405e884ab8ff4b44fdba6901099c dm-8 LIO-ORG,3-09
>size=6.0G features='0' hwhandler='1 alua' wp=rw
>`-+- policy='service-time 0' prio=50 status=active
>  `- 1:0:0:9 sdk     8:160 active ready running
>
>$ lsblk /dev/sdk
>NAME                                MAJ:MIN RM SIZE RO TYPE  MOUNTPOINT
>sdk                                   8:160  0   6G  0 disk  
>└─36001405e884ab8ff4b44fdba6901099c 253:13   0   6G  0 mpath 
>
>$ virt-builder fedora-32 -o disk.img
>[   2.9] Downloading: http://builder.libguestfs.org/fedora-32.xz
>[   3.8] Planning how to build this image
>[   3.8] Uncompressing
>[  11.1] Opening the new disk
>[  16.1] Setting a random seed
>[  16.1] Setting passwords
>virt-builder: Setting random password of root to TcikQqRxAaIqS1kF
>[  17.0] Finishing off
>                   Output file: disk.img
>                   Output size: 6.0G
>                 Output format: raw
>            Total usable space: 5.4G
>                    Free space: 4.0G (74%)
>
>$ qemu-img info disk.img
>image: disk.img
>file format: raw
>virtual size: 6 GiB (6442450944 bytes)
>disk size: 1.29 GiB
>
># qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c
>    (100.00/100%)
>
>Works.
>
># lsblk /dev/sdk
>NAME                                   MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
>sdk                                      8:160  0    6G  0 disk  
>└─36001405e884ab8ff4b44fdba6901099c    253:13   0    6G  0 mpath 
>  ├─36001405e884ab8ff4b44fdba6901099c1 253:14   0    1M  0 part  
>  ├─36001405e884ab8ff4b44fdba6901099c2 253:15   0    1G  0 part  
>  ├─36001405e884ab8ff4b44fdba6901099c3 253:16   0  615M  0 part  
>  └─36001405e884ab8ff4b44fdba6901099c4 253:17   0  4.4G  0 part  
>
># qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c
>    (100.00/100%)
>
>Works, wiping the partitions.
>
># mount /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /tmp/mnt
>
># mount | grep /dev/mapper/36001405e884ab8ff4b44fdba6901099c4
>/dev/mapper/36001405e884ab8ff4b44fdba6901099c4 on /tmp/mnt type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
>
>This is invalid use, converting to device with mounted file system, but let's try:
>
># qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c
>    (100.00/100%)
>
>Still works?!
>
># wipefs -a /dev/mapper/36001405e884ab8ff4b44fdba6901099c
>wipefs: error: /dev/mapper/36001405e884ab8ff4b44fdba6901099c: probing initialization failed: Device or resource busy
>
># blkdiscard /dev/mapper/36001405e884ab8ff4b44fdba6901099c
>
>Works.
>
>This the configuration of the LUN on the server side:
>
>   {                                                                                                                         
>      "aio": false,                                                                                                           
>      "alua_tpgs": [                                                                                                          
>        {                                                                                                                     
>          "alua_access_state": 0,                                                                                             
>          "alua_access_status": 0,                                                                                            
>          "alua_access_type": 3,                                                                                              
>          "alua_support_active_nonoptimized": 1,                                                                              
>          "alua_support_active_optimized": 1,                                                                                 
>          "alua_support_offline": 1,                                                                                          
>          "alua_support_standby": 1,                                                                                          
>          "alua_support_transitioning": 1,                                                                                    
>          "alua_support_unavailable": 1,                                                                                      
>          "alua_write_metadata": 0,                                                                                           
>          "implicit_trans_secs": 0,                                                                                           
>          "name": "default_tg_pt_gp",                                                                                         
>          "nonop_delay_msecs": 100,                                                                                           
>          "preferred": 0,                                                                                                     
>          "tg_pt_gp_id": 0,                                                                                                   
>          "trans_delay_msecs": 0                                                                                              
>        }                                                                                                                     
>      ],                                                                                                                      
>      "attributes": {                                                                                                         
>        "block_size": 512,                                                                                                    
>        "emulate_3pc": 1,                                                                                                     
>        "emulate_caw": 1,                                                                                                     
>        "emulate_dpo": 1,                                                                                                     
>        "emulate_fua_read": 1,                                                                                                
>        "emulate_fua_write": 1,                                                                                               
>        "emulate_model_alias": 1,                                                                                             
>        "emulate_pr": 1,                                                                                                      
>        "emulate_rest_reord": 0,                                                                                              
>        "emulate_tas": 1,                                                                                                     
>        "emulate_tpu": 1,                                                                                                     
>        "emulate_tpws": 1,                                                                                                    
>        "emulate_ua_intlck_ctrl": 0,                                                                                          
>        "emulate_write_cache": 1,                                                                                             
>        "enforce_pr_isids": 1,                                                                                                
>        "force_pr_aptpl": 0,                                                                                                  
>        "is_nonrot": 0,                                                                                                       
>        "max_unmap_block_desc_count": 1,                                                                                      
>        "max_unmap_lba_count": 8192,                                                                                          
>        "max_write_same_len": 65335,                                                                                          
>        "optimal_sectors": 16384,                                                                                             
>        "pi_prot_format": 0,                                                                                                  
>        "pi_prot_type": 0,                                                                                                    
>        "pi_prot_verify": 0,                                                                                                  
>        "queue_depth": 128,                                                                                                   
>        "unmap_granularity": 1,                                                                                               
>        "unmap_granularity_alignment": 0,                                                                                     
>        "unmap_zeroes_data": 0                                                                                                
>      },                                                                                                                      
>      "dev": "/target/3/09",                                                                                                  
>      "name": "3-09",                                                                                                         
>      "plugin": "fileio",                                                                                                     
>      "size": 6442450944,                                                                                                     
>      "write_back": true,                                                                                                     
>      "wwn": "e884ab8f-f4b4-4fdb-a690-1099c072c86d"                                                                           
>    },                                                                                                                        
>
>Maybe this upstream change is not in all downstream 5.11 kernels, or 5.11.7
>already includes the fix?

Linux 5.10.24/5.11.7 already merged the fix on 2021-03-17 by Greg Kroah-Hartman.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.11.y&id=7e0815797656f029fab2edc309406cddf931b9d8
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=d44c9780ed40db88626c9354868eab72159c7a7f

# git describe d44c9780ed40db88626c9354868eab72159c7a7f
v5.10.23-184-gd44c9780ed40

# git describe 7e0815797656f029fab2edc309406cddf931b9d8
v5.11.6-178-g7e0815797656

>Adding Ben, maybe he had more insight on the multipath side.
>
>>If I understand the kernel change correctly, this can happen when there is
>>a mounted file system on top of the multipath device. I don't think we have 
>>a use case when qemu accesses a multipath device when the device is used
>>by a file system, but maybe I missed something.
>> 
>>So that to me implies 
>>that we actually should not retry BLKZEROOUT, because the EBUSY will 
>>remain, and that condition won’t change while the block device is in use 
>>by qemu.
>>
>>On the other hand, in the code, you have decided not to reset 
>>has_write_zeroes to false, so the implementation will retry.
>>
>>EBUSY is usually a temporary error, so retrying makes sense. The question
>>is if we really can write zeroes manually in this case?
>> 
>>So I don’t quite understand.  Should we keep trying BLKZEROOUT or is 
>>there no chance of it working after it has at one point failed with 
>>EBUSY?  (Are there other cases besides what’s described in this commit 
>>message where EBUSY might be returned and it is only temporary?)
>>
>>> Fallback to pwritev instead of exit for -EBUSY error.
>>> 
>>> The issue was introduced in Linux 5.10:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
>>> 
>>> Fixed in Linux 5.12:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d
>>> 
>>> Signed-off-by: ChangLimin <changlm@chinatelecom.cn>
>>> ---
>>>   block/file-posix.c | 8 ++++++--
>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/block/file-posix.c b/block/file-posix.c
>>> index 20e14f8e96..d4054ac9cb 100644
>>> --- a/block/file-posix.c
>>> +++ b/block/file-posix.c
>>> @@ -1624,8 +1624,12 @@ static ssize_t 
>>> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
>>>           } while (errno == EINTR);
>>> 
>>>           ret = translate_err(-errno);
>>> -        if (ret == -ENOTSUP) {
>>> -            s->has_write_zeroes = false;
>>> +        switch (ret) {
>>> +        case -ENOTSUP:
>>> +            s->has_write_zeroes = false; /* fall through */
>>> +        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath 
>>> devices */
>>> +            return -ENOTSUP;
>>> +            break;
>>
>>(Not sure why this break is here.)
>>
>>Max
>>
>>>           }
>>>       }
>>>   #endif
>>> --
>>> 2.27.0
>>>
Nir Soffer March 26, 2021, 7:39 p.m. UTC | #7
On Fri, Mar 26, 2021 at 3:21 AM ChangLimin <changlm@chinatelecom.cn> wrote:

> >On Thu, Mar 25, 2021 at 8:07 AM ChangLimin <changlm@chinatelecom.cn>
> wrote:
> >>On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mreitz@redhat.com> wrote:
> >>On 22.03.21 10:25, ChangLimin wrote:
> >>> For Linux 5.10/5.11, qemu write zeros to a multipath device using
> >>> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return
> -EBUSY
> >>> permanently.
> >>
> >>So as far as I can track back the discussion, Kevin asked on v1 why we’d
> >>set has_write_zeroes to false, i.e. whether the EBUSY might not go away
> >>at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then.
> >>You haven’t explicitly replied to that question (as far as I can see),
> >>so it kind of still stands.
> >>
> >>Implicitly, there are two conflicting answers in this patch: On one
> >>hand, the commit message says “permanently”, and this is what you told
> >>Nir as a realistic case where this can occur.
> >
> >For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is
> below.
> >For other Linux version, the EBUSY may be temporary.
> >Because  Linux 5.10/5.11 is not used widely, so do not set
> has_write_zeroes to false.
> >
> >>I'm afraid ChangLimin did not answer my question. I'm looking for real
> >>world used case when qemu cannot write zeros to multipath device, when
> >>nobody else is using the device.
> >>
> >>I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert,
> >>once with a multipath device, and once with logical volume on a vg
> created
> >>on the multipath device, and I could not reproduce this issue.
> >
> >The following is steps to reproduct the issue on Fedora 34.
> >
> ># uname -a
> >Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021
> x86_64 x86_64 x86_64 GNU/Linux
> >
> >Is this the most recent kernel? I have 5.11.7 in fedora 32.
> >
> >
> ># qemu-img -V
> >qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1)
> >
> >1.  Login in an ISCSI LUN created using targetcli on ubuntu 20.04
> ># iscsiadm -m discovery -t st -p 192.169.1.109
> >192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100
> >
> ># iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100
> ># iscsiadm -m session
> >tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100
> (non-flash)
> >
> >2. start multipathd service
> ># mpathconf --enable
> ># systemctl start multipathd
> >
> >3.  add multipath path
> ># multipath -a `/lib/udev/scsi_id -g /dev/sdb`   # sdb means the ISCSI LUN
> >wwid '36001405b76856e4816b48b99c6a77de3' added
> >
> ># multipathd add path /dev/sdb
> >ok
> >
> ># multipath -ll     # /dev/dm-1 is the multipath device based on /dev/sdb
> >mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102
> >size=1.0G features='0' hwhandler='1 alua' wp=rw
> >`-+- policy='service-time 0' prio=50 status=active
> >  `- 5:0:0:0  sdd  8:48   active ready running
> >
> >You are using user_friendly_names which is (sadly) the default.
> >But I don't think it should matter.
> >
> >4. qemu-img return EBUSY both to dm-1 and sdb
> ># wget
> http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
> ># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1
> >qemu-img: error while writing at byte 0: Device or resource busy
> >
> ># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb
> >qemu-img: error while writing at byte 0: Device or resource busy
> >
> >5. blkdiscard also return EBUSY  both to dm-1 and sdb
> ># blkdiscard -o 0 -l 4096 /dev/dm-1
> >blkdiscard: cannot open /dev/dm-1: Device or resource busy
> >
> ># blkdiscard -o 0 -l 4096 /dev/sdb
> >blkdiscard: cannot open /dev/sdb: No such file or directory
> >
> >6. dd write zero is good, because it does not use blkdiscard
> ># dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct
> >100+0 records in
> >100+0 records out
> >104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s
> >
> >7. The LUN should support blkdiscard feature, otherwise it will not write
> zero
> >with  ioctl(fd, BLKZEROOUT, range)
> >
> >Thanks!
> >
> >I could not reproduce this with kernel 5.10, but now I'm no 5.11:
> ># uname -r
> >5.11.7-100.fc32.x86_64
> >
> ># qemu-img --version
> >qemu-img version 5.2.0 (qemu-5.2.0-6.fc32.1)
> >Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers
> >
> ># cat /etc/multipath.conf
> >defaults {
> >user_friendly_names no
> >find_multipaths no
> >}
> >
> >blacklist_exceptions {
> >        property "(SCSI_IDENT_|ID_WWN)"
> >}
> >
> >blacklist {
> >}
> >
> ># multipath -ll 36001405e884ab8ff4b44fdba6901099c
> >36001405e884ab8ff4b44fdba6901099c dm-8 LIO-ORG,3-09
> >size=6.0G features='0' hwhandler='1 alua' wp=rw
> >`-+- policy='service-time 0' prio=50 status=active
> >  `- 1:0:0:9 sdk     8:160 active ready running
> >
> >$ lsblk /dev/sdk
> >NAME                                MAJ:MIN RM SIZE RO TYPE  MOUNTPOINT
> >sdk                                   8:160  0   6G  0 disk
> >└─36001405e884ab8ff4b44fdba6901099c 253:13   0   6G  0 mpath
> >
> >$ virt-builder fedora-32 -o disk.img
> >[   2.9] Downloading: http://builder.libguestfs.org/fedora-32.xz
> >[   3.8] Planning how to build this image
> >[   3.8] Uncompressing
> >[  11.1] Opening the new disk
> >[  16.1] Setting a random seed
> >[  16.1] Setting passwords
> >virt-builder: Setting random password of root to TcikQqRxAaIqS1kF
> >[  17.0] Finishing off
> >                   Output file: disk.img
> >                   Output size: 6.0G
> >                 Output format: raw
> >            Total usable space: 5.4G
> >                    Free space: 4.0G (74%)
> >
> >$ qemu-img info disk.img
> >image: disk.img
> >file format: raw
> >virtual size: 6 GiB (6442450944 bytes)
> >disk size: 1.29 GiB
> >
> ># qemu-img convert -p -f raw -O raw -t none -T none disk.img
> /dev/mapper/36001405e884ab8ff4b44fdba6901099c
> >    (100.00/100%)
> >
> >Works.
> >
> ># lsblk /dev/sdk
> >NAME                                   MAJ:MIN RM  SIZE RO TYPE
>  MOUNTPOINT
> >sdk                                      8:160  0    6G  0 disk
> >└─36001405e884ab8ff4b44fdba6901099c    253:13   0    6G  0 mpath
> >  ├─36001405e884ab8ff4b44fdba6901099c1 253:14   0    1M  0 part
> >  ├─36001405e884ab8ff4b44fdba6901099c2 253:15   0    1G  0 part
> >  ├─36001405e884ab8ff4b44fdba6901099c3 253:16   0  615M  0 part
> >  └─36001405e884ab8ff4b44fdba6901099c4 253:17   0  4.4G  0 part
> >
> ># qemu-img convert -p -f raw -O raw -t none -T none disk.img
> /dev/mapper/36001405e884ab8ff4b44fdba6901099c
> >    (100.00/100%)
> >
> >Works, wiping the partitions.
> >
> ># mount /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /tmp/mnt
> >
> ># mount | grep /dev/mapper/36001405e884ab8ff4b44fdba6901099c4
> >/dev/mapper/36001405e884ab8ff4b44fdba6901099c4 on /tmp/mnt type xfs
> (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> >
> >This is invalid use, converting to device with mounted file system, but
> let's try:
> >
> ># qemu-img convert -p -f raw -O raw -t none -T none disk.img
> /dev/mapper/36001405e884ab8ff4b44fdba6901099c
> >    (100.00/100%)
> >
> >Still works?!
> >
> ># wipefs -a /dev/mapper/36001405e884ab8ff4b44fdba6901099c
> >wipefs: error: /dev/mapper/36001405e884ab8ff4b44fdba6901099c: probing
> initialization failed: Device or resource busy
> >
> ># blkdiscard /dev/mapper/36001405e884ab8ff4b44fdba6901099c
> >
> >Works.
> >
> >This the configuration of the LUN on the server side:
> >
> >   {
>
> >      "aio": false,
>
> >      "alua_tpgs": [
>
> >        {
>
> >          "alua_access_state": 0,
>
> >          "alua_access_status": 0,
>
> >          "alua_access_type": 3,
>
> >          "alua_support_active_nonoptimized": 1,
>
> >          "alua_support_active_optimized": 1,
>
> >          "alua_support_offline": 1,
>
> >          "alua_support_standby": 1,
>
> >          "alua_support_transitioning": 1,
>
> >          "alua_support_unavailable": 1,
>
> >          "alua_write_metadata": 0,
>
> >          "implicit_trans_secs": 0,
>
> >          "name": "default_tg_pt_gp",
>
> >          "nonop_delay_msecs": 100,
>
> >          "preferred": 0,
>
> >          "tg_pt_gp_id": 0,
>
> >          "trans_delay_msecs": 0
>
> >        }
>
> >      ],
>
> >      "attributes": {
>
> >        "block_size": 512,
>
> >        "emulate_3pc": 1,
>
> >        "emulate_caw": 1,
>
> >        "emulate_dpo": 1,
>
> >        "emulate_fua_read": 1,
>
> >        "emulate_fua_write": 1,
>
> >        "emulate_model_alias": 1,
>
> >        "emulate_pr": 1,
>
> >        "emulate_rest_reord": 0,
>
> >        "emulate_tas": 1,
>
> >        "emulate_tpu": 1,
>
> >        "emulate_tpws": 1,
>
> >        "emulate_ua_intlck_ctrl": 0,
>
> >        "emulate_write_cache": 1,
>
> >        "enforce_pr_isids": 1,
>
> >        "force_pr_aptpl": 0,
>
> >        "is_nonrot": 0,
>
> >        "max_unmap_block_desc_count": 1,
>
> >        "max_unmap_lba_count": 8192,
>
> >        "max_write_same_len": 65335,
>
> >        "optimal_sectors": 16384,
>
> >        "pi_prot_format": 0,
>
> >        "pi_prot_type": 0,
>
> >        "pi_prot_verify": 0,
>
> >        "queue_depth": 128,
>
> >        "unmap_granularity": 1,
>
> >        "unmap_granularity_alignment": 0,
>
> >        "unmap_zeroes_data": 0
>
> >      },
>
> >      "dev": "/target/3/09",
>
> >      "name": "3-09",
>
> >      "plugin": "fileio",
>
> >      "size": 6442450944,
>
> >      "write_back": true,
>
> >      "wwn": "e884ab8f-f4b4-4fdb-a690-1099c072c86d"
>
> >    },
>
> >
> >Maybe this upstream change is not in all downstream 5.11 kernels, or
> 5.11.7
> >already includes the fix?
>
> Linux 5.10.24/5.11.7 already merged the fix on 2021-03-17 by Greg
> Kroah-Hartman.
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.11.y&id=7e0815797656f029fab2edc309406cddf931b9d8
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=d44c9780ed40db88626c9354868eab72159c7a7f
>
> # git describe d44c9780ed40db88626c9354868eab72159c7a7f
> v5.10.23-184-gd44c9780ed40
>
> # git describe 7e0815797656f029fab2edc309406cddf931b9d8
> v5.11.6-178-g7e0815797656
>

So this is practically fixed.

I don't think keeping a workaround for this in qemu is a good idea.

>Adding Ben, maybe he had more insight on the multipath side.
> >
> >>If I understand the kernel change correctly, this can happen when there
> is
> >>a mounted file system on top of the multipath device. I don't think we
> have
> >>a use case when qemu accesses a multipath device when the device is used
> >>by a file system, but maybe I missed something.
> >>
> >>So that to me implies
> >>that we actually should not retry BLKZEROOUT, because the EBUSY will
> >>remain, and that condition won’t change while the block device is in use
> >>by qemu.
> >>
> >>On the other hand, in the code, you have decided not to reset
> >>has_write_zeroes to false, so the implementation will retry.
> >>
> >>EBUSY is usually a temporary error, so retrying makes sense. The question
> >>is if we really can write zeroes manually in this case?
> >>
> >>So I don’t quite understand.  Should we keep trying BLKZEROOUT or is
> >>there no chance of it working after it has at one point failed with
> >>EBUSY?  (Are there other cases besides what’s described in this commit
> >>message where EBUSY might be returned and it is only temporary?)
> >>
> >>> Fallback to pwritev instead of exit for -EBUSY error.
> >>>
> >>> The issue was introduced in Linux 5.10:
> >>>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
> >>>
> >>> Fixed in Linux 5.12:
> >>>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d
> >>>
> >>> Signed-off-by: ChangLimin <changlm@chinatelecom.cn>
> >>> ---
> >>>   block/file-posix.c | 8 ++++++--
> >>>   1 file changed, 6 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/block/file-posix.c b/block/file-posix.c
> >>> index 20e14f8e96..d4054ac9cb 100644
> >>> --- a/block/file-posix.c
> >>> +++ b/block/file-posix.c
> >>> @@ -1624,8 +1624,12 @@ static ssize_t
> >>> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
> >>>           } while (errno == EINTR);
> >>>
> >>>           ret = translate_err(-errno);
> >>> -        if (ret == -ENOTSUP) {
> >>> -            s->has_write_zeroes = false;
> >>> +        switch (ret) {
> >>> +        case -ENOTSUP:
> >>> +            s->has_write_zeroes = false; /* fall through */
> >>> +        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for
> multipath
> >>> devices */
> >>> +            return -ENOTSUP;
> >>> +            break;
> >>
> >>(Not sure why this break is here.)
> >>
> >>Max
> >>
> >>>           }
> >>>       }
> >>>   #endif
> >>> --
> >>> 2.27.0
> >>>
>
>
diff mbox series

Patch

diff --git a/block/file-posix.c b/block/file-posix.c
index 20e14f8e96..d4054ac9cb 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1624,8 +1624,12 @@  static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
         } while (errno == EINTR);

         ret = translate_err(-errno);
-        if (ret == -ENOTSUP) {
-            s->has_write_zeroes = false;
+        switch (ret) {
+        case -ENOTSUP:
+            s->has_write_zeroes = false; /* fall through */
+        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath devices */
+            return -ENOTSUP;
+            break;
         }
     }
 #endif