Message ID | 2021032217253258728710@chinatelecom.cn |
---|---|
State | New |
Headers | show |
Series | [V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block | expand |
On 3/22/21 5:25 AM, ChangLimin wrote: > For Linux 5.10/5.11, qemu write zeros to a multipath device using > ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY > permanently. Fallback to pwritev instead of exit for -EBUSY error. > > The issue was introduced in Linux 5.10: > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02 > > Fixed in Linux 5.12: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d > > Signed-off-by: ChangLimin <changlm@chinatelecom.cn> To be clear, when I asked "When do we get -EINVAL?" it wasn't because I doubted that we would ever get it, I was just unclear of the circumstances in which we might receive EINVAL and was hoping you would explain it to me. > --- > block/file-posix.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/block/file-posix.c b/block/file-posix.c > index 20e14f8e96..d4054ac9cb 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -1624,8 +1624,12 @@ static ssize_t > handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) > } while (errno == EINTR); > > ret = translate_err(-errno); > - if (ret == -ENOTSUP) { > - s->has_write_zeroes = false; > + switch (ret) { > + case -ENOTSUP: > + s->has_write_zeroes = false; /* fall through */ > + case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath > devices */ > + return -ENOTSUP; > + break; What effect does this have, now? We'll return ENOTSUP but we won't disable trying it again in the future, is that right? Kevin, is this what you had in mind? --js > } > } > #endif > -- > 2.27.0 >
On 22.03.21 10:25, ChangLimin wrote: > For Linux 5.10/5.11, qemu write zeros to a multipath device using > ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY > permanently. So as far as I can track back the discussion, Kevin asked on v1 why we’d set has_write_zeroes to false, i.e. whether the EBUSY might not go away at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then. You haven’t explicitly replied to that question (as far as I can see), so it kind of still stands. Implicitly, there are two conflicting answers in this patch: On one hand, the commit message says “permanently”, and this is what you told Nir as a realistic case where this can occur. So that to me implies that we actually should not retry BLKZEROOUT, because the EBUSY will remain, and that condition won’t change while the block device is in use by qemu. On the other hand, in the code, you have decided not to reset has_write_zeroes to false, so the implementation will retry. So I don’t quite understand. Should we keep trying BLKZEROOUT or is there no chance of it working after it has at one point failed with EBUSY? (Are there other cases besides what’s described in this commit message where EBUSY might be returned and it is only temporary?) > Fallback to pwritev instead of exit for -EBUSY error. > > The issue was introduced in Linux 5.10: > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02 > > Fixed in Linux 5.12: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d > > Signed-off-by: ChangLimin <changlm@chinatelecom.cn> > --- > block/file-posix.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/block/file-posix.c b/block/file-posix.c > index 20e14f8e96..d4054ac9cb 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -1624,8 +1624,12 @@ static ssize_t > handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) > } while (errno == EINTR); > > ret = translate_err(-errno); > - if (ret == -ENOTSUP) { > - s->has_write_zeroes = false; > + switch (ret) { > + case -ENOTSUP: > + s->has_write_zeroes = false; /* fall through */ > + case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath > devices */ > + return -ENOTSUP; > + break; (Not sure why this break is here.) Max > } > } > #endif > -- > 2.27.0 >
On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mreitz@redhat.com> wrote: > On 22.03.21 10:25, ChangLimin wrote: > > For Linux 5.10/5.11, qemu write zeros to a multipath device using > > ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY > > permanently. > > So as far as I can track back the discussion, Kevin asked on v1 why we’d > set has_write_zeroes to false, i.e. whether the EBUSY might not go away > at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then. > You haven’t explicitly replied to that question (as far as I can see), > so it kind of still stands. > > Implicitly, there are two conflicting answers in this patch: On one > hand, the commit message says “permanently”, and this is what you told > Nir as a realistic case where this can occur. I'm afraid ChangLimin did not answer my question. I'm looking for real world used case when qemu cannot write zeros to multipath device, when nobody else is using the device. I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert, once with a multipath device, and once with logical volume on a vg created on the multipath device, and I could not reproduce this issue. If I understand the kernel change correctly, this can happen when there is a mounted file system on top of the multipath device. I don't think we have a use case when qemu accesses a multipath device when the device is used by a file system, but maybe I missed something. > So that to me implies > that we actually should not retry BLKZEROOUT, because the EBUSY will > remain, and that condition won’t change while the block device is in use > by qemu. > > On the other hand, in the code, you have decided not to reset > has_write_zeroes to false, so the implementation will retry. > EBUSY is usually a temporary error, so retrying makes sense. The question is if we really can write zeroes manually in this case? > So I don’t quite understand. Should we keep trying BLKZEROOUT or is > there no chance of it working after it has at one point failed with > EBUSY? (Are there other cases besides what’s described in this commit > message where EBUSY might be returned and it is only temporary?) > > > Fallback to pwritev instead of exit for -EBUSY error. > > > > The issue was introduced in Linux 5.10: > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02 > > > > Fixed in Linux 5.12: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d > > > > Signed-off-by: ChangLimin <changlm@chinatelecom.cn> > > --- > > block/file-posix.c | 8 ++++++-- > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > diff --git a/block/file-posix.c b/block/file-posix.c > > index 20e14f8e96..d4054ac9cb 100644 > > --- a/block/file-posix.c > > +++ b/block/file-posix.c > > @@ -1624,8 +1624,12 @@ static ssize_t > > handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) > > } while (errno == EINTR); > > > > ret = translate_err(-errno); > > - if (ret == -ENOTSUP) { > > - s->has_write_zeroes = false; > > + switch (ret) { > > + case -ENOTSUP: > > + s->has_write_zeroes = false; /* fall through */ > > + case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath > > devices */ > > + return -ENOTSUP; > > + break; > > (Not sure why this break is here.) > > Max > > > } > > } > > #endif > > -- > > 2.27.0 > > > > >
>On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mreitz@redhat.com> wrote: >On 22.03.21 10:25, ChangLimin wrote: >> For Linux 5.10/5.11, qemu write zeros to a multipath device using >> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY >> permanently. > >So as far as I can track back the discussion, Kevin asked on v1 why we’d >set has_write_zeroes to false, i.e. whether the EBUSY might not go away >at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then. >You haven’t explicitly replied to that question (as far as I can see), >so it kind of still stands. > >Implicitly, there are two conflicting answers in this patch: On one >hand, the commit message says “permanently”, and this is what you told >Nir as a realistic case where this can occur. For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is below. For other Linux version, the EBUSY may be temporary. Because Linux 5.10/5.11 is not used widely, so do not set has_write_zeroes to false. >I'm afraid ChangLimin did not answer my question. I'm looking for real >world used case when qemu cannot write zeros to multipath device, when >nobody else is using the device. > >I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert, >once with a multipath device, and once with logical volume on a vg created >on the multipath device, and I could not reproduce this issue. The following is steps to reproduct the issue on Fedora 34. # uname -a Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux # qemu-img -V qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1) 1. Login in an ISCSI LUN created using targetcli on ubuntu 20.04 # iscsiadm -m discovery -t st -p 192.169.1.109 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 # iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100 # iscsiadm -m session tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 (non-flash) 2. start multipathd service # mpathconf --enable # systemctl start multipathd 3. add multipath path # multipath -a `/lib/udev/scsi_id -g /dev/sdb` # sdb means the ISCSI LUN wwid '36001405b76856e4816b48b99c6a77de3' added # multipathd add path /dev/sdb ok # multipath -ll # /dev/dm-1 is the multipath device based on /dev/sdb mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102 size=1.0G features='0' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active `- 5:0:0:0 sdd 8:48 active ready running 4. qemu-img return EBUSY both to dm-1 and sdb # wget http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img # qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1 qemu-img: error while writing at byte 0: Device or resource busy # qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb qemu-img: error while writing at byte 0: Device or resource busy 5. blkdiscard also return EBUSY both to dm-1 and sdb # blkdiscard -o 0 -l 4096 /dev/dm-1 blkdiscard: cannot open /dev/dm-1: Device or resource busy # blkdiscard -o 0 -l 4096 /dev/sdb blkdiscard: cannot open /dev/sdb: No such file or directory 6. dd write zero is good, because it does not use blkdiscard # dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct 100+0 records in 100+0 records out 104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s 7. The LUN should support blkdiscard feature, otherwise it will not write zero with ioctl(fd, BLKZEROOUT, range) >If I understand the kernel change correctly, this can happen when there is >a mounted file system on top of the multipath device. I don't think we have >a use case when qemu accesses a multipath device when the device is used >by a file system, but maybe I missed something. > >So that to me implies >that we actually should not retry BLKZEROOUT, because the EBUSY will >remain, and that condition won’t change while the block device is in use >by qemu. > >On the other hand, in the code, you have decided not to reset >has_write_zeroes to false, so the implementation will retry. > >EBUSY is usually a temporary error, so retrying makes sense. The question >is if we really can write zeroes manually in this case? > >So I don’t quite understand. Should we keep trying BLKZEROOUT or is >there no chance of it working after it has at one point failed with >EBUSY? (Are there other cases besides what’s described in this commit >message where EBUSY might be returned and it is only temporary?) > >> Fallback to pwritev instead of exit for -EBUSY error. >> >> The issue was introduced in Linux 5.10: >> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02 >> >> Fixed in Linux 5.12: >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d >> >> Signed-off-by: ChangLimin <changlm@chinatelecom.cn> >> --- >> block/file-posix.c | 8 ++++++-- >> 1 file changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/block/file-posix.c b/block/file-posix.c >> index 20e14f8e96..d4054ac9cb 100644 >> --- a/block/file-posix.c >> +++ b/block/file-posix.c >> @@ -1624,8 +1624,12 @@ static ssize_t >> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) >> } while (errno == EINTR); >> >> ret = translate_err(-errno); >> - if (ret == -ENOTSUP) { >> - s->has_write_zeroes = false; >> + switch (ret) { >> + case -ENOTSUP: >> + s->has_write_zeroes = false; /* fall through */ >> + case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath >> devices */ >> + return -ENOTSUP; >> + break; > >(Not sure why this break is here.) > >Max > >> } >> } >> #endif >> -- >> 2.27.0 >>
On Thu, Mar 25, 2021 at 8:07 AM ChangLimin <changlm@chinatelecom.cn> wrote: > >On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mreitz@redhat.com> wrote: > >On 22.03.21 10:25, ChangLimin wrote: > >> For Linux 5.10/5.11, qemu write zeros to a multipath device using > >> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY > >> permanently. > > > >So as far as I can track back the discussion, Kevin asked on v1 why we’d > >set has_write_zeroes to false, i.e. whether the EBUSY might not go away > >at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then. > >You haven’t explicitly replied to that question (as far as I can see), > >so it kind of still stands. > > > >Implicitly, there are two conflicting answers in this patch: On one > >hand, the commit message says “permanently”, and this is what you told > >Nir as a realistic case where this can occur. > > For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is > below. > For other Linux version, the EBUSY may be temporary. > Because Linux 5.10/5.11 is not used widely, so do not set has_write_zeroes > to false. > > >I'm afraid ChangLimin did not answer my question. I'm looking for real > >world used case when qemu cannot write zeros to multipath device, when > >nobody else is using the device. > > > >I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert, > >once with a multipath device, and once with logical volume on a vg created > >on the multipath device, and I could not reproduce this issue. > > The following is steps to reproduct the issue on Fedora 34. > > # uname -a > Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > Is this the most recent kernel? I have 5.11.7 in fedora 32. > > # qemu-img -V > qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1) > > 1. Login in an ISCSI LUN created using targetcli on ubuntu 20.04 > # iscsiadm -m discovery -t st -p 192.169.1.109 > 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 > > # iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100 > # iscsiadm -m session > tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 > (non-flash) > > 2. start multipathd service > # mpathconf --enable > # systemctl start multipathd > > 3. add multipath path > # multipath -a `/lib/udev/scsi_id -g /dev/sdb` # sdb means the ISCSI LUN > wwid '36001405b76856e4816b48b99c6a77de3' added > > # multipathd add path /dev/sdb > ok > > # multipath -ll # /dev/dm-1 is the multipath device based on /dev/sdb > mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102 > size=1.0G features='0' hwhandler='1 alua' wp=rw > `-+- policy='service-time 0' prio=50 status=active > `- 5:0:0:0 sdd 8:48 active ready running > You are using user_friendly_names which is (sadly) the default. But I don't think it should matter. 4. qemu-img return EBUSY both to dm-1 and sdb > # wget http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img > # qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1 > qemu-img: error while writing at byte 0: Device or resource busy > > # qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb > qemu-img: error while writing at byte 0: Device or resource busy > > 5. blkdiscard also return EBUSY both to dm-1 and sdb > # blkdiscard -o 0 -l 4096 /dev/dm-1 > blkdiscard: cannot open /dev/dm-1: Device or resource busy > > # blkdiscard -o 0 -l 4096 /dev/sdb > blkdiscard: cannot open /dev/sdb: No such file or directory > > 6. dd write zero is good, because it does not use blkdiscard > # dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct > 100+0 records in > 100+0 records out > 104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s > > 7. The LUN should support blkdiscard feature, otherwise it will not write > zero > with ioctl(fd, BLKZEROOUT, range) > Thanks! I could not reproduce this with kernel 5.10, but now I'm no 5.11: # uname -r 5.11.7-100.fc32.x86_64 # qemu-img --version qemu-img version 5.2.0 (qemu-5.2.0-6.fc32.1) Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers # cat /etc/multipath.conf defaults { user_friendly_names no find_multipaths no } blacklist_exceptions { property "(SCSI_IDENT_|ID_WWN)" } blacklist { } # multipath -ll 36001405e884ab8ff4b44fdba6901099c 36001405e884ab8ff4b44fdba6901099c dm-8 LIO-ORG,3-09 size=6.0G features='0' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active `- 1:0:0:9 sdk 8:160 active ready running $ lsblk /dev/sdk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdk 8:160 0 6G 0 disk └─36001405e884ab8ff4b44fdba6901099c 253:13 0 6G 0 mpath $ virt-builder fedora-32 -o disk.img [ 2.9] Downloading: http://builder.libguestfs.org/fedora-32.xz [ 3.8] Planning how to build this image [ 3.8] Uncompressing [ 11.1] Opening the new disk [ 16.1] Setting a random seed [ 16.1] Setting passwords virt-builder: Setting random password of root to TcikQqRxAaIqS1kF [ 17.0] Finishing off Output file: disk.img Output size: 6.0G Output format: raw Total usable space: 5.4G Free space: 4.0G (74%) $ qemu-img info disk.img image: disk.img file format: raw virtual size: 6 GiB (6442450944 bytes) disk size: 1.29 GiB # qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c (100.00/100%) Works. # lsblk /dev/sdk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdk 8:160 0 6G 0 disk └─36001405e884ab8ff4b44fdba6901099c 253:13 0 6G 0 mpath ├─36001405e884ab8ff4b44fdba6901099c1 253:14 0 1M 0 part ├─36001405e884ab8ff4b44fdba6901099c2 253:15 0 1G 0 part ├─36001405e884ab8ff4b44fdba6901099c3 253:16 0 615M 0 part └─36001405e884ab8ff4b44fdba6901099c4 253:17 0 4.4G 0 part # qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c (100.00/100%) Works, wiping the partitions. # mount /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /tmp/mnt # mount | grep /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 on /tmp/mnt type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) This is invalid use, converting to device with mounted file system, but let's try: # qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c (100.00/100%) Still works?! # wipefs -a /dev/mapper/36001405e884ab8ff4b44fdba6901099c wipefs: error: /dev/mapper/36001405e884ab8ff4b44fdba6901099c: probing initialization failed: Device or resource busy # blkdiscard /dev/mapper/36001405e884ab8ff4b44fdba6901099c Works. This the configuration of the LUN on the server side: { "aio": false, "alua_tpgs": [ { "alua_access_state": 0, "alua_access_status": 0, "alua_access_type": 3, "alua_support_active_nonoptimized": 1, "alua_support_active_optimized": 1, "alua_support_offline": 1, "alua_support_standby": 1, "alua_support_transitioning": 1, "alua_support_unavailable": 1, "alua_write_metadata": 0, "implicit_trans_secs": 0, "name": "default_tg_pt_gp", "nonop_delay_msecs": 100, "preferred": 0, "tg_pt_gp_id": 0, "trans_delay_msecs": 0 } ], "attributes": { "block_size": 512, "emulate_3pc": 1, "emulate_caw": 1, "emulate_dpo": 1, "emulate_fua_read": 1, "emulate_fua_write": 1, "emulate_model_alias": 1, "emulate_pr": 1, "emulate_rest_reord": 0, "emulate_tas": 1, "emulate_tpu": 1, "emulate_tpws": 1, "emulate_ua_intlck_ctrl": 0, "emulate_write_cache": 1, "enforce_pr_isids": 1, "force_pr_aptpl": 0, "is_nonrot": 0, "max_unmap_block_desc_count": 1, "max_unmap_lba_count": 8192, "max_write_same_len": 65335, "optimal_sectors": 16384, "pi_prot_format": 0, "pi_prot_type": 0, "pi_prot_verify": 0, "queue_depth": 128, "unmap_granularity": 1, "unmap_granularity_alignment": 0, "unmap_zeroes_data": 0 }, "dev": "/target/3/09", "name": "3-09", "plugin": "fileio", "size": 6442450944, "write_back": true, "wwn": "e884ab8f-f4b4-4fdb-a690-1099c072c86d" }, Maybe this upstream change is not in all downstream 5.11 kernels, or 5.11.7 already includes the fix? Adding Ben, maybe he had more insight on the multipath side. >If I understand the kernel change correctly, this can happen when there is > >a mounted file system on top of the multipath device. I don't think we > have > >a use case when qemu accesses a multipath device when the device is used > >by a file system, but maybe I missed something. > > > >So that to me implies > >that we actually should not retry BLKZEROOUT, because the EBUSY will > >remain, and that condition won’t change while the block device is in use > >by qemu. > > > >On the other hand, in the code, you have decided not to reset > >has_write_zeroes to false, so the implementation will retry. > > > >EBUSY is usually a temporary error, so retrying makes sense. The question > >is if we really can write zeroes manually in this case? > > > >So I don’t quite understand. Should we keep trying BLKZEROOUT or is > >there no chance of it working after it has at one point failed with > >EBUSY? (Are there other cases besides what’s described in this commit > >message where EBUSY might be returned and it is only temporary?) > > > >> Fallback to pwritev instead of exit for -EBUSY error. > >> > >> The issue was introduced in Linux 5.10: > >> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02 > >> > >> Fixed in Linux 5.12: > >> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d > >> > >> Signed-off-by: ChangLimin <changlm@chinatelecom.cn> > >> --- > >> block/file-posix.c | 8 ++++++-- > >> 1 file changed, 6 insertions(+), 2 deletions(-) > >> > >> diff --git a/block/file-posix.c b/block/file-posix.c > >> index 20e14f8e96..d4054ac9cb 100644 > >> --- a/block/file-posix.c > >> +++ b/block/file-posix.c > >> @@ -1624,8 +1624,12 @@ static ssize_t > >> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) > >> } while (errno == EINTR); > >> > >> ret = translate_err(-errno); > >> - if (ret == -ENOTSUP) { > >> - s->has_write_zeroes = false; > >> + switch (ret) { > >> + case -ENOTSUP: > >> + s->has_write_zeroes = false; /* fall through */ > >> + case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for > multipath > >> devices */ > >> + return -ENOTSUP; > >> + break; > > > >(Not sure why this break is here.) > > > >Max > > > >> } > >> } > >> #endif > >> -- > >> 2.27.0 > >> > >
>On Thu, Mar 25, 2021 at 8:07 AM ChangLimin <changlm@chinatelecom.cn> wrote: >>On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mreitz@redhat.com> wrote: >>On 22.03.21 10:25, ChangLimin wrote: >>> For Linux 5.10/5.11, qemu write zeros to a multipath device using >>> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY >>> permanently. >> >>So as far as I can track back the discussion, Kevin asked on v1 why we’d >>set has_write_zeroes to false, i.e. whether the EBUSY might not go away >>at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then. >>You haven’t explicitly replied to that question (as far as I can see), >>so it kind of still stands. >> >>Implicitly, there are two conflicting answers in this patch: On one >>hand, the commit message says “permanently”, and this is what you told >>Nir as a realistic case where this can occur. > >For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is below. >For other Linux version, the EBUSY may be temporary. >Because Linux 5.10/5.11 is not used widely, so do not set has_write_zeroes to false. > >>I'm afraid ChangLimin did not answer my question. I'm looking for real >>world used case when qemu cannot write zeros to multipath device, when >>nobody else is using the device. >> >>I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert, >>once with a multipath device, and once with logical volume on a vg created >>on the multipath device, and I could not reproduce this issue. > >The following is steps to reproduct the issue on Fedora 34. > ># uname -a >Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > >Is this the most recent kernel? I have 5.11.7 in fedora 32. > > ># qemu-img -V >qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1) > >1. Login in an ISCSI LUN created using targetcli on ubuntu 20.04 ># iscsiadm -m discovery -t st -p 192.169.1.109 >192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 > ># iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100 ># iscsiadm -m session >tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 (non-flash) > >2. start multipathd service ># mpathconf --enable ># systemctl start multipathd > >3. add multipath path ># multipath -a `/lib/udev/scsi_id -g /dev/sdb` # sdb means the ISCSI LUN >wwid '36001405b76856e4816b48b99c6a77de3' added > ># multipathd add path /dev/sdb >ok > ># multipath -ll # /dev/dm-1 is the multipath device based on /dev/sdb >mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102 >size=1.0G features='0' hwhandler='1 alua' wp=rw >`-+- policy='service-time 0' prio=50 status=active > `- 5:0:0:0 sdd 8:48 active ready running > >You are using user_friendly_names which is (sadly) the default. >But I don't think it should matter. > >4. qemu-img return EBUSY both to dm-1 and sdb ># wget http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img ># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1 >qemu-img: error while writing at byte 0: Device or resource busy > ># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb >qemu-img: error while writing at byte 0: Device or resource busy > >5. blkdiscard also return EBUSY both to dm-1 and sdb ># blkdiscard -o 0 -l 4096 /dev/dm-1 >blkdiscard: cannot open /dev/dm-1: Device or resource busy > ># blkdiscard -o 0 -l 4096 /dev/sdb >blkdiscard: cannot open /dev/sdb: No such file or directory > >6. dd write zero is good, because it does not use blkdiscard ># dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct >100+0 records in >100+0 records out >104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s > >7. The LUN should support blkdiscard feature, otherwise it will not write zero >with ioctl(fd, BLKZEROOUT, range) > >Thanks! > >I could not reproduce this with kernel 5.10, but now I'm no 5.11: ># uname -r >5.11.7-100.fc32.x86_64 > ># qemu-img --version >qemu-img version 5.2.0 (qemu-5.2.0-6.fc32.1) >Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers > ># cat /etc/multipath.conf >defaults { >user_friendly_names no >find_multipaths no >} > >blacklist_exceptions { > property "(SCSI_IDENT_|ID_WWN)" >} > >blacklist { >} > ># multipath -ll 36001405e884ab8ff4b44fdba6901099c >36001405e884ab8ff4b44fdba6901099c dm-8 LIO-ORG,3-09 >size=6.0G features='0' hwhandler='1 alua' wp=rw >`-+- policy='service-time 0' prio=50 status=active > `- 1:0:0:9 sdk 8:160 active ready running > >$ lsblk /dev/sdk >NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT >sdk 8:160 0 6G 0 disk >└─36001405e884ab8ff4b44fdba6901099c 253:13 0 6G 0 mpath > >$ virt-builder fedora-32 -o disk.img >[ 2.9] Downloading: http://builder.libguestfs.org/fedora-32.xz >[ 3.8] Planning how to build this image >[ 3.8] Uncompressing >[ 11.1] Opening the new disk >[ 16.1] Setting a random seed >[ 16.1] Setting passwords >virt-builder: Setting random password of root to TcikQqRxAaIqS1kF >[ 17.0] Finishing off > Output file: disk.img > Output size: 6.0G > Output format: raw > Total usable space: 5.4G > Free space: 4.0G (74%) > >$ qemu-img info disk.img >image: disk.img >file format: raw >virtual size: 6 GiB (6442450944 bytes) >disk size: 1.29 GiB > ># qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c > (100.00/100%) > >Works. > ># lsblk /dev/sdk >NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT >sdk 8:160 0 6G 0 disk >└─36001405e884ab8ff4b44fdba6901099c 253:13 0 6G 0 mpath > ├─36001405e884ab8ff4b44fdba6901099c1 253:14 0 1M 0 part > ├─36001405e884ab8ff4b44fdba6901099c2 253:15 0 1G 0 part > ├─36001405e884ab8ff4b44fdba6901099c3 253:16 0 615M 0 part > └─36001405e884ab8ff4b44fdba6901099c4 253:17 0 4.4G 0 part > ># qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c > (100.00/100%) > >Works, wiping the partitions. > ># mount /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /tmp/mnt > ># mount | grep /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 >/dev/mapper/36001405e884ab8ff4b44fdba6901099c4 on /tmp/mnt type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) > >This is invalid use, converting to device with mounted file system, but let's try: > ># qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c > (100.00/100%) > >Still works?! > ># wipefs -a /dev/mapper/36001405e884ab8ff4b44fdba6901099c >wipefs: error: /dev/mapper/36001405e884ab8ff4b44fdba6901099c: probing initialization failed: Device or resource busy > ># blkdiscard /dev/mapper/36001405e884ab8ff4b44fdba6901099c > >Works. > >This the configuration of the LUN on the server side: > > { > "aio": false, > "alua_tpgs": [ > { > "alua_access_state": 0, > "alua_access_status": 0, > "alua_access_type": 3, > "alua_support_active_nonoptimized": 1, > "alua_support_active_optimized": 1, > "alua_support_offline": 1, > "alua_support_standby": 1, > "alua_support_transitioning": 1, > "alua_support_unavailable": 1, > "alua_write_metadata": 0, > "implicit_trans_secs": 0, > "name": "default_tg_pt_gp", > "nonop_delay_msecs": 100, > "preferred": 0, > "tg_pt_gp_id": 0, > "trans_delay_msecs": 0 > } > ], > "attributes": { > "block_size": 512, > "emulate_3pc": 1, > "emulate_caw": 1, > "emulate_dpo": 1, > "emulate_fua_read": 1, > "emulate_fua_write": 1, > "emulate_model_alias": 1, > "emulate_pr": 1, > "emulate_rest_reord": 0, > "emulate_tas": 1, > "emulate_tpu": 1, > "emulate_tpws": 1, > "emulate_ua_intlck_ctrl": 0, > "emulate_write_cache": 1, > "enforce_pr_isids": 1, > "force_pr_aptpl": 0, > "is_nonrot": 0, > "max_unmap_block_desc_count": 1, > "max_unmap_lba_count": 8192, > "max_write_same_len": 65335, > "optimal_sectors": 16384, > "pi_prot_format": 0, > "pi_prot_type": 0, > "pi_prot_verify": 0, > "queue_depth": 128, > "unmap_granularity": 1, > "unmap_granularity_alignment": 0, > "unmap_zeroes_data": 0 > }, > "dev": "/target/3/09", > "name": "3-09", > "plugin": "fileio", > "size": 6442450944, > "write_back": true, > "wwn": "e884ab8f-f4b4-4fdb-a690-1099c072c86d" > }, > >Maybe this upstream change is not in all downstream 5.11 kernels, or 5.11.7 >already includes the fix? Linux 5.10.24/5.11.7 already merged the fix on 2021-03-17 by Greg Kroah-Hartman. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.11.y&id=7e0815797656f029fab2edc309406cddf931b9d8 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=d44c9780ed40db88626c9354868eab72159c7a7f # git describe d44c9780ed40db88626c9354868eab72159c7a7f v5.10.23-184-gd44c9780ed40 # git describe 7e0815797656f029fab2edc309406cddf931b9d8 v5.11.6-178-g7e0815797656 >Adding Ben, maybe he had more insight on the multipath side. > >>If I understand the kernel change correctly, this can happen when there is >>a mounted file system on top of the multipath device. I don't think we have >>a use case when qemu accesses a multipath device when the device is used >>by a file system, but maybe I missed something. >> >>So that to me implies >>that we actually should not retry BLKZEROOUT, because the EBUSY will >>remain, and that condition won’t change while the block device is in use >>by qemu. >> >>On the other hand, in the code, you have decided not to reset >>has_write_zeroes to false, so the implementation will retry. >> >>EBUSY is usually a temporary error, so retrying makes sense. The question >>is if we really can write zeroes manually in this case? >> >>So I don’t quite understand. Should we keep trying BLKZEROOUT or is >>there no chance of it working after it has at one point failed with >>EBUSY? (Are there other cases besides what’s described in this commit >>message where EBUSY might be returned and it is only temporary?) >> >>> Fallback to pwritev instead of exit for -EBUSY error. >>> >>> The issue was introduced in Linux 5.10: >>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02 >>> >>> Fixed in Linux 5.12: >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d >>> >>> Signed-off-by: ChangLimin <changlm@chinatelecom.cn> >>> --- >>> block/file-posix.c | 8 ++++++-- >>> 1 file changed, 6 insertions(+), 2 deletions(-) >>> >>> diff --git a/block/file-posix.c b/block/file-posix.c >>> index 20e14f8e96..d4054ac9cb 100644 >>> --- a/block/file-posix.c >>> +++ b/block/file-posix.c >>> @@ -1624,8 +1624,12 @@ static ssize_t >>> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) >>> } while (errno == EINTR); >>> >>> ret = translate_err(-errno); >>> - if (ret == -ENOTSUP) { >>> - s->has_write_zeroes = false; >>> + switch (ret) { >>> + case -ENOTSUP: >>> + s->has_write_zeroes = false; /* fall through */ >>> + case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath >>> devices */ >>> + return -ENOTSUP; >>> + break; >> >>(Not sure why this break is here.) >> >>Max >> >>> } >>> } >>> #endif >>> -- >>> 2.27.0 >>>
On Fri, Mar 26, 2021 at 3:21 AM ChangLimin <changlm@chinatelecom.cn> wrote: > >On Thu, Mar 25, 2021 at 8:07 AM ChangLimin <changlm@chinatelecom.cn> > wrote: > >>On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mreitz@redhat.com> wrote: > >>On 22.03.21 10:25, ChangLimin wrote: > >>> For Linux 5.10/5.11, qemu write zeros to a multipath device using > >>> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return > -EBUSY > >>> permanently. > >> > >>So as far as I can track back the discussion, Kevin asked on v1 why we’d > >>set has_write_zeroes to false, i.e. whether the EBUSY might not go away > >>at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then. > >>You haven’t explicitly replied to that question (as far as I can see), > >>so it kind of still stands. > >> > >>Implicitly, there are two conflicting answers in this patch: On one > >>hand, the commit message says “permanently”, and this is what you told > >>Nir as a realistic case where this can occur. > > > >For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is > below. > >For other Linux version, the EBUSY may be temporary. > >Because Linux 5.10/5.11 is not used widely, so do not set > has_write_zeroes to false. > > > >>I'm afraid ChangLimin did not answer my question. I'm looking for real > >>world used case when qemu cannot write zeros to multipath device, when > >>nobody else is using the device. > >> > >>I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert, > >>once with a multipath device, and once with logical volume on a vg > created > >>on the multipath device, and I could not reproduce this issue. > > > >The following is steps to reproduct the issue on Fedora 34. > > > ># uname -a > >Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > > > >Is this the most recent kernel? I have 5.11.7 in fedora 32. > > > > > ># qemu-img -V > >qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1) > > > >1. Login in an ISCSI LUN created using targetcli on ubuntu 20.04 > ># iscsiadm -m discovery -t st -p 192.169.1.109 > >192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 > > > ># iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100 > ># iscsiadm -m session > >tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 > (non-flash) > > > >2. start multipathd service > ># mpathconf --enable > ># systemctl start multipathd > > > >3. add multipath path > ># multipath -a `/lib/udev/scsi_id -g /dev/sdb` # sdb means the ISCSI LUN > >wwid '36001405b76856e4816b48b99c6a77de3' added > > > ># multipathd add path /dev/sdb > >ok > > > ># multipath -ll # /dev/dm-1 is the multipath device based on /dev/sdb > >mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102 > >size=1.0G features='0' hwhandler='1 alua' wp=rw > >`-+- policy='service-time 0' prio=50 status=active > > `- 5:0:0:0 sdd 8:48 active ready running > > > >You are using user_friendly_names which is (sadly) the default. > >But I don't think it should matter. > > > >4. qemu-img return EBUSY both to dm-1 and sdb > ># wget > http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img > ># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1 > >qemu-img: error while writing at byte 0: Device or resource busy > > > ># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb > >qemu-img: error while writing at byte 0: Device or resource busy > > > >5. blkdiscard also return EBUSY both to dm-1 and sdb > ># blkdiscard -o 0 -l 4096 /dev/dm-1 > >blkdiscard: cannot open /dev/dm-1: Device or resource busy > > > ># blkdiscard -o 0 -l 4096 /dev/sdb > >blkdiscard: cannot open /dev/sdb: No such file or directory > > > >6. dd write zero is good, because it does not use blkdiscard > ># dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct > >100+0 records in > >100+0 records out > >104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s > > > >7. The LUN should support blkdiscard feature, otherwise it will not write > zero > >with ioctl(fd, BLKZEROOUT, range) > > > >Thanks! > > > >I could not reproduce this with kernel 5.10, but now I'm no 5.11: > ># uname -r > >5.11.7-100.fc32.x86_64 > > > ># qemu-img --version > >qemu-img version 5.2.0 (qemu-5.2.0-6.fc32.1) > >Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers > > > ># cat /etc/multipath.conf > >defaults { > >user_friendly_names no > >find_multipaths no > >} > > > >blacklist_exceptions { > > property "(SCSI_IDENT_|ID_WWN)" > >} > > > >blacklist { > >} > > > ># multipath -ll 36001405e884ab8ff4b44fdba6901099c > >36001405e884ab8ff4b44fdba6901099c dm-8 LIO-ORG,3-09 > >size=6.0G features='0' hwhandler='1 alua' wp=rw > >`-+- policy='service-time 0' prio=50 status=active > > `- 1:0:0:9 sdk 8:160 active ready running > > > >$ lsblk /dev/sdk > >NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT > >sdk 8:160 0 6G 0 disk > >└─36001405e884ab8ff4b44fdba6901099c 253:13 0 6G 0 mpath > > > >$ virt-builder fedora-32 -o disk.img > >[ 2.9] Downloading: http://builder.libguestfs.org/fedora-32.xz > >[ 3.8] Planning how to build this image > >[ 3.8] Uncompressing > >[ 11.1] Opening the new disk > >[ 16.1] Setting a random seed > >[ 16.1] Setting passwords > >virt-builder: Setting random password of root to TcikQqRxAaIqS1kF > >[ 17.0] Finishing off > > Output file: disk.img > > Output size: 6.0G > > Output format: raw > > Total usable space: 5.4G > > Free space: 4.0G (74%) > > > >$ qemu-img info disk.img > >image: disk.img > >file format: raw > >virtual size: 6 GiB (6442450944 bytes) > >disk size: 1.29 GiB > > > ># qemu-img convert -p -f raw -O raw -t none -T none disk.img > /dev/mapper/36001405e884ab8ff4b44fdba6901099c > > (100.00/100%) > > > >Works. > > > ># lsblk /dev/sdk > >NAME MAJ:MIN RM SIZE RO TYPE > MOUNTPOINT > >sdk 8:160 0 6G 0 disk > >└─36001405e884ab8ff4b44fdba6901099c 253:13 0 6G 0 mpath > > ├─36001405e884ab8ff4b44fdba6901099c1 253:14 0 1M 0 part > > ├─36001405e884ab8ff4b44fdba6901099c2 253:15 0 1G 0 part > > ├─36001405e884ab8ff4b44fdba6901099c3 253:16 0 615M 0 part > > └─36001405e884ab8ff4b44fdba6901099c4 253:17 0 4.4G 0 part > > > ># qemu-img convert -p -f raw -O raw -t none -T none disk.img > /dev/mapper/36001405e884ab8ff4b44fdba6901099c > > (100.00/100%) > > > >Works, wiping the partitions. > > > ># mount /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /tmp/mnt > > > ># mount | grep /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 > >/dev/mapper/36001405e884ab8ff4b44fdba6901099c4 on /tmp/mnt type xfs > (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) > > > >This is invalid use, converting to device with mounted file system, but > let's try: > > > ># qemu-img convert -p -f raw -O raw -t none -T none disk.img > /dev/mapper/36001405e884ab8ff4b44fdba6901099c > > (100.00/100%) > > > >Still works?! > > > ># wipefs -a /dev/mapper/36001405e884ab8ff4b44fdba6901099c > >wipefs: error: /dev/mapper/36001405e884ab8ff4b44fdba6901099c: probing > initialization failed: Device or resource busy > > > ># blkdiscard /dev/mapper/36001405e884ab8ff4b44fdba6901099c > > > >Works. > > > >This the configuration of the LUN on the server side: > > > > { > > > "aio": false, > > > "alua_tpgs": [ > > > { > > > "alua_access_state": 0, > > > "alua_access_status": 0, > > > "alua_access_type": 3, > > > "alua_support_active_nonoptimized": 1, > > > "alua_support_active_optimized": 1, > > > "alua_support_offline": 1, > > > "alua_support_standby": 1, > > > "alua_support_transitioning": 1, > > > "alua_support_unavailable": 1, > > > "alua_write_metadata": 0, > > > "implicit_trans_secs": 0, > > > "name": "default_tg_pt_gp", > > > "nonop_delay_msecs": 100, > > > "preferred": 0, > > > "tg_pt_gp_id": 0, > > > "trans_delay_msecs": 0 > > > } > > > ], > > > "attributes": { > > > "block_size": 512, > > > "emulate_3pc": 1, > > > "emulate_caw": 1, > > > "emulate_dpo": 1, > > > "emulate_fua_read": 1, > > > "emulate_fua_write": 1, > > > "emulate_model_alias": 1, > > > "emulate_pr": 1, > > > "emulate_rest_reord": 0, > > > "emulate_tas": 1, > > > "emulate_tpu": 1, > > > "emulate_tpws": 1, > > > "emulate_ua_intlck_ctrl": 0, > > > "emulate_write_cache": 1, > > > "enforce_pr_isids": 1, > > > "force_pr_aptpl": 0, > > > "is_nonrot": 0, > > > "max_unmap_block_desc_count": 1, > > > "max_unmap_lba_count": 8192, > > > "max_write_same_len": 65335, > > > "optimal_sectors": 16384, > > > "pi_prot_format": 0, > > > "pi_prot_type": 0, > > > "pi_prot_verify": 0, > > > "queue_depth": 128, > > > "unmap_granularity": 1, > > > "unmap_granularity_alignment": 0, > > > "unmap_zeroes_data": 0 > > > }, > > > "dev": "/target/3/09", > > > "name": "3-09", > > > "plugin": "fileio", > > > "size": 6442450944, > > > "write_back": true, > > > "wwn": "e884ab8f-f4b4-4fdb-a690-1099c072c86d" > > > }, > > > > >Maybe this upstream change is not in all downstream 5.11 kernels, or > 5.11.7 > >already includes the fix? > > Linux 5.10.24/5.11.7 already merged the fix on 2021-03-17 by Greg > Kroah-Hartman. > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.11.y&id=7e0815797656f029fab2edc309406cddf931b9d8 > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=d44c9780ed40db88626c9354868eab72159c7a7f > > # git describe d44c9780ed40db88626c9354868eab72159c7a7f > v5.10.23-184-gd44c9780ed40 > > # git describe 7e0815797656f029fab2edc309406cddf931b9d8 > v5.11.6-178-g7e0815797656 > So this is practically fixed. I don't think keeping a workaround for this in qemu is a good idea. >Adding Ben, maybe he had more insight on the multipath side. > > > >>If I understand the kernel change correctly, this can happen when there > is > >>a mounted file system on top of the multipath device. I don't think we > have > >>a use case when qemu accesses a multipath device when the device is used > >>by a file system, but maybe I missed something. > >> > >>So that to me implies > >>that we actually should not retry BLKZEROOUT, because the EBUSY will > >>remain, and that condition won’t change while the block device is in use > >>by qemu. > >> > >>On the other hand, in the code, you have decided not to reset > >>has_write_zeroes to false, so the implementation will retry. > >> > >>EBUSY is usually a temporary error, so retrying makes sense. The question > >>is if we really can write zeroes manually in this case? > >> > >>So I don’t quite understand. Should we keep trying BLKZEROOUT or is > >>there no chance of it working after it has at one point failed with > >>EBUSY? (Are there other cases besides what’s described in this commit > >>message where EBUSY might be returned and it is only temporary?) > >> > >>> Fallback to pwritev instead of exit for -EBUSY error. > >>> > >>> The issue was introduced in Linux 5.10: > >>> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02 > >>> > >>> Fixed in Linux 5.12: > >>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d > >>> > >>> Signed-off-by: ChangLimin <changlm@chinatelecom.cn> > >>> --- > >>> block/file-posix.c | 8 ++++++-- > >>> 1 file changed, 6 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/block/file-posix.c b/block/file-posix.c > >>> index 20e14f8e96..d4054ac9cb 100644 > >>> --- a/block/file-posix.c > >>> +++ b/block/file-posix.c > >>> @@ -1624,8 +1624,12 @@ static ssize_t > >>> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) > >>> } while (errno == EINTR); > >>> > >>> ret = translate_err(-errno); > >>> - if (ret == -ENOTSUP) { > >>> - s->has_write_zeroes = false; > >>> + switch (ret) { > >>> + case -ENOTSUP: > >>> + s->has_write_zeroes = false; /* fall through */ > >>> + case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for > multipath > >>> devices */ > >>> + return -ENOTSUP; > >>> + break; > >> > >>(Not sure why this break is here.) > >> > >>Max > >> > >>> } > >>> } > >>> #endif > >>> -- > >>> 2.27.0 > >>> > >
diff --git a/block/file-posix.c b/block/file-posix.c index 20e14f8e96..d4054ac9cb 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1624,8 +1624,12 @@ static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) } while (errno == EINTR); ret = translate_err(-errno); - if (ret == -ENOTSUP) { - s->has_write_zeroes = false; + switch (ret) { + case -ENOTSUP: + s->has_write_zeroes = false; /* fall through */ + case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath devices */ + return -ENOTSUP; + break; } } #endif
For Linux 5.10/5.11, qemu write zeros to a multipath device using ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY permanently. Fallback to pwritev instead of exit for -EBUSY error. The issue was introduced in Linux 5.10: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02 Fixed in Linux 5.12: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d Signed-off-by: ChangLimin <changlm@chinatelecom.cn> --- block/file-posix.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) -- 2.27.0