Message ID | 20190630150855.1016-1-mlevitsk@redhat.com |
---|---|
Headers | show |
Series | RFC: don't obey the block device max transfer len / max segments for block devices | expand |
On Sun, 2019-06-30 at 18:08 +0300, Maxim Levitsky wrote: > It looks like Linux block devices, even in O_DIRECT mode don't have any user visible > limit on transfer size / number of segments, which underlying block device can have. > The block layer takes care of enforcing these limits by splitting the bios. > > By limiting the transfer sizes, we force qemu to do the splitting itself which > introduces various overheads. > It is especially visible in nbd server, where the low max transfer size of the > underlying device forces us to advertise this over NBD, thus increasing the traffic overhead in case of > image conversion which benefits from large blocks. > > More information can be found here: > https://bugzilla.redhat.com/show_bug.cgi?id=1647104 > > Tested this with qemu-img convert over nbd and natively and to my surprise, even native IO performance improved a bit. > (The device on which it was tested is Intel Optane DC P4800X, which has 128k max transfer size) > > The benchmark: > > Images were created using: > > Sparse image: qemu-img create -f qcow2 /dev/nvme0n1p3 1G / 10G / 100G > Allocated image: qemu-img create -f qcow2 /dev/nvme0n1p3 -o preallocation=metadata 1G / 10G / 100G > > The test was: > > echo "convert native:" > rm -rf /dev/shm/disk.img > time qemu-img convert -p -f qcow2 -O raw -T none $FILE /dev/shm/disk.img > /dev/zero > > echo "convert via nbd:" > qemu-nbd -k /tmp/nbd.sock -v -f qcow2 $FILE -x export --cache=none --aio=native --fork > rm -rf /dev/shm/disk.img > time qemu-img convert -p -f raw -O raw nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img > /dev/zero > > The results: > > ========================================= > 1G sparse image: > native: > before: 0.027s > after: 0.027s > nbd: > before: 0.287s > after: 0.035s > > ========================================= > 100G sparse image: > native: > before: 0.028s > after: 0.028s > nbd: > before: 23.796s > after: 0.109s > > ========================================= > 1G preallocated image: > native: > before: 0.454s > after: 0.427s > nbd: > before: 0.649s > after: 0.546s > > The block limits of max transfer size/max segment size are retained > for the SCSI passthrough because in this case the kernel passes the userspace request > directly to the kernel scsi driver, bypassing the block layer, and thus there is no code to split > such requests. > > What do you think? > > Fam, since you was the original author of the code that added > these limits, could you share your opinion on that? > What was the reason besides SCSI passthrough? > > Best regards, > Maxim Levitsky > > Maxim Levitsky (1): > raw-posix.c - use max transfer length / max segemnt count only for > SCSI passthrough > > block/file-posix.c | 16 +++++++--------- > 1 file changed, 7 insertions(+), 9 deletions(-) > Ping Best regards, Maxim Levitsky
On Sun, Jun 30, 2019 at 06:08:54PM +0300, Maxim Levitsky wrote: > It looks like Linux block devices, even in O_DIRECT mode don't have any user visible > limit on transfer size / number of segments, which underlying block device can have. > The block layer takes care of enforcing these limits by splitting the bios. > > By limiting the transfer sizes, we force qemu to do the splitting itself which > introduces various overheads. > It is especially visible in nbd server, where the low max transfer size of the > underlying device forces us to advertise this over NBD, thus increasing the traffic overhead in case of > image conversion which benefits from large blocks. > > More information can be found here: > https://bugzilla.redhat.com/show_bug.cgi?id=1647104 > > Tested this with qemu-img convert over nbd and natively and to my surprise, even native IO performance improved a bit. > (The device on which it was tested is Intel Optane DC P4800X, which has 128k max transfer size) > > The benchmark: > > Images were created using: > > Sparse image: qemu-img create -f qcow2 /dev/nvme0n1p3 1G / 10G / 100G > Allocated image: qemu-img create -f qcow2 /dev/nvme0n1p3 -o preallocation=metadata 1G / 10G / 100G > > The test was: > > echo "convert native:" > rm -rf /dev/shm/disk.img > time qemu-img convert -p -f qcow2 -O raw -T none $FILE /dev/shm/disk.img > /dev/zero > > echo "convert via nbd:" > qemu-nbd -k /tmp/nbd.sock -v -f qcow2 $FILE -x export --cache=none --aio=native --fork > rm -rf /dev/shm/disk.img > time qemu-img convert -p -f raw -O raw nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img > /dev/zero > > The results: > > ========================================= > 1G sparse image: > native: > before: 0.027s > after: 0.027s > nbd: > before: 0.287s > after: 0.035s > > ========================================= > 100G sparse image: > native: > before: 0.028s > after: 0.028s > nbd: > before: 23.796s > after: 0.109s > > ========================================= > 1G preallocated image: > native: > before: 0.454s > after: 0.427s > nbd: > before: 0.649s > after: 0.546s > > The block limits of max transfer size/max segment size are retained > for the SCSI passthrough because in this case the kernel passes the userspace request > directly to the kernel scsi driver, bypassing the block layer, and thus there is no code to split > such requests. > > What do you think? > > Fam, since you was the original author of the code that added > these limits, could you share your opinion on that? > What was the reason besides SCSI passthrough? > > Best regards, > Maxim Levitsky > > Maxim Levitsky (1): > raw-posix.c - use max transfer length / max segemnt count only for > SCSI passthrough > > block/file-posix.c | 16 +++++++--------- > 1 file changed, 7 insertions(+), 9 deletions(-) Adding Eric Blake, who implemented the generic request splitting in the block layer and may know if there were any other reasons aside from SCSI passthrough why file-posix.c enforces the host block device's maximum transfer size. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
On 7/3/19 4:52 AM, Stefan Hajnoczi wrote: > On Sun, Jun 30, 2019 at 06:08:54PM +0300, Maxim Levitsky wrote: >> It looks like Linux block devices, even in O_DIRECT mode don't have any user visible >> limit on transfer size / number of segments, which underlying block device can have. >> The block layer takes care of enforcing these limits by splitting the bios. s/The block layer/The kernel block layer/ >> >> By limiting the transfer sizes, we force qemu to do the splitting itself which double space >> introduces various overheads. >> It is especially visible in nbd server, where the low max transfer size of the >> underlying device forces us to advertise this over NBD, thus increasing the traffic overhead in case of Long line for a commit message. >> image conversion which benefits from large blocks. >> >> More information can be found here: >> https://bugzilla.redhat.com/show_bug.cgi?id=1647104 >> >> Tested this with qemu-img convert over nbd and natively and to my surprise, even native IO performance improved a bit. >> (The device on which it was tested is Intel Optane DC P4800X, which has 128k max transfer size) >> >> The benchmark: >> I'm sorry I didn't see this before softfreeze, but as a performance improvement, I think it still classes as a bug fix and is safe for inclusion in rc0. >> The block limits of max transfer size/max segment size are retained >> for the SCSI passthrough because in this case the kernel passes the userspace request >> directly to the kernel scsi driver, bypassing the block layer, and thus there is no code to split >> such requests. >> >> What do you think? Seems like a reasonable explanation. >> >> Fam, since you was the original author of the code that added >> these limits, could you share your opinion on that? >> What was the reason besides SCSI passthrough? >> >> Best regards, >> Maxim Levitsky >> >> Maxim Levitsky (1): >> raw-posix.c - use max transfer length / max segemnt count only for >> SCSI passthrough >> >> block/file-posix.c | 16 +++++++--------- >> 1 file changed, 7 insertions(+), 9 deletions(-) > > Adding Eric Blake, who implemented the generic request splitting in the > block layer and may know if there were any other reasons aside from SCSI > passthrough why file-posix.c enforces the host block device's maximum > transfer size. No, I don't have any strong reasons for why file I/O must be capped to a specific limit other than size_t (since the kernel does just fine at splitting things up). > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> >