mbox series

[RFC,for-3.0-rc3,0/3] qemu-img: Disable copy offloading by default

Message ID 20180727033402.13513-1-famz@redhat.com
Headers show
Series qemu-img: Disable copy offloading by default | expand

Message

Fam Zheng July 27, 2018, 3:33 a.m. UTC
Kevin pointed out that both glibc and kernel provides a slow fallback of
copy_file_range which hurts thin provisioning. This is particularly true for
thin LVs, because host_device driver cannot get allocation info from the
volume, and copy_file_range is called on every sectors, making the dst fully
allocated.

NFS mount points also doesn't support SEEK_DATA well, so the allocation
information is unknown to QEMU.

That leaves only iscsi:// which seems to do what we want so far, but it is a
smaller use case.

Add an option to qemu-img convert, "-C", to enable (attempting) copy offloading
explicitly. And mark it incompatible with "-S" and "-c".

Fam Zheng (3):
  Revert "qemu-img: Document copy offloading implications with -S and
    -c"
  qemu-img: Add -C option for convert with copy offloading
  iotests: Add test for 'qemu-img convert -C' compatibility

 qemu-img-cmds.hx           |  2 +-
 qemu-img.c                 | 21 +++++++++++++++++----
 qemu-img.texi              | 14 +++++++++-----
 tests/qemu-iotests/082     |  8 ++++++++
 tests/qemu-iotests/082.out | 11 +++++++++++
 5 files changed, 46 insertions(+), 10 deletions(-)

Comments

Kevin Wolf July 27, 2018, 10:29 a.m. UTC | #1
Am 27.07.2018 um 05:33 hat Fam Zheng geschrieben:
> Kevin pointed out that both glibc and kernel provides a slow fallback of
> copy_file_range which hurts thin provisioning. This is particularly true for
> thin LVs, because host_device driver cannot get allocation info from the
> volume, and copy_file_range is called on every sectors, making the dst fully
> allocated.
> 
> NFS mount points also doesn't support SEEK_DATA well, so the allocation
> information is unknown to QEMU.
> 
> That leaves only iscsi:// which seems to do what we want so far, but it is a
> smaller use case.
> 
> Add an option to qemu-img convert, "-C", to enable (attempting) copy offloading
> explicitly. And mark it incompatible with "-S" and "-c".

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

Not sure why you made this an RFC only, but I think we absolutely need
this. People are used to using 'qemu-img convert' to compact images and
this would regress with automatic copy offloading.

Do you think we need more discussion?

Kevin
Fam Zheng July 27, 2018, 12:14 p.m. UTC | #2
On Fri, Jul 27, 2018 at 6:29 PM Kevin Wolf <kwolf@redhat.com> wrote:
>
> Am 27.07.2018 um 05:33 hat Fam Zheng geschrieben:
> > Kevin pointed out that both glibc and kernel provides a slow fallback of
> > copy_file_range which hurts thin provisioning. This is particularly true for
> > thin LVs, because host_device driver cannot get allocation info from the
> > volume, and copy_file_range is called on every sectors, making the dst fully
> > allocated.
> >
> > NFS mount points also doesn't support SEEK_DATA well, so the allocation
> > information is unknown to QEMU.
> >
> > That leaves only iscsi:// which seems to do what we want so far, but it is a
> > smaller use case.
> >
> > Add an option to qemu-img convert, "-C", to enable (attempting) copy offloading
> > explicitly. And mark it incompatible with "-S" and "-c".
>
> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
>
> Not sure why you made this an RFC only, but I think we absolutely need
> this. People are used to using 'qemu-img convert' to compact images and
> this would regress with automatic copy offloading.
>
> Do you think we need more discussion?

I think merging this for 3.0 is the right thing do to.

What worries me is the general usability of the feature. We could
probably explore ideas about how we can better take advantage of copy
offloading. I don't think counting on the user to make the right
decision between disk efficiency (thin provisioning) and BW efficiency
(copy offloading) will ever work. Even if we don't care about breaking
the default '-S 4k' behavior, the lack of SEEK_DATA/SEEK_HOLE support
on host NFS and block devices will make it very hard to use. Making it
worse, if the network to NFS server is good enough, convert with
pread64/pwrite64 with host page cache is also more efficient than
copy_file_range, so we'll be slower by trying to play clever. :(

Any thought?

Fam
Nir Soffer July 27, 2018, 1:40 p.m. UTC | #3
On Fri, Jul 27, 2018 at 3:15 PM Fam Zheng <famz@redhat.com> wrote:

> On Fri, Jul 27, 2018 at 6:29 PM Kevin Wolf <kwolf@redhat.com> wrote:
> >
> > Am 27.07.2018 um 05:33 hat Fam Zheng geschrieben:
> > > Kevin pointed out that both glibc and kernel provides a slow fallback
> of
> > > copy_file_range which hurts thin provisioning. This is particularly
> true for
> > > thin LVs, because host_device driver cannot get allocation info from
> the
> > > volume, and copy_file_range is called on every sectors, making the dst
> fully
> > > allocated.
> > >
> > > NFS mount points also doesn't support SEEK_DATA well, so the allocation
> > > information is unknown to QEMU.
>

NFS >= 4.2 supports SEEK_DATA/HOLE.


> > >
> > > That leaves only iscsi:// which seems to do what we want so far, but
> it is a
> > > smaller use case.
> > >
> > > Add an option to qemu-img convert, "-C", to enable (attempting) copy
> offloading
> > > explicitly. And mark it incompatible with "-S" and "-c".
> >
> > Reviewed-by: Kevin Wolf <kwolf@redhat.com>
> >
> > Not sure why you made this an RFC only, but I think we absolutely need
> > this. People are used to using 'qemu-img convert' to compact images and
> > this would regress with automatic copy offloading.
> >
> > Do you think we need more discussion?
>
> I think merging this for 3.0 is the right thing do to.
>
> What worries me is the general usability of the feature. We could
> probably explore ideas about how we can better take advantage of copy
> offloading. I don't think counting on the user to make the right
> decision between disk efficiency (thin provisioning) and BW efficiency

(copy offloading) will ever work. Even if we don't care about breaking
> the default '-S 4k' behavior, the lack of SEEK_DATA/SEEK_HOLE support
> on host NFS and block devices will make it very hard to use. Making it
> worse, if the network to NFS server is good enough, convert with
> pread64/pwrite64 with host page cache

is also more efficient than
> copy_file_range, so we'll be slower by trying to play clever. :(
>

In oVirt we always disable caching (-t none -T none), so using
host page cache is not an option.

I think adding an option for copy offloading is the right thing. This way
we can introduce the feature early without breaking the rest of the stuck.

For long term it would be nicer if qemu could select the best way to do
the copy automatically, keeping the allocation policy specified by the user.
oVirt still have bugs related to converting preallocated images to sparse
and sparse images to preallocated. Users like to have control on this, so
qemu-img cannot change the policy.

Do we have some documentation on the useful cases for copy
offloading? Did we benchmark this with different kind of storage?

The interesting use case for oVirt are:
- block storage: copying rregular LV on shared storage to
  another LV on same VG, but may be on a different PV.
- file storage: copying files on same NFS or GlusterFS mont.
- copying between different servers or file types, I guess
  copy_file_range will not help in this case, right?

Nir
Kevin Wolf July 27, 2018, 2:52 p.m. UTC | #4
Am 27.07.2018 um 14:14 hat Fam Zheng geschrieben:
> On Fri, Jul 27, 2018 at 6:29 PM Kevin Wolf <kwolf@redhat.com> wrote:
> >
> > Am 27.07.2018 um 05:33 hat Fam Zheng geschrieben:
> > > Kevin pointed out that both glibc and kernel provides a slow fallback of
> > > copy_file_range which hurts thin provisioning. This is particularly true for
> > > thin LVs, because host_device driver cannot get allocation info from the
> > > volume, and copy_file_range is called on every sectors, making the dst fully
> > > allocated.
> > >
> > > NFS mount points also doesn't support SEEK_DATA well, so the allocation
> > > information is unknown to QEMU.
> > >
> > > That leaves only iscsi:// which seems to do what we want so far, but it is a
> > > smaller use case.
> > >
> > > Add an option to qemu-img convert, "-C", to enable (attempting) copy offloading
> > > explicitly. And mark it incompatible with "-S" and "-c".
> >
> > Reviewed-by: Kevin Wolf <kwolf@redhat.com>
> >
> > Not sure why you made this an RFC only, but I think we absolutely need
> > this. People are used to using 'qemu-img convert' to compact images and
> > this would regress with automatic copy offloading.
> >
> > Do you think we need more discussion?
> 
> I think merging this for 3.0 is the right thing do to.

Thanks, applied to the block branch.

> What worries me is the general usability of the feature. We could
> probably explore ideas about how we can better take advantage of copy
> offloading. I don't think counting on the user to make the right
> decision between disk efficiency (thin provisioning) and BW efficiency
> (copy offloading) will ever work.

Ultimately it is a decision that QEMU can't make, though. The user needs
to tell us at least whether they are trying to compact the image or
whether they just want to get it copied as fast as possible.

Even if we know this, of course, we may still be lacking information
about the storage to make the best decision.

> Even if we don't care about breaking the default '-S 4k' behavior, the
> lack of SEEK_DATA/SEEK_HOLE support on host NFS and block devices will
> make it very hard to use.

I hope that NFS 4.2 will gain some ground in the future. Without it,
thin provisioning with raw images on NFS is basically impossible.

I'm not aware of problems with block devices?

> Making it worse, if the network to NFS server is good enough, convert
> with pread64/pwrite64 with host page cache is also more efficient than
> copy_file_range, so we'll be slower by trying to play clever. :(

Really? I could understand not being much faster, but being actually
slower sounds wrong. Are we slower because of overhead in
copy_file_range() itself or something during preparation?

Kevin