mbox series

[RFC,00/18] Map memory at destination .load_setup in vDPA-net migration

Message ID 20231019143455.2377694-1-eperezma@redhat.com
Headers show
Series Map memory at destination .load_setup in vDPA-net migration | expand

Message

Eugenio Perez Martin Oct. 19, 2023, 2:34 p.m. UTC
Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment the migration starts.  Moving that operation allows
QEMU to communicate the kernel the maps while the workload is still running in
the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
saving all the pinning time.

Note that further devices setup at the end of the migration may alter the guest
memory layout. But same as the previous point, many operations are still done
incrementally, like memory pinning, so we're saving time anyway.

The first bunch of patches just reorganizes the code, so memory related
operation parameters are shared between all vhost_vdpa devices.  This is
because the destination does not know what vhost_vdpa struct will have the
registered listener member, so it is easier to place them in a shared struct
rather to keep them in vhost_vdpa struct.  Future version may squash or omit
these patches.

Only tested with vdpa_sim. I'm sending this before full benchmark, as some work
like [1] can be based on it, and Si-Wei agreed on benchmark this series with
his experience.

Future directions on top of this series may include:
* Iterative migration of virtio-net devices, as it may reduce downtime per [1].
  vhost-vdpa net can apply the configuration through CVQ in the destination
  while the source is still migrating.
* Move more things ahead of migration time, like DRIVER_OK.
* Check that the devices of the destination are valid, and cancel the migration
  in case it is not.

[1] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/

Eugenio Pérez (18):
  vdpa: add VhostVDPAShared
  vdpa: move iova tree to the shared struct
  vdpa: move iova_range to vhost_vdpa_shared
  vdpa: move shadow_data to vhost_vdpa_shared
  vdpa: use vdpa shared for tracing
  vdpa: move file descriptor to vhost_vdpa_shared
  vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
  vdpa: move backend_cap to vhost_vdpa_shared
  vdpa: remove msg type of vhost_vdpa
  vdpa: move iommu_list to vhost_vdpa_shared
  vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
  vdpa: use dev_shared in vdpa_iommu
  vdpa: move memory listener to vhost_vdpa_shared
  vdpa: do not set virtio status bits if unneeded
  vdpa: add vhost_vdpa_load_setup
  vdpa: add vhost_vdpa_net_load_setup NetClient callback
  vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
  virtio_net: register incremental migration handlers

 include/hw/virtio/vhost-vdpa.h |  43 +++++---
 include/net/net.h              |   4 +
 hw/net/virtio-net.c            |  23 +++++
 hw/virtio/vdpa-dev.c           |   7 +-
 hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
 net/vhost-vdpa.c               | 127 ++++++++++++-----------
 hw/virtio/trace-events         |  14 +--
 7 files changed, 239 insertions(+), 162 deletions(-)

Comments

Jason Wang Nov. 2, 2023, 4:36 a.m. UTC | #1
On Thu, Oct 19, 2023 at 10:35 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> Current memory operations like pinning may take a lot of time at the
> destination.  Currently they are done after the source of the migration is
> stopped, and before the workload is resumed at the destination.  This is a
> period where neigher traffic can flow, nor the VM workload can continue
> (downtime).
>
> We can do better as we know the memory layout of the guest RAM at the
> destination from the moment the migration starts.  Moving that operation allows
> QEMU to communicate the kernel the maps while the workload is still running in
> the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
> saving all the pinning time.
>
> Note that further devices setup at the end of the migration may alter the guest
> memory layout. But same as the previous point, many operations are still done
> incrementally, like memory pinning, so we're saving time anyway.
>
> The first bunch of patches just reorganizes the code, so memory related
> operation parameters are shared between all vhost_vdpa devices.  This is
> because the destination does not know what vhost_vdpa struct will have the
> registered listener member, so it is easier to place them in a shared struct
> rather to keep them in vhost_vdpa struct.  Future version may squash or omit
> these patches.
>
> Only tested with vdpa_sim. I'm sending this before full benchmark, as some work
> like [1] can be based on it, and Si-Wei agreed on benchmark this series with
> his experience.

I'd expect we can see some improvement even without other
optimizations? For example, do we see improvement on mlx5?

(Or we can probably add some delay to the simulator to see)

Thanks

>
> Future directions on top of this series may include:
> * Iterative migration of virtio-net devices, as it may reduce downtime per [1].
>   vhost-vdpa net can apply the configuration through CVQ in the destination
>   while the source is still migrating.
> * Move more things ahead of migration time, like DRIVER_OK.
> * Check that the devices of the destination are valid, and cancel the migration
>   in case it is not.
>
> [1] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
>
> Eugenio Pérez (18):
>   vdpa: add VhostVDPAShared
>   vdpa: move iova tree to the shared struct
>   vdpa: move iova_range to vhost_vdpa_shared
>   vdpa: move shadow_data to vhost_vdpa_shared
>   vdpa: use vdpa shared for tracing
>   vdpa: move file descriptor to vhost_vdpa_shared
>   vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
>   vdpa: move backend_cap to vhost_vdpa_shared
>   vdpa: remove msg type of vhost_vdpa
>   vdpa: move iommu_list to vhost_vdpa_shared
>   vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
>   vdpa: use dev_shared in vdpa_iommu
>   vdpa: move memory listener to vhost_vdpa_shared
>   vdpa: do not set virtio status bits if unneeded
>   vdpa: add vhost_vdpa_load_setup
>   vdpa: add vhost_vdpa_net_load_setup NetClient callback
>   vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
>   virtio_net: register incremental migration handlers
>
>  include/hw/virtio/vhost-vdpa.h |  43 +++++---
>  include/net/net.h              |   4 +
>  hw/net/virtio-net.c            |  23 +++++
>  hw/virtio/vdpa-dev.c           |   7 +-
>  hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
>  net/vhost-vdpa.c               | 127 ++++++++++++-----------
>  hw/virtio/trace-events         |  14 +--
>  7 files changed, 239 insertions(+), 162 deletions(-)
>
> --
> 2.39.3
>
>
Si-Wei Liu Nov. 2, 2023, 10:12 a.m. UTC | #2
On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> Current memory operations like pinning may take a lot of time at the
>
> destination.  Currently they are done after the source of the migration is
>
> stopped, and before the workload is resumed at the destination.  This is a
>
> period where neigher traffic can flow, nor the VM workload can continue
>
> (downtime).
>
>
>
> We can do better as we know the memory layout of the guest RAM at the
>
> destination from the moment the migration starts.  Moving that operation allows
>
> QEMU to communicate the kernel the maps while the workload is still running in
>
> the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
>
> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
>
> saving all the pinning time.
I get what you want to say, though not sure how pinning is relevant to 
on-chip IOMMU and .set_map here, essentially pinning is required for all 
parent vdpa drivers that perform DMA hence don't want VM pages to move 
around.
>
>
>
> Note that further devices setup at the end of the migration may alter the guest
>
> memory layout. But same as the previous point, many operations are still done
>
> incrementally, like memory pinning, so we're saving time anyway.
>
>
>
> The first bunch of patches just reorganizes the code, so memory related
>
> operation parameters are shared between all vhost_vdpa devices.  This is
>
> because the destination does not know what vhost_vdpa struct will have the
>
> registered listener member, so it is easier to place them in a shared struct
>
> rather to keep them in vhost_vdpa struct.  Future version may squash or omit
>
> these patches.
It looks this VhostVDPAShared facility (patch 1-13) is also what I need 
in my SVQ descriptor group series [*], for which I've built similar 
construct there. If possible please try to merge this in ASAP. I'll 
rework my series on top of that.

[*] 
https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5

>
>
>
> Only tested with vdpa_sim. I'm sending this before full benchmark, as some work
>
> like [1] can be based on it, and Si-Wei agreed on benchmark this series with
>
> his experience.
Haven't done the full benchmark compared to pre-map at destination yet, 
though an observation is that the destination QEMU seems very easy to 
get stuck for very long time while in mid of pinning pages. During this 
period, any client doing read-only QMP query or executing HMP info 
command got frozen indefinitely (subject to how large size the memory is 
being pinned). Is it possible to unblock those QMP request or HMP 
command from being executed (at least the read-only ones) while in 
migration? Yield from the load_setup corourtine and spawn another thread?

Having said, not sure if .load_setup is a good fit for what we want to 
do. Searching all current users of .load_setup, either the job can be 
done instantly or the task is time bound without trapping into kernel 
for too long. Maybe pinning is too special use case here...

-Siwei
>
>
>
> Future directions on top of this series may include:
>
> * Iterative migration of virtio-net devices, as it may reduce downtime per [1].
>
>    vhost-vdpa net can apply the configuration through CVQ in the destination
>
>    while the source is still migrating.
>
> * Move more things ahead of migration time, like DRIVER_OK.
>
> * Check that the devices of the destination are valid, and cancel the migration
>
>    in case it is not.
>
>
>
> [1] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
>
>
>
> Eugenio Pérez (18):
>
>    vdpa: add VhostVDPAShared
>
>    vdpa: move iova tree to the shared struct
>
>    vdpa: move iova_range to vhost_vdpa_shared
>
>    vdpa: move shadow_data to vhost_vdpa_shared
>
>    vdpa: use vdpa shared for tracing
>
>    vdpa: move file descriptor to vhost_vdpa_shared
>
>    vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
>
>    vdpa: move backend_cap to vhost_vdpa_shared
>
>    vdpa: remove msg type of vhost_vdpa
>
>    vdpa: move iommu_list to vhost_vdpa_shared
>
>    vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
>
>    vdpa: use dev_shared in vdpa_iommu
>
>    vdpa: move memory listener to vhost_vdpa_shared
>
>    vdpa: do not set virtio status bits if unneeded
>
>    vdpa: add vhost_vdpa_load_setup
>
>    vdpa: add vhost_vdpa_net_load_setup NetClient callback
>
>    vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
>
>    virtio_net: register incremental migration handlers
>
>
>
>   include/hw/virtio/vhost-vdpa.h |  43 +++++---
>
>   include/net/net.h              |   4 +
>
>   hw/net/virtio-net.c            |  23 +++++
>
>   hw/virtio/vdpa-dev.c           |   7 +-
>
>   hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
>
>   net/vhost-vdpa.c               | 127 ++++++++++++-----------
>
>   hw/virtio/trace-events         |  14 +--
>
>   7 files changed, 239 insertions(+), 162 deletions(-)
>
>
>
Eugenio Perez Martin Nov. 2, 2023, 12:37 p.m. UTC | #3
On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> > Current memory operations like pinning may take a lot of time at the
> >
> > destination.  Currently they are done after the source of the migration is
> >
> > stopped, and before the workload is resumed at the destination.  This is a
> >
> > period where neigher traffic can flow, nor the VM workload can continue
> >
> > (downtime).
> >
> >
> >
> > We can do better as we know the memory layout of the guest RAM at the
> >
> > destination from the moment the migration starts.  Moving that operation allows
> >
> > QEMU to communicate the kernel the maps while the workload is still running in
> >
> > the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
> >
> > but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
> >
> > saving all the pinning time.
> I get what you want to say, though not sure how pinning is relevant to
> on-chip IOMMU and .set_map here, essentially pinning is required for all
> parent vdpa drivers that perform DMA hence don't want VM pages to move
> around.

Basically highlighting that the work done under .set_map is not only
pinning, but it is a significant fraction on it. It can be reworded or
deleted for sure.

> >
> >
> >
> > Note that further devices setup at the end of the migration may alter the guest
> >
> > memory layout. But same as the previous point, many operations are still done
> >
> > incrementally, like memory pinning, so we're saving time anyway.
> >
> >
> >
> > The first bunch of patches just reorganizes the code, so memory related
> >
> > operation parameters are shared between all vhost_vdpa devices.  This is
> >
> > because the destination does not know what vhost_vdpa struct will have the
> >
> > registered listener member, so it is easier to place them in a shared struct
> >
> > rather to keep them in vhost_vdpa struct.  Future version may squash or omit
> >
> > these patches.
> It looks this VhostVDPAShared facility (patch 1-13) is also what I need
> in my SVQ descriptor group series [*], for which I've built similar
> construct there. If possible please try to merge this in ASAP. I'll
> rework my series on top of that.
>
> [*]
> https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
>

I can send it individually, for sure.

MST, Jason, can this first part be merged? It doesn't add a lot by
itself but it helps pave the way for future changes.

> >
> >
> >
> > Only tested with vdpa_sim. I'm sending this before full benchmark, as some work
> >
> > like [1] can be based on it, and Si-Wei agreed on benchmark this series with
> >
> > his experience.
> Haven't done the full benchmark compared to pre-map at destination yet,
> though an observation is that the destination QEMU seems very easy to
> get stuck for very long time while in mid of pinning pages. During this
> period, any client doing read-only QMP query or executing HMP info
> command got frozen indefinitely (subject to how large size the memory is
> being pinned). Is it possible to unblock those QMP request or HMP
> command from being executed (at least the read-only ones) while in
> migration? Yield from the load_setup corourtine and spawn another thread?
>

Ok, I wasn't aware of that.

I think we cannot yield in a coroutine and wait for an ioctl. One
option that came to my mind is to effectively use another thread, and
use a POSIX barrier (or equivalent on glib / QEMU) before finishing
the migration. I'm not sure if there are more points where we can
check the barrier and tell the migration to continue or stop though.

Another option is to effectively start doing these ioctls in an
asynchronous way, io_uring cmds like, but I'd like to achieve this
first.

> Having said, not sure if .load_setup is a good fit for what we want to
> do. Searching all current users of .load_setup, either the job can be
> done instantly or the task is time bound without trapping into kernel
> for too long. Maybe pinning is too special use case here...
>
> -Siwei
> >
> >
> >
> > Future directions on top of this series may include:
> >
> > * Iterative migration of virtio-net devices, as it may reduce downtime per [1].
> >
> >    vhost-vdpa net can apply the configuration through CVQ in the destination
> >
> >    while the source is still migrating.
> >
> > * Move more things ahead of migration time, like DRIVER_OK.
> >
> > * Check that the devices of the destination are valid, and cancel the migration
> >
> >    in case it is not.
> >
> >
> >
> > [1] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
> >
> >
> >
> > Eugenio Pérez (18):
> >
> >    vdpa: add VhostVDPAShared
> >
> >    vdpa: move iova tree to the shared struct
> >
> >    vdpa: move iova_range to vhost_vdpa_shared
> >
> >    vdpa: move shadow_data to vhost_vdpa_shared
> >
> >    vdpa: use vdpa shared for tracing
> >
> >    vdpa: move file descriptor to vhost_vdpa_shared
> >
> >    vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
> >
> >    vdpa: move backend_cap to vhost_vdpa_shared
> >
> >    vdpa: remove msg type of vhost_vdpa
> >
> >    vdpa: move iommu_list to vhost_vdpa_shared
> >
> >    vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
> >
> >    vdpa: use dev_shared in vdpa_iommu
> >
> >    vdpa: move memory listener to vhost_vdpa_shared
> >
> >    vdpa: do not set virtio status bits if unneeded
> >
> >    vdpa: add vhost_vdpa_load_setup
> >
> >    vdpa: add vhost_vdpa_net_load_setup NetClient callback
> >
> >    vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
> >
> >    virtio_net: register incremental migration handlers
> >
> >
> >
> >   include/hw/virtio/vhost-vdpa.h |  43 +++++---
> >
> >   include/net/net.h              |   4 +
> >
> >   hw/net/virtio-net.c            |  23 +++++
> >
> >   hw/virtio/vdpa-dev.c           |   7 +-
> >
> >   hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
> >
> >   net/vhost-vdpa.c               | 127 ++++++++++++-----------
> >
> >   hw/virtio/trace-events         |  14 +--
> >
> >   7 files changed, 239 insertions(+), 162 deletions(-)
> >
> >
> >
>
Si-Wei Liu Nov. 3, 2023, 8:19 p.m. UTC | #4
On 11/2/2023 5:37 AM, Eugenio Perez Martin wrote:
> On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
>>> Current memory operations like pinning may take a lot of time at the
>>>
>>> destination.  Currently they are done after the source of the migration is
>>>
>>> stopped, and before the workload is resumed at the destination.  This is a
>>>
>>> period where neigher traffic can flow, nor the VM workload can continue
>>>
>>> (downtime).
>>>
>>>
>>>
>>> We can do better as we know the memory layout of the guest RAM at the
>>>
>>> destination from the moment the migration starts.  Moving that operation allows
>>>
>>> QEMU to communicate the kernel the maps while the workload is still running in
>>>
>>> the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
>>>
>>> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
>>>
>>> saving all the pinning time.
>> I get what you want to say, though not sure how pinning is relevant to
>> on-chip IOMMU and .set_map here, essentially pinning is required for all
>> parent vdpa drivers that perform DMA hence don't want VM pages to move
>> around.
> Basically highlighting that the work done under .set_map is not only
> pinning, but it is a significant fraction on it. It can be reworded or
> deleted for sure.
>
>>>
>>>
>>> Note that further devices setup at the end of the migration may alter the guest
>>>
>>> memory layout. But same as the previous point, many operations are still done
>>>
>>> incrementally, like memory pinning, so we're saving time anyway.
>>>
>>>
>>>
>>> The first bunch of patches just reorganizes the code, so memory related
>>>
>>> operation parameters are shared between all vhost_vdpa devices.  This is
>>>
>>> because the destination does not know what vhost_vdpa struct will have the
>>>
>>> registered listener member, so it is easier to place them in a shared struct
>>>
>>> rather to keep them in vhost_vdpa struct.  Future version may squash or omit
>>>
>>> these patches.
>> It looks this VhostVDPAShared facility (patch 1-13) is also what I need
>> in my SVQ descriptor group series [*], for which I've built similar
>> construct there. If possible please try to merge this in ASAP. I'll
>> rework my series on top of that.
>>
>> [*]
>> https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
>>
> I can send it individually, for sure.
>
> MST, Jason, can this first part be merged? It doesn't add a lot by
> itself but it helps pave the way for future changes.
If it cannot, it doesn't matter. I can pick it from here and get my 
series posted with your patches 1-13 applied upfront. This should work, 
I think?

>>>
>>>
>>> Only tested with vdpa_sim. I'm sending this before full benchmark, as some work
>>>
>>> like [1] can be based on it, and Si-Wei agreed on benchmark this series with
>>>
>>> his experience.
>> Haven't done the full benchmark compared to pre-map at destination yet,
>> though an observation is that the destination QEMU seems very easy to
>> get stuck for very long time while in mid of pinning pages. During this
>> period, any client doing read-only QMP query or executing HMP info
>> command got frozen indefinitely (subject to how large size the memory is
>> being pinned). Is it possible to unblock those QMP request or HMP
>> command from being executed (at least the read-only ones) while in
>> migration? Yield from the load_setup corourtine and spawn another thread?
>>
> Ok, I wasn't aware of that.
>
> I think we cannot yield in a coroutine and wait for an ioctl.
I was wondering if we need a separate coroutine out of the general 
migration path to support this special code without overloading 
load_setup or its callers. For instance, unblock the source from sending 
guest rams while allow destination pin pages in parallel should be 
possible.

Regardless, a separate thread is needed to carry out all the heavy 
lifting, i.e. ioctl(2) or write(2) syscalls to map&pin pages.


> One
> option that came to my mind is to effectively use another thread, and
> use a POSIX barrier (or equivalent on glib / QEMU) before finishing
> the migration.
Yes, a separate thread is needed anyway.

>   I'm not sure if there are more points where we can
> check the barrier and tell the migration to continue or stop though.
I think there is, for e.g. what if the dma_map fails. There must be a 
check point for that.

>
> Another option is to effectively start doing these ioctls in an
> asynchronous way, io_uring cmds like, but I'd like to achieve this
> first.
Yes, io_uring or any async API could be another option. Though this 
needs new uAPI through additional kernel facility to support. Anyway, 
it's up to you to decide. :)

Regards,
-Siwei
>> Having said, not sure if .load_setup is a good fit for what we want to
>> do. Searching all current users of .load_setup, either the job can be
>> done instantly or the task is time bound without trapping into kernel
>> for too long. Maybe pinning is too special use case here...
>>
>> -Siwei
>>>
>>>
>>> Future directions on top of this series may include:
>>>
>>> * Iterative migration of virtio-net devices, as it may reduce downtime per [1].
>>>
>>>     vhost-vdpa net can apply the configuration through CVQ in the destination
>>>
>>>     while the source is still migrating.
>>>
>>> * Move more things ahead of migration time, like DRIVER_OK.
>>>
>>> * Check that the devices of the destination are valid, and cancel the migration
>>>
>>>     in case it is not.
>>>
>>>
>>>
>>> [1] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
>>>
>>>
>>>
>>> Eugenio Pérez (18):
>>>
>>>     vdpa: add VhostVDPAShared
>>>
>>>     vdpa: move iova tree to the shared struct
>>>
>>>     vdpa: move iova_range to vhost_vdpa_shared
>>>
>>>     vdpa: move shadow_data to vhost_vdpa_shared
>>>
>>>     vdpa: use vdpa shared for tracing
>>>
>>>     vdpa: move file descriptor to vhost_vdpa_shared
>>>
>>>     vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
>>>
>>>     vdpa: move backend_cap to vhost_vdpa_shared
>>>
>>>     vdpa: remove msg type of vhost_vdpa
>>>
>>>     vdpa: move iommu_list to vhost_vdpa_shared
>>>
>>>     vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
>>>
>>>     vdpa: use dev_shared in vdpa_iommu
>>>
>>>     vdpa: move memory listener to vhost_vdpa_shared
>>>
>>>     vdpa: do not set virtio status bits if unneeded
>>>
>>>     vdpa: add vhost_vdpa_load_setup
>>>
>>>     vdpa: add vhost_vdpa_net_load_setup NetClient callback
>>>
>>>     vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
>>>
>>>     virtio_net: register incremental migration handlers
>>>
>>>
>>>
>>>    include/hw/virtio/vhost-vdpa.h |  43 +++++---
>>>
>>>    include/net/net.h              |   4 +
>>>
>>>    hw/net/virtio-net.c            |  23 +++++
>>>
>>>    hw/virtio/vdpa-dev.c           |   7 +-
>>>
>>>    hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
>>>
>>>    net/vhost-vdpa.c               | 127 ++++++++++++-----------
>>>
>>>    hw/virtio/trace-events         |  14 +--
>>>
>>>    7 files changed, 239 insertions(+), 162 deletions(-)
>>>
>>>
>>>
Si-Wei Liu Nov. 3, 2023, 8:40 p.m. UTC | #5
On 11/2/2023 3:12 AM, Si-Wei Liu wrote:
>
>
> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
>> Current memory operations like pinning may take a lot of time at the
>>
>> destination.  Currently they are done after the source of the 
>> migration is
>>
>> stopped, and before the workload is resumed at the destination. This 
>> is a
>>
>> period where neigher traffic can flow, nor the VM workload can continue
>>
>> (downtime).
>>
>>
>>
>> We can do better as we know the memory layout of the guest RAM at the
>>
>> destination from the moment the migration starts.  Moving that 
>> operation allows
>>
>> QEMU to communicate the kernel the maps while the workload is still 
>> running in
>>
>> the source, so Linux can start mapping them.  Ideally, all IOMMU is 
>> configured,
>>
>> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're 
>> still
>>
>> saving all the pinning time.
> I get what you want to say, though not sure how pinning is relevant to 
> on-chip IOMMU and .set_map here, essentially pinning is required for 
> all parent vdpa drivers that perform DMA hence don't want VM pages to 
> move around.
>>
>>
>>
>> Note that further devices setup at the end of the migration may alter 
>> the guest
>>
>> memory layout. But same as the previous point, many operations are 
>> still done
>>
>> incrementally, like memory pinning, so we're saving time anyway.
>>
>>
>>
>> The first bunch of patches just reorganizes the code, so memory related
>>
>> operation parameters are shared between all vhost_vdpa devices. This is
>>
>> because the destination does not know what vhost_vdpa struct will 
>> have the
>>
>> registered listener member, so it is easier to place them in a shared 
>> struct
>>
>> rather to keep them in vhost_vdpa struct.  Future version may squash 
>> or omit
>>
>> these patches.
> It looks this VhostVDPAShared facility (patch 1-13) is also what I 
> need in my SVQ descriptor group series [*], for which I've built 
> similar construct there. If possible please try to merge this in ASAP. 
> I'll rework my series on top of that.
>
> [*] 
> https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
>
>>
>>
>>
>> Only tested with vdpa_sim. I'm sending this before full benchmark, as 
>> some work
>>
>> like [1] can be based on it, and Si-Wei agreed on benchmark this 
>> series with
>>
>> his experience.
> Haven't done the full benchmark compared to pre-map at destination yet,
Hi Eugenio,

I just notice one thing that affects the performance benchmark for this 
series in terms of migration total_time (to be fair, it's mlx5_vdpa 
specific). It looks like iotlb map batching is not acked (via 
vhost_vdpa_set_backend_cap) at the point of vhost-vdpa_load_setup, 
effectively causing quite extensive time spent on hundreds of dma_map 
calls from listener_register().  While the equivalent code had been 
implemented in my destination pre-map patch [1]. Although I can 
benchmark the current patchset by remove batching from my code, I guess 
that's not the goal of this benchmark, right?

If would be the best to have map batching in place, so benchmark for 
both options could match. What do you think?

Thanks,
-Siwei

[1]
https://github.com/siwliu-kernel/qemu/commit/0ce225b0c7e618163ea09da3846c93c4de2f85ed#diff-45489c6f25dc36fd84e1cd28cbf3b8ff03301e2d24dadb6d1c334c9e8f14c00cR639

> though an observation is that the destination QEMU seems very easy to 
> get stuck for very long time while in mid of pinning pages. During 
> this period, any client doing read-only QMP query or executing HMP 
> info command got frozen indefinitely (subject to how large size the 
> memory is being pinned). Is it possible to unblock those QMP request 
> or HMP command from being executed (at least the read-only ones) while 
> in migration? Yield from the load_setup corourtine and spawn another 
> thread?
>
> Having said, not sure if .load_setup is a good fit for what we want to 
> do. Searching all current users of .load_setup, either the job can be 
> done instantly or the task is time bound without trapping into kernel 
> for too long. Maybe pinning is too special use case here...
>
> -Siwei
>>
>>
>>
>> Future directions on top of this series may include:
>>
>> * Iterative migration of virtio-net devices, as it may reduce 
>> downtime per [1].
>>
>>    vhost-vdpa net can apply the configuration through CVQ in the 
>> destination
>>
>>    while the source is still migrating.
>>
>> * Move more things ahead of migration time, like DRIVER_OK.
>>
>> * Check that the devices of the destination are valid, and cancel the 
>> migration
>>
>>    in case it is not.
>>
>>
>>
>> [1] 
>> https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
>>
>>
>>
>> Eugenio Pérez (18):
>>
>>    vdpa: add VhostVDPAShared
>>
>>    vdpa: move iova tree to the shared struct
>>
>>    vdpa: move iova_range to vhost_vdpa_shared
>>
>>    vdpa: move shadow_data to vhost_vdpa_shared
>>
>>    vdpa: use vdpa shared for tracing
>>
>>    vdpa: move file descriptor to vhost_vdpa_shared
>>
>>    vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
>>
>>    vdpa: move backend_cap to vhost_vdpa_shared
>>
>>    vdpa: remove msg type of vhost_vdpa
>>
>>    vdpa: move iommu_list to vhost_vdpa_shared
>>
>>    vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
>>
>>    vdpa: use dev_shared in vdpa_iommu
>>
>>    vdpa: move memory listener to vhost_vdpa_shared
>>
>>    vdpa: do not set virtio status bits if unneeded
>>
>>    vdpa: add vhost_vdpa_load_setup
>>
>>    vdpa: add vhost_vdpa_net_load_setup NetClient callback
>>
>>    vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
>>
>>    virtio_net: register incremental migration handlers
>>
>>
>>
>>   include/hw/virtio/vhost-vdpa.h |  43 +++++---
>>
>>   include/net/net.h              |   4 +
>>
>>   hw/net/virtio-net.c            |  23 +++++
>>
>>   hw/virtio/vdpa-dev.c           |   7 +-
>>
>>   hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
>>
>>   net/vhost-vdpa.c               | 127 ++++++++++++-----------
>>
>>   hw/virtio/trace-events         |  14 +--
>>
>>   7 files changed, 239 insertions(+), 162 deletions(-)
>>
>>
>>
>
Jason Wang Nov. 6, 2023, 4:17 a.m. UTC | #6
On Thu, Nov 2, 2023 at 8:38 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >
> >
> >
> > On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> > > Current memory operations like pinning may take a lot of time at the
> > >
> > > destination.  Currently they are done after the source of the migration is
> > >
> > > stopped, and before the workload is resumed at the destination.  This is a
> > >
> > > period where neigher traffic can flow, nor the VM workload can continue
> > >
> > > (downtime).
> > >
> > >
> > >
> > > We can do better as we know the memory layout of the guest RAM at the
> > >
> > > destination from the moment the migration starts.  Moving that operation allows
> > >
> > > QEMU to communicate the kernel the maps while the workload is still running in
> > >
> > > the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
> > >
> > > but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
> > >
> > > saving all the pinning time.
> > I get what you want to say, though not sure how pinning is relevant to
> > on-chip IOMMU and .set_map here, essentially pinning is required for all
> > parent vdpa drivers that perform DMA hence don't want VM pages to move
> > around.
>
> Basically highlighting that the work done under .set_map is not only
> pinning, but it is a significant fraction on it. It can be reworded or
> deleted for sure.
>
> > >
> > >
> > >
> > > Note that further devices setup at the end of the migration may alter the guest
> > >
> > > memory layout. But same as the previous point, many operations are still done
> > >
> > > incrementally, like memory pinning, so we're saving time anyway.
> > >
> > >
> > >
> > > The first bunch of patches just reorganizes the code, so memory related
> > >
> > > operation parameters are shared between all vhost_vdpa devices.  This is
> > >
> > > because the destination does not know what vhost_vdpa struct will have the
> > >
> > > registered listener member, so it is easier to place them in a shared struct
> > >
> > > rather to keep them in vhost_vdpa struct.  Future version may squash or omit
> > >
> > > these patches.
> > It looks this VhostVDPAShared facility (patch 1-13) is also what I need
> > in my SVQ descriptor group series [*], for which I've built similar
> > construct there. If possible please try to merge this in ASAP. I'll
> > rework my series on top of that.
> >
> > [*]
> > https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
> >
>
> I can send it individually, for sure.
>
> MST, Jason, can this first part be merged? It doesn't add a lot by
> itself but it helps pave the way for future changes.
>

I'm fine with that, or it might be easier if you resend it as a
standalone series.

Thanks
Eugenio Perez Martin Nov. 6, 2023, 9:04 a.m. UTC | #7
On Fri, Nov 3, 2023 at 9:41 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 11/2/2023 3:12 AM, Si-Wei Liu wrote:
> >
> >
> > On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> >> Current memory operations like pinning may take a lot of time at the
> >>
> >> destination.  Currently they are done after the source of the
> >> migration is
> >>
> >> stopped, and before the workload is resumed at the destination. This
> >> is a
> >>
> >> period where neigher traffic can flow, nor the VM workload can continue
> >>
> >> (downtime).
> >>
> >>
> >>
> >> We can do better as we know the memory layout of the guest RAM at the
> >>
> >> destination from the moment the migration starts.  Moving that
> >> operation allows
> >>
> >> QEMU to communicate the kernel the maps while the workload is still
> >> running in
> >>
> >> the source, so Linux can start mapping them.  Ideally, all IOMMU is
> >> configured,
> >>
> >> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're
> >> still
> >>
> >> saving all the pinning time.
> > I get what you want to say, though not sure how pinning is relevant to
> > on-chip IOMMU and .set_map here, essentially pinning is required for
> > all parent vdpa drivers that perform DMA hence don't want VM pages to
> > move around.
> >>
> >>
> >>
> >> Note that further devices setup at the end of the migration may alter
> >> the guest
> >>
> >> memory layout. But same as the previous point, many operations are
> >> still done
> >>
> >> incrementally, like memory pinning, so we're saving time anyway.
> >>
> >>
> >>
> >> The first bunch of patches just reorganizes the code, so memory related
> >>
> >> operation parameters are shared between all vhost_vdpa devices. This is
> >>
> >> because the destination does not know what vhost_vdpa struct will
> >> have the
> >>
> >> registered listener member, so it is easier to place them in a shared
> >> struct
> >>
> >> rather to keep them in vhost_vdpa struct.  Future version may squash
> >> or omit
> >>
> >> these patches.
> > It looks this VhostVDPAShared facility (patch 1-13) is also what I
> > need in my SVQ descriptor group series [*], for which I've built
> > similar construct there. If possible please try to merge this in ASAP.
> > I'll rework my series on top of that.
> >
> > [*]
> > https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
> >
> >>
> >>
> >>
> >> Only tested with vdpa_sim. I'm sending this before full benchmark, as
> >> some work
> >>
> >> like [1] can be based on it, and Si-Wei agreed on benchmark this
> >> series with
> >>
> >> his experience.
> > Haven't done the full benchmark compared to pre-map at destination yet,
> Hi Eugenio,
>
> I just notice one thing that affects the performance benchmark for this
> series in terms of migration total_time (to be fair, it's mlx5_vdpa
> specific). It looks like iotlb map batching is not acked (via
> vhost_vdpa_set_backend_cap) at the point of vhost-vdpa_load_setup,
> effectively causing quite extensive time spent on hundreds of dma_map
> calls from listener_register().  While the equivalent code had been
> implemented in my destination pre-map patch [1]. Although I can
> benchmark the current patchset by remove batching from my code, I guess
> that's not the goal of this benchmark, right?
>
> If would be the best to have map batching in place, so benchmark for
> both options could match. What do you think?
>

This is definitely a bug I need to fix too, thanks for catching it!
:). But I'm not sure how this happens, as net_init_vhost_vdpa should
be called before .load_setup. I'll take a deeper look, thanks!

> Thanks,
> -Siwei
>
> [1]
> https://github.com/siwliu-kernel/qemu/commit/0ce225b0c7e618163ea09da3846c93c4de2f85ed#diff-45489c6f25dc36fd84e1cd28cbf3b8ff03301e2d24dadb6d1c334c9e8f14c00cR639
>
> > though an observation is that the destination QEMU seems very easy to
> > get stuck for very long time while in mid of pinning pages. During
> > this period, any client doing read-only QMP query or executing HMP
> > info command got frozen indefinitely (subject to how large size the
> > memory is being pinned). Is it possible to unblock those QMP request
> > or HMP command from being executed (at least the read-only ones) while
> > in migration? Yield from the load_setup corourtine and spawn another
> > thread?
> >
> > Having said, not sure if .load_setup is a good fit for what we want to
> > do. Searching all current users of .load_setup, either the job can be
> > done instantly or the task is time bound without trapping into kernel
> > for too long. Maybe pinning is too special use case here...
> >
> > -Siwei
> >>
> >>
> >>
> >> Future directions on top of this series may include:
> >>
> >> * Iterative migration of virtio-net devices, as it may reduce
> >> downtime per [1].
> >>
> >>    vhost-vdpa net can apply the configuration through CVQ in the
> >> destination
> >>
> >>    while the source is still migrating.
> >>
> >> * Move more things ahead of migration time, like DRIVER_OK.
> >>
> >> * Check that the devices of the destination are valid, and cancel the
> >> migration
> >>
> >>    in case it is not.
> >>
> >>
> >>
> >> [1]
> >> https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
> >>
> >>
> >>
> >> Eugenio Pérez (18):
> >>
> >>    vdpa: add VhostVDPAShared
> >>
> >>    vdpa: move iova tree to the shared struct
> >>
> >>    vdpa: move iova_range to vhost_vdpa_shared
> >>
> >>    vdpa: move shadow_data to vhost_vdpa_shared
> >>
> >>    vdpa: use vdpa shared for tracing
> >>
> >>    vdpa: move file descriptor to vhost_vdpa_shared
> >>
> >>    vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
> >>
> >>    vdpa: move backend_cap to vhost_vdpa_shared
> >>
> >>    vdpa: remove msg type of vhost_vdpa
> >>
> >>    vdpa: move iommu_list to vhost_vdpa_shared
> >>
> >>    vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
> >>
> >>    vdpa: use dev_shared in vdpa_iommu
> >>
> >>    vdpa: move memory listener to vhost_vdpa_shared
> >>
> >>    vdpa: do not set virtio status bits if unneeded
> >>
> >>    vdpa: add vhost_vdpa_load_setup
> >>
> >>    vdpa: add vhost_vdpa_net_load_setup NetClient callback
> >>
> >>    vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
> >>
> >>    virtio_net: register incremental migration handlers
> >>
> >>
> >>
> >>   include/hw/virtio/vhost-vdpa.h |  43 +++++---
> >>
> >>   include/net/net.h              |   4 +
> >>
> >>   hw/net/virtio-net.c            |  23 +++++
> >>
> >>   hw/virtio/vdpa-dev.c           |   7 +-
> >>
> >>   hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
> >>
> >>   net/vhost-vdpa.c               | 127 ++++++++++++-----------
> >>
> >>   hw/virtio/trace-events         |  14 +--
> >>
> >>   7 files changed, 239 insertions(+), 162 deletions(-)
> >>
> >>
> >>
> >
>
Eugenio Perez Martin Dec. 5, 2023, 2:23 p.m. UTC | #8
On Fri, Nov 3, 2023 at 9:19 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 11/2/2023 5:37 AM, Eugenio Perez Martin wrote:
> > On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> >>> Current memory operations like pinning may take a lot of time at the
> >>>
> >>> destination.  Currently they are done after the source of the migration is
> >>>
> >>> stopped, and before the workload is resumed at the destination.  This is a
> >>>
> >>> period where neigher traffic can flow, nor the VM workload can continue
> >>>
> >>> (downtime).
> >>>
> >>>
> >>>
> >>> We can do better as we know the memory layout of the guest RAM at the
> >>>
> >>> destination from the moment the migration starts.  Moving that operation allows
> >>>
> >>> QEMU to communicate the kernel the maps while the workload is still running in
> >>>
> >>> the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
> >>>
> >>> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
> >>>
> >>> saving all the pinning time.
> >> I get what you want to say, though not sure how pinning is relevant to
> >> on-chip IOMMU and .set_map here, essentially pinning is required for all
> >> parent vdpa drivers that perform DMA hence don't want VM pages to move
> >> around.
> > Basically highlighting that the work done under .set_map is not only
> > pinning, but it is a significant fraction on it. It can be reworded or
> > deleted for sure.
> >
> >>>
> >>>
> >>> Note that further devices setup at the end of the migration may alter the guest
> >>>
> >>> memory layout. But same as the previous point, many operations are still done
> >>>
> >>> incrementally, like memory pinning, so we're saving time anyway.
> >>>
> >>>
> >>>
> >>> The first bunch of patches just reorganizes the code, so memory related
> >>>
> >>> operation parameters are shared between all vhost_vdpa devices.  This is
> >>>
> >>> because the destination does not know what vhost_vdpa struct will have the
> >>>
> >>> registered listener member, so it is easier to place them in a shared struct
> >>>
> >>> rather to keep them in vhost_vdpa struct.  Future version may squash or omit
> >>>
> >>> these patches.
> >> It looks this VhostVDPAShared facility (patch 1-13) is also what I need
> >> in my SVQ descriptor group series [*], for which I've built similar
> >> construct there. If possible please try to merge this in ASAP. I'll
> >> rework my series on top of that.
> >>
> >> [*]
> >> https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
> >>
> > I can send it individually, for sure.
> >
> > MST, Jason, can this first part be merged? It doesn't add a lot by
> > itself but it helps pave the way for future changes.
> If it cannot, it doesn't matter. I can pick it from here and get my
> series posted with your patches 1-13 applied upfront. This should work,
> I think?
>
> >>>
> >>>
> >>> Only tested with vdpa_sim. I'm sending this before full benchmark, as some work
> >>>
> >>> like [1] can be based on it, and Si-Wei agreed on benchmark this series with
> >>>
> >>> his experience.
> >> Haven't done the full benchmark compared to pre-map at destination yet,
> >> though an observation is that the destination QEMU seems very easy to
> >> get stuck for very long time while in mid of pinning pages. During this
> >> period, any client doing read-only QMP query or executing HMP info
> >> command got frozen indefinitely (subject to how large size the memory is
> >> being pinned). Is it possible to unblock those QMP request or HMP
> >> command from being executed (at least the read-only ones) while in
> >> migration? Yield from the load_setup corourtine and spawn another thread?
> >>
> > Ok, I wasn't aware of that.
> >
> > I think we cannot yield in a coroutine and wait for an ioctl.
> I was wondering if we need a separate coroutine out of the general
> migration path to support this special code without overloading
> load_setup or its callers. For instance, unblock the source from sending
> guest rams while allow destination pin pages in parallel should be
> possible.
>

Hi Si-Wei,

I'm working on this, I think I'll be able to send a new version soon.
Just a question, when the mapping is done in vhost_vdpa_dev_start as
the current upstream master does, are you able to interact with QMP?

Thanks!

> Regardless, a separate thread is needed to carry out all the heavy
> lifting, i.e. ioctl(2) or write(2) syscalls to map&pin pages.
>
>
> > One
> > option that came to my mind is to effectively use another thread, and
> > use a POSIX barrier (or equivalent on glib / QEMU) before finishing
> > the migration.
> Yes, a separate thread is needed anyway.
>
> >   I'm not sure if there are more points where we can
> > check the barrier and tell the migration to continue or stop though.
> I think there is, for e.g. what if the dma_map fails. There must be a
> check point for that.
>
> >
> > Another option is to effectively start doing these ioctls in an
> > asynchronous way, io_uring cmds like, but I'd like to achieve this
> > first.
> Yes, io_uring or any async API could be another option. Though this
> needs new uAPI through additional kernel facility to support. Anyway,
> it's up to you to decide. :)
>
> Regards,
> -Siwei
> >> Having said, not sure if .load_setup is a good fit for what we want to
> >> do. Searching all current users of .load_setup, either the job can be
> >> done instantly or the task is time bound without trapping into kernel
> >> for too long. Maybe pinning is too special use case here...
> >>
> >> -Siwei
> >>>
> >>>
> >>> Future directions on top of this series may include:
> >>>
> >>> * Iterative migration of virtio-net devices, as it may reduce downtime per [1].
> >>>
> >>>     vhost-vdpa net can apply the configuration through CVQ in the destination
> >>>
> >>>     while the source is still migrating.
> >>>
> >>> * Move more things ahead of migration time, like DRIVER_OK.
> >>>
> >>> * Check that the devices of the destination are valid, and cancel the migration
> >>>
> >>>     in case it is not.
> >>>
> >>>
> >>>
> >>> [1] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
> >>>
> >>>
> >>>
> >>> Eugenio Pérez (18):
> >>>
> >>>     vdpa: add VhostVDPAShared
> >>>
> >>>     vdpa: move iova tree to the shared struct
> >>>
> >>>     vdpa: move iova_range to vhost_vdpa_shared
> >>>
> >>>     vdpa: move shadow_data to vhost_vdpa_shared
> >>>
> >>>     vdpa: use vdpa shared for tracing
> >>>
> >>>     vdpa: move file descriptor to vhost_vdpa_shared
> >>>
> >>>     vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
> >>>
> >>>     vdpa: move backend_cap to vhost_vdpa_shared
> >>>
> >>>     vdpa: remove msg type of vhost_vdpa
> >>>
> >>>     vdpa: move iommu_list to vhost_vdpa_shared
> >>>
> >>>     vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
> >>>
> >>>     vdpa: use dev_shared in vdpa_iommu
> >>>
> >>>     vdpa: move memory listener to vhost_vdpa_shared
> >>>
> >>>     vdpa: do not set virtio status bits if unneeded
> >>>
> >>>     vdpa: add vhost_vdpa_load_setup
> >>>
> >>>     vdpa: add vhost_vdpa_net_load_setup NetClient callback
> >>>
> >>>     vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
> >>>
> >>>     virtio_net: register incremental migration handlers
> >>>
> >>>
> >>>
> >>>    include/hw/virtio/vhost-vdpa.h |  43 +++++---
> >>>
> >>>    include/net/net.h              |   4 +
> >>>
> >>>    hw/net/virtio-net.c            |  23 +++++
> >>>
> >>>    hw/virtio/vdpa-dev.c           |   7 +-
> >>>
> >>>    hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
> >>>
> >>>    net/vhost-vdpa.c               | 127 ++++++++++++-----------
> >>>
> >>>    hw/virtio/trace-events         |  14 +--
> >>>
> >>>    7 files changed, 239 insertions(+), 162 deletions(-)
> >>>
> >>>
> >>>
>
Si-Wei Liu Dec. 6, 2023, 12:36 a.m. UTC | #9
On 12/5/2023 6:23 AM, Eugenio Perez Martin wrote:
> On Fri, Nov 3, 2023 at 9:19 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 11/2/2023 5:37 AM, Eugenio Perez Martin wrote:
>>> On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
>>>>> Current memory operations like pinning may take a lot of time at the
>>>>>
>>>>> destination.  Currently they are done after the source of the migration is
>>>>>
>>>>> stopped, and before the workload is resumed at the destination.  This is a
>>>>>
>>>>> period where neigher traffic can flow, nor the VM workload can continue
>>>>>
>>>>> (downtime).
>>>>>
>>>>>
>>>>>
>>>>> We can do better as we know the memory layout of the guest RAM at the
>>>>>
>>>>> destination from the moment the migration starts.  Moving that operation allows
>>>>>
>>>>> QEMU to communicate the kernel the maps while the workload is still running in
>>>>>
>>>>> the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,
>>>>>
>>>>> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
>>>>>
>>>>> saving all the pinning time.
>>>> I get what you want to say, though not sure how pinning is relevant to
>>>> on-chip IOMMU and .set_map here, essentially pinning is required for all
>>>> parent vdpa drivers that perform DMA hence don't want VM pages to move
>>>> around.
>>> Basically highlighting that the work done under .set_map is not only
>>> pinning, but it is a significant fraction on it. It can be reworded or
>>> deleted for sure.
>>>
>>>>>
>>>>> Note that further devices setup at the end of the migration may alter the guest
>>>>>
>>>>> memory layout. But same as the previous point, many operations are still done
>>>>>
>>>>> incrementally, like memory pinning, so we're saving time anyway.
>>>>>
>>>>>
>>>>>
>>>>> The first bunch of patches just reorganizes the code, so memory related
>>>>>
>>>>> operation parameters are shared between all vhost_vdpa devices.  This is
>>>>>
>>>>> because the destination does not know what vhost_vdpa struct will have the
>>>>>
>>>>> registered listener member, so it is easier to place them in a shared struct
>>>>>
>>>>> rather to keep them in vhost_vdpa struct.  Future version may squash or omit
>>>>>
>>>>> these patches.
>>>> It looks this VhostVDPAShared facility (patch 1-13) is also what I need
>>>> in my SVQ descriptor group series [*], for which I've built similar
>>>> construct there. If possible please try to merge this in ASAP. I'll
>>>> rework my series on top of that.
>>>>
>>>> [*]
>>>> https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
>>>>
>>> I can send it individually, for sure.
>>>
>>> MST, Jason, can this first part be merged? It doesn't add a lot by
>>> itself but it helps pave the way for future changes.
>> If it cannot, it doesn't matter. I can pick it from here and get my
>> series posted with your patches 1-13 applied upfront. This should work,
>> I think?
>>
>>>>>
>>>>> Only tested with vdpa_sim. I'm sending this before full benchmark, as some work
>>>>>
>>>>> like [1] can be based on it, and Si-Wei agreed on benchmark this series with
>>>>>
>>>>> his experience.
>>>> Haven't done the full benchmark compared to pre-map at destination yet,
>>>> though an observation is that the destination QEMU seems very easy to
>>>> get stuck for very long time while in mid of pinning pages. During this
>>>> period, any client doing read-only QMP query or executing HMP info
>>>> command got frozen indefinitely (subject to how large size the memory is
>>>> being pinned). Is it possible to unblock those QMP request or HMP
>>>> command from being executed (at least the read-only ones) while in
>>>> migration? Yield from the load_setup corourtine and spawn another thread?
>>>>
>>> Ok, I wasn't aware of that.
>>>
>>> I think we cannot yield in a coroutine and wait for an ioctl.
>> I was wondering if we need a separate coroutine out of the general
>> migration path to support this special code without overloading
>> load_setup or its callers. For instance, unblock the source from sending
>> guest rams while allow destination pin pages in parallel should be
>> possible.
>>
> Hi Si-Wei,
>
> I'm working on this, I think I'll be able to send a new version soon.
> Just a question, when the mapping is done in vhost_vdpa_dev_start as
> the current upstream master does, are you able to interact with QMP?
Hi Eugenio,

Yes, the latest version works pretty well! Did not get to all of the QMP 
commands, but at least I can do read-only QMP without a problem. That is 
able to address our typical usages. Thanks for the prompt fix!

I've rebased my series on top the .load_setup series instead of the top 
13 patches for 9.0, as there are some other dependent patches from this 
series to avoid duplicate work. Am debugging some problems I ran into 
after the code merge. Once they are sorted out I'll post my patch series 
soon!

Thanks,
-Siwei



>
> Thanks!
>
>> Regardless, a separate thread is needed to carry out all the heavy
>> lifting, i.e. ioctl(2) or write(2) syscalls to map&pin pages.
>>
>>
>>> One
>>> option that came to my mind is to effectively use another thread, and
>>> use a POSIX barrier (or equivalent on glib / QEMU) before finishing
>>> the migration.
>> Yes, a separate thread is needed anyway.
>>
>>>    I'm not sure if there are more points where we can
>>> check the barrier and tell the migration to continue or stop though.
>> I think there is, for e.g. what if the dma_map fails. There must be a
>> check point for that.
>>
>>> Another option is to effectively start doing these ioctls in an
>>> asynchronous way, io_uring cmds like, but I'd like to achieve this
>>> first.
>> Yes, io_uring or any async API could be another option. Though this
>> needs new uAPI through additional kernel facility to support. Anyway,
>> it's up to you to decide. :)
>>
>> Regards,
>> -Siwei
>>>> Having said, not sure if .load_setup is a good fit for what we want to
>>>> do. Searching all current users of .load_setup, either the job can be
>>>> done instantly or the task is time bound without trapping into kernel
>>>> for too long. Maybe pinning is too special use case here...
>>>>
>>>> -Siwei
>>>>>
>>>>> Future directions on top of this series may include:
>>>>>
>>>>> * Iterative migration of virtio-net devices, as it may reduce downtime per [1].
>>>>>
>>>>>      vhost-vdpa net can apply the configuration through CVQ in the destination
>>>>>
>>>>>      while the source is still migrating.
>>>>>
>>>>> * Move more things ahead of migration time, like DRIVER_OK.
>>>>>
>>>>> * Check that the devices of the destination are valid, and cancel the migration
>>>>>
>>>>>      in case it is not.
>>>>>
>>>>>
>>>>>
>>>>> [1] https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
>>>>>
>>>>>
>>>>>
>>>>> Eugenio Pérez (18):
>>>>>
>>>>>      vdpa: add VhostVDPAShared
>>>>>
>>>>>      vdpa: move iova tree to the shared struct
>>>>>
>>>>>      vdpa: move iova_range to vhost_vdpa_shared
>>>>>
>>>>>      vdpa: move shadow_data to vhost_vdpa_shared
>>>>>
>>>>>      vdpa: use vdpa shared for tracing
>>>>>
>>>>>      vdpa: move file descriptor to vhost_vdpa_shared
>>>>>
>>>>>      vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
>>>>>
>>>>>      vdpa: move backend_cap to vhost_vdpa_shared
>>>>>
>>>>>      vdpa: remove msg type of vhost_vdpa
>>>>>
>>>>>      vdpa: move iommu_list to vhost_vdpa_shared
>>>>>
>>>>>      vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
>>>>>
>>>>>      vdpa: use dev_shared in vdpa_iommu
>>>>>
>>>>>      vdpa: move memory listener to vhost_vdpa_shared
>>>>>
>>>>>      vdpa: do not set virtio status bits if unneeded
>>>>>
>>>>>      vdpa: add vhost_vdpa_load_setup
>>>>>
>>>>>      vdpa: add vhost_vdpa_net_load_setup NetClient callback
>>>>>
>>>>>      vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
>>>>>
>>>>>      virtio_net: register incremental migration handlers
>>>>>
>>>>>
>>>>>
>>>>>     include/hw/virtio/vhost-vdpa.h |  43 +++++---
>>>>>
>>>>>     include/net/net.h              |   4 +
>>>>>
>>>>>     hw/net/virtio-net.c            |  23 +++++
>>>>>
>>>>>     hw/virtio/vdpa-dev.c           |   7 +-
>>>>>
>>>>>     hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
>>>>>
>>>>>     net/vhost-vdpa.c               | 127 ++++++++++++-----------
>>>>>
>>>>>     hw/virtio/trace-events         |  14 +--
>>>>>
>>>>>     7 files changed, 239 insertions(+), 162 deletions(-)
>>>>>
>>>>>
>>>>>