mbox series

[RFC,0/2] Identify aliased maps in vdpa SVQ iova_tree

Message ID 20240410100345.389462-1-eperezma@redhat.com
Headers show
Series Identify aliased maps in vdpa SVQ iova_tree | expand

Message

Eugenio Perez Martin April 10, 2024, 10:03 a.m. UTC
The guest may have overlapped memory regions, where different GPA leads
to the same HVA.  This causes a problem when overlapped regions
(different GPA but same translated HVA) exists in the tree, as looking
them by HVA will return them twice.

To solve this, track GPA in the DMA entry that acs as unique identifiers
to the maps.  When the map needs to be removed, iova tree is able to
find the right one.

Users that does not go to this extra layer of indirection can use the
iova tree as usual, with id = 0.

This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
time to reproduce the issue.  This has been tested only without overlapping
maps.  If it works with overlapping maps, it will be intergrated in the main
series.

Comments are welcome.  Thanks!

Eugenio Pérez (2):
  iova_tree: add an id member to DMAMap
  vdpa: identify aliased maps in iova_tree

 hw/virtio/vhost-vdpa.c   | 2 ++
 include/qemu/iova-tree.h | 5 +++--
 util/iova-tree.c         | 3 ++-
 3 files changed, 7 insertions(+), 3 deletions(-)

Comments

Jason Wang April 12, 2024, 6:46 a.m. UTC | #1
On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> The guest may have overlapped memory regions, where different GPA leads
> to the same HVA.  This causes a problem when overlapped regions
> (different GPA but same translated HVA) exists in the tree, as looking
> them by HVA will return them twice.

I think I don't understand if there's any side effect for shadow virtqueue?

Thanks

>
> To solve this, track GPA in the DMA entry that acs as unique identifiers
> to the maps.  When the map needs to be removed, iova tree is able to
> find the right one.
>
> Users that does not go to this extra layer of indirection can use the
> iova tree as usual, with id = 0.
>
> This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> time to reproduce the issue.  This has been tested only without overlapping
> maps.  If it works with overlapping maps, it will be intergrated in the main
> series.
>
> Comments are welcome.  Thanks!
>
> Eugenio Pérez (2):
>   iova_tree: add an id member to DMAMap
>   vdpa: identify aliased maps in iova_tree
>
>  hw/virtio/vhost-vdpa.c   | 2 ++
>  include/qemu/iova-tree.h | 5 +++--
>  util/iova-tree.c         | 3 ++-
>  3 files changed, 7 insertions(+), 3 deletions(-)
>
> --
> 2.44.0
>
Eugenio Perez Martin April 12, 2024, 7:56 a.m. UTC | #2
On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > The guest may have overlapped memory regions, where different GPA leads
> > to the same HVA.  This causes a problem when overlapped regions
> > (different GPA but same translated HVA) exists in the tree, as looking
> > them by HVA will return them twice.
>
> I think I don't understand if there's any side effect for shadow virtqueue?
>

My bad, I totally forgot to put a reference to where this comes from.

Si-Wei found that during initialization this sequences of maps /
unmaps happens [1]:

HVA                    GPA                IOVA
-------------------------------------------------------------------------------------------------------------------------
Map
[0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
[0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
[0x80001000, 0x2000001000)
[0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
[0x2000001000, 0x2000021000)

Unmap
[0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
0x20000) ???

The third HVA range is contained in the first one, but exposed under a
different GVA (aliased). This is not "flattened" by QEMU, as GPA does
not overlap, only HVA.

At the third chunk unmap, the current algorithm finds the first chunk,
not the second one. This series is the way to tell the difference at
unmap time.

[1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html

Thanks!

> Thanks
>
> >
> > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > to the maps.  When the map needs to be removed, iova tree is able to
> > find the right one.
> >
> > Users that does not go to this extra layer of indirection can use the
> > iova tree as usual, with id = 0.
> >
> > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> > time to reproduce the issue.  This has been tested only without overlapping
> > maps.  If it works with overlapping maps, it will be intergrated in the main
> > series.
> >
> > Comments are welcome.  Thanks!
> >
> > Eugenio Pérez (2):
> >   iova_tree: add an id member to DMAMap
> >   vdpa: identify aliased maps in iova_tree
> >
> >  hw/virtio/vhost-vdpa.c   | 2 ++
> >  include/qemu/iova-tree.h | 5 +++--
> >  util/iova-tree.c         | 3 ++-
> >  3 files changed, 7 insertions(+), 3 deletions(-)
> >
> > --
> > 2.44.0
> >
>
Jason Wang May 7, 2024, 7:29 a.m. UTC | #3
On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >
> > > The guest may have overlapped memory regions, where different GPA leads
> > > to the same HVA.  This causes a problem when overlapped regions
> > > (different GPA but same translated HVA) exists in the tree, as looking
> > > them by HVA will return them twice.
> >
> > I think I don't understand if there's any side effect for shadow virtqueue?
> >
>
> My bad, I totally forgot to put a reference to where this comes from.
>
> Si-Wei found that during initialization this sequences of maps /
> unmaps happens [1]:
>
> HVA                    GPA                IOVA
> -------------------------------------------------------------------------------------------------------------------------
> Map
> [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> [0x80001000, 0x2000001000)
> [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> [0x2000001000, 0x2000021000)
>
> Unmap
> [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> 0x20000) ???
>
> The third HVA range is contained in the first one, but exposed under a
> different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> not overlap, only HVA.
>
> At the third chunk unmap, the current algorithm finds the first chunk,
> not the second one. This series is the way to tell the difference at
> unmap time.
>
> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
>
> Thanks!

Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
the iova tree to solve this issue completely. Then there won't be
aliasing issues.

Thanks

>
> > Thanks
> >
> > >
> > > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > > to the maps.  When the map needs to be removed, iova tree is able to
> > > find the right one.
> > >
> > > Users that does not go to this extra layer of indirection can use the
> > > iova tree as usual, with id = 0.
> > >
> > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> > > time to reproduce the issue.  This has been tested only without overlapping
> > > maps.  If it works with overlapping maps, it will be intergrated in the main
> > > series.
> > >
> > > Comments are welcome.  Thanks!
> > >
> > > Eugenio Pérez (2):
> > >   iova_tree: add an id member to DMAMap
> > >   vdpa: identify aliased maps in iova_tree
> > >
> > >  hw/virtio/vhost-vdpa.c   | 2 ++
> > >  include/qemu/iova-tree.h | 5 +++--
> > >  util/iova-tree.c         | 3 ++-
> > >  3 files changed, 7 insertions(+), 3 deletions(-)
> > >
> > > --
> > > 2.44.0
> > >
> >
>
Eugenio Perez Martin May 7, 2024, 10:56 a.m. UTC | #4
On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > >
> > > > The guest may have overlapped memory regions, where different GPA leads
> > > > to the same HVA.  This causes a problem when overlapped regions
> > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > them by HVA will return them twice.
> > >
> > > I think I don't understand if there's any side effect for shadow virtqueue?
> > >
> >
> > My bad, I totally forgot to put a reference to where this comes from.
> >
> > Si-Wei found that during initialization this sequences of maps /
> > unmaps happens [1]:
> >
> > HVA                    GPA                IOVA
> > -------------------------------------------------------------------------------------------------------------------------
> > Map
> > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > [0x80001000, 0x2000001000)
> > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > [0x2000001000, 0x2000021000)
> >
> > Unmap
> > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > 0x20000) ???
> >
> > The third HVA range is contained in the first one, but exposed under a
> > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > not overlap, only HVA.
> >
> > At the third chunk unmap, the current algorithm finds the first chunk,
> > not the second one. This series is the way to tell the difference at
> > unmap time.
> >
> > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> >
> > Thanks!
>
> Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> the iova tree to solve this issue completely. Then there won't be
> aliasing issues.
>

I'm ok to explore that route but this has another problem. Both SVQ
vrings and CVQ buffers also need to be addressable by VhostIOVATree,
and they do not have GPA.

At this moment vhost_svq_translate_addr is able to handle this
transparently as we translate vaddr to SVQ IOVA. How can we store
these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
then a list to go through other entries (SVQ vaddr and CVQ buffers).

Thanks!

> Thanks
>
> >
> > > Thanks
> > >
> > > >
> > > > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > > > to the maps.  When the map needs to be removed, iova tree is able to
> > > > find the right one.
> > > >
> > > > Users that does not go to this extra layer of indirection can use the
> > > > iova tree as usual, with id = 0.
> > > >
> > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> > > > time to reproduce the issue.  This has been tested only without overlapping
> > > > maps.  If it works with overlapping maps, it will be intergrated in the main
> > > > series.
> > > >
> > > > Comments are welcome.  Thanks!
> > > >
> > > > Eugenio Pérez (2):
> > > >   iova_tree: add an id member to DMAMap
> > > >   vdpa: identify aliased maps in iova_tree
> > > >
> > > >  hw/virtio/vhost-vdpa.c   | 2 ++
> > > >  include/qemu/iova-tree.h | 5 +++--
> > > >  util/iova-tree.c         | 3 ++-
> > > >  3 files changed, 7 insertions(+), 3 deletions(-)
> > > >
> > > > --
> > > > 2.44.0
> > > >
> > >
> >
>
Jason Wang May 8, 2024, 2:29 a.m. UTC | #5
On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > >
> > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > them by HVA will return them twice.
> > > >
> > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > >
> > >
> > > My bad, I totally forgot to put a reference to where this comes from.
> > >
> > > Si-Wei found that during initialization this sequences of maps /
> > > unmaps happens [1]:
> > >
> > > HVA                    GPA                IOVA
> > > -------------------------------------------------------------------------------------------------------------------------
> > > Map
> > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > [0x80001000, 0x2000001000)
> > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > [0x2000001000, 0x2000021000)
> > >
> > > Unmap
> > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > 0x20000) ???
> > >
> > > The third HVA range is contained in the first one, but exposed under a
> > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > not overlap, only HVA.
> > >
> > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > not the second one. This series is the way to tell the difference at
> > > unmap time.
> > >
> > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > >
> > > Thanks!
> >
> > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > the iova tree to solve this issue completely. Then there won't be
> > aliasing issues.
> >
>
> I'm ok to explore that route but this has another problem. Both SVQ
> vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> and they do not have GPA.
>
> At this moment vhost_svq_translate_addr is able to handle this
> transparently as we translate vaddr to SVQ IOVA. How can we store
> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> then a list to go through other entries (SVQ vaddr and CVQ buffers).

This seems to be tricky.

As discussed, it could be another iova tree.

Thanks

>
> Thanks!
>
> > Thanks
> >
> > >
> > > > Thanks
> > > >
> > > > >
> > > > > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > > > > to the maps.  When the map needs to be removed, iova tree is able to
> > > > > find the right one.
> > > > >
> > > > > Users that does not go to this extra layer of indirection can use the
> > > > > iova tree as usual, with id = 0.
> > > > >
> > > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> > > > > time to reproduce the issue.  This has been tested only without overlapping
> > > > > maps.  If it works with overlapping maps, it will be intergrated in the main
> > > > > series.
> > > > >
> > > > > Comments are welcome.  Thanks!
> > > > >
> > > > > Eugenio Pérez (2):
> > > > >   iova_tree: add an id member to DMAMap
> > > > >   vdpa: identify aliased maps in iova_tree
> > > > >
> > > > >  hw/virtio/vhost-vdpa.c   | 2 ++
> > > > >  include/qemu/iova-tree.h | 5 +++--
> > > > >  util/iova-tree.c         | 3 ++-
> > > > >  3 files changed, 7 insertions(+), 3 deletions(-)
> > > > >
> > > > > --
> > > > > 2.44.0
> > > > >
> > > >
> > >
> >
>
Eugenio Perez Martin May 8, 2024, 5:15 p.m. UTC | #6
On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:
> > > >
> > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > them by HVA will return them twice.
> > > > >
> > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > >
> > > >
> > > > My bad, I totally forgot to put a reference to where this comes from.
> > > >
> > > > Si-Wei found that during initialization this sequences of maps /
> > > > unmaps happens [1]:
> > > >
> > > > HVA                    GPA                IOVA
> > > > -------------------------------------------------------------------------------------------------------------------------
> > > > Map
> > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > [0x80001000, 0x2000001000)
> > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > [0x2000001000, 0x2000021000)
> > > >
> > > > Unmap
> > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > 0x20000) ???
> > > >
> > > > The third HVA range is contained in the first one, but exposed under a
> > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > not overlap, only HVA.
> > > >
> > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > not the second one. This series is the way to tell the difference at
> > > > unmap time.
> > > >
> > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > >
> > > > Thanks!
> > >
> > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > the iova tree to solve this issue completely. Then there won't be
> > > aliasing issues.
> > >
> >
> > I'm ok to explore that route but this has another problem. Both SVQ
> > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > and they do not have GPA.
> >
> > At this moment vhost_svq_translate_addr is able to handle this
> > transparently as we translate vaddr to SVQ IOVA. How can we store
> > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > then a list to go through other entries (SVQ vaddr and CVQ buffers).
>
> This seems to be tricky.
>
> As discussed, it could be another iova tree.
>

Yes but there are many ways to add another IOVATree. Let me expand & recap.

Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
Let's call it gpa_iova_tree, as opposed to the current iova_tree that
translates from vaddr to SVQ IOVA. To know which one to use is easy at
adding or removing, like in the memory listener, but how to know at
vhost_svq_translate_addr?

The easiest way for me is to rely on memory_region_from_host(). When
vaddr is from the guest, it returns a valid MemoryRegion. When it is
not, it returns NULL. I'm not sure if this is a valid use case, it
just worked in my tests so far.

Now we have the second problem: The GPA values of the regions of the
two IOVA tree must be unique. We need to be able to find unallocated
regions in SVQ IOVA. At this moment there is only one IOVATree, so
this is done easily by vhost_iova_tree_map_alloc. But it is very
complicated with two trees.

Option 2a is to add another IOVATree in VhostIOVATree. I think the
easiest way is to keep the GPA -> SVQ IOVA in one tree, let's call it
iova_gpa_map, and the current vaddr -> SVQ IOVA tree in
iova_taddr_map. This second tree should contain both vaddr memory that
belongs to the guest and host-only vaddr like vrings and CVQ buffers.

How to pass the GPA to VhostIOVATree API? To add it to DMAMap is
complicated, as it is shared with intel_iommu. We can add new
functions to VhostIOVATree that accepts vaddr plus GPA, or maybe it is
enough with GPA only. It should be functions to add, remove, and
allocate new entries. But vaddr ones must be kept, because buffers
might be host-only.

Then the caller can choose which version to call: for adding and
removing guest memory from the memory listener, the GPA variant.
Adding SVQ vrings and CVQ buffers should use the current vaddr
versions. vhost_svq_translate_addr still needs to use
memory_region_from_host() to know which one to call.

Although I didn't like this approach because it complicates
VhostIOVATree, I think it is the easier way except for option 4, I'll
explain later.

This has an extra advantage: currently, the lookup in
vhost_svq_translate_addr is linear, O(1). This would allow us to use
the tree properly.

Option 2b could be to keep them totally separated. So current
VhostIOVATree->iova_taddr_map only contains host-only entries, and the
new iova_gpa_map containst the guest entries. I don't think this case
adds anything except less memory usage, as the gpa map (which should
be the fastest) will be the same size. Also, it makes it difficult to
implement vhost_iova_tree_map_alloc.

Option 3 is to not add new functions but extend the current ones. That
would require special values of GPA values to indicate no GPA, like
SVQ vrings. I think option 2a is better, but this may help to keep the
interface simpler.

Option 4 is what I'm proposing in this RFC. To store the GPA as map id
so we can tell if the vaddr corresponds to one SVQ IOVA or another.
Now I'm having trouble retrieving the memory section I see in the
memory listener. It should not be so difficult but. The main advantage
is not to duplicate data structs that are already in QEMU, but maybe
it is not worth the effort.

Going further with this option, we could add a flag to ignore the .id
member added. But it adds more and more complexity to the API so I
would prefer option 2a for this.

> Thanks
>
> >
> > Thanks!
> >
> > > Thanks
> > >
> > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > > > > > to the maps.  When the map needs to be removed, iova tree is able to
> > > > > > find the right one.
> > > > > >
> > > > > > Users that does not go to this extra layer of indirection can use the
> > > > > > iova tree as usual, with id = 0.
> > > > > >
> > > > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> > > > > > time to reproduce the issue.  This has been tested only without overlapping
> > > > > > maps.  If it works with overlapping maps, it will be intergrated in the main
> > > > > > series.
> > > > > >
> > > > > > Comments are welcome.  Thanks!
> > > > > >
> > > > > > Eugenio Pérez (2):
> > > > > >   iova_tree: add an id member to DMAMap
> > > > > >   vdpa: identify aliased maps in iova_tree
> > > > > >
> > > > > >  hw/virtio/vhost-vdpa.c   | 2 ++
> > > > > >  include/qemu/iova-tree.h | 5 +++--
> > > > > >  util/iova-tree.c         | 3 ++-
> > > > > >  3 files changed, 7 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > --
> > > > > > 2.44.0
> > > > > >
> > > > >
> > > >
> > >
> >
>
Jason Wang May 9, 2024, 6:27 a.m. UTC | #7
On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >
> > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > them by HVA will return them twice.
> > > > > >
> > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > >
> > > > >
> > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > >
> > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > unmaps happens [1]:
> > > > >
> > > > > HVA                    GPA                IOVA
> > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > Map
> > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > [0x80001000, 0x2000001000)
> > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > [0x2000001000, 0x2000021000)
> > > > >
> > > > > Unmap
> > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > 0x20000) ???
> > > > >
> > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > not overlap, only HVA.
> > > > >
> > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > not the second one. This series is the way to tell the difference at
> > > > > unmap time.
> > > > >
> > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > >
> > > > > Thanks!
> > > >
> > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > the iova tree to solve this issue completely. Then there won't be
> > > > aliasing issues.
> > > >
> > >
> > > I'm ok to explore that route but this has another problem. Both SVQ
> > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > and they do not have GPA.
> > >
> > > At this moment vhost_svq_translate_addr is able to handle this
> > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> >
> > This seems to be tricky.
> >
> > As discussed, it could be another iova tree.
> >
>
> Yes but there are many ways to add another IOVATree. Let me expand & recap.
>
> Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> translates from vaddr to SVQ IOVA. To know which one to use is easy at
> adding or removing, like in the memory listener, but how to know at
> vhost_svq_translate_addr?

Then we won't use virtqueue_pop() at all, we need a SVQ version of
virtqueue_pop() to translate GPA to SVQ IOVA directly?

>
> The easiest way for me is to rely on memory_region_from_host(). When
> vaddr is from the guest, it returns a valid MemoryRegion. When it is
> not, it returns NULL. I'm not sure if this is a valid use case, it
> just worked in my tests so far.
>
> Now we have the second problem: The GPA values of the regions of the
> two IOVA tree must be unique. We need to be able to find unallocated
> regions in SVQ IOVA. At this moment there is only one IOVATree, so
> this is done easily by vhost_iova_tree_map_alloc. But it is very
> complicated with two trees.

Would it be simpler if we decouple the IOVA allocator? For example, we
can have a dedicated gtree to track the allocated IOVA ranges. It is
shared by both

1) Guest memory (GPA)
2) SVQ virtqueue and buffers

And another gtree to track the GPA to IOVA.

The SVQ code could use either

1) one linear mappings that contains both SVQ virtqueue and buffers

or

2) dynamic IOVA allocation/deallocation helpers

So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?

>
> Option 2a is to add another IOVATree in VhostIOVATree. I think the
> easiest way is to keep the GPA -> SVQ IOVA in one tree, let's call it
> iova_gpa_map, and the current vaddr -> SVQ IOVA tree in
> iova_taddr_map. This second tree should contain both vaddr memory that
> belongs to the guest and host-only vaddr like vrings and CVQ buffers.
>
> How to pass the GPA to VhostIOVATree API? To add it to DMAMap is
> complicated, as it is shared with intel_iommu. We can add new
> functions to VhostIOVATree that accepts vaddr plus GPA, or maybe it is
> enough with GPA only. It should be functions to add, remove, and
> allocate new entries. But vaddr ones must be kept, because buffers
> might be host-only.
>
> Then the caller can choose which version to call: for adding and
> removing guest memory from the memory listener, the GPA variant.
> Adding SVQ vrings and CVQ buffers should use the current vaddr
> versions. vhost_svq_translate_addr still needs to use
> memory_region_from_host() to know which one to call.

So the idea is, for region_del/region_add use iova_gpa_map? For the
SVQ translation, use the iova_taddr_map?

I suspect we still need to synchronize with those two trees so it
might be still problematic as iova_taddr_map contains the overlapped
regions.

>
> Although I didn't like this approach because it complicates
> VhostIOVATree, I think it is the easier way except for option 4, I'll
> explain later.
>
> This has an extra advantage: currently, the lookup in
> vhost_svq_translate_addr is linear, O(1). This would allow us to use
> the tree properly.

It uses g_tree_foreach() which I guess is not O(1).

>
> Option 2b could be to keep them totally separated. So current
> VhostIOVATree->iova_taddr_map only contains host-only entries, and the
> new iova_gpa_map containst the guest entries. I don't think this case
> adds anything except less memory usage, as the gpa map (which should
> be the fastest) will be the same size. Also, it makes it difficult to
> implement vhost_iova_tree_map_alloc.
>
> Option 3 is to not add new functions but extend the current ones. That
> would require special values of GPA values to indicate no GPA, like
> SVQ vrings. I think option 2a is better, but this may help to keep the
> interface simpler.
>
> Option 4 is what I'm proposing in this RFC. To store the GPA as map id
> so we can tell if the vaddr corresponds to one SVQ IOVA or another.
> Now I'm having trouble retrieving the memory section I see in the
> memory listener. It should not be so difficult but. The main advantage
> is not to duplicate data structs that are already in QEMU, but maybe
> it is not worth the effort.
>
> Going further with this option, we could add a flag to ignore the .id
> member added. But it adds more and more complexity to the API so I
> would prefer option 2a for this.

Thanks

>
> > Thanks
> >
> > >
> > > Thanks!
> > >
> > > > Thanks
> > > >
> > > > >
> > > > > > Thanks
> > > > > >
> > > > > > >
> > > > > > > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > > > > > > to the maps.  When the map needs to be removed, iova tree is able to
> > > > > > > find the right one.
> > > > > > >
> > > > > > > Users that does not go to this extra layer of indirection can use the
> > > > > > > iova tree as usual, with id = 0.
> > > > > > >
> > > > > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> > > > > > > time to reproduce the issue.  This has been tested only without overlapping
> > > > > > > maps.  If it works with overlapping maps, it will be intergrated in the main
> > > > > > > series.
> > > > > > >
> > > > > > > Comments are welcome.  Thanks!
> > > > > > >
> > > > > > > Eugenio Pérez (2):
> > > > > > >   iova_tree: add an id member to DMAMap
> > > > > > >   vdpa: identify aliased maps in iova_tree
> > > > > > >
> > > > > > >  hw/virtio/vhost-vdpa.c   | 2 ++
> > > > > > >  include/qemu/iova-tree.h | 5 +++--
> > > > > > >  util/iova-tree.c         | 3 ++-
> > > > > > >  3 files changed, 7 insertions(+), 3 deletions(-)
> > > > > > >
> > > > > > > --
> > > > > > > 2.44.0
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Eugenio Perez Martin May 9, 2024, 7:10 a.m. UTC | #8
On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > >
> > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > >
> > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > them by HVA will return them twice.
> > > > > > >
> > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > >
> > > > > >
> > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > >
> > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > unmaps happens [1]:
> > > > > >
> > > > > > HVA                    GPA                IOVA
> > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > Map
> > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > [0x80001000, 0x2000001000)
> > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > [0x2000001000, 0x2000021000)
> > > > > >
> > > > > > Unmap
> > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > 0x20000) ???
> > > > > >
> > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > not overlap, only HVA.
> > > > > >
> > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > not the second one. This series is the way to tell the difference at
> > > > > > unmap time.
> > > > > >
> > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > >
> > > > > > Thanks!
> > > > >
> > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > aliasing issues.
> > > > >
> > > >
> > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > and they do not have GPA.
> > > >
> > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > >
> > > This seems to be tricky.
> > >
> > > As discussed, it could be another iova tree.
> > >
> >
> > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> >
> > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > adding or removing, like in the memory listener, but how to know at
> > vhost_svq_translate_addr?
>
> Then we won't use virtqueue_pop() at all, we need a SVQ version of
> virtqueue_pop() to translate GPA to SVQ IOVA directly?
>

The problem is not virtqueue_pop, that's out of the
vhost_svq_translate_addr. The problem is the need of adding
conditionals / complexity in all the callers of

> >
> > The easiest way for me is to rely on memory_region_from_host(). When
> > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > not, it returns NULL. I'm not sure if this is a valid use case, it
> > just worked in my tests so far.
> >
> > Now we have the second problem: The GPA values of the regions of the
> > two IOVA tree must be unique. We need to be able to find unallocated
> > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > complicated with two trees.
>
> Would it be simpler if we decouple the IOVA allocator? For example, we
> can have a dedicated gtree to track the allocated IOVA ranges. It is
> shared by both
>
> 1) Guest memory (GPA)
> 2) SVQ virtqueue and buffers
>
> And another gtree to track the GPA to IOVA.
>
> The SVQ code could use either
>
> 1) one linear mappings that contains both SVQ virtqueue and buffers
>
> or
>
> 2) dynamic IOVA allocation/deallocation helpers
>
> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
>

That's possible, but that scatters the IOVA handling code instead of
keeping it self-contained in VhostIOVATree.

> >
> > Option 2a is to add another IOVATree in VhostIOVATree. I think the
> > easiest way is to keep the GPA -> SVQ IOVA in one tree, let's call it
> > iova_gpa_map, and the current vaddr -> SVQ IOVA tree in
> > iova_taddr_map. This second tree should contain both vaddr memory that
> > belongs to the guest and host-only vaddr like vrings and CVQ buffers.
> >
> > How to pass the GPA to VhostIOVATree API? To add it to DMAMap is
> > complicated, as it is shared with intel_iommu. We can add new
> > functions to VhostIOVATree that accepts vaddr plus GPA, or maybe it is
> > enough with GPA only. It should be functions to add, remove, and
> > allocate new entries. But vaddr ones must be kept, because buffers
> > might be host-only.
> >
> > Then the caller can choose which version to call: for adding and
> > removing guest memory from the memory listener, the GPA variant.
> > Adding SVQ vrings and CVQ buffers should use the current vaddr
> > versions. vhost_svq_translate_addr still needs to use
> > memory_region_from_host() to know which one to call.
>
> So the idea is, for region_del/region_add use iova_gpa_map? For the
> SVQ translation, use the iova_taddr_map?
>
> I suspect we still need to synchronize with those two trees so it
> might be still problematic as iova_taddr_map contains the overlapped
> regions.
>

Right. All adding / removing functions with GPA must also update the
current iova_taddr_map too.

> >
> > Although I didn't like this approach because it complicates
> > VhostIOVATree, I think it is the easier way except for option 4, I'll
> > explain later.
> >
> > This has an extra advantage: currently, the lookup in
> > vhost_svq_translate_addr is linear, O(1). This would allow us to use
> > the tree properly.
>
> It uses g_tree_foreach() which I guess is not O(1).
>

I'm sorry I meant O(N).

> >
> > Option 2b could be to keep them totally separated. So current
> > VhostIOVATree->iova_taddr_map only contains host-only entries, and the
> > new iova_gpa_map containst the guest entries. I don't think this case
> > adds anything except less memory usage, as the gpa map (which should
> > be the fastest) will be the same size. Also, it makes it difficult to
> > implement vhost_iova_tree_map_alloc.
> >
> > Option 3 is to not add new functions but extend the current ones. That
> > would require special values of GPA values to indicate no GPA, like
> > SVQ vrings. I think option 2a is better, but this may help to keep the
> > interface simpler.
> >
> > Option 4 is what I'm proposing in this RFC. To store the GPA as map id
> > so we can tell if the vaddr corresponds to one SVQ IOVA or another.
> > Now I'm having trouble retrieving the memory section I see in the
> > memory listener. It should not be so difficult but. The main advantage
> > is not to duplicate data structs that are already in QEMU, but maybe
> > it is not worth the effort.
> >
> > Going further with this option, we could add a flag to ignore the .id
> > member added. But it adds more and more complexity to the API so I
> > would prefer option 2a for this.
>
> Thanks
>
> >
> > > Thanks
> > >
> > > >
> > > > Thanks!
> > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > >
> > > > > > > > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > > > > > > > to the maps.  When the map needs to be removed, iova tree is able to
> > > > > > > > find the right one.
> > > > > > > >
> > > > > > > > Users that does not go to this extra layer of indirection can use the
> > > > > > > > iova tree as usual, with id = 0.
> > > > > > > >
> > > > > > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> > > > > > > > time to reproduce the issue.  This has been tested only without overlapping
> > > > > > > > maps.  If it works with overlapping maps, it will be intergrated in the main
> > > > > > > > series.
> > > > > > > >
> > > > > > > > Comments are welcome.  Thanks!
> > > > > > > >
> > > > > > > > Eugenio Pérez (2):
> > > > > > > >   iova_tree: add an id member to DMAMap
> > > > > > > >   vdpa: identify aliased maps in iova_tree
> > > > > > > >
> > > > > > > >  hw/virtio/vhost-vdpa.c   | 2 ++
> > > > > > > >  include/qemu/iova-tree.h | 5 +++--
> > > > > > > >  util/iova-tree.c         | 3 ++-
> > > > > > > >  3 files changed, 7 insertions(+), 3 deletions(-)
> > > > > > > >
> > > > > > > > --
> > > > > > > > 2.44.0
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Jason Wang May 10, 2024, 4:28 a.m. UTC | #9
On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >
> > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > > them by HVA will return them twice.
> > > > > > > >
> > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > > >
> > > > > > >
> > > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > > >
> > > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > > unmaps happens [1]:
> > > > > > >
> > > > > > > HVA                    GPA                IOVA
> > > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > > Map
> > > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > > [0x80001000, 0x2000001000)
> > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > > [0x2000001000, 0x2000021000)
> > > > > > >
> > > > > > > Unmap
> > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > > 0x20000) ???
> > > > > > >
> > > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > > not overlap, only HVA.
> > > > > > >
> > > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > > not the second one. This series is the way to tell the difference at
> > > > > > > unmap time.
> > > > > > >
> > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > > >
> > > > > > > Thanks!
> > > > > >
> > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > > aliasing issues.
> > > > > >
> > > > >
> > > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > > and they do not have GPA.
> > > > >
> > > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > > >
> > > > This seems to be tricky.
> > > >
> > > > As discussed, it could be another iova tree.
> > > >
> > >
> > > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> > >
> > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > > adding or removing, like in the memory listener, but how to know at
> > > vhost_svq_translate_addr?
> >
> > Then we won't use virtqueue_pop() at all, we need a SVQ version of
> > virtqueue_pop() to translate GPA to SVQ IOVA directly?
> >
>
> The problem is not virtqueue_pop, that's out of the
> vhost_svq_translate_addr. The problem is the need of adding
> conditionals / complexity in all the callers of
>
> > >
> > > The easiest way for me is to rely on memory_region_from_host(). When
> > > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > > not, it returns NULL. I'm not sure if this is a valid use case, it
> > > just worked in my tests so far.
> > >
> > > Now we have the second problem: The GPA values of the regions of the
> > > two IOVA tree must be unique. We need to be able to find unallocated
> > > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > > complicated with two trees.
> >
> > Would it be simpler if we decouple the IOVA allocator? For example, we
> > can have a dedicated gtree to track the allocated IOVA ranges. It is
> > shared by both
> >
> > 1) Guest memory (GPA)
> > 2) SVQ virtqueue and buffers
> >
> > And another gtree to track the GPA to IOVA.
> >
> > The SVQ code could use either
> >
> > 1) one linear mappings that contains both SVQ virtqueue and buffers
> >
> > or
> >
> > 2) dynamic IOVA allocation/deallocation helpers
> >
> > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> >
>
> That's possible, but that scatters the IOVA handling code instead of
> keeping it self-contained in VhostIOVATree.

To me, the IOVA range/allocation is orthogonal to how IOVA is used.

An example is the iova allocator in the kernel.

Note that there's an even simpler IOVA "allocator" in NVME passthrough
code, not sure it is useful here though (haven't had a deep look at
that).

Thanks
Eugenio Perez Martin May 10, 2024, 7:16 a.m. UTC | #10
On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > >
> > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > > <eperezma@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > > > them by HVA will return them twice.
> > > > > > > > >
> > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > > > >
> > > > > > > >
> > > > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > > > >
> > > > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > > > unmaps happens [1]:
> > > > > > > >
> > > > > > > > HVA                    GPA                IOVA
> > > > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > > > Map
> > > > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > > > [0x80001000, 0x2000001000)
> > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > > > [0x2000001000, 0x2000021000)
> > > > > > > >
> > > > > > > > Unmap
> > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > > > 0x20000) ???
> > > > > > > >
> > > > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > > > not overlap, only HVA.
> > > > > > > >
> > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > > > not the second one. This series is the way to tell the difference at
> > > > > > > > unmap time.
> > > > > > > >
> > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > >
> > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > > > aliasing issues.
> > > > > > >
> > > > > >
> > > > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > > > and they do not have GPA.
> > > > > >
> > > > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > > > >
> > > > > This seems to be tricky.
> > > > >
> > > > > As discussed, it could be another iova tree.
> > > > >
> > > >
> > > > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> > > >
> > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > > > adding or removing, like in the memory listener, but how to know at
> > > > vhost_svq_translate_addr?
> > >
> > > Then we won't use virtqueue_pop() at all, we need a SVQ version of
> > > virtqueue_pop() to translate GPA to SVQ IOVA directly?
> > >
> >
> > The problem is not virtqueue_pop, that's out of the
> > vhost_svq_translate_addr. The problem is the need of adding
> > conditionals / complexity in all the callers of
> >
> > > >
> > > > The easiest way for me is to rely on memory_region_from_host(). When
> > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > > > not, it returns NULL. I'm not sure if this is a valid use case, it
> > > > just worked in my tests so far.
> > > >
> > > > Now we have the second problem: The GPA values of the regions of the
> > > > two IOVA tree must be unique. We need to be able to find unallocated
> > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > > > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > > > complicated with two trees.
> > >
> > > Would it be simpler if we decouple the IOVA allocator? For example, we
> > > can have a dedicated gtree to track the allocated IOVA ranges. It is
> > > shared by both
> > >
> > > 1) Guest memory (GPA)
> > > 2) SVQ virtqueue and buffers
> > >
> > > And another gtree to track the GPA to IOVA.
> > >
> > > The SVQ code could use either
> > >
> > > 1) one linear mappings that contains both SVQ virtqueue and buffers
> > >
> > > or
> > >
> > > 2) dynamic IOVA allocation/deallocation helpers
> > >
> > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> > >
> >
> > That's possible, but that scatters the IOVA handling code instead of
> > keeping it self-contained in VhostIOVATree.
>
> To me, the IOVA range/allocation is orthogonal to how IOVA is used.
>
> An example is the iova allocator in the kernel.
>
> Note that there's an even simpler IOVA "allocator" in NVME passthrough
> code, not sure it is useful here though (haven't had a deep look at
> that).
>

I don't know enough about them to have an opinion. I keep seeing the
drawback of needing to synchronize both allocation & adding in all the
places we want to modify the IOVATree. At this moment, these are the
vhost-vdpa memory listener, the SVQ vring creation and removal, and
net CVQ buffers. But it may be more in the future.

What are the advantages of keeping these separated that justifies
needing to synchronize in all these places, compared with keeping them
synchronized in VhostIOVATree?

Thanks!
Jason Wang May 11, 2024, 4 a.m. UTC | #11
On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >
> > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > > > <eperezma@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > > > > them by HVA will return them twice.
> > > > > > > > > >
> > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > > > > >
> > > > > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > > > > unmaps happens [1]:
> > > > > > > > >
> > > > > > > > > HVA                    GPA                IOVA
> > > > > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > > > > Map
> > > > > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > > > > [0x80001000, 0x2000001000)
> > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > > > > [0x2000001000, 0x2000021000)
> > > > > > > > >
> > > > > > > > > Unmap
> > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > > > > 0x20000) ???
> > > > > > > > >
> > > > > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > > > > not overlap, only HVA.
> > > > > > > > >
> > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > > > > not the second one. This series is the way to tell the difference at
> > > > > > > > > unmap time.
> > > > > > > > >
> > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > > > > aliasing issues.
> > > > > > > >
> > > > > > >
> > > > > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > > > > and they do not have GPA.
> > > > > > >
> > > > > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > > > > >
> > > > > > This seems to be tricky.
> > > > > >
> > > > > > As discussed, it could be another iova tree.
> > > > > >
> > > > >
> > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> > > > >
> > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > > > > adding or removing, like in the memory listener, but how to know at
> > > > > vhost_svq_translate_addr?
> > > >
> > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of
> > > > virtqueue_pop() to translate GPA to SVQ IOVA directly?
> > > >
> > >
> > > The problem is not virtqueue_pop, that's out of the
> > > vhost_svq_translate_addr. The problem is the need of adding
> > > conditionals / complexity in all the callers of
> > >
> > > > >
> > > > > The easiest way for me is to rely on memory_region_from_host(). When
> > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > > > > not, it returns NULL. I'm not sure if this is a valid use case, it
> > > > > just worked in my tests so far.
> > > > >
> > > > > Now we have the second problem: The GPA values of the regions of the
> > > > > two IOVA tree must be unique. We need to be able to find unallocated
> > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > > > > complicated with two trees.
> > > >
> > > > Would it be simpler if we decouple the IOVA allocator? For example, we
> > > > can have a dedicated gtree to track the allocated IOVA ranges. It is
> > > > shared by both
> > > >
> > > > 1) Guest memory (GPA)
> > > > 2) SVQ virtqueue and buffers
> > > >
> > > > And another gtree to track the GPA to IOVA.
> > > >
> > > > The SVQ code could use either
> > > >
> > > > 1) one linear mappings that contains both SVQ virtqueue and buffers
> > > >
> > > > or
> > > >
> > > > 2) dynamic IOVA allocation/deallocation helpers
> > > >
> > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> > > >
> > >
> > > That's possible, but that scatters the IOVA handling code instead of
> > > keeping it self-contained in VhostIOVATree.
> >
> > To me, the IOVA range/allocation is orthogonal to how IOVA is used.
> >
> > An example is the iova allocator in the kernel.
> >
> > Note that there's an even simpler IOVA "allocator" in NVME passthrough
> > code, not sure it is useful here though (haven't had a deep look at
> > that).
> >
>
> I don't know enough about them to have an opinion. I keep seeing the
> drawback of needing to synchronize both allocation & adding in all the
> places we want to modify the IOVATree. At this moment, these are the
> vhost-vdpa memory listener, the SVQ vring creation and removal, and
> net CVQ buffers. But it may be more in the future.
>
> What are the advantages of keeping these separated that justifies
> needing to synchronize in all these places, compared with keeping them
> synchronized in VhostIOVATree?

It doesn't need to be synchronized.

Assuming guest and SVQ shares IOVA range. IOVA only needs to track
which part of the range has been used.

This simplifies things, we can store GPA->IOVA mappings and SVQ ->
IOVA mappings separately.

Thanks

>
> Thanks!
>
Eugenio Perez Martin May 13, 2024, 6:27 a.m. UTC | #12
On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > >
> > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > > > > <eperezma@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > > > > > them by HVA will return them twice.
> > > > > > > > > > >
> > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > > > > > >
> > > > > > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > > > > > unmaps happens [1]:
> > > > > > > > > >
> > > > > > > > > > HVA                    GPA                IOVA
> > > > > > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > > > > > Map
> > > > > > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > > > > > [0x80001000, 0x2000001000)
> > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > > > > > [0x2000001000, 0x2000021000)
> > > > > > > > > >
> > > > > > > > > > Unmap
> > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > > > > > 0x20000) ???
> > > > > > > > > >
> > > > > > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > > > > > not overlap, only HVA.
> > > > > > > > > >
> > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > > > > > not the second one. This series is the way to tell the difference at
> > > > > > > > > > unmap time.
> > > > > > > > > >
> > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > > > > > >
> > > > > > > > > > Thanks!
> > > > > > > > >
> > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > > > > > aliasing issues.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > > > > > and they do not have GPA.
> > > > > > > >
> > > > > > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > > > > > >
> > > > > > > This seems to be tricky.
> > > > > > >
> > > > > > > As discussed, it could be another iova tree.
> > > > > > >
> > > > > >
> > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> > > > > >
> > > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > > > > > adding or removing, like in the memory listener, but how to know at
> > > > > > vhost_svq_translate_addr?
> > > > >
> > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of
> > > > > virtqueue_pop() to translate GPA to SVQ IOVA directly?
> > > > >
> > > >
> > > > The problem is not virtqueue_pop, that's out of the
> > > > vhost_svq_translate_addr. The problem is the need of adding
> > > > conditionals / complexity in all the callers of
> > > >
> > > > > >
> > > > > > The easiest way for me is to rely on memory_region_from_host(). When
> > > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > > > > > not, it returns NULL. I'm not sure if this is a valid use case, it
> > > > > > just worked in my tests so far.
> > > > > >
> > > > > > Now we have the second problem: The GPA values of the regions of the
> > > > > > two IOVA tree must be unique. We need to be able to find unallocated
> > > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > > > > > complicated with two trees.
> > > > >
> > > > > Would it be simpler if we decouple the IOVA allocator? For example, we
> > > > > can have a dedicated gtree to track the allocated IOVA ranges. It is
> > > > > shared by both
> > > > >
> > > > > 1) Guest memory (GPA)
> > > > > 2) SVQ virtqueue and buffers
> > > > >
> > > > > And another gtree to track the GPA to IOVA.
> > > > >
> > > > > The SVQ code could use either
> > > > >
> > > > > 1) one linear mappings that contains both SVQ virtqueue and buffers
> > > > >
> > > > > or
> > > > >
> > > > > 2) dynamic IOVA allocation/deallocation helpers
> > > > >
> > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> > > > >
> > > >
> > > > That's possible, but that scatters the IOVA handling code instead of
> > > > keeping it self-contained in VhostIOVATree.
> > >
> > > To me, the IOVA range/allocation is orthogonal to how IOVA is used.
> > >
> > > An example is the iova allocator in the kernel.
> > >
> > > Note that there's an even simpler IOVA "allocator" in NVME passthrough
> > > code, not sure it is useful here though (haven't had a deep look at
> > > that).
> > >
> >
> > I don't know enough about them to have an opinion. I keep seeing the
> > drawback of needing to synchronize both allocation & adding in all the
> > places we want to modify the IOVATree. At this moment, these are the
> > vhost-vdpa memory listener, the SVQ vring creation and removal, and
> > net CVQ buffers. But it may be more in the future.
> >
> > What are the advantages of keeping these separated that justifies
> > needing to synchronize in all these places, compared with keeping them
> > synchronized in VhostIOVATree?
>
> It doesn't need to be synchronized.
>
> Assuming guest and SVQ shares IOVA range. IOVA only needs to track
> which part of the range has been used.
>

Not sure if I follow, that is what I mean with "synchronized".

> This simplifies things, we can store GPA->IOVA mappings and SVQ ->
> IOVA mappings separately.
>

Sorry, I still cannot see the whole picture :).

Let's assume we have all the GPA mapped to specific IOVA regions, so
we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ
because of the migration. How can we know where we can place SVQ
vrings without having them synchronized?

At this moment we're using a tree. The tree nature of the current SVQ
IOVA -> VA makes all nodes ordered so it is more or less easy to look
for holes.

Now your proposal uses the SVQ IOVA as tree values. Should we iterate
over all of them, order them, of the two trees, and then look for
holes there?

> Thanks
>
> >
> > Thanks!
> >
>
Jason Wang May 13, 2024, 8:28 a.m. UTC | #13
On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > > > > > <eperezma@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > > > > > > them by HVA will return them twice.
> > > > > > > > > > > >
> > > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > > > > > > >
> > > > > > > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > > > > > > unmaps happens [1]:
> > > > > > > > > > >
> > > > > > > > > > > HVA                    GPA                IOVA
> > > > > > > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > > > > > > Map
> > > > > > > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > > > > > > [0x80001000, 0x2000001000)
> > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > > > > > > [0x2000001000, 0x2000021000)
> > > > > > > > > > >
> > > > > > > > > > > Unmap
> > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > > > > > > 0x20000) ???
> > > > > > > > > > >
> > > > > > > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > > > > > > not overlap, only HVA.
> > > > > > > > > > >
> > > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > > > > > > not the second one. This series is the way to tell the difference at
> > > > > > > > > > > unmap time.
> > > > > > > > > > >
> > > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > > > > > > >
> > > > > > > > > > > Thanks!
> > > > > > > > > >
> > > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > > > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > > > > > > aliasing issues.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > > > > > > and they do not have GPA.
> > > > > > > > >
> > > > > > > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > > > > > > >
> > > > > > > > This seems to be tricky.
> > > > > > > >
> > > > > > > > As discussed, it could be another iova tree.
> > > > > > > >
> > > > > > >
> > > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> > > > > > >
> > > > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > > > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > > > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > > > > > > adding or removing, like in the memory listener, but how to know at
> > > > > > > vhost_svq_translate_addr?
> > > > > >
> > > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of
> > > > > > virtqueue_pop() to translate GPA to SVQ IOVA directly?
> > > > > >
> > > > >
> > > > > The problem is not virtqueue_pop, that's out of the
> > > > > vhost_svq_translate_addr. The problem is the need of adding
> > > > > conditionals / complexity in all the callers of
> > > > >
> > > > > > >
> > > > > > > The easiest way for me is to rely on memory_region_from_host(). When
> > > > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > > > > > > not, it returns NULL. I'm not sure if this is a valid use case, it
> > > > > > > just worked in my tests so far.
> > > > > > >
> > > > > > > Now we have the second problem: The GPA values of the regions of the
> > > > > > > two IOVA tree must be unique. We need to be able to find unallocated
> > > > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > > > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > > > > > > complicated with two trees.
> > > > > >
> > > > > > Would it be simpler if we decouple the IOVA allocator? For example, we
> > > > > > can have a dedicated gtree to track the allocated IOVA ranges. It is
> > > > > > shared by both
> > > > > >
> > > > > > 1) Guest memory (GPA)
> > > > > > 2) SVQ virtqueue and buffers
> > > > > >
> > > > > > And another gtree to track the GPA to IOVA.
> > > > > >
> > > > > > The SVQ code could use either
> > > > > >
> > > > > > 1) one linear mappings that contains both SVQ virtqueue and buffers
> > > > > >
> > > > > > or
> > > > > >
> > > > > > 2) dynamic IOVA allocation/deallocation helpers
> > > > > >
> > > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> > > > > >
> > > > >
> > > > > That's possible, but that scatters the IOVA handling code instead of
> > > > > keeping it self-contained in VhostIOVATree.
> > > >
> > > > To me, the IOVA range/allocation is orthogonal to how IOVA is used.
> > > >
> > > > An example is the iova allocator in the kernel.
> > > >
> > > > Note that there's an even simpler IOVA "allocator" in NVME passthrough
> > > > code, not sure it is useful here though (haven't had a deep look at
> > > > that).
> > > >
> > >
> > > I don't know enough about them to have an opinion. I keep seeing the
> > > drawback of needing to synchronize both allocation & adding in all the
> > > places we want to modify the IOVATree. At this moment, these are the
> > > vhost-vdpa memory listener, the SVQ vring creation and removal, and
> > > net CVQ buffers. But it may be more in the future.
> > >
> > > What are the advantages of keeping these separated that justifies
> > > needing to synchronize in all these places, compared with keeping them
> > > synchronized in VhostIOVATree?
> >
> > It doesn't need to be synchronized.
> >
> > Assuming guest and SVQ shares IOVA range. IOVA only needs to track
> > which part of the range has been used.
> >
>
> Not sure if I follow, that is what I mean with "synchronized".

Oh right.

>
> > This simplifies things, we can store GPA->IOVA mappings and SVQ ->
> > IOVA mappings separately.
> >
>
> Sorry, I still cannot see the whole picture :).
>
> Let's assume we have all the GPA mapped to specific IOVA regions, so
> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ
> because of the migration. How can we know where we can place SVQ
> vrings without having them synchronized?

Just allocating a new IOVA range for SVQ?

>
> At this moment we're using a tree. The tree nature of the current SVQ
> IOVA -> VA makes all nodes ordered so it is more or less easy to look
> for holes.

Yes, iova allocate could still be implemented via a tree.

>
> Now your proposal uses the SVQ IOVA as tree values. Should we iterate
> over all of them, order them, of the two trees, and then look for
> holes there?

Let me clarify, correct me if I was wrong:

1) IOVA allocator is still implemented via a tree, we just don't need
to store how the IOVA is used
2) A dedicated GPA -> IOVA tree, updated via listeners and is used in
the datapath SVQ translation
3) A linear mapping or another SVQ -> IOVA tree used for SVQ

Thanks

>
> > Thanks
> >
> > >
> > > Thanks!
> > >
> >
>
Eugenio Perez Martin May 13, 2024, 9:56 a.m. UTC | #14
On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:
> > > >
> > > > On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > > > > > > <eperezma@redhat.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > > > > > > > them by HVA will return them twice.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > > > > > > > >
> > > > > > > > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > > > > > > > unmaps happens [1]:
> > > > > > > > > > > >
> > > > > > > > > > > > HVA                    GPA                IOVA
> > > > > > > > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > > > > > > > Map
> > > > > > > > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > > > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > > > > > > > [0x80001000, 0x2000001000)
> > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > > > > > > > [0x2000001000, 0x2000021000)
> > > > > > > > > > > >
> > > > > > > > > > > > Unmap
> > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > > > > > > > 0x20000) ???
> > > > > > > > > > > >
> > > > > > > > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > > > > > > > not overlap, only HVA.
> > > > > > > > > > > >
> > > > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > > > > > > > not the second one. This series is the way to tell the difference at
> > > > > > > > > > > > unmap time.
> > > > > > > > > > > >
> > > > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks!
> > > > > > > > > > >
> > > > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > > > > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > > > > > > > aliasing issues.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > > > > > > > and they do not have GPA.
> > > > > > > > > >
> > > > > > > > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > > > > > > > >
> > > > > > > > > This seems to be tricky.
> > > > > > > > >
> > > > > > > > > As discussed, it could be another iova tree.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> > > > > > > >
> > > > > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > > > > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > > > > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > > > > > > > adding or removing, like in the memory listener, but how to know at
> > > > > > > > vhost_svq_translate_addr?
> > > > > > >
> > > > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of
> > > > > > > virtqueue_pop() to translate GPA to SVQ IOVA directly?
> > > > > > >
> > > > > >
> > > > > > The problem is not virtqueue_pop, that's out of the
> > > > > > vhost_svq_translate_addr. The problem is the need of adding
> > > > > > conditionals / complexity in all the callers of
> > > > > >
> > > > > > > >
> > > > > > > > The easiest way for me is to rely on memory_region_from_host(). When
> > > > > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > > > > > > > not, it returns NULL. I'm not sure if this is a valid use case, it
> > > > > > > > just worked in my tests so far.
> > > > > > > >
> > > > > > > > Now we have the second problem: The GPA values of the regions of the
> > > > > > > > two IOVA tree must be unique. We need to be able to find unallocated
> > > > > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > > > > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > > > > > > > complicated with two trees.
> > > > > > >
> > > > > > > Would it be simpler if we decouple the IOVA allocator? For example, we
> > > > > > > can have a dedicated gtree to track the allocated IOVA ranges. It is
> > > > > > > shared by both
> > > > > > >
> > > > > > > 1) Guest memory (GPA)
> > > > > > > 2) SVQ virtqueue and buffers
> > > > > > >
> > > > > > > And another gtree to track the GPA to IOVA.
> > > > > > >
> > > > > > > The SVQ code could use either
> > > > > > >
> > > > > > > 1) one linear mappings that contains both SVQ virtqueue and buffers
> > > > > > >
> > > > > > > or
> > > > > > >
> > > > > > > 2) dynamic IOVA allocation/deallocation helpers
> > > > > > >
> > > > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> > > > > > >
> > > > > >
> > > > > > That's possible, but that scatters the IOVA handling code instead of
> > > > > > keeping it self-contained in VhostIOVATree.
> > > > >
> > > > > To me, the IOVA range/allocation is orthogonal to how IOVA is used.
> > > > >
> > > > > An example is the iova allocator in the kernel.
> > > > >
> > > > > Note that there's an even simpler IOVA "allocator" in NVME passthrough
> > > > > code, not sure it is useful here though (haven't had a deep look at
> > > > > that).
> > > > >
> > > >
> > > > I don't know enough about them to have an opinion. I keep seeing the
> > > > drawback of needing to synchronize both allocation & adding in all the
> > > > places we want to modify the IOVATree. At this moment, these are the
> > > > vhost-vdpa memory listener, the SVQ vring creation and removal, and
> > > > net CVQ buffers. But it may be more in the future.
> > > >
> > > > What are the advantages of keeping these separated that justifies
> > > > needing to synchronize in all these places, compared with keeping them
> > > > synchronized in VhostIOVATree?
> > >
> > > It doesn't need to be synchronized.
> > >
> > > Assuming guest and SVQ shares IOVA range. IOVA only needs to track
> > > which part of the range has been used.
> > >
> >
> > Not sure if I follow, that is what I mean with "synchronized".
>
> Oh right.
>
> >
> > > This simplifies things, we can store GPA->IOVA mappings and SVQ ->
> > > IOVA mappings separately.
> > >
> >
> > Sorry, I still cannot see the whole picture :).
> >
> > Let's assume we have all the GPA mapped to specific IOVA regions, so
> > we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ
> > because of the migration. How can we know where we can place SVQ
> > vrings without having them synchronized?
>
> Just allocating a new IOVA range for SVQ?
>
> >
> > At this moment we're using a tree. The tree nature of the current SVQ
> > IOVA -> VA makes all nodes ordered so it is more or less easy to look
> > for holes.
>
> Yes, iova allocate could still be implemented via a tree.
>
> >
> > Now your proposal uses the SVQ IOVA as tree values. Should we iterate
> > over all of them, order them, of the two trees, and then look for
> > holes there?
>
> Let me clarify, correct me if I was wrong:
>
> 1) IOVA allocator is still implemented via a tree, we just don't need
> to store how the IOVA is used
> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in
> the datapath SVQ translation
> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ
>

Ok, so the part I was missing is that now we have 3 whole trees, with
somehow redundant information :).

In some sense this is simpler than trying to get all the information
from only two trees. On the bad side, all SVQ calls that allocate some
region need to remember to add to one of the two other threes. That is
what I mean by synchronized. But sure, we can go that way.

> Thanks
>
> >
> > > Thanks
> > >
> > > >
> > > > Thanks!
> > > >
> > >
> >
>
Jason Wang May 14, 2024, 3:56 a.m. UTC | #15
On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> > > > <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > > > > > > > <eperezma@redhat.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > > > > > > > > them by HVA will return them twice.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > > > > > > > > unmaps happens [1]:
> > > > > > > > > > > > >
> > > > > > > > > > > > > HVA                    GPA                IOVA
> > > > > > > > > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > > > > > > > > Map
> > > > > > > > > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > > > > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > > > > > > > > [0x80001000, 0x2000001000)
> > > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > > > > > > > > [0x2000001000, 0x2000021000)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Unmap
> > > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > > > > > > > > 0x20000) ???
> > > > > > > > > > > > >
> > > > > > > > > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > > > > > > > > not overlap, only HVA.
> > > > > > > > > > > > >
> > > > > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > > > > > > > > not the second one. This series is the way to tell the difference at
> > > > > > > > > > > > > unmap time.
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks!
> > > > > > > > > > > >
> > > > > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > > > > > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > > > > > > > > aliasing issues.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > > > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > > > > > > > > and they do not have GPA.
> > > > > > > > > > >
> > > > > > > > > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > > > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > > > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > > > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > > > > > > > > >
> > > > > > > > > > This seems to be tricky.
> > > > > > > > > >
> > > > > > > > > > As discussed, it could be another iova tree.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> > > > > > > > >
> > > > > > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > > > > > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > > > > > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > > > > > > > > adding or removing, like in the memory listener, but how to know at
> > > > > > > > > vhost_svq_translate_addr?
> > > > > > > >
> > > > > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of
> > > > > > > > virtqueue_pop() to translate GPA to SVQ IOVA directly?
> > > > > > > >
> > > > > > >
> > > > > > > The problem is not virtqueue_pop, that's out of the
> > > > > > > vhost_svq_translate_addr. The problem is the need of adding
> > > > > > > conditionals / complexity in all the callers of
> > > > > > >
> > > > > > > > >
> > > > > > > > > The easiest way for me is to rely on memory_region_from_host(). When
> > > > > > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > > > > > > > > not, it returns NULL. I'm not sure if this is a valid use case, it
> > > > > > > > > just worked in my tests so far.
> > > > > > > > >
> > > > > > > > > Now we have the second problem: The GPA values of the regions of the
> > > > > > > > > two IOVA tree must be unique. We need to be able to find unallocated
> > > > > > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > > > > > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > > > > > > > > complicated with two trees.
> > > > > > > >
> > > > > > > > Would it be simpler if we decouple the IOVA allocator? For example, we
> > > > > > > > can have a dedicated gtree to track the allocated IOVA ranges. It is
> > > > > > > > shared by both
> > > > > > > >
> > > > > > > > 1) Guest memory (GPA)
> > > > > > > > 2) SVQ virtqueue and buffers
> > > > > > > >
> > > > > > > > And another gtree to track the GPA to IOVA.
> > > > > > > >
> > > > > > > > The SVQ code could use either
> > > > > > > >
> > > > > > > > 1) one linear mappings that contains both SVQ virtqueue and buffers
> > > > > > > >
> > > > > > > > or
> > > > > > > >
> > > > > > > > 2) dynamic IOVA allocation/deallocation helpers
> > > > > > > >
> > > > > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> > > > > > > >
> > > > > > >
> > > > > > > That's possible, but that scatters the IOVA handling code instead of
> > > > > > > keeping it self-contained in VhostIOVATree.
> > > > > >
> > > > > > To me, the IOVA range/allocation is orthogonal to how IOVA is used.
> > > > > >
> > > > > > An example is the iova allocator in the kernel.
> > > > > >
> > > > > > Note that there's an even simpler IOVA "allocator" in NVME passthrough
> > > > > > code, not sure it is useful here though (haven't had a deep look at
> > > > > > that).
> > > > > >
> > > > >
> > > > > I don't know enough about them to have an opinion. I keep seeing the
> > > > > drawback of needing to synchronize both allocation & adding in all the
> > > > > places we want to modify the IOVATree. At this moment, these are the
> > > > > vhost-vdpa memory listener, the SVQ vring creation and removal, and
> > > > > net CVQ buffers. But it may be more in the future.
> > > > >
> > > > > What are the advantages of keeping these separated that justifies
> > > > > needing to synchronize in all these places, compared with keeping them
> > > > > synchronized in VhostIOVATree?
> > > >
> > > > It doesn't need to be synchronized.
> > > >
> > > > Assuming guest and SVQ shares IOVA range. IOVA only needs to track
> > > > which part of the range has been used.
> > > >
> > >
> > > Not sure if I follow, that is what I mean with "synchronized".
> >
> > Oh right.
> >
> > >
> > > > This simplifies things, we can store GPA->IOVA mappings and SVQ ->
> > > > IOVA mappings separately.
> > > >
> > >
> > > Sorry, I still cannot see the whole picture :).
> > >
> > > Let's assume we have all the GPA mapped to specific IOVA regions, so
> > > we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ
> > > because of the migration. How can we know where we can place SVQ
> > > vrings without having them synchronized?
> >
> > Just allocating a new IOVA range for SVQ?
> >
> > >
> > > At this moment we're using a tree. The tree nature of the current SVQ
> > > IOVA -> VA makes all nodes ordered so it is more or less easy to look
> > > for holes.
> >
> > Yes, iova allocate could still be implemented via a tree.
> >
> > >
> > > Now your proposal uses the SVQ IOVA as tree values. Should we iterate
> > > over all of them, order them, of the two trees, and then look for
> > > holes there?
> >
> > Let me clarify, correct me if I was wrong:
> >
> > 1) IOVA allocator is still implemented via a tree, we just don't need
> > to store how the IOVA is used
> > 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in
> > the datapath SVQ translation
> > 3) A linear mapping or another SVQ -> IOVA tree used for SVQ
> >
>
> Ok, so the part I was missing is that now we have 3 whole trees, with
> somehow redundant information :).

Somehow, it decouples the IOVA usage out of the IOVA allocator. This
might be simple as guests and SVQ may try to share a single IOVA
address space.

>
> In some sense this is simpler than trying to get all the information
> from only two trees. On the bad side, all SVQ calls that allocate some
> region need to remember to add to one of the two other threes. That is
> what I mean by synchronized. But sure, we can go that way.

Just a suggestion, if it turns out to complicate the issue, I'm fine
to go the other way.

Thanks

>
> > Thanks
> >
> > >
> > > > Thanks
> > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > >
> > >
> >
>
>