diff mbox series

[V8,1/4] mem: add share parameter to memory-backend-ram

Message ID 20180117095421.124787-2-marcel@redhat.com
State New
Headers show
Series hw/pvrdma: PVRDMA device implementation | expand

Commit Message

Marcel Apfelbaum Jan. 17, 2018, 9:54 a.m. UTC
Currently only file backed memory backend can
be created with a "share" flag in order to allow
sharing guest RAM with other processes in the host.

Add the "share" flag also to RAM Memory Backend
in order to allow remapping parts of the guest RAM
to different host virtual addresses. This is needed
by the RDMA devices in order to remap non-contiguous
QEMU virtual addresses to a contiguous virtual address range.

Moved the "share" flag to the Host Memory base class,
modified phys_mem_alloc to include the new parameter
and a new interface memory_region_init_ram_shared_nomigrate.

There are no functional changes if the new flag is not used.

Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
---
 backends/hostmem-file.c  | 25 +------------------------
 backends/hostmem-ram.c   |  4 ++--
 backends/hostmem.c       | 21 +++++++++++++++++++++
 exec.c                   | 26 +++++++++++++++-----------
 include/exec/memory.h    | 23 +++++++++++++++++++++++
 include/exec/ram_addr.h  |  3 ++-
 include/qemu/osdep.h     |  2 +-
 include/sysemu/hostmem.h |  2 +-
 include/sysemu/kvm.h     |  2 +-
 memory.c                 | 16 +++++++++++++---
 target/s390x/kvm.c       |  4 ++--
 util/oslib-posix.c       |  4 ++--
 util/oslib-win32.c       |  2 +-
 13 files changed, 85 insertions(+), 49 deletions(-)

Comments

Eduardo Habkost Jan. 31, 2018, 8:40 p.m. UTC | #1
On Wed, Jan 17, 2018 at 11:54:18AM +0200, Marcel Apfelbaum wrote:
> Currently only file backed memory backend can
> be created with a "share" flag in order to allow
> sharing guest RAM with other processes in the host.
> 
> Add the "share" flag also to RAM Memory Backend
> in order to allow remapping parts of the guest RAM
> to different host virtual addresses. This is needed
> by the RDMA devices in order to remap non-contiguous
> QEMU virtual addresses to a contiguous virtual address range.
> 

Why do we need to make this configurable?  Would anything break
if MAP_SHARED was always used if possible?


> Moved the "share" flag to the Host Memory base class,
> modified phys_mem_alloc to include the new parameter
> and a new interface memory_region_init_ram_shared_nomigrate.
> 
> There are no functional changes if the new flag is not used.
> 
> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
[...]
Michael S. Tsirkin Jan. 31, 2018, 9:10 p.m. UTC | #2
On Wed, Jan 31, 2018 at 06:40:59PM -0200, Eduardo Habkost wrote:
> On Wed, Jan 17, 2018 at 11:54:18AM +0200, Marcel Apfelbaum wrote:
> > Currently only file backed memory backend can
> > be created with a "share" flag in order to allow
> > sharing guest RAM with other processes in the host.
> > 
> > Add the "share" flag also to RAM Memory Backend
> > in order to allow remapping parts of the guest RAM
> > to different host virtual addresses. This is needed
> > by the RDMA devices in order to remap non-contiguous
> > QEMU virtual addresses to a contiguous virtual address range.
> > 
> 
> Why do we need to make this configurable?  Would anything break
> if MAP_SHARED was always used if possible?

See Documentation/vm/numa_memory_policy.txt for a list
of complications.

Maybe we should more of an effort to detect and report these
issues.

> 
> > Moved the "share" flag to the Host Memory base class,
> > modified phys_mem_alloc to include the new parameter
> > and a new interface memory_region_init_ram_shared_nomigrate.
> > 
> > There are no functional changes if the new flag is not used.
> > 
> > Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
> [...]
> 
> -- 
> Eduardo
Eduardo Habkost Jan. 31, 2018, 11:34 p.m. UTC | #3
On Wed, Jan 31, 2018 at 11:10:07PM +0200, Michael S. Tsirkin wrote:
> On Wed, Jan 31, 2018 at 06:40:59PM -0200, Eduardo Habkost wrote:
> > On Wed, Jan 17, 2018 at 11:54:18AM +0200, Marcel Apfelbaum wrote:
> > > Currently only file backed memory backend can
> > > be created with a "share" flag in order to allow
> > > sharing guest RAM with other processes in the host.
> > > 
> > > Add the "share" flag also to RAM Memory Backend
> > > in order to allow remapping parts of the guest RAM
> > > to different host virtual addresses. This is needed
> > > by the RDMA devices in order to remap non-contiguous
> > > QEMU virtual addresses to a contiguous virtual address range.
> > > 
> > 
> > Why do we need to make this configurable?  Would anything break
> > if MAP_SHARED was always used if possible?
> 
> See Documentation/vm/numa_memory_policy.txt for a list
> of complications.

Ew.

> 
> Maybe we should more of an effort to detect and report these
> issues.

Probably.  Having other features breaking silently when using
pvrdma doesn't sound good.  We must at least document those
problems in the documentation for memory-backend-ram.

BTW, what's the root cause for requiring HVAs in the buffer?  Can
this be fixed?
Michael S. Tsirkin Feb. 1, 2018, 2:22 a.m. UTC | #4
On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
> On Wed, Jan 31, 2018 at 11:10:07PM +0200, Michael S. Tsirkin wrote:
> > On Wed, Jan 31, 2018 at 06:40:59PM -0200, Eduardo Habkost wrote:
> > > On Wed, Jan 17, 2018 at 11:54:18AM +0200, Marcel Apfelbaum wrote:
> > > > Currently only file backed memory backend can
> > > > be created with a "share" flag in order to allow
> > > > sharing guest RAM with other processes in the host.
> > > > 
> > > > Add the "share" flag also to RAM Memory Backend
> > > > in order to allow remapping parts of the guest RAM
> > > > to different host virtual addresses. This is needed
> > > > by the RDMA devices in order to remap non-contiguous
> > > > QEMU virtual addresses to a contiguous virtual address range.
> > > > 
> > > 
> > > Why do we need to make this configurable?  Would anything break
> > > if MAP_SHARED was always used if possible?
> > 
> > See Documentation/vm/numa_memory_policy.txt for a list
> > of complications.
> 
> Ew.
> 
> > 
> > Maybe we should more of an effort to detect and report these
> > issues.
> 
> Probably.  Having other features breaking silently when using
> pvrdma doesn't sound good.  We must at least document those
> problems in the documentation for memory-backend-ram.
> 
> BTW, what's the root cause for requiring HVAs in the buffer?

It's a side effect of the kernel/userspace API which always wants
a single HVA/len pair to map memory for the application.


>  Can
> this be fixed?

I think yes.  It'd need to be a kernel patch for the RDMA subsystem
mapping an s/g list with actual memory. The HVA/len pair would then just
be used to refer to the region, without creating the two mappings.

Something like splitting the register mr into

mr = create mr (va/len) - allocate a handle and record the va/len

addmemory(mr, offset, hva, len) - pin memory

register mr - pass it to HW

As a nice side effect we won't burn so much virtual address space.

This will fix rdma with hugetlbfs as well which is currently broken.


> -- 
> Eduardo
Marcel Apfelbaum Feb. 1, 2018, 5:36 a.m. UTC | #5
On 01/02/2018 4:22, Michael S. Tsirkin wrote:
> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
>> On Wed, Jan 31, 2018 at 11:10:07PM +0200, Michael S. Tsirkin wrote:
>>> On Wed, Jan 31, 2018 at 06:40:59PM -0200, Eduardo Habkost wrote:
>>>> On Wed, Jan 17, 2018 at 11:54:18AM +0200, Marcel Apfelbaum wrote:
>>>>> Currently only file backed memory backend can
>>>>> be created with a "share" flag in order to allow
>>>>> sharing guest RAM with other processes in the host.
>>>>>
>>>>> Add the "share" flag also to RAM Memory Backend
>>>>> in order to allow remapping parts of the guest RAM
>>>>> to different host virtual addresses. This is needed
>>>>> by the RDMA devices in order to remap non-contiguous
>>>>> QEMU virtual addresses to a contiguous virtual address range.
>>>>>
>>>>
>>>> Why do we need to make this configurable?  Would anything break
>>>> if MAP_SHARED was always used if possible?
>>>
>>> See Documentation/vm/numa_memory_policy.txt for a list
>>> of complications.
>>
>> Ew.
>>
>>>
>>> Maybe we should more of an effort to detect and report these
>>> issues.
>>
>> Probably.  Having other features breaking silently when using
>> pvrdma doesn't sound good.  We must at least document those
>> problems in the documentation for memory-backend-ram.
>>
>> BTW, what's the root cause for requiring HVAs in the buffer?
> 
> It's a side effect of the kernel/userspace API which always wants
> a single HVA/len pair to map memory for the application.
> 
> 

Hi Eduardo and Michael,

>>  Can
>> this be fixed?
> 
> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
> mapping an s/g list with actual memory. The HVA/len pair would then just
> be used to refer to the region, without creating the two mappings.
> 
> Something like splitting the register mr into
> 
> mr = create mr (va/len) - allocate a handle and record the va/len
> 
> addmemory(mr, offset, hva, len) - pin memory
> 
> register mr - pass it to HW
> 
> As a nice side effect we won't burn so much virtual address space.
>

We would still need a contiguous virtual address space range (for post-send)
which we don't have since guest contiguous virtual address space
will always end up as non-contiguous host virtual address space.

I am not sure the RDMA HW can handle a large VA with holes.

An alternative would be 0-based MR, QEMU intercepts the post-send
operations and can substract the guest VA base address.
However I didn't see the implementation in kernel for 0 based MRs
and also the RDMA maintainer said it would work for local keys
and not for remote keys.

> This will fix rdma with hugetlbfs as well which is currently broken.
> 
> 

There is already a discussion on the linux-rdma list:
    https://www.spinics.net/lists/linux-rdma/msg60079.html
But it will take some (actually a lot of) time, we are currently talking about
a possible API. And it does not solve the re-mapping...

Thanks,
Marcel

>> -- 
>> Eduardo
Eduardo Habkost Feb. 1, 2018, 12:10 p.m. UTC | #6
On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
> > On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
[...]
> >> BTW, what's the root cause for requiring HVAs in the buffer?
> > 
> > It's a side effect of the kernel/userspace API which always wants
> > a single HVA/len pair to map memory for the application.
> > 
> > 
> 
> Hi Eduardo and Michael,
> 
> >>  Can
> >> this be fixed?
> > 
> > I think yes.  It'd need to be a kernel patch for the RDMA subsystem
> > mapping an s/g list with actual memory. The HVA/len pair would then just
> > be used to refer to the region, without creating the two mappings.
> > 
> > Something like splitting the register mr into
> > 
> > mr = create mr (va/len) - allocate a handle and record the va/len
> > 
> > addmemory(mr, offset, hva, len) - pin memory
> > 
> > register mr - pass it to HW
> > 
> > As a nice side effect we won't burn so much virtual address space.
> >
> 
> We would still need a contiguous virtual address space range (for post-send)
> which we don't have since guest contiguous virtual address space
> will always end up as non-contiguous host virtual address space.
> 
> I am not sure the RDMA HW can handle a large VA with holes.

I'm confused.  Why would the hardware see and care about virtual
addresses?  How exactly does the hardware translates VAs to
PAs?  What if the process page tables change?

> 
> An alternative would be 0-based MR, QEMU intercepts the post-send
> operations and can substract the guest VA base address.
> However I didn't see the implementation in kernel for 0 based MRs
> and also the RDMA maintainer said it would work for local keys
> and not for remote keys.

This is also unexpected: are GVAs visible to the virtual RDMA
hardware?  Where does the QEMU pvrdma code translates GVAs to
GPAs?

> 
> > This will fix rdma with hugetlbfs as well which is currently broken.
> > 
> > 
> 
> There is already a discussion on the linux-rdma list:
>     https://www.spinics.net/lists/linux-rdma/msg60079.html
> But it will take some (actually a lot of) time, we are currently talking about
> a possible API. And it does not solve the re-mapping...
> 
> Thanks,
> Marcel
> 
> >> -- 
> >> Eduardo
>
Marcel Apfelbaum Feb. 1, 2018, 12:29 p.m. UTC | #7
On 01/02/2018 14:10, Eduardo Habkost wrote:
> On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
>> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
>>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
> [...]
>>>> BTW, what's the root cause for requiring HVAs in the buffer?
>>>
>>> It's a side effect of the kernel/userspace API which always wants
>>> a single HVA/len pair to map memory for the application.
>>>
>>>
>>
>> Hi Eduardo and Michael,
>>
>>>>  Can
>>>> this be fixed?
>>>
>>> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
>>> mapping an s/g list with actual memory. The HVA/len pair would then just
>>> be used to refer to the region, without creating the two mappings.
>>>
>>> Something like splitting the register mr into
>>>
>>> mr = create mr (va/len) - allocate a handle and record the va/len
>>>
>>> addmemory(mr, offset, hva, len) - pin memory
>>>
>>> register mr - pass it to HW
>>>
>>> As a nice side effect we won't burn so much virtual address space.
>>>
>>
>> We would still need a contiguous virtual address space range (for post-send)
>> which we don't have since guest contiguous virtual address space
>> will always end up as non-contiguous host virtual address space.
>>
>> I am not sure the RDMA HW can handle a large VA with holes.
> 
> I'm confused.  Why would the hardware see and care about virtual
> addresses? 

The post-send operations bypasses the kernel, and the process
puts in the work request GVA addresses.

> How exactly does the hardware translates VAs to
> PAs? 

The HW maintains a page-directory like structure different form MMU
VA -> phys pages

> What if the process page tables change?
> 

Since the page tables the HW uses are their own, we just need the phys
page to be pinned.

>>
>> An alternative would be 0-based MR, QEMU intercepts the post-send
>> operations and can substract the guest VA base address.
>> However I didn't see the implementation in kernel for 0 based MRs
>> and also the RDMA maintainer said it would work for local keys
>> and not for remote keys.
> 
> This is also unexpected: are GVAs visible to the virtual RDMA
> hardware? 

Yes, explained above

> Where does the QEMU pvrdma code translates GVAs to
> GPAs?
> 

During reg_mr (memory registration commands)
Then it registers the same addresses to the real HW.
(as Host virtual addresses)

Thanks,
Marcel

>>
>>> This will fix rdma with hugetlbfs as well which is currently broken.
>>>
>>>
>>
>> There is already a discussion on the linux-rdma list:
>>     https://www.spinics.net/lists/linux-rdma/msg60079.html
>> But it will take some (actually a lot of) time, we are currently talking about
>> a possible API. And it does not solve the re-mapping...
>>
>> Thanks,
>> Marcel
>>
>>>> -- 
>>>> Eduardo
>>
>
Michael S. Tsirkin Feb. 1, 2018, 12:57 p.m. UTC | #8
On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
> > On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
> >> On Wed, Jan 31, 2018 at 11:10:07PM +0200, Michael S. Tsirkin wrote:
> >>> On Wed, Jan 31, 2018 at 06:40:59PM -0200, Eduardo Habkost wrote:
> >>>> On Wed, Jan 17, 2018 at 11:54:18AM +0200, Marcel Apfelbaum wrote:
> >>>>> Currently only file backed memory backend can
> >>>>> be created with a "share" flag in order to allow
> >>>>> sharing guest RAM with other processes in the host.
> >>>>>
> >>>>> Add the "share" flag also to RAM Memory Backend
> >>>>> in order to allow remapping parts of the guest RAM
> >>>>> to different host virtual addresses. This is needed
> >>>>> by the RDMA devices in order to remap non-contiguous
> >>>>> QEMU virtual addresses to a contiguous virtual address range.
> >>>>>
> >>>>
> >>>> Why do we need to make this configurable?  Would anything break
> >>>> if MAP_SHARED was always used if possible?
> >>>
> >>> See Documentation/vm/numa_memory_policy.txt for a list
> >>> of complications.
> >>
> >> Ew.
> >>
> >>>
> >>> Maybe we should more of an effort to detect and report these
> >>> issues.
> >>
> >> Probably.  Having other features breaking silently when using
> >> pvrdma doesn't sound good.  We must at least document those
> >> problems in the documentation for memory-backend-ram.
> >>
> >> BTW, what's the root cause for requiring HVAs in the buffer?
> > 
> > It's a side effect of the kernel/userspace API which always wants
> > a single HVA/len pair to map memory for the application.
> > 
> > 
> 
> Hi Eduardo and Michael,
> 
> >>  Can
> >> this be fixed?
> > 
> > I think yes.  It'd need to be a kernel patch for the RDMA subsystem
> > mapping an s/g list with actual memory. The HVA/len pair would then just
> > be used to refer to the region, without creating the two mappings.
> > 
> > Something like splitting the register mr into
> > 
> > mr = create mr (va/len) - allocate a handle and record the va/len
> > 
> > addmemory(mr, offset, hva, len) - pin memory
> > 
> > register mr - pass it to HW
> > 
> > As a nice side effect we won't burn so much virtual address space.
> >
> 
> We would still need a contiguous virtual address space range (for post-send)
> which we don't have since guest contiguous virtual address space
> will always end up as non-contiguous host virtual address space.

It just needs to be contiguous in the HCA virtual address space.
Software never accesses through this pointer.
In other words - basically expose register physical mr to userspace.


> 
> I am not sure the RDMA HW can handle a large VA with holes.
> 
> An alternative would be 0-based MR, QEMU intercepts the post-send
> operations and can substract the guest VA base address.
> However I didn't see the implementation in kernel for 0 based MRs
> and also the RDMA maintainer said it would work for local keys
> and not for remote keys.
> 
> > This will fix rdma with hugetlbfs as well which is currently broken.
> > 
> > 
> 
> There is already a discussion on the linux-rdma list:
>     https://www.spinics.net/lists/linux-rdma/msg60079.html
> But it will take some (actually a lot of) time, we are currently talking about
> a possible API.

You probably need to pass the s/g piece by piece since it might exceed
any reasonable array size.

> And it does not solve the re-mapping...
> 
> Thanks,
> Marcel

Haven't read through that discussion. But at least what I posted solves
it since you do not need it contiguous in HVA any longer.

> >> -- 
> >> Eduardo
Eduardo Habkost Feb. 1, 2018, 1:53 p.m. UTC | #9
On Thu, Feb 01, 2018 at 02:29:25PM +0200, Marcel Apfelbaum wrote:
> On 01/02/2018 14:10, Eduardo Habkost wrote:
> > On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
> >> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
> >>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
> > [...]
> >>>> BTW, what's the root cause for requiring HVAs in the buffer?
> >>>
> >>> It's a side effect of the kernel/userspace API which always wants
> >>> a single HVA/len pair to map memory for the application.
> >>>
> >>>
> >>
> >> Hi Eduardo and Michael,
> >>
> >>>>  Can
> >>>> this be fixed?
> >>>
> >>> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
> >>> mapping an s/g list with actual memory. The HVA/len pair would then just
> >>> be used to refer to the region, without creating the two mappings.
> >>>
> >>> Something like splitting the register mr into
> >>>
> >>> mr = create mr (va/len) - allocate a handle and record the va/len
> >>>
> >>> addmemory(mr, offset, hva, len) - pin memory
> >>>
> >>> register mr - pass it to HW
> >>>
> >>> As a nice side effect we won't burn so much virtual address space.
> >>>
> >>
> >> We would still need a contiguous virtual address space range (for post-send)
> >> which we don't have since guest contiguous virtual address space
> >> will always end up as non-contiguous host virtual address space.
> >>
> >> I am not sure the RDMA HW can handle a large VA with holes.
> > 
> > I'm confused.  Why would the hardware see and care about virtual
> > addresses? 
> 
> The post-send operations bypasses the kernel, and the process
> puts in the work request GVA addresses.
> 
> > How exactly does the hardware translates VAs to
> > PAs? 
> 
> The HW maintains a page-directory like structure different form MMU
> VA -> phys pages
> 
> > What if the process page tables change?
> > 
> 
> Since the page tables the HW uses are their own, we just need the phys
> page to be pinned.

So there's no hardware-imposed requirement that the hardware VAs
(mapped by the HW page directory) match the VAs in QEMU
address-space, right?  If the RDMA API is updated to remove this
requirement, couldn't you just use the untranslated guest VAs
directly?
Michael S. Tsirkin Feb. 1, 2018, 2:24 p.m. UTC | #10
On Thu, Feb 01, 2018 at 02:29:25PM +0200, Marcel Apfelbaum wrote:
> On 01/02/2018 14:10, Eduardo Habkost wrote:
> > On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
> >> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
> >>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
> > [...]
> >>>> BTW, what's the root cause for requiring HVAs in the buffer?
> >>>
> >>> It's a side effect of the kernel/userspace API which always wants
> >>> a single HVA/len pair to map memory for the application.
> >>>
> >>>
> >>
> >> Hi Eduardo and Michael,
> >>
> >>>>  Can
> >>>> this be fixed?
> >>>
> >>> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
> >>> mapping an s/g list with actual memory. The HVA/len pair would then just
> >>> be used to refer to the region, without creating the two mappings.
> >>>
> >>> Something like splitting the register mr into
> >>>
> >>> mr = create mr (va/len) - allocate a handle and record the va/len
> >>>
> >>> addmemory(mr, offset, hva, len) - pin memory
> >>>
> >>> register mr - pass it to HW
> >>>
> >>> As a nice side effect we won't burn so much virtual address space.
> >>>
> >>
> >> We would still need a contiguous virtual address space range (for post-send)
> >> which we don't have since guest contiguous virtual address space
> >> will always end up as non-contiguous host virtual address space.
> >>
> >> I am not sure the RDMA HW can handle a large VA with holes.
> > 
> > I'm confused.  Why would the hardware see and care about virtual
> > addresses? 
> 
> The post-send operations bypasses the kernel, and the process
> puts in the work request GVA addresses.

To be more precise, it's the guest supplied IOVA that is sent to the card.

> > How exactly does the hardware translates VAs to
> > PAs? 
> 
> The HW maintains a page-directory like structure different form MMU
> VA -> phys pages
> 
> > What if the process page tables change?
> > 
> 
> Since the page tables the HW uses are their own, we just need the phys
> page to be pinned.
> 
> >>
> >> An alternative would be 0-based MR, QEMU intercepts the post-send
> >> operations and can substract the guest VA base address.
> >> However I didn't see the implementation in kernel for 0 based MRs
> >> and also the RDMA maintainer said it would work for local keys
> >> and not for remote keys.
> > 
> > This is also unexpected: are GVAs visible to the virtual RDMA
> > hardware? 
> 
> Yes, explained above
> 
> > Where does the QEMU pvrdma code translates GVAs to
> > GPAs?
> > 
> 
> During reg_mr (memory registration commands)
> Then it registers the same addresses to the real HW.
> (as Host virtual addresses)
> 
> Thanks,
> Marcel


The full fix would be to allow QEMU to map a list of
pages to a guest supplied IOVA.

> >>
> >>> This will fix rdma with hugetlbfs as well which is currently broken.
> >>>
> >>>
> >>
> >> There is already a discussion on the linux-rdma list:
> >>     https://www.spinics.net/lists/linux-rdma/msg60079.html
> >> But it will take some (actually a lot of) time, we are currently talking about
> >> a possible API. And it does not solve the re-mapping...
> >>
> >> Thanks,
> >> Marcel
> >>
> >>>> -- 
> >>>> Eduardo
> >>
> >
Eduardo Habkost Feb. 1, 2018, 4:31 p.m. UTC | #11
On Thu, Feb 01, 2018 at 04:24:30PM +0200, Michael S. Tsirkin wrote:
[...]
> The full fix would be to allow QEMU to map a list of
> pages to a guest supplied IOVA.

Thanks, that's what I expected.

While this is not possible, the only requests I have for this
patch is that we clearly document:
* What's the only purpose of share=on on a host-memory-backend
  object (due to pvrdma limitations).
* The potential undesirable side-effects of setting share=on.
* On the commit message and other comments, clearly distinguish
  HVAs in the QEMU address-space from IOVAs, to avoid confusion.
Michael S. Tsirkin Feb. 1, 2018, 4:48 p.m. UTC | #12
On Thu, Feb 01, 2018 at 02:31:32PM -0200, Eduardo Habkost wrote:
> On Thu, Feb 01, 2018 at 04:24:30PM +0200, Michael S. Tsirkin wrote:
> [...]
> > The full fix would be to allow QEMU to map a list of
> > pages to a guest supplied IOVA.
> 
> Thanks, that's what I expected.
> 
> While this is not possible, the only requests I have for this
> patch is that we clearly document:
> * What's the only purpose of share=on on a host-memory-backend
>   object (due to pvrdma limitations).
> * The potential undesirable side-effects of setting share=on.
> * On the commit message and other comments, clearly distinguish
>   HVAs in the QEMU address-space from IOVAs, to avoid confusion.

Looking forward, when we do support it, how will management find out
it no longer needs to pass the share parameter?

Further, if the side effects of the share parameter go away,
how will it know these no longer hold?

> -- 
> Eduardo
Eduardo Habkost Feb. 1, 2018, 4:57 p.m. UTC | #13
On Thu, Feb 01, 2018 at 06:48:54PM +0200, Michael S. Tsirkin wrote:
> On Thu, Feb 01, 2018 at 02:31:32PM -0200, Eduardo Habkost wrote:
> > On Thu, Feb 01, 2018 at 04:24:30PM +0200, Michael S. Tsirkin wrote:
> > [...]
> > > The full fix would be to allow QEMU to map a list of
> > > pages to a guest supplied IOVA.
> > 
> > Thanks, that's what I expected.
> > 
> > While this is not possible, the only requests I have for this
> > patch is that we clearly document:
> > * What's the only purpose of share=on on a host-memory-backend
> >   object (due to pvrdma limitations).
> > * The potential undesirable side-effects of setting share=on.
> > * On the commit message and other comments, clearly distinguish
> >   HVAs in the QEMU address-space from IOVAs, to avoid confusion.
> 
> Looking forward, when we do support it, how will management find out
> it no longer needs to pass the share parameter?
> 
> Further, if the side effects of the share parameter go away,
> how will it know these no longer hold?

A query-host-capabilities or similar QMP command seems necessary
for that.  It would be useful for other stuff like MAP_SYNC.
Michael S. Tsirkin Feb. 1, 2018, 4:59 p.m. UTC | #14
On Thu, Feb 01, 2018 at 02:57:39PM -0200, Eduardo Habkost wrote:
> On Thu, Feb 01, 2018 at 06:48:54PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2018 at 02:31:32PM -0200, Eduardo Habkost wrote:
> > > On Thu, Feb 01, 2018 at 04:24:30PM +0200, Michael S. Tsirkin wrote:
> > > [...]
> > > > The full fix would be to allow QEMU to map a list of
> > > > pages to a guest supplied IOVA.
> > > 
> > > Thanks, that's what I expected.
> > > 
> > > While this is not possible, the only requests I have for this
> > > patch is that we clearly document:
> > > * What's the only purpose of share=on on a host-memory-backend
> > >   object (due to pvrdma limitations).
> > > * The potential undesirable side-effects of setting share=on.
> > > * On the commit message and other comments, clearly distinguish
> > >   HVAs in the QEMU address-space from IOVAs, to avoid confusion.
> > 
> > Looking forward, when we do support it, how will management find out
> > it no longer needs to pass the share parameter?
> > 
> > Further, if the side effects of the share parameter go away,
> > how will it know these no longer hold?
> 
> A query-host-capabilities or similar QMP command seems necessary
> for that.

Is anyone working on that?

> It would be useful for other stuff like MAP_SYNC.


> -- 
> Eduardo
Eduardo Habkost Feb. 1, 2018, 5:01 p.m. UTC | #15
On Thu, Feb 01, 2018 at 06:59:07PM +0200, Michael S. Tsirkin wrote:
> On Thu, Feb 01, 2018 at 02:57:39PM -0200, Eduardo Habkost wrote:
> > On Thu, Feb 01, 2018 at 06:48:54PM +0200, Michael S. Tsirkin wrote:
> > > On Thu, Feb 01, 2018 at 02:31:32PM -0200, Eduardo Habkost wrote:
> > > > On Thu, Feb 01, 2018 at 04:24:30PM +0200, Michael S. Tsirkin wrote:
> > > > [...]
> > > > > The full fix would be to allow QEMU to map a list of
> > > > > pages to a guest supplied IOVA.
> > > > 
> > > > Thanks, that's what I expected.
> > > > 
> > > > While this is not possible, the only requests I have for this
> > > > patch is that we clearly document:
> > > > * What's the only purpose of share=on on a host-memory-backend
> > > >   object (due to pvrdma limitations).
> > > > * The potential undesirable side-effects of setting share=on.
> > > > * On the commit message and other comments, clearly distinguish
> > > >   HVAs in the QEMU address-space from IOVAs, to avoid confusion.
> > > 
> > > Looking forward, when we do support it, how will management find out
> > > it no longer needs to pass the share parameter?
> > > 
> > > Further, if the side effects of the share parameter go away,
> > > how will it know these no longer hold?
> > 
> > A query-host-capabilities or similar QMP command seems necessary
> > for that.
> 
> Is anyone working on that?

Not yet.
Michael S. Tsirkin Feb. 1, 2018, 5:12 p.m. UTC | #16
On Thu, Feb 01, 2018 at 03:01:36PM -0200, Eduardo Habkost wrote:
> On Thu, Feb 01, 2018 at 06:59:07PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2018 at 02:57:39PM -0200, Eduardo Habkost wrote:
> > > On Thu, Feb 01, 2018 at 06:48:54PM +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Feb 01, 2018 at 02:31:32PM -0200, Eduardo Habkost wrote:
> > > > > On Thu, Feb 01, 2018 at 04:24:30PM +0200, Michael S. Tsirkin wrote:
> > > > > [...]
> > > > > > The full fix would be to allow QEMU to map a list of
> > > > > > pages to a guest supplied IOVA.
> > > > > 
> > > > > Thanks, that's what I expected.
> > > > > 
> > > > > While this is not possible, the only requests I have for this
> > > > > patch is that we clearly document:
> > > > > * What's the only purpose of share=on on a host-memory-backend
> > > > >   object (due to pvrdma limitations).
> > > > > * The potential undesirable side-effects of setting share=on.
> > > > > * On the commit message and other comments, clearly distinguish
> > > > >   HVAs in the QEMU address-space from IOVAs, to avoid confusion.
> > > > 
> > > > Looking forward, when we do support it, how will management find out
> > > > it no longer needs to pass the share parameter?
> > > > 
> > > > Further, if the side effects of the share parameter go away,
> > > > how will it know these no longer hold?
> > > 
> > > A query-host-capabilities or similar QMP command seems necessary
> > > for that.
> > 
> > Is anyone working on that?
> 
> Not yet.
> 
> -- 
> Eduardo

Do these patches need to wait until we do have that command?

I'm thinking it's better to have "share=on required with rdma"
and "hugetlbfs not supported with rdma"
than the reverse, this way new hosts do not need to carry
thus stuff around forever.

Also, how does management know which devices are affected?
Eduardo Habkost Feb. 1, 2018, 5:36 p.m. UTC | #17
On Thu, Feb 01, 2018 at 07:12:45PM +0200, Michael S. Tsirkin wrote:
> On Thu, Feb 01, 2018 at 03:01:36PM -0200, Eduardo Habkost wrote:
> > On Thu, Feb 01, 2018 at 06:59:07PM +0200, Michael S. Tsirkin wrote:
> > > On Thu, Feb 01, 2018 at 02:57:39PM -0200, Eduardo Habkost wrote:
> > > > On Thu, Feb 01, 2018 at 06:48:54PM +0200, Michael S. Tsirkin wrote:
> > > > > On Thu, Feb 01, 2018 at 02:31:32PM -0200, Eduardo Habkost wrote:
> > > > > > On Thu, Feb 01, 2018 at 04:24:30PM +0200, Michael S. Tsirkin wrote:
> > > > > > [...]
> > > > > > > The full fix would be to allow QEMU to map a list of
> > > > > > > pages to a guest supplied IOVA.
> > > > > > 
> > > > > > Thanks, that's what I expected.
> > > > > > 
> > > > > > While this is not possible, the only requests I have for this
> > > > > > patch is that we clearly document:
> > > > > > * What's the only purpose of share=on on a host-memory-backend
> > > > > >   object (due to pvrdma limitations).
> > > > > > * The potential undesirable side-effects of setting share=on.
> > > > > > * On the commit message and other comments, clearly distinguish
> > > > > >   HVAs in the QEMU address-space from IOVAs, to avoid confusion.
> > > > > 
> > > > > Looking forward, when we do support it, how will management find out
> > > > > it no longer needs to pass the share parameter?
> > > > > 
> > > > > Further, if the side effects of the share parameter go away,
> > > > > how will it know these no longer hold?
> > > > 
> > > > A query-host-capabilities or similar QMP command seems necessary
> > > > for that.
> > > 
> > > Is anyone working on that?
> > 
> > Not yet.
> > 
> > -- 
> > Eduardo
> 
> Do these patches need to wait until we do have that command?

I don't think so.  The command will be needed only when
support for pvrdma without share=on gets implemented.

Right now, all we need is clear documentation.

> 
> I'm thinking it's better to have "share=on required with rdma"
> and "hugetlbfs not supported with rdma"
> than the reverse, this way new hosts do not need to carry
> thus stuff around forever.

What do you mean by "the reverse"?

IIUC, the requirements/limitations are:

* share=on required for pvrdma.  Already documented and enforced
  by pvrdma code in this series.
* hugetlbfs not supported with rdma. Is this detected/reported by
  QEMU?  Is it documented?
* side-effects of share=on.  This is not detected nor documented,
  and probably already applies to other memory backends.
  * Nice to have: document when share=on is useful (answer:
    because of pvrdma), when adding share=on support to
    host-memory-backend.

> 
> Also, how does management know which devices are affected?

Right now?  By reading documentation.
Marcel Apfelbaum Feb. 1, 2018, 5:58 p.m. UTC | #18
On 01/02/2018 19:36, Eduardo Habkost wrote:
> On Thu, Feb 01, 2018 at 07:12:45PM +0200, Michael S. Tsirkin wrote:
>> On Thu, Feb 01, 2018 at 03:01:36PM -0200, Eduardo Habkost wrote:
>>> On Thu, Feb 01, 2018 at 06:59:07PM +0200, Michael S. Tsirkin wrote:
>>>> On Thu, Feb 01, 2018 at 02:57:39PM -0200, Eduardo Habkost wrote:
>>>>> On Thu, Feb 01, 2018 at 06:48:54PM +0200, Michael S. Tsirkin wrote:
>>>>>> On Thu, Feb 01, 2018 at 02:31:32PM -0200, Eduardo Habkost wrote:
>>>>>>> On Thu, Feb 01, 2018 at 04:24:30PM +0200, Michael S. Tsirkin wrote:
>>>>>>> [...]
>>>>>>>> The full fix would be to allow QEMU to map a list of
>>>>>>>> pages to a guest supplied IOVA.
>>>>>>>
>>>>>>> Thanks, that's what I expected.
>>>>>>>
>>>>>>> While this is not possible, the only requests I have for this
>>>>>>> patch is that we clearly document:
>>>>>>> * What's the only purpose of share=on on a host-memory-backend
>>>>>>>   object (due to pvrdma limitations).
>>>>>>> * The potential undesirable side-effects of setting share=on.
>>>>>>> * On the commit message and other comments, clearly distinguish
>>>>>>>   HVAs in the QEMU address-space from IOVAs, to avoid confusion.
>>>>>>
>>>>>> Looking forward, when we do support it, how will management find out
>>>>>> it no longer needs to pass the share parameter?
>>>>>>
>>>>>> Further, if the side effects of the share parameter go away,
>>>>>> how will it know these no longer hold?
>>>>>
>>>>> A query-host-capabilities or similar QMP command seems necessary
>>>>> for that.
>>>>
>>>> Is anyone working on that?
>>>
>>> Not yet.
>>>
>>> -- 
>>> Eduardo
>>
>> Do these patches need to wait until we do have that command?
> 
> I don't think so.  The command will be needed only when
> support for pvrdma without share=on gets implemented.
> 
> Right now, all we need is clear documentation.
> 
>>
>> I'm thinking it's better to have "share=on required with rdma"
>> and "hugetlbfs not supported with rdma"
>> than the reverse, this way new hosts do not need to carry
>> thus stuff around forever.
> 
> What do you mean by "the reverse"?
> 
> IIUC, the requirements/limitations are:
> 
> * share=on required for pvrdma.  Already documented and enforced
>   by pvrdma code in this series.

Right.

> * hugetlbfs not supported with rdma. Is this detected/reported by
>   QEMU?  Is it documented?

Yes, enforced by the pvrdma device initialization and documented in the
corresponding pvrdma doc.

> * side-effects of share=on.  This is not detected nor documented,
>   and probably already applies to other memory backends.
>   * Nice to have: document when share=on is useful (answer:
>     because of pvrdma), when adding share=on support to
>     host-memory-backend.
> 

The documentation is part of the pvrdma doc.
What are the side-effects of share=on? I missed that.
(share=on is new for the memory backed RAM, the file
backed RAM already had the share parameter)

One can just grep for "share=on" in the docs directory
and can easily see the only current usage. But maybe will
be more, maybe we don't want to limit it for now.

I am planning to re-spin today/tomorrow before sending
a pull-request, can you please point me on what documentation
to add and what side-effects I should document?

Thanks,
Marcel

>>
>> Also, how does management know which devices are affected?
> 
> Right now?  By reading documentation.
>
Michael S. Tsirkin Feb. 1, 2018, 6:01 p.m. UTC | #19
On Thu, Feb 01, 2018 at 03:36:55PM -0200, Eduardo Habkost wrote:
> On Thu, Feb 01, 2018 at 07:12:45PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2018 at 03:01:36PM -0200, Eduardo Habkost wrote:
> > > On Thu, Feb 01, 2018 at 06:59:07PM +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Feb 01, 2018 at 02:57:39PM -0200, Eduardo Habkost wrote:
> > > > > On Thu, Feb 01, 2018 at 06:48:54PM +0200, Michael S. Tsirkin wrote:
> > > > > > On Thu, Feb 01, 2018 at 02:31:32PM -0200, Eduardo Habkost wrote:
> > > > > > > On Thu, Feb 01, 2018 at 04:24:30PM +0200, Michael S. Tsirkin wrote:
> > > > > > > [...]
> > > > > > > > The full fix would be to allow QEMU to map a list of
> > > > > > > > pages to a guest supplied IOVA.
> > > > > > > 
> > > > > > > Thanks, that's what I expected.
> > > > > > > 
> > > > > > > While this is not possible, the only requests I have for this
> > > > > > > patch is that we clearly document:
> > > > > > > * What's the only purpose of share=on on a host-memory-backend
> > > > > > >   object (due to pvrdma limitations).
> > > > > > > * The potential undesirable side-effects of setting share=on.
> > > > > > > * On the commit message and other comments, clearly distinguish
> > > > > > >   HVAs in the QEMU address-space from IOVAs, to avoid confusion.
> > > > > > 
> > > > > > Looking forward, when we do support it, how will management find out
> > > > > > it no longer needs to pass the share parameter?
> > > > > > 
> > > > > > Further, if the side effects of the share parameter go away,
> > > > > > how will it know these no longer hold?
> > > > > 
> > > > > A query-host-capabilities or similar QMP command seems necessary
> > > > > for that.
> > > > 
> > > > Is anyone working on that?
> > > 
> > > Not yet.
> > > 
> > > -- 
> > > Eduardo
> > 
> > Do these patches need to wait until we do have that command?
> 
> I don't think so.  The command will be needed only when
> support for pvrdma without share=on gets implemented.
> 
> Right now, all we need is clear documentation.
> 
> > 
> > I'm thinking it's better to have "share=on required with rdma"
> > and "hugetlbfs not supported with rdma"
> > than the reverse, this way new hosts do not need to carry
> > thus stuff around forever.
> 
> What do you mean by "the reverse"?
> 
> IIUC, the requirements/limitations are:
> 
> * share=on required for pvrdma.  Already documented and enforced
>   by pvrdma code in this series.
> * hugetlbfs not supported with rdma. Is this detected/reported by
>   QEMU?  Is it documented?

Probably should be.

> * side-effects of share=on.  This is not detected nor documented,
>   and probably already applies to other memory backends.
>   * Nice to have: document when share=on is useful (answer:
>     because of pvrdma), when adding share=on support to
>     host-memory-backend.
> 
> > 
> > Also, how does management know which devices are affected?
> 
> Right now?  By reading documentation.


> -- 
> Eduardo
Marcel Apfelbaum Feb. 1, 2018, 6:03 p.m. UTC | #20
On 01/02/2018 15:53, Eduardo Habkost wrote:
> On Thu, Feb 01, 2018 at 02:29:25PM +0200, Marcel Apfelbaum wrote:
>> On 01/02/2018 14:10, Eduardo Habkost wrote:
>>> On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
>>>> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
>>>>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
>>> [...]
>>>>>> BTW, what's the root cause for requiring HVAs in the buffer?
>>>>>
>>>>> It's a side effect of the kernel/userspace API which always wants
>>>>> a single HVA/len pair to map memory for the application.
>>>>>
>>>>>
>>>>
>>>> Hi Eduardo and Michael,
>>>>
>>>>>>  Can
>>>>>> this be fixed?
>>>>>
>>>>> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
>>>>> mapping an s/g list with actual memory. The HVA/len pair would then just
>>>>> be used to refer to the region, without creating the two mappings.
>>>>>
>>>>> Something like splitting the register mr into
>>>>>
>>>>> mr = create mr (va/len) - allocate a handle and record the va/len
>>>>>
>>>>> addmemory(mr, offset, hva, len) - pin memory
>>>>>
>>>>> register mr - pass it to HW
>>>>>
>>>>> As a nice side effect we won't burn so much virtual address space.
>>>>>
>>>>
>>>> We would still need a contiguous virtual address space range (for post-send)
>>>> which we don't have since guest contiguous virtual address space
>>>> will always end up as non-contiguous host virtual address space.
>>>>
>>>> I am not sure the RDMA HW can handle a large VA with holes.
>>>
>>> I'm confused.  Why would the hardware see and care about virtual
>>> addresses? 
>>
>> The post-send operations bypasses the kernel, and the process
>> puts in the work request GVA addresses.
>>
>>> How exactly does the hardware translates VAs to
>>> PAs? 
>>
>> The HW maintains a page-directory like structure different form MMU
>> VA -> phys pages
>>
>>> What if the process page tables change?
>>>
>>
>> Since the page tables the HW uses are their own, we just need the phys
>> page to be pinned.
> 
> So there's no hardware-imposed requirement that the hardware VAs
> (mapped by the HW page directory) match the VAs in QEMU
> address-space, right? 

Actually there is. Today it works exactly as you described.

> If the RDMA API is updated to remove this
> requirement, couldn't you just use the untranslated guest VAs
> directly?
> 

When the RDMA API will be decided we will still need to pass
some kind of VA to the HW, but if we succeed to remove the
phys page registration from the VA, yes, it will be possible.


Thanks,
Marcel
Marcel Apfelbaum Feb. 1, 2018, 6:07 p.m. UTC | #21
On 01/02/2018 16:24, Michael S. Tsirkin wrote:
> On Thu, Feb 01, 2018 at 02:29:25PM +0200, Marcel Apfelbaum wrote:
>> On 01/02/2018 14:10, Eduardo Habkost wrote:
>>> On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
>>>> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
>>>>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
>>> [...]
>>>>>> BTW, what's the root cause for requiring HVAs in the buffer?
>>>>>
>>>>> It's a side effect of the kernel/userspace API which always wants
>>>>> a single HVA/len pair to map memory for the application.
>>>>>
>>>>>
>>>>
>>>> Hi Eduardo and Michael,
>>>>
>>>>>>  Can
>>>>>> this be fixed?
>>>>>
>>>>> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
>>>>> mapping an s/g list with actual memory. The HVA/len pair would then just
>>>>> be used to refer to the region, without creating the two mappings.
>>>>>
>>>>> Something like splitting the register mr into
>>>>>
>>>>> mr = create mr (va/len) - allocate a handle and record the va/len
>>>>>
>>>>> addmemory(mr, offset, hva, len) - pin memory
>>>>>
>>>>> register mr - pass it to HW
>>>>>
>>>>> As a nice side effect we won't burn so much virtual address space.
>>>>>
>>>>
>>>> We would still need a contiguous virtual address space range (for post-send)
>>>> which we don't have since guest contiguous virtual address space
>>>> will always end up as non-contiguous host virtual address space.
>>>>
>>>> I am not sure the RDMA HW can handle a large VA with holes.
>>>
>>> I'm confused.  Why would the hardware see and care about virtual
>>> addresses? 
>>
>> The post-send operations bypasses the kernel, and the process
>> puts in the work request GVA addresses.
> 
> To be more precise, it's the guest supplied IOVA that is sent to the card.
> 
>>> How exactly does the hardware translates VAs to
>>> PAs? 
>>
>> The HW maintains a page-directory like structure different form MMU
>> VA -> phys pages
>>
>>> What if the process page tables change?
>>>
>>
>> Since the page tables the HW uses are their own, we just need the phys
>> page to be pinned.
>>
>>>>
>>>> An alternative would be 0-based MR, QEMU intercepts the post-send
>>>> operations and can substract the guest VA base address.
>>>> However I didn't see the implementation in kernel for 0 based MRs
>>>> and also the RDMA maintainer said it would work for local keys
>>>> and not for remote keys.
>>>
>>> This is also unexpected: are GVAs visible to the virtual RDMA
>>> hardware? 
>>
>> Yes, explained above
>>
>>> Where does the QEMU pvrdma code translates GVAs to
>>> GPAs?
>>>
>>
>> During reg_mr (memory registration commands)
>> Then it registers the same addresses to the real HW.
>> (as Host virtual addresses)
>>
>> Thanks,
>> Marcel
> 
> 
> The full fix would be to allow QEMU to map a list of
> pages to a guest supplied IOVA.
> 

Agreed, we are trying to influence the RDMA discussion
on the new API in this direction.

Thanks,
Marcel

>>>>
>>>>> This will fix rdma with hugetlbfs as well which is currently broken.
>>>>>
>>>>>
>>>>
>>>> There is already a discussion on the linux-rdma list:
>>>>     https://www.spinics.net/lists/linux-rdma/msg60079.html
>>>> But it will take some (actually a lot of) time, we are currently talking about
>>>> a possible API. And it does not solve the re-mapping...
>>>>
>>>> Thanks,
>>>> Marcel
>>>>
>>>>>> -- 
>>>>>> Eduardo
>>>>
>>>
Marcel Apfelbaum Feb. 1, 2018, 6:11 p.m. UTC | #22
On 01/02/2018 14:57, Michael S. Tsirkin wrote:
> On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
>> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
>>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
>>>> On Wed, Jan 31, 2018 at 11:10:07PM +0200, Michael S. Tsirkin wrote:
>>>>> On Wed, Jan 31, 2018 at 06:40:59PM -0200, Eduardo Habkost wrote:
>>>>>> On Wed, Jan 17, 2018 at 11:54:18AM +0200, Marcel Apfelbaum wrote:
>>>>>>> Currently only file backed memory backend can
>>>>>>> be created with a "share" flag in order to allow
>>>>>>> sharing guest RAM with other processes in the host.
>>>>>>>
>>>>>>> Add the "share" flag also to RAM Memory Backend
>>>>>>> in order to allow remapping parts of the guest RAM
>>>>>>> to different host virtual addresses. This is needed
>>>>>>> by the RDMA devices in order to remap non-contiguous
>>>>>>> QEMU virtual addresses to a contiguous virtual address range.
>>>>>>>
>>>>>>
>>>>>> Why do we need to make this configurable?  Would anything break
>>>>>> if MAP_SHARED was always used if possible?
>>>>>
>>>>> See Documentation/vm/numa_memory_policy.txt for a list
>>>>> of complications.
>>>>
>>>> Ew.
>>>>
>>>>>
>>>>> Maybe we should more of an effort to detect and report these
>>>>> issues.
>>>>
>>>> Probably.  Having other features breaking silently when using
>>>> pvrdma doesn't sound good.  We must at least document those
>>>> problems in the documentation for memory-backend-ram.
>>>>
>>>> BTW, what's the root cause for requiring HVAs in the buffer?
>>>
>>> It's a side effect of the kernel/userspace API which always wants
>>> a single HVA/len pair to map memory for the application.
>>>
>>>
>>
>> Hi Eduardo and Michael,
>>
>>>>  Can
>>>> this be fixed?
>>>
>>> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
>>> mapping an s/g list with actual memory. The HVA/len pair would then just
>>> be used to refer to the region, without creating the two mappings.
>>>
>>> Something like splitting the register mr into
>>>
>>> mr = create mr (va/len) - allocate a handle and record the va/len
>>>
>>> addmemory(mr, offset, hva, len) - pin memory
>>>
>>> register mr - pass it to HW
>>>
>>> As a nice side effect we won't burn so much virtual address space.
>>>
>>
>> We would still need a contiguous virtual address space range (for post-send)
>> which we don't have since guest contiguous virtual address space
>> will always end up as non-contiguous host virtual address space.
> 
> It just needs to be contiguous in the HCA virtual address space.
> Software never accesses through this pointer.
> In other words - basically expose register physical mr to userspace.
> 
> 
>>
>> I am not sure the RDMA HW can handle a large VA with holes.
>>
>> An alternative would be 0-based MR, QEMU intercepts the post-send
>> operations and can substract the guest VA base address.
>> However I didn't see the implementation in kernel for 0 based MRs
>> and also the RDMA maintainer said it would work for local keys
>> and not for remote keys.
>>
>>> This will fix rdma with hugetlbfs as well which is currently broken.
>>>
>>>
>>
>> There is already a discussion on the linux-rdma list:
>>     https://www.spinics.net/lists/linux-rdma/msg60079.html
>> But it will take some (actually a lot of) time, we are currently talking about
>> a possible API.
> 
> You probably need to pass the s/g piece by piece since it might exceed
> any reasonable array size.

Right. They say the new API is ioctl based but so this is not a limitation.
We proposed also a bitmap representation of a large range,
but what we really need is what you mentioned:
  to pass the Guest VA directly to reg_mr.

Thanks,
Marcel

> 
>> And it does not solve the re-mapping...
>>
>> Thanks,
>> Marcel
> 
> Haven't read through that discussion. But at least what I posted solves
> it since you do not need it contiguous in HVA any longer.
> 
>>>> -- 
>>>> Eduardo
Eduardo Habkost Feb. 1, 2018, 6:18 p.m. UTC | #23
On Thu, Feb 01, 2018 at 07:58:10PM +0200, Marcel Apfelbaum wrote:
> On 01/02/2018 19:36, Eduardo Habkost wrote:
> > On Thu, Feb 01, 2018 at 07:12:45PM +0200, Michael S. Tsirkin wrote:
> >> On Thu, Feb 01, 2018 at 03:01:36PM -0200, Eduardo Habkost wrote:
> >>> On Thu, Feb 01, 2018 at 06:59:07PM +0200, Michael S. Tsirkin wrote:
> >>>> On Thu, Feb 01, 2018 at 02:57:39PM -0200, Eduardo Habkost wrote:
> >>>>> On Thu, Feb 01, 2018 at 06:48:54PM +0200, Michael S. Tsirkin wrote:
> >>>>>> On Thu, Feb 01, 2018 at 02:31:32PM -0200, Eduardo Habkost wrote:
> >>>>>>> On Thu, Feb 01, 2018 at 04:24:30PM +0200, Michael S. Tsirkin wrote:
> >>>>>>> [...]
> >>>>>>>> The full fix would be to allow QEMU to map a list of
> >>>>>>>> pages to a guest supplied IOVA.
> >>>>>>>
> >>>>>>> Thanks, that's what I expected.
> >>>>>>>
> >>>>>>> While this is not possible, the only requests I have for this
> >>>>>>> patch is that we clearly document:
> >>>>>>> * What's the only purpose of share=on on a host-memory-backend
> >>>>>>>   object (due to pvrdma limitations).
> >>>>>>> * The potential undesirable side-effects of setting share=on.
> >>>>>>> * On the commit message and other comments, clearly distinguish
> >>>>>>>   HVAs in the QEMU address-space from IOVAs, to avoid confusion.
> >>>>>>
> >>>>>> Looking forward, when we do support it, how will management find out
> >>>>>> it no longer needs to pass the share parameter?
> >>>>>>
> >>>>>> Further, if the side effects of the share parameter go away,
> >>>>>> how will it know these no longer hold?
> >>>>>
> >>>>> A query-host-capabilities or similar QMP command seems necessary
> >>>>> for that.
> >>>>
> >>>> Is anyone working on that?
> >>>
> >>> Not yet.
> >>>
> >>> -- 
> >>> Eduardo
> >>
> >> Do these patches need to wait until we do have that command?
> > 
> > I don't think so.  The command will be needed only when
> > support for pvrdma without share=on gets implemented.
> > 
> > Right now, all we need is clear documentation.
> > 
> >>
> >> I'm thinking it's better to have "share=on required with rdma"
> >> and "hugetlbfs not supported with rdma"
> >> than the reverse, this way new hosts do not need to carry
> >> thus stuff around forever.
> > 
> > What do you mean by "the reverse"?
> > 
> > IIUC, the requirements/limitations are:
> > 
> > * share=on required for pvrdma.  Already documented and enforced
> >   by pvrdma code in this series.
> 
> Right.
> 
> > * hugetlbfs not supported with rdma. Is this detected/reported by
> >   QEMU?  Is it documented?
> 
> Yes, enforced by the pvrdma device initialization and documented in the
> corresponding pvrdma doc.
> 
> > * side-effects of share=on.  This is not detected nor documented,
> >   and probably already applies to other memory backends.
> >   * Nice to have: document when share=on is useful (answer:
> >     because of pvrdma), when adding share=on support to
> >     host-memory-backend.
> > 
> 
> The documentation is part of the pvrdma doc.
> What are the side-effects of share=on? I missed that.
> (share=on is new for the memory backed RAM, the file
> backed RAM already had the share parameter)
> 
> One can just grep for "share=on" in the docs directory
> and can easily see the only current usage. But maybe will
> be more, maybe we don't want to limit it for now.
> 
> I am planning to re-spin today/tomorrow before sending
> a pull-request, can you please point me on what documentation
> to add and what side-effects I should document?
> 

The full list of side-effects is not clear to me.  For some of
them, see Documentation/vm/numa_memory_policy.txt on the kernel
tree.

The documentation for memory backend options is at
qemu-options.hx.  Maybe something like this, extending the
existing paragraph:

  The @option{share} boolean option determines whether the memory
  region is marked as private to QEMU, or shared (mapped using
  the MAP_SHARED flag).  The latter allows a co-operating
  external process to access the QEMU memory region.

  @option{share} is also required for pvrdma devices due to
  limitations in the RDMA API provided by Linux.

  Setting share=on might affect the ability to configure NUMA
  bindings for the memory backend under some circumstances, see
  Documentation/vm/numa_memory_policy.txt on the Linux kernel
  source tree for additional details.

I hate to point users to low-level documentation on the kernel
tree, but it's better than nothing.

We also need to list "share" as a valid option at the
"@item -object memory-backend-ram,[...]" line.
Eduardo Habkost Feb. 1, 2018, 6:21 p.m. UTC | #24
On Thu, Feb 01, 2018 at 08:03:53PM +0200, Marcel Apfelbaum wrote:
> On 01/02/2018 15:53, Eduardo Habkost wrote:
> > On Thu, Feb 01, 2018 at 02:29:25PM +0200, Marcel Apfelbaum wrote:
> >> On 01/02/2018 14:10, Eduardo Habkost wrote:
> >>> On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
> >>>> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
> >>>>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
> >>> [...]
> >>>>>> BTW, what's the root cause for requiring HVAs in the buffer?
> >>>>>
> >>>>> It's a side effect of the kernel/userspace API which always wants
> >>>>> a single HVA/len pair to map memory for the application.
> >>>>>
> >>>>>
> >>>>
> >>>> Hi Eduardo and Michael,
> >>>>
> >>>>>>  Can
> >>>>>> this be fixed?
> >>>>>
> >>>>> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
> >>>>> mapping an s/g list with actual memory. The HVA/len pair would then just
> >>>>> be used to refer to the region, without creating the two mappings.
> >>>>>
> >>>>> Something like splitting the register mr into
> >>>>>
> >>>>> mr = create mr (va/len) - allocate a handle and record the va/len
> >>>>>
> >>>>> addmemory(mr, offset, hva, len) - pin memory
> >>>>>
> >>>>> register mr - pass it to HW
> >>>>>
> >>>>> As a nice side effect we won't burn so much virtual address space.
> >>>>>
> >>>>
> >>>> We would still need a contiguous virtual address space range (for post-send)
> >>>> which we don't have since guest contiguous virtual address space
> >>>> will always end up as non-contiguous host virtual address space.
> >>>>
> >>>> I am not sure the RDMA HW can handle a large VA with holes.
> >>>
> >>> I'm confused.  Why would the hardware see and care about virtual
> >>> addresses? 
> >>
> >> The post-send operations bypasses the kernel, and the process
> >> puts in the work request GVA addresses.
> >>
> >>> How exactly does the hardware translates VAs to
> >>> PAs? 
> >>
> >> The HW maintains a page-directory like structure different form MMU
> >> VA -> phys pages
> >>
> >>> What if the process page tables change?
> >>>
> >>
> >> Since the page tables the HW uses are their own, we just need the phys
> >> page to be pinned.
> > 
> > So there's no hardware-imposed requirement that the hardware VAs
> > (mapped by the HW page directory) match the VAs in QEMU
> > address-space, right? 
> 
> Actually there is. Today it works exactly as you described.

Are you sure there's such hardware-imposed requirement?

Why would the hardware require VAs to match the ones in the
userspace address-space, if it doesn't use the CPU MMU at all?
Marcel Apfelbaum Feb. 1, 2018, 6:31 p.m. UTC | #25
On 01/02/2018 20:21, Eduardo Habkost wrote:
> On Thu, Feb 01, 2018 at 08:03:53PM +0200, Marcel Apfelbaum wrote:
>> On 01/02/2018 15:53, Eduardo Habkost wrote:
>>> On Thu, Feb 01, 2018 at 02:29:25PM +0200, Marcel Apfelbaum wrote:
>>>> On 01/02/2018 14:10, Eduardo Habkost wrote:
>>>>> On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
>>>>>> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
>>>>> [...]
>>>>>>>> BTW, what's the root cause for requiring HVAs in the buffer?
>>>>>>>
>>>>>>> It's a side effect of the kernel/userspace API which always wants
>>>>>>> a single HVA/len pair to map memory for the application.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Hi Eduardo and Michael,
>>>>>>
>>>>>>>>  Can
>>>>>>>> this be fixed?
>>>>>>>
>>>>>>> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
>>>>>>> mapping an s/g list with actual memory. The HVA/len pair would then just
>>>>>>> be used to refer to the region, without creating the two mappings.
>>>>>>>
>>>>>>> Something like splitting the register mr into
>>>>>>>
>>>>>>> mr = create mr (va/len) - allocate a handle and record the va/len
>>>>>>>
>>>>>>> addmemory(mr, offset, hva, len) - pin memory
>>>>>>>
>>>>>>> register mr - pass it to HW
>>>>>>>
>>>>>>> As a nice side effect we won't burn so much virtual address space.
>>>>>>>
>>>>>>
>>>>>> We would still need a contiguous virtual address space range (for post-send)
>>>>>> which we don't have since guest contiguous virtual address space
>>>>>> will always end up as non-contiguous host virtual address space.
>>>>>>
>>>>>> I am not sure the RDMA HW can handle a large VA with holes.
>>>>>
>>>>> I'm confused.  Why would the hardware see and care about virtual
>>>>> addresses? 
>>>>
>>>> The post-send operations bypasses the kernel, and the process
>>>> puts in the work request GVA addresses.
>>>>
>>>>> How exactly does the hardware translates VAs to
>>>>> PAs? 
>>>>
>>>> The HW maintains a page-directory like structure different form MMU
>>>> VA -> phys pages
>>>>
>>>>> What if the process page tables change?
>>>>>
>>>>
>>>> Since the page tables the HW uses are their own, we just need the phys
>>>> page to be pinned.
>>>
>>> So there's no hardware-imposed requirement that the hardware VAs
>>> (mapped by the HW page directory) match the VAs in QEMU
>>> address-space, right? 
>>
>> Actually there is. Today it works exactly as you described.
> 
> Are you sure there's such hardware-imposed requirement?
> 

Yes.

> Why would the hardware require VAs to match the ones in the
> userspace address-space, if it doesn't use the CPU MMU at all?
> 

It works like that:

1. We register a buffer from the process address space
   giving its base address and length.
   This call goes to kernel which in turn pins the phys pages
   and registers them with the device *together* with the base
   address (virtual address!)
2. The device builds its own page tables to be able to translate
   the virtual addresses to actual phys pages.
3. The process executes post-send requests directly to hw by-passing
   the kernel giving process virtual addresses in work requests.
4. The device uses its own page tables to translate the virtual
   addresses to phys pages and sending them.

Theoretically is possible to send any contiguous IOVA instead of
process's one but is not how is working today.

Makes sense?

Thanks,
Marcel
Marcel Apfelbaum Feb. 1, 2018, 6:34 p.m. UTC | #26
On 01/02/2018 20:18, Eduardo Habkost wrote:
> On Thu, Feb 01, 2018 at 07:58:10PM +0200, Marcel Apfelbaum wrote:
>> On 01/02/2018 19:36, Eduardo Habkost wrote:
>>> On Thu, Feb 01, 2018 at 07:12:45PM +0200, Michael S. Tsirkin wrote:
>>>> On Thu, Feb 01, 2018 at 03:01:36PM -0200, Eduardo Habkost wrote:
>>>>> On Thu, Feb 01, 2018 at 06:59:07PM +0200, Michael S. Tsirkin wrote:
>>>>>> On Thu, Feb 01, 2018 at 02:57:39PM -0200, Eduardo Habkost wrote:
>>>>>>> On Thu, Feb 01, 2018 at 06:48:54PM +0200, Michael S. Tsirkin wrote:
>>>>>>>> On Thu, Feb 01, 2018 at 02:31:32PM -0200, Eduardo Habkost wrote:
>>>>>>>>> On Thu, Feb 01, 2018 at 04:24:30PM +0200, Michael S. Tsirkin wrote:
>>>>>>>>> [...]
>>>>>>>>>> The full fix would be to allow QEMU to map a list of
>>>>>>>>>> pages to a guest supplied IOVA.
>>>>>>>>>
>>>>>>>>> Thanks, that's what I expected.
>>>>>>>>>
>>>>>>>>> While this is not possible, the only requests I have for this
>>>>>>>>> patch is that we clearly document:
>>>>>>>>> * What's the only purpose of share=on on a host-memory-backend
>>>>>>>>>   object (due to pvrdma limitations).
>>>>>>>>> * The potential undesirable side-effects of setting share=on.
>>>>>>>>> * On the commit message and other comments, clearly distinguish
>>>>>>>>>   HVAs in the QEMU address-space from IOVAs, to avoid confusion.
>>>>>>>>
>>>>>>>> Looking forward, when we do support it, how will management find out
>>>>>>>> it no longer needs to pass the share parameter?
>>>>>>>>
>>>>>>>> Further, if the side effects of the share parameter go away,
>>>>>>>> how will it know these no longer hold?
>>>>>>>
>>>>>>> A query-host-capabilities or similar QMP command seems necessary
>>>>>>> for that.
>>>>>>
>>>>>> Is anyone working on that?
>>>>>
>>>>> Not yet.
>>>>>
>>>>> -- 
>>>>> Eduardo
>>>>
>>>> Do these patches need to wait until we do have that command?
>>>
>>> I don't think so.  The command will be needed only when
>>> support for pvrdma without share=on gets implemented.
>>>
>>> Right now, all we need is clear documentation.
>>>
>>>>
>>>> I'm thinking it's better to have "share=on required with rdma"
>>>> and "hugetlbfs not supported with rdma"
>>>> than the reverse, this way new hosts do not need to carry
>>>> thus stuff around forever.
>>>
>>> What do you mean by "the reverse"?
>>>
>>> IIUC, the requirements/limitations are:
>>>
>>> * share=on required for pvrdma.  Already documented and enforced
>>>   by pvrdma code in this series.
>>
>> Right.
>>
>>> * hugetlbfs not supported with rdma. Is this detected/reported by
>>>   QEMU?  Is it documented?
>>
>> Yes, enforced by the pvrdma device initialization and documented in the
>> corresponding pvrdma doc.
>>
>>> * side-effects of share=on.  This is not detected nor documented,
>>>   and probably already applies to other memory backends.
>>>   * Nice to have: document when share=on is useful (answer:
>>>     because of pvrdma), when adding share=on support to
>>>     host-memory-backend.
>>>
>>
>> The documentation is part of the pvrdma doc.
>> What are the side-effects of share=on? I missed that.
>> (share=on is new for the memory backed RAM, the file
>> backed RAM already had the share parameter)
>>
>> One can just grep for "share=on" in the docs directory
>> and can easily see the only current usage. But maybe will
>> be more, maybe we don't want to limit it for now.
>>
>> I am planning to re-spin today/tomorrow before sending
>> a pull-request, can you please point me on what documentation
>> to add and what side-effects I should document?
>>
> 
> The full list of side-effects is not clear to me.  For some of
> them, see Documentation/vm/numa_memory_policy.txt on the kernel
> tree.
> 
> The documentation for memory backend options is at
> qemu-options.hx.  Maybe something like this, extending the
> existing paragraph:
> 
>   The @option{share} boolean option determines whether the memory
>   region is marked as private to QEMU, or shared (mapped using
>   the MAP_SHARED flag).  The latter allows a co-operating
>   external process to access the QEMU memory region.
> 
>   @option{share} is also required for pvrdma devices due to
>   limitations in the RDMA API provided by Linux.
> 
>   Setting share=on might affect the ability to configure NUMA
>   bindings for the memory backend under some circumstances, see
>   Documentation/vm/numa_memory_policy.txt on the Linux kernel
>   source tree for additional details.
> 
> I hate to point users to low-level documentation on the kernel
> tree, but it's better than nothing.
> 
> We also need to list "share" as a valid option at the
> "@item -object memory-backend-ram,[...]" line.
> 

Thanks for the help Eduardo, I'll be sure to update the docs as
advised.
Marcel
Eduardo Habkost Feb. 1, 2018, 6:51 p.m. UTC | #27
On Thu, Feb 01, 2018 at 08:31:09PM +0200, Marcel Apfelbaum wrote:
> On 01/02/2018 20:21, Eduardo Habkost wrote:
> > On Thu, Feb 01, 2018 at 08:03:53PM +0200, Marcel Apfelbaum wrote:
> >> On 01/02/2018 15:53, Eduardo Habkost wrote:
> >>> On Thu, Feb 01, 2018 at 02:29:25PM +0200, Marcel Apfelbaum wrote:
> >>>> On 01/02/2018 14:10, Eduardo Habkost wrote:
> >>>>> On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
> >>>>>> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
> >>>>>>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
> >>>>> [...]
> >>>>>>>> BTW, what's the root cause for requiring HVAs in the buffer?
> >>>>>>>
> >>>>>>> It's a side effect of the kernel/userspace API which always wants
> >>>>>>> a single HVA/len pair to map memory for the application.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> Hi Eduardo and Michael,
> >>>>>>
> >>>>>>>>  Can
> >>>>>>>> this be fixed?
> >>>>>>>
> >>>>>>> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
> >>>>>>> mapping an s/g list with actual memory. The HVA/len pair would then just
> >>>>>>> be used to refer to the region, without creating the two mappings.
> >>>>>>>
> >>>>>>> Something like splitting the register mr into
> >>>>>>>
> >>>>>>> mr = create mr (va/len) - allocate a handle and record the va/len
> >>>>>>>
> >>>>>>> addmemory(mr, offset, hva, len) - pin memory
> >>>>>>>
> >>>>>>> register mr - pass it to HW
> >>>>>>>
> >>>>>>> As a nice side effect we won't burn so much virtual address space.
> >>>>>>>
> >>>>>>
> >>>>>> We would still need a contiguous virtual address space range (for post-send)
> >>>>>> which we don't have since guest contiguous virtual address space
> >>>>>> will always end up as non-contiguous host virtual address space.
> >>>>>>
> >>>>>> I am not sure the RDMA HW can handle a large VA with holes.
> >>>>>
> >>>>> I'm confused.  Why would the hardware see and care about virtual
> >>>>> addresses? 
> >>>>
> >>>> The post-send operations bypasses the kernel, and the process
> >>>> puts in the work request GVA addresses.
> >>>>
> >>>>> How exactly does the hardware translates VAs to
> >>>>> PAs? 
> >>>>
> >>>> The HW maintains a page-directory like structure different form MMU
> >>>> VA -> phys pages
> >>>>
> >>>>> What if the process page tables change?
> >>>>>
> >>>>
> >>>> Since the page tables the HW uses are their own, we just need the phys
> >>>> page to be pinned.
> >>>
> >>> So there's no hardware-imposed requirement that the hardware VAs
> >>> (mapped by the HW page directory) match the VAs in QEMU
> >>> address-space, right? 
> >>
> >> Actually there is. Today it works exactly as you described.
> > 
> > Are you sure there's such hardware-imposed requirement?
> > 
> 
> Yes.
> 
> > Why would the hardware require VAs to match the ones in the
> > userspace address-space, if it doesn't use the CPU MMU at all?
> > 
> 
> It works like that:
> 
> 1. We register a buffer from the process address space
>    giving its base address and length.
>    This call goes to kernel which in turn pins the phys pages
>    and registers them with the device *together* with the base
>    address (virtual address!)
> 2. The device builds its own page tables to be able to translate
>    the virtual addresses to actual phys pages.

How would the device be able to do that?  It would require the
device to look at the process page tables, wouldn't it?  Isn't
the HW IOVA->PA translation table built by the OS?


> 3. The process executes post-send requests directly to hw by-passing
>    the kernel giving process virtual addresses in work requests.
> 4. The device uses its own page tables to translate the virtual
>    addresses to phys pages and sending them.
> 
> Theoretically is possible to send any contiguous IOVA instead of
> process's one but is not how is working today.
> 
> Makes sense?
Michael S. Tsirkin Feb. 1, 2018, 6:52 p.m. UTC | #28
On Thu, Feb 01, 2018 at 08:31:09PM +0200, Marcel Apfelbaum wrote:
> Theoretically is possible to send any contiguous IOVA instead of
> process's one but is not how is working today.

It works this way today in hardware but it's not hardware that limits
it to work this way - it's a software limitation.
Marcel Apfelbaum Feb. 1, 2018, 6:58 p.m. UTC | #29
On 01/02/2018 20:51, Eduardo Habkost wrote:
> On Thu, Feb 01, 2018 at 08:31:09PM +0200, Marcel Apfelbaum wrote:
>> On 01/02/2018 20:21, Eduardo Habkost wrote:
>>> On Thu, Feb 01, 2018 at 08:03:53PM +0200, Marcel Apfelbaum wrote:
>>>> On 01/02/2018 15:53, Eduardo Habkost wrote:
>>>>> On Thu, Feb 01, 2018 at 02:29:25PM +0200, Marcel Apfelbaum wrote:
>>>>>> On 01/02/2018 14:10, Eduardo Habkost wrote:
>>>>>>> On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
>>>>>>>> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
>>>>>>>>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
>>>>>>> [...]
>>>>>>>>>> BTW, what's the root cause for requiring HVAs in the buffer?
>>>>>>>>>
>>>>>>>>> It's a side effect of the kernel/userspace API which always wants
>>>>>>>>> a single HVA/len pair to map memory for the application.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Eduardo and Michael,
>>>>>>>>
>>>>>>>>>>  Can
>>>>>>>>>> this be fixed?
>>>>>>>>>
>>>>>>>>> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
>>>>>>>>> mapping an s/g list with actual memory. The HVA/len pair would then just
>>>>>>>>> be used to refer to the region, without creating the two mappings.
>>>>>>>>>
>>>>>>>>> Something like splitting the register mr into
>>>>>>>>>
>>>>>>>>> mr = create mr (va/len) - allocate a handle and record the va/len
>>>>>>>>>
>>>>>>>>> addmemory(mr, offset, hva, len) - pin memory
>>>>>>>>>
>>>>>>>>> register mr - pass it to HW
>>>>>>>>>
>>>>>>>>> As a nice side effect we won't burn so much virtual address space.
>>>>>>>>>
>>>>>>>>
>>>>>>>> We would still need a contiguous virtual address space range (for post-send)
>>>>>>>> which we don't have since guest contiguous virtual address space
>>>>>>>> will always end up as non-contiguous host virtual address space.
>>>>>>>>
>>>>>>>> I am not sure the RDMA HW can handle a large VA with holes.
>>>>>>>
>>>>>>> I'm confused.  Why would the hardware see and care about virtual
>>>>>>> addresses? 
>>>>>>
>>>>>> The post-send operations bypasses the kernel, and the process
>>>>>> puts in the work request GVA addresses.
>>>>>>
>>>>>>> How exactly does the hardware translates VAs to
>>>>>>> PAs? 
>>>>>>
>>>>>> The HW maintains a page-directory like structure different form MMU
>>>>>> VA -> phys pages
>>>>>>
>>>>>>> What if the process page tables change?
>>>>>>>
>>>>>>
>>>>>> Since the page tables the HW uses are their own, we just need the phys
>>>>>> page to be pinned.
>>>>>
>>>>> So there's no hardware-imposed requirement that the hardware VAs
>>>>> (mapped by the HW page directory) match the VAs in QEMU
>>>>> address-space, right? 
>>>>
>>>> Actually there is. Today it works exactly as you described.
>>>
>>> Are you sure there's such hardware-imposed requirement?
>>>
>>
>> Yes.
>>
>>> Why would the hardware require VAs to match the ones in the
>>> userspace address-space, if it doesn't use the CPU MMU at all?
>>>
>>
>> It works like that:
>>
>> 1. We register a buffer from the process address space
>>    giving its base address and length.
>>    This call goes to kernel which in turn pins the phys pages
>>    and registers them with the device *together* with the base
>>    address (virtual address!)
>> 2. The device builds its own page tables to be able to translate
>>    the virtual addresses to actual phys pages.
> 
> How would the device be able to do that?  It would require the
> device to look at the process page tables, wouldn't it?  Isn't
> the HW IOVA->PA translation table built by the OS?
> 

As stated above, these are tables private for the device.
(They even have a hw vendor specific layout I think,
 since the device holds some cache)

The device looks at its own private page tables, and not
to the OS ones.

> 
>> 3. The process executes post-send requests directly to hw by-passing
>>    the kernel giving process virtual addresses in work requests.
>> 4. The device uses its own page tables to translate the virtual
>>    addresses to phys pages and sending them.
>>
>> Theoretically is possible to send any contiguous IOVA instead of
>> process's one but is not how is working today.
>>
>> Makes sense?
>
Eduardo Habkost Feb. 1, 2018, 7:21 p.m. UTC | #30
On Thu, Feb 01, 2018 at 08:58:32PM +0200, Marcel Apfelbaum wrote:
> On 01/02/2018 20:51, Eduardo Habkost wrote:
> > On Thu, Feb 01, 2018 at 08:31:09PM +0200, Marcel Apfelbaum wrote:
> >> On 01/02/2018 20:21, Eduardo Habkost wrote:
> >>> On Thu, Feb 01, 2018 at 08:03:53PM +0200, Marcel Apfelbaum wrote:
> >>>> On 01/02/2018 15:53, Eduardo Habkost wrote:
> >>>>> On Thu, Feb 01, 2018 at 02:29:25PM +0200, Marcel Apfelbaum wrote:
> >>>>>> On 01/02/2018 14:10, Eduardo Habkost wrote:
> >>>>>>> On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
> >>>>>>>> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
> >>>>>>>>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
> >>>>>>> [...]
> >>>>>>>>>> BTW, what's the root cause for requiring HVAs in the buffer?
> >>>>>>>>>
> >>>>>>>>> It's a side effect of the kernel/userspace API which always wants
> >>>>>>>>> a single HVA/len pair to map memory for the application.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Eduardo and Michael,
> >>>>>>>>
> >>>>>>>>>>  Can
> >>>>>>>>>> this be fixed?
> >>>>>>>>>
> >>>>>>>>> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
> >>>>>>>>> mapping an s/g list with actual memory. The HVA/len pair would then just
> >>>>>>>>> be used to refer to the region, without creating the two mappings.
> >>>>>>>>>
> >>>>>>>>> Something like splitting the register mr into
> >>>>>>>>>
> >>>>>>>>> mr = create mr (va/len) - allocate a handle and record the va/len
> >>>>>>>>>
> >>>>>>>>> addmemory(mr, offset, hva, len) - pin memory
> >>>>>>>>>
> >>>>>>>>> register mr - pass it to HW
> >>>>>>>>>
> >>>>>>>>> As a nice side effect we won't burn so much virtual address space.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> We would still need a contiguous virtual address space range (for post-send)
> >>>>>>>> which we don't have since guest contiguous virtual address space
> >>>>>>>> will always end up as non-contiguous host virtual address space.
> >>>>>>>>
> >>>>>>>> I am not sure the RDMA HW can handle a large VA with holes.
> >>>>>>>
> >>>>>>> I'm confused.  Why would the hardware see and care about virtual
> >>>>>>> addresses? 
> >>>>>>
> >>>>>> The post-send operations bypasses the kernel, and the process
> >>>>>> puts in the work request GVA addresses.
> >>>>>>
> >>>>>>> How exactly does the hardware translates VAs to
> >>>>>>> PAs? 
> >>>>>>
> >>>>>> The HW maintains a page-directory like structure different form MMU
> >>>>>> VA -> phys pages
> >>>>>>
> >>>>>>> What if the process page tables change?
> >>>>>>>
> >>>>>>
> >>>>>> Since the page tables the HW uses are their own, we just need the phys
> >>>>>> page to be pinned.
> >>>>>
> >>>>> So there's no hardware-imposed requirement that the hardware VAs
> >>>>> (mapped by the HW page directory) match the VAs in QEMU
> >>>>> address-space, right? 
> >>>>
> >>>> Actually there is. Today it works exactly as you described.
> >>>
> >>> Are you sure there's such hardware-imposed requirement?
> >>>
> >>
> >> Yes.
> >>
> >>> Why would the hardware require VAs to match the ones in the
> >>> userspace address-space, if it doesn't use the CPU MMU at all?
> >>>
> >>
> >> It works like that:
> >>
> >> 1. We register a buffer from the process address space
> >>    giving its base address and length.
> >>    This call goes to kernel which in turn pins the phys pages
> >>    and registers them with the device *together* with the base
> >>    address (virtual address!)
> >> 2. The device builds its own page tables to be able to translate
> >>    the virtual addresses to actual phys pages.
> > 
> > How would the device be able to do that?  It would require the
> > device to look at the process page tables, wouldn't it?  Isn't
> > the HW IOVA->PA translation table built by the OS?
> > 
> 
> As stated above, these are tables private for the device.
> (They even have a hw vendor specific layout I think,
>  since the device holds some cache)
> 
> The device looks at its own private page tables, and not
> to the OS ones.

I'm still confused by your statement that the device builds its
own [IOVA->PA] page table.  How would the device do that if it
doesn't have access to the CPU MMU state?  Isn't the IOVA->PA
translation table built by the OS?

> 
> > 
> >> 3. The process executes post-send requests directly to hw by-passing
> >>    the kernel giving process virtual addresses in work requests.
> >> 4. The device uses its own page tables to translate the virtual
> >>    addresses to phys pages and sending them.
> >>
> >> Theoretically is possible to send any contiguous IOVA instead of
> >> process's one but is not how is working today.
> >>
> >> Makes sense?
> > 
>
Marcel Apfelbaum Feb. 1, 2018, 7:28 p.m. UTC | #31
On 01/02/2018 21:21, Eduardo Habkost wrote:
> On Thu, Feb 01, 2018 at 08:58:32PM +0200, Marcel Apfelbaum wrote:
>> On 01/02/2018 20:51, Eduardo Habkost wrote:
>>> On Thu, Feb 01, 2018 at 08:31:09PM +0200, Marcel Apfelbaum wrote:
>>>> On 01/02/2018 20:21, Eduardo Habkost wrote:
>>>>> On Thu, Feb 01, 2018 at 08:03:53PM +0200, Marcel Apfelbaum wrote:
>>>>>> On 01/02/2018 15:53, Eduardo Habkost wrote:
>>>>>>> On Thu, Feb 01, 2018 at 02:29:25PM +0200, Marcel Apfelbaum wrote:
>>>>>>>> On 01/02/2018 14:10, Eduardo Habkost wrote:
>>>>>>>>> On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
>>>>>>>>>> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
>>>>>>>>> [...]
>>>>>>>>>>>> BTW, what's the root cause for requiring HVAs in the buffer?
>>>>>>>>>>>
>>>>>>>>>>> It's a side effect of the kernel/userspace API which always wants
>>>>>>>>>>> a single HVA/len pair to map memory for the application.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Eduardo and Michael,
>>>>>>>>>>
>>>>>>>>>>>>  Can
>>>>>>>>>>>> this be fixed?
>>>>>>>>>>>
>>>>>>>>>>> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
>>>>>>>>>>> mapping an s/g list with actual memory. The HVA/len pair would then just
>>>>>>>>>>> be used to refer to the region, without creating the two mappings.
>>>>>>>>>>>
>>>>>>>>>>> Something like splitting the register mr into
>>>>>>>>>>>
>>>>>>>>>>> mr = create mr (va/len) - allocate a handle and record the va/len
>>>>>>>>>>>
>>>>>>>>>>> addmemory(mr, offset, hva, len) - pin memory
>>>>>>>>>>>
>>>>>>>>>>> register mr - pass it to HW
>>>>>>>>>>>
>>>>>>>>>>> As a nice side effect we won't burn so much virtual address space.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We would still need a contiguous virtual address space range (for post-send)
>>>>>>>>>> which we don't have since guest contiguous virtual address space
>>>>>>>>>> will always end up as non-contiguous host virtual address space.
>>>>>>>>>>
>>>>>>>>>> I am not sure the RDMA HW can handle a large VA with holes.
>>>>>>>>>
>>>>>>>>> I'm confused.  Why would the hardware see and care about virtual
>>>>>>>>> addresses? 
>>>>>>>>
>>>>>>>> The post-send operations bypasses the kernel, and the process
>>>>>>>> puts in the work request GVA addresses.
>>>>>>>>
>>>>>>>>> How exactly does the hardware translates VAs to
>>>>>>>>> PAs? 
>>>>>>>>
>>>>>>>> The HW maintains a page-directory like structure different form MMU
>>>>>>>> VA -> phys pages
>>>>>>>>
>>>>>>>>> What if the process page tables change?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Since the page tables the HW uses are their own, we just need the phys
>>>>>>>> page to be pinned.
>>>>>>>
>>>>>>> So there's no hardware-imposed requirement that the hardware VAs
>>>>>>> (mapped by the HW page directory) match the VAs in QEMU
>>>>>>> address-space, right? 
>>>>>>
>>>>>> Actually there is. Today it works exactly as you described.
>>>>>
>>>>> Are you sure there's such hardware-imposed requirement?
>>>>>
>>>>
>>>> Yes.
>>>>
>>>>> Why would the hardware require VAs to match the ones in the
>>>>> userspace address-space, if it doesn't use the CPU MMU at all?
>>>>>
>>>>
>>>> It works like that:
>>>>
>>>> 1. We register a buffer from the process address space
>>>>    giving its base address and length.
>>>>    This call goes to kernel which in turn pins the phys pages
>>>>    and registers them with the device *together* with the base
>>>>    address (virtual address!)
>>>> 2. The device builds its own page tables to be able to translate
>>>>    the virtual addresses to actual phys pages.
>>>
>>> How would the device be able to do that?  It would require the
>>> device to look at the process page tables, wouldn't it?  Isn't
>>> the HW IOVA->PA translation table built by the OS?
>>>
>>
>> As stated above, these are tables private for the device.
>> (They even have a hw vendor specific layout I think,
>>  since the device holds some cache)
>>
>> The device looks at its own private page tables, and not
>> to the OS ones.
> 
> I'm still confused by your statement that the device builds its
> own [IOVA->PA] page table.  How would the device do that if it
> doesn't have access to the CPU MMU state?  Isn't the IOVA->PA
> translation table built by the OS?
> 

Sorry about the confusion. The device gets a base virtual address,
the memory region length and a list of phys pages.
This is enough information to create its own kind of tables
which will tell, for example, if the IOVA starts at address
0x1000, that address 0x1001 is at page 0 and address 0x2000 is at page 1.

Be aware this base virtual address can be from any address space, not only
from the process address, the process address space is only the current software
implementation.

Thanks,
Marcel

>>
>>>
>>>> 3. The process executes post-send requests directly to hw by-passing
>>>>    the kernel giving process virtual addresses in work requests.
>>>> 4. The device uses its own page tables to translate the virtual
>>>>    addresses to phys pages and sending them.
>>>>
>>>> Theoretically is possible to send any contiguous IOVA instead of
>>>> process's one but is not how is working today.
>>>>
>>>> Makes sense?
>>>
>>
>
Paolo Bonzini Feb. 1, 2018, 7:35 p.m. UTC | #32
On 01/02/2018 14:21, Eduardo Habkost wrote:
>> The device looks at its own private page tables, and not
>> to the OS ones.
> I'm still confused by your statement that the device builds its
> own [IOVA->PA] page table.  How would the device do that if it
> doesn't have access to the CPU MMU state?  Isn't the IOVA->PA
> translation table built by the OS?

The driver builds a page table for the device, either when it pins the
pages or by using MMU notifiers.

Paolo
diff mbox series

Patch

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index e44c319915..bc95022a68 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -31,7 +31,6 @@  typedef struct HostMemoryBackendFile HostMemoryBackendFile;
 struct HostMemoryBackendFile {
     HostMemoryBackend parent_obj;
 
-    bool share;
     bool discard_data;
     char *mem_path;
 };
@@ -58,7 +57,7 @@  file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
         path = object_get_canonical_path(OBJECT(backend));
         memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
                                  path,
-                                 backend->size, fb->share,
+                                 backend->size, backend->share,
                                  fb->mem_path, errp);
         g_free(path);
     }
@@ -85,25 +84,6 @@  static void set_mem_path(Object *o, const char *str, Error **errp)
     fb->mem_path = g_strdup(str);
 }
 
-static bool file_memory_backend_get_share(Object *o, Error **errp)
-{
-    HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
-
-    return fb->share;
-}
-
-static void file_memory_backend_set_share(Object *o, bool value, Error **errp)
-{
-    HostMemoryBackend *backend = MEMORY_BACKEND(o);
-    HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(o);
-
-    if (host_memory_backend_mr_inited(backend)) {
-        error_setg(errp, "cannot change property value");
-        return;
-    }
-    fb->share = value;
-}
-
 static bool file_memory_backend_get_discard_data(Object *o, Error **errp)
 {
     return MEMORY_BACKEND_FILE(o)->discard_data;
@@ -136,9 +116,6 @@  file_backend_class_init(ObjectClass *oc, void *data)
     bc->alloc = file_backend_memory_alloc;
     oc->unparent = file_backend_unparent;
 
-    object_class_property_add_bool(oc, "share",
-        file_memory_backend_get_share, file_memory_backend_set_share,
-        &error_abort);
     object_class_property_add_bool(oc, "discard-data",
         file_memory_backend_get_discard_data, file_memory_backend_set_discard_data,
         &error_abort);
diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
index 38977be73e..7ddd08d370 100644
--- a/backends/hostmem-ram.c
+++ b/backends/hostmem-ram.c
@@ -28,8 +28,8 @@  ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     }
 
     path = object_get_canonical_path_component(OBJECT(backend));
-    memory_region_init_ram_nomigrate(&backend->mr, OBJECT(backend), path,
-                           backend->size, errp);
+    memory_region_init_ram_shared_nomigrate(&backend->mr, OBJECT(backend), path,
+                           backend->size, backend->share, errp);
     g_free(path);
 }
 
diff --git a/backends/hostmem.c b/backends/hostmem.c
index ee2c2d5bfd..1daf13bd2e 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -369,6 +369,24 @@  static void set_id(Object *o, const char *str, Error **errp)
     backend->id = g_strdup(str);
 }
 
+static bool host_memory_backend_get_share(Object *o, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(o);
+
+    return backend->share;
+}
+
+static void host_memory_backend_set_share(Object *o, bool value, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(o);
+
+    if (host_memory_backend_mr_inited(backend)) {
+        error_setg(errp, "cannot change property value");
+        return;
+    }
+    backend->share = value;
+}
+
 static void
 host_memory_backend_class_init(ObjectClass *oc, void *data)
 {
@@ -399,6 +417,9 @@  host_memory_backend_class_init(ObjectClass *oc, void *data)
         host_memory_backend_get_policy,
         host_memory_backend_set_policy, &error_abort);
     object_class_property_add_str(oc, "id", get_id, set_id, &error_abort);
+    object_class_property_add_bool(oc, "share",
+        host_memory_backend_get_share, host_memory_backend_set_share,
+        &error_abort);
 }
 
 static void host_memory_backend_finalize(Object *o)
diff --git a/exec.c b/exec.c
index d28fc0cd3d..7f543fba80 100644
--- a/exec.c
+++ b/exec.c
@@ -1285,7 +1285,7 @@  static int subpage_register (subpage_t *mmio, uint32_t start, uint32_t end,
                              uint16_t section);
 static subpage_t *subpage_init(FlatView *fv, hwaddr base);
 
-static void *(*phys_mem_alloc)(size_t size, uint64_t *align) =
+static void *(*phys_mem_alloc)(size_t size, uint64_t *align, bool shared) =
                                qemu_anon_ram_alloc;
 
 /*
@@ -1293,7 +1293,7 @@  static void *(*phys_mem_alloc)(size_t size, uint64_t *align) =
  * Accelerators with unusual needs may need this.  Hopefully, we can
  * get rid of it eventually.
  */
-void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align))
+void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align, bool shared))
 {
     phys_mem_alloc = alloc;
 }
@@ -1915,7 +1915,7 @@  static void dirty_memory_extend(ram_addr_t old_ram_size,
     }
 }
 
-static void ram_block_add(RAMBlock *new_block, Error **errp)
+static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
 {
     RAMBlock *block;
     RAMBlock *last_block = NULL;
@@ -1938,7 +1938,7 @@  static void ram_block_add(RAMBlock *new_block, Error **errp)
             }
         } else {
             new_block->host = phys_mem_alloc(new_block->max_length,
-                                             &new_block->mr->align);
+                                             &new_block->mr->align, shared);
             if (!new_block->host) {
                 error_setg_errno(errp, errno,
                                  "cannot set up guest memory '%s'",
@@ -2043,7 +2043,7 @@  RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
         return NULL;
     }
 
-    ram_block_add(new_block, &local_err);
+    ram_block_add(new_block, &local_err, share);
     if (local_err) {
         g_free(new_block);
         error_propagate(errp, local_err);
@@ -2085,7 +2085,7 @@  RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
                                   void (*resized)(const char*,
                                                   uint64_t length,
                                                   void *host),
-                                  void *host, bool resizeable,
+                                  void *host, bool resizeable, bool share,
                                   MemoryRegion *mr, Error **errp)
 {
     RAMBlock *new_block;
@@ -2108,7 +2108,7 @@  RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     if (resizeable) {
         new_block->flags |= RAM_RESIZEABLE;
     }
-    ram_block_add(new_block, &local_err);
+    ram_block_add(new_block, &local_err, share);
     if (local_err) {
         g_free(new_block);
         error_propagate(errp, local_err);
@@ -2120,12 +2120,15 @@  RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
                                    MemoryRegion *mr, Error **errp)
 {
-    return qemu_ram_alloc_internal(size, size, NULL, host, false, mr, errp);
+    return qemu_ram_alloc_internal(size, size, NULL, host, false,
+                                   false, mr, errp);
 }
 
-RAMBlock *qemu_ram_alloc(ram_addr_t size, MemoryRegion *mr, Error **errp)
+RAMBlock *qemu_ram_alloc(ram_addr_t size, bool share,
+                         MemoryRegion *mr, Error **errp)
 {
-    return qemu_ram_alloc_internal(size, size, NULL, NULL, false, mr, errp);
+    return qemu_ram_alloc_internal(size, size, NULL, NULL, false,
+                                   share, mr, errp);
 }
 
 RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
@@ -2134,7 +2137,8 @@  RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
                                                      void *host),
                                      MemoryRegion *mr, Error **errp)
 {
-    return qemu_ram_alloc_internal(size, maxsz, resized, NULL, true, mr, errp);
+    return qemu_ram_alloc_internal(size, maxsz, resized, NULL, true,
+                                   false, mr, errp);
 }
 
 static void reclaim_ramblock(RAMBlock *block)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index a4cabdf44c..dd28eaba68 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -428,6 +428,29 @@  void memory_region_init_ram_nomigrate(MemoryRegion *mr,
                                       Error **errp);
 
 /**
+ * memory_region_init_ram_shared_nomigrate:  Initialize RAM memory region.
+ *                                           Accesses into the region will
+ *                                           modify memory directly.
+ *
+ * @mr: the #MemoryRegion to be initialized.
+ * @owner: the object that tracks the region's reference count
+ * @name: Region name, becomes part of RAMBlock name used in migration stream
+ *        must be unique within any device
+ * @size: size of the region.
+ * @share: allow remapping RAM to different addresses
+ * @errp: pointer to Error*, to store an error if it happens.
+ *
+ * Note that this function is similar to memory_region_init_ram_nomigrate.
+ * The only difference is part of the RAM region can be remapped.
+ */
+void memory_region_init_ram_shared_nomigrate(MemoryRegion *mr,
+                                             struct Object *owner,
+                                             const char *name,
+                                             uint64_t size,
+                                             bool share,
+                                             Error **errp);
+
+/**
  * memory_region_init_resizeable_ram:  Initialize memory region with resizeable
  *                                     RAM.  Accesses into the region will
  *                                     modify memory directly.  Only an initial
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 7633ef6342..cf2446a176 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -80,7 +80,8 @@  RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
                                  Error **errp);
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
                                   MemoryRegion *mr, Error **errp);
-RAMBlock *qemu_ram_alloc(ram_addr_t size, MemoryRegion *mr, Error **errp);
+RAMBlock *qemu_ram_alloc(ram_addr_t size, bool share, MemoryRegion *mr,
+                         Error **errp);
 RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t max_size,
                                     void (*resized)(const char*,
                                                     uint64_t length,
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index adb3758275..41658060a7 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -255,7 +255,7 @@  extern int daemon(int, int);
 int qemu_daemon(int nochdir, int noclose);
 void *qemu_try_memalign(size_t alignment, size_t size);
 void *qemu_memalign(size_t alignment, size_t size);
-void *qemu_anon_ram_alloc(size_t size, uint64_t *align);
+void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared);
 void qemu_vfree(void *ptr);
 void qemu_anon_ram_free(void *ptr, size_t size);
 
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index ed6a437f4d..4d8f859f03 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -55,7 +55,7 @@  struct HostMemoryBackend {
     char *id;
     uint64_t size;
     bool merge, dump;
-    bool prealloc, force_prealloc, is_mapped;
+    bool prealloc, force_prealloc, is_mapped, share;
     DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
     HostMemPolicy policy;
 
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index bbf12a1723..85002ac49a 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -248,7 +248,7 @@  int kvm_on_sigbus(int code, void *addr);
 
 /* interface with exec.c */
 
-void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align));
+void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align, bool shared));
 
 /* internal API */
 
diff --git a/memory.c b/memory.c
index 4b41fb837b..cb4fe1a55a 100644
--- a/memory.c
+++ b/memory.c
@@ -1538,11 +1538,21 @@  void memory_region_init_ram_nomigrate(MemoryRegion *mr,
                                       uint64_t size,
                                       Error **errp)
 {
+    memory_region_init_ram_shared_nomigrate(mr, owner, name, size, false, errp);
+}
+
+void memory_region_init_ram_shared_nomigrate(MemoryRegion *mr,
+                                             Object *owner,
+                                             const char *name,
+                                             uint64_t size,
+                                             bool share,
+                                             Error **errp)
+{
     memory_region_init(mr, owner, name, size);
     mr->ram = true;
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc(size, mr, errp);
+    mr->ram_block = qemu_ram_alloc(size, share, mr, errp);
     mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
 }
 
@@ -1651,7 +1661,7 @@  void memory_region_init_rom_nomigrate(MemoryRegion *mr,
     mr->readonly = true;
     mr->terminates = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc(size, mr, errp);
+    mr->ram_block = qemu_ram_alloc(size, false, mr, errp);
     mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
 }
 
@@ -1670,7 +1680,7 @@  void memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
     mr->terminates = true;
     mr->rom_device = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc(size, mr, errp);
+    mr->ram_block = qemu_ram_alloc(size, false,  mr, errp);
 }
 
 void memory_region_init_iommu(void *_iommu_mr,
diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c
index 6a18a413b4..73629ebb25 100644
--- a/target/s390x/kvm.c
+++ b/target/s390x/kvm.c
@@ -144,7 +144,7 @@  static int cap_gs;
 
 static int active_cmma;
 
-static void *legacy_s390_alloc(size_t size, uint64_t *align);
+static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared);
 
 static int kvm_s390_query_mem_limit(uint64_t *memory_limit)
 {
@@ -743,7 +743,7 @@  int kvm_s390_mem_op(S390CPU *cpu, vaddr addr, uint8_t ar, void *hostbuf,
  * to grow. We also have to use MAP parameters that avoid
  * read-only mapping of guest pages.
  */
-static void *legacy_s390_alloc(size_t size, uint64_t *align)
+static void *legacy_s390_alloc(size_t size, uint64_t *align, bool shared)
 {
     void *mem;
 
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 77369c92ce..0cf3548778 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -127,10 +127,10 @@  void *qemu_memalign(size_t alignment, size_t size)
 }
 
 /* alloc shared memory pages */
-void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment)
+void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared)
 {
     size_t align = QEMU_VMALLOC_ALIGN;
-    void *ptr = qemu_ram_mmap(-1, size, align, false);
+    void *ptr = qemu_ram_mmap(-1, size, align, shared);
 
     if (ptr == MAP_FAILED) {
         return NULL;
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index 69a6286d50..bb5ad28bd3 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -67,7 +67,7 @@  void *qemu_memalign(size_t alignment, size_t size)
     return qemu_oom_check(qemu_try_memalign(alignment, size));
 }
 
-void *qemu_anon_ram_alloc(size_t size, uint64_t *align)
+void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared)
 {
     void *ptr;