diff mbox series

[v3,2/2] vhost: Only align sections for vhost-user

Message ID 20200116202414.157959-3-dgilbert@redhat.com
State New
Headers show
Series Fix hyperv synic on vhost | expand

Commit Message

Dr. David Alan Gilbert Jan. 16, 2020, 8:24 p.m. UTC
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

I added hugepage alignment code in c1ece84e7c9 to deal with
vhost-user + postcopy which needs aligned pages when using userfault.
However, on x86 the lower 2MB of address space tends to be shotgun'd
with small fragments around the 512-640k range - e.g. video RAM, and
with HyperV synic pages tend to sit around there - again splitting
it up.  The alignment code complains with a 'Section rounded to ...'
error and gives up.

Since vhost-user already filters out devices without an fd
(see vhost-user.c vhost_user_mem_section_filter) it shouldn't be
affected by those overlaps.

Turn the alignment off on vhost-kernel so that it doesn't try
and align, and thus won't hit the rounding issues.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/virtio/vhost.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

Comments

Paolo Bonzini Jan. 17, 2020, 12:52 p.m. UTC | #1
On 16/01/20 21:24, Dr. David Alan Gilbert (git) wrote:
> +    if (dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_USER) {   
> +        /* Round the section to it's page size */
> +        /* First align the start down to a page boundary */
> +        size_t mrs_page = qemu_ram_pagesize(mrs_rb);
> +        uint64_t alignage = mrs_host & (mrs_page - 1);
> +        if (alignage) {
> +            mrs_host -= alignage;
> +            mrs_size += alignage;
> +            mrs_gpa  -= alignage;
> +        }
> +        /* Now align the size up to a page boundary */
> +        alignage = mrs_size & (mrs_page - 1);
> +        if (alignage) {
> +            mrs_size += mrs_page - alignage;
> +        }
> +        trace_vhost_region_add_section_aligned(section->mr->name, mrs_gpa, mrs_size,
> +                                               mrs_host);
> +    }

Ok, now I understand!

So it seems to me that the vhost-user protocol is deficient, it should
have had two different fields for the region size and the region
alignment (so that mmap does not fail).  But I guess that's just yet
another thing to remember for vhost-user v2.

I would add a comment to explain why the alignment is needed in the
first place, but this fix is certainly much more pleasant.  Thanks very
much.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo
Michael S. Tsirkin Jan. 17, 2020, 1:40 p.m. UTC | #2
On Fri, Jan 17, 2020 at 01:52:44PM +0100, Paolo Bonzini wrote:
> On 16/01/20 21:24, Dr. David Alan Gilbert (git) wrote:
> > +    if (dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_USER) {   
> > +        /* Round the section to it's page size */
> > +        /* First align the start down to a page boundary */
> > +        size_t mrs_page = qemu_ram_pagesize(mrs_rb);
> > +        uint64_t alignage = mrs_host & (mrs_page - 1);
> > +        if (alignage) {
> > +            mrs_host -= alignage;
> > +            mrs_size += alignage;
> > +            mrs_gpa  -= alignage;
> > +        }
> > +        /* Now align the size up to a page boundary */
> > +        alignage = mrs_size & (mrs_page - 1);
> > +        if (alignage) {
> > +            mrs_size += mrs_page - alignage;
> > +        }
> > +        trace_vhost_region_add_section_aligned(section->mr->name, mrs_gpa, mrs_size,
> > +                                               mrs_host);
> > +    }
> 
> Ok, now I understand!
> 
> So it seems to me that the vhost-user protocol is deficient, it should
> have had two different fields for the region size and the region
> alignment (so that mmap does not fail).  But I guess that's just yet
> another thing to remember for vhost-user v2.

We don't really need v2 just to add a field. Compatibility is maintained
using feature bits. Adding that is a subject for another patch.
But I'm not sure I understand why does remote need to know about alignment.
This patch seems to handle it locally ...

> I would add a comment to explain why the alignment is needed in the
> first place, but this fix is certainly much more pleasant.  Thanks very
> much.
> 
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> Paolo
Paolo Bonzini Jan. 17, 2020, 1:58 p.m. UTC | #3
On 17/01/20 14:40, Michael S. Tsirkin wrote:
> We don't really need v2 just to add a field. Compatibility is maintained
> using feature bits. Adding that is a subject for another patch.
> But I'm not sure I understand why does remote need to know about alignment.
> This patch seems to handle it locally ...

Because the remote vhost here will not be able to use the synic regions.
 If it did, it would have the same overlap problem as vhost-kernel.

The alignment is needed because, even if you are mapping only [768k,1M)
of a 2M hugepage, you need to mmap [0,2M).  You can then discard the
rest, but IIUC if you only mmap [768k,1M) then the kernel will fail the
mmap.

Paolo
Michael S. Tsirkin Jan. 17, 2020, 2:25 p.m. UTC | #4
On Fri, Jan 17, 2020 at 02:58:47PM +0100, Paolo Bonzini wrote:
> On 17/01/20 14:40, Michael S. Tsirkin wrote:
> > We don't really need v2 just to add a field. Compatibility is maintained
> > using feature bits. Adding that is a subject for another patch.
> > But I'm not sure I understand why does remote need to know about alignment.
> > This patch seems to handle it locally ...
> 
> Because the remote vhost here will not be able to use the synic regions.
>  If it did, it would have the same overlap problem as vhost-kernel.
> 
> The alignment is needed because, even if you are mapping only [768k,1M)
> of a 2M hugepage, you need to mmap [0,2M).  You can then discard the
> rest, but IIUC if you only mmap [768k,1M) then the kernel will fail the
> mmap.
> 
> Paolo

So right now remote will query the fd passed to get the alignment.
You are basically saying it's not enough in some cases?
Paolo Bonzini Jan. 17, 2020, 3:06 p.m. UTC | #5
On 17/01/20 15:25, Michael S. Tsirkin wrote:
> On Fri, Jan 17, 2020 at 02:58:47PM +0100, Paolo Bonzini wrote:
>> On 17/01/20 14:40, Michael S. Tsirkin wrote:
>>> We don't really need v2 just to add a field. Compatibility is maintained
>>> using feature bits. Adding that is a subject for another patch.
>>> But I'm not sure I understand why does remote need to know about alignment.
>>> This patch seems to handle it locally ...
>>
>> Because the remote vhost here will not be able to use the synic regions.
>>  If it did, it would have the same overlap problem as vhost-kernel.
>>
>> The alignment is needed because, even if you are mapping only [768k,1M)
>> of a 2M hugepage, you need to mmap [0,2M).  You can then discard the
>> rest, but IIUC if you only mmap [768k,1M) then the kernel will fail the
>> mmap.
> 
> So right now remote will query the fd passed to get the alignment.

It should, but will it?  It's not in the spec and I assume QEMU is doing
this alignment work because some server is not doing it.  But indeed we
could use a feature bit to say "don't worry I will be doing the right
thing".

Paolo

> You are basically saying it's not enough in some cases?
>
Dr. David Alan Gilbert Jan. 17, 2020, 3:10 p.m. UTC | #6
* Paolo Bonzini (pbonzini@redhat.com) wrote:
> On 17/01/20 14:40, Michael S. Tsirkin wrote:
> > We don't really need v2 just to add a field. Compatibility is maintained
> > using feature bits. Adding that is a subject for another patch.
> > But I'm not sure I understand why does remote need to know about alignment.
> > This patch seems to handle it locally ...
> 
> Because the remote vhost here will not be able to use the synic regions.
>  If it did, it would have the same overlap problem as vhost-kernel.
> 
> The alignment is needed because, even if you are mapping only [768k,1M)
> of a 2M hugepage, you need to mmap [0,2M).  You can then discard the
> rest, but IIUC if you only mmap [768k,1M) then the kernel will fail the
> mmap.

The vhost-user problem is worse than that.

Lets ignore the synic regions for a minute;  imagine a normal PC,
0-512k ish, and a 512k-1GB map; the client side can do the alignment
to allow an mmap to work for each of those two mappings, you then
end up with two mmap's on the client, both aligned at 0, each covering
part of the same range.  If the client is careful in normal use it can
make sure it's OK, but you do end up with two mappings which is odd.
Once you add userfault for psotcopy, having those two mappings of the same
range gets even more problematic when we wait for one of the pages
to become available.

The other problem is that there's a limited number of slots for mappings
in the vhost-user protocol; and they easily get filled up if you
break up the memory ranges.

Doing the aggregation on the qemu side keeps your slot usage down and
also removes the problem of the double mappings.

The client can't easily resolve the double mapping alignment because it
can't tell if the two fd's it's been passed actually refer to the
same backing file; so it can't aggregate (from memory).

Dave



> Paolo
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
diff mbox series

Patch

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 774d87d98e..25fd469179 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -547,26 +547,28 @@  static void vhost_region_add_section(struct vhost_dev *dev,
     uintptr_t mrs_host = (uintptr_t)memory_region_get_ram_ptr(section->mr) +
                          section->offset_within_region;
     RAMBlock *mrs_rb = section->mr->ram_block;
-    size_t mrs_page = qemu_ram_pagesize(mrs_rb);
 
     trace_vhost_region_add_section(section->mr->name, mrs_gpa, mrs_size,
                                    mrs_host);
 
-    /* Round the section to it's page size */
-    /* First align the start down to a page boundary */
-    uint64_t alignage = mrs_host & (mrs_page - 1);
-    if (alignage) {
-        mrs_host -= alignage;
-        mrs_size += alignage;
-        mrs_gpa  -= alignage;
-    }
-    /* Now align the size up to a page boundary */
-    alignage = mrs_size & (mrs_page - 1);
-    if (alignage) {
-        mrs_size += mrs_page - alignage;
-    }
-    trace_vhost_region_add_section_aligned(section->mr->name, mrs_gpa, mrs_size,
-                                           mrs_host);
+    if (dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_USER) {   
+        /* Round the section to it's page size */
+        /* First align the start down to a page boundary */
+        size_t mrs_page = qemu_ram_pagesize(mrs_rb);
+        uint64_t alignage = mrs_host & (mrs_page - 1);
+        if (alignage) {
+            mrs_host -= alignage;
+            mrs_size += alignage;
+            mrs_gpa  -= alignage;
+        }
+        /* Now align the size up to a page boundary */
+        alignage = mrs_size & (mrs_page - 1);
+        if (alignage) {
+            mrs_size += mrs_page - alignage;
+        }
+        trace_vhost_region_add_section_aligned(section->mr->name, mrs_gpa, mrs_size,
+                                               mrs_host);
+    }
 
     if (dev->n_tmp_sections) {
         /* Since we already have at least one section, lets see if