diff mbox series

[v3,12/15] util: vfio-helpers: Implement ram_block_resized()

Message ID 20200227101205.5616-13-david@redhat.com
State New
Headers show
Series Ram blocks with resizeable anonymous allocations under POSIX | expand

Commit Message

David Hildenbrand Feb. 27, 2020, 10:12 a.m. UTC
Let's implement ram_block_resized(), allowing resizeable mappings.

For resizeable mappings, we reserve $max_size IOVA address space, but only
map $size of it. When resizing, unmap the old part and remap the new
part. We'll need a new ioctl to do this atomically (e.g., to resize
while the guest is running - not allowed for now).

Cc: Richard Henderson <rth@twiddle.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/trace-events   |  5 ++--
 util/vfio-helpers.c | 70 ++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 66 insertions(+), 9 deletions(-)

Comments

Peter Xu Feb. 28, 2020, 7:42 p.m. UTC | #1
On Thu, Feb 27, 2020 at 11:12:02AM +0100, David Hildenbrand wrote:
> Let's implement ram_block_resized(), allowing resizeable mappings.
> 
> For resizeable mappings, we reserve $max_size IOVA address space, but only
> map $size of it. When resizing, unmap the old part and remap the new
> part. We'll need a new ioctl to do this atomically (e.g., to resize
> while the guest is running - not allowed for now).

Curious: I think it's true for now because resizing only happens
during reboot or destination VM during migration (but before
switching).  However is that guaranteed too in the future?

[...]

> @@ -631,7 +658,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
>                  qemu_vfio_remove_mapping(s, mapping);
>                  goto out;
>              }
> -            s->low_water_mark += size;
> +            s->low_water_mark += max_size;

I think it's fine to only increase the low water mark here, however
imo it would be better to also cache the max size in IOVAMapping too,
then in resize() we double check new_size <= max_size?  It also makes
IOVAMapping more self contained.

Thanks,
Peter Xu Feb. 28, 2020, 7:55 p.m. UTC | #2
On Thu, Feb 27, 2020 at 11:12:02AM +0100, David Hildenbrand wrote:
> +static void qemu_vfio_dma_map_resize(QEMUVFIOState *s, void *host,
> +                                     size_t old_size, size_t new_size)
> +{
> +    IOVAMapping *m;
> +    int index = 0;
> +
> +    qemu_mutex_lock(&s->lock);
> +    m = qemu_vfio_find_mapping(s, host, &index);
> +    if (!m) {
> +        return;
> +    }
> +    assert(m->size == old_size);
> +
> +    /* Note: Not atomic - we need a new ioctl for that. */
> +    qemu_vfio_undo_mapping(s, m->iova, m->size);
> +    qemu_vfio_do_mapping(s, host, m->iova, new_size);

Another way to ask my previous question 1 (in the other reply): Can we
simply map/unmap the extra, while keep the shared untouched?

Thanks,

> +
> +    m->size = new_size;
> +    assert(qemu_vfio_verify_mappings(s));
> +
> +    qemu_mutex_unlock(&s->lock);
> +}
David Hildenbrand Feb. 28, 2020, 8:16 p.m. UTC | #3
> Am 28.02.2020 um 20:43 schrieb Peter Xu <peterx@redhat.com>:
> 
> On Thu, Feb 27, 2020 at 11:12:02AM +0100, David Hildenbrand wrote:
>> Let's implement ram_block_resized(), allowing resizeable mappings.
>> 
>> For resizeable mappings, we reserve $max_size IOVA address space, but only
>> map $size of it. When resizing, unmap the old part and remap the new
>> part. We'll need a new ioctl to do this atomically (e.g., to resize
>> while the guest is running - not allowed for now).
> 
> Curious: I think it's true for now because resizing only happens
> during reboot or destination VM during migration (but before
> switching).  However is that guaranteed too in the future?
> 

E.g., virtio-mem will run mutual exclusive with vfio until vfio won‘t pin all guest memory anymore blindly (iow, is compatible with memory overcommit and discarding of ram blocks).

> [...]
> 
>> @@ -631,7 +658,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
>>                 qemu_vfio_remove_mapping(s, mapping);
>>                 goto out;
>>             }
>> -            s->low_water_mark += size;
>> +            s->low_water_mark += max_size;
> 
> I think it's fine to only increase the low water mark here, however
> imo it would be better to also cache the max size in IOVAMapping too,
> then in resize() we double check new_size <= max_size?  It also makes
> IOVAMapping more self contained.
> 

I‘ll have a look how much additional code that will imply - if it‘s simple, I‘ll do it.

Thanks!

> Thanks,
> 
> -- 
> Peter Xu
>
David Hildenbrand Feb. 28, 2020, 8:19 p.m. UTC | #4
> Am 28.02.2020 um 20:55 schrieb Peter Xu <peterx@redhat.com>:
> 
> On Thu, Feb 27, 2020 at 11:12:02AM +0100, David Hildenbrand wrote:
>> +static void qemu_vfio_dma_map_resize(QEMUVFIOState *s, void *host,
>> +                                     size_t old_size, size_t new_size)
>> +{
>> +    IOVAMapping *m;
>> +    int index = 0;
>> +
>> +    qemu_mutex_lock(&s->lock);
>> +    m = qemu_vfio_find_mapping(s, host, &index);
>> +    if (!m) {
>> +        return;
>> +    }
>> +    assert(m->size == old_size);
>> +
>> +    /* Note: Not atomic - we need a new ioctl for that. */
>> +    qemu_vfio_undo_mapping(s, m->iova, m->size);
>> +    qemu_vfio_do_mapping(s, host, m->iova, new_size);
> 
> Another way to ask my previous question 1 (in the other reply): Can we
> simply map/unmap the extra, while keep the shared untouched?

As far as I understand the kernel implementation, unfortunately no. You might be able to grow (by mapping the delta), but shrinking is not possible AFAIR. And I *think* with many resizes, there could be an issue if I remember correctly.

Thanks!

> 
> Thanks,
> 
>> +
>> +    m->size = new_size;
>> +    assert(qemu_vfio_verify_mappings(s));
>> +
>> +    qemu_mutex_unlock(&s->lock);
>> +}
> 
> -- 
> Peter Xu
>
Peter Xu Feb. 28, 2020, 8:49 p.m. UTC | #5
On Fri, Feb 28, 2020 at 03:19:45PM -0500, David Hildenbrand wrote:
> 
> 
> > Am 28.02.2020 um 20:55 schrieb Peter Xu <peterx@redhat.com>:
> > 
> > On Thu, Feb 27, 2020 at 11:12:02AM +0100, David Hildenbrand wrote:
> >> +static void qemu_vfio_dma_map_resize(QEMUVFIOState *s, void *host,
> >> +                                     size_t old_size, size_t new_size)
> >> +{
> >> +    IOVAMapping *m;
> >> +    int index = 0;
> >> +
> >> +    qemu_mutex_lock(&s->lock);
> >> +    m = qemu_vfio_find_mapping(s, host, &index);
> >> +    if (!m) {
> >> +        return;
> >> +    }
> >> +    assert(m->size == old_size);
> >> +
> >> +    /* Note: Not atomic - we need a new ioctl for that. */
> >> +    qemu_vfio_undo_mapping(s, m->iova, m->size);
> >> +    qemu_vfio_do_mapping(s, host, m->iova, new_size);
> > 
> > Another way to ask my previous question 1 (in the other reply): Can we
> > simply map/unmap the extra, while keep the shared untouched?
> 
> As far as I understand the kernel implementation, unfortunately no. You might be able to grow (by mapping the delta), but shrinking is not possible AFAIR. And I *think* with many resizes, there could be an issue if I remember correctly.

Ah right (and I think the same thing will happen to vfio-pci during
memory_region_set_size()).  Then please double confirm that virtio-mem
is disabled for both vfio-helper users and vfio-pci.  Also, would you
mind comment above the unmap+map with more information?  E.g.:

  For now, we must unmap the whole mapped range first and remap with
  the new size.  The reason is that VFIO_IOMMU_UNMAP_DMA might fail
  with partial unmap of previous mappings, so even we can add extra
  new mappings to extend the old range, however we still won't able to
  always success on shrinking memories.  The side effect is that it's
  never safe to do resizing during VM execution.

Or something better.  With an enriched comment, I think it's fine:

Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks,
David Hildenbrand Feb. 28, 2020, 8:56 p.m. UTC | #6
> Am 28.02.2020 um 21:49 schrieb Peter Xu <peterx@redhat.com>:
> 
> On Fri, Feb 28, 2020 at 03:19:45PM -0500, David Hildenbrand wrote:
>> 
>> 
>>>> Am 28.02.2020 um 20:55 schrieb Peter Xu <peterx@redhat.com>:
>>> 
>>> On Thu, Feb 27, 2020 at 11:12:02AM +0100, David Hildenbrand wrote:
>>>> +static void qemu_vfio_dma_map_resize(QEMUVFIOState *s, void *host,
>>>> +                                     size_t old_size, size_t new_size)
>>>> +{
>>>> +    IOVAMapping *m;
>>>> +    int index = 0;
>>>> +
>>>> +    qemu_mutex_lock(&s->lock);
>>>> +    m = qemu_vfio_find_mapping(s, host, &index);
>>>> +    if (!m) {
>>>> +        return;
>>>> +    }
>>>> +    assert(m->size == old_size);
>>>> +
>>>> +    /* Note: Not atomic - we need a new ioctl for that. */
>>>> +    qemu_vfio_undo_mapping(s, m->iova, m->size);
>>>> +    qemu_vfio_do_mapping(s, host, m->iova, new_size);
>>> 
>>> Another way to ask my previous question 1 (in the other reply): Can we
>>> simply map/unmap the extra, while keep the shared untouched?
>> 
>> As far as I understand the kernel implementation, unfortunately no. You might be able to grow (by mapping the delta), but shrinking is not possible AFAIR. And I *think* with many resizes, there could be an issue if I remember correctly.
> 
> Ah right (and I think the same thing will happen to vfio-pci during
> memory_region_set_size()).  Then please double confirm that virtio-mem
> is disabled for both vfio-helper users and vfio-pci.  Also, would you

Yes, will double check, should have taken care of both.

> mind comment above the unmap+map with more information?  E.g.:
> 

Sure, will add - thanks!

>  For now, we must unmap the whole mapped range first and remap with
>  the new size.  The reason is that VFIO_IOMMU_UNMAP_DMA might fail
>  with partial unmap of previous mappings, so even we can add extra
>  new mappings to extend the old range, however we still won't able to
>  always success on shrinking memories.  The side effect is that it's
>  never safe to do resizing during VM execution.
> 
> Or something better.  With an enriched comment, I think it's fine:
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 

Thanks Peter! Have a nice weekend!
Peter Xu Feb. 28, 2020, 9:01 p.m. UTC | #7
On Fri, Feb 28, 2020 at 09:16:28PM +0100, David Hildenbrand wrote:
> 

[...]

> >> @@ -631,7 +658,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
> >>                 qemu_vfio_remove_mapping(s, mapping);
> >>                 goto out;
> >>             }
> >> -            s->low_water_mark += size;
> >> +            s->low_water_mark += max_size;
> > 
> > I think it's fine to only increase the low water mark here, however
> > imo it would be better to also cache the max size in IOVAMapping too,
> > then in resize() we double check new_size <= max_size?  It also makes
> > IOVAMapping more self contained.
> > 
> 
> I‘ll have a look how much additional code that will imply - if it‘s simple, I‘ll do it.

AFAICT it'll be as simple as introducing IOVAMapping.max_size, then
pass max_size into qemu_vfio_add_mapping().  Thanks,
David Hildenbrand March 2, 2020, 3:01 p.m. UTC | #8
On 28.02.20 22:01, Peter Xu wrote:
> On Fri, Feb 28, 2020 at 09:16:28PM +0100, David Hildenbrand wrote:
>>
> 
> [...]
> 
>>>> @@ -631,7 +658,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
>>>>                 qemu_vfio_remove_mapping(s, mapping);
>>>>                 goto out;
>>>>             }
>>>> -            s->low_water_mark += size;
>>>> +            s->low_water_mark += max_size;
>>>
>>> I think it's fine to only increase the low water mark here, however
>>> imo it would be better to also cache the max size in IOVAMapping too,
>>> then in resize() we double check new_size <= max_size?  It also makes
>>> IOVAMapping more self contained.
>>>
>>
>> I‘ll have a look how much additional code that will imply - if it‘s simple, I‘ll do it.
> 
> AFAICT it'll be as simple as introducing IOVAMapping.max_size, then
> pass max_size into qemu_vfio_add_mapping().  Thanks,
> 

Yeah, was answering from my mobile without code at hand :) added! Thanks!
diff mbox series

Patch

diff --git a/util/trace-events b/util/trace-events
index 83b6639018..88b7dbf4a5 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -74,8 +74,9 @@  qemu_mutex_unlock(void *mutex, const char *file, const int line) "released mutex
 
 # vfio-helpers.c
 qemu_vfio_dma_reset_temporary(void *s) "s %p"
-qemu_vfio_ram_block_added(void *s, void *p, size_t size) "s %p host %p size 0x%zx"
-qemu_vfio_ram_block_removed(void *s, void *p, size_t size) "s %p host %p size 0x%zx"
+qemu_vfio_ram_block_added(void *s, void *p, size_t size, size_t max_size) "s %p host %p size 0x%zx max_size 0x%zx"
+qemu_vfio_ram_block_removed(void *s, void *p, size_t size, size_t max_size) "s %p host %p size 0x%zx max_size 0x%zx"
+qemu_vfio_ram_block_resized(void *s, void *p, size_t old_size, size_t new_sizze) "s %p host %p old_size 0x%zx new_size 0x%zx"
 qemu_vfio_find_mapping(void *s, void *p) "s %p host %p"
 qemu_vfio_new_mapping(void *s, void *host, size_t size, int index, uint64_t iova) "s %p host %p size %zu index %d iova 0x%"PRIx64
 qemu_vfio_do_mapping(void *s, void *host, size_t size, uint64_t iova) "s %p host %p size %zu iova 0x%"PRIx64
diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index f0c77f0d69..566eb4e1eb 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -372,14 +372,20 @@  fail_container:
     return ret;
 }
 
+static int qemu_vfio_dma_map_resizeable(QEMUVFIOState *s, void *host,
+                                        size_t size, size_t max_size,
+                                        bool temporary, uint64_t *iova);
+static void qemu_vfio_dma_map_resize(QEMUVFIOState *s, void *host,
+                                     size_t old_size, size_t new_size);
+
 static void qemu_vfio_ram_block_added(RAMBlockNotifier *n, void *host,
                                       size_t size, size_t max_size)
 {
     QEMUVFIOState *s = container_of(n, QEMUVFIOState, ram_notifier);
     int ret;
 
-    trace_qemu_vfio_ram_block_added(s, host, max_size);
-    ret = qemu_vfio_dma_map(s, host, max_size, false, NULL);
+    trace_qemu_vfio_ram_block_added(s, host, size, max_size);
+    ret = qemu_vfio_dma_map_resizeable(s, host, size, max_size, false, NULL);
     if (ret) {
         error_report("qemu_vfio_dma_map(%p, %zu) failed: %s", host, max_size,
                      strerror(-ret));
@@ -391,16 +397,28 @@  static void qemu_vfio_ram_block_removed(RAMBlockNotifier *n, void *host,
 {
     QEMUVFIOState *s = container_of(n, QEMUVFIOState, ram_notifier);
     if (host) {
-        trace_qemu_vfio_ram_block_removed(s, host, max_size);
+        trace_qemu_vfio_ram_block_removed(s, host, size, max_size);
         qemu_vfio_dma_unmap(s, host);
     }
 }
 
+static void qemu_vfio_ram_block_resized(RAMBlockNotifier *n, void *host,
+                                        size_t old_size, size_t new_size)
+{
+    QEMUVFIOState *s = container_of(n, QEMUVFIOState, ram_notifier);
+
+    if (host) {
+        trace_qemu_vfio_ram_block_resized(s, host, old_size, new_size);
+        qemu_vfio_dma_map_resize(s, host, old_size, new_size);
+    }
+}
+
 static void qemu_vfio_open_common(QEMUVFIOState *s)
 {
     qemu_mutex_init(&s->lock);
     s->ram_notifier.ram_block_added = qemu_vfio_ram_block_added;
     s->ram_notifier.ram_block_removed = qemu_vfio_ram_block_removed;
+    s->ram_notifier.ram_block_resized = qemu_vfio_ram_block_resized;
     s->low_water_mark = QEMU_VFIO_IOVA_MIN;
     s->high_water_mark = QEMU_VFIO_IOVA_MAX;
     ram_block_notifier_add(&s->ram_notifier);
@@ -597,9 +615,14 @@  static bool qemu_vfio_verify_mappings(QEMUVFIOState *s)
  * the result in @iova if not NULL. The caller need to make sure the area is
  * aligned to page size, and mustn't overlap with existing mapping areas (split
  * mapping status within this area is not allowed).
+ *
+ * If size < max_size, a region of max_size in IOVA address is reserved, such
+ * that the mapping can later be resized. Resizeable mappings are only allowed
+ * for !temporary mappings.
  */
-int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
-                      bool temporary, uint64_t *iova)
+static int qemu_vfio_dma_map_resizeable(QEMUVFIOState *s, void *host,
+                                        size_t size, size_t max_size,
+                                        bool temporary, uint64_t *iova)
 {
     int ret = 0;
     int index;
@@ -608,13 +631,17 @@  int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
 
     assert(QEMU_PTR_IS_ALIGNED(host, qemu_real_host_page_size));
     assert(QEMU_IS_ALIGNED(size, qemu_real_host_page_size));
+    assert(QEMU_IS_ALIGNED(max_size, qemu_real_host_page_size));
+    assert(size == max_size || !temporary);
+    assert(size <= max_size);
+
     trace_qemu_vfio_dma_map(s, host, size, temporary, iova);
     qemu_mutex_lock(&s->lock);
     mapping = qemu_vfio_find_mapping(s, host, &index);
     if (mapping) {
         iova0 = mapping->iova + ((uint8_t *)host - (uint8_t *)mapping->host);
     } else {
-        if (s->high_water_mark - s->low_water_mark + 1 < size) {
+        if (s->high_water_mark - s->low_water_mark + 1 < max_size) {
             ret = -ENOMEM;
             goto out;
         }
@@ -631,7 +658,7 @@  int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
                 qemu_vfio_remove_mapping(s, mapping);
                 goto out;
             }
-            s->low_water_mark += size;
+            s->low_water_mark += max_size;
             qemu_vfio_dump_mappings(s);
         } else {
             iova0 = s->high_water_mark - size;
@@ -650,6 +677,12 @@  out:
     return ret;
 }
 
+int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
+                      bool temporary, uint64_t *iova)
+{
+    return qemu_vfio_dma_map_resizeable(s, host, size, size, temporary, iova);
+}
+
 /* Reset the high watermark and free all "temporary" mappings. */
 int qemu_vfio_dma_reset_temporary(QEMUVFIOState *s)
 {
@@ -694,6 +727,29 @@  out:
     qemu_mutex_unlock(&s->lock);
 }
 
+static void qemu_vfio_dma_map_resize(QEMUVFIOState *s, void *host,
+                                     size_t old_size, size_t new_size)
+{
+    IOVAMapping *m;
+    int index = 0;
+
+    qemu_mutex_lock(&s->lock);
+    m = qemu_vfio_find_mapping(s, host, &index);
+    if (!m) {
+        return;
+    }
+    assert(m->size == old_size);
+
+    /* Note: Not atomic - we need a new ioctl for that. */
+    qemu_vfio_undo_mapping(s, m->iova, m->size);
+    qemu_vfio_do_mapping(s, host, m->iova, new_size);
+
+    m->size = new_size;
+    assert(qemu_vfio_verify_mappings(s));
+
+    qemu_mutex_unlock(&s->lock);
+}
+
 static void qemu_vfio_reset(QEMUVFIOState *s)
 {
     ioctl(s->device, VFIO_DEVICE_RESET);