diff mbox series

[v5,07/18] s390x: protvirt: Inhibit balloon when switching to protected mode

Message ID 20200226122038.61481-8-frankja@linux.ibm.com
State New
Headers show
Series s390x: Protected Virtualization support | expand

Commit Message

Janosch Frank Feb. 26, 2020, 12:20 p.m. UTC
Ballooning in protected VMs can only be done when the guest shares the
pages it gives to the host. Hence, until we have a solution for this
in the guest kernel, we inhibit ballooning when switching into
protected mode and reverse that once we move out of it.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 hw/s390x/s390-virtio-ccw.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

David Hildenbrand Feb. 26, 2020, 2:59 p.m. UTC | #1
On 26.02.20 13:20, Janosch Frank wrote:
> Ballooning in protected VMs can only be done when the guest shares the
> pages it gives to the host. Hence, until we have a solution for this
> in the guest kernel, we inhibit ballooning when switching into
> protected mode and reverse that once we move out of it.

I don't understand what you mean here, sorry. zapping a page will mean
that a fresh one will be faulted in when accessed. And AFAIK, that means
it will be encrypted again when needed.

Is that more like the UV will detect this as an integrity issue and
crash the VM?

> 
> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
> ---
>  hw/s390x/s390-virtio-ccw.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
> index 9983165b05..0f4455d1df 100644
> --- a/hw/s390x/s390-virtio-ccw.c
> +++ b/hw/s390x/s390-virtio-ccw.c
> @@ -41,6 +41,7 @@
>  #include "hw/qdev-properties.h"
>  #include "hw/s390x/tod.h"
>  #include "sysemu/sysemu.h"
> +#include "sysemu/balloon.h"
>  #include "hw/s390x/pv.h"
>  #include "migration/blocker.h"
>  
> @@ -336,6 +337,7 @@ static void s390_machine_unprotect(S390CcwMachineState *ms)
>          ms->pv = false;
>      }
>      migrate_del_blocker(pv_mig_blocker);
> +    qemu_balloon_inhibit(false);
>  }
>  
>  static int s390_machine_protect(S390CcwMachineState *ms)
> @@ -344,6 +346,7 @@ static int s390_machine_protect(S390CcwMachineState *ms)
>      CPUState *t;
>      int rc;
>  
> +    qemu_balloon_inhibit(true);
>      if (!pv_mig_blocker) {
>          error_setg(&pv_mig_blocker,
>                     "protected VMs are currently not migrateable.");
>
Christian Borntraeger Feb. 26, 2020, 3:06 p.m. UTC | #2
On 26.02.20 15:59, David Hildenbrand wrote:
> On 26.02.20 13:20, Janosch Frank wrote:
>> Ballooning in protected VMs can only be done when the guest shares the
>> pages it gives to the host. Hence, until we have a solution for this
>> in the guest kernel, we inhibit ballooning when switching into
>> protected mode and reverse that once we move out of it.
> 
> I don't understand what you mean here, sorry. zapping a page will mean
> that a fresh one will be faulted in when accessed. And AFAIK, that means
> it will be encrypted again when needed.
> 
> Is that more like the UV will detect this as an integrity issue and
> crash the VM?

yes, the UV will detect a fresh page as an integrity issue.
Only if the page was defined to be shared by the guest, we would avoid the
integrity check.
Janosch Frank Feb. 26, 2020, 3:11 p.m. UTC | #3
On 2/26/20 3:59 PM, David Hildenbrand wrote:
> On 26.02.20 13:20, Janosch Frank wrote:
>> Ballooning in protected VMs can only be done when the guest shares the
>> pages it gives to the host. Hence, until we have a solution for this
>> in the guest kernel, we inhibit ballooning when switching into
>> protected mode and reverse that once we move out of it.
> 
> I don't understand what you mean here, sorry. zapping a page will mean
> that a fresh one will be faulted in when accessed. And AFAIK, that means
> it will be encrypted again when needed.

Yes, as soon as the host alters non-shared memory we'll run into
integrity issues.


I've been talking to Halil after I sent this out and it looks like we'll
rather try to automatically enable the IOMMU for all devices when
switching into protected mode. He said that if the IOMMU is set the
balloon code will do an early exit on feature negotiation.

> 
> Is that more like the UV will detect this as an integrity issue and
> crash the VM?
> 
>>
>> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
>> ---
>>  hw/s390x/s390-virtio-ccw.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
>> index 9983165b05..0f4455d1df 100644
>> --- a/hw/s390x/s390-virtio-ccw.c
>> +++ b/hw/s390x/s390-virtio-ccw.c
>> @@ -41,6 +41,7 @@
>>  #include "hw/qdev-properties.h"
>>  #include "hw/s390x/tod.h"
>>  #include "sysemu/sysemu.h"
>> +#include "sysemu/balloon.h"
>>  #include "hw/s390x/pv.h"
>>  #include "migration/blocker.h"
>>  
>> @@ -336,6 +337,7 @@ static void s390_machine_unprotect(S390CcwMachineState *ms)
>>          ms->pv = false;
>>      }
>>      migrate_del_blocker(pv_mig_blocker);
>> +    qemu_balloon_inhibit(false);
>>  }
>>  
>>  static int s390_machine_protect(S390CcwMachineState *ms)
>> @@ -344,6 +346,7 @@ static int s390_machine_protect(S390CcwMachineState *ms)
>>      CPUState *t;
>>      int rc;
>>  
>> +    qemu_balloon_inhibit(true);
>>      if (!pv_mig_blocker) {
>>          error_setg(&pv_mig_blocker,
>>                     "protected VMs are currently not migrateable.");
>>
> 
>
Christian Borntraeger Feb. 26, 2020, 3:13 p.m. UTC | #4
On 26.02.20 16:11, Janosch Frank wrote:
> On 2/26/20 3:59 PM, David Hildenbrand wrote:
>> On 26.02.20 13:20, Janosch Frank wrote:
>>> Ballooning in protected VMs can only be done when the guest shares the
>>> pages it gives to the host. Hence, until we have a solution for this
>>> in the guest kernel, we inhibit ballooning when switching into
>>> protected mode and reverse that once we move out of it.
>>
>> I don't understand what you mean here, sorry. zapping a page will mean
>> that a fresh one will be faulted in when accessed. And AFAIK, that means
>> it will be encrypted again when needed.
> 
> Yes, as soon as the host alters non-shared memory we'll run into
> integrity issues.
> 
> 
> I've been talking to Halil after I sent this out and it looks like we'll
> rather try to automatically enable the IOMMU for all devices when
> switching into protected mode. He said that if the IOMMU is set the
> balloon code will do an early exit on feature negotiation.

I think we should fence the balloon here nevertheless, so the patch in 
itself is probably fine.
David Hildenbrand Feb. 26, 2020, 3:15 p.m. UTC | #5
On 26.02.20 16:13, Christian Borntraeger wrote:
> 
> 
> On 26.02.20 16:11, Janosch Frank wrote:
>> On 2/26/20 3:59 PM, David Hildenbrand wrote:
>>> On 26.02.20 13:20, Janosch Frank wrote:
>>>> Ballooning in protected VMs can only be done when the guest shares the
>>>> pages it gives to the host. Hence, until we have a solution for this
>>>> in the guest kernel, we inhibit ballooning when switching into
>>>> protected mode and reverse that once we move out of it.
>>>
>>> I don't understand what you mean here, sorry. zapping a page will mean
>>> that a fresh one will be faulted in when accessed. And AFAIK, that means
>>> it will be encrypted again when needed.
>>
>> Yes, as soon as the host alters non-shared memory we'll run into
>> integrity issues.
>>
>>
>> I've been talking to Halil after I sent this out and it looks like we'll
>> rather try to automatically enable the IOMMU for all devices when
>> switching into protected mode. He said that if the IOMMU is set the
>> balloon code will do an early exit on feature negotiation.
> 
> I think we should fence the balloon here nevertheless, so the patch in 
> itself is probably fine.

+1, this is a global "don't use ram_block_discard" trigger.
David Hildenbrand Feb. 26, 2020, 3:16 p.m. UTC | #6
On 26.02.20 16:06, Christian Borntraeger wrote:
> 
> 
> On 26.02.20 15:59, David Hildenbrand wrote:
>> On 26.02.20 13:20, Janosch Frank wrote:
>>> Ballooning in protected VMs can only be done when the guest shares the
>>> pages it gives to the host. Hence, until we have a solution for this
>>> in the guest kernel, we inhibit ballooning when switching into
>>> protected mode and reverse that once we move out of it.
>>
>> I don't understand what you mean here, sorry. zapping a page will mean
>> that a fresh one will be faulted in when accessed. And AFAIK, that means
>> it will be encrypted again when needed.
>>
>> Is that more like the UV will detect this as an integrity issue and
>> crash the VM?
> 
> yes, the UV will detect a fresh page as an integrity issue.
> Only if the page was defined to be shared by the guest, we would avoid the
> integrity check.
> 

Please make that clearer in the patch description. With that

Reviewed-by: David Hildenbrand <david@redhat.com>
Janosch Frank Feb. 26, 2020, 3:30 p.m. UTC | #7
On 2/26/20 4:16 PM, David Hildenbrand wrote:
> On 26.02.20 16:06, Christian Borntraeger wrote:
>>
>>
>> On 26.02.20 15:59, David Hildenbrand wrote:
>>> On 26.02.20 13:20, Janosch Frank wrote:
>>>> Ballooning in protected VMs can only be done when the guest shares the
>>>> pages it gives to the host. Hence, until we have a solution for this
>>>> in the guest kernel, we inhibit ballooning when switching into
>>>> protected mode and reverse that once we move out of it.
>>>
>>> I don't understand what you mean here, sorry. zapping a page will mean
>>> that a fresh one will be faulted in when accessed. And AFAIK, that means
>>> it will be encrypted again when needed.
>>>
>>> Is that more like the UV will detect this as an integrity issue and
>>> crash the VM?
>>
>> yes, the UV will detect a fresh page as an integrity issue.
>> Only if the page was defined to be shared by the guest, we would avoid the
>> integrity check.
>>
> 
> Please make that clearer in the patch description. With that
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> 

How about:
s390x: protvirt: Inhibit balloon when switching to protected mode

Ballooning in protected VMs can only be done when the guest shares the
pages it gives to the host. If pages are not shared, the integrity
checks will fail once those pages have been altered and are given back
to the guest.

Hence, until we have a solution for this in the guest kernel, we
inhibit ballooning when switching into protected mode and reverse that
once we move out of it.
David Hildenbrand Feb. 26, 2020, 3:31 p.m. UTC | #8
On 26.02.20 16:30, Janosch Frank wrote:
> On 2/26/20 4:16 PM, David Hildenbrand wrote:
>> On 26.02.20 16:06, Christian Borntraeger wrote:
>>>
>>>
>>> On 26.02.20 15:59, David Hildenbrand wrote:
>>>> On 26.02.20 13:20, Janosch Frank wrote:
>>>>> Ballooning in protected VMs can only be done when the guest shares the
>>>>> pages it gives to the host. Hence, until we have a solution for this
>>>>> in the guest kernel, we inhibit ballooning when switching into
>>>>> protected mode and reverse that once we move out of it.
>>>>
>>>> I don't understand what you mean here, sorry. zapping a page will mean
>>>> that a fresh one will be faulted in when accessed. And AFAIK, that means
>>>> it will be encrypted again when needed.
>>>>
>>>> Is that more like the UV will detect this as an integrity issue and
>>>> crash the VM?
>>>
>>> yes, the UV will detect a fresh page as an integrity issue.
>>> Only if the page was defined to be shared by the guest, we would avoid the
>>> integrity check.
>>>
>>
>> Please make that clearer in the patch description. With that
>>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
>>
> 
> How about:
> s390x: protvirt: Inhibit balloon when switching to protected mode
> 
> Ballooning in protected VMs can only be done when the guest shares the
> pages it gives to the host. If pages are not shared, the integrity
> checks will fail once those pages have been altered and are given back
> to the guest.
> 
> Hence, until we have a solution for this in the guest kernel, we
> inhibit ballooning when switching into protected mode and reverse that
> once we move out of it.
> 

Yep, sounds good!
Cornelia Huck Feb. 26, 2020, 6:24 p.m. UTC | #9
On Wed, 26 Feb 2020 16:30:32 +0100
Janosch Frank <frankja@linux.ibm.com> wrote:

> On 2/26/20 4:16 PM, David Hildenbrand wrote:
> > On 26.02.20 16:06, Christian Borntraeger wrote:  
> >>
> >>
> >> On 26.02.20 15:59, David Hildenbrand wrote:  
> >>> On 26.02.20 13:20, Janosch Frank wrote:  
> >>>> Ballooning in protected VMs can only be done when the guest shares the
> >>>> pages it gives to the host. Hence, until we have a solution for this
> >>>> in the guest kernel, we inhibit ballooning when switching into
> >>>> protected mode and reverse that once we move out of it.  
> >>>
> >>> I don't understand what you mean here, sorry. zapping a page will mean
> >>> that a fresh one will be faulted in when accessed. And AFAIK, that means
> >>> it will be encrypted again when needed.
> >>>
> >>> Is that more like the UV will detect this as an integrity issue and
> >>> crash the VM?  
> >>
> >> yes, the UV will detect a fresh page as an integrity issue.
> >> Only if the page was defined to be shared by the guest, we would avoid the
> >> integrity check.
> >>  
> > 
> > Please make that clearer in the patch description. With that
> > 
> > Reviewed-by: David Hildenbrand <david@redhat.com>
> >   
> 
> How about:
> s390x: protvirt: Inhibit balloon when switching to protected mode
> 
> Ballooning in protected VMs can only be done when the guest shares the
> pages it gives to the host. If pages are not shared, the integrity
> checks will fail once those pages have been altered and are given back
> to the guest.

This makes sense to me...

> 
> Hence, until we have a solution for this in the guest kernel, we
> inhibit ballooning when switching into protected mode and reverse that
> once we move out of it.

...however, I'm scratching my head here.

If we have a future guest that knows how to handle this, how do we
know? We know that the current Linux driver clears
VIRTIO_F_IOMMU_PLATFORM during feature negotiation, and presumably a
guest that knows how to handle this will not do that. But it still
won't work as expected, as we inhibit ballooning...

So, either
- we don't inhibit ballooning now; any guest that clears the (required)
  virtio feature bit won't be able to drive the virtio-balloon device
  anyway, but a future guest that negotiates the bit would work; or
- we inhibit ballooning now; no guest can therefore use ballooning,
  regardless what they are doing or not doing (this includes guests
  that negotiate the feature bit, but fail to handle pages properly).

Or is there some other reason why we need to inhibit ballooning for
protected vms?
Halil Pasic Feb. 27, 2020, 12:24 p.m. UTC | #10
On Wed, 26 Feb 2020 16:11:03 +0100
Janosch Frank <frankja@linux.ibm.com> wrote:

> On 2/26/20 3:59 PM, David Hildenbrand wrote:
> > On 26.02.20 13:20, Janosch Frank wrote:
> >> Ballooning in protected VMs can only be done when the guest shares the
> >> pages it gives to the host. Hence, until we have a solution for this
> >> in the guest kernel, we inhibit ballooning when switching into
> >> protected mode and reverse that once we move out of it.
> > 
> > I don't understand what you mean here, sorry. zapping a page will mean
> > that a fresh one will be faulted in when accessed. And AFAIK, that means
> > it will be encrypted again when needed.
> 
> Yes, as soon as the host alters non-shared memory we'll run into
> integrity issues.
> 
> 
> I've been talking to Halil after I sent this out and it looks like we'll
> rather try to automatically enable the IOMMU for all devices when
> switching into protected mode. He said that if the IOMMU is set the
> balloon code will do an early exit on feature negotiation.
> 

I have a discussion starter RFC for you.

--------------------------8<----------------------------------------------
From: Halil Pasic <pasic@linux.ibm.com>
Date: Wed, 26 Feb 2020 16:48:21 +0100
Subject: [RFC PATCH 1/1] virtio-ccw: auto-manage VIRTIO_F_IOMMU_PLATFORM

The virtio specification tells that the device is to present
VIRTIO_F_ACCESS_PLATFORM (a.k.a. VIRTIO_F_IOMMU_PLATFORM) when the
device "can only access certain memory addresses with said access
specified and/or granted by the platform". This is the case for a
protected VM, as the device can access only memory addresses that are in
pages that are currently shared (only the guest can share/unsare its
page).

No VM however starts out as a protected VM, but some VMs may be
converted to protected VMs if the guest decides so.

Making the end user explicitly manage the VIRTIO_F_ACCESS_PLATFORM via
the property iommu_on is a minor disaster. If the correctness of the
paravirtualized virtio devices depends (and thus in a sense the
correctness of the hypervisor), then the hypervisor should have the last
word about whether VIRTIO_F_ACCESS_PLATFORM is to be presented or not.

Let's manage the VIRTIO_F_ACCESS_PLATFORM virtio feature automatically
for virtio-ccw devices, so that we set it before we start the protected
configuration, and unset it when our VM is not protected.

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
---
NOTES:
* I wanted to have a discussion starter fast, there are multiple open
questions.

* Doing more than one system_resets() is hackish.  We
should look into this.

* The user interface implications of this patch are also an ugly can of
worms. Needs to be discussed.

* We should consider keeping the original value if !pv and restoring it
on pv --> !pv, instead of forcing to unset when leaving pv, and actually
not forcing unset when !pv.

* It might make sense to do something like this in virtio core, but I
  decided start the discussion with a ccw-only change.

* Maybe we need a machine option that enables this sort of behavior,
especially if we decide not to do the conserving/restoring.

* Lightly tested.
---
 hw/s390x/s390-virtio-ccw.c |  4 ++--
 hw/s390x/virtio-ccw.c      | 13 +++++++++++++
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
index 0f4455d1df..996124f152 100644
--- a/hw/s390x/s390-virtio-ccw.c
+++ b/hw/s390x/s390-virtio-ccw.c
@@ -337,7 +337,7 @@ static void s390_machine_unprotect(S390CcwMachineState *ms)
         ms->pv = false;
     }
     migrate_del_blocker(pv_mig_blocker);
-    qemu_balloon_inhibit(false);
+    subsystem_reset();
 }
 
 static int s390_machine_protect(S390CcwMachineState *ms)
@@ -346,7 +346,6 @@ static int s390_machine_protect(S390CcwMachineState *ms)
     CPUState *t;
     int rc;
 
-    qemu_balloon_inhibit(true);
     if (!pv_mig_blocker) {
         error_setg(&pv_mig_blocker,
                    "protected VMs are currently not migrateable.");
@@ -384,6 +383,7 @@ static int s390_machine_protect(S390CcwMachineState *ms)
     if (rc) {
         goto out_err;
     }
+    subsystem_reset();
     return rc;
 
 out_err:
diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c
index 13f57e7b67..48bb54f68e 100644
--- a/hw/s390x/virtio-ccw.c
+++ b/hw/s390x/virtio-ccw.c
@@ -869,12 +869,24 @@ static void virtio_ccw_notify(DeviceState *d, uint16_t vector)
     }
 }
 
+static inline void virtio_ccw_pv_enforce_features(VirtIODevice *vdev)
+{
+    S390CcwMachineState *ms = S390_CCW_MACHINE(qdev_get_machine());
+
+    if (ms->pv) {
+        virtio_add_feature(&vdev->host_features, VIRTIO_F_IOMMU_PLATFORM);
+    } else {
+        virtio_clear_feature(&vdev->host_features, VIRTIO_F_IOMMU_PLATFORM);
+    }
+}
+
 static void virtio_ccw_reset(DeviceState *d)
 {
     VirtioCcwDevice *dev = VIRTIO_CCW_DEVICE(d);
     VirtIODevice *vdev = virtio_bus_get_device(&dev->bus);
     VirtIOCCWDeviceClass *vdc = VIRTIO_CCW_DEVICE_GET_CLASS(dev);
 
+    virtio_ccw_pv_enforce_features(vdev);
     virtio_ccw_reset_virtio(dev, vdev);
     if (vdc->parent_reset) {
         vdc->parent_reset(d);
@@ -1103,6 +1115,7 @@ static void virtio_ccw_pre_plugged(DeviceState *d, Error **errp)
     if (dev->max_rev >= 1) {
         virtio_add_feature(&vdev->host_features, VIRTIO_F_VERSION_1);
     }
+    virtio_ccw_pv_enforce_features(vdev);
 }
 
 /* This is called by virtio-bus just after the device is plugged. */

base-commit: 8665f2475f5999d4c9f33598f1360f0b0797c489
Christian Borntraeger March 3, 2020, 12:42 p.m. UTC | #11
On 26.02.20 19:24, Cornelia Huck wrote:
> On Wed, 26 Feb 2020 16:30:32 +0100
> Janosch Frank <frankja@linux.ibm.com> wrote:
> 
>> On 2/26/20 4:16 PM, David Hildenbrand wrote:
>>> On 26.02.20 16:06, Christian Borntraeger wrote:  
>>>>
>>>>
>>>> On 26.02.20 15:59, David Hildenbrand wrote:  
>>>>> On 26.02.20 13:20, Janosch Frank wrote:  
>>>>>> Ballooning in protected VMs can only be done when the guest shares the
>>>>>> pages it gives to the host. Hence, until we have a solution for this
>>>>>> in the guest kernel, we inhibit ballooning when switching into
>>>>>> protected mode and reverse that once we move out of it.  
>>>>>
>>>>> I don't understand what you mean here, sorry. zapping a page will mean
>>>>> that a fresh one will be faulted in when accessed. And AFAIK, that means
>>>>> it will be encrypted again when needed.
>>>>>
>>>>> Is that more like the UV will detect this as an integrity issue and
>>>>> crash the VM?  
>>>>
>>>> yes, the UV will detect a fresh page as an integrity issue.
>>>> Only if the page was defined to be shared by the guest, we would avoid the
>>>> integrity check.
>>>>  
>>>
>>> Please make that clearer in the patch description. With that
>>>
>>> Reviewed-by: David Hildenbrand <david@redhat.com>
>>>   
>>
>> How about:
>> s390x: protvirt: Inhibit balloon when switching to protected mode
>>
>> Ballooning in protected VMs can only be done when the guest shares the
>> pages it gives to the host. If pages are not shared, the integrity
>> checks will fail once those pages have been altered and are given back
>> to the guest.
> 
> This makes sense to me...
> 
>>
>> Hence, until we have a solution for this in the guest kernel, we
>> inhibit ballooning when switching into protected mode and reverse that
>> once we move out of it.
> 
> ...however, I'm scratching my head here.
> 
> If we have a future guest that knows how to handle this, how do we
> know? We know that the current Linux driver clears
> VIRTIO_F_IOMMU_PLATFORM during feature negotiation, and presumably a
> guest that knows how to handle this will not do that. But it still
> won't work as expected, as we inhibit ballooning...
> 
> So, either
> - we don't inhibit ballooning now; any guest that clears the (required)
>   virtio feature bit won't be able to drive the virtio-balloon device
>   anyway, but a future guest that negotiates the bit would work; or
> - we inhibit ballooning now; no guest can therefore use ballooning,
>   regardless what they are doing or not doing (this includes guests
>   that negotiate the feature bit, but fail to handle pages properly).
> 
> Or is there some other reason why we need to inhibit ballooning for
> protected vms?


So here is my proposal.
1. we block ballooning now in QEMU (take this patch now)
2. Later Halil will provide a change to virtio that removes the blocker and adds
VIRTIO_F_IOMMU_PLATFORM automatically by QEMU when doing the protvirt switch. This
is ok as the guest balloon driver will reject to work with the IOMMU change
(see 
https://lore.kernel.org/qemu-devel/20200227132402.67a38047.pasic@linux.ibm.com/)
3. later we can fix the guest balloon driver to do the right thing for this
case (e.g. do the make shared call)
Halil Pasic March 19, 2020, 1:42 p.m. UTC | #12
On Thu, 27 Feb 2020 13:24:02 +0100
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Wed, 26 Feb 2020 16:11:03 +0100
> Janosch Frank <frankja@linux.ibm.com> wrote:
> 
> > On 2/26/20 3:59 PM, David Hildenbrand wrote:
> > > On 26.02.20 13:20, Janosch Frank wrote:
> > >> Ballooning in protected VMs can only be done when the guest shares the
> > >> pages it gives to the host. Hence, until we have a solution for this
> > >> in the guest kernel, we inhibit ballooning when switching into
> > >> protected mode and reverse that once we move out of it.
> > > 
> > > I don't understand what you mean here, sorry. zapping a page will mean
> > > that a fresh one will be faulted in when accessed. And AFAIK, that means
> > > it will be encrypted again when needed.
> > 
> > Yes, as soon as the host alters non-shared memory we'll run into
> > integrity issues.
> > 
> > 
> > I've been talking to Halil after I sent this out and it looks like we'll
> > rather try to automatically enable the IOMMU for all devices when
> > switching into protected mode. He said that if the IOMMU is set the
> > balloon code will do an early exit on feature negotiation.
> > 
> 
> I have a discussion starter RFC for you.
> 
> --------------------------8<----------------------------------------------
> From: Halil Pasic <pasic@linux.ibm.com>
> Date: Wed, 26 Feb 2020 16:48:21 +0100
> Subject: [RFC PATCH 1/1] virtio-ccw: auto-manage VIRTIO_F_IOMMU_PLATFORM
> 
> The virtio specification tells that the device is to present
> VIRTIO_F_ACCESS_PLATFORM (a.k.a. VIRTIO_F_IOMMU_PLATFORM) when the
> device "can only access certain memory addresses with said access
> specified and/or granted by the platform". This is the case for a
> protected VM, as the device can access only memory addresses that are in
> pages that are currently shared (only the guest can share/unsare its
> page).
> 
> No VM however starts out as a protected VM, but some VMs may be
> converted to protected VMs if the guest decides so.
> 
> Making the end user explicitly manage the VIRTIO_F_ACCESS_PLATFORM via
> the property iommu_on is a minor disaster. If the correctness of the
> paravirtualized virtio devices depends (and thus in a sense the
> correctness of the hypervisor), then the hypervisor should have the last
> word about whether VIRTIO_F_ACCESS_PLATFORM is to be presented or not.
> 
> Let's manage the VIRTIO_F_ACCESS_PLATFORM virtio feature automatically
> for virtio-ccw devices, so that we set it before we start the protected
> configuration, and unset it when our VM is not protected.
> 
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> ---
> NOTES:
> * I wanted to have a discussion starter fast, there are multiple open
> questions.
> 
> * Doing more than one system_resets() is hackish.  We
> should look into this.
> 
> * The user interface implications of this patch are also an ugly can of
> worms. Needs to be discussed.
> 
> * We should consider keeping the original value if !pv and restoring it
> on pv --> !pv, instead of forcing to unset when leaving pv, and actually
> not forcing unset when !pv.
> 
> * It might make sense to do something like this in virtio core, but I
>   decided start the discussion with a ccw-only change.
> 
> * Maybe we need a machine option that enables this sort of behavior,
> especially if we decide not to do the conserving/restoring.
> 
> * Lightly tested.

ping

Any interest in this?

> ---
>  hw/s390x/s390-virtio-ccw.c |  4 ++--
>  hw/s390x/virtio-ccw.c      | 13 +++++++++++++
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
> index 0f4455d1df..996124f152 100644
> --- a/hw/s390x/s390-virtio-ccw.c
> +++ b/hw/s390x/s390-virtio-ccw.c
> @@ -337,7 +337,7 @@ static void s390_machine_unprotect(S390CcwMachineState *ms)
>          ms->pv = false;
>      }
>      migrate_del_blocker(pv_mig_blocker);
> -    qemu_balloon_inhibit(false);
> +    subsystem_reset();
>  }
>  
>  static int s390_machine_protect(S390CcwMachineState *ms)
> @@ -346,7 +346,6 @@ static int s390_machine_protect(S390CcwMachineState *ms)
>      CPUState *t;
>      int rc;
>  
> -    qemu_balloon_inhibit(true);
>      if (!pv_mig_blocker) {
>          error_setg(&pv_mig_blocker,
>                     "protected VMs are currently not migrateable.");
> @@ -384,6 +383,7 @@ static int s390_machine_protect(S390CcwMachineState *ms)
>      if (rc) {
>          goto out_err;
>      }
> +    subsystem_reset();
>      return rc;
>  
>  out_err:
> diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c
> index 13f57e7b67..48bb54f68e 100644
> --- a/hw/s390x/virtio-ccw.c
> +++ b/hw/s390x/virtio-ccw.c
> @@ -869,12 +869,24 @@ static void virtio_ccw_notify(DeviceState *d, uint16_t vector)
>      }
>  }
>  
> +static inline void virtio_ccw_pv_enforce_features(VirtIODevice *vdev)
> +{
> +    S390CcwMachineState *ms = S390_CCW_MACHINE(qdev_get_machine());
> +
> +    if (ms->pv) {
> +        virtio_add_feature(&vdev->host_features, VIRTIO_F_IOMMU_PLATFORM);
> +    } else {
> +        virtio_clear_feature(&vdev->host_features, VIRTIO_F_IOMMU_PLATFORM);
> +    }
> +}
> +
>  static void virtio_ccw_reset(DeviceState *d)
>  {
>      VirtioCcwDevice *dev = VIRTIO_CCW_DEVICE(d);
>      VirtIODevice *vdev = virtio_bus_get_device(&dev->bus);
>      VirtIOCCWDeviceClass *vdc = VIRTIO_CCW_DEVICE_GET_CLASS(dev);
>  
> +    virtio_ccw_pv_enforce_features(vdev);
>      virtio_ccw_reset_virtio(dev, vdev);
>      if (vdc->parent_reset) {
>          vdc->parent_reset(d);
> @@ -1103,6 +1115,7 @@ static void virtio_ccw_pre_plugged(DeviceState *d, Error **errp)
>      if (dev->max_rev >= 1) {
>          virtio_add_feature(&vdev->host_features, VIRTIO_F_VERSION_1);
>      }
> +    virtio_ccw_pv_enforce_features(vdev);
>  }
>  
>  /* This is called by virtio-bus just after the device is plugged. */
> 
> base-commit: 8665f2475f5999d4c9f33598f1360f0b0797c489
David Hildenbrand March 19, 2020, 1:54 p.m. UTC | #13
On 27.02.20 13:24, Halil Pasic wrote:
> On Wed, 26 Feb 2020 16:11:03 +0100
> Janosch Frank <frankja@linux.ibm.com> wrote:
> 
>> On 2/26/20 3:59 PM, David Hildenbrand wrote:
>>> On 26.02.20 13:20, Janosch Frank wrote:
>>>> Ballooning in protected VMs can only be done when the guest shares the
>>>> pages it gives to the host. Hence, until we have a solution for this
>>>> in the guest kernel, we inhibit ballooning when switching into
>>>> protected mode and reverse that once we move out of it.
>>>
>>> I don't understand what you mean here, sorry. zapping a page will mean
>>> that a fresh one will be faulted in when accessed. And AFAIK, that means
>>> it will be encrypted again when needed.
>>
>> Yes, as soon as the host alters non-shared memory we'll run into
>> integrity issues.
>>
>>
>> I've been talking to Halil after I sent this out and it looks like we'll
>> rather try to automatically enable the IOMMU for all devices when
>> switching into protected mode. He said that if the IOMMU is set the
>> balloon code will do an early exit on feature negotiation.
>>
> 
> I have a discussion starter RFC for you.
> 
> --------------------------8<----------------------------------------------
> From: Halil Pasic <pasic@linux.ibm.com>
> Date: Wed, 26 Feb 2020 16:48:21 +0100
> Subject: [RFC PATCH 1/1] virtio-ccw: auto-manage VIRTIO_F_IOMMU_PLATFORM
> 
> The virtio specification tells that the device is to present
> VIRTIO_F_ACCESS_PLATFORM (a.k.a. VIRTIO_F_IOMMU_PLATFORM) when the
> device "can only access certain memory addresses with said access
> specified and/or granted by the platform". This is the case for a
> protected VM, as the device can access only memory addresses that are in
> pages that are currently shared (only the guest can share/unsare its
> page).
> 
> No VM however starts out as a protected VM, but some VMs may be
> converted to protected VMs if the guest decides so.
> 
> Making the end user explicitly manage the VIRTIO_F_ACCESS_PLATFORM via
> the property iommu_on is a minor disaster. If the correctness of the
> paravirtualized virtio devices depends (and thus in a sense the
> correctness of the hypervisor), then the hypervisor should have the last
> word about whether VIRTIO_F_ACCESS_PLATFORM is to be presented or not.
> 
> Let's manage the VIRTIO_F_ACCESS_PLATFORM virtio feature automatically
> for virtio-ccw devices, so that we set it before we start the protected
> configuration, and unset it when our VM is not protected.
> 
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> ---
> NOTES:
> * I wanted to have a discussion starter fast, there are multiple open
> questions.
> 
> * Doing more than one system_resets() is hackish.  We
> should look into this.
> 
> * The user interface implications of this patch are also an ugly can of
> worms. Needs to be discussed.
> 
> * We should consider keeping the original value if !pv and restoring it
> on pv --> !pv, instead of forcing to unset when leaving pv, and actually
> not forcing unset when !pv.
> 
> * It might make sense to do something like this in virtio core, but I
>   decided start the discussion with a ccw-only change.
> 
> * Maybe we need a machine option that enables this sort of behavior,
> especially if we decide not to do the conserving/restoring.
> 
> * Lightly tested.
> ---
>  hw/s390x/s390-virtio-ccw.c |  4 ++--
>  hw/s390x/virtio-ccw.c      | 13 +++++++++++++
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
> index 0f4455d1df..996124f152 100644
> --- a/hw/s390x/s390-virtio-ccw.c
> +++ b/hw/s390x/s390-virtio-ccw.c
> @@ -337,7 +337,7 @@ static void s390_machine_unprotect(S390CcwMachineState *ms)
>          ms->pv = false;
>      }
>      migrate_del_blocker(pv_mig_blocker);
> -    qemu_balloon_inhibit(false);
> +    subsystem_reset();
>  }
>  
>  static int s390_machine_protect(S390CcwMachineState *ms)
> @@ -346,7 +346,6 @@ static int s390_machine_protect(S390CcwMachineState *ms)
>      CPUState *t;
>      int rc;
>  
> -    qemu_balloon_inhibit(true);
>      if (!pv_mig_blocker) {
>          error_setg(&pv_mig_blocker,
>                     "protected VMs are currently not migrateable.");
> @@ -384,6 +383,7 @@ static int s390_machine_protect(S390CcwMachineState *ms)
>      if (rc) {
>          goto out_err;
>      }
> +    subsystem_reset();
>      return rc;
>  
>  out_err:
> diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c
> index 13f57e7b67..48bb54f68e 100644
> --- a/hw/s390x/virtio-ccw.c
> +++ b/hw/s390x/virtio-ccw.c
> @@ -869,12 +869,24 @@ static void virtio_ccw_notify(DeviceState *d, uint16_t vector)
>      }
>  }
>  
> +static inline void virtio_ccw_pv_enforce_features(VirtIODevice *vdev)
> +{
> +    S390CcwMachineState *ms = S390_CCW_MACHINE(qdev_get_machine());
> +
> +    if (ms->pv) {
> +        virtio_add_feature(&vdev->host_features, VIRTIO_F_IOMMU_PLATFORM);
> +    } else {
> +        virtio_clear_feature(&vdev->host_features, VIRTIO_F_IOMMU_PLATFORM);
> +    }
> +}
> +
>  static void virtio_ccw_reset(DeviceState *d)
>  {
>      VirtioCcwDevice *dev = VIRTIO_CCW_DEVICE(d);
>      VirtIODevice *vdev = virtio_bus_get_device(&dev->bus);
>      VirtIOCCWDeviceClass *vdc = VIRTIO_CCW_DEVICE_GET_CLASS(dev);
>  
> +    virtio_ccw_pv_enforce_features(vdev);
>      virtio_ccw_reset_virtio(dev, vdev);
>      if (vdc->parent_reset) {
>          vdc->parent_reset(d);
> @@ -1103,6 +1115,7 @@ static void virtio_ccw_pre_plugged(DeviceState *d, Error **errp)
>      if (dev->max_rev >= 1) {
>          virtio_add_feature(&vdev->host_features, VIRTIO_F_VERSION_1);
>      }
> +    virtio_ccw_pv_enforce_features(vdev);
>  }
>  
>  /* This is called by virtio-bus just after the device is plugged. */
> 
> base-commit: 8665f2475f5999d4c9f33598f1360f0b0797c489
> 

I asked this question already to Michael (cc) via a different channel,
but hare is it again:

Why does the balloon driver not support VIRTIO_F_IOMMU_PLATFORM? It is
absolutely not clear to me. The introducing commit mentioned that it
"bypasses DMA". I fail to see that.

At least the communication via the SG mechanism should work perfectly
fine with an IOMMU enabled. So I assume it boils down to the pages that
we inflate/deflate not being referenced via IOVA?

I don't think they have to be IOVA addresses. We're neither reading nor
writing these pages. We really speak about "physical memory in the
system" when ballooning. Everything else doesn't really make sense.
There is no need to map/unmap pages we inflate/deflate AFAIKs.

IMHO, we should not try to piggy-back on VIRTIO_F_IOMMU_PLATFORM here,
but instead explicitly disable it either in the hypervisor or the guest.

I hope someone can clarify what the real issue with an IOMMU and
ballooning is, because I'll be having the same "issue" with virtio-mem.
Halil Pasic March 19, 2020, 3:40 p.m. UTC | #14
On Thu, 19 Mar 2020 14:54:11 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 27.02.20 13:24, Halil Pasic wrote:
> > On Wed, 26 Feb 2020 16:11:03 +0100
> > Janosch Frank <frankja@linux.ibm.com> wrote:
> > 
> >> On 2/26/20 3:59 PM, David Hildenbrand wrote:
> >>> On 26.02.20 13:20, Janosch Frank wrote:
> >>>> Ballooning in protected VMs can only be done when the guest shares the
> >>>> pages it gives to the host. Hence, until we have a solution for this
> >>>> in the guest kernel, we inhibit ballooning when switching into
> >>>> protected mode and reverse that once we move out of it.
> >>>
> >>> I don't understand what you mean here, sorry. zapping a page will mean
> >>> that a fresh one will be faulted in when accessed. And AFAIK, that means
> >>> it will be encrypted again when needed.
> >>
> >> Yes, as soon as the host alters non-shared memory we'll run into
> >> integrity issues.
> >>
> >>
> >> I've been talking to Halil after I sent this out and it looks like we'll
> >> rather try to automatically enable the IOMMU for all devices when
> >> switching into protected mode. He said that if the IOMMU is set the
> >> balloon code will do an early exit on feature negotiation.
> >>
> > 
> > I have a discussion starter RFC for you.
> > 
> > --------------------------8<----------------------------------------------
> > From: Halil Pasic <pasic@linux.ibm.com>
> > Date: Wed, 26 Feb 2020 16:48:21 +0100
> > Subject: [RFC PATCH 1/1] virtio-ccw: auto-manage VIRTIO_F_IOMMU_PLATFORM
> > 
> > The virtio specification tells that the device is to present
> > VIRTIO_F_ACCESS_PLATFORM (a.k.a. VIRTIO_F_IOMMU_PLATFORM) when the
> > device "can only access certain memory addresses with said access
> > specified and/or granted by the platform". This is the case for a
> > protected VM, as the device can access only memory addresses that are in
> > pages that are currently shared (only the guest can share/unsare its
> > page).
> > 
> > No VM however starts out as a protected VM, but some VMs may be
> > converted to protected VMs if the guest decides so.
> > 
> > Making the end user explicitly manage the VIRTIO_F_ACCESS_PLATFORM via
> > the property iommu_on is a minor disaster. If the correctness of the
> > paravirtualized virtio devices depends (and thus in a sense the
> > correctness of the hypervisor), then the hypervisor should have the last
> > word about whether VIRTIO_F_ACCESS_PLATFORM is to be presented or not.
> > 
> > Let's manage the VIRTIO_F_ACCESS_PLATFORM virtio feature automatically
> > for virtio-ccw devices, so that we set it before we start the protected
> > configuration, and unset it when our VM is not protected.
> > 
> > Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> > ---
> > NOTES:
> > * I wanted to have a discussion starter fast, there are multiple open
> > questions.
> > 
> > * Doing more than one system_resets() is hackish.  We
> > should look into this.
> > 
> > * The user interface implications of this patch are also an ugly can of
> > worms. Needs to be discussed.
> > 
> > * We should consider keeping the original value if !pv and restoring it
> > on pv --> !pv, instead of forcing to unset when leaving pv, and actually
> > not forcing unset when !pv.
> > 
> > * It might make sense to do something like this in virtio core, but I
> >   decided start the discussion with a ccw-only change.
> > 
> > * Maybe we need a machine option that enables this sort of behavior,
> > especially if we decide not to do the conserving/restoring.
> > 
> > * Lightly tested.
> > ---
> >  hw/s390x/s390-virtio-ccw.c |  4 ++--
> >  hw/s390x/virtio-ccw.c      | 13 +++++++++++++
> >  2 files changed, 15 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
> > index 0f4455d1df..996124f152 100644
> > --- a/hw/s390x/s390-virtio-ccw.c
> > +++ b/hw/s390x/s390-virtio-ccw.c
> > @@ -337,7 +337,7 @@ static void s390_machine_unprotect(S390CcwMachineState *ms)
> >          ms->pv = false;
> >      }
> >      migrate_del_blocker(pv_mig_blocker);
> > -    qemu_balloon_inhibit(false);
> > +    subsystem_reset();
> >  }
> >  
> >  static int s390_machine_protect(S390CcwMachineState *ms)
> > @@ -346,7 +346,6 @@ static int s390_machine_protect(S390CcwMachineState *ms)
> >      CPUState *t;
> >      int rc;
> >  
> > -    qemu_balloon_inhibit(true);
> >      if (!pv_mig_blocker) {
> >          error_setg(&pv_mig_blocker,
> >                     "protected VMs are currently not migrateable.");
> > @@ -384,6 +383,7 @@ static int s390_machine_protect(S390CcwMachineState *ms)
> >      if (rc) {
> >          goto out_err;
> >      }
> > +    subsystem_reset();
> >      return rc;
> >  
> >  out_err:
> > diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c
> > index 13f57e7b67..48bb54f68e 100644
> > --- a/hw/s390x/virtio-ccw.c
> > +++ b/hw/s390x/virtio-ccw.c
> > @@ -869,12 +869,24 @@ static void virtio_ccw_notify(DeviceState *d, uint16_t vector)
> >      }
> >  }
> >  
> > +static inline void virtio_ccw_pv_enforce_features(VirtIODevice *vdev)
> > +{
> > +    S390CcwMachineState *ms = S390_CCW_MACHINE(qdev_get_machine());
> > +
> > +    if (ms->pv) {
> > +        virtio_add_feature(&vdev->host_features, VIRTIO_F_IOMMU_PLATFORM);
> > +    } else {
> > +        virtio_clear_feature(&vdev->host_features, VIRTIO_F_IOMMU_PLATFORM);
> > +    }
> > +}
> > +
> >  static void virtio_ccw_reset(DeviceState *d)
> >  {
> >      VirtioCcwDevice *dev = VIRTIO_CCW_DEVICE(d);
> >      VirtIODevice *vdev = virtio_bus_get_device(&dev->bus);
> >      VirtIOCCWDeviceClass *vdc = VIRTIO_CCW_DEVICE_GET_CLASS(dev);
> >  
> > +    virtio_ccw_pv_enforce_features(vdev);
> >      virtio_ccw_reset_virtio(dev, vdev);
> >      if (vdc->parent_reset) {
> >          vdc->parent_reset(d);
> > @@ -1103,6 +1115,7 @@ static void virtio_ccw_pre_plugged(DeviceState *d, Error **errp)
> >      if (dev->max_rev >= 1) {
> >          virtio_add_feature(&vdev->host_features, VIRTIO_F_VERSION_1);
> >      }
> > +    virtio_ccw_pv_enforce_features(vdev);
> >  }
> >  
> >  /* This is called by virtio-bus just after the device is plugged. */
> > 
> > base-commit: 8665f2475f5999d4c9f33598f1360f0b0797c489
> > 
> 
> I asked this question already to Michael (cc) via a different channel,
> but hare is it again:
> 
> Why does the balloon driver not support VIRTIO_F_IOMMU_PLATFORM? It is
> absolutely not clear to me. The introducing commit mentioned that it
> "bypasses DMA". I fail to see that.
> 
> At least the communication via the SG mechanism should work perfectly
> fine with an IOMMU enabled. So I assume it boils down to the pages that
> we inflate/deflate not being referenced via IOVA?

AFAIU the IOVA/GPA stuff is not the problem here. You have said it
yourself, the SG mechanism would work for balloon out of the box, as it
does for the other virtio devices. 

But VIRTIO_F_ACCESS_PLATFORM (aka VIRTIO_F_IOMMU_PLATFORM)  not presented
means according to Michael that the device has full access to the entire
guest RAM. If VIRTIO_F_ACCESS_PLATFORM is negotiated this may or may not
be the case.

The actual problem is that the pages denoted by the buffer transmitted
via the virtqueue are normally not shared pages. I.e. the hypervisor
can not reuse them (what is the point of balloon inflate). To make this
work, the guest would need to share the pages before saying 'host these
are in my balloon, so you can use them'. This is a piece of logic we
need only if the host/the device does not have full access to the
guest RAM. That is in my opinion why the balloon driver fences
VIRTIO_F_ACCESS_PLATFORM.

Does that make sense?

> 
> I don't think they have to be IOVA addresses. We're neither reading nor
> writing these pages. We really speak about "physical memory in the
> system" when ballooning. Everything else doesn't really make sense.
> There is no need to map/unmap pages we inflate/deflate AFAIKs.
> 
> IMHO, we should not try to piggy-back on VIRTIO_F_IOMMU_PLATFORM here,
> but instead explicitly disable it either in the hypervisor or the guest.
> 

We need a feature bit here. We can say fencing VIRTIO_F_ACCESS_PLATFORM
was a bug, fix that bug, and then invent another 'the guest RAM is
somehow different' feature bit specific to the balloon, and then create
arch hooks in the driver that get active if this feature is negotiated.

I assumed the fact that the balloon driver fences
VIRTIO_F_ACCESS_PLATFORM is not a bug.

> I hope someone can clarify what the real issue with an IOMMU and
> ballooning is, because I'll be having the same "issue" with
virtio-mem.
> 

The issue is not with the IOMMU, the issue is with restricted access
to guest RAM. The definition of VIRTIO_F_ACCESS_PLATFORM is such that we
pretty much know what's up when VIRTIO_F_ACCESS_PLATFORM is not
presented, but VIRTIO_F_ACCESS_PLATFORM presented can mean a couple of
things.

Regards,
Halil
David Hildenbrand March 19, 2020, 5:31 p.m. UTC | #15
[...]

>>
>> I asked this question already to Michael (cc) via a different channel,
>> but hare is it again:
>>
>> Why does the balloon driver not support VIRTIO_F_IOMMU_PLATFORM? It is
>> absolutely not clear to me. The introducing commit mentioned that it
>> "bypasses DMA". I fail to see that.
>>
>> At least the communication via the SG mechanism should work perfectly
>> fine with an IOMMU enabled. So I assume it boils down to the pages that
>> we inflate/deflate not being referenced via IOVA?
> 
> AFAIU the IOVA/GPA stuff is not the problem here. You have said it
> yourself, the SG mechanism would work for balloon out of the box, as it
> does for the other virtio devices. 
> 
> But VIRTIO_F_ACCESS_PLATFORM (aka VIRTIO_F_IOMMU_PLATFORM)  not presented
> means according to Michael that the device has full access to the entire
> guest RAM. If VIRTIO_F_ACCESS_PLATFORM is negotiated this may or may not
> be the case.

So you say

"The virtio specification tells that the device is to present
VIRTIO_F_ACCESS_PLATFORM (a.k.a. VIRTIO_F_IOMMU_PLATFORM) when the
device "can only access certain memory addresses with said access
specified and/or granted by the platform"."

So, AFAIU, *any* virtio device (hypervisor side) has to present this
flag when PV is enabled. In that regard, your patch makes perfect sense
(although I am not sure it's a good idea to overwrite these feature bits
- maybe they should be activated on the cmdline permanently instead when
PV is to be used? (or enable )).

> 
> The actual problem is that the pages denoted by the buffer transmitted
> via the virtqueue are normally not shared pages. I.e. the hypervisor
> can not reuse them (what is the point of balloon inflate). To make this
> work, the guest would need to share the pages before saying 'host these
> are in my balloon, so you can use them'. This is a piece of logic we

What exactly would have to be done in the hypervisor to support it?

Assume we have to trigger sharing/unsharing - this sounds like a very
architecture specific thing? Or is this e.g., doing a map/unmap
operation like mapping/unmapping the SG?

Right now it sounds to me "we have to do $ARCHSPECIFIC when
inflating/deflating in the guest", which feels wrong.

> need only if the host/the device does not have full access to the
> guest RAM. That is in my opinion why the balloon driver fences
> VIRTIO_F_ACCESS_PLATFORM.> Does that make sense?

Yeah, I understood the "device has to set VIRTIO_F_ACCESS_PLATFORM"
part. Struggling with the "what can the guest driver actually do" part.

> 
>>
>> I don't think they have to be IOVA addresses. We're neither reading nor
>> writing these pages. We really speak about "physical memory in the
>> system" when ballooning. Everything else doesn't really make sense.
>> There is no need to map/unmap pages we inflate/deflate AFAIKs.
>>
>> IMHO, we should not try to piggy-back on VIRTIO_F_IOMMU_PLATFORM here,
>> but instead explicitly disable it either in the hypervisor or the guest.
>>
> 
> We need a feature bit here. We can say fencing VIRTIO_F_ACCESS_PLATFORM
> was a bug, fix that bug, and then invent another 'the guest RAM is
> somehow different' feature bit specific to the balloon, and then create
> arch hooks in the driver that get active if this feature is negotiated.
> 
> I assumed the fact that the balloon driver fences
> VIRTIO_F_ACCESS_PLATFORM is not a bug.
> 
>> I hope someone can clarify what the real issue with an IOMMU and
>> ballooning is, because I'll be having the same "issue" with
> virtio-mem.
>>
> 
> The issue is not with the IOMMU, the issue is with restricted access
> to guest RAM. The definition of VIRTIO_F_ACCESS_PLATFORM is such that we
> pretty much know what's up when VIRTIO_F_ACCESS_PLATFORM is not
> presented, but VIRTIO_F_ACCESS_PLATFORM presented can mean a couple of
> things.

Understood.
Michael S. Tsirkin March 19, 2020, 5:45 p.m. UTC | #16
On Thu, Mar 19, 2020 at 02:54:11PM +0100, David Hildenbrand wrote:
> Why does the balloon driver not support VIRTIO_F_IOMMU_PLATFORM? It is
> absolutely not clear to me. The introducing commit mentioned that it
> "bypasses DMA". I fail to see that.

Well sure one can put the balloon behind an IOMMU.  If will shuffle PFN
lists through a shared page.  Problem is, you can't run an untrusted
driver with it since if you do it can corrupt guest memory.
And VIRTIO_F_IOMMU_PLATFORM so far meant that you can run
a userspace driver.

Maybe we need a separate feature bit for this kind of thing where you
assume the driver is trusted? Such a bit - unlike
VIRTIO_F_IOMMU_PLATFORM - would allow legacy guests ...
David Hildenbrand March 20, 2020, 9:36 a.m. UTC | #17
On 19.03.20 18:45, Michael S. Tsirkin wrote:
> On Thu, Mar 19, 2020 at 02:54:11PM +0100, David Hildenbrand wrote:
>> Why does the balloon driver not support VIRTIO_F_IOMMU_PLATFORM? It is
>> absolutely not clear to me. The introducing commit mentioned that it
>> "bypasses DMA". I fail to see that.
> 
> Well sure one can put the balloon behind an IOMMU.  If will shuffle PFN
> lists through a shared page.  Problem is, you can't run an untrusted
> driver with it since if you do it can corrupt guest memory.
> And VIRTIO_F_IOMMU_PLATFORM so far meant that you can run
> a userspace driver.

Just to clarify: Is it sufficient to clear VIRTIO_F_IOMMU_PLATFORM in
the *guest kernel driver* to prohibit *guest userspace drivers*?
I would have thought we would have to disallow on the hypervisor/device
side. (no expert on user space drivers, especially how they
detect/enable/access virtio devices)

> 
> Maybe we need a separate feature bit for this kind of thing where you
> assume the driver is trusted? Such a bit - unlike
> VIRTIO_F_IOMMU_PLATFORM - would allow legacy guests ...

Let's take virtio-mem as an example. You cannot zap memory outside of
the scope of a virtio-mem device. So I assume having a user space driver
would be ok (although most probably of limited use :) )?

Still, for virtio-mem, special s390x handling, similar to virtio-balloon
- (un)sharing of pages - would have to be performed.

So some feature bits to cleanly separate the different limitations would
be great. At least in regard to s390x, I guess we don't have to worry
too much about legacy guests.
Halil Pasic March 20, 2020, 6:43 p.m. UTC | #18
On Thu, 19 Mar 2020 18:31:11 +0100
David Hildenbrand <david@redhat.com> wrote:

> [...]
> 
> >>
> >> I asked this question already to Michael (cc) via a different
> >> channel, but hare is it again:
> >>
> >> Why does the balloon driver not support VIRTIO_F_IOMMU_PLATFORM? It
> >> is absolutely not clear to me. The introducing commit mentioned
> >> that it "bypasses DMA". I fail to see that.
> >>
> >> At least the communication via the SG mechanism should work
> >> perfectly fine with an IOMMU enabled. So I assume it boils down to
> >> the pages that we inflate/deflate not being referenced via IOVA?
> > 
> > AFAIU the IOVA/GPA stuff is not the problem here. You have said it
> > yourself, the SG mechanism would work for balloon out of the box, as
> > it does for the other virtio devices. 
> > 
> > But VIRTIO_F_ACCESS_PLATFORM (aka VIRTIO_F_IOMMU_PLATFORM)  not
> > presented means according to Michael that the device has full access
> > to the entire guest RAM. If VIRTIO_F_ACCESS_PLATFORM is negotiated
> > this may or may not be the case.
> 
> So you say
> 
> "The virtio specification tells that the device is to present
> VIRTIO_F_ACCESS_PLATFORM (a.k.a. VIRTIO_F_IOMMU_PLATFORM) when the
> device "can only access certain memory addresses with said access
> specified and/or granted by the platform"."
> 
> So, AFAIU, *any* virtio device (hypervisor side) has to present this
> flag when PV is enabled. 

Yes, and balloon says bye bye when running in PV mode is only a secondary
objective. I've compiled some references:

"To summarize, the necessary conditions for a hack along these lines
(using DMA API without VIRTIO_F_ACCESS_PLATFORM) are that we detect that:

  - secure guest mode is enabled - so we know that since we don't share
    most memory regular virtio code won't
    work, even though the buggy hypervisor didn't set VIRTIO_F_ACCESS_PLATFORM" 
(Michael Tsirkin, https://lkml.org/lkml/2020/2/20/1021)
I.e.: PV but !VIRTIO_F_ACCESS_PLATFORM \implies bugy hypervisor


"If VIRTIO_F_ACCESS_PLATFORM is set then things just work.  If
VIRTIO_F_ACCESS_PLATFORM is clear device is supposed to have access to
all of memory.  You can argue in various ways but it's easier to just
declare a behaviour that violates this a bug."
(Michael Tsirkin, https://lkml.org/lkml/2020/2/21/1626)
This one is about all memory guest, and not just the buffers transfered
via the virtqueue, which surprised me a bit at the beginning. But balloon
actually needs this.

"A device SHOULD offer VIRTIO_F_ACCESS_PLATFORM if its access to memory
is through bus addresses distinct from and translated by the platform to
physical addresses used by the driver, and/or if it can only access
certain memory addresses with said access specified and/or granted by
the platform. A device MAY fail to operate further if
VIRTIO_F_ACCESS_PLATFORM is not accepted. "
(https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-4120002)


> In that regard, your patch makes perfect sense
> (although I am not sure it's a good idea to overwrite these feature
> bits
> - maybe they should be activated on the cmdline permanently instead
> when PV is to be used? (or enable )).

I didn't understand the last part. I believe conserving the user
specified value when not running in PV mode is better than the hard
overwrite I did here. I wanted a discussion starter.

I think the other option (with respect to let QEMU manage this for user,
i.e. what I try to do here) is to fence the conversion if virtio devices
that do not offer VIRTIO_F_ACCESS_PLATFORM are attached; and disallow
hotplug of such devices at some point during the conversion.

I believe that alternative is even uglier.

IMHO we don't want the end user to fiddle with iommu_platform, because
all the 'benefit' he gets from that is possibility to make a mistake.
For example, I got an internal bug report saying virtio is broken with
PV, which boiled down to an overlooked auto generated NIC, which of
course had iommu_platform (VIRTIO_F_ACCESS_PLATFORM) not set.

> 
> > 
> > The actual problem is that the pages denoted by the buffer
> > transmitted via the virtqueue are normally not shared pages. I.e.
> > the hypervisor can not reuse them (what is the point of balloon
> > inflate). To make this work, the guest would need to share the pages
> > before saying 'host these are in my balloon, so you can use them'.
> > This is a piece of logic we
> 
> What exactly would have to be done in the hypervisor to support it?

AFAIK nothing. The guest needs to share the pages, and everything works.
Janosch, can you help me with this one? 

> 
> Assume we have to trigger sharing/unsharing - this sounds like a very
> architecture specific thing?

It is, but any guest having sovereignty about its memory may need
something similar.

> Or is this e.g., doing a map/unmap
> operation like mapping/unmapping the SG?

No this is something different. We need stronger guarantees than the
streaming portion of the DMA API provides. And what we actually want
is not DMA but something very different.

> 
> Right now it sounds to me "we have to do $ARCHSPECIFIC when
> inflating/deflating in the guest", which feels wrong.
> 

It is wrong in a sense. Drivers are mostly supposed to be portable. But
balloon is not a run of the mill device. I don't see any other way to
make this work.

> > need only if the host/the device does not have full access to the
> > guest RAM. That is in my opinion why the balloon driver fences
> > VIRTIO_F_ACCESS_PLATFORM.> Does that make sense?
> 
> Yeah, I understood the "device has to set VIRTIO_F_ACCESS_PLATFORM"
> part. Struggling with the "what can the guest driver actually do" part.
> 

Let me try to reword this. The point of PV is that the guest has
exclusive access to his pages unless the guest decides to share some
of the using a dedicated ultravisor call.

The point of the memballoon is, as far as I understand, to effectively
dynamically manage the guests memory size within given boundaries, and
without requiring memory hotplug. The basic idea is that the pages in
the balloon belong to the host. The host attempting to re-use a
non-shared page of a guest leads to problems. AFAIR the main problem
was that shall we ever want to deflate such a page (make it again
available for guest use) we would need to do an import, and that can
only work if we have the exact same content as when it was exported.
Otherwise integrity check fails as if we had a malicious hypervisor,
that is trying to inject stuff into the guest.

I'm sure Janosch can provide a better explanation.

I really don't see another way, how memory ballooning could work with
something like PV, without the balloon driver relinquishing the guests
ownership of the pages that are going to leave the guest via the balloon.

On that note ccing the AMD SEV people. Balloon is at this point
dysfunctional for them as well. @Tom: Right? If yes what problems need to
be solved so virtio-balloon can work under SEV?

Regards,
Halil
Brijesh Singh March 24, 2020, 9:07 p.m. UTC | #19
On 3/20/20 1:43 PM, Halil Pasic wrote:
> On Thu, 19 Mar 2020 18:31:11 +0100
> David Hildenbrand <david@redhat.com> wrote:
>
>> [...]
>>
>>>> I asked this question already to Michael (cc) via a different
>>>> channel, but hare is it again:
>>>>
>>>> Why does the balloon driver not support VIRTIO_F_IOMMU_PLATFORM? It
>>>> is absolutely not clear to me. The introducing commit mentioned
>>>> that it "bypasses DMA". I fail to see that.
>>>>
>>>> At least the communication via the SG mechanism should work
>>>> perfectly fine with an IOMMU enabled. So I assume it boils down to
>>>> the pages that we inflate/deflate not being referenced via IOVA?
>>> AFAIU the IOVA/GPA stuff is not the problem here. You have said it
>>> yourself, the SG mechanism would work for balloon out of the box, as
>>> it does for the other virtio devices. 
>>>
>>> But VIRTIO_F_ACCESS_PLATFORM (aka VIRTIO_F_IOMMU_PLATFORM)  not
>>> presented means according to Michael that the device has full access
>>> to the entire guest RAM. If VIRTIO_F_ACCESS_PLATFORM is negotiated
>>> this may or may not be the case.
>> So you say
>>
>> "The virtio specification tells that the device is to present
>> VIRTIO_F_ACCESS_PLATFORM (a.k.a. VIRTIO_F_IOMMU_PLATFORM) when the
>> device "can only access certain memory addresses with said access
>> specified and/or granted by the platform"."
>>
>> So, AFAIU, *any* virtio device (hypervisor side) has to present this
>> flag when PV is enabled. 
> Yes, and balloon says bye bye when running in PV mode is only a secondary
> objective. I've compiled some references:
>
> "To summarize, the necessary conditions for a hack along these lines
> (using DMA API without VIRTIO_F_ACCESS_PLATFORM) are that we detect that:
>
>   - secure guest mode is enabled - so we know that since we don't share
>     most memory regular virtio code won't
>     work, even though the buggy hypervisor didn't set VIRTIO_F_ACCESS_PLATFORM" 
> (Michael Tsirkin, https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2020%2F2%2F20%2F1021&amp;data=02%7C01%7Cbrijesh.singh%40amd.com%7C52b79b5c9e894dd968c508d7ccfe9479%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637203266090844487&amp;sdata=aNS%2FW2nL27mPSl1Xz3iXUY31qtrzmVHYhzVHEILAaQQ%3D&amp;reserved=0)
> I.e.: PV but !VIRTIO_F_ACCESS_PLATFORM \implies bugy hypervisor
>
>
> "If VIRTIO_F_ACCESS_PLATFORM is set then things just work.  If
> VIRTIO_F_ACCESS_PLATFORM is clear device is supposed to have access to
> all of memory.  You can argue in various ways but it's easier to just
> declare a behaviour that violates this a bug."
> (Michael Tsirkin, https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2020%2F2%2F21%2F1626&amp;data=02%7C01%7Cbrijesh.singh%40amd.com%7C52b79b5c9e894dd968c508d7ccfe9479%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637203266090854439&amp;sdata=d3knybBUZ5NL0Lv1C2JS040A3toiCxXVYLkBlzXSrqc%3D&amp;reserved=0)
> This one is about all memory guest, and not just the buffers transfered
> via the virtqueue, which surprised me a bit at the beginning. But balloon
> actually needs this.
>
> "A device SHOULD offer VIRTIO_F_ACCESS_PLATFORM if its access to memory
> is through bus addresses distinct from and translated by the platform to
> physical addresses used by the driver, and/or if it can only access
> certain memory addresses with said access specified and/or granted by
> the platform. A device MAY fail to operate further if
> VIRTIO_F_ACCESS_PLATFORM is not accepted. "
> (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oasis-open.org%2Fvirtio%2Fvirtio%2Fv1.1%2Fcs01%2Fvirtio-v1.1-cs01.html%23x1-4120002&amp;data=02%7C01%7Cbrijesh.singh%40amd.com%7C52b79b5c9e894dd968c508d7ccfe9479%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637203266090854439&amp;sdata=RBx8cBr8I%2FWFChtVFTjBygRiHIXMmsjT8W%2BwLaTNQ24%3D&amp;reserved=0)
>
>
>> In that regard, your patch makes perfect sense
>> (although I am not sure it's a good idea to overwrite these feature
>> bits
>> - maybe they should be activated on the cmdline permanently instead
>> when PV is to be used? (or enable )).
> I didn't understand the last part. I believe conserving the user
> specified value when not running in PV mode is better than the hard
> overwrite I did here. I wanted a discussion starter.
>
> I think the other option (with respect to let QEMU manage this for user,
> i.e. what I try to do here) is to fence the conversion if virtio devices
> that do not offer VIRTIO_F_ACCESS_PLATFORM are attached; and disallow
> hotplug of such devices at some point during the conversion.
>
> I believe that alternative is even uglier.
>
> IMHO we don't want the end user to fiddle with iommu_platform, because
> all the 'benefit' he gets from that is possibility to make a mistake.
> For example, I got an internal bug report saying virtio is broken with
> PV, which boiled down to an overlooked auto generated NIC, which of
> course had iommu_platform (VIRTIO_F_ACCESS_PLATFORM) not set.
>
>>> The actual problem is that the pages denoted by the buffer
>>> transmitted via the virtqueue are normally not shared pages. I.e.
>>> the hypervisor can not reuse them (what is the point of balloon
>>> inflate). To make this work, the guest would need to share the pages
>>> before saying 'host these are in my balloon, so you can use them'.
>>> This is a piece of logic we
>> What exactly would have to be done in the hypervisor to support it?
> AFAIK nothing. The guest needs to share the pages, and everything works.
> Janosch, can you help me with this one? 
>
>> Assume we have to trigger sharing/unsharing - this sounds like a very
>> architecture specific thing?
> It is, but any guest having sovereignty about its memory may need
> something similar.
>
>> Or is this e.g., doing a map/unmap
>> operation like mapping/unmapping the SG?
> No this is something different. We need stronger guarantees than the
> streaming portion of the DMA API provides. And what we actually want
> is not DMA but something very different.
>
>> Right now it sounds to me "we have to do $ARCHSPECIFIC when
>> inflating/deflating in the guest", which feels wrong.
>>
> It is wrong in a sense. Drivers are mostly supposed to be portable. But
> balloon is not a run of the mill device. I don't see any other way to
> make this work.
>
>>> need only if the host/the device does not have full access to the
>>> guest RAM. That is in my opinion why the balloon driver fences
>>> VIRTIO_F_ACCESS_PLATFORM.> Does that make sense?
>> Yeah, I understood the "device has to set VIRTIO_F_ACCESS_PLATFORM"
>> part. Struggling with the "what can the guest driver actually do" part.
>>
> Let me try to reword this. The point of PV is that the guest has
> exclusive access to his pages unless the guest decides to share some
> of the using a dedicated ultravisor call.
>
> The point of the memballoon is, as far as I understand, to effectively
> dynamically manage the guests memory size within given boundaries, and
> without requiring memory hotplug. The basic idea is that the pages in
> the balloon belong to the host. The host attempting to re-use a
> non-shared page of a guest leads to problems. AFAIR the main problem
> was that shall we ever want to deflate such a page (make it again
> available for guest use) we would need to do an import, and that can
> only work if we have the exact same content as when it was exported.
> Otherwise integrity check fails as if we had a malicious hypervisor,
> that is trying to inject stuff into the guest.
>
> I'm sure Janosch can provide a better explanation.
>
> I really don't see another way, how memory ballooning could work with
> something like PV, without the balloon driver relinquishing the guests
> ownership of the pages that are going to leave the guest via the balloon.
>
> On that note ccing the AMD SEV people. Balloon is at this point
> dysfunctional for them as well. @Tom: Right? If yes what problems need to
> be solved so virtio-balloon can work under SEV?


Yes, Balloon does not work for SEV as well. I did not investigated in
the great detail on what it will take to make it work. At that time the
main issue was that the balloon virtio device did not had
VIRTIO_F_ACCESS_PLATFORM flag set. So, we were failing to create a
shared virtio ring. The current SEV does not have integrity protection
support so we should be okay to release the buffer from guest and import
the buffer into the guest. We probably need to deal with some cache
flushes because cache may still contain the data with different C-bit
etc. AMD has recently published the SNP architecture, which supports the
integrity protection and it appears that this architecture provides some
instruction which guest can used by the guest to validate and invalidate
the pages as it leaves and enter the guest memory space. Overall, i
think its doable but devil in the detail.

> Regards,
> Halil 
>
>
David Hildenbrand March 27, 2020, 10:50 a.m. UTC | #20
>> So, AFAIU, *any* virtio device (hypervisor side) has to present this
>> flag when PV is enabled. 
> 
> Yes, and balloon says bye bye when running in PV mode is only a secondary
> objective. I've compiled some references:

Thanks!

> 
> "To summarize, the necessary conditions for a hack along these lines
> (using DMA API without VIRTIO_F_ACCESS_PLATFORM) are that we detect that:
> 
>   - secure guest mode is enabled - so we know that since we don't share
>     most memory regular virtio code won't
>     work, even though the buggy hypervisor didn't set VIRTIO_F_ACCESS_PLATFORM" 
> (Michael Tsirkin, https://lkml.org/lkml/2020/2/20/1021)
> I.e.: PV but !VIRTIO_F_ACCESS_PLATFORM \implies bugy hypervisor
> 
> 
> "If VIRTIO_F_ACCESS_PLATFORM is set then things just work.  If
> VIRTIO_F_ACCESS_PLATFORM is clear device is supposed to have access to
> all of memory.  You can argue in various ways but it's easier to just
> declare a behaviour that violates this a bug."
> (Michael Tsirkin, https://lkml.org/lkml/2020/2/21/1626)
> This one is about all memory guest, and not just the buffers transfered
> via the virtqueue, which surprised me a bit at the beginning. But balloon
> actually needs this.
> 
> "A device SHOULD offer VIRTIO_F_ACCESS_PLATFORM if its access to memory
> is through bus addresses distinct from and translated by the platform to
> physical addresses used by the driver, and/or if it can only access
> certain memory addresses with said access specified and/or granted by
> the platform. A device MAY fail to operate further if
> VIRTIO_F_ACCESS_PLATFORM is not accepted. "
> (https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-4120002)
> 

> 
>> In that regard, your patch makes perfect sense
>> (although I am not sure it's a good idea to overwrite these feature
>> bits
>> - maybe they should be activated on the cmdline permanently instead
>> when PV is to be used? (or enable )).
> 
> I didn't understand the last part. I believe conserving the user
> specified value when not running in PV mode is better than the hard
> overwrite I did here. I wanted a discussion starter.
> 
> I think the other option (with respect to let QEMU manage this for user,
> i.e. what I try to do here) is to fence the conversion if virtio devices
> that do not offer VIRTIO_F_ACCESS_PLATFORM are attached; and disallow
> hotplug of such devices at some point during the conversion.
> 
> I believe that alternative is even uglier.
> 
> IMHO we don't want the end user to fiddle with iommu_platform, because
> all the 'benefit' he gets from that is possibility to make a mistake.
> For example, I got an internal bug report saying virtio is broken with
> PV, which boiled down to an overlooked auto generated NIC, which of
> course had iommu_platform (VIRTIO_F_ACCESS_PLATFORM) not set.
> 
>>
>>>
>>> The actual problem is that the pages denoted by the buffer
>>> transmitted via the virtqueue are normally not shared pages. I.e.
>>> the hypervisor can not reuse them (what is the point of balloon
>>> inflate). To make this work, the guest would need to share the pages
>>> before saying 'host these are in my balloon, so you can use them'.
>>> This is a piece of logic we
>>
>> What exactly would have to be done in the hypervisor to support it?
> 
> AFAIK nothing. The guest needs to share the pages, and everything works.
> Janosch, can you help me with this one? 
> 

See below, making this work on the hypervisor side would be much cleaner
IMHO, but most probably not possible due to guest integrity.

FWIW, "Free page reporting" will (never) work with PV, where there is
basically no manual "deflation" step anymore.

>>
>> Assume we have to trigger sharing/unsharing - this sounds like a very
>> architecture specific thing?
> 
> It is, but any guest having sovereignty about its memory may need
> something similar.
> 
>> Or is this e.g., doing a map/unmap
>> operation like mapping/unmapping the SG?
> 
> No this is something different. We need stronger guarantees than the
> streaming portion of the DMA API provides. And what we actually want
> is not DMA but something very different.

Right, that's what I was expecting ...

> 
>>
>> Right now it sounds to me "we have to do $ARCHSPECIFIC when
>> inflating/deflating in the guest", which feels wrong.
>>
> 
> It is wrong in a sense. Drivers are mostly supposed to be portable. But
> balloon is not a run of the mill device. I don't see any other way to
> make this work.

Well, it is mostly architecture independent until now ...

> 
>>> need only if the host/the device does not have full access to the
>>> guest RAM. That is in my opinion why the balloon driver fences
>>> VIRTIO_F_ACCESS_PLATFORM.> Does that make sense?
>>
>> Yeah, I understood the "device has to set VIRTIO_F_ACCESS_PLATFORM"
>> part. Struggling with the "what can the guest driver actually do" part.
>>
> 
> Let me try to reword this. The point of PV is that the guest has
> exclusive access to his pages unless the guest decides to share some
> of the using a dedicated ultravisor call.
> 
> The point of the memballoon is, as far as I understand, to effectively
> dynamically manage the guests memory size within given boundaries, and
> without requiring memory hotplug. The basic idea is that the pages in
> the balloon belong to the host. The host attempting to re-use a
> non-shared page of a guest leads to problems. AFAIR the main problem
> was that shall we ever want to deflate such a page (make it again
> available for guest use) we would need to do an import, and that can
> only work if we have the exact same content as when it was exported.
> Otherwise integrity check fails as if we had a malicious hypervisor,
> that is trying to inject stuff into the guest.
> 
> I'm sure Janosch can provide a better explanation.
> 
> I really don't see another way, how memory ballooning could work with
> something like PV, without the balloon driver relinquishing the guests
> ownership of the pages that are going to leave the guest via the balloon.>
> On that note ccing the AMD SEV people. Balloon is at this point
> dysfunctional for them as well. @Tom: Right? If yes what problems need to
> be solved so virtio-balloon can work under SEV?

SEV even pins all guest memory, so it's useless and would be even
dangerous to use.


Some thoughts:


1. I would really prefer if there is a way to zero-out+share a page and
zero-out+unshare a page triggered by the hypervisor. Then only the
hypervisor has to do "the right thing" when
inflating/deflating/rebooting etc. I know we can "unshare all" via the
UV - we e.g., have to do that on reboots. But I assume this might mess
with "guest integrity" (being able to zero out random guest pages
technically) and is therefore not possible.


2. Have some other way to communicate "careful, ballooning won't work".
E.g., the guest detecting *itself* that it is running inside a PV
environment and not loading virtio-balloon until it properly
shares/unshares. Again, piggy-backing on IOMMU/VIRTIO_F_ACCESS_PLATFORM
somehow feels wrong.


E.g., once you would support inflation/deflation in virtio-balloon, free
page reporting could not be supported. So it's more than just a single
arch specific inflation/deflation callback.


And virtio-mem [1] will have similar issues once we want to use that on
s390x. But there, an arch-specific share/unshare callback should be
sufficient most probably. Still, there would have to be a way to block
it on s390x PV until implemented. Ideally it will be the same as for
virtio-balloon.

Again, being able to do that in the hypervisor instead of in the guest
would be much cleaner.

[1] https://lkml.kernel.org/r/20200311171422.10484-1-david@redhat.com
Michael S. Tsirkin March 29, 2020, 2:48 p.m. UTC | #21
On Fri, Mar 20, 2020 at 10:36:53AM +0100, David Hildenbrand wrote:
> On 19.03.20 18:45, Michael S. Tsirkin wrote:
> > On Thu, Mar 19, 2020 at 02:54:11PM +0100, David Hildenbrand wrote:
> >> Why does the balloon driver not support VIRTIO_F_IOMMU_PLATFORM? It is
> >> absolutely not clear to me. The introducing commit mentioned that it
> >> "bypasses DMA". I fail to see that.
> > 
> > Well sure one can put the balloon behind an IOMMU.  If will shuffle PFN
> > lists through a shared page.  Problem is, you can't run an untrusted
> > driver with it since if you do it can corrupt guest memory.
> > And VIRTIO_F_IOMMU_PLATFORM so far meant that you can run
> > a userspace driver.
> 
> Just to clarify: Is it sufficient to clear VIRTIO_F_IOMMU_PLATFORM in
> the *guest kernel driver* to prohibit *guest userspace drivers*?

No it isn't sufficient.

> I would have thought we would have to disallow on the hypervisor/device
> side. (no expert on user space drivers, especially how they
> detect/enable/access virtio devices)

QEMU does exactly this:

static int virtio_validate_features(VirtIODevice *vdev)
{
    VirtioDeviceClass *k = VIRTIO_DEVICE_GET_CLASS(vdev);

    if (virtio_host_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM) &&
        !virtio_vdev_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM)) {
        return -EFAULT;
    }
...
}


> > 
> > Maybe we need a separate feature bit for this kind of thing where you
> > assume the driver is trusted? Such a bit - unlike
> > VIRTIO_F_IOMMU_PLATFORM - would allow legacy guests ...
> 
> Let's take virtio-mem as an example. You cannot zap memory outside of
> the scope of a virtio-mem device. So I assume having a user space driver
> would be ok (although most probably of limited use :) )?
> 
> Still, for virtio-mem, special s390x handling, similar to virtio-balloon
> - (un)sharing of pages - would have to be performed.
> 
> So some feature bits to cleanly separate the different limitations would
> be great. At least in regard to s390x, I guess we don't have to worry
> too much about legacy guests.

So if you have the cycles to think through and document how balloon
interacts with different access limitations, that would be great!

> -- 
> Thanks,
> 
> David / dhildenb
David Hildenbrand March 31, 2020, 8:59 a.m. UTC | #22
>> I would have thought we would have to disallow on the hypervisor/device
>> side. (no expert on user space drivers, especially how they
>> detect/enable/access virtio devices)
> 
> QEMU does exactly this:
> 
> static int virtio_validate_features(VirtIODevice *vdev)
> {
>     VirtioDeviceClass *k = VIRTIO_DEVICE_GET_CLASS(vdev);
> 
>     if (virtio_host_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM) &&
>         !virtio_vdev_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM)) {
>         return -EFAULT;
>     }
> ...
> }

Okay, that makes sense. Thanks!

> 
> 
>>>
>>> Maybe we need a separate feature bit for this kind of thing where you
>>> assume the driver is trusted? Such a bit - unlike
>>> VIRTIO_F_IOMMU_PLATFORM - would allow legacy guests ...
>>
>> Let's take virtio-mem as an example. You cannot zap memory outside of
>> the scope of a virtio-mem device. So I assume having a user space driver
>> would be ok (although most probably of limited use :) )?
>>
>> Still, for virtio-mem, special s390x handling, similar to virtio-balloon
>> - (un)sharing of pages - would have to be performed.
>>
>> So some feature bits to cleanly separate the different limitations would
>> be great. At least in regard to s390x, I guess we don't have to worry
>> too much about legacy guests.
> 
> So if you have the cycles to think through and document how balloon
> interacts with different access limitations, that would be great!

I'll add it to my ever-growing todo list. Would be great if Halil could
help out thinking how to express the semantics so we can handle PV
properly (both, virtio-balloon, but also virtio-mem).

Cheers!
diff mbox series

Patch

diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
index 9983165b05..0f4455d1df 100644
--- a/hw/s390x/s390-virtio-ccw.c
+++ b/hw/s390x/s390-virtio-ccw.c
@@ -41,6 +41,7 @@ 
 #include "hw/qdev-properties.h"
 #include "hw/s390x/tod.h"
 #include "sysemu/sysemu.h"
+#include "sysemu/balloon.h"
 #include "hw/s390x/pv.h"
 #include "migration/blocker.h"
 
@@ -336,6 +337,7 @@  static void s390_machine_unprotect(S390CcwMachineState *ms)
         ms->pv = false;
     }
     migrate_del_blocker(pv_mig_blocker);
+    qemu_balloon_inhibit(false);
 }
 
 static int s390_machine_protect(S390CcwMachineState *ms)
@@ -344,6 +346,7 @@  static int s390_machine_protect(S390CcwMachineState *ms)
     CPUState *t;
     int rc;
 
+    qemu_balloon_inhibit(true);
     if (!pv_mig_blocker) {
         error_setg(&pv_mig_blocker,
                    "protected VMs are currently not migrateable.");