diff mbox series

hw/vfio: let readonly flag take effect for mmaped regions

Message ID 20200327111934.71066-1-yan.y.zhao@intel.com
State New
Headers show
Series hw/vfio: let readonly flag take effect for mmaped regions | expand

Commit Message

Yan Zhao March 27, 2020, 11:19 a.m. UTC
From: Yan Zhao <yan.y.zhao@intel.com>

currently, vfio regions without VFIO_REGION_INFO_FLAG_WRITE are only
read-only when VFIO_REGION_INFO_FLAG_MMAP is not set.

regions with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP
are only read-only in host page table for qemu.

This patch sets corresponding ept page entries read-only for regions
with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP.

accordingly, it ignores guest write when guest writes to the read-only
regions are trapped.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Xin Zeng <xin.zeng@intel.com>
---
 hw/vfio/common.c | 4 ++++
 memory.c         | 3 +++
 2 files changed, 7 insertions(+)

Comments

Philippe Mathieu-Daudé March 27, 2020, 10:51 a.m. UTC | #1
Hi Yan,

On 3/27/20 12:19 PM, yan.y.zhao@intel.com wrote:
> From: Yan Zhao <yan.y.zhao@intel.com>
> 
> currently, vfio regions without VFIO_REGION_INFO_FLAG_WRITE are only
> read-only when VFIO_REGION_INFO_FLAG_MMAP is not set.
> 
> regions with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP
> are only read-only in host page table for qemu.
> 
> This patch sets corresponding ept page entries read-only for regions
> with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP.
> 
> accordingly, it ignores guest write when guest writes to the read-only
> regions are trapped.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Xin Zeng <xin.zeng@intel.com>
> ---
>   hw/vfio/common.c | 4 ++++
>   memory.c         | 3 +++
>   2 files changed, 7 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 0b3593b3c0..e901621ca0 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -971,6 +971,10 @@ int vfio_region_mmap(VFIORegion *region)
>                                             name, region->mmaps[i].size,
>                                             region->mmaps[i].mmap);
>           g_free(name);
> +
> +        if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE)) {
> +            memory_region_set_readonly(&region->mmaps[i].mem, true);
> +        }
>           memory_region_add_subregion(region->mem, region->mmaps[i].offset,
>                                       &region->mmaps[i].mem);
>   
> diff --git a/memory.c b/memory.c
> index 601b749906..4b1071dc74 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1313,6 +1313,9 @@ static void memory_region_ram_device_write(void *opaque, hwaddr addr,
>       MemoryRegion *mr = opaque;
>   
>       trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, size);
> +    if (mr->readonly) {
> +        return;
> +    }

Shouldn't this be in memory_region_dispatch_write()?

Please split this patch in 2, this (generic) hunk as first patch, then 
the VFIO more specific change.

>   
>       switch (size) {
>       case 1:
>
Paolo Bonzini March 27, 2020, 4:17 p.m. UTC | #2
On 27/03/20 11:51, Philippe Mathieu-Daudé wrote:
>>  diff --git a/memory.c b/memory.c
>> index 601b749906..4b1071dc74 100644
>> --- a/memory.c
>> +++ b/memory.c
>> @@ -1313,6 +1313,9 @@ static void memory_region_ram_device_write(void
>> *opaque, hwaddr addr,
>>       MemoryRegion *mr = opaque;
>>         trace_memory_region_ram_device_write(get_cpu_index(), mr,
>> addr, data, size);
>> +    if (mr->readonly) {
>> +        return;
>> +    }
> 
> Shouldn't this be in memory_region_dispatch_write()?

No, in general you want memory regions to get writes, so that they
become for example a machine-check exception of some sorts.  However,
memory_region_ram_device_write should probably be changed to a
.write_with_attrs operation, so that it can return MEMTX_ERROR.

> Please split this patch in 2, this (generic) hunk as first patch, then
> the VFIO more specific change.
> 
>>         switch (size) {
>>       case 1:
>>
> 

No need, I can just add my Acked-by for Alex to pick up the patch.

Paolo
Alex Williamson March 27, 2020, 5:25 p.m. UTC | #3
On Fri, 27 Mar 2020 11:19:34 +0000
yan.y.zhao@intel.com wrote:

> From: Yan Zhao <yan.y.zhao@intel.com>
> 
> currently, vfio regions without VFIO_REGION_INFO_FLAG_WRITE are only
> read-only when VFIO_REGION_INFO_FLAG_MMAP is not set.
> 
> regions with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP
> are only read-only in host page table for qemu.
> 
> This patch sets corresponding ept page entries read-only for regions
> with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP.
> 
> accordingly, it ignores guest write when guest writes to the read-only
> regions are trapped.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Xin Zeng <xin.zeng@intel.com>
> ---

Currently we set the r/w protection on the mmap, do I understand
correctly that the change in the vfio code below results in KVM exiting
to QEMU to handle a write to a read-only region and therefore we need
the memory.c change to drop the write?  This prevents a SIGBUS or
similar?

Meanwhile vfio_region_setup() uses the same vfio_region_ops for all
regions and vfio_region_write() would still allow writes, so if the
device were using x-no-mmap=on, I think we'd still get a write to this
region and expect the vfio device to drop it.  Should we prevent that
write in QEMU as well?

Can you also identify what device and region requires this so that we
can decide whether this is QEMU 5.0 or 5.1 material?  PCI BARs are of
course always R/W and the ROM uses different ops and doesn't support
mmap, so this is a device specific region of some sort.  Thanks,

Alex


>  hw/vfio/common.c | 4 ++++
>  memory.c         | 3 +++
>  2 files changed, 7 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 0b3593b3c0..e901621ca0 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -971,6 +971,10 @@ int vfio_region_mmap(VFIORegion *region)
>                                            name, region->mmaps[i].size,
>                                            region->mmaps[i].mmap);
>          g_free(name);
> +
> +        if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE)) {
> +            memory_region_set_readonly(&region->mmaps[i].mem, true);
> +        }
>          memory_region_add_subregion(region->mem, region->mmaps[i].offset,
>                                      &region->mmaps[i].mem);
>  
> diff --git a/memory.c b/memory.c
> index 601b749906..4b1071dc74 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1313,6 +1313,9 @@ static void memory_region_ram_device_write(void *opaque, hwaddr addr,
>      MemoryRegion *mr = opaque;
>  
>      trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, size);
> +    if (mr->readonly) {
> +        return;
> +    }
>  
>      switch (size) {
>      case 1:
Yan Zhao March 30, 2020, 1:35 a.m. UTC | #4
On Sat, Mar 28, 2020 at 01:25:37AM +0800, Alex Williamson wrote:
> On Fri, 27 Mar 2020 11:19:34 +0000
> yan.y.zhao@intel.com wrote:
> 
> > From: Yan Zhao <yan.y.zhao@intel.com>
> > 
> > currently, vfio regions without VFIO_REGION_INFO_FLAG_WRITE are only
> > read-only when VFIO_REGION_INFO_FLAG_MMAP is not set.
> > 
> > regions with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP
> > are only read-only in host page table for qemu.
> > 
> > This patch sets corresponding ept page entries read-only for regions
> > with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP.
> > 
> > accordingly, it ignores guest write when guest writes to the read-only
> > regions are trapped.
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Xin Zeng <xin.zeng@intel.com>
> > ---
> 
> Currently we set the r/w protection on the mmap, do I understand
> correctly that the change in the vfio code below results in KVM exiting
> to QEMU to handle a write to a read-only region and therefore we need
> the memory.c change to drop the write?  This prevents a SIGBUS or
> similar?
yes, correct. the change in memory.c is to prevent a SIGSEGV in host as
it's mmaped to read-only. we think it's better to just drop the writes
from guest rather than corrupt the qemu.

> 
> Meanwhile vfio_region_setup() uses the same vfio_region_ops for all
> regions and vfio_region_write() would still allow writes, so if the
> device were using x-no-mmap=on, I think we'd still get a write to this
> region and expect the vfio device to drop it.  Should we prevent that
> write in QEMU as well?
yes, it expects vfio device to drop it right now.
As the driver sets the flag without VFIO_REGION_INFO_FLAG_WRITE, it should
handle it properly.
both dropping in qemu and dropping in vfio device are fine to us.
we wonder which one is your preference :)


> Can you also identify what device and region requires this so that we
> can decide whether this is QEMU 5.0 or 5.1 material?  PCI BARs are of
> course always R/W and the ROM uses different ops and doesn't support
> mmap, so this is a device specific region of some sort.  Thanks,
> 
It's a virtual mdev device for which we want to emulate a virtual
read-only MMIO BAR.
Is there any consideration that PCI BARs have to be R/W ?
we didn't find it out in PCI specification.

Thanks
Yan


> 
> >  hw/vfio/common.c | 4 ++++
> >  memory.c         | 3 +++
> >  2 files changed, 7 insertions(+)
> > 
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 0b3593b3c0..e901621ca0 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -971,6 +971,10 @@ int vfio_region_mmap(VFIORegion *region)
> >                                            name, region->mmaps[i].size,
> >                                            region->mmaps[i].mmap);
> >          g_free(name);
> > +
> > +        if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE)) {
> > +            memory_region_set_readonly(&region->mmaps[i].mem, true);
> > +        }
> >          memory_region_add_subregion(region->mem, region->mmaps[i].offset,
> >                                      &region->mmaps[i].mem);
> >  
> > diff --git a/memory.c b/memory.c
> > index 601b749906..4b1071dc74 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -1313,6 +1313,9 @@ static void memory_region_ram_device_write(void *opaque, hwaddr addr,
> >      MemoryRegion *mr = opaque;
> >  
> >      trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, size);
> > +    if (mr->readonly) {
> > +        return;
> > +    }
> >  
> >      switch (size) {
> >      case 1:
>
Yan Zhao March 30, 2020, 6:34 a.m. UTC | #5
On Mon, Mar 30, 2020 at 09:35:27AM +0800, Yan Zhao wrote:
> On Sat, Mar 28, 2020 at 01:25:37AM +0800, Alex Williamson wrote:
> > On Fri, 27 Mar 2020 11:19:34 +0000
> > yan.y.zhao@intel.com wrote:
> > 
> > > From: Yan Zhao <yan.y.zhao@intel.com>
> > > 
> > > currently, vfio regions without VFIO_REGION_INFO_FLAG_WRITE are only
> > > read-only when VFIO_REGION_INFO_FLAG_MMAP is not set.
> > > 
> > > regions with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP
> > > are only read-only in host page table for qemu.
> > > 
> > > This patch sets corresponding ept page entries read-only for regions
> > > with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP.
> > > 
> > > accordingly, it ignores guest write when guest writes to the read-only
> > > regions are trapped.
> > > 
> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > Signed-off-by: Xin Zeng <xin.zeng@intel.com>
> > > ---
> > 
> > Currently we set the r/w protection on the mmap, do I understand
> > correctly that the change in the vfio code below results in KVM exiting
> > to QEMU to handle a write to a read-only region and therefore we need
> > the memory.c change to drop the write?  This prevents a SIGBUS or
> > similar?
> yes, correct. the change in memory.c is to prevent a SIGSEGV in host as
> it's mmaped to read-only. we think it's better to just drop the writes
> from guest rather than corrupt the qemu.
> 
> > 
> > Meanwhile vfio_region_setup() uses the same vfio_region_ops for all
> > regions and vfio_region_write() would still allow writes, so if the
> > device were using x-no-mmap=on, I think we'd still get a write to this
> > region and expect the vfio device to drop it.  Should we prevent that
> > write in QEMU as well?
> yes, it expects vfio device to drop it right now.
> As the driver sets the flag without VFIO_REGION_INFO_FLAG_WRITE, it should
> handle it properly.
> both dropping in qemu and dropping in vfio device are fine to us.
> we wonder which one is your preference :)
>
> 
> > Can you also identify what device and region requires this so that we
> > can decide whether this is QEMU 5.0 or 5.1 material?  PCI BARs are of
> > course always R/W and the ROM uses different ops and doesn't support
> > mmap, so this is a device specific region of some sort.  Thanks,
> > 
> It's a virtual mdev device for which we want to emulate a virtual
> read-only MMIO BAR.
> Is there any consideration that PCI BARs have to be R/W ?
> we didn't find it out in PCI specification.
> 
>
looks MMIO regions in vfio platform are also possible to be read-only and
mmaped.

Thanks
Yan

> 
> > 
> > >  hw/vfio/common.c | 4 ++++
> > >  memory.c         | 3 +++
> > >  2 files changed, 7 insertions(+)
> > > 
> > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > > index 0b3593b3c0..e901621ca0 100644
> > > --- a/hw/vfio/common.c
> > > +++ b/hw/vfio/common.c
> > > @@ -971,6 +971,10 @@ int vfio_region_mmap(VFIORegion *region)
> > >                                            name, region->mmaps[i].size,
> > >                                            region->mmaps[i].mmap);
> > >          g_free(name);
> > > +
> > > +        if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE)) {
> > > +            memory_region_set_readonly(&region->mmaps[i].mem, true);
> > > +        }
> > >          memory_region_add_subregion(region->mem, region->mmaps[i].offset,
> > >                                      &region->mmaps[i].mem);
> > >  
> > > diff --git a/memory.c b/memory.c
> > > index 601b749906..4b1071dc74 100644
> > > --- a/memory.c
> > > +++ b/memory.c
> > > @@ -1313,6 +1313,9 @@ static void memory_region_ram_device_write(void *opaque, hwaddr addr,
> > >      MemoryRegion *mr = opaque;
> > >  
> > >      trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, size);
> > > +    if (mr->readonly) {
> > > +        return;
> > > +    }
> > >  
> > >      switch (size) {
> > >      case 1:
> > 
>
Alex Williamson March 30, 2020, 2:59 p.m. UTC | #6
On Mon, 30 Mar 2020 02:34:02 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Mon, Mar 30, 2020 at 09:35:27AM +0800, Yan Zhao wrote:
> > On Sat, Mar 28, 2020 at 01:25:37AM +0800, Alex Williamson wrote:  
> > > On Fri, 27 Mar 2020 11:19:34 +0000
> > > yan.y.zhao@intel.com wrote:
> > >   
> > > > From: Yan Zhao <yan.y.zhao@intel.com>
> > > > 
> > > > currently, vfio regions without VFIO_REGION_INFO_FLAG_WRITE are only
> > > > read-only when VFIO_REGION_INFO_FLAG_MMAP is not set.
> > > > 
> > > > regions with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP
> > > > are only read-only in host page table for qemu.
> > > > 
> > > > This patch sets corresponding ept page entries read-only for regions
> > > > with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP.
> > > > 
> > > > accordingly, it ignores guest write when guest writes to the read-only
> > > > regions are trapped.
> > > > 
> > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > > Signed-off-by: Xin Zeng <xin.zeng@intel.com>
> > > > ---  
> > > 
> > > Currently we set the r/w protection on the mmap, do I understand
> > > correctly that the change in the vfio code below results in KVM exiting
> > > to QEMU to handle a write to a read-only region and therefore we need
> > > the memory.c change to drop the write?  This prevents a SIGBUS or
> > > similar?  
> > yes, correct. the change in memory.c is to prevent a SIGSEGV in host as
> > it's mmaped to read-only. we think it's better to just drop the writes
> > from guest rather than corrupt the qemu.
> >   
> > > 
> > > Meanwhile vfio_region_setup() uses the same vfio_region_ops for all
> > > regions and vfio_region_write() would still allow writes, so if the
> > > device were using x-no-mmap=on, I think we'd still get a write to this
> > > region and expect the vfio device to drop it.  Should we prevent that
> > > write in QEMU as well?  
> > yes, it expects vfio device to drop it right now.
> > As the driver sets the flag without VFIO_REGION_INFO_FLAG_WRITE, it should
> > handle it properly.
> > both dropping in qemu and dropping in vfio device are fine to us.
> > we wonder which one is your preference :)

The kernel and device should always do the right thing, we cannot rely
on the user to honor the mapping, but it's also a reasonable response
from the kernel to kill the process with a SIGSEGV if the user ignores
the protections.  So I don't think it's an either/or, the kernel needs
to do the right thing for itself and in this case QEMU should do the
right thing for itself, which is to drop writes for regions that don't
support it.  So in general, I agree with your patch.
 
> > > Can you also identify what device and region requires this so that we
> > > can decide whether this is QEMU 5.0 or 5.1 material?  PCI BARs are of
> > > course always R/W and the ROM uses different ops and doesn't support
> > > mmap, so this is a device specific region of some sort.  Thanks,
> > >   
> > It's a virtual mdev device for which we want to emulate a virtual
> > read-only MMIO BAR.
> > Is there any consideration that PCI BARs have to be R/W ?
> > we didn't find it out in PCI specification.

What the device chooses to do with writes to a BAR is its own business,
the PCI spec shouldn't try to define that.  There's also no PCI spec
mechanism to declare the access protections for an entire BAR, that's
device specific behavior.  The current QEMU vfio-pci behavior is
therefore somewhat implicit in knowing this for a directly assigned
device.  We can mmap the device and we expect writes to unwritable
registers within that mapping to be dropped.

For an mdev device, we can't rely on the user honoring the access
protections, ie. the user shouldn't be able to exploit the kernel or
device by doing so, but I also agree that QEMU, as a friendly vfio
user, should avoid unsupported operations and protect itself from how
the kernel may handle the fault.

Since this mdev device doesn't exist yet, I'm thinking this is QEMU
v5.1 material though.

> looks MMIO regions in vfio platform are also possible to be read-only and
> mmaped.

Yes.  Thanks,

Alex
Yan Zhao March 31, 2020, 1:59 a.m. UTC | #7
On Mon, Mar 30, 2020 at 10:59:23PM +0800, Alex Williamson wrote:
> On Mon, 30 Mar 2020 02:34:02 -0400
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Mon, Mar 30, 2020 at 09:35:27AM +0800, Yan Zhao wrote:
> > > On Sat, Mar 28, 2020 at 01:25:37AM +0800, Alex Williamson wrote:  
> > > > On Fri, 27 Mar 2020 11:19:34 +0000
> > > > yan.y.zhao@intel.com wrote:
> > > >   
> > > > > From: Yan Zhao <yan.y.zhao@intel.com>
> > > > > 
> > > > > currently, vfio regions without VFIO_REGION_INFO_FLAG_WRITE are only
> > > > > read-only when VFIO_REGION_INFO_FLAG_MMAP is not set.
> > > > > 
> > > > > regions with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP
> > > > > are only read-only in host page table for qemu.
> > > > > 
> > > > > This patch sets corresponding ept page entries read-only for regions
> > > > > with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP.
> > > > > 
> > > > > accordingly, it ignores guest write when guest writes to the read-only
> > > > > regions are trapped.
> > > > > 
> > > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > > > Signed-off-by: Xin Zeng <xin.zeng@intel.com>
> > > > > ---  
> > > > 
> > > > Currently we set the r/w protection on the mmap, do I understand
> > > > correctly that the change in the vfio code below results in KVM exiting
> > > > to QEMU to handle a write to a read-only region and therefore we need
> > > > the memory.c change to drop the write?  This prevents a SIGBUS or
> > > > similar?  
> > > yes, correct. the change in memory.c is to prevent a SIGSEGV in host as
> > > it's mmaped to read-only. we think it's better to just drop the writes
> > > from guest rather than corrupt the qemu.
> > >   
> > > > 
> > > > Meanwhile vfio_region_setup() uses the same vfio_region_ops for all
> > > > regions and vfio_region_write() would still allow writes, so if the
> > > > device were using x-no-mmap=on, I think we'd still get a write to this
> > > > region and expect the vfio device to drop it.  Should we prevent that
> > > > write in QEMU as well?  
> > > yes, it expects vfio device to drop it right now.
> > > As the driver sets the flag without VFIO_REGION_INFO_FLAG_WRITE, it should
> > > handle it properly.
> > > both dropping in qemu and dropping in vfio device are fine to us.
> > > we wonder which one is your preference :)
> 
> The kernel and device should always do the right thing, we cannot rely
> on the user to honor the mapping, but it's also a reasonable response
> from the kernel to kill the process with a SIGSEGV if the user ignores
> the protections.  So I don't think it's an either/or, the kernel needs
> to do the right thing for itself and in this case QEMU should do the
> right thing for itself, which is to drop writes for regions that don't
> support it.  So in general, I agree with your patch.
>
hi Alex
so is there anything I need to do? do I need to add a write dropping in
vfio_region_write() too? if yes, do I need to keep the
trace_vfio_region_write() before dropping ?

Thanks
Yan

> > > > Can you also identify what device and region requires this so that we
> > > > can decide whether this is QEMU 5.0 or 5.1 material?  PCI BARs are of
> > > > course always R/W and the ROM uses different ops and doesn't support
> > > > mmap, so this is a device specific region of some sort.  Thanks,
> > > >   
> > > It's a virtual mdev device for which we want to emulate a virtual
> > > read-only MMIO BAR.
> > > Is there any consideration that PCI BARs have to be R/W ?
> > > we didn't find it out in PCI specification.
> 
> What the device chooses to do with writes to a BAR is its own business,
> the PCI spec shouldn't try to define that.  There's also no PCI spec
> mechanism to declare the access protections for an entire BAR, that's
> device specific behavior.  The current QEMU vfio-pci behavior is
> therefore somewhat implicit in knowing this for a directly assigned
> device.  We can mmap the device and we expect writes to unwritable
> registers within that mapping to be dropped.
> 
> For an mdev device, we can't rely on the user honoring the access
> protections, ie. the user shouldn't be able to exploit the kernel or
> device by doing so, but I also agree that QEMU, as a friendly vfio
> user, should avoid unsupported operations and protect itself from how
> the kernel may handle the fault.
> 
> Since this mdev device doesn't exist yet, I'm thinking this is QEMU
> v5.1 material though.
> 
> > looks MMIO regions in vfio platform are also possible to be read-only and
> > mmaped.
> 
> Yes.  Thanks,
> 
> Alex
>
Philippe Mathieu-Daudé March 31, 2020, 7:59 a.m. UTC | #8
On 3/27/20 5:17 PM, Paolo Bonzini wrote:
> On 27/03/20 11:51, Philippe Mathieu-Daudé wrote:
>>>    diff --git a/memory.c b/memory.c
>>> index 601b749906..4b1071dc74 100644
>>> --- a/memory.c
>>> +++ b/memory.c
>>> @@ -1313,6 +1313,9 @@ static void memory_region_ram_device_write(void
>>> *opaque, hwaddr addr,
>>>        MemoryRegion *mr = opaque;
>>>          trace_memory_region_ram_device_write(get_cpu_index(), mr,
>>> addr, data, size);
>>> +    if (mr->readonly) {
>>> +        return;
>>> +    }
>>
>> Shouldn't this be in memory_region_dispatch_write()?
> 
> No, in general you want memory regions to get writes, so that they
> become for example a machine-check exception of some sorts.  However,
> memory_region_ram_device_write should probably be changed to a
> .write_with_attrs operation, so that it can return MEMTX_ERROR.
> 
>> Please split this patch in 2, this (generic) hunk as first patch, then
>> the VFIO more specific change.
>>
>>>          switch (size) {
>>>        case 1:
>>>
>>
> 
> No need, I can just add my Acked-by for Alex to pick up the patch.

Having 2 different fix in 2 different patches helps when cherry-picking 
(bisecting, backporting...) and reverting. My 2 cents anyway.
Alex Williamson March 31, 2020, 7:28 p.m. UTC | #9
On Mon, 30 Mar 2020 21:59:41 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Mon, Mar 30, 2020 at 10:59:23PM +0800, Alex Williamson wrote:
> > On Mon, 30 Mar 2020 02:34:02 -0400
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Mon, Mar 30, 2020 at 09:35:27AM +0800, Yan Zhao wrote:  
> > > > On Sat, Mar 28, 2020 at 01:25:37AM +0800, Alex Williamson wrote:    
> > > > > On Fri, 27 Mar 2020 11:19:34 +0000
> > > > > yan.y.zhao@intel.com wrote:
> > > > >     
> > > > > > From: Yan Zhao <yan.y.zhao@intel.com>
> > > > > > 
> > > > > > currently, vfio regions without VFIO_REGION_INFO_FLAG_WRITE are only
> > > > > > read-only when VFIO_REGION_INFO_FLAG_MMAP is not set.
> > > > > > 
> > > > > > regions with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP
> > > > > > are only read-only in host page table for qemu.
> > > > > > 
> > > > > > This patch sets corresponding ept page entries read-only for regions
> > > > > > with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP.
> > > > > > 
> > > > > > accordingly, it ignores guest write when guest writes to the read-only
> > > > > > regions are trapped.
> > > > > > 
> > > > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > > > > Signed-off-by: Xin Zeng <xin.zeng@intel.com>
> > > > > > ---    
> > > > > 
> > > > > Currently we set the r/w protection on the mmap, do I understand
> > > > > correctly that the change in the vfio code below results in KVM exiting
> > > > > to QEMU to handle a write to a read-only region and therefore we need
> > > > > the memory.c change to drop the write?  This prevents a SIGBUS or
> > > > > similar?    
> > > > yes, correct. the change in memory.c is to prevent a SIGSEGV in host as
> > > > it's mmaped to read-only. we think it's better to just drop the writes
> > > > from guest rather than corrupt the qemu.
> > > >     
> > > > > 
> > > > > Meanwhile vfio_region_setup() uses the same vfio_region_ops for all
> > > > > regions and vfio_region_write() would still allow writes, so if the
> > > > > device were using x-no-mmap=on, I think we'd still get a write to this
> > > > > region and expect the vfio device to drop it.  Should we prevent that
> > > > > write in QEMU as well?    
> > > > yes, it expects vfio device to drop it right now.
> > > > As the driver sets the flag without VFIO_REGION_INFO_FLAG_WRITE, it should
> > > > handle it properly.
> > > > both dropping in qemu and dropping in vfio device are fine to us.
> > > > we wonder which one is your preference :)  
> > 
> > The kernel and device should always do the right thing, we cannot rely
> > on the user to honor the mapping, but it's also a reasonable response
> > from the kernel to kill the process with a SIGSEGV if the user ignores
> > the protections.  So I don't think it's an either/or, the kernel needs
> > to do the right thing for itself and in this case QEMU should do the
> > right thing for itself, which is to drop writes for regions that don't
> > support it.  So in general, I agree with your patch.
> >  
> hi Alex
> so is there anything I need to do? do I need to add a write dropping in
> vfio_region_write() too? if yes, do I need to keep the
> trace_vfio_region_write() before dropping ?

Hi Yan,

Yes to both feels the most consistent, right?  If we want the same
behavior between mmap'd and non-mmap'd regions, QEMU should drop both
writes.  Your change to memory_region_ram_device_write() drops the
write after tracing.  For both cases though it might be worthwhile to
have a separate trace recorded to indicate a dropped write, probably in
place of the normal trace to avoid confusion.  Thanks,

Alex
Yan Zhao April 1, 2020, 6:45 a.m. UTC | #10
On Wed, Apr 01, 2020 at 03:28:27AM +0800, Alex Williamson wrote:
> On Mon, 30 Mar 2020 21:59:41 -0400
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Mon, Mar 30, 2020 at 10:59:23PM +0800, Alex Williamson wrote:
> > > On Mon, 30 Mar 2020 02:34:02 -0400
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Mon, Mar 30, 2020 at 09:35:27AM +0800, Yan Zhao wrote:  
> > > > > On Sat, Mar 28, 2020 at 01:25:37AM +0800, Alex Williamson wrote:    
> > > > > > On Fri, 27 Mar 2020 11:19:34 +0000
> > > > > > yan.y.zhao@intel.com wrote:
> > > > > >     
> > > > > > > From: Yan Zhao <yan.y.zhao@intel.com>
> > > > > > > 
> > > > > > > currently, vfio regions without VFIO_REGION_INFO_FLAG_WRITE are only
> > > > > > > read-only when VFIO_REGION_INFO_FLAG_MMAP is not set.
> > > > > > > 
> > > > > > > regions with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP
> > > > > > > are only read-only in host page table for qemu.
> > > > > > > 
> > > > > > > This patch sets corresponding ept page entries read-only for regions
> > > > > > > with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP.
> > > > > > > 
> > > > > > > accordingly, it ignores guest write when guest writes to the read-only
> > > > > > > regions are trapped.
> > > > > > > 
> > > > > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > > > > > Signed-off-by: Xin Zeng <xin.zeng@intel.com>
> > > > > > > ---    
> > > > > > 
> > > > > > Currently we set the r/w protection on the mmap, do I understand
> > > > > > correctly that the change in the vfio code below results in KVM exiting
> > > > > > to QEMU to handle a write to a read-only region and therefore we need
> > > > > > the memory.c change to drop the write?  This prevents a SIGBUS or
> > > > > > similar?    
> > > > > yes, correct. the change in memory.c is to prevent a SIGSEGV in host as
> > > > > it's mmaped to read-only. we think it's better to just drop the writes
> > > > > from guest rather than corrupt the qemu.
> > > > >     
> > > > > > 
> > > > > > Meanwhile vfio_region_setup() uses the same vfio_region_ops for all
> > > > > > regions and vfio_region_write() would still allow writes, so if the
> > > > > > device were using x-no-mmap=on, I think we'd still get a write to this
> > > > > > region and expect the vfio device to drop it.  Should we prevent that
> > > > > > write in QEMU as well?    
> > > > > yes, it expects vfio device to drop it right now.
> > > > > As the driver sets the flag without VFIO_REGION_INFO_FLAG_WRITE, it should
> > > > > handle it properly.
> > > > > both dropping in qemu and dropping in vfio device are fine to us.
> > > > > we wonder which one is your preference :)  
> > > 
> > > The kernel and device should always do the right thing, we cannot rely
> > > on the user to honor the mapping, but it's also a reasonable response
> > > from the kernel to kill the process with a SIGSEGV if the user ignores
> > > the protections.  So I don't think it's an either/or, the kernel needs
> > > to do the right thing for itself and in this case QEMU should do the
> > > right thing for itself, which is to drop writes for regions that don't
> > > support it.  So in general, I agree with your patch.
> > >  
> > hi Alex
> > so is there anything I need to do? do I need to add a write dropping in
> > vfio_region_write() too? if yes, do I need to keep the
> > trace_vfio_region_write() before dropping ?
> 
> Hi Yan,
> 
> Yes to both feels the most consistent, right?  If we want the same
> behavior between mmap'd and non-mmap'd regions, QEMU should drop both
> writes.  Your change to memory_region_ram_device_write() drops the
> write after tracing.  For both cases though it might be worthwhile to
> have a separate trace recorded to indicate a dropped write, probably in
> place of the normal trace to avoid confusion.  Thanks,
> 
got it. will do it.

Thanks
Yan
Yan Zhao April 1, 2020, 6:47 a.m. UTC | #11
On Tue, Mar 31, 2020 at 03:59:02PM +0800, Philippe Mathieu-Daudé wrote:
> On 3/27/20 5:17 PM, Paolo Bonzini wrote:
> > On 27/03/20 11:51, Philippe Mathieu-Daudé wrote:
> >>>    diff --git a/memory.c b/memory.c
> >>> index 601b749906..4b1071dc74 100644
> >>> --- a/memory.c
> >>> +++ b/memory.c
> >>> @@ -1313,6 +1313,9 @@ static void memory_region_ram_device_write(void
> >>> *opaque, hwaddr addr,
> >>>        MemoryRegion *mr = opaque;
> >>>          trace_memory_region_ram_device_write(get_cpu_index(), mr,
> >>> addr, data, size);
> >>> +    if (mr->readonly) {
> >>> +        return;
> >>> +    }
> >>
> >> Shouldn't this be in memory_region_dispatch_write()?
> > 
> > No, in general you want memory regions to get writes, so that they
> > become for example a machine-check exception of some sorts.  However,
> > memory_region_ram_device_write should probably be changed to a
> > .write_with_attrs operation, so that it can return MEMTX_ERROR.
> > 
> >> Please split this patch in 2, this (generic) hunk as first patch, then
> >> the VFIO more specific change.
> >>
> >>>          switch (size) {
> >>>        case 1:
> >>>
> >>
> > 
> > No need, I can just add my Acked-by for Alex to pick up the patch.
> 
> Having 2 different fix in 2 different patches helps when cherry-picking 
> (bisecting, backporting...) and reverting. My 2 cents anyway.
ok. I can seperate it in patch v2.

Thanks for your input:)
diff mbox series

Patch

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0b3593b3c0..e901621ca0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -971,6 +971,10 @@  int vfio_region_mmap(VFIORegion *region)
                                           name, region->mmaps[i].size,
                                           region->mmaps[i].mmap);
         g_free(name);
+
+        if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE)) {
+            memory_region_set_readonly(&region->mmaps[i].mem, true);
+        }
         memory_region_add_subregion(region->mem, region->mmaps[i].offset,
                                     &region->mmaps[i].mem);
 
diff --git a/memory.c b/memory.c
index 601b749906..4b1071dc74 100644
--- a/memory.c
+++ b/memory.c
@@ -1313,6 +1313,9 @@  static void memory_region_ram_device_write(void *opaque, hwaddr addr,
     MemoryRegion *mr = opaque;
 
     trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, size);
+    if (mr->readonly) {
+        return;
+    }
 
     switch (size) {
     case 1: