diff mbox

[V2,RFC] fixup! virtio: convert to use DMA api

Message ID 1461245745-6710-1-git-send-email-mst@redhat.com
State New
Headers show

Commit Message

Michael S. Tsirkin April 21, 2016, 1:43 p.m. UTC
This adds a flag to enable/disable bypassing the IOMMU by
virtio devices.

This is on top of patch
http://article.gmane.org/gmane.comp.emulators.qemu/403467
virtio: convert to use DMA api

Tested with patchset
http://article.gmane.org/gmane.linux.kernel.virtualization/27545
virtio-pci: iommu support (note: bit number has been kept at 34
intentionally to match posted guest code. a non-RFC version will
renumber bits to be contigious).

changes from v1:
    drop PASSTHROUGH flag

The interaction between virtio and DMA API is messy.

On most systems with virtio, physical addresses match bus addresses,
and it doesn't particularly matter whether we use the DMA API.

On some systems, including Xen and any system with a physical device
that speaks virtio behind a physical IOMMU, we must use the DMA API
for virtio DMA to work at all.

Add a feature bit to detect that: VIRTIO_F_IOMMU_PLATFORM.

If not there, we preserve historic behavior and bypass the DMA
API unless within Xen guest. This is actually required for
systems, including SPARC and PPC64, where virtio-pci devices are
enumerated as though they are behind an IOMMU, but the virtio host
ignores the IOMMU, so we must either pretend that the IOMMU isn't
there or somehow map everything as the identity.

Re: non-virtio devices.

It turns out that on old QEMU hosts, only emulated devices which were
part of QEMU use the IOMMU.  Should we want to bypass the IOMMU for such
devices *only*, it would be rather easy to detect them by looking at
subsystem vendor and device ID. Thus, no new interfaces are required
except for virtio which always uses the same subsystem vendor and device ID.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/hw/virtio/virtio-access.h              | 3 ++-
 include/hw/virtio/virtio.h                     | 4 +++-
 include/standard-headers/linux/virtio_config.h | 2 ++
 3 files changed, 7 insertions(+), 2 deletions(-)

Comments

Wei Liu April 21, 2016, 1:54 p.m. UTC | #1
Add Stefano and Anthony

On Thu, Apr 21, 2016 at 04:43:45PM +0300, Michael S. Tsirkin wrote:
> This adds a flag to enable/disable bypassing the IOMMU by
> virtio devices.
> 
> This is on top of patch
> http://article.gmane.org/gmane.comp.emulators.qemu/403467
> virtio: convert to use DMA api
> 
> Tested with patchset
> http://article.gmane.org/gmane.linux.kernel.virtualization/27545
> virtio-pci: iommu support (note: bit number has been kept at 34
> intentionally to match posted guest code. a non-RFC version will
> renumber bits to be contigious).
> 
> changes from v1:
>     drop PASSTHROUGH flag
> 
> The interaction between virtio and DMA API is messy.
> 
> On most systems with virtio, physical addresses match bus addresses,
> and it doesn't particularly matter whether we use the DMA API.
> 
> On some systems, including Xen and any system with a physical device
> that speaks virtio behind a physical IOMMU, we must use the DMA API
> for virtio DMA to work at all.
> 
> Add a feature bit to detect that: VIRTIO_F_IOMMU_PLATFORM.
> 
> If not there, we preserve historic behavior and bypass the DMA
> API unless within Xen guest. This is actually required for
> systems, including SPARC and PPC64, where virtio-pci devices are
> enumerated as though they are behind an IOMMU, but the virtio host
> ignores the IOMMU, so we must either pretend that the IOMMU isn't
> there or somehow map everything as the identity.
> 
> Re: non-virtio devices.
> 
> It turns out that on old QEMU hosts, only emulated devices which were
> part of QEMU use the IOMMU.  Should we want to bypass the IOMMU for such
> devices *only*, it would be rather easy to detect them by looking at
> subsystem vendor and device ID. Thus, no new interfaces are required
> except for virtio which always uses the same subsystem vendor and device ID.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  include/hw/virtio/virtio-access.h              | 3 ++-
>  include/hw/virtio/virtio.h                     | 4 +++-
>  include/standard-headers/linux/virtio_config.h | 2 ++
>  3 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h
> index 967cc75..bb6f34e 100644
> --- a/include/hw/virtio/virtio-access.h
> +++ b/include/hw/virtio/virtio-access.h
> @@ -23,7 +23,8 @@ static inline AddressSpace *virtio_get_dma_as(VirtIODevice *vdev)
>      BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
>      VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
>  
> -    if (k->get_dma_as) {
> +    if (virtio_host_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM) &&
> +        k->get_dma_as) {
>          return k->get_dma_as(qbus->parent);
>      }
>      return &address_space_memory;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index b12faa9..44f3788 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -228,7 +228,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
>      DEFINE_PROP_BIT64("notify_on_empty", _state, _field,  \
>                        VIRTIO_F_NOTIFY_ON_EMPTY, true), \
>      DEFINE_PROP_BIT64("any_layout", _state, _field, \
> -                      VIRTIO_F_ANY_LAYOUT, true)
> +                      VIRTIO_F_ANY_LAYOUT, true), \
> +    DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
> +                      VIRTIO_F_IOMMU_PLATFORM, false)
>  
>  hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
>  hwaddr virtio_queue_get_avail_addr(VirtIODevice *vdev, int n);
> diff --git a/include/standard-headers/linux/virtio_config.h b/include/standard-headers/linux/virtio_config.h
> index bcc445b..3fcfbb1 100644
> --- a/include/standard-headers/linux/virtio_config.h
> +++ b/include/standard-headers/linux/virtio_config.h
> @@ -61,4 +61,6 @@
>  /* v1.0 compliant. */
>  #define VIRTIO_F_VERSION_1		32
>  
> +/* Do not bypass the IOMMU (if configured) */
> +#define VIRTIO_F_IOMMU_PLATFORM		34
>  #endif /* _LINUX_VIRTIO_CONFIG_H */
> -- 
> MST
Stefan Hajnoczi April 21, 2016, 2:56 p.m. UTC | #2
On Thu, Apr 21, 2016 at 04:43:45PM +0300, Michael S. Tsirkin wrote:
> This adds a flag to enable/disable bypassing the IOMMU by
> virtio devices.
> 
> This is on top of patch
> http://article.gmane.org/gmane.comp.emulators.qemu/403467
> virtio: convert to use DMA api
> 
> Tested with patchset
> http://article.gmane.org/gmane.linux.kernel.virtualization/27545
> virtio-pci: iommu support (note: bit number has been kept at 34
> intentionally to match posted guest code. a non-RFC version will
> renumber bits to be contigious).
> 
> changes from v1:
>     drop PASSTHROUGH flag
> 
> The interaction between virtio and DMA API is messy.
> 
> On most systems with virtio, physical addresses match bus addresses,
> and it doesn't particularly matter whether we use the DMA API.
> 
> On some systems, including Xen and any system with a physical device
> that speaks virtio behind a physical IOMMU, we must use the DMA API
> for virtio DMA to work at all.
> 
> Add a feature bit to detect that: VIRTIO_F_IOMMU_PLATFORM.
> 
> If not there, we preserve historic behavior and bypass the DMA
> API unless within Xen guest. This is actually required for
> systems, including SPARC and PPC64, where virtio-pci devices are
> enumerated as though they are behind an IOMMU, but the virtio host
> ignores the IOMMU, so we must either pretend that the IOMMU isn't
> there or somehow map everything as the identity.
> 
> Re: non-virtio devices.
> 
> It turns out that on old QEMU hosts, only emulated devices which were
> part of QEMU use the IOMMU.  Should we want to bypass the IOMMU for such
> devices *only*, it would be rather easy to detect them by looking at
> subsystem vendor and device ID. Thus, no new interfaces are required
> except for virtio which always uses the same subsystem vendor and device ID.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  include/hw/virtio/virtio-access.h              | 3 ++-
>  include/hw/virtio/virtio.h                     | 4 +++-
>  include/standard-headers/linux/virtio_config.h | 2 ++
>  3 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h
> index 967cc75..bb6f34e 100644
> --- a/include/hw/virtio/virtio-access.h
> +++ b/include/hw/virtio/virtio-access.h
> @@ -23,7 +23,8 @@ static inline AddressSpace *virtio_get_dma_as(VirtIODevice *vdev)
>      BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
>      VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
>  
> -    if (k->get_dma_as) {
> +    if (virtio_host_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM) &&
> +        k->get_dma_as) {
>          return k->get_dma_as(qbus->parent);
>      }
>      return &address_space_memory;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index b12faa9..44f3788 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -228,7 +228,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
>      DEFINE_PROP_BIT64("notify_on_empty", _state, _field,  \
>                        VIRTIO_F_NOTIFY_ON_EMPTY, true), \
>      DEFINE_PROP_BIT64("any_layout", _state, _field, \
> -                      VIRTIO_F_ANY_LAYOUT, true)
> +                      VIRTIO_F_ANY_LAYOUT, true), \
> +    DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
> +                      VIRTIO_F_IOMMU_PLATFORM, false)

Looks like the impact of this patch is that users who relied on
k->get_dma_as today may now have to explicitly add iommu_platform=on.
Are there any such users (e.g. Xen)?

Instead of breaking the command-line for these users you could invert
the flag's meaning ("iommu_bypass=on") and set it in the SPARC/PPC
machine types.

Stefan
Michael S. Tsirkin April 21, 2016, 3:11 p.m. UTC | #3
On Thu, Apr 21, 2016 at 03:56:53PM +0100, Stefan Hajnoczi wrote:
> On Thu, Apr 21, 2016 at 04:43:45PM +0300, Michael S. Tsirkin wrote:
> > This adds a flag to enable/disable bypassing the IOMMU by
> > virtio devices.
> > 
> > This is on top of patch
> > http://article.gmane.org/gmane.comp.emulators.qemu/403467
> > virtio: convert to use DMA api
> > 
> > Tested with patchset
> > http://article.gmane.org/gmane.linux.kernel.virtualization/27545
> > virtio-pci: iommu support (note: bit number has been kept at 34
> > intentionally to match posted guest code. a non-RFC version will
> > renumber bits to be contigious).
> > 
> > changes from v1:
> >     drop PASSTHROUGH flag
> > 
> > The interaction between virtio and DMA API is messy.
> > 
> > On most systems with virtio, physical addresses match bus addresses,
> > and it doesn't particularly matter whether we use the DMA API.
> > 
> > On some systems, including Xen and any system with a physical device
> > that speaks virtio behind a physical IOMMU, we must use the DMA API
> > for virtio DMA to work at all.
> > 
> > Add a feature bit to detect that: VIRTIO_F_IOMMU_PLATFORM.
> > 
> > If not there, we preserve historic behavior and bypass the DMA
> > API unless within Xen guest. This is actually required for
> > systems, including SPARC and PPC64, where virtio-pci devices are
> > enumerated as though they are behind an IOMMU, but the virtio host
> > ignores the IOMMU, so we must either pretend that the IOMMU isn't
> > there or somehow map everything as the identity.
> > 
> > Re: non-virtio devices.
> > 
> > It turns out that on old QEMU hosts, only emulated devices which were
> > part of QEMU use the IOMMU.  Should we want to bypass the IOMMU for such
> > devices *only*, it would be rather easy to detect them by looking at
> > subsystem vendor and device ID. Thus, no new interfaces are required
> > except for virtio which always uses the same subsystem vendor and device ID.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  include/hw/virtio/virtio-access.h              | 3 ++-
> >  include/hw/virtio/virtio.h                     | 4 +++-
> >  include/standard-headers/linux/virtio_config.h | 2 ++
> >  3 files changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h
> > index 967cc75..bb6f34e 100644
> > --- a/include/hw/virtio/virtio-access.h
> > +++ b/include/hw/virtio/virtio-access.h
> > @@ -23,7 +23,8 @@ static inline AddressSpace *virtio_get_dma_as(VirtIODevice *vdev)
> >      BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
> >      VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
> >  
> > -    if (k->get_dma_as) {
> > +    if (virtio_host_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM) &&
> > +        k->get_dma_as) {
> >          return k->get_dma_as(qbus->parent);
> >      }
> >      return &address_space_memory;
> > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > index b12faa9..44f3788 100644
> > --- a/include/hw/virtio/virtio.h
> > +++ b/include/hw/virtio/virtio.h
> > @@ -228,7 +228,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
> >      DEFINE_PROP_BIT64("notify_on_empty", _state, _field,  \
> >                        VIRTIO_F_NOTIFY_ON_EMPTY, true), \
> >      DEFINE_PROP_BIT64("any_layout", _state, _field, \
> > -                      VIRTIO_F_ANY_LAYOUT, true)
> > +                      VIRTIO_F_ANY_LAYOUT, true), \
> > +    DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
> > +                      VIRTIO_F_IOMMU_PLATFORM, false)
> 
> Looks like the impact of this patch is that users who relied on
> k->get_dma_as today may now have to explicitly add iommu_platform=on.
> Are there any such users (e.g. Xen)?

No because upstream this is ignored. This is an incremental patch
on top of Jason's one.

> Instead of breaking the command-line for these users you could invert
> the flag's meaning ("iommu_bypass=on") and set it in the SPARC/PPC
> machine types.
> 
> Stefan

I hope I made it clear that there are no such users.
Stefan Hajnoczi April 22, 2016, 9:33 a.m. UTC | #4
On Thu, Apr 21, 2016 at 06:11:40PM +0300, Michael S. Tsirkin wrote:
> On Thu, Apr 21, 2016 at 03:56:53PM +0100, Stefan Hajnoczi wrote:
> > On Thu, Apr 21, 2016 at 04:43:45PM +0300, Michael S. Tsirkin wrote:
> > > This adds a flag to enable/disable bypassing the IOMMU by
> > > virtio devices.
> > > 
> > > This is on top of patch
> > > http://article.gmane.org/gmane.comp.emulators.qemu/403467
> > > virtio: convert to use DMA api
> > > 
> > > Tested with patchset
> > > http://article.gmane.org/gmane.linux.kernel.virtualization/27545
> > > virtio-pci: iommu support (note: bit number has been kept at 34
> > > intentionally to match posted guest code. a non-RFC version will
> > > renumber bits to be contigious).
> > > 
> > > changes from v1:
> > >     drop PASSTHROUGH flag
> > > 
> > > The interaction between virtio and DMA API is messy.
> > > 
> > > On most systems with virtio, physical addresses match bus addresses,
> > > and it doesn't particularly matter whether we use the DMA API.
> > > 
> > > On some systems, including Xen and any system with a physical device
> > > that speaks virtio behind a physical IOMMU, we must use the DMA API
> > > for virtio DMA to work at all.
> > > 
> > > Add a feature bit to detect that: VIRTIO_F_IOMMU_PLATFORM.
> > > 
> > > If not there, we preserve historic behavior and bypass the DMA
> > > API unless within Xen guest. This is actually required for
> > > systems, including SPARC and PPC64, where virtio-pci devices are
> > > enumerated as though they are behind an IOMMU, but the virtio host
> > > ignores the IOMMU, so we must either pretend that the IOMMU isn't
> > > there or somehow map everything as the identity.
> > > 
> > > Re: non-virtio devices.
> > > 
> > > It turns out that on old QEMU hosts, only emulated devices which were
> > > part of QEMU use the IOMMU.  Should we want to bypass the IOMMU for such
> > > devices *only*, it would be rather easy to detect them by looking at
> > > subsystem vendor and device ID. Thus, no new interfaces are required
> > > except for virtio which always uses the same subsystem vendor and device ID.
> > > 
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > ---
> > >  include/hw/virtio/virtio-access.h              | 3 ++-
> > >  include/hw/virtio/virtio.h                     | 4 +++-
> > >  include/standard-headers/linux/virtio_config.h | 2 ++
> > >  3 files changed, 7 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h
> > > index 967cc75..bb6f34e 100644
> > > --- a/include/hw/virtio/virtio-access.h
> > > +++ b/include/hw/virtio/virtio-access.h
> > > @@ -23,7 +23,8 @@ static inline AddressSpace *virtio_get_dma_as(VirtIODevice *vdev)
> > >      BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
> > >      VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
> > >  
> > > -    if (k->get_dma_as) {
> > > +    if (virtio_host_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM) &&
> > > +        k->get_dma_as) {
> > >          return k->get_dma_as(qbus->parent);
> > >      }
> > >      return &address_space_memory;
> > > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > > index b12faa9..44f3788 100644
> > > --- a/include/hw/virtio/virtio.h
> > > +++ b/include/hw/virtio/virtio.h
> > > @@ -228,7 +228,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
> > >      DEFINE_PROP_BIT64("notify_on_empty", _state, _field,  \
> > >                        VIRTIO_F_NOTIFY_ON_EMPTY, true), \
> > >      DEFINE_PROP_BIT64("any_layout", _state, _field, \
> > > -                      VIRTIO_F_ANY_LAYOUT, true)
> > > +                      VIRTIO_F_ANY_LAYOUT, true), \
> > > +    DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
> > > +                      VIRTIO_F_IOMMU_PLATFORM, false)
> > 
> > Looks like the impact of this patch is that users who relied on
> > k->get_dma_as today may now have to explicitly add iommu_platform=on.
> > Are there any such users (e.g. Xen)?
> 
> No because upstream this is ignored. This is an incremental patch
> on top of Jason's one.
> 
> > Instead of breaking the command-line for these users you could invert
> > the flag's meaning ("iommu_bypass=on") and set it in the SPARC/PPC
> > machine types.
> > 
> > Stefan
> 
> I hope I made it clear that there are no such users.

Okay, thanks!

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
David Woodhouse April 27, 2016, 12:18 p.m. UTC | #5
> > On some systems, including Xen and any system with a physical device
> > that speaks virtio behind a physical IOMMU, we must use the DMA API
> > for virtio DMA to work at all.
> > 
> > Add a feature bit to detect that: VIRTIO_F_IOMMU_PLATFORM.
> > 
> > If not there, we preserve historic behavior and bypass the DMA
> > API unless within Xen guest. This is actually required for
> > systems, including SPARC and PPC64, where virtio-pci devices are
> > enumerated as though they are behind an IOMMU, but the virtio host
> > ignores the IOMMU, so we must either pretend that the IOMMU isn't
> > there or somehow map everything as the identity.
> > 
> > Re: non-virtio devices.
> > 
> > It turns out that on old QEMU hosts, only emulated devices which were
> > part of QEMU use the IOMMU.  Should we want to bypass the IOMMU for such
> > devices *only*, it would be rather easy to detect them by looking at
> > subsystem vendor and device ID. Thus, no new interfaces are required
> > except for virtio which always uses the same subsystem vendor and device ID.

Apologies for dropping this thread; I've been travelling.

But seriously, NO!

I understand why you want to see this as a virtio-specific issue, but
it isn't. And we don't *want* it to be.

In the guest, drivers SHALL use the DMA API. And the DMA API SHALL do
the right thing for each device according to its needs.

So any information passed from qemu to the guest should be directed at
the platform IOMMU code (or handled by qemu-detection quirks in the
guest, if we must).

It is *not* acceptable for the virtio drivers in the guest to just
eschew the DMA API completely, triggered by some device-specific flag.

The qemu implementation is, of course, monolithic. In qemu the fact
that virtio doesn't get translated by the emulated IOMMU *is* actually
down to code in the virtio implementation. I get that.

But then again, it's not just virtio. *Any* device which we emulate for
the guest could have that same issue, and appear as untranslated. (And
assigned PCI devices currently do).

Let's think about the parallel with a system-on-chip. Let's say we have
a peripheral which got included, but which was wired up such that it
bypasses the IOMMU and gets to do direct physical DMA. Is that a
feature of that specific peripheral? Do we hack its drivers to make the
distinction between this incarnation, and a normal discrete version of
the same device? No! It's a feature of the *system* and needs to be
conveyed to the OS IOMMU code to do the right thing. Not to the driver.

In my opinion, adding the VIRTIO_F_IOMMU_PLATFORM feature bit is
absolutely the wrong thing to do.

What we *should* do is a patchset in the guest which both fixes virtio
drivers to *always* use the DMA API, and fixes the DMA API to DTRT at
the same time — by detecting qemu and installing no-op DMA ops for the
appropriate devices, perhaps.

Then we can look at giving qemu a way to properly indicate which
devices it actually does DMA mapping for, so we can remove those
heuristic assumptions.

But that flag does *not* live in the virtio host←→guest ABI.
Michael S. Tsirkin April 27, 2016, 1:37 p.m. UTC | #6
On Wed, Apr 27, 2016 at 01:18:21PM +0100, David Woodhouse wrote:
> 
> > > On some systems, including Xen and any system with a physical device
> > > that speaks virtio behind a physical IOMMU, we must use the DMA API
> > > for virtio DMA to work at all.
> > > 
> > > Add a feature bit to detect that: VIRTIO_F_IOMMU_PLATFORM.
> > > 
> > > If not there, we preserve historic behavior and bypass the DMA
> > > API unless within Xen guest. This is actually required for
> > > systems, including SPARC and PPC64, where virtio-pci devices are
> > > enumerated as though they are behind an IOMMU, but the virtio host
> > > ignores the IOMMU, so we must either pretend that the IOMMU isn't
> > > there or somehow map everything as the identity.
> > > 
> > > Re: non-virtio devices.
> > > 
> > > It turns out that on old QEMU hosts, only emulated devices which were
> > > part of QEMU use the IOMMU.  Should we want to bypass the IOMMU for such
> > > devices *only*, it would be rather easy to detect them by looking at
> > > subsystem vendor and device ID. Thus, no new interfaces are required
> > > except for virtio which always uses the same subsystem vendor and device ID.
> 
> Apologies for dropping this thread; I've been travelling.
> 
> But seriously, NO!
> 
> I understand why you want to see this as a virtio-specific issue, but
> it isn't. And we don't *want* it to be.
> 
> In the guest, drivers SHALL use the DMA API. And the DMA API SHALL do
> the right thing for each device according to its needs.
> 
> So any information passed from qemu to the guest should be directed at
> the platform IOMMU code (or handled by qemu-detection quirks in the
> guest, if we must).
> 
> It is *not* acceptable for the virtio drivers in the guest to just
> eschew the DMA API completely, triggered by some device-specific flag.
> 
> The qemu implementation is, of course, monolithic. In qemu the fact
> that virtio doesn't get translated by the emulated IOMMU *is* actually
> down to code in the virtio implementation. I get that.
> 
> But then again, it's not just virtio. *Any* device which we emulate for
> the guest could have that same issue, and appear as untranslated. (And
> assigned PCI devices currently do).
> 
> Let's think about the parallel with a system-on-chip. Let's say we have
> a peripheral which got included, but which was wired up such that it
> bypasses the IOMMU and gets to do direct physical DMA. Is that a
> feature of that specific peripheral? Do we hack its drivers to make the
> distinction between this incarnation, and a normal discrete version of
> the same device? No! It's a feature of the *system*

One correction: it's a feature of the device in the system.
There could be a mix of devices bypassing and not
bypassing the IOMMU.

> and needs to be
> conveyed to the OS IOMMU code to do the right thing. Not to the driver.
> 
> In my opinion, adding the VIRTIO_F_IOMMU_PLATFORM feature bit is
> absolutely the wrong thing to do.
> 
> What we *should* do is a patchset in the guest which both fixes virtio
> drivers to *always* use the DMA API, and fixes the DMA API to DTRT at
> the same time — by detecting qemu and installing no-op DMA ops for the
> appropriate devices, perhaps.

Sounds good. And a way to detect appropriate devices could
be by looking at the feature flag, perhaps?


> Then we can look at giving qemu a way to properly indicate which
> devices it actually does DMA mapping for, so we can remove those
> heuristic assumptions.
> 
> But that flag does *not* live in the virtio host←→guest ABI.
> 
> -- 
> David Woodhouse                            Open Source Technology Centre
> David.Woodhouse@intel.com                              Intel Corporation
>
Joerg Roedel April 27, 2016, 2:23 p.m. UTC | #7
On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
> One correction: it's a feature of the device in the system.
> There could be a mix of devices bypassing and not
> bypassing the IOMMU.

No, it really is not. A device can't chose to bypass the IOMMU. But the
IOMMU can chose to let the device bypass. So any fix here belongs
into the platform/iommu code too and not into some driver.

> Sounds good. And a way to detect appropriate devices could
> be by looking at the feature flag, perhaps?

Again, no! The way to detect that is to look into the iommu description
structures provided by the firmware. They provide everything necessary
to tell the iommu code which devices are not translated.



	Joerg
Andy Lutomirski April 27, 2016, 2:31 p.m. UTC | #8
On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> One correction: it's a feature of the device in the system.
>> There could be a mix of devices bypassing and not
>> bypassing the IOMMU.
>
> No, it really is not. A device can't chose to bypass the IOMMU. But the
> IOMMU can chose to let the device bypass. So any fix here belongs
> into the platform/iommu code too and not into some driver.
>
>> Sounds good. And a way to detect appropriate devices could
>> be by looking at the feature flag, perhaps?
>
> Again, no! The way to detect that is to look into the iommu description
> structures provided by the firmware. They provide everything necessary
> to tell the iommu code which devices are not translated.
>

Except on PPC and SPARC.  As far as I know, those are the only
problematic platforms.

Is it too late to *disable* QEMU's q35-iommu thingy until it can be
fixed to report correct data in the DMAR tables?

--Andy
Michael S. Tsirkin April 27, 2016, 2:34 p.m. UTC | #9
On Wed, Apr 27, 2016 at 04:23:32PM +0200, Joerg Roedel wrote:
> On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
> > One correction: it's a feature of the device in the system.
> > There could be a mix of devices bypassing and not
> > bypassing the IOMMU.
> 
> No, it really is not. A device can't chose to bypass the IOMMU. But the
> IOMMU can chose to let the device bypass.

QEMU can choose to bypass IOMMU for one device and not the other.
IOMMU in QEMU isn't involved when it's bypassed.

> So any fix here belongs
> into the platform/iommu code too and not into some driver.

Fine but this is beside the point. Almost all virtio devices
bypass IOMMU and what this patch does is create a way
to detect devices that don't. This code can maybe go into
platform.

> > Sounds good. And a way to detect appropriate devices could
> > be by looking at the feature flag, perhaps?
> 
> Again, no! The way to detect that is to look into the iommu description
> structures provided by the firmware. They provide everything necessary
> to tell the iommu code which devices are not translated.
> 
> 
> 
> 	Joerg

It would be easy if they did but they don't do this on all systems.
In particular the idea for firmware interface was clearly
that a given bus either is connected through IOMMU or bypassing it.
Whether virtio bypasses the iommu is unrelated to the bus it's on.
Michael S. Tsirkin April 27, 2016, 2:38 p.m. UTC | #10
On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
> >> One correction: it's a feature of the device in the system.
> >> There could be a mix of devices bypassing and not
> >> bypassing the IOMMU.
> >
> > No, it really is not. A device can't chose to bypass the IOMMU. But the
> > IOMMU can chose to let the device bypass. So any fix here belongs
> > into the platform/iommu code too and not into some driver.
> >
> >> Sounds good. And a way to detect appropriate devices could
> >> be by looking at the feature flag, perhaps?
> >
> > Again, no! The way to detect that is to look into the iommu description
> > structures provided by the firmware. They provide everything necessary
> > to tell the iommu code which devices are not translated.
> >
> 
> Except on PPC and SPARC.  As far as I know, those are the only
> problematic platforms.
> 
> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
> fixed to report correct data in the DMAR tables?
> 
> --Andy

Meaning virtio or assigned devices?
For virtio - it's way too late since these are working configurations.
For assigned devices - they don't work on x86 so it doesn't have
to be disabled, it's safe to ignore.
Andy Lutomirski April 27, 2016, 2:43 p.m. UTC | #11
On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> >> One correction: it's a feature of the device in the system.
>> >> There could be a mix of devices bypassing and not
>> >> bypassing the IOMMU.
>> >
>> > No, it really is not. A device can't chose to bypass the IOMMU. But the
>> > IOMMU can chose to let the device bypass. So any fix here belongs
>> > into the platform/iommu code too and not into some driver.
>> >
>> >> Sounds good. And a way to detect appropriate devices could
>> >> be by looking at the feature flag, perhaps?
>> >
>> > Again, no! The way to detect that is to look into the iommu description
>> > structures provided by the firmware. They provide everything necessary
>> > to tell the iommu code which devices are not translated.
>> >
>>
>> Except on PPC and SPARC.  As far as I know, those are the only
>> problematic platforms.
>>
>> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
>> fixed to report correct data in the DMAR tables?
>>
>> --Andy
>
> Meaning virtio or assigned devices?
> For virtio - it's way too late since these are working configurations.
> For assigned devices - they don't work on x86 so it doesn't have
> to be disabled, it's safe to ignore.

I mean actually prevent QEMU from running in q35-iommu mode with any
virtio devices attached or maybe even turn off q35-iommu mode entirely
[1].  Doesn't it require that the user literally pass the word
"experimental" into QEMU right now?  It did at some point IIRC.

The reason I'm asking is that, other than q35-iommu, QEMU's virtio
devices *don't* bypass the IOMMU except on PPC and SPARC, simply
because there is no other configuration AFAICT that has virtio and and
IOMMU.  So maybe the right solution is to fix q35-iommu to use DMAR
correctly (thus breaking q35-iommu users with older guest kernels,
which hopefully don't actually exist) and to come up with a PPC- and
SPARC-specific solution, or maybe OpenFirmware-specific solution, to
handle PPC and SPARC down the road.

[1] I'm pretty sure I emailed the QEMU list before q35-iommu ever
showed up in a release asking the QEMU team to please not do that
until this issue was resolved.  Sadly, that email was ignored :(

--Andy
Michael S. Tsirkin April 27, 2016, 2:54 p.m. UTC | #12
On Wed, Apr 27, 2016 at 07:43:07AM -0700, Andy Lutomirski wrote:
> On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
> >> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <joro@8bytes.org> wrote:
> >> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
> >> >> One correction: it's a feature of the device in the system.
> >> >> There could be a mix of devices bypassing and not
> >> >> bypassing the IOMMU.
> >> >
> >> > No, it really is not. A device can't chose to bypass the IOMMU. But the
> >> > IOMMU can chose to let the device bypass. So any fix here belongs
> >> > into the platform/iommu code too and not into some driver.
> >> >
> >> >> Sounds good. And a way to detect appropriate devices could
> >> >> be by looking at the feature flag, perhaps?
> >> >
> >> > Again, no! The way to detect that is to look into the iommu description
> >> > structures provided by the firmware. They provide everything necessary
> >> > to tell the iommu code which devices are not translated.
> >> >
> >>
> >> Except on PPC and SPARC.  As far as I know, those are the only
> >> problematic platforms.
> >>
> >> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
> >> fixed to report correct data in the DMAR tables?
> >>
> >> --Andy
> >
> > Meaning virtio or assigned devices?
> > For virtio - it's way too late since these are working configurations.
> > For assigned devices - they don't work on x86 so it doesn't have
> > to be disabled, it's safe to ignore.
> 
> I mean actually prevent QEMU from running in q35-iommu mode with any
> virtio devices attached or maybe even turn off q35-iommu mode entirely
> [1].  Doesn't it require that the user literally pass the word
> "experimental" into QEMU right now?  It did at some point IIRC.
> 
> The reason I'm asking is that, other than q35-iommu, QEMU's virtio
> devices *don't* bypass the IOMMU except on PPC and SPARC, simply
> because there is no other configuration AFAICT that has virtio and and
> IOMMU.  So maybe the right solution is to fix q35-iommu to use DMAR
> correctly (thus breaking q35-iommu users with older guest kernels,
> which hopefully don't actually exist) and to come up with a PPC- and
> SPARC-specific solution, or maybe OpenFirmware-specific solution, to
> handle PPC and SPARC down the road.
> 
> [1] I'm pretty sure I emailed the QEMU list before q35-iommu ever
> showed up in a release asking the QEMU team to please not do that
> until this issue was resolved.  Sadly, that email was ignored :(
> 
> --Andy

Sorry, I didn't make myself clear.
Point is, QEMU is not the only virtio implementation out there.
So we can't know no virtio implementations have an IOMMU as long as
linux supports this IOMMU.
virtio always used physical addresses since it was born and if it
changes that it must do this in a way that does not break existing
users.
Joerg Roedel April 27, 2016, 2:56 p.m. UTC | #13
On Wed, Apr 27, 2016 at 05:34:30PM +0300, Michael S. Tsirkin wrote:
> On Wed, Apr 27, 2016 at 04:23:32PM +0200, Joerg Roedel wrote:

> QEMU can choose to bypass IOMMU for one device and not the other.
> IOMMU in QEMU isn't involved when it's bypassed.

And it is QEMU's task to tell the OS, right? And the correct way to do
this is via the firmware ACPI tables.

> Fine but this is beside the point. Almost all virtio devices
> bypass IOMMU and what this patch does is create a way
> to detect devices that don't. This code can maybe go into
> platform.

Again, the way to detect this is in platform code must not be device
specific. This is what the DMAR and IVRS tables on x86 are for.

When there is no way to do this in the firmware (or there is no firmware
at all), we have to do a quirk in the platform code for it.



	Joerg
Joerg Roedel April 27, 2016, 2:58 p.m. UTC | #14
On Wed, Apr 27, 2016 at 05:54:57PM +0300, Michael S. Tsirkin wrote:
> Point is, QEMU is not the only virtio implementation out there.
> So we can't know no virtio implementations have an IOMMU as long as
> linux supports this IOMMU.
> virtio always used physical addresses since it was born and if it
> changes that it must do this in a way that does not break existing
> users.

FWIW, virtio in qemu can continue to just use physical addresses. But
qemu needs to advertise that fact correctly to the OS in the DMAR table.
This way old kernels (where virtio does not use DMA-API) will also
continue to work on the fixed qemu.



	Joerg
Michael S. Tsirkin April 27, 2016, 3:05 p.m. UTC | #15
On Wed, Apr 27, 2016 at 04:56:32PM +0200, Joerg Roedel wrote:
> On Wed, Apr 27, 2016 at 05:34:30PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Apr 27, 2016 at 04:23:32PM +0200, Joerg Roedel wrote:
> 
> > QEMU can choose to bypass IOMMU for one device and not the other.
> > IOMMU in QEMU isn't involved when it's bypassed.
> 
> And it is QEMU's task to tell the OS, right? And the correct way to do
> this is via the firmware ACPI tables.

Going forward, this might be reasonable. Of course it didn't in the past
and no one cared because virtio devices used physical addresses. We have
to keep these setups working.

> > Fine but this is beside the point. Almost all virtio devices
> > bypass IOMMU and what this patch does is create a way
> > to detect devices that don't. This code can maybe go into
> > platform.
> 
> Again, the way to detect this is in platform code must not be device
> specific. This is what the DMAR and IVRS tables on x86 are for.
> 
> When there is no way to do this in the firmware (or there is no firmware
> at all), we have to do a quirk in the platform code for it.
> 
> 
> 
> 	Joerg

I really don't get it.

There's exactly one device that works now and needs the work-around and
so that we need to support, and that is virtio. It happens to have
exactly the same issue on all platforms.

Why would we want to work hard to build platform-specific
solutions to a problem that can be solved in 5 lines of
generic code?
Michael S. Tsirkin April 27, 2016, 3:09 p.m. UTC | #16
On Wed, Apr 27, 2016 at 04:58:51PM +0200, Joerg Roedel wrote:
> On Wed, Apr 27, 2016 at 05:54:57PM +0300, Michael S. Tsirkin wrote:
> > Point is, QEMU is not the only virtio implementation out there.
> > So we can't know no virtio implementations have an IOMMU as long as
> > linux supports this IOMMU.
> > virtio always used physical addresses since it was born and if it
> > changes that it must do this in a way that does not break existing
> > users.
> 
> FWIW, virtio in qemu can continue to just use physical addresses. But
> qemu needs to advertise that fact correctly to the OS in the DMAR table.
> This way old kernels (where virtio does not use DMA-API) will also
> continue to work on the fixed qemu.
> 
> 
> 
> 	Joerg

It's not clear it can do this since DMAR tables seem to assume
a given slot is either bypassing IOMMU or going through it.
QEMU allows reusing same slot for virtio and non-virtio devices.

Besides, this patch is not about it, it's a flag for QEMU to
tell guest that it can trust DMAR tables.
Andy Lutomirski April 27, 2016, 3:10 p.m. UTC | #17
On Wed, Apr 27, 2016 at 7:54 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Apr 27, 2016 at 07:43:07AM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
>> >> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> >> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> >> >> One correction: it's a feature of the device in the system.
>> >> >> There could be a mix of devices bypassing and not
>> >> >> bypassing the IOMMU.
>> >> >
>> >> > No, it really is not. A device can't chose to bypass the IOMMU. But the
>> >> > IOMMU can chose to let the device bypass. So any fix here belongs
>> >> > into the platform/iommu code too and not into some driver.
>> >> >
>> >> >> Sounds good. And a way to detect appropriate devices could
>> >> >> be by looking at the feature flag, perhaps?
>> >> >
>> >> > Again, no! The way to detect that is to look into the iommu description
>> >> > structures provided by the firmware. They provide everything necessary
>> >> > to tell the iommu code which devices are not translated.
>> >> >
>> >>
>> >> Except on PPC and SPARC.  As far as I know, those are the only
>> >> problematic platforms.
>> >>
>> >> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
>> >> fixed to report correct data in the DMAR tables?
>> >>
>> >> --Andy
>> >
>> > Meaning virtio or assigned devices?
>> > For virtio - it's way too late since these are working configurations.
>> > For assigned devices - they don't work on x86 so it doesn't have
>> > to be disabled, it's safe to ignore.
>>
>> I mean actually prevent QEMU from running in q35-iommu mode with any
>> virtio devices attached or maybe even turn off q35-iommu mode entirely
>> [1].  Doesn't it require that the user literally pass the word
>> "experimental" into QEMU right now?  It did at some point IIRC.
>>
>> The reason I'm asking is that, other than q35-iommu, QEMU's virtio
>> devices *don't* bypass the IOMMU except on PPC and SPARC, simply
>> because there is no other configuration AFAICT that has virtio and and
>> IOMMU.  So maybe the right solution is to fix q35-iommu to use DMAR
>> correctly (thus breaking q35-iommu users with older guest kernels,
>> which hopefully don't actually exist) and to come up with a PPC- and
>> SPARC-specific solution, or maybe OpenFirmware-specific solution, to
>> handle PPC and SPARC down the road.
>>
>> [1] I'm pretty sure I emailed the QEMU list before q35-iommu ever
>> showed up in a release asking the QEMU team to please not do that
>> until this issue was resolved.  Sadly, that email was ignored :(
>>
>> --Andy
>
> Sorry, I didn't make myself clear.
> Point is, QEMU is not the only virtio implementation out there.
> So we can't know no virtio implementations have an IOMMU as long as
> linux supports this IOMMU.
> virtio always used physical addresses since it was born and if it
> changes that it must do this in a way that does not break existing
> users.

Is there any non-QEMU virtio implementation can provide an
IOMMU-bypassing virtio device on a platform that has a nontrivial
IOMMU?

--Andy
David Woodhouse April 27, 2016, 3:15 p.m. UTC | #18
On Wed, 2016-04-27 at 18:05 +0300, Michael S. Tsirkin wrote:
> 
> I really don't get it.
> 
> There's exactly one device that works now and needs the work-around and
> so that we need to support, and that is virtio. It happens to have
> exactly the same issue on all platforms.

False. We have other devices which are currently *not* translated by
the emulated IOMMU and which aren't going to be in the short term
either.

We also have other devices (emulated hardware NICs) to which precisely
the same "we don't need protection" arguments apply, and which we
*could* expose to the guest without an IOMMU translation if we really
wanted to. It makes as much sense as exposing virtio without an IOMMU,
going forward.

> Why would we want to work hard to build platform-specific
> solutions to a problem that can be solved in 5 lines of
> generic code?

Because it's a dirty hack in the *wrong* place.
Michael S. Tsirkin April 27, 2016, 6:17 p.m. UTC | #19
On Wed, Apr 27, 2016 at 04:15:35PM +0100, David Woodhouse wrote:
> On Wed, 2016-04-27 at 18:05 +0300, Michael S. Tsirkin wrote:
> > 
> > I really don't get it.
> > 
> > There's exactly one device that works now and needs the work-around and
> > so that we need to support, and that is virtio. It happens to have
> > exactly the same issue on all platforms.
> 
> False. We have other devices which are currently *not* translated by
> the emulated IOMMU and which aren't going to be in the short term
> either.
> 
> We also have other devices (emulated hardware NICs) to which precisely
> the same "we don't need protection" arguments apply, and which we
> *could* expose to the guest without an IOMMU translation if we really
> wanted to. It makes as much sense as exposing virtio without an IOMMU,
> going forward.

The reasons for virtio are mostly dealing legacy.
We don't need protection is a separate issue
that I'd rather drop for now.

> > Why would we want to work hard to build platform-specific
> > solutions to a problem that can be solved in 5 lines of
> > generic code?
> 
> Because it's a dirty hack in the *wrong* place.

No one came up with a better one so far :(

> -- 
> dwmw2
David Woodhouse April 27, 2016, 7:16 p.m. UTC | #20
On Wed, 2016-04-27 at 21:17 +0300, Michael S. Tsirkin wrote:
> 
> > Because it's a dirty hack in the *wrong* place.
> 
> No one came up with a better one so far :(

Seriously?

Take a look at drivers/iommu/intel-iommu.c. It has quirks for all kinds
of shitty devices that have to be put in passthrough mode or otherwise
excluded.

We don't actually *need* it for the Intel IOMMU; all we need is for
QEMU to stop lying in its DMAR tables.

We *do* want the same kind of quirks in the relevant POWER and ARM
IOMMU code in the kernel. Do that (hell, a simple quirk for all virtio
devices will suffice, but NOT in the virtio driver) at the same moment
you fix the virtio devices to use the DMA API. Job done.

Some time *later* we can work on *refining* that quirk, and a way for
QEMU to tell the guest (via something generic like fwcfg, maybe) that
some devices are and aren't translated.

Actually, I'm about to look at moving dma_ops into struct device and
cleaning up the way we detect which IOMMU is attached, at device
instantiation time. Perhaps I can shove the virtio-exception quirk in
there while I'm at it...
Michael S. Tsirkin April 28, 2016, 2:34 p.m. UTC | #21
On Wed, Apr 27, 2016 at 08:16:57PM +0100, David Woodhouse wrote:
> On Wed, 2016-04-27 at 21:17 +0300, Michael S. Tsirkin wrote:
> > 
> > > Because it's a dirty hack in the *wrong* place.
> > 
> > No one came up with a better one so far :(
> 
> Seriously?
> 
> Take a look at drivers/iommu/intel-iommu.c. It has quirks for all kinds
> of shitty devices that have to be put in passthrough mode or otherwise
> excluded.

I see work-arounds for broken IOMMUs but not for
individual devices. Could you point me to a more specific
example?


> We don't actually *need* it for the Intel IOMMU; all we need is for
> QEMU to stop lying in its DMAR tables.

We need it for legacy QEMU anyway, and it's not easy for QEMU to stop
lying about virtio, so we'll need it for a while.
I think it's easy for QEMU to stop lying about assigned devices,
so we don't need it for non-virtio devices.

> We *do* want the same kind of quirks in the relevant POWER and ARM
> IOMMU code in the kernel. Do that (hell, a simple quirk for all virtio
> devices will suffice, but NOT in the virtio driver

Sure - that works. It does not have to be in the driver.

>) at the same moment
> you fix the virtio devices to use the DMA API. Job done.
> 
> Some time *later* we can work on *refining* that quirk, and a way for
> QEMU to tell the guest (via something generic like fwcfg, maybe) that
> some devices are and aren't translated.

I don't see why how fwcfg can work here. It's a static thing,
devices can come and go with hotplug.

> Actually, I'm about to look at moving dma_ops into struct device and
> cleaning up the way we detect which IOMMU is attached, at device
> instantiation time. Perhaps I can shove the virtio-exception quirk in
> there while I'm at it...

Sounds good.

> -- 
> dwmw2
>
David Woodhouse April 28, 2016, 3:11 p.m. UTC | #22
On Thu, 2016-04-28 at 17:34 +0300, Michael S. Tsirkin wrote:
> I see work-arounds for broken IOMMUs but not for
> individual devices. Could you point me to a more specific
> example?

I think the closest example is probably quirk_ioat_snb_local_iommu().

If we see this particular device, we *know* what the topology actually
looks like. We check the hardware setup, and if we're *not* being told
the truth, then we stick it in bypass mode because we know it *isn't*
actually being translated.

Actually, that's almost *identical* to what we want, isn't it?

Except instead of checking undocumented chipset registers, it wants to
be checking "am I on a version of qemu known to lie about virtio being
translated?"

> > We don't actually *need* it for the Intel IOMMU; all we need is for
> > QEMU to stop lying in its DMAR tables.
> We need it for legacy QEMU anyway, and it's not easy for QEMU to stop
> lying about virtio, so we'll need it for a while.
> I think it's easy for QEMU to stop lying about assigned devices,
> so we don't need it for non-virtio devices.

Why is it easier for QEMU to tell the truth about assigned devices,
than it is for virtio? Assuming they both remain actually untranslated
for now, why's it easier to fix the DMAR table for one and not the
other?

(Implementing translation of assigned devices is on my list, but it's a
long way off).

> I don't see why how fwcfg can work here. It's a static thing,
> devices can come and go with hotplug.

This touches on something you said elsewhere, that it's
painful/impossible to hot-unplug a translated device and hot-plug an
untranslated device in the same slot (and vice versa).

So let's assume for now that a given slot is indeed static, and either
translated or untranslated. Like the DMAR table, the fwcfg can just
give a list of slot which are (or aren't) translated.

And then you can *only* add a translated device to a translated slot,
or an untranslated device to an untranslated slot.

All the internally-emulated devices *can* be either translated or
untranslated. That's just a matter of software. Surely, you currently
*can't* have translated assigned devices (until someone implements the
whole VT-d page table shadowing or whatever), so you'll be barred from
assigning a device to a slot which *previously* had an untranslated
device. But so what? Put it in a different slot instead.
Michael S. Tsirkin April 28, 2016, 3:37 p.m. UTC | #23
On Thu, Apr 28, 2016 at 04:11:54PM +0100, David Woodhouse wrote:
> On Thu, 2016-04-28 at 17:34 +0300, Michael S. Tsirkin wrote:
> > I see work-arounds for broken IOMMUs but not for
> > individual devices. Could you point me to a more specific
> > example?
> 
> I think the closest example is probably quirk_ioat_snb_local_iommu().

OK, so for intel, it seems that it's enough to set
	pdev->dev.archdata.iommu = DUMMY_DEVICE_DOMAIN_INFO;
for the device.

Do I have to poke at each iommu implementation to find
a way to do this, or is there some way to do it
portably?

> If we see this particular device, we *know* what the topology actually
> looks like. We check the hardware setup, and if we're *not* being told
> the truth, then we stick it in bypass mode because we know it *isn't*
> actually being translated.
> 
> Actually, that's almost *identical* to what we want, isn't it?
> 
> Except instead of checking undocumented chipset registers, it wants to
> be checking "am I on a version of qemu known to lie about virtio being
> translated?"

Not exactly - I think that future versions of qemu might lie
about some devices but not others.

> > > We don't actually *need* it for the Intel IOMMU; all we need is for
> > > QEMU to stop lying in its DMAR tables.
> > We need it for legacy QEMU anyway, and it's not easy for QEMU to stop
> > lying about virtio, so we'll need it for a while.
> > I think it's easy for QEMU to stop lying about assigned devices,
> > so we don't need it for non-virtio devices.
> 
> Why is it easier for QEMU to tell the truth about assigned devices,
> than it is for virtio? Assuming they both remain actually untranslated
> for now, why's it easier to fix the DMAR table for one and not the
> other?
> 
> (Implementing translation of assigned devices is on my list, but it's a
> long way off).

DMAR is unfortunately not a good match for what people do with QEMU.

There is a patchset on list fixing translation of assigned
devices. So the fix for these will simply be to do translation for all
assigned devices. It's harder for virtio as it isn't always
processed in QEMU - there's vhost in kernel and an out of process
vhost-user plugin. So we can end up e.g. with modern QEMU which
does translate in-process virtio but not out of process one.

> > I don't see why how fwcfg can work here. It's a static thing,
> > devices can come and go with hotplug.
> 
> This touches on something you said elsewhere, that it's
> painful/impossible to hot-unplug a translated device and hot-plug an
> untranslated device in the same slot (and vice versa).
> 
> So let's assume for now that a given slot is indeed static, and either
> translated or untranslated. Like the DMAR table, the fwcfg can just
> give a list of slot which are (or aren't) translated.
> 
> And then you can *only* add a translated device to a translated slot,
> or an untranslated device to an untranslated slot.
> 
> All the internally-emulated devices *can* be either translated or
> untranslated. That's just a matter of software. Surely, you currently
> *can't* have translated assigned devices (until someone implements the
> whole VT-d page table shadowing or whatever), so you'll be barred from
> assigning a device to a slot which *previously* had an untranslated
> device. But so what? Put it in a different slot instead.

Unfortunately people got used to be able to put any device
in any slot, and built external tools around that ability.
It's rather painful to break this assumption.

> -- 
> dwmw2
>
David Woodhouse April 28, 2016, 3:48 p.m. UTC | #24
On Thu, 2016-04-28 at 18:37 +0300, Michael S. Tsirkin wrote:
> OK, so for intel, it seems that it's enough to set
> 	pdev->dev.archdata.iommu = DUMMY_DEVICE_DOMAIN_INFO;
> for the device.

Yes, currently. Although that's vile. In fact what we *want* to happen
is for the intel-iommu code simply to decline to provide DMA ops for
this device, and let it fall back to the swiotlb or no-op DMA ops, as
appropriate.

As it is, we have the intel-iommu DMA ops *unconditionally, and they
have a hack to manually fall back to calling swiotlb. It's all just
horrid, which is why I want to clean it up with nice per-device DMA ops
and discovery thereof :)

> Do I have to poke at each iommu implementation to find
> a way to do this, or is there some way to do it
> portably?

There *will* be.... Christoph has already done some of the cleanup in
this space, and I need to take stock of what he's already done, and
finish off the parts I want to build on top of it.

> Not exactly - I think that future versions of qemu might lie
> about some devices but not others.

Can we keep this simple?

QEMU currently lies about some devices. Let's implement a heuristic for
the guest OS to know about that, and react accordingly.

Then let's fix QEMU to tell the truth. All the time, unconditionally.
Even on POWER/ARM where there's no obvious *way* for it to tell the
truth (because you don't have the flexibility that DMAR tables do), and
we need to devise a way to put it in the device-tree or fwcfg or
something else.

And only once QEMU consistently tells the *truth*, then we can start to
do new stuff and let it actually change its behaviour.

> DMAR is unfortunately not a good match for what people do with QEMU.
> 
> There is a patchset on list fixing translation of assigned
> devices. So the fix for these will simply be to do translation for
> all assigned devices. It's harder for virtio as it isn't always
> processed in QEMU - there's vhost in kernel and an out of process
> vhost-user plugin. So we can end up e.g. with modern QEMU which
> does translate in-process virtio but not out of process one.

Right... just stop. Fix QEMU to tell the truth first, and *then* once
we can trust it, we can start to change its behaviour. :)

> Unfortunately people got used to be able to put any device
> in any slot, and built external tools around that ability.
> It's rather painful to break this assumption.

Well, if you just said you have a patch set which allows translation of
assigned devices then you are most of the way there, aren't you? We
just need to fix the out-of-process virtio case, and everything can be
either translated or untranslated?
Michael S. Tsirkin May 1, 2016, 10:37 a.m. UTC | #25
On Thu, Apr 28, 2016 at 04:48:25PM +0100, David Woodhouse wrote:
> On Thu, 2016-04-28 at 18:37 +0300, Michael S. Tsirkin wrote:
> > OK, so for intel, it seems that it's enough to set
> > 	pdev->dev.archdata.iommu = DUMMY_DEVICE_DOMAIN_INFO;
> > for the device.
> 
> Yes, currently. Although that's vile. In fact what we *want* to happen
> is for the intel-iommu code simply to decline to provide DMA ops for
> this device, and let it fall back to the swiotlb or no-op DMA ops, as
> appropriate.
> 
> As it is, we have the intel-iommu DMA ops *unconditionally, and they
> have a hack to manually fall back to calling swiotlb. It's all just
> horrid, which is why I want to clean it up with nice per-device DMA ops
> and discovery thereof :)
> 
> > Do I have to poke at each iommu implementation to find
> > a way to do this, or is there some way to do it
> > portably?
> 
> There *will* be.... Christoph has already done some of the cleanup in
> this space, and I need to take stock of what he's already done, and
> finish off the parts I want to build on top of it.
> 
> > Not exactly - I think that future versions of qemu might lie
> > about some devices but not others.
> 
> Can we keep this simple?
> 
> QEMU currently lies about some devices. Let's implement a heuristic for
> the guest OS to know about that, and react accordingly.
> 
> Then let's fix QEMU to tell the truth. All the time, unconditionally.
> Even on POWER/ARM where there's no obvious *way* for it to tell the
> truth (because you don't have the flexibility that DMAR tables do), and
> we need to devise a way to put it in the device-tree or fwcfg or
> something else.

Right.  Unfortunately all these aren't easy to implement at all.
So I'm inclined to go the "something else" route.
It has the added benefit of giving us a heuristic for free.

> And only once QEMU consistently tells the *truth*, then we can start to
> do new stuff and let it actually change its behaviour.
> 
> > DMAR is unfortunately not a good match for what people do with QEMU.
> > 
> > There is a patchset on list fixing translation of assigned
> > devices. So the fix for these will simply be to do translation for
> > all assigned devices. It's harder for virtio as it isn't always
> > processed in QEMU - there's vhost in kernel and an out of process
> > vhost-user plugin. So we can end up e.g. with modern QEMU which
> > does translate in-process virtio but not out of process one.
> 
> Right... just stop. Fix QEMU to tell the truth first, and *then* once
> we can trust it, we can start to change its behaviour. :)
> 
> > Unfortunately people got used to be able to put any device
> > in any slot, and built external tools around that ability.
> > It's rather painful to break this assumption.
> 
> Well, if you just said you have a patch set which allows translation of
> assigned devices then you are most of the way there, aren't you? We
> just need to fix the out-of-process virtio case, and everything can be
> either translated or untranslated?

Absolutely. But that "just" will take a while.  With out of process
there's always a chance that remote doesn't implement translation. E.g.
new QEMU running on an old host kernel.

> -- 
> dwmw2
>
Paolo Bonzini May 9, 2016, 11:09 a.m. UTC | #26
On 28/04/2016 17:37, Michael S. Tsirkin wrote:
> > All the internally-emulated devices *can* be either translated or
> > untranslated. That's just a matter of software. Surely, you currently
> > *can't* have translated assigned devices (until someone implements the
> > whole VT-d page table shadowing or whatever), so you'll be barred from
> > assigning a device to a slot which *previously* had an untranslated
> > device. But so what? Put it in a different slot instead.
> 
> Unfortunately people got used to be able to put any device
> in any slot, and built external tools around that ability.
> It's rather painful to break this assumption.

Once you move to PCIe, a lot of things become more complicated.  This is
just one of them; instead of needing half a dozen PCI bridges, you'll
need half a dozen plus one.

Paolo
diff mbox

Patch

diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h
index 967cc75..bb6f34e 100644
--- a/include/hw/virtio/virtio-access.h
+++ b/include/hw/virtio/virtio-access.h
@@ -23,7 +23,8 @@  static inline AddressSpace *virtio_get_dma_as(VirtIODevice *vdev)
     BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
 
-    if (k->get_dma_as) {
+    if (virtio_host_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM) &&
+        k->get_dma_as) {
         return k->get_dma_as(qbus->parent);
     }
     return &address_space_memory;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b12faa9..44f3788 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -228,7 +228,9 @@  typedef struct VirtIORNGConf VirtIORNGConf;
     DEFINE_PROP_BIT64("notify_on_empty", _state, _field,  \
                       VIRTIO_F_NOTIFY_ON_EMPTY, true), \
     DEFINE_PROP_BIT64("any_layout", _state, _field, \
-                      VIRTIO_F_ANY_LAYOUT, true)
+                      VIRTIO_F_ANY_LAYOUT, true), \
+    DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
+                      VIRTIO_F_IOMMU_PLATFORM, false)
 
 hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
 hwaddr virtio_queue_get_avail_addr(VirtIODevice *vdev, int n);
diff --git a/include/standard-headers/linux/virtio_config.h b/include/standard-headers/linux/virtio_config.h
index bcc445b..3fcfbb1 100644
--- a/include/standard-headers/linux/virtio_config.h
+++ b/include/standard-headers/linux/virtio_config.h
@@ -61,4 +61,6 @@ 
 /* v1.0 compliant. */
 #define VIRTIO_F_VERSION_1		32
 
+/* Do not bypass the IOMMU (if configured) */
+#define VIRTIO_F_IOMMU_PLATFORM		34
 #endif /* _LINUX_VIRTIO_CONFIG_H */