Patchwork Supporting emulation of IOMMUs

login
register
mail settings
Submitter David Gibson
Date April 21, 2011, 7:03 a.m.
Message ID <20110421070347.GG11968@yookeroo>
Download mbox | patch
Permalink /patch/92382/
State New
Headers show

Comments

David Gibson - April 21, 2011, 7:03 a.m.
A few months ago, Eduard - Gabriel Munteanu posted a series of patches
implementing support for emulating the AMD PCI IOMMU
(http://lists.nongnu.org/archive/html/qemu-devel/2011-01/msg03196.html).

In fact, this series implemented a general DMA/IOMMU layer which can
be used by any device model, and one translation backend for this
implementing the AMD specific PCI IOMMU.

These patches don't seem to have gone anywhere for the last few
months, however, and so far I've been unable to contact the author
(trying again with this mail).

I have an interest in this code, because the pSeries machine will also
need IOMMU emulation support.  At present we only support virtual
devices, through the PAPR interface, and we have support for the
hypervisor-controller IOMMU translation in the PAPR VIO code.
However, we want to add PCI device support and this will also need
IOMMU translation.

The series seems to have the right basic approach, so if the author is
indeed MIA, I was planning to pick up the patches and resubmit them
(with support for the pSeries IOMMU added).

Before I do that, I was hoping to get some consensus that this is the
right way to go.  For reference, I have an updated version of the
first patch (which adds the core IOMMU layer) below.

From: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Date: Fri, 4 Feb 2011 01:32:55 +0200
Subject: [PATCH] Generic DMA memory access interface

This introduces replacements for memory access functions like
cpu_physical_memory_read(). The new interface can handle address
translation and access checking through an IOMMU.

David Gibson: I have made several bugfixes and cleanups to Eduard's
original patch.

 * dma_memory_rw() was incorrectly using (uninitialized) plen instead
   of len in the fallback to no-IOMMU case.

 * the dma_memory_map() tracking was storing the guest physical
   address of each mapping, but not the qemu user virtual address.
   However in unmap() it was then attempting to lookup by virtual
   using a completely bogus cast.

 * The dma_memory_rw() function is moved from dma_rw.h to dma_rw.c, it
   was a bit too much code for an inline.

 * IOMMU support is now available on all target platforms, not just
   i386, but is configurable (--enable-iommu/--disable-iommu).  Stubs
   are used so that individual drivers can use the new dma interface
   and it will turn into old-style cpu physical accesses at no cost on
   IOMMU-less builds.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 Makefile.target |    1 +
 configure       |   12 ++++
 hw/dma_rw.c     |  147 +++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/dma_rw.h     |  156 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 316 insertions(+), 0 deletions(-)
 create mode 100644 hw/dma_rw.c
 create mode 100644 hw/dma_rw.h
Alexander Graf - April 21, 2011, 9:39 a.m.
On 21.04.2011, at 09:03, David Gibson wrote:

> A few months ago, Eduard - Gabriel Munteanu posted a series of patches
> implementing support for emulating the AMD PCI IOMMU
> (http://lists.nongnu.org/archive/html/qemu-devel/2011-01/msg03196.html).
> 
> In fact, this series implemented a general DMA/IOMMU layer which can
> be used by any device model, and one translation backend for this
> implementing the AMD specific PCI IOMMU.
> 
> These patches don't seem to have gone anywhere for the last few
> months, however, and so far I've been unable to contact the author
> (trying again with this mail).
> 
> I have an interest in this code, because the pSeries machine will also
> need IOMMU emulation support.  At present we only support virtual
> devices, through the PAPR interface, and we have support for the
> hypervisor-controller IOMMU translation in the PAPR VIO code.
> However, we want to add PCI device support and this will also need
> IOMMU translation.
> 
> The series seems to have the right basic approach, so if the author is
> indeed MIA, I was planning to pick up the patches and resubmit them
> (with support for the pSeries IOMMU added).
> 
> Before I do that, I was hoping to get some consensus that this is the
> right way to go.  For reference, I have an updated version of the
> first patch (which adds the core IOMMU layer) below.
> 
> From: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
> Date: Fri, 4 Feb 2011 01:32:55 +0200
> Subject: [PATCH] Generic DMA memory access interface
> 
> This introduces replacements for memory access functions like
> cpu_physical_memory_read(). The new interface can handle address
> translation and access checking through an IOMMU.
> 
> David Gibson: I have made several bugfixes and cleanups to Eduard's
> original patch.
> 
> * dma_memory_rw() was incorrectly using (uninitialized) plen instead
>   of len in the fallback to no-IOMMU case.
> 
> * the dma_memory_map() tracking was storing the guest physical
>   address of each mapping, but not the qemu user virtual address.
>   However in unmap() it was then attempting to lookup by virtual
>   using a completely bogus cast.
> 
> * The dma_memory_rw() function is moved from dma_rw.h to dma_rw.c, it
>   was a bit too much code for an inline.
> 
> * IOMMU support is now available on all target platforms, not just
>   i386, but is configurable (--enable-iommu/--disable-iommu).  Stubs
>   are used so that individual drivers can use the new dma interface
>   and it will turn into old-style cpu physical accesses at no cost on
>   IOMMU-less builds.
> 
> Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> Makefile.target |    1 +
> configure       |   12 ++++
> hw/dma_rw.c     |  147 +++++++++++++++++++++++++++++++++++++++++++++++++++
> hw/dma_rw.h     |  156 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 316 insertions(+), 0 deletions(-)
> create mode 100644 hw/dma_rw.c
> create mode 100644 hw/dma_rw.h
> 
> diff --git a/Makefile.target b/Makefile.target
> index 95f5eda..c3d36c6 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -198,6 +198,7 @@ obj-$(CONFIG_REALLY_VIRTFS) += virtio-9p.o
> obj-y += rwhandler.o
> obj-$(CONFIG_KVM) += kvm.o kvm-all.o
> obj-$(CONFIG_NO_KVM) += kvm-stub.o
> +obj-$(CONFIG_IOMMU) += dma_rw.o
> LIBS+=-lz
> 
> QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
> diff --git a/configure b/configure
> index da2da04..fa6f4d5 100755
> --- a/configure
> +++ b/configure
> @@ -130,6 +130,7 @@ xen=""
> linux_aio=""
> attr=""
> vhost_net=""
> +iommu="no"
> xfs=""
> 
> gprof="no"
> @@ -723,6 +724,10 @@ for opt do
>   ;;
>   --enable-vhost-net) vhost_net="yes"
>   ;;
> +  --enable-iommu) iommu="yes"
> +  ;;
> +  --disable-iommu) iommu="no"
> +  ;;
>   --disable-opengl) opengl="no"
>   ;;
>   --enable-opengl) opengl="yes"
> @@ -934,6 +939,8 @@ echo "  --enable-docs            enable documentation build"
> echo "  --disable-docs           disable documentation build"
> echo "  --disable-vhost-net      disable vhost-net acceleration support"
> echo "  --enable-vhost-net       enable vhost-net acceleration support"
> +echo "  --disable-iommu          disable IOMMU emulation support"
> +echo "  --enable-vhost-net       enable IOMMU emulation support"

eeh?

> echo "  --enable-trace-backend=B Set trace backend"
> echo "                           Available backends:" $("$source_path"/scripts/tracetool --list-backends)
> echo "  --with-trace-file=NAME   Full PATH,NAME of file to store traces"
> @@ -2608,6 +2615,7 @@ echo "madvise           $madvise"
> echo "posix_madvise     $posix_madvise"
> echo "uuid support      $uuid"
> echo "vhost-net support $vhost_net"
> +echo "IOMMU support     $iommu"
> echo "Trace backend     $trace_backend"
> echo "Trace output file $trace_file-<pid>"
> echo "spice support     $spice"
> @@ -3412,6 +3420,10 @@ if test "$target_softmmu" = "yes" -a \( \
>   echo "CONFIG_NEED_MMU=y" >> $config_target_mak
> fi
> 
> +if test "$iommu" = "yes" ; then
> +  echo "CONFIG_IOMMU=y" >> $config_target_mak
> +fi
> +
> if test "$gprof" = "yes" ; then
>   echo "TARGET_GPROF=yes" >> $config_target_mak
>   if test "$target_linux_user" = "yes" ; then
> diff --git a/hw/dma_rw.c b/hw/dma_rw.c
> new file mode 100644
> index 0000000..627835c
> --- /dev/null
> +++ b/hw/dma_rw.c
> @@ -0,0 +1,147 @@
> +/*
> + * Generic DMA memory access interface.
> + *
> + * Copyright (c) 2011 Eduard - Gabriel Munteanu
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */
> +
> +#include "dma_rw.h"
> +#include "range.h"
> +
> +void dma_memory_rw(DMADevice *dev, dma_addr_t addr,
> +                   void *buf, dma_addr_t len, int is_write)
> +{
> +    dma_addr_t paddr, plen;
> +    int err;
> +
> +    /*
> +     * Fast-path non-iommu.
> +     * More importantly, makes it obvious what this function does.
> +     */
> +    if (NO_IOMMU(dev)) {
> +        cpu_physical_memory_rw(addr, buf, len, is_write);
> +        return;
> +    }
> +
> +    while (len) {
> +        err = dev->mmu->translate(dev, addr, &paddr, &plen, is_write);
> +        if (err) {
> +            return;
> +        }
> +
> +        /* The translation might be valid for larger regions. */
> +        if (plen > len) {
> +            plen = len;
> +        }
> +
> +        cpu_physical_memory_rw(paddr, buf, plen, is_write);
> +
> +        len -= plen;
> +        addr += plen;
> +        buf += plen;
> +    }
> +}
> +
> +typedef struct DMAMemoryMap DMAMemoryMap;
> +struct DMAMemoryMap {
> +    dma_addr_t              addr;
> +    size_t                  len;
> +    void *                  buf;
> +    DMAInvalidateMapFunc    *invalidate;
> +    void                    *invalidate_opaque;
> +
> +    QLIST_ENTRY(DMAMemoryMap) list;
> +};
> +
> +void dma_invalidate_memory_range(DMADevice *dev,
> +                                 dma_addr_t addr, dma_addr_t len)
> +{
> +    DMAMemoryMap *map;
> +
> +    QLIST_FOREACH(map, &dev->mmu->memory_maps, list) {
> +        if (ranges_overlap(addr, len, map->addr, map->len)) {
> +            map->invalidate(map->invalidate_opaque);
> +            QLIST_REMOVE(map, list);
> +            free(map);
> +        }
> +    }
> +}
> +
> +void *dma_memory_map(DMADevice *dev, DMAInvalidateMapFunc *cb, void *opaque,
> +                     dma_addr_t addr, dma_addr_t *len, int is_write)
> +{
> +    int err;
> +    target_phys_addr_t paddr, plen;
> +    void *buf;
> +
> +    if (NO_IOMMU(dev)) {
> +        return cpu_physical_memory_map(addr, len, is_write);
> +    }
> +
> +    plen = *len;
> +    err = dev->mmu->translate(dev, addr, &paddr, &plen, is_write);
> +    if (err) {
> +        return NULL;
> +    }
> +
> +    /*
> +     * If this is true, the virtual region is contiguous,
> +     * but the translated physical region isn't. We just
> +     * clamp *len, much like cpu_physical_memory_map() does.
> +     */
> +    if (plen < *len) {
> +        *len = plen;
> +    }
> +
> +    buf = cpu_physical_memory_map(paddr, len, is_write);
> +
> +    /* We treat maps as remote TLBs to cope with stuff like AIO. */
> +    if (cb) {
> +        DMAMemoryMap *map;
> +
> +        map = qemu_malloc(sizeof(DMAMemoryMap));
> +        map->addr = addr;
> +        map->len = *len;
> +        map->buf = buf;
> +        map->invalidate = cb;
> +        map->invalidate_opaque = opaque;
> +        
> +        QLIST_INSERT_HEAD(&dev->mmu->memory_maps, map, list);
> +    }
> +
> +    return buf;
> +}
> +
> +void dma_memory_unmap(DMADevice *dev, void *buffer, dma_addr_t len,
> +                      int is_write, dma_addr_t access_len)
> +{
> +    cpu_physical_memory_unmap(buffer, len, is_write, access_len);
> +    if (!NO_IOMMU(dev)) {
> +        DMAMemoryMap *map;
> +
> +        QLIST_FOREACH(map, &dev->mmu->memory_maps, list) {
> +            if ((map->buf == buffer) && (map->len == len)) {
> +                QLIST_REMOVE(map, list);
> +                free(map);
> +            }
> +        }
> +    }
> +}
> +
> diff --git a/hw/dma_rw.h b/hw/dma_rw.h
> new file mode 100644
> index 0000000..58a3d0f
> --- /dev/null
> +++ b/hw/dma_rw.h
> @@ -0,0 +1,156 @@
> +#ifndef DMA_RW_H
> +#define DMA_RW_H
> +
> +#include "qemu-common.h"
> +
> +typedef uint64_t dma_addr_t;
> +
> +typedef struct DMAMmu DMAMmu;
> +typedef struct DMADevice DMADevice;
> +
> +typedef int DMATranslateFunc(DMADevice *dev,
> +                             dma_addr_t addr,
> +                             dma_addr_t *paddr,
> +                             dma_addr_t *len,
> +                             int is_write);
> +
> +typedef void DMAInvalidateMapFunc(void *);
> +
> +#ifndef CONFIG_IOMMU
> +struct DMAMmu {
> +};
> +
> +struct DMADevice {
> +};
> +
> +#define NO_IOMMU(_dev) (1)
> +
> +static inline void dma_memory_rw(DMADevice *dev, dma_addr_t addr,
> +                                 void *buf, dma_addr_t len, int is_write)
> +{
> +    cpu_physical_memory_rw(addr, buf, len, is_write);
> +}
> +
> +static inline void *dma_memory_map(DMADevice *dev,
> +                                   DMAInvalidateMapFunc *cb, void *opaque,
> +                                   dma_addr_t addr, dma_addr_t *len,
> +                                   int is_write)
> +{
> +    return cpu_physical_memory_map(addr, len, is_write);
> +}
> +
> +static inline void dma_memory_unmap(DMADevice *dev,
> +                                    void *buffer, dma_addr_t len,
> +                                    int is_write, dma_addr_t access_len)
> +{
> +    cpu_physical_memory_unmap(buffer, len, is_write, access_len);
> +}
> +
> +#define DEFINE_DMA_LD(suffix, size)                                       \
> +static inline uint##size##_t                                              \
> +dma_ld##suffix(DMADevice *dev, dma_addr_t addr)                           \
> +{                                                                         \
> +    return ld##suffix##_phys(addr);                                       \
> +}
> +
> +#define DEFINE_DMA_ST(suffix, size)                                       \
> +static inline void                                                        \
> +dma_st##suffix(DMADevice *dev, dma_addr_t addr, uint##size##_t val)       \
> +{                                                                         \
> +    st##suffix##_phys(addr, val);                                         \
> +}
> +
> +#else
> +struct DMAMmu {
> +    DeviceState *iommu;
> +    DMATranslateFunc *translate;
> +    QLIST_HEAD(memory_maps, DMAMemoryMap) memory_maps;
> +};
> +
> +struct DMADevice {

How exactly is this going to be used? Also, in the end I think that most devices should just go through a PCI specific interface that then calls the DMA helpers:

pci_memory_rw(PCIDevice *d, ...)

even if it's only as simple as calling

dma_memory_rw(d->iommu, ...)


Alex
Eduard - Gabriel Munteanu - April 21, 2011, 6:47 p.m.
On Thu, Apr 21, 2011 at 05:03:47PM +1000, David Gibson wrote:
> A few months ago, Eduard - Gabriel Munteanu posted a series of patches
> implementing support for emulating the AMD PCI IOMMU
> (http://lists.nongnu.org/archive/html/qemu-devel/2011-01/msg03196.html).
> 
> In fact, this series implemented a general DMA/IOMMU layer which can
> be used by any device model, and one translation backend for this
> implementing the AMD specific PCI IOMMU.
> 
> These patches don't seem to have gone anywhere for the last few
> months, however, and so far I've been unable to contact the author
> (trying again with this mail).
> 
> I have an interest in this code, because the pSeries machine will also
> need IOMMU emulation support.  At present we only support virtual
> devices, through the PAPR interface, and we have support for the
> hypervisor-controller IOMMU translation in the PAPR VIO code.
> However, we want to add PCI device support and this will also need
> IOMMU translation.
> 
> The series seems to have the right basic approach, so if the author is
> indeed MIA, I was planning to pick up the patches and resubmit them
> (with support for the pSeries IOMMU added).

Hi,

Not really MIA, but I've been a bit busy lately, so I'm sorry if I
couldn't answer your mail in a timely fashion.

I'll try making another merge attempt tonight/tomorrow.

> Before I do that, I was hoping to get some consensus that this is the
> right way to go.  For reference, I have an updated version of the
> first patch (which adds the core IOMMU layer) below.

Some developers expressed a few concerns during my last merge attempt,
I'm going to go through them and see if they have been solved.

> From: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
> Date: Fri, 4 Feb 2011 01:32:55 +0200
> Subject: [PATCH] Generic DMA memory access interface
> 
> This introduces replacements for memory access functions like
> cpu_physical_memory_read(). The new interface can handle address
> translation and access checking through an IOMMU.
> 
> David Gibson: I have made several bugfixes and cleanups to Eduard's
> original patch.
> 
>  * dma_memory_rw() was incorrectly using (uninitialized) plen instead
>    of len in the fallback to no-IOMMU case.
> 
>  * the dma_memory_map() tracking was storing the guest physical
>    address of each mapping, but not the qemu user virtual address.
>    However in unmap() it was then attempting to lookup by virtual
>    using a completely bogus cast.

Thanks. Map invalidation didn't get much testing, maybe figuring out a
way to trigger it from a guest would be nice, say a testcase.

>  * The dma_memory_rw() function is moved from dma_rw.h to dma_rw.c, it
>    was a bit too much code for an inline.
> 
>  * IOMMU support is now available on all target platforms, not just
>    i386, but is configurable (--enable-iommu/--disable-iommu).  Stubs
>    are used so that individual drivers can use the new dma interface
>    and it will turn into old-style cpu physical accesses at no cost on
>    IOMMU-less builds.

My impression was people were in favor of having the IOMMU code always
built in (and go through the direct cpu_physical_* when not configured
by the guest). And perhaps poison the old interfaces once everything
goes through the new DMA layer. I'm okay any way, though.

> Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  Makefile.target |    1 +
>  configure       |   12 ++++
>  hw/dma_rw.c     |  147 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/dma_rw.h     |  156 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 316 insertions(+), 0 deletions(-)
>  create mode 100644 hw/dma_rw.c
>  create mode 100644 hw/dma_rw.h
> 
> diff --git a/Makefile.target b/Makefile.target
> index 95f5eda..c3d36c6 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -198,6 +198,7 @@ obj-$(CONFIG_REALLY_VIRTFS) += virtio-9p.o
>  obj-y += rwhandler.o
>  obj-$(CONFIG_KVM) += kvm.o kvm-all.o
>  obj-$(CONFIG_NO_KVM) += kvm-stub.o
> +obj-$(CONFIG_IOMMU) += dma_rw.o
>  LIBS+=-lz
>  
>  QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
> diff --git a/configure b/configure
> index da2da04..fa6f4d5 100755
> --- a/configure
> +++ b/configure
> @@ -130,6 +130,7 @@ xen=""
>  linux_aio=""
>  attr=""
>  vhost_net=""
> +iommu="no"
>  xfs=""
>  
>  gprof="no"
> @@ -723,6 +724,10 @@ for opt do
>    ;;
>    --enable-vhost-net) vhost_net="yes"
>    ;;
> +  --enable-iommu) iommu="yes"
> +  ;;
> +  --disable-iommu) iommu="no"
> +  ;;
>    --disable-opengl) opengl="no"
>    ;;
>    --enable-opengl) opengl="yes"
> @@ -934,6 +939,8 @@ echo "  --enable-docs            enable documentation build"
>  echo "  --disable-docs           disable documentation build"
>  echo "  --disable-vhost-net      disable vhost-net acceleration support"
>  echo "  --enable-vhost-net       enable vhost-net acceleration support"
> +echo "  --disable-iommu          disable IOMMU emulation support"
> +echo "  --enable-vhost-net       enable IOMMU emulation support"
>  echo "  --enable-trace-backend=B Set trace backend"
>  echo "                           Available backends:" $("$source_path"/scripts/tracetool --list-backends)
>  echo "  --with-trace-file=NAME   Full PATH,NAME of file to store traces"
> @@ -2608,6 +2615,7 @@ echo "madvise           $madvise"
>  echo "posix_madvise     $posix_madvise"
>  echo "uuid support      $uuid"
>  echo "vhost-net support $vhost_net"
> +echo "IOMMU support     $iommu"
>  echo "Trace backend     $trace_backend"
>  echo "Trace output file $trace_file-<pid>"
>  echo "spice support     $spice"
> @@ -3412,6 +3420,10 @@ if test "$target_softmmu" = "yes" -a \( \
>    echo "CONFIG_NEED_MMU=y" >> $config_target_mak
>  fi
>  
> +if test "$iommu" = "yes" ; then
> +  echo "CONFIG_IOMMU=y" >> $config_target_mak
> +fi
> +
>  if test "$gprof" = "yes" ; then
>    echo "TARGET_GPROF=yes" >> $config_target_mak
>    if test "$target_linux_user" = "yes" ; then
> diff --git a/hw/dma_rw.c b/hw/dma_rw.c
> new file mode 100644
> index 0000000..627835c
> --- /dev/null
> +++ b/hw/dma_rw.c
> @@ -0,0 +1,147 @@
> +/*
> + * Generic DMA memory access interface.
> + *
> + * Copyright (c) 2011 Eduard - Gabriel Munteanu
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */
> +
> +#include "dma_rw.h"
> +#include "range.h"
> +
> +void dma_memory_rw(DMADevice *dev, dma_addr_t addr,
> +                   void *buf, dma_addr_t len, int is_write)
> +{
> +    dma_addr_t paddr, plen;
> +    int err;
> +
> +    /*
> +     * Fast-path non-iommu.
> +     * More importantly, makes it obvious what this function does.
> +     */
> +    if (NO_IOMMU(dev)) {
> +        cpu_physical_memory_rw(addr, buf, len, is_write);
> +        return;
> +    }
> +
> +    while (len) {
> +        err = dev->mmu->translate(dev, addr, &paddr, &plen, is_write);
> +        if (err) {
> +            return;
> +        }
> +
> +        /* The translation might be valid for larger regions. */
> +        if (plen > len) {
> +            plen = len;
> +        }
> +
> +        cpu_physical_memory_rw(paddr, buf, plen, is_write);
> +
> +        len -= plen;
> +        addr += plen;
> +        buf += plen;
> +    }
> +}
> +
> +typedef struct DMAMemoryMap DMAMemoryMap;
> +struct DMAMemoryMap {
> +    dma_addr_t              addr;
> +    size_t                  len;
> +    void *                  buf;
> +    DMAInvalidateMapFunc    *invalidate;
> +    void                    *invalidate_opaque;
> +
> +    QLIST_ENTRY(DMAMemoryMap) list;
> +};
> +
> +void dma_invalidate_memory_range(DMADevice *dev,
> +                                 dma_addr_t addr, dma_addr_t len)
> +{
> +    DMAMemoryMap *map;
> +
> +    QLIST_FOREACH(map, &dev->mmu->memory_maps, list) {
> +        if (ranges_overlap(addr, len, map->addr, map->len)) {
> +            map->invalidate(map->invalidate_opaque);
> +            QLIST_REMOVE(map, list);
> +            free(map);
> +        }
> +    }
> +}
> +
> +void *dma_memory_map(DMADevice *dev, DMAInvalidateMapFunc *cb, void *opaque,
> +                     dma_addr_t addr, dma_addr_t *len, int is_write)
> +{
> +    int err;
> +    target_phys_addr_t paddr, plen;
> +    void *buf;
> +
> +    if (NO_IOMMU(dev)) {
> +        return cpu_physical_memory_map(addr, len, is_write);
> +    }
> +
> +    plen = *len;
> +    err = dev->mmu->translate(dev, addr, &paddr, &plen, is_write);
> +    if (err) {
> +        return NULL;
> +    }
> +
> +    /*
> +     * If this is true, the virtual region is contiguous,
> +     * but the translated physical region isn't. We just
> +     * clamp *len, much like cpu_physical_memory_map() does.
> +     */
> +    if (plen < *len) {
> +        *len = plen;
> +    }
> +
> +    buf = cpu_physical_memory_map(paddr, len, is_write);
> +
> +    /* We treat maps as remote TLBs to cope with stuff like AIO. */
> +    if (cb) {
> +        DMAMemoryMap *map;
> +
> +        map = qemu_malloc(sizeof(DMAMemoryMap));
> +        map->addr = addr;
> +        map->len = *len;
> +        map->buf = buf;
> +        map->invalidate = cb;
> +        map->invalidate_opaque = opaque;
> +        
> +        QLIST_INSERT_HEAD(&dev->mmu->memory_maps, map, list);
> +    }
> +
> +    return buf;
> +}
> +
> +void dma_memory_unmap(DMADevice *dev, void *buffer, dma_addr_t len,
> +                      int is_write, dma_addr_t access_len)
> +{
> +    cpu_physical_memory_unmap(buffer, len, is_write, access_len);
> +    if (!NO_IOMMU(dev)) {
> +        DMAMemoryMap *map;
> +
> +        QLIST_FOREACH(map, &dev->mmu->memory_maps, list) {
> +            if ((map->buf == buffer) && (map->len == len)) {
> +                QLIST_REMOVE(map, list);
> +                free(map);
> +            }
> +        }
> +    }
> +}
> +
> diff --git a/hw/dma_rw.h b/hw/dma_rw.h
> new file mode 100644
> index 0000000..58a3d0f
> --- /dev/null
> +++ b/hw/dma_rw.h
> @@ -0,0 +1,156 @@
> +#ifndef DMA_RW_H
> +#define DMA_RW_H
> +
> +#include "qemu-common.h"
> +
> +typedef uint64_t dma_addr_t;
> +
> +typedef struct DMAMmu DMAMmu;
> +typedef struct DMADevice DMADevice;
> +
> +typedef int DMATranslateFunc(DMADevice *dev,
> +                             dma_addr_t addr,
> +                             dma_addr_t *paddr,
> +                             dma_addr_t *len,
> +                             int is_write);
> +
> +typedef void DMAInvalidateMapFunc(void *);
> +
> +#ifndef CONFIG_IOMMU
> +struct DMAMmu {
> +};
> +
> +struct DMADevice {
> +};
> +
> +#define NO_IOMMU(_dev) (1)
> +
> +static inline void dma_memory_rw(DMADevice *dev, dma_addr_t addr,
> +                                 void *buf, dma_addr_t len, int is_write)
> +{
> +    cpu_physical_memory_rw(addr, buf, len, is_write);
> +}
> +
> +static inline void *dma_memory_map(DMADevice *dev,
> +                                   DMAInvalidateMapFunc *cb, void *opaque,
> +                                   dma_addr_t addr, dma_addr_t *len,
> +                                   int is_write)
> +{
> +    return cpu_physical_memory_map(addr, len, is_write);
> +}
> +
> +static inline void dma_memory_unmap(DMADevice *dev,
> +                                    void *buffer, dma_addr_t len,
> +                                    int is_write, dma_addr_t access_len)
> +{
> +    cpu_physical_memory_unmap(buffer, len, is_write, access_len);
> +}
> +
> +#define DEFINE_DMA_LD(suffix, size)                                       \
> +static inline uint##size##_t                                              \
> +dma_ld##suffix(DMADevice *dev, dma_addr_t addr)                           \
> +{                                                                         \
> +    return ld##suffix##_phys(addr);                                       \
> +}
> +
> +#define DEFINE_DMA_ST(suffix, size)                                       \
> +static inline void                                                        \
> +dma_st##suffix(DMADevice *dev, dma_addr_t addr, uint##size##_t val)       \
> +{                                                                         \
> +    st##suffix##_phys(addr, val);                                         \
> +}
> +
> +#else
> +struct DMAMmu {
> +    DeviceState *iommu;
> +    DMATranslateFunc *translate;
> +    QLIST_HEAD(memory_maps, DMAMemoryMap) memory_maps;
> +};
> +
> +struct DMADevice {
> +    DMAMmu *mmu;
> +};
> +
> +#define NO_IOMMU(_dev) (!(_dev) || !(_dev)->mmu)
> +
> +void dma_memory_rw(DMADevice *dev, dma_addr_t addr,
> +                   void *buf, dma_addr_t len, int is_write);
> +
> +void *dma_memory_map(DMADevice *dev,
> +                     DMAInvalidateMapFunc *cb, void *opaque,
> +                     dma_addr_t addr, dma_addr_t *len,
> +                     int is_write);
> +void dma_memory_unmap(DMADevice *dev,
> +                      void *buffer, dma_addr_t len,
> +                      int is_write, dma_addr_t access_len);
> +
> +void dma_invalidate_memory_range(DMADevice *dev,
> +                                 dma_addr_t addr, dma_addr_t len);
> +
> +/* warning: like the corresponding ldX_phys / stX_phys functions, these
> + * DMA accessors can only handle aligned accesses */
> +
> +#define DEFINE_DMA_LD(suffix, size)                                       \
> +static inline uint##size##_t                                              \
> +dma_ld##suffix(DMADevice *dev, dma_addr_t addr)                           \
> +{                                                                         \
> +    int err;                                                              \
> +    dma_addr_t paddr, plen;                                               \
> +                                                                          \
> +    if (NO_IOMMU(dev)) {                                                  \
> +        return ld##suffix##_phys(addr);                                   \
> +    }                                                                     \
> +                                                                          \
> +    err = dev->mmu->translate(dev, addr, &paddr, &plen, 0);               \
> +    if (err || (plen < size / 8))                                         \
> +        return 0;                                                         \
> +                                                                          \
> +    return ld##suffix##_phys(paddr);                                      \
> +}
> +
> +#define DEFINE_DMA_ST(suffix, size)                                       \
> +static inline void                                                        \
> +dma_st##suffix(DMADevice *dev, dma_addr_t addr, uint##size##_t val)       \
> +{                                                                         \
> +    int err;                                                              \
> +    target_phys_addr_t paddr, plen;                                       \
> +                                                                          \
> +    if (NO_IOMMU(dev)) {                                                  \
> +        st##suffix##_phys(addr, val);                                     \
> +        return;                                                           \
> +    }                                                                     \
> +    err = dev->mmu->translate(dev, addr, &paddr, &plen, 1);               \
> +    if (err || (plen < size / 8))                                         \
> +        return;                                                           \
> +                                                                          \
> +    st##suffix##_phys(paddr, val);                                        \
> +}
> +#endif /* CONFIG_IOMMU */
> +
> +DEFINE_DMA_LD(ub, 8)
> +DEFINE_DMA_LD(uw, 16)
> +DEFINE_DMA_LD(l, 32)
> +DEFINE_DMA_LD(q, 64)
> +
> +DEFINE_DMA_ST(b, 8)
> +DEFINE_DMA_ST(w, 16)
> +DEFINE_DMA_ST(l, 32)
> +DEFINE_DMA_ST(q, 64)
> +
> +static inline void dma_memory_read(DMADevice *dev,
> +                                   dma_addr_t addr,
> +                                   void *buf,
> +                                   dma_addr_t len)
> +{
> +    dma_memory_rw(dev, addr, buf, len, 0);
> +}
> +
> +static inline void dma_memory_write(DMADevice *dev,
> +                                    dma_addr_t addr,
> +                                    const void *buf,
> +                                    dma_addr_t len)
> +{
> +    dma_memory_rw(dev, addr, (void *) buf, len, 1);
> +}
> +
> +#endif /* DMA_RW_H */
> -- 
> 1.7.1
> 
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson
Joerg Roedel - April 26, 2011, 3:58 p.m.
Hello David,

On Thu, Apr 21, 2011 at 03:03:47AM -0400, David Gibson wrote:
> A few months ago, Eduard - Gabriel Munteanu posted a series of patches
> implementing support for emulating the AMD PCI IOMMU
> (http://lists.nongnu.org/archive/html/qemu-devel/2011-01/msg03196.html).
> 
> In fact, this series implemented a general DMA/IOMMU layer which can
> be used by any device model, and one translation backend for this
> implementing the AMD specific PCI IOMMU.

the patches for AMD IOMMU emulation are not yet upstream has you have
already noticed. Eduard is busy with his studies at the moment so any
help is greatly appreciated :)
Feel free to pick up hist patches and re-submit them (keeping the author
correct of course).
If you submit your code, it would be cool if you could put me on Cc so I
can easily follow the discussion and jump in if required.

Regards,

	Joerg
Richard Henderson - April 28, 2011, 9:57 p.m.
On 04/21/2011 02:39 AM, Alexander Graf wrote:
> How exactly is this going to be used? Also, in the end I think that
> most devices should just go through a PCI specific interface that
> then calls the DMA helpers:
> 
> pci_memory_rw(PCIDevice *d, ...)
> 
> even if it's only as simple as calling
> 
> dma_memory_rw(d->iommu, ...)

I've had a read through the patches posted in January.  It all does
seem relatively sane.  At least, I can readily see how I would apply
these interfaces to my Alpha port without trouble.

I'll agree with Alex though that the raw dma_memory_rw functions 
should not be exposed to the drivers for any given bus.  They should
always go through {pci,isa}_memory_rw.  And these should almost 
certainly be inline functions that just pass on &device->bus.mmu.


r~
Richard Henderson - April 29, 2011, 2:27 p.m.
On 04/28/2011 02:57 PM, Richard Henderson wrote:
> I've had a read through the patches posted in January.  It all does
> seem relatively sane.  At least, I can readily see how I would apply
> these interfaces to my Alpha port without trouble.

I take that back, I see one rather annoying problem: the assumption
that the translate function operates on some sort of standalone device.

This assumption is present in two places:

> void pci_register_iommu(PCIDevice *dev, DMATranslateFunc *translate);

Here, you're assuming that the IOMMU is a device on the PCI bus itself.

While this may indeed be how the AMD-IOMMU presents itself for the 
convenience of the pc-minded operating system, that's certainly not how
the hardware is implemented.  In practice it's a function of the PCI
host controller.  And indeed, that's how it is presented to the system
for the Sparc and Alpha controllers with which I am familiar.  I assume
it's similar for PowerPC, but I've never looked.

> struct DMAMmu {
>     DeviceState *iommu;
>     DMATranslateFunc *translate;
>     QLIST_HEAD(memory_maps, DMAMemoryMap) memory_maps;
> };

Here, you're assuming that the "iommu state" is a standalone qdev.

This is probably true most of the time, given that 

  FROM_SYSBUS(PCIHostState, sysbus_from_qdev(dev))

is true.  However, the Alpha Typhoon chipset has two pci host
controllers that are tied together in ways that would be 
difficult, or at least irritating, to represent as two separate
qdev entities.

I suggest that, like many other places we have callbacks, this
should be an opaque value private to the translate function.

In your AMD-IOMMU case you can then do

  pci_register_iommu(dev->bus, amd_iommu_translate, st);

and avoid 

>     PCIDevice *pci_dev = container_of(dev, PCIDevice, dma);
>     PCIDevice *iommu_dev = DO_UPCAST(PCIDevice, qdev, dev->mmu->iommu);
>     AMDIOMMUState *st = DO_UPCAST(AMDIOMMUState, dev, iommu_dev);

at the beginning of your translate function.  You currently have
three levels of casting and pointer chasing; surely you can agree
that having a single cast from void* is much easier to follow.

In my Alpha case I can then do

  pci_register_iommu(pci_bus0, typhoon_iommu_translate, &state->pchip0);
  pci_register_iommu(pci_bus1, typhoon_iommu_translate, &state->pchip1);

and be off to the races.


r~
David Gibson - May 9, 2011, 2:05 a.m.
On Thu, Apr 21, 2011 at 11:39:22AM +0200, Alexander Graf wrote:
> On 21.04.2011, at 09:03, David Gibson wrote:
[snip]
> > @@ -934,6 +939,8 @@ echo "  --enable-docs            enable documentation build"
> > echo "  --disable-docs           disable documentation build"
> > echo "  --disable-vhost-net      disable vhost-net acceleration support"
> > echo "  --enable-vhost-net       enable vhost-net acceleration support"
> > +echo "  --disable-iommu          disable IOMMU emulation support"
> > +echo "  --enable-vhost-net       enable IOMMU emulation support"
> 
> eeh?

Oops, copy and paste error.  Fix in my tree now.

[snip]
> > +#else
> > +struct DMAMmu {
> > +    DeviceState *iommu;
> > +    DMATranslateFunc *translate;
> > +    QLIST_HEAD(memory_maps, DMAMemoryMap) memory_maps;
> > +};
> > +
> > +struct DMADevice {
> 
> How exactly is this going to be used? 

Well, the guts of the dma layer is not my work, so I can't really
answer that.

> Also, in the end I think that
> most devices should just go through a PCI specific interface that
> then calls the DMA helpers:
> 
> pci_memory_rw(PCIDevice *d, ...)
> 
> even if it's only as simple as calling
> 
> dma_memory_rw(d->iommu, ...)

So, I was actually thinking it would make most sense to have an iommu
pointer in *every* qdev, PCI or otherwise, so that we just use
	dma_memory_rw(qdev, ...)
everywhere.  Obviously a NULL iommu pointer would fall back to no
translation.
David Gibson - May 9, 2011, 2:06 a.m.
On Thu, Apr 28, 2011 at 02:57:14PM -0700, Richard Henderson wrote:
> On 04/21/2011 02:39 AM, Alexander Graf wrote:
> > How exactly is this going to be used? Also, in the end I think that
> > most devices should just go through a PCI specific interface that
> > then calls the DMA helpers:
> > 
> > pci_memory_rw(PCIDevice *d, ...)
> > 
> > even if it's only as simple as calling
> > 
> > dma_memory_rw(d->iommu, ...)
> 
> I've had a read through the patches posted in January.  It all does
> seem relatively sane.  At least, I can readily see how I would apply
> these interfaces to my Alpha port without trouble.
> 
> I'll agree with Alex though that the raw dma_memory_rw functions 
> should not be exposed to the drivers for any given bus.  They should
> always go through {pci,isa}_memory_rw.  And these should almost 
> certainly be inline functions that just pass on &device->bus.mmu.

I don't really see the point of having per-bus helpers.  This is going
to be widespread enough that we can have an iommu pointer in the core
qdev, then everything can use the same helpers.
David Gibson - May 9, 2011, 3:28 a.m.
On Fri, Apr 29, 2011 at 07:27:38AM -0700, Richard Henderson wrote:
> On 04/28/2011 02:57 PM, Richard Henderson wrote:
> > I've had a read through the patches posted in January.  It all does
> > seem relatively sane.  At least, I can readily see how I would apply
> > these interfaces to my Alpha port without trouble.
> 
> I take that back, I see one rather annoying problem: the assumption
> that the translate function operates on some sort of standalone device.
> 
> This assumption is present in two places:

Yeah, I noticed this problem too, though I think it's in the AMD IOMMU
support patch rather than the DMA translation core.

> > void pci_register_iommu(PCIDevice *dev, DMATranslateFunc *translate);
> 
> Here, you're assuming that the IOMMU is a device on the PCI bus itself.
> 
> While this may indeed be how the AMD-IOMMU presents itself for the 
> convenience of the pc-minded operating system, that's certainly not how
> the hardware is implemented.  In practice it's a function of the PCI
> host controller.  And indeed, that's how it is presented to the system
> for the Sparc and Alpha controllers with which I am familiar.  I assume
> it's similar for PowerPC, but I've never looked.

Yes, that's right.  On pSeries platforms IOMMU translations are
configured using hcalls.

> > struct DMAMmu {
> >     DeviceState *iommu;
> >     DMATranslateFunc *translate;
> >     QLIST_HEAD(memory_maps, DMAMemoryMap) memory_maps;
> > };
> 
> Here, you're assuming that the "iommu state" is a standalone qdev.
> 
> This is probably true most of the time, given that 
> 
>   FROM_SYSBUS(PCIHostState, sysbus_from_qdev(dev))
> 
> is true.  However, the Alpha Typhoon chipset has two pci host
> controllers that are tied together in ways that would be 
> difficult, or at least irritating, to represent as two separate
> qdev entities.
> 
> I suggest that, like many other places we have callbacks, this
> should be an opaque value private to the translate function.
> 
> In your AMD-IOMMU case you can then do
> 
>   pci_register_iommu(dev->bus, amd_iommu_translate, st);
> 
> and avoid 
> 
> >     PCIDevice *pci_dev = container_of(dev, PCIDevice, dma);
> >     PCIDevice *iommu_dev = DO_UPCAST(PCIDevice, qdev, dev->mmu->iommu);
> >     AMDIOMMUState *st = DO_UPCAST(AMDIOMMUState, dev, iommu_dev);
> 
> at the beginning of your translate function.  You currently have
> three levels of casting and pointer chasing; surely you can agree
> that having a single cast from void* is much easier to follow.
> 
> In my Alpha case I can then do
> 
>   pci_register_iommu(pci_bus0, typhoon_iommu_translate, &state->pchip0);
>   pci_register_iommu(pci_bus1, typhoon_iommu_translate, &state->pchip1);
> 
> and be off to the races.

Yes, I think that makes sense.
David Gibson - May 10, 2011, 1:44 a.m.
On Thu, Apr 21, 2011 at 09:47:31PM +0300, Eduard - Gabriel Munteanu wrote:
> On Thu, Apr 21, 2011 at 05:03:47PM +1000, David Gibson wrote:
> > A few months ago, Eduard - Gabriel Munteanu posted a series of patches
> > implementing support for emulating the AMD PCI IOMMU
> > (http://lists.nongnu.org/archive/html/qemu-devel/2011-01/msg03196.html).
> > 
> > In fact, this series implemented a general DMA/IOMMU layer which can
> > be used by any device model, and one translation backend for this
> > implementing the AMD specific PCI IOMMU.
> > 
> > These patches don't seem to have gone anywhere for the last few
> > months, however, and so far I've been unable to contact the author
> > (trying again with this mail).
> > 
> > I have an interest in this code, because the pSeries machine will also
> > need IOMMU emulation support.  At present we only support virtual
> > devices, through the PAPR interface, and we have support for the
> > hypervisor-controller IOMMU translation in the PAPR VIO code.
> > However, we want to add PCI device support and this will also need
> > IOMMU translation.
> > 
> > The series seems to have the right basic approach, so if the author is
> > indeed MIA, I was planning to pick up the patches and resubmit them
> > (with support for the pSeries IOMMU added).
> 
> Hi,
> 
> Not really MIA, but I've been a bit busy lately, so I'm sorry if I
> couldn't answer your mail in a timely fashion.
> 
> I'll try making another merge attempt tonight/tomorrow.

Ok.  Did this happen?  Sorry, I've been away the last couple of weeks
- I had a google at the qemu-devel archives but couldn't spot a new
merge, but did I just not look hard enough?

> > Before I do that, I was hoping to get some consensus that this is the
> > right way to go.  For reference, I have an updated version of the
> > first patch (which adds the core IOMMU layer) below.

I think the base DMA layer is the correct approach.  There are some
problems with the handling in PCI - as someone else pointed out the
fact that it assumes the IOMMU is itself a PCI device is problematic
for non-x86 platforms.

> Some developers expressed a few concerns during my last merge attempt,
> I'm going to go through them and see if they have been solved.

Ok.

[snip]
> >  * the dma_memory_map() tracking was storing the guest physical
> >    address of each mapping, but not the qemu user virtual address.
> >    However in unmap() it was then attempting to lookup by virtual
> >    using a completely bogus cast.
> 
> Thanks. Map invalidation didn't get much testing, maybe figuring out a
> way to trigger it from a guest would be nice, say a testcase.

Well, I didn't catch a logic problem - for me this bug caused compile
failure.

> >  * The dma_memory_rw() function is moved from dma_rw.h to dma_rw.c, it
> >    was a bit too much code for an inline.
> > 
> >  * IOMMU support is now available on all target platforms, not just
> >    i386, but is configurable (--enable-iommu/--disable-iommu).  Stubs
> >    are used so that individual drivers can use the new dma interface
> >    and it will turn into old-style cpu physical accesses at no cost on
> >    IOMMU-less builds.
> 
> My impression was people were in favor of having the IOMMU code always
> built in (and go through the direct cpu_physical_* when not configured
> by the guest). And perhaps poison the old interfaces once everything
> goes through the new DMA layer. I'm okay any way, though.

Oh, I had the opposite impression.  I don't care either way,
personally.
Eduard - Gabriel Munteanu - May 14, 2011, 3:27 p.m.
On Tue, May 10, 2011 at 11:44:26AM +1000, David Gibson wrote:
> On Thu, Apr 21, 2011 at 09:47:31PM +0300, Eduard - Gabriel Munteanu wrote:
> > On Thu, Apr 21, 2011 at 05:03:47PM +1000, David Gibson wrote:
> > > A few months ago, Eduard - Gabriel Munteanu posted a series of patches
> > > implementing support for emulating the AMD PCI IOMMU
> > > (http://lists.nongnu.org/archive/html/qemu-devel/2011-01/msg03196.html).
> > > 
> > > In fact, this series implemented a general DMA/IOMMU layer which can
> > > be used by any device model, and one translation backend for this
> > > implementing the AMD specific PCI IOMMU.
> > > 
> > > These patches don't seem to have gone anywhere for the last few
> > > months, however, and so far I've been unable to contact the author
> > > (trying again with this mail).
> > > 
> > > I have an interest in this code, because the pSeries machine will also
> > > need IOMMU emulation support.  At present we only support virtual
> > > devices, through the PAPR interface, and we have support for the
> > > hypervisor-controller IOMMU translation in the PAPR VIO code.
> > > However, we want to add PCI device support and this will also need
> > > IOMMU translation.
> > > 
> > > The series seems to have the right basic approach, so if the author is
> > > indeed MIA, I was planning to pick up the patches and resubmit them
> > > (with support for the pSeries IOMMU added).
> > 
> > Hi,
> > 
> > Not really MIA, but I've been a bit busy lately, so I'm sorry if I
> > couldn't answer your mail in a timely fashion.
> > 
> > I'll try making another merge attempt tonight/tomorrow.
> 
> Ok.  Did this happen?  Sorry, I've been away the last couple of weeks
> - I had a google at the qemu-devel archives but couldn't spot a new
> merge, but did I just not look hard enough?

[snip]

No, I've made some progress but I've still got a few concerns to
address, mainly how to handle unaligned accesses and some things related
to the IOMMU behavior like target aborts.

I've added some macro magic to declare bus-specific interfaces and
converted PCI devices to use pci_memory_*().

I'll try not to hold this back too much, but I can't make promises wrt
timing.


	Eduard
Benjamin Herrenschmidt - May 14, 2011, 11:16 p.m.
On Sat, 2011-05-14 at 18:27 +0300, Eduard - Gabriel Munteanu wrote:
> No, I've made some progress but I've still got a few concerns to
> address, mainly how to handle unaligned accesses and some things
> related
> to the IOMMU behavior like target aborts.
> 
> I've added some macro magic to declare bus-specific interfaces and
> converted PCI devices to use pci_memory_*().
> 
> I'll try not to hold this back too much, but I can't make promises wrt
> timing.

If you have newer patches, even if they don't address all the above
issues, please send them.

We cannot afford to be blocked waiting for you, we have to make progress
on implementing the ppc side.

Cheers,
Ben.

Patch

diff --git a/Makefile.target b/Makefile.target
index 95f5eda..c3d36c6 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -198,6 +198,7 @@  obj-$(CONFIG_REALLY_VIRTFS) += virtio-9p.o
 obj-y += rwhandler.o
 obj-$(CONFIG_KVM) += kvm.o kvm-all.o
 obj-$(CONFIG_NO_KVM) += kvm-stub.o
+obj-$(CONFIG_IOMMU) += dma_rw.o
 LIBS+=-lz
 
 QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
diff --git a/configure b/configure
index da2da04..fa6f4d5 100755
--- a/configure
+++ b/configure
@@ -130,6 +130,7 @@  xen=""
 linux_aio=""
 attr=""
 vhost_net=""
+iommu="no"
 xfs=""
 
 gprof="no"
@@ -723,6 +724,10 @@  for opt do
   ;;
   --enable-vhost-net) vhost_net="yes"
   ;;
+  --enable-iommu) iommu="yes"
+  ;;
+  --disable-iommu) iommu="no"
+  ;;
   --disable-opengl) opengl="no"
   ;;
   --enable-opengl) opengl="yes"
@@ -934,6 +939,8 @@  echo "  --enable-docs            enable documentation build"
 echo "  --disable-docs           disable documentation build"
 echo "  --disable-vhost-net      disable vhost-net acceleration support"
 echo "  --enable-vhost-net       enable vhost-net acceleration support"
+echo "  --disable-iommu          disable IOMMU emulation support"
+echo "  --enable-vhost-net       enable IOMMU emulation support"
 echo "  --enable-trace-backend=B Set trace backend"
 echo "                           Available backends:" $("$source_path"/scripts/tracetool --list-backends)
 echo "  --with-trace-file=NAME   Full PATH,NAME of file to store traces"
@@ -2608,6 +2615,7 @@  echo "madvise           $madvise"
 echo "posix_madvise     $posix_madvise"
 echo "uuid support      $uuid"
 echo "vhost-net support $vhost_net"
+echo "IOMMU support     $iommu"
 echo "Trace backend     $trace_backend"
 echo "Trace output file $trace_file-<pid>"
 echo "spice support     $spice"
@@ -3412,6 +3420,10 @@  if test "$target_softmmu" = "yes" -a \( \
   echo "CONFIG_NEED_MMU=y" >> $config_target_mak
 fi
 
+if test "$iommu" = "yes" ; then
+  echo "CONFIG_IOMMU=y" >> $config_target_mak
+fi
+
 if test "$gprof" = "yes" ; then
   echo "TARGET_GPROF=yes" >> $config_target_mak
   if test "$target_linux_user" = "yes" ; then
diff --git a/hw/dma_rw.c b/hw/dma_rw.c
new file mode 100644
index 0000000..627835c
--- /dev/null
+++ b/hw/dma_rw.c
@@ -0,0 +1,147 @@ 
+/*
+ * Generic DMA memory access interface.
+ *
+ * Copyright (c) 2011 Eduard - Gabriel Munteanu
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "dma_rw.h"
+#include "range.h"
+
+void dma_memory_rw(DMADevice *dev, dma_addr_t addr,
+                   void *buf, dma_addr_t len, int is_write)
+{
+    dma_addr_t paddr, plen;
+    int err;
+
+    /*
+     * Fast-path non-iommu.
+     * More importantly, makes it obvious what this function does.
+     */
+    if (NO_IOMMU(dev)) {
+        cpu_physical_memory_rw(addr, buf, len, is_write);
+        return;
+    }
+
+    while (len) {
+        err = dev->mmu->translate(dev, addr, &paddr, &plen, is_write);
+        if (err) {
+            return;
+        }
+
+        /* The translation might be valid for larger regions. */
+        if (plen > len) {
+            plen = len;
+        }
+
+        cpu_physical_memory_rw(paddr, buf, plen, is_write);
+
+        len -= plen;
+        addr += plen;
+        buf += plen;
+    }
+}
+
+typedef struct DMAMemoryMap DMAMemoryMap;
+struct DMAMemoryMap {
+    dma_addr_t              addr;
+    size_t                  len;
+    void *                  buf;
+    DMAInvalidateMapFunc    *invalidate;
+    void                    *invalidate_opaque;
+
+    QLIST_ENTRY(DMAMemoryMap) list;
+};
+
+void dma_invalidate_memory_range(DMADevice *dev,
+                                 dma_addr_t addr, dma_addr_t len)
+{
+    DMAMemoryMap *map;
+
+    QLIST_FOREACH(map, &dev->mmu->memory_maps, list) {
+        if (ranges_overlap(addr, len, map->addr, map->len)) {
+            map->invalidate(map->invalidate_opaque);
+            QLIST_REMOVE(map, list);
+            free(map);
+        }
+    }
+}
+
+void *dma_memory_map(DMADevice *dev, DMAInvalidateMapFunc *cb, void *opaque,
+                     dma_addr_t addr, dma_addr_t *len, int is_write)
+{
+    int err;
+    target_phys_addr_t paddr, plen;
+    void *buf;
+
+    if (NO_IOMMU(dev)) {
+        return cpu_physical_memory_map(addr, len, is_write);
+    }
+
+    plen = *len;
+    err = dev->mmu->translate(dev, addr, &paddr, &plen, is_write);
+    if (err) {
+        return NULL;
+    }
+
+    /*
+     * If this is true, the virtual region is contiguous,
+     * but the translated physical region isn't. We just
+     * clamp *len, much like cpu_physical_memory_map() does.
+     */
+    if (plen < *len) {
+        *len = plen;
+    }
+
+    buf = cpu_physical_memory_map(paddr, len, is_write);
+
+    /* We treat maps as remote TLBs to cope with stuff like AIO. */
+    if (cb) {
+        DMAMemoryMap *map;
+
+        map = qemu_malloc(sizeof(DMAMemoryMap));
+        map->addr = addr;
+        map->len = *len;
+        map->buf = buf;
+        map->invalidate = cb;
+        map->invalidate_opaque = opaque;
+        
+        QLIST_INSERT_HEAD(&dev->mmu->memory_maps, map, list);
+    }
+
+    return buf;
+}
+
+void dma_memory_unmap(DMADevice *dev, void *buffer, dma_addr_t len,
+                      int is_write, dma_addr_t access_len)
+{
+    cpu_physical_memory_unmap(buffer, len, is_write, access_len);
+    if (!NO_IOMMU(dev)) {
+        DMAMemoryMap *map;
+
+        QLIST_FOREACH(map, &dev->mmu->memory_maps, list) {
+            if ((map->buf == buffer) && (map->len == len)) {
+                QLIST_REMOVE(map, list);
+                free(map);
+            }
+        }
+    }
+}
+
diff --git a/hw/dma_rw.h b/hw/dma_rw.h
new file mode 100644
index 0000000..58a3d0f
--- /dev/null
+++ b/hw/dma_rw.h
@@ -0,0 +1,156 @@ 
+#ifndef DMA_RW_H
+#define DMA_RW_H
+
+#include "qemu-common.h"
+
+typedef uint64_t dma_addr_t;
+
+typedef struct DMAMmu DMAMmu;
+typedef struct DMADevice DMADevice;
+
+typedef int DMATranslateFunc(DMADevice *dev,
+                             dma_addr_t addr,
+                             dma_addr_t *paddr,
+                             dma_addr_t *len,
+                             int is_write);
+
+typedef void DMAInvalidateMapFunc(void *);
+
+#ifndef CONFIG_IOMMU
+struct DMAMmu {
+};
+
+struct DMADevice {
+};
+
+#define NO_IOMMU(_dev) (1)
+
+static inline void dma_memory_rw(DMADevice *dev, dma_addr_t addr,
+                                 void *buf, dma_addr_t len, int is_write)
+{
+    cpu_physical_memory_rw(addr, buf, len, is_write);
+}
+
+static inline void *dma_memory_map(DMADevice *dev,
+                                   DMAInvalidateMapFunc *cb, void *opaque,
+                                   dma_addr_t addr, dma_addr_t *len,
+                                   int is_write)
+{
+    return cpu_physical_memory_map(addr, len, is_write);
+}
+
+static inline void dma_memory_unmap(DMADevice *dev,
+                                    void *buffer, dma_addr_t len,
+                                    int is_write, dma_addr_t access_len)
+{
+    cpu_physical_memory_unmap(buffer, len, is_write, access_len);
+}
+
+#define DEFINE_DMA_LD(suffix, size)                                       \
+static inline uint##size##_t                                              \
+dma_ld##suffix(DMADevice *dev, dma_addr_t addr)                           \
+{                                                                         \
+    return ld##suffix##_phys(addr);                                       \
+}
+
+#define DEFINE_DMA_ST(suffix, size)                                       \
+static inline void                                                        \
+dma_st##suffix(DMADevice *dev, dma_addr_t addr, uint##size##_t val)       \
+{                                                                         \
+    st##suffix##_phys(addr, val);                                         \
+}
+
+#else
+struct DMAMmu {
+    DeviceState *iommu;
+    DMATranslateFunc *translate;
+    QLIST_HEAD(memory_maps, DMAMemoryMap) memory_maps;
+};
+
+struct DMADevice {
+    DMAMmu *mmu;
+};
+
+#define NO_IOMMU(_dev) (!(_dev) || !(_dev)->mmu)
+
+void dma_memory_rw(DMADevice *dev, dma_addr_t addr,
+                   void *buf, dma_addr_t len, int is_write);
+
+void *dma_memory_map(DMADevice *dev,
+                     DMAInvalidateMapFunc *cb, void *opaque,
+                     dma_addr_t addr, dma_addr_t *len,
+                     int is_write);
+void dma_memory_unmap(DMADevice *dev,
+                      void *buffer, dma_addr_t len,
+                      int is_write, dma_addr_t access_len);
+
+void dma_invalidate_memory_range(DMADevice *dev,
+                                 dma_addr_t addr, dma_addr_t len);
+
+/* warning: like the corresponding ldX_phys / stX_phys functions, these
+ * DMA accessors can only handle aligned accesses */
+
+#define DEFINE_DMA_LD(suffix, size)                                       \
+static inline uint##size##_t                                              \
+dma_ld##suffix(DMADevice *dev, dma_addr_t addr)                           \
+{                                                                         \
+    int err;                                                              \
+    dma_addr_t paddr, plen;                                               \
+                                                                          \
+    if (NO_IOMMU(dev)) {                                                  \
+        return ld##suffix##_phys(addr);                                   \
+    }                                                                     \
+                                                                          \
+    err = dev->mmu->translate(dev, addr, &paddr, &plen, 0);               \
+    if (err || (plen < size / 8))                                         \
+        return 0;                                                         \
+                                                                          \
+    return ld##suffix##_phys(paddr);                                      \
+}
+
+#define DEFINE_DMA_ST(suffix, size)                                       \
+static inline void                                                        \
+dma_st##suffix(DMADevice *dev, dma_addr_t addr, uint##size##_t val)       \
+{                                                                         \
+    int err;                                                              \
+    target_phys_addr_t paddr, plen;                                       \
+                                                                          \
+    if (NO_IOMMU(dev)) {                                                  \
+        st##suffix##_phys(addr, val);                                     \
+        return;                                                           \
+    }                                                                     \
+    err = dev->mmu->translate(dev, addr, &paddr, &plen, 1);               \
+    if (err || (plen < size / 8))                                         \
+        return;                                                           \
+                                                                          \
+    st##suffix##_phys(paddr, val);                                        \
+}
+#endif /* CONFIG_IOMMU */
+
+DEFINE_DMA_LD(ub, 8)
+DEFINE_DMA_LD(uw, 16)
+DEFINE_DMA_LD(l, 32)
+DEFINE_DMA_LD(q, 64)
+
+DEFINE_DMA_ST(b, 8)
+DEFINE_DMA_ST(w, 16)
+DEFINE_DMA_ST(l, 32)
+DEFINE_DMA_ST(q, 64)
+
+static inline void dma_memory_read(DMADevice *dev,
+                                   dma_addr_t addr,
+                                   void *buf,
+                                   dma_addr_t len)
+{
+    dma_memory_rw(dev, addr, buf, len, 0);
+}
+
+static inline void dma_memory_write(DMADevice *dev,
+                                    dma_addr_t addr,
+                                    const void *buf,
+                                    dma_addr_t len)
+{
+    dma_memory_rw(dev, addr, (void *) buf, len, 1);
+}
+
+#endif /* DMA_RW_H */