diff mbox

[09/13] iommu: Add facility to cancel in-use dma memory maps

Message ID 1340087992-2399-10-git-send-email-benh@kernel.crashing.org
State New
Headers show

Commit Message

Benjamin Herrenschmidt June 19, 2012, 6:39 a.m. UTC
From: David Gibson <david@gibson.dropbear.id.au>

One new complication raised by IOMMU support over only handling DMA
directly to physical addresses is handling dma_memory_map() case
(replacing cpu_physical_memory_map()) when the IOMMU translation the
IOVAs covered by such a map are invalidated or changed while the map
is active.  This should never happen with correct guest software, but
we do need to handle buggy guests.  This case might also occur during
handovers between different guest software stages if the handover
protocols aren't fully seamless.

The iommu implementation will have to wait for maps to be removed
before it can "complete" an invalidation of a translation, which
can take a long time. In order to make it possible to speed that
process up, we add a "Cancel" callback to the map function which
the clients can optionally provide.

The core makes no use of that, but the iommu backend implementation
may choose to keep track of maps and call the respective cancel
callback whenever a translation within a map is removed, allowing
the driver to do things like cancel async IOs etc.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 dma-helpers.c |   49 ++++++++++++++++++++++++++++---------------------
 dma.h         |   23 +++++++++++++++++++----
 2 files changed, 47 insertions(+), 25 deletions(-)

Comments

Anthony Liguori June 20, 2012, 9:25 p.m. UTC | #1
On 06/19/2012 01:39 AM, Benjamin Herrenschmidt wrote:
> From: David Gibson<david@gibson.dropbear.id.au>
>
> One new complication raised by IOMMU support over only handling DMA
> directly to physical addresses is handling dma_memory_map() case
> (replacing cpu_physical_memory_map()) when the IOMMU translation the
> IOVAs covered by such a map are invalidated or changed while the map
> is active.  This should never happen with correct guest software, but
> we do need to handle buggy guests.  This case might also occur during
> handovers between different guest software stages if the handover
> protocols aren't fully seamless.
>
> The iommu implementation will have to wait for maps to be removed
> before it can "complete" an invalidation of a translation, which
> can take a long time. In order to make it possible to speed that
> process up, we add a "Cancel" callback to the map function which
> the clients can optionally provide.
>
> The core makes no use of that, but the iommu backend implementation
> may choose to keep track of maps and call the respective cancel
> callback whenever a translation within a map is removed, allowing
> the driver to do things like cancel async IOs etc.
>
> Signed-off-by: David Gibson<david@gibson.dropbear.id.au>
> Signed-off-by: Benjamin Herrenschmidt<benh@kernel.crashing.org>
> ---
>   dma-helpers.c |   49 ++++++++++++++++++++++++++++---------------------
>   dma.h         |   23 +++++++++++++++++++----
>   2 files changed, 47 insertions(+), 25 deletions(-)
>
> diff --git a/dma-helpers.c b/dma-helpers.c
> index b4ee827..6e6c7b3 100644
> --- a/dma-helpers.c
> +++ b/dma-helpers.c
> @@ -107,6 +107,28 @@ static void dma_complete(DMAAIOCB *dbs, int ret)
>       }
>   }
>
> +static void dma_aio_cancel(BlockDriverAIOCB *acb)
> +{
> +    DMAAIOCB *dbs = container_of(acb, DMAAIOCB, common);
> +
> +    trace_dma_aio_cancel(dbs);
> +
> +    if (dbs->acb) {
> +        BlockDriverAIOCB *acb = dbs->acb;
> +        dbs->acb = NULL;
> +        dbs->in_cancel = true;
> +        bdrv_aio_cancel(acb);
> +        dbs->in_cancel = false;
> +    }
> +    dbs->common.cb = NULL;
> +    dma_complete(dbs, 0);

So this cancellation stuff is hopelessly broken

It's simply not possible to fully cancel pending DMA in a synchronous callback.

Indeed, bdrv_aio_cancel ends up having a nasty little loop in it:

     if (active) {
         /* fail safe: if the aio could not be canceled, we wait for
            it */
         while (qemu_paio_error(acb) == EINPROGRESS)
             ;
     }

That spins w/100% CPU.

Can you explain when DMA cancellation really happens and what the effect would 
be if we simply ignored it?

Can we do something more clever like use an asynchronous callback to handle 
flushing active DMA mappings?

There's just no way a callback like this is going to work.

Regards,

Anthony Liguori
Benjamin Herrenschmidt June 20, 2012, 9:52 p.m. UTC | #2
On Wed, 2012-06-20 at 16:25 -0500, Anthony Liguori wrote:

> So this cancellation stuff is hopelessly broken
> 
> It's simply not possible to fully cancel pending DMA in a synchronous callback.

Well, at least for PAPR H_PUT_TCE, cancellation must be synchronous, ie
the hypercall must not return until the cancellation is complete.

> Indeed, bdrv_aio_cancel ends up having a nasty little loop in it:
> 
>      if (active) {
>          /* fail safe: if the aio could not be canceled, we wait for
>             it */
>          while (qemu_paio_error(acb) == EINPROGRESS)
>              ;
>      }
> 
> That spins w/100% CPU.
> 
> Can you explain when DMA cancellation really happens and what the effect would 
> be if we simply ignored it?

It will almost never happen in practice. It will actually never happen
with the current patch series. Where it -will- eventually happen in the
long run is if the guest removes a translation that is "in use" by a
dma_map() mapping established by a device. It's always a guest
programming error though and it's not an attack vector since the guest
can only shoot itself in the foot, but it might make things like kdump
less reliable inside the guest.

We need a way to signal the device that the translation is going away
and we need -a- way to synchronize though it could be a two step process
(see below).

> Can we do something more clever like use an asynchronous callback to handle 
> flushing active DMA mappings?
> 
> There's just no way a callback like this is going to work.

Ok so first let's see what happens in real HW: One of the DMA accesses
gets a target abort return from the host bridge. The device interrupts
it's operations and signals an error.

Now, I agree that requiring a cancel callback to act synchronously might
be a bit fishy, so what about we define the following semantics:

 - First this assumes our iommu backend decides to implement that level
of correctness, as I said above, none do in that patch series (yet)

 - The basic idea is that for most iommu, there's an MMIO to start a TLB
flush and an MMIO the guest uses to spin on to get the status as to
whether the TLB flush has completed, so we can do things asynchronously
that way. However we -still- need to do things synchronously for the
hypercall used by PAPR, but as we discussed earlier that can be done
without spinning, by delaying the completion of the hypercall.

 - So step 1, no callback at all. When an iommu TLB flush operation is
started, we tag all pending maps (see below). We signal completion when
all those maps have been unmapped.

 - The above tagging can be done using some kind of generation count
along with an ordered list of maps, we keep track of the "oldest" map
still active, that sort of thing. Not too hard.

 - step 2, because some maps can be long lived and we don't want to
delay invalidations for ever, we add a cancel callback which device can
optionally installed along with a map. This callback is only meant to
-initiate- a cancellation in order to speed up when the unmap will
occur.

What do you think ? Would that work ? As I explained in my email
exchange with Gerd, there's quite a few issues in actually implementing
cancellation properly anyway, for example, today on PAPR, H_PUT_TCE is
implemented by the kernel KVM in real mode for performance reasons. So
we would need to change the KVM API to be able to keep the kernel
informed that there are maps covering portions of the iommu space (a
bitmap ?) to force exists to qemu when an invalidation collides with a
map for example.

Additionally, to be totally -correct-, we would need synchronization
with qemu to a much larger extent. IE. Any invalidation must also make
sure that anything that used a previous translation has completed, ie,
even the simple dma_ld/st ops must be synchronized in theory.

My conclusion is that the complexity of solving the problem is huge,
while the actual problem scope is close to non-existent. So I think we
can safely merge the series and ignore the issue for the time being.

Cheers,
Ben.
Benjamin Herrenschmidt June 22, 2012, 3:18 a.m. UTC | #3
On Wed, 2012-06-20 at 16:25 -0500, Anthony Liguori wrote:

> > +static void dma_aio_cancel(BlockDriverAIOCB *acb)
> > +{
> > +    DMAAIOCB *dbs = container_of(acb, DMAAIOCB, common);
> > +
> > +    trace_dma_aio_cancel(dbs);
> > +
> > +    if (dbs->acb) {
> > +        BlockDriverAIOCB *acb = dbs->acb;
> > +        dbs->acb = NULL;
> > +        dbs->in_cancel = true;
> > +        bdrv_aio_cancel(acb);
> > +        dbs->in_cancel = false;
> > +    }
> > +    dbs->common.cb = NULL;
> > +    dma_complete(dbs, 0);
> 
> So this cancellation stuff is hopelessly broken
> 
> It's simply not possible to fully cancel pending DMA in a synchronous callback.
> 
> Indeed, bdrv_aio_cancel ends up having a nasty little loop in it:

Yes, it's broken. Note that the patch didn't add the above function,
only moved it around.

In any case, I've decided to just drop that patch completely from the
series. IE. I'm not adding the dma_memory_map_with_cancel() variant,
there's no point since:

 - Nothing will call the cancel callback today and possibly for a while

 - Nothing passes a cancel callback other than the bdrv stuff and that
callback is hopelessly broken as you mentioned above.

So there's just no point. We will add an optional cancel callback again
later when we eventually decide to sort that problem out properly, it
will be an asynchronous cancel, ie, just "initiate" the cancellation,
and I'll probably add that as an argument to the normal dma_memory_map()
(adding NULL to all callers that don't care) at that point.

For now, let's not add a known to be broken and unused interface.

Cheers,
Ben.
diff mbox

Patch

diff --git a/dma-helpers.c b/dma-helpers.c
index b4ee827..6e6c7b3 100644
--- a/dma-helpers.c
+++ b/dma-helpers.c
@@ -107,6 +107,28 @@  static void dma_complete(DMAAIOCB *dbs, int ret)
     }
 }
 
+static void dma_aio_cancel(BlockDriverAIOCB *acb)
+{
+    DMAAIOCB *dbs = container_of(acb, DMAAIOCB, common);
+
+    trace_dma_aio_cancel(dbs);
+
+    if (dbs->acb) {
+        BlockDriverAIOCB *acb = dbs->acb;
+        dbs->acb = NULL;
+        dbs->in_cancel = true;
+        bdrv_aio_cancel(acb);
+        dbs->in_cancel = false;
+    }
+    dbs->common.cb = NULL;
+    dma_complete(dbs, 0);
+}
+
+static void dma_bdrv_cancel_cb(void *opaque)
+{
+    dma_aio_cancel(&((DMAAIOCB *)opaque)->common);
+}
+
 static void dma_bdrv_cb(void *opaque, int ret)
 {
     DMAAIOCB *dbs = (DMAAIOCB *)opaque;
@@ -127,7 +149,8 @@  static void dma_bdrv_cb(void *opaque, int ret)
     while (dbs->sg_cur_index < dbs->sg->nsg) {
         cur_addr = dbs->sg->sg[dbs->sg_cur_index].base + dbs->sg_cur_byte;
         cur_len = dbs->sg->sg[dbs->sg_cur_index].len - dbs->sg_cur_byte;
-        mem = dma_memory_map(dbs->sg->dma, cur_addr, &cur_len, dbs->dir);
+        mem = dma_memory_map_with_cancel(dbs->sg->dma, dma_bdrv_cancel_cb, dbs,
+                                         cur_addr, &cur_len, dbs->dir);
         if (!mem)
             break;
         qemu_iovec_add(&dbs->iov, mem, cur_len);
@@ -149,23 +172,6 @@  static void dma_bdrv_cb(void *opaque, int ret)
     assert(dbs->acb);
 }
 
-static void dma_aio_cancel(BlockDriverAIOCB *acb)
-{
-    DMAAIOCB *dbs = container_of(acb, DMAAIOCB, common);
-
-    trace_dma_aio_cancel(dbs);
-
-    if (dbs->acb) {
-        BlockDriverAIOCB *acb = dbs->acb;
-        dbs->acb = NULL;
-        dbs->in_cancel = true;
-        bdrv_aio_cancel(acb);
-        dbs->in_cancel = false;
-    }
-    dbs->common.cb = NULL;
-    dma_complete(dbs, 0);
-}
-
 static AIOPool dma_aio_pool = {
     .aiocb_size         = sizeof(DMAAIOCB),
     .cancel             = dma_aio_cancel,
@@ -353,7 +359,9 @@  void dma_context_init(DMAContext *dma, DMATranslateFunc translate,
     dma->unmap = unmap;
 }
 
-void *iommu_dma_memory_map(DMAContext *dma, dma_addr_t addr, dma_addr_t *len,
+void *iommu_dma_memory_map(DMAContext *dma,
+                           DMACancelMapFunc cb, void *cb_opaque,
+                           dma_addr_t addr, dma_addr_t *len,
                            DMADirection dir)
 {
     int err;
@@ -361,7 +369,7 @@  void *iommu_dma_memory_map(DMAContext *dma, dma_addr_t addr, dma_addr_t *len,
     void *buf;
 
     if (dma->map) {
-        return dma->map(dma, addr, len, dir);
+        return dma->map(dma, cb, cb_opaque, addr, len, dir);
     }
 
     plen = *len;
@@ -397,5 +405,4 @@  void iommu_dma_memory_unmap(DMAContext *dma, void *buffer, dma_addr_t len,
     cpu_physical_memory_unmap(buffer, len,
                               dir == DMA_DIRECTION_FROM_DEVICE,
                               access_len);
-
 }
diff --git a/dma.h b/dma.h
index 14fe17d..f1fcb71 100644
--- a/dma.h
+++ b/dma.h
@@ -49,10 +49,15 @@  typedef int DMATranslateFunc(DMAContext *dma,
                              target_phys_addr_t *paddr,
                              target_phys_addr_t *len,
                              DMADirection dir);
+
+typedef void DMACancelMapFunc(void *);
 typedef void* DMAMapFunc(DMAContext *dma,
+                         DMACancelMapFunc cb,
+                         void *cb_opaque,	 
                          dma_addr_t addr,
                          dma_addr_t *len,
                          DMADirection dir);
+
 typedef void DMAUnmapFunc(DMAContext *dma,
                           void *buffer,
                           dma_addr_t len,
@@ -129,11 +134,15 @@  static inline int dma_memory_set(DMAContext *dma, dma_addr_t addr,
 }
 
 void *iommu_dma_memory_map(DMAContext *dma,
+                           DMACancelMapFunc *cb, void *opaque,
                            dma_addr_t addr, dma_addr_t *len,
                            DMADirection dir);
-static inline void *dma_memory_map(DMAContext *dma,
-                                   dma_addr_t addr, dma_addr_t *len,
-                                   DMADirection dir)
+static inline void *dma_memory_map_with_cancel(DMAContext *dma,
+                                               DMACancelMapFunc *cb,
+                                               void *opaque,
+                                               dma_addr_t addr,
+                                               dma_addr_t *len,
+                                               DMADirection dir)
 {
     if (!dma_has_iommu(dma)) {
         target_phys_addr_t xlen = *len;
@@ -144,9 +153,15 @@  static inline void *dma_memory_map(DMAContext *dma,
         *len = xlen;
         return p;
     } else {
-        return iommu_dma_memory_map(dma, addr, len, dir);
+        return iommu_dma_memory_map(dma, cb, opaque, addr, len, dir);
     }
 }
+static inline void *dma_memory_map(DMAContext *dma,
+                                   dma_addr_t addr, dma_addr_t *len,
+                                   DMADirection dir)
+{
+    return dma_memory_map_with_cancel(dma, NULL, NULL, addr, len, dir);
+}
 
 void iommu_dma_memory_unmap(DMAContext *dma,
                             void *buffer, dma_addr_t len,