Message ID | 1340087992-2399-10-git-send-email-benh@kernel.crashing.org |
---|---|
State | New |
Headers | show |
On 06/19/2012 01:39 AM, Benjamin Herrenschmidt wrote: > From: David Gibson<david@gibson.dropbear.id.au> > > One new complication raised by IOMMU support over only handling DMA > directly to physical addresses is handling dma_memory_map() case > (replacing cpu_physical_memory_map()) when the IOMMU translation the > IOVAs covered by such a map are invalidated or changed while the map > is active. This should never happen with correct guest software, but > we do need to handle buggy guests. This case might also occur during > handovers between different guest software stages if the handover > protocols aren't fully seamless. > > The iommu implementation will have to wait for maps to be removed > before it can "complete" an invalidation of a translation, which > can take a long time. In order to make it possible to speed that > process up, we add a "Cancel" callback to the map function which > the clients can optionally provide. > > The core makes no use of that, but the iommu backend implementation > may choose to keep track of maps and call the respective cancel > callback whenever a translation within a map is removed, allowing > the driver to do things like cancel async IOs etc. > > Signed-off-by: David Gibson<david@gibson.dropbear.id.au> > Signed-off-by: Benjamin Herrenschmidt<benh@kernel.crashing.org> > --- > dma-helpers.c | 49 ++++++++++++++++++++++++++++--------------------- > dma.h | 23 +++++++++++++++++++---- > 2 files changed, 47 insertions(+), 25 deletions(-) > > diff --git a/dma-helpers.c b/dma-helpers.c > index b4ee827..6e6c7b3 100644 > --- a/dma-helpers.c > +++ b/dma-helpers.c > @@ -107,6 +107,28 @@ static void dma_complete(DMAAIOCB *dbs, int ret) > } > } > > +static void dma_aio_cancel(BlockDriverAIOCB *acb) > +{ > + DMAAIOCB *dbs = container_of(acb, DMAAIOCB, common); > + > + trace_dma_aio_cancel(dbs); > + > + if (dbs->acb) { > + BlockDriverAIOCB *acb = dbs->acb; > + dbs->acb = NULL; > + dbs->in_cancel = true; > + bdrv_aio_cancel(acb); > + dbs->in_cancel = false; > + } > + dbs->common.cb = NULL; > + dma_complete(dbs, 0); So this cancellation stuff is hopelessly broken It's simply not possible to fully cancel pending DMA in a synchronous callback. Indeed, bdrv_aio_cancel ends up having a nasty little loop in it: if (active) { /* fail safe: if the aio could not be canceled, we wait for it */ while (qemu_paio_error(acb) == EINPROGRESS) ; } That spins w/100% CPU. Can you explain when DMA cancellation really happens and what the effect would be if we simply ignored it? Can we do something more clever like use an asynchronous callback to handle flushing active DMA mappings? There's just no way a callback like this is going to work. Regards, Anthony Liguori
On Wed, 2012-06-20 at 16:25 -0500, Anthony Liguori wrote: > So this cancellation stuff is hopelessly broken > > It's simply not possible to fully cancel pending DMA in a synchronous callback. Well, at least for PAPR H_PUT_TCE, cancellation must be synchronous, ie the hypercall must not return until the cancellation is complete. > Indeed, bdrv_aio_cancel ends up having a nasty little loop in it: > > if (active) { > /* fail safe: if the aio could not be canceled, we wait for > it */ > while (qemu_paio_error(acb) == EINPROGRESS) > ; > } > > That spins w/100% CPU. > > Can you explain when DMA cancellation really happens and what the effect would > be if we simply ignored it? It will almost never happen in practice. It will actually never happen with the current patch series. Where it -will- eventually happen in the long run is if the guest removes a translation that is "in use" by a dma_map() mapping established by a device. It's always a guest programming error though and it's not an attack vector since the guest can only shoot itself in the foot, but it might make things like kdump less reliable inside the guest. We need a way to signal the device that the translation is going away and we need -a- way to synchronize though it could be a two step process (see below). > Can we do something more clever like use an asynchronous callback to handle > flushing active DMA mappings? > > There's just no way a callback like this is going to work. Ok so first let's see what happens in real HW: One of the DMA accesses gets a target abort return from the host bridge. The device interrupts it's operations and signals an error. Now, I agree that requiring a cancel callback to act synchronously might be a bit fishy, so what about we define the following semantics: - First this assumes our iommu backend decides to implement that level of correctness, as I said above, none do in that patch series (yet) - The basic idea is that for most iommu, there's an MMIO to start a TLB flush and an MMIO the guest uses to spin on to get the status as to whether the TLB flush has completed, so we can do things asynchronously that way. However we -still- need to do things synchronously for the hypercall used by PAPR, but as we discussed earlier that can be done without spinning, by delaying the completion of the hypercall. - So step 1, no callback at all. When an iommu TLB flush operation is started, we tag all pending maps (see below). We signal completion when all those maps have been unmapped. - The above tagging can be done using some kind of generation count along with an ordered list of maps, we keep track of the "oldest" map still active, that sort of thing. Not too hard. - step 2, because some maps can be long lived and we don't want to delay invalidations for ever, we add a cancel callback which device can optionally installed along with a map. This callback is only meant to -initiate- a cancellation in order to speed up when the unmap will occur. What do you think ? Would that work ? As I explained in my email exchange with Gerd, there's quite a few issues in actually implementing cancellation properly anyway, for example, today on PAPR, H_PUT_TCE is implemented by the kernel KVM in real mode for performance reasons. So we would need to change the KVM API to be able to keep the kernel informed that there are maps covering portions of the iommu space (a bitmap ?) to force exists to qemu when an invalidation collides with a map for example. Additionally, to be totally -correct-, we would need synchronization with qemu to a much larger extent. IE. Any invalidation must also make sure that anything that used a previous translation has completed, ie, even the simple dma_ld/st ops must be synchronized in theory. My conclusion is that the complexity of solving the problem is huge, while the actual problem scope is close to non-existent. So I think we can safely merge the series and ignore the issue for the time being. Cheers, Ben.
On Wed, 2012-06-20 at 16:25 -0500, Anthony Liguori wrote: > > +static void dma_aio_cancel(BlockDriverAIOCB *acb) > > +{ > > + DMAAIOCB *dbs = container_of(acb, DMAAIOCB, common); > > + > > + trace_dma_aio_cancel(dbs); > > + > > + if (dbs->acb) { > > + BlockDriverAIOCB *acb = dbs->acb; > > + dbs->acb = NULL; > > + dbs->in_cancel = true; > > + bdrv_aio_cancel(acb); > > + dbs->in_cancel = false; > > + } > > + dbs->common.cb = NULL; > > + dma_complete(dbs, 0); > > So this cancellation stuff is hopelessly broken > > It's simply not possible to fully cancel pending DMA in a synchronous callback. > > Indeed, bdrv_aio_cancel ends up having a nasty little loop in it: Yes, it's broken. Note that the patch didn't add the above function, only moved it around. In any case, I've decided to just drop that patch completely from the series. IE. I'm not adding the dma_memory_map_with_cancel() variant, there's no point since: - Nothing will call the cancel callback today and possibly for a while - Nothing passes a cancel callback other than the bdrv stuff and that callback is hopelessly broken as you mentioned above. So there's just no point. We will add an optional cancel callback again later when we eventually decide to sort that problem out properly, it will be an asynchronous cancel, ie, just "initiate" the cancellation, and I'll probably add that as an argument to the normal dma_memory_map() (adding NULL to all callers that don't care) at that point. For now, let's not add a known to be broken and unused interface. Cheers, Ben.
diff --git a/dma-helpers.c b/dma-helpers.c index b4ee827..6e6c7b3 100644 --- a/dma-helpers.c +++ b/dma-helpers.c @@ -107,6 +107,28 @@ static void dma_complete(DMAAIOCB *dbs, int ret) } } +static void dma_aio_cancel(BlockDriverAIOCB *acb) +{ + DMAAIOCB *dbs = container_of(acb, DMAAIOCB, common); + + trace_dma_aio_cancel(dbs); + + if (dbs->acb) { + BlockDriverAIOCB *acb = dbs->acb; + dbs->acb = NULL; + dbs->in_cancel = true; + bdrv_aio_cancel(acb); + dbs->in_cancel = false; + } + dbs->common.cb = NULL; + dma_complete(dbs, 0); +} + +static void dma_bdrv_cancel_cb(void *opaque) +{ + dma_aio_cancel(&((DMAAIOCB *)opaque)->common); +} + static void dma_bdrv_cb(void *opaque, int ret) { DMAAIOCB *dbs = (DMAAIOCB *)opaque; @@ -127,7 +149,8 @@ static void dma_bdrv_cb(void *opaque, int ret) while (dbs->sg_cur_index < dbs->sg->nsg) { cur_addr = dbs->sg->sg[dbs->sg_cur_index].base + dbs->sg_cur_byte; cur_len = dbs->sg->sg[dbs->sg_cur_index].len - dbs->sg_cur_byte; - mem = dma_memory_map(dbs->sg->dma, cur_addr, &cur_len, dbs->dir); + mem = dma_memory_map_with_cancel(dbs->sg->dma, dma_bdrv_cancel_cb, dbs, + cur_addr, &cur_len, dbs->dir); if (!mem) break; qemu_iovec_add(&dbs->iov, mem, cur_len); @@ -149,23 +172,6 @@ static void dma_bdrv_cb(void *opaque, int ret) assert(dbs->acb); } -static void dma_aio_cancel(BlockDriverAIOCB *acb) -{ - DMAAIOCB *dbs = container_of(acb, DMAAIOCB, common); - - trace_dma_aio_cancel(dbs); - - if (dbs->acb) { - BlockDriverAIOCB *acb = dbs->acb; - dbs->acb = NULL; - dbs->in_cancel = true; - bdrv_aio_cancel(acb); - dbs->in_cancel = false; - } - dbs->common.cb = NULL; - dma_complete(dbs, 0); -} - static AIOPool dma_aio_pool = { .aiocb_size = sizeof(DMAAIOCB), .cancel = dma_aio_cancel, @@ -353,7 +359,9 @@ void dma_context_init(DMAContext *dma, DMATranslateFunc translate, dma->unmap = unmap; } -void *iommu_dma_memory_map(DMAContext *dma, dma_addr_t addr, dma_addr_t *len, +void *iommu_dma_memory_map(DMAContext *dma, + DMACancelMapFunc cb, void *cb_opaque, + dma_addr_t addr, dma_addr_t *len, DMADirection dir) { int err; @@ -361,7 +369,7 @@ void *iommu_dma_memory_map(DMAContext *dma, dma_addr_t addr, dma_addr_t *len, void *buf; if (dma->map) { - return dma->map(dma, addr, len, dir); + return dma->map(dma, cb, cb_opaque, addr, len, dir); } plen = *len; @@ -397,5 +405,4 @@ void iommu_dma_memory_unmap(DMAContext *dma, void *buffer, dma_addr_t len, cpu_physical_memory_unmap(buffer, len, dir == DMA_DIRECTION_FROM_DEVICE, access_len); - } diff --git a/dma.h b/dma.h index 14fe17d..f1fcb71 100644 --- a/dma.h +++ b/dma.h @@ -49,10 +49,15 @@ typedef int DMATranslateFunc(DMAContext *dma, target_phys_addr_t *paddr, target_phys_addr_t *len, DMADirection dir); + +typedef void DMACancelMapFunc(void *); typedef void* DMAMapFunc(DMAContext *dma, + DMACancelMapFunc cb, + void *cb_opaque, dma_addr_t addr, dma_addr_t *len, DMADirection dir); + typedef void DMAUnmapFunc(DMAContext *dma, void *buffer, dma_addr_t len, @@ -129,11 +134,15 @@ static inline int dma_memory_set(DMAContext *dma, dma_addr_t addr, } void *iommu_dma_memory_map(DMAContext *dma, + DMACancelMapFunc *cb, void *opaque, dma_addr_t addr, dma_addr_t *len, DMADirection dir); -static inline void *dma_memory_map(DMAContext *dma, - dma_addr_t addr, dma_addr_t *len, - DMADirection dir) +static inline void *dma_memory_map_with_cancel(DMAContext *dma, + DMACancelMapFunc *cb, + void *opaque, + dma_addr_t addr, + dma_addr_t *len, + DMADirection dir) { if (!dma_has_iommu(dma)) { target_phys_addr_t xlen = *len; @@ -144,9 +153,15 @@ static inline void *dma_memory_map(DMAContext *dma, *len = xlen; return p; } else { - return iommu_dma_memory_map(dma, addr, len, dir); + return iommu_dma_memory_map(dma, cb, opaque, addr, len, dir); } } +static inline void *dma_memory_map(DMAContext *dma, + dma_addr_t addr, dma_addr_t *len, + DMADirection dir) +{ + return dma_memory_map_with_cancel(dma, NULL, NULL, addr, len, dir); +} void iommu_dma_memory_unmap(DMAContext *dma, void *buffer, dma_addr_t len,