Patchwork [4/8] ahci: use qiov instead of dma helpers

login
register
mail settings
Submitter Alexander Graf
Date Dec. 20, 2010, 9:13 p.m.
Message ID <1292879604-22268-5-git-send-email-agraf@suse.de>
Download mbox | patch
Permalink /patch/76232/
State New
Headers show

Comments

Alexander Graf - Dec. 20, 2010, 9:13 p.m.
The DMA helpers incur additional overhead on data transfers. I'm not
sure we need the additional complexity provided by them. So let's just
use qiovs directly when running in the fast path (ncq).

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 hw/ide/ahci.c |  100 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 hw/ide/ahci.h |    3 ++
 2 files changed, 95 insertions(+), 8 deletions(-)
Kevin Wolf - Jan. 18, 2011, 12:35 p.m.
Am 20.12.2010 22:13, schrieb Alexander Graf:
> The DMA helpers incur additional overhead on data transfers. I'm not
> sure we need the additional complexity provided by them. So let's just
> use qiovs directly when running in the fast path (ncq).
> 
> Signed-off-by: Alexander Graf <agraf@suse.de>
> ---
>  hw/ide/ahci.c |  100 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  hw/ide/ahci.h |    3 ++
>  2 files changed, 95 insertions(+), 8 deletions(-)

I don't feel comfortable with this one, and I think a while ago we
discussed on IRC why the DMA helpers even exist. If AHCI doesn't need
them, probably nobody needed them.

However, I'm inclined to think that AHCI actually _does_ need them in
corner cases, even though it might not break in all the common cases
that you have tested. Can you explain why only AHCI doesn't need it or
is it just "didn't break for me in practice"?

Where does the overhead in the DMA helpers come from? Can we optimize
this code instead of making the device emulation less correct?

Kevin
Alexander Graf - Jan. 18, 2011, 12:45 p.m.
On 18.01.2011, at 13:35, Kevin Wolf wrote:

> Am 20.12.2010 22:13, schrieb Alexander Graf:
>> The DMA helpers incur additional overhead on data transfers. I'm not
>> sure we need the additional complexity provided by them. So let's just
>> use qiovs directly when running in the fast path (ncq).
>> 
>> Signed-off-by: Alexander Graf <agraf@suse.de>
>> ---
>> hw/ide/ahci.c |  100 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>> hw/ide/ahci.h |    3 ++
>> 2 files changed, 95 insertions(+), 8 deletions(-)
> 
> I don't feel comfortable with this one, and I think a while ago we
> discussed on IRC why the DMA helpers even exist. If AHCI doesn't need
> them, probably nobody needed them.
> 
> However, I'm inclined to think that AHCI actually _does_ need them in
> corner cases, even though it might not break in all the common cases
> that you have tested. Can you explain why only AHCI doesn't need it or
> is it just "didn't break for me in practice"?
> 

It's the latter.

> Where does the overhead in the DMA helpers come from? Can we optimize
> this code instead of making the device emulation less correct?

Well, dma helpers involve another malloc which is probably the biggest hog. I frankly don't see the point in making it correct for the fast path though. I'd rather like to have a fast block emulation that works with all OSs than an accurate one that emulates something nobody cares about.

Virtio for example doesn't use dma helpers either - they just claim it's not defined in the spec. So if virtio-blk gets away with it, it means that all OSs should never make use of the additional complexity.


Alex
Stefan Hajnoczi - Jan. 18, 2011, 1:14 p.m.
On Tue, Jan 18, 2011 at 01:45:40PM +0100, Alexander Graf wrote:
> 
> On 18.01.2011, at 13:35, Kevin Wolf wrote:
> 
> > Am 20.12.2010 22:13, schrieb Alexander Graf:
> >> The DMA helpers incur additional overhead on data transfers. I'm not
> >> sure we need the additional complexity provided by them. So let's just
> >> use qiovs directly when running in the fast path (ncq).
> >> 
> >> Signed-off-by: Alexander Graf <agraf@suse.de>
> >> ---
> >> hw/ide/ahci.c |  100 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
> >> hw/ide/ahci.h |    3 ++
> >> 2 files changed, 95 insertions(+), 8 deletions(-)
> > 
> > I don't feel comfortable with this one, and I think a while ago we
> > discussed on IRC why the DMA helpers even exist. If AHCI doesn't need
> > them, probably nobody needed them.
> > 
> > However, I'm inclined to think that AHCI actually _does_ need them in
> > corner cases, even though it might not break in all the common cases
> > that you have tested. Can you explain why only AHCI doesn't need it or
> > is it just "didn't break for me in practice"?
> > 
> 
> It's the latter.
> 
> > Where does the overhead in the DMA helpers come from? Can we optimize
> > this code instead of making the device emulation less correct?
> 
> Well, dma helpers involve another malloc which is probably the biggest hog. I frankly don't see the point in making it correct for the fast path though. I'd rather like to have a fast block emulation that works with all OSs than an accurate one that emulates something nobody cares about.
> 
> Virtio for example doesn't use dma helpers either - they just claim it's not defined in the spec. So if virtio-blk gets away with it, it means that all OSs should never make use of the additional complexity.

From what I can tell DMA helpers is common AIO request code plus:

1. It handles map failure using cpu_register_map_client().
2. It handles short maps that are unable to map a full sglist element.

These two requirements are due to QEMU's guest memory mapping API.  IIUC
the limitations on that API are due to a limited amount of bounce buffer
space being used for some targets allowing mapping of non-RAM memory.

Perhaps this means that AHCI does not work on those targets if you
decide to send non-RAM pages to disk?

I'd be interested in understanding how this all works and how the QEMU
RAM API that Anthony and Alex Williamson have been playing with comes
into play.

Stefan
Kevin Wolf - Jan. 18, 2011, 1:30 p.m.
Am 18.01.2011 14:14, schrieb Stefan Hajnoczi:
> On Tue, Jan 18, 2011 at 01:45:40PM +0100, Alexander Graf wrote:
>>
>> On 18.01.2011, at 13:35, Kevin Wolf wrote:
>>
>>> Am 20.12.2010 22:13, schrieb Alexander Graf:
>>>> The DMA helpers incur additional overhead on data transfers. I'm not
>>>> sure we need the additional complexity provided by them. So let's just
>>>> use qiovs directly when running in the fast path (ncq).
>>>>
>>>> Signed-off-by: Alexander Graf <agraf@suse.de>
>>>> ---
>>>> hw/ide/ahci.c |  100 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>>> hw/ide/ahci.h |    3 ++
>>>> 2 files changed, 95 insertions(+), 8 deletions(-)
>>>
>>> I don't feel comfortable with this one, and I think a while ago we
>>> discussed on IRC why the DMA helpers even exist. If AHCI doesn't need
>>> them, probably nobody needed them.
>>>
>>> However, I'm inclined to think that AHCI actually _does_ need them in
>>> corner cases, even though it might not break in all the common cases
>>> that you have tested. Can you explain why only AHCI doesn't need it or
>>> is it just "didn't break for me in practice"?
>>>
>>
>> It's the latter.
>>
>>> Where does the overhead in the DMA helpers come from? Can we optimize
>>> this code instead of making the device emulation less correct?
>>
>> Well, dma helpers involve another malloc which is probably the biggest hog.

If you can get around this malloc in AHCI (IIUC, you do it by reusing
the old sglist buffer?), we can probably do something similar in the
generic DMA helper code and have IDE etc. benefit from it as well.

> I frankly don't see the point in making it correct for the fast path though. I'd rather like to have a fast block emulation that works with all OSs than an accurate one that emulates something nobody cares about.

Sorry, but "Windows and Linux" != "all OSes".

I'm sure there is a way that gives us correctness _and_ performance.

>> Virtio for example doesn't use dma helpers either - they just claim it's not defined in the spec. So if virtio-blk gets away with it, it means that all OSs should never make use of the additional complexity.
> 
> From what I can tell DMA helpers is common AIO request code plus:
> 
> 1. It handles map failure using cpu_register_map_client().
> 2. It handles short maps that are unable to map a full sglist element.
> 
> These two requirements are due to QEMU's guest memory mapping API.  IIUC
> the limitations on that API are due to a limited amount of bounce buffer
> space being used for some targets allowing mapping of non-RAM memory.
> 
> Perhaps this means that AHCI does not work on those targets if you
> decide to send non-RAM pages to disk?
> 
> I'd be interested in understanding how this all works and how the QEMU
> RAM API that Anthony and Alex Williamson have been playing with comes
> into play.

Yeah, I don't know the details either. I hope that Anthony will comment
on it.

Kevin

Patch

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index fa97f9b..7a29925 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -642,6 +642,87 @@  static void ahci_write_fis_d2h(AHCIDevice *ad, uint8_t *cmd_fis)
     }
 }
 
+static int ahci_iov_destroy(AHCIDevice *ad, QEMUIOVector *qiov, int is_write)
+{
+    int i;
+    struct iovec *iov = qiov->iov;
+
+    for (i = 0; i < qiov->niov; i++) {
+        /* flags_size is zero-based */
+        cpu_physical_memory_unmap(iov[i].iov_base, !is_write,
+                                  iov[i].iov_len, iov[i].iov_len);
+    }
+
+    return 0;
+}
+
+static int ahci_populate_iov(AHCIDevice *ad, NCQTransferState *ncq_tfs,
+                             int is_write)
+{
+    QEMUIOVector *qiov = &ncq_tfs->qiov;
+    struct iovec *iov;
+    AHCICmdHdr *cmd = ad->cur_cmd;
+    uint32_t opts = le32_to_cpu(cmd->opts);
+    uint64_t prdt_addr = le64_to_cpu(cmd->tbl_addr) + 0x80;
+    int sglist_alloc_hint = opts >> AHCI_CMD_HDR_PRDT_LEN;
+    target_phys_addr_t prdt_len = (sglist_alloc_hint * sizeof(AHCI_SG));
+    target_phys_addr_t real_prdt_len = prdt_len;
+    uint8_t *prdt;
+    int i;
+    int r = 0;
+
+    if (!sglist_alloc_hint) {
+        DPRINTF(ad->port_no, "no sg list given by guest: 0x%08x\n", opts);
+        return -1;
+    }
+
+    /* map PRDT */
+    if (!(prdt = cpu_physical_memory_map(prdt_addr, &prdt_len, 0))){
+        DPRINTF(ad->port_no, "map failed\n");
+        return -1;
+    }
+
+    if (prdt_len < real_prdt_len) {
+        DPRINTF(ad->port_no, "mapped less than expected\n");
+        r = -1;
+        goto out;
+    }
+
+    /* Ensure we have enough space to store the iovecs */
+    if (ncq_tfs->iov_max < sglist_alloc_hint) {
+        ncq_tfs->iov = qemu_realloc(ncq_tfs->iov,
+                                    sglist_alloc_hint * sizeof(struct iovec));
+        ncq_tfs->iov_max = sglist_alloc_hint;
+    }
+    iov = ncq_tfs->iov;
+
+    if (!iov) {
+        DPRINTF(ad->port_no, "out of memory\n");
+        ncq_tfs->iov_max = 0;
+        goto out;
+    }
+
+    /* Get entries in the PRDT, init a qemu sglist accordingly */
+    if (sglist_alloc_hint > 0) {
+        AHCI_SG *tbl = (AHCI_SG *)prdt;
+        void *p;
+
+        for (i = 0; i < sglist_alloc_hint; i++) {
+            target_phys_addr_t len;
+            /* flags_size is zero-based */
+            len = iov[i].iov_len = le32_to_cpu(tbl[i].flags_size) + 1;
+            p = cpu_physical_memory_map(le64_to_cpu(tbl[i].addr),
+                                        &len, !is_write);
+            iov[i].iov_base = p;
+        }
+        qemu_iovec_init_external(qiov, iov, sglist_alloc_hint);
+    }
+
+out:
+    cpu_physical_memory_unmap(prdt, 0, prdt_len, prdt_len);
+    return r;
+}
+
 static int ahci_populate_sglist(AHCIDevice *ad, QEMUSGList *sglist)
 {
     AHCICmdHdr *cmd = ad->cur_cmd;
@@ -711,7 +792,7 @@  static void ncq_cb(void *opaque, int ret)
     DPRINTF(ncq_tfs->drive->port_no, "NCQ transfer tag %d finished\n",
             ncq_tfs->tag);
 
-    qemu_sglist_destroy(&ncq_tfs->sglist);
+    ahci_iov_destroy(ncq_tfs->drive, &ncq_tfs->qiov, !ncq_tfs->is_read);
     ncq_tfs->used = 0;
 }
 
@@ -747,7 +828,6 @@  static void process_ncq_command(AHCIState *s, int port, uint8_t *cmd_fis,
             ncq_tfs->lba, ncq_tfs->lba + ncq_tfs->sector_count - 2,
             s->dev[port].port.ifs[0].nb_sectors - 1);
 
-    ahci_populate_sglist(&s->dev[port], &ncq_tfs->sglist);
     ncq_tfs->tag = tag;
 
     switch(ncq_fis->command) {
@@ -757,9 +837,11 @@  static void process_ncq_command(AHCIState *s, int port, uint8_t *cmd_fis,
             ncq_tfs->is_read = 1;
 
             DPRINTF(port, "tag %d aio read %ld\n", ncq_tfs->tag, ncq_tfs->lba);
-            ncq_tfs->aiocb = dma_bdrv_read(ncq_tfs->drive->port.ifs[0].bs,
-                                           &ncq_tfs->sglist, ncq_tfs->lba,
-                                           ncq_cb, ncq_tfs);
+            ahci_populate_iov(&s->dev[port], ncq_tfs, 0);
+            ncq_tfs->aiocb = bdrv_aio_readv(ncq_tfs->drive->port.ifs[0].bs,
+                                            ncq_tfs->lba, &ncq_tfs->qiov,
+                                            ncq_tfs->sector_count, ncq_cb,
+                                            ncq_tfs);
             break;
         case WRITE_FPDMA_QUEUED:
             DPRINTF(port, "NCQ writing %d sectors to LBA %ld, tag %d\n",
@@ -767,9 +849,11 @@  static void process_ncq_command(AHCIState *s, int port, uint8_t *cmd_fis,
             ncq_tfs->is_read = 0;
 
             DPRINTF(port, "tag %d aio write %ld\n", ncq_tfs->tag, ncq_tfs->lba);
-            ncq_tfs->aiocb = dma_bdrv_write(ncq_tfs->drive->port.ifs[0].bs,
-                                            &ncq_tfs->sglist, ncq_tfs->lba,
-                                            ncq_cb, ncq_tfs);
+            ahci_populate_iov(&s->dev[port], ncq_tfs, 1);
+            ncq_tfs->aiocb = bdrv_aio_writev(ncq_tfs->drive->port.ifs[0].bs,
+                                             ncq_tfs->lba, &ncq_tfs->qiov,
+                                             ncq_tfs->sector_count, ncq_cb,
+                                             ncq_tfs);
             break;
         default:
             DPRINTF(port, "error: tried to process non-NCQ command as NCQ\n");
diff --git a/hw/ide/ahci.h b/hw/ide/ahci.h
index 5f8126a..0824064 100644
--- a/hw/ide/ahci.h
+++ b/hw/ide/ahci.h
@@ -243,6 +243,9 @@  typedef struct NCQTransferState {
     uint8_t tag;
     int slot;
     int used;
+    QEMUIOVector qiov;
+    struct iovec *iov;
+    int iov_max;
 } NCQTransferState;
 
 struct AHCIDevice {