diff mbox

[1/9] dma-helpers: Expose the sg mapping logic

Message ID 1450284917-10508-2-git-send-email-apyrgio@arrikto.com
State New
Headers show

Commit Message

Alex Pyrgiotis Dec. 16, 2015, 4:55 p.m. UTC
The mapping of scatter-gather lists from physical addresses (as
perceived by the guest kernel) to the virtual address space of the QEMU
process is a vital step for a DMA operation. This step is currently
implemented, amongst other things, in dma_blk_cb(), making it impossible
to be used by anyone else.

In order to pave the way for the DMA support of ioctl commands, expose
the aforementioned logic in a separate function called "dma_map_sg".
Also, expose some other important pieces too, such as the initialization
of the dbs structure.

Signed-off-by: Alex Pyrgiotis <apyrgio@arrikto.com>
Signed-off-by: Dimitris Aragiorgis <dimara@arrikto.com>

Comments

Paolo Bonzini Feb. 11, 2016, 11:17 a.m. UTC | #1
On 16/12/2015 17:55, Alex Pyrgiotis wrote:
> +/*
> + * Create a QEMUIOVector from a scatter-gather list.
> + *
> + * This function does not copy the data of the scatter-gather list. Instead, it
> + * uses the dma_memory_map() function to map physical memory regions of the
> + * virtual device (as interpreted by the guest kernel) into the address space
> + * of the QEMU process, in order to have access to the data.
> + */
> +static void dma_map_sg(DMAAIOCB *dbs)

In special cases where the QEMUSGList includes MMIO regions, dma_map_sg
might not be able to map the whole list.  In this case, for regular I/O
it is possible to break the operation in multiple steps---in fact, this
breaking of requests is the main purpose of most of the code in
dma-helpers.c.

However, it is not possible to do the same for ioctls.  This is actually
the reason why no one has ever tried to make scsi-generic do anything
but bounce-buffering.  I think that your code breaks horribly in this
case, and I don't see a way to fix it, except for reverting to bounce
buffering.

This would require major changes in your patches, and I'm not sure
whether they are worth it for the single use case of tape devices...

Paolo
Alex Pyrgiotis Feb. 19, 2016, 11:50 a.m. UTC | #2
Hi Paolo,

Sorry for the late reply and thanks for the review. See my comments inline:

On 02/11/2016 01:17 PM, Paolo Bonzini wrote:
> 
> 
> On 16/12/2015 17:55, Alex Pyrgiotis wrote:
>> +/*
>> + * Create a QEMUIOVector from a scatter-gather list.
>> + *
>> + * This function does not copy the data of the scatter-gather list. Instead, it
>> + * uses the dma_memory_map() function to map physical memory regions of the
>> + * virtual device (as interpreted by the guest kernel) into the address space
>> + * of the QEMU process, in order to have access to the data.
>> + */
>> +static void dma_map_sg(DMAAIOCB *dbs)
> 
> In special cases where the QEMUSGList includes MMIO regions, dma_map_sg
> might not be able to map the whole list.  In this case, for regular I/O
> it is possible to break the operation in multiple steps---in fact, this
> breaking of requests is the main purpose of most of the code in
> dma-helpers.c.

You're right. I've written a comment about that yet, somehow, the
implications of rescheduling a DMA operation flew over my head. Great.

So, I've tried to grasp the current situation and this is what I've got
so far. Please correct me where I'm wrong.


Guest kernel space:
1. The guest kernel wants to send a SCSI request to a (para)virtual SCSI
   controller.
2. This SCSI request involves a list of physical address ranges.
   Whether the request only refers to a single contiguous physical
   range, or whether it is a scatter-gather list depends on the
   implementation of the kernel's Low-level Device Driver (LLDD) and
   the capabilities of the SCSI controller.
3. Some of these ranges may include MMIO regions. Since I'm not aware
   of a scenario where this happens, can you provide an example?
4. The LLDD writes the SCSI CDB to the SCSI controller's queue and
   informs the SCSI controller by writing to some sort of doorbell
   register.

QEMU/Hardware space:
5. The SCSI controller code will create a QEMUSGList that points to
   the memory regions of the SCSI request. This QEMUSGList will also
   include the MMIO regions.
6. The QEMU device implementation, e.g. scsi-block, chooses to use
   the dma_* interface.
7. The dma_blk_read/write() code will ultimately attempt to map all the
   memory regions pointed by the QEMUSGList in order to create a
   QEMUIOVector.
8. At some point during the mapping loop, the code will encounter an
   MMIO region. Since reading and writing from/to an MMIO region
   requires  special handling, e.g., we need to call
   MemoryRegion->ops->write(), we cannot include it in our read/write
   system call to the host kernel.
9. This leads to a partial read/write and the mapping loop will resume
   once the partial read/write() has finished.


Are we in the same page so far? If so, let's dig into the
dma_blk_read/write() logic. The following is a simplified pseudocode:


	allocate QEMUIOVector
callback:

	# Reset any changes to QEMUIOVector from previous operations
	for address in QEMUIOVector:
		unmap address
	reset QEMUIOVector

	# Check if the previous IO operation was the last one and if so,
	# call the completion callback of the user and return.
	if operation is completed:
		dma_complete()
		return

	offset = adjust read/write offset
	# Map each sg of QEMUSGList and add it to the QEMUIOVector
	for sg in QEMUSGList:
		address = dma_memory_map(sg)
		if address == NULL:
			break
		add address to QEMUIOVector

	# When either of the following operations have finished, restart
	# this DMA operation.
	if nothing mapped:
		# If we can't map anything, we need to retry
		create reschedule_dma callback
		cpu_register_map_client(callback)
	else:
		# Else, we can perform a partial read/write()
		do readv/writev(offset, QEMUIOVector, callback)
	

Are the above OK? If so, I have some questions:

a) Is an MMIO region one of the reasons why we can't map an sg?
b) At which point will the relevant ops->write() method for the MMIO
   region be called when we have to DMA into the region?? Is it handled
   implicitly in dma_memory_map()?
c) I'm not quite sure about the logic of the "nothing mapped" section.
   Correct me if I'm wrong, but what I think it does is that it
   registers a callback (reschedule_dma) once some sort of mapping has
   completed. What kind of mapping is this? Is there anything more to
   it?

> However, it is not possible to do the same for ioctls.  This is actually
> the reason why no one has ever tried to make scsi-generic do anything
> but bounce-buffering. I think that your code breaks horribly in this
> case, and I don't see a way to fix it, except for reverting to bounce
> buffering.
>
> This would require major changes in your patches, and I'm not sure
> whether they are worth it for the single use case of tape devices...

Well, I wouldn't narrow it down to tape devices. The scsi-generic
interface is the universal interface for all SCSI devices and the only
interface that is fully passthrough. So this patch would effectively
boost the baseline performance of SCSI devices. I think it's worth a try.

Thanks,
Alex
Paolo Bonzini Feb. 22, 2016, 10:43 a.m. UTC | #3
On 19/02/2016 12:50, Alex Pyrgiotis wrote:
> QEMU/Hardware space:
> 5. The SCSI controller code will create a QEMUSGList that points to
>    the memory regions of the SCSI request. This QEMUSGList will also
>    include the MMIO regions.
> 6. The QEMU device implementation, e.g. scsi-block, chooses to use
>    the dma_* interface.
> 7. The dma_blk_read/write() code will ultimately attempt to map all the
>    memory regions pointed by the QEMUSGList in order to create a
>    QEMUIOVector.
> 8. At some point during the mapping loop, the code will encounter an
>    MMIO region. Since reading and writing from/to an MMIO region
>    requires  special handling, e.g., we need to call
>    MemoryRegion->ops->write(), we cannot include it in our read/write
>    system call to the host kernel.
> 9. This leads to a partial read/write and the mapping loop will resume
>    once the partial read/write() has finished.
> 
> Are we in the same page so far?

Yes.

> Are the above OK? If so, I have some questions:
> 
> a) Is an MMIO region one of the reasons why we can't map an sg?

Yes, the only one pretty much.

> b) At which point will the relevant ops->write() method for the MMIO
>    region be called when we have to DMA into the region?? Is it handled
>    implicitly in dma_memory_map()?

It's in address_space_unmap:

    if (is_write) {
        address_space_write(as, bounce.addr, MEMTXATTRS_UNSPECIFIED,
                            bounce.buffer, access_len);
    }

Likewise, address_space_map does the ops->read call through
address_space_read.

> c) I'm not quite sure about the logic of the "nothing mapped" section.
>    Correct me if I'm wrong, but what I think it does is that it
>    registers a callback (reschedule_dma) once some sort of mapping has
>    completed. What kind of mapping is this? Is there anything more to
>    it?

Once something (presumably a concurrent user of dma-helpers.c) calls
address_space_unmap to free the mapping (the bounce.buffer in the above
address_space_write call), reschedule_dma is called.

>> However, it is not possible to do the same for ioctls.  This is actually
>> the reason why no one has ever tried to make scsi-generic do anything
>> but bounce-buffering. I think that your code breaks horribly in this
>> case, and I don't see a way to fix it, except for reverting to bounce
>> buffering.
>>
>> This would require major changes in your patches, and I'm not sure
>> whether they are worth it for the single use case of tape devices...
> 
> Well, I wouldn't narrow it down to tape devices. The scsi-generic
> interface is the universal interface for all SCSI devices and the only
> interface that is fully passthrough.

Sure, but what's the advantage of a fully passthrough interface over
scsi-block?

> So this patch would effectively
> boost the baseline performance of SCSI devices. I think it's worth a try.

I think the first step is understanding what to do about the weird "&
~BDRV_SECTOR_MASK" case, then.

Paolo
Alex Pyrgiotis Feb. 25, 2016, 10:10 a.m. UTC | #4
Hi Paolo,

Thanks a lot for your clarifications. See my comments inline:

(tl;dr: I suggest we reconsider Fam Zheng's attempt to remove the global
bounce buffer, which would make dma-helpers simpler and unblock this patch)

On 02/22/2016 12:43 PM, Paolo Bonzini wrote:
> 
> 
> On 19/02/2016 12:50, Alex Pyrgiotis wrote:
>> QEMU/Hardware space:
>> 5. The SCSI controller code will create a QEMUSGList that points to
>>    the memory regions of the SCSI request. This QEMUSGList will also
>>    include the MMIO regions.
>> 6. The QEMU device implementation, e.g. scsi-block, chooses to use
>>    the dma_* interface.
>> 7. The dma_blk_read/write() code will ultimately attempt to map all the
>>    memory regions pointed by the QEMUSGList in order to create a
>>    QEMUIOVector.
>> 8. At some point during the mapping loop, the code will encounter an
>>    MMIO region. Since reading and writing from/to an MMIO region
>>    requires  special handling, e.g., we need to call
>>    MemoryRegion->ops->write(), we cannot include it in our read/write
>>    system call to the host kernel.

This step and the next one were not clear to me, but thanks to your
comments, I now get what's happening behind the scenes. So, let's reiterate:

All normal regions in a QEMUSGList point to an address range in the
guest's RAM. The MMIO regions of QEMU's virtual devices, however, do not
correspond to such an address range, so QEMU must create a bounce buffer
to represent them. This bounce buffer is added in the I/O vector which
contains the rest of the mapped addresses and is later given to a
readv()/writev() call.

>> 9. This leads to a partial read/write and the mapping loop will resume
>>    once the partial read/write() has finished.

The MMIO region is the trigger for a partial read/write, but it's not
the actual reason. The actual reason is that there is only *one*
*global* bounce buffer. This means that if it's in use it or we
need to use it twice, we will have to wait.

>> Are we in the same page so far?
> 
> Yes.
> 
>> Are the above OK? If so, I have some questions:
>>
>> a) Is an MMIO region one of the reasons why we can't map an sg?
> 
> Yes, the only one pretty much.
> 
>> b) At which point will the relevant ops->write() method for the MMIO
>>    region be called when we have to DMA into the region?? Is it handled
>>    implicitly in dma_memory_map()?
> 
> It's in address_space_unmap:
> 
>     if (is_write) {
>         address_space_write(as, bounce.addr, MEMTXATTRS_UNSPECIFIED,
>                             bounce.buffer, access_len);
>     }
> 
> Likewise, address_space_map does the ops->read call through
> address_space_read.
> 
>> c) I'm not quite sure about the logic of the "nothing mapped" section.
>>    Correct me if I'm wrong, but what I think it does is that it
>>    registers a callback (reschedule_dma) once some sort of mapping has
>>    completed. What kind of mapping is this? Is there anything more to
>>    it?
> 
> Once something (presumably a concurrent user of dma-helpers.c) calls
> address_space_unmap to free the mapping (the bounce.buffer in the above
> address_space_write call), reschedule_dma is called.
>
>>> However, it is not possible to do the same for ioctls.  This is actually
>>> the reason why no one has ever tried to make scsi-generic do anything
>>> but bounce-buffering. I think that your code breaks horribly in this
>>> case, and I don't see a way to fix it, except for reverting to bounce
>>> buffering.

Sure, you're right, there's no sensible way to break an ioctl()
operation in many. However, I'd argue that we shouldn't need to, as it
would be much better if the DMA operation was never restarted in the
first place. Instead, if we dealt with the bigger issue of the global
bounce buffer, we could kill two birds with one stone.

I see that there was an attempt [1] to replace the global bounce buffer
with something more dynamic. In short, the objections [2] were the
following:

1. It introduced locking/unlocking a global mutex in the hot path as
   well as a hash table lookup.
2. It allowed for unbounded memory allocations.

An improvement that would address (1) is to get rid of any global state:

Since the mapping operation takes place in the context of a DMA
operation, we could provide a ctx-type struct to the dma_memory_(un)map
--> address_space_(un)map functions that would contain a hash table. If
any memory allocations were needed, we would track them using that hash
table, which would require no locks. Moreover, if the initialization of
the hash table hurts the performance in the general case, we could use
instead a skip list, if the number of memory allocations is small (e.g.
< 100).

If a mapping operation does not take place in the context of a DMA
operation, we could pass NULL and the address_space_(un)map code would
fallback to the global bounce buffer. Having a fallback would allow for
a smooth transition.

As for the point raised in (2), we can have a limit on the allocated
pages, e.g. 1024. If a well-behaved guest reaches that limit, it will
resume the allocation once a request has completed. For misbehaving
guests, this means that their request will block. Is this reasonable

>>> This would require major changes in your patches, and I'm not sure
>>> whether they are worth it for the single use case of tape devices...
>>
>> Well, I wouldn't narrow it down to tape devices. The scsi-generic
>> interface is the universal interface for all SCSI devices and the only
>> interface that is fully passthrough.
> 
> Sure, but what's the advantage of a fully passthrough interface over
> scsi-block?

This is not a matter of advantage between the two interfaces. A fully
passthrough interface is simply an alternative data path which one
should be able to test and benchmark.

>> So this patch would effectively
>> boost the baseline performance of SCSI devices. I think it's worth a try.
> 
> I think the first step is understanding what to do about the weird "&
> ~BDRV_SECTOR_MASK" case, then.

We can discuss about this case in the "dma-helpers: Do not truncate
small qiovs" thread. I'm all for the removal of this check, but I was
hoping that Kevin would clarify if it's still relevant.

Cheers,
Alex


[1] http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg01776.html
[2] http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg01884.html
Paolo Bonzini Feb. 25, 2016, 10:22 a.m. UTC | #5
On 25/02/2016 11:10, Alex Pyrgiotis wrote:
> All normal regions in a QEMUSGList point to an address range in the
> guest's RAM. The MMIO regions of QEMU's virtual devices, however, do not
> correspond to such an address range, so QEMU must create a bounce buffer
> to represent them. This bounce buffer is added in the I/O vector which
> contains the rest of the mapped addresses and is later given to a
> readv()/writev() call.

Correct.

>>> 9. This leads to a partial read/write and the mapping loop will resume
>>>    once the partial read/write() has finished.
> 
> The MMIO region is the trigger for a partial read/write, but it's not
> the actual reason. The actual reason is that there is only *one*
> *global* bounce buffer. This means that if it's in use it or we
> need to use it twice, we will have to wait.

Yes.

>>>> However, it is not possible to do the same for ioctls.  This is actually
>>>> the reason why no one has ever tried to make scsi-generic do anything
>>>> but bounce-buffering. I think that your code breaks horribly in this
>>>> case, and I don't see a way to fix it, except for reverting to bounce
>>>> buffering.
> 
> Sure, you're right, there's no sensible way to break an ioctl()
> operation in many. However, I'd argue that we shouldn't need to, as it
> would be much better if the DMA operation was never restarted in the
> first place. Instead, if we dealt with the bigger issue of the global
> bounce buffer, we could kill two birds with one stone.
> 
> I see that there was an attempt [1] to replace the global bounce buffer
> with something more dynamic. In short, the objections [2] were the
> following:
> 
> 1. It introduced locking/unlocking a global mutex in the hot path as
>    well as a hash table lookup.
> 2. It allowed for unbounded memory allocations.
> 
> An improvement that would address (1) is to get rid of any global state:
> 
> Since the mapping operation takes place in the context of a DMA
> operation, we could provide a ctx-type struct to the dma_memory_(un)map
> --> address_space_(un)map functions that would contain a hash table. If
> any memory allocations were needed, we would track them using that hash
> table, which would require no locks. Moreover, if the initialization of
> the hash table hurts the performance in the general case, we could use
> instead a skip list, if the number of memory allocations is small (e.g.
> < 100).

You don't need a hash table either if you manage the bounce buffer list
per DMA transfer, and the simplest way to achieve that is to move the
bounce buffer from exec.c to dma-helpers.c entirely.

The patch could first introduce address_space_map_direct that never uses
the bounce buffer.  dma-helpers.c can call address_space_map_direct and,
if it fails, proceed to allocate (and fill if writing to the device) a
bounce buffer.  Since the QEMUSGList is mapped and unmapped
beginning-to-end, you can just use a FIFO queue.  The FIFO queue stores
a (QEMUSGList, buffer) tuple.  When unmapping a QEMUSGList you check if
it matches the head of the queue; if it does, you write back the
contents of the bounce buffer (for reads from the device) and free it.
If it doesn't match, you call address_space_unmap.

Then, once the bounce buffer is implemented within dma-helpers.c, you
remove address_space_map and rename address_space_map_direct to
address_space_map.  cpu_register_map_client goes away.

The unbounded memory allocation can be avoided by bounding the number of
entries in the queue.  In the case of scsi-generic you could just as
well allow INT_MAX entries, because scsi-generic would do unbounded
memory allocation anyway for the bounce buffer.

Modulo the "& BDRV_SECTOR_MASK" issue, this actually seems simpler than
what this series was doing.

Paolo
Alex Pyrgiotis Feb. 25, 2016, 11:19 a.m. UTC | #6
Hi Paolo,

On 02/25/2016 12:22 PM, Paolo Bonzini wrote:
> 
> 
> On 25/02/2016 11:10, Alex Pyrgiotis wrote:
>> 8<... snip ...>8
>>
>> Sure, you're right, there's no sensible way to break an ioctl()
>> operation in many. However, I'd argue that we shouldn't need to, as it
>> would be much better if the DMA operation was never restarted in the
>> first place. Instead, if we dealt with the bigger issue of the global
>> bounce buffer, we could kill two birds with one stone.
>>
>> I see that there was an attempt [1] to replace the global bounce buffer
>> with something more dynamic. In short, the objections [2] were the
>> following:
>>
>> 1. It introduced locking/unlocking a global mutex in the hot path as
>>    well as a hash table lookup.
>> 2. It allowed for unbounded memory allocations.
>>
>> An improvement that would address (1) is to get rid of any global state:
>>
>> Since the mapping operation takes place in the context of a DMA
>> operation, we could provide a ctx-type struct to the dma_memory_(un)map
>> --> address_space_(un)map functions that would contain a hash table. If
>> any memory allocations were needed, we would track them using that hash
>> table, which would require no locks. Moreover, if the initialization of
>> the hash table hurts the performance in the general case, we could use
>> instead a skip list, if the number of memory allocations is small (e.g.
>> < 100).
> 
> You don't need a hash table either if you manage the bounce buffer list
> per DMA transfer, and the simplest way to achieve that is to move the
> bounce buffer from exec.c to dma-helpers.c entirely.
> 
> The patch could first introduce address_space_map_direct that never uses
> the bounce buffer.  dma-helpers.c can call address_space_map_direct and,
> if it fails, proceed to allocate (and fill if writing to the device) a
> bounce buffer.

You mean that address_space_map_direct() is a copy of the
address_space_map() code, minus the bounce buffer part which will be
handled in dma-helpers.c? If so, I agree.

Also, the buffer should be removed from address_space_unmap.

> Since the QEMUSGList is mapped and unmapped
> beginning-to-end, you can just use a FIFO queue.  The FIFO queue stores
> a (QEMUSGList, buffer) tuple.

I suppose that this queue is stored in the dbs structure of dma-helpers?
If so, I agree.

> When unmapping a QEMUSGList you check if
> it matches the head of the queue; if it does, you write back the
> contents of the bounce buffer (for reads from the device) and free it.
> If it doesn't match, you call address_space_unmap.

Agree.

> Then, once the bounce buffer is implemented within dma-helpers.c, you
> remove address_space_map and rename address_space_map_direct to
> address_space_map.  cpu_register_map_client goes away.

What about the other users of address_space_map()? Is the bounce buffer
used only for DMA operations? If so, I agree. Else, we need a fallback.

> The unbounded memory allocation can be avoided by bounding the number of
> entries in the queue.

Agree.

Thanks,
Alex
Paolo Bonzini Feb. 25, 2016, 1:01 p.m. UTC | #7
On 25/02/2016 12:19, Alex Pyrgiotis wrote:
>> The patch could first introduce address_space_map_direct that never uses
>> the bounce buffer.  dma-helpers.c can call address_space_map_direct and,
>> if it fails, proceed to allocate (and fill if writing to the device) a
>> bounce buffer.
> 
> You mean that address_space_map_direct() is a copy of the
> address_space_map() code, minus the bounce buffer part which will be
> handled in dma-helpers.c? If so, I agree.

Yes.  In particular minus the:

    if (!memory_access_is_direct(mr, is_write))

part, hence the name. :)  If direct access isn't possible,
address_space_map_direct would always return NULL, like
address_space_map does when bounce.in_use is already true.

> Also, the buffer should be removed from address_space_unmap.

Right, though only after address_space_map_direct is renamed to
address_space_map.  That's also the point where cpu_register_map_client
disappears, and other parts of dma-helpers.c too such as the bottom half.

>> Since the QEMUSGList is mapped and unmapped
>> beginning-to-end, you can just use a FIFO queue.  The FIFO queue stores
>> a (QEMUSGList, buffer) tuple.
> 
> I suppose that this queue is stored in the dbs structure of dma-helpers?
> If so, I agree.

Yes.

>> Then, once the bounce buffer is implemented within dma-helpers.c, you
>> remove address_space_map and rename address_space_map_direct to
>> address_space_map.  cpu_register_map_client goes away.
> 
> What about the other users of address_space_map()? Is the bounce buffer
> used only for DMA operations? If so, I agree. Else, we need a fallback.

Other users usually fail if address_space_map cannot return a mapping as
big as requested.  They also do not use cpu_register_map_client.  Both
shortcomings mean that in practice they do not support the bounce buffer.

Paolo
Kevin Wolf Feb. 26, 2016, 9:20 a.m. UTC | #8
Am 25.02.2016 um 11:10 hat Alex Pyrgiotis geschrieben:
> > I think the first step is understanding what to do about the weird "&
> > ~BDRV_SECTOR_MASK" case, then.
> 
> We can discuss about this case in the "dma-helpers: Do not truncate
> small qiovs" thread. I'm all for the removal of this check, but I was
> hoping that Kevin would clarify if it's still relevant.

I think all of dma-helpers.c is currently sector based, and as long as
it is, this code needs to stay. You need to look at it in relation to
the divisions by 512 that the same function performs - the block layer
simply expects that if you have an I/O request for one sector, your qiov
is 512 bytes in length and not 527. So whenever the integer division
rounds down, the qiov needs to be trimmed, too.

As soon as you convert the whole function to byte granularity (and the
block layer does have byte granularity APIs, so that shouldn't be a lot
of work), you don't have any implied rounding by integer divisions any
more and the qiov trimming can go away as well.

Kevin
diff mbox

Patch

diff --git a/dma-helpers.c b/dma-helpers.c
index 4faec5d..c38661e 100644
--- a/dma-helpers.c
+++ b/dma-helpers.c
@@ -79,9 +79,10 @@  typedef struct {
     QEMUIOVector iov;
     QEMUBH *bh;
     DMAIOFunc *io_func;
+    BlockCompletionFunc *dma_cb;
 } DMAAIOCB;
 
-static void dma_blk_cb(void *opaque, int ret);
+static void dma_blk_io_cb(void *opaque, int ret);
 
 static void reschedule_dma(void *opaque)
 {
@@ -89,7 +90,7 @@  static void reschedule_dma(void *opaque)
 
     qemu_bh_delete(dbs->bh);
     dbs->bh = NULL;
-    dma_blk_cb(dbs, 0);
+    dbs->dma_cb(dbs, 0);
 }
 
 static void dma_blk_unmap(DMAAIOCB *dbs)
@@ -120,21 +121,19 @@  static void dma_complete(DMAAIOCB *dbs, int ret)
     qemu_aio_unref(dbs);
 }
 
-static void dma_blk_cb(void *opaque, int ret)
+/*
+ * Create a QEMUIOVector from a scatter-gather list.
+ *
+ * This function does not copy the data of the scatter-gather list. Instead, it
+ * uses the dma_memory_map() function to map physical memory regions of the
+ * virtual device (as interpreted by the guest kernel) into the address space
+ * of the QEMU process, in order to have access to the data.
+ */
+static void dma_map_sg(DMAAIOCB *dbs)
 {
-    DMAAIOCB *dbs = (DMAAIOCB *)opaque;
     dma_addr_t cur_addr, cur_len;
     void *mem;
 
-    trace_dma_blk_cb(dbs, ret);
-
-    dbs->acb = NULL;
-    dbs->sector_num += dbs->iov.size / 512;
-
-    if (dbs->sg_cur_index == dbs->sg->nsg || ret < 0) {
-        dma_complete(dbs, ret);
-        return;
-    }
     dma_blk_unmap(dbs);
 
     while (dbs->sg_cur_index < dbs->sg->nsg) {
@@ -162,9 +161,38 @@  static void dma_blk_cb(void *opaque, int ret)
     if (dbs->iov.size & ~BDRV_SECTOR_MASK) {
         qemu_iovec_discard_back(&dbs->iov, dbs->iov.size & ~BDRV_SECTOR_MASK);
     }
+}
+
+/*
+ * Callback function for DMA read/write operations.
+ *
+ * This function initiates the read/write operation and also acts as a
+ * completion callback. It uses the dma_map_sg() function to map the
+ * scatter-gather list to a QEMUIOVector and then passes this iovec to the
+ * underlying read/write I/O function.
+ *
+ * If the DMA operation cannot take place in one step, e.g. it couldn't map all
+ * the scatter-gather entries, then this function will do a partial I/O
+ * operation and once done, it will be called and will retry the I/O operation.
+ */
+static void dma_blk_io_cb(void *opaque, int ret)
+{
+    DMAAIOCB *dbs = (DMAAIOCB *)opaque;
+
+    trace_dma_blk_io_cb(dbs, ret);
+
+    dbs->acb = NULL;
+    dbs->sector_num += dbs->iov.size / 512;
+
+    if (dbs->sg_cur_index == dbs->sg->nsg || ret < 0) {
+        dma_complete(dbs, ret);
+        return;
+    }
+
+    dma_map_sg(dbs);
 
     dbs->acb = dbs->io_func(dbs->blk, dbs->sector_num, &dbs->iov,
-                            dbs->iov.size / 512, dma_blk_cb, dbs);
+                            dbs->iov.size / 512, dma_blk_io_cb, dbs);
     assert(dbs->acb);
 }
 
@@ -190,6 +218,22 @@  static const AIOCBInfo dma_aiocb_info = {
     .cancel_async       = dma_aio_cancel,
 };
 
+/*
+ * Initialize the dbs structure and the QEMUIOVector to sane defaults.
+ */
+static void dma_init_dbs(DMAAIOCB *dbs, BlockBackend *blk, QEMUSGList *sg,
+                         DMADirection dir)
+{
+    dbs->acb = NULL;
+    dbs->blk = blk;
+    dbs->sg = sg;
+    dbs->sg_cur_index = 0;
+    dbs->sg_cur_byte = 0;
+    dbs->dir = dir;
+    dbs->bh = NULL;
+    qemu_iovec_init(&dbs->iov, sg->nsg);
+}
+
 BlockAIOCB *dma_blk_io(
     BlockBackend *blk, QEMUSGList *sg, uint64_t sector_num,
     DMAIOFunc *io_func, BlockCompletionFunc *cb,
@@ -199,21 +243,14 @@  BlockAIOCB *dma_blk_io(
 
     trace_dma_blk_io(dbs, blk, sector_num, (dir == DMA_DIRECTION_TO_DEVICE));
 
-    dbs->acb = NULL;
-    dbs->blk = blk;
-    dbs->sg = sg;
+    dma_init_dbs(dbs, blk, sg, dir);
     dbs->sector_num = sector_num;
-    dbs->sg_cur_index = 0;
-    dbs->sg_cur_byte = 0;
-    dbs->dir = dir;
     dbs->io_func = io_func;
-    dbs->bh = NULL;
-    qemu_iovec_init(&dbs->iov, sg->nsg);
-    dma_blk_cb(dbs, 0);
+    dbs->dma_cb = dma_blk_io_cb;
+    dbs->dma_cb(dbs, 0);
     return &dbs->common;
 }
 
-
 BlockAIOCB *dma_blk_read(BlockBackend *blk,
                          QEMUSGList *sg, uint64_t sector,
                          void (*cb)(void *opaque, int ret), void *opaque)
diff --git a/trace-events b/trace-events
index 2fce98e..120cdd4 100644
--- a/trace-events
+++ b/trace-events
@@ -1127,7 +1127,7 @@  win_helper_retry(uint32_t tl) "tl=%d"
 dma_blk_io(void *dbs, void *bs, int64_t sector_num, bool to_dev) "dbs=%p bs=%p sector_num=%" PRId64 " to_dev=%d"
 dma_aio_cancel(void *dbs) "dbs=%p"
 dma_complete(void *dbs, int ret, void *cb) "dbs=%p ret=%d cb=%p"
-dma_blk_cb(void *dbs, int ret) "dbs=%p ret=%d"
+dma_blk_io_cb(void *dbs, int ret) "dbs=%p ret=%d"
 dma_map_wait(void *dbs) "dbs=%p"
 
 # ui/console.c