Patchwork [RFC,4/7] ide: IOMMU support

login
register
mail settings
Submitter Eduard - Gabriel Munteanu
Date July 14, 2010, 5:45 a.m.
Message ID <1279086307-9596-5-git-send-email-eduard.munteanu@linux360.ro>
Download mbox | patch
Permalink /patch/58839/
State New
Headers show

Comments

Eduard - Gabriel Munteanu - July 14, 2010, 5:45 a.m.
Memory accesses must go through the IOMMU layer.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 hw/ide/core.c |   46 +++++++++++++++++++++++++++++++---------------
 1 files changed, 31 insertions(+), 15 deletions(-)
Paul Brook - July 14, 2010, 1:53 p.m.
> Memory accesses must go through the IOMMU layer.

No. Devices should not know or care whether an IOMMU is present.

You should be adding a DeviceState argument to cpu_physical_memory_{rw,map}. 
This should then handle IOMMU translation transparently.

You also need to accomodate the the case where multiple IOMMU are present.

Paul
Joerg Roedel - July 14, 2010, 6:33 p.m.
On Wed, Jul 14, 2010 at 02:53:03PM +0100, Paul Brook wrote:
> > Memory accesses must go through the IOMMU layer.
> 
> No. Devices should not know or care whether an IOMMU is present.

There are real devices that care very much about an IOMMU. Basically all
devices supporting ATS care about that. So I don't see a problem if the
device emulation code of qemu also cares about present IOMMUs.

> You should be adding a DeviceState argument to cpu_physical_memory_{rw,map}. 
> This should then handle IOMMU translation transparently.

That's not a good idea imho. With an IOMMU the device no longer accesses
cpu physical memory. It accesses device virtual memory. Using
cpu_physical_memory* functions in device code becomes misleading when
the device virtual address space differs from cpu physical. So different
functions for devices make a lot of sense here. Another reason for
seperate functions is that we can extend them later to support emulation
of ATS devices.

> You also need to accomodate the the case where multiple IOMMU are present.

This, indeed, is something transparent to the device. This should be
handled inside the iommu emulation code.

	Joerg
Paul Brook - July 14, 2010, 8:13 p.m.
> On Wed, Jul 14, 2010 at 02:53:03PM +0100, Paul Brook wrote:
> > > Memory accesses must go through the IOMMU layer.
> > 
> > No. Devices should not know or care whether an IOMMU is present.
> 
> There are real devices that care very much about an IOMMU. Basically all
> devices supporting ATS care about that. So I don't see a problem if the
> device emulation code of qemu also cares about present IOMMUs.
> 
> > You should be adding a DeviceState argument to
> > cpu_physical_memory_{rw,map}. This should then handle IOMMU translation
> > transparently.
> 
> That's not a good idea imho. With an IOMMU the device no longer accesses
> cpu physical memory. It accesses device virtual memory. Using
> cpu_physical_memory* functions in device code becomes misleading when
> the device virtual address space differs from cpu physical. 

Well, ok, the function name needs fixing too.  However I think the only thing 
missing from the current API is that it does not provide a way to determine 
which device is performing the access.

Depending how the we decide to handle IOMMU invalidation, it may also be 
necessary to augment the memory_map API to allow the system to request a 
mapping be revoked.  However this issue is not specific to the IOMMU 
implementation. Such bugs are already present on any system that allows 
dynamic reconfiguration of the address space, e.g. by changing PCI BARs.

> So different
> functions for devices make a lot of sense here. Another reason for
> seperate functions is that we can extend them later to support emulation
> of ATS devices.

I disagree. ATS should be an independent feature, and is inherently bus 
specific.  As usual the PCI spec is not publicly available, but based on the 
AMD IOMMU docs I'd say that ATS is completely independent of memory accesses - 
the convention being that you trust an ATS capable device to DTRT, and 
configure the bus IOMMU to apply a flat mapping for accesses from such 
devices.

> > You also need to accomodate the the case where multiple IOMMU are
> > present.
> 
> This, indeed, is something transparent to the device. This should be
> handled inside the iommu emulation code.

I think you've got the abstraction boundaries all wrong.

A device performs a memory access on its local bus. It has no knowledge of how 
that access is routed to its destination.  The device should not be aware of 
any IOMMUs, in the same way that it doesn't know whether it happens to be 
accessing RAM or memory mapped peripherals on another device.

Each IOMMU is fundamentally part of a bus bridge. For example the bridge 
between a PCI bus and the system bus. It provides a address mapping from one 
bus to another. 

There should be no direct interaction between an IOMMU and a device (ignoring 
ATS, which is effectively a separate data channel).  Everything should be done 
via the cpu_phsycial_memory_* code.  Likewise on a system with multiple nested 
IOMMUs there should be no direct interatcion between these. 
cpu_physical_memory_* should walk the device/bus tree to determine where the 
access terminates, applying mappings appropriately.

Paul
Anthony Liguori - July 14, 2010, 9:29 p.m.
On 07/14/2010 03:13 PM, Paul Brook wrote:
>> On Wed, Jul 14, 2010 at 02:53:03PM +0100, Paul Brook wrote:
>>      
>>>> Memory accesses must go through the IOMMU layer.
>>>>          
>>> No. Devices should not know or care whether an IOMMU is present.
>>>        
>> There are real devices that care very much about an IOMMU. Basically all
>> devices supporting ATS care about that. So I don't see a problem if the
>> device emulation code of qemu also cares about present IOMMUs.
>>
>>      
>>> You should be adding a DeviceState argument to
>>> cpu_physical_memory_{rw,map}. This should then handle IOMMU translation
>>> transparently.
>>>        
>> That's not a good idea imho. With an IOMMU the device no longer accesses
>> cpu physical memory. It accesses device virtual memory. Using
>> cpu_physical_memory* functions in device code becomes misleading when
>> the device virtual address space differs from cpu physical.
>>      
> Well, ok, the function name needs fixing too.  However I think the only thing
> missing from the current API is that it does not provide a way to determine
> which device is performing the access.
>    

I agree with Paul.

The right approach IMHO is to convert devices to use bus-specific 
functions to access memory.  The bus specific functions should have a 
device argument as the first parameter.

For PCI-based IOMMUs, the implementation exists solely within the PCI 
bus.  For platforms (like SPARC) that have lower level IOMMUs, we would 
need to probably introduce a sysbus memory access layer and then provide 
a hook to implement an IOMMU there.

> Depending how the we decide to handle IOMMU invalidation, it may also be
> necessary to augment the memory_map API to allow the system to request a
> mapping be revoked.  However this issue is not specific to the IOMMU
> implementation. Such bugs are already present on any system that allows
> dynamic reconfiguration of the address space, e.g. by changing PCI BARs.
>    

That's why the memory_map API today does not allow mappings to persist 
after trips back to the main loop.

Regards,

Anthony Liguori

>> So different
>> functions for devices make a lot of sense here. Another reason for
>> seperate functions is that we can extend them later to support emulation
>> of ATS devices.
>>      
> I disagree. ATS should be an independent feature, and is inherently bus
> specific.  As usual the PCI spec is not publicly available, but based on the
> AMD IOMMU docs I'd say that ATS is completely independent of memory accesses -
> the convention being that you trust an ATS capable device to DTRT, and
> configure the bus IOMMU to apply a flat mapping for accesses from such
> devices.
>
>    
>>> You also need to accomodate the the case where multiple IOMMU are
>>> present.
>>>        
>> This, indeed, is something transparent to the device. This should be
>> handled inside the iommu emulation code.
>>      
> I think you've got the abstraction boundaries all wrong.
>
> A device performs a memory access on its local bus. It has no knowledge of how
> that access is routed to its destination.  The device should not be aware of
> any IOMMUs, in the same way that it doesn't know whether it happens to be
> accessing RAM or memory mapped peripherals on another device.
>
> Each IOMMU is fundamentally part of a bus bridge. For example the bridge
> between a PCI bus and the system bus. It provides a address mapping from one
> bus to another.
>
> There should be no direct interaction between an IOMMU and a device (ignoring
> ATS, which is effectively a separate data channel).  Everything should be done
> via the cpu_phsycial_memory_* code.  Likewise on a system with multiple nested
> IOMMUs there should be no direct interatcion between these.
> cpu_physical_memory_* should walk the device/bus tree to determine where the
> access terminates, applying mappings appropriately.
>
> Paul
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Chris Wright - July 14, 2010, 10:24 p.m.
* Anthony Liguori (anthony@codemonkey.ws) wrote:
> On 07/14/2010 03:13 PM, Paul Brook wrote:
> >>On Wed, Jul 14, 2010 at 02:53:03PM +0100, Paul Brook wrote:
> >>>>Memory accesses must go through the IOMMU layer.
> >>>No. Devices should not know or care whether an IOMMU is present.
> >>There are real devices that care very much about an IOMMU. Basically all
> >>devices supporting ATS care about that. So I don't see a problem if the
> >>device emulation code of qemu also cares about present IOMMUs.
> >>
> >>>You should be adding a DeviceState argument to
> >>>cpu_physical_memory_{rw,map}. This should then handle IOMMU translation
> >>>transparently.
> >>That's not a good idea imho. With an IOMMU the device no longer accesses
> >>cpu physical memory. It accesses device virtual memory. Using
> >>cpu_physical_memory* functions in device code becomes misleading when
> >>the device virtual address space differs from cpu physical.
> >Well, ok, the function name needs fixing too.  However I think the only thing
> >missing from the current API is that it does not provide a way to determine
> >which device is performing the access.
> 
> I agree with Paul.

I do too.

> The right approach IMHO is to convert devices to use bus-specific
> functions to access memory.  The bus specific functions should have
> a device argument as the first parameter.

As for ATS, the internal api to handle the device's dma reqeust needs
a notion of a translated vs. an untranslated request.  IOW, if qemu ever
had a device with ATS support, the device would use its local cache to
translate the dma address and then submit a translated request to the
pci bus (effectively doing a raw cpu physical memory* in that case).

thanks,
-chris
Eduard - Gabriel Munteanu - July 14, 2010, 11:11 p.m.
On Wed, Jul 14, 2010 at 02:53:03PM +0100, Paul Brook wrote:
> > Memory accesses must go through the IOMMU layer.
> 
> No. Devices should not know or care whether an IOMMU is present.

They don't really care. iommu_get() et al. are convenience functions
which can and do return NULL when there's no IOMMU and device code can
pass that NULL around without checking. I could've probably made the r/w
functions take only the DeviceState in addition to normal args, but
wanted to avoid looking up the related structures on each I/O operation.

> You should be adding a DeviceState argument to cpu_physical_memory_{rw,map}. 
> This should then handle IOMMU translation transparently.
> 
> You also need to accomodate the the case where multiple IOMMU are present.
> 
> Paul

We don't assume there's a single IOMMU in the generic layer. The
callbacks within 'struct iommu' could very well dispatch the request to
one of multiple, coexisting IOMMUs.


	Eduard
Eduard - Gabriel Munteanu - July 14, 2010, 11:39 p.m.
On Wed, Jul 14, 2010 at 09:13:44PM +0100, Paul Brook wrote:

> Well, ok, the function name needs fixing too.  However I think the only thing 
> missing from the current API is that it does not provide a way to determine 
> which device is performing the access.
> 
> Depending how the we decide to handle IOMMU invalidation, it may also be 
> necessary to augment the memory_map API to allow the system to request a 
> mapping be revoked.  However this issue is not specific to the IOMMU 
> implementation. Such bugs are already present on any system that allows 
> dynamic reconfiguration of the address space, e.g. by changing PCI BARs.

Yeah, having a way to alter existing maps would be good. Basically it's
the only place where we truly care about the existence of an IOMMU, and
that's due to AIO.

But it's really tricky to do, unfortunately. I also think having an
abort_doing_io_on_map() kind of notifier would be sufficient. It should
notify back when I/O has been completely stopped.

This should be enough, since we don't expect the results of IDE-like DMA
to be recoverable in case a mapping change occurs, even on real hardware.

> I disagree. ATS should be an independent feature, and is inherently bus 
> specific.  As usual the PCI spec is not publicly available, but based on the 
> AMD IOMMU docs I'd say that ATS is completely independent of memory accesses - 
> the convention being that you trust an ATS capable device to DTRT, and 
> configure the bus IOMMU to apply a flat mapping for accesses from such 
> devices.

ATS is documented in the Hypertransport specs which are publicly
available.

[snip]

> A device performs a memory access on its local bus. It has no knowledge of how 
> that access is routed to its destination.  The device should not be aware of 
> any IOMMUs, in the same way that it doesn't know whether it happens to be 
> accessing RAM or memory mapped peripherals on another device.
> 
> Each IOMMU is fundamentally part of a bus bridge. For example the bridge 
> between a PCI bus and the system bus. It provides a address mapping from one 
> bus to another. 
> 
> There should be no direct interaction between an IOMMU and a device (ignoring 
> ATS, which is effectively a separate data channel).  Everything should be done 
> via the cpu_phsycial_memory_* code.  Likewise on a system with multiple nested 
> IOMMUs there should be no direct interatcion between these. 
> cpu_physical_memory_* should walk the device/bus tree to determine where the 
> access terminates, applying mappings appropriately.
> 
> Paul

Admittedly I could make __iommu_rw() repeateadly call itself instead of
doing cpu_physical_memory_rw(). That's what I actually intended, and it
should handle IOMMU nesting. It's a trivial change to do so.

Note that emulating hardware realistically defeats some performance
purposes. It'd make AIO impossible, if we imagine some sort of message
passing scenario (which would be just like the real thing, but a lot
slower).


	Eduard
Joerg Roedel - July 15, 2010, 9:10 a.m.
On Wed, Jul 14, 2010 at 04:29:18PM -0500, Anthony Liguori wrote:
> On 07/14/2010 03:13 PM, Paul Brook wrote:
>> Well, ok, the function name needs fixing too.  However I think the only thing
>> missing from the current API is that it does not provide a way to determine
>> which device is performing the access.
>
> I agree with Paul.
>
> The right approach IMHO is to convert devices to use bus-specific  
> functions to access memory.  The bus specific functions should have a  
> device argument as the first parameter.

If this means a seperate interface for device dma accesses and not fold
that functionality into the cpu_physical_memory* interface I agree too :-)

		Joerg
Joerg Roedel - July 15, 2010, 9:22 a.m.
On Wed, Jul 14, 2010 at 09:13:44PM +0100, Paul Brook wrote:

> A device performs a memory access on its local bus. It has no knowledge of how 
> that access is routed to its destination.  The device should not be aware of 
> any IOMMUs, in the same way that it doesn't know whether it happens to be 
> accessing RAM or memory mapped peripherals on another device.

Right.

> Each IOMMU is fundamentally part of a bus bridge. For example the bridge 
> between a PCI bus and the system bus. It provides a address mapping from one 
> bus to another.

An IOMMU is not necessarily part of a bus bridge. By concept an IOMMU
can also be implemented on a plugin-card translating only that card.
Real implementations that I am aware of always implement the IOMMU in
the PCI root bridge, though.

> There should be no direct interaction between an IOMMU and a device (ignoring 
> ATS, which is effectively a separate data channel).  Everything should be done 
> via the cpu_phsycial_memory_* code.  Likewise on a system with multiple nested 
> IOMMUs there should be no direct interatcion between these. 
> cpu_physical_memory_* should walk the device/bus tree to determine where the 
> access terminates, applying mappings appropriately.

Thats the point where I disagree. I think there should be a seperate set
of functions independent from cpu_physical_memory_* to handle device
memory accesses. This would keep the changes small and non-intrusive.
Beside that, real memory controlers can also handle cpu memory accesses
different from device memory accesses. The AMD northbridge GART uses
this to decide whether it needs to remap a request or not. The GART can
be configured to translate cpu and device accesses seperatly.


		Joerg
Paul Brook - July 15, 2010, 10:28 a.m.
> > The right approach IMHO is to convert devices to use bus-specific
> > functions to access memory.  The bus specific functions should have
> > a device argument as the first parameter.
> 
> As for ATS, the internal api to handle the device's dma reqeust needs
> a notion of a translated vs. an untranslated request.  IOW, if qemu ever
> had a device with ATS support, the device would use its local cache to
> translate the dma address and then submit a translated request to the
> pci bus (effectively doing a raw cpu physical memory* in that case).

Really? Can you provide an documentation to support this claim?
My impression is that there is no difference between translated and 
untranslated devices, and the translation is explicitly disabled by software.

Paul
Paul Brook - July 15, 2010, 10:33 a.m.
> > Depending how the we decide to handle IOMMU invalidation, it may also be
> > necessary to augment the memory_map API to allow the system to request a
> > mapping be revoked.  However this issue is not specific to the IOMMU
> > implementation. Such bugs are already present on any system that allows
> > dynamic reconfiguration of the address space, e.g. by changing PCI BARs.
> 
> That's why the memory_map API today does not allow mappings to persist
> after trips back to the main loop.

Sure it does.  If you can't combine zero-copy memory access with asynchronous 
IO then IMO it's fairly useless. See e.g. dma-helpers.c

Paul
Paul Brook - July 15, 2010, 10:49 a.m.
> On Wed, Jul 14, 2010 at 09:13:44PM +0100, Paul Brook wrote:
> > A device performs a memory access on its local bus. It has no knowledge
> > of how that access is routed to its destination.  The device should not
> > be aware of any IOMMUs, in the same way that it doesn't know whether it
> > happens to be accessing RAM or memory mapped peripherals on another
> > device.
> 
> Right.
> 
> > Each IOMMU is fundamentally part of a bus bridge. For example the bridge
> > between a PCI bus and the system bus. It provides a address mapping from
> > one bus to another.
> 
> An IOMMU is not necessarily part of a bus bridge. By concept an IOMMU
> can also be implemented on a plugin-card translating only that card.
> Real implementations that I am aware of always implement the IOMMU in
> the PCI root bridge, though.

If the IOMMU is implemented on the card, then it isn't an interesting case. 
It's effectively just a complex form of scatter-gather.

If the on-card MMU can delegate pagetable walks to an external device then IMO 
that's also an unrelated feature, and requires an additional data channel.

> > There should be no direct interaction between an IOMMU and a device
> > (ignoring ATS, which is effectively a separate data channel). 
> > Everything should be done via the cpu_phsycial_memory_* code.  Likewise
> > on a system with multiple nested IOMMUs there should be no direct
> > interatcion between these.
> > cpu_physical_memory_* should walk the device/bus tree to determine where
> > the access terminates, applying mappings appropriately.
> 
> Thats the point where I disagree. I think there should be a seperate set
> of functions independent from cpu_physical_memory_* to handle device
> memory accesses. This would keep the changes small and non-intrusive.
> Beside that, real memory controlers can also handle cpu memory accesses
> different from device memory accesses. The AMD northbridge GART uses
> this to decide whether it needs to remap a request or not. The GART can
> be configured to translate cpu and device accesses seperatly.

My point still stands. You should not be pushing the IOMMU handling into 
device specific code. All you need to do is make the memory access routines 
aware of which device caused the access.

The fact that the GART can translate CPU accesses proves my point.  If you 
have separate code for CPU and devices, then you need to duplicate the GART 
handling code.

Paul
Paul Brook - July 15, 2010, 10:58 a.m.
> On Wed, Jul 14, 2010 at 02:53:03PM +0100, Paul Brook wrote:
> > > Memory accesses must go through the IOMMU layer.
> > 
> > No. Devices should not know or care whether an IOMMU is present.
> 
> They don't really care. iommu_get() et al. are convenience functions
> which can and do return NULL when there's no IOMMU and device code can
> pass that NULL around without checking. 

Devices should not need to know any of this. You're introducing a significant 
amount of duplication and complexity into every device.

The assumption that all accesses will go through the same IOMMU is also false. 
Accesses to devices on the same bus will not be translated by the IOMMU. 
Currently there are probably also other things that will break in this case, 
but your API seems fundamentally incapable of handling this.

> I could've probably made the r/w
> functions take only the DeviceState in addition to normal args, but
> wanted to avoid looking up the related structures on each I/O operation.

That's exactly what you should be doing.  If this is inefficient then there 
are much better ways of fixing this. e.g. by not having the device perform so 
many accesses, or by adding some sort of translation cache.

> > You should be adding a DeviceState argument to
> > cpu_physical_memory_{rw,map}. This should then handle IOMMU translation
> > transparently.
> > 
> > You also need to accomodate the the case where multiple IOMMU are
> > present.
> 
> We don't assume there's a single IOMMU in the generic layer. The
> callbacks within 'struct iommu' could very well dispatch the request to
> one of multiple, coexisting IOMMUs.

So you've now introduced yet another copy of the translation code. Not only 
does every device have to be IOMMU aware, but every IOMMU also has to be aware 
of nested IOMMUs.

Paul
Anthony Liguori - July 15, 2010, 12:42 p.m.
On 07/15/2010 05:33 AM, Paul Brook wrote:
>>> Depending how the we decide to handle IOMMU invalidation, it may also be
>>> necessary to augment the memory_map API to allow the system to request a
>>> mapping be revoked.  However this issue is not specific to the IOMMU
>>> implementation. Such bugs are already present on any system that allows
>>> dynamic reconfiguration of the address space, e.g. by changing PCI BARs.
>>>        
>> That's why the memory_map API today does not allow mappings to persist
>> after trips back to the main loop.
>>      
> Sure it does.  If you can't combine zero-copy memory access with asynchronous
> IO then IMO it's fairly useless. See e.g. dma-helpers.c
>    

DMA's a very special case.  DMA is performed asynchronously to the 
execution of the CPU so you generally can't make any guarantees about 
what state the transaction is in until it's completed.  That gives us a 
fair bit of wiggle room when dealing with a DMA operation to a region of 
physical memory where the physical memory mapping is altered in some way 
during the transaction.

However, that is not true in the general case.

Regards,

Anthony Liguori

> Paul
>
Anthony Liguori - July 15, 2010, 12:45 p.m.
On 07/15/2010 04:10 AM, Joerg Roedel wrote:
> On Wed, Jul 14, 2010 at 04:29:18PM -0500, Anthony Liguori wrote:
>    
>> On 07/14/2010 03:13 PM, Paul Brook wrote:
>>      
>>> Well, ok, the function name needs fixing too.  However I think the only thing
>>> missing from the current API is that it does not provide a way to determine
>>> which device is performing the access.
>>>        
>> I agree with Paul.
>>
>> The right approach IMHO is to convert devices to use bus-specific
>> functions to access memory.  The bus specific functions should have a
>> device argument as the first parameter.
>>      
> If this means a seperate interface for device dma accesses and not fold
> that functionality into the cpu_physical_memory* interface I agree too :-)
>    

No.  PCI devices should never call cpu_physical_memory*.

PCI devices should call pci_memory*.

ISA devices should call isa_memory*.

All device memory accesses should go through their respective buses.  
There can be multiple IOMMUs at different levels of the device 
hierarchy.  If you don't provide bus-level memory access functions that 
chain through the hierarchy, it's extremely difficult to implement all 
the necessary hooks to perform the translations at different places.

Regards,

Anthony Liguori

> 		Joerg
>
>
Paul Brook - July 15, 2010, 2:02 p.m.
> >>> Depending how the we decide to handle IOMMU invalidation, it may also
> >>> be necessary to augment the memory_map API to allow the system to
> >>> request a mapping be revoked.  However this issue is not specific to
> >>> the IOMMU implementation. Such bugs are already present on any system
> >>> that allows dynamic reconfiguration of the address space, e.g. by
> >>> changing PCI BARs.
> >> 
> >> That's why the memory_map API today does not allow mappings to persist
> >> after trips back to the main loop.
> > 
> > Sure it does.  If you can't combine zero-copy memory access with
> > asynchronous IO then IMO it's fairly useless. See e.g. dma-helpers.c
> 
> DMA's a very special case.  

Special compared to what?  The whole purpose of this API is to provide DMA.

> DMA is performed asynchronously to the
> execution of the CPU so you generally can't make any guarantees about
> what state the transaction is in until it's completed.  That gives us a
> fair bit of wiggle room when dealing with a DMA operation to a region of
> physical memory where the physical memory mapping is altered in some way
> during the transaction.

You do have ordering constraints though. While it may not be possible to 
directly determine whether the DMA completed before or after the remapping, 
and you might not be able to make any assumptions about the atomicity of the 
transaction as a whole, it is reasonable to assume that any writes to the old 
mapping will occur before the remapping operation completes.

While things like store buffers potentially allows reordering and deferral of 
accesses, there are generally fairly tight constraints on this. For example a 
PCI hast bridge may buffer CPU writes. However it will guarantee that those 
writes have been flushed out before a subsequent read operation completes.

Consider the case where the hypervisor allows passthough of a device, using 
the IOMMU to support DMA from that device into virtual machine RAM. When that 
virtual machine is destroyed the IOMMU mapping for that device will be 
invalidated. Once the invalidation has completed that RAM can be reused by the 
hypervisor for other purposes. This may happen before the device is reset.  We 
probably don't really care what happens to the device in this case, but we do 
need to prevent the device stomping on ram it no longer owns.

There are two ways this can be handled:

If your address translation mechanism allows updates to be deferred 
indefinitely then we can stall until all relevant DMA transactions have 
completed.  This is probably sufficient for well behaved guests, but 
potentially opens up a significant window for DoS attacks. 

If you need the remapping to occur in a finite timeframe (in the PCI BAR case 
this is probably before the next CPU access to that bus) then you need some 
mechanism for revoking the host mapping provided by cpu_physical_memory_map.

Note that a QEMU DMA transaction typically encompasses a whole block of data. 
The transaction is started when the AIO request is issued, and remains live 
until the transfer completes. This includes the time taken to fetch the data 
from external media/devices.

On real hardware a DMA transaction typically only covers a single burst memory 
write (maybe 16 bytes). This will generally not start until the device has 
buffered sufficient data to satisfy the burst (or has sufficient buffer space 
to receive the whole burst).

Paul
Joerg Roedel - July 15, 2010, 2:45 p.m.
On Thu, Jul 15, 2010 at 07:45:06AM -0500, Anthony Liguori wrote:
> On 07/15/2010 04:10 AM, Joerg Roedel wrote:

>> If this means a seperate interface for device dma accesses and not fold
>> that functionality into the cpu_physical_memory* interface I agree too :-)
>>
> No.  PCI devices should never call cpu_physical_memory*.

Fully agreed.

> PCI devices should call pci_memory*.
>
> ISA devices should call isa_memory*.

This is a seperate interface. I like the idea and as you stated below it
has clear advantages, so lets go this way.

> All device memory accesses should go through their respective buses.   
> There can be multiple IOMMUs at different levels of the device  
> hierarchy.  If you don't provide bus-level memory access functions that  
> chain through the hierarchy, it's extremely difficult to implement all  
> the necessary hooks to perform the translations at different places.


		Joerg
Joerg Roedel - July 15, 2010, 2:59 p.m.
On Thu, Jul 15, 2010 at 11:49:20AM +0100, Paul Brook wrote:

> > An IOMMU is not necessarily part of a bus bridge. By concept an IOMMU
> > can also be implemented on a plugin-card translating only that card.
> > Real implementations that I am aware of always implement the IOMMU in
> > the PCI root bridge, though.
> 
> If the IOMMU is implemented on the card, then it isn't an interesting case. 
> It's effectively just a complex form of scatter-gather.
> 
> If the on-card MMU can delegate pagetable walks to an external device then IMO 
> that's also an unrelated feature, and requires an additional data channel.

But that would be handled by the same IOMMU emulation code, so the
hooks need to be usable there too.

> My point still stands. You should not be pushing the IOMMU handling into 
> device specific code. All you need to do is make the memory access routines 
> aware of which device caused the access.

Right, the device does not need to know too much about the IOMMU in the
general case. The iommu_get/iommu_read/iommu_write interface should
replaced by the pci_memory* functions like suggested by Anthony.

> The fact that the GART can translate CPU accesses proves my point.  If you 
> have separate code for CPU and devices, then you need to duplicate the GART 
> handling code.

You can configure the GART to translate device accesses only, cpu
accesses only, or to translate both. This is hard to handle if cpu and
device emulation use the same memory access functions.


		Joerg
Eduard - Gabriel Munteanu - July 15, 2010, 4:45 p.m.
On Thu, Jul 15, 2010 at 07:45:06AM -0500, Anthony Liguori wrote:
> 
> No.  PCI devices should never call cpu_physical_memory*.
> 
> PCI devices should call pci_memory*.
> 
> ISA devices should call isa_memory*.
> 
> All device memory accesses should go through their respective buses.  
> There can be multiple IOMMUs at different levels of the device 
> hierarchy.  If you don't provide bus-level memory access functions that 
> chain through the hierarchy, it's extremely difficult to implement all 
> the necessary hooks to perform the translations at different places.
> 
> Regards,
> 
> Anthony Liguori
> 

I liked Paul's initial approach more, at least if I understood him
correctly. Basically I'm suggesting a single memory_* function that
simply asks the bus for I/O and translation. Say you have something like
this:

+ Bus 1
|
---- Memory 1
|
---+ Bus 2 bridge
   |
   ---- Memory 2
   |
   ---+ Bus 3 bridge
      |
      ---- Device

Say Device wants to write to memory. If we have the DeviceState we
needn't concern whether this is a BusOneDevice or BusTwoDevice from
device code itself. We would just call

memory_rw(dev_state, addr, buf, size, is_write);

which simply recurses through DeviceState's and BusState's through their
parent pointers. The actual bus can set up those to provide
identification information and perhaps hooks for translation and access
checking. So memory_rw() looks like this (pseudocode):

static void memory_rw(DeviceState *dev,
                      target_phys_addr_t addr,
		      uint8_t *buf,
		      int size,
		      int is_write)
{
	BusState *bus = get_parent_bus_of_dev(dev);
	DeviceState *pdev = get_parent_dev(dev);
	target_phys_addr_t taddr;

	if (!bus) {
		/* This shouldn't happen. */
		assert(0);
	}

	if (bus->responsible_for(addr)) {
		raw_physical_memory_rw(addr, buf, size, is_write);
		return;
	}

	taddr = bus->translate(dev, addr);
	memory_rw(pdev, taddr, buf, size, is_write);
}

If we do this, it seems there's no need to provide separate
functions. The actual buses must instead initialize those hooks
properly. Translation here is something inherent to the bus, that
handles arbitration between possibly multiple IOMMUs. Our memory would
normally reside on / belong to the top-level bus.

What do you think? (Naming could be better though.)


	Eduard
Chris Wright - July 15, 2010, 4:52 p.m.
* Paul Brook (paul@codesourcery.com) wrote:
> > > The right approach IMHO is to convert devices to use bus-specific
> > > functions to access memory.  The bus specific functions should have
> > > a device argument as the first parameter.
> > 
> > As for ATS, the internal api to handle the device's dma reqeust needs
> > a notion of a translated vs. an untranslated request.  IOW, if qemu ever
> > had a device with ATS support, the device would use its local cache to
> > translate the dma address and then submit a translated request to the
> > pci bus (effectively doing a raw cpu physical memory* in that case).
> 
> Really? Can you provide an documentation to support this claim?
> My impression is that there is no difference between translated and 
> untranslated devices, and the translation is explicitly disabled by software.

ATS allows an I/O device to request a translation from the IOMMU.
The device can then cache that translation and use the translated address
in a PCIe memory transaction.  PCIe uses a couple of previously reserved
bits in the transaction layer packet header to describe the address
type for memory transactions.  The default (00) maps to legacy PCIe and
describes the memory address as untranslated.  This is the normal mode,
and could then incur a translation if an IOMMU is present and programmed
w/ page tables, etc. as is passes through the host bridge.

Another type is simply a transaction requesting a translation.  This is
new, and allows a device to request (and cache) a translation from the
IOMMU for subsequent use.

The third type is a memory transaction tagged as already translated.
This is the type of transaction an ATS capable I/O device will generate
when it was able to translate the memory address from its own cache.

Of course, there's also an invalidation request that the IOMMU can send
to ATS capable I/O devices to invalidate the cached translation.

thanks,
-chris
Avi Kivity - July 15, 2010, 5:02 p.m.
On 07/15/2010 07:52 PM, Chris Wright wrote:
>
>> Really? Can you provide an documentation to support this claim?
>> My impression is that there is no difference between translated and
>> untranslated devices, and the translation is explicitly disabled by software.
>>      
> ATS allows an I/O device to request a translation from the IOMMU.
> The device can then cache that translation and use the translated address
> in a PCIe memory transaction.  PCIe uses a couple of previously reserved
> bits in the transaction layer packet header to describe the address
> type for memory transactions.  The default (00) maps to legacy PCIe and
> describes the memory address as untranslated.  This is the normal mode,
> and could then incur a translation if an IOMMU is present and programmed
> w/ page tables, etc. as is passes through the host bridge.
>
> Another type is simply a transaction requesting a translation.  This is
> new, and allows a device to request (and cache) a translation from the
> IOMMU for subsequent use.
>
> The third type is a memory transaction tagged as already translated.
> This is the type of transaction an ATS capable I/O device will generate
> when it was able to translate the memory address from its own cache.
>
> Of course, there's also an invalidation request that the IOMMU can send
> to ATS capable I/O devices to invalidate the cached translation.
>    

For emulated device, it seems like we can ignore ATS completely, no?
Chris Wright - July 15, 2010, 5:14 p.m.
* Chris Wright (chrisw@sous-sol.org) wrote:
> * Paul Brook (paul@codesourcery.com) wrote:
> > > > The right approach IMHO is to convert devices to use bus-specific
> > > > functions to access memory.  The bus specific functions should have
> > > > a device argument as the first parameter.
> > > 
> > > As for ATS, the internal api to handle the device's dma reqeust needs
> > > a notion of a translated vs. an untranslated request.  IOW, if qemu ever
> > > had a device with ATS support, the device would use its local cache to
> > > translate the dma address and then submit a translated request to the
> > > pci bus (effectively doing a raw cpu physical memory* in that case).
> > 
> > Really? Can you provide an documentation to support this claim?

Wow...color me surprised...there's actually some apparently public
"training" docs that might help give a more complete view:

http://www.pcisig.com/developers/main/training_materials/get_document?doc_id=0ab681ba7001e40cdb297ddaf279a8de82a7dc40

ATS discussion starts on slide 23.

> > My impression is that there is no difference between translated and 
> > untranslated devices, and the translation is explicitly disabled by software.

And now that I re-read that sentence, I see what you are talking about.
Yes, there is the above notion as well.

A device can live in a 1:1 mapping of device address space to physical
memory.  This could be achieved in a few ways (all done by the OS software
programming the IOMMU).

One is to simply create a set of page tables that map 1:1 all of device
memory to physical memory.  Another is to somehow mark the device as
special (either omit translation tables or mark the translation entry
as effectively "do not translate").  This is often referred to as Pass
Through mode.  But this is not the same as ATS.

Pass Through mode is the functional equivalent of disabling the
translation/isolation capabilities of the IOMMU.  It's typically used
when an OS wants to keep a device for itself and isn't interested in
the isolation properties of the IOMMU.  It then only creates isolating
translation tables for devices it's giving to unprivileged software
(e.g. Linux/KVM giving a device to a guest, Linux giving a device to
user space process, etc.)

> ATS allows an I/O device to request a translation from the IOMMU.
> The device can then cache that translation and use the translated address
> in a PCIe memory transaction.  PCIe uses a couple of previously reserved
> bits in the transaction layer packet header to describe the address
> type for memory transactions.  The default (00) maps to legacy PCIe and
> describes the memory address as untranslated.  This is the normal mode,
> and could then incur a translation if an IOMMU is present and programmed
> w/ page tables, etc. as is passes through the host bridge.
> 
> Another type is simply a transaction requesting a translation.  This is
> new, and allows a device to request (and cache) a translation from the
> IOMMU for subsequent use.
> 
> The third type is a memory transaction tagged as already translated.
> This is the type of transaction an ATS capable I/O device will generate
> when it was able to translate the memory address from its own cache.
> 
> Of course, there's also an invalidation request that the IOMMU can send
> to ATS capable I/O devices to invalidate the cached translation.

thanks,
-chris
Chris Wright - July 15, 2010, 5:17 p.m.
* Avi Kivity (avi@redhat.com) wrote:
> On 07/15/2010 07:52 PM, Chris Wright wrote:
> >
> >>Really? Can you provide an documentation to support this claim?
> >>My impression is that there is no difference between translated and
> >>untranslated devices, and the translation is explicitly disabled by software.
> >ATS allows an I/O device to request a translation from the IOMMU.
> >The device can then cache that translation and use the translated address
> >in a PCIe memory transaction.  PCIe uses a couple of previously reserved
> >bits in the transaction layer packet header to describe the address
> >type for memory transactions.  The default (00) maps to legacy PCIe and
> >describes the memory address as untranslated.  This is the normal mode,
> >and could then incur a translation if an IOMMU is present and programmed
> >w/ page tables, etc. as is passes through the host bridge.
> >
> >Another type is simply a transaction requesting a translation.  This is
> >new, and allows a device to request (and cache) a translation from the
> >IOMMU for subsequent use.
> >
> >The third type is a memory transaction tagged as already translated.
> >This is the type of transaction an ATS capable I/O device will generate
> >when it was able to translate the memory address from its own cache.
> >
> >Of course, there's also an invalidation request that the IOMMU can send
> >to ATS capable I/O devices to invalidate the cached translation.
> 
> For emulated device, it seems like we can ignore ATS completely, no?

Not if you want to emulate an ATS capable device ;)

Eariler upthread I said:

  IOW, if qemu ever had a device with ATS support...

So, that should've been a much bigger _IF_

thanks,
-chris
Avi Kivity - July 15, 2010, 5:22 p.m.
On 07/15/2010 08:17 PM, Chris Wright wrote:
>
>> For emulated device, it seems like we can ignore ATS completely, no?
>>      
> Not if you want to emulate an ATS capable device ;)
>    

What I meant was that the whole request translation, invalidate, dma 
using a translated address thing is invisible to software.  We can 
emulate an ATS capable device by going through the iommu every time.
Joerg Roedel - July 15, 2010, 5:22 p.m.
On Thu, Jul 15, 2010 at 08:02:05PM +0300, Avi Kivity wrote:

> For emulated device, it seems like we can ignore ATS completely, no?

An important use-case for emulation is software testing and caching of
iommu's is an important part that needs to be handled in software. For
this purpose it makes sense to emulate the behavior of caches too. So we
probably should not ignore the possibility of an emulated ATS device
completly.

		Joerg
Chris Wright - July 15, 2010, 5:25 p.m.
* Avi Kivity (avi@redhat.com) wrote:
> On 07/15/2010 08:17 PM, Chris Wright wrote:
> >
> >>For emulated device, it seems like we can ignore ATS completely, no?
> >Not if you want to emulate an ATS capable device ;)
> 
> What I meant was that the whole request translation, invalidate, dma
> using a translated address thing is invisible to software.  We can
> emulate an ATS capable device by going through the iommu every time.

Well, I don't see any reason to completely ignore it.  It'd be really
useful for testing (I'd use it that way).  Esp to verify the
invalidation of the device IOTLBs.

But I think it's not a difficult thing to emulate once we have a proper
api encapsulating a device's dma request.

thanks,
-chris
Eduard - Gabriel Munteanu - July 15, 2010, 5:27 p.m.
On Thu, Jul 15, 2010 at 10:17:10AM -0700, Chris Wright wrote:
> * Avi Kivity (avi@redhat.com) wrote:
> > 
> > For emulated device, it seems like we can ignore ATS completely, no?
> 
> Not if you want to emulate an ATS capable device ;)
> 
> Eariler upthread I said:
> 
>   IOW, if qemu ever had a device with ATS support...
> 
> So, that should've been a much bigger _IF_
> 
> thanks,
> -chris

I think we can augment some devices with ATS capability if there are
performance gains in doing so. This doesn't seem to be a detail the
actual guest OS would be interested in, so we can do it even for devices
that existed long before the AMD IOMMU came into existence. But I'm not
really sure about this, it's just a thought.

Linux seems to be issuing IOTLB invalidation commands anyway.


	Eduard
Anthony Liguori - July 15, 2010, 5:42 p.m.
On 07/15/2010 11:45 AM, Eduard - Gabriel Munteanu wrote:
> On Thu, Jul 15, 2010 at 07:45:06AM -0500, Anthony Liguori wrote:
>    
>> No.  PCI devices should never call cpu_physical_memory*.
>>
>> PCI devices should call pci_memory*.
>>
>> ISA devices should call isa_memory*.
>>
>> All device memory accesses should go through their respective buses.
>> There can be multiple IOMMUs at different levels of the device
>> hierarchy.  If you don't provide bus-level memory access functions that
>> chain through the hierarchy, it's extremely difficult to implement all
>> the necessary hooks to perform the translations at different places.
>>
>> Regards,
>>
>> Anthony Liguori
>>
>>      
> I liked Paul's initial approach more, at least if I understood him
> correctly. Basically I'm suggesting a single memory_* function that
> simply asks the bus for I/O and translation. Say you have something like
> this:
>
> + Bus 1
> |
> ---- Memory 1
> |
> ---+ Bus 2 bridge
>     |
>     ---- Memory 2
>     |
>     ---+ Bus 3 bridge
>        |
>        ---- Device
>
> Say Device wants to write to memory. If we have the DeviceState we
> needn't concern whether this is a BusOneDevice or BusTwoDevice from
> device code itself. We would just call
>
> memory_rw(dev_state, addr, buf, size, is_write);
>    

I dislike this API for a few reasons:

1) buses have different types of addresses with different address 
ranges.  this api would have to take a generic address type.
2) dev_state would be the qdev device state.  this means qdev needs to 
have memory hook mechanisms that's chainable.  I think it's unnecessary 
at the qdev level
3) users have upcasted device states, so it's more natural to pass 
PCIDevice than DeviceState.
4) there's an assumption that all devices can get to DeviceState.  
that's not always true today.

> which simply recurses through DeviceState's and BusState's through their
> parent pointers. The actual bus can set up those to provide
> identification information and perhaps hooks for translation and access
> checking. So memory_rw() looks like this (pseudocode):
>
> static void memory_rw(DeviceState *dev,
>                        target_phys_addr_t addr,
> 		      uint8_t *buf,
> 		      int size,
> 		      int is_write)
> {
> 	BusState *bus = get_parent_bus_of_dev(dev);
> 	DeviceState *pdev = get_parent_dev(dev);
> 	target_phys_addr_t taddr;
>
> 	if (!bus) {
> 		/* This shouldn't happen. */
> 		assert(0);
> 	}
>
> 	if (bus->responsible_for(addr)) {
> 		raw_physical_memory_rw(addr, buf, size, is_write);
> 		return;
> 	}
>
> 	taddr = bus->translate(dev, addr);
> 	memory_rw(pdev, taddr, buf, size, is_write);
>    

This is too simplistic because you sometimes have layering that doesn't 
fit into the bus model.  For instance, virtio + pci.

We really want a virtio_memory_rw that calls either syborg_memory_rw or 
pci_memory_rw based on the transport.  In your proposal, we would have 
to model virtio-pci as a bus with a single device which appears awkward 
to me.

Regards,

Anthony Liguori

> }
>
> If we do this, it seems there's no need to provide separate
> functions. The actual buses must instead initialize those hooks
> properly. Translation here is something inherent to the bus, that
> handles arbitration between possibly multiple IOMMUs. Our memory would
> normally reside on / belong to the top-level bus.
>
> What do you think? (Naming could be better though.)
>
>
> 	Eduard
>
>

Patch

diff --git a/hw/ide/core.c b/hw/ide/core.c
index 0b3b7c2..7f8f7df 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -26,6 +26,7 @@ 
 #include <hw/pc.h>
 #include <hw/pci.h>
 #include <hw/scsi.h>
+#include <hw/iommu.h>
 #include "qemu-timer.h"
 #include "sysemu.h"
 #include "dma.h"
@@ -433,7 +434,12 @@  static int dma_buf_prepare(BMDMAState *bm, int is_write)
         uint32_t addr;
         uint32_t size;
     } prd;
-    int l, len;
+    int l, len, err, io_len;
+    struct iommu *iommu;
+    DeviceState *dev;
+    target_phys_addr_t io_addr;
+
+    iommu = iommu_get(s->bus->qbus.parent, &dev);
 
     qemu_sglist_init(&s->sg, s->nsector / (IDE_PAGE_SIZE / 512) + 1);
     s->io_buffer_size = 0;
@@ -443,7 +449,7 @@  static int dma_buf_prepare(BMDMAState *bm, int is_write)
             if (bm->cur_prd_last ||
                 (bm->cur_addr - bm->addr) >= IDE_PAGE_SIZE)
                 return s->io_buffer_size != 0;
-            cpu_physical_memory_read(bm->cur_addr, (uint8_t *)&prd, 8);
+            err = iommu_read(iommu, dev, bm->cur_addr, (uint8_t *)&prd, 8);
             bm->cur_addr += 8;
             prd.addr = le32_to_cpu(prd.addr);
             prd.size = le32_to_cpu(prd.size);
@@ -455,11 +461,22 @@  static int dma_buf_prepare(BMDMAState *bm, int is_write)
             bm->cur_prd_last = (prd.size & 0x80000000);
         }
         l = bm->cur_prd_len;
-        if (l > 0) {
-            qemu_sglist_add(&s->sg, bm->cur_prd_addr, l);
-            bm->cur_prd_addr += l;
-            bm->cur_prd_len -= l;
-            s->io_buffer_size += l;
+        while (l > 0) {
+            /*
+             * In case translation / access checking fails no
+             * transfer happens but we pretend it went through.
+             */
+            err = iommu_translate(iommu, dev, bm->cur_prd_addr,
+                                  &io_addr, &io_len, !is_write);
+            if (!err) {
+                if (io_len > l)
+                    io_len = l;
+                qemu_sglist_add(&s->sg, io_addr, io_len);
+            }
+            bm->cur_prd_addr += io_len;
+            bm->cur_prd_len -= io_len;
+            s->io_buffer_size += io_len;
+            l -= io_len;
         }
     }
     return 1;
@@ -516,6 +533,10 @@  static int dma_buf_rw(BMDMAState *bm, int is_write)
         uint32_t size;
     } prd;
     int l, len;
+    struct iommu *iommu;
+    DeviceState *dev;
+
+    iommu = iommu_get(s->bus->qbus.parent, &dev);
 
     for(;;) {
         l = s->io_buffer_size - s->io_buffer_index;
@@ -526,7 +547,7 @@  static int dma_buf_rw(BMDMAState *bm, int is_write)
             if (bm->cur_prd_last ||
                 (bm->cur_addr - bm->addr) >= IDE_PAGE_SIZE)
                 return 0;
-            cpu_physical_memory_read(bm->cur_addr, (uint8_t *)&prd, 8);
+            iommu_read(iommu, dev, bm->cur_addr, (uint8_t *)&prd, 8);
             bm->cur_addr += 8;
             prd.addr = le32_to_cpu(prd.addr);
             prd.size = le32_to_cpu(prd.size);
@@ -540,13 +561,8 @@  static int dma_buf_rw(BMDMAState *bm, int is_write)
         if (l > bm->cur_prd_len)
             l = bm->cur_prd_len;
         if (l > 0) {
-            if (is_write) {
-                cpu_physical_memory_write(bm->cur_prd_addr,
-                                          s->io_buffer + s->io_buffer_index, l);
-            } else {
-                cpu_physical_memory_read(bm->cur_prd_addr,
-                                          s->io_buffer + s->io_buffer_index, l);
-            }
+            iommu_rw(iommu, dev, bm->cur_prd_addr,
+                     s->io_buffer + s->io_buffer_index, l, is_write);
             bm->cur_prd_addr += l;
             bm->cur_prd_len -= l;
             s->io_buffer_index += l;