Message ID | 1279086307-9596-5-git-send-email-eduard.munteanu@linux360.ro |
---|---|
State | New |
Headers | show |
> Memory accesses must go through the IOMMU layer.
No. Devices should not know or care whether an IOMMU is present.
You should be adding a DeviceState argument to cpu_physical_memory_{rw,map}.
This should then handle IOMMU translation transparently.
You also need to accomodate the the case where multiple IOMMU are present.
Paul
On Wed, Jul 14, 2010 at 02:53:03PM +0100, Paul Brook wrote: > > Memory accesses must go through the IOMMU layer. > > No. Devices should not know or care whether an IOMMU is present. There are real devices that care very much about an IOMMU. Basically all devices supporting ATS care about that. So I don't see a problem if the device emulation code of qemu also cares about present IOMMUs. > You should be adding a DeviceState argument to cpu_physical_memory_{rw,map}. > This should then handle IOMMU translation transparently. That's not a good idea imho. With an IOMMU the device no longer accesses cpu physical memory. It accesses device virtual memory. Using cpu_physical_memory* functions in device code becomes misleading when the device virtual address space differs from cpu physical. So different functions for devices make a lot of sense here. Another reason for seperate functions is that we can extend them later to support emulation of ATS devices. > You also need to accomodate the the case where multiple IOMMU are present. This, indeed, is something transparent to the device. This should be handled inside the iommu emulation code. Joerg
> On Wed, Jul 14, 2010 at 02:53:03PM +0100, Paul Brook wrote: > > > Memory accesses must go through the IOMMU layer. > > > > No. Devices should not know or care whether an IOMMU is present. > > There are real devices that care very much about an IOMMU. Basically all > devices supporting ATS care about that. So I don't see a problem if the > device emulation code of qemu also cares about present IOMMUs. > > > You should be adding a DeviceState argument to > > cpu_physical_memory_{rw,map}. This should then handle IOMMU translation > > transparently. > > That's not a good idea imho. With an IOMMU the device no longer accesses > cpu physical memory. It accesses device virtual memory. Using > cpu_physical_memory* functions in device code becomes misleading when > the device virtual address space differs from cpu physical. Well, ok, the function name needs fixing too. However I think the only thing missing from the current API is that it does not provide a way to determine which device is performing the access. Depending how the we decide to handle IOMMU invalidation, it may also be necessary to augment the memory_map API to allow the system to request a mapping be revoked. However this issue is not specific to the IOMMU implementation. Such bugs are already present on any system that allows dynamic reconfiguration of the address space, e.g. by changing PCI BARs. > So different > functions for devices make a lot of sense here. Another reason for > seperate functions is that we can extend them later to support emulation > of ATS devices. I disagree. ATS should be an independent feature, and is inherently bus specific. As usual the PCI spec is not publicly available, but based on the AMD IOMMU docs I'd say that ATS is completely independent of memory accesses - the convention being that you trust an ATS capable device to DTRT, and configure the bus IOMMU to apply a flat mapping for accesses from such devices. > > You also need to accomodate the the case where multiple IOMMU are > > present. > > This, indeed, is something transparent to the device. This should be > handled inside the iommu emulation code. I think you've got the abstraction boundaries all wrong. A device performs a memory access on its local bus. It has no knowledge of how that access is routed to its destination. The device should not be aware of any IOMMUs, in the same way that it doesn't know whether it happens to be accessing RAM or memory mapped peripherals on another device. Each IOMMU is fundamentally part of a bus bridge. For example the bridge between a PCI bus and the system bus. It provides a address mapping from one bus to another. There should be no direct interaction between an IOMMU and a device (ignoring ATS, which is effectively a separate data channel). Everything should be done via the cpu_phsycial_memory_* code. Likewise on a system with multiple nested IOMMUs there should be no direct interatcion between these. cpu_physical_memory_* should walk the device/bus tree to determine where the access terminates, applying mappings appropriately. Paul
On 07/14/2010 03:13 PM, Paul Brook wrote: >> On Wed, Jul 14, 2010 at 02:53:03PM +0100, Paul Brook wrote: >> >>>> Memory accesses must go through the IOMMU layer. >>>> >>> No. Devices should not know or care whether an IOMMU is present. >>> >> There are real devices that care very much about an IOMMU. Basically all >> devices supporting ATS care about that. So I don't see a problem if the >> device emulation code of qemu also cares about present IOMMUs. >> >> >>> You should be adding a DeviceState argument to >>> cpu_physical_memory_{rw,map}. This should then handle IOMMU translation >>> transparently. >>> >> That's not a good idea imho. With an IOMMU the device no longer accesses >> cpu physical memory. It accesses device virtual memory. Using >> cpu_physical_memory* functions in device code becomes misleading when >> the device virtual address space differs from cpu physical. >> > Well, ok, the function name needs fixing too. However I think the only thing > missing from the current API is that it does not provide a way to determine > which device is performing the access. > I agree with Paul. The right approach IMHO is to convert devices to use bus-specific functions to access memory. The bus specific functions should have a device argument as the first parameter. For PCI-based IOMMUs, the implementation exists solely within the PCI bus. For platforms (like SPARC) that have lower level IOMMUs, we would need to probably introduce a sysbus memory access layer and then provide a hook to implement an IOMMU there. > Depending how the we decide to handle IOMMU invalidation, it may also be > necessary to augment the memory_map API to allow the system to request a > mapping be revoked. However this issue is not specific to the IOMMU > implementation. Such bugs are already present on any system that allows > dynamic reconfiguration of the address space, e.g. by changing PCI BARs. > That's why the memory_map API today does not allow mappings to persist after trips back to the main loop. Regards, Anthony Liguori >> So different >> functions for devices make a lot of sense here. Another reason for >> seperate functions is that we can extend them later to support emulation >> of ATS devices. >> > I disagree. ATS should be an independent feature, and is inherently bus > specific. As usual the PCI spec is not publicly available, but based on the > AMD IOMMU docs I'd say that ATS is completely independent of memory accesses - > the convention being that you trust an ATS capable device to DTRT, and > configure the bus IOMMU to apply a flat mapping for accesses from such > devices. > > >>> You also need to accomodate the the case where multiple IOMMU are >>> present. >>> >> This, indeed, is something transparent to the device. This should be >> handled inside the iommu emulation code. >> > I think you've got the abstraction boundaries all wrong. > > A device performs a memory access on its local bus. It has no knowledge of how > that access is routed to its destination. The device should not be aware of > any IOMMUs, in the same way that it doesn't know whether it happens to be > accessing RAM or memory mapped peripherals on another device. > > Each IOMMU is fundamentally part of a bus bridge. For example the bridge > between a PCI bus and the system bus. It provides a address mapping from one > bus to another. > > There should be no direct interaction between an IOMMU and a device (ignoring > ATS, which is effectively a separate data channel). Everything should be done > via the cpu_phsycial_memory_* code. Likewise on a system with multiple nested > IOMMUs there should be no direct interatcion between these. > cpu_physical_memory_* should walk the device/bus tree to determine where the > access terminates, applying mappings appropriately. > > Paul > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
* Anthony Liguori (anthony@codemonkey.ws) wrote: > On 07/14/2010 03:13 PM, Paul Brook wrote: > >>On Wed, Jul 14, 2010 at 02:53:03PM +0100, Paul Brook wrote: > >>>>Memory accesses must go through the IOMMU layer. > >>>No. Devices should not know or care whether an IOMMU is present. > >>There are real devices that care very much about an IOMMU. Basically all > >>devices supporting ATS care about that. So I don't see a problem if the > >>device emulation code of qemu also cares about present IOMMUs. > >> > >>>You should be adding a DeviceState argument to > >>>cpu_physical_memory_{rw,map}. This should then handle IOMMU translation > >>>transparently. > >>That's not a good idea imho. With an IOMMU the device no longer accesses > >>cpu physical memory. It accesses device virtual memory. Using > >>cpu_physical_memory* functions in device code becomes misleading when > >>the device virtual address space differs from cpu physical. > >Well, ok, the function name needs fixing too. However I think the only thing > >missing from the current API is that it does not provide a way to determine > >which device is performing the access. > > I agree with Paul. I do too. > The right approach IMHO is to convert devices to use bus-specific > functions to access memory. The bus specific functions should have > a device argument as the first parameter. As for ATS, the internal api to handle the device's dma reqeust needs a notion of a translated vs. an untranslated request. IOW, if qemu ever had a device with ATS support, the device would use its local cache to translate the dma address and then submit a translated request to the pci bus (effectively doing a raw cpu physical memory* in that case). thanks, -chris
On Wed, Jul 14, 2010 at 02:53:03PM +0100, Paul Brook wrote: > > Memory accesses must go through the IOMMU layer. > > No. Devices should not know or care whether an IOMMU is present. They don't really care. iommu_get() et al. are convenience functions which can and do return NULL when there's no IOMMU and device code can pass that NULL around without checking. I could've probably made the r/w functions take only the DeviceState in addition to normal args, but wanted to avoid looking up the related structures on each I/O operation. > You should be adding a DeviceState argument to cpu_physical_memory_{rw,map}. > This should then handle IOMMU translation transparently. > > You also need to accomodate the the case where multiple IOMMU are present. > > Paul We don't assume there's a single IOMMU in the generic layer. The callbacks within 'struct iommu' could very well dispatch the request to one of multiple, coexisting IOMMUs. Eduard
On Wed, Jul 14, 2010 at 09:13:44PM +0100, Paul Brook wrote: > Well, ok, the function name needs fixing too. However I think the only thing > missing from the current API is that it does not provide a way to determine > which device is performing the access. > > Depending how the we decide to handle IOMMU invalidation, it may also be > necessary to augment the memory_map API to allow the system to request a > mapping be revoked. However this issue is not specific to the IOMMU > implementation. Such bugs are already present on any system that allows > dynamic reconfiguration of the address space, e.g. by changing PCI BARs. Yeah, having a way to alter existing maps would be good. Basically it's the only place where we truly care about the existence of an IOMMU, and that's due to AIO. But it's really tricky to do, unfortunately. I also think having an abort_doing_io_on_map() kind of notifier would be sufficient. It should notify back when I/O has been completely stopped. This should be enough, since we don't expect the results of IDE-like DMA to be recoverable in case a mapping change occurs, even on real hardware. > I disagree. ATS should be an independent feature, and is inherently bus > specific. As usual the PCI spec is not publicly available, but based on the > AMD IOMMU docs I'd say that ATS is completely independent of memory accesses - > the convention being that you trust an ATS capable device to DTRT, and > configure the bus IOMMU to apply a flat mapping for accesses from such > devices. ATS is documented in the Hypertransport specs which are publicly available. [snip] > A device performs a memory access on its local bus. It has no knowledge of how > that access is routed to its destination. The device should not be aware of > any IOMMUs, in the same way that it doesn't know whether it happens to be > accessing RAM or memory mapped peripherals on another device. > > Each IOMMU is fundamentally part of a bus bridge. For example the bridge > between a PCI bus and the system bus. It provides a address mapping from one > bus to another. > > There should be no direct interaction between an IOMMU and a device (ignoring > ATS, which is effectively a separate data channel). Everything should be done > via the cpu_phsycial_memory_* code. Likewise on a system with multiple nested > IOMMUs there should be no direct interatcion between these. > cpu_physical_memory_* should walk the device/bus tree to determine where the > access terminates, applying mappings appropriately. > > Paul Admittedly I could make __iommu_rw() repeateadly call itself instead of doing cpu_physical_memory_rw(). That's what I actually intended, and it should handle IOMMU nesting. It's a trivial change to do so. Note that emulating hardware realistically defeats some performance purposes. It'd make AIO impossible, if we imagine some sort of message passing scenario (which would be just like the real thing, but a lot slower). Eduard
On Wed, Jul 14, 2010 at 04:29:18PM -0500, Anthony Liguori wrote: > On 07/14/2010 03:13 PM, Paul Brook wrote: >> Well, ok, the function name needs fixing too. However I think the only thing >> missing from the current API is that it does not provide a way to determine >> which device is performing the access. > > I agree with Paul. > > The right approach IMHO is to convert devices to use bus-specific > functions to access memory. The bus specific functions should have a > device argument as the first parameter. If this means a seperate interface for device dma accesses and not fold that functionality into the cpu_physical_memory* interface I agree too :-) Joerg
On Wed, Jul 14, 2010 at 09:13:44PM +0100, Paul Brook wrote: > A device performs a memory access on its local bus. It has no knowledge of how > that access is routed to its destination. The device should not be aware of > any IOMMUs, in the same way that it doesn't know whether it happens to be > accessing RAM or memory mapped peripherals on another device. Right. > Each IOMMU is fundamentally part of a bus bridge. For example the bridge > between a PCI bus and the system bus. It provides a address mapping from one > bus to another. An IOMMU is not necessarily part of a bus bridge. By concept an IOMMU can also be implemented on a plugin-card translating only that card. Real implementations that I am aware of always implement the IOMMU in the PCI root bridge, though. > There should be no direct interaction between an IOMMU and a device (ignoring > ATS, which is effectively a separate data channel). Everything should be done > via the cpu_phsycial_memory_* code. Likewise on a system with multiple nested > IOMMUs there should be no direct interatcion between these. > cpu_physical_memory_* should walk the device/bus tree to determine where the > access terminates, applying mappings appropriately. Thats the point where I disagree. I think there should be a seperate set of functions independent from cpu_physical_memory_* to handle device memory accesses. This would keep the changes small and non-intrusive. Beside that, real memory controlers can also handle cpu memory accesses different from device memory accesses. The AMD northbridge GART uses this to decide whether it needs to remap a request or not. The GART can be configured to translate cpu and device accesses seperatly. Joerg
> > The right approach IMHO is to convert devices to use bus-specific > > functions to access memory. The bus specific functions should have > > a device argument as the first parameter. > > As for ATS, the internal api to handle the device's dma reqeust needs > a notion of a translated vs. an untranslated request. IOW, if qemu ever > had a device with ATS support, the device would use its local cache to > translate the dma address and then submit a translated request to the > pci bus (effectively doing a raw cpu physical memory* in that case). Really? Can you provide an documentation to support this claim? My impression is that there is no difference between translated and untranslated devices, and the translation is explicitly disabled by software. Paul
> > Depending how the we decide to handle IOMMU invalidation, it may also be > > necessary to augment the memory_map API to allow the system to request a > > mapping be revoked. However this issue is not specific to the IOMMU > > implementation. Such bugs are already present on any system that allows > > dynamic reconfiguration of the address space, e.g. by changing PCI BARs. > > That's why the memory_map API today does not allow mappings to persist > after trips back to the main loop. Sure it does. If you can't combine zero-copy memory access with asynchronous IO then IMO it's fairly useless. See e.g. dma-helpers.c Paul
> On Wed, Jul 14, 2010 at 09:13:44PM +0100, Paul Brook wrote: > > A device performs a memory access on its local bus. It has no knowledge > > of how that access is routed to its destination. The device should not > > be aware of any IOMMUs, in the same way that it doesn't know whether it > > happens to be accessing RAM or memory mapped peripherals on another > > device. > > Right. > > > Each IOMMU is fundamentally part of a bus bridge. For example the bridge > > between a PCI bus and the system bus. It provides a address mapping from > > one bus to another. > > An IOMMU is not necessarily part of a bus bridge. By concept an IOMMU > can also be implemented on a plugin-card translating only that card. > Real implementations that I am aware of always implement the IOMMU in > the PCI root bridge, though. If the IOMMU is implemented on the card, then it isn't an interesting case. It's effectively just a complex form of scatter-gather. If the on-card MMU can delegate pagetable walks to an external device then IMO that's also an unrelated feature, and requires an additional data channel. > > There should be no direct interaction between an IOMMU and a device > > (ignoring ATS, which is effectively a separate data channel). > > Everything should be done via the cpu_phsycial_memory_* code. Likewise > > on a system with multiple nested IOMMUs there should be no direct > > interatcion between these. > > cpu_physical_memory_* should walk the device/bus tree to determine where > > the access terminates, applying mappings appropriately. > > Thats the point where I disagree. I think there should be a seperate set > of functions independent from cpu_physical_memory_* to handle device > memory accesses. This would keep the changes small and non-intrusive. > Beside that, real memory controlers can also handle cpu memory accesses > different from device memory accesses. The AMD northbridge GART uses > this to decide whether it needs to remap a request or not. The GART can > be configured to translate cpu and device accesses seperatly. My point still stands. You should not be pushing the IOMMU handling into device specific code. All you need to do is make the memory access routines aware of which device caused the access. The fact that the GART can translate CPU accesses proves my point. If you have separate code for CPU and devices, then you need to duplicate the GART handling code. Paul
> On Wed, Jul 14, 2010 at 02:53:03PM +0100, Paul Brook wrote: > > > Memory accesses must go through the IOMMU layer. > > > > No. Devices should not know or care whether an IOMMU is present. > > They don't really care. iommu_get() et al. are convenience functions > which can and do return NULL when there's no IOMMU and device code can > pass that NULL around without checking. Devices should not need to know any of this. You're introducing a significant amount of duplication and complexity into every device. The assumption that all accesses will go through the same IOMMU is also false. Accesses to devices on the same bus will not be translated by the IOMMU. Currently there are probably also other things that will break in this case, but your API seems fundamentally incapable of handling this. > I could've probably made the r/w > functions take only the DeviceState in addition to normal args, but > wanted to avoid looking up the related structures on each I/O operation. That's exactly what you should be doing. If this is inefficient then there are much better ways of fixing this. e.g. by not having the device perform so many accesses, or by adding some sort of translation cache. > > You should be adding a DeviceState argument to > > cpu_physical_memory_{rw,map}. This should then handle IOMMU translation > > transparently. > > > > You also need to accomodate the the case where multiple IOMMU are > > present. > > We don't assume there's a single IOMMU in the generic layer. The > callbacks within 'struct iommu' could very well dispatch the request to > one of multiple, coexisting IOMMUs. So you've now introduced yet another copy of the translation code. Not only does every device have to be IOMMU aware, but every IOMMU also has to be aware of nested IOMMUs. Paul
On 07/15/2010 05:33 AM, Paul Brook wrote: >>> Depending how the we decide to handle IOMMU invalidation, it may also be >>> necessary to augment the memory_map API to allow the system to request a >>> mapping be revoked. However this issue is not specific to the IOMMU >>> implementation. Such bugs are already present on any system that allows >>> dynamic reconfiguration of the address space, e.g. by changing PCI BARs. >>> >> That's why the memory_map API today does not allow mappings to persist >> after trips back to the main loop. >> > Sure it does. If you can't combine zero-copy memory access with asynchronous > IO then IMO it's fairly useless. See e.g. dma-helpers.c > DMA's a very special case. DMA is performed asynchronously to the execution of the CPU so you generally can't make any guarantees about what state the transaction is in until it's completed. That gives us a fair bit of wiggle room when dealing with a DMA operation to a region of physical memory where the physical memory mapping is altered in some way during the transaction. However, that is not true in the general case. Regards, Anthony Liguori > Paul >
On 07/15/2010 04:10 AM, Joerg Roedel wrote: > On Wed, Jul 14, 2010 at 04:29:18PM -0500, Anthony Liguori wrote: > >> On 07/14/2010 03:13 PM, Paul Brook wrote: >> >>> Well, ok, the function name needs fixing too. However I think the only thing >>> missing from the current API is that it does not provide a way to determine >>> which device is performing the access. >>> >> I agree with Paul. >> >> The right approach IMHO is to convert devices to use bus-specific >> functions to access memory. The bus specific functions should have a >> device argument as the first parameter. >> > If this means a seperate interface for device dma accesses and not fold > that functionality into the cpu_physical_memory* interface I agree too :-) > No. PCI devices should never call cpu_physical_memory*. PCI devices should call pci_memory*. ISA devices should call isa_memory*. All device memory accesses should go through their respective buses. There can be multiple IOMMUs at different levels of the device hierarchy. If you don't provide bus-level memory access functions that chain through the hierarchy, it's extremely difficult to implement all the necessary hooks to perform the translations at different places. Regards, Anthony Liguori > Joerg > >
> >>> Depending how the we decide to handle IOMMU invalidation, it may also > >>> be necessary to augment the memory_map API to allow the system to > >>> request a mapping be revoked. However this issue is not specific to > >>> the IOMMU implementation. Such bugs are already present on any system > >>> that allows dynamic reconfiguration of the address space, e.g. by > >>> changing PCI BARs. > >> > >> That's why the memory_map API today does not allow mappings to persist > >> after trips back to the main loop. > > > > Sure it does. If you can't combine zero-copy memory access with > > asynchronous IO then IMO it's fairly useless. See e.g. dma-helpers.c > > DMA's a very special case. Special compared to what? The whole purpose of this API is to provide DMA. > DMA is performed asynchronously to the > execution of the CPU so you generally can't make any guarantees about > what state the transaction is in until it's completed. That gives us a > fair bit of wiggle room when dealing with a DMA operation to a region of > physical memory where the physical memory mapping is altered in some way > during the transaction. You do have ordering constraints though. While it may not be possible to directly determine whether the DMA completed before or after the remapping, and you might not be able to make any assumptions about the atomicity of the transaction as a whole, it is reasonable to assume that any writes to the old mapping will occur before the remapping operation completes. While things like store buffers potentially allows reordering and deferral of accesses, there are generally fairly tight constraints on this. For example a PCI hast bridge may buffer CPU writes. However it will guarantee that those writes have been flushed out before a subsequent read operation completes. Consider the case where the hypervisor allows passthough of a device, using the IOMMU to support DMA from that device into virtual machine RAM. When that virtual machine is destroyed the IOMMU mapping for that device will be invalidated. Once the invalidation has completed that RAM can be reused by the hypervisor for other purposes. This may happen before the device is reset. We probably don't really care what happens to the device in this case, but we do need to prevent the device stomping on ram it no longer owns. There are two ways this can be handled: If your address translation mechanism allows updates to be deferred indefinitely then we can stall until all relevant DMA transactions have completed. This is probably sufficient for well behaved guests, but potentially opens up a significant window for DoS attacks. If you need the remapping to occur in a finite timeframe (in the PCI BAR case this is probably before the next CPU access to that bus) then you need some mechanism for revoking the host mapping provided by cpu_physical_memory_map. Note that a QEMU DMA transaction typically encompasses a whole block of data. The transaction is started when the AIO request is issued, and remains live until the transfer completes. This includes the time taken to fetch the data from external media/devices. On real hardware a DMA transaction typically only covers a single burst memory write (maybe 16 bytes). This will generally not start until the device has buffered sufficient data to satisfy the burst (or has sufficient buffer space to receive the whole burst). Paul
On Thu, Jul 15, 2010 at 07:45:06AM -0500, Anthony Liguori wrote: > On 07/15/2010 04:10 AM, Joerg Roedel wrote: >> If this means a seperate interface for device dma accesses and not fold >> that functionality into the cpu_physical_memory* interface I agree too :-) >> > No. PCI devices should never call cpu_physical_memory*. Fully agreed. > PCI devices should call pci_memory*. > > ISA devices should call isa_memory*. This is a seperate interface. I like the idea and as you stated below it has clear advantages, so lets go this way. > All device memory accesses should go through their respective buses. > There can be multiple IOMMUs at different levels of the device > hierarchy. If you don't provide bus-level memory access functions that > chain through the hierarchy, it's extremely difficult to implement all > the necessary hooks to perform the translations at different places. Joerg
On Thu, Jul 15, 2010 at 11:49:20AM +0100, Paul Brook wrote: > > An IOMMU is not necessarily part of a bus bridge. By concept an IOMMU > > can also be implemented on a plugin-card translating only that card. > > Real implementations that I am aware of always implement the IOMMU in > > the PCI root bridge, though. > > If the IOMMU is implemented on the card, then it isn't an interesting case. > It's effectively just a complex form of scatter-gather. > > If the on-card MMU can delegate pagetable walks to an external device then IMO > that's also an unrelated feature, and requires an additional data channel. But that would be handled by the same IOMMU emulation code, so the hooks need to be usable there too. > My point still stands. You should not be pushing the IOMMU handling into > device specific code. All you need to do is make the memory access routines > aware of which device caused the access. Right, the device does not need to know too much about the IOMMU in the general case. The iommu_get/iommu_read/iommu_write interface should replaced by the pci_memory* functions like suggested by Anthony. > The fact that the GART can translate CPU accesses proves my point. If you > have separate code for CPU and devices, then you need to duplicate the GART > handling code. You can configure the GART to translate device accesses only, cpu accesses only, or to translate both. This is hard to handle if cpu and device emulation use the same memory access functions. Joerg
On Thu, Jul 15, 2010 at 07:45:06AM -0500, Anthony Liguori wrote: > > No. PCI devices should never call cpu_physical_memory*. > > PCI devices should call pci_memory*. > > ISA devices should call isa_memory*. > > All device memory accesses should go through their respective buses. > There can be multiple IOMMUs at different levels of the device > hierarchy. If you don't provide bus-level memory access functions that > chain through the hierarchy, it's extremely difficult to implement all > the necessary hooks to perform the translations at different places. > > Regards, > > Anthony Liguori > I liked Paul's initial approach more, at least if I understood him correctly. Basically I'm suggesting a single memory_* function that simply asks the bus for I/O and translation. Say you have something like this: + Bus 1 | ---- Memory 1 | ---+ Bus 2 bridge | ---- Memory 2 | ---+ Bus 3 bridge | ---- Device Say Device wants to write to memory. If we have the DeviceState we needn't concern whether this is a BusOneDevice or BusTwoDevice from device code itself. We would just call memory_rw(dev_state, addr, buf, size, is_write); which simply recurses through DeviceState's and BusState's through their parent pointers. The actual bus can set up those to provide identification information and perhaps hooks for translation and access checking. So memory_rw() looks like this (pseudocode): static void memory_rw(DeviceState *dev, target_phys_addr_t addr, uint8_t *buf, int size, int is_write) { BusState *bus = get_parent_bus_of_dev(dev); DeviceState *pdev = get_parent_dev(dev); target_phys_addr_t taddr; if (!bus) { /* This shouldn't happen. */ assert(0); } if (bus->responsible_for(addr)) { raw_physical_memory_rw(addr, buf, size, is_write); return; } taddr = bus->translate(dev, addr); memory_rw(pdev, taddr, buf, size, is_write); } If we do this, it seems there's no need to provide separate functions. The actual buses must instead initialize those hooks properly. Translation here is something inherent to the bus, that handles arbitration between possibly multiple IOMMUs. Our memory would normally reside on / belong to the top-level bus. What do you think? (Naming could be better though.) Eduard
* Paul Brook (paul@codesourcery.com) wrote: > > > The right approach IMHO is to convert devices to use bus-specific > > > functions to access memory. The bus specific functions should have > > > a device argument as the first parameter. > > > > As for ATS, the internal api to handle the device's dma reqeust needs > > a notion of a translated vs. an untranslated request. IOW, if qemu ever > > had a device with ATS support, the device would use its local cache to > > translate the dma address and then submit a translated request to the > > pci bus (effectively doing a raw cpu physical memory* in that case). > > Really? Can you provide an documentation to support this claim? > My impression is that there is no difference between translated and > untranslated devices, and the translation is explicitly disabled by software. ATS allows an I/O device to request a translation from the IOMMU. The device can then cache that translation and use the translated address in a PCIe memory transaction. PCIe uses a couple of previously reserved bits in the transaction layer packet header to describe the address type for memory transactions. The default (00) maps to legacy PCIe and describes the memory address as untranslated. This is the normal mode, and could then incur a translation if an IOMMU is present and programmed w/ page tables, etc. as is passes through the host bridge. Another type is simply a transaction requesting a translation. This is new, and allows a device to request (and cache) a translation from the IOMMU for subsequent use. The third type is a memory transaction tagged as already translated. This is the type of transaction an ATS capable I/O device will generate when it was able to translate the memory address from its own cache. Of course, there's also an invalidation request that the IOMMU can send to ATS capable I/O devices to invalidate the cached translation. thanks, -chris
On 07/15/2010 07:52 PM, Chris Wright wrote: > >> Really? Can you provide an documentation to support this claim? >> My impression is that there is no difference between translated and >> untranslated devices, and the translation is explicitly disabled by software. >> > ATS allows an I/O device to request a translation from the IOMMU. > The device can then cache that translation and use the translated address > in a PCIe memory transaction. PCIe uses a couple of previously reserved > bits in the transaction layer packet header to describe the address > type for memory transactions. The default (00) maps to legacy PCIe and > describes the memory address as untranslated. This is the normal mode, > and could then incur a translation if an IOMMU is present and programmed > w/ page tables, etc. as is passes through the host bridge. > > Another type is simply a transaction requesting a translation. This is > new, and allows a device to request (and cache) a translation from the > IOMMU for subsequent use. > > The third type is a memory transaction tagged as already translated. > This is the type of transaction an ATS capable I/O device will generate > when it was able to translate the memory address from its own cache. > > Of course, there's also an invalidation request that the IOMMU can send > to ATS capable I/O devices to invalidate the cached translation. > For emulated device, it seems like we can ignore ATS completely, no?
* Chris Wright (chrisw@sous-sol.org) wrote: > * Paul Brook (paul@codesourcery.com) wrote: > > > > The right approach IMHO is to convert devices to use bus-specific > > > > functions to access memory. The bus specific functions should have > > > > a device argument as the first parameter. > > > > > > As for ATS, the internal api to handle the device's dma reqeust needs > > > a notion of a translated vs. an untranslated request. IOW, if qemu ever > > > had a device with ATS support, the device would use its local cache to > > > translate the dma address and then submit a translated request to the > > > pci bus (effectively doing a raw cpu physical memory* in that case). > > > > Really? Can you provide an documentation to support this claim? Wow...color me surprised...there's actually some apparently public "training" docs that might help give a more complete view: http://www.pcisig.com/developers/main/training_materials/get_document?doc_id=0ab681ba7001e40cdb297ddaf279a8de82a7dc40 ATS discussion starts on slide 23. > > My impression is that there is no difference between translated and > > untranslated devices, and the translation is explicitly disabled by software. And now that I re-read that sentence, I see what you are talking about. Yes, there is the above notion as well. A device can live in a 1:1 mapping of device address space to physical memory. This could be achieved in a few ways (all done by the OS software programming the IOMMU). One is to simply create a set of page tables that map 1:1 all of device memory to physical memory. Another is to somehow mark the device as special (either omit translation tables or mark the translation entry as effectively "do not translate"). This is often referred to as Pass Through mode. But this is not the same as ATS. Pass Through mode is the functional equivalent of disabling the translation/isolation capabilities of the IOMMU. It's typically used when an OS wants to keep a device for itself and isn't interested in the isolation properties of the IOMMU. It then only creates isolating translation tables for devices it's giving to unprivileged software (e.g. Linux/KVM giving a device to a guest, Linux giving a device to user space process, etc.) > ATS allows an I/O device to request a translation from the IOMMU. > The device can then cache that translation and use the translated address > in a PCIe memory transaction. PCIe uses a couple of previously reserved > bits in the transaction layer packet header to describe the address > type for memory transactions. The default (00) maps to legacy PCIe and > describes the memory address as untranslated. This is the normal mode, > and could then incur a translation if an IOMMU is present and programmed > w/ page tables, etc. as is passes through the host bridge. > > Another type is simply a transaction requesting a translation. This is > new, and allows a device to request (and cache) a translation from the > IOMMU for subsequent use. > > The third type is a memory transaction tagged as already translated. > This is the type of transaction an ATS capable I/O device will generate > when it was able to translate the memory address from its own cache. > > Of course, there's also an invalidation request that the IOMMU can send > to ATS capable I/O devices to invalidate the cached translation. thanks, -chris
* Avi Kivity (avi@redhat.com) wrote: > On 07/15/2010 07:52 PM, Chris Wright wrote: > > > >>Really? Can you provide an documentation to support this claim? > >>My impression is that there is no difference between translated and > >>untranslated devices, and the translation is explicitly disabled by software. > >ATS allows an I/O device to request a translation from the IOMMU. > >The device can then cache that translation and use the translated address > >in a PCIe memory transaction. PCIe uses a couple of previously reserved > >bits in the transaction layer packet header to describe the address > >type for memory transactions. The default (00) maps to legacy PCIe and > >describes the memory address as untranslated. This is the normal mode, > >and could then incur a translation if an IOMMU is present and programmed > >w/ page tables, etc. as is passes through the host bridge. > > > >Another type is simply a transaction requesting a translation. This is > >new, and allows a device to request (and cache) a translation from the > >IOMMU for subsequent use. > > > >The third type is a memory transaction tagged as already translated. > >This is the type of transaction an ATS capable I/O device will generate > >when it was able to translate the memory address from its own cache. > > > >Of course, there's also an invalidation request that the IOMMU can send > >to ATS capable I/O devices to invalidate the cached translation. > > For emulated device, it seems like we can ignore ATS completely, no? Not if you want to emulate an ATS capable device ;) Eariler upthread I said: IOW, if qemu ever had a device with ATS support... So, that should've been a much bigger _IF_ thanks, -chris
On 07/15/2010 08:17 PM, Chris Wright wrote: > >> For emulated device, it seems like we can ignore ATS completely, no? >> > Not if you want to emulate an ATS capable device ;) > What I meant was that the whole request translation, invalidate, dma using a translated address thing is invisible to software. We can emulate an ATS capable device by going through the iommu every time.
On Thu, Jul 15, 2010 at 08:02:05PM +0300, Avi Kivity wrote:
> For emulated device, it seems like we can ignore ATS completely, no?
An important use-case for emulation is software testing and caching of
iommu's is an important part that needs to be handled in software. For
this purpose it makes sense to emulate the behavior of caches too. So we
probably should not ignore the possibility of an emulated ATS device
completly.
Joerg
* Avi Kivity (avi@redhat.com) wrote: > On 07/15/2010 08:17 PM, Chris Wright wrote: > > > >>For emulated device, it seems like we can ignore ATS completely, no? > >Not if you want to emulate an ATS capable device ;) > > What I meant was that the whole request translation, invalidate, dma > using a translated address thing is invisible to software. We can > emulate an ATS capable device by going through the iommu every time. Well, I don't see any reason to completely ignore it. It'd be really useful for testing (I'd use it that way). Esp to verify the invalidation of the device IOTLBs. But I think it's not a difficult thing to emulate once we have a proper api encapsulating a device's dma request. thanks, -chris
On Thu, Jul 15, 2010 at 10:17:10AM -0700, Chris Wright wrote: > * Avi Kivity (avi@redhat.com) wrote: > > > > For emulated device, it seems like we can ignore ATS completely, no? > > Not if you want to emulate an ATS capable device ;) > > Eariler upthread I said: > > IOW, if qemu ever had a device with ATS support... > > So, that should've been a much bigger _IF_ > > thanks, > -chris I think we can augment some devices with ATS capability if there are performance gains in doing so. This doesn't seem to be a detail the actual guest OS would be interested in, so we can do it even for devices that existed long before the AMD IOMMU came into existence. But I'm not really sure about this, it's just a thought. Linux seems to be issuing IOTLB invalidation commands anyway. Eduard
On 07/15/2010 11:45 AM, Eduard - Gabriel Munteanu wrote: > On Thu, Jul 15, 2010 at 07:45:06AM -0500, Anthony Liguori wrote: > >> No. PCI devices should never call cpu_physical_memory*. >> >> PCI devices should call pci_memory*. >> >> ISA devices should call isa_memory*. >> >> All device memory accesses should go through their respective buses. >> There can be multiple IOMMUs at different levels of the device >> hierarchy. If you don't provide bus-level memory access functions that >> chain through the hierarchy, it's extremely difficult to implement all >> the necessary hooks to perform the translations at different places. >> >> Regards, >> >> Anthony Liguori >> >> > I liked Paul's initial approach more, at least if I understood him > correctly. Basically I'm suggesting a single memory_* function that > simply asks the bus for I/O and translation. Say you have something like > this: > > + Bus 1 > | > ---- Memory 1 > | > ---+ Bus 2 bridge > | > ---- Memory 2 > | > ---+ Bus 3 bridge > | > ---- Device > > Say Device wants to write to memory. If we have the DeviceState we > needn't concern whether this is a BusOneDevice or BusTwoDevice from > device code itself. We would just call > > memory_rw(dev_state, addr, buf, size, is_write); > I dislike this API for a few reasons: 1) buses have different types of addresses with different address ranges. this api would have to take a generic address type. 2) dev_state would be the qdev device state. this means qdev needs to have memory hook mechanisms that's chainable. I think it's unnecessary at the qdev level 3) users have upcasted device states, so it's more natural to pass PCIDevice than DeviceState. 4) there's an assumption that all devices can get to DeviceState. that's not always true today. > which simply recurses through DeviceState's and BusState's through their > parent pointers. The actual bus can set up those to provide > identification information and perhaps hooks for translation and access > checking. So memory_rw() looks like this (pseudocode): > > static void memory_rw(DeviceState *dev, > target_phys_addr_t addr, > uint8_t *buf, > int size, > int is_write) > { > BusState *bus = get_parent_bus_of_dev(dev); > DeviceState *pdev = get_parent_dev(dev); > target_phys_addr_t taddr; > > if (!bus) { > /* This shouldn't happen. */ > assert(0); > } > > if (bus->responsible_for(addr)) { > raw_physical_memory_rw(addr, buf, size, is_write); > return; > } > > taddr = bus->translate(dev, addr); > memory_rw(pdev, taddr, buf, size, is_write); > This is too simplistic because you sometimes have layering that doesn't fit into the bus model. For instance, virtio + pci. We really want a virtio_memory_rw that calls either syborg_memory_rw or pci_memory_rw based on the transport. In your proposal, we would have to model virtio-pci as a bus with a single device which appears awkward to me. Regards, Anthony Liguori > } > > If we do this, it seems there's no need to provide separate > functions. The actual buses must instead initialize those hooks > properly. Translation here is something inherent to the bus, that > handles arbitration between possibly multiple IOMMUs. Our memory would > normally reside on / belong to the top-level bus. > > What do you think? (Naming could be better though.) > > > Eduard > >
diff --git a/hw/ide/core.c b/hw/ide/core.c index 0b3b7c2..7f8f7df 100644 --- a/hw/ide/core.c +++ b/hw/ide/core.c @@ -26,6 +26,7 @@ #include <hw/pc.h> #include <hw/pci.h> #include <hw/scsi.h> +#include <hw/iommu.h> #include "qemu-timer.h" #include "sysemu.h" #include "dma.h" @@ -433,7 +434,12 @@ static int dma_buf_prepare(BMDMAState *bm, int is_write) uint32_t addr; uint32_t size; } prd; - int l, len; + int l, len, err, io_len; + struct iommu *iommu; + DeviceState *dev; + target_phys_addr_t io_addr; + + iommu = iommu_get(s->bus->qbus.parent, &dev); qemu_sglist_init(&s->sg, s->nsector / (IDE_PAGE_SIZE / 512) + 1); s->io_buffer_size = 0; @@ -443,7 +449,7 @@ static int dma_buf_prepare(BMDMAState *bm, int is_write) if (bm->cur_prd_last || (bm->cur_addr - bm->addr) >= IDE_PAGE_SIZE) return s->io_buffer_size != 0; - cpu_physical_memory_read(bm->cur_addr, (uint8_t *)&prd, 8); + err = iommu_read(iommu, dev, bm->cur_addr, (uint8_t *)&prd, 8); bm->cur_addr += 8; prd.addr = le32_to_cpu(prd.addr); prd.size = le32_to_cpu(prd.size); @@ -455,11 +461,22 @@ static int dma_buf_prepare(BMDMAState *bm, int is_write) bm->cur_prd_last = (prd.size & 0x80000000); } l = bm->cur_prd_len; - if (l > 0) { - qemu_sglist_add(&s->sg, bm->cur_prd_addr, l); - bm->cur_prd_addr += l; - bm->cur_prd_len -= l; - s->io_buffer_size += l; + while (l > 0) { + /* + * In case translation / access checking fails no + * transfer happens but we pretend it went through. + */ + err = iommu_translate(iommu, dev, bm->cur_prd_addr, + &io_addr, &io_len, !is_write); + if (!err) { + if (io_len > l) + io_len = l; + qemu_sglist_add(&s->sg, io_addr, io_len); + } + bm->cur_prd_addr += io_len; + bm->cur_prd_len -= io_len; + s->io_buffer_size += io_len; + l -= io_len; } } return 1; @@ -516,6 +533,10 @@ static int dma_buf_rw(BMDMAState *bm, int is_write) uint32_t size; } prd; int l, len; + struct iommu *iommu; + DeviceState *dev; + + iommu = iommu_get(s->bus->qbus.parent, &dev); for(;;) { l = s->io_buffer_size - s->io_buffer_index; @@ -526,7 +547,7 @@ static int dma_buf_rw(BMDMAState *bm, int is_write) if (bm->cur_prd_last || (bm->cur_addr - bm->addr) >= IDE_PAGE_SIZE) return 0; - cpu_physical_memory_read(bm->cur_addr, (uint8_t *)&prd, 8); + iommu_read(iommu, dev, bm->cur_addr, (uint8_t *)&prd, 8); bm->cur_addr += 8; prd.addr = le32_to_cpu(prd.addr); prd.size = le32_to_cpu(prd.size); @@ -540,13 +561,8 @@ static int dma_buf_rw(BMDMAState *bm, int is_write) if (l > bm->cur_prd_len) l = bm->cur_prd_len; if (l > 0) { - if (is_write) { - cpu_physical_memory_write(bm->cur_prd_addr, - s->io_buffer + s->io_buffer_index, l); - } else { - cpu_physical_memory_read(bm->cur_prd_addr, - s->io_buffer + s->io_buffer_index, l); - } + iommu_rw(iommu, dev, bm->cur_prd_addr, + s->io_buffer + s->io_buffer_index, l, is_write); bm->cur_prd_addr += l; bm->cur_prd_len -= l; s->io_buffer_index += l;
Memory accesses must go through the IOMMU layer. Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> --- hw/ide/core.c | 46 +++++++++++++++++++++++++++++++--------------- 1 files changed, 31 insertions(+), 15 deletions(-)