Message ID | 20150327171946.GL1562@arm.com |
---|---|
State | New |
Headers | show |
Hi Will, On Fri, Mar 27, 2015 at 05:19:46PM +0000, Will Deacon wrote: > Please can you pull the following IOMMU changes for 4.1? They move the > per-iommu_ops pgsize_bitmap field into the iommu_domain, which allows > IOMMUs such as the ARM SMMU to support different page sizes within a > given SoC. I have some concerns about the direction taken with this patch-set. The goal for the IOMMU-API is still to have domains that can be attached to arbitrary devices (even when mappings already exist). But with this patch-set we move into a direction where a domain can only be used on IOMMUs that support the page-sizes required by the domain. In the end this would be visible to the user of the IOMMU-API, which is not what we want. I can understand the motivation behind these patches, but we need to think about how this could work with the desired semantics of the IOMMU-API. Joerg
On Tue, Mar 31, 2015 at 03:24:40PM +0100, Joerg Roedel wrote: > Hi Will, Hi Joerg, > On Fri, Mar 27, 2015 at 05:19:46PM +0000, Will Deacon wrote: > > Please can you pull the following IOMMU changes for 4.1? They move the > > per-iommu_ops pgsize_bitmap field into the iommu_domain, which allows > > IOMMUs such as the ARM SMMU to support different page sizes within a > > given SoC. > > I have some concerns about the direction taken with this patch-set. The > goal for the IOMMU-API is still to have domains that can be attached to > arbitrary devices (even when mappings already exist). But with this > patch-set we move into a direction where a domain can only be used on > IOMMUs that support the page-sizes required by the domain. In the end > this would be visible to the user of the IOMMU-API, which is not what we > want. But isn't this restriction already the case in practice? For example, if I have a domain with some mapping already configured, then that mapping will be using some fixed set of page sizes. Attaching a device behind another IOMMU that doesn't support that page size would effectively require the domain page tables to be freed and re-allocated from scratch. So I don't think this patch series leaves us any worse off that we currently are already. Ths plus points of the patches are that: - We can support different page sizes per domain (the ARM SMMU hardware really does support this and it would be nice to exploit that to gain better TLB utilisation) - We can support systems containing IOMMUs that don't support a common page size (I believe the arm64 Juno platform has this feature) - I don't have to manipulate a const data structure (iommu_ops) at runtime whenever I find a new IOMMU with a different set of supported page sizes. > I can understand the motivation behind these patches, but we need to > think about how this could work with the desired semantics of the > IOMMU-API. Do we have any code using this feature of the IOMMU API? I don't think it's realistic in the general case to allow arbitrary devices to be attached to a domain unless the domain can also span multiple IOMMUs. In that case, we'd actually need multiple sets of page tables, potentially described using different formats... Will
On Tue, 2015-03-31 at 15:49 +0100, Will Deacon wrote: > On Tue, Mar 31, 2015 at 03:24:40PM +0100, Joerg Roedel wrote: > > Hi Will, > > Hi Joerg, > > > On Fri, Mar 27, 2015 at 05:19:46PM +0000, Will Deacon wrote: > > > Please can you pull the following IOMMU changes for 4.1? They move the > > > per-iommu_ops pgsize_bitmap field into the iommu_domain, which allows > > > IOMMUs such as the ARM SMMU to support different page sizes within a > > > given SoC. > > > > I have some concerns about the direction taken with this patch-set. The > > goal for the IOMMU-API is still to have domains that can be attached to > > arbitrary devices (even when mappings already exist). But with this > > patch-set we move into a direction where a domain can only be used on > > IOMMUs that support the page-sizes required by the domain. In the end > > this would be visible to the user of the IOMMU-API, which is not what we > > want. > > But isn't this restriction already the case in practice? For example, if > I have a domain with some mapping already configured, then that mapping > will be using some fixed set of page sizes. Attaching a device behind > another IOMMU that doesn't support that page size would effectively require > the domain page tables to be freed and re-allocated from scratch. > > So I don't think this patch series leaves us any worse off that we currently > are already. > > Ths plus points of the patches are that: > > - We can support different page sizes per domain (the ARM SMMU hardware > really does support this and it would be nice to exploit that to gain > better TLB utilisation) > > - We can support systems containing IOMMUs that don't support a common > page size (I believe the arm64 Juno platform has this feature) > > - I don't have to manipulate a const data structure (iommu_ops) at runtime > whenever I find a new IOMMU with a different set of supported page > sizes. > > > I can understand the motivation behind these patches, but we need to > > think about how this could work with the desired semantics of the > > IOMMU-API. > > Do we have any code using this feature of the IOMMU API? I don't think it's > realistic in the general case to allow arbitrary devices to be attached to a > domain unless the domain can also span multiple IOMMUs. In that case, we'd > actually need multiple sets of page tables, potentially described using > different formats... Legacy KVM assignment relies on being able to attach all the devices to a single IOMMU domain and the hardware generally supports the domain page table being used by multiple hardware units. It's not without issue though. For instance, there's no hardware spec that requires that all the hardware IOMMUs for an iommu_ops must support IOMMU_CAP_CACHE_COHERENCY. That's a per domain capability, not per iommu_ops. If we start with a device attached to an IOMMU that does support this capability and create our mappings with the IOMMU_CACHE protection flag, that domain is incompatible with other IOMMU hardware units that do not support that capability. On VT-d, the IOMMU API lets us share the domain between hardware units, but we might get invalid reserved field faults if we mix-n-match too much. This is why VFIO had to add support for multiple IOMMU domains within a VFIO container. It used to be that a VFIO container was essentially a 1:1 abstraction of an IOMMU domain, but issues like IOMMU_CACHE forced us to extend that abstraction. It makes sense to me that supported page sizes has a similar problem to IOMMU_CACHE; IOMMU mappings can be made that are dependent on the composition of the domain at the time of mapping and there's no requirement that all the IOMMU hardware units support the exact same features. VFIO already assumes that separate domains don't necessarily use the same ops and we make sure mappings and un-mappings are aligned to the smallest common size. We'll have some work to do though if there is no common size between the domains and we may need to add a test to our notion of compatible domains if pgsize_bitmap moves from iommu_ops (sorry, I forget whether you already had a patch for that in this pull request). Thanks, Alex
Hi Joerg, On 31/03/15 15:24, Joerg Roedel wrote: > Hi Will, > > On Fri, Mar 27, 2015 at 05:19:46PM +0000, Will Deacon wrote: >> Please can you pull the following IOMMU changes for 4.1? They move the >> per-iommu_ops pgsize_bitmap field into the iommu_domain, which allows >> IOMMUs such as the ARM SMMU to support different page sizes within a >> given SoC. > > I have some concerns about the direction taken with this patch-set. The > goal for the IOMMU-API is still to have domains that can be attached to > arbitrary devices (even when mappings already exist). But with this > patch-set we move into a direction where a domain can only be used on > IOMMUs that support the page-sizes required by the domain. In the end > this would be visible to the user of the IOMMU-API, which is not what we > want. I think the notion of sharing domains between arbitrary devices is already broken for systems with multiple IOMMU devices - with the iommu_domain itself now embedded in a private data structure, it's certainly not safe to pass it to two different drivers, should the client devices be behind heterogeneous IOMMUs. Furthermore, since most of those private data structures contain instance-specific data, that prevents a domain spanning even two homogeneous IOMMU devices. If the goal is for iommu_domains to be a hardware-independent abstraction, then shouldn't we be looking to move the other instance-specific features like attributes and geometry out of them? Perhaps the answer is the opposite; accept the existing iommu_domain as what it appears to be used as - a fairly thin abstraction of one or more context banks/uTLBs/DMARs/whatever on a single IOMMU device - and consider some higher-level interface for grouping one or more of these "device domains" together and sharing a single IOVA space between them. Robin. > I can understand the motivation behind these patches, but we need to > think about how this could work with the desired semantics of the > IOMMU-API. > > > Joerg >
Hi Alex, Joerg, On Tue, Mar 31, 2015 at 04:50:50PM +0100, Alex Williamson wrote: > On Tue, 2015-03-31 at 15:49 +0100, Will Deacon wrote: > > On Tue, Mar 31, 2015 at 03:24:40PM +0100, Joerg Roedel wrote: > > > On Fri, Mar 27, 2015 at 05:19:46PM +0000, Will Deacon wrote: > > > > Please can you pull the following IOMMU changes for 4.1? They move the > > > > per-iommu_ops pgsize_bitmap field into the iommu_domain, which allows > > > > IOMMUs such as the ARM SMMU to support different page sizes within a > > > > given SoC. > > > > > > I have some concerns about the direction taken with this patch-set. The > > > goal for the IOMMU-API is still to have domains that can be attached to > > > arbitrary devices (even when mappings already exist). But with this > > > patch-set we move into a direction where a domain can only be used on > > > IOMMUs that support the page-sizes required by the domain. In the end > > > this would be visible to the user of the IOMMU-API, which is not what we > > > want. > > > > But isn't this restriction already the case in practice? For example, if > > I have a domain with some mapping already configured, then that mapping > > will be using some fixed set of page sizes. Attaching a device behind > > another IOMMU that doesn't support that page size would effectively require > > the domain page tables to be freed and re-allocated from scratch. > > > > So I don't think this patch series leaves us any worse off that we currently > > are already. > > > > Ths plus points of the patches are that: > > > > - We can support different page sizes per domain (the ARM SMMU hardware > > really does support this and it would be nice to exploit that to gain > > better TLB utilisation) > > > > - We can support systems containing IOMMUs that don't support a common > > page size (I believe the arm64 Juno platform has this feature) > > > > - I don't have to manipulate a const data structure (iommu_ops) at runtime > > whenever I find a new IOMMU with a different set of supported page > > sizes. > > > > > I can understand the motivation behind these patches, but we need to > > > think about how this could work with the desired semantics of the > > > IOMMU-API. > > > > Do we have any code using this feature of the IOMMU API? I don't think it's > > realistic in the general case to allow arbitrary devices to be attached to a > > domain unless the domain can also span multiple IOMMUs. In that case, we'd > > actually need multiple sets of page tables, potentially described using > > different formats... > > Legacy KVM assignment relies on being able to attach all the devices to > a single IOMMU domain and the hardware generally supports the domain > page table being used by multiple hardware units. It's not without > issue though. For instance, there's no hardware spec that requires that > all the hardware IOMMUs for an iommu_ops must support > IOMMU_CAP_CACHE_COHERENCY. That's a per domain capability, not per > iommu_ops. If we start with a device attached to an IOMMU that does > support this capability and create our mappings with the IOMMU_CACHE > protection flag, that domain is incompatible with other IOMMU hardware > units that do not support that capability. On VT-d, the IOMMU API lets > us share the domain between hardware units, but we might get invalid > reserved field faults if we mix-n-match too much. FWIW, legacy KVM assignment is also only supported on x86 so hopefully the mixing and matching is somewhat limited by the available platforms. > This is why VFIO had to add support for multiple IOMMU domains within a > VFIO container. It used to be that a VFIO container was essentially a > 1:1 abstraction of an IOMMU domain, but issues like IOMMU_CACHE forced > us to extend that abstraction. Agreed. We also need to cater for multiple, heterogenous IOMMUs in the system too. For example, the new ChromeBit device based on rk3288 contains both a rockchip iommu (rockchip-iommu.c) *and* an ARM SMMU (arm-smmu.c). These IOMMUs both deal with masters on the platform bus and they have separate page table formats. Granted, this patch series doesn't address that problem, but it doesn't make it worse either. > It makes sense to me that supported page sizes has a similar problem to > IOMMU_CACHE; IOMMU mappings can be made that are dependent on the > composition of the domain at the time of mapping and there's no > requirement that all the IOMMU hardware units support the exact same > features. VFIO already assumes that separate domains don't necessarily > use the same ops and we make sure mappings and un-mappings are aligned > to the smallest common size. We'll have some work to do though if there > is no common size between the domains and we may need to add a test to > our notion of compatible domains if pgsize_bitmap moves from iommu_ops > (sorry, I forget whether you already had a patch for that in this pull > request). Thanks, The series updates vfio_pgsize_bitmap to use the domains, so dma_do_map will fail if there's no common page size amongst the IOMMUs referenced by the container. That preserves the existing behaviour, but I'm not sure why it's actually required. Couldn't we potentially allow different domains to use different page sizes? We'd just need to advertise the page sizes that are large enough to be supported by all domains. iommu_map will then take care of breaking them down into smaller chunks when required. On the ARM SMMU, you may have the choice of a few different page sizes but, once you've chosen one (actually a subset), you can't use the others for that domain. This is currently a decision made when the first device is attached to a domain (i.e. when we allocate the initial page table) and we basically try to find the page size closest to the one in use by the CPU. See arm_lpae_restrict_pgsizes for the gory details. With the current code, that means when one domain decides to use a particular subset, we have to update (const) iommu_ops->pgsize_bitmap and therefore needlessly prevent any other domain on any other ARM SMMU from using a different set of page sizes. Joerg: are you still against merging this? If so, what do you need to see in order to change your mind? This is certainly something that we need on ARM-based systems and, tbh, the direction that the IOMMU core is going seems to be towards a N:1 domain:IOMMU mapping anyway (c.f. your series to embed struct iommu_domain in a driver-private data structure). Will
On Fri, 2015-03-27 at 17:19 +0000, Will Deacon wrote: > > Please can you pull the following IOMMU changes for 4.1? They move the > per-iommu_ops pgsize_bitmap field into the iommu_domain, which allows > IOMMUs such as the ARM SMMU to support different page sizes within a > given SoC. Can't we kill it entirely? It's always been horrid. Just let the IOMMU driver fill the page tables as it sees according to the virtual/physical size and alignment of the range it's asked to map. We already lie about the supported page sizes in the Intel VT-d driver and claim to support everything, just to make the caller "Just Ask" as it should. We might have to fix the loop that KVM uses to unmap pages, and potentially the bizarre unmap() API which returns the page size that it unmapped, but we should do that anyway :) Perhaps we could have an unmap function which returns a list of unmapped pages chained on pg->freelist, if this was the last use of those pages (as ISTR it is in the KVM case).
On Wed, Apr 01, 2015 at 02:14:11PM +0100, David Woodhouse wrote: > On Fri, 2015-03-27 at 17:19 +0000, Will Deacon wrote: > > > > Please can you pull the following IOMMU changes for 4.1? They move the > > per-iommu_ops pgsize_bitmap field into the iommu_domain, which allows > > IOMMUs such as the ARM SMMU to support different page sizes within a > > given SoC. > > Can't we kill it entirely? It's always been horrid. Just let the IOMMU > driver fill the page tables as it sees according to the virtual/physical > size and alignment of the range it's asked to map. > > We already lie about the supported page sizes in the Intel VT-d driver > and claim to support everything, just to make the caller "Just Ask" as > it should. > > We might have to fix the loop that KVM uses to unmap pages, and > potentially the bizarre unmap() API which returns the page size that it > unmapped, but we should do that anyway :) We'd also need to update any page-table code that relies on being called one page at a time (i.e. the new io-pgtable stuff we merged for 4.0) so that we iterate over the requested range. That shouldn't be too difficult (i.e. just inline the iommu_map logic into each driver), but it would need some testing. In fact, that would also allow us to nuke the whole TLB on VFIO unmap of large ranges. We currently iterate over the range page-by-page, which takes ages (minutes) in simulation. Will
On Wed, 2015-04-01 at 14:39 +0100, Will Deacon wrote: > We'd also need to update any page-table code that relies on being called > one page at a time (i.e. the new io-pgtable stuff we merged for 4.0) so > that we iterate over the requested range. Ick. I hadn't noticed we'd entrenched that mistake even harder. It certainly *wasn't* that hard to fix it... > That shouldn't be too difficult (i.e. just inline the iommu_map logic > into each driver), but it would need some testing. Good. > In fact, that would also allow us to nuke the whole TLB on VFIO unmap of > large ranges. We currently iterate over the range page-by-page, which takes > ages (minutes) in simulation. Which is one of the big reasons we lie about page sizes in VT-d.
On Wed, Apr 01, 2015 at 02:52:57PM +0100, David Woodhouse wrote: > On Wed, 2015-04-01 at 14:39 +0100, Will Deacon wrote: > > We'd also need to update any page-table code that relies on being called > > one page at a time (i.e. the new io-pgtable stuff we merged for 4.0) so > > that we iterate over the requested range. > > Ick. I hadn't noticed we'd entrenched that mistake even harder. It > certainly *wasn't* that hard to fix it... Well, at least it's now in one place! (before, the same mistake was in each driver). > > That shouldn't be too difficult (i.e. just inline the iommu_map logic > > into each driver), but it would need some testing. > > Good. > > > In fact, that would also allow us to nuke the whole TLB on VFIO unmap of > > large ranges. We currently iterate over the range page-by-page, which takes > > ages (minutes) in simulation. > > Which is one of the big reasons we lie about page sizes in VT-d. Makes sense for now. That just leaves the case where the caller tries to map something smaller than a page. Do we want to return the amount mapped in iommu_map? Will
On Wed, 2015-04-01 at 15:05 +0100, Will Deacon wrote: > That just leaves the case where the caller tries to map something > smaller than a page. Do we want to return the amount mapped in > iommu_map? Perhaps we should just expose the *minimum* page size? Although for the IOMMU API I thought that was always assumed to be the same as PAGE_SIZE, and for the DMA API it doesn't matter much because you're *given* the IOVA, and you just know that it's mapped *enough*.
On Wed, Apr 01, 2015 at 03:28:10PM +0100, David Woodhouse wrote: > On Wed, 2015-04-01 at 15:05 +0100, Will Deacon wrote: > > That just leaves the case where the caller tries to map something > > smaller than a page. Do we want to return the amount mapped in > > iommu_map? > > Perhaps we should just expose the *minimum* page size? Although for the > IOMMU API I thought that was always assumed to be the same as PAGE_SIZE, > and for the DMA API it doesn't matter much because you're *given* the > IOVA, and you just know that it's mapped *enough*. We can certainly run with a 64k CPU page size and a 4k IOMMU page size on arm64 systems. The other way around is much more problematic, so I think you're right to suggest the minimum page size. However, once we have that, we run into the same problem that we've got with the current pgsize_bitmap. Namely, that it needs to be a per-domain property to avoid it changing dynamically following an initial map request or a probe of a new IOMMU device. Will
On Wed, 2015-04-01 at 15:39 +0100, Will Deacon wrote: > > However, once we have that, we run into the same problem that we've got > with the current pgsize_bitmap. Namely, that it needs to be a per-domain > property to avoid it changing dynamically following an initial map request > or a probe of a new IOMMU device. That's not so much of a problem, surely? When adding devices to a new domain without any mappings, the minimum page size of the domain is the largest of all the IOMMUs participating in that domain. And if you try to add new devices after you've already started making mappings, we refuse the addition if it would raise the minimum page size higher than the page sizes you're already using. In VT-d we already have very similar logic for superpage and snooping support.
Hi Will, On Tue, Mar 31, 2015 at 03:49:56PM +0100, Will Deacon wrote: > But isn't this restriction already the case in practice? For example, if > I have a domain with some mapping already configured, then that mapping > will be using some fixed set of page sizes. Attaching a device behind > another IOMMU that doesn't support that page size would effectively require > the domain page tables to be freed and re-allocated from scratch. The problem is that this restriction depends on the IOMMU driver in use. >From the beginning of the IOMMU-API the backend drivers have supported sharing a domain by multiple devices. In fact, the domain is just an abstraction for an address space that can be attached to devices just like cpu page-tables are assigned to threads of a process. We can discuss whether this fundamental concept of the IOMMU-API needs to be changed (moving into the direction this patch-set proposes). I could imaging removing the domain concept entirely and just have map/unmap functions like this: iommu_map(struct device *dev, dma_addr_t iova, phys_addr_t phys, size_t size) iommu_unmap(struct device *dev, dma_addr_t iova, phys_addr_t phys, size_t size) This would require some changes for the users of the IOMMU-API, but it shouldn't be too hard. Or we keep the desired semantics as they are now and do everything needed for that in the IOMMU drivers. For the arm-smmu this would mean exposing a common set of supported pgsizes between IOMMUs, or to build multiple page-tables to match the different IOMMU capabilities. I am open for disussions for either way, but I like the current semantics a bit more, as it allows us to share page-tables between devices and we can move all of the nasty code in VFIO that already creates multiple domains to get different page-tables into the IOMMU core or the drivers (were it belongs). What I definitly don't want is a mixture of both concepts depending on the IOMMU driver in use. We should settle on one way and force the drivers to behave accordingly. We don't need a common API when every driver behaves differently. Joerg
Hi Will, On Wed, Apr 01, 2015 at 12:53:40PM +0100, Will Deacon wrote: > Joerg: are you still against merging this? If so, what do you need to > see in order to change your mind? This is certainly something that we > need on ARM-based systems and, tbh, the direction that the IOMMU core is > going seems to be towards a N:1 domain:IOMMU mapping anyway (c.f. your > series to embed struct iommu_domain in a driver-private data structure). What I want is consistent behavior between drivers. Until now I tried to push for the domain concept, which allows to attach arbitrary devices to a domain. This patch-set makes it impossible to achieve this goal, that's why I am resistant against it. We can discuss moving into another direction with the IOMMU-API, and re-evaluate then if this patch-set is still required. Joerg
On Wed, Apr 01, 2015 at 03:46:10PM +0100, David Woodhouse wrote: > On Wed, 2015-04-01 at 15:39 +0100, Will Deacon wrote: > > However, once we have that, we run into the same problem that we've got > > with the current pgsize_bitmap. Namely, that it needs to be a per-domain > > property to avoid it changing dynamically following an initial map request > > or a probe of a new IOMMU device. > > That's not so much of a problem, surely? When adding devices to a new > domain without any mappings, the minimum page size of the domain is the > largest of all the IOMMUs participating in that domain. The issue (speaking in terms of the ARM SMMU, since that's what I'm familiar with) is that we don't know the page sizes until we've chosen our translation regime. For example, we may have an SMMU that supports the following page sizes 1: 4k | 2M | 1G 2: 16k | 32M 3: 64k | 512M so a domain on the SMMU can use one of those three regimes. Other domains *on the same SMMU* may use a different regime. That means we've actually got three minimum page sizes for the SMMU. Therefore, the page size we end up using for a domain depends on both: (a) Whether or not we've probed another SMMU (using the same driver) that goes and further restricts iommu_ops->min_pgsize (or whatever it ends up being called) (b) Which one of the available regimes is chosen by the page-table code. Currently (b) tries to get something close to the CPU page size, but you could imagine adding an interface to bias the selection towards some other size. We could pick the largest minimum page size of the supported regimes (I can't word this better... it would be 64k in the example above), but I worry that this will end up being too big in practice and we'll over-map for most small mappings. If we put the min_pgsize in the domain, we can just describe it using the regime that domain is using. > And if you try to add new devices after you've already started making > mappings, we refuse the addition if it would raise the minimum page size > higher than the page sizes you're already using. Ok, so I think that solves (a) above, but means we have to keep track of the min_pgsize in the domain anyway so that we can check against it, in which case it may just as well be removed from iommu_ops. Will
On Wed, 2015-04-01 at 12:53 +0100, Will Deacon wrote: > On Tue, Mar 31, 2015 at 04:50:50PM +0100, Alex Williamson wrote: > > It makes sense to me that supported page sizes has a similar problem to > > IOMMU_CACHE; IOMMU mappings can be made that are dependent on the > > composition of the domain at the time of mapping and there's no > > requirement that all the IOMMU hardware units support the exact same > > features. VFIO already assumes that separate domains don't necessarily > > use the same ops and we make sure mappings and un-mappings are aligned > > to the smallest common size. We'll have some work to do though if there > > is no common size between the domains and we may need to add a test to > > our notion of compatible domains if pgsize_bitmap moves from iommu_ops > > (sorry, I forget whether you already had a patch for that in this pull > > request). Thanks, > > The series updates vfio_pgsize_bitmap to use the domains, so dma_do_map > will fail if there's no common page size amongst the IOMMUs referenced > by the container. That preserves the existing behaviour, but I'm not > sure why it's actually required. Couldn't we potentially allow different > domains to use different page sizes? We can. I think the only thing we're trying to do with it now is confirm that the user mapping meets the minimum size and alignment requirements of the IOMMU. We don't use it as a basis for our actual mappings, it's the IOMMU API's job to do that. Thanks, Alex
On Wed, 2015-04-01 at 15:52 +0200, David Woodhouse wrote: > On Wed, 2015-04-01 at 14:39 +0100, Will Deacon wrote: > > We'd also need to update any page-table code that relies on being called > > one page at a time (i.e. the new io-pgtable stuff we merged for 4.0) so > > that we iterate over the requested range. > > Ick. I hadn't noticed we'd entrenched that mistake even harder. It > certainly *wasn't* that hard to fix it... > > > That shouldn't be too difficult (i.e. just inline the iommu_map logic > > into each driver), but it would need some testing. > > Good. > > > In fact, that would also allow us to nuke the whole TLB on VFIO unmap of > > large ranges. We currently iterate over the range page-by-page, which takes > > ages (minutes) in simulation. > > Which is one of the big reasons we lie about page sizes in VT-d. I think (hope) we've solved a significant portion of the unmap problem with the chunking patches that are now in v4.0. We should now be attempting to find the largest contiguous range to unmap instead of doing the page-by-page unmap that was mostly hoping the IOMMU actually used a super page. I'm not sure how lying about page sizes had any effect on this though. In order for vfio to get any smarter about it we'd need a "commit" callback and to be able to use the freelist to track pages to unpin after commit. Thanks, Alex
On Wed, Apr 01, 2015 at 04:38:54PM +0100, Joerg Roedel wrote: > Hi Will, Hi Joerg, > On Tue, Mar 31, 2015 at 03:49:56PM +0100, Will Deacon wrote: > > But isn't this restriction already the case in practice? For example, if > > I have a domain with some mapping already configured, then that mapping > > will be using some fixed set of page sizes. Attaching a device behind > > another IOMMU that doesn't support that page size would effectively require > > the domain page tables to be freed and re-allocated from scratch. > > The problem is that this restriction depends on the IOMMU driver in use. > From the beginning of the IOMMU-API the backend drivers have supported > sharing a domain by multiple devices. In fact, the domain is just an > abstraction for an address space that can be attached to devices just > like cpu page-tables are assigned to threads of a process. I think the domain is a useful abstraction; I just don't think it works too well when you attach devices upstream of different IOMMUs to the same domain. If all the devices attached to a domain share an IOMMU, everything works nicely. > We can discuss whether this fundamental concept of the IOMMU-API needs > to be changed (moving into the direction this patch-set proposes). I > could imaging removing the domain concept entirely and just have > map/unmap functions like this: > > iommu_map(struct device *dev, dma_addr_t iova, > phys_addr_t phys, size_t size) > iommu_unmap(struct device *dev, dma_addr_t iova, > phys_addr_t phys, size_t size) > > This would require some changes for the users of the IOMMU-API, but it > shouldn't be too hard. > > Or we keep the desired semantics as they are now and do everything > needed for that in the IOMMU drivers. For the arm-smmu this would mean > exposing a common set of supported pgsizes between IOMMUs, or to build > multiple page-tables to match the different IOMMU capabilities. > > I am open for disussions for either way, but I like the current > semantics a bit more, as it allows us to share page-tables between > devices and we can move all of the nasty code in VFIO that already > creates multiple domains to get different page-tables into the > IOMMU core or the drivers (were it belongs). I agree that sharing page tables is desirable and I certainly wouldn't suggest removing the domain abstraction. > What I definitly don't want is a mixture of both concepts depending on > the IOMMU driver in use. We should settle on one way and force the > drivers to behave accordingly. We don't need a common API when every > driver behaves differently. Agreed. How would you feel about restricting domains to be per-IOMMU instance? VFIO already copes with this, so I think we'd just need something to keep legacy KVM device passthrough working on x86. Maybe we could add a new domain type using your new series (DOMAIN_X86_KVM_LEGACY or something) and avoid the cross-IOMMU domain checks for that. Will
Hi Alex, On Wed, Apr 01, 2015 at 05:51:04PM +0100, Alex Williamson wrote: > On Wed, 2015-04-01 at 15:52 +0200, David Woodhouse wrote: > > On Wed, 2015-04-01 at 14:39 +0100, Will Deacon wrote: > > > In fact, that would also allow us to nuke the whole TLB on VFIO unmap of > > > large ranges. We currently iterate over the range page-by-page, which takes > > > ages (minutes) in simulation. > > > > Which is one of the big reasons we lie about page sizes in VT-d. > > I think (hope) we've solved a significant portion of the unmap problem > with the chunking patches that are now in v4.0. We should now be > attempting to find the largest contiguous range to unmap instead of > doing the page-by-page unmap that was mostly hoping the IOMMU actually > used a super page. I'm not sure how lying about page sizes had any > effect on this though. In order for vfio to get any smarter about it > we'd need a "commit" callback and to be able to use the freelist to > track pages to unpin after commit. Thanks, Just so I understand this correctly, is that "commit" callback to ensure that the domain_destroy has actually taken effect on the IOMMU (e.g. by invalidating the TLBs and waiting for that to take effect)? The io-pgtable API has a tlb_sync callback, but actually I made tlb_flush_all (which is called when destroying the page tables) always synchronous so that you wouldn't need to unmap anything explicitly in that case. Unfortunately, 4.0 doesn't seem to have improved wrt VFIO unmap for me. Is there something extra I need to add to the ARM SMMU driver? Will
On Wed, 2015-04-01 at 18:50 +0100, Will Deacon wrote: > Hi Alex, > > On Wed, Apr 01, 2015 at 05:51:04PM +0100, Alex Williamson wrote: > > On Wed, 2015-04-01 at 15:52 +0200, David Woodhouse wrote: > > > On Wed, 2015-04-01 at 14:39 +0100, Will Deacon wrote: > > > > In fact, that would also allow us to nuke the whole TLB on VFIO unmap of > > > > large ranges. We currently iterate over the range page-by-page, which takes > > > > ages (minutes) in simulation. > > > > > > Which is one of the big reasons we lie about page sizes in VT-d. > > > > I think (hope) we've solved a significant portion of the unmap problem > > with the chunking patches that are now in v4.0. We should now be > > attempting to find the largest contiguous range to unmap instead of > > doing the page-by-page unmap that was mostly hoping the IOMMU actually > > used a super page. I'm not sure how lying about page sizes had any > > effect on this though. In order for vfio to get any smarter about it > > we'd need a "commit" callback and to be able to use the freelist to > > track pages to unpin after commit. Thanks, > > Just so I understand this correctly, is that "commit" callback to ensure > that the domain_destroy has actually taken effect on the IOMMU (e.g. by > invalidating the TLBs and waiting for that to take effect)? Not necessarily domain_destroy, but simply large unmaps that require multiple calls to iommu_unmap(). Each call requires the IOMMU driver to flush io-tlb page structures. If we could save the io-tlb flushes all to the end, I imagine we'd get a good improvement in efficiency. We can't unpin pages until after the flush though, so we'd need something like the page freelist to bridge the gap. > The io-pgtable API has a tlb_sync callback, but actually I made > tlb_flush_all (which is called when destroying the page tables) always > synchronous so that you wouldn't need to unmap anything explicitly in > that case. Unfortunately we need to unmap before we destroy the domain because the IOMMU is tracking the pages we have pinned. It's again a chicken and egg problem, we need the IOMMU to know what to unpin, but we can't unpin it until the IOMMU mapping is removed because that would give the userspace device access to an IOVA not backed by pinned memory. > Unfortunately, 4.0 doesn't seem to have improved wrt VFIO unmap for me. > Is there something extra I need to add to the ARM SMMU driver? No, but see the comment above drivers/vfio/vfio_iommu_type1.c:vfio_test_domain_fgsp() It's an ugly heuristic, but the only way I could come up with for improving performance on VT-d w/o degrading performance on AMD-Vi. I don't know what your hardware does, but if it's more like AMD-Vi, you may not see a difference. Thanks, Alex
Hi Will, On Wed, Apr 01, 2015 at 06:03:30PM +0100, Will Deacon wrote: > Agreed. How would you feel about restricting domains to be per-IOMMU > instance? If we find a good way to do that it would be fine. We need to expose information about individual IOMMUs for that. The question is whether we need to expose them to the IOMMU-API user or just from the drivers to the IOMMU core code. I prefer the second option, but if there is a good and clean way for the first one I am open for discussions. > VFIO already copes with this, so I think we'd just need something to > keep legacy KVM device passthrough working on x86. VFIO uses a lot of implicit and internal knowledge about the backend IOMMU drivers to decide if a second domain is needed. I think this logic should either be moved into the iommu core code or made generic enough that it works not only for different supported page-size, but also for snooping flags or whatever differences exist between iommus in a system. > Maybe we could add a new domain type using your new series > (DOMAIN_X86_KVM_LEGACY or something) and avoid the cross-IOMMU domain > checks for that. The legacy device assignment code can also be adapted to whatever changes we agree upon, no need to support a stable API in the iommu code. Joerg
On Wed, Apr 01, 2015 at 05:36:18PM +0100, Will Deacon wrote: > The issue (speaking in terms of the ARM SMMU, since that's what I'm familiar > with) is that we don't know the page sizes until we've chosen our > translation regime. Can't you hard-code one regime in the driver and just don't use the others? Or might a SMMU only support a subset of the possible regimes? Joerg
On Wed, Apr 01, 2015 at 10:28:54PM +0100, joro@8bytes.org wrote: > On Wed, Apr 01, 2015 at 05:36:18PM +0100, Will Deacon wrote: > > The issue (speaking in terms of the ARM SMMU, since that's what I'm familiar > > with) is that we don't know the page sizes until we've chosen our > > translation regime. > > Can't you hard-code one regime in the driver and just don't use the > others? Or might a SMMU only support a subset of the possible regimes? Indeed, we can end up with systems that only support a subset of the regimes. Different workloads may also benefit from different regimes due to better TLB utilisation, but there's not currently a way to bias the page table allocator away from the CPU page size. Will