mbox

[GIT,PULL] iommu: Kill off pgsize_bitmap field from struct iommu_ops

Message ID 20150327171946.GL1562@arm.com
State New
Headers show

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git for-joerg/arm-smmu/pgsize

Message

Will Deacon March 27, 2015, 5:19 p.m. UTC
Hi Joerg,

Please can you pull the following IOMMU changes for 4.1? They move the
per-iommu_ops pgsize_bitmap field into the iommu_domain, which allows
IOMMUs such as the ARM SMMU to support different page sizes within a
given SoC.

Cheers,

Will

--->8

The following changes since commit 06e5801b8cb3fc057d88cb4dc03c0b64b2744cda:

  Linux 4.0-rc4 (2015-03-15 17:38:20 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git for-joerg/arm-smmu/pgsize

for you to fetch changes up to 0232a7bfdab918a0cfdc7545abf09e359eb8c32c:

  iommu: of: enforce const-ness of struct iommu_ops (2015-03-20 17:39:34 +0000)

----------------------------------------------------------------
Robin Murphy (1):
      iommu: of: enforce const-ness of struct iommu_ops

Will Deacon (2):
      iommu: remove unused priv field from struct iommu_ops
      iommu: move pgsize_bitmap from struct iommu_ops to struct iommu_domain

 arch/arm/include/asm/dma-mapping.h   |  2 +-
 arch/arm/mm/dma-mapping.c            |  6 +++---
 arch/arm64/include/asm/dma-mapping.h |  3 ++-
 drivers/iommu/amd_iommu.c            |  2 +-
 drivers/iommu/arm-smmu.c             | 12 +++++-------
 drivers/iommu/exynos-iommu.c         |  3 ++-
 drivers/iommu/intel-iommu.c          |  2 +-
 drivers/iommu/iommu.c                | 16 ++++++++--------
 drivers/iommu/ipmmu-vmsa.c           |  2 +-
 drivers/iommu/msm_iommu.c            |  3 ++-
 drivers/iommu/of_iommu.c             | 12 ++++++------
 drivers/iommu/omap-iommu.c           |  3 ++-
 drivers/iommu/rockchip-iommu.c       |  2 +-
 drivers/iommu/shmobile-iommu.c       |  2 +-
 drivers/iommu/tegra-gart.c           |  2 +-
 drivers/iommu/tegra-smmu.c           |  3 +--
 drivers/of/platform.c                |  2 +-
 drivers/vfio/vfio_iommu_type1.c      |  2 +-
 include/linux/dma-mapping.h          |  2 +-
 include/linux/iommu.h                |  6 +-----
 include/linux/of_iommu.h             |  8 ++++----
 21 files changed, 46 insertions(+), 49 deletions(-)

Comments

Joerg Roedel March 31, 2015, 2:24 p.m. UTC | #1
Hi Will,

On Fri, Mar 27, 2015 at 05:19:46PM +0000, Will Deacon wrote:
> Please can you pull the following IOMMU changes for 4.1? They move the
> per-iommu_ops pgsize_bitmap field into the iommu_domain, which allows
> IOMMUs such as the ARM SMMU to support different page sizes within a
> given SoC.

I have some concerns about the direction taken with this patch-set. The
goal for the IOMMU-API is still to have domains that can be attached to
arbitrary devices (even when mappings already exist). But with this
patch-set we move into a direction where a domain can only be used on
IOMMUs that support the page-sizes required by the domain. In the end
this would be visible to the user of the IOMMU-API, which is not what we
want.

I can understand the motivation behind these patches, but we need to
think about how this could work with the desired semantics of the
IOMMU-API.


	Joerg
Will Deacon March 31, 2015, 2:49 p.m. UTC | #2
On Tue, Mar 31, 2015 at 03:24:40PM +0100, Joerg Roedel wrote:
> Hi Will,

Hi Joerg,

> On Fri, Mar 27, 2015 at 05:19:46PM +0000, Will Deacon wrote:
> > Please can you pull the following IOMMU changes for 4.1? They move the
> > per-iommu_ops pgsize_bitmap field into the iommu_domain, which allows
> > IOMMUs such as the ARM SMMU to support different page sizes within a
> > given SoC.
> 
> I have some concerns about the direction taken with this patch-set. The
> goal for the IOMMU-API is still to have domains that can be attached to
> arbitrary devices (even when mappings already exist). But with this
> patch-set we move into a direction where a domain can only be used on
> IOMMUs that support the page-sizes required by the domain. In the end
> this would be visible to the user of the IOMMU-API, which is not what we
> want.

But isn't this restriction already the case in practice? For example, if
I have a domain with some mapping already configured, then that mapping
will be using some fixed set of page sizes. Attaching a device behind
another IOMMU that doesn't support that page size would effectively require
the domain page tables to be freed and re-allocated from scratch.

So I don't think this patch series leaves us any worse off that we currently
are already.

Ths plus points of the patches are that:

  - We can support different page sizes per domain (the ARM SMMU hardware
    really does support this and it would be nice to exploit that to gain
    better TLB utilisation)

  - We can support systems containing IOMMUs that don't support a common
    page size (I believe the arm64 Juno platform has this feature)

  - I don't have to manipulate a const data structure (iommu_ops) at runtime
    whenever I find a new IOMMU with a different set of supported page
    sizes.

> I can understand the motivation behind these patches, but we need to
> think about how this could work with the desired semantics of the
> IOMMU-API.

Do we have any code using this feature of the IOMMU API? I don't think it's
realistic in the general case to allow arbitrary devices to be attached to a
domain unless the domain can also span multiple IOMMUs. In that case, we'd
actually need multiple sets of page tables, potentially described using
different formats...

Will
Alex Williamson March 31, 2015, 3:50 p.m. UTC | #3
On Tue, 2015-03-31 at 15:49 +0100, Will Deacon wrote:
> On Tue, Mar 31, 2015 at 03:24:40PM +0100, Joerg Roedel wrote:
> > Hi Will,
> 
> Hi Joerg,
> 
> > On Fri, Mar 27, 2015 at 05:19:46PM +0000, Will Deacon wrote:
> > > Please can you pull the following IOMMU changes for 4.1? They move the
> > > per-iommu_ops pgsize_bitmap field into the iommu_domain, which allows
> > > IOMMUs such as the ARM SMMU to support different page sizes within a
> > > given SoC.
> > 
> > I have some concerns about the direction taken with this patch-set. The
> > goal for the IOMMU-API is still to have domains that can be attached to
> > arbitrary devices (even when mappings already exist). But with this
> > patch-set we move into a direction where a domain can only be used on
> > IOMMUs that support the page-sizes required by the domain. In the end
> > this would be visible to the user of the IOMMU-API, which is not what we
> > want.
> 
> But isn't this restriction already the case in practice? For example, if
> I have a domain with some mapping already configured, then that mapping
> will be using some fixed set of page sizes. Attaching a device behind
> another IOMMU that doesn't support that page size would effectively require
> the domain page tables to be freed and re-allocated from scratch.
> 
> So I don't think this patch series leaves us any worse off that we currently
> are already.
> 
> Ths plus points of the patches are that:
> 
>   - We can support different page sizes per domain (the ARM SMMU hardware
>     really does support this and it would be nice to exploit that to gain
>     better TLB utilisation)
> 
>   - We can support systems containing IOMMUs that don't support a common
>     page size (I believe the arm64 Juno platform has this feature)
> 
>   - I don't have to manipulate a const data structure (iommu_ops) at runtime
>     whenever I find a new IOMMU with a different set of supported page
>     sizes.
> 
> > I can understand the motivation behind these patches, but we need to
> > think about how this could work with the desired semantics of the
> > IOMMU-API.
> 
> Do we have any code using this feature of the IOMMU API? I don't think it's
> realistic in the general case to allow arbitrary devices to be attached to a
> domain unless the domain can also span multiple IOMMUs. In that case, we'd
> actually need multiple sets of page tables, potentially described using
> different formats...

Legacy KVM assignment relies on being able to attach all the devices to
a single IOMMU domain and the hardware generally supports the domain
page table being used by multiple hardware units.  It's not without
issue though.  For instance, there's no hardware spec that requires that
all the hardware IOMMUs for an iommu_ops must support
IOMMU_CAP_CACHE_COHERENCY.  That's a per domain capability, not per
iommu_ops.  If we start with a device attached to an IOMMU that does
support this capability and create our mappings with the IOMMU_CACHE
protection flag, that domain is incompatible with other IOMMU hardware
units that do not support that capability.  On VT-d, the IOMMU API lets
us share the domain between hardware units, but we might get invalid
reserved field faults if we mix-n-match too much.

This is why VFIO had to add support for multiple IOMMU domains within a
VFIO container.  It used to be that a VFIO container was essentially a
1:1 abstraction of an IOMMU domain, but issues like IOMMU_CACHE forced
us to extend that abstraction.

It makes sense to me that supported page sizes has a similar problem to
IOMMU_CACHE; IOMMU mappings can be made that are dependent on the
composition of the domain at the time of mapping and there's no
requirement that all the IOMMU hardware units support the exact same
features.  VFIO already assumes that separate domains don't necessarily
use the same ops and we make sure mappings and un-mappings are aligned
to the smallest common size.  We'll have some work to do though if there
is no common size between the domains and we may need to add a test to
our notion of compatible domains if pgsize_bitmap moves from iommu_ops
(sorry, I forget whether you already had a patch for that in this pull
request).  Thanks,

Alex
Robin Murphy March 31, 2015, 4:07 p.m. UTC | #4
Hi Joerg,

On 31/03/15 15:24, Joerg Roedel wrote:
> Hi Will,
>
> On Fri, Mar 27, 2015 at 05:19:46PM +0000, Will Deacon wrote:
>> Please can you pull the following IOMMU changes for 4.1? They move the
>> per-iommu_ops pgsize_bitmap field into the iommu_domain, which allows
>> IOMMUs such as the ARM SMMU to support different page sizes within a
>> given SoC.
>
> I have some concerns about the direction taken with this patch-set. The
> goal for the IOMMU-API is still to have domains that can be attached to
> arbitrary devices (even when mappings already exist). But with this
> patch-set we move into a direction where a domain can only be used on
> IOMMUs that support the page-sizes required by the domain. In the end
> this would be visible to the user of the IOMMU-API, which is not what we
> want.

I think the notion of sharing domains between arbitrary devices is 
already broken for systems with multiple IOMMU devices - with the 
iommu_domain itself now embedded in a private data structure, it's 
certainly not safe to pass it to two different drivers, should the 
client devices be behind heterogeneous IOMMUs. Furthermore, since most 
of those private data structures contain instance-specific data, that 
prevents a domain spanning even two homogeneous IOMMU devices.

If the goal is for iommu_domains to be a hardware-independent 
abstraction, then shouldn't we be looking to move the other 
instance-specific features like attributes and geometry out of them? 
Perhaps the answer is the opposite; accept the existing iommu_domain as 
what it appears to be used as - a fairly thin abstraction of one or more 
context banks/uTLBs/DMARs/whatever on a single IOMMU device - and 
consider some higher-level interface for grouping one or more of these 
"device domains" together and sharing a single IOVA space between them.


Robin.

> I can understand the motivation behind these patches, but we need to
> think about how this could work with the desired semantics of the
> IOMMU-API.
>
>
> 	Joerg
>
Will Deacon April 1, 2015, 11:53 a.m. UTC | #5
Hi Alex, Joerg,

On Tue, Mar 31, 2015 at 04:50:50PM +0100, Alex Williamson wrote:
> On Tue, 2015-03-31 at 15:49 +0100, Will Deacon wrote:
> > On Tue, Mar 31, 2015 at 03:24:40PM +0100, Joerg Roedel wrote:
> > > On Fri, Mar 27, 2015 at 05:19:46PM +0000, Will Deacon wrote:
> > > > Please can you pull the following IOMMU changes for 4.1? They move the
> > > > per-iommu_ops pgsize_bitmap field into the iommu_domain, which allows
> > > > IOMMUs such as the ARM SMMU to support different page sizes within a
> > > > given SoC.
> > > 
> > > I have some concerns about the direction taken with this patch-set. The
> > > goal for the IOMMU-API is still to have domains that can be attached to
> > > arbitrary devices (even when mappings already exist). But with this
> > > patch-set we move into a direction where a domain can only be used on
> > > IOMMUs that support the page-sizes required by the domain. In the end
> > > this would be visible to the user of the IOMMU-API, which is not what we
> > > want.
> > 
> > But isn't this restriction already the case in practice? For example, if
> > I have a domain with some mapping already configured, then that mapping
> > will be using some fixed set of page sizes. Attaching a device behind
> > another IOMMU that doesn't support that page size would effectively require
> > the domain page tables to be freed and re-allocated from scratch.
> > 
> > So I don't think this patch series leaves us any worse off that we currently
> > are already.
> > 
> > Ths plus points of the patches are that:
> > 
> >   - We can support different page sizes per domain (the ARM SMMU hardware
> >     really does support this and it would be nice to exploit that to gain
> >     better TLB utilisation)
> > 
> >   - We can support systems containing IOMMUs that don't support a common
> >     page size (I believe the arm64 Juno platform has this feature)
> > 
> >   - I don't have to manipulate a const data structure (iommu_ops) at runtime
> >     whenever I find a new IOMMU with a different set of supported page
> >     sizes.
> > 
> > > I can understand the motivation behind these patches, but we need to
> > > think about how this could work with the desired semantics of the
> > > IOMMU-API.
> > 
> > Do we have any code using this feature of the IOMMU API? I don't think it's
> > realistic in the general case to allow arbitrary devices to be attached to a
> > domain unless the domain can also span multiple IOMMUs. In that case, we'd
> > actually need multiple sets of page tables, potentially described using
> > different formats...
> 
> Legacy KVM assignment relies on being able to attach all the devices to
> a single IOMMU domain and the hardware generally supports the domain
> page table being used by multiple hardware units.  It's not without
> issue though.  For instance, there's no hardware spec that requires that
> all the hardware IOMMUs for an iommu_ops must support
> IOMMU_CAP_CACHE_COHERENCY.  That's a per domain capability, not per
> iommu_ops.  If we start with a device attached to an IOMMU that does
> support this capability and create our mappings with the IOMMU_CACHE
> protection flag, that domain is incompatible with other IOMMU hardware
> units that do not support that capability.  On VT-d, the IOMMU API lets
> us share the domain between hardware units, but we might get invalid
> reserved field faults if we mix-n-match too much.

FWIW, legacy KVM assignment is also only supported on x86 so hopefully
the mixing and matching is somewhat limited by the available platforms.

> This is why VFIO had to add support for multiple IOMMU domains within a
> VFIO container.  It used to be that a VFIO container was essentially a
> 1:1 abstraction of an IOMMU domain, but issues like IOMMU_CACHE forced
> us to extend that abstraction.

Agreed. We also need to cater for multiple, heterogenous IOMMUs in the
system too. For example, the new ChromeBit device based on rk3288 contains
both a rockchip iommu (rockchip-iommu.c) *and* an ARM SMMU (arm-smmu.c).
These IOMMUs both deal with masters on the platform bus and they have
separate page table formats.

Granted, this patch series doesn't address that problem, but it doesn't
make it worse either.

> It makes sense to me that supported page sizes has a similar problem to
> IOMMU_CACHE; IOMMU mappings can be made that are dependent on the
> composition of the domain at the time of mapping and there's no
> requirement that all the IOMMU hardware units support the exact same
> features.  VFIO already assumes that separate domains don't necessarily
> use the same ops and we make sure mappings and un-mappings are aligned
> to the smallest common size.  We'll have some work to do though if there
> is no common size between the domains and we may need to add a test to
> our notion of compatible domains if pgsize_bitmap moves from iommu_ops
> (sorry, I forget whether you already had a patch for that in this pull
> request).  Thanks,

The series updates vfio_pgsize_bitmap to use the domains, so dma_do_map
will fail if there's no common page size amongst the IOMMUs referenced
by the container. That preserves the existing behaviour, but I'm not
sure why it's actually required. Couldn't we potentially allow different
domains to use different page sizes? We'd just need to advertise the
page sizes that are large enough to be supported by all domains. iommu_map
will then take care of breaking them down into smaller chunks when required.

On the ARM SMMU, you may have the choice of a few different page sizes but,
once you've chosen one (actually a subset), you can't use the others for
that domain. This is currently a decision made when the first device is
attached to a domain (i.e. when we allocate the initial page table) and we
basically try to find the page size closest to the one in use by the CPU.
See arm_lpae_restrict_pgsizes for the gory details. With the current code,
that means when one domain decides to use a particular subset, we have to
update (const) iommu_ops->pgsize_bitmap and therefore needlessly prevent any
other domain on any other ARM SMMU from using a different set of page sizes.

Joerg: are you still against merging this? If so, what do you need to
see in order to change your mind? This is certainly something that we
need on ARM-based systems and, tbh, the direction that the IOMMU core is
going seems to be towards a N:1 domain:IOMMU mapping anyway (c.f. your
series to embed struct iommu_domain in a driver-private data structure).

Will
David Woodhouse April 1, 2015, 1:14 p.m. UTC | #6
On Fri, 2015-03-27 at 17:19 +0000, Will Deacon wrote:
> 
> Please can you pull the following IOMMU changes for 4.1? They move the
> per-iommu_ops pgsize_bitmap field into the iommu_domain, which allows
> IOMMUs such as the ARM SMMU to support different page sizes within a
> given SoC.

Can't we kill it entirely? It's always been horrid. Just let the IOMMU
driver fill the page tables as it sees according to the virtual/physical
size and alignment of the range it's asked to map.

We already lie about the supported page sizes in the Intel VT-d driver
and claim to support everything, just to make the caller "Just Ask" as
it should.

We might have to fix the loop that KVM uses to unmap pages, and
potentially the bizarre unmap() API which returns the page size that it
unmapped, but we should do that anyway :)

Perhaps we could have an unmap function which returns a list of unmapped
pages chained on pg->freelist, if this was the last use of those pages
(as ISTR it is in the KVM case).
Will Deacon April 1, 2015, 1:39 p.m. UTC | #7
On Wed, Apr 01, 2015 at 02:14:11PM +0100, David Woodhouse wrote:
> On Fri, 2015-03-27 at 17:19 +0000, Will Deacon wrote:
> > 
> > Please can you pull the following IOMMU changes for 4.1? They move the
> > per-iommu_ops pgsize_bitmap field into the iommu_domain, which allows
> > IOMMUs such as the ARM SMMU to support different page sizes within a
> > given SoC.
> 
> Can't we kill it entirely? It's always been horrid. Just let the IOMMU
> driver fill the page tables as it sees according to the virtual/physical
> size and alignment of the range it's asked to map.
> 
> We already lie about the supported page sizes in the Intel VT-d driver
> and claim to support everything, just to make the caller "Just Ask" as
> it should.
> 
> We might have to fix the loop that KVM uses to unmap pages, and
> potentially the bizarre unmap() API which returns the page size that it
> unmapped, but we should do that anyway :)

We'd also need to update any page-table code that relies on being called
one page at a time (i.e. the new io-pgtable stuff we merged for 4.0) so
that we iterate over the requested range. That shouldn't be too difficult
(i.e. just inline the iommu_map logic into each driver), but it would
need some testing.

In fact, that would also allow us to nuke the whole TLB on VFIO unmap of
large ranges. We currently iterate over the range page-by-page, which takes
ages (minutes) in simulation.

Will
David Woodhouse April 1, 2015, 1:52 p.m. UTC | #8
On Wed, 2015-04-01 at 14:39 +0100, Will Deacon wrote:
> We'd also need to update any page-table code that relies on being called
> one page at a time (i.e. the new io-pgtable stuff we merged for 4.0) so
> that we iterate over the requested range.

Ick. I hadn't noticed we'd entrenched that mistake even harder. It
certainly *wasn't* that hard to fix it...

>  That shouldn't be too difficult (i.e. just inline the iommu_map logic
> into each driver), but it would need some testing.

Good.

> In fact, that would also allow us to nuke the whole TLB on VFIO unmap of
> large ranges. We currently iterate over the range page-by-page, which takes
> ages (minutes) in simulation.

Which is one of the big reasons we lie about page sizes in VT-d.
Will Deacon April 1, 2015, 2:05 p.m. UTC | #9
On Wed, Apr 01, 2015 at 02:52:57PM +0100, David Woodhouse wrote:
> On Wed, 2015-04-01 at 14:39 +0100, Will Deacon wrote:
> > We'd also need to update any page-table code that relies on being called
> > one page at a time (i.e. the new io-pgtable stuff we merged for 4.0) so
> > that we iterate over the requested range.
> 
> Ick. I hadn't noticed we'd entrenched that mistake even harder. It
> certainly *wasn't* that hard to fix it...

Well, at least it's now in one place! (before, the same mistake was in
each driver).

> >  That shouldn't be too difficult (i.e. just inline the iommu_map logic
> > into each driver), but it would need some testing.
> 
> Good.
> 
> > In fact, that would also allow us to nuke the whole TLB on VFIO unmap of
> > large ranges. We currently iterate over the range page-by-page, which takes
> > ages (minutes) in simulation.
> 
> Which is one of the big reasons we lie about page sizes in VT-d.

Makes sense for now.

That just leaves the case where the caller tries to map something smaller
than a page. Do we want to return the amount mapped in iommu_map?

Will
David Woodhouse April 1, 2015, 2:28 p.m. UTC | #10
On Wed, 2015-04-01 at 15:05 +0100, Will Deacon wrote:
> That just leaves the case where the caller tries to map something
> smaller than a page. Do we want to return the amount mapped in
> iommu_map?

Perhaps we should just expose the *minimum* page size? Although for the
IOMMU API I thought that was always assumed to be the same as PAGE_SIZE,
and for the DMA API it doesn't matter much because you're *given* the
IOVA, and you just know that it's mapped *enough*.
Will Deacon April 1, 2015, 2:39 p.m. UTC | #11
On Wed, Apr 01, 2015 at 03:28:10PM +0100, David Woodhouse wrote:
> On Wed, 2015-04-01 at 15:05 +0100, Will Deacon wrote:
> > That just leaves the case where the caller tries to map something
> > smaller than a page. Do we want to return the amount mapped in
> > iommu_map?
> 
> Perhaps we should just expose the *minimum* page size? Although for the
> IOMMU API I thought that was always assumed to be the same as PAGE_SIZE,
> and for the DMA API it doesn't matter much because you're *given* the
> IOVA, and you just know that it's mapped *enough*.

We can certainly run with a 64k CPU page size and a 4k IOMMU page size on
arm64 systems. The other way around is much more problematic, so I think
you're right to suggest the minimum page size.

However, once we have that, we run into the same problem that we've got
with the current pgsize_bitmap. Namely, that it needs to be a per-domain
property to avoid it changing dynamically following an initial map request
or a probe of a new IOMMU device.

Will
David Woodhouse April 1, 2015, 2:46 p.m. UTC | #12
On Wed, 2015-04-01 at 15:39 +0100, Will Deacon wrote:
> 
> However, once we have that, we run into the same problem that we've got
> with the current pgsize_bitmap. Namely, that it needs to be a per-domain
> property to avoid it changing dynamically following an initial map request
> or a probe of a new IOMMU device.

That's not so much of a problem, surely? When adding devices to a new
domain without any mappings, the minimum page size of the domain is the
largest of all the IOMMUs participating in that domain.

And if you try to add new devices after you've already started making
mappings, we refuse the addition if it would raise the minimum page size
higher than the page sizes you're already using.

In VT-d we already have very similar logic for superpage and snooping
support.
Joerg Roedel April 1, 2015, 3:38 p.m. UTC | #13
Hi Will,

On Tue, Mar 31, 2015 at 03:49:56PM +0100, Will Deacon wrote:
> But isn't this restriction already the case in practice? For example, if
> I have a domain with some mapping already configured, then that mapping
> will be using some fixed set of page sizes. Attaching a device behind
> another IOMMU that doesn't support that page size would effectively require
> the domain page tables to be freed and re-allocated from scratch.

The problem is that this restriction depends on the IOMMU driver in use.
>From the beginning of the IOMMU-API the backend drivers have supported
sharing a domain by multiple devices. In fact, the domain is just an
abstraction for an address space that can be attached to devices just
like cpu page-tables are assigned to threads of a process.

We can discuss whether this fundamental concept of the IOMMU-API needs
to be changed (moving into the direction this patch-set proposes). I
could imaging removing the domain concept entirely and just have
map/unmap functions like this:

	iommu_map(struct device *dev, dma_addr_t iova,
		  phys_addr_t phys, size_t size)
	iommu_unmap(struct device *dev, dma_addr_t iova,
		    phys_addr_t phys, size_t size)

This would require some changes for the users of the IOMMU-API, but it
shouldn't be too hard.

Or we keep the desired semantics as they are now and do everything
needed for that in the IOMMU drivers. For the arm-smmu this would mean
exposing a common set of supported pgsizes between IOMMUs, or to build
multiple page-tables to match the different IOMMU capabilities.

I am open for disussions for either way, but I like the current
semantics a bit more, as it allows us to share page-tables between
devices and we can move all of the nasty code in VFIO that already
creates multiple domains to get different page-tables into the
IOMMU core or the drivers (were it belongs).

What I definitly don't want is a mixture of both concepts depending on
the IOMMU driver in use. We should settle on one way and force the
drivers to behave accordingly. We don't need a common API when every
driver behaves differently.


	Joerg
Joerg Roedel April 1, 2015, 3:53 p.m. UTC | #14
Hi Will,

On Wed, Apr 01, 2015 at 12:53:40PM +0100, Will Deacon wrote:
> Joerg: are you still against merging this? If so, what do you need to
> see in order to change your mind? This is certainly something that we
> need on ARM-based systems and, tbh, the direction that the IOMMU core is
> going seems to be towards a N:1 domain:IOMMU mapping anyway (c.f. your
> series to embed struct iommu_domain in a driver-private data structure).

What I want is consistent behavior between drivers. Until now I tried to
push for the domain concept, which allows to attach arbitrary devices to
a domain. This patch-set makes it impossible to achieve this goal,
that's why I am resistant against it.

We can discuss moving into another direction with the IOMMU-API, and
re-evaluate then if this patch-set is still required.


	Joerg
Will Deacon April 1, 2015, 4:36 p.m. UTC | #15
On Wed, Apr 01, 2015 at 03:46:10PM +0100, David Woodhouse wrote:
> On Wed, 2015-04-01 at 15:39 +0100, Will Deacon wrote:
> > However, once we have that, we run into the same problem that we've got
> > with the current pgsize_bitmap. Namely, that it needs to be a per-domain
> > property to avoid it changing dynamically following an initial map request
> > or a probe of a new IOMMU device.
> 
> That's not so much of a problem, surely? When adding devices to a new
> domain without any mappings, the minimum page size of the domain is the
> largest of all the IOMMUs participating in that domain.

The issue (speaking in terms of the ARM SMMU, since that's what I'm familiar
with) is that we don't know the page sizes until we've chosen our
translation regime.

For example, we may have an SMMU that supports the following page sizes

  1:  4k | 2M | 1G
  2: 16k | 32M
  3: 64k | 512M

so a domain on the SMMU can use one of those three regimes. Other domains
*on the same SMMU* may use a different regime. That means we've actually
got three minimum page sizes for the SMMU.

Therefore, the page size we end up using for a domain depends on both:

  (a) Whether or not we've probed another SMMU (using the same driver) that
      goes and further restricts iommu_ops->min_pgsize (or whatever it ends
      up being called)

  (b) Which one of the available regimes is chosen by the page-table code.

Currently (b) tries to get something close to the CPU page size, but you
could imagine adding an interface to bias the selection towards some other
size. We could pick the largest minimum page size of the supported regimes
(I can't word this better... it would be 64k in the example above), but I
worry that this will end up being too big in practice and we'll over-map
for most small mappings.

If we put the min_pgsize in the domain, we can just describe it using the
regime that domain is using.

> And if you try to add new devices after you've already started making
> mappings, we refuse the addition if it would raise the minimum page size
> higher than the page sizes you're already using.

Ok, so I think that solves (a) above, but means we have to keep track of
the min_pgsize in the domain anyway so that we can check against it, in
which case it may just as well be removed from iommu_ops.

Will
Alex Williamson April 1, 2015, 4:45 p.m. UTC | #16
On Wed, 2015-04-01 at 12:53 +0100, Will Deacon wrote:
> On Tue, Mar 31, 2015 at 04:50:50PM +0100, Alex Williamson wrote:
> > It makes sense to me that supported page sizes has a similar problem to
> > IOMMU_CACHE; IOMMU mappings can be made that are dependent on the
> > composition of the domain at the time of mapping and there's no
> > requirement that all the IOMMU hardware units support the exact same
> > features.  VFIO already assumes that separate domains don't necessarily
> > use the same ops and we make sure mappings and un-mappings are aligned
> > to the smallest common size.  We'll have some work to do though if there
> > is no common size between the domains and we may need to add a test to
> > our notion of compatible domains if pgsize_bitmap moves from iommu_ops
> > (sorry, I forget whether you already had a patch for that in this pull
> > request).  Thanks,
> 
> The series updates vfio_pgsize_bitmap to use the domains, so dma_do_map
> will fail if there's no common page size amongst the IOMMUs referenced
> by the container. That preserves the existing behaviour, but I'm not
> sure why it's actually required. Couldn't we potentially allow different
> domains to use different page sizes?

We can.  I think the only thing we're trying to do with it now is
confirm that the user mapping meets the minimum size and alignment
requirements of the IOMMU.  We don't use it as a basis for our actual
mappings, it's the IOMMU API's job to do that.  Thanks,

Alex
Alex Williamson April 1, 2015, 4:51 p.m. UTC | #17
On Wed, 2015-04-01 at 15:52 +0200, David Woodhouse wrote:
> On Wed, 2015-04-01 at 14:39 +0100, Will Deacon wrote:
> > We'd also need to update any page-table code that relies on being called
> > one page at a time (i.e. the new io-pgtable stuff we merged for 4.0) so
> > that we iterate over the requested range.
> 
> Ick. I hadn't noticed we'd entrenched that mistake even harder. It
> certainly *wasn't* that hard to fix it...
> 
> >  That shouldn't be too difficult (i.e. just inline the iommu_map logic
> > into each driver), but it would need some testing.
> 
> Good.
> 
> > In fact, that would also allow us to nuke the whole TLB on VFIO unmap of
> > large ranges. We currently iterate over the range page-by-page, which takes
> > ages (minutes) in simulation.
> 
> Which is one of the big reasons we lie about page sizes in VT-d.

I think (hope) we've solved a significant portion of the unmap problem
with the chunking patches that are now in v4.0.  We should now be
attempting to find the largest contiguous range to unmap instead of
doing the page-by-page unmap that was mostly hoping the IOMMU actually
used a super page.  I'm not sure how lying about page sizes had any
effect on this though.  In order for vfio to get any smarter about it
we'd need a "commit" callback and to be able to use the freelist to
track pages to unpin after commit.  Thanks,

Alex
Will Deacon April 1, 2015, 5:03 p.m. UTC | #18
On Wed, Apr 01, 2015 at 04:38:54PM +0100, Joerg Roedel wrote:
> Hi Will,

Hi Joerg,

> On Tue, Mar 31, 2015 at 03:49:56PM +0100, Will Deacon wrote:
> > But isn't this restriction already the case in practice? For example, if
> > I have a domain with some mapping already configured, then that mapping
> > will be using some fixed set of page sizes. Attaching a device behind
> > another IOMMU that doesn't support that page size would effectively require
> > the domain page tables to be freed and re-allocated from scratch.
> 
> The problem is that this restriction depends on the IOMMU driver in use.
> From the beginning of the IOMMU-API the backend drivers have supported
> sharing a domain by multiple devices. In fact, the domain is just an
> abstraction for an address space that can be attached to devices just
> like cpu page-tables are assigned to threads of a process.

I think the domain is a useful abstraction; I just don't think it works
too well when you attach devices upstream of different IOMMUs to the same
domain. If all the devices attached to a domain share an IOMMU, everything
works nicely.

> We can discuss whether this fundamental concept of the IOMMU-API needs
> to be changed (moving into the direction this patch-set proposes). I
> could imaging removing the domain concept entirely and just have
> map/unmap functions like this:
> 
> 	iommu_map(struct device *dev, dma_addr_t iova,
> 		  phys_addr_t phys, size_t size)
> 	iommu_unmap(struct device *dev, dma_addr_t iova,
> 		    phys_addr_t phys, size_t size)
> 
> This would require some changes for the users of the IOMMU-API, but it
> shouldn't be too hard.
> 
> Or we keep the desired semantics as they are now and do everything
> needed for that in the IOMMU drivers. For the arm-smmu this would mean
> exposing a common set of supported pgsizes between IOMMUs, or to build
> multiple page-tables to match the different IOMMU capabilities.
> 
> I am open for disussions for either way, but I like the current
> semantics a bit more, as it allows us to share page-tables between
> devices and we can move all of the nasty code in VFIO that already
> creates multiple domains to get different page-tables into the
> IOMMU core or the drivers (were it belongs).

I agree that sharing page tables is desirable and I certainly wouldn't
suggest removing the domain abstraction.

> What I definitly don't want is a mixture of both concepts depending on
> the IOMMU driver in use. We should settle on one way and force the
> drivers to behave accordingly. We don't need a common API when every
> driver behaves differently.

Agreed. How would you feel about restricting domains to be per-IOMMU
instance? VFIO already copes with this, so I think we'd just need
something to keep legacy KVM device passthrough working on x86. Maybe we
could add a new domain type using your new series (DOMAIN_X86_KVM_LEGACY
or something) and avoid the cross-IOMMU domain checks for that.

Will
Will Deacon April 1, 2015, 5:50 p.m. UTC | #19
Hi Alex,

On Wed, Apr 01, 2015 at 05:51:04PM +0100, Alex Williamson wrote:
> On Wed, 2015-04-01 at 15:52 +0200, David Woodhouse wrote:
> > On Wed, 2015-04-01 at 14:39 +0100, Will Deacon wrote:
> > > In fact, that would also allow us to nuke the whole TLB on VFIO unmap of
> > > large ranges. We currently iterate over the range page-by-page, which takes
> > > ages (minutes) in simulation.
> > 
> > Which is one of the big reasons we lie about page sizes in VT-d.
> 
> I think (hope) we've solved a significant portion of the unmap problem
> with the chunking patches that are now in v4.0.  We should now be
> attempting to find the largest contiguous range to unmap instead of
> doing the page-by-page unmap that was mostly hoping the IOMMU actually
> used a super page.  I'm not sure how lying about page sizes had any
> effect on this though.  In order for vfio to get any smarter about it
> we'd need a "commit" callback and to be able to use the freelist to
> track pages to unpin after commit.  Thanks,

Just so I understand this correctly, is that "commit" callback to ensure
that the domain_destroy has actually taken effect on the IOMMU (e.g. by
invalidating the TLBs and waiting for that to take effect)?

The io-pgtable API has a tlb_sync callback, but actually I made
tlb_flush_all (which is called when destroying the page tables) always
synchronous so that you wouldn't need to unmap anything explicitly in
that case.

Unfortunately, 4.0 doesn't seem to have improved wrt VFIO unmap for me.
Is there something extra I need to add to the ARM SMMU driver?

Will
Alex Williamson April 1, 2015, 6:18 p.m. UTC | #20
On Wed, 2015-04-01 at 18:50 +0100, Will Deacon wrote:
> Hi Alex,
> 
> On Wed, Apr 01, 2015 at 05:51:04PM +0100, Alex Williamson wrote:
> > On Wed, 2015-04-01 at 15:52 +0200, David Woodhouse wrote:
> > > On Wed, 2015-04-01 at 14:39 +0100, Will Deacon wrote:
> > > > In fact, that would also allow us to nuke the whole TLB on VFIO unmap of
> > > > large ranges. We currently iterate over the range page-by-page, which takes
> > > > ages (minutes) in simulation.
> > > 
> > > Which is one of the big reasons we lie about page sizes in VT-d.
> > 
> > I think (hope) we've solved a significant portion of the unmap problem
> > with the chunking patches that are now in v4.0.  We should now be
> > attempting to find the largest contiguous range to unmap instead of
> > doing the page-by-page unmap that was mostly hoping the IOMMU actually
> > used a super page.  I'm not sure how lying about page sizes had any
> > effect on this though.  In order for vfio to get any smarter about it
> > we'd need a "commit" callback and to be able to use the freelist to
> > track pages to unpin after commit.  Thanks,
> 
> Just so I understand this correctly, is that "commit" callback to ensure
> that the domain_destroy has actually taken effect on the IOMMU (e.g. by
> invalidating the TLBs and waiting for that to take effect)?

Not necessarily domain_destroy, but simply large unmaps that require
multiple calls to iommu_unmap().  Each call requires the IOMMU driver to
flush io-tlb page structures.  If we could save the io-tlb flushes all
to the end, I imagine we'd get a good improvement in efficiency.  We
can't unpin pages until after the flush though, so we'd need something
like the page freelist to bridge the gap.

> The io-pgtable API has a tlb_sync callback, but actually I made
> tlb_flush_all (which is called when destroying the page tables) always
> synchronous so that you wouldn't need to unmap anything explicitly in
> that case.

Unfortunately we need to unmap before we destroy the domain because the
IOMMU is tracking the pages we have pinned.  It's again a chicken and
egg problem, we need the IOMMU to know what to unpin, but we can't unpin
it until the IOMMU mapping is removed because that would give the
userspace device access to an IOVA not backed by pinned memory.

> Unfortunately, 4.0 doesn't seem to have improved wrt VFIO unmap for me.
> Is there something extra I need to add to the ARM SMMU driver?

No, but see the comment above
drivers/vfio/vfio_iommu_type1.c:vfio_test_domain_fgsp()  It's an ugly
heuristic, but the only way I could come up with for improving
performance on VT-d w/o degrading performance on AMD-Vi.  I don't know
what your hardware does, but if it's more like AMD-Vi, you may not see a
difference.  Thanks,

Alex
Joerg Roedel April 1, 2015, 9:24 p.m. UTC | #21
Hi Will,

On Wed, Apr 01, 2015 at 06:03:30PM +0100, Will Deacon wrote:
> Agreed. How would you feel about restricting domains to be per-IOMMU
> instance?

If we find a good way to do that it would be fine. We need to expose
information about individual IOMMUs for that. The question is whether we
need to expose them to the IOMMU-API user or just from the drivers to
the IOMMU core code.

I prefer the second option, but if there is a good and clean way for the
first one I am open for discussions.

> VFIO already copes with this, so I think we'd just need something to
> keep legacy KVM device passthrough working on x86.

VFIO uses a lot of implicit and internal knowledge about the backend
IOMMU drivers to decide if a second domain is needed. I think this logic
should either be moved into the iommu core code or made generic enough
that it works not only for different supported page-size, but also for
snooping flags or whatever differences exist between iommus in a system.

> Maybe we could add a new domain type using your new series
> (DOMAIN_X86_KVM_LEGACY or something) and avoid the cross-IOMMU domain
> checks for that.

The legacy device assignment code can also be adapted to whatever
changes we agree upon, no need to support a stable API in the iommu
code.


	Joerg
Joerg Roedel April 1, 2015, 9:28 p.m. UTC | #22
On Wed, Apr 01, 2015 at 05:36:18PM +0100, Will Deacon wrote:
> The issue (speaking in terms of the ARM SMMU, since that's what I'm familiar
> with) is that we don't know the page sizes until we've chosen our
> translation regime.

Can't you hard-code one regime in the driver and just don't use the
others? Or might a SMMU only support a subset of the possible regimes?


	Joerg
Will Deacon April 2, 2015, 8:58 a.m. UTC | #23
On Wed, Apr 01, 2015 at 10:28:54PM +0100, joro@8bytes.org wrote:
> On Wed, Apr 01, 2015 at 05:36:18PM +0100, Will Deacon wrote:
> > The issue (speaking in terms of the ARM SMMU, since that's what I'm familiar
> > with) is that we don't know the page sizes until we've chosen our
> > translation regime.
> 
> Can't you hard-code one regime in the driver and just don't use the
> others? Or might a SMMU only support a subset of the possible regimes?

Indeed, we can end up with systems that only support a subset of the
regimes. Different workloads may also benefit from different regimes due
to better TLB utilisation, but there's not currently a way to bias the page
table allocator away from the CPU page size.

Will