mbox series

[v6,00/25] iommu: Shared Virtual Addressing for SMMUv3

Message ID 20200430143424.2787566-1-jean-philippe@linaro.org
Headers show
Series iommu: Shared Virtual Addressing for SMMUv3 | expand

Message

Jean-Philippe Brucker April 30, 2020, 2:33 p.m. UTC
Shared Virtual Addressing (SVA) allows to share process page tables with
devices using the IOMMU, PASIDs and I/O page faults. Add SVA support to
the Arm SMMUv3 driver.

Since v5 [1]:

* Added patches 1-3. Patch 1 adds a PASID field to mm_struct as
  discussed in [1] and [2]. This is also needed for Intel ENQCMD. Patch
  2 adds refcounts to IOASID and patch 3 adds a couple of helpers to
  allocate the PASID.

* Dropped most of iommu-sva.c. After getting rid of io_mm following
  review of v5, there wasn't enough generic code left to justify the
  indirect branch overhead of io_mm_ops in the MMU notifiers. I ended up
  with more glue than useful code, and couldn't find an easy way to deal
  with domains in the SMMU driver (we keep PASID tables per domain,
  while x86 keeps them per device). The direct approach in patch 17 is
  nicer and a little easier to read. The SMMU driver only gained 160
  lines, while iommu-sva lost 470 lines.

  As a result I dropped the MMU notifier patch.

  Jacob, one upside of this rework is that we now free ioasids in
  blocking context, which might help with your addition of notifiers to
  ioasid.c

* Simplified io-pgfault a bit, since flush() isn't called from mm exit
  path anymore.

* Fixed a bug in patch 17 (don't clear the stall bit when stall is
  forced).

You can find the latest version on https://jpbrucker.net/git/linux
branch sva/current, and sva/zip-devel for the Hisilicon zip accelerator.

[1] https://lore.kernel.org/linux-iommu/20200414170252.714402-1-jean-philippe@linaro.org/
[2] https://lore.kernel.org/linux-iommu/1585596788-193989-6-git-send-email-fenghua.yu@intel.com/

Jean-Philippe Brucker (25):
  mm: Add a PASID field to mm_struct
  iommu/ioasid: Add ioasid references
  iommu/sva: Add PASID helpers
  iommu: Add a page fault handler
  iommu/iopf: Handle mm faults
  arm64: mm: Add asid_gen_match() helper
  arm64: mm: Pin down ASIDs for sharing mm with devices
  iommu/io-pgtable-arm: Move some definitions to a header
  iommu/arm-smmu-v3: Manage ASIDs with xarray
  arm64: cpufeature: Export symbol read_sanitised_ftr_reg()
  iommu/arm-smmu-v3: Share process page tables
  iommu/arm-smmu-v3: Seize private ASID
  iommu/arm-smmu-v3: Add support for VHE
  iommu/arm-smmu-v3: Enable broadcast TLB maintenance
  iommu/arm-smmu-v3: Add SVA feature checking
  iommu/arm-smmu-v3: Add SVA device feature
  iommu/arm-smmu-v3: Implement iommu_sva_bind/unbind()
  iommu/arm-smmu-v3: Hook up ATC invalidation to mm ops
  iommu/arm-smmu-v3: Add support for Hardware Translation Table Update
  iommu/arm-smmu-v3: Maintain a SID->device structure
  dt-bindings: document stall property for IOMMU masters
  iommu/arm-smmu-v3: Add stall support for platform devices
  PCI/ATS: Add PRI stubs
  PCI/ATS: Export PRI functions
  iommu/arm-smmu-v3: Add support for PRI

 drivers/iommu/Kconfig                         |   11 +
 drivers/iommu/Makefile                        |    2 +
 .../devicetree/bindings/iommu/iommu.txt       |   18 +
 arch/arm64/include/asm/mmu.h                  |    1 +
 arch/arm64/include/asm/mmu_context.h          |   11 +-
 drivers/iommu/io-pgtable-arm.h                |   30 +
 drivers/iommu/iommu-sva.h                     |   15 +
 include/linux/ioasid.h                        |   10 +-
 include/linux/iommu.h                         |   53 +
 include/linux/mm_types.h                      |    4 +
 include/linux/pci-ats.h                       |    8 +
 arch/arm64/kernel/cpufeature.c                |    1 +
 arch/arm64/mm/context.c                       |  103 +-
 drivers/iommu/arm-smmu-v3.c                   | 1554 +++++++++++++++--
 drivers/iommu/io-pgfault.c                    |  458 +++++
 drivers/iommu/io-pgtable-arm.c                |   27 +-
 drivers/iommu/ioasid.c                        |   30 +-
 drivers/iommu/iommu-sva.c                     |   85 +
 drivers/iommu/of_iommu.c                      |    5 +-
 drivers/pci/ats.c                             |    4 +
 MAINTAINERS                                   |    3 +-
 21 files changed, 2275 insertions(+), 158 deletions(-)
 create mode 100644 drivers/iommu/io-pgtable-arm.h
 create mode 100644 drivers/iommu/iommu-sva.h
 create mode 100644 drivers/iommu/io-pgfault.c
 create mode 100644 drivers/iommu/iommu-sva.c

Comments

Suzuki K Poulose April 30, 2020, 3:18 p.m. UTC | #1
On 04/30/2020 03:34 PM, Jean-Philippe Brucker wrote:
> The SMMUv3 driver would like to read the MMFR0 PARANGE field in order to
> share CPU page tables with devices. Allow the driver to be built as
> module by exporting the read_sanitized_ftr_reg() cpufeature symbol.
> 
> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>

Acked-by: Suzuki K Poulose <suzuki.poulose@arm.com>

> ---
>   arch/arm64/kernel/cpufeature.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index 9fac745aa7bb2..5f6adbf4ae893 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -841,6 +841,7 @@ u64 read_sanitised_ftr_reg(u32 id)
>   	BUG_ON(!regp);
>   	return regp->sys_val;
>   }
> +EXPORT_SYMBOL_GPL(read_sanitised_ftr_reg);
>   
>   #define read_sysreg_case(r)	\
>   	case r:		return read_sysreg_s(r)
>
Suzuki K Poulose April 30, 2020, 3:39 p.m. UTC | #2
On 04/30/2020 03:34 PM, Jean-Philippe Brucker wrote:
> With Shared Virtual Addressing (SVA), we need to mirror CPU TTBR, TCR,
> MAIR and ASIDs in SMMU contexts. Each SMMU has a single ASID space split
> into two sets, shared and private. Shared ASIDs correspond to those
> obtained from the arch ASID allocator, and private ASIDs are used for
> "classic" map/unmap DMA.
> 
> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---

> +
> +	tcr = FIELD_PREP(CTXDESC_CD_0_TCR_T0SZ, 64ULL - VA_BITS) |
> +	      FIELD_PREP(CTXDESC_CD_0_TCR_IRGN0, ARM_LPAE_TCR_RGN_WBWA) |
> +	      FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, ARM_LPAE_TCR_RGN_WBWA) |
> +	      FIELD_PREP(CTXDESC_CD_0_TCR_SH0, ARM_LPAE_TCR_SH_IS) |
> +	      CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
> +
> +	switch (PAGE_SIZE) {
> +	case SZ_4K:
> +		tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_TG0, ARM_LPAE_TCR_TG0_4K);
> +		break;
> +	case SZ_16K:
> +		tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_TG0, ARM_LPAE_TCR_TG0_16K);
> +		break;
> +	case SZ_64K:
> +		tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_TG0, ARM_LPAE_TCR_TG0_64K);
> +		break;
> +	default:
> +		WARN_ON(1);
> +		ret = -EINVAL;
> +		goto err_free_asid;
> +	}
> +
> +	reg = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
> +	par = cpuid_feature_extract_unsigned_field(reg, ID_AA64MMFR0_PARANGE_SHIFT);
> +	tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_IPS, par);
> +
> +	cd->ttbr = virt_to_phys(mm->pgd);

Does the TTBR follow the same layout as TTBR_ELx for 52bit IPA ? i.e, 
TTBR[5:2] = BADDR[51:48] ? Are you covered for that ?

Suzuki
Jacob Pan April 30, 2020, 6:39 p.m. UTC | #3
On Thu, 30 Apr 2020 16:34:01 +0200
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> Let IOASID users take references to existing ioasids with
> ioasid_get(). ioasid_free() drops a reference and only frees the
> ioasid when its reference number is zero. It returns whether the
> ioasid was freed.
> 
Looks good to me, I was planning to do the same for VT-d use. Just a
couple of points for potential extension. I can rebase on top of this.


> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
>  include/linux/ioasid.h | 10 ++++++++--
>  drivers/iommu/ioasid.c | 30 +++++++++++++++++++++++++++++-
>  2 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 6f000d7a0ddcd..609ba6f15b9e3 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -34,7 +34,8 @@ struct ioasid_allocator_ops {
>  #if IS_ENABLED(CONFIG_IOASID)
>  ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t
> max, void *private);
> -void ioasid_free(ioasid_t ioasid);
> +void ioasid_get(ioasid_t ioasid);
> +bool ioasid_free(ioasid_t ioasid);
>  void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>  		  bool (*getter)(void *));
>  int ioasid_register_allocator(struct ioasid_allocator_ops
> *allocator); @@ -48,10 +49,15 @@ static inline ioasid_t
> ioasid_alloc(struct ioasid_set *set, ioasid_t min, return
> INVALID_IOASID; }
>  
> -static inline void ioasid_free(ioasid_t ioasid)
> +static inline void ioasid_get(ioasid_t ioasid)
>  {
>  }
>  
> +static inline bool ioasid_free(ioasid_t ioasid)
> +{
> +	return false;
> +}
> +
>  static inline void *ioasid_find(struct ioasid_set *set, ioasid_t
> ioasid, bool (*getter)(void *))
>  {
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 0f8dd377aada3..46511ac53e0c8 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -15,6 +15,7 @@ struct ioasid_data {
>  	struct ioasid_set *set;
>  	void *private;
>  	struct rcu_head rcu;
> +	refcount_t refs;
>  };
>  
>  /*
> @@ -314,6 +315,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
> ioasid_t min, ioasid_t max, 
>  	data->set = set;
>  	data->private = private;
> +	refcount_set(&data->refs, 1);
>  
>  	/*
>  	 * Custom allocator needs allocator data to perform platform
> specific @@ -345,12 +347,33 @@ ioasid_t ioasid_alloc(struct
> ioasid_set *set, ioasid_t min, ioasid_t max, }
>  EXPORT_SYMBOL_GPL(ioasid_alloc);
>  
> +/**
> + * ioasid_get - obtain a reference to the IOASID
> + */
> +void ioasid_get(ioasid_t ioasid)
why void? what if the ioasid is not valid.

> +{
> +	struct ioasid_data *ioasid_data;
> +
> +	spin_lock(&ioasid_allocator_lock);
> +	ioasid_data = xa_load(&active_allocator->xa, ioasid);
> +	if (ioasid_data)
> +		refcount_inc(&ioasid_data->refs);
> +	spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get);
> +
>  /**
>   * ioasid_free - Free an IOASID
>   * @ioasid: the ID to remove
> + *
> + * Put a reference to the IOASID, free it when the number of
> references drops to
> + * zero.
> + *
> + * Return: %true if the IOASID was freed, %false otherwise.
>   */
> -void ioasid_free(ioasid_t ioasid)
> +bool ioasid_free(ioasid_t ioasid)
>  {
> +	bool free = false;
>  	struct ioasid_data *ioasid_data;
>  
>  	spin_lock(&ioasid_allocator_lock);
> @@ -360,6 +383,10 @@ void ioasid_free(ioasid_t ioasid)
>  		goto exit_unlock;
>  	}
>  
> +	free = refcount_dec_and_test(&ioasid_data->refs);
> +	if (!free)
> +		goto exit_unlock;
> +
Just FYI, we may need to add states for the IOASID, i.g. mark the IOASID
inactive after free. And prohibit ioasid_get() after freed. For VT-d,
this is useful when KVM queries the IOASID.

>  	active_allocator->ops->free(ioasid,
> active_allocator->ops->pdata); /* Custom allocator needs additional
> steps to free the xa element */ if (active_allocator->flags &
> IOASID_ALLOCATOR_CUSTOM) { @@ -369,6 +396,7 @@ void
> ioasid_free(ioasid_t ioasid) 
>  exit_unlock:
>  	spin_unlock(&ioasid_allocator_lock);
> +	return free;
>  }
>  EXPORT_SYMBOL_GPL(ioasid_free);
>  

[Jacob Pan]
Jacob Pan April 30, 2020, 8:48 p.m. UTC | #4
On Thu, 30 Apr 2020 11:39:31 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> > -void ioasid_free(ioasid_t ioasid)
> > +bool ioasid_free(ioasid_t ioasid)
> >  {
Sorry I missed this in the last reply.

I think free needs to be unconditional since there is not a good way to
fail it.

Also can we have more symmetric APIs, seems we don't have ioasid_put()
in this patchset.
How about?
ioasid_alloc()
ioasid_free(); //drop reference, mark inactive, but not reclaimed if
		refcount is not zero.
ioasid_get() // returns err if the ioasid is marked inactive by
		ioasid_free()
ioasid_put();// drop reference, reclaim if refcount is 0.

It is similar to get/put/alloc/free pids.
Jacob Pan April 30, 2020, 9:16 p.m. UTC | #5
Hi Jean,

A couple question on how SMMU handles CD.v and translation disable.

On Thu, 30 Apr 2020 16:34:16 +0200
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> The sva_bind() function allows devices to access process address
> spaces using a PASID (aka SSID).
> 
> (1) bind() allocates or gets an existing MMU notifier tied to the
>     (domain, mm) pair. Each mm gets one PASID.
> 
> (2) Any change to the address space calls invalidate_range() which
> sends ATC invalidations (in a subsequent patch).
> 
> (3) When the process address space dies, the release() notifier
> disables the CD to allow reclaiming the page tables. Since release()
> has to be light we do not instruct device drivers to stop DMA here,
> we just ignore incoming page faults.
> 
>     To avoid any event 0x0a print (C_BAD_CD) we disable translation
>     without clearing CD.V. PCIe Translation Requests and Page Requests
>     are silently denied. Don't clear the R bit because the S bit can't
>     be cleared when STALL_MODEL==0b10 (forced), and clearing R without
>     clearing S is useless. Faulting transactions will stall and will
> be aborted by the IOPF handler.
> 
> (4) After stopping DMA, the device driver releases the bond by calling
>     unbind(). We release the MMU notifier, free the PASID and the
> bond.
> 
> Three structures keep track of bonds:
> * arm_smmu_bond: one per (device, mm) pair, the handle returned to the
>   device driver for a bind() request.
> * arm_smmu_mmu_notifier: one per (domain, mm) pair, deals with ATS/TLB
>   invalidations and clearing the context descriptor on mm exit.
> * arm_smmu_ctx_desc: one per mm, holds the pinned ASID and pgd.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> v5->v6:
> * Implement bind() directly instead of going through io_mm_ops
> * Don't clear S and R bits in step (3), it doesn't work with
>   STALL_FORCE.
> ---
>  drivers/iommu/Kconfig       |   1 +
>  drivers/iommu/arm-smmu-v3.c | 256
> +++++++++++++++++++++++++++++++++++- 2 files changed, 253
> insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 1e64ee6592e16..f863c4562feeb 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -432,6 +432,7 @@ config ARM_SMMU_V3
>  	tristate "ARM Ltd. System MMU Version 3 (SMMUv3) Support"
>  	depends on ARM64
>  	select IOMMU_API
> +	select IOMMU_SVA
>  	select IOMMU_IO_PGTABLE_LPAE
>  	select GENERIC_MSI_IRQ_DOMAIN
>  	help
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index c7942d0540599..00e5b69bb81a5 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -24,6 +24,7 @@
>  #include <linux/iopoll.h>
>  #include <linux/module.h>
>  #include <linux/mmu_context.h>
> +#include <linux/mmu_notifier.h>
>  #include <linux/msi.h>
>  #include <linux/of.h>
>  #include <linux/of_address.h>
> @@ -36,6 +37,7 @@
>  #include <linux/amba/bus.h>
>  
>  #include "io-pgtable-arm.h"
> +#include "iommu-sva.h"
>  
>  /* MMIO registers */
>  #define ARM_SMMU_IDR0			0x0
> @@ -731,8 +733,31 @@ struct arm_smmu_domain {
>  
>  	struct list_head		devices;
>  	spinlock_t			devices_lock;
> +
> +	struct mmu_notifier_ops		mn_ops;
>  };
>  
> +struct arm_smmu_mmu_notifier {
> +	struct mmu_notifier		mn;
> +	struct arm_smmu_ctx_desc	*cd;
> +	bool				cleared;
> +	refcount_t			refs;
> +	struct arm_smmu_domain		*domain;
> +};
> +
> +#define mn_to_smmu(mn) container_of(mn, struct
> arm_smmu_mmu_notifier, mn) +
> +struct arm_smmu_bond {
> +	struct iommu_sva		sva;
> +	struct mm_struct		*mm;
> +	struct arm_smmu_mmu_notifier	*smmu_mn;
> +	struct list_head		list;
> +	refcount_t			refs;
> +};
> +
> +#define sva_to_bond(handle) \
> +	container_of(handle, struct arm_smmu_bond, sva)
> +
>  struct arm_smmu_option_prop {
>  	u32 opt;
>  	const char *prop;
> @@ -742,6 +767,13 @@ static DEFINE_XARRAY_ALLOC1(asid_xa);
>  static DEFINE_SPINLOCK(contexts_lock);
>  static DEFINE_MUTEX(arm_smmu_sva_lock);
>  
> +/*
> + * When a process dies, DMA is still running but we need to clear
> the pgd. If we
> + * simply cleared the valid bit from the context descriptor, we'd
> get event 0x0a
> + * which are not recoverable.
> + */
> +static struct arm_smmu_ctx_desc invalid_cd = { 0 };
> +
>  static struct arm_smmu_option_prop arm_smmu_options[] = {
>  	{ ARM_SMMU_OPT_SKIP_PREFETCH,
> "hisilicon,broken-prefetch-cmd" }, { ARM_SMMU_OPT_PAGE0_REGS_ONLY,
> "cavium,cn9900-broken-page1-regspace"}, @@ -1652,7 +1684,9 @@ static
> int __arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain,
>  	 * (2) Install a secondary CD, for SID+SSID traffic.
>  	 * (3) Update ASID of a CD. Atomically write the first 64
> bits of the
>  	 *     CD, then invalidate the old entry and mappings.
> -	 * (4) Remove a secondary CD.
> +	 * (4) Quiesce the context without clearing the valid bit.
> Disable
> +	 *     translation, and ignore any translation fault.
> +	 * (5) Remove a secondary CD.
>  	 */
>  	u64 val;
>  	bool cd_live;
> @@ -1669,8 +1703,10 @@ static int __arm_smmu_write_ctx_desc(struct
> arm_smmu_domain *smmu_domain, val = le64_to_cpu(cdptr[0]);
>  	cd_live = !!(val & CTXDESC_CD_0_V);
>  
> -	if (!cd) { /* (4) */
> +	if (!cd) { /* (5) */
>  		val = 0;
> +	} else if (cd == &invalid_cd) { /* (4) */
> +		val |= CTXDESC_CD_0_TCR_EPD0;
>  	} else if (cd_live) { /* (3) */
>  		val &= ~CTXDESC_CD_0_ASID;
>  		val |= FIELD_PREP(CTXDESC_CD_0_ASID, cd->asid);
> @@ -1883,7 +1919,6 @@ static struct arm_smmu_ctx_desc
> *arm_smmu_share_asid(u16 asid) return NULL;
>  }
>  
> -__maybe_unused
>  static struct arm_smmu_ctx_desc *arm_smmu_alloc_shared_cd(struct
> mm_struct *mm) {
>  	u16 asid;
> @@ -1976,7 +2011,6 @@ static struct arm_smmu_ctx_desc
> *arm_smmu_alloc_shared_cd(struct mm_struct *mm) return ERR_PTR(ret);
>  }
>  
> -__maybe_unused
>  static void arm_smmu_free_shared_cd(struct arm_smmu_ctx_desc *cd)
>  {
>  	if (arm_smmu_free_asid(cd)) {
> @@ -2611,6 +2645,8 @@ static bool arm_smmu_capable(enum iommu_cap cap)
>  	}
>  }
>  
> +static struct mmu_notifier_ops arm_smmu_mmu_notifier_ops;
> +
>  static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
>  {
>  	struct arm_smmu_domain *smmu_domain;
> @@ -2638,6 +2674,7 @@ static struct iommu_domain
> *arm_smmu_domain_alloc(unsigned type)
> mutex_init(&smmu_domain->init_mutex);
> INIT_LIST_HEAD(&smmu_domain->devices);
> spin_lock_init(&smmu_domain->devices_lock);
> +	smmu_domain->mn_ops = arm_smmu_mmu_notifier_ops;
>  
>  	return &smmu_domain->domain;
>  }
> @@ -3118,6 +3155,208 @@ arm_smmu_iova_to_phys(struct iommu_domain
> *domain, dma_addr_t iova) return ops->iova_to_phys(ops, iova);
>  }
>  
> +static struct mmu_notifier *arm_smmu_mmu_notifier_alloc(struct
> mm_struct *mm) +{
> +	struct arm_smmu_mmu_notifier *smmu_mn;
> +
> +	smmu_mn = kzalloc(sizeof(*smmu_mn), GFP_KERNEL);
> +	if (!smmu_mn)
> +		return ERR_PTR(-ENOMEM);
> +
> +	smmu_mn->cd = arm_smmu_alloc_shared_cd(mm);
> +	if (IS_ERR(smmu_mn->cd)) {
> +		void *ptr = ERR_CAST(smmu_mn->cd);
> +
> +		kfree(smmu_mn);
> +		return ptr;
> +	}
> +	refcount_set(&smmu_mn->refs, 1);
> +
> +	return &smmu_mn->mn;
> +}
> +
> +static void arm_smmu_mmu_notifier_free(struct mmu_notifier *mn)
> +{
> +	struct arm_smmu_mmu_notifier *smmu_mn = mn_to_smmu(mn);
> +
> +	arm_smmu_free_shared_cd(smmu_mn->cd);
> +	kfree(smmu_mn);
> +}
> +
> +static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn,
> +					 struct mm_struct *mm,
> +					 unsigned long start,
> unsigned long end) +{
> +	/* TODO: invalidate ATS */
> +}
> +
> +static void arm_smmu_mm_release(struct mmu_notifier *mn, struct
> mm_struct *mm) +{
> +	struct arm_smmu_mmu_notifier *smmu_mn = mn_to_smmu(mn);
> +	struct arm_smmu_domain *smmu_domain;
> +
> +	mutex_lock(&arm_smmu_sva_lock);
> +	if (smmu_mn->cleared) {
> +		mutex_unlock(&arm_smmu_sva_lock);
> +		return;
> +	}
> +
> +	smmu_domain = smmu_mn->domain;
> +
> +	/*
> +	 * DMA may still be running. Keep the cd valid but disable
> +	 * translation, so that new events will still result in
> stall.
> +	 */
Does "disable translation" also disable translated requests? I guess
release is called after tlb invalidate range, so assuming no more
devTLB left to generate translated request?

> +	arm_smmu_write_ctx_desc(smmu_domain, mm->pasid, &invalid_cd);
> +
> +	arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_mn->cd->asid);
> +	/* TODO: invalidate ATS */
> +
If mm release is called after tlb invalidate range, is it still
necessary to invalidate again?

> +	smmu_mn->cleared = true;
> +	mutex_unlock(&arm_smmu_sva_lock);
> +}
> +
> +static struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = {
> +	.alloc_notifier		= arm_smmu_mmu_notifier_alloc,
> +	.free_notifier		= arm_smmu_mmu_notifier_free,
> +	.invalidate_range	= arm_smmu_mm_invalidate_range,
> +	.release		= arm_smmu_mm_release,
> +};
> +
> +static struct iommu_sva *
> +__arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm)
> +{
> +	int ret;
> +	ioasid_t pasid;
> +	struct mmu_notifier *mn;
> +	struct arm_smmu_bond *bond;
> +	struct arm_smmu_mmu_notifier *smmu_mn;
> +	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> +
> +	if (!master || !master->sva_enabled)
> +		return ERR_PTR(-ENODEV);
> +
> +	/* If bind() was already called for this (dev, mm) pair,
> reuse it. */
> +	list_for_each_entry(bond, &master->bonds, list) {
> +		if (bond->mm == mm) {
> +			refcount_inc(&bond->refs);
> +			return &bond->sva;
> +		}
> +	}
> +
> +	mn = mmu_notifier_get(&smmu_domain->mn_ops, mm);
> +	if (IS_ERR(mn))
> +		return ERR_CAST(mn);
> +
> +	smmu_mn = mn_to_smmu(mn);
> +	if (smmu_mn->domain)
> +		refcount_inc(&smmu_mn->refs);
> +
> +	bond = kzalloc(sizeof(*bond), GFP_KERNEL);
> +	if (!bond) {
> +		ret = -ENOMEM;
> +		goto err_put_mn;
> +	}
> +
> +	/* Allocate a PASID for this mm if necessary */
> +	pasid = iommu_sva_alloc_pasid(mm, 1, (1U <<
> master->ssid_bits) - 1);
> +	if (pasid == INVALID_IOASID) {
> +		ret = -ENOSPC;
> +		goto err_free_bond;
> +	}
> +	bond->mm = mm;
> +	bond->sva.dev = dev;
> +	bond->smmu_mn = smmu_mn;
> +	refcount_set(&bond->refs, 1);
> +
> +	ret = arm_smmu_write_ctx_desc(smmu_domain, mm->pasid,
> smmu_mn->cd);
> +	if (ret)
> +		goto err_free_pasid;
> +
> +	bond->sva.dev = dev;
> +	list_add(&bond->list, &master->bonds);
> +	smmu_mn->domain = smmu_domain;
> +	return &bond->sva;
> +
> +err_free_pasid:
> +	iommu_sva_free_pasid(mm);
> +err_free_bond:
> +	kfree(bond);
> +err_put_mn:
> +	refcount_dec(&smmu_mn->refs);
> +	mmu_notifier_put(mn);
> +	return ERR_PTR(ret);
> +}
> +
> +static void __arm_smmu_sva_unbind(struct iommu_sva *handle)
> +{
> +	struct arm_smmu_mmu_notifier *smmu_mn;
> +	struct arm_smmu_bond *bond = sva_to_bond(handle);
> +	struct iommu_domain *domain =
> iommu_get_domain_for_dev(handle->dev);
> +	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> +
> +	if (!refcount_dec_and_test(&bond->refs))
> +		return;
> +
> +	list_del(&bond->list);
> +
> +	smmu_mn = bond->smmu_mn;
> +	/*
> +	 * This is redundant as the MMU notifier already counts
> refs, but frees
> +	 * the bond in a RCU callback which cannot sleep. We have
> much cleaning
> +	 * to do and we hold all the right locks, so duplicate the
> refcounting.
> +	 */
> +	if (refcount_dec_and_test(&smmu_mn->refs)) {
> +		arm_smmu_write_ctx_desc(smmu_domain,
> bond->mm->pasid, NULL); +
> +		/*
> +		 * If we went through clear(), we've already
> invalidated, and no
> +		 * new TLB entry can have been formed.
> +		 */
> +		if (!smmu_mn->cleared) {
> +			arm_smmu_tlb_inv_asid(smmu_domain->smmu,
> +					      smmu_mn->cd->asid);
> +			/* TODO: invalidate ATS */
> +		}
> +	}
> +
> +	iommu_sva_free_pasid(bond->mm);
> +	kfree(bond);
> +	mmu_notifier_put(&smmu_mn->mn);
> +}
> +
> +static struct iommu_sva *
> +arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, void
> *drvdata) +{
> +	struct iommu_sva *handle;
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> +
> +	if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
> +		return ERR_PTR(-EINVAL);
> +
> +	mutex_lock(&arm_smmu_sva_lock);
> +	handle = __arm_smmu_sva_bind(dev, mm);
> +	mutex_unlock(&arm_smmu_sva_lock);
> +	return handle;
> +}
> +
> +static void arm_smmu_sva_unbind(struct iommu_sva *handle)
> +{
> +	mutex_lock(&arm_smmu_sva_lock);
> +	__arm_smmu_sva_unbind(handle);
> +	mutex_unlock(&arm_smmu_sva_lock);
> +}
> +
> +static int arm_smmu_sva_get_pasid(struct iommu_sva *handle)
> +{
> +	struct arm_smmu_bond *bond = sva_to_bond(handle);
> +
> +	return bond->mm->pasid;
> +}
> +
>  static struct platform_driver arm_smmu_driver;
>  
>  static
> @@ -3426,6 +3665,12 @@ static int arm_smmu_dev_disable_sva(struct
> device *dev) master->sva_enabled = false;
>  	mutex_unlock(&arm_smmu_sva_lock);
>  
> +	/*
> +	 * Since the MMU notifier ops are held in the domain, it is
> not safe to
> +	 * free the domain until all MMU notifiers are freed.
> +	 */
> +	mmu_notifier_synchronize();
> +
>  	return 0;
>  }
>  
> @@ -3482,6 +3727,9 @@ static struct iommu_ops arm_smmu_ops = {
>  	.dev_feat_enabled	= arm_smmu_dev_feature_enabled,
>  	.dev_enable_feat	= arm_smmu_dev_enable_feature,
>  	.dev_disable_feat	= arm_smmu_dev_disable_feature,
> +	.sva_bind		= arm_smmu_sva_bind,
> +	.sva_unbind		= arm_smmu_sva_unbind,
> +	.sva_get_pasid		= arm_smmu_sva_get_pasid,
>  	.pgsize_bitmap		= -1UL, /* Restricted during
> device attach */ };
>  

[Jacob Pan]
Jacob Pan April 30, 2020, 9:18 p.m. UTC | #6
On Thu, 30 Apr 2020 16:33:59 +0200
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> Shared Virtual Addressing (SVA) allows to share process page tables
> with devices using the IOMMU, PASIDs and I/O page faults. Add SVA
> support to the Arm SMMUv3 driver.
> 
> Since v5 [1]:
> 
> * Added patches 1-3. Patch 1 adds a PASID field to mm_struct as
>   discussed in [1] and [2]. This is also needed for Intel ENQCMD.
> Patch 2 adds refcounts to IOASID and patch 3 adds a couple of helpers
> to allocate the PASID.
> 
> * Dropped most of iommu-sva.c. After getting rid of io_mm following
>   review of v5, there wasn't enough generic code left to justify the
>   indirect branch overhead of io_mm_ops in the MMU notifiers. I ended
> up with more glue than useful code, and couldn't find an easy way to
> deal with domains in the SMMU driver (we keep PASID tables per domain,
>   while x86 keeps them per device). The direct approach in patch 17 is
>   nicer and a little easier to read. The SMMU driver only gained 160
>   lines, while iommu-sva lost 470 lines.
> 
>   As a result I dropped the MMU notifier patch.
> 
>   Jacob, one upside of this rework is that we now free ioasids in
>   blocking context, which might help with your addition of notifiers
> to ioasid.c
> 
Thanks for the note. It does make notifier much easier, plus the
refcount can alleviate the constraint on ordering.

I guess we don't share mmu notifier code for now :)

> * Simplified io-pgfault a bit, since flush() isn't called from mm exit
>   path anymore.
> 
> * Fixed a bug in patch 17 (don't clear the stall bit when stall is
>   forced).
> 
> You can find the latest version on https://jpbrucker.net/git/linux
> branch sva/current, and sva/zip-devel for the Hisilicon zip
> accelerator.
> 
> [1]
> https://lore.kernel.org/linux-iommu/20200414170252.714402-1-jean-philippe@linaro.org/
> [2]
> https://lore.kernel.org/linux-iommu/1585596788-193989-6-git-send-email-fenghua.yu@intel.com/
> 
> Jean-Philippe Brucker (25):
>   mm: Add a PASID field to mm_struct
>   iommu/ioasid: Add ioasid references
>   iommu/sva: Add PASID helpers
>   iommu: Add a page fault handler
>   iommu/iopf: Handle mm faults
>   arm64: mm: Add asid_gen_match() helper
>   arm64: mm: Pin down ASIDs for sharing mm with devices
>   iommu/io-pgtable-arm: Move some definitions to a header
>   iommu/arm-smmu-v3: Manage ASIDs with xarray
>   arm64: cpufeature: Export symbol read_sanitised_ftr_reg()
>   iommu/arm-smmu-v3: Share process page tables
>   iommu/arm-smmu-v3: Seize private ASID
>   iommu/arm-smmu-v3: Add support for VHE
>   iommu/arm-smmu-v3: Enable broadcast TLB maintenance
>   iommu/arm-smmu-v3: Add SVA feature checking
>   iommu/arm-smmu-v3: Add SVA device feature
>   iommu/arm-smmu-v3: Implement iommu_sva_bind/unbind()
>   iommu/arm-smmu-v3: Hook up ATC invalidation to mm ops
>   iommu/arm-smmu-v3: Add support for Hardware Translation Table Update
>   iommu/arm-smmu-v3: Maintain a SID->device structure
>   dt-bindings: document stall property for IOMMU masters
>   iommu/arm-smmu-v3: Add stall support for platform devices
>   PCI/ATS: Add PRI stubs
>   PCI/ATS: Export PRI functions
>   iommu/arm-smmu-v3: Add support for PRI
> 
>  drivers/iommu/Kconfig                         |   11 +
>  drivers/iommu/Makefile                        |    2 +
>  .../devicetree/bindings/iommu/iommu.txt       |   18 +
>  arch/arm64/include/asm/mmu.h                  |    1 +
>  arch/arm64/include/asm/mmu_context.h          |   11 +-
>  drivers/iommu/io-pgtable-arm.h                |   30 +
>  drivers/iommu/iommu-sva.h                     |   15 +
>  include/linux/ioasid.h                        |   10 +-
>  include/linux/iommu.h                         |   53 +
>  include/linux/mm_types.h                      |    4 +
>  include/linux/pci-ats.h                       |    8 +
>  arch/arm64/kernel/cpufeature.c                |    1 +
>  arch/arm64/mm/context.c                       |  103 +-
>  drivers/iommu/arm-smmu-v3.c                   | 1554
> +++++++++++++++-- drivers/iommu/io-pgfault.c                    |
> 458 +++++ drivers/iommu/io-pgtable-arm.c                |   27 +-
>  drivers/iommu/ioasid.c                        |   30 +-
>  drivers/iommu/iommu-sva.c                     |   85 +
>  drivers/iommu/of_iommu.c                      |    5 +-
>  drivers/pci/ats.c                             |    4 +
>  MAINTAINERS                                   |    3 +-
>  21 files changed, 2275 insertions(+), 158 deletions(-)
>  create mode 100644 drivers/iommu/io-pgtable-arm.h
>  create mode 100644 drivers/iommu/iommu-sva.h
>  create mode 100644 drivers/iommu/io-pgfault.c
>  create mode 100644 drivers/iommu/iommu-sva.c
> 

[Jacob Pan]
Christoph Hellwig May 1, 2020, 12:15 p.m. UTC | #7
> @@ -432,6 +432,7 @@ config ARM_SMMU_V3
>  	tristate "ARM Ltd. System MMU Version 3 (SMMUv3) Support"
>  	depends on ARM64
>  	select IOMMU_API
> +	select IOMMU_SVA
>  	select IOMMU_IO_PGTABLE_LPAE
>  	select GENERIC_MSI_IRQ_DOMAIN

Doesn't this need to select MMU_NOTIFIER now?

> +	struct mmu_notifier_ops		mn_ops;

Note: not a pointer.

> +	/* If bind() was already called for this (dev, mm) pair, reuse it. */
> +	list_for_each_entry(bond, &master->bonds, list) {
> +		if (bond->mm == mm) {
> +			refcount_inc(&bond->refs);
> +			return &bond->sva;
> +		}
> +	}
> +
> +	mn = mmu_notifier_get(&smmu_domain->mn_ops, mm);
> +	if (IS_ERR(mn))
> +		return ERR_CAST(mn);

Which seems to be to avoid mmu_notifier_get reusing notifiers registered
by other arm_smmu_master instance right?

Either you could just use plain old mmu_notifier_register to avoid
the reuse.  Or we could enhance the mmu_notifier_get to pass a private
oaque instance ID pointer, which is checked in addition to the ops,
and you could probably kill off the bonds list and lookup.
Jason Gunthorpe May 1, 2020, 12:55 p.m. UTC | #8
On Fri, May 01, 2020 at 05:15:52AM -0700, Christoph Hellwig wrote:
> > @@ -432,6 +432,7 @@ config ARM_SMMU_V3
> >  	tristate "ARM Ltd. System MMU Version 3 (SMMUv3) Support"
> >  	depends on ARM64
> >  	select IOMMU_API
> > +	select IOMMU_SVA
> >  	select IOMMU_IO_PGTABLE_LPAE
> >  	select GENERIC_MSI_IRQ_DOMAIN
> 
> Doesn't this need to select MMU_NOTIFIER now?
> 
> > +	struct mmu_notifier_ops		mn_ops;
> 
> Note: not a pointer.
> 
> > +	/* If bind() was already called for this (dev, mm) pair, reuse it. */
> > +	list_for_each_entry(bond, &master->bonds, list) {
> > +		if (bond->mm == mm) {
> > +			refcount_inc(&bond->refs);
> > +			return &bond->sva;
> > +		}
> > +	}

I also would like it if searching for mms in linked lists was not
necessary, this is kind of the point of 'get'

Is this a side effect of the earlier remark to get rid of the linked
list inside the notifier?

> Or we could enhance the mmu_notifier_get to pass a private
> oaque instance ID pointer, which is checked in addition to the ops,
> and you could probably kill off the bonds list and lookup.

This might be the best option if it can absorb the above search..

Jason
Baolu Lu May 3, 2020, 5:49 a.m. UTC | #9
Hi Jean,

On 2020/4/30 22:34, Jean-Philippe Brucker wrote:
> Some systems allow devices to handle I/O Page Faults in the core mm. For
> example systems implementing the PCIe PRI extension or Arm SMMU stall
> model. Infrastructure for reporting these recoverable page faults was
> added to the IOMMU core by commit 0c830e6b3282 ("iommu: Introduce device
> fault report API"). Add a page fault handler for host SVA.
> 
> IOMMU driver can now instantiate several fault workqueues and link them
> to IOPF-capable devices. Drivers can choose between a single global
> workqueue, one per IOMMU device, one per low-level fault queue, one per
> domain, etc.
> 
> When it receives a fault event, supposedly in an IRQ handler, the IOMMU
> driver reports the fault using iommu_report_device_fault(), which calls
> the registered handler. The page fault handler then calls the mm fault
> handler, and reports either success or failure with iommu_page_response().
> When the handler succeeded, the IOMMU retries the access.
> 
> The iopf_param pointer could be embedded into iommu_fault_param. But
> putting iopf_param into the iommu_param structure allows us not to care
> about ordering between calls to iopf_queue_add_device() and
> iommu_register_device_fault_handler().
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> v5->v6: Simplify flush. As we're not flushing in the mm exit path
>    anymore, we can mandate that IOMMU drivers flush their low-level queue
>    themselves before calling iopf_queue_flush_dev(). No need to register
>    a flush callback anymore.
> ---
>   drivers/iommu/Kconfig      |   3 +
>   drivers/iommu/Makefile     |   1 +
>   include/linux/iommu.h      |  51 +++++
>   drivers/iommu/io-pgfault.c | 383 +++++++++++++++++++++++++++++++++++++
>   4 files changed, 438 insertions(+)
>   create mode 100644 drivers/iommu/io-pgfault.c
> 

[...]

> +
> +static void iopf_handle_group(struct work_struct *work)
> +{
> +	struct iopf_group *group;
> +	struct iopf_fault *iopf, *next;
> +	enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS;
> +
> +	group = container_of(work, struct iopf_group, work);
> +
> +	list_for_each_entry_safe(iopf, next, &group->faults, head) {
> +		/*
> +		 * For the moment, errors are sticky: don't handle subsequent
> +		 * faults in the group if there is an error.
> +		 */
> +		if (status == IOMMU_PAGE_RESP_SUCCESS)
> +			status = iopf_handle_single(iopf);
> +
> +		if (!(iopf->fault.prm.flags &
> +		      IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE))
> +			kfree(iopf);

The iopf is freed,but not removed from the list. This will cause wild
pointer in code.

> +	}
> +
> +	iopf_complete_group(group->dev, &group->last_fault, status);
> +	kfree(group);
> +}
> +

[...]

> +/**
> + * iopf_queue_flush_dev - Ensure that all queued faults have been processed
> + * @dev: the endpoint whose faults need to be flushed.
> + * @pasid: the PASID affected by this flush
> + *
> + * The IOMMU driver calls this before releasing a PASID, to ensure that all
> + * pending faults for this PASID have been handled, and won't hit the address
> + * space of the next process that uses this PASID. The driver must make sure
> + * that no new fault is added to the queue. In particular it must flush its
> + * low-level queue before calling this function.
> + *
> + * Return: 0 on success and <0 on error.
> + */
> +int iopf_queue_flush_dev(struct device *dev, int pasid)
> +{
> +	int ret = 0;
> +	struct iopf_device_param *iopf_param;
> +	struct dev_iommu *param = dev->iommu;
> +
> +	if (!param)
> +		return -ENODEV;
> +
> +	mutex_lock(&param->lock);
> +	iopf_param = param->iopf_param;
> +	if (iopf_param)
> +		flush_workqueue(iopf_param->queue->wq);

There may be other pasid iopf in the workqueue. Flush all tasks in
the workqueue will hurt other pasids. I might lose any context.

> +	else
> +		ret = -ENODEV;
> +	mutex_unlock(&param->lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_flush_dev);
> +
> +/**
> + * iopf_queue_discard_partial - Remove all pending partial fault
> + * @queue: the queue whose partial faults need to be discarded
> + *
> + * When the hardware queue overflows, last page faults in a group may have been
> + * lost and the IOMMU driver calls this to discard all partial faults. The
> + * driver shouldn't be adding new faults to this queue concurrently.
> + *
> + * Return: 0 on success and <0 on error.
> + */
> +int iopf_queue_discard_partial(struct iopf_queue *queue)
> +{
> +	struct iopf_fault *iopf, *next;
> +	struct iopf_device_param *iopf_param;
> +
> +	if (!queue)
> +		return -EINVAL;
> +
> +	mutex_lock(&queue->lock);
> +	list_for_each_entry(iopf_param, &queue->devices, queue_list) {
> +		list_for_each_entry_safe(iopf, next, &iopf_param->partial, head)
> +			kfree(iopf);

iopf is freed but not removed from the list.

> +	}
> +	mutex_unlock(&queue->lock);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_discard_partial);
> +
> +/**
> + * iopf_queue_add_device - Add producer to the fault queue
> + * @queue: IOPF queue
> + * @dev: device to add
> + *
> + * Return: 0 on success and <0 on error.
> + */
> +int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev)
> +{
> +	int ret = -EBUSY;
> +	struct iopf_device_param *iopf_param;
> +	struct dev_iommu *param = dev->iommu;
> +
> +	if (!param)
> +		return -ENODEV;
> +
> +	iopf_param = kzalloc(sizeof(*iopf_param), GFP_KERNEL);
> +	if (!iopf_param)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&iopf_param->partial);
> +	iopf_param->queue = queue;
> +	iopf_param->dev = dev;
> +
> +	mutex_lock(&queue->lock);
> +	mutex_lock(&param->lock);
> +	if (!param->iopf_param) {
> +		list_add(&iopf_param->queue_list, &queue->devices);
> +		param->iopf_param = iopf_param;
> +		ret = 0;
> +	}
> +	mutex_unlock(&param->lock);
> +	mutex_unlock(&queue->lock);
> +
> +	if (ret)
> +		kfree(iopf_param);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_add_device);
> +
> +/**
> + * iopf_queue_remove_device - Remove producer from fault queue
> + * @queue: IOPF queue
> + * @dev: device to remove
> + *
> + * Caller makes sure that no more faults are reported for this device.
> + *
> + * Return: 0 on success and <0 on error.
> + */
> +int iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev)
> +{
> +	int ret = 0;
> +	struct iopf_fault *iopf, *next;
> +	struct iopf_device_param *iopf_param;
> +	struct dev_iommu *param = dev->iommu;
> +
> +	if (!param || !queue)
> +		return -EINVAL;
> +
> +	mutex_lock(&queue->lock);
> +	mutex_lock(&param->lock);
> +	iopf_param = param->iopf_param;
> +	if (iopf_param && iopf_param->queue == queue) {
> +		list_del(&iopf_param->queue_list);
> +		param->iopf_param = NULL;
> +	} else {
> +		ret = -EINVAL;
> +	}
> +	mutex_unlock(&param->lock);
> +	mutex_unlock(&queue->lock);
> +	if (ret)
> +		return ret;
> +
> +	/* Just in case some faults are still stuck */
> +	list_for_each_entry_safe(iopf, next, &iopf_param->partial, head)
> +		kfree(iopf);

The same here.

> +
> +	kfree(iopf_param);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_remove_device);
> +
> +/**
> + * iopf_queue_alloc - Allocate and initialize a fault queue
> + * @name: a unique string identifying the queue (for workqueue)
> + *
> + * Return: the queue on success and NULL on error.
> + */
> +struct iopf_queue *iopf_queue_alloc(const char *name)
> +{
> +	struct iopf_queue *queue;
> +
> +	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
> +	if (!queue)
> +		return NULL;
> +
> +	/*
> +	 * The WQ is unordered because the low-level handler enqueues faults by
> +	 * group. PRI requests within a group have to be ordered, but once
> +	 * that's dealt with, the high-level function can handle groups out of
> +	 * order.
> +	 */
> +	queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name);
> +	if (!queue->wq) {
> +		kfree(queue);
> +		return NULL;
> +	}
> +
> +	INIT_LIST_HEAD(&queue->devices);
> +	mutex_init(&queue->lock);
> +
> +	return queue;
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_alloc);
> +
> +/**
> + * iopf_queue_free - Free IOPF queue
> + * @queue: queue to free
> + *
> + * Counterpart to iopf_queue_alloc(). The driver must not be queuing faults or
> + * adding/removing devices on this queue anymore.
> + */
> +void iopf_queue_free(struct iopf_queue *queue)
> +{
> +	struct iopf_device_param *iopf_param, *next;
> +
> +	if (!queue)
> +		return;
> +
> +	list_for_each_entry_safe(iopf_param, next, &queue->devices, queue_list)
> +		iopf_queue_remove_device(queue, iopf_param->dev);
> +
> +	destroy_workqueue(queue->wq);
> +	kfree(queue);
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_free);
> 

Best regards,
baolu
Baolu Lu May 3, 2020, 5:54 a.m. UTC | #10
On 2020/4/30 22:34, Jean-Philippe Brucker wrote:
> When a recoverable page fault is handled by the fault workqueue, find the
> associated mm and call handle_mm_fault.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> v5->v6: select CONFIG_IOMMU_SVA
> ---
>   drivers/iommu/Kconfig      |  1 +
>   drivers/iommu/io-pgfault.c | 79 +++++++++++++++++++++++++++++++++++++-
>   2 files changed, 78 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 4f33e489f0726..1e64ee6592e16 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -109,6 +109,7 @@ config IOMMU_SVA
>   
>   config IOMMU_PAGE_FAULT
>   	bool
> +	select IOMMU_SVA

It would be better to move this to the previous patch.

>   
>   config FSL_PAMU
>   	bool "Freescale IOMMU support"
> diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c
> index 38732e97faac1..09a71dc4de20a 100644
> --- a/drivers/iommu/io-pgfault.c
> +++ b/drivers/iommu/io-pgfault.c
> @@ -7,9 +7,12 @@
>   
>   #include <linux/iommu.h>
>   #include <linux/list.h>
> +#include <linux/sched/mm.h>
>   #include <linux/slab.h>
>   #include <linux/workqueue.h>
>   
> +#include "iommu-sva.h"
> +
>   /**
>    * struct iopf_queue - IO Page Fault queue
>    * @wq: the fault workqueue
> @@ -68,8 +71,57 @@ static int iopf_complete_group(struct device *dev, struct iopf_fault *iopf,
>   static enum iommu_page_response_code
>   iopf_handle_single(struct iopf_fault *iopf)
>   {
> -	/* TODO */
> -	return -ENODEV;
> +	vm_fault_t ret;
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma;
> +	unsigned int access_flags = 0;
> +	unsigned int fault_flags = FAULT_FLAG_REMOTE;
> +	struct iommu_fault_page_request *prm = &iopf->fault.prm;
> +	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
> +
> +	if (!(prm->flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID))
> +		return status;
> +
> +	mm = iommu_sva_find(prm->pasid);
> +	if (IS_ERR_OR_NULL(mm))
> +		return status;
> +
> +	down_read(&mm->mmap_sem);
> +
> +	vma = find_extend_vma(mm, prm->addr);
> +	if (!vma)
> +		/* Unmapped area */
> +		goto out_put_mm;
> +
> +	if (prm->perm & IOMMU_FAULT_PERM_READ)
> +		access_flags |= VM_READ;
> +
> +	if (prm->perm & IOMMU_FAULT_PERM_WRITE) {
> +		access_flags |= VM_WRITE;
> +		fault_flags |= FAULT_FLAG_WRITE;
> +	}
> +
> +	if (prm->perm & IOMMU_FAULT_PERM_EXEC) {
> +		access_flags |= VM_EXEC;
> +		fault_flags |= FAULT_FLAG_INSTRUCTION;
> +	}
> +
> +	if (!(prm->perm & IOMMU_FAULT_PERM_PRIV))
> +		fault_flags |= FAULT_FLAG_USER;
> +
> +	if (access_flags & ~vma->vm_flags)
> +		/* Access fault */
> +		goto out_put_mm;
> +
> +	ret = handle_mm_fault(vma, prm->addr, fault_flags);
> +	status = ret & VM_FAULT_ERROR ? IOMMU_PAGE_RESP_INVALID :
> +		IOMMU_PAGE_RESP_SUCCESS;
> +
> +out_put_mm:
> +	up_read(&mm->mmap_sem);
> +	mmput(mm);
> +
> +	return status;
>   }
>   
>   static void iopf_handle_group(struct work_struct *work)
> @@ -104,6 +156,29 @@ static void iopf_handle_group(struct work_struct *work)
>    *
>    * Add a fault to the device workqueue, to be handled by mm.
>    *
> + * This module doesn't handle PCI PASID Stop Marker; IOMMU drivers must discard
> + * them before reporting faults. A PASID Stop Marker (LRW = 0b100) doesn't
> + * expect a response. It may be generated when disabling a PASID (issuing a
> + * PASID stop request) by some PCI devices.
> + *
> + * The PASID stop request is issued by the device driver before unbind(). Once
> + * it completes, no page request is generated for this PASID anymore and
> + * outstanding ones have been pushed to the IOMMU (as per PCIe 4.0r1.0 - 6.20.1
> + * and 10.4.1.2 - Managing PASID TLP Prefix Usage). Some PCI devices will wait
> + * for all outstanding page requests to come back with a response before
> + * completing the PASID stop request. Others do not wait for page responses, and
> + * instead issue this Stop Marker that tells us when the PASID can be
> + * reallocated.
> + *
> + * It is safe to discard the Stop Marker because it is an optimization.
> + * a. Page requests, which are posted requests, have been flushed to the IOMMU
> + *    when the stop request completes.
> + * b. We flush all fault queues on unbind() before freeing the PASID.
> + *
> + * So even though the Stop Marker might be issued by the device *after* the stop
> + * request completes, outstanding faults will have been dealt with by the time
> + * we free the PASID.
> + *
>    * Return: 0 on success and <0 on error.
>    */
>   int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)
> 

The same for the comments.

Best regards,
baolu
Xu Zaibo May 4, 2020, 1:52 a.m. UTC | #11
On 2020/4/30 22:34, Jean-Philippe Brucker wrote:
> Some devices can tag their DMA requests with a 20-bit Process Address
> Space ID (PASID), allowing them to access multiple address spaces. In
> combination with recoverable I/O page faults (for example PCIe PRI),
> PASID allows the IOMMU to share page tables with the MMU.
>
> To make sure that a single PASID is allocated for each address space, as
> required by Intel ENQCMD, store the PASID in the mm_struct. The IOMMU
> driver is in charge of serializing modifications to the PASID field.
>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> For the field's validity I'm thinking invalid PASID = 0. In ioasid.h we
> define INVALID_IOASID as ~0U, but I think we can now change it to 0,
> since Intel is now also reserving PASID #0 for Transactions without
> PASID and AMD IOMMU uses GIoV for this too.
> ---
>   include/linux/mm_types.h | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4aba6c0c2ba80..8db6472758175 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -534,6 +534,10 @@ struct mm_struct {
>   		atomic_long_t hugetlb_usage;
>   #endif
>   		struct work_struct async_put_work;
> +#ifdef CONFIG_IOMMU_SUPPORT
> +		/* Address space ID used by device DMA */
> +		unsigned int pasid;
> +#endif
Maybe '#ifdef CONFIG_IOMMU_SVA ... #endif' is more reasonable?

Thanks,
Zaibo

.
>   	} __randomize_layout;
>   
>   	/*
Jean-Philippe Brucker May 4, 2020, 2:11 p.m. UTC | #12
On Thu, Apr 30, 2020 at 04:39:53PM +0100, Suzuki K Poulose wrote:
> On 04/30/2020 03:34 PM, Jean-Philippe Brucker wrote:
> > With Shared Virtual Addressing (SVA), we need to mirror CPU TTBR, TCR,
> > MAIR and ASIDs in SMMU contexts. Each SMMU has a single ASID space split
> > into two sets, shared and private. Shared ASIDs correspond to those
> > obtained from the arch ASID allocator, and private ASIDs are used for
> > "classic" map/unmap DMA.
> > 
> > Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > ---
> 
> > +
> > +	tcr = FIELD_PREP(CTXDESC_CD_0_TCR_T0SZ, 64ULL - VA_BITS) |
> > +	      FIELD_PREP(CTXDESC_CD_0_TCR_IRGN0, ARM_LPAE_TCR_RGN_WBWA) |
> > +	      FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, ARM_LPAE_TCR_RGN_WBWA) |
> > +	      FIELD_PREP(CTXDESC_CD_0_TCR_SH0, ARM_LPAE_TCR_SH_IS) |
> > +	      CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
> > +
> > +	switch (PAGE_SIZE) {
> > +	case SZ_4K:
> > +		tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_TG0, ARM_LPAE_TCR_TG0_4K);
> > +		break;
> > +	case SZ_16K:
> > +		tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_TG0, ARM_LPAE_TCR_TG0_16K);
> > +		break;
> > +	case SZ_64K:
> > +		tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_TG0, ARM_LPAE_TCR_TG0_64K);
> > +		break;
> > +	default:
> > +		WARN_ON(1);
> > +		ret = -EINVAL;
> > +		goto err_free_asid;
> > +	}
> > +
> > +	reg = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
> > +	par = cpuid_feature_extract_unsigned_field(reg, ID_AA64MMFR0_PARANGE_SHIFT);
> > +	tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_IPS, par);
> > +
> > +	cd->ttbr = virt_to_phys(mm->pgd);
> 
> Does the TTBR follow the same layout as TTBR_ELx for 52bit IPA ? i.e,
> TTBR[5:2] = BADDR[51:48] ? Are you covered for that ?

Good point, I don't remember checking this. The SMMU TTBR doesn't have the
same layout as the CPU's, and we don't need to swizzle the bits. For the
lower bits, the alignment requirements on the pgd are identical to the
MMU.

Thanks,
Jean
Prabhakar Kushwaha May 4, 2020, 2:24 p.m. UTC | #13
Dear Jean,

On Thu, Apr 30, 2020 at 8:11 PM Jean-Philippe Brucker
<jean-philippe@linaro.org> wrote:
>
> If the SMMU supports it and the kernel was built with HTTU support, enable

is there any framework/config for HTTU which must be enabled to use this patch?


> We can enable HTTU even if CPUs don't support it, because the kernel
> always checks for HW dirty bit and updates the PTE flags atomically.
>
I believe, this statement is valid in context of this patch-set only.

One cannot use code snipped to test HTTU because exiting
io-pgtable-arm.c driver doesn't have framework to leverage HTTU
benfits. It by-default sets AF=1 and does not set DBM.

Thanks

--pk
Jean-Philippe Brucker May 4, 2020, 2:25 p.m. UTC | #14
On Thu, Apr 30, 2020 at 11:39:31AM -0700, Jacob Pan wrote:
> > +/**
> > + * ioasid_get - obtain a reference to the IOASID
> > + */
> > +void ioasid_get(ioasid_t ioasid)
> why void? what if the ioasid is not valid.

My intended use was for the caller to get an additional reference when
they're already holding one. So this should always succeed and I'd prefer
a WARN_ON if the ioasid isn't valid rather than returning an error. But if
you intend to add a state to ioasids between dropping refcount and free,
then a return value makes sense.

Thanks,
Jean

> 
> > +{
> > +	struct ioasid_data *ioasid_data;
> > +
> > +	spin_lock(&ioasid_allocator_lock);
> > +	ioasid_data = xa_load(&active_allocator->xa, ioasid);
> > +	if (ioasid_data)
> > +		refcount_inc(&ioasid_data->refs);
> > +	spin_unlock(&ioasid_allocator_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_get);
> > +
> >  /**
> >   * ioasid_free - Free an IOASID
> >   * @ioasid: the ID to remove
> > + *
> > + * Put a reference to the IOASID, free it when the number of
> > references drops to
> > + * zero.
> > + *
> > + * Return: %true if the IOASID was freed, %false otherwise.
> >   */
> > -void ioasid_free(ioasid_t ioasid)
> > +bool ioasid_free(ioasid_t ioasid)
> >  {
> > +	bool free = false;
> >  	struct ioasid_data *ioasid_data;
> >  
> >  	spin_lock(&ioasid_allocator_lock);
> > @@ -360,6 +383,10 @@ void ioasid_free(ioasid_t ioasid)
> >  		goto exit_unlock;
> >  	}
> >  
> > +	free = refcount_dec_and_test(&ioasid_data->refs);
> > +	if (!free)
> > +		goto exit_unlock;
> > +
> Just FYI, we may need to add states for the IOASID, i.g. mark the IOASID
> inactive after free. And prohibit ioasid_get() after freed. For VT-d,
> this is useful when KVM queries the IOASID.
> 
> >  	active_allocator->ops->free(ioasid,
> > active_allocator->ops->pdata); /* Custom allocator needs additional
> > steps to free the xa element */ if (active_allocator->flags &
> > IOASID_ALLOCATOR_CUSTOM) { @@ -369,6 +396,7 @@ void
> > ioasid_free(ioasid_t ioasid) 
> >  exit_unlock:
> >  	spin_unlock(&ioasid_allocator_lock);
> > +	return free;
> >  }
> >  EXPORT_SYMBOL_GPL(ioasid_free);
> >  
> 
> [Jacob Pan]
Jean-Philippe Brucker May 4, 2020, 2:39 p.m. UTC | #15
On Thu, Apr 30, 2020 at 01:48:42PM -0700, Jacob Pan wrote:
> On Thu, 30 Apr 2020 11:39:31 -0700
> Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> 
> > > -void ioasid_free(ioasid_t ioasid)
> > > +bool ioasid_free(ioasid_t ioasid)
> > >  {
> Sorry I missed this in the last reply.
> 
> I think free needs to be unconditional since there is not a good way to
> fail it.
> 
> Also can we have more symmetric APIs, seems we don't have ioasid_put()
> in this patchset.

Yes I was thinking of renaming ioasid_free() to ioasid_put() but got lazy. 

> How about?
> ioasid_alloc()
> ioasid_free(); //drop reference, mark inactive, but not reclaimed if
> 		refcount is not zero.
> ioasid_get() // returns err if the ioasid is marked inactive by
> 		ioasid_free()

How does the caller know that the ioasid is in active/inactive state, and
not freed/reallocated?

> ioasid_put();// drop reference, reclaim if refcount is 0.

I'll add ioasid_put() for now. I'd like to avoid introducing the inactive
state in this patch, so shall I change the calls in the Intel driver to
ioasid_put(), and not introduce a new ioasid_free() for the moment?

Thanks,
Jean
Suzuki K Poulose May 4, 2020, 2:42 p.m. UTC | #16
On 05/04/2020 03:11 PM, Jean-Philippe Brucker wrote:
> On Thu, Apr 30, 2020 at 04:39:53PM +0100, Suzuki K Poulose wrote:
>> On 04/30/2020 03:34 PM, Jean-Philippe Brucker wrote:
>>> With Shared Virtual Addressing (SVA), we need to mirror CPU TTBR, TCR,
>>> MAIR and ASIDs in SMMU contexts. Each SMMU has a single ASID space split
>>> into two sets, shared and private. Shared ASIDs correspond to those
>>> obtained from the arch ASID allocator, and private ASIDs are used for
>>> "classic" map/unmap DMA.
>>>
>>> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
>>> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
>>> ---
>>
>>> +
>>> +	tcr = FIELD_PREP(CTXDESC_CD_0_TCR_T0SZ, 64ULL - VA_BITS) |
>>> +	      FIELD_PREP(CTXDESC_CD_0_TCR_IRGN0, ARM_LPAE_TCR_RGN_WBWA) |
>>> +	      FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, ARM_LPAE_TCR_RGN_WBWA) |
>>> +	      FIELD_PREP(CTXDESC_CD_0_TCR_SH0, ARM_LPAE_TCR_SH_IS) |
>>> +	      CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
>>> +
>>> +	switch (PAGE_SIZE) {
>>> +	case SZ_4K:
>>> +		tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_TG0, ARM_LPAE_TCR_TG0_4K);
>>> +		break;
>>> +	case SZ_16K:
>>> +		tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_TG0, ARM_LPAE_TCR_TG0_16K);
>>> +		break;
>>> +	case SZ_64K:
>>> +		tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_TG0, ARM_LPAE_TCR_TG0_64K);
>>> +		break;
>>> +	default:
>>> +		WARN_ON(1);
>>> +		ret = -EINVAL;
>>> +		goto err_free_asid;
>>> +	}
>>> +
>>> +	reg = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
>>> +	par = cpuid_feature_extract_unsigned_field(reg, ID_AA64MMFR0_PARANGE_SHIFT);
>>> +	tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_IPS, par);
>>> +
>>> +	cd->ttbr = virt_to_phys(mm->pgd);
>>
>> Does the TTBR follow the same layout as TTBR_ELx for 52bit IPA ? i.e,
>> TTBR[5:2] = BADDR[51:48] ? Are you covered for that ?
> 
> Good point, I don't remember checking this. The SMMU TTBR doesn't have the
> same layout as the CPU's, and we don't need to swizzle the bits. For the
> lower bits, the alignment requirements on the pgd are identical to the
> MMU.

Ok, if that is the case:

Acked-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Jean-Philippe Brucker May 4, 2020, 3:09 p.m. UTC | #17
On Thu, Apr 30, 2020 at 02:18:16PM -0700, Jacob Pan wrote:
> On Thu, 30 Apr 2020 16:33:59 +0200
> Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:
> 
> > Shared Virtual Addressing (SVA) allows to share process page tables
> > with devices using the IOMMU, PASIDs and I/O page faults. Add SVA
> > support to the Arm SMMUv3 driver.
> > 
> > Since v5 [1]:
> > 
> > * Added patches 1-3. Patch 1 adds a PASID field to mm_struct as
> >   discussed in [1] and [2]. This is also needed for Intel ENQCMD.
> > Patch 2 adds refcounts to IOASID and patch 3 adds a couple of helpers
> > to allocate the PASID.
> > 
> > * Dropped most of iommu-sva.c. After getting rid of io_mm following
> >   review of v5, there wasn't enough generic code left to justify the
> >   indirect branch overhead of io_mm_ops in the MMU notifiers. I ended
> > up with more glue than useful code, and couldn't find an easy way to
> > deal with domains in the SMMU driver (we keep PASID tables per domain,
> >   while x86 keeps them per device). The direct approach in patch 17 is
> >   nicer and a little easier to read. The SMMU driver only gained 160
> >   lines, while iommu-sva lost 470 lines.
> > 
> >   As a result I dropped the MMU notifier patch.
> > 
> >   Jacob, one upside of this rework is that we now free ioasids in
> >   blocking context, which might help with your addition of notifiers
> > to ioasid.c
> > 
> Thanks for the note. It does make notifier much easier, plus the
> refcount can alleviate the constraint on ordering.
> 
> I guess we don't share mmu notifier code for now :)

I think it's more efficient for each IOMMU driver to at least implement
their own invalidate_range() callback and avoid indirect branches. For the
rest I couldn't find a lot of code to share, most of it is writing PASID
tables and invalidating. We can revisit later, as long as we agree on the
bind() API the implementations should be similar enough.

Thanks,
Jean
Jacob Pan May 4, 2020, 3:22 p.m. UTC | #18
On Mon, 4 May 2020 16:39:32 +0200
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> On Thu, Apr 30, 2020 at 01:48:42PM -0700, Jacob Pan wrote:
> > On Thu, 30 Apr 2020 11:39:31 -0700
> > Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> >   
> > > > -void ioasid_free(ioasid_t ioasid)
> > > > +bool ioasid_free(ioasid_t ioasid)
> > > >  {  
> > Sorry I missed this in the last reply.
> > 
> > I think free needs to be unconditional since there is not a good
> > way to fail it.
> > 
> > Also can we have more symmetric APIs, seems we don't have
> > ioasid_put() in this patchset.  
> 
> Yes I was thinking of renaming ioasid_free() to ioasid_put() but got
> lazy. 
> 
> > How about?
> > ioasid_alloc()
> > ioasid_free(); //drop reference, mark inactive, but not reclaimed if
> > 		refcount is not zero.
> > ioasid_get() // returns err if the ioasid is marked inactive by
> > 		ioasid_free()  
> 
> How does the caller know that the ioasid is in active/inactive state,
> and not freed/reallocated?
> 
In inactive state, callers of ioasid_find(), ioasid_get() would all
fail. Only ioasid_put can still operate on it.

In freed state (i.e. not allocated), it will be the same as above with
the exception that ioasid_put has no effect.

> > ioasid_put();// drop reference, reclaim if refcount is 0.  
> 
> I'll add ioasid_put() for now. I'd like to avoid introducing the
> inactive state in this patch,
Sounds good. I just wanted to consult with you about the above APIs. I
will introduce the state when we have a real use.

> so shall I change the calls in the
> Intel driver to ioasid_put(), and not introduce a new ioasid_free()
> for the moment?
> 
Sounds good. 

> Thanks,
> Jean
> 

[Jacob Pan]
Jacob Pan May 4, 2020, 3:27 p.m. UTC | #19
On Mon, 4 May 2020 16:25:48 +0200
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> On Thu, Apr 30, 2020 at 11:39:31AM -0700, Jacob Pan wrote:
> > > +/**
> > > + * ioasid_get - obtain a reference to the IOASID
> > > + */
> > > +void ioasid_get(ioasid_t ioasid)  
> > why void? what if the ioasid is not valid.  
> 
> My intended use was for the caller to get an additional reference when
> they're already holding one. So this should always succeed and I'd
> prefer a WARN_ON if the ioasid isn't valid rather than returning an
> error. But if you intend to add a state to ioasids between dropping
> refcount and free, then a return value makes sense.
> 
Yes, a WARN_ON will do. No need for return value for now.

> Thanks,
> Jean
> 
> >   
> > > +{
> > > +	struct ioasid_data *ioasid_data;
> > > +
> > > +	spin_lock(&ioasid_allocator_lock);
> > > +	ioasid_data = xa_load(&active_allocator->xa, ioasid);
> > > +	if (ioasid_data)
> > > +		refcount_inc(&ioasid_data->refs);
> > > +	spin_unlock(&ioasid_allocator_lock);
> > > +}
> > > +EXPORT_SYMBOL_GPL(ioasid_get);
> > > +
> > >  /**
> > >   * ioasid_free - Free an IOASID
> > >   * @ioasid: the ID to remove
> > > + *
> > > + * Put a reference to the IOASID, free it when the number of
> > > references drops to
> > > + * zero.
> > > + *
> > > + * Return: %true if the IOASID was freed, %false otherwise.
> > >   */
> > > -void ioasid_free(ioasid_t ioasid)
> > > +bool ioasid_free(ioasid_t ioasid)
> > >  {
> > > +	bool free = false;
> > >  	struct ioasid_data *ioasid_data;
> > >  
> > >  	spin_lock(&ioasid_allocator_lock);
> > > @@ -360,6 +383,10 @@ void ioasid_free(ioasid_t ioasid)
> > >  		goto exit_unlock;
> > >  	}
> > >  
> > > +	free = refcount_dec_and_test(&ioasid_data->refs);
> > > +	if (!free)
> > > +		goto exit_unlock;
> > > +  
> > Just FYI, we may need to add states for the IOASID, i.g. mark the
> > IOASID inactive after free. And prohibit ioasid_get() after freed.
> > For VT-d, this is useful when KVM queries the IOASID.
> >   
> > >  	active_allocator->ops->free(ioasid,
> > > active_allocator->ops->pdata); /* Custom allocator needs
> > > additional steps to free the xa element */ if
> > > (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) { @@ -369,6
> > > +396,7 @@ void ioasid_free(ioasid_t ioasid) 
> > >  exit_unlock:
> > >  	spin_unlock(&ioasid_allocator_lock);
> > > +	return free;
> > >  }
> > >  EXPORT_SYMBOL_GPL(ioasid_free);
> > >    
> > 
> > [Jacob Pan]  

[Jacob Pan]
Jean-Philippe Brucker May 4, 2020, 4:06 p.m. UTC | #20
On Fri, May 01, 2020 at 05:15:52AM -0700, Christoph Hellwig wrote:
> > @@ -432,6 +432,7 @@ config ARM_SMMU_V3
> >  	tristate "ARM Ltd. System MMU Version 3 (SMMUv3) Support"
> >  	depends on ARM64
> >  	select IOMMU_API
> > +	select IOMMU_SVA
> >  	select IOMMU_IO_PGTABLE_LPAE
> >  	select GENERIC_MSI_IRQ_DOMAIN
> 
> Doesn't this need to select MMU_NOTIFIER now?

Yes, will fix

> > +	struct mmu_notifier_ops		mn_ops;
> 
> Note: not a pointer.
> 
> > +	/* If bind() was already called for this (dev, mm) pair, reuse it. */
> > +	list_for_each_entry(bond, &master->bonds, list) {
> > +		if (bond->mm == mm) {
> > +			refcount_inc(&bond->refs);
> > +			return &bond->sva;
> > +		}
> > +	}
> > +
> > +	mn = mmu_notifier_get(&smmu_domain->mn_ops, mm);
> > +	if (IS_ERR(mn))
> > +		return ERR_CAST(mn);
> 
> Which seems to be to avoid mmu_notifier_get reusing notifiers registered
> by other arm_smmu_master instance right?

Yes, although I'm registering a single mmu notifier per (domain, mm) pair,
not (master, mm), because the SMMU driver keeps one set of PASID tables
per IOMMU domain.

> Either you could just use plain old mmu_notifier_register to avoid
> the reuse.  Or we could enhance the mmu_notifier_get to pass a private
> oaque instance ID pointer, which is checked in addition to the ops,
> and you could probably kill off the bonds list and lookup.

Going back to mmu_notifier_register() seems better for now. I don't want
to change the core APIs just for this driver, because it's likely to
change again when more hardware starts appearing and we optimize it.

Thanks,
Jean
Jean-Philippe Brucker May 4, 2020, 4:07 p.m. UTC | #21
On Fri, May 01, 2020 at 09:55:13AM -0300, Jason Gunthorpe wrote:
> On Fri, May 01, 2020 at 05:15:52AM -0700, Christoph Hellwig wrote:
> > > @@ -432,6 +432,7 @@ config ARM_SMMU_V3
> > >  	tristate "ARM Ltd. System MMU Version 3 (SMMUv3) Support"
> > >  	depends on ARM64
> > >  	select IOMMU_API
> > > +	select IOMMU_SVA
> > >  	select IOMMU_IO_PGTABLE_LPAE
> > >  	select GENERIC_MSI_IRQ_DOMAIN
> > 
> > Doesn't this need to select MMU_NOTIFIER now?
> > 
> > > +	struct mmu_notifier_ops		mn_ops;
> > 
> > Note: not a pointer.
> > 
> > > +	/* If bind() was already called for this (dev, mm) pair, reuse it. */
> > > +	list_for_each_entry(bond, &master->bonds, list) {
> > > +		if (bond->mm == mm) {
> > > +			refcount_inc(&bond->refs);
> > > +			return &bond->sva;
> > > +		}
> > > +	}
> 
> I also would like it if searching for mms in linked lists was not
> necessary, this is kind of the point of 'get'
> 
> Is this a side effect of the earlier remark to get rid of the linked
> list inside the notifier?
> 
> > Or we could enhance the mmu_notifier_get to pass a private
> > oaque instance ID pointer, which is checked in addition to the ops,
> > and you could probably kill off the bonds list and lookup.
> 
> This might be the best option if it can absorb the above search..

It wouldn't, the above search is separate. I currently register the MMU
notifier on (IOMMU domain, mm). The (dev, mm) search above is to follow
the iommu_sva_bind_device() API doc, that states we should return the same
handle for a given (dev, mm) pair.

Thanks,
Jean
Jean-Philippe Brucker May 4, 2020, 4:22 p.m. UTC | #22
On Sun, May 03, 2020 at 01:49:01PM +0800, Lu Baolu wrote:
> > +static void iopf_handle_group(struct work_struct *work)
> > +{
> > +	struct iopf_group *group;
> > +	struct iopf_fault *iopf, *next;
> > +	enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS;
> > +
> > +	group = container_of(work, struct iopf_group, work);
> > +
> > +	list_for_each_entry_safe(iopf, next, &group->faults, head) {
> > +		/*
> > +		 * For the moment, errors are sticky: don't handle subsequent
> > +		 * faults in the group if there is an error.
> > +		 */
> > +		if (status == IOMMU_PAGE_RESP_SUCCESS)
> > +			status = iopf_handle_single(iopf);
> > +
> > +		if (!(iopf->fault.prm.flags &
> > +		      IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE))
> > +			kfree(iopf);
> 
> The iopf is freed,but not removed from the list. This will cause wild
> pointer in code.

We free the list with the group below, so this one is fine.

> 
> > +	}
> > +
> > +	iopf_complete_group(group->dev, &group->last_fault, status);
> > +	kfree(group);
> > +}
> > +
> 
> [...]
> 
> > +/**
> > + * iopf_queue_flush_dev - Ensure that all queued faults have been processed
> > + * @dev: the endpoint whose faults need to be flushed.
> > + * @pasid: the PASID affected by this flush
> > + *
> > + * The IOMMU driver calls this before releasing a PASID, to ensure that all
> > + * pending faults for this PASID have been handled, and won't hit the address
> > + * space of the next process that uses this PASID. The driver must make sure
> > + * that no new fault is added to the queue. In particular it must flush its
> > + * low-level queue before calling this function.
> > + *
> > + * Return: 0 on success and <0 on error.
> > + */
> > +int iopf_queue_flush_dev(struct device *dev, int pasid)
> > +{
> > +	int ret = 0;
> > +	struct iopf_device_param *iopf_param;
> > +	struct dev_iommu *param = dev->iommu;
> > +
> > +	if (!param)
> > +		return -ENODEV;
> > +
> > +	mutex_lock(&param->lock);
> > +	iopf_param = param->iopf_param;
> > +	if (iopf_param)
> > +		flush_workqueue(iopf_param->queue->wq);
> 
> There may be other pasid iopf in the workqueue. Flush all tasks in
> the workqueue will hurt other pasids. I might lose any context.

Granted this isn't optimal because we don't take the PASID argument into
account (I think I'll remove it, don't know how to use it). But I don't
think it affects other PASIDs, because all flush_workqueue() does is wait
until all faults currently in the worqueue are processed. So it only
blocks the current thread, but nothing is lost.

> 
> > +	else
> > +		ret = -ENODEV;
> > +	mutex_unlock(&param->lock);
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(iopf_queue_flush_dev);
> > +
> > +/**
> > + * iopf_queue_discard_partial - Remove all pending partial fault
> > + * @queue: the queue whose partial faults need to be discarded
> > + *
> > + * When the hardware queue overflows, last page faults in a group may have been
> > + * lost and the IOMMU driver calls this to discard all partial faults. The
> > + * driver shouldn't be adding new faults to this queue concurrently.
> > + *
> > + * Return: 0 on success and <0 on error.
> > + */
> > +int iopf_queue_discard_partial(struct iopf_queue *queue)
> > +{
> > +	struct iopf_fault *iopf, *next;
> > +	struct iopf_device_param *iopf_param;
> > +
> > +	if (!queue)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&queue->lock);
> > +	list_for_each_entry(iopf_param, &queue->devices, queue_list) {
> > +		list_for_each_entry_safe(iopf, next, &iopf_param->partial, head)
> > +			kfree(iopf);
> 
> iopf is freed but not removed from the list.

Ouch yes this is wrong, will fix.

> 
> > +	}
> > +	mutex_unlock(&queue->lock);
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(iopf_queue_discard_partial);
> > +
> > +/**
> > + * iopf_queue_add_device - Add producer to the fault queue
> > + * @queue: IOPF queue
> > + * @dev: device to add
> > + *
> > + * Return: 0 on success and <0 on error.
> > + */
> > +int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev)
> > +{
> > +	int ret = -EBUSY;
> > +	struct iopf_device_param *iopf_param;
> > +	struct dev_iommu *param = dev->iommu;
> > +
> > +	if (!param)
> > +		return -ENODEV;
> > +
> > +	iopf_param = kzalloc(sizeof(*iopf_param), GFP_KERNEL);
> > +	if (!iopf_param)
> > +		return -ENOMEM;
> > +
> > +	INIT_LIST_HEAD(&iopf_param->partial);
> > +	iopf_param->queue = queue;
> > +	iopf_param->dev = dev;
> > +
> > +	mutex_lock(&queue->lock);
> > +	mutex_lock(&param->lock);
> > +	if (!param->iopf_param) {
> > +		list_add(&iopf_param->queue_list, &queue->devices);
> > +		param->iopf_param = iopf_param;
> > +		ret = 0;
> > +	}
> > +	mutex_unlock(&param->lock);
> > +	mutex_unlock(&queue->lock);
> > +
> > +	if (ret)
> > +		kfree(iopf_param);
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(iopf_queue_add_device);
> > +
> > +/**
> > + * iopf_queue_remove_device - Remove producer from fault queue
> > + * @queue: IOPF queue
> > + * @dev: device to remove
> > + *
> > + * Caller makes sure that no more faults are reported for this device.
> > + *
> > + * Return: 0 on success and <0 on error.
> > + */
> > +int iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev)
> > +{
> > +	int ret = 0;
> > +	struct iopf_fault *iopf, *next;
> > +	struct iopf_device_param *iopf_param;
> > +	struct dev_iommu *param = dev->iommu;
> > +
> > +	if (!param || !queue)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&queue->lock);
> > +	mutex_lock(&param->lock);
> > +	iopf_param = param->iopf_param;
> > +	if (iopf_param && iopf_param->queue == queue) {
> > +		list_del(&iopf_param->queue_list);
> > +		param->iopf_param = NULL;
> > +	} else {
> > +		ret = -EINVAL;
> > +	}
> > +	mutex_unlock(&param->lock);
> > +	mutex_unlock(&queue->lock);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/* Just in case some faults are still stuck */
> > +	list_for_each_entry_safe(iopf, next, &iopf_param->partial, head)
> > +		kfree(iopf);
> 
> The same here.

Here is fine, we free the iopf_param below

Thanks,
Jean

> 
> > +
> > +	kfree(iopf_param);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(iopf_queue_remove_device);
Jean-Philippe Brucker May 4, 2020, 4:25 p.m. UTC | #23
On Sun, May 03, 2020 at 01:54:36PM +0800, Lu Baolu wrote:
> On 2020/4/30 22:34, Jean-Philippe Brucker wrote:
> > When a recoverable page fault is handled by the fault workqueue, find the
> > associated mm and call handle_mm_fault.
> > 
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > ---
> > v5->v6: select CONFIG_IOMMU_SVA
> > ---
> >   drivers/iommu/Kconfig      |  1 +
> >   drivers/iommu/io-pgfault.c | 79 +++++++++++++++++++++++++++++++++++++-
> >   2 files changed, 78 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> > index 4f33e489f0726..1e64ee6592e16 100644
> > --- a/drivers/iommu/Kconfig
> > +++ b/drivers/iommu/Kconfig
> > @@ -109,6 +109,7 @@ config IOMMU_SVA
> >   config IOMMU_PAGE_FAULT
> >   	bool
> > +	select IOMMU_SVA
> 
> It would be better to move this to the previous patch.
> 
[...]
> > @@ -104,6 +156,29 @@ static void iopf_handle_group(struct work_struct *work)
> >    *
> >    * Add a fault to the device workqueue, to be handled by mm.
> >    *
> > + * This module doesn't handle PCI PASID Stop Marker; IOMMU drivers must discard
> > + * them before reporting faults. A PASID Stop Marker (LRW = 0b100) doesn't
> > + * expect a response. It may be generated when disabling a PASID (issuing a
> > + * PASID stop request) by some PCI devices.
> > + *
> > + * The PASID stop request is issued by the device driver before unbind(). Once
> > + * it completes, no page request is generated for this PASID anymore and
> > + * outstanding ones have been pushed to the IOMMU (as per PCIe 4.0r1.0 - 6.20.1
> > + * and 10.4.1.2 - Managing PASID TLP Prefix Usage). Some PCI devices will wait
> > + * for all outstanding page requests to come back with a response before
> > + * completing the PASID stop request. Others do not wait for page responses, and
> > + * instead issue this Stop Marker that tells us when the PASID can be
> > + * reallocated.
> > + *
> > + * It is safe to discard the Stop Marker because it is an optimization.
> > + * a. Page requests, which are posted requests, have been flushed to the IOMMU
> > + *    when the stop request completes.
> > + * b. We flush all fault queues on unbind() before freeing the PASID.
> > + *
> > + * So even though the Stop Marker might be issued by the device *after* the stop
> > + * request completes, outstanding faults will have been dealt with by the time
> > + * we free the PASID.
> > + *
> >    * Return: 0 on success and <0 on error.
> >    */
> >   int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)
> > 
> 
> The same for the comments.

I think I'll squash both patches, probably doesn't make it harder to
review.

Thanks,
Jean
Jean-Philippe Brucker May 4, 2020, 4:29 p.m. UTC | #24
On Mon, May 04, 2020 at 09:52:44AM +0800, Xu Zaibo wrote:
> 
> On 2020/4/30 22:34, Jean-Philippe Brucker wrote:
> > Some devices can tag their DMA requests with a 20-bit Process Address
> > Space ID (PASID), allowing them to access multiple address spaces. In
> > combination with recoverable I/O page faults (for example PCIe PRI),
> > PASID allows the IOMMU to share page tables with the MMU.
> > 
> > To make sure that a single PASID is allocated for each address space, as
> > required by Intel ENQCMD, store the PASID in the mm_struct. The IOMMU
> > driver is in charge of serializing modifications to the PASID field.
> > 
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > ---
> > For the field's validity I'm thinking invalid PASID = 0. In ioasid.h we
> > define INVALID_IOASID as ~0U, but I think we can now change it to 0,
> > since Intel is now also reserving PASID #0 for Transactions without
> > PASID and AMD IOMMU uses GIoV for this too.
> > ---
> >   include/linux/mm_types.h | 4 ++++
> >   1 file changed, 4 insertions(+)
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 4aba6c0c2ba80..8db6472758175 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -534,6 +534,10 @@ struct mm_struct {
> >   		atomic_long_t hugetlb_usage;
> >   #endif
> >   		struct work_struct async_put_work;
> > +#ifdef CONFIG_IOMMU_SUPPORT
> > +		/* Address space ID used by device DMA */
> > +		unsigned int pasid;
> > +#endif
> Maybe '#ifdef CONFIG_IOMMU_SVA ... #endif' is more reasonable?

CONFIG_IOMMU_SVA enables a few helpers but IOMMU drivers don't have to use
them, so I think IOMMU_SUPPORT is more appropriate.

Thanks,
Jean
Jean-Philippe Brucker May 4, 2020, 4:35 p.m. UTC | #25
Hi,

On Mon, May 04, 2020 at 07:54:03PM +0530, Prabhakar Kushwaha wrote:
> Dear Jean,
> 
> On Thu, Apr 30, 2020 at 8:11 PM Jean-Philippe Brucker
> <jean-philippe@linaro.org> wrote:
> >
> > If the SMMU supports it and the kernel was built with HTTU support, enable
> 
> is there any framework/config for HTTU which must be enabled to use this patch?
> 
> 
> > We can enable HTTU even if CPUs don't support it, because the kernel
> > always checks for HW dirty bit and updates the PTE flags atomically.
> >
> I believe, this statement is valid in context of this patch-set only.
> 
> One cannot use code snipped to test HTTU because exiting
> io-pgtable-arm.c driver doesn't have framework to leverage HTTU
> benfits. It by-default sets AF=1 and does not set DBM.

Right, this patch only sets the hardware access and dirty flags for SVA
(page tables shared with the CPU through iommu_bind*()), it doesn't enable
anything for iommu_map/unmap(). Although I remember discussing it for VM
migration, I don't know of any effort to use hardware access/dirty bits
outside of SVA.

Thanks,
Jean
Jean-Philippe Brucker May 4, 2020, 4:43 p.m. UTC | #26
On Thu, Apr 30, 2020 at 02:16:17PM -0700, Jacob Pan wrote:
> > +static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn,
> > +					 struct mm_struct *mm,
> > +					 unsigned long start,
> > unsigned long end) +{
> > +	/* TODO: invalidate ATS */
> > +}
> > +
> > +static void arm_smmu_mm_release(struct mmu_notifier *mn, struct
> > mm_struct *mm) +{
> > +	struct arm_smmu_mmu_notifier *smmu_mn = mn_to_smmu(mn);
> > +	struct arm_smmu_domain *smmu_domain;
> > +
> > +	mutex_lock(&arm_smmu_sva_lock);
> > +	if (smmu_mn->cleared) {
> > +		mutex_unlock(&arm_smmu_sva_lock);
> > +		return;
> > +	}
> > +
> > +	smmu_domain = smmu_mn->domain;
> > +
> > +	/*
> > +	 * DMA may still be running. Keep the cd valid but disable
> > +	 * translation, so that new events will still result in
> > stall.
> > +	 */
> Does "disable translation" also disable translated requests?

No it doesn't disable translated requests, it only prevents the SMMU from
accessing the pgd.

> I guess
> release is called after tlb invalidate range, so assuming no more
> devTLB left to generate translated request?

I'm counting on the invalidate below (here a TODO, implemented in next
patch) to drop all devTLB entries. After that invalidate, the device:
* issues a Translation Request, returns with R=W=0 because we disabled
  translation (and it isn't present in the SMMU TLB).
* issues a Page Request, returns with InvalidRequest because
  mmget_not_zero() fails.

> 
> > +	arm_smmu_write_ctx_desc(smmu_domain, mm->pasid, &invalid_cd);
> > +
> > +	arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_mn->cd->asid);
> > +	/* TODO: invalidate ATS */
> > +
> If mm release is called after tlb invalidate range, is it still
> necessary to invalidate again?

No, provided all mappings from the address space are unmapped and
invalidated. I'll double check, but in my tests invalidate range didn't
seem to be called for all mappings on mm exit, so I believe we do need
this.

Thanks,
Jean
Jacob Pan May 4, 2020, 8:47 p.m. UTC | #27
On Mon, 4 May 2020 18:43:51 +0200
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> On Thu, Apr 30, 2020 at 02:16:17PM -0700, Jacob Pan wrote:
> > > +static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn,
> > > +					 struct mm_struct *mm,
> > > +					 unsigned long start,
> > > unsigned long end) +{
> > > +	/* TODO: invalidate ATS */
> > > +}
> > > +
> > > +static void arm_smmu_mm_release(struct mmu_notifier *mn, struct
> > > mm_struct *mm) +{
> > > +	struct arm_smmu_mmu_notifier *smmu_mn = mn_to_smmu(mn);
> > > +	struct arm_smmu_domain *smmu_domain;
> > > +
> > > +	mutex_lock(&arm_smmu_sva_lock);
> > > +	if (smmu_mn->cleared) {
> > > +		mutex_unlock(&arm_smmu_sva_lock);
> > > +		return;
> > > +	}
> > > +
> > > +	smmu_domain = smmu_mn->domain;
> > > +
> > > +	/*
> > > +	 * DMA may still be running. Keep the cd valid but
> > > disable
> > > +	 * translation, so that new events will still result in
> > > stall.
> > > +	 */  
> > Does "disable translation" also disable translated requests?  
> 
> No it doesn't disable translated requests, it only prevents the SMMU
> from accessing the pgd.
> 
OK. same as VT-d.

> > I guess
> > release is called after tlb invalidate range, so assuming no more
> > devTLB left to generate translated request?  
> 
> I'm counting on the invalidate below (here a TODO, implemented in next
> patch) to drop all devTLB entries. After that invalidate, the device:
> * issues a Translation Request, returns with R=W=0 because we disabled
>   translation (and it isn't present in the SMMU TLB).
> * issues a Page Request, returns with InvalidRequest because
>   mmget_not_zero() fails.
> 
Same flow. Thanks for the explanation.

> >   
> > > +	arm_smmu_write_ctx_desc(smmu_domain, mm->pasid,
> > > &invalid_cd); +
> > > +	arm_smmu_tlb_inv_asid(smmu_domain->smmu,
> > > smmu_mn->cd->asid);
> > > +	/* TODO: invalidate ATS */
> > > +  
> > If mm release is called after tlb invalidate range, is it still
> > necessary to invalidate again?  
> 
> No, provided all mappings from the address space are unmapped and
> invalidated. I'll double check, but in my tests invalidate range
> didn't seem to be called for all mappings on mm exit, so I believe we
> do need this.
> 
I think it is safe to invalidate again. There was a concern that mm
release may delete IOMMU driver from the notification list and miss tlb
invalidate range. I had a hard time to confirm that with ftrace while
killing a process, many lost events.


> Thanks,
> Jean
> 

[Jacob Pan]
Jean-Philippe Brucker May 5, 2020, 9:15 a.m. UTC | #28
On Mon, May 04, 2020 at 01:47:23PM -0700, Jacob Pan wrote:
> > > > +	arm_smmu_write_ctx_desc(smmu_domain, mm->pasid,
> > > > &invalid_cd); +
> > > > +	arm_smmu_tlb_inv_asid(smmu_domain->smmu,
> > > > smmu_mn->cd->asid);
> > > > +	/* TODO: invalidate ATS */
> > > > +  
> > > If mm release is called after tlb invalidate range, is it still
> > > necessary to invalidate again?  
> > 
> > No, provided all mappings from the address space are unmapped and
> > invalidated. I'll double check, but in my tests invalidate range
> > didn't seem to be called for all mappings on mm exit, so I believe we
> > do need this.
> > 
> I think it is safe to invalidate again. There was a concern that mm
> release may delete IOMMU driver from the notification list and miss tlb
> invalidate range. I had a hard time to confirm that with ftrace while
> killing a process, many lost events.
> 

If it helps, I have a test that generates small DMA transactions on a SMMU
model. This is the trace for a job on a 8kB mmap'd buffer:

  smmu_bind_alloc: dev=0000:00:03.0 pasid=1
  dev_fault: IOMMU:0000:00:03.0 type=2 reason=0 addr=0x0000ffff860e6000 pasid=1 group=74 flags=3 prot=2
  dev_page_response: IOMMU:0000:00:03.0 code=0 pasid=1 group=74
  dev_fault: IOMMU:0000:00:03.0 type=2 reason=0 addr=0x0000ffff860e7000 pasid=1 group=143 flags=3 prot=2
  dev_page_response: IOMMU:0000:00:03.0 code=0 pasid=1 group=143
  smmu_mm_invalidate: pasid=1 start=0xffff860e6000 end=0xffff860e8000
  smmu_mm_invalidate: pasid=1 start=0xffff860e6000 end=0xffff860e8000
  smmu_mm_invalidate: pasid=1 start=0xffff860e8000 end=0xffff860ea000
  smmu_mm_invalidate: pasid=1 start=0xffff860e8000 end=0xffff860ea000
  smmu_unbind_free: dev=0000:00:03.0 pasid=1

And this is the same job, but the process immediately kills itself after
launching it.

  smmu_bind_alloc: dev=0000:00:03.0 pasid=1
  dev_fault: IOMMU:0000:00:03.0 type=2 reason=0 addr=0x0000ffffb9d15000 pasid=1 group=259 flags=3 prot=2
  smmu_mm_release: pasid=1
  dev_page_response: IOMMU:0000:00:03.0 code=0 pasid=1 group=259
  dev_fault: IOMMU:0000:00:03.0 type=2 reason=0 addr=0x0000ffffb9d15000 pasid=1 group=383 flags=3 prot=2
  dev_page_response: IOMMU:0000:00:03.0 code=1 pasid=1 group=383
  smmu_unbind_free: dev=0000:00:03.0 pasid=1

We don't get any invalidate_range notification in this case.

Thanks,
Jean
Jacob Pan May 7, 2020, 4:31 p.m. UTC | #29
On Tue, 5 May 2020 11:15:31 +0200
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> On Mon, May 04, 2020 at 01:47:23PM -0700, Jacob Pan wrote:
> > > > > +	arm_smmu_write_ctx_desc(smmu_domain, mm->pasid,
> > > > > &invalid_cd); +
> > > > > +	arm_smmu_tlb_inv_asid(smmu_domain->smmu,
> > > > > smmu_mn->cd->asid);
> > > > > +	/* TODO: invalidate ATS */
> > > > > +    
> > > > If mm release is called after tlb invalidate range, is it still
> > > > necessary to invalidate again?    
> > > 
> > > No, provided all mappings from the address space are unmapped and
> > > invalidated. I'll double check, but in my tests invalidate range
> > > didn't seem to be called for all mappings on mm exit, so I
> > > believe we do need this.
> > >   
> > I think it is safe to invalidate again. There was a concern that mm
> > release may delete IOMMU driver from the notification list and miss
> > tlb invalidate range. I had a hard time to confirm that with ftrace
> > while killing a process, many lost events.
> >   
> 
> If it helps, I have a test that generates small DMA transactions on a
> SMMU model. This is the trace for a job on a 8kB mmap'd buffer:
> 
>   smmu_bind_alloc: dev=0000:00:03.0 pasid=1
>   dev_fault: IOMMU:0000:00:03.0 type=2 reason=0
> addr=0x0000ffff860e6000 pasid=1 group=74 flags=3 prot=2
> dev_page_response: IOMMU:0000:00:03.0 code=0 pasid=1 group=74
> dev_fault: IOMMU:0000:00:03.0 type=2 reason=0 addr=0x0000ffff860e7000
> pasid=1 group=143 flags=3 prot=2 dev_page_response:
> IOMMU:0000:00:03.0 code=0 pasid=1 group=143 smmu_mm_invalidate:
> pasid=1 start=0xffff860e6000 end=0xffff860e8000 smmu_mm_invalidate:
> pasid=1 start=0xffff860e6000 end=0xffff860e8000 smmu_mm_invalidate:
> pasid=1 start=0xffff860e8000 end=0xffff860ea000 smmu_mm_invalidate:
> pasid=1 start=0xffff860e8000 end=0xffff860ea000 smmu_unbind_free:
> dev=0000:00:03.0 pasid=1
> 
> And this is the same job, but the process immediately kills itself
> after launching it.
> 
>   smmu_bind_alloc: dev=0000:00:03.0 pasid=1
>   dev_fault: IOMMU:0000:00:03.0 type=2 reason=0
> addr=0x0000ffffb9d15000 pasid=1 group=259 flags=3 prot=2
> smmu_mm_release: pasid=1 dev_page_response: IOMMU:0000:00:03.0 code=0
> pasid=1 group=259 dev_fault: IOMMU:0000:00:03.0 type=2 reason=0
> addr=0x0000ffffb9d15000 pasid=1 group=383 flags=3 prot=2
> dev_page_response: IOMMU:0000:00:03.0 code=1 pasid=1 group=383
> smmu_unbind_free: dev=0000:00:03.0 pasid=1
> 
> We don't get any invalidate_range notification in this case.
> 
Thanks for the confirmation. We do need to invalidate here.

> Thanks,
> Jean

[Jacob Pan]