mbox series

[v7,00/24] iommu: Shared Virtual Addressing for SMMUv3

Message ID 20200519175502.2504091-1-jean-philippe@linaro.org
Headers show
Series iommu: Shared Virtual Addressing for SMMUv3 | expand

Message

Jean-Philippe Brucker May 19, 2020, 5:54 p.m. UTC
Shared Virtual Addressing (SVA) allows to share process page tables with
devices using the IOMMU, PASIDs and I/O page faults. Add SVA support to
the Arm SMMUv3 driver.

Since v6 [1]:
* Rename ioasid_free() to ioasid_put() in patch 02, requiring changes to
  the Intel drivers.
* Use mmu_notifier_register() in patch 16 to avoid copying the ops and
  simplify the invalidate() notifier in patch 17.
* As a result, replace context spinlock with a mutex. Simplified locking in
  patch 11 (That patch still looks awful, but I think the series is more
  readable overall). And I've finally been able to remove the GFP_ATOMIC
  allocations.
* Use a single patch (04) for io-pgfault.c, since the code was simplified
  in v6. Fixed partial list in patch 04.

[1] https://lore.kernel.org/linux-iommu/20200430143424.2787566-1-jean-philippe@linaro.org/

Jean-Philippe Brucker (24):
  mm: Add a PASID field to mm_struct
  iommu/ioasid: Add ioasid references
  iommu/sva: Add PASID helpers
  iommu: Add a page fault handler
  arm64: mm: Add asid_gen_match() helper
  arm64: mm: Pin down ASIDs for sharing mm with devices
  iommu/io-pgtable-arm: Move some definitions to a header
  iommu/arm-smmu-v3: Manage ASIDs with xarray
  arm64: cpufeature: Export symbol read_sanitised_ftr_reg()
  iommu/arm-smmu-v3: Share process page tables
  iommu/arm-smmu-v3: Seize private ASID
  iommu/arm-smmu-v3: Add support for VHE
  iommu/arm-smmu-v3: Enable broadcast TLB maintenance
  iommu/arm-smmu-v3: Add SVA feature checking
  iommu/arm-smmu-v3: Add SVA device feature
  iommu/arm-smmu-v3: Implement iommu_sva_bind/unbind()
  iommu/arm-smmu-v3: Hook up ATC invalidation to mm ops
  iommu/arm-smmu-v3: Add support for Hardware Translation Table Update
  iommu/arm-smmu-v3: Maintain a SID->device structure
  dt-bindings: document stall property for IOMMU masters
  iommu/arm-smmu-v3: Add stall support for platform devices
  PCI/ATS: Add PRI stubs
  PCI/ATS: Export PRI functions
  iommu/arm-smmu-v3: Add support for PRI

 drivers/iommu/Kconfig                         |   12 +
 drivers/iommu/Makefile                        |    2 +
 .../devicetree/bindings/iommu/iommu.txt       |   18 +
 arch/arm64/include/asm/mmu.h                  |    1 +
 arch/arm64/include/asm/mmu_context.h          |   11 +-
 drivers/iommu/io-pgtable-arm.h                |   30 +
 drivers/iommu/iommu-sva.h                     |   15 +
 include/linux/ioasid.h                        |   10 +-
 include/linux/iommu.h                         |   53 +
 include/linux/mm_types.h                      |    4 +
 include/linux/pci-ats.h                       |    8 +
 arch/arm64/kernel/cpufeature.c                |    1 +
 arch/arm64/mm/context.c                       |  103 +-
 drivers/iommu/arm-smmu-v3.c                   | 1552 +++++++++++++++--
 drivers/iommu/intel-iommu.c                   |    4 +-
 drivers/iommu/intel-svm.c                     |    6 +-
 drivers/iommu/io-pgfault.c                    |  459 +++++
 drivers/iommu/io-pgtable-arm.c                |   27 +-
 drivers/iommu/ioasid.c                        |   38 +-
 drivers/iommu/iommu-sva.c                     |   85 +
 drivers/iommu/of_iommu.c                      |    5 +-
 drivers/pci/ats.c                             |    4 +
 MAINTAINERS                                   |    3 +-
 23 files changed, 2286 insertions(+), 165 deletions(-)
 create mode 100644 drivers/iommu/io-pgtable-arm.h
 create mode 100644 drivers/iommu/iommu-sva.h
 create mode 100644 drivers/iommu/io-pgfault.c
 create mode 100644 drivers/iommu/iommu-sva.c

Comments

Baolu Lu May 20, 2020, 2:31 a.m. UTC | #1
On 5/20/20 1:54 AM, Jean-Philippe Brucker wrote:
> Let IOASID users take references to existing ioasids with ioasid_get().
> ioasid_put() drops a reference and only frees the ioasid when its
> reference number is zero. It returns true if the ioasid was freed.
> For drivers that don't call ioasid_get(), ioasid_put() is the same as
> ioasid_free().
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> v6->v7: rename ioasid_free() to ioasid_put(), add WARN in ioasid_get()
> ---
>   include/linux/ioasid.h      | 10 ++++++++--
>   drivers/iommu/intel-iommu.c |  4 ++--
>   drivers/iommu/intel-svm.c   |  6 +++---
>   drivers/iommu/ioasid.c      | 38 +++++++++++++++++++++++++++++++++----
>   4 files changed, 47 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 6f000d7a0ddc..e9dacd4b9f6b 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -34,7 +34,8 @@ struct ioasid_allocator_ops {
>   #if IS_ENABLED(CONFIG_IOASID)
>   ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
>   		      void *private);
> -void ioasid_free(ioasid_t ioasid);
> +void ioasid_get(ioasid_t ioasid);
> +bool ioasid_put(ioasid_t ioasid);
>   void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>   		  bool (*getter)(void *));
>   int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> @@ -48,10 +49,15 @@ static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
>   	return INVALID_IOASID;
>   }
>   
> -static inline void ioasid_free(ioasid_t ioasid)
> +static inline void ioasid_get(ioasid_t ioasid)
>   {
>   }
>   
> +static inline bool ioasid_put(ioasid_t ioasid)
> +{
> +	return false;
> +}
> +
>   static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>   				bool (*getter)(void *))
>   {
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index ed21ce6d1238..0230f35480ee 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5432,7 +5432,7 @@ static void auxiliary_unlink_device(struct dmar_domain *domain,
>   	domain->auxd_refcnt--;
>   
>   	if (!domain->auxd_refcnt && domain->default_pasid > 0)
> -		ioasid_free(domain->default_pasid);
> +		ioasid_put(domain->default_pasid);
>   }
>   
>   static int aux_domain_add_dev(struct dmar_domain *domain,
> @@ -5494,7 +5494,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
>   	spin_unlock(&iommu->lock);
>   	spin_unlock_irqrestore(&device_domain_lock, flags);
>   	if (!domain->auxd_refcnt && domain->default_pasid > 0)
> -		ioasid_free(domain->default_pasid);
> +		ioasid_put(domain->default_pasid);
>   
>   	return ret;
>   }
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index 2998418f0a38..86f1264bd07c 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -353,7 +353,7 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
>   		if (mm) {
>   			ret = mmu_notifier_register(&svm->notifier, mm);
>   			if (ret) {
> -				ioasid_free(svm->pasid);
> +				ioasid_put(svm->pasid);
>   				kfree(svm);
>   				kfree(sdev);
>   				goto out;
> @@ -371,7 +371,7 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
>   		if (ret) {
>   			if (mm)
>   				mmu_notifier_unregister(&svm->notifier, mm);
> -			ioasid_free(svm->pasid);
> +			ioasid_put(svm->pasid);
>   			kfree(svm);
>   			kfree(sdev);
>   			goto out;
> @@ -447,7 +447,7 @@ int intel_svm_unbind_mm(struct device *dev, int pasid)
>   			kfree_rcu(sdev, rcu);
>   
>   			if (list_empty(&svm->devs)) {
> -				ioasid_free(svm->pasid);
> +				ioasid_put(svm->pasid);
>   				if (svm->mm)
>   					mmu_notifier_unregister(&svm->notifier, svm->mm);
>   				list_del(&svm->list);
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 0f8dd377aada..50ee27bbd04e 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -2,7 +2,7 @@
>   /*
>    * I/O Address Space ID allocator. There is one global IOASID space, split into
>    * subsets. Users create a subset with DECLARE_IOASID_SET, then allocate and
> - * free IOASIDs with ioasid_alloc and ioasid_free.
> + * free IOASIDs with ioasid_alloc and ioasid_put.
>    */
>   #include <linux/ioasid.h>
>   #include <linux/module.h>
> @@ -15,6 +15,7 @@ struct ioasid_data {
>   	struct ioasid_set *set;
>   	void *private;
>   	struct rcu_head rcu;
> +	refcount_t refs;
>   };
>   
>   /*
> @@ -314,6 +315,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
>   
>   	data->set = set;
>   	data->private = private;
> +	refcount_set(&data->refs, 1);
>   
>   	/*
>   	 * Custom allocator needs allocator data to perform platform specific
> @@ -346,11 +348,34 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
>   EXPORT_SYMBOL_GPL(ioasid_alloc);
>   
>   /**
> - * ioasid_free - Free an IOASID
> + * ioasid_get - obtain a reference to the IOASID
> + */
> +void ioasid_get(ioasid_t ioasid)
> +{
> +	struct ioasid_data *ioasid_data;
> +
> +	spin_lock(&ioasid_allocator_lock);
> +	ioasid_data = xa_load(&active_allocator->xa, ioasid);
> +	if (ioasid_data)
> +		refcount_inc(&ioasid_data->refs);
> +	else
> +		WARN_ON(1);
> +	spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get);
> +
> +/**
> + * ioasid_put - Release a reference to an ioasid
>    * @ioasid: the ID to remove
> + *
> + * Put a reference to the IOASID, free it when the number of references drops to
> + * zero.
> + *
> + * Return: %true if the IOASID was freed, %false otherwise.
>    */
> -void ioasid_free(ioasid_t ioasid)
> +bool ioasid_put(ioasid_t ioasid)
>   {
> +	bool free = false;
>   	struct ioasid_data *ioasid_data;
>   
>   	spin_lock(&ioasid_allocator_lock);
> @@ -360,6 +385,10 @@ void ioasid_free(ioasid_t ioasid)
>   		goto exit_unlock;
>   	}
>   
> +	free = refcount_dec_and_test(&ioasid_data->refs);
> +	if (!free)
> +		goto exit_unlock;
> +
>   	active_allocator->ops->free(ioasid, active_allocator->ops->pdata);
>   	/* Custom allocator needs additional steps to free the xa element */
>   	if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
> @@ -369,8 +398,9 @@ void ioasid_free(ioasid_t ioasid)
>   
>   exit_unlock:
>   	spin_unlock(&ioasid_allocator_lock);
> +	return free;
>   }
> -EXPORT_SYMBOL_GPL(ioasid_free);
> +EXPORT_SYMBOL_GPL(ioasid_put);
>   
>   /**
>    * ioasid_find - Find IOASID data
> 

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>

Best regards,
baolu
Baolu Lu May 20, 2020, 2:41 a.m. UTC | #2
On 5/20/20 1:54 AM, Jean-Philippe Brucker wrote:
> Let IOMMU drivers allocate a single PASID per mm. Store the mm in the
> IOASID set to allow refcounting and searching mm by PASID, when handling
> an I/O page fault.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
>   drivers/iommu/Kconfig     |  5 +++
>   drivers/iommu/Makefile    |  1 +
>   drivers/iommu/iommu-sva.h | 15 +++++++
>   drivers/iommu/iommu-sva.c | 85 +++++++++++++++++++++++++++++++++++++++
>   4 files changed, 106 insertions(+)
>   create mode 100644 drivers/iommu/iommu-sva.h
>   create mode 100644 drivers/iommu/iommu-sva.c
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 2ab07ce17abb..d9fa5b410015 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -102,6 +102,11 @@ config IOMMU_DMA
>   	select IRQ_MSI_IOMMU
>   	select NEED_SG_DMA_LENGTH
>   
> +# Shared Virtual Addressing library
> +config IOMMU_SVA

This looks too generic. It doesn't match the code it actually controls.

How about IOMMU_SVA_LIB?

> +	bool
> +	select IOASID > +
>   config FSL_PAMU
>   	bool "Freescale IOMMU support"
>   	depends on PCI
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 9f33fdb3bb05..40c800dd4e3e 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -37,3 +37,4 @@ obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
>   obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
>   obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
>   obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
> +obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
> diff --git a/drivers/iommu/iommu-sva.h b/drivers/iommu/iommu-sva.h
> new file mode 100644
> index 000000000000..78f806fcacbe
> --- /dev/null
> +++ b/drivers/iommu/iommu-sva.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * SVA library for IOMMU drivers
> + */
> +#ifndef _IOMMU_SVA_H
> +#define _IOMMU_SVA_H
> +
> +#include <linux/ioasid.h>
> +#include <linux/mm_types.h>
> +
> +int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max);
> +void iommu_sva_free_pasid(struct mm_struct *mm);
> +struct mm_struct *iommu_sva_find(ioasid_t pasid);
> +
> +#endif /* _IOMMU_SVA_H */
> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
> new file mode 100644
> index 000000000000..442644a1ade0
> --- /dev/null
> +++ b/drivers/iommu/iommu-sva.c
> @@ -0,0 +1,85 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Helpers for IOMMU drivers implementing SVA
> + */
> +#include <linux/mutex.h>
> +#include <linux/sched/mm.h>
> +
> +#include "iommu-sva.h"
> +
> +static DEFINE_MUTEX(iommu_sva_lock);
> +static DECLARE_IOASID_SET(shared_pasid);

NIT: how about iommu_sva_pasid?

> +
> +/**
> + * iommu_sva_alloc_pasid - Allocate a PASID for the mm
> + * @mm: the mm
> + * @min: minimum PASID value (inclusive)
> + * @max: maximum PASID value (inclusive)
> + *
> + * Try to allocate a PASID for this mm, or take a reference to the existing one
> + * provided it fits within the [min, max] range. On success the PASID is
> + * available in mm->pasid, and must be released with iommu_sva_free_pasid().
> + *
> + * Returns 0 on success and < 0 on error.
> + */
> +int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
> +{
> +	int ret = 0;
> +	ioasid_t pasid;
> +
> +	if (min == INVALID_IOASID || max == INVALID_IOASID ||
> +	    min == 0 || max < min)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu_sva_lock);
> +	if (mm->pasid) {
> +		if (mm->pasid >= min && mm->pasid <= max)
> +			ioasid_get(mm->pasid);
> +		else
> +			ret = -EOVERFLOW;
> +	} else {
> +		pasid = ioasid_alloc(&shared_pasid, min, max, mm);
> +		if (pasid == INVALID_IOASID)
> +			ret = -ENOMEM;
> +		else
> +			mm->pasid = pasid;
> +	}
> +	mutex_unlock(&iommu_sva_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_alloc_pasid);
> +
> +/**
> + * iommu_sva_free_pasid - Release the mm's PASID
> + * @mm: the mm.
> + *
> + * Drop one reference to a PASID allocated with iommu_sva_alloc_pasid()
> + */
> +void iommu_sva_free_pasid(struct mm_struct *mm)
> +{
> +	mutex_lock(&iommu_sva_lock);
> +	if (ioasid_put(mm->pasid))
> +		mm->pasid = 0;
> +	mutex_unlock(&iommu_sva_lock);
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_free_pasid);
> +
> +/* ioasid wants a void * argument */
> +static bool __mmget_not_zero(void *mm)
> +{
> +	return mmget_not_zero(mm);
> +}
> +
> +/**
> + * iommu_sva_find() - Find mm associated to the given PASID
> + * @pasid: Process Address Space ID assigned to the mm
> + *
> + * On success a reference to the mm is taken, and must be released with mmput().
> + *
> + * Returns the mm corresponding to this PASID, or an error if not found.
> + */
> +struct mm_struct *iommu_sva_find(ioasid_t pasid)
> +{
> +	return ioasid_find(&shared_pasid, pasid, __mmget_not_zero);
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_find);
> 

Best regards,
baolu
Baolu Lu May 20, 2020, 6:42 a.m. UTC | #3
Hi Jean,

On 2020/5/20 1:54, Jean-Philippe Brucker wrote:
> Some systems allow devices to handle I/O Page Faults in the core mm. For
> example systems implementing the PCIe PRI extension or Arm SMMU stall
> model. Infrastructure for reporting these recoverable page faults was
> added to the IOMMU core by commit 0c830e6b3282 ("iommu: Introduce device
> fault report API"). Add a page fault handler for host SVA.
> 
> IOMMU driver can now instantiate several fault workqueues and link them
> to IOPF-capable devices. Drivers can choose between a single global
> workqueue, one per IOMMU device, one per low-level fault queue, one per
> domain, etc.
> 
> When it receives a fault event, supposedly in an IRQ handler, the IOMMU
> driver reports the fault using iommu_report_device_fault(), which calls
> the registered handler. The page fault handler then calls the mm fault
> handler, and reports either success or failure with iommu_page_response().
> When the handler succeeded, the IOMMU retries the access.
> 
> The iopf_param pointer could be embedded into iommu_fault_param. But
> putting iopf_param into the iommu_param structure allows us not to care
> about ordering between calls to iopf_queue_add_device() and
> iommu_register_device_fault_handler().
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> v6->v7: Fix leak in iopf_queue_discard_partial()
> ---
>   drivers/iommu/Kconfig      |   4 +
>   drivers/iommu/Makefile     |   1 +
>   include/linux/iommu.h      |  51 +++++
>   drivers/iommu/io-pgfault.c | 459 +++++++++++++++++++++++++++++++++++++
>   4 files changed, 515 insertions(+)
>   create mode 100644 drivers/iommu/io-pgfault.c
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index d9fa5b410015..15e9dc4e503c 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -107,6 +107,10 @@ config IOMMU_SVA
>   	bool
>   	select IOASID
>   
> +config IOMMU_PAGE_FAULT
> +	bool
> +	select IOMMU_SVA
> +
>   config FSL_PAMU
>   	bool "Freescale IOMMU support"
>   	depends on PCI
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 40c800dd4e3e..bf5cb4ee8409 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -4,6 +4,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>   obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
>   obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
>   obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
> +obj-$(CONFIG_IOMMU_PAGE_FAULT) += io-pgfault.o
>   obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
>   obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>   obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o

[SNIP]

> +
> +static enum iommu_page_response_code
> +iopf_handle_single(struct iopf_fault *iopf)
> +{
> +	vm_fault_t ret;
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma;
> +	unsigned int access_flags = 0;
> +	unsigned int fault_flags = FAULT_FLAG_REMOTE;
> +	struct iommu_fault_page_request *prm = &iopf->fault.prm;
> +	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
> +
> +	if (!(prm->flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID))
> +		return status;
> +
> +	mm = iommu_sva_find(prm->pasid);
> +	if (IS_ERR_OR_NULL(mm))
> +		return status;
> +
> +	down_read(&mm->mmap_sem);
> +
> +	vma = find_extend_vma(mm, prm->addr);
> +	if (!vma)
> +		/* Unmapped area */
> +		goto out_put_mm;
> +
> +	if (prm->perm & IOMMU_FAULT_PERM_READ)
> +		access_flags |= VM_READ;
> +
> +	if (prm->perm & IOMMU_FAULT_PERM_WRITE) {
> +		access_flags |= VM_WRITE;
> +		fault_flags |= FAULT_FLAG_WRITE;
> +	}
> +
> +	if (prm->perm & IOMMU_FAULT_PERM_EXEC) {
> +		access_flags |= VM_EXEC;
> +		fault_flags |= FAULT_FLAG_INSTRUCTION;
> +	}
> +
> +	if (!(prm->perm & IOMMU_FAULT_PERM_PRIV))
> +		fault_flags |= FAULT_FLAG_USER;
> +
> +	if (access_flags & ~vma->vm_flags)
> +		/* Access fault */
> +		goto out_put_mm;
> +
> +	ret = handle_mm_fault(vma, prm->addr, fault_flags);
> +	status = ret & VM_FAULT_ERROR ? IOMMU_PAGE_RESP_INVALID :

Do you mind telling why it's IOMMU_PAGE_RESP_INVALID but not
IOMMU_PAGE_RESP_FAILURE?

> +		IOMMU_PAGE_RESP_SUCCESS;
> +
> +out_put_mm:
> +	up_read(&mm->mmap_sem);
> +	mmput(mm);
> +
> +	return status;
> +}
> +
> +static void iopf_handle_group(struct work_struct *work)
> +{
> +	struct iopf_group *group;
> +	struct iopf_fault *iopf, *next;
> +	enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS;
> +
> +	group = container_of(work, struct iopf_group, work);
> +
> +	list_for_each_entry_safe(iopf, next, &group->faults, list) {
> +		/*
> +		 * For the moment, errors are sticky: don't handle subsequent
> +		 * faults in the group if there is an error.
> +		 */
> +		if (status == IOMMU_PAGE_RESP_SUCCESS)
> +			status = iopf_handle_single(iopf);
> +
> +		if (!(iopf->fault.prm.flags &
> +		      IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE))
> +			kfree(iopf);
> +	}
> +
> +	iopf_complete_group(group->dev, &group->last_fault, status);
> +	kfree(group);
> +}
> +
> +/**
> + * iommu_queue_iopf - IO Page Fault handler
> + * @evt: fault event

@fault?

> + * @cookie: struct device, passed to iommu_register_device_fault_handler.
> + *
> + * Add a fault to the device workqueue, to be handled by mm.
> + *
> + * This module doesn't handle PCI PASID Stop Marker; IOMMU drivers must discard
> + * them before reporting faults. A PASID Stop Marker (LRW = 0b100) doesn't
> + * expect a response. It may be generated when disabling a PASID (issuing a
> + * PASID stop request) by some PCI devices.
> + *
> + * The PASID stop request is issued by the device driver before unbind(). Once
> + * it completes, no page request is generated for this PASID anymore and
> + * outstanding ones have been pushed to the IOMMU (as per PCIe 4.0r1.0 - 6.20.1
> + * and 10.4.1.2 - Managing PASID TLP Prefix Usage). Some PCI devices will wait
> + * for all outstanding page requests to come back with a response before
> + * completing the PASID stop request. Others do not wait for page responses, and
> + * instead issue this Stop Marker that tells us when the PASID can be
> + * reallocated.
> + *
> + * It is safe to discard the Stop Marker because it is an optimization.
> + * a. Page requests, which are posted requests, have been flushed to the IOMMU
> + *    when the stop request completes.
> + * b. We flush all fault queues on unbind() before freeing the PASID.
> + *
> + * So even though the Stop Marker might be issued by the device *after* the stop
> + * request completes, outstanding faults will have been dealt with by the time
> + * we free the PASID.
> + *
> + * Return: 0 on success and <0 on error.
> + */
> +int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)
> +{
> +	int ret;
> +	struct iopf_group *group;
> +	struct iopf_fault *iopf, *next;
> +	struct iopf_device_param *iopf_param;
> +
> +	struct device *dev = cookie;
> +	struct dev_iommu *param = dev->iommu;
> +
> +	lockdep_assert_held(&param->lock);
> +
> +	if (fault->type != IOMMU_FAULT_PAGE_REQ)
> +		/* Not a recoverable page fault */
> +		return -EOPNOTSUPP;
> +
> +	/*
> +	 * As long as we're holding param->lock, the queue can't be unlinked
> +	 * from the device and therefore cannot disappear.
> +	 */
> +	iopf_param = param->iopf_param;
> +	if (!iopf_param)
> +		return -ENODEV;
> +
> +	if (!(fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) {
> +		iopf = kzalloc(sizeof(*iopf), GFP_KERNEL);
> +		if (!iopf)
> +			return -ENOMEM;
> +
> +		iopf->fault = *fault;
> +
> +		/* Non-last request of a group. Postpone until the last one */
> +		list_add(&iopf->list, &iopf_param->partial);
> +
> +		return 0;
> +	}
> +
> +	group = kzalloc(sizeof(*group), GFP_KERNEL);
> +	if (!group) {
> +		/*
> +		 * The caller will send a response to the hardware. But we do
> +		 * need to clean up before leaving, otherwise partial faults
> +		 * will be stuck.
> +		 */
> +		ret = -ENOMEM;
> +		goto cleanup_partial;
> +	}
> +
> +	group->dev = dev;
> +	group->last_fault.fault = *fault;
> +	INIT_LIST_HEAD(&group->faults);
> +	list_add(&group->last_fault.list, &group->faults);
> +	INIT_WORK(&group->work, iopf_handle_group);
> +
> +	/* See if we have partial faults for this group */
> +	list_for_each_entry_safe(iopf, next, &iopf_param->partial, list) {
> +		if (iopf->fault.prm.grpid == fault->prm.grpid)
> +			/* Insert *before* the last fault */
> +			list_move(&iopf->list, &group->faults);
> +	}
> +
> +	queue_work(iopf_param->queue->wq, &group->work);
> +	return 0;
> +
> +cleanup_partial:
> +	list_for_each_entry_safe(iopf, next, &iopf_param->partial, list) {
> +		if (iopf->fault.prm.grpid == fault->prm.grpid) {
> +			list_del(&iopf->list);
> +			kfree(iopf);
> +		}
> +	}
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_queue_iopf);

[SNIP]

> + 
> +
> +/**
> + * iopf_queue_add_device - Add producer to the fault queue
> + * @queue: IOPF queue
> + * @dev: device to add
> + *
> + * Return: 0 on success and <0 on error.
> + */
> +int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev)
> +{
> +	int ret = -EBUSY;
> +	struct iopf_device_param *iopf_param;
> +	struct dev_iommu *param = dev->iommu;
> +
> +	if (!param)
> +		return -ENODEV;
> +
> +	iopf_param = kzalloc(sizeof(*iopf_param), GFP_KERNEL);
> +	if (!iopf_param)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&iopf_param->partial);
> +	iopf_param->queue = queue; iopf_param->dev = dev;

Two lines?

> +
> +	mutex_lock(&queue->lock);
> +	mutex_lock(&param->lock);
> +	if (!param->iopf_param) {
> +		list_add(&iopf_param->queue_list, &queue->devices);
> +		param->iopf_param = iopf_param;
> +		ret = 0;
> +	}
> +	mutex_unlock(&param->lock);
> +	mutex_unlock(&queue->lock);
> +
> +	if (ret)
> +		kfree(iopf_param);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_add_device);
> +

[SNIP]
Best regards,
baolu
Will Deacon May 21, 2020, 10:35 a.m. UTC | #4
Hi Jean-Philippe,

On Tue, May 19, 2020 at 07:54:38PM +0200, Jean-Philippe Brucker wrote:
> Shared Virtual Addressing (SVA) allows to share process page tables with
> devices using the IOMMU, PASIDs and I/O page faults. Add SVA support to
> the Arm SMMUv3 driver.
> 
> Since v6 [1]:
> * Rename ioasid_free() to ioasid_put() in patch 02, requiring changes to
>   the Intel drivers.
> * Use mmu_notifier_register() in patch 16 to avoid copying the ops and
>   simplify the invalidate() notifier in patch 17.
> * As a result, replace context spinlock with a mutex. Simplified locking in
>   patch 11 (That patch still looks awful, but I think the series is more
>   readable overall). And I've finally been able to remove the GFP_ATOMIC
>   allocations.
> * Use a single patch (04) for io-pgfault.c, since the code was simplified
>   in v6. Fixed partial list in patch 04.

There's an awful lot here and it stretches across quite a few subsystems,
with different git trees. What's the plan for merging it?

I'm happy to take some of the arm64 and smmu changes for 5.8, then perhaps
we can review what's left and target 5.9? It would also be helpful to split
that up into separate series where there aren't strong dependencies, I
think.

Will
Will Deacon May 21, 2020, 11:12 a.m. UTC | #5
On Tue, May 19, 2020 at 07:54:56PM +0200, Jean-Philippe Brucker wrote:
> If the SMMU supports it and the kernel was built with HTTU support,
> enable hardware update of access and dirty flags. This is essential for
> shared page tables, to reduce the number of access faults on the fault
> queue. Normal DMA with io-pgtables doesn't currently use the access or
> dirty flags.
> 
> We can enable HTTU even if CPUs don't support it, because the kernel
> always checks for HW dirty bit and updates the PTE flags atomically.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
>  drivers/iommu/arm-smmu-v3.c | 24 +++++++++++++++++++++++-
>  1 file changed, 23 insertions(+), 1 deletion(-)

How does this work if the SMMU isn't cache coherent? I'm guessing we don't
want to enable any SVA stuff in that case, but I couldn't spot where that
was being enforced. Did I just miss it?

Will
Will Deacon May 21, 2020, 2:16 p.m. UTC | #6
On Tue, May 19, 2020 at 07:54:45PM +0200, Jean-Philippe Brucker wrote:
> Extract some of the most generic TCR defines, so they can be reused by
> the page table sharing code.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
>  drivers/iommu/io-pgtable-arm.h | 30 ++++++++++++++++++++++++++++++
>  drivers/iommu/io-pgtable-arm.c | 27 ++-------------------------
>  MAINTAINERS                    |  3 +--
>  3 files changed, 33 insertions(+), 27 deletions(-)
>  create mode 100644 drivers/iommu/io-pgtable-arm.h

Acked-by: Will Deacon <will@kernel.org>

Will
Will Deacon May 21, 2020, 2:16 p.m. UTC | #7
On Tue, May 19, 2020 at 07:54:50PM +0200, Jean-Philippe Brucker wrote:
> ARMv8.1 extensions added Virtualization Host Extensions (VHE), which allow
> to run a host kernel at EL2. When using normal DMA, Device and CPU address
> spaces are dissociated, and do not need to implement the same
> capabilities, so VHE hasn't been used in the SMMU until now.
> 
> With shared address spaces however, ASIDs are shared between MMU and SMMU,
> and broadcast TLB invalidations issued by a CPU are taken into account by
> the SMMU. TLB entries on both sides need to have identical exception level
> in order to be cleared with a single invalidation.
> 
> When the CPU is using VHE, enable VHE in the SMMU for all STEs. Normal DMA
> mappings will need to use TLBI_EL2 commands instead of TLBI_NH, but
> shouldn't be otherwise affected by this change.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
>  drivers/iommu/arm-smmu-v3.c | 31 ++++++++++++++++++++++++++-----
>  1 file changed, 26 insertions(+), 5 deletions(-)

Acked-by: Will Deacon <will@kernel.org>

Will
Will Deacon May 21, 2020, 2:17 p.m. UTC | #8
On Tue, May 19, 2020 at 07:54:52PM +0200, Jean-Philippe Brucker wrote:
> Aggregate all sanity-checks for sharing CPU page tables with the SMMU
> under a single ARM_SMMU_FEAT_SVA bit. For PCIe SVA, users also need to
> check FEAT_ATS and FEAT_PRI. For platform SVA, they will most likely have
> to check FEAT_STALLS.
> 
> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
>  drivers/iommu/arm-smmu-v3.c | 72 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 72 insertions(+)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 9332253e3608..a9f6f1d7014e 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -660,6 +660,7 @@ struct arm_smmu_device {
>  #define ARM_SMMU_FEAT_RANGE_INV		(1 << 15)
>  #define ARM_SMMU_FEAT_E2H		(1 << 16)
>  #define ARM_SMMU_FEAT_BTM		(1 << 17)
> +#define ARM_SMMU_FEAT_SVA		(1 << 18)
>  	u32				features;
>  
>  #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
> @@ -3935,6 +3936,74 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
>  	return 0;
>  }
>  
> +static bool arm_smmu_supports_sva(struct arm_smmu_device *smmu)
> +{
> +	unsigned long reg, fld;
> +	unsigned long oas;
> +	unsigned long asid_bits;
> +
> +	u32 feat_mask = ARM_SMMU_FEAT_BTM | ARM_SMMU_FEAT_COHERENCY;

Aha -- here's the coherency check I missed!

> +
> +	if ((smmu->features & feat_mask) != feat_mask)
> +		return false;
> +
> +	if (!(smmu->pgsize_bitmap & PAGE_SIZE))
> +		return false;
> +
> +	/*
> +	 * Get the smallest PA size of all CPUs (sanitized by cpufeature). We're
> +	 * not even pretending to support AArch32 here.
> +	 */
> +	reg = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
> +	fld = cpuid_feature_extract_unsigned_field(reg, ID_AA64MMFR0_PARANGE_SHIFT);
> +	switch (fld) {
> +	case 0x0:
> +		oas = 32;
> +		break;
> +	case 0x1:
> +		oas = 36;
> +		break;
> +	case 0x2:
> +		oas = 40;
> +		break;
> +	case 0x3:
> +		oas = 42;
> +		break;
> +	case 0x4:
> +		oas = 44;
> +		break;
> +	case 0x5:
> +		oas = 48;
> +		break;
> +	case 0x6:

We can use ID_AA64MMFR0_PARANGE_xx constants instead of the hardcoded hex
numbers here.

With that:

Acked-by: Will Deacon <will@kernel.org>

Will
Will Deacon May 21, 2020, 2:17 p.m. UTC | #9
On Thu, May 21, 2020 at 11:35:14AM +0100, Will Deacon wrote:
> On Tue, May 19, 2020 at 07:54:38PM +0200, Jean-Philippe Brucker wrote:
> > Shared Virtual Addressing (SVA) allows to share process page tables with
> > devices using the IOMMU, PASIDs and I/O page faults. Add SVA support to
> > the Arm SMMUv3 driver.
> > 
> > Since v6 [1]:
> > * Rename ioasid_free() to ioasid_put() in patch 02, requiring changes to
> >   the Intel drivers.
> > * Use mmu_notifier_register() in patch 16 to avoid copying the ops and
> >   simplify the invalidate() notifier in patch 17.
> > * As a result, replace context spinlock with a mutex. Simplified locking in
> >   patch 11 (That patch still looks awful, but I think the series is more
> >   readable overall). And I've finally been able to remove the GFP_ATOMIC
> >   allocations.
> > * Use a single patch (04) for io-pgfault.c, since the code was simplified
> >   in v6. Fixed partial list in patch 04.
> 
> There's an awful lot here and it stretches across quite a few subsystems,
> with different git trees. What's the plan for merging it?
> 
> I'm happy to take some of the arm64 and smmu changes for 5.8, then perhaps
> we can review what's left and target 5.9? It would also be helpful to split
> that up into separate series where there aren't strong dependencies, I
> think.

Hmm, so the way the series is structured makes it quite difficult to apply
much of this at all :(

I've taken patch 5 into the arm64 tree and patch 8 into the smmu tree. I'll
leave a couple of Acks on some of the simpler patches, but I think this
really needs splitting up a bit to make it more manageable.

I also notice a bunch of TODOs that get introduced and then removed. Given
that the series needs to be bisectable, these shouldn't be needed and can
just be removed.

Thanks,

Will
Will Deacon May 21, 2020, 2:17 p.m. UTC | #10
[+Marc]

On Tue, May 19, 2020 at 07:54:51PM +0200, Jean-Philippe Brucker wrote:
> The SMMUv3 can handle invalidation targeted at TLB entries with shared
> ASIDs. If the implementation supports broadcast TLB maintenance, enable it
> and keep track of it in a feature bit. The SMMU will then be affected by
> inner-shareable TLB invalidations from other agents.
> 
> A major side-effect of this change is that stage-2 translation contexts
> are now affected by all invalidations by VMID. VMIDs are all shared and
> the only ways to prevent over-invalidation, since the stage-2 page tables
> are not shared between CPU and SMMU, are to either disable BTM or allocate
> different VMIDs. This patch does not address the problem.

This sounds like a potential performance issue, particularly as we expose
stage-2 contexts via VFIO directly. Maybe we could reserve some portion of
VMID space for the SMMU? Marc, what do you reckon?

Will
Marc Zyngier May 21, 2020, 2:38 p.m. UTC | #11
On 2020-05-21 15:17, Will Deacon wrote:
> [+Marc]
> 
> On Tue, May 19, 2020 at 07:54:51PM +0200, Jean-Philippe Brucker wrote:
>> The SMMUv3 can handle invalidation targeted at TLB entries with shared
>> ASIDs. If the implementation supports broadcast TLB maintenance, 
>> enable it
>> and keep track of it in a feature bit. The SMMU will then be affected 
>> by
>> inner-shareable TLB invalidations from other agents.
>> 
>> A major side-effect of this change is that stage-2 translation 
>> contexts
>> are now affected by all invalidations by VMID. VMIDs are all shared 
>> and
>> the only ways to prevent over-invalidation, since the stage-2 page 
>> tables
>> are not shared between CPU and SMMU, are to either disable BTM or 
>> allocate
>> different VMIDs. This patch does not address the problem.
> 
> This sounds like a potential performance issue, particularly as we 
> expose
> stage-2 contexts via VFIO directly. Maybe we could reserve some portion 
> of
> VMID space for the SMMU? Marc, what do you reckon?

Certainly doable when we have 16bits VMIDs. With smaller VMID spaces 
(like on
v8.0), this is a bit more difficult (we do have pretty large v8.0 
systems
around). How many VMID bits are we talking about?

         M.
Jean-Philippe Brucker May 22, 2020, 10:17 a.m. UTC | #12
[+Eric]

On Thu, May 21, 2020 at 03:38:35PM +0100, Marc Zyngier wrote:
> On 2020-05-21 15:17, Will Deacon wrote:
> > [+Marc]
> > 
> > On Tue, May 19, 2020 at 07:54:51PM +0200, Jean-Philippe Brucker wrote:
> > > The SMMUv3 can handle invalidation targeted at TLB entries with shared
> > > ASIDs. If the implementation supports broadcast TLB maintenance,
> > > enable it
> > > and keep track of it in a feature bit. The SMMU will then be
> > > affected by
> > > inner-shareable TLB invalidations from other agents.
> > > 
> > > A major side-effect of this change is that stage-2 translation
> > > contexts
> > > are now affected by all invalidations by VMID. VMIDs are all shared
> > > and
> > > the only ways to prevent over-invalidation, since the stage-2 page
> > > tables
> > > are not shared between CPU and SMMU, are to either disable BTM or
> > > allocate
> > > different VMIDs. This patch does not address the problem.
> > 
> > This sounds like a potential performance issue, particularly as we
> > expose
> > stage-2 contexts via VFIO directly.

Yes it's certainly going to affect SMMU performance, though I haven't
measured it. QEMU and kvmtool currently use stage-1 translations instead
of stage-2, so it won't be a problem until they start using nested
translation (and unless the SMMU only supports stage-2).

In the coming month I'd like to have a look at coordinating VMID
allocation between KVM and SMMU, for guest SVA. If the guest wants to
share page tables with the SMMU, the SMMU has to use the same VMIDs as the
VM to receive broadcast TLBI.

Similarly to patch 06 ("arm64: mm: Pin down ASIDs for sharing mm with
devices") the SMMU would request a VMID allocated by KVM, when setting up
a nesting VFIO container. One major downside is that the VMID is pinned
and cannot be recycled on rollover while it's being used for DMA.

I wonder if we could use this even when page tables aren't shared between
CPU and SMMU, to avoid splitting the VMID space.

> > Maybe we could reserve some portion
> > of
> > VMID space for the SMMU? Marc, what do you reckon?
> 
> Certainly doable when we have 16bits VMIDs. With smaller VMID spaces (like
> on
> v8.0), this is a bit more difficult (we do have pretty large v8.0 systems
> around).

It's only an issue if those systems have an SMMUv3 supporting DVM. With
any luck that doesn't exist?

> How many VMID bits are we talking about?

That's anyone's guess... One passed-through device per VM would halve the
VMID space. But the SMMU allocates one VMID for each device assigned to a
guest, not one per VM (well one per domain, or VFIO container, but I think
it boils down to one per device with QEMU). So with SR-IOV for example it
should be pretty easy to reach 256 VMIDs in the SMMU.

Thanks,
Jean
Xiang Zheng May 27, 2020, 3 a.m. UTC | #13
Hi Jean,

This patch only enables HTTU bits in CDs. Is it also neccessary to enable
HTTU bits in STEs in this patch?

On 2020/5/20 1:54, Jean-Philippe Brucker wrote:
> If the SMMU supports it and the kernel was built with HTTU support,
> enable hardware update of access and dirty flags. This is essential for
> shared page tables, to reduce the number of access faults on the fault
> queue. Normal DMA with io-pgtables doesn't currently use the access or
> dirty flags.
> 
> We can enable HTTU even if CPUs don't support it, because the kernel
> always checks for HW dirty bit and updates the PTE flags atomically.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
>  drivers/iommu/arm-smmu-v3.c | 24 +++++++++++++++++++++++-
>  1 file changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 1386d4d2bc60..6a368218f54c 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -58,6 +58,8 @@
>  #define IDR0_ASID16			(1 << 12)
>  #define IDR0_ATS			(1 << 10)
>  #define IDR0_HYP			(1 << 9)
> +#define IDR0_HD				(1 << 7)
> +#define IDR0_HA				(1 << 6)
>  #define IDR0_BTM			(1 << 5)
>  #define IDR0_COHACC			(1 << 4)
>  #define IDR0_TTF			GENMASK(3, 2)
> @@ -311,6 +313,9 @@
>  #define CTXDESC_CD_0_TCR_IPS		GENMASK_ULL(34, 32)
>  #define CTXDESC_CD_0_TCR_TBI0		(1ULL << 38)
>  
> +#define CTXDESC_CD_0_TCR_HA		(1UL << 43)
> +#define CTXDESC_CD_0_TCR_HD		(1UL << 42)
> +

>  #define CTXDESC_CD_0_AA64		(1UL << 41)
>  #define CTXDESC_CD_0_S			(1UL << 44)
>  #define CTXDESC_CD_0_R			(1UL << 45)
> @@ -663,6 +668,8 @@ struct arm_smmu_device {
>  #define ARM_SMMU_FEAT_E2H		(1 << 16)
>  #define ARM_SMMU_FEAT_BTM		(1 << 17)
>  #define ARM_SMMU_FEAT_SVA		(1 << 18)
> +#define ARM_SMMU_FEAT_HA		(1 << 19)
> +#define ARM_SMMU_FEAT_HD		(1 << 20)
>  	u32				features;
>  
>  #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
> @@ -1718,10 +1725,17 @@ static int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain,
>  		 * this substream's traffic
>  		 */
>  	} else { /* (1) and (2) */
> +		u64 tcr = cd->tcr;
> +
>  		cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
>  		cdptr[2] = 0;
>  		cdptr[3] = cpu_to_le64(cd->mair);
>  
> +		if (!(smmu->features & ARM_SMMU_FEAT_HD))
> +			tcr &= ~CTXDESC_CD_0_TCR_HD;
> +		if (!(smmu->features & ARM_SMMU_FEAT_HA))
> +			tcr &= ~CTXDESC_CD_0_TCR_HA;
> +
>  		/*
>  		 * STE is live, and the SMMU might read dwords of this CD in any
>  		 * order. Ensure that it observes valid values before reading
> @@ -1729,7 +1743,7 @@ static int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain,
>  		 */
>  		arm_smmu_sync_cd(smmu_domain, ssid, true);
>  
> -		val = cd->tcr |
> +		val = tcr |
>  #ifdef __BIG_ENDIAN
>  			CTXDESC_CD_0_ENDI |
>  #endif
> @@ -1958,10 +1972,12 @@ static struct arm_smmu_ctx_desc *arm_smmu_alloc_shared_cd(struct mm_struct *mm)
>  		return old_cd;
>  	}
>  
> +	/* HA and HD will be filtered out later if not supported by the SMMU */
>  	tcr = FIELD_PREP(CTXDESC_CD_0_TCR_T0SZ, 64ULL - VA_BITS) |
>  	      FIELD_PREP(CTXDESC_CD_0_TCR_IRGN0, ARM_LPAE_TCR_RGN_WBWA) |
>  	      FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, ARM_LPAE_TCR_RGN_WBWA) |
>  	      FIELD_PREP(CTXDESC_CD_0_TCR_SH0, ARM_LPAE_TCR_SH_IS) |
> +	      CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
>  	      CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
>  
>  	switch (PAGE_SIZE) {
> @@ -4454,6 +4470,12 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>  			smmu->features |= ARM_SMMU_FEAT_E2H;
>  	}
>  
> +	if (reg & (IDR0_HA | IDR0_HD)) {

> +		smmu->features |= ARM_SMMU_FEAT_HA;
> +		if (reg & IDR0_HD)
> +			smmu->features |= ARM_SMMU_FEAT_HD;
> +	}
> +

>  	/*
>  	 * If the CPU is using VHE, but the SMMU doesn't support it, the SMMU
>  	 * will create TLB entries for NH-EL1 world and will miss the
>
Jean-Philippe Brucker May 27, 2020, 8:41 a.m. UTC | #14
On Wed, May 27, 2020 at 11:00:29AM +0800, Xiang Zheng wrote:
> Hi Jean,
> 
> This patch only enables HTTU bits in CDs. Is it also neccessary to enable
> HTTU bits in STEs in this patch?

Only if you need HTTU for stage-2 tables. This series is only about
sharing stage-1 page tables, for which HTTU is enabled in the CD. I'll add
a statement in the commit message.

Thanks,
Jean
Xiang Zheng May 29, 2020, 9:18 a.m. UTC | #15
Hi,

On 2020/5/20 1:54, Jean-Philippe Brucker wrote:
> Some systems allow devices to handle I/O Page Faults in the core mm. For
> example systems implementing the PCIe PRI extension or Arm SMMU stall
> model. Infrastructure for reporting these recoverable page faults was
> added to the IOMMU core by commit 0c830e6b3282 ("iommu: Introduce device
> fault report API"). Add a page fault handler for host SVA.
> 
> IOMMU driver can now instantiate several fault workqueues and link them
> to IOPF-capable devices. Drivers can choose between a single global
> workqueue, one per IOMMU device, one per low-level fault queue, one per
> domain, etc.
> 
> When it receives a fault event, supposedly in an IRQ handler, the IOMMU
> driver reports the fault using iommu_report_device_fault(), which calls
> the registered handler. The page fault handler then calls the mm fault
> handler, and reports either success or failure with iommu_page_response().
> When the handler succeeded, the IOMMU retries the access.
> 
> The iopf_param pointer could be embedded into iommu_fault_param. But
> putting iopf_param into the iommu_param structure allows us not to care
> about ordering between calls to iopf_queue_add_device() and
> iommu_register_device_fault_handler().
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> v6->v7: Fix leak in iopf_queue_discard_partial()
> ---
>  drivers/iommu/Kconfig      |   4 +
>  drivers/iommu/Makefile     |   1 +
>  include/linux/iommu.h      |  51 +++++
>  drivers/iommu/io-pgfault.c | 459 +++++++++++++++++++++++++++++++++++++
>  4 files changed, 515 insertions(+)
>  create mode 100644 drivers/iommu/io-pgfault.c
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index d9fa5b410015..15e9dc4e503c 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -107,6 +107,10 @@ config IOMMU_SVA
>  	bool
>  	select IOASID
>  
> +config IOMMU_PAGE_FAULT
> +	bool
> +	select IOMMU_SVA
> +
>  config FSL_PAMU
>  	bool "Freescale IOMMU support"
>  	depends on PCI
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 40c800dd4e3e..bf5cb4ee8409 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -4,6 +4,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>  obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
>  obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
>  obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
> +obj-$(CONFIG_IOMMU_PAGE_FAULT) += io-pgfault.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index b62525747bd9..a462157c855b 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -46,6 +46,7 @@ struct iommu_domain;
>  struct notifier_block;
>  struct iommu_sva;
>  struct iommu_fault_event;
> +struct iopf_queue;
>  
>  /* iommu fault flags */
>  #define IOMMU_FAULT_READ	0x0
> @@ -347,6 +348,7 @@ struct iommu_fault_param {
>   * struct dev_iommu - Collection of per-device IOMMU data
>   *
>   * @fault_param: IOMMU detected device fault reporting data
> + * @iopf_param:	 I/O Page Fault queue and data
>   * @fwspec:	 IOMMU fwspec data
>   * @priv:	 IOMMU Driver private data
>   *
> @@ -356,6 +358,7 @@ struct iommu_fault_param {
>  struct dev_iommu {
>  	struct mutex lock;
>  	struct iommu_fault_param	*fault_param;
> +	struct iopf_device_param	*iopf_param;
>  	struct iommu_fwspec		*fwspec;
>  	void				*priv;
>  };

[...]

>  #endif /* __LINUX_IOMMU_H */
> diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c
> new file mode 100644
> index 000000000000..1f61c1bc05da
> --- /dev/null
> +++ b/drivers/iommu/io-pgfault.c
> @@ -0,0 +1,459 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Handle device page faults
> + *
> + * Copyright (C) 2020 ARM Ltd.
> + */
> +
> +#include <linux/iommu.h>
> +#include <linux/list.h>
> +#include <linux/sched/mm.h>
> +#include <linux/slab.h>
> +#include <linux/workqueue.h>
> +
> +#include "iommu-sva.h"
> +
> +/**
> + * struct iopf_queue - IO Page Fault queue
> + * @wq: the fault workqueue
> + * @devices: devices attached to this queue
> + * @lock: protects the device list
> + */
> +struct iopf_queue {
> +	struct workqueue_struct		*wq;
> +	struct list_head		devices;
> +	struct mutex			lock;
> +};
> +
> +/**
> + * struct iopf_device_param - IO Page Fault data attached to a device
> + * @dev: the device that owns this param
> + * @queue: IOPF queue
> + * @queue_list: index into queue->devices
> + * @partial: faults that are part of a Page Request Group for which the last
> + *           request hasn't been submitted yet.
> + */
> +struct iopf_device_param {
> +	struct device			*dev;
> +	struct iopf_queue		*queue;
> +	struct list_head		queue_list;
> +	struct list_head		partial;
> +};
> +
> +struct iopf_fault {
> +	struct iommu_fault		fault;
> +	struct list_head		list;
> +};
> +
> +struct iopf_group {
> +	struct iopf_fault		last_fault;
> +	struct list_head		faults;
> +	struct work_struct		work;
> +	struct device			*dev;
> +};
> +
>
[...]

> +
> +static void iopf_handle_group(struct work_struct *work)
> +{
> +	struct iopf_group *group;
> +	struct iopf_fault *iopf, *next;
> +	enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS;
> +
> +	group = container_of(work, struct iopf_group, work);
> +
> +	list_for_each_entry_safe(iopf, next, &group->faults, list) {
> +		/*
> +		 * For the moment, errors are sticky: don't handle subsequent
> +		 * faults in the group if there is an error.
> +		 */
> +		if (status == IOMMU_PAGE_RESP_SUCCESS)
> +			status = iopf_handle_single(iopf);
> +
> +		if (!(iopf->fault.prm.flags &
> +		      IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE))
> +			kfree(iopf);
> +	}
> +
> +	iopf_complete_group(group->dev, &group->last_fault, status);
> +	kfree(group);
> +}
> +
> +/**
> + * iommu_queue_iopf - IO Page Fault handler
> + * @evt: fault event
> + * @cookie: struct device, passed to iommu_register_device_fault_handler.
> + *
> + * Add a fault to the device workqueue, to be handled by mm.
> + *
> + * This module doesn't handle PCI PASID Stop Marker; IOMMU drivers must discard
> + * them before reporting faults. A PASID Stop Marker (LRW = 0b100) doesn't
> + * expect a response. It may be generated when disabling a PASID (issuing a
> + * PASID stop request) by some PCI devices.
> + *
> + * The PASID stop request is issued by the device driver before unbind(). Once
> + * it completes, no page request is generated for this PASID anymore and
> + * outstanding ones have been pushed to the IOMMU (as per PCIe 4.0r1.0 - 6.20.1
> + * and 10.4.1.2 - Managing PASID TLP Prefix Usage). Some PCI devices will wait
> + * for all outstanding page requests to come back with a response before
> + * completing the PASID stop request. Others do not wait for page responses, and
> + * instead issue this Stop Marker that tells us when the PASID can be
> + * reallocated.
> + *
> + * It is safe to discard the Stop Marker because it is an optimization.
> + * a. Page requests, which are posted requests, have been flushed to the IOMMU
> + *    when the stop request completes.
> + * b. We flush all fault queues on unbind() before freeing the PASID.
> + *
> + * So even though the Stop Marker might be issued by the device *after* the stop
> + * request completes, outstanding faults will have been dealt with by the time
> + * we free the PASID.
> + *
> + * Return: 0 on success and <0 on error.
> + */
> +int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)
> +{
> +	int ret;
> +	struct iopf_group *group;
> +	struct iopf_fault *iopf, *next;
> +	struct iopf_device_param *iopf_param;
> +
> +	struct device *dev = cookie;
> +	struct dev_iommu *param = dev->iommu;
> +
> +	lockdep_assert_held(&param->lock);
> +
> +	if (fault->type != IOMMU_FAULT_PAGE_REQ)
> +		/* Not a recoverable page fault */
> +		return -EOPNOTSUPP;
> +
> +	/*
> +	 * As long as we're holding param->lock, the queue can't be unlinked
> +	 * from the device and therefore cannot disappear.
> +	 */
> +	iopf_param = param->iopf_param;
> +	if (!iopf_param)
> +		return -ENODEV;
> +
> +	if (!(fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) {
> +		iopf = kzalloc(sizeof(*iopf), GFP_KERNEL);
> +		if (!iopf)
> +			return -ENOMEM;
> +
> +		iopf->fault = *fault;
> +
> +		/* Non-last request of a group. Postpone until the last one */
> +		list_add(&iopf->list, &iopf_param->partial);
> +
> +		return 0;
> +	}
> +
> +	group = kzalloc(sizeof(*group), GFP_KERNEL);
> +	if (!group) {
> +		/*
> +		 * The caller will send a response to the hardware. But we do
> +		 * need to clean up before leaving, otherwise partial faults
> +		 * will be stuck.
> +		 */
> +		ret = -ENOMEM;
> +		goto cleanup_partial;
> +	}
> +
> +	group->dev = dev;
> +	group->last_fault.fault = *fault;
> +	INIT_LIST_HEAD(&group->faults);
> +	list_add(&group->last_fault.list, &group->faults);
> +	INIT_WORK(&group->work, iopf_handle_group);
> +
> +	/* See if we have partial faults for this group */
> +	list_for_each_entry_safe(iopf, next, &iopf_param->partial, list) {
> +		if (iopf->fault.prm.grpid == fault->prm.grpid)
> +			/* Insert *before* the last fault */
> +			list_move(&iopf->list, &group->faults);
> +	}
> +
> +	queue_work(iopf_param->queue->wq, &group->work);
> +	return 0;
> +
> +cleanup_partial:
> +	list_for_each_entry_safe(iopf, next, &iopf_param->partial, list) {
> +		if (iopf->fault.prm.grpid == fault->prm.grpid) {
> +			list_del(&iopf->list);
> +			kfree(iopf);
> +		}
> +	}
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_queue_iopf);

[...]

> +/**
> + * iopf_queue_remove_device - Remove producer from fault queue
> + * @queue: IOPF queue
> + * @dev: device to remove
> + *
> + * Caller makes sure that no more faults are reported for this device.
> + *
> + * Return: 0 on success and <0 on error.
> + */
> +int iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev)
> +{
> +	int ret = 0;
> +	struct iopf_fault *iopf, *next;
> +	struct iopf_device_param *iopf_param;
> +	struct dev_iommu *param = dev->iommu;
> +
> +	if (!param || !queue)
> +		return -EINVAL;
> +
> +	mutex_lock(&queue->lock);
> +	mutex_lock(&param->lock);
> +	iopf_param = param->iopf_param;
> +	if (iopf_param && iopf_param->queue == queue) {
> +		list_del(&iopf_param->queue_list);
> +		param->iopf_param = NULL;
> +	} else {
> +		ret = -EINVAL;
> +	}
> +	mutex_unlock(&param->lock);
> +	mutex_unlock(&queue->lock);
> +	if (ret)
> +		return ret;
> +
> +	/* Just in case some faults are still stuck */
> +	list_for_each_entry_safe(iopf, next, &iopf_param->partial, list)
> +		kfree(iopf);
> +
> +	kfree(iopf_param);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_remove_device);
> +
> +/**
> + * iopf_queue_alloc - Allocate and initialize a fault queue
> + * @name: a unique string identifying the queue (for workqueue)
> + *
> + * Return: the queue on success and NULL on error.
> + */
> +struct iopf_queue *iopf_queue_alloc(const char *name)
> +{
> +	struct iopf_queue *queue;
> +
> +	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
> +	if (!queue)
> +		return NULL;
> +
> +	/*
> +	 * The WQ is unordered because the low-level handler enqueues faults by
> +	 * group. PRI requests within a group have to be ordered, but once
> +	 * that's dealt with, the high-level function can handle groups out of
> +	 * order.
> +	 */
> +	queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name);
> +	if (!queue->wq) {
> +		kfree(queue);
> +		return NULL;
> +	}
> +
> +	INIT_LIST_HEAD(&queue->devices);
> +	mutex_init(&queue->lock);
> +
> +	return queue;
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_alloc);
> +
> +/**
> + * iopf_queue_free - Free IOPF queue
> + * @queue: queue to free
> + *
> + * Counterpart to iopf_queue_alloc(). The driver must not be queuing faults or
> + * adding/removing devices on this queue anymore.
> + */
> +void iopf_queue_free(struct iopf_queue *queue)
> +{
> +	struct iopf_device_param *iopf_param, *next;
> +
> +	if (!queue)
> +		return;
> +
> +	list_for_each_entry_safe(iopf_param, next, &queue->devices, queue_list)
> +		iopf_queue_remove_device(queue, iopf_param->dev);
> +
> +	destroy_workqueue(queue->wq);

Do we need to free iopf_group(s) here in case the work queue of the group(s) are not
scheduled yet? If that occurs, we might leak memory here.

If the caller can ensure that would never occur or I miss something, please ignore
my noise. :)

> +	kfree(queue);
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_free);
>
Shameerali Kolothum Thodi June 1, 2020, 12:42 p.m. UTC | #16
Hi Jean,

> -----Original Message-----
> From: iommu [mailto:iommu-bounces@lists.linux-foundation.org] On Behalf Of
> Jean-Philippe Brucker
> Sent: 19 May 2020 18:55
> To: iommu@lists.linux-foundation.org; devicetree@vger.kernel.org;
> linux-arm-kernel@lists.infradead.org; linux-pci@vger.kernel.org;
> linux-mm@kvack.org
> Cc: fenghua.yu@intel.com; kevin.tian@intel.com; jgg@ziepe.ca;
> catalin.marinas@arm.com; robin.murphy@arm.com; hch@infradead.org;
> zhangfei.gao@linaro.org; Jean-Philippe Brucker <jean-philippe@linaro.org>;
> felix.kuehling@amd.com; will@kernel.org; christian.koenig@amd.com
> Subject: [PATCH v7 21/24] iommu/arm-smmu-v3: Add stall support for platform
> devices
> 
> The SMMU provides a Stall model for handling page faults in platform
> devices. It is similar to PCI PRI, but doesn't require devices to have
> their own translation cache. Instead, faulting transactions are parked
> and the OS is given a chance to fix the page tables and retry the
> transaction.
> 
> Enable stall for devices that support it (opt-in by firmware). When an
> event corresponds to a translation error, call the IOMMU fault handler.
> If the fault is recoverable, it will call us back to terminate or
> continue the stall.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
>  drivers/iommu/Kconfig       |   1 +
>  include/linux/iommu.h       |   2 +
>  drivers/iommu/arm-smmu-v3.c | 284
> ++++++++++++++++++++++++++++++++++--
>  drivers/iommu/of_iommu.c    |   5 +-
>  4 files changed, 281 insertions(+), 11 deletions(-)
> 

[...]
 
> +static int arm_smmu_page_response(struct device *dev,
> +				  struct iommu_fault_event *unused,
> +				  struct iommu_page_response *resp)
> +{
> +	struct arm_smmu_cmdq_ent cmd = {0};
> +	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +	int sid = master->streams[0].id;
> +
> +	if (master->stall_enabled) {
> +		cmd.opcode		= CMDQ_OP_RESUME;
> +		cmd.resume.sid		= sid;
> +		cmd.resume.stag		= resp->grpid;
> +		switch (resp->code) {
> +		case IOMMU_PAGE_RESP_INVALID:
> +		case IOMMU_PAGE_RESP_FAILURE:
> +			cmd.resume.resp = CMDQ_RESUME_0_RESP_ABORT;
> +			break;
> +		case IOMMU_PAGE_RESP_SUCCESS:
> +			cmd.resume.resp = CMDQ_RESUME_0_RESP_RETRY;
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +	} else {
> +		/* TODO: insert PRI response here */
> +		return -ENODEV;
> +	}
> +
> +	arm_smmu_cmdq_issue_cmd(master->smmu, &cmd);
> +	/*
> +	 * Don't send a SYNC, it doesn't do anything for RESUME or PRI_RESP.
> +	 * RESUME consumption guarantees that the stalled transaction will be
> +	 * terminated... at some point in the future. PRI_RESP is fire and
> +	 * forget.
> +	 */
> +
> +	return 0;
> +}
> +
>  /* Context descriptor manipulation functions */
>  static void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16
> asid)
>  {
> @@ -1762,8 +1846,7 @@ static int arm_smmu_write_ctx_desc(struct
> arm_smmu_domain *smmu_domain,
>  			FIELD_PREP(CTXDESC_CD_0_ASID, cd->asid) |
>  			CTXDESC_CD_0_V;
> 
> -		/* STALL_MODEL==0b10 && CD.S==0 is ILLEGAL */
> -		if (smmu->features & ARM_SMMU_FEAT_STALL_FORCE)
> +		if (smmu_domain->stall_enabled)
>  			val |= CTXDESC_CD_0_S;
>  	}
> 
> @@ -2171,7 +2254,7 @@ static void arm_smmu_write_strtab_ent(struct
> arm_smmu_master *master, u32 sid,
>  			 FIELD_PREP(STRTAB_STE_1_STRW, strw));
> 
>  		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
> -		   !(smmu->features & ARM_SMMU_FEAT_STALL_FORCE))
> +		    !master->stall_enabled)
>  			dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
> 
>  		val |= (s1_cfg->cdcfg.cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK)
> |
> @@ -2248,7 +2331,6 @@ static int arm_smmu_init_l2_strtab(struct
> arm_smmu_device *smmu, u32 sid)
>  	return 0;
>  }
> 
> -__maybe_unused
>  static struct arm_smmu_master *
>  arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
>  {
> @@ -2275,23 +2357,123 @@ arm_smmu_find_master(struct
> arm_smmu_device *smmu, u32 sid)
>  }
> 
>  /* IRQ and event handlers */
> +static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt)
> +{
> +	int ret;
> +	u32 perm = 0;
> +	struct arm_smmu_master *master;
> +	bool ssid_valid = evt[0] & EVTQ_0_SSV;
> +	u8 type = FIELD_GET(EVTQ_0_ID, evt[0]);
> +	u32 sid = FIELD_GET(EVTQ_0_SID, evt[0]);
> +	struct iommu_fault_event fault_evt = { };
> +	struct iommu_fault *flt = &fault_evt.fault;
> +
> +	/* Stage-2 is always pinned at the moment */
> +	if (evt[1] & EVTQ_1_S2)
> +		return -EFAULT;
> +
> +	master = arm_smmu_find_master(smmu, sid);
> +	if (!master)
> +		return -EINVAL;
> +
> +	if (evt[1] & EVTQ_1_READ)
> +		perm |= IOMMU_FAULT_PERM_READ;
> +	else
> +		perm |= IOMMU_FAULT_PERM_WRITE;
> +
> +	if (evt[1] & EVTQ_1_EXEC)
> +		perm |= IOMMU_FAULT_PERM_EXEC;
> +
> +	if (evt[1] & EVTQ_1_PRIV)
> +		perm |= IOMMU_FAULT_PERM_PRIV;
> +
> +	if (evt[1] & EVTQ_1_STALL) {
> +		flt->type = IOMMU_FAULT_PAGE_REQ;
> +		flt->prm = (struct iommu_fault_page_request) {
> +			.flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE,
> +			.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]),
> +			.grpid = FIELD_GET(EVTQ_1_STAG, evt[1]),
> +			.perm = perm,
> +			.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]),
> +		};
> +

> +		if (ssid_valid)
> +			flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;

Do we need to set this for STALL mode only support? I had an issue with this
being set on a vSVA POC based on our D06 zip device(which is a "fake " pci dev
that supports STALL mode but no PRI). The issue is, CMDQ_OP_RESUME doesn't
have any ssid or SSV params and works on sid and stag only. Hence, it is difficult for
Qemu SMMUv3 to populate this fields while preparing a page response. I can see
that this flag is being checked in iopf_handle_single() (patch 04/24) as well. For POC,
I used a temp fix[1] to work around this. Please let me know your thoughts.

Thanks,
Shameer

1. https://github.com/hisilicon/kernel-dev/commit/99ff96146e924055f38d97a5897e4becfa378d15
Jean-Philippe Brucker June 2, 2020, 9:38 a.m. UTC | #17
Hi Shameer,

On Mon, Jun 01, 2020 at 12:42:15PM +0000, Shameerali Kolothum Thodi wrote:
> >  /* IRQ and event handlers */
> > +static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt)
> > +{
> > +	int ret;
> > +	u32 perm = 0;
> > +	struct arm_smmu_master *master;
> > +	bool ssid_valid = evt[0] & EVTQ_0_SSV;
> > +	u8 type = FIELD_GET(EVTQ_0_ID, evt[0]);
> > +	u32 sid = FIELD_GET(EVTQ_0_SID, evt[0]);
> > +	struct iommu_fault_event fault_evt = { };
> > +	struct iommu_fault *flt = &fault_evt.fault;
> > +
> > +	/* Stage-2 is always pinned at the moment */
> > +	if (evt[1] & EVTQ_1_S2)
> > +		return -EFAULT;
> > +
> > +	master = arm_smmu_find_master(smmu, sid);
> > +	if (!master)
> > +		return -EINVAL;
> > +
> > +	if (evt[1] & EVTQ_1_READ)
> > +		perm |= IOMMU_FAULT_PERM_READ;
> > +	else
> > +		perm |= IOMMU_FAULT_PERM_WRITE;
> > +
> > +	if (evt[1] & EVTQ_1_EXEC)
> > +		perm |= IOMMU_FAULT_PERM_EXEC;
> > +
> > +	if (evt[1] & EVTQ_1_PRIV)
> > +		perm |= IOMMU_FAULT_PERM_PRIV;
> > +
> > +	if (evt[1] & EVTQ_1_STALL) {
> > +		flt->type = IOMMU_FAULT_PAGE_REQ;
> > +		flt->prm = (struct iommu_fault_page_request) {
> > +			.flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE,
> > +			.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]),
> > +			.grpid = FIELD_GET(EVTQ_1_STAG, evt[1]),
> > +			.perm = perm,
> > +			.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]),
> > +		};
> > +
> 
> > +		if (ssid_valid)
> > +			flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
> 
> Do we need to set this for STALL mode only support? I had an issue with this
> being set on a vSVA POC based on our D06 zip device(which is a "fake " pci dev
> that supports STALL mode but no PRI). The issue is, CMDQ_OP_RESUME doesn't
> have any ssid or SSV params and works on sid and stag only.

I don't understand the problem, arm_smmu_page_response() doesn't set SSID
or SSV when sending a CMDQ_OP_RESUME. Could you detail the flow of a stall
event and RESUME command in your prototype?  Are you getting issues with
the host driver or the guest driver?

We do need to forward the SSV flag all the way to the guest driver, so the
guest can find the faulting address space using the SSID. Once the guest
handled the fault, then we don't send the SSID back to the host as part of
the RESUME command.

Thanks,
Jean

> Hence, it is difficult for
> Qemu SMMUv3 to populate this fields while preparing a page response. I can see
> that this flag is being checked in iopf_handle_single() (patch 04/24) as well. For POC,
> I used a temp fix[1] to work around this. Please let me know your thoughts.
> 
> Thanks,
> Shameer
> 
> 1. https://github.com/hisilicon/kernel-dev/commit/99ff96146e924055f38d97a5897e4becfa378d15
>
Shameerali Kolothum Thodi June 2, 2020, 10:31 a.m. UTC | #18
Hi Jean,

> -----Original Message-----
> From: linux-arm-kernel [mailto:linux-arm-kernel-bounces@lists.infradead.org]
> On Behalf Of Jean-Philippe Brucker
> Sent: 02 June 2020 10:39
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: devicetree@vger.kernel.org; kevin.tian@intel.com; will@kernel.org;
> fenghua.yu@intel.com; jgg@ziepe.ca; linux-pci@vger.kernel.org;
> felix.kuehling@amd.com; hch@infradead.org; linux-mm@kvack.org;
> iommu@lists.linux-foundation.org; catalin.marinas@arm.com;
> zhangfei.gao@linaro.org; robin.murphy@arm.com;
> christian.koenig@amd.com; linux-arm-kernel@lists.infradead.org
> Subject: Re: [PATCH v7 21/24] iommu/arm-smmu-v3: Add stall support for
> platform devices
> 
> Hi Shameer,
> 
> On Mon, Jun 01, 2020 at 12:42:15PM +0000, Shameerali Kolothum Thodi
> wrote:
> > >  /* IRQ and event handlers */
> > > +static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64
> > > +*evt) {
> > > +	int ret;
> > > +	u32 perm = 0;
> > > +	struct arm_smmu_master *master;
> > > +	bool ssid_valid = evt[0] & EVTQ_0_SSV;
> > > +	u8 type = FIELD_GET(EVTQ_0_ID, evt[0]);
> > > +	u32 sid = FIELD_GET(EVTQ_0_SID, evt[0]);
> > > +	struct iommu_fault_event fault_evt = { };
> > > +	struct iommu_fault *flt = &fault_evt.fault;
> > > +
> > > +	/* Stage-2 is always pinned at the moment */
> > > +	if (evt[1] & EVTQ_1_S2)
> > > +		return -EFAULT;
> > > +
> > > +	master = arm_smmu_find_master(smmu, sid);
> > > +	if (!master)
> > > +		return -EINVAL;
> > > +
> > > +	if (evt[1] & EVTQ_1_READ)
> > > +		perm |= IOMMU_FAULT_PERM_READ;
> > > +	else
> > > +		perm |= IOMMU_FAULT_PERM_WRITE;
> > > +
> > > +	if (evt[1] & EVTQ_1_EXEC)
> > > +		perm |= IOMMU_FAULT_PERM_EXEC;
> > > +
> > > +	if (evt[1] & EVTQ_1_PRIV)
> > > +		perm |= IOMMU_FAULT_PERM_PRIV;
> > > +
> > > +	if (evt[1] & EVTQ_1_STALL) {
> > > +		flt->type = IOMMU_FAULT_PAGE_REQ;
> > > +		flt->prm = (struct iommu_fault_page_request) {
> > > +			.flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE,
> > > +			.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]),
> > > +			.grpid = FIELD_GET(EVTQ_1_STAG, evt[1]),
> > > +			.perm = perm,
> > > +			.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]),
> > > +		};
> > > +
> >
> > > +		if (ssid_valid)
> > > +			flt->prm.flags |=
> IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
> >
> > Do we need to set this for STALL mode only support? I had an issue
> > with this being set on a vSVA POC based on our D06 zip device(which is
> > a "fake " pci dev that supports STALL mode but no PRI). The issue is,
> > CMDQ_OP_RESUME doesn't have any ssid or SSV params and works on sid
> and stag only.
> 
> I don't understand the problem, arm_smmu_page_response() doesn't set SSID
> or SSV when sending a CMDQ_OP_RESUME. Could you detail the flow of a stall
> event and RESUME command in your prototype?  Are you getting issues with
> the host driver or the guest driver?

The issue is on the host side iommu_page_response(). The flow is something like
below.

Stall: Host:-

arm_smmu_handle_evt()
  iommu_report_device_fault()
    vfio_pci_iommu_dev_fault_handler()
      
Stall: Qemu:-

vfio_dma_fault_notifier_handler()
  inject_faults()
    smmuv3_inject_faults()

Stall: Guest:-

arm_smmu_handle_evt()
  iommu_report_device_fault()
    iommu_queue_iopf
  ...
  iopf_handle_group()
    iopf_handle_single()
      handle_mm_fault()
        iopf_complete()
           iommu_page_response()
             arm_smmu_page_response()
               arm_smmu_cmdq_issue_cmd(CMDQ_OP_RESUME)

Resume: Qemu:-

smmuv3_cmdq_consume(SMMU_CMD_RESUME)
  smmuv3_notify_page_resp()
    vfio:ioctl(page_response)  --> struct iommu_page_response is filled
                             with only version, grpid and code.

Resume: Host:-
  ioctl(page_response)
    iommu_page_response()  --> fails as the pending req has PASID_VALID flag
                             set and it checks for a match.
      arm_smmu_page_response()

Hope the above is clear.

> We do need to forward the SSV flag all the way to the guest driver, so the guest
> can find the faulting address space using the SSID. Once the guest handled the
> fault, then we don't send the SSID back to the host as part of the RESUME
> command.

True, the guest requires SSV flag to handle the page fault. But, as shown in the
flow above, the issue is on the host side iommu_page_response() where it
searches for a matching pending req based on pasid. Not sure we can bypass
that and call arm_smmu_page_response() directly but then have to delete the
pending req from the list as well.

Please let me know if there is a better way to handle the host side page
response.

Thanks,
Shameer

> Thanks,
> Jean
> 
> > Hence, it is difficult for
> > Qemu SMMUv3 to populate this fields while preparing a page response. I
> > can see that this flag is being checked in iopf_handle_single() (patch
> > 04/24) as well. For POC, I used a temp fix[1] to work around this. Please let
> me know your thoughts.
> >
> > Thanks,
> > Shameer
> >
> > 1.
> > https://github.com/hisilicon/kernel-dev/commit/99ff96146e924055f38d97a
> > 5897e4becfa378d15
> >
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Shameerali Kolothum Thodi June 2, 2020, 12:12 p.m. UTC | #19
> -----Original Message-----
> From: linux-arm-kernel [mailto:linux-arm-kernel-bounces@lists.infradead.org]
> On Behalf Of Jean-Philippe Brucker
> Sent: 02 June 2020 12:46
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: devicetree@vger.kernel.org; kevin.tian@intel.com; fenghua.yu@intel.com;
> linux-pci@vger.kernel.org; felix.kuehling@amd.com; robin.murphy@arm.com;
> christian.koenig@amd.com; hch@infradead.org; jgg@ziepe.ca;
> iommu@lists.linux-foundation.org; catalin.marinas@arm.com;
> zhangfei.gao@linaro.org; will@kernel.org; linux-mm@kvack.org;
> linux-arm-kernel@lists.infradead.org
> Subject: Re: [PATCH v7 21/24] iommu/arm-smmu-v3: Add stall support for
> platform devices
> 
> On Tue, Jun 02, 2020 at 10:31:29AM +0000, Shameerali Kolothum Thodi wrote:
> > Hi Jean,
> >
> > > -----Original Message-----
> > > From: linux-arm-kernel
> > > [mailto:linux-arm-kernel-bounces@lists.infradead.org]
> > > On Behalf Of Jean-Philippe Brucker
> > > Sent: 02 June 2020 10:39
> > > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> > > Cc: devicetree@vger.kernel.org; kevin.tian@intel.com;
> > > will@kernel.org; fenghua.yu@intel.com; jgg@ziepe.ca;
> > > linux-pci@vger.kernel.org; felix.kuehling@amd.com;
> > > hch@infradead.org; linux-mm@kvack.org;
> > > iommu@lists.linux-foundation.org; catalin.marinas@arm.com;
> > > zhangfei.gao@linaro.org; robin.murphy@arm.com;
> > > christian.koenig@amd.com; linux-arm-kernel@lists.infradead.org
> > > Subject: Re: [PATCH v7 21/24] iommu/arm-smmu-v3: Add stall support
> > > for platform devices
> > >
> > > Hi Shameer,
> > >
> > > On Mon, Jun 01, 2020 at 12:42:15PM +0000, Shameerali Kolothum Thodi
> > > wrote:
> > > > >  /* IRQ and event handlers */
> > > > > +static int arm_smmu_handle_evt(struct arm_smmu_device *smmu,
> > > > > +u64
> > > > > +*evt) {
> > > > > +	int ret;
> > > > > +	u32 perm = 0;
> > > > > +	struct arm_smmu_master *master;
> > > > > +	bool ssid_valid = evt[0] & EVTQ_0_SSV;
> > > > > +	u8 type = FIELD_GET(EVTQ_0_ID, evt[0]);
> > > > > +	u32 sid = FIELD_GET(EVTQ_0_SID, evt[0]);
> > > > > +	struct iommu_fault_event fault_evt = { };
> > > > > +	struct iommu_fault *flt = &fault_evt.fault;
> > > > > +
> > > > > +	/* Stage-2 is always pinned at the moment */
> > > > > +	if (evt[1] & EVTQ_1_S2)
> > > > > +		return -EFAULT;
> > > > > +
> > > > > +	master = arm_smmu_find_master(smmu, sid);
> > > > > +	if (!master)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (evt[1] & EVTQ_1_READ)
> > > > > +		perm |= IOMMU_FAULT_PERM_READ;
> > > > > +	else
> > > > > +		perm |= IOMMU_FAULT_PERM_WRITE;
> > > > > +
> > > > > +	if (evt[1] & EVTQ_1_EXEC)
> > > > > +		perm |= IOMMU_FAULT_PERM_EXEC;
> > > > > +
> > > > > +	if (evt[1] & EVTQ_1_PRIV)
> > > > > +		perm |= IOMMU_FAULT_PERM_PRIV;
> > > > > +
> > > > > +	if (evt[1] & EVTQ_1_STALL) {
> > > > > +		flt->type = IOMMU_FAULT_PAGE_REQ;
> > > > > +		flt->prm = (struct iommu_fault_page_request) {
> > > > > +			.flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE,
> > > > > +			.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]),
> > > > > +			.grpid = FIELD_GET(EVTQ_1_STAG, evt[1]),
> > > > > +			.perm = perm,
> > > > > +			.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]),
> > > > > +		};
> > > > > +
> > > >
> > > > > +		if (ssid_valid)
> > > > > +			flt->prm.flags |=
> > > IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
> > > >
> > > > Do we need to set this for STALL mode only support? I had an issue
> > > > with this being set on a vSVA POC based on our D06 zip
> > > > device(which is a "fake " pci dev that supports STALL mode but no
> > > > PRI). The issue is, CMDQ_OP_RESUME doesn't have any ssid or SSV
> > > > params and works on sid
> > > and stag only.
> > >
> > > I don't understand the problem, arm_smmu_page_response() doesn't set
> > > SSID or SSV when sending a CMDQ_OP_RESUME. Could you detail the flow
> > > of a stall event and RESUME command in your prototype?  Are you
> > > getting issues with the host driver or the guest driver?
> >
> > The issue is on the host side iommu_page_response(). The flow is
> > something like below.
> >
> > Stall: Host:-
> >
> > arm_smmu_handle_evt()
> >   iommu_report_device_fault()
> >     vfio_pci_iommu_dev_fault_handler()
> >
> > Stall: Qemu:-
> >
> > vfio_dma_fault_notifier_handler()
> >   inject_faults()
> >     smmuv3_inject_faults()
> >
> > Stall: Guest:-
> >
> > arm_smmu_handle_evt()
> >   iommu_report_device_fault()
> >     iommu_queue_iopf
> >   ...
> >   iopf_handle_group()
> >     iopf_handle_single()
> >       handle_mm_fault()
> >         iopf_complete()
> >            iommu_page_response()
> >              arm_smmu_page_response()
> >                arm_smmu_cmdq_issue_cmd(CMDQ_OP_RESUME)
> >
> > Resume: Qemu:-
> >
> > smmuv3_cmdq_consume(SMMU_CMD_RESUME)
> >   smmuv3_notify_page_resp()
> >     vfio:ioctl(page_response)  --> struct iommu_page_response is filled
> >                              with only version, grpid and code.
> >
> > Resume: Host:-
> >   ioctl(page_response)
> >     iommu_page_response()  --> fails as the pending req has PASID_VALID
> flag
> >                              set and it checks for a match.
> 
> I believe the fix needs to be here. It's also wrong for PRI since not all PCIe
> endpoint require a PASID in the page response. Could you try the attached
> patch?

Going through the patch, yes, that will definitely fix the issue. But isn't it better if
the request itself indicate whether it expects a response msg with a valid pasid or
not? The response msg can come from userspace as well(vSVA) and if for some reason
doesn't set it for a req that expects pasid then it should be an error, right? In the temp
fix I had, I introduced another flag to indicate the endpoint has PRI support or not and
used that to verify the pasid requirement. But for the PRI case you mentioned 
above, not sure it is easy to get that information or not. May be I am complicating things
here :)

Thanks,
Shameer

> Thanks,
> Jean
> 
> >       arm_smmu_page_response()
> >
> > Hope the above is clear.
> >
> > > We do need to forward the SSV flag all the way to the guest driver,
> > > so the guest can find the faulting address space using the SSID.
> > > Once the guest handled the fault, then we don't send the SSID back
> > > to the host as part of the RESUME command.
> >
> > True, the guest requires SSV flag to handle the page fault. But, as
> > shown in the flow above, the issue is on the host side
> > iommu_page_response() where it searches for a matching pending req
> > based on pasid. Not sure we can bypass that and call
> > arm_smmu_page_response() directly but then have to delete the pending req
> from the list as well.
> >
> > Please let me know if there is a better way to handle the host side
> > page response.
> >
> > Thanks,
> > Shameer
> >
> > > Thanks,
> > > Jean
> > >
> > > > Hence, it is difficult for
> > > > Qemu SMMUv3 to populate this fields while preparing a page
> > > > response. I can see that this flag is being checked in
> > > > iopf_handle_single() (patch
> > > > 04/24) as well. For POC, I used a temp fix[1] to work around this.
> > > > Please let
> > > me know your thoughts.
> > > >
> > > > Thanks,
> > > > Shameer
> > > >
> > > > 1.
> > > > https://github.com/hisilicon/kernel-dev/commit/99ff96146e924055f38
> > > > d97a
> > > > 5897e4becfa378d15
> > > >
> > >
> > > _______________________________________________
> > > linux-arm-kernel mailing list
> > > linux-arm-kernel@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Jean-Philippe Brucker June 3, 2020, 7:38 a.m. UTC | #20
On Tue, Jun 02, 2020 at 12:12:30PM +0000, Shameerali Kolothum Thodi wrote:
> > > > > > +		if (ssid_valid)
> > > > > > +			flt->prm.flags |=
> > > > IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
> > > > >
> > > > > Do we need to set this for STALL mode only support? I had an issue
> > > > > with this being set on a vSVA POC based on our D06 zip
> > > > > device(which is a "fake " pci dev that supports STALL mode but no
> > > > > PRI). The issue is, CMDQ_OP_RESUME doesn't have any ssid or SSV
> > > > > params and works on sid
> > > > and stag only.
> > > >
> > > > I don't understand the problem, arm_smmu_page_response() doesn't set
> > > > SSID or SSV when sending a CMDQ_OP_RESUME. Could you detail the flow
> > > > of a stall event and RESUME command in your prototype?  Are you
> > > > getting issues with the host driver or the guest driver?
> > >
> > > The issue is on the host side iommu_page_response(). The flow is
> > > something like below.
> > >
> > > Stall: Host:-
> > >
> > > arm_smmu_handle_evt()
> > >   iommu_report_device_fault()
> > >     vfio_pci_iommu_dev_fault_handler()
> > >
> > > Stall: Qemu:-
> > >
> > > vfio_dma_fault_notifier_handler()
> > >   inject_faults()
> > >     smmuv3_inject_faults()
> > >
> > > Stall: Guest:-
> > >
> > > arm_smmu_handle_evt()
> > >   iommu_report_device_fault()
> > >     iommu_queue_iopf
> > >   ...
> > >   iopf_handle_group()
> > >     iopf_handle_single()
> > >       handle_mm_fault()
> > >         iopf_complete()
> > >            iommu_page_response()
> > >              arm_smmu_page_response()
> > >                arm_smmu_cmdq_issue_cmd(CMDQ_OP_RESUME)
> > >
> > > Resume: Qemu:-
> > >
> > > smmuv3_cmdq_consume(SMMU_CMD_RESUME)
> > >   smmuv3_notify_page_resp()
> > >     vfio:ioctl(page_response)  --> struct iommu_page_response is filled
> > >                              with only version, grpid and code.
> > >
> > > Resume: Host:-
> > >   ioctl(page_response)
> > >     iommu_page_response()  --> fails as the pending req has PASID_VALID
> > flag
> > >                              set and it checks for a match.
> > 
> > I believe the fix needs to be here. It's also wrong for PRI since not all PCIe
> > endpoint require a PASID in the page response. Could you try the attached
> > patch?
> 
> Going through the patch, yes, that will definitely fix the issue. But isn't it better if
> the request itself indicate whether it expects a response msg with a valid pasid or
> not? The response msg can come from userspace as well(vSVA) and if for some reason
> doesn't set it for a req that expects pasid then it should be an error, right? In the temp
> fix I had, I introduced another flag to indicate the endpoint has PRI support or not and
> used that to verify the pasid requirement. But for the PRI case you mentioned 
> above, not sure it is easy to get that information or not. May be I am complicating things
> here :)

No you're right, we shouldn't send back malformed responses to the SMMU. I
suppose we can store a flag "PASID required" in the fault and check that
against the response. If we have to discard the guest's response, then we
can either fake a response (abort the stall) right away, or wait for the
response timeout to kick, which will do the same.

Thanks,
Jean
Zenghui Yu Aug. 28, 2020, 9:28 a.m. UTC | #21
On 2020/5/20 1:54, Jean-Philippe Brucker wrote:
> @@ -4454,6 +4470,12 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>   			smmu->features |= ARM_SMMU_FEAT_E2H;
>   	}
>   
> +	if (reg & (IDR0_HA | IDR0_HD)) {
> +		smmu->features |= ARM_SMMU_FEAT_HA;
> +		if (reg & IDR0_HD)
> +			smmu->features |= ARM_SMMU_FEAT_HD;
> +	}
> +

nitpick:

As per the IORT spec (DEN0049D, 3.1.1.2 SMMUv3 node, Table 10), the
"HTTU Override" flag of the SMMUv3 node can override the value in
SMMU_IDR0.HTTU. You may want to check this bit before selecting the
{HA,HD} features and shout if there is a mismatch between firmware and
the SMMU implementation. Just like how ARM_SMMU_FEAT_COHERENCY is
selected.


Thanks,
Zenghui
Jean-Philippe Brucker Sept. 16, 2020, 2:11 p.m. UTC | #22
On Fri, Aug 28, 2020 at 05:28:22PM +0800, Zenghui Yu wrote:
> On 2020/5/20 1:54, Jean-Philippe Brucker wrote:
> > @@ -4454,6 +4470,12 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
> >   			smmu->features |= ARM_SMMU_FEAT_E2H;
> >   	}
> > +	if (reg & (IDR0_HA | IDR0_HD)) {
> > +		smmu->features |= ARM_SMMU_FEAT_HA;
> > +		if (reg & IDR0_HD)
> > +			smmu->features |= ARM_SMMU_FEAT_HD;
> > +	}
> > +
> 
> nitpick:
> 
> As per the IORT spec (DEN0049D, 3.1.1.2 SMMUv3 node, Table 10), the
> "HTTU Override" flag of the SMMUv3 node can override the value in
> SMMU_IDR0.HTTU. You may want to check this bit before selecting the
> {HA,HD} features and shout if there is a mismatch between firmware and
> the SMMU implementation. Just like how ARM_SMMU_FEAT_COHERENCY is
> selected.

Thanks for pointing this out, I didn't know about these flags but have
added them to the patch now.

Thanks,
Jean
Jean-Philippe Brucker Nov. 11, 2020, 1:57 p.m. UTC | #23
Hi Baolu,

Thanks for the review. I'm only now reworking this and realized I've never
sent a reply, sorry about that.

On Wed, May 20, 2020 at 02:42:21PM +0800, Lu Baolu wrote:
> Hi Jean,
> 
> On 2020/5/20 1:54, Jean-Philippe Brucker wrote:
> > Some systems allow devices to handle I/O Page Faults in the core mm. For
> > example systems implementing the PCIe PRI extension or Arm SMMU stall
> > model. Infrastructure for reporting these recoverable page faults was
> > added to the IOMMU core by commit 0c830e6b3282 ("iommu: Introduce device
> > fault report API"). Add a page fault handler for host SVA.
> > 
> > IOMMU driver can now instantiate several fault workqueues and link them
> > to IOPF-capable devices. Drivers can choose between a single global
> > workqueue, one per IOMMU device, one per low-level fault queue, one per
> > domain, etc.
> > 
> > When it receives a fault event, supposedly in an IRQ handler, the IOMMU
> > driver reports the fault using iommu_report_device_fault(), which calls
> > the registered handler. The page fault handler then calls the mm fault
> > handler, and reports either success or failure with iommu_page_response().
> > When the handler succeeded, the IOMMU retries the access.
> > 
> > The iopf_param pointer could be embedded into iommu_fault_param. But
> > putting iopf_param into the iommu_param structure allows us not to care
> > about ordering between calls to iopf_queue_add_device() and
> > iommu_register_device_fault_handler().
> > 
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
[...]
> > +static enum iommu_page_response_code
> > +iopf_handle_single(struct iopf_fault *iopf)
> > +{
> > +	vm_fault_t ret;
> > +	struct mm_struct *mm;
> > +	struct vm_area_struct *vma;
> > +	unsigned int access_flags = 0;
> > +	unsigned int fault_flags = FAULT_FLAG_REMOTE;
> > +	struct iommu_fault_page_request *prm = &iopf->fault.prm;
> > +	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
> > +
> > +	if (!(prm->flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID))
> > +		return status;
> > +
> > +	mm = iommu_sva_find(prm->pasid);
> > +	if (IS_ERR_OR_NULL(mm))
> > +		return status;
> > +
> > +	down_read(&mm->mmap_sem);
> > +
> > +	vma = find_extend_vma(mm, prm->addr);
> > +	if (!vma)
> > +		/* Unmapped area */
> > +		goto out_put_mm;
> > +
> > +	if (prm->perm & IOMMU_FAULT_PERM_READ)
> > +		access_flags |= VM_READ;
> > +
> > +	if (prm->perm & IOMMU_FAULT_PERM_WRITE) {
> > +		access_flags |= VM_WRITE;
> > +		fault_flags |= FAULT_FLAG_WRITE;
> > +	}
> > +
> > +	if (prm->perm & IOMMU_FAULT_PERM_EXEC) {
> > +		access_flags |= VM_EXEC;
> > +		fault_flags |= FAULT_FLAG_INSTRUCTION;
> > +	}
> > +
> > +	if (!(prm->perm & IOMMU_FAULT_PERM_PRIV))
> > +		fault_flags |= FAULT_FLAG_USER;
> > +
> > +	if (access_flags & ~vma->vm_flags)
> > +		/* Access fault */
> > +		goto out_put_mm;
> > +
> > +	ret = handle_mm_fault(vma, prm->addr, fault_flags);
> > +	status = ret & VM_FAULT_ERROR ? IOMMU_PAGE_RESP_INVALID :
> 
> Do you mind telling why it's IOMMU_PAGE_RESP_INVALID but not
> IOMMU_PAGE_RESP_FAILURE?

PAGE_RESP_FAILURE maps to PRI Response code "Response Failure" which
indicates a catastrophic error and causes the function to disable PRI.
Instead PAGE_RESP_INVALID maps to PRI Response code "Invalid request",
which tells the function that the address is invalid and there is no point
retrying this particular access.

Thanks,
Jean
Jean-Philippe Brucker Nov. 11, 2020, 1:57 p.m. UTC | #24
Hi Xiang,

Thank you for reviewing this. I forgot to send a reply, sorry for the
delay.

On Fri, May 29, 2020 at 05:18:27PM +0800, Xiang Zheng wrote:
> Hi,
> 
> On 2020/5/20 1:54, Jean-Philippe Brucker wrote:
> > Some systems allow devices to handle I/O Page Faults in the core mm. For
> > example systems implementing the PCIe PRI extension or Arm SMMU stall
> > model. Infrastructure for reporting these recoverable page faults was
> > added to the IOMMU core by commit 0c830e6b3282 ("iommu: Introduce device
> > fault report API"). Add a page fault handler for host SVA.
> > 
> > IOMMU driver can now instantiate several fault workqueues and link them
> > to IOPF-capable devices. Drivers can choose between a single global
> > workqueue, one per IOMMU device, one per low-level fault queue, one per
> > domain, etc.
> > 
> > When it receives a fault event, supposedly in an IRQ handler, the IOMMU
> > driver reports the fault using iommu_report_device_fault(), which calls
> > the registered handler. The page fault handler then calls the mm fault
> > handler, and reports either success or failure with iommu_page_response().
> > When the handler succeeded, the IOMMU retries the access.
> > 
> > The iopf_param pointer could be embedded into iommu_fault_param. But
> > putting iopf_param into the iommu_param structure allows us not to care
> > about ordering between calls to iopf_queue_add_device() and
> > iommu_register_device_fault_handler().
> > 
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
[...]
> > +/**
> > + * iopf_queue_free - Free IOPF queue
> > + * @queue: queue to free
> > + *
> > + * Counterpart to iopf_queue_alloc(). The driver must not be queuing faults or
> > + * adding/removing devices on this queue anymore.
> > + */
> > +void iopf_queue_free(struct iopf_queue *queue)
> > +{
> > +	struct iopf_device_param *iopf_param, *next;
> > +
> > +	if (!queue)
> > +		return;
> > +
> > +	list_for_each_entry_safe(iopf_param, next, &queue->devices, queue_list)
> > +		iopf_queue_remove_device(queue, iopf_param->dev);
> > +
> > +	destroy_workqueue(queue->wq);
> 
> Do we need to free iopf_group(s) here in case the work queue of the group(s) are not
> scheduled yet? If that occurs, we might leak memory here.

Partial groups are freed by iopf_queue_remove_device(), and all other
groups are freed when destroy_workqueue() executes the remaining work.

Thanks,
Jean
Baolu Lu Nov. 11, 2020, 11:11 p.m. UTC | #25
Hi Jean,

On 2020/11/11 21:57, Jean-Philippe Brucker wrote:
> Hi Baolu,
> 
> Thanks for the review. I'm only now reworking this and realized I've never
> sent a reply, sorry about that.
> 
> On Wed, May 20, 2020 at 02:42:21PM +0800, Lu Baolu wrote:
>> Hi Jean,
>>
>> On 2020/5/20 1:54, Jean-Philippe Brucker wrote:
>>> Some systems allow devices to handle I/O Page Faults in the core mm. For
>>> example systems implementing the PCIe PRI extension or Arm SMMU stall
>>> model. Infrastructure for reporting these recoverable page faults was
>>> added to the IOMMU core by commit 0c830e6b3282 ("iommu: Introduce device
>>> fault report API"). Add a page fault handler for host SVA.
>>>
>>> IOMMU driver can now instantiate several fault workqueues and link them
>>> to IOPF-capable devices. Drivers can choose between a single global
>>> workqueue, one per IOMMU device, one per low-level fault queue, one per
>>> domain, etc.
>>>
>>> When it receives a fault event, supposedly in an IRQ handler, the IOMMU
>>> driver reports the fault using iommu_report_device_fault(), which calls
>>> the registered handler. The page fault handler then calls the mm fault
>>> handler, and reports either success or failure with iommu_page_response().
>>> When the handler succeeded, the IOMMU retries the access.
>>>
>>> The iopf_param pointer could be embedded into iommu_fault_param. But
>>> putting iopf_param into the iommu_param structure allows us not to care
>>> about ordering between calls to iopf_queue_add_device() and
>>> iommu_register_device_fault_handler().
>>>
>>> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> [...]
>>> +static enum iommu_page_response_code
>>> +iopf_handle_single(struct iopf_fault *iopf)
>>> +{
>>> +	vm_fault_t ret;
>>> +	struct mm_struct *mm;
>>> +	struct vm_area_struct *vma;
>>> +	unsigned int access_flags = 0;
>>> +	unsigned int fault_flags = FAULT_FLAG_REMOTE;
>>> +	struct iommu_fault_page_request *prm = &iopf->fault.prm;
>>> +	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
>>> +
>>> +	if (!(prm->flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID))
>>> +		return status;
>>> +
>>> +	mm = iommu_sva_find(prm->pasid);
>>> +	if (IS_ERR_OR_NULL(mm))
>>> +		return status;
>>> +
>>> +	down_read(&mm->mmap_sem);
>>> +
>>> +	vma = find_extend_vma(mm, prm->addr);
>>> +	if (!vma)
>>> +		/* Unmapped area */
>>> +		goto out_put_mm;
>>> +
>>> +	if (prm->perm & IOMMU_FAULT_PERM_READ)
>>> +		access_flags |= VM_READ;
>>> +
>>> +	if (prm->perm & IOMMU_FAULT_PERM_WRITE) {
>>> +		access_flags |= VM_WRITE;
>>> +		fault_flags |= FAULT_FLAG_WRITE;
>>> +	}
>>> +
>>> +	if (prm->perm & IOMMU_FAULT_PERM_EXEC) {
>>> +		access_flags |= VM_EXEC;
>>> +		fault_flags |= FAULT_FLAG_INSTRUCTION;
>>> +	}
>>> +
>>> +	if (!(prm->perm & IOMMU_FAULT_PERM_PRIV))
>>> +		fault_flags |= FAULT_FLAG_USER;
>>> +
>>> +	if (access_flags & ~vma->vm_flags)
>>> +		/* Access fault */
>>> +		goto out_put_mm;
>>> +
>>> +	ret = handle_mm_fault(vma, prm->addr, fault_flags);
>>> +	status = ret & VM_FAULT_ERROR ? IOMMU_PAGE_RESP_INVALID :
>>
>> Do you mind telling why it's IOMMU_PAGE_RESP_INVALID but not
>> IOMMU_PAGE_RESP_FAILURE?
> 
> PAGE_RESP_FAILURE maps to PRI Response code "Response Failure" which
> indicates a catastrophic error and causes the function to disable PRI.
> Instead PAGE_RESP_INVALID maps to PRI Response code "Invalid request",
> which tells the function that the address is invalid and there is no point
> retrying this particular access.

Thanks for the explanation. I am also working on converting Intel VT-d
to use this framework (and the sva helpers). So far so good.

Best regards,
baolu