[kernel,v5,10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

Message ID 20170222082133.10277-11-aik@ozlabs.ru
State Superseded
Headers show

Commit Message

Alexey Kardashevskiy Feb. 22, 2017, 8:21 a.m.
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

If we fail to update a hardware IOMMU table unexpected reason, we just
clear it and move on as there is nothing really we can do about it -
for example, if we hot plug a VFIO device to a guest, existing TCE tables
will be mirrored automatically to the hardware and there is no interface
to report to the guest about possible failures.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

As this creates a descriptor per IOMMU table-LIOBN couple (called
kvmppc_spapr_tce_iommu_table), it is possible to have several
descriptors with the same iommu_table (hardware IOMMU table) attached
to the same LIOBN; we do not remove duplicates though as
iommu_table_ops::exchange not just update a TCE entry (which is
shared among IOMMU groups) but also invalidates the TCE cache
(one per IOMMU group).

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v5:
* changed error codes in multiple places
* added bunch of WARN_ON() in places which should not really happen
* adde a check that an iommu table is not attached already to LIOBN
* dropped explicit calls to iommu_tce_clear_param_check/
iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
call them anyway (since the previous patch)
* if we fail to update a hardware IOMMU table for unexpected reason,
this just clears the entry

v4:
* added note to the commit log about allowing multiple updates of
the same IOMMU table;
* instead of checking for if any memory was preregistered, this
returns H_TOO_HARD if a specific page was not;
* fixed comments from v3 about error handling in many places;
* simplified TCE handlers and merged IOMMU parts inline - for example,
there used to be kvmppc_h_put_tce_iommu(), now it is merged into
kvmppc_h_put_tce(); this allows to check IOBA boundaries against
the first attached table only (makes the code simpler);

v3:
* simplified not to use VFIO group notifiers
* reworked cleanup, should be cleaner/simpler now

v2:
* reworked to use new VFIO notifiers
* now same iommu_table may appear in the list several times, to be fixed later
---
 Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   4 +
 include/uapi/linux/kvm.h                   |   8 +
 arch/powerpc/kvm/book3s_64_vio.c           | 307 ++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 152 +++++++++++++-
 arch/powerpc/kvm/powerpc.c                 |   2 +
 virt/kvm/vfio.c                            |  60 ++++++
 8 files changed, 555 insertions(+), 8 deletions(-)

Comments

David Gibson Feb. 24, 2017, 2:14 a.m. | #1
On Wed, Feb 22, 2017 at 07:21:33PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> If we fail to update a hardware IOMMU table unexpected reason, we just
> clear it and move on as there is nothing really we can do about it -
> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> will be mirrored automatically to the hardware and there is no interface
> to report to the guest about possible failures.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

I have some comments on this patch, but all the definite ones are
pretty minor and could be done as later cleanups.

I have some more serious queries, but they are just queries and
requests for clarification.  If there are satisfactory answers to
them, I'll add my R-b.


---
> Changes:
> v5:
> * changed error codes in multiple places
> * added bunch of WARN_ON() in places which should not really happen
> * adde a check that an iommu table is not attached already to LIOBN
> * dropped explicit calls to iommu_tce_clear_param_check/
> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> call them anyway (since the previous patch)
> * if we fail to update a hardware IOMMU table for unexpected reason,
> this just clears the entry
> 
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>  include/uapi/linux/kvm.h                   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c           | 307 ++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 152 +++++++++++++-
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  virt/kvm/vfio.c                            |  60 ++++++
>  8 files changed, 555 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..f95d867168ea 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,25 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +	allocated by sPAPR KVM.
> +	kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +	struct kvm_vfio_spapr_tce {
> +		__u32	argsz;
> +		__u32	flags;
> +		__s32	groupfd;
> +		__s32	tablefd;
> +	};
> +
> +	where
> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +	@flags are not supported now, must be zero;
> +	@groupfd is a file descriptor for a VFIO group;
> +	@tablefd is a file descriptor for a TCE table allocated via
> +		KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index e59b172666cd..a827006941f8 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> +	struct rcu_head rcu;
> +	struct list_head next;
> +	struct vfio_group *group;
> +	struct iommu_table *tbl;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head iommu_tables;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index e04b7fb8ccaa..b8a39dec92cf 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group);
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a2c9bb5a0ead..cdfa01169bd2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	groupfd;
> +	__s32	tablefd;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 15df8ae627d9..062407af09ee 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -27,6 +27,10 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/iommu.h>
> +#include <linux/file.h>
> +#include <linux/vfio.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -39,6 +43,36 @@
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
> +#include <asm/mmu_context.h>
> +
> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> +{
> +	void (*fn)(struct vfio_group *);
> +
> +	fn = symbol_get(vfio_group_put_external_user);
> +	if (WARN_ON(!fn))
> +		return;
> +
> +	fn(vfio_group);
> +
> +	symbol_put(vfio_group_put_external_user);
> +}
> +
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;
> +
> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
>  
>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>  {
> @@ -90,6 +124,130 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>  	return ret;
>  }
>  
> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> +
> +	iommu_table_put(stit->tbl);
> +	kvm_vfio_group_put_external_user(stit->group);
> +
> +	kfree(stit);
> +}
> +
> +static void kvm_spapr_tce_liobn_release_iommu_group(
> +		struct kvmppc_spapr_tce_table *stt,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> +
> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> +		if (group && (stit->group != group))
> +			continue;
> +
> +		list_del_rcu(&stit->next);
> +
> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> +	}
> +}
> +
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> +}
> +
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt = NULL;
> +	bool found = false;
> +	struct iommu_table *tbl = NULL;
> +	struct iommu_table_group *table_group;
> +	long i, ret = 0;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	struct fd f;
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (WARN_ON(!grp))
> +		return -EIO;

I think it would be nicer to have a function that goes directly from
vfio_group to iommu_group rather than going via id.  That can be a
later cleanup, though.

> +
> +	f = fdget(tablefd);
> +	if (!f.file) {
> +		ret = -EBADF;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt == f.file->private_data) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	fdput(f);
> +
> +	if (!found) {
> +		ret = -EINVAL;
> +		goto put_exit;
> +	}
> +
> +	table_group = iommu_group_get_iommudata(grp);
> +	if (WARN_ON(!table_group)) {
> +		ret = -EFAULT;
> +		goto put_exit;
> +	}
> +
> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> +		struct iommu_table *tbltmp = table_group->tables[i];
> +
> +		if (!tbltmp)
> +			continue;
> +
> +		/*
> +		 * Make sure hardware table parameters are exactly the same;
> +		 * this is used in the TCE handlers where boundary checks
> +		 * use only the first attached table.
> +		 */
> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> +				(tbltmp->it_offset == stt->offset) &&
> +				(tbltmp->it_size == stt->size)) {
> +			tbl = tbltmp;
> +			break;
> +		}
> +	}
> +	if (!tbl) {
> +		ret = -EINVAL;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
> +		if ((stit->tbl == tbl) && (stit->group == group)) {
> +			ret = -EBUSY;
> +			goto put_exit;
> +		}
> +	}
> +
> +	iommu_table_get(tbl);
> +
> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> +	stit->tbl = tbl;
> +	stit->group = group;
> +
> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> +
> +put_exit:
> +	iommu_group_put(grp);
> +
> +	return ret;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> @@ -132,6 +290,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>  
>  	list_del_rcu(&stt->list);
>  
> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> +
>  	kvm_put_kvm(stt->kvm);
>  
>  	kvmppc_account_memlimit(
> @@ -181,6 +341,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	stt->offset = args->offset;
>  	stt->size = size;
>  	stt->kvm = kvm;
> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>  
>  	for (i = 0; i < npages; i++) {
>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> @@ -209,11 +370,102 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	return ret;
>  }
>  
> +static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
> +{
> +	unsigned long hpa = 0;
> +	enum dma_data_direction dir = DMA_NONE;
> +
> +	iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +}
> +
> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;

Initialization is unnecessary for the variable above.

> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (WARN_ON(!pua))
> +		return H_HARDWARE;
> +
> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret != H_SUCCESS)
> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> +		return H_PARAMETER;
> +
> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (WARN_ON(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
> +		return H_HARDWARE;

Remind me - could a failure here be triggered by userspace failing to
preregister all the memory it should?

> +	if (mm_iommu_mapped_inc(mem))
> +		return H_CLOSED;
> +
> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
> -	long ret;
> +	long ret, idx;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, gpa;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -230,7 +482,28 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> +	entry = ioba >> stt->page_shift;
> +	gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +	dir = iommu_tce_direction(tce);
> +
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		if (dir == DMA_NONE) {
> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry);
> +		} else {
> +			idx = srcu_read_lock(&vcpu->kvm->srcu);
> +			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> +					entry, gpa, dir);
> +			srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +		}
> +
> +		if (WARN_ON_ONCE(ret == H_HARDWARE))
> +			kvmppc_clear_tce(stit->tbl, entry);
> +		else if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
> +	kvmppc_tce_put(stt, entry, tce);
>  
>  	return H_SUCCESS;
>  }
> @@ -242,9 +515,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS, idx;
> -	unsigned long entry, ua = 0;
> +	unsigned long entry, ua = 0, gpa;
>  	u64 __user *tces;
>  	u64 tce;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -283,6 +557,18 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
>  
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			ret = kvmppc_tce_iommu_map(vcpu->kvm,
> +					stit->tbl, entry + i, gpa,
> +					iommu_tce_direction(tce));
> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
> +				kvmppc_clear_tce(stit->tbl, entry);
> +			else if (ret != H_SUCCESS)
> +				goto unlock_exit;
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -299,6 +585,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -312,6 +599,20 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +		for (i = 0; i < npages; ++i) {
> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry + i);
> +
> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
> +				kvmppc_clear_tce(stit->tbl, entry);
> +			else if (ret != H_SUCCESS)
> +				return ret;
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 92d769f4eaea..4a1d978ebd98 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -161,11 +161,111 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
> +{
> +	unsigned long hpa = 0;
> +	enum dma_data_direction dir = DMA_NONE;
> +
> +	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +}
> +
> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (WARN_ON(!pua))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;

To avoid a double WARN() (and to make the warnings easier to
understand) I'd suggest putting a WARN_ON() here, rather than in the
callers when they receieve an H_HARDWARE.  IIUC this really shouldn't
ever happen, and it certainly can't be the guest's fault?

> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret)
> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa = 0, ua;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> +		return H_PARAMETER;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (WARN_ON(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (WARN_ON(!pua))
> +		return H_HARDWARE;
> +
> +	if (WARN_ON(mm_iommu_mapped_inc(mem)))
> +		return H_CLOSED;
> +
> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, gpa;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -182,7 +282,25 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> +	entry = ioba >> stt->page_shift;
> +	gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +	dir = iommu_tce_direction(tce);
> +
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		if (dir == DMA_NONE)
> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry);
> +		else
> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> +					stit->tbl, entry, gpa, dir);
> +
> +		if (WARN_ON_ONCE(ret == H_HARDWARE))
> +			kvmppc_rm_clear_tce(stit->tbl, entry);
> +		else if (ret != H_SUCCESS)
> +			return ret;
> +	}
> +
> +	kvmppc_tce_put(stt, entry, tce);
>  
>  	return H_SUCCESS;
>  }
> @@ -220,9 +338,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS;
> -	unsigned long tces, entry, ua = 0;
> +	unsigned long tces, entry, ua = 0, tce, gpa;
>  	unsigned long *rmap = NULL;
>  	bool prereg = false;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -287,12 +406,24 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>  
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
>  
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> +					stit->tbl, entry + i, gpa,
> +					iommu_tce_direction(tce));
> +			if (WARN_ON_ONCE(ret == H_HARDWARE))

I don't think you need the WARN() here - the only H_HARDWARE failure
path in iommu_map() already includes a WARN().

> +				kvmppc_rm_clear_tce(stit->tbl, entry);
> +			else if (ret != H_SUCCESS)
> +				goto unlock_exit;

It's also not clear to me why the H_HARDWARE error path clears the
entry, but the other failure paths don't.  Or why an H_HARDWARE will
result in continuing to set the rest of the TCEs, but other failures
won't.

> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -309,6 +440,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -322,6 +454,20 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +		for (i = 0; i < npages; ++i) {
> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +					stit->tbl, entry + i);
> +
> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
> +				kvmppc_rm_clear_tce(stit->tbl, entry);

As noted earlier, I think this WARN belongs within iommu_unmap()
rather than out here.

> +			else if (ret != H_SUCCESS)
> +				return ret;
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index cd892dec7cb6..f3127dc87912 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +		/* fallthrough */
> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
>  	case KVM_CAP_PPC_ENABLE_HCALL:
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index d32f239eb471..2b7dc22265fe 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  
>  		mutex_unlock(&kv->lock);
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>  
>  		kvm_vfio_group_put_external_user(vfio_group);
> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> +		struct kvm_vfio_spapr_tce param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz || param.flags)
> +			return -EINVAL;
> +
> +		f = fdget(param.groupfd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +
> +		ret = -ENOENT;
> +
> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.tablefd, vfio_group);
> +
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>  	}
>  
>  	return -ENXIO;
> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> +#endif
>  			return 0;
>  		}
>  
> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>  	struct kvm_vfio_group *kvg, *tmp;
>  
>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>  		list_del(&kvg->node);
Alexey Kardashevskiy Feb. 24, 2017, 3:29 a.m. | #2
On 24/02/17 13:14, David Gibson wrote:
> On Wed, Feb 22, 2017 at 07:21:33PM +1100, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>> KVM tries to handle a TCE request in the real mode, if failed
>> it passes the request to the virtual mode to complete the operation.
>> If it a virtual mode handler fails, the request is passed to
>> the user space; this is not expected to happen though.
>>
>> To avoid dealing with page use counters (which is tricky in real mode),
>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>> to pre-register the userspace memory. The very first TCE request will
>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>> of the TCE table (iommu_table::it_userspace) is not allocated till
>> the very first mapping happens and we cannot call vmalloc in real mode.
>>
>> If we fail to update a hardware IOMMU table unexpected reason, we just
>> clear it and move on as there is nothing really we can do about it -
>> for example, if we hot plug a VFIO device to a guest, existing TCE tables
>> will be mirrored automatically to the hardware and there is no interface
>> to report to the guest about possible failures.
>>
>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>> and associates a physical IOMMU table with the SPAPR TCE table (which
>> is a guest view of the hardware IOMMU table). The iommu_table object
>> is cached and referenced so we do not have to look up for it in real mode.
>>
>> This does not implement the UNSET counterpart as there is no use for it -
>> once the acceleration is enabled, the existing userspace won't
>> disable it unless a VFIO container is destroyed; this adds necessary
>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>
>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>> descriptors with the same iommu_table (hardware IOMMU table) attached
>> to the same LIOBN; we do not remove duplicates though as
>> iommu_table_ops::exchange not just update a TCE entry (which is
>> shared among IOMMU groups) but also invalidates the TCE cache
>> (one per IOMMU group).
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> This finally makes use of vfio_external_user_iommu_id() which was
>> introduced quite some time ago and was considered for removal.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> I have some comments on this patch, but all the definite ones are
> pretty minor and could be done as later cleanups.
> 
> I have some more serious queries, but they are just queries and
> requests for clarification.  If there are satisfactory answers to
> them, I'll add my R-b.
> 
> 
> ---
>> Changes:
>> v5:
>> * changed error codes in multiple places
>> * added bunch of WARN_ON() in places which should not really happen
>> * adde a check that an iommu table is not attached already to LIOBN
>> * dropped explicit calls to iommu_tce_clear_param_check/
>> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
>> call them anyway (since the previous patch)
>> * if we fail to update a hardware IOMMU table for unexpected reason,
>> this just clears the entry
>>
>> v4:
>> * added note to the commit log about allowing multiple updates of
>> the same IOMMU table;
>> * instead of checking for if any memory was preregistered, this
>> returns H_TOO_HARD if a specific page was not;
>> * fixed comments from v3 about error handling in many places;
>> * simplified TCE handlers and merged IOMMU parts inline - for example,
>> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
>> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
>> the first attached table only (makes the code simpler);
>>
>> v3:
>> * simplified not to use VFIO group notifiers
>> * reworked cleanup, should be cleaner/simpler now
>>
>> v2:
>> * reworked to use new VFIO notifiers
>> * now same iommu_table may appear in the list several times, to be fixed later
>> ---
>>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>>  include/uapi/linux/kvm.h                   |   8 +
>>  arch/powerpc/kvm/book3s_64_vio.c           | 307 ++++++++++++++++++++++++++++-
>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 152 +++++++++++++-
>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>  virt/kvm/vfio.c                            |  60 ++++++
>>  8 files changed, 555 insertions(+), 8 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740c67ca..f95d867168ea 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,25 @@ Groups:
>>  
>>  KVM_DEV_VFIO_GROUP attributes:
>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>> +	allocated by sPAPR KVM.
>> +	kvm_device_attr.addr points to a struct:
>>  
>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>> -for the VFIO group.
>> +	struct kvm_vfio_spapr_tce {
>> +		__u32	argsz;
>> +		__u32	flags;
>> +		__s32	groupfd;
>> +		__s32	tablefd;
>> +	};
>> +
>> +	where
>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
>> +	@flags are not supported now, must be zero;
>> +	@groupfd is a file descriptor for a VFIO group;
>> +	@tablefd is a file descriptor for a TCE table allocated via
>> +		KVM_CREATE_SPAPR_TCE.
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index e59b172666cd..a827006941f8 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>  	atomic_t refcnt;
>>  };
>>  
>> +struct kvmppc_spapr_tce_iommu_table {
>> +	struct rcu_head rcu;
>> +	struct list_head next;
>> +	struct vfio_group *group;
>> +	struct iommu_table *tbl;
>> +};
>> +
>>  struct kvmppc_spapr_tce_table {
>>  	struct list_head list;
>>  	struct kvm *kvm;
>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>  	u32 page_shift;
>>  	u64 offset;		/* in pages */
>>  	u64 size;		/* window size in pages */
>> +	struct list_head iommu_tables;
>>  	struct page *pages[0];
>>  };
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index e04b7fb8ccaa..b8a39dec92cf 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group);
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group);
>>  
>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  				struct kvm_create_spapr_tce_64 *args);
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index a2c9bb5a0ead..cdfa01169bd2 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
>>  #define  KVM_DEV_VFIO_GROUP			1
>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>  
>>  enum kvm_device_type {
>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
>>  	KVM_DEV_TYPE_MAX,
>>  };
>>  
>> +struct kvm_vfio_spapr_tce {
>> +	__u32	argsz;
>> +	__u32	flags;
>> +	__s32	groupfd;
>> +	__s32	tablefd;
>> +};
>> +
>>  /*
>>   * ioctls for VM fds
>>   */
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 15df8ae627d9..062407af09ee 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -27,6 +27,10 @@
>>  #include <linux/hugetlb.h>
>>  #include <linux/list.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/iommu.h>
>> +#include <linux/file.h>
>> +#include <linux/vfio.h>
>> +#include <linux/module.h>
>>  
>>  #include <asm/tlbflush.h>
>>  #include <asm/kvm_ppc.h>
>> @@ -39,6 +43,36 @@
>>  #include <asm/udbg.h>
>>  #include <asm/iommu.h>
>>  #include <asm/tce.h>
>> +#include <asm/mmu_context.h>
>> +
>> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>> +{
>> +	void (*fn)(struct vfio_group *);
>> +
>> +	fn = symbol_get(vfio_group_put_external_user);
>> +	if (WARN_ON(!fn))
>> +		return;
>> +
>> +	fn(vfio_group);
>> +
>> +	symbol_put(vfio_group_put_external_user);
>> +}
>> +
>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>> +{
>> +	int (*fn)(struct vfio_group *);
>> +	int ret = -1;
>> +
>> +	fn = symbol_get(vfio_external_user_iommu_id);
>> +	if (!fn)
>> +		return ret;
>> +
>> +	ret = fn(vfio_group);
>> +
>> +	symbol_put(vfio_external_user_iommu_id);
>> +
>> +	return ret;
>> +}
>>  
>>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>>  {
>> @@ -90,6 +124,130 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>>  	return ret;
>>  }
>>  
>> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
>> +			struct kvmppc_spapr_tce_iommu_table, rcu);
>> +
>> +	iommu_table_put(stit->tbl);
>> +	kvm_vfio_group_put_external_user(stit->group);
>> +
>> +	kfree(stit);
>> +}
>> +
>> +static void kvm_spapr_tce_liobn_release_iommu_group(
>> +		struct kvmppc_spapr_tce_table *stt,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
>> +
>> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
>> +		if (group && (stit->group != group))
>> +			continue;
>> +
>> +		list_del_rcu(&stit->next);
>> +
>> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
>> +	}
>> +}
>> +
>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt;
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
>> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
>> +}
>> +
>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>> +		struct vfio_group *group)
>> +{
>> +	struct kvmppc_spapr_tce_table *stt = NULL;
>> +	bool found = false;
>> +	struct iommu_table *tbl = NULL;
>> +	struct iommu_table_group *table_group;
>> +	long i, ret = 0;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	struct fd f;
>> +	int group_id;
>> +	struct iommu_group *grp;
>> +
>> +	group_id = kvm_vfio_external_user_iommu_id(group);
>> +	grp = iommu_group_get_by_id(group_id);
>> +	if (WARN_ON(!grp))
>> +		return -EIO;
> 
> I think it would be nicer to have a function that goes directly from
> vfio_group to iommu_group rather than going via id.  That can be a
> later cleanup, though.
> 
>> +
>> +	f = fdget(tablefd);
>> +	if (!f.file) {
>> +		ret = -EBADF;
>> +		goto put_exit;
>> +	}
>> +
>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>> +		if (stt == f.file->private_data) {
>> +			found = true;
>> +			break;
>> +		}
>> +	}
>> +
>> +	fdput(f);
>> +
>> +	if (!found) {
>> +		ret = -EINVAL;
>> +		goto put_exit;
>> +	}
>> +
>> +	table_group = iommu_group_get_iommudata(grp);
>> +	if (WARN_ON(!table_group)) {
>> +		ret = -EFAULT;
>> +		goto put_exit;
>> +	}
>> +
>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> +		struct iommu_table *tbltmp = table_group->tables[i];
>> +
>> +		if (!tbltmp)
>> +			continue;
>> +
>> +		/*
>> +		 * Make sure hardware table parameters are exactly the same;
>> +		 * this is used in the TCE handlers where boundary checks
>> +		 * use only the first attached table.
>> +		 */
>> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
>> +				(tbltmp->it_offset == stt->offset) &&
>> +				(tbltmp->it_size == stt->size)) {
>> +			tbl = tbltmp;
>> +			break;
>> +		}
>> +	}
>> +	if (!tbl) {
>> +		ret = -EINVAL;
>> +		goto put_exit;
>> +	}
>> +
>> +	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
>> +		if ((stit->tbl == tbl) && (stit->group == group)) {
>> +			ret = -EBUSY;
>> +			goto put_exit;
>> +		}
>> +	}
>> +
>> +	iommu_table_get(tbl);
>> +
>> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
>> +	stit->tbl = tbl;
>> +	stit->group = group;
>> +
>> +	list_add_rcu(&stit->next, &stt->iommu_tables);
>> +
>> +put_exit:
>> +	iommu_group_put(grp);
>> +
>> +	return ret;
>> +}
>> +
>>  static void release_spapr_tce_table(struct rcu_head *head)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
>> @@ -132,6 +290,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>>  
>>  	list_del_rcu(&stt->list);
>>  
>> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
>> +
>>  	kvm_put_kvm(stt->kvm);
>>  
>>  	kvmppc_account_memlimit(
>> @@ -181,6 +341,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	stt->offset = args->offset;
>>  	stt->size = size;
>>  	stt->kvm = kvm;
>> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>>  
>>  	for (i = 0; i < npages; i++) {
>>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
>> @@ -209,11 +370,102 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	return ret;
>>  }
>>  
>> +static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	unsigned long hpa = 0;
>> +	enum dma_data_direction dir = DMA_NONE;
>> +
>> +	iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +}
>> +
>> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> 
> Initialization is unnecessary for the variable above.
> 
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (WARN_ON(!pua))
>> +		return H_HARDWARE;
>> +
>> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +	long ret;
>> +
>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
>> +		return H_HARDWARE;
>> +
>> +	if (dir == DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +	if (ret != H_SUCCESS)
>> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +
>> +	return ret;
>> +}
>> +
>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long gpa,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>> +		return H_PARAMETER;
>> +
>> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	if (WARN_ON(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
>> +		return H_HARDWARE;
> 
> Remind me - could a failure here be triggered by userspace failing to
> preregister all the memory it should?


mm_iommu_lookup() should fail first as the only reason for
mm_iommu_ua_to_hpa() to fail is (entry >= mem->entries) but this is checked
in mm_iommu_lookup().

> 
>> +	if (mm_iommu_mapped_inc(mem))
>> +		return H_CLOSED;
>> +
>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +
>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		      unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>> -	long ret;
>> +	long ret, idx;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	unsigned long entry, gpa;
>> +	enum dma_data_direction dir;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -230,7 +482,28 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>> +	entry = ioba >> stt->page_shift;
>> +	gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +	dir = iommu_tce_direction(tce);
>> +
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		if (dir == DMA_NONE) {
>> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>> +					stit->tbl, entry);
>> +		} else {
>> +			idx = srcu_read_lock(&vcpu->kvm->srcu);
>> +			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
>> +					entry, gpa, dir);
>> +			srcu_read_unlock(&vcpu->kvm->srcu, idx);
>> +		}
>> +
>> +		if (WARN_ON_ONCE(ret == H_HARDWARE))
>> +			kvmppc_clear_tce(stit->tbl, entry);
>> +		else if (ret != H_SUCCESS)
>> +			return ret;
>> +	}
>> +
>> +	kvmppc_tce_put(stt, entry, tce);
>>  
>>  	return H_SUCCESS;
>>  }
>> @@ -242,9 +515,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret = H_SUCCESS, idx;
>> -	unsigned long entry, ua = 0;
>> +	unsigned long entry, ua = 0, gpa;
>>  	u64 __user *tces;
>>  	u64 tce;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -283,6 +557,18 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>>  
>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			ret = kvmppc_tce_iommu_map(vcpu->kvm,
>> +					stit->tbl, entry + i, gpa,
>> +					iommu_tce_direction(tce));
>> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
>> +				kvmppc_clear_tce(stit->tbl, entry);
>> +			else if (ret != H_SUCCESS)
>> +				goto unlock_exit;
>> +		}
>> +
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
>>  
>> @@ -299,6 +585,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -312,6 +599,20 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
>> +
>> +		for (i = 0; i < npages; ++i) {
>> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>> +					stit->tbl, entry + i);
>> +
>> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
>> +				kvmppc_clear_tce(stit->tbl, entry);
>> +			else if (ret != H_SUCCESS)
>> +				return ret;
>> +		}
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index 92d769f4eaea..4a1d978ebd98 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -161,11 +161,111 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>  
>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	unsigned long hpa = 0;
>> +	enum dma_data_direction dir = DMA_NONE;
>> +
>> +	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +}
>> +
>> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> +	if (WARN_ON(!pua))
>> +		return H_HARDWARE;
>> +
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (!pua)
>> +		return H_TOO_HARD;
>> +
>> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	mm_iommu_mapped_dec(mem);
>> +
>> +	*pua = 0;
>> +
>> +	return H_SUCCESS;
>> +}
>> +
>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
>> +		struct iommu_table *tbl, unsigned long entry)
>> +{
>> +	enum dma_data_direction dir = DMA_NONE;
>> +	unsigned long hpa = 0;
>> +	long ret;
>> +
>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
>> +		return H_HARDWARE;
> 
> To avoid a double WARN() (and to make the warnings easier to
> understand) I'd suggest putting a WARN_ON() here, rather than in the
> callers when they receieve an H_HARDWARE.  IIUC this really shouldn't
> ever happen, and it certainly can't be the guest's fault?


Makes sense.


>> +
>> +	if (dir == DMA_NONE)
>> +		return H_SUCCESS;
>> +
>> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +	if (ret)
>> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +
>> +	return ret;
>> +}
>> +
>> +static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>> +		unsigned long entry, unsigned long gpa,
>> +		enum dma_data_direction dir)
>> +{
>> +	long ret;
>> +	unsigned long hpa = 0, ua;
>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +
>> +	if (!pua)
>> +		/* it_userspace allocation might be delayed */
>> +		return H_TOO_HARD;
>> +
>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>> +		return H_PARAMETER;


Referred below as [1]


>> +
>> +	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>> +	if (!mem)
>> +		return H_TOO_HARD;
>> +
>> +	if (WARN_ON(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
>> +		return H_HARDWARE;
>> +
>> +	pua = (void *) vmalloc_to_phys(pua);
>> +	if (WARN_ON(!pua))
>> +		return H_HARDWARE;
>> +
>> +	if (WARN_ON(mm_iommu_mapped_inc(mem)))
>> +		return H_CLOSED;
>> +
>> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>> +	if (ret) {
>> +		mm_iommu_mapped_dec(mem);
>> +		return H_TOO_HARD;
>> +	}
>> +
>> +	if (dir != DMA_NONE)
>> +		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
>> +
>> +	*pua = ua;
>> +
>> +	return 0;
>> +}
>> +
>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  		unsigned long ioba, unsigned long tce)
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>> +	unsigned long entry, gpa;
>> +	enum dma_data_direction dir;
>>  
>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>  	/* 	    liobn, ioba, tce); */
>> @@ -182,7 +282,25 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>  	if (ret != H_SUCCESS)
>>  		return ret;
>>  
>> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>> +	entry = ioba >> stt->page_shift;
>> +	gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +	dir = iommu_tce_direction(tce);
>> +
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		if (dir == DMA_NONE)
>> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>> +					stit->tbl, entry);
>> +		else
>> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
>> +					stit->tbl, entry, gpa, dir);
>> +
>> +		if (WARN_ON_ONCE(ret == H_HARDWARE))
>> +			kvmppc_rm_clear_tce(stit->tbl, entry);
>> +		else if (ret != H_SUCCESS)
>> +			return ret;
>> +	}
>> +
>> +	kvmppc_tce_put(stt, entry, tce);
>>  
>>  	return H_SUCCESS;
>>  }
>> @@ -220,9 +338,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret = H_SUCCESS;
>> -	unsigned long tces, entry, ua = 0;
>> +	unsigned long tces, entry, ua = 0, tce, gpa;
>>  	unsigned long *rmap = NULL;
>>  	bool prereg = false;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -287,12 +406,24 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>  	}
>>  
>>  	for (i = 0; i < npages; ++i) {
>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>  
>>  		ret = kvmppc_tce_validate(stt, tce);
>>  		if (ret != H_SUCCESS)
>>  			goto unlock_exit;
>>  
>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
>> +					stit->tbl, entry + i, gpa,
>> +					iommu_tce_direction(tce));
>> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
> 
> I don't think you need the WARN() here - the only H_HARDWARE failure
> path in iommu_map() already includes a WARN().


True, I can drop it here.


> 
>> +				kvmppc_rm_clear_tce(stit->tbl, entry);
>> +			else if (ret != H_SUCCESS)
>> +				goto unlock_exit;
> 
> It's also not clear to me why the H_HARDWARE error path clears the
> entry, but the other failure paths don't.  Or why an H_HARDWARE will
> result in continuing to set the rest of the TCEs, but other failures
> won't.


The idea was that other failures still have some chance that handling may
succeed in virtual mode or via QEMU, H_HARDWARE is fatal.

I am just not sure if H_PARAMETER is what I want to return at [1], to make
the calling code simplier, I could return H_HARDWARE there as well (instead
of H_PARAMETER).


> 
>> +		}
>> +
>>  		kvmppc_tce_put(stt, entry + i, tce);
>>  	}
>>  
>> @@ -309,6 +440,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  {
>>  	struct kvmppc_spapr_tce_table *stt;
>>  	long i, ret;
>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>  
>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>  	if (!stt)
>> @@ -322,6 +454,20 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>  		return H_PARAMETER;
>>  
>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
>> +
>> +		for (i = 0; i < npages; ++i) {
>> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>> +					stit->tbl, entry + i);
>> +
>> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
>> +				kvmppc_rm_clear_tce(stit->tbl, entry);
> 
> As noted earlier, I think this WARN belongs within iommu_unmap()
> rather than out here.
> 
>> +			else if (ret != H_SUCCESS)
>> +				return ret;
>> +		}
>> +	}
>> +
>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>  
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index cd892dec7cb6..f3127dc87912 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>  #ifdef CONFIG_PPC_BOOK3S_64
>>  	case KVM_CAP_SPAPR_TCE:
>>  	case KVM_CAP_SPAPR_TCE_64:
>> +		/* fallthrough */
>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>  	case KVM_CAP_PPC_RTAS:
>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>  	case KVM_CAP_PPC_ENABLE_HCALL:
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index d32f239eb471..2b7dc22265fe 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -20,6 +20,10 @@
>>  #include <linux/vfio.h>
>>  #include "vfio.h"
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +#include <asm/kvm_ppc.h>
>> +#endif
>> +
>>  struct kvm_vfio_group {
>>  	struct list_head node;
>>  	struct vfio_group *vfio_group;
>> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  
>>  		mutex_unlock(&kv->lock);
>>  
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>>  
>>  		kvm_vfio_group_put_external_user(vfio_group);
>> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>  		kvm_vfio_update_coherency(dev);
>>  
>>  		return ret;
>> +
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
>> +		struct kvm_vfio_spapr_tce param;
>> +		unsigned long minsz;
>> +		struct kvm_vfio *kv = dev->private;
>> +		struct vfio_group *vfio_group;
>> +		struct kvm_vfio_group *kvg;
>> +		struct fd f;
>> +
>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
>> +
>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (param.argsz < minsz || param.flags)
>> +			return -EINVAL;
>> +
>> +		f = fdget(param.groupfd);
>> +		if (!f.file)
>> +			return -EBADF;
>> +
>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>> +		fdput(f);
>> +
>> +		if (IS_ERR(vfio_group))
>> +			return PTR_ERR(vfio_group);
>> +
>> +		ret = -ENOENT;
>> +
>> +		mutex_lock(&kv->lock);
>> +
>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>> +			if (kvg->vfio_group != vfio_group)
>> +				continue;
>> +
>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>> +					param.tablefd, vfio_group);
>> +
>> +			break;
>> +		}
>> +
>> +		mutex_unlock(&kv->lock);
>> +
>> +		return ret;
>> +	}
>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>  	}
>>  
>>  	return -ENXIO;
>> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>  		switch (attr->attr) {
>>  		case KVM_DEV_VFIO_GROUP_ADD:
>>  		case KVM_DEV_VFIO_GROUP_DEL:
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
>> +#endif
>>  			return 0;
>>  		}
>>  
>> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>>  	struct kvm_vfio_group *kvg, *tmp;
>>  
>>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
>> +#endif
>>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>>  		list_del(&kvg->node);
>
David Gibson Feb. 24, 2017, 3:36 a.m. | #3
On Fri, Feb 24, 2017 at 02:29:14PM +1100, Alexey Kardashevskiy wrote:
> On 24/02/17 13:14, David Gibson wrote:
> > On Wed, Feb 22, 2017 at 07:21:33PM +1100, Alexey Kardashevskiy wrote:
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> If we fail to update a hardware IOMMU table unexpected reason, we just
> >> clear it and move on as there is nothing really we can do about it -
> >> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> >> will be mirrored automatically to the hardware and there is no interface
> >> to report to the guest about possible failures.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is cached and referenced so we do not have to look up for it in real mode.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is destroyed; this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >> descriptors with the same iommu_table (hardware IOMMU table) attached
> >> to the same LIOBN; we do not remove duplicates though as
> >> iommu_table_ops::exchange not just update a TCE entry (which is
> >> shared among IOMMU groups) but also invalidates the TCE cache
> >> (one per IOMMU group).
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > 
> > I have some comments on this patch, but all the definite ones are
> > pretty minor and could be done as later cleanups.
> > 
> > I have some more serious queries, but they are just queries and
> > requests for clarification.  If there are satisfactory answers to
> > them, I'll add my R-b.
> > 
> > 
> > ---
> >> Changes:
> >> v5:
> >> * changed error codes in multiple places
> >> * added bunch of WARN_ON() in places which should not really happen
> >> * adde a check that an iommu table is not attached already to LIOBN
> >> * dropped explicit calls to iommu_tce_clear_param_check/
> >> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> >> call them anyway (since the previous patch)
> >> * if we fail to update a hardware IOMMU table for unexpected reason,
> >> this just clears the entry
> >>
> >> v4:
> >> * added note to the commit log about allowing multiple updates of
> >> the same IOMMU table;
> >> * instead of checking for if any memory was preregistered, this
> >> returns H_TOO_HARD if a specific page was not;
> >> * fixed comments from v3 about error handling in many places;
> >> * simplified TCE handlers and merged IOMMU parts inline - for example,
> >> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> >> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> >> the first attached table only (makes the code simpler);
> >>
> >> v3:
> >> * simplified not to use VFIO group notifiers
> >> * reworked cleanup, should be cleaner/simpler now
> >>
> >> v2:
> >> * reworked to use new VFIO notifiers
> >> * now same iommu_table may appear in the list several times, to be fixed later
> >> ---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
> >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
> >>  include/uapi/linux/kvm.h                   |   8 +
> >>  arch/powerpc/kvm/book3s_64_vio.c           | 307 ++++++++++++++++++++++++++++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 152 +++++++++++++-
> >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>  virt/kvm/vfio.c                            |  60 ++++++
> >>  8 files changed, 555 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >> index ef51740c67ca..f95d867168ea 100644
> >> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >> @@ -16,7 +16,25 @@ Groups:
> >>  
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >> +	allocated by sPAPR KVM.
> >> +	kvm_device_attr.addr points to a struct:
> >>  
> >> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >> -for the VFIO group.
> >> +	struct kvm_vfio_spapr_tce {
> >> +		__u32	argsz;
> >> +		__u32	flags;
> >> +		__s32	groupfd;
> >> +		__s32	tablefd;
> >> +	};
> >> +
> >> +	where
> >> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >> +	@flags are not supported now, must be zero;
> >> +	@groupfd is a file descriptor for a VFIO group;
> >> +	@tablefd is a file descriptor for a TCE table allocated via
> >> +		KVM_CREATE_SPAPR_TCE.
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >> index e59b172666cd..a827006941f8 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
> >>  	atomic_t refcnt;
> >>  };
> >>  
> >> +struct kvmppc_spapr_tce_iommu_table {
> >> +	struct rcu_head rcu;
> >> +	struct list_head next;
> >> +	struct vfio_group *group;
> >> +	struct iommu_table *tbl;
> >> +};
> >> +
> >>  struct kvmppc_spapr_tce_table {
> >>  	struct list_head list;
> >>  	struct kvm *kvm;
> >> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
> >>  	u32 page_shift;
> >>  	u64 offset;		/* in pages */
> >>  	u64 size;		/* window size in pages */
> >> +	struct list_head iommu_tables;
> >>  	struct page *pages[0];
> >>  };
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >> index e04b7fb8ccaa..b8a39dec92cf 100644
> >> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> >>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
> >>  			struct kvm_memory_slot *memslot, unsigned long porder);
> >>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> >> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >> +		struct vfio_group *group);
> >> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >> +		struct vfio_group *group);
> >>  
> >>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				struct kvm_create_spapr_tce_64 *args);
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index a2c9bb5a0ead..cdfa01169bd2 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
> >>  #define  KVM_DEV_VFIO_GROUP			1
> >>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
> >>  
> >>  enum kvm_device_type {
> >>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
> >>  	KVM_DEV_TYPE_MAX,
> >>  };
> >>  
> >> +struct kvm_vfio_spapr_tce {
> >> +	__u32	argsz;
> >> +	__u32	flags;
> >> +	__s32	groupfd;
> >> +	__s32	tablefd;
> >> +};
> >> +
> >>  /*
> >>   * ioctls for VM fds
> >>   */
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 15df8ae627d9..062407af09ee 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -27,6 +27,10 @@
> >>  #include <linux/hugetlb.h>
> >>  #include <linux/list.h>
> >>  #include <linux/anon_inodes.h>
> >> +#include <linux/iommu.h>
> >> +#include <linux/file.h>
> >> +#include <linux/vfio.h>
> >> +#include <linux/module.h>
> >>  
> >>  #include <asm/tlbflush.h>
> >>  #include <asm/kvm_ppc.h>
> >> @@ -39,6 +43,36 @@
> >>  #include <asm/udbg.h>
> >>  #include <asm/iommu.h>
> >>  #include <asm/tce.h>
> >> +#include <asm/mmu_context.h>
> >> +
> >> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >> +{
> >> +	void (*fn)(struct vfio_group *);
> >> +
> >> +	fn = symbol_get(vfio_group_put_external_user);
> >> +	if (WARN_ON(!fn))
> >> +		return;
> >> +
> >> +	fn(vfio_group);
> >> +
> >> +	symbol_put(vfio_group_put_external_user);
> >> +}
> >> +
> >> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >> +{
> >> +	int (*fn)(struct vfio_group *);
> >> +	int ret = -1;
> >> +
> >> +	fn = symbol_get(vfio_external_user_iommu_id);
> >> +	if (!fn)
> >> +		return ret;
> >> +
> >> +	ret = fn(vfio_group);
> >> +
> >> +	symbol_put(vfio_external_user_iommu_id);
> >> +
> >> +	return ret;
> >> +}
> >>  
> >>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
> >>  {
> >> @@ -90,6 +124,130 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
> >>  	return ret;
> >>  }
> >>  
> >> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> >> +{
> >> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> >> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> >> +
> >> +	iommu_table_put(stit->tbl);
> >> +	kvm_vfio_group_put_external_user(stit->group);
> >> +
> >> +	kfree(stit);
> >> +}
> >> +
> >> +static void kvm_spapr_tce_liobn_release_iommu_group(
> >> +		struct kvmppc_spapr_tce_table *stt,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> >> +
> >> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> >> +		if (group && (stit->group != group))
> >> +			continue;
> >> +
> >> +		list_del_rcu(&stit->next);
> >> +
> >> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> >> +	}
> >> +}
> >> +
> >> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_table *stt;
> >> +
> >> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> >> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> >> +}
> >> +
> >> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> >> +		struct vfio_group *group)
> >> +{
> >> +	struct kvmppc_spapr_tce_table *stt = NULL;
> >> +	bool found = false;
> >> +	struct iommu_table *tbl = NULL;
> >> +	struct iommu_table_group *table_group;
> >> +	long i, ret = 0;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +	struct fd f;
> >> +	int group_id;
> >> +	struct iommu_group *grp;
> >> +
> >> +	group_id = kvm_vfio_external_user_iommu_id(group);
> >> +	grp = iommu_group_get_by_id(group_id);
> >> +	if (WARN_ON(!grp))
> >> +		return -EIO;
> > 
> > I think it would be nicer to have a function that goes directly from
> > vfio_group to iommu_group rather than going via id.  That can be a
> > later cleanup, though.
> > 
> >> +
> >> +	f = fdget(tablefd);
> >> +	if (!f.file) {
> >> +		ret = -EBADF;
> >> +		goto put_exit;
> >> +	}
> >> +
> >> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> >> +		if (stt == f.file->private_data) {
> >> +			found = true;
> >> +			break;
> >> +		}
> >> +	}
> >> +
> >> +	fdput(f);
> >> +
> >> +	if (!found) {
> >> +		ret = -EINVAL;
> >> +		goto put_exit;
> >> +	}
> >> +
> >> +	table_group = iommu_group_get_iommudata(grp);
> >> +	if (WARN_ON(!table_group)) {
> >> +		ret = -EFAULT;
> >> +		goto put_exit;
> >> +	}
> >> +
> >> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >> +		struct iommu_table *tbltmp = table_group->tables[i];
> >> +
> >> +		if (!tbltmp)
> >> +			continue;
> >> +
> >> +		/*
> >> +		 * Make sure hardware table parameters are exactly the same;
> >> +		 * this is used in the TCE handlers where boundary checks
> >> +		 * use only the first attached table.
> >> +		 */
> >> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> >> +				(tbltmp->it_offset == stt->offset) &&
> >> +				(tbltmp->it_size == stt->size)) {
> >> +			tbl = tbltmp;
> >> +			break;
> >> +		}
> >> +	}
> >> +	if (!tbl) {
> >> +		ret = -EINVAL;
> >> +		goto put_exit;
> >> +	}
> >> +
> >> +	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
> >> +		if ((stit->tbl == tbl) && (stit->group == group)) {
> >> +			ret = -EBUSY;
> >> +			goto put_exit;
> >> +		}
> >> +	}
> >> +
> >> +	iommu_table_get(tbl);
> >> +
> >> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> >> +	stit->tbl = tbl;
> >> +	stit->group = group;
> >> +
> >> +	list_add_rcu(&stit->next, &stt->iommu_tables);
> >> +
> >> +put_exit:
> >> +	iommu_group_put(grp);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>  static void release_spapr_tce_table(struct rcu_head *head)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> >> @@ -132,6 +290,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
> >>  
> >>  	list_del_rcu(&stt->list);
> >>  
> >> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> >> +
> >>  	kvm_put_kvm(stt->kvm);
> >>  
> >>  	kvmppc_account_memlimit(
> >> @@ -181,6 +341,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	stt->offset = args->offset;
> >>  	stt->size = size;
> >>  	stt->kvm = kvm;
> >> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
> >>  
> >>  	for (i = 0; i < npages; i++) {
> >>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >> @@ -209,11 +370,102 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	return ret;
> >>  }
> >>  
> >> +static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	unsigned long hpa = 0;
> >> +	enum dma_data_direction dir = DMA_NONE;
> >> +
> >> +	iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >> +}
> >> +
> >> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> > 
> > Initialization is unnecessary for the variable above.
> > 
> >> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +
> >> +	if (WARN_ON(!pua))
> >> +		return H_HARDWARE;
> >> +
> >> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	mm_iommu_mapped_dec(mem);
> >> +
> >> +	*pua = 0;
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	enum dma_data_direction dir = DMA_NONE;
> >> +	unsigned long hpa = 0;
> >> +	long ret;
> >> +
> >> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> >> +		return H_HARDWARE;
> >> +
> >> +	if (dir == DMA_NONE)
> >> +		return H_SUCCESS;
> >> +
> >> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +	if (ret != H_SUCCESS)
> >> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> >> +		unsigned long entry, unsigned long gpa,
> >> +		enum dma_data_direction dir)
> >> +{
> >> +	long ret;
> >> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +	struct mm_iommu_table_group_mem_t *mem;
> >> +
> >> +	if (!pua)
> >> +		/* it_userspace allocation might be delayed */
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> >> +		return H_PARAMETER;
> >> +
> >> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (WARN_ON(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
> >> +		return H_HARDWARE;
> > 
> > Remind me - could a failure here be triggered by userspace failing to
> > preregister all the memory it should?
> 
> 
> mm_iommu_lookup() should fail first as the only reason for
> mm_iommu_ua_to_hpa() to fail is (entry >= mem->entries) but this is checked
> in mm_iommu_lookup().

Ah, yes of course, I forgot.  Ok, that should be fine.


> > 
> >> +	if (mm_iommu_mapped_inc(mem))
> >> +		return H_CLOSED;
> >> +
> >> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> >> +	if (ret) {
> >> +		mm_iommu_mapped_dec(mem);
> >> +		return H_TOO_HARD;
> >> +	}
> >> +
> >> +	if (dir != DMA_NONE)
> >> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +
> >> +	*pua = ua;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  		      unsigned long ioba, unsigned long tce)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >> -	long ret;
> >> +	long ret, idx;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +	unsigned long entry, gpa;
> >> +	enum dma_data_direction dir;
> >>  
> >>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>  	/* 	    liobn, ioba, tce); */
> >> @@ -230,7 +482,28 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  	if (ret != H_SUCCESS)
> >>  		return ret;
> >>  
> >> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >> +	entry = ioba >> stt->page_shift;
> >> +	gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +	dir = iommu_tce_direction(tce);
> >> +
> >> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +		if (dir == DMA_NONE) {
> >> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> >> +					stit->tbl, entry);
> >> +		} else {
> >> +			idx = srcu_read_lock(&vcpu->kvm->srcu);
> >> +			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> >> +					entry, gpa, dir);
> >> +			srcu_read_unlock(&vcpu->kvm->srcu, idx);
> >> +		}
> >> +
> >> +		if (WARN_ON_ONCE(ret == H_HARDWARE))
> >> +			kvmppc_clear_tce(stit->tbl, entry);
> >> +		else if (ret != H_SUCCESS)
> >> +			return ret;
> >> +	}
> >> +
> >> +	kvmppc_tce_put(stt, entry, tce);
> >>  
> >>  	return H_SUCCESS;
> >>  }
> >> @@ -242,9 +515,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret = H_SUCCESS, idx;
> >> -	unsigned long entry, ua = 0;
> >> +	unsigned long entry, ua = 0, gpa;
> >>  	u64 __user *tces;
> >>  	u64 tce;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -283,6 +557,18 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		if (ret != H_SUCCESS)
> >>  			goto unlock_exit;
> >>  
> >> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			ret = kvmppc_tce_iommu_map(vcpu->kvm,
> >> +					stit->tbl, entry + i, gpa,
> >> +					iommu_tce_direction(tce));
> >> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
> >> +				kvmppc_clear_tce(stit->tbl, entry);
> >> +			else if (ret != H_SUCCESS)
> >> +				goto unlock_exit;
> >> +		}
> >> +
> >>  		kvmppc_tce_put(stt, entry + i, tce);
> >>  	}
> >>  
> >> @@ -299,6 +585,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -312,6 +599,20 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>  		return H_PARAMETER;
> >>  
> >> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> >> +
> >> +		for (i = 0; i < npages; ++i) {
> >> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> >> +					stit->tbl, entry + i);
> >> +
> >> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
> >> +				kvmppc_clear_tce(stit->tbl, entry);
> >> +			else if (ret != H_SUCCESS)
> >> +				return ret;
> >> +		}
> >> +	}
> >> +
> >>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>  
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> index 92d769f4eaea..4a1d978ebd98 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> @@ -161,11 +161,111 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
> >>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
> >>  
> >>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> >> +static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	unsigned long hpa = 0;
> >> +	enum dma_data_direction dir = DMA_NONE;
> >> +
> >> +	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> >> +}
> >> +
> >> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> >> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +
> >> +	if (WARN_ON(!pua))
> >> +		return H_HARDWARE;
> >> +
> >> +	pua = (void *) vmalloc_to_phys(pua);
> >> +	if (!pua)
> >> +		return H_TOO_HARD;
> >> +
> >> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	mm_iommu_mapped_dec(mem);
> >> +
> >> +	*pua = 0;
> >> +
> >> +	return H_SUCCESS;
> >> +}
> >> +
> >> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> >> +		struct iommu_table *tbl, unsigned long entry)
> >> +{
> >> +	enum dma_data_direction dir = DMA_NONE;
> >> +	unsigned long hpa = 0;
> >> +	long ret;
> >> +
> >> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> >> +		return H_HARDWARE;
> > 
> > To avoid a double WARN() (and to make the warnings easier to
> > understand) I'd suggest putting a WARN_ON() here, rather than in the
> > callers when they receieve an H_HARDWARE.  IIUC this really shouldn't
> > ever happen, and it certainly can't be the guest's fault?
> 
> 
> Makes sense.

I guess it might want WARN_ON_ONCE() to avoid spamming the user with
errors for every TCE, though.

> >> +
> >> +	if (dir == DMA_NONE)
> >> +		return H_SUCCESS;
> >> +
> >> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +	if (ret)
> >> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> >> +		unsigned long entry, unsigned long gpa,
> >> +		enum dma_data_direction dir)
> >> +{
> >> +	long ret;
> >> +	unsigned long hpa = 0, ua;
> >> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >> +	struct mm_iommu_table_group_mem_t *mem;
> >> +
> >> +	if (!pua)
> >> +		/* it_userspace allocation might be delayed */
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> >> +		return H_PARAMETER;
> 
> 
> Referred below as [1]
> 
> 
> >> +
> >> +	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> >> +	if (!mem)
> >> +		return H_TOO_HARD;
> >> +
> >> +	if (WARN_ON(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
> >> +		return H_HARDWARE;
> >> +
> >> +	pua = (void *) vmalloc_to_phys(pua);
> >> +	if (WARN_ON(!pua))
> >> +		return H_HARDWARE;
> >> +
> >> +	if (WARN_ON(mm_iommu_mapped_inc(mem)))
> >> +		return H_CLOSED;
> >> +
> >> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> >> +	if (ret) {
> >> +		mm_iommu_mapped_dec(mem);
> >> +		return H_TOO_HARD;
> >> +	}
> >> +
> >> +	if (dir != DMA_NONE)
> >> +		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> >> +
> >> +	*pua = ua;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  		unsigned long ioba, unsigned long tce)
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >> +	unsigned long entry, gpa;
> >> +	enum dma_data_direction dir;
> >>  
> >>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> >>  	/* 	    liobn, ioba, tce); */
> >> @@ -182,7 +282,25 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> >>  	if (ret != H_SUCCESS)
> >>  		return ret;
> >>  
> >> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
> >> +	entry = ioba >> stt->page_shift;
> >> +	gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +	dir = iommu_tce_direction(tce);
> >> +
> >> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +		if (dir == DMA_NONE)
> >> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> >> +					stit->tbl, entry);
> >> +		else
> >> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> >> +					stit->tbl, entry, gpa, dir);
> >> +
> >> +		if (WARN_ON_ONCE(ret == H_HARDWARE))
> >> +			kvmppc_rm_clear_tce(stit->tbl, entry);
> >> +		else if (ret != H_SUCCESS)
> >> +			return ret;
> >> +	}
> >> +
> >> +	kvmppc_tce_put(stt, entry, tce);
> >>  
> >>  	return H_SUCCESS;
> >>  }
> >> @@ -220,9 +338,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret = H_SUCCESS;
> >> -	unsigned long tces, entry, ua = 0;
> >> +	unsigned long tces, entry, ua = 0, tce, gpa;
> >>  	unsigned long *rmap = NULL;
> >>  	bool prereg = false;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -287,12 +406,24 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  	}
> >>  
> >>  	for (i = 0; i < npages; ++i) {
> >> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> >> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>  
> >>  		ret = kvmppc_tce_validate(stt, tce);
> >>  		if (ret != H_SUCCESS)
> >>  			goto unlock_exit;
> >>  
> >> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >> +
> >> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> >> +					stit->tbl, entry + i, gpa,
> >> +					iommu_tce_direction(tce));
> >> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
> > 
> > I don't think you need the WARN() here - the only H_HARDWARE failure
> > path in iommu_map() already includes a WARN().
> 
> 
> True, I can drop it here.
> 
> 
> > 
> >> +				kvmppc_rm_clear_tce(stit->tbl, entry);
> >> +			else if (ret != H_SUCCESS)
> >> +				goto unlock_exit;
> > 
> > It's also not clear to me why the H_HARDWARE error path clears the
> > entry, but the other failure paths don't.  Or why an H_HARDWARE will
> > result in continuing to set the rest of the TCEs, but other failures
> > won't.
> 
> 
> The idea was that other failures still have some chance that handling may
> succeed in virtual mode or via QEMU, H_HARDWARE is fatal.

Um... yes.. but the logic seems to be backwards for that: on
H_HARDWARE you warn and keep going, on other errors you bail out
entirely.

> I am just not sure if H_PARAMETER is what I want to return at [1], to make
> the calling code simplier, I could return H_HARDWARE there as well (instead
> of H_PARAMETER).

That sounds right, IIUC the gpa to ua translation shouldn't ever
fail because of something the guest did.  So I'd expect either
H_HARDWARE, or H_TOO_HARD (if there's some hope that virtual mode can
make the translation when real mode couldn't).

> >> +		}
> >> +
> >>  		kvmppc_tce_put(stt, entry + i, tce);
> >>  	}
> >>  
> >> @@ -309,6 +440,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret;
> >> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>  	if (!stt)
> >> @@ -322,6 +454,20 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
> >>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>  		return H_PARAMETER;
> >>  
> >> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
> >> +
> >> +		for (i = 0; i < npages; ++i) {
> >> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> >> +					stit->tbl, entry + i);
> >> +
> >> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
> >> +				kvmppc_rm_clear_tce(stit->tbl, entry);
> > 
> > As noted earlier, I think this WARN belongs within iommu_unmap()
> > rather than out here.
> > 
> >> +			else if (ret != H_SUCCESS)
> >> +				return ret;
> >> +		}
> >> +	}
> >> +
> >>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
> >>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
> >>  
> >> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >> index cd892dec7cb6..f3127dc87912 100644
> >> --- a/arch/powerpc/kvm/powerpc.c
> >> +++ b/arch/powerpc/kvm/powerpc.c
> >> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>  #ifdef CONFIG_PPC_BOOK3S_64
> >>  	case KVM_CAP_SPAPR_TCE:
> >>  	case KVM_CAP_SPAPR_TCE_64:
> >> +		/* fallthrough */
> >> +	case KVM_CAP_SPAPR_TCE_VFIO:
> >>  	case KVM_CAP_PPC_RTAS:
> >>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>  	case KVM_CAP_PPC_ENABLE_HCALL:
> >> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >> index d32f239eb471..2b7dc22265fe 100644
> >> --- a/virt/kvm/vfio.c
> >> +++ b/virt/kvm/vfio.c
> >> @@ -20,6 +20,10 @@
> >>  #include <linux/vfio.h>
> >>  #include "vfio.h"
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +#include <asm/kvm_ppc.h>
> >> +#endif
> >> +
> >>  struct kvm_vfio_group {
> >>  	struct list_head node;
> >>  	struct vfio_group *vfio_group;
> >> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  
> >>  		mutex_unlock(&kv->lock);
> >>  
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> >> +#endif
> >>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
> >>  
> >>  		kvm_vfio_group_put_external_user(vfio_group);
> >> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  		kvm_vfio_update_coherency(dev);
> >>  
> >>  		return ret;
> >> +
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> >> +		struct kvm_vfio_spapr_tce param;
> >> +		unsigned long minsz;
> >> +		struct kvm_vfio *kv = dev->private;
> >> +		struct vfio_group *vfio_group;
> >> +		struct kvm_vfio_group *kvg;
> >> +		struct fd f;
> >> +
> >> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> >> +
> >> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> >> +			return -EFAULT;
> >> +
> >> +		if (param.argsz < minsz || param.flags)
> >> +			return -EINVAL;
> >> +
> >> +		f = fdget(param.groupfd);
> >> +		if (!f.file)
> >> +			return -EBADF;
> >> +
> >> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >> +		fdput(f);
> >> +
> >> +		if (IS_ERR(vfio_group))
> >> +			return PTR_ERR(vfio_group);
> >> +
> >> +		ret = -ENOENT;
> >> +
> >> +		mutex_lock(&kv->lock);
> >> +
> >> +		list_for_each_entry(kvg, &kv->group_list, node) {
> >> +			if (kvg->vfio_group != vfio_group)
> >> +				continue;
> >> +
> >> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> >> +					param.tablefd, vfio_group);
> >> +
> >> +			break;
> >> +		}
> >> +
> >> +		mutex_unlock(&kv->lock);
> >> +
> >> +		return ret;
> >> +	}
> >> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
> >>  	}
> >>  
> >>  	return -ENXIO;
> >> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
> >>  		switch (attr->attr) {
> >>  		case KVM_DEV_VFIO_GROUP_ADD:
> >>  		case KVM_DEV_VFIO_GROUP_DEL:
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> >> +#endif
> >>  			return 0;
> >>  		}
> >>  
> >> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
> >>  	struct kvm_vfio_group *kvg, *tmp;
> >>  
> >>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> >> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> >> +#endif
> >>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
> >>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>  		list_del(&kvg->node);
> > 
> 
>
Alexey Kardashevskiy Feb. 24, 2017, 3:43 a.m. | #4
On 24/02/17 14:36, David Gibson wrote:
> On Fri, Feb 24, 2017 at 02:29:14PM +1100, Alexey Kardashevskiy wrote:
>> On 24/02/17 13:14, David Gibson wrote:
>>> On Wed, Feb 22, 2017 at 07:21:33PM +1100, Alexey Kardashevskiy wrote:
>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>>>> without passing them to user space which saves time on switching
>>>> to user space and back.
>>>>
>>>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>>>> KVM tries to handle a TCE request in the real mode, if failed
>>>> it passes the request to the virtual mode to complete the operation.
>>>> If it a virtual mode handler fails, the request is passed to
>>>> the user space; this is not expected to happen though.
>>>>
>>>> To avoid dealing with page use counters (which is tricky in real mode),
>>>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>>>> to pre-register the userspace memory. The very first TCE request will
>>>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>>>> of the TCE table (iommu_table::it_userspace) is not allocated till
>>>> the very first mapping happens and we cannot call vmalloc in real mode.
>>>>
>>>> If we fail to update a hardware IOMMU table unexpected reason, we just
>>>> clear it and move on as there is nothing really we can do about it -
>>>> for example, if we hot plug a VFIO device to a guest, existing TCE tables
>>>> will be mirrored automatically to the hardware and there is no interface
>>>> to report to the guest about possible failures.
>>>>
>>>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>>>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>>>> and associates a physical IOMMU table with the SPAPR TCE table (which
>>>> is a guest view of the hardware IOMMU table). The iommu_table object
>>>> is cached and referenced so we do not have to look up for it in real mode.
>>>>
>>>> This does not implement the UNSET counterpart as there is no use for it -
>>>> once the acceleration is enabled, the existing userspace won't
>>>> disable it unless a VFIO container is destroyed; this adds necessary
>>>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>>>
>>>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>>>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>>>> descriptors with the same iommu_table (hardware IOMMU table) attached
>>>> to the same LIOBN; we do not remove duplicates though as
>>>> iommu_table_ops::exchange not just update a TCE entry (which is
>>>> shared among IOMMU groups) but also invalidates the TCE cache
>>>> (one per IOMMU group).
>>>>
>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>> space.
>>>>
>>>> This finally makes use of vfio_external_user_iommu_id() which was
>>>> introduced quite some time ago and was considered for removal.
>>>>
>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>
>>> I have some comments on this patch, but all the definite ones are
>>> pretty minor and could be done as later cleanups.
>>>
>>> I have some more serious queries, but they are just queries and
>>> requests for clarification.  If there are satisfactory answers to
>>> them, I'll add my R-b.
>>>
>>>
>>> ---
>>>> Changes:
>>>> v5:
>>>> * changed error codes in multiple places
>>>> * added bunch of WARN_ON() in places which should not really happen
>>>> * adde a check that an iommu table is not attached already to LIOBN
>>>> * dropped explicit calls to iommu_tce_clear_param_check/
>>>> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
>>>> call them anyway (since the previous patch)
>>>> * if we fail to update a hardware IOMMU table for unexpected reason,
>>>> this just clears the entry
>>>>
>>>> v4:
>>>> * added note to the commit log about allowing multiple updates of
>>>> the same IOMMU table;
>>>> * instead of checking for if any memory was preregistered, this
>>>> returns H_TOO_HARD if a specific page was not;
>>>> * fixed comments from v3 about error handling in many places;
>>>> * simplified TCE handlers and merged IOMMU parts inline - for example,
>>>> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
>>>> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
>>>> the first attached table only (makes the code simpler);
>>>>
>>>> v3:
>>>> * simplified not to use VFIO group notifiers
>>>> * reworked cleanup, should be cleaner/simpler now
>>>>
>>>> v2:
>>>> * reworked to use new VFIO notifiers
>>>> * now same iommu_table may appear in the list several times, to be fixed later
>>>> ---
>>>>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
>>>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>>>>  include/uapi/linux/kvm.h                   |   8 +
>>>>  arch/powerpc/kvm/book3s_64_vio.c           | 307 ++++++++++++++++++++++++++++-
>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 152 +++++++++++++-
>>>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>>>  virt/kvm/vfio.c                            |  60 ++++++
>>>>  8 files changed, 555 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>>>> index ef51740c67ca..f95d867168ea 100644
>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>> @@ -16,7 +16,25 @@ Groups:
>>>>  
>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>>>> +	allocated by sPAPR KVM.
>>>> +	kvm_device_attr.addr points to a struct:
>>>>  
>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>> -for the VFIO group.
>>>> +	struct kvm_vfio_spapr_tce {
>>>> +		__u32	argsz;
>>>> +		__u32	flags;
>>>> +		__s32	groupfd;
>>>> +		__s32	tablefd;
>>>> +	};
>>>> +
>>>> +	where
>>>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>> +	@flags are not supported now, must be zero;
>>>> +	@groupfd is a file descriptor for a VFIO group;
>>>> +	@tablefd is a file descriptor for a TCE table allocated via
>>>> +		KVM_CREATE_SPAPR_TCE.
>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>>> index e59b172666cd..a827006941f8 100644
>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>>>  	atomic_t refcnt;
>>>>  };
>>>>  
>>>> +struct kvmppc_spapr_tce_iommu_table {
>>>> +	struct rcu_head rcu;
>>>> +	struct list_head next;
>>>> +	struct vfio_group *group;
>>>> +	struct iommu_table *tbl;
>>>> +};
>>>> +
>>>>  struct kvmppc_spapr_tce_table {
>>>>  	struct list_head list;
>>>>  	struct kvm *kvm;
>>>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>>>  	u32 page_shift;
>>>>  	u64 offset;		/* in pages */
>>>>  	u64 size;		/* window size in pages */
>>>> +	struct list_head iommu_tables;
>>>>  	struct page *pages[0];
>>>>  };
>>>>  
>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>>>> index e04b7fb8ccaa..b8a39dec92cf 100644
>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>>>> +		struct vfio_group *group);
>>>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>>>> +		struct vfio_group *group);
>>>>  
>>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>  				struct kvm_create_spapr_tce_64 *args);
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index a2c9bb5a0ead..cdfa01169bd2 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
>>>>  #define  KVM_DEV_VFIO_GROUP			1
>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>>>  
>>>>  enum kvm_device_type {
>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
>>>>  	KVM_DEV_TYPE_MAX,
>>>>  };
>>>>  
>>>> +struct kvm_vfio_spapr_tce {
>>>> +	__u32	argsz;
>>>> +	__u32	flags;
>>>> +	__s32	groupfd;
>>>> +	__s32	tablefd;
>>>> +};
>>>> +
>>>>  /*
>>>>   * ioctls for VM fds
>>>>   */
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>>>> index 15df8ae627d9..062407af09ee 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>> @@ -27,6 +27,10 @@
>>>>  #include <linux/hugetlb.h>
>>>>  #include <linux/list.h>
>>>>  #include <linux/anon_inodes.h>
>>>> +#include <linux/iommu.h>
>>>> +#include <linux/file.h>
>>>> +#include <linux/vfio.h>
>>>> +#include <linux/module.h>
>>>>  
>>>>  #include <asm/tlbflush.h>
>>>>  #include <asm/kvm_ppc.h>
>>>> @@ -39,6 +43,36 @@
>>>>  #include <asm/udbg.h>
>>>>  #include <asm/iommu.h>
>>>>  #include <asm/tce.h>
>>>> +#include <asm/mmu_context.h>
>>>> +
>>>> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>> +{
>>>> +	void (*fn)(struct vfio_group *);
>>>> +
>>>> +	fn = symbol_get(vfio_group_put_external_user);
>>>> +	if (WARN_ON(!fn))
>>>> +		return;
>>>> +
>>>> +	fn(vfio_group);
>>>> +
>>>> +	symbol_put(vfio_group_put_external_user);
>>>> +}
>>>> +
>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>> +{
>>>> +	int (*fn)(struct vfio_group *);
>>>> +	int ret = -1;
>>>> +
>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>> +	if (!fn)
>>>> +		return ret;
>>>> +
>>>> +	ret = fn(vfio_group);
>>>> +
>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>> +
>>>> +	return ret;
>>>> +}
>>>>  
>>>>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>>>>  {
>>>> @@ -90,6 +124,130 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
>>>> +{
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
>>>> +			struct kvmppc_spapr_tce_iommu_table, rcu);
>>>> +
>>>> +	iommu_table_put(stit->tbl);
>>>> +	kvm_vfio_group_put_external_user(stit->group);
>>>> +
>>>> +	kfree(stit);
>>>> +}
>>>> +
>>>> +static void kvm_spapr_tce_liobn_release_iommu_group(
>>>> +		struct kvmppc_spapr_tce_table *stt,
>>>> +		struct vfio_group *group)
>>>> +{
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
>>>> +
>>>> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
>>>> +		if (group && (stit->group != group))
>>>> +			continue;
>>>> +
>>>> +		list_del_rcu(&stit->next);
>>>> +
>>>> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
>>>> +	}
>>>> +}
>>>> +
>>>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>>>> +		struct vfio_group *group)
>>>> +{
>>>> +	struct kvmppc_spapr_tce_table *stt;
>>>> +
>>>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
>>>> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
>>>> +}
>>>> +
>>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>>>> +		struct vfio_group *group)
>>>> +{
>>>> +	struct kvmppc_spapr_tce_table *stt = NULL;
>>>> +	bool found = false;
>>>> +	struct iommu_table *tbl = NULL;
>>>> +	struct iommu_table_group *table_group;
>>>> +	long i, ret = 0;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>> +	struct fd f;
>>>> +	int group_id;
>>>> +	struct iommu_group *grp;
>>>> +
>>>> +	group_id = kvm_vfio_external_user_iommu_id(group);
>>>> +	grp = iommu_group_get_by_id(group_id);
>>>> +	if (WARN_ON(!grp))
>>>> +		return -EIO;
>>>
>>> I think it would be nicer to have a function that goes directly from
>>> vfio_group to iommu_group rather than going via id.  That can be a
>>> later cleanup, though.
>>>
>>>> +
>>>> +	f = fdget(tablefd);
>>>> +	if (!f.file) {
>>>> +		ret = -EBADF;
>>>> +		goto put_exit;
>>>> +	}
>>>> +
>>>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>>>> +		if (stt == f.file->private_data) {
>>>> +			found = true;
>>>> +			break;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	fdput(f);
>>>> +
>>>> +	if (!found) {
>>>> +		ret = -EINVAL;
>>>> +		goto put_exit;
>>>> +	}
>>>> +
>>>> +	table_group = iommu_group_get_iommudata(grp);
>>>> +	if (WARN_ON(!table_group)) {
>>>> +		ret = -EFAULT;
>>>> +		goto put_exit;
>>>> +	}
>>>> +
>>>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>>>> +		struct iommu_table *tbltmp = table_group->tables[i];
>>>> +
>>>> +		if (!tbltmp)
>>>> +			continue;
>>>> +
>>>> +		/*
>>>> +		 * Make sure hardware table parameters are exactly the same;
>>>> +		 * this is used in the TCE handlers where boundary checks
>>>> +		 * use only the first attached table.
>>>> +		 */
>>>> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
>>>> +				(tbltmp->it_offset == stt->offset) &&
>>>> +				(tbltmp->it_size == stt->size)) {
>>>> +			tbl = tbltmp;
>>>> +			break;
>>>> +		}
>>>> +	}
>>>> +	if (!tbl) {
>>>> +		ret = -EINVAL;
>>>> +		goto put_exit;
>>>> +	}
>>>> +
>>>> +	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
>>>> +		if ((stit->tbl == tbl) && (stit->group == group)) {
>>>> +			ret = -EBUSY;
>>>> +			goto put_exit;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	iommu_table_get(tbl);
>>>> +
>>>> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
>>>> +	stit->tbl = tbl;
>>>> +	stit->group = group;
>>>> +
>>>> +	list_add_rcu(&stit->next, &stt->iommu_tables);
>>>> +
>>>> +put_exit:
>>>> +	iommu_group_put(grp);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>>  static void release_spapr_tce_table(struct rcu_head *head)
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
>>>> @@ -132,6 +290,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>>>>  
>>>>  	list_del_rcu(&stt->list);
>>>>  
>>>> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
>>>> +
>>>>  	kvm_put_kvm(stt->kvm);
>>>>  
>>>>  	kvmppc_account_memlimit(
>>>> @@ -181,6 +341,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>  	stt->offset = args->offset;
>>>>  	stt->size = size;
>>>>  	stt->kvm = kvm;
>>>> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>>>>  
>>>>  	for (i = 0; i < npages; i++) {
>>>>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
>>>> @@ -209,11 +370,102 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
>>>> +{
>>>> +	unsigned long hpa = 0;
>>>> +	enum dma_data_direction dir = DMA_NONE;
>>>> +
>>>> +	iommu_tce_xchg(tbl, entry, &hpa, &dir);
>>>> +}
>>>> +
>>>> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>> +{
>>>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>>>
>>> Initialization is unnecessary for the variable above.
>>>
>>>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +
>>>> +	if (WARN_ON(!pua))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
>>>> +	if (!mem)
>>>> +		return H_TOO_HARD;
>>>> +
>>>> +	mm_iommu_mapped_dec(mem);
>>>> +
>>>> +	*pua = 0;
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>> +{
>>>> +	enum dma_data_direction dir = DMA_NONE;
>>>> +	unsigned long hpa = 0;
>>>> +	long ret;
>>>> +
>>>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (dir == DMA_NONE)
>>>> +		return H_SUCCESS;
>>>> +
>>>> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>>>> +	if (ret != H_SUCCESS)
>>>> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>>>> +		unsigned long entry, unsigned long gpa,
>>>> +		enum dma_data_direction dir)
>>>> +{
>>>> +	long ret;
>>>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +	struct mm_iommu_table_group_mem_t *mem;
>>>> +
>>>> +	if (!pua)
>>>> +		/* it_userspace allocation might be delayed */
>>>> +		return H_TOO_HARD;
>>>> +
>>>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>>>> +		return H_PARAMETER;
>>>> +
>>>> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>>>> +	if (!mem)
>>>> +		return H_TOO_HARD;
>>>> +
>>>> +	if (WARN_ON(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
>>>> +		return H_HARDWARE;
>>>
>>> Remind me - could a failure here be triggered by userspace failing to
>>> preregister all the memory it should?
>>
>>
>> mm_iommu_lookup() should fail first as the only reason for
>> mm_iommu_ua_to_hpa() to fail is (entry >= mem->entries) but this is checked
>> in mm_iommu_lookup().
> 
> Ah, yes of course, I forgot.  Ok, that should be fine.
> 
> 
>>>
>>>> +	if (mm_iommu_mapped_inc(mem))
>>>> +		return H_CLOSED;
>>>> +
>>>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>>>> +	if (ret) {
>>>> +		mm_iommu_mapped_dec(mem);
>>>> +		return H_TOO_HARD;
>>>> +	}
>>>> +
>>>> +	if (dir != DMA_NONE)
>>>> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>>>> +
>>>> +	*pua = ua;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>  		      unsigned long ioba, unsigned long tce)
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>> -	long ret;
>>>> +	long ret, idx;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>> +	unsigned long entry, gpa;
>>>> +	enum dma_data_direction dir;
>>>>  
>>>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>>>  	/* 	    liobn, ioba, tce); */
>>>> @@ -230,7 +482,28 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>  	if (ret != H_SUCCESS)
>>>>  		return ret;
>>>>  
>>>> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>>> +	entry = ioba >> stt->page_shift;
>>>> +	gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>> +	dir = iommu_tce_direction(tce);
>>>> +
>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +		if (dir == DMA_NONE) {
>>>> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>>>> +					stit->tbl, entry);
>>>> +		} else {
>>>> +			idx = srcu_read_lock(&vcpu->kvm->srcu);
>>>> +			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
>>>> +					entry, gpa, dir);
>>>> +			srcu_read_unlock(&vcpu->kvm->srcu, idx);
>>>> +		}
>>>> +
>>>> +		if (WARN_ON_ONCE(ret == H_HARDWARE))
>>>> +			kvmppc_clear_tce(stit->tbl, entry);
>>>> +		else if (ret != H_SUCCESS)
>>>> +			return ret;
>>>> +	}
>>>> +
>>>> +	kvmppc_tce_put(stt, entry, tce);
>>>>  
>>>>  	return H_SUCCESS;
>>>>  }
>>>> @@ -242,9 +515,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>  	long i, ret = H_SUCCESS, idx;
>>>> -	unsigned long entry, ua = 0;
>>>> +	unsigned long entry, ua = 0, gpa;
>>>>  	u64 __user *tces;
>>>>  	u64 tce;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>  
>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>  	if (!stt)
>>>> @@ -283,6 +557,18 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  		if (ret != H_SUCCESS)
>>>>  			goto unlock_exit;
>>>>  
>>>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>> +
>>>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +			ret = kvmppc_tce_iommu_map(vcpu->kvm,
>>>> +					stit->tbl, entry + i, gpa,
>>>> +					iommu_tce_direction(tce));
>>>> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
>>>> +				kvmppc_clear_tce(stit->tbl, entry);
>>>> +			else if (ret != H_SUCCESS)
>>>> +				goto unlock_exit;
>>>> +		}
>>>> +
>>>>  		kvmppc_tce_put(stt, entry + i, tce);
>>>>  	}
>>>>  
>>>> @@ -299,6 +585,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>  	long i, ret;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>  
>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>  	if (!stt)
>>>> @@ -312,6 +599,20 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>>>  		return H_PARAMETER;
>>>>  
>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
>>>> +
>>>> +		for (i = 0; i < npages; ++i) {
>>>> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>>>> +					stit->tbl, entry + i);
>>>> +
>>>> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
>>>> +				kvmppc_clear_tce(stit->tbl, entry);
>>>> +			else if (ret != H_SUCCESS)
>>>> +				return ret;
>>>> +		}
>>>> +	}
>>>> +
>>>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>>>  
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> index 92d769f4eaea..4a1d978ebd98 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> @@ -161,11 +161,111 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>>>  
>>>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>>>> +static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
>>>> +{
>>>> +	unsigned long hpa = 0;
>>>> +	enum dma_data_direction dir = DMA_NONE;
>>>> +
>>>> +	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>>>> +}
>>>> +
>>>> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>> +{
>>>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>>>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +
>>>> +	if (WARN_ON(!pua))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	pua = (void *) vmalloc_to_phys(pua);
>>>> +	if (!pua)
>>>> +		return H_TOO_HARD;
>>>> +
>>>> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
>>>> +	if (!mem)
>>>> +		return H_TOO_HARD;
>>>> +
>>>> +	mm_iommu_mapped_dec(mem);
>>>> +
>>>> +	*pua = 0;
>>>> +
>>>> +	return H_SUCCESS;
>>>> +}
>>>> +
>>>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>> +{
>>>> +	enum dma_data_direction dir = DMA_NONE;
>>>> +	unsigned long hpa = 0;
>>>> +	long ret;
>>>> +
>>>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
>>>> +		return H_HARDWARE;
>>>
>>> To avoid a double WARN() (and to make the warnings easier to
>>> understand) I'd suggest putting a WARN_ON() here, rather than in the
>>> callers when they receieve an H_HARDWARE.  IIUC this really shouldn't
>>> ever happen, and it certainly can't be the guest's fault?
>>
>>
>> Makes sense.
> 
> I guess it might want WARN_ON_ONCE() to avoid spamming the user with
> errors for every TCE, though.


We do not expect this to happen at all :) I can convert all of them to
_ONCE really as the purpose of WARN_ON is mostly to document what we do not
expect.



>>>> +
>>>> +	if (dir == DMA_NONE)
>>>> +		return H_SUCCESS;
>>>> +
>>>> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
>>>> +	if (ret)
>>>> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>>>> +		unsigned long entry, unsigned long gpa,
>>>> +		enum dma_data_direction dir)
>>>> +{
>>>> +	long ret;
>>>> +	unsigned long hpa = 0, ua;
>>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +	struct mm_iommu_table_group_mem_t *mem;
>>>> +
>>>> +	if (!pua)
>>>> +		/* it_userspace allocation might be delayed */
>>>> +		return H_TOO_HARD;
>>>> +
>>>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>>>> +		return H_PARAMETER;
>>
>>
>> Referred below as [1]
>>
>>
>>>> +
>>>> +	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>>>> +	if (!mem)
>>>> +		return H_TOO_HARD;
>>>> +
>>>> +	if (WARN_ON(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	pua = (void *) vmalloc_to_phys(pua);
>>>> +	if (WARN_ON(!pua))
>>>> +		return H_HARDWARE;
>>>> +
>>>> +	if (WARN_ON(mm_iommu_mapped_inc(mem)))
>>>> +		return H_CLOSED;
>>>> +
>>>> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>>>> +	if (ret) {
>>>> +		mm_iommu_mapped_dec(mem);
>>>> +		return H_TOO_HARD;
>>>> +	}
>>>> +
>>>> +	if (dir != DMA_NONE)
>>>> +		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
>>>> +
>>>> +	*pua = ua;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>  		unsigned long ioba, unsigned long tce)
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>  	long ret;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>> +	unsigned long entry, gpa;
>>>> +	enum dma_data_direction dir;
>>>>  
>>>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>>>  	/* 	    liobn, ioba, tce); */
>>>> @@ -182,7 +282,25 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>  	if (ret != H_SUCCESS)
>>>>  		return ret;
>>>>  
>>>> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>>> +	entry = ioba >> stt->page_shift;
>>>> +	gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>> +	dir = iommu_tce_direction(tce);
>>>> +
>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +		if (dir == DMA_NONE)
>>>> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>>>> +					stit->tbl, entry);
>>>> +		else
>>>> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
>>>> +					stit->tbl, entry, gpa, dir);
>>>> +
>>>> +		if (WARN_ON_ONCE(ret == H_HARDWARE))
>>>> +			kvmppc_rm_clear_tce(stit->tbl, entry);
>>>> +		else if (ret != H_SUCCESS)
>>>> +			return ret;
>>>> +	}
>>>> +
>>>> +	kvmppc_tce_put(stt, entry, tce);
>>>>  
>>>>  	return H_SUCCESS;
>>>>  }
>>>> @@ -220,9 +338,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>  	long i, ret = H_SUCCESS;
>>>> -	unsigned long tces, entry, ua = 0;
>>>> +	unsigned long tces, entry, ua = 0, tce, gpa;
>>>>  	unsigned long *rmap = NULL;
>>>>  	bool prereg = false;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>  
>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>  	if (!stt)
>>>> @@ -287,12 +406,24 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>  	}
>>>>  
>>>>  	for (i = 0; i < npages; ++i) {
>>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
>>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>  
>>>>  		ret = kvmppc_tce_validate(stt, tce);
>>>>  		if (ret != H_SUCCESS)
>>>>  			goto unlock_exit;
>>>>  
>>>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>> +
>>>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
>>>> +					stit->tbl, entry + i, gpa,
>>>> +					iommu_tce_direction(tce));
>>>> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
>>>
>>> I don't think you need the WARN() here - the only H_HARDWARE failure
>>> path in iommu_map() already includes a WARN().
>>
>>
>> True, I can drop it here.
>>
>>
>>>
>>>> +				kvmppc_rm_clear_tce(stit->tbl, entry);
>>>> +			else if (ret != H_SUCCESS)
>>>> +				goto unlock_exit;
>>>
>>> It's also not clear to me why the H_HARDWARE error path clears the
>>> entry, but the other failure paths don't.  Or why an H_HARDWARE will
>>> result in continuing to set the rest of the TCEs, but other failures
>>> won't.
>>
>>
>> The idea was that other failures still have some chance that handling may
>> succeed in virtual mode or via QEMU, H_HARDWARE is fatal.
> 
> Um... yes.. but the logic seems to be backwards for that: on
> H_HARDWARE you warn and keep going, on other errors you bail out
> entirely.

By "fatal" I means fatal for this particular hardware TCE(s), no hope in
trying this particular TCE in virtual mode.


> 
>> I am just not sure if H_PARAMETER is what I want to return at [1], to make
>> the calling code simplier, I could return H_HARDWARE there as well (instead
>> of H_PARAMETER).
> 
> That sounds right, IIUC the gpa to ua translation shouldn't ever
> fail because of something the guest did. 


The guest can easily pass bad TCE/GPA which is not in any registered slot.
So it is rather H_PARAMETER.

> So I'd expect either
> H_HARDWARE, or H_TOO_HARD (if there's some hope that virtual mode can
> make the translation when real mode couldn't).

No, virtual mode uses the exact same helper.


>>>> +		}
>>>> +
>>>>  		kvmppc_tce_put(stt, entry + i, tce);
>>>>  	}
>>>>  
>>>> @@ -309,6 +440,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>  {
>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>  	long i, ret;
>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>  
>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>  	if (!stt)
>>>> @@ -322,6 +454,20 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>>>  		return H_PARAMETER;
>>>>  
>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
>>>> +
>>>> +		for (i = 0; i < npages; ++i) {
>>>> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>>>> +					stit->tbl, entry + i);
>>>> +
>>>> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
>>>> +				kvmppc_rm_clear_tce(stit->tbl, entry);
>>>
>>> As noted earlier, I think this WARN belongs within iommu_unmap()
>>> rather than out here.
>>>
>>>> +			else if (ret != H_SUCCESS)
>>>> +				return ret;
>>>> +		}
>>>> +	}
>>>> +
>>>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>>>  
>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>> index cd892dec7cb6..f3127dc87912 100644
>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>>  #ifdef CONFIG_PPC_BOOK3S_64
>>>>  	case KVM_CAP_SPAPR_TCE:
>>>>  	case KVM_CAP_SPAPR_TCE_64:
>>>> +		/* fallthrough */
>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>  	case KVM_CAP_PPC_RTAS:
>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>>>  	case KVM_CAP_PPC_ENABLE_HCALL:
>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>> index d32f239eb471..2b7dc22265fe 100644
>>>> --- a/virt/kvm/vfio.c
>>>> +++ b/virt/kvm/vfio.c
>>>> @@ -20,6 +20,10 @@
>>>>  #include <linux/vfio.h>
>>>>  #include "vfio.h"
>>>>  
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +#include <asm/kvm_ppc.h>
>>>> +#endif
>>>> +
>>>>  struct kvm_vfio_group {
>>>>  	struct list_head node;
>>>>  	struct vfio_group *vfio_group;
>>>> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>  
>>>>  		mutex_unlock(&kv->lock);
>>>>  
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
>>>> +#endif
>>>>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>>>>  
>>>>  		kvm_vfio_group_put_external_user(vfio_group);
>>>> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>  		kvm_vfio_update_coherency(dev);
>>>>  
>>>>  		return ret;
>>>> +
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
>>>> +		struct kvm_vfio_spapr_tce param;
>>>> +		unsigned long minsz;
>>>> +		struct kvm_vfio *kv = dev->private;
>>>> +		struct vfio_group *vfio_group;
>>>> +		struct kvm_vfio_group *kvg;
>>>> +		struct fd f;
>>>> +
>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
>>>> +
>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>> +			return -EFAULT;
>>>> +
>>>> +		if (param.argsz < minsz || param.flags)
>>>> +			return -EINVAL;
>>>> +
>>>> +		f = fdget(param.groupfd);
>>>> +		if (!f.file)
>>>> +			return -EBADF;
>>>> +
>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>> +		fdput(f);
>>>> +
>>>> +		if (IS_ERR(vfio_group))
>>>> +			return PTR_ERR(vfio_group);
>>>> +
>>>> +		ret = -ENOENT;
>>>> +
>>>> +		mutex_lock(&kv->lock);
>>>> +
>>>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>>>> +			if (kvg->vfio_group != vfio_group)
>>>> +				continue;
>>>> +
>>>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>>>> +					param.tablefd, vfio_group);
>>>> +
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		mutex_unlock(&kv->lock);
>>>> +
>>>> +		return ret;
>>>> +	}
>>>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>>>  	}
>>>>  
>>>>  	return -ENXIO;
>>>> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>>>  		switch (attr->attr) {
>>>>  		case KVM_DEV_VFIO_GROUP_ADD:
>>>>  		case KVM_DEV_VFIO_GROUP_DEL:
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
>>>> +#endif
>>>>  			return 0;
>>>>  		}
>>>>  
>>>> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>>>>  	struct kvm_vfio_group *kvg, *tmp;
>>>>  
>>>>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
>>>> +#endif
>>>>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>>>>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>  		list_del(&kvg->node);
>>>
>>
>>
> 
> 
> 
>
Alexey Kardashevskiy Feb. 24, 2017, 3:46 a.m. | #5
On 24/02/17 14:43, Alexey Kardashevskiy wrote:
> On 24/02/17 14:36, David Gibson wrote:
>> On Fri, Feb 24, 2017 at 02:29:14PM +1100, Alexey Kardashevskiy wrote:
>>> On 24/02/17 13:14, David Gibson wrote:
>>>> On Wed, Feb 22, 2017 at 07:21:33PM +1100, Alexey Kardashevskiy wrote:
>>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>>>>> without passing them to user space which saves time on switching
>>>>> to user space and back.
>>>>>
>>>>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>>>>> KVM tries to handle a TCE request in the real mode, if failed
>>>>> it passes the request to the virtual mode to complete the operation.
>>>>> If it a virtual mode handler fails, the request is passed to
>>>>> the user space; this is not expected to happen though.
>>>>>
>>>>> To avoid dealing with page use counters (which is tricky in real mode),
>>>>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>>>>> to pre-register the userspace memory. The very first TCE request will
>>>>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>>>>> of the TCE table (iommu_table::it_userspace) is not allocated till
>>>>> the very first mapping happens and we cannot call vmalloc in real mode.
>>>>>
>>>>> If we fail to update a hardware IOMMU table unexpected reason, we just
>>>>> clear it and move on as there is nothing really we can do about it -
>>>>> for example, if we hot plug a VFIO device to a guest, existing TCE tables
>>>>> will be mirrored automatically to the hardware and there is no interface
>>>>> to report to the guest about possible failures.
>>>>>
>>>>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>>>>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>>>>> and associates a physical IOMMU table with the SPAPR TCE table (which
>>>>> is a guest view of the hardware IOMMU table). The iommu_table object
>>>>> is cached and referenced so we do not have to look up for it in real mode.
>>>>>
>>>>> This does not implement the UNSET counterpart as there is no use for it -
>>>>> once the acceleration is enabled, the existing userspace won't
>>>>> disable it unless a VFIO container is destroyed; this adds necessary
>>>>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>>>>
>>>>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>>>>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>>>>> descriptors with the same iommu_table (hardware IOMMU table) attached
>>>>> to the same LIOBN; we do not remove duplicates though as
>>>>> iommu_table_ops::exchange not just update a TCE entry (which is
>>>>> shared among IOMMU groups) but also invalidates the TCE cache
>>>>> (one per IOMMU group).
>>>>>
>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>>> space.
>>>>>
>>>>> This finally makes use of vfio_external_user_iommu_id() which was
>>>>> introduced quite some time ago and was considered for removal.
>>>>>
>>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>>
>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>
>>>> I have some comments on this patch, but all the definite ones are
>>>> pretty minor and could be done as later cleanups.
>>>>
>>>> I have some more serious queries, but they are just queries and
>>>> requests for clarification.  If there are satisfactory answers to
>>>> them, I'll add my R-b.
>>>>
>>>>
>>>> ---
>>>>> Changes:
>>>>> v5:
>>>>> * changed error codes in multiple places
>>>>> * added bunch of WARN_ON() in places which should not really happen
>>>>> * adde a check that an iommu table is not attached already to LIOBN
>>>>> * dropped explicit calls to iommu_tce_clear_param_check/
>>>>> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
>>>>> call them anyway (since the previous patch)
>>>>> * if we fail to update a hardware IOMMU table for unexpected reason,
>>>>> this just clears the entry
>>>>>
>>>>> v4:
>>>>> * added note to the commit log about allowing multiple updates of
>>>>> the same IOMMU table;
>>>>> * instead of checking for if any memory was preregistered, this
>>>>> returns H_TOO_HARD if a specific page was not;
>>>>> * fixed comments from v3 about error handling in many places;
>>>>> * simplified TCE handlers and merged IOMMU parts inline - for example,
>>>>> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
>>>>> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
>>>>> the first attached table only (makes the code simpler);
>>>>>
>>>>> v3:
>>>>> * simplified not to use VFIO group notifiers
>>>>> * reworked cleanup, should be cleaner/simpler now
>>>>>
>>>>> v2:
>>>>> * reworked to use new VFIO notifiers
>>>>> * now same iommu_table may appear in the list several times, to be fixed later
>>>>> ---
>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  22 ++-
>>>>>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>>>>>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>>>>>  include/uapi/linux/kvm.h                   |   8 +
>>>>>  arch/powerpc/kvm/book3s_64_vio.c           | 307 ++++++++++++++++++++++++++++-
>>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 152 +++++++++++++-
>>>>>  arch/powerpc/kvm/powerpc.c                 |   2 +
>>>>>  virt/kvm/vfio.c                            |  60 ++++++
>>>>>  8 files changed, 555 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>>>>> index ef51740c67ca..f95d867168ea 100644
>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>>> @@ -16,7 +16,25 @@ Groups:
>>>>>  
>>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>> +	for the VFIO group.
>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>> +	for the VFIO group.
>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>>>>> +	allocated by sPAPR KVM.
>>>>> +	kvm_device_attr.addr points to a struct:
>>>>>  
>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>>> -for the VFIO group.
>>>>> +	struct kvm_vfio_spapr_tce {
>>>>> +		__u32	argsz;
>>>>> +		__u32	flags;
>>>>> +		__s32	groupfd;
>>>>> +		__s32	tablefd;
>>>>> +	};
>>>>> +
>>>>> +	where
>>>>> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>>> +	@flags are not supported now, must be zero;
>>>>> +	@groupfd is a file descriptor for a VFIO group;
>>>>> +	@tablefd is a file descriptor for a TCE table allocated via
>>>>> +		KVM_CREATE_SPAPR_TCE.
>>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>>>> index e59b172666cd..a827006941f8 100644
>>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>>>>  	atomic_t refcnt;
>>>>>  };
>>>>>  
>>>>> +struct kvmppc_spapr_tce_iommu_table {
>>>>> +	struct rcu_head rcu;
>>>>> +	struct list_head next;
>>>>> +	struct vfio_group *group;
>>>>> +	struct iommu_table *tbl;
>>>>> +};
>>>>> +
>>>>>  struct kvmppc_spapr_tce_table {
>>>>>  	struct list_head list;
>>>>>  	struct kvm *kvm;
>>>>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>>>>  	u32 page_shift;
>>>>>  	u64 offset;		/* in pages */
>>>>>  	u64 size;		/* window size in pages */
>>>>> +	struct list_head iommu_tables;
>>>>>  	struct page *pages[0];
>>>>>  };
>>>>>  
>>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>>>>> index e04b7fb8ccaa..b8a39dec92cf 100644
>>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>>> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>>>>>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>>>>>  			struct kvm_memory_slot *memslot, unsigned long porder);
>>>>>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>>>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>>>>> +		struct vfio_group *group);
>>>>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>>>>> +		struct vfio_group *group);
>>>>>  
>>>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>>  				struct kvm_create_spapr_tce_64 *args);
>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>> index a2c9bb5a0ead..cdfa01169bd2 100644
>>>>> --- a/include/uapi/linux/kvm.h
>>>>> +++ b/include/uapi/linux/kvm.h
>>>>> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
>>>>>  #define  KVM_DEV_VFIO_GROUP			1
>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>>>>>  
>>>>>  enum kvm_device_type {
>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>>> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
>>>>>  	KVM_DEV_TYPE_MAX,
>>>>>  };
>>>>>  
>>>>> +struct kvm_vfio_spapr_tce {
>>>>> +	__u32	argsz;
>>>>> +	__u32	flags;
>>>>> +	__s32	groupfd;
>>>>> +	__s32	tablefd;
>>>>> +};
>>>>> +
>>>>>  /*
>>>>>   * ioctls for VM fds
>>>>>   */
>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>>>>> index 15df8ae627d9..062407af09ee 100644
>>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>>> @@ -27,6 +27,10 @@
>>>>>  #include <linux/hugetlb.h>
>>>>>  #include <linux/list.h>
>>>>>  #include <linux/anon_inodes.h>
>>>>> +#include <linux/iommu.h>
>>>>> +#include <linux/file.h>
>>>>> +#include <linux/vfio.h>
>>>>> +#include <linux/module.h>
>>>>>  
>>>>>  #include <asm/tlbflush.h>
>>>>>  #include <asm/kvm_ppc.h>
>>>>> @@ -39,6 +43,36 @@
>>>>>  #include <asm/udbg.h>
>>>>>  #include <asm/iommu.h>
>>>>>  #include <asm/tce.h>
>>>>> +#include <asm/mmu_context.h>
>>>>> +
>>>>> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>> +{
>>>>> +	void (*fn)(struct vfio_group *);
>>>>> +
>>>>> +	fn = symbol_get(vfio_group_put_external_user);
>>>>> +	if (WARN_ON(!fn))
>>>>> +		return;
>>>>> +
>>>>> +	fn(vfio_group);
>>>>> +
>>>>> +	symbol_put(vfio_group_put_external_user);
>>>>> +}
>>>>> +
>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>>> +{
>>>>> +	int (*fn)(struct vfio_group *);
>>>>> +	int ret = -1;
>>>>> +
>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>>> +	if (!fn)
>>>>> +		return ret;
>>>>> +
>>>>> +	ret = fn(vfio_group);
>>>>> +
>>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>>> +
>>>>> +	return ret;
>>>>> +}
>>>>>  
>>>>>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>>>>>  {
>>>>> @@ -90,6 +124,130 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>>>>>  	return ret;
>>>>>  }
>>>>>  
>>>>> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
>>>>> +{
>>>>> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
>>>>> +			struct kvmppc_spapr_tce_iommu_table, rcu);
>>>>> +
>>>>> +	iommu_table_put(stit->tbl);
>>>>> +	kvm_vfio_group_put_external_user(stit->group);
>>>>> +
>>>>> +	kfree(stit);
>>>>> +}
>>>>> +
>>>>> +static void kvm_spapr_tce_liobn_release_iommu_group(
>>>>> +		struct kvmppc_spapr_tce_table *stt,
>>>>> +		struct vfio_group *group)
>>>>> +{
>>>>> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
>>>>> +
>>>>> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
>>>>> +		if (group && (stit->group != group))
>>>>> +			continue;
>>>>> +
>>>>> +		list_del_rcu(&stit->next);
>>>>> +
>>>>> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
>>>>> +	}
>>>>> +}
>>>>> +
>>>>> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
>>>>> +		struct vfio_group *group)
>>>>> +{
>>>>> +	struct kvmppc_spapr_tce_table *stt;
>>>>> +
>>>>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
>>>>> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
>>>>> +}
>>>>> +
>>>>> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
>>>>> +		struct vfio_group *group)
>>>>> +{
>>>>> +	struct kvmppc_spapr_tce_table *stt = NULL;
>>>>> +	bool found = false;
>>>>> +	struct iommu_table *tbl = NULL;
>>>>> +	struct iommu_table_group *table_group;
>>>>> +	long i, ret = 0;
>>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>> +	struct fd f;
>>>>> +	int group_id;
>>>>> +	struct iommu_group *grp;
>>>>> +
>>>>> +	group_id = kvm_vfio_external_user_iommu_id(group);
>>>>> +	grp = iommu_group_get_by_id(group_id);
>>>>> +	if (WARN_ON(!grp))
>>>>> +		return -EIO;
>>>>
>>>> I think it would be nicer to have a function that goes directly from
>>>> vfio_group to iommu_group rather than going via id.  That can be a
>>>> later cleanup, though.
>>>>
>>>>> +
>>>>> +	f = fdget(tablefd);
>>>>> +	if (!f.file) {
>>>>> +		ret = -EBADF;
>>>>> +		goto put_exit;
>>>>> +	}
>>>>> +
>>>>> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
>>>>> +		if (stt == f.file->private_data) {
>>>>> +			found = true;
>>>>> +			break;
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>> +	fdput(f);
>>>>> +
>>>>> +	if (!found) {
>>>>> +		ret = -EINVAL;
>>>>> +		goto put_exit;
>>>>> +	}
>>>>> +
>>>>> +	table_group = iommu_group_get_iommudata(grp);
>>>>> +	if (WARN_ON(!table_group)) {
>>>>> +		ret = -EFAULT;
>>>>> +		goto put_exit;
>>>>> +	}
>>>>> +
>>>>> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>>>>> +		struct iommu_table *tbltmp = table_group->tables[i];
>>>>> +
>>>>> +		if (!tbltmp)
>>>>> +			continue;
>>>>> +
>>>>> +		/*
>>>>> +		 * Make sure hardware table parameters are exactly the same;
>>>>> +		 * this is used in the TCE handlers where boundary checks
>>>>> +		 * use only the first attached table.
>>>>> +		 */
>>>>> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
>>>>> +				(tbltmp->it_offset == stt->offset) &&
>>>>> +				(tbltmp->it_size == stt->size)) {
>>>>> +			tbl = tbltmp;
>>>>> +			break;
>>>>> +		}
>>>>> +	}
>>>>> +	if (!tbl) {
>>>>> +		ret = -EINVAL;
>>>>> +		goto put_exit;
>>>>> +	}
>>>>> +
>>>>> +	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
>>>>> +		if ((stit->tbl == tbl) && (stit->group == group)) {
>>>>> +			ret = -EBUSY;
>>>>> +			goto put_exit;
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>> +	iommu_table_get(tbl);
>>>>> +
>>>>> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
>>>>> +	stit->tbl = tbl;
>>>>> +	stit->group = group;
>>>>> +
>>>>> +	list_add_rcu(&stit->next, &stt->iommu_tables);
>>>>> +
>>>>> +put_exit:
>>>>> +	iommu_group_put(grp);
>>>>> +
>>>>> +	return ret;
>>>>> +}
>>>>> +
>>>>>  static void release_spapr_tce_table(struct rcu_head *head)
>>>>>  {
>>>>>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
>>>>> @@ -132,6 +290,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>>>>>  
>>>>>  	list_del_rcu(&stt->list);
>>>>>  
>>>>> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
>>>>> +
>>>>>  	kvm_put_kvm(stt->kvm);
>>>>>  
>>>>>  	kvmppc_account_memlimit(
>>>>> @@ -181,6 +341,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>>  	stt->offset = args->offset;
>>>>>  	stt->size = size;
>>>>>  	stt->kvm = kvm;
>>>>> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>>>>>  
>>>>>  	for (i = 0; i < npages; i++) {
>>>>>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
>>>>> @@ -209,11 +370,102 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>>  	return ret;
>>>>>  }
>>>>>  
>>>>> +static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
>>>>> +{
>>>>> +	unsigned long hpa = 0;
>>>>> +	enum dma_data_direction dir = DMA_NONE;
>>>>> +
>>>>> +	iommu_tce_xchg(tbl, entry, &hpa, &dir);
>>>>> +}
>>>>> +
>>>>> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
>>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>>> +{
>>>>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>>>>
>>>> Initialization is unnecessary for the variable above.
>>>>
>>>>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>>>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>>> +
>>>>> +	if (WARN_ON(!pua))
>>>>> +		return H_HARDWARE;
>>>>> +
>>>>> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
>>>>> +	if (!mem)
>>>>> +		return H_TOO_HARD;
>>>>> +
>>>>> +	mm_iommu_mapped_dec(mem);
>>>>> +
>>>>> +	*pua = 0;
>>>>> +
>>>>> +	return H_SUCCESS;
>>>>> +}
>>>>> +
>>>>> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
>>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>>> +{
>>>>> +	enum dma_data_direction dir = DMA_NONE;
>>>>> +	unsigned long hpa = 0;
>>>>> +	long ret;
>>>>> +
>>>>> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
>>>>> +		return H_HARDWARE;
>>>>> +
>>>>> +	if (dir == DMA_NONE)
>>>>> +		return H_SUCCESS;
>>>>> +
>>>>> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>>>>> +	if (ret != H_SUCCESS)
>>>>> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
>>>>> +
>>>>> +	return ret;
>>>>> +}
>>>>> +
>>>>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>>>>> +		unsigned long entry, unsigned long gpa,
>>>>> +		enum dma_data_direction dir)
>>>>> +{
>>>>> +	long ret;
>>>>> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>>> +	struct mm_iommu_table_group_mem_t *mem;
>>>>> +
>>>>> +	if (!pua)
>>>>> +		/* it_userspace allocation might be delayed */
>>>>> +		return H_TOO_HARD;
>>>>> +
>>>>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>>>>> +		return H_PARAMETER;
>>>>> +
>>>>> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>>>>> +	if (!mem)
>>>>> +		return H_TOO_HARD;
>>>>> +
>>>>> +	if (WARN_ON(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
>>>>> +		return H_HARDWARE;
>>>>
>>>> Remind me - could a failure here be triggered by userspace failing to
>>>> preregister all the memory it should?
>>>
>>>
>>> mm_iommu_lookup() should fail first as the only reason for
>>> mm_iommu_ua_to_hpa() to fail is (entry >= mem->entries) but this is checked
>>> in mm_iommu_lookup().
>>
>> Ah, yes of course, I forgot.  Ok, that should be fine.
>>
>>
>>>>
>>>>> +	if (mm_iommu_mapped_inc(mem))
>>>>> +		return H_CLOSED;
>>>>> +
>>>>> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
>>>>> +	if (ret) {
>>>>> +		mm_iommu_mapped_dec(mem);
>>>>> +		return H_TOO_HARD;
>>>>> +	}
>>>>> +
>>>>> +	if (dir != DMA_NONE)
>>>>> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>>>>> +
>>>>> +	*pua = ua;
>>>>> +
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>>  		      unsigned long ioba, unsigned long tce)
>>>>>  {
>>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>> -	long ret;
>>>>> +	long ret, idx;
>>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>> +	unsigned long entry, gpa;
>>>>> +	enum dma_data_direction dir;
>>>>>  
>>>>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>>>>  	/* 	    liobn, ioba, tce); */
>>>>> @@ -230,7 +482,28 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>>  	if (ret != H_SUCCESS)
>>>>>  		return ret;
>>>>>  
>>>>> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>>>> +	entry = ioba >> stt->page_shift;
>>>>> +	gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>>> +	dir = iommu_tce_direction(tce);
>>>>> +
>>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>>> +		if (dir == DMA_NONE) {
>>>>> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>>>>> +					stit->tbl, entry);
>>>>> +		} else {
>>>>> +			idx = srcu_read_lock(&vcpu->kvm->srcu);
>>>>> +			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
>>>>> +					entry, gpa, dir);
>>>>> +			srcu_read_unlock(&vcpu->kvm->srcu, idx);
>>>>> +		}
>>>>> +
>>>>> +		if (WARN_ON_ONCE(ret == H_HARDWARE))
>>>>> +			kvmppc_clear_tce(stit->tbl, entry);
>>>>> +		else if (ret != H_SUCCESS)
>>>>> +			return ret;
>>>>> +	}
>>>>> +
>>>>> +	kvmppc_tce_put(stt, entry, tce);
>>>>>  
>>>>>  	return H_SUCCESS;
>>>>>  }
>>>>> @@ -242,9 +515,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>  {
>>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>>  	long i, ret = H_SUCCESS, idx;
>>>>> -	unsigned long entry, ua = 0;
>>>>> +	unsigned long entry, ua = 0, gpa;
>>>>>  	u64 __user *tces;
>>>>>  	u64 tce;
>>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>>  
>>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>>  	if (!stt)
>>>>> @@ -283,6 +557,18 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>  		if (ret != H_SUCCESS)
>>>>>  			goto unlock_exit;
>>>>>  
>>>>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>>> +
>>>>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>>> +			ret = kvmppc_tce_iommu_map(vcpu->kvm,
>>>>> +					stit->tbl, entry + i, gpa,
>>>>> +					iommu_tce_direction(tce));
>>>>> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
>>>>> +				kvmppc_clear_tce(stit->tbl, entry);
>>>>> +			else if (ret != H_SUCCESS)
>>>>> +				goto unlock_exit;
>>>>> +		}
>>>>> +
>>>>>  		kvmppc_tce_put(stt, entry + i, tce);
>>>>>  	}
>>>>>  
>>>>> @@ -299,6 +585,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>>  {
>>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>>  	long i, ret;
>>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>>  
>>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>>  	if (!stt)
>>>>> @@ -312,6 +599,20 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>>>>  		return H_PARAMETER;
>>>>>  
>>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>>> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
>>>>> +
>>>>> +		for (i = 0; i < npages; ++i) {
>>>>> +			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
>>>>> +					stit->tbl, entry + i);
>>>>> +
>>>>> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
>>>>> +				kvmppc_clear_tce(stit->tbl, entry);
>>>>> +			else if (ret != H_SUCCESS)
>>>>> +				return ret;
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>>>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>>>>  
>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>> index 92d769f4eaea..4a1d978ebd98 100644
>>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>> @@ -161,11 +161,111 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>>>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>>>>  
>>>>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>>>>> +static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
>>>>> +{
>>>>> +	unsigned long hpa = 0;
>>>>> +	enum dma_data_direction dir = DMA_NONE;
>>>>> +
>>>>> +	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>>>>> +}
>>>>> +
>>>>> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
>>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>>> +{
>>>>> +	struct mm_iommu_table_group_mem_t *mem = NULL;
>>>>> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>>>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>>> +
>>>>> +	if (WARN_ON(!pua))
>>>>> +		return H_HARDWARE;
>>>>> +
>>>>> +	pua = (void *) vmalloc_to_phys(pua);
>>>>> +	if (!pua)
>>>>> +		return H_TOO_HARD;
>>>>> +
>>>>> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
>>>>> +	if (!mem)
>>>>> +		return H_TOO_HARD;
>>>>> +
>>>>> +	mm_iommu_mapped_dec(mem);
>>>>> +
>>>>> +	*pua = 0;
>>>>> +
>>>>> +	return H_SUCCESS;
>>>>> +}
>>>>> +
>>>>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
>>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>>> +{
>>>>> +	enum dma_data_direction dir = DMA_NONE;
>>>>> +	unsigned long hpa = 0;
>>>>> +	long ret;
>>>>> +
>>>>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
>>>>> +		return H_HARDWARE;
>>>>
>>>> To avoid a double WARN() (and to make the warnings easier to
>>>> understand) I'd suggest putting a WARN_ON() here, rather than in the
>>>> callers when they receieve an H_HARDWARE.  IIUC this really shouldn't
>>>> ever happen, and it certainly can't be the guest's fault?
>>>
>>>
>>> Makes sense.
>>
>> I guess it might want WARN_ON_ONCE() to avoid spamming the user with
>> errors for every TCE, though.
> 
> 
> We do not expect this to happen at all :) I can convert all of them to
> _ONCE really as the purpose of WARN_ON is mostly to document what we do not
> expect.
> 
> 
> 
>>>>> +
>>>>> +	if (dir == DMA_NONE)
>>>>> +		return H_SUCCESS;
>>>>> +
>>>>> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
>>>>> +	if (ret)
>>>>> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>>>>> +
>>>>> +	return ret;
>>>>> +}
>>>>> +
>>>>> +static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
>>>>> +		unsigned long entry, unsigned long gpa,
>>>>> +		enum dma_data_direction dir)
>>>>> +{
>>>>> +	long ret;
>>>>> +	unsigned long hpa = 0, ua;
>>>>> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>>> +	struct mm_iommu_table_group_mem_t *mem;
>>>>> +
>>>>> +	if (!pua)
>>>>> +		/* it_userspace allocation might be delayed */
>>>>> +		return H_TOO_HARD;
>>>>> +
>>>>> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
>>>>> +		return H_PARAMETER;
>>>
>>>
>>> Referred below as [1]
>>>
>>>
>>>>> +
>>>>> +	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
>>>>> +	if (!mem)
>>>>> +		return H_TOO_HARD;
>>>>> +
>>>>> +	if (WARN_ON(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
>>>>> +		return H_HARDWARE;
>>>>> +
>>>>> +	pua = (void *) vmalloc_to_phys(pua);
>>>>> +	if (WARN_ON(!pua))
>>>>> +		return H_HARDWARE;
>>>>> +
>>>>> +	if (WARN_ON(mm_iommu_mapped_inc(mem)))
>>>>> +		return H_CLOSED;
>>>>> +
>>>>> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
>>>>> +	if (ret) {
>>>>> +		mm_iommu_mapped_dec(mem);
>>>>> +		return H_TOO_HARD;
>>>>> +	}
>>>>> +
>>>>> +	if (dir != DMA_NONE)
>>>>> +		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
>>>>> +
>>>>> +	*pua = ua;
>>>>> +
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>>  		unsigned long ioba, unsigned long tce)
>>>>>  {
>>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>>  	long ret;
>>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>> +	unsigned long entry, gpa;
>>>>> +	enum dma_data_direction dir;
>>>>>  
>>>>>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>>>>  	/* 	    liobn, ioba, tce); */
>>>>> @@ -182,7 +282,25 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>>  	if (ret != H_SUCCESS)
>>>>>  		return ret;
>>>>>  
>>>>> -	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>>>>> +	entry = ioba >> stt->page_shift;
>>>>> +	gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>>> +	dir = iommu_tce_direction(tce);
>>>>> +
>>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>>> +		if (dir == DMA_NONE)
>>>>> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>>>>> +					stit->tbl, entry);
>>>>> +		else
>>>>> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
>>>>> +					stit->tbl, entry, gpa, dir);
>>>>> +
>>>>> +		if (WARN_ON_ONCE(ret == H_HARDWARE))
>>>>> +			kvmppc_rm_clear_tce(stit->tbl, entry);
>>>>> +		else if (ret != H_SUCCESS)
>>>>> +			return ret;
>>>>> +	}
>>>>> +
>>>>> +	kvmppc_tce_put(stt, entry, tce);
>>>>>  
>>>>>  	return H_SUCCESS;
>>>>>  }
>>>>> @@ -220,9 +338,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>  {
>>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>>  	long i, ret = H_SUCCESS;
>>>>> -	unsigned long tces, entry, ua = 0;
>>>>> +	unsigned long tces, entry, ua = 0, tce, gpa;
>>>>>  	unsigned long *rmap = NULL;
>>>>>  	bool prereg = false;
>>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>>  
>>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>>  	if (!stt)
>>>>> @@ -287,12 +406,24 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>  	}
>>>>>  
>>>>>  	for (i = 0; i < npages; ++i) {
>>>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>>  
>>>>>  		ret = kvmppc_tce_validate(stt, tce);
>>>>>  		if (ret != H_SUCCESS)
>>>>>  			goto unlock_exit;
>>>>>  
>>>>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>>> +
>>>>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>>> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
>>>>> +					stit->tbl, entry + i, gpa,
>>>>> +					iommu_tce_direction(tce));
>>>>> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
>>>>
>>>> I don't think you need the WARN() here - the only H_HARDWARE failure
>>>> path in iommu_map() already includes a WARN().
>>>
>>>
>>> True, I can drop it here.
>>>
>>>
>>>>
>>>>> +				kvmppc_rm_clear_tce(stit->tbl, entry);
>>>>> +			else if (ret != H_SUCCESS)
>>>>> +				goto unlock_exit;
>>>>
>>>> It's also not clear to me why the H_HARDWARE error path clears the
>>>> entry, but the other failure paths don't.  Or why an H_HARDWARE will
>>>> result in continuing to set the rest of the TCEs, but other failures
>>>> won't.
>>>
>>>
>>> The idea was that other failures still have some chance that handling may
>>> succeed in virtual mode or via QEMU, H_HARDWARE is fatal.
>>
>> Um... yes.. but the logic seems to be backwards for that: on
>> H_HARDWARE you warn and keep going, on other errors you bail out
>> entirely.
> 
> By "fatal" I means fatal for this particular hardware TCE(s), no hope in
> trying this particular TCE in virtual mode.
> 
> 
>>
>>> I am just not sure if H_PARAMETER is what I want to return at [1], to make
>>> the calling code simplier, I could return H_HARDWARE there as well (instead
>>> of H_PARAMETER).
>>
>> That sounds right, IIUC the gpa to ua translation shouldn't ever
>> fail because of something the guest did. 
> 
> 
> The guest can easily pass bad TCE/GPA which is not in any registered slot.
> So it is rather H_PARAMETER.
> 
>> So I'd expect either
>> H_HARDWARE, or H_TOO_HARD (if there's some hope that virtual mode can
>> make the translation when real mode couldn't).
> 
> No, virtual mode uses the exact same helper.


I just realized that kvmppc_gpa_to_ua() could be a part of
kvmppc_tce_validate() and kvmppc_rm_tce_iommu_map() could just receive the
userspace address rather than guest physical.


> 
> 
>>>>> +		}
>>>>> +
>>>>>  		kvmppc_tce_put(stt, entry + i, tce);
>>>>>  	}
>>>>>  
>>>>> @@ -309,6 +440,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>>  {
>>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>>  	long i, ret;
>>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>>  
>>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>>  	if (!stt)
>>>>> @@ -322,6 +454,20 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>>>>>  		return H_PARAMETER;
>>>>>  
>>>>> +	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>>> +		unsigned long entry = ioba >> stit->tbl->it_page_shift;
>>>>> +
>>>>> +		for (i = 0; i < npages; ++i) {
>>>>> +			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
>>>>> +					stit->tbl, entry + i);
>>>>> +
>>>>> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
>>>>> +				kvmppc_rm_clear_tce(stit->tbl, entry);
>>>>
>>>> As noted earlier, I think this WARN belongs within iommu_unmap()
>>>> rather than out here.
>>>>
>>>>> +			else if (ret != H_SUCCESS)
>>>>> +				return ret;
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>>>>>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>>>>>  
>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>>> index cd892dec7cb6..f3127dc87912 100644
>>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>>> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
>>>>>  	case KVM_CAP_SPAPR_TCE:
>>>>>  	case KVM_CAP_SPAPR_TCE_64:
>>>>> +		/* fallthrough */
>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>>  	case KVM_CAP_PPC_RTAS:
>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>>>>  	case KVM_CAP_PPC_ENABLE_HCALL:
>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>>> index d32f239eb471..2b7dc22265fe 100644
>>>>> --- a/virt/kvm/vfio.c
>>>>> +++ b/virt/kvm/vfio.c
>>>>> @@ -20,6 +20,10 @@
>>>>>  #include <linux/vfio.h>
>>>>>  #include "vfio.h"
>>>>>  
>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>> +#include <asm/kvm_ppc.h>
>>>>> +#endif
>>>>> +
>>>>>  struct kvm_vfio_group {
>>>>>  	struct list_head node;
>>>>>  	struct vfio_group *vfio_group;
>>>>> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>>  
>>>>>  		mutex_unlock(&kv->lock);
>>>>>  
>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
>>>>> +#endif
>>>>>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>>>>>  
>>>>>  		kvm_vfio_group_put_external_user(vfio_group);
>>>>> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>>  		kvm_vfio_update_coherency(dev);
>>>>>  
>>>>>  		return ret;
>>>>> +
>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
>>>>> +		struct kvm_vfio_spapr_tce param;
>>>>> +		unsigned long minsz;
>>>>> +		struct kvm_vfio *kv = dev->private;
>>>>> +		struct vfio_group *vfio_group;
>>>>> +		struct kvm_vfio_group *kvg;
>>>>> +		struct fd f;
>>>>> +
>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
>>>>> +
>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>>> +			return -EFAULT;
>>>>> +
>>>>> +		if (param.argsz < minsz || param.flags)
>>>>> +			return -EINVAL;
>>>>> +
>>>>> +		f = fdget(param.groupfd);
>>>>> +		if (!f.file)
>>>>> +			return -EBADF;
>>>>> +
>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>>> +		fdput(f);
>>>>> +
>>>>> +		if (IS_ERR(vfio_group))
>>>>> +			return PTR_ERR(vfio_group);
>>>>> +
>>>>> +		ret = -ENOENT;
>>>>> +
>>>>> +		mutex_lock(&kv->lock);
>>>>> +
>>>>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>>>>> +			if (kvg->vfio_group != vfio_group)
>>>>> +				continue;
>>>>> +
>>>>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>>>>> +					param.tablefd, vfio_group);
>>>>> +
>>>>> +			break;
>>>>> +		}
>>>>> +
>>>>> +		mutex_unlock(&kv->lock);
>>>>> +
>>>>> +		return ret;
>>>>> +	}
>>>>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>>>>  	}
>>>>>  
>>>>>  	return -ENXIO;
>>>>> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>>>>  		switch (attr->attr) {
>>>>>  		case KVM_DEV_VFIO_GROUP_ADD:
>>>>>  		case KVM_DEV_VFIO_GROUP_DEL:
>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
>>>>> +#endif
>>>>>  			return 0;
>>>>>  		}
>>>>>  
>>>>> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>>>>>  	struct kvm_vfio_group *kvg, *tmp;
>>>>>  
>>>>>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
>>>>> +#endif
>>>>>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>>>>>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>>  		list_del(&kvg->node);
>>>>
>>>
>>>
>>
>>
>>
>>
> 
>
David Gibson Feb. 27, 2017, 1:53 a.m. | #6
On Fri, Feb 24, 2017 at 02:43:05PM +1100, Alexey Kardashevskiy wrote:
> On 24/02/17 14:36, David Gibson wrote:
> > On Fri, Feb 24, 2017 at 02:29:14PM +1100, Alexey Kardashevskiy wrote:
> >> On 24/02/17 13:14, David Gibson wrote:
> >>> On Wed, Feb 22, 2017 at 07:21:33PM +1100, Alexey Kardashevskiy
> wrote:
[snip]
> >>>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> >>>> +		struct iommu_table *tbl, unsigned long entry)
> >>>> +{
> >>>> +	enum dma_data_direction dir = DMA_NONE;
> >>>> +	unsigned long hpa = 0;
> >>>> +	long ret;
> >>>> +
> >>>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> >>>> +		return H_HARDWARE;
> >>>
> >>> To avoid a double WARN() (and to make the warnings easier to
> >>> understand) I'd suggest putting a WARN_ON() here, rather than in the
> >>> callers when they receieve an H_HARDWARE.  IIUC this really shouldn't
> >>> ever happen, and it certainly can't be the guest's fault?
> >>
> >>
> >> Makes sense.
> > 
> > I guess it might want WARN_ON_ONCE() to avoid spamming the user with
> > errors for every TCE, though.
> 
> 
> We do not expect this to happen at all :) I can convert all of them to
> _ONCE really as the purpose of WARN_ON is mostly to document what we do not
> expect.

Sure, seems reasonable.

[snip]
> >>>> @@ -220,9 +338,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>  	long i, ret = H_SUCCESS;
> >>>> -	unsigned long tces, entry, ua = 0;
> >>>> +	unsigned long tces, entry, ua = 0, tce, gpa;
> >>>>  	unsigned long *rmap = NULL;
> >>>>  	bool prereg = false;
> >>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>>>  
> >>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>>>  	if (!stt)
> >>>> @@ -287,12 +406,24 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  	}
> >>>>  
> >>>>  	for (i = 0; i < npages; ++i) {
> >>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>>  
> >>>>  		ret = kvmppc_tce_validate(stt, tce);
> >>>>  		if (ret != H_SUCCESS)
> >>>>  			goto unlock_exit;
> >>>>  
> >>>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>>> +
> >>>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >>>> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> >>>> +					stit->tbl, entry + i, gpa,
> >>>> +					iommu_tce_direction(tce));
> >>>> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
> >>>
> >>> I don't think you need the WARN() here - the only H_HARDWARE failure
> >>> path in iommu_map() already includes a WARN().
> >>
> >>
> >> True, I can drop it here.
> >>
> >>
> >>>
> >>>> +				kvmppc_rm_clear_tce(stit->tbl, entry);
> >>>> +			else if (ret != H_SUCCESS)
> >>>> +				goto unlock_exit;
> >>>
> >>> It's also not clear to me why the H_HARDWARE error path clears the
> >>> entry, but the other failure paths don't.  Or why an H_HARDWARE will
> >>> result in continuing to set the rest of the TCEs, but other failures
> >>> won't.
> >>
> >>
> >> The idea was that other failures still have some chance that handling may
> >> succeed in virtual mode or via QEMU, H_HARDWARE is fatal.
> > 
> > Um... yes.. but the logic seems to be backwards for that: on
> > H_HARDWARE you warn and keep going, on other errors you bail out
> > entirely.
> 
> By "fatal" I means fatal for this particular hardware TCE(s), no hope in
> trying this particular TCE in virtual mode.

Ok... still not following why that means the "fatal" error results in
continuing to attempt for the rest of the updated TCEs, whereas the
"non fatal" one bails out.  Especially since the bail out will only go
to virtual mode if ret == H_TOO_HARD, which it isn't clear is the only
possibility.

> >> I am just not sure if H_PARAMETER is what I want to return at [1], to make
> >> the calling code simplier, I could return H_HARDWARE there as well (instead
> >> of H_PARAMETER).
> > 
> > That sounds right, IIUC the gpa to ua translation shouldn't ever
> > fail because of something the guest did. 
> 
> 
> The guest can easily pass bad TCE/GPA which is not in any registered slot.
> So it is rather H_PARAMETER.

Ah, yes.

> > So I'd expect either
> > H_HARDWARE, or H_TOO_HARD (if there's some hope that virtual mode can
> > make the translation when real mode couldn't).
> 
> No, virtual mode uses the exact same helper.

Ok.
Alexey Kardashevskiy Feb. 27, 2017, 3:20 a.m. | #7
On 27/02/17 12:53, David Gibson wrote:
> On Fri, Feb 24, 2017 at 02:43:05PM +1100, Alexey Kardashevskiy wrote:
>> On 24/02/17 14:36, David Gibson wrote:
>>> On Fri, Feb 24, 2017 at 02:29:14PM +1100, Alexey Kardashevskiy wrote:
>>>> On 24/02/17 13:14, David Gibson wrote:
>>>>> On Wed, Feb 22, 2017 at 07:21:33PM +1100, Alexey Kardashevskiy
>> wrote:
> [snip]
>>>>>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
>>>>>> +		struct iommu_table *tbl, unsigned long entry)
>>>>>> +{
>>>>>> +	enum dma_data_direction dir = DMA_NONE;
>>>>>> +	unsigned long hpa = 0;
>>>>>> +	long ret;
>>>>>> +
>>>>>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
>>>>>> +		return H_HARDWARE;
>>>>>
>>>>> To avoid a double WARN() (and to make the warnings easier to
>>>>> understand) I'd suggest putting a WARN_ON() here, rather than in the
>>>>> callers when they receieve an H_HARDWARE.  IIUC this really shouldn't
>>>>> ever happen, and it certainly can't be the guest's fault?
>>>>
>>>>
>>>> Makes sense.
>>>
>>> I guess it might want WARN_ON_ONCE() to avoid spamming the user with
>>> errors for every TCE, though.
>>
>>
>> We do not expect this to happen at all :) I can convert all of them to
>> _ONCE really as the purpose of WARN_ON is mostly to document what we do not
>> expect.
> 
> Sure, seems reasonable.
> 
> [snip]
>>>>>> @@ -220,9 +338,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>>  {
>>>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>>>  	long i, ret = H_SUCCESS;
>>>>>> -	unsigned long tces, entry, ua = 0;
>>>>>> +	unsigned long tces, entry, ua = 0, tce, gpa;
>>>>>>  	unsigned long *rmap = NULL;
>>>>>>  	bool prereg = false;
>>>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
>>>>>>  
>>>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>>>  	if (!stt)
>>>>>> @@ -287,12 +406,24 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>>  	}
>>>>>>  
>>>>>>  	for (i = 0; i < npages; ++i) {
>>>>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>>>  
>>>>>>  		ret = kvmppc_tce_validate(stt, tce);
>>>>>>  		if (ret != H_SUCCESS)
>>>>>>  			goto unlock_exit;
>>>>>>  
>>>>>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
>>>>>> +
>>>>>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
>>>>>> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
>>>>>> +					stit->tbl, entry + i, gpa,
>>>>>> +					iommu_tce_direction(tce));
>>>>>> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
>>>>>
>>>>> I don't think you need the WARN() here - the only H_HARDWARE failure
>>>>> path in iommu_map() already includes a WARN().
>>>>
>>>>
>>>> True, I can drop it here.
>>>>
>>>>
>>>>>
>>>>>> +				kvmppc_rm_clear_tce(stit->tbl, entry);
>>>>>> +			else if (ret != H_SUCCESS)
>>>>>> +				goto unlock_exit;
>>>>>
>>>>> It's also not clear to me why the H_HARDWARE error path clears the
>>>>> entry, but the other failure paths don't.  Or why an H_HARDWARE will
>>>>> result in continuing to set the rest of the TCEs, but other failures
>>>>> won't.
>>>>
>>>>
>>>> The idea was that other failures still have some chance that handling may
>>>> succeed in virtual mode or via QEMU, H_HARDWARE is fatal.
>>>
>>> Um... yes.. but the logic seems to be backwards for that: on
>>> H_HARDWARE you warn and keep going, on other errors you bail out
>>> entirely.
>>
>> By "fatal" I means fatal for this particular hardware TCE(s), no hope in
>> trying this particular TCE in virtual mode.
> 
> Ok... still not following why that means the "fatal" error results in
> continuing to attempt for the rest of the updated TCEs, whereas the
> "non fatal" one bails out.

I was applying the principle that if after all checks done we still cannot
update the hardware table, then just clear the TCE and move on. Or I
misunderstood the idea?


> Especially since the bail out will only go
> to virtual mode if ret == H_TOO_HARD, which it isn't clear is the only
> possibility.


H_TOO_HARD goes to virtual mode, H_TOO_HARD in virtual goes to the
userspace (QEMU).

Will "if (WARN_ON_ONCE(ret != H_SUCCESS && ret != H_TOO_HARD))" make more
sense?



> 
>>>> I am just not sure if H_PARAMETER is what I want to return at [1], to make
>>>> the calling code simplier, I could return H_HARDWARE there as well (instead
>>>> of H_PARAMETER).
>>>
>>> That sounds right, IIUC the gpa to ua translation shouldn't ever
>>> fail because of something the guest did. 
>>
>>
>> The guest can easily pass bad TCE/GPA which is not in any registered slot.
>> So it is rather H_PARAMETER.
> 
> Ah, yes.
> 
>>> So I'd expect either
>>> H_HARDWARE, or H_TOO_HARD (if there's some hope that virtual mode can
>>> make the translation when real mode couldn't).
>>
>> No, virtual mode uses the exact same helper.
> 
> Ok.
>
David Gibson Feb. 28, 2017, 12:54 a.m. | #8
On Mon, Feb 27, 2017 at 02:20:13PM +1100, Alexey Kardashevskiy wrote:
> On 27/02/17 12:53, David Gibson wrote:
> > On Fri, Feb 24, 2017 at 02:43:05PM +1100, Alexey Kardashevskiy wrote:
> >> On 24/02/17 14:36, David Gibson wrote:
> >>> On Fri, Feb 24, 2017 at 02:29:14PM +1100, Alexey Kardashevskiy wrote:
> >>>> On 24/02/17 13:14, David Gibson wrote:
> >>>>> On Wed, Feb 22, 2017 at 07:21:33PM +1100, Alexey Kardashevskiy
> >> wrote:
> > [snip]
> >>>>>> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> >>>>>> +		struct iommu_table *tbl, unsigned long entry)
> >>>>>> +{
> >>>>>> +	enum dma_data_direction dir = DMA_NONE;
> >>>>>> +	unsigned long hpa = 0;
> >>>>>> +	long ret;
> >>>>>> +
> >>>>>> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> >>>>>> +		return H_HARDWARE;
> >>>>>
> >>>>> To avoid a double WARN() (and to make the warnings easier to
> >>>>> understand) I'd suggest putting a WARN_ON() here, rather than in the
> >>>>> callers when they receieve an H_HARDWARE.  IIUC this really shouldn't
> >>>>> ever happen, and it certainly can't be the guest's fault?
> >>>>
> >>>>
> >>>> Makes sense.
> >>>
> >>> I guess it might want WARN_ON_ONCE() to avoid spamming the user with
> >>> errors for every TCE, though.
> >>
> >>
> >> We do not expect this to happen at all :) I can convert all of them to
> >> _ONCE really as the purpose of WARN_ON is mostly to document what we do not
> >> expect.
> > 
> > Sure, seems reasonable.
> > 
> > [snip]
> >>>>>> @@ -220,9 +338,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>>>  {
> >>>>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>>>  	long i, ret = H_SUCCESS;
> >>>>>> -	unsigned long tces, entry, ua = 0;
> >>>>>> +	unsigned long tces, entry, ua = 0, tce, gpa;
> >>>>>>  	unsigned long *rmap = NULL;
> >>>>>>  	bool prereg = false;
> >>>>>> +	struct kvmppc_spapr_tce_iommu_table *stit;
> >>>>>>  
> >>>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>>>>>  	if (!stt)
> >>>>>> @@ -287,12 +406,24 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>>>  	}
> >>>>>>  
> >>>>>>  	for (i = 0; i < npages; ++i) {
> >>>>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>>>>  
> >>>>>>  		ret = kvmppc_tce_validate(stt, tce);
> >>>>>>  		if (ret != H_SUCCESS)
> >>>>>>  			goto unlock_exit;
> >>>>>>  
> >>>>>> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>>>>> +
> >>>>>> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> >>>>>> +			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
> >>>>>> +					stit->tbl, entry + i, gpa,
> >>>>>> +					iommu_tce_direction(tce));
> >>>>>> +			if (WARN_ON_ONCE(ret == H_HARDWARE))
> >>>>>
> >>>>> I don't think you need the WARN() here - the only H_HARDWARE failure
> >>>>> path in iommu_map() already includes a WARN().
> >>>>
> >>>>
> >>>> True, I can drop it here.
> >>>>
> >>>>
> >>>>>
> >>>>>> +				kvmppc_rm_clear_tce(stit->tbl, entry);
> >>>>>> +			else if (ret != H_SUCCESS)
> >>>>>> +				goto unlock_exit;
> >>>>>
> >>>>> It's also not clear to me why the H_HARDWARE error path clears the
> >>>>> entry, but the other failure paths don't.  Or why an H_HARDWARE will
> >>>>> result in continuing to set the rest of the TCEs, but other failures
> >>>>> won't.
> >>>>
> >>>>
> >>>> The idea was that other failures still have some chance that handling may
> >>>> succeed in virtual mode or via QEMU, H_HARDWARE is fatal.
> >>>
> >>> Um... yes.. but the logic seems to be backwards for that: on
> >>> H_HARDWARE you warn and keep going, on other errors you bail out
> >>> entirely.
> >>
> >> By "fatal" I means fatal for this particular hardware TCE(s), no hope in
> >> trying this particular TCE in virtual mode.
> > 
> > Ok... still not following why that means the "fatal" error results in
> > continuing to attempt for the rest of the updated TCEs, whereas the
> > "non fatal" one bails out.
> 
> I was applying the principle that if after all checks done we still cannot
> update the hardware table, then just clear the TCE and move on. Or I
> misunderstood the idea?

*Still* not seeing why if we cannot update the hardware table we keep
trying with the rest of the entries, but on other failures we don't.

> > Especially since the bail out will only go
> > to virtual mode if ret == H_TOO_HARD, which it isn't clear is the only
> > possibility.
> 
> 
> H_TOO_HARD goes to virtual mode, H_TOO_HARD in virtual goes to the
> userspace (QEMU).
> 
> Will "if (WARN_ON_ONCE(ret != H_SUCCESS && ret != H_TOO_HARD))" make more
> sense?

Probably, but depends what's in the if.

> > 
> >>>> I am just not sure if H_PARAMETER is what I want to return at [1], to make
> >>>> the calling code simplier, I could return H_HARDWARE there as well (instead
> >>>> of H_PARAMETER).
> >>>
> >>> That sounds right, IIUC the gpa to ua translation shouldn't ever
> >>> fail because of something the guest did. 
> >>
> >>
> >> The guest can easily pass bad TCE/GPA which is not in any registered slot.
> >> So it is rather H_PARAMETER.
> > 
> > Ah, yes.
> > 
> >>> So I'd expect either
> >>> H_HARDWARE, or H_TOO_HARD (if there's some hope that virtual mode can
> >>> make the translation when real mode couldn't).
> >>
> >> No, virtual mode uses the exact same helper.
> > 
> > Ok.
> > 
> 
>

Patch

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740c67ca..f95d867168ea 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,25 @@  Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+	allocated by sPAPR KVM.
+	kvm_device_attr.addr points to a struct:
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+	struct kvm_vfio_spapr_tce {
+		__u32	argsz;
+		__u32	flags;
+		__s32	groupfd;
+		__s32	tablefd;
+	};
+
+	where
+	@argsz is the size of kvm_vfio_spapr_tce_liobn;
+	@flags are not supported now, must be zero;
+	@groupfd is a file descriptor for a VFIO group;
+	@tablefd is a file descriptor for a TCE table allocated via
+		KVM_CREATE_SPAPR_TCE.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index e59b172666cd..a827006941f8 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -191,6 +191,13 @@  struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_iommu_table {
+	struct rcu_head rcu;
+	struct list_head next;
+	struct vfio_group *group;
+	struct iommu_table *tbl;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;
@@ -199,6 +206,7 @@  struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head iommu_tables;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index e04b7fb8ccaa..b8a39dec92cf 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -163,6 +163,10 @@  extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group);
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a2c9bb5a0ead..cdfa01169bd2 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1076,6 +1076,7 @@  struct kvm_device_attr {
 #define  KVM_DEV_VFIO_GROUP			1
 #define   KVM_DEV_VFIO_GROUP_ADD			1
 #define   KVM_DEV_VFIO_GROUP_DEL			2
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
@@ -1097,6 +1098,13 @@  enum kvm_device_type {
 	KVM_DEV_TYPE_MAX,
 };
 
+struct kvm_vfio_spapr_tce {
+	__u32	argsz;
+	__u32	flags;
+	__s32	groupfd;
+	__s32	tablefd;
+};
+
 /*
  * ioctls for VM fds
  */
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 15df8ae627d9..062407af09ee 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,10 @@ 
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
+#include <linux/vfio.h>
+#include <linux/module.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -39,6 +43,36 @@ 
 #include <asm/udbg.h>
 #include <asm/iommu.h>
 #include <asm/tce.h>
+#include <asm/mmu_context.h>
+
+static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
+{
+	void (*fn)(struct vfio_group *);
+
+	fn = symbol_get(vfio_group_put_external_user);
+	if (WARN_ON(!fn))
+		return;
+
+	fn(vfio_group);
+
+	symbol_put(vfio_group_put_external_user);
+}
+
+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
+{
+	int (*fn)(struct vfio_group *);
+	int ret = -1;
+
+	fn = symbol_get(vfio_external_user_iommu_id);
+	if (!fn)
+		return ret;
+
+	ret = fn(vfio_group);
+
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
 
 static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
 {
@@ -90,6 +124,130 @@  static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
 	return ret;
 }
 
+static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
+			struct kvmppc_spapr_tce_iommu_table, rcu);
+
+	iommu_table_put(stit->tbl);
+	kvm_vfio_group_put_external_user(stit->group);
+
+	kfree(stit);
+}
+
+static void kvm_spapr_tce_liobn_release_iommu_group(
+		struct kvmppc_spapr_tce_table *stt,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
+
+	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
+		if (group && (stit->group != group))
+			continue;
+
+		list_del_rcu(&stit->next);
+
+		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
+	}
+}
+
+extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_table *stt;
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
+		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
+}
+
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
+		struct vfio_group *group)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	bool found = false;
+	struct iommu_table *tbl = NULL;
+	struct iommu_table_group *table_group;
+	long i, ret = 0;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	struct fd f;
+	int group_id;
+	struct iommu_group *grp;
+
+	group_id = kvm_vfio_external_user_iommu_id(group);
+	grp = iommu_group_get_by_id(group_id);
+	if (WARN_ON(!grp))
+		return -EIO;
+
+	f = fdget(tablefd);
+	if (!f.file) {
+		ret = -EBADF;
+		goto put_exit;
+	}
+
+	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt == f.file->private_data) {
+			found = true;
+			break;
+		}
+	}
+
+	fdput(f);
+
+	if (!found) {
+		ret = -EINVAL;
+		goto put_exit;
+	}
+
+	table_group = iommu_group_get_iommudata(grp);
+	if (WARN_ON(!table_group)) {
+		ret = -EFAULT;
+		goto put_exit;
+	}
+
+	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		struct iommu_table *tbltmp = table_group->tables[i];
+
+		if (!tbltmp)
+			continue;
+
+		/*
+		 * Make sure hardware table parameters are exactly the same;
+		 * this is used in the TCE handlers where boundary checks
+		 * use only the first attached table.
+		 */
+		if ((tbltmp->it_page_shift == stt->page_shift) &&
+				(tbltmp->it_offset == stt->offset) &&
+				(tbltmp->it_size == stt->size)) {
+			tbl = tbltmp;
+			break;
+		}
+	}
+	if (!tbl) {
+		ret = -EINVAL;
+		goto put_exit;
+	}
+
+	list_for_each_entry_rcu(stit, &stt->iommu_tables, next) {
+		if ((stit->tbl == tbl) && (stit->group == group)) {
+			ret = -EBUSY;
+			goto put_exit;
+		}
+	}
+
+	iommu_table_get(tbl);
+
+	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
+	stit->tbl = tbl;
+	stit->group = group;
+
+	list_add_rcu(&stit->next, &stt->iommu_tables);
+
+put_exit:
+	iommu_group_put(grp);
+
+	return ret;
+}
+
 static void release_spapr_tce_table(struct rcu_head *head)
 {
 	struct kvmppc_spapr_tce_table *stt = container_of(head,
@@ -132,6 +290,8 @@  static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 
 	list_del_rcu(&stt->list);
 
+	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
+
 	kvm_put_kvm(stt->kvm);
 
 	kvmppc_account_memlimit(
@@ -181,6 +341,7 @@  long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	stt->offset = args->offset;
 	stt->size = size;
 	stt->kvm = kvm;
+	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
 
 	for (i = 0; i < npages; i++) {
 		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -209,11 +370,102 @@  long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	return ret;
 }
 
+static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
+{
+	unsigned long hpa = 0;
+	enum dma_data_direction dir = DMA_NONE;
+
+	iommu_tce_xchg(tbl, entry, &hpa, &dir);
+}
+
+static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (WARN_ON(!pua))
+		return H_HARDWARE;
+
+	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_TOO_HARD;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+	long ret;
+
+	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+	if (ret != H_SUCCESS)
+		iommu_tce_xchg(tbl, entry, &hpa, &dir);
+
+	return ret;
+}
+
+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
+		return H_PARAMETER;
+
+	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_TOO_HARD;
+
+	if (WARN_ON(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
+		return H_HARDWARE;
+
+	if (mm_iommu_mapped_inc(mem))
+		return H_CLOSED;
+
+	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
-	long ret;
+	long ret, idx;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	unsigned long entry, gpa;
+	enum dma_data_direction dir;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -230,7 +482,28 @@  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
+	entry = ioba >> stt->page_shift;
+	gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	dir = iommu_tce_direction(tce);
+
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		if (dir == DMA_NONE) {
+			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
+					stit->tbl, entry);
+		} else {
+			idx = srcu_read_lock(&vcpu->kvm->srcu);
+			ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
+					entry, gpa, dir);
+			srcu_read_unlock(&vcpu->kvm->srcu, idx);
+		}
+
+		if (WARN_ON_ONCE(ret == H_HARDWARE))
+			kvmppc_clear_tce(stit->tbl, entry);
+		else if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	kvmppc_tce_put(stt, entry, tce);
 
 	return H_SUCCESS;
 }
@@ -242,9 +515,10 @@  long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret = H_SUCCESS, idx;
-	unsigned long entry, ua = 0;
+	unsigned long entry, ua = 0, gpa;
 	u64 __user *tces;
 	u64 tce;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -283,6 +557,18 @@  long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		if (ret != H_SUCCESS)
 			goto unlock_exit;
 
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			ret = kvmppc_tce_iommu_map(vcpu->kvm,
+					stit->tbl, entry + i, gpa,
+					iommu_tce_direction(tce));
+			if (WARN_ON_ONCE(ret == H_HARDWARE))
+				kvmppc_clear_tce(stit->tbl, entry);
+			else if (ret != H_SUCCESS)
+				goto unlock_exit;
+		}
+
 		kvmppc_tce_put(stt, entry + i, tce);
 	}
 
@@ -299,6 +585,7 @@  long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -312,6 +599,20 @@  long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		unsigned long entry = ioba >> stit->tbl->it_page_shift;
+
+		for (i = 0; i < npages; ++i) {
+			ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
+					stit->tbl, entry + i);
+
+			if (WARN_ON_ONCE(ret == H_HARDWARE))
+				kvmppc_clear_tce(stit->tbl, entry);
+			else if (ret != H_SUCCESS)
+				return ret;
+		}
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 92d769f4eaea..4a1d978ebd98 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -161,11 +161,111 @@  long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+static void kvmppc_rm_clear_tce(struct iommu_table *tbl, unsigned long entry)
+{
+	unsigned long hpa = 0;
+	enum dma_data_direction dir = DMA_NONE;
+
+	iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+}
+
+static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	struct mm_iommu_table_group_mem_t *mem = NULL;
+	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+	if (WARN_ON(!pua))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (!pua)
+		return H_TOO_HARD;
+
+	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
+	if (!mem)
+		return H_TOO_HARD;
+
+	mm_iommu_mapped_dec(mem);
+
+	*pua = 0;
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
+		struct iommu_table *tbl, unsigned long entry)
+{
+	enum dma_data_direction dir = DMA_NONE;
+	unsigned long hpa = 0;
+	long ret;
+
+	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
+		return H_HARDWARE;
+
+	if (dir == DMA_NONE)
+		return H_SUCCESS;
+
+	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
+	if (ret)
+		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+
+	return ret;
+}
+
+static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long gpa,
+		enum dma_data_direction dir)
+{
+	long ret;
+	unsigned long hpa = 0, ua;
+	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+	struct mm_iommu_table_group_mem_t *mem;
+
+	if (!pua)
+		/* it_userspace allocation might be delayed */
+		return H_TOO_HARD;
+
+	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
+		return H_PARAMETER;
+
+	mem = mm_iommu_lookup_rm(kvm->mm, ua, 1ULL << tbl->it_page_shift);
+	if (!mem)
+		return H_TOO_HARD;
+
+	if (WARN_ON(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
+		return H_HARDWARE;
+
+	pua = (void *) vmalloc_to_phys(pua);
+	if (WARN_ON(!pua))
+		return H_HARDWARE;
+
+	if (WARN_ON(mm_iommu_mapped_inc(mem)))
+		return H_CLOSED;
+
+	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
+	if (ret) {
+		mm_iommu_mapped_dec(mem);
+		return H_TOO_HARD;
+	}
+
+	if (dir != DMA_NONE)
+		kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
+
+	*pua = ua;
+
+	return 0;
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		unsigned long ioba, unsigned long tce)
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
+	unsigned long entry, gpa;
+	enum dma_data_direction dir;
 
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
@@ -182,7 +282,25 @@  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
-	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
+	entry = ioba >> stt->page_shift;
+	gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	dir = iommu_tce_direction(tce);
+
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		if (dir == DMA_NONE)
+			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
+					stit->tbl, entry);
+		else
+			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
+					stit->tbl, entry, gpa, dir);
+
+		if (WARN_ON_ONCE(ret == H_HARDWARE))
+			kvmppc_rm_clear_tce(stit->tbl, entry);
+		else if (ret != H_SUCCESS)
+			return ret;
+	}
+
+	kvmppc_tce_put(stt, entry, tce);
 
 	return H_SUCCESS;
 }
@@ -220,9 +338,10 @@  long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret = H_SUCCESS;
-	unsigned long tces, entry, ua = 0;
+	unsigned long tces, entry, ua = 0, tce, gpa;
 	unsigned long *rmap = NULL;
 	bool prereg = false;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -287,12 +406,24 @@  long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	}
 
 	for (i = 0; i < npages; ++i) {
-		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
+		tce = be64_to_cpu(((u64 *)tces)[i]);
 
 		ret = kvmppc_tce_validate(stt, tce);
 		if (ret != H_SUCCESS)
 			goto unlock_exit;
 
+		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+			ret = kvmppc_rm_tce_iommu_map(vcpu->kvm,
+					stit->tbl, entry + i, gpa,
+					iommu_tce_direction(tce));
+			if (WARN_ON_ONCE(ret == H_HARDWARE))
+				kvmppc_rm_clear_tce(stit->tbl, entry);
+			else if (ret != H_SUCCESS)
+				goto unlock_exit;
+		}
+
 		kvmppc_tce_put(stt, entry + i, tce);
 	}
 
@@ -309,6 +440,7 @@  long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 {
 	struct kvmppc_spapr_tce_table *stt;
 	long i, ret;
+	struct kvmppc_spapr_tce_iommu_table *stit;
 
 	stt = kvmppc_find_table(vcpu->kvm, liobn);
 	if (!stt)
@@ -322,6 +454,20 @@  long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
 		return H_PARAMETER;
 
+	list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
+		unsigned long entry = ioba >> stit->tbl->it_page_shift;
+
+		for (i = 0; i < npages; ++i) {
+			ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
+					stit->tbl, entry + i);
+
+			if (WARN_ON_ONCE(ret == H_HARDWARE))
+				kvmppc_rm_clear_tce(stit->tbl, entry);
+			else if (ret != H_SUCCESS)
+				return ret;
+		}
+	}
+
 	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
 		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index cd892dec7cb6..f3127dc87912 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -536,6 +536,8 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_SPAPR_TCE_64:
+		/* fallthrough */
+	case KVM_CAP_SPAPR_TCE_VFIO:
 	case KVM_CAP_PPC_RTAS:
 	case KVM_CAP_PPC_FIXUP_HCALL:
 	case KVM_CAP_PPC_ENABLE_HCALL:
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index d32f239eb471..2b7dc22265fe 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -20,6 +20,10 @@ 
 #include <linux/vfio.h>
 #include "vfio.h"
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+#include <asm/kvm_ppc.h>
+#endif
+
 struct kvm_vfio_group {
 	struct list_head node;
 	struct vfio_group *vfio_group;
@@ -211,6 +215,9 @@  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 
 		mutex_unlock(&kv->lock);
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(vfio_group, NULL);
 
 		kvm_vfio_group_put_external_user(vfio_group);
@@ -218,6 +225,53 @@  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		kvm_vfio_update_coherency(dev);
 
 		return ret;
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
+		struct kvm_vfio_spapr_tce param;
+		unsigned long minsz;
+		struct kvm_vfio *kv = dev->private;
+		struct vfio_group *vfio_group;
+		struct kvm_vfio_group *kvg;
+		struct fd f;
+
+		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz || param.flags)
+			return -EINVAL;
+
+		f = fdget(param.groupfd);
+		if (!f.file)
+			return -EBADF;
+
+		vfio_group = kvm_vfio_group_get_external_user(f.file);
+		fdput(f);
+
+		if (IS_ERR(vfio_group))
+			return PTR_ERR(vfio_group);
+
+		ret = -ENOENT;
+
+		mutex_lock(&kv->lock);
+
+		list_for_each_entry(kvg, &kv->group_list, node) {
+			if (kvg->vfio_group != vfio_group)
+				continue;
+
+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
+					param.tablefd, vfio_group);
+
+			break;
+		}
+
+		mutex_unlock(&kv->lock);
+
+		return ret;
+	}
+#endif /* CONFIG_SPAPR_TCE_IOMMU */
 	}
 
 	return -ENXIO;
@@ -242,6 +296,9 @@  static int kvm_vfio_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_VFIO_GROUP_ADD:
 		case KVM_DEV_VFIO_GROUP_DEL:
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
+#endif
 			return 0;
 		}
 
@@ -257,6 +314,9 @@  static void kvm_vfio_destroy(struct kvm_device *dev)
 	struct kvm_vfio_group *kvg, *tmp;
 
 	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
+#endif
 		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
 		kvm_vfio_group_put_external_user(kvg->vfio_group);
 		list_del(&kvg->node);