Patchwork [3/4] KVM: PPC: Add support for IOMMU in-kernel handling

login
register
mail settings
Submitter Alexey Kardashevskiy
Date June 5, 2013, 6:11 a.m.
Message ID <1370412673-1345-4-git-send-email-aik@ozlabs.ru>
Download mbox | patch
Permalink /patch/248921/
State New
Headers show

Comments

Alexey Kardashevskiy - June 5, 2013, 6:11 a.m.
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which should
save time on switching to QEMU and back.

Both real and virtual modes are supported - whenever the kernel
fails to handle TCE request, it passes it to the virtual mode.
If it the virtual mode handlers fail, then the request is passed
to the user mode, for example, to QEMU.

This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
in-kernel handling of IOMMU map/unmap.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>

---

Changes:
2013/06/05:
* changed capability number
* changed ioctl number
* update the doc article number

2013/05/20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
---
 Documentation/virtual/kvm/api.txt   |   28 +++++
 arch/powerpc/include/asm/kvm_host.h |    3 +
 arch/powerpc/include/asm/kvm_ppc.h  |    2 +
 arch/powerpc/include/uapi/asm/kvm.h |    7 ++
 arch/powerpc/kvm/book3s_64_vio.c    |  198 ++++++++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  193 +++++++++++++++++++++++++++++++++-
 arch/powerpc/kvm/powerpc.c          |   12 +++
 include/uapi/linux/kvm.h            |    2 +
 8 files changed, 439 insertions(+), 6 deletions(-)
Benjamin Herrenschmidt - June 16, 2013, 4:39 a.m.
>  static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
> -			unsigned long *pte_sizep)
> +			unsigned long *pte_sizep, bool do_get_page)
>  {
>  	pte_t *ptep;
>  	unsigned int shift = 0;
> @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
>  	if (!pte_present(*ptep))
>  		return __pte(0);
>  
> +	/*
> +	 * Put huge pages handling to the virtual mode.
> +	 * The only exception is for TCE list pages which we
> +	 * do need to call get_page() for.
> +	 */
> +	if ((*pte_sizep > PAGE_SIZE) && do_get_page)
> +		return __pte(0);
> +
>  	/* wait until _PAGE_BUSY is clear then set it atomically */
>  	__asm__ __volatile__ (
>  		"1:	ldarx	%0,0,%3\n"
> @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
>  		: "cc");
>  
>  	ret = pte;
> +	if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) {
> +		struct page *pg = NULL;
> +		pg = realmode_pfn_to_page(pte_pfn(pte));
> +		if (realmode_get_page(pg)) {
> +			ret = __pte(0);
> +		} else {
> +			pte = pte_mkyoung(pte);
> +			if (writing)
> +				pte = pte_mkdirty(pte);
> +		}
> +	}
> +	*ptep = pte;	/* clears _PAGE_BUSY */
>  
>  	return ret;
>  }

So now you are adding the clearing of _PAGE_BUSY that was missing for
your first patch, except that this is not enough since that means that
in the "emulated" case (ie, !do_get_page) you will in essence return
and then use a PTE that is not locked without any synchronization to
ensure that the underlying page doesn't go away... then you'll
dereference that page.

So either make everything use speculative get_page, or make the emulated
case use the MMU notifier to drop the operation in case of collision.

The former looks easier.

Also, any specific reason why you do:

  - Lock the PTE
  - get_page()
  - Unlock the PTE

Instead of

  - Read the PTE
  - get_page_unless_zero
  - re-check PTE

Like get_user_pages_fast() does ?

The former will be two atomic ops, the latter only one (faster), but
maybe you have a good reason why that can't work...

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf - June 16, 2013, 10:25 p.m.
On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:

> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests without passing them to QEMU, which should
> save time on switching to QEMU and back.
> 
> Both real and virtual modes are supported - whenever the kernel
> fails to handle TCE request, it passes it to the virtual mode.
> If it the virtual mode handlers fail, then the request is passed
> to the user mode, for example, to QEMU.
> 
> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
> in-kernel handling of IOMMU map/unmap.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> 
> ---
> 
> Changes:
> 2013/06/05:
> * changed capability number
> * changed ioctl number
> * update the doc article number
> 
> 2013/05/20:
> * removed get_user() from real mode handlers
> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
> translated TCEs, tries realmode_get_page() on those and if it fails, it
> passes control over the virtual mode handler which tries to finish
> the request handling
> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
> on a page
> * The only reason to pass the request to user mode now is when the user mode
> did not register TCE table in the kernel, in all other cases the virtual mode
> handler is expected to do the job
> ---
> Documentation/virtual/kvm/api.txt   |   28 +++++
> arch/powerpc/include/asm/kvm_host.h |    3 +
> arch/powerpc/include/asm/kvm_ppc.h  |    2 +
> arch/powerpc/include/uapi/asm/kvm.h |    7 ++
> arch/powerpc/kvm/book3s_64_vio.c    |  198 ++++++++++++++++++++++++++++++++++-
> arch/powerpc/kvm/book3s_64_vio_hv.c |  193 +++++++++++++++++++++++++++++++++-
> arch/powerpc/kvm/powerpc.c          |   12 +++
> include/uapi/linux/kvm.h            |    2 +
> 8 files changed, 439 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 6c082ff..e962e3b 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2379,6 +2379,34 @@ the guest. Othwerwise it might be better for the guest to continue using H_PUT_T
> hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
> 
> 
> +4.84 KVM_CREATE_SPAPR_TCE_IOMMU
> +
> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> +Architectures: powerpc
> +Type: vm ioctl
> +Parameters: struct kvm_create_spapr_tce_iommu (in)
> +Returns: 0 on success, -1 on error
> +
> +This creates a link between IOMMU group and a hardware TCE (translation
> +control entry) table. This link lets the host kernel know what IOMMU
> +group (i.e. TCE table) to use for the LIOBN number passed with
> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
> +
> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> +struct kvm_create_spapr_tce_iommu {
> +	__u64 liobn;
> +	__u32 iommu_id;
> +	__u32 flags;
> +};
> +
> +No flag is supported at the moment.
> +
> +When the guest issues TCE call on a liobn for which a TCE table has been
> +registered, the kernel will handle it in real mode, updating the hardware
> +TCE table. TCE table calls for other liobns will cause a vm exit and must
> +be handled by userspace.

Ok, please walk me through the security model you have in mind here.

Basically what this ioctl does is that it creates a guest TCE table that reflects its changes into a host TCE table whenever it gets modified. So far so good.

Now I don't see any checks that verify whether iommu_id is actually good to use from that user's access rights. Just because I have access to /dev/kvm I don't necessarily have access to an iommu control device.

So the least I can see would be a local DoS attack where one user space program with only access to /dev/kvm can simply kill any access to another process's device by overflowing a host iommu TCE table with junk entries.

There's even a certain chance of an information disclosure exploit here where a malicious user space program could get itself all network traffic DMA'd from another VM.

How does this work on the host level? What is the security token to take control of a host TCE table?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt - June 16, 2013, 10:39 p.m.
On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote:
> +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> +               struct kvm_create_spapr_tce_iommu *args)
> +{
> +       struct kvmppc_spapr_tce_table *tt = NULL;
> +       struct iommu_group *grp;
> +       struct iommu_table *tbl;
> +
> +       /* Find an IOMMU table for the given ID */
> +       grp = iommu_group_get_by_id(args->iommu_id);
> +       if (!grp)
> +               return -ENXIO;
> +
> +       tbl = iommu_group_get_iommudata(grp);
> +       if (!tbl)
> +               return -ENXIO;

So Alex Graf pointed out here, there is a security issue here, or are we
missing something ?

What prevents a malicious program that has access to /dev/kvm from
taking over random iommu groups (including host used ones) that way?

What is the security model of that whole iommu stuff to begin with ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson - June 17, 2013, 3:13 a.m.
On Mon, 2013-06-17 at 08:39 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2013-06-05 at 16:11 +1000, Alexey Kardashevskiy wrote:
> > +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> > +               struct kvm_create_spapr_tce_iommu *args)
> > +{
> > +       struct kvmppc_spapr_tce_table *tt = NULL;
> > +       struct iommu_group *grp;
> > +       struct iommu_table *tbl;
> > +
> > +       /* Find an IOMMU table for the given ID */
> > +       grp = iommu_group_get_by_id(args->iommu_id);
> > +       if (!grp)
> > +               return -ENXIO;
> > +
> > +       tbl = iommu_group_get_iommudata(grp);
> > +       if (!tbl)
> > +               return -ENXIO;
> 
> So Alex Graf pointed out here, there is a security issue here, or are we
> missing something ?
> 
> What prevents a malicious program that has access to /dev/kvm from
> taking over random iommu groups (including host used ones) that way?
> 
> What is the security model of that whole iommu stuff to begin with ?

IOMMU groups themselves don't provide security, they're accessed by
interfaces like VFIO, which provide the security.  Given a brief look, I
agree, this looks like a possible backdoor.  The typical VFIO way to
handle this would be to pass a VFIO file descriptor here to prove that
the process has access to the IOMMU group.  This is how /dev/vfio/vfio
gains the ability to setup an IOMMU domain an do mappings with the
SET_CONTAINER ioctl using a group fd.  Thanks,

Alex



--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt - June 17, 2013, 3:56 a.m.
On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote:

> IOMMU groups themselves don't provide security, they're accessed by
> interfaces like VFIO, which provide the security.  Given a brief look, I
> agree, this looks like a possible backdoor.  The typical VFIO way to
> handle this would be to pass a VFIO file descriptor here to prove that
> the process has access to the IOMMU group.  This is how /dev/vfio/vfio
> gains the ability to setup an IOMMU domain an do mappings with the
> SET_CONTAINER ioctl using a group fd.  Thanks,

How do you envision that in the kernel ? IE. I'm in KVM code, gets that
vfio fd, what do I do with it ?

Basically, KVM needs to know that the user is allowed to use that iommu
group. I don't think we want KVM however to call into VFIO directly
right ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson - June 18, 2013, 2:32 a.m.
On Mon, 2013-06-17 at 13:56 +1000, Benjamin Herrenschmidt wrote:
> On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote:
> 
> > IOMMU groups themselves don't provide security, they're accessed by
> > interfaces like VFIO, which provide the security.  Given a brief look, I
> > agree, this looks like a possible backdoor.  The typical VFIO way to
> > handle this would be to pass a VFIO file descriptor here to prove that
> > the process has access to the IOMMU group.  This is how /dev/vfio/vfio
> > gains the ability to setup an IOMMU domain an do mappings with the
> > SET_CONTAINER ioctl using a group fd.  Thanks,
> 
> How do you envision that in the kernel ? IE. I'm in KVM code, gets that
> vfio fd, what do I do with it ?
> 
> Basically, KVM needs to know that the user is allowed to use that iommu
> group. I don't think we want KVM however to call into VFIO directly
> right ?

Right, we don't want to create dependencies across modules.  I don't
have a vision for how this should work.  This is effectively a complete
side-band to vfio, so we're really just dealing in the iommu group
space.  Maybe there needs to be some kind of registration of ownership
for the group using some kind of token.  It would need to include some
kind of notification when that ownership ends.  That might also be a
convenient tag to toggle driver probing off for devices in the group.
Other ideas?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt - June 18, 2013, 4:38 a.m.
On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote:

> Right, we don't want to create dependencies across modules.  I don't
> have a vision for how this should work.  This is effectively a complete
> side-band to vfio, so we're really just dealing in the iommu group
> space.  Maybe there needs to be some kind of registration of ownership
> for the group using some kind of token.  It would need to include some
> kind of notification when that ownership ends.  That might also be a
> convenient tag to toggle driver probing off for devices in the group.
> Other ideas?  Thanks,

All of that smells nasty like it will need a pile of bloody
infrastructure.... which makes me think it's too complicated and not the
right approach.

How does access control work today on x86/VFIO ? Can you give me a bit
more details ? I didn't get a good grasp in your previous email....

From the look of it, the VFIO file descriptor is what has the "access
control" to the underlying iommu, is this right ? So we somewhat need to
transfer (or copy) that ownership from the VFIO fd to the KVM VM.

I don't see a way to do that without some cross-layering here...

Rusty, are you aware of some kernel mechanism we can use for that ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson - June 18, 2013, 2:48 p.m.
On Tue, 2013-06-18 at 14:38 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote:
> 
> > Right, we don't want to create dependencies across modules.  I don't
> > have a vision for how this should work.  This is effectively a complete
> > side-band to vfio, so we're really just dealing in the iommu group
> > space.  Maybe there needs to be some kind of registration of ownership
> > for the group using some kind of token.  It would need to include some
> > kind of notification when that ownership ends.  That might also be a
> > convenient tag to toggle driver probing off for devices in the group.
> > Other ideas?  Thanks,
> 
> All of that smells nasty like it will need a pile of bloody
> infrastructure.... which makes me think it's too complicated and not the
> right approach.
> 
> How does access control work today on x86/VFIO ? Can you give me a bit
> more details ? I didn't get a good grasp in your previous email....

The current model is not x86 specific, but it only covers doing iommu
and device access through vfio.  The kink here is that we're trying to
do device access and setup through vfio, but iommu manipulation through
kvm.  We may want to revisit whether we can do the in-kernel iommu
manipulation through vfio rather than kvm.

For vfio in general, the group is the unit of ownership.  A user is
granted access to /dev/vfio/$GROUP through file permissions.  The user
opens the group and a container (/dev/vfio/vfio) and calls SET_CONTAINER
on the group.  If supported by the platform, multiple groups can be set
to the same container, which allows for iommu domain sharing.  Once a
group is associated with a container, an iommu backend can be
initialized for the container.  Only then can a device be accessed
through the group.

So even if we were to pass a vfio group file descriptor into kvm and it
matched as some kind of ownership token on the iommu group, it's not
clear that's sufficient to assume we can start programming the iommu.
Thanks,

Alex

> From the look of it, the VFIO file descriptor is what has the "access
> control" to the underlying iommu, is this right ? So we somewhat need to
> transfer (or copy) that ownership from the VFIO fd to the KVM VM.
> 
> I don't see a way to do that without some cross-layering here...
> 
> Rusty, are you aware of some kernel mechanism we can use for that ?
> 
> Cheers,
> Ben.
> 
> 



--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt - June 18, 2013, 9:58 p.m.
On Tue, 2013-06-18 at 08:48 -0600, Alex Williamson wrote:
> On Tue, 2013-06-18 at 14:38 +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote:
> > 
> > > Right, we don't want to create dependencies across modules.  I don't
> > > have a vision for how this should work.  This is effectively a complete
> > > side-band to vfio, so we're really just dealing in the iommu group
> > > space.  Maybe there needs to be some kind of registration of ownership
> > > for the group using some kind of token.  It would need to include some
> > > kind of notification when that ownership ends.  That might also be a
> > > convenient tag to toggle driver probing off for devices in the group.
> > > Other ideas?  Thanks,
> > 
> > All of that smells nasty like it will need a pile of bloody
> > infrastructure.... which makes me think it's too complicated and not the
> > right approach.
> > 
> > How does access control work today on x86/VFIO ? Can you give me a bit
> > more details ? I didn't get a good grasp in your previous email....
> 
> The current model is not x86 specific, but it only covers doing iommu
> and device access through vfio.  The kink here is that we're trying to
> do device access and setup through vfio, but iommu manipulation through
> kvm.  We may want to revisit whether we can do the in-kernel iommu
> manipulation through vfio rather than kvm.

How would that be possible ?

The hypercalls from the guest arrive in KVM... in a very very specific &
restricted environment which we call real mode (MMU off but still
running in guest context), where we try to do as much as possible, or in
virtual mode, where they get handled as normal KVM exits.

The only way we could handle them "in VFIO" would be if somewhat VFIO
registered callbacks with KVM... if we have that sort of
cross-dependency, then we may as well have a simpler one where VFIO
tells KVM what iommu is available for the VM

> For vfio in general, the group is the unit of ownership.  A user is
> granted access to /dev/vfio/$GROUP through file permissions.  The user
> opens the group and a container (/dev/vfio/vfio) and calls SET_CONTAINER
> on the group.  If supported by the platform, multiple groups can be set
> to the same container, which allows for iommu domain sharing.  Once a
> group is associated with a container, an iommu backend can be
> initialized for the container.  Only then can a device be accessed
> through the group.
> 
> So even if we were to pass a vfio group file descriptor into kvm and it
> matched as some kind of ownership token on the iommu group, it's not
> clear that's sufficient to assume we can start programming the iommu.
> Thanks,

Your scheme seems to me that it would have the same problem if you
wanted to do virtualized iommu....

In any case, this is a big deal. We have a requirement for pass-through.
It cannot work with any remotely usable performance level if we don't
implement the calls in KVM, so it needs to be sorted one way or another
and I'm at a loss how here...

Ben.

> Alex
> 
> > From the look of it, the VFIO file descriptor is what has the "access
> > control" to the underlying iommu, is this right ? So we somewhat need to
> > transfer (or copy) that ownership from the VFIO fd to the KVM VM.
> > 
> > I don't see a way to do that without some cross-layering here...
> > 
> > Rusty, are you aware of some kernel mechanism we can use for that ?
> > 
> > Cheers,
> > Ben.
> > 
> > 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexey Kardashevskiy - June 19, 2013, 3:17 a.m.
On 06/16/2013 02:39 PM, Benjamin Herrenschmidt wrote:
>>  static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
>> -			unsigned long *pte_sizep)
>> +			unsigned long *pte_sizep, bool do_get_page)
>>  {
>>  	pte_t *ptep;
>>  	unsigned int shift = 0;
>> @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
>>  	if (!pte_present(*ptep))
>>  		return __pte(0);
>>  
>> +	/*
>> +	 * Put huge pages handling to the virtual mode.
>> +	 * The only exception is for TCE list pages which we
>> +	 * do need to call get_page() for.
>> +	 */
>> +	if ((*pte_sizep > PAGE_SIZE) && do_get_page)
>> +		return __pte(0);
>> +
>>  	/* wait until _PAGE_BUSY is clear then set it atomically */
>>  	__asm__ __volatile__ (
>>  		"1:	ldarx	%0,0,%3\n"
>> @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
>>  		: "cc");
>>  
>>  	ret = pte;
>> +	if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) {
>> +		struct page *pg = NULL;
>> +		pg = realmode_pfn_to_page(pte_pfn(pte));
>> +		if (realmode_get_page(pg)) {
>> +			ret = __pte(0);
>> +		} else {
>> +			pte = pte_mkyoung(pte);
>> +			if (writing)
>> +				pte = pte_mkdirty(pte);
>> +		}
>> +	}
>> +	*ptep = pte;	/* clears _PAGE_BUSY */
>>  
>>  	return ret;
>>  }
> 
> So now you are adding the clearing of _PAGE_BUSY that was missing for
> your first patch, except that this is not enough since that means that
> in the "emulated" case (ie, !do_get_page) you will in essence return
> and then use a PTE that is not locked without any synchronization to
> ensure that the underlying page doesn't go away... then you'll
> dereference that page.
> 
> So either make everything use speculative get_page, or make the emulated
> case use the MMU notifier to drop the operation in case of collision.
> 
> The former looks easier.
> 
> Also, any specific reason why you do:
> 
>   - Lock the PTE
>   - get_page()
>   - Unlock the PTE
> 
> Instead of
> 
>   - Read the PTE
>   - get_page_unless_zero
>   - re-check PTE
> 
> Like get_user_pages_fast() does ?
> 
> The former will be two atomic ops, the latter only one (faster), but
> maybe you have a good reason why that can't work...



If we want to set "dirty" and "young" bits for pte then I do not know how
to avoid _PAGE_BUSY.
Rusty Russell - June 19, 2013, 3:35 a.m.
Alex Williamson <alex.williamson@redhat.com> writes:
> On Mon, 2013-06-17 at 13:56 +1000, Benjamin Herrenschmidt wrote:
>> On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote:
>> 
>> > IOMMU groups themselves don't provide security, they're accessed by
>> > interfaces like VFIO, which provide the security.  Given a brief look, I
>> > agree, this looks like a possible backdoor.  The typical VFIO way to
>> > handle this would be to pass a VFIO file descriptor here to prove that
>> > the process has access to the IOMMU group.  This is how /dev/vfio/vfio
>> > gains the ability to setup an IOMMU domain an do mappings with the
>> > SET_CONTAINER ioctl using a group fd.  Thanks,
>> 
>> How do you envision that in the kernel ? IE. I'm in KVM code, gets that
>> vfio fd, what do I do with it ?
>> 
>> Basically, KVM needs to know that the user is allowed to use that iommu
>> group. I don't think we want KVM however to call into VFIO directly
>> right ?
>
> Right, we don't want to create dependencies across modules.  I don't
> have a vision for how this should work.  This is effectively a complete
> side-band to vfio, so we're really just dealing in the iommu group
> space.  Maybe there needs to be some kind of registration of ownership
> for the group using some kind of token.  It would need to include some
> kind of notification when that ownership ends.  That might also be a
> convenient tag to toggle driver probing off for devices in the group.
> Other ideas?  Thanks,

It's actually not that bad.

eg. 

struct vfio_container *vfio_container_from_file(struct file *filp)
{
        if (filp->f_op != &vfio_device_fops)
                return ERR_PTR(-EINVAL);

        /* OK it really is a vfio fd, return the data. */
        ....
}
EXPORT_SYMBOL_GPL(vfio_container_from_file);

...

inside KVM_CREATE_SPAPR_TCE_IOMMU:

        struct file *vfio_filp;
        struct vfio_container *(lookup)(struct file *filp);

        vfio_filp = fget(create_tce_iommu.fd);
        if (!vfio)
                ret = -EBADF;
        lookup = symbol_get(vfio_container_from_file);
        if (!lookup)
                ret = -EINVAL;
        else {
                container = lookup(vfio_filp);
                if (IS_ERR(container))
                        ret = PTR_ERR(container);
                else
                        ...
                symbol_put(vfio_container_from_file);
        }

symbol_get() won't try to load a module; it'll just fail.  This is what
you want, since they must have vfio in the kernel to get a valid fd...

Hope that helps,
Rusty.
                
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt - June 19, 2013, 4:59 a.m.
On Wed, 2013-06-19 at 13:05 +0930, Rusty Russell wrote:
> symbol_get() won't try to load a module; it'll just fail.  This is what
> you want, since they must have vfio in the kernel to get a valid fd...

Ok, cool. I suppose what we want here Alexey is slightly higher level,
something like:

	vfio_validate_iommu_id(file, iommu_id)

Which verifies that the file that was passed in is allowed to use
that iommu_id.

That's a simple and flexible interface (ie, it will work even if we
support multiple iommu IDs in the future for a vfio, for example
for DDW windows etc...), the logic to know about the ID remains
in qemu, this is strictly a validation call.

That way we also don't have to expose the containing vfio struct etc...
just that simple function.

Alex, any objection ?

Do we need to make it a get/put interface instead ?

	vfio_validate_and_use_iommu(file, iommu_id);

	vfio_release_iommu(file, iommu_id);

To ensure that the resource remains owned by the process until KVM
is closed as well ?

Or do we want to register with VFIO with a callback so that VFIO can
call us if it needs us to give it up ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf - June 19, 2013, 9:58 a.m.
On 19.06.2013, at 06:59, Benjamin Herrenschmidt wrote:

> On Wed, 2013-06-19 at 13:05 +0930, Rusty Russell wrote:
>> symbol_get() won't try to load a module; it'll just fail.  This is what
>> you want, since they must have vfio in the kernel to get a valid fd...
> 
> Ok, cool. I suppose what we want here Alexey is slightly higher level,
> something like:
> 
> 	vfio_validate_iommu_id(file, iommu_id)
> 
> Which verifies that the file that was passed in is allowed to use
> that iommu_id.
> 
> That's a simple and flexible interface (ie, it will work even if we
> support multiple iommu IDs in the future for a vfio, for example
> for DDW windows etc...), the logic to know about the ID remains
> in qemu, this is strictly a validation call.
> 
> That way we also don't have to expose the containing vfio struct etc...
> just that simple function.
> 
> Alex, any objection ?

Which Alex? :)

I think validate works, it keeps iteration logic out of the kernel which is a good thing. There still needs to be an interface for getting the iommu id in VFIO, but I suppose that one's for the other Alex and Jörg to comment on.

> 
> Do we need to make it a get/put interface instead ?
> 
> 	vfio_validate_and_use_iommu(file, iommu_id);
> 
> 	vfio_release_iommu(file, iommu_id);
> 
> To ensure that the resource remains owned by the process until KVM
> is closed as well ?
> 
> Or do we want to register with VFIO with a callback so that VFIO can
> call us if it needs us to give it up ?

Can't we just register a handler on the fd and get notified when it closes? Can you kill VFIO access without closing the fd?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt - June 19, 2013, 2:50 p.m.
On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote:

> > Alex, any objection ?
> 
> Which Alex? :)

Heh, mostly Williamson in this specific case but your input is still
welcome :-)

> I think validate works, it keeps iteration logic out of the kernel
> which is a good thing. There still needs to be an interface for
> getting the iommu id in VFIO, but I suppose that one's for the other
> Alex and Jörg to comment on.

I think getting the iommu fd is already covered by separate patches from
Alexey.

> > 
> > Do we need to make it a get/put interface instead ?
> > 
> > 	vfio_validate_and_use_iommu(file, iommu_id);
> > 
> > 	vfio_release_iommu(file, iommu_id);
> > 
> > To ensure that the resource remains owned by the process until KVM
> > is closed as well ?
> > 
> > Or do we want to register with VFIO with a callback so that VFIO can
> > call us if it needs us to give it up ?
> 
> Can't we just register a handler on the fd and get notified when it
> closes? Can you kill VFIO access without closing the fd?

That sounds actually harder :-)

The question is basically: When we validate that relationship between a
specific VFIO struct file with an iommu, what is the lifetime of that
and how do we handle this lifetime properly.

There's two ways for that sort of situation: The notification model
where we get notified when the relationship is broken, and the refcount
model where we become a "user" and thus delay the breaking of the
relationship until we have been disposed of as well.

In this specific case, it's hard to tell what is the right model from my
perspective, which is why I would welcome Alex (W.) input.

In the end, the solution will end up being in the form of APIs exposed
by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W),
as owner of VFIO at this stage, what do you want those to look
like ? :-)

Cheers,
Ben.
 
> 
> Alex
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson - June 19, 2013, 3:49 p.m.
On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote:
> 
> > > Alex, any objection ?
> > 
> > Which Alex? :)
> 
> Heh, mostly Williamson in this specific case but your input is still
> welcome :-)
> 
> > I think validate works, it keeps iteration logic out of the kernel
> > which is a good thing. There still needs to be an interface for
> > getting the iommu id in VFIO, but I suppose that one's for the other
> > Alex and Jörg to comment on.
> 
> I think getting the iommu fd is already covered by separate patches from
> Alexey.
> 
> > > 
> > > Do we need to make it a get/put interface instead ?
> > > 
> > > 	vfio_validate_and_use_iommu(file, iommu_id);
> > > 
> > > 	vfio_release_iommu(file, iommu_id);
> > > 
> > > To ensure that the resource remains owned by the process until KVM
> > > is closed as well ?
> > > 
> > > Or do we want to register with VFIO with a callback so that VFIO can
> > > call us if it needs us to give it up ?
> > 
> > Can't we just register a handler on the fd and get notified when it
> > closes? Can you kill VFIO access without closing the fd?
> 
> That sounds actually harder :-)
> 
> The question is basically: When we validate that relationship between a
> specific VFIO struct file with an iommu, what is the lifetime of that
> and how do we handle this lifetime properly.
> 
> There's two ways for that sort of situation: The notification model
> where we get notified when the relationship is broken, and the refcount
> model where we become a "user" and thus delay the breaking of the
> relationship until we have been disposed of as well.
> 
> In this specific case, it's hard to tell what is the right model from my
> perspective, which is why I would welcome Alex (W.) input.
> 
> In the end, the solution will end up being in the form of APIs exposed
> by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W),
> as owner of VFIO at this stage, what do you want those to look
> like ? :-)

My first thought is that we should use the same reference counting as we
have for vfio devices (group->container_users).  An interface for that
might look like:

int vfio_group_add_external_user(struct file *filep)
{
	struct vfio_group *group = filep->private_data;

	if (filep->f_op != &vfio_group_fops)
		return -EINVAL;


	if (!atomic_inc_not_zero(&group->container_users))
		return -EINVAL;

	return 0;
}

void vfio_group_del_external_user(struct file *filep)
{
	struct vfio_group *group = filep->private_data;

	BUG_ON(filep->f_op != &vfio_group_fops);

	vfio_group_try_dissolve_container(group);
}

int vfio_group_iommu_id_from_file(struct file *filep)
{
	struct vfio_group *group = filep->private_data;

	BUG_ON(filep->f_op != &vfio_group_fops);

	return iommu_group_id(group->iommu_group);
}

Would that work?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexey Kardashevskiy - June 20, 2013, 4:58 a.m.
On 06/20/2013 01:49 AM, Alex Williamson wrote:
> On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote:
>> On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote:
>>
>>>> Alex, any objection ?
>>>
>>> Which Alex? :)
>>
>> Heh, mostly Williamson in this specific case but your input is still
>> welcome :-)
>>
>>> I think validate works, it keeps iteration logic out of the kernel
>>> which is a good thing. There still needs to be an interface for
>>> getting the iommu id in VFIO, but I suppose that one's for the other
>>> Alex and Jörg to comment on.
>>
>> I think getting the iommu fd is already covered by separate patches from
>> Alexey.
>>
>>>>
>>>> Do we need to make it a get/put interface instead ?
>>>>
>>>> 	vfio_validate_and_use_iommu(file, iommu_id);
>>>>
>>>> 	vfio_release_iommu(file, iommu_id);
>>>>
>>>> To ensure that the resource remains owned by the process until KVM
>>>> is closed as well ?
>>>>
>>>> Or do we want to register with VFIO with a callback so that VFIO can
>>>> call us if it needs us to give it up ?
>>>
>>> Can't we just register a handler on the fd and get notified when it
>>> closes? Can you kill VFIO access without closing the fd?
>>
>> That sounds actually harder :-)
>>
>> The question is basically: When we validate that relationship between a
>> specific VFIO struct file with an iommu, what is the lifetime of that
>> and how do we handle this lifetime properly.
>>
>> There's two ways for that sort of situation: The notification model
>> where we get notified when the relationship is broken, and the refcount
>> model where we become a "user" and thus delay the breaking of the
>> relationship until we have been disposed of as well.
>>
>> In this specific case, it's hard to tell what is the right model from my
>> perspective, which is why I would welcome Alex (W.) input.
>>
>> In the end, the solution will end up being in the form of APIs exposed
>> by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W),
>> as owner of VFIO at this stage, what do you want those to look
>> like ? :-)
> 
> My first thought is that we should use the same reference counting as we
> have for vfio devices (group->container_users).  An interface for that
> might look like:
> 
> int vfio_group_add_external_user(struct file *filep)
> {
> 	struct vfio_group *group = filep->private_data;
> 
> 	if (filep->f_op != &vfio_group_fops)
> 		return -EINVAL;
> 
> 
> 	if (!atomic_inc_not_zero(&group->container_users))
> 		return -EINVAL;
> 
> 	return 0;
> }
> 
> void vfio_group_del_external_user(struct file *filep)
> {
> 	struct vfio_group *group = filep->private_data;
> 
> 	BUG_ON(filep->f_op != &vfio_group_fops);
> 
> 	vfio_group_try_dissolve_container(group);
> }
> 
> int vfio_group_iommu_id_from_file(struct file *filep)
> {
> 	struct vfio_group *group = filep->private_data;
> 
> 	BUG_ON(filep->f_op != &vfio_group_fops);
> 
> 	return iommu_group_id(group->iommu_group);
> }
> 
> Would that work?  Thanks,


Just out of curiosity - would not get_file() and fput_atomic() on a group's
file* do the right job instead of vfio_group_add_external_user() and
vfio_group_del_external_user()?
David Gibson - June 20, 2013, 5:28 a.m.
On Thu, Jun 20, 2013 at 02:58:18PM +1000, Alexey Kardashevskiy wrote:
> On 06/20/2013 01:49 AM, Alex Williamson wrote:
> > On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote:
> >> On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote:
> >>
> >>>> Alex, any objection ?
> >>>
> >>> Which Alex? :)
> >>
> >> Heh, mostly Williamson in this specific case but your input is still
> >> welcome :-)
> >>
> >>> I think validate works, it keeps iteration logic out of the kernel
> >>> which is a good thing. There still needs to be an interface for
> >>> getting the iommu id in VFIO, but I suppose that one's for the other
> >>> Alex and Jörg to comment on.
> >>
> >> I think getting the iommu fd is already covered by separate patches from
> >> Alexey.
> >>
> >>>>
> >>>> Do we need to make it a get/put interface instead ?
> >>>>
> >>>> 	vfio_validate_and_use_iommu(file, iommu_id);
> >>>>
> >>>> 	vfio_release_iommu(file, iommu_id);
> >>>>
> >>>> To ensure that the resource remains owned by the process until KVM
> >>>> is closed as well ?
> >>>>
> >>>> Or do we want to register with VFIO with a callback so that VFIO can
> >>>> call us if it needs us to give it up ?
> >>>
> >>> Can't we just register a handler on the fd and get notified when it
> >>> closes? Can you kill VFIO access without closing the fd?
> >>
> >> That sounds actually harder :-)
> >>
> >> The question is basically: When we validate that relationship between a
> >> specific VFIO struct file with an iommu, what is the lifetime of that
> >> and how do we handle this lifetime properly.
> >>
> >> There's two ways for that sort of situation: The notification model
> >> where we get notified when the relationship is broken, and the refcount
> >> model where we become a "user" and thus delay the breaking of the
> >> relationship until we have been disposed of as well.
> >>
> >> In this specific case, it's hard to tell what is the right model from my
> >> perspective, which is why I would welcome Alex (W.) input.
> >>
> >> In the end, the solution will end up being in the form of APIs exposed
> >> by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W),
> >> as owner of VFIO at this stage, what do you want those to look
> >> like ? :-)
> > 
> > My first thought is that we should use the same reference counting as we
> > have for vfio devices (group->container_users).  An interface for that
> > might look like:
> > 
> > int vfio_group_add_external_user(struct file *filep)
> > {
> > 	struct vfio_group *group = filep->private_data;
> > 
> > 	if (filep->f_op != &vfio_group_fops)
> > 		return -EINVAL;
> > 
> > 
> > 	if (!atomic_inc_not_zero(&group->container_users))
> > 		return -EINVAL;
> > 
> > 	return 0;
> > }
> > 
> > void vfio_group_del_external_user(struct file *filep)
> > {
> > 	struct vfio_group *group = filep->private_data;
> > 
> > 	BUG_ON(filep->f_op != &vfio_group_fops);
> > 
> > 	vfio_group_try_dissolve_container(group);
> > }
> > 
> > int vfio_group_iommu_id_from_file(struct file *filep)
> > {
> > 	struct vfio_group *group = filep->private_data;
> > 
> > 	BUG_ON(filep->f_op != &vfio_group_fops);
> > 
> > 	return iommu_group_id(group->iommu_group);
> > }
> > 
> > Would that work?  Thanks,
> 
> 
> Just out of curiosity - would not get_file() and fput_atomic() on a group's
> file* do the right job instead of vfio_group_add_external_user() and
> vfio_group_del_external_user()?

I was thinking that too.  Grabbing a file reference would certainly be
the usual way of handling this sort of thing.
Benjamin Herrenschmidt - June 20, 2013, 7:47 a.m.
On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> > Just out of curiosity - would not get_file() and fput_atomic() on a
> group's
> > file* do the right job instead of vfio_group_add_external_user() and
> > vfio_group_del_external_user()?
> 
> I was thinking that too.  Grabbing a file reference would certainly be
> the usual way of handling this sort of thing.

But that wouldn't prevent the group ownership to be returned to
the kernel or another user would it ?

Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexey Kardashevskiy - June 20, 2013, 8:48 a.m.
On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
>>> Just out of curiosity - would not get_file() and fput_atomic() on a
>> group's
>>> file* do the right job instead of vfio_group_add_external_user() and
>>> vfio_group_del_external_user()?
>>
>> I was thinking that too.  Grabbing a file reference would certainly be
>> the usual way of handling this sort of thing.
> 
> But that wouldn't prevent the group ownership to be returned to
> the kernel or another user would it ?


Holding the file pointer does not let the group->container_users counter go
to zero and this is exactly what vfio_group_add_external_user() and
vfio_group_del_external_user() do. The difference is only in absolute value
- 2 vs. 3.

No change in behaviour whether I use new vfio API or simply hold file* till
KVM closes fd created when IOMMU was connected to LIOBN.

And while this counter is not zero, QEMU cannot take ownership over the group.

I am definitely still missing the bigger picture...
Alex Williamson - June 20, 2013, 2:55 p.m.
On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
> On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> >>> Just out of curiosity - would not get_file() and fput_atomic() on a
> >> group's
> >>> file* do the right job instead of vfio_group_add_external_user() and
> >>> vfio_group_del_external_user()?
> >>
> >> I was thinking that too.  Grabbing a file reference would certainly be
> >> the usual way of handling this sort of thing.
> > 
> > But that wouldn't prevent the group ownership to be returned to
> > the kernel or another user would it ?
> 
> 
> Holding the file pointer does not let the group->container_users counter go
> to zero

How so?  Holding the file pointer means the file won't go away, which
means the group release function won't be called.  That means the group
won't go away, but that doesn't mean it's attached to an IOMMU.  A user
could call UNSET_CONTAINER.

>  and this is exactly what vfio_group_add_external_user() and
> vfio_group_del_external_user() do. The difference is only in absolute value
> - 2 vs. 3.
> 
> No change in behaviour whether I use new vfio API or simply hold file* till
> KVM closes fd created when IOMMU was connected to LIOBN.

By that notion you could open(/dev/vfio/$GROUP) and you're safe, right?
But what about SET_CONTAINER & SET_IOMMU?  All that you guarantee
holding the file pointer is that the vfio_group exists.

> And while this counter is not zero, QEMU cannot take ownership over the group.
>
> I am definitely still missing the bigger picture...

The bigger picture is that the group needs to exist AND it needs to be
setup and maintained to have IOMMU protection.  Actually, my first stab
at add_external_user doesn't look sufficient, it needs to look more like
vfio_group_get_device_fd, checking group->container->iommu and
group_viable().  As written it would allow an external user after
SET_CONTAINER without SET_IOMMU.  It should also be part of the API that
the external user must hold the file reference between add_external_use
and del_external_user and do cleanup on any exit paths.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexey Kardashevskiy - June 22, 2013, 8:25 a.m.
On 06/21/2013 12:55 AM, Alex Williamson wrote:
> On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
>> On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
>>> On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
>>>>> Just out of curiosity - would not get_file() and fput_atomic() on a
>>>> group's
>>>>> file* do the right job instead of vfio_group_add_external_user() and
>>>>> vfio_group_del_external_user()?
>>>>
>>>> I was thinking that too.  Grabbing a file reference would certainly be
>>>> the usual way of handling this sort of thing.
>>>
>>> But that wouldn't prevent the group ownership to be returned to
>>> the kernel or another user would it ?
>>
>>
>> Holding the file pointer does not let the group->container_users counter go
>> to zero
> 
> How so?  Holding the file pointer means the file won't go away, which
> means the group release function won't be called.  That means the group
> won't go away, but that doesn't mean it's attached to an IOMMU.  A user
> could call UNSET_CONTAINER.
> 
>>  and this is exactly what vfio_group_add_external_user() and
>> vfio_group_del_external_user() do. The difference is only in absolute value
>> - 2 vs. 3.
>>
>> No change in behaviour whether I use new vfio API or simply hold file* till
>> KVM closes fd created when IOMMU was connected to LIOBN.
> 
> By that notion you could open(/dev/vfio/$GROUP) and you're safe, right?
> But what about SET_CONTAINER & SET_IOMMU?  All that you guarantee
> holding the file pointer is that the vfio_group exists.
> 
>> And while this counter is not zero, QEMU cannot take ownership over the group.
>>
>> I am definitely still missing the bigger picture...
> 
> The bigger picture is that the group needs to exist AND it needs to be
> setup and maintained to have IOMMU protection.  Actually, my first stab
> at add_external_user doesn't look sufficient, it needs to look more like
> vfio_group_get_device_fd, checking group->container->iommu and
> group_viable().


This makes sense. If you did this, that would be great. Without it, I
really cannot see how the proposed inc/dec of container_users is better
than simple holding file*. Thanks.


> As written it would allow an external user after
> SET_CONTAINER without SET_IOMMU.  It should also be part of the API that
> the external user must hold the file reference between add_external_use
> and del_external_user and do cleanup on any exit paths.  Thanks,
David Gibson - June 22, 2013, 12:03 p.m.
On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote:
> On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
> > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> > >>> Just out of curiosity - would not get_file() and fput_atomic() on a
> > >> group's
> > >>> file* do the right job instead of vfio_group_add_external_user() and
> > >>> vfio_group_del_external_user()?
> > >>
> > >> I was thinking that too.  Grabbing a file reference would certainly be
> > >> the usual way of handling this sort of thing.
> > > 
> > > But that wouldn't prevent the group ownership to be returned to
> > > the kernel or another user would it ?
> > 
> > 
> > Holding the file pointer does not let the group->container_users counter go
> > to zero
> 
> How so?  Holding the file pointer means the file won't go away, which
> means the group release function won't be called.  That means the group
> won't go away, but that doesn't mean it's attached to an IOMMU.  A user
> could call UNSET_CONTAINER.

Uhh... *thinks*.  Ah, I see.

I think the interface should not take the group fd, but the container
fd.  Holding a reference to *that* would keep the necessary things
around.  But more to the point, it's the right thing semantically:

The container is essentially the handle on a host iommu address space,
and so that's what should be bound by the KVM call to a particular
guest iommu address space.  e.g. it would make no sense to bind two
different groups to different guest iommu address spaces, if they were
in the same container - the guest thinks they are different spaces,
but if they're in the same container they must be the same space.
Alex Williamson - June 22, 2013, 2:28 p.m.
On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote:
> On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote:
> > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
> > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a
> > > >> group's
> > > >>> file* do the right job instead of vfio_group_add_external_user() and
> > > >>> vfio_group_del_external_user()?
> > > >>
> > > >> I was thinking that too.  Grabbing a file reference would certainly be
> > > >> the usual way of handling this sort of thing.
> > > > 
> > > > But that wouldn't prevent the group ownership to be returned to
> > > > the kernel or another user would it ?
> > > 
> > > 
> > > Holding the file pointer does not let the group->container_users counter go
> > > to zero
> > 
> > How so?  Holding the file pointer means the file won't go away, which
> > means the group release function won't be called.  That means the group
> > won't go away, but that doesn't mean it's attached to an IOMMU.  A user
> > could call UNSET_CONTAINER.
> 
> Uhh... *thinks*.  Ah, I see.
> 
> I think the interface should not take the group fd, but the container
> fd.  Holding a reference to *that* would keep the necessary things
> around.  But more to the point, it's the right thing semantically:
> 
> The container is essentially the handle on a host iommu address space,
> and so that's what should be bound by the KVM call to a particular
> guest iommu address space.  e.g. it would make no sense to bind two
> different groups to different guest iommu address spaces, if they were
> in the same container - the guest thinks they are different spaces,
> but if they're in the same container they must be the same space.

While the container is the gateway to the iommu, what empowers the
container to maintain an iommu is the group.  What happens to a
container when all the groups are disconnected or closed?  Groups are
the unit that indicates hardware access, not containers.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt - June 22, 2013, 11:28 p.m.
On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote:
> I think the interface should not take the group fd, but the container
> fd.  Holding a reference to *that* would keep the necessary things
> around.  But more to the point, it's the right thing semantically:
> 
> The container is essentially the handle on a host iommu address space,
> and so that's what should be bound by the KVM call to a particular
> guest iommu address space.  e.g. it would make no sense to bind two
> different groups to different guest iommu address spaces, if they were
> in the same container - the guest thinks they are different spaces,
> but if they're in the same container they must be the same space.

Interestingly, how are we going to extend that when/if we implement
DDW ?

DDW means an API by which the guest can request the creation of
additional iommus for a given device (typically, in addition to the
default smallish 32-bit one using 4k pages, the guest can request
a larger window in 64-bit space using a larger page size).

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Gibson - June 24, 2013, 3:52 a.m.
On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote:
> On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote:
> > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote:
> > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
> > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a
> > > > >> group's
> > > > >>> file* do the right job instead of vfio_group_add_external_user() and
> > > > >>> vfio_group_del_external_user()?
> > > > >>
> > > > >> I was thinking that too.  Grabbing a file reference would certainly be
> > > > >> the usual way of handling this sort of thing.
> > > > > 
> > > > > But that wouldn't prevent the group ownership to be returned to
> > > > > the kernel or another user would it ?
> > > > 
> > > > 
> > > > Holding the file pointer does not let the group->container_users counter go
> > > > to zero
> > > 
> > > How so?  Holding the file pointer means the file won't go away, which
> > > means the group release function won't be called.  That means the group
> > > won't go away, but that doesn't mean it's attached to an IOMMU.  A user
> > > could call UNSET_CONTAINER.
> > 
> > Uhh... *thinks*.  Ah, I see.
> > 
> > I think the interface should not take the group fd, but the container
> > fd.  Holding a reference to *that* would keep the necessary things
> > around.  But more to the point, it's the right thing semantically:
> > 
> > The container is essentially the handle on a host iommu address space,
> > and so that's what should be bound by the KVM call to a particular
> > guest iommu address space.  e.g. it would make no sense to bind two
> > different groups to different guest iommu address spaces, if they were
> > in the same container - the guest thinks they are different spaces,
> > but if they're in the same container they must be the same space.
> 
> While the container is the gateway to the iommu, what empowers the
> container to maintain an iommu is the group.  What happens to a
> container when all the groups are disconnected or closed?  Groups are
> the unit that indicates hardware access, not containers.  Thanks,

Uh... huh?  I'm really not sure what you're getting at.

The operation we're doing for KVM here is binding a guest iommu
address space to a particular host iommu address space.  Why would we
not want to use the obvious handle on the host iommu address space,
which is the container fd?
David Gibson - June 24, 2013, 3:54 a.m.
On Sun, Jun 23, 2013 at 09:28:13AM +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote:
> > I think the interface should not take the group fd, but the container
> > fd.  Holding a reference to *that* would keep the necessary things
> > around.  But more to the point, it's the right thing semantically:
> > 
> > The container is essentially the handle on a host iommu address space,
> > and so that's what should be bound by the KVM call to a particular
> > guest iommu address space.  e.g. it would make no sense to bind two
> > different groups to different guest iommu address spaces, if they were
> > in the same container - the guest thinks they are different spaces,
> > but if they're in the same container they must be the same space.
> 
> Interestingly, how are we going to extend that when/if we implement
> DDW ?
> 
> DDW means an API by which the guest can request the creation of
> additional iommus for a given device (typically, in addition to the
> default smallish 32-bit one using 4k pages, the guest can request
> a larger window in 64-bit space using a larger page size).

So, would a PAPR gest requesting this expect the new window to have
a new liobn, or an existing liobn?
Benjamin Herrenschmidt - June 24, 2013, 3:58 a.m.
On Mon, 2013-06-24 at 13:54 +1000, David Gibson wrote:
> > DDW means an API by which the guest can request the creation of
> > additional iommus for a given device (typically, in addition to the
> > default smallish 32-bit one using 4k pages, the guest can request
> > a larger window in 64-bit space using a larger page size).
> 
> So, would a PAPR gest requesting this expect the new window to have
> a new liobn, or an existing liobn?

New liobn or there is no way to H_PUT_TCE it (it exists in addition
to the legacy window).

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson - June 24, 2013, 4:41 a.m.
On Mon, 2013-06-24 at 13:52 +1000, David Gibson wrote:
> On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote:
> > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote:
> > > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote:
> > > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
> > > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> > > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> > > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a
> > > > > >> group's
> > > > > >>> file* do the right job instead of vfio_group_add_external_user() and
> > > > > >>> vfio_group_del_external_user()?
> > > > > >>
> > > > > >> I was thinking that too.  Grabbing a file reference would certainly be
> > > > > >> the usual way of handling this sort of thing.
> > > > > > 
> > > > > > But that wouldn't prevent the group ownership to be returned to
> > > > > > the kernel or another user would it ?
> > > > > 
> > > > > 
> > > > > Holding the file pointer does not let the group->container_users counter go
> > > > > to zero
> > > > 
> > > > How so?  Holding the file pointer means the file won't go away, which
> > > > means the group release function won't be called.  That means the group
> > > > won't go away, but that doesn't mean it's attached to an IOMMU.  A user
> > > > could call UNSET_CONTAINER.
> > > 
> > > Uhh... *thinks*.  Ah, I see.
> > > 
> > > I think the interface should not take the group fd, but the container
> > > fd.  Holding a reference to *that* would keep the necessary things
> > > around.  But more to the point, it's the right thing semantically:
> > > 
> > > The container is essentially the handle on a host iommu address space,
> > > and so that's what should be bound by the KVM call to a particular
> > > guest iommu address space.  e.g. it would make no sense to bind two
> > > different groups to different guest iommu address spaces, if they were
> > > in the same container - the guest thinks they are different spaces,
> > > but if they're in the same container they must be the same space.
> > 
> > While the container is the gateway to the iommu, what empowers the
> > container to maintain an iommu is the group.  What happens to a
> > container when all the groups are disconnected or closed?  Groups are
> > the unit that indicates hardware access, not containers.  Thanks,
> 
> Uh... huh?  I'm really not sure what you're getting at.
> 
> The operation we're doing for KVM here is binding a guest iommu
> address space to a particular host iommu address space.  Why would we
> not want to use the obvious handle on the host iommu address space,
> which is the container fd?

AIUI, the request isn't for an interface through which to do iommu
mappings.  The request is for an interface to show that the user has
sufficient privileges to do mappings.  Groups are what gives the user
that ability.  The iommu is also possibly associated with multiple iommu
groups and I believe what is being asked for here is a way to hold and
lock a single iommu group with iommu protection.

From a practical point of view, the iommu interface is de-privileged
once the groups are disconnected or closed.  Holding a reference count
on the iommu fd won't prevent that.  That means we'd have to use a
notifier to have KVM stop the side-channel iommu access.  Meanwhile
holding the file descriptor for the group and adding an interface that
bumps use counter allows KVM to lock itself in, just as if it had a
device opened itself.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Gibson - June 27, 2013, 11:01 a.m.
On Sun, Jun 23, 2013 at 10:41:24PM -0600, Alex Williamson wrote:
> On Mon, 2013-06-24 at 13:52 +1000, David Gibson wrote:
> > On Sat, Jun 22, 2013 at 08:28:06AM -0600, Alex Williamson wrote:
> > > On Sat, 2013-06-22 at 22:03 +1000, David Gibson wrote:
> > > > On Thu, Jun 20, 2013 at 08:55:13AM -0600, Alex Williamson wrote:
> > > > > On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
> > > > > > On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
> > > > > > > On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
> > > > > > >>> Just out of curiosity - would not get_file() and fput_atomic() on a
> > > > > > >> group's
> > > > > > >>> file* do the right job instead of vfio_group_add_external_user() and
> > > > > > >>> vfio_group_del_external_user()?
> > > > > > >>
> > > > > > >> I was thinking that too.  Grabbing a file reference would certainly be
> > > > > > >> the usual way of handling this sort of thing.
> > > > > > > 
> > > > > > > But that wouldn't prevent the group ownership to be returned to
> > > > > > > the kernel or another user would it ?
> > > > > > 
> > > > > > 
> > > > > > Holding the file pointer does not let the group->container_users counter go
> > > > > > to zero
> > > > > 
> > > > > How so?  Holding the file pointer means the file won't go away, which
> > > > > means the group release function won't be called.  That means the group
> > > > > won't go away, but that doesn't mean it's attached to an IOMMU.  A user
> > > > > could call UNSET_CONTAINER.
> > > > 
> > > > Uhh... *thinks*.  Ah, I see.
> > > > 
> > > > I think the interface should not take the group fd, but the container
> > > > fd.  Holding a reference to *that* would keep the necessary things
> > > > around.  But more to the point, it's the right thing semantically:
> > > > 
> > > > The container is essentially the handle on a host iommu address space,
> > > > and so that's what should be bound by the KVM call to a particular
> > > > guest iommu address space.  e.g. it would make no sense to bind two
> > > > different groups to different guest iommu address spaces, if they were
> > > > in the same container - the guest thinks they are different spaces,
> > > > but if they're in the same container they must be the same space.
> > > 
> > > While the container is the gateway to the iommu, what empowers the
> > > container to maintain an iommu is the group.  What happens to a
> > > container when all the groups are disconnected or closed?  Groups are
> > > the unit that indicates hardware access, not containers.  Thanks,
> > 
> > Uh... huh?  I'm really not sure what you're getting at.
> > 
> > The operation we're doing for KVM here is binding a guest iommu
> > address space to a particular host iommu address space.  Why would we
> > not want to use the obvious handle on the host iommu address space,
> > which is the container fd?
> 
> AIUI, the request isn't for an interface through which to do iommu
> mappings.  The request is for an interface to show that the user has
> sufficient privileges to do mappings.  Groups are what gives the user
> that ability.  The iommu is also possibly associated with multiple iommu
> groups and I believe what is being asked for here is a way to hold and
> lock a single iommu group with iommu protection.
> 
> >From a practical point of view, the iommu interface is de-privileged
> once the groups are disconnected or closed.  Holding a reference count
> on the iommu fd won't prevent that.  That means we'd have to use a
> notifier to have KVM stop the side-channel iommu access.  Meanwhile
> holding the file descriptor for the group and adding an interface that
> bumps use counter allows KVM to lock itself in, just as if it had a
> device opened itself.  Thanks,

Ah, good point.

Patch

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 6c082ff..e962e3b 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2379,6 +2379,34 @@  the guest. Othwerwise it might be better for the guest to continue using H_PUT_T
 hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
 
 
+4.84 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+	__u32 flags;
+};
+
+No flag is supported at the moment.
+
+When the guest issues TCE call on a liobn for which a TCE table has been
+registered, the kernel will handle it in real mode, updating the hardware
+TCE table. TCE table calls for other liobns will cause a vm exit and must
+be handled by userspace.
+
+
 5. The kvm_run structure
 ------------------------
 
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 85d8f26..ac0e2fe 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,7 @@  struct kvmppc_spapr_tce_table {
 	struct kvm *kvm;
 	u64 liobn;
 	u32 window_size;
+	struct iommu_group *grp;    /* used for IOMMU groups */
 	struct page *pages[0];
 };
 
@@ -611,6 +612,8 @@  struct kvm_vcpu_arch {
 	u64 busy_preempt;
 
 	unsigned long *tce_tmp;    /* TCE cache for TCE_PUT_INDIRECT hall */
+	unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */
+	unsigned long tce_reason;  /* The reason of switching to the virtmode */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index e852921b..934e01d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,6 +133,8 @@  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+				struct kvm_create_spapr_tce_iommu *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
 		struct kvm_vcpu *vcpu, unsigned long liobn);
 extern long kvmppc_emulated_validate_tce(unsigned long tce);
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..cf82af4 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -319,6 +319,13 @@  struct kvm_create_spapr_tce {
 	__u32 window_size;
 };
 
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+	__u32 flags;
+};
+
 /* for KVM_ALLOCATE_RMA */
 struct kvm_allocate_rma {
 	__u64 rma_size;
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 06b7b20..ffb4698 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,8 @@ 
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -56,8 +58,13 @@  static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 
 	mutex_lock(&kvm->lock);
 	list_del(&stt->list);
-	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
-		__free_page(stt->pages[i]);
+#ifdef CONFIG_IOMMU_API
+	if (stt->grp) {
+		iommu_group_put(stt->grp);
+	} else
+#endif
+		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+			__free_page(stt->pages[i]);
 	kfree(stt);
 	mutex_unlock(&kvm->lock);
 
@@ -153,6 +160,62 @@  fail:
 	return ret;
 }
 
+#ifdef CONFIG_IOMMU_API
+static const struct file_operations kvm_spapr_tce_iommu_fops = {
+	.release	= kvm_spapr_tce_release,
+};
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	struct kvmppc_spapr_tce_table *tt = NULL;
+	struct iommu_group *grp;
+	struct iommu_table *tbl;
+
+	/* Find an IOMMU table for the given ID */
+	grp = iommu_group_get_by_id(args->iommu_id);
+	if (!grp)
+		return -ENXIO;
+
+	tbl = iommu_group_get_iommudata(grp);
+	if (!tbl)
+		return -ENXIO;
+
+	/* Check this LIOBN hasn't been previously allocated */
+	list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn == args->liobn)
+			return -EBUSY;
+	}
+
+	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
+	if (!tt)
+		return -ENOMEM;
+
+	tt->liobn = args->liobn;
+	tt->kvm = kvm;
+	tt->grp = grp;
+
+	kvm_get_kvm(kvm);
+
+	mutex_lock(&kvm->lock);
+	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
+
+	mutex_unlock(&kvm->lock);
+
+	pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n",
+			args->liobn, args->iommu_id, args->flags);
+
+	return anon_inode_getfd("kvm-spapr-tce-iommu",
+			&kvm_spapr_tce_iommu_fops, tt, O_RDWR);
+}
+#else
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	return -ENOSYS;
+}
+#endif /* CONFIG_IOMMU_API */
+
 /* Converts guest physical address into host virtual */
 static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
 		unsigned long gpa)
@@ -180,6 +243,46 @@  long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		if (vcpu->arch.tce_reason == H_HARDWARE) {
+			iommu_clear_tces_and_put_pages(tbl, entry, 1);
+			return H_HARDWARE;
+
+		} else if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+			if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+				return H_PARAMETER;
+
+			ret = iommu_clear_tces_and_put_pages(tbl, entry, 1);
+		} else {
+			void *hva;
+
+			if (iommu_tce_put_param_check(tbl, ioba, tce))
+				return H_PARAMETER;
+
+			hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce);
+			if (hva == ERROR_ADDR)
+				return H_HARDWARE;
+
+			ret = iommu_put_tce_user_mode(tbl,
+					ioba >> IOMMU_PAGE_SHIFT,
+					(unsigned long) hva);
+		}
+		iommu_flush_tce(tbl);
+
+		if (ret)
+			return H_HARDWARE;
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if (ioba >= tt->window_size)
 		return H_PARAMETER;
@@ -220,6 +323,70 @@  long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		/* Something bad happened, do cleanup and exit */
+		if (vcpu->arch.tce_reason == H_HARDWARE) {
+			i = vcpu->arch.tce_tmp_num;
+			goto fail_clear_tce;
+		} else if (vcpu->arch.tce_reason != H_TOO_HARD) {
+			/*
+			 * We get here only in PR KVM mode, otherwise
+			 * the real mode handler would have checked TCEs
+			 * already and failed on guest TCE translation.
+			 */
+			for (i = 0; i < npages; ++i) {
+				if (get_user(vcpu->arch.tce_tmp[i], tces + i))
+					return H_HARDWARE;
+
+				if (iommu_tce_put_param_check(tbl, ioba +
+						(i << IOMMU_PAGE_SHIFT),
+						vcpu->arch.tce_tmp[i]))
+					return H_PARAMETER;
+			}
+		} /* else: The real mode handler checked TCEs already */
+
+		/* Translate TCEs */
+		for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) {
+			void *hva = kvmppc_virtmode_gpa_to_hva(vcpu,
+					vcpu->arch.tce_tmp[i]);
+
+			if (hva == ERROR_ADDR)
+				goto fail_clear_tce;
+
+			vcpu->arch.tce_tmp[i] = (unsigned long) hva;
+		}
+
+		/* Do get_page and put TCEs for all pages */
+		for (i = 0; i < npages; ++i) {
+			if (iommu_put_tce_user_mode(tbl,
+					(ioba >> IOMMU_PAGE_SHIFT) + i,
+					vcpu->arch.tce_tmp[i])) {
+				i = npages;
+				goto fail_clear_tce;
+			}
+		}
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+
+fail_clear_tce:
+		/* Cannot complete the translation, clean up and exit */
+		iommu_clear_tces_and_put_pages(tbl,
+				ioba >> IOMMU_PAGE_SHIFT, i);
+
+		iommu_flush_tce(tbl);
+
+		return H_HARDWARE;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
@@ -253,6 +420,33 @@  long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+		unsigned long tmp, entry = ioba >> IOMMU_PAGE_SHIFT;
+
+		vcpu->arch.tce_tmp_num = 0;
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		/* PR KVM? */
+		if (!vcpu->arch.tce_tmp_num &&
+				(vcpu->arch.tce_reason != H_TOO_HARD) &&
+				iommu_tce_clear_param_check(tbl, ioba,
+						tce_value, npages))
+			return H_PARAMETER;
+
+		/* Do actual cleanup */
+		tmp = vcpu->arch.tce_tmp_num;
+		if (iommu_clear_tces_and_put_pages(tbl, entry + tmp,
+				npages - tmp))
+			return H_PARAMETER;
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index c68d538..dc4ae32 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@ 
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/list.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -118,7 +119,7 @@  EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 
 static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
-			unsigned long *pte_sizep)
+			unsigned long *pte_sizep, bool do_get_page)
 {
 	pte_t *ptep;
 	unsigned int shift = 0;
@@ -135,6 +136,14 @@  static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
 	if (!pte_present(*ptep))
 		return __pte(0);
 
+	/*
+	 * Put huge pages handling to the virtual mode.
+	 * The only exception is for TCE list pages which we
+	 * do need to call get_page() for.
+	 */
+	if ((*pte_sizep > PAGE_SIZE) && do_get_page)
+		return __pte(0);
+
 	/* wait until _PAGE_BUSY is clear then set it atomically */
 	__asm__ __volatile__ (
 		"1:	ldarx	%0,0,%3\n"
@@ -148,6 +157,18 @@  static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
 		: "cc");
 
 	ret = pte;
+	if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) {
+		struct page *pg = NULL;
+		pg = realmode_pfn_to_page(pte_pfn(pte));
+		if (realmode_get_page(pg)) {
+			ret = __pte(0);
+		} else {
+			pte = pte_mkyoung(pte);
+			if (writing)
+				pte = pte_mkdirty(pte);
+		}
+	}
+	*ptep = pte;	/* clears _PAGE_BUSY */
 
 	return ret;
 }
@@ -157,7 +178,7 @@  static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
  * Also returns pte and page size if the page is present in page table.
  */
 static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
-		unsigned long gpa)
+		unsigned long gpa, bool do_get_page)
 {
 	struct kvm_memory_slot *memslot;
 	pte_t pte;
@@ -175,7 +196,7 @@  static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
 
 	/* Find a PTE and determine the size */
 	pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva,
-			writing, &pg_size);
+			writing, &pg_size, do_get_page);
 	if (!pte)
 		return ERROR_ADDR;
 
@@ -188,6 +209,52 @@  static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
 	return hpa;
 }
 
+#ifdef CONFIG_IOMMU_API
+static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu,
+		struct iommu_table *tbl, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	long ret = 0, i;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i) {
+		struct page *page;
+		unsigned long oldtce;
+
+		oldtce = iommu_clear_tce(tbl, entry + i);
+		if (!oldtce)
+			continue;
+
+		page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT);
+		if (!page) {
+			ret = H_TOO_HARD;
+			break;
+		}
+
+		if (oldtce & TCE_PCI_WRITE)
+			SetPageDirty(page);
+
+		if (realmode_put_page(page)) {
+			ret = H_TOO_HARD;
+			break;
+		}
+	}
+
+	if (ret == H_TOO_HARD) {
+		vcpu->arch.tce_tmp_num = i;
+		vcpu->arch.tce_reason = H_TOO_HARD;
+	}
+	/* if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
+				__func__, ioba, tce_value, ret); */
+
+	return ret;
+}
+#endif /* CONFIG_IOMMU_API */
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
@@ -199,6 +266,52 @@  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		vcpu->arch.tce_reason = 0;
+
+		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+			unsigned long hpa, hva;
+
+			if (iommu_tce_put_param_check(tbl, ioba, tce))
+				return H_PARAMETER;
+
+			hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true);
+			if (hpa == ERROR_ADDR) {
+				vcpu->arch.tce_reason = H_TOO_HARD;
+				return H_TOO_HARD;
+			}
+
+			hva = (unsigned long) __va(hpa);
+			ret = iommu_tce_build(tbl,
+					ioba >> IOMMU_PAGE_SHIFT,
+					hva, iommu_tce_direction(hva));
+			if (unlikely(ret)) {
+				struct page *pg = realmode_pfn_to_page(hpa);
+				BUG_ON(!pg);
+				if (realmode_put_page(pg)) {
+					vcpu->arch.tce_reason = H_HARDWARE;
+					return H_TOO_HARD;
+				}
+				return H_HARDWARE;
+			}
+		} else {
+			ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba, 0, 1);
+			if (ret)
+				return ret;
+		}
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if (ioba >= tt->window_size)
 		return H_PARAMETER;
@@ -235,10 +348,62 @@  long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (tce_list & ~IOMMU_PAGE_MASK)
 		return H_PARAMETER;
 
-	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
+	vcpu->arch.tce_tmp_num = 0;
+	vcpu->arch.tce_reason = 0;
+
+	tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu,
+			tce_list, false);
 	if ((unsigned long)tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		/* Check all TCEs */
+		for (i = 0; i < npages; ++i) {
+			if (iommu_tce_put_param_check(tbl, ioba +
+					(i << IOMMU_PAGE_SHIFT), tces[i]))
+				return H_PARAMETER;
+			vcpu->arch.tce_tmp[i] = tces[i];
+		}
+
+		/* Translate TCEs and go get_page */
+		for (i = 0; i < npages; ++i) {
+			unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu,
+					vcpu->arch.tce_tmp[i], true);
+			if (hpa == ERROR_ADDR) {
+				vcpu->arch.tce_tmp_num = i;
+				vcpu->arch.tce_reason = H_TOO_HARD;
+				return H_TOO_HARD;
+			}
+			vcpu->arch.tce_tmp[i] = hpa;
+		}
+
+		/* Put TCEs to the table */
+		for (i = 0; i < npages; ++i) {
+			unsigned long hva = (unsigned long)
+					__va(vcpu->arch.tce_tmp[i]);
+
+			ret = iommu_tce_build(tbl,
+					(ioba >> IOMMU_PAGE_SHIFT) + i,
+					hva, iommu_tce_direction(hva));
+			if (ret) {
+				/* All wrong, go virtmode and do cleanup */
+				vcpu->arch.tce_reason = H_HARDWARE;
+				return H_TOO_HARD;
+			}
+		}
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
@@ -268,6 +433,26 @@  long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+#ifdef CONFIG_IOMMU_API
+	if (tt->grp) {
+		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+		/* Return error if the group is being destroyed */
+		if (!tbl)
+			return H_RESCINDED;
+
+		vcpu->arch.tce_reason = 0;
+
+		ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba,
+				tce_value, npages);
+		if (ret)
+			return ret;
+
+		iommu_flush_tce(tbl);
+
+		return H_SUCCESS;
+	}
+#endif
 	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 8465c2a..da6bf61 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -396,6 +396,7 @@  int kvm_dev_ioctl_check_extension(long ext)
 		break;
 #endif
 	case KVM_CAP_SPAPR_MULTITCE:
+	case KVM_CAP_SPAPR_TCE_IOMMU:
 		r = 1;
 		break;
 	default:
@@ -1025,6 +1026,17 @@  long kvm_arch_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
 		goto out;
 	}
+	case KVM_CREATE_SPAPR_TCE_IOMMU: {
+		struct kvm_create_spapr_tce_iommu create_tce_iommu;
+		struct kvm *kvm = filp->private_data;
+
+		r = -EFAULT;
+		if (copy_from_user(&create_tce_iommu, argp,
+				sizeof(create_tce_iommu)))
+			goto out;
+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+		goto out;
+	}
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index fc0d6b9..8cf86dc 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@  struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_SPAPR_MULTITCE 93
+#define KVM_CAP_SPAPR_TCE_IOMMU 94
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -922,6 +923,7 @@  struct kvm_s390_ucas_mapping {
 /* Available with KVM_CAP_PPC_ALLOC_HTAB */
 #define KVM_PPC_ALLOCATE_HTAB	  _IOWR(KVMIO, 0xa7, __u32)
 #define KVM_CREATE_SPAPR_TCE	  _IOW(KVMIO,  0xa8, struct kvm_create_spapr_tce)
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)
 /* Available with KVM_CAP_RMA */
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
 /* Available with KVM_CAP_PPC_HTAB_FD */