diff mbox

[6/8] KVM: PPC: Add support for multiple-TCE hcalls

Message ID 1373123227-22969-7-git-send-email-aik@ozlabs.ru (mailing list archive)
State Not Applicable
Headers show

Commit Message

Alexey Kardashevskiy July 6, 2013, 3:07 p.m. UTC
This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.

This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

---
Changelog:
2013/07/06:
* fixed number of wrong get_page()/put_page() calls

2013/06/27:
* fixed clear of BUSY bit in kvmppc_lookup_pte()
* H_PUT_TCE_INDIRECT does realmode_get_page() now
* KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
* updated doc

2013/06/05:
* fixed mistype about IBMVIO in the commit message
* updated doc and moved it to another section
* changed capability number

2013/05/21:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 Documentation/virtual/kvm/api.txt       |  25 +++
 arch/powerpc/include/asm/kvm_host.h     |   9 ++
 arch/powerpc/include/asm/kvm_ppc.h      |  16 +-
 arch/powerpc/kvm/book3s_64_vio.c        | 154 ++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c     | 260 ++++++++++++++++++++++++++++----
 arch/powerpc/kvm/book3s_hv.c            |  41 ++++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +
 arch/powerpc/kvm/book3s_pr_papr.c       |  37 ++++-
 arch/powerpc/kvm/powerpc.c              |   3 +
 9 files changed, 517 insertions(+), 34 deletions(-)

Comments

Alexander Graf July 9, 2013, 5:02 p.m. UTC | #1
On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
> devices or emulated PCI.  These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.

We don't mention QEMU explicitly in KVM code usually.

> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
> (copied from user and verified) before writing the whole list into
> the TCE table. This cache will be utilized more in the upcoming
> VFIO/IOMMU support to continue TCE list processing in the virtual
> mode in the case if the real mode handler failed for some reason.
>
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
>
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.

Same as above. But really you're only giving recommendations here. 
What's the point? Please describe what the benefit of this patch is, not 
what some other random subsystem might do with the benefits it brings.

>
> Signed-off-by: Paul Mackerras<paulus@samba.org>
> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>
> ---
> Changelog:
> 2013/07/06:
> * fixed number of wrong get_page()/put_page() calls
>
> 2013/06/27:
> * fixed clear of BUSY bit in kvmppc_lookup_pte()
> * H_PUT_TCE_INDIRECT does realmode_get_page() now
> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
> * updated doc
>
> 2013/06/05:
> * fixed mistype about IBMVIO in the commit message
> * updated doc and moved it to another section
> * changed capability number
>
> 2013/05/21:
> * added kvm_vcpu_arch::tce_tmp
> * removed cleanup if put_indirect failed, instead we do not even start
> writing to TCE table if we cannot get TCEs from the user and they are
> invalid
> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
> and kvmppc_emulated_validate_tce (for the previous item)
> * fixed bug with failthrough for H_IPI
> * removed all get_user() from real mode handlers
> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>
> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
> ---
>   Documentation/virtual/kvm/api.txt       |  25 +++
>   arch/powerpc/include/asm/kvm_host.h     |   9 ++
>   arch/powerpc/include/asm/kvm_ppc.h      |  16 +-
>   arch/powerpc/kvm/book3s_64_vio.c        | 154 ++++++++++++++++++-
>   arch/powerpc/kvm/book3s_64_vio_hv.c     | 260 ++++++++++++++++++++++++++++----
>   arch/powerpc/kvm/book3s_hv.c            |  41 ++++-
>   arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +
>   arch/powerpc/kvm/book3s_pr_papr.c       |  37 ++++-
>   arch/powerpc/kvm/powerpc.c              |   3 +
>   9 files changed, 517 insertions(+), 34 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 6365fef..762c703 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed to userspace to be
>   handled.
>
>
> +4.86 KVM_CAP_PPC_MULTITCE
> +
> +Capability: KVM_CAP_PPC_MULTITCE
> +Architectures: ppc
> +Type: vm
> +
> +This capability means the kernel is capable of handling hypercalls
> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
> +space. This significanly accelerates DMA operations for PPC KVM guests.

significanly? Please run this through a spell checker.

> +The user space should expect that its handlers for these hypercalls

s/The//

> +are not going to be called.

Is user space guaranteed they will not be called? Or can it still happen?

> +
> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
> +the user space might have to advertise it for the guest. For example,
> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in
> +the "ibm,hypertas-functions" device-tree property.

This paragraph describes sPAPR. That's fine, but please document it as 
such. Also please check your grammar.

> +
> +Without this capability, only H_PUT_TCE is handled by the kernel and
> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
> +unless the capability is present as passing hypercalls to the userspace
> +slows operations a lot.
> +
> +Unlike other capabilities of this section, this one is always enabled.

Why? Wouldn't that confuse older user space?

> +
> +
>   5. The kvm_run structure
>   ------------------------
>
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index af326cd..20d04bd 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
>   	struct kvm *kvm;
>   	u64 liobn;
>   	u32 window_size;
> +	struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;

You don't need this.

>   	struct page *pages[0];
>   };
>
> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
>   	spinlock_t tbacct_lock;
>   	u64 busy_stolen;
>   	u64 busy_preempt;
> +
> +	unsigned long *tce_tmp_hpas;	/* TCE cache for TCE_PUT_INDIRECT hcall */
> +	enum {
> +		TCERM_NONE,
> +		TCERM_GETPAGE,
> +		TCERM_PUTTCE,
> +		TCERM_PUTLIST,
> +	} tce_rm_fail;			/* failed stage of request processing */
>   #endif
>   };
>
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index a5287fe..fa722a0 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>
>   extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>   				struct kvm_create_spapr_tce *args);
> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> -			     unsigned long ioba, unsigned long tce);
> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
> +		struct kvm_vcpu *vcpu, unsigned long liobn);
> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
> +		unsigned long ioba, unsigned long tce);
> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce);
> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list, unsigned long npages);
> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages);
>   extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>   				struct kvm_allocate_rma *rma);
>   extern struct kvmppc_linear_info *kvm_alloc_rma(void);
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index b2d3f3b..99bf4e5 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -14,6 +14,7 @@
>    *
>    * Copyright 2010 Paul Mackerras, IBM Corp.<paulus@au1.ibm.com>
>    * Copyright 2011 David Gibson, IBM Corporation<dwg@au1.ibm.com>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<aik@au1.ibm.com>
>    */
>
>   #include<linux/types.h>
> @@ -36,8 +37,10 @@
>   #include<asm/ppc-opcode.h>
>   #include<asm/kvm_host.h>
>   #include<asm/udbg.h>
> +#include<asm/iommu.h>
> +#include<asm/tce.h>
>
> -#define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR      ((void *)~(unsigned long)0x0)
>
>   static long kvmppc_stt_npages(unsigned long window_size)
>   {
> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
>   	struct kvm *kvm = stt->kvm;
>   	int i;
>
> +#define __SV(x) stt->stat.x
> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
> +	pr_debug("%s stat for liobn=%llx\n"
> +			"--------------- realmode ----- virtmode ---\n"
> +			"put_tce       %10ld     %10ld\n"
> +			"put_tce_indir %10ld     %10ld\n"
> +			"stuff_tce     %10ld     %10ld\n",
> +			__func__, stt->liobn,
> +			__SVD(put), __SV(vm.put),
> +			__SVD(indir), __SV(vm.indir),
> +			__SVD(stuff), __SV(vm.stuff));
> +#undef __SVD
> +#undef __SV

All of these stat points should just be trace points. You can do the 
statistic gathering from user space then.

> +
>   	mutex_lock(&kvm->lock);
>   	list_del(&stt->list);
>   	for (i = 0; i<  kvmppc_stt_npages(stt->window_size); i++)
> @@ -148,3 +165,138 @@ fail:
>   	}
>   	return ret;
>   }
> +
> +/* Converts guest physical address to host virtual address */
> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,

Please don't distinguish _vm versions. They're the normal case. _rm ones 
are the special ones.

> +		unsigned long gpa, struct page **pg)
> +{
> +	unsigned long hva, gfn = gpa>>  PAGE_SHIFT;
> +	struct kvm_memory_slot *memslot;
> +
> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> +	if (!memslot)
> +		return ERROR_ADDR;
> +
> +	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa&  ~PAGE_MASK);

s/+/|/

> +
> +	if (get_user_pages_fast(hva&  PAGE_MASK, 1, 0, pg) != 1)
> +		return ERROR_ADDR;
> +
> +	return (void *) hva;
> +}
> +
> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce)
> +{
> +	long ret;
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */

Unclear comment.

> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	++tt->stat.vm.put;
> +
> +	if (ioba>= tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce);
> +	if (ret)
> +		return ret;
> +
> +	kvmppc_emulated_put_tce(tt, ioba, tce);
> +
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i, ret = H_SUCCESS;
> +	unsigned long __user *tces;
> +	struct page *pg = NULL;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	++tt->stat.vm.indir;
> +
> +	/*
> +	 * The spec says that the maximum size of the list is 512 TCEs so
> +	 * so the whole table addressed resides in 4K page

so so?

> +	 */
> +	if (npages>  512)
> +		return H_PARAMETER;
> +
> +	if (tce_list&  ~IOMMU_PAGE_MASK)
> +		return H_PARAMETER;
> +
> +	if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
> +		return H_PARAMETER;
> +
> +	tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg);
> +	if (tces == ERROR_ADDR)
> +		return H_TOO_HARD;
> +
> +	if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
> +		goto put_list_page_exit;
> +
> +	for (i = 0; i<  npages; ++i) {
> +		if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
> +			ret = H_PARAMETER;
> +			goto put_list_page_exit;
> +		}
> +
> +		ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
> +		if (ret)
> +			goto put_list_page_exit;
> +	}
> +
> +	for (i = 0; i<  npages; ++i)
> +		kvmppc_emulated_put_tce(tt, ioba + (i<<  IOMMU_PAGE_SHIFT),
> +				vcpu->arch.tce_tmp_hpas[i]);
> +put_list_page_exit:
> +	if (pg)
> +		put_page(pg);
> +
> +	if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
> +		vcpu->arch.tce_rm_fail = TCERM_NONE;
> +		if (pg&&  !PageCompound(pg))
> +			put_page(pg); /* finish pending realmode_put_page() */
> +	}
> +
> +	return ret;
> +}
> +
> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i, ret;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	/* Didn't find the liobn, put it to userspace */
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	++tt->stat.vm.stuff;
> +
> +	if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce_value);
> +	if (ret || (tce_value&  (TCE_PCI_WRITE | TCE_PCI_READ)))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i<  npages; ++i, ioba += IOMMU_PAGE_SIZE)
> +		kvmppc_emulated_put_tce(tt, ioba, tce_value);
> +
> +	return H_SUCCESS;
> +}
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 30c2f3b..cd3e6f9 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -14,6 +14,7 @@
>    *
>    * Copyright 2010 Paul Mackerras, IBM Corp.<paulus@au1.ibm.com>
>    * Copyright 2011 David Gibson, IBM Corporation<dwg@au1.ibm.com>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<aik@au1.ibm.com>
>    */
>
>   #include<linux/types.h>
> @@ -35,42 +36,243 @@
>   #include<asm/ppc-opcode.h>
>   #include<asm/kvm_host.h>
>   #include<asm/udbg.h>
> +#include<asm/iommu.h>
> +#include<asm/tce.h>
>
>   #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR      (~(unsigned long)0x0)
>
> -/* WARNING: This will be called in real-mode on HV KVM and virtual
> - *          mode on PR KVM

What's wrong with the warning?

> +/*
> + * Finds a TCE table descriptor by LIOBN
>    */
> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
> +		unsigned long liobn)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +
> +	list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) {
> +		if (tt->liobn == liobn)
> +			return tt;
> +	}
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
> +
> +#ifdef DEBUG
> +/*
> + * Lets user mode disable realmode handlers by putting big number
> + * in the bottom value of LIOBN

What? Seriously? Just don't enable the CAP.

> + */
> +#define kvmppc_find_tce_table(a, b) \
> +		((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
> +#endif
> +
> +/*
> + * Validates TCE address.
> + * At the moment only flags are validated as other checks will significantly slow
> + * down or can make it even impossible to handle TCE requests in real mode.

What?

> + */
> +long kvmppc_emulated_validate_tce(unsigned long tce)

I don't like the naming scheme. Please turn this around and make it 
kvmppc_tce_validate().

> +{
> +	if (tce&  ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
> +		return H_PARAMETER;
> +
> +	return H_SUCCESS;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
> +
> +/*
> + * Handles TCE requests for QEMU emulated devices.

We still don't mention QEMU in KVM code. And does it really matter 
whether they're emulated by QEMU? Devices could also be emulated by KVM.

> + * Puts guest TCE values to the table and expects QEMU to convert them
> + * later in a QEMU device implementation.
> + * Called in both real and virtual modes.
> + * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
> + */
> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,

kvmppc_tce_put()

> +		unsigned long ioba, unsigned long tce)
> +{
> +	unsigned long idx = ioba>>  SPAPR_TCE_SHIFT;
> +	struct page *page;
> +	u64 *tbl;
> +
> +	/*
> +	 * Note on the use of page_address() in real mode,
> +	 *
> +	 * It is safe to use page_address() in real mode on ppc64 because
> +	 * page_address() is always defined as lowmem_page_address()
> +	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
> +	 * operation and does not access page struct.
> +	 *
> +	 * Theoretically page_address() could be defined different
> +	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
> +	 * should be enabled.
> +	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
> +	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
> +	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
> +	 * is not expected to be enabled on ppc32, page_address()
> +	 * is safe for ppc32 as well.
> +	 */
> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
> +#error TODO: fix to avoid page_address() here
> +#endif

Can you extract the text above, the check and the page_address call into 
a simple wrapper function?

> +	page = tt->pages[idx / TCES_PER_PAGE];
> +	tbl = (u64 *)page_address(page);
> +
> +	/* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */

This is not an RFC, is it?

> +	tbl[idx % TCES_PER_PAGE] = tce;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
> +
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +/*
> + * Converts guest physical address to host physical address.
> + * Tries to increase page counter via realmode_get_page() and
> + * returns ERROR_ADDR if failed.
> + */
> +static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
> +		unsigned long gpa, struct page **pg)
> +{
> +	struct kvm_memory_slot *memslot;
> +	pte_t *ptep, pte;
> +	unsigned long hva, hpa = ERROR_ADDR;
> +	unsigned long gfn = gpa>>  PAGE_SHIFT;
> +	unsigned shift = 0;
> +
> +	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> +	if (!memslot)
> +		return ERROR_ADDR;
> +
> +	hva = __gfn_to_hva_memslot(memslot, gfn);
> +
> +	ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva,&shift);
> +	if (!ptep || !pte_present(*ptep))
> +		return ERROR_ADDR;
> +	pte = *ptep;
> +
> +	if (((gpa&  TCE_PCI_WRITE) || pte_write(pte))&&  !pte_dirty(pte))
> +		return ERROR_ADDR;
> +
> +	if (!pte_young(pte))
> +		return ERROR_ADDR;
> +
> +	if (!shift)
> +		shift = PAGE_SHIFT;
> +
> +	/* Put huge pages handling to the virtual mode */
> +	if (shift>  PAGE_SHIFT)
> +		return ERROR_ADDR;
> +
> +	*pg = realmode_pfn_to_page(pte_pfn(pte));
> +	if (!*pg || realmode_get_page(*pg))
> +		return ERROR_ADDR;
> +
> +	/* pte_pfn(pte) returns address aligned to pg_size */
> +	hpa = (pte_pfn(pte)<<  PAGE_SHIFT) + (gpa&  ((1<<  shift) - 1));
> +
> +	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> +		hpa = ERROR_ADDR;
> +		realmode_put_page(*pg);
> +		*pg = NULL;
> +	}
> +
> +	return hpa;
> +}
> +
>   long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>   		      unsigned long ioba, unsigned long tce)
>   {
> -	struct kvm *kvm = vcpu->kvm;
> -	struct kvmppc_spapr_tce_table *stt;
> -
> -	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> -	/* 	    liobn, ioba, tce); */
> -
> -	list_for_each_entry(stt,&kvm->arch.spapr_tce_tables, list) {
> -		if (stt->liobn == liobn) {
> -			unsigned long idx = ioba>>  SPAPR_TCE_SHIFT;
> -			struct page *page;
> -			u64 *tbl;
> -
> -			/* udbg_printf("H_PUT_TCE: liobn 0x%lx =>  stt=%p  window_size=0x%x\n", */
> -			/* 	    liobn, stt, stt->window_size); */
> -			if (ioba>= stt->window_size)
> -				return H_PARAMETER;
> -
> -			page = stt->pages[idx / TCES_PER_PAGE];
> -			tbl = (u64 *)page_address(page);
> -
> -			/* FIXME: Need to validate the TCE itself */
> -			/* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
> -			tbl[idx % TCES_PER_PAGE] = tce;
> -			return H_SUCCESS;
> -		}
> +	long ret;
> +	struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
> +
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	++tt->stat.rm.put;
> +
> +	if (ioba>= tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce);
> +	if (!ret)
> +		kvmppc_emulated_put_tce(tt, ioba, tce);
> +
> +	return ret;
> +}
> +
> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,

So the _vm version is the normal one and this is the _rm version? If so, 
please mark it as such. Is there any way to generate both from the same 
source? The way it's now there is a lot of duplicate code.


Alex

> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_list,	unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
> +	long i, ret = H_SUCCESS;
> +	unsigned long tces;
> +	struct page *pg = NULL;
> +
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	++tt->stat.rm.indir;
> +
> +	/*
> +	 * The spec says that the maximum size of the list is 512 TCEs so
> +	 * so the whole table addressed resides in 4K page
> +	 */
> +	if (npages>  512)
> +		return H_PARAMETER;
> +
> +	if (tce_list&  ~IOMMU_PAGE_MASK)
> +		return H_PARAMETER;
> +
> +	if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
> +		return H_PARAMETER;
> +
> +	tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce_list,&pg);
> +	if (tces == ERROR_ADDR)
> +		return H_TOO_HARD;
> +
> +	for (i = 0; i<  npages; ++i) {
> +		ret = kvmppc_emulated_validate_tce(((unsigned long *)tces)[i]);
> +		if (ret)
> +			goto put_unlock_exit;
> +	}
> +
> +	for (i = 0; i<  npages; ++i)
> +		kvmppc_emulated_put_tce(tt, ioba + (i<<  IOMMU_PAGE_SHIFT),
> +				((unsigned long *)tces)[i]);
> +
> +put_unlock_exit:
> +	if (!ret&&  pg&&  !PageCompound(pg)&&  realmode_put_page(pg)) {
> +		vcpu->arch.tce_rm_fail = TCERM_PUTLIST;
> +		ret = H_TOO_HARD;
>   	}
>
> -	/* Didn't find the liobn, punt it to userspace */
> -	return H_TOO_HARD;
> +	return ret;
> +}
> +
> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> +		unsigned long liobn, unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	struct kvmppc_spapr_tce_table *tt;
> +	long i, ret;
> +
> +	tt = kvmppc_find_tce_table(vcpu, liobn);
> +	if (!tt)
> +		return H_TOO_HARD;
> +
> +	++tt->stat.rm.stuff;
> +
> +	if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
> +		return H_PARAMETER;
> +
> +	ret = kvmppc_emulated_validate_tce(tce_value);
> +	if (ret || (tce_value&  (TCE_PCI_WRITE | TCE_PCI_READ)))
> +		return H_PARAMETER;
> +
> +	for (i = 0; i<  npages; ++i, ioba += IOMMU_PAGE_SIZE)
> +		kvmppc_emulated_put_tce(tt, ioba, tce_value);
> +
> +	return H_SUCCESS;
>   }
> +#endif /* CONFIG_KVM_BOOK3S_64_HV */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 550f592..ac41d01 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -567,7 +567,31 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>   		if (kvmppc_xics_enabled(vcpu)) {
>   			ret = kvmppc_xics_hcall(vcpu, req);
>   			break;
> -		} /* fallthrough */
> +		}
> +		return RESUME_HOST;
> +	case H_PUT_TCE:
> +		ret = kvmppc_vm_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_PUT_TCE_INDIRECT:
> +		ret = kvmppc_vm_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
> +	case H_STUFF_TCE:
> +		ret = kvmppc_vm_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> +						kvmppc_get_gpr(vcpu, 5),
> +						kvmppc_get_gpr(vcpu, 6),
> +						kvmppc_get_gpr(vcpu, 7));
> +		if (ret == H_TOO_HARD)
> +			return RESUME_HOST;
> +		break;
>   	default:
>   		return RESUME_HOST;
>   	}
> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
>   	vcpu->arch.cpu_type = KVM_CPU_3S_64;
>   	kvmppc_sanity_check(vcpu);
>
> +	/*
> +	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
> +	 * half executed, we first read TCEs from the user, check them and
> +	 * return error if something went wrong and only then put TCEs into
> +	 * the TCE table.
> +	 *
> +	 * tce_tmp_hpas is a cache for TCEs to avoid stack allocation or
> +	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
> +	 * each (4096 bytes).
> +	 */
> +	vcpu->arch.tce_tmp_hpas = kmalloc(4096, GFP_KERNEL);
> +	if (!vcpu->arch.tce_tmp_hpas)
> +		goto free_vcpu;
> +
>   	return vcpu;
>
>   free_vcpu:
> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
>   	unpin_vpa(vcpu->kvm,&vcpu->arch.slb_shadow);
>   	unpin_vpa(vcpu->kvm,&vcpu->arch.vpa);
>   	spin_unlock(&vcpu->arch.vpa_update_lock);
> +	kfree(vcpu->arch.tce_tmp_hpas);
>   	kvm_vcpu_uninit(vcpu);
>   	kmem_cache_free(kvm_vcpu_cache, vcpu);
>   }
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b02f91e..d35554e 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1490,6 +1490,12 @@ hcall_real_table:
>   	.long	0		/* 0x11c */
>   	.long	0		/* 0x120 */
>   	.long	.kvmppc_h_bulk_remove - hcall_real_table
> +	.long	0		/* 0x128 */
> +	.long	0		/* 0x12c */
> +	.long	0		/* 0x130 */
> +	.long	0		/* 0x134 */
> +	.long	.kvmppc_h_stuff_tce - hcall_real_table
> +	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
>   hcall_real_table_end:
>
>   ignore_hdec:
> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
> index da0e0bc..edfea88 100644
> --- a/arch/powerpc/kvm/book3s_pr_papr.c
> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
>   	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>   	long rc;
>
> -	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
> +	rc = kvmppc_vm_h_put_tce(vcpu, liobn, ioba, tce);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_vm_h_put_tce_indirect(vcpu, liobn, ioba,
> +			tce, npages);
> +	if (rc == H_TOO_HARD)
> +		return EMULATE_FAIL;
> +	kvmppc_set_gpr(vcpu, 3, rc);
> +	return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> +	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> +	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> +	long rc;
> +
> +	rc = kvmppc_vm_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>   	if (rc == H_TOO_HARD)
>   		return EMULATE_FAIL;
>   	kvmppc_set_gpr(vcpu, 3, rc);
> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
>   		return kvmppc_h_pr_bulk_remove(vcpu);
>   	case H_PUT_TCE:
>   		return kvmppc_h_pr_put_tce(vcpu);
> +	case H_PUT_TCE_INDIRECT:
> +		return kvmppc_h_pr_put_tce_indirect(vcpu);
> +	case H_STUFF_TCE:
> +		return kvmppc_h_pr_stuff_tce(vcpu);
>   	case H_CEDE:
>   		vcpu->arch.shared->msr |= MSR_EE;
>   		kvm_vcpu_block(vcpu);
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 6316ee3..ccb578b 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -394,6 +394,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>   	case KVM_CAP_PPC_GET_SMMU_INFO:
>   		r = 1;
>   		break;
> +	case KVM_CAP_SPAPR_MULTITCE:
> +		r = 1;
> +		break;
>   #endif
>   	default:
>   		r = 0;
Alexey Kardashevskiy July 10, 2013, 5 a.m. UTC | #2
On 07/10/2013 03:02 AM, Alexander Graf wrote:
> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>> devices or emulated PCI.  These calls allow adding multiple entries
>> (up to 512) into the TCE table in one call which saves time on
>> transition to/from real mode.
> 
> We don't mention QEMU explicitly in KVM code usually.
> 
>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>> (copied from user and verified) before writing the whole list into
>> the TCE table. This cache will be utilized more in the upcoming
>> VFIO/IOMMU support to continue TCE list processing in the virtual
>> mode in the case if the real mode handler failed for some reason.
>>
>> This adds a guest physical to host real address converter
>> and calls the existing H_PUT_TCE handler. The converting function
>> is going to be fully utilized by upcoming VFIO supporting patches.
>>
>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>> so in order to support the functionality of this patch, QEMU
>> needs to query for this capability and set the "hcall-multi-tce"
>> hypertas property only if the capability is present, otherwise
>> there will be serious performance degradation.
> 
> Same as above. But really you're only giving recommendations here. What's
> the point? Please describe what the benefit of this patch is, not what some
> other random subsystem might do with the benefits it brings.
> 
>>
>> Signed-off-by: Paul Mackerras<paulus@samba.org>
>> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>>
>> ---
>> Changelog:
>> 2013/07/06:
>> * fixed number of wrong get_page()/put_page() calls
>>
>> 2013/06/27:
>> * fixed clear of BUSY bit in kvmppc_lookup_pte()
>> * H_PUT_TCE_INDIRECT does realmode_get_page() now
>> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
>> * updated doc
>>
>> 2013/06/05:
>> * fixed mistype about IBMVIO in the commit message
>> * updated doc and moved it to another section
>> * changed capability number
>>
>> 2013/05/21:
>> * added kvm_vcpu_arch::tce_tmp
>> * removed cleanup if put_indirect failed, instead we do not even start
>> writing to TCE table if we cannot get TCEs from the user and they are
>> invalid
>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>> and kvmppc_emulated_validate_tce (for the previous item)
>> * fixed bug with failthrough for H_IPI
>> * removed all get_user() from real mode handlers
>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>
>> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>> ---
>>   Documentation/virtual/kvm/api.txt       |  25 +++
>>   arch/powerpc/include/asm/kvm_host.h     |   9 ++
>>   arch/powerpc/include/asm/kvm_ppc.h      |  16 +-
>>   arch/powerpc/kvm/book3s_64_vio.c        | 154 ++++++++++++++++++-
>>   arch/powerpc/kvm/book3s_64_vio_hv.c     | 260
>> ++++++++++++++++++++++++++++----
>>   arch/powerpc/kvm/book3s_hv.c            |  41 ++++-
>>   arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +
>>   arch/powerpc/kvm/book3s_pr_papr.c       |  37 ++++-
>>   arch/powerpc/kvm/powerpc.c              |   3 +
>>   9 files changed, 517 insertions(+), 34 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/api.txt
>> b/Documentation/virtual/kvm/api.txt
>> index 6365fef..762c703 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
>> to userspace to be
>>   handled.
>>
>>
>> +4.86 KVM_CAP_PPC_MULTITCE
>> +
>> +Capability: KVM_CAP_PPC_MULTITCE
>> +Architectures: ppc
>> +Type: vm
>> +
>> +This capability means the kernel is capable of handling hypercalls
>> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
>> +space. This significanly accelerates DMA operations for PPC KVM guests.
> 
> significanly? Please run this through a spell checker.
> 
>> +The user space should expect that its handlers for these hypercalls
> 
> s/The//
> 
>> +are not going to be called.
> 
> Is user space guaranteed they will not be called? Or can it still happen?

... if user space previously registered LIOBN in KVM (via
KVM_CREATE_SPAPR_TCE or similar calls).

ok?

There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
and may never get there.


>> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
>> +the user space might have to advertise it for the guest. For example,
>> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in
>> +the "ibm,hypertas-functions" device-tree property.
> 
> This paragraph describes sPAPR. That's fine, but please document it as
> such. Also please check your grammar.

>> +
>> +Without this capability, only H_PUT_TCE is handled by the kernel and
>> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
>> +unless the capability is present as passing hypercalls to the userspace
>> +slows operations a lot.
>> +
>> +Unlike other capabilities of this section, this one is always enabled.
> 
> Why? Wouldn't that confuse older user space?


How? Old user space won't check for this capability and won't tell the
guest to use it (via "hcall-multi-tce"). Old H_PUT_TCE is still there.

If the guest always uses H_PUT_TCE_INDIRECT/H_STUFF_TCE no matter what,
then it is its problem - it won't work now anyway as neither QEMU nor host
kernel supports these calls.



>> +
>> +
>>   5. The kvm_run structure
>>   ------------------------
>>
>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>> b/arch/powerpc/include/asm/kvm_host.h
>> index af326cd..20d04bd 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
>>       struct kvm *kvm;
>>       u64 liobn;
>>       u32 window_size;
>> +    struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
> 
> You don't need this.
>
>>       struct page *pages[0];
>>   };
>>
>> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
>>       spinlock_t tbacct_lock;
>>       u64 busy_stolen;
>>       u64 busy_preempt;
>> +
>> +    unsigned long *tce_tmp_hpas;    /* TCE cache for TCE_PUT_INDIRECT
>> hcall */
>> +    enum {
>> +        TCERM_NONE,
>> +        TCERM_GETPAGE,
>> +        TCERM_PUTTCE,
>> +        TCERM_PUTLIST,
>> +    } tce_rm_fail;            /* failed stage of request processing */
>>   #endif
>>   };
>>
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h
>> b/arch/powerpc/include/asm/kvm_ppc.h
>> index a5287fe..fa722a0 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu
>> *vcpu);
>>
>>   extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>                   struct kvm_create_spapr_tce *args);
>> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>> -                 unsigned long ioba, unsigned long tce);
>> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>> +        struct kvm_vcpu *vcpu, unsigned long liobn);
>> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
>> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>> +        unsigned long ioba, unsigned long tce);
>> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>> +        unsigned long liobn, unsigned long ioba,
>> +        unsigned long tce);
>> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>> +        unsigned long liobn, unsigned long ioba,
>> +        unsigned long tce_list, unsigned long npages);
>> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>> +        unsigned long liobn, unsigned long ioba,
>> +        unsigned long tce_value, unsigned long npages);
>>   extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>>                   struct kvm_allocate_rma *rma);
>>   extern struct kvmppc_linear_info *kvm_alloc_rma(void);
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>> b/arch/powerpc/kvm/book3s_64_vio.c
>> index b2d3f3b..99bf4e5 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -14,6 +14,7 @@
>>    *
>>    * Copyright 2010 Paul Mackerras, IBM Corp.<paulus@au1.ibm.com>
>>    * Copyright 2011 David Gibson, IBM Corporation<dwg@au1.ibm.com>
>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<aik@au1.ibm.com>
>>    */
>>
>>   #include<linux/types.h>
>> @@ -36,8 +37,10 @@
>>   #include<asm/ppc-opcode.h>
>>   #include<asm/kvm_host.h>
>>   #include<asm/udbg.h>
>> +#include<asm/iommu.h>
>> +#include<asm/tce.h>
>>
>> -#define TCES_PER_PAGE    (PAGE_SIZE / sizeof(u64))
>> +#define ERROR_ADDR      ((void *)~(unsigned long)0x0)
>>
>>   static long kvmppc_stt_npages(unsigned long window_size)
>>   {
>> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct
>> kvmppc_spapr_tce_table *stt)
>>       struct kvm *kvm = stt->kvm;
>>       int i;
>>
>> +#define __SV(x) stt->stat.x
>> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
>> +    pr_debug("%s stat for liobn=%llx\n"
>> +            "--------------- realmode ----- virtmode ---\n"
>> +            "put_tce       %10ld     %10ld\n"
>> +            "put_tce_indir %10ld     %10ld\n"
>> +            "stuff_tce     %10ld     %10ld\n",
>> +            __func__, stt->liobn,
>> +            __SVD(put), __SV(vm.put),
>> +            __SVD(indir), __SV(vm.indir),
>> +            __SVD(stuff), __SV(vm.stuff));
>> +#undef __SVD
>> +#undef __SV
> 
> All of these stat points should just be trace points. You can do the
> statistic gathering from user space then.
> 
>> +
>>       mutex_lock(&kvm->lock);
>>       list_del(&stt->list);
>>       for (i = 0; i<  kvmppc_stt_npages(stt->window_size); i++)
>> @@ -148,3 +165,138 @@ fail:
>>       }
>>       return ret;
>>   }
>> +
>> +/* Converts guest physical address to host virtual address */
>> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
> 
> Please don't distinguish _vm versions. They're the normal case. _rm ones
> are the special ones.
> 
>> +        unsigned long gpa, struct page **pg)
>> +{
>> +    unsigned long hva, gfn = gpa>>  PAGE_SHIFT;
>> +    struct kvm_memory_slot *memslot;
>> +
>> +    memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>> +    if (!memslot)
>> +        return ERROR_ADDR;
>> +
>> +    hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa&  ~PAGE_MASK);
> 
> s/+/|/
> 
>> +
>> +    if (get_user_pages_fast(hva&  PAGE_MASK, 1, 0, pg) != 1)
>> +        return ERROR_ADDR;
>> +
>> +    return (void *) hva;
>> +}
>> +
>> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>> +        unsigned long liobn, unsigned long ioba,
>> +        unsigned long tce)
>> +{
>> +    long ret;
>> +    struct kvmppc_spapr_tce_table *tt;
>> +
>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>> +    /* Didn't find the liobn, put it to userspace */
> 
> Unclear comment.


What detail is missing?


>> +    if (!tt)
>> +        return H_TOO_HARD;
>> +
>> +    ++tt->stat.vm.put;
>> +
>> +    if (ioba>= tt->window_size)
>> +        return H_PARAMETER;
>> +
>> +    ret = kvmppc_emulated_validate_tce(tce);
>> +    if (ret)
>> +        return ret;
>> +
>> +    kvmppc_emulated_put_tce(tt, ioba, tce);
>> +
>> +    return H_SUCCESS;
>> +}
>> +
>> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>> +        unsigned long liobn, unsigned long ioba,
>> +        unsigned long tce_list, unsigned long npages)
>> +{
>> +    struct kvmppc_spapr_tce_table *tt;
>> +    long i, ret = H_SUCCESS;
>> +    unsigned long __user *tces;
>> +    struct page *pg = NULL;
>> +
>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>> +    /* Didn't find the liobn, put it to userspace */
>> +    if (!tt)
>> +        return H_TOO_HARD;
>> +
>> +    ++tt->stat.vm.indir;
>> +
>> +    /*
>> +     * The spec says that the maximum size of the list is 512 TCEs so
>> +     * so the whole table addressed resides in 4K page
> 
> so so?
>
>> +     */
>> +    if (npages>  512)
>> +        return H_PARAMETER;
>> +
>> +    if (tce_list&  ~IOMMU_PAGE_MASK)
>> +        return H_PARAMETER;
>> +
>> +    if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
>> +        return H_PARAMETER;
>> +
>> +    tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg);
>> +    if (tces == ERROR_ADDR)
>> +        return H_TOO_HARD;
>> +
>> +    if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
>> +        goto put_list_page_exit;
>> +
>> +    for (i = 0; i<  npages; ++i) {
>> +        if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
>> +            ret = H_PARAMETER;
>> +            goto put_list_page_exit;
>> +        }
>> +
>> +        ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
>> +        if (ret)
>> +            goto put_list_page_exit;
>> +    }
>> +
>> +    for (i = 0; i<  npages; ++i)
>> +        kvmppc_emulated_put_tce(tt, ioba + (i<<  IOMMU_PAGE_SHIFT),
>> +                vcpu->arch.tce_tmp_hpas[i]);
>> +put_list_page_exit:
>> +    if (pg)
>> +        put_page(pg);
>> +
>> +    if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
>> +        vcpu->arch.tce_rm_fail = TCERM_NONE;
>> +        if (pg&&  !PageCompound(pg))
>> +            put_page(pg); /* finish pending realmode_put_page() */
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>> +        unsigned long liobn, unsigned long ioba,
>> +        unsigned long tce_value, unsigned long npages)
>> +{
>> +    struct kvmppc_spapr_tce_table *tt;
>> +    long i, ret;
>> +
>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>> +    /* Didn't find the liobn, put it to userspace */
>> +    if (!tt)
>> +        return H_TOO_HARD;
>> +
>> +    ++tt->stat.vm.stuff;
>> +
>> +    if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
>> +        return H_PARAMETER;
>> +
>> +    ret = kvmppc_emulated_validate_tce(tce_value);
>> +    if (ret || (tce_value&  (TCE_PCI_WRITE | TCE_PCI_READ)))
>> +        return H_PARAMETER;
>> +
>> +    for (i = 0; i<  npages; ++i, ioba += IOMMU_PAGE_SIZE)
>> +        kvmppc_emulated_put_tce(tt, ioba, tce_value);
>> +
>> +    return H_SUCCESS;
>> +}
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index 30c2f3b..cd3e6f9 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -14,6 +14,7 @@
>>    *
>>    * Copyright 2010 Paul Mackerras, IBM Corp.<paulus@au1.ibm.com>
>>    * Copyright 2011 David Gibson, IBM Corporation<dwg@au1.ibm.com>
>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<aik@au1.ibm.com>
>>    */
>>
>>   #include<linux/types.h>
>> @@ -35,42 +36,243 @@
>>   #include<asm/ppc-opcode.h>
>>   #include<asm/kvm_host.h>
>>   #include<asm/udbg.h>
>> +#include<asm/iommu.h>
>> +#include<asm/tce.h>
>>
>>   #define TCES_PER_PAGE    (PAGE_SIZE / sizeof(u64))
>> +#define ERROR_ADDR      (~(unsigned long)0x0)
>>
>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>> - *          mode on PR KVM
> 
> What's wrong with the warning?


It belongs to kvmppc_h_put_tce() which is not called in virtual mode anymore.

It is technically correct for kvmppc_find_tce_table() though. Should I put
this comment before every function which may be called from real and
virtual modes?



>> +/*
>> + * Finds a TCE table descriptor by LIOBN
>>    */
>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
>> +        unsigned long liobn)
>> +{
>> +    struct kvmppc_spapr_tce_table *tt;
>> +
>> +    list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) {
>> +        if (tt->liobn == liobn)
>> +            return tt;
>> +    }
>> +
>> +    return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>> +
>> +#ifdef DEBUG
>> +/*
>> + * Lets user mode disable realmode handlers by putting big number
>> + * in the bottom value of LIOBN
> 
> What? Seriously? Just don't enable the CAP.


It is under DEBUG. It really, really helps to be able to disable real mode
handlers without reboot. Ok, no debug code, I'll remove.


>> + */
>> +#define kvmppc_find_tce_table(a, b) \
>> +        ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
>> +#endif
>> +
>> +/*
>> + * Validates TCE address.
>> + * At the moment only flags are validated as other checks will
>> significantly slow
>> + * down or can make it even impossible to handle TCE requests in real mode.
> 
> What?


What is missing here (besides good english)?


>> + */
>> +long kvmppc_emulated_validate_tce(unsigned long tce)
> 
> I don't like the naming scheme. Please turn this around and make it
> kvmppc_tce_validate().


Oh. "Like"... Ok.


>> +{
>> +    if (tce&  ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>> +        return H_PARAMETER;
>> +
>> +    return H_SUCCESS;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
>> +
>> +/*
>> + * Handles TCE requests for QEMU emulated devices.
> 
> We still don't mention QEMU in KVM code. And does it really matter whether
> they're emulated by QEMU? Devices could also be emulated by KVM.
>
>> + * Puts guest TCE values to the table and expects QEMU to convert them
>> + * later in a QEMU device implementation.
>> + * Called in both real and virtual modes.
>> + * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
>> + */
>> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
> 
> kvmppc_tce_put()
> 
>> +        unsigned long ioba, unsigned long tce)
>> +{
>> +    unsigned long idx = ioba>>  SPAPR_TCE_SHIFT;
>> +    struct page *page;
>> +    u64 *tbl;
>> +
>> +    /*
>> +     * Note on the use of page_address() in real mode,
>> +     *
>> +     * It is safe to use page_address() in real mode on ppc64 because
>> +     * page_address() is always defined as lowmem_page_address()
>> +     * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
>> +     * operation and does not access page struct.
>> +     *
>> +     * Theoretically page_address() could be defined different
>> +     * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>> +     * should be enabled.
>> +     * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>> +     * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>> +     * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>> +     * is not expected to be enabled on ppc32, page_address()
>> +     * is safe for ppc32 as well.
>> +     */
>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>> +#error TODO: fix to avoid page_address() here
>> +#endif
> 
> Can you extract the text above, the check and the page_address call into a
> simple wrapper function?


Is this function also too big? Sorry, I do not understand the comment.


>> +    page = tt->pages[idx / TCES_PER_PAGE];
>> +    tbl = (u64 *)page_address(page);
>> +
>> +    /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
> 
> This is not an RFC, is it?


Any debug code is prohibited? Ok, I'll remove.


>> +    tbl[idx % TCES_PER_PAGE] = tce;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
>> +
>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>> +/*
>> + * Converts guest physical address to host physical address.
>> + * Tries to increase page counter via realmode_get_page() and
>> + * returns ERROR_ADDR if failed.
>> + */
>> +static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
>> +        unsigned long gpa, struct page **pg)
>> +{
>> +    struct kvm_memory_slot *memslot;
>> +    pte_t *ptep, pte;
>> +    unsigned long hva, hpa = ERROR_ADDR;
>> +    unsigned long gfn = gpa>>  PAGE_SHIFT;
>> +    unsigned shift = 0;
>> +
>> +    memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>> +    if (!memslot)
>> +        return ERROR_ADDR;
>> +
>> +    hva = __gfn_to_hva_memslot(memslot, gfn);
>> +
>> +    ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva,&shift);
>> +    if (!ptep || !pte_present(*ptep))
>> +        return ERROR_ADDR;
>> +    pte = *ptep;
>> +
>> +    if (((gpa&  TCE_PCI_WRITE) || pte_write(pte))&&  !pte_dirty(pte))
>> +        return ERROR_ADDR;
>> +
>> +    if (!pte_young(pte))
>> +        return ERROR_ADDR;
>> +
>> +    if (!shift)
>> +        shift = PAGE_SHIFT;
>> +
>> +    /* Put huge pages handling to the virtual mode */
>> +    if (shift>  PAGE_SHIFT)
>> +        return ERROR_ADDR;
>> +
>> +    *pg = realmode_pfn_to_page(pte_pfn(pte));
>> +    if (!*pg || realmode_get_page(*pg))
>> +        return ERROR_ADDR;
>> +
>> +    /* pte_pfn(pte) returns address aligned to pg_size */
>> +    hpa = (pte_pfn(pte)<<  PAGE_SHIFT) + (gpa&  ((1<<  shift) - 1));
>> +
>> +    if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>> +        hpa = ERROR_ADDR;
>> +        realmode_put_page(*pg);
>> +        *pg = NULL;
>> +    }
>> +
>> +    return hpa;
>> +}
>> +
>>   long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>                 unsigned long ioba, unsigned long tce)
>>   {
>> -    struct kvm *kvm = vcpu->kvm;
>> -    struct kvmppc_spapr_tce_table *stt;
>> -
>> -    /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>> -    /*         liobn, ioba, tce); */
>> -
>> -    list_for_each_entry(stt,&kvm->arch.spapr_tce_tables, list) {
>> -        if (stt->liobn == liobn) {
>> -            unsigned long idx = ioba>>  SPAPR_TCE_SHIFT;
>> -            struct page *page;
>> -            u64 *tbl;
>> -
>> -            /* udbg_printf("H_PUT_TCE: liobn 0x%lx =>  stt=%p 
>> window_size=0x%x\n", */
>> -            /*         liobn, stt, stt->window_size); */
>> -            if (ioba>= stt->window_size)
>> -                return H_PARAMETER;
>> -
>> -            page = stt->pages[idx / TCES_PER_PAGE];
>> -            tbl = (u64 *)page_address(page);
>> -
>> -            /* FIXME: Need to validate the TCE itself */
>> -            /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>> -            tbl[idx % TCES_PER_PAGE] = tce;
>> -            return H_SUCCESS;
>> -        }
>> +    long ret;
>> +    struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
>> +
>> +    if (!tt)
>> +        return H_TOO_HARD;
>> +
>> +    ++tt->stat.rm.put;
>> +
>> +    if (ioba>= tt->window_size)
>> +        return H_PARAMETER;
>> +
>> +    ret = kvmppc_emulated_validate_tce(tce);
>> +    if (!ret)
>> +        kvmppc_emulated_put_tce(tt, ioba, tce);
>> +
>> +    return ret;
>> +}
>> +
>> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> 
> So the _vm version is the normal one and this is the _rm version? If so,
> please mark it as such. Is there any way to generate both from the same
> source? The way it's now there is a lot of duplicate code.


I tried, looked very ugly. If you insist, I will do so.
Alexander Graf July 10, 2013, 10:05 a.m. UTC | #3
On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote:

> On 07/10/2013 03:02 AM, Alexander Graf wrote:
>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>> devices or emulated PCI.  These calls allow adding multiple entries
>>> (up to 512) into the TCE table in one call which saves time on
>>> transition to/from real mode.
>> 
>> We don't mention QEMU explicitly in KVM code usually.
>> 
>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>> (copied from user and verified) before writing the whole list into
>>> the TCE table. This cache will be utilized more in the upcoming
>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>> mode in the case if the real mode handler failed for some reason.
>>> 
>>> This adds a guest physical to host real address converter
>>> and calls the existing H_PUT_TCE handler. The converting function
>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>> 
>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>> so in order to support the functionality of this patch, QEMU
>>> needs to query for this capability and set the "hcall-multi-tce"
>>> hypertas property only if the capability is present, otherwise
>>> there will be serious performance degradation.
>> 
>> Same as above. But really you're only giving recommendations here. What's
>> the point? Please describe what the benefit of this patch is, not what some
>> other random subsystem might do with the benefits it brings.
>> 
>>> 
>>> Signed-off-by: Paul Mackerras<paulus@samba.org>
>>> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>>> 
>>> ---
>>> Changelog:
>>> 2013/07/06:
>>> * fixed number of wrong get_page()/put_page() calls
>>> 
>>> 2013/06/27:
>>> * fixed clear of BUSY bit in kvmppc_lookup_pte()
>>> * H_PUT_TCE_INDIRECT does realmode_get_page() now
>>> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
>>> * updated doc
>>> 
>>> 2013/06/05:
>>> * fixed mistype about IBMVIO in the commit message
>>> * updated doc and moved it to another section
>>> * changed capability number
>>> 
>>> 2013/05/21:
>>> * added kvm_vcpu_arch::tce_tmp
>>> * removed cleanup if put_indirect failed, instead we do not even start
>>> writing to TCE table if we cannot get TCEs from the user and they are
>>> invalid
>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>> and kvmppc_emulated_validate_tce (for the previous item)
>>> * fixed bug with failthrough for H_IPI
>>> * removed all get_user() from real mode handlers
>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>> 
>>> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>>> ---
>>>  Documentation/virtual/kvm/api.txt       |  25 +++
>>>  arch/powerpc/include/asm/kvm_host.h     |   9 ++
>>>  arch/powerpc/include/asm/kvm_ppc.h      |  16 +-
>>>  arch/powerpc/kvm/book3s_64_vio.c        | 154 ++++++++++++++++++-
>>>  arch/powerpc/kvm/book3s_64_vio_hv.c     | 260
>>> ++++++++++++++++++++++++++++----
>>>  arch/powerpc/kvm/book3s_hv.c            |  41 ++++-
>>>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +
>>>  arch/powerpc/kvm/book3s_pr_papr.c       |  37 ++++-
>>>  arch/powerpc/kvm/powerpc.c              |   3 +
>>>  9 files changed, 517 insertions(+), 34 deletions(-)
>>> 
>>> diff --git a/Documentation/virtual/kvm/api.txt
>>> b/Documentation/virtual/kvm/api.txt
>>> index 6365fef..762c703 100644
>>> --- a/Documentation/virtual/kvm/api.txt
>>> +++ b/Documentation/virtual/kvm/api.txt
>>> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
>>> to userspace to be
>>>  handled.
>>> 
>>> 
>>> +4.86 KVM_CAP_PPC_MULTITCE
>>> +
>>> +Capability: KVM_CAP_PPC_MULTITCE
>>> +Architectures: ppc
>>> +Type: vm
>>> +
>>> +This capability means the kernel is capable of handling hypercalls
>>> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
>>> +space. This significanly accelerates DMA operations for PPC KVM guests.
>> 
>> significanly? Please run this through a spell checker.
>> 
>>> +The user space should expect that its handlers for these hypercalls
>> 
>> s/The//
>> 
>>> +are not going to be called.
>> 
>> Is user space guaranteed they will not be called? Or can it still happen?
> 
> ... if user space previously registered LIOBN in KVM (via
> KVM_CREATE_SPAPR_TCE or similar calls).
> 
> ok?

How about this?

The hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration.

---

The target audience for this documentation is user space KVM API users. Someone developing kvm tool for example. They want to know implications specific CAPs have.

> 
> There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
> and may never get there.
> 
> 
>>> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
>>> +the user space might have to advertise it for the guest. For example,
>>> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in
>>> +the "ibm,hypertas-functions" device-tree property.
>> 
>> This paragraph describes sPAPR. That's fine, but please document it as
>> such. Also please check your grammar.
> 
>>> +
>>> +Without this capability, only H_PUT_TCE is handled by the kernel and
>>> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
>>> +unless the capability is present as passing hypercalls to the userspace
>>> +slows operations a lot.
>>> +
>>> +Unlike other capabilities of this section, this one is always enabled.
>> 
>> Why? Wouldn't that confuse older user space?
> 
> 
> How? Old user space won't check for this capability and won't tell the
> guest to use it (via "hcall-multi-tce"). Old H_PUT_TCE is still there.
> 
> If the guest always uses H_PUT_TCE_INDIRECT/H_STUFF_TCE no matter what,
> then it is its problem - it won't work now anyway as neither QEMU nor host
> kernel supports these calls.

Always assume that you are a kernel developer without knowledge of any user space code using your interfaces. So there is the theoretical possibility that there is a user space client out there that implements H_PUT_TCE_INDIRECT and advertises hcall-multi-tce to the guest. Would that client break? If so, we should definitely have the CAP disabled by default.

But really, it's also as much about consistency as anything else. If we leave everything as is and always extend functionality by enabling new CAPs, we're pretty much guaranteed that we don't break anything by accident. It also makes debugging easier because you can for example disable this particular feature to see whether something has bad side effects.

> 
> 
> 
>>> +
>>> +
>>>  5. The kvm_run structure
>>>  ------------------------
>>> 
>>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>>> b/arch/powerpc/include/asm/kvm_host.h
>>> index af326cd..20d04bd 100644
>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
>>>      struct kvm *kvm;
>>>      u64 liobn;
>>>      u32 window_size;
>>> +    struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>> 
>> You don't need this.
>> 
>>>      struct page *pages[0];
>>>  };
>>> 
>>> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
>>>      spinlock_t tbacct_lock;
>>>      u64 busy_stolen;
>>>      u64 busy_preempt;
>>> +
>>> +    unsigned long *tce_tmp_hpas;    /* TCE cache for TCE_PUT_INDIRECT
>>> hcall */
>>> +    enum {
>>> +        TCERM_NONE,
>>> +        TCERM_GETPAGE,
>>> +        TCERM_PUTTCE,
>>> +        TCERM_PUTLIST,
>>> +    } tce_rm_fail;            /* failed stage of request processing */
>>>  #endif
>>>  };
>>> 
>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h
>>> b/arch/powerpc/include/asm/kvm_ppc.h
>>> index a5287fe..fa722a0 100644
>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu
>>> *vcpu);
>>> 
>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>                  struct kvm_create_spapr_tce *args);
>>> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>> -                 unsigned long ioba, unsigned long tce);
>>> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>> +        struct kvm_vcpu *vcpu, unsigned long liobn);
>>> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
>>> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>> +        unsigned long ioba, unsigned long tce);
>>> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>> +        unsigned long liobn, unsigned long ioba,
>>> +        unsigned long tce);
>>> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>> +        unsigned long liobn, unsigned long ioba,
>>> +        unsigned long tce_list, unsigned long npages);
>>> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>> +        unsigned long liobn, unsigned long ioba,
>>> +        unsigned long tce_value, unsigned long npages);
>>>  extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>>>                  struct kvm_allocate_rma *rma);
>>>  extern struct kvmppc_linear_info *kvm_alloc_rma(void);
>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>>> b/arch/powerpc/kvm/book3s_64_vio.c
>>> index b2d3f3b..99bf4e5 100644
>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>> @@ -14,6 +14,7 @@
>>>   *
>>>   * Copyright 2010 Paul Mackerras, IBM Corp.<paulus@au1.ibm.com>
>>>   * Copyright 2011 David Gibson, IBM Corporation<dwg@au1.ibm.com>
>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<aik@au1.ibm.com>
>>>   */
>>> 
>>>  #include<linux/types.h>
>>> @@ -36,8 +37,10 @@
>>>  #include<asm/ppc-opcode.h>
>>>  #include<asm/kvm_host.h>
>>>  #include<asm/udbg.h>
>>> +#include<asm/iommu.h>
>>> +#include<asm/tce.h>
>>> 
>>> -#define TCES_PER_PAGE    (PAGE_SIZE / sizeof(u64))
>>> +#define ERROR_ADDR      ((void *)~(unsigned long)0x0)
>>> 
>>>  static long kvmppc_stt_npages(unsigned long window_size)
>>>  {
>>> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct
>>> kvmppc_spapr_tce_table *stt)
>>>      struct kvm *kvm = stt->kvm;
>>>      int i;
>>> 
>>> +#define __SV(x) stt->stat.x
>>> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
>>> +    pr_debug("%s stat for liobn=%llx\n"
>>> +            "--------------- realmode ----- virtmode ---\n"
>>> +            "put_tce       %10ld     %10ld\n"
>>> +            "put_tce_indir %10ld     %10ld\n"
>>> +            "stuff_tce     %10ld     %10ld\n",
>>> +            __func__, stt->liobn,
>>> +            __SVD(put), __SV(vm.put),
>>> +            __SVD(indir), __SV(vm.indir),
>>> +            __SVD(stuff), __SV(vm.stuff));
>>> +#undef __SVD
>>> +#undef __SV
>> 
>> All of these stat points should just be trace points. You can do the
>> statistic gathering from user space then.
>> 
>>> +
>>>      mutex_lock(&kvm->lock);
>>>      list_del(&stt->list);
>>>      for (i = 0; i<  kvmppc_stt_npages(stt->window_size); i++)
>>> @@ -148,3 +165,138 @@ fail:
>>>      }
>>>      return ret;
>>>  }
>>> +
>>> +/* Converts guest physical address to host virtual address */
>>> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>> 
>> Please don't distinguish _vm versions. They're the normal case. _rm ones
>> are the special ones.
>> 
>>> +        unsigned long gpa, struct page **pg)
>>> +{
>>> +    unsigned long hva, gfn = gpa>>  PAGE_SHIFT;
>>> +    struct kvm_memory_slot *memslot;
>>> +
>>> +    memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>> +    if (!memslot)
>>> +        return ERROR_ADDR;
>>> +
>>> +    hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa&  ~PAGE_MASK);
>> 
>> s/+/|/
>> 
>>> +
>>> +    if (get_user_pages_fast(hva&  PAGE_MASK, 1, 0, pg) != 1)
>>> +        return ERROR_ADDR;
>>> +
>>> +    return (void *) hva;
>>> +}
>>> +
>>> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>> +        unsigned long liobn, unsigned long ioba,
>>> +        unsigned long tce)
>>> +{
>>> +    long ret;
>>> +    struct kvmppc_spapr_tce_table *tt;
>>> +
>>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>>> +    /* Didn't find the liobn, put it to userspace */
>> 
>> Unclear comment.
> 
> 
> What detail is missing?

Grammar wise "it" in the second half of the sentence refers to liobn. So you "put" the "liobn to userspace". That sentence doesn't make any sense.

What you really want to say is:

  /* Couldn't find the liobn. Something went wrong. Let user space handle the hypercall. That has better ways of dealing with errors. */

> 
> 
>>> +    if (!tt)
>>> +        return H_TOO_HARD;
>>> +
>>> +    ++tt->stat.vm.put;
>>> +
>>> +    if (ioba>= tt->window_size)
>>> +        return H_PARAMETER;
>>> +
>>> +    ret = kvmppc_emulated_validate_tce(tce);
>>> +    if (ret)
>>> +        return ret;
>>> +
>>> +    kvmppc_emulated_put_tce(tt, ioba, tce);
>>> +
>>> +    return H_SUCCESS;
>>> +}
>>> +
>>> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>> +        unsigned long liobn, unsigned long ioba,
>>> +        unsigned long tce_list, unsigned long npages)
>>> +{
>>> +    struct kvmppc_spapr_tce_table *tt;
>>> +    long i, ret = H_SUCCESS;
>>> +    unsigned long __user *tces;
>>> +    struct page *pg = NULL;
>>> +
>>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>>> +    /* Didn't find the liobn, put it to userspace */
>>> +    if (!tt)
>>> +        return H_TOO_HARD;
>>> +
>>> +    ++tt->stat.vm.indir;
>>> +
>>> +    /*
>>> +     * The spec says that the maximum size of the list is 512 TCEs so
>>> +     * so the whole table addressed resides in 4K page
>> 
>> so so?
>> 
>>> +     */
>>> +    if (npages>  512)
>>> +        return H_PARAMETER;
>>> +
>>> +    if (tce_list&  ~IOMMU_PAGE_MASK)
>>> +        return H_PARAMETER;
>>> +
>>> +    if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
>>> +        return H_PARAMETER;
>>> +
>>> +    tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg);
>>> +    if (tces == ERROR_ADDR)
>>> +        return H_TOO_HARD;
>>> +
>>> +    if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
>>> +        goto put_list_page_exit;
>>> +
>>> +    for (i = 0; i<  npages; ++i) {
>>> +        if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
>>> +            ret = H_PARAMETER;
>>> +            goto put_list_page_exit;
>>> +        }
>>> +
>>> +        ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
>>> +        if (ret)
>>> +            goto put_list_page_exit;
>>> +    }
>>> +
>>> +    for (i = 0; i<  npages; ++i)
>>> +        kvmppc_emulated_put_tce(tt, ioba + (i<<  IOMMU_PAGE_SHIFT),
>>> +                vcpu->arch.tce_tmp_hpas[i]);
>>> +put_list_page_exit:
>>> +    if (pg)
>>> +        put_page(pg);
>>> +
>>> +    if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
>>> +        vcpu->arch.tce_rm_fail = TCERM_NONE;
>>> +        if (pg&&  !PageCompound(pg))
>>> +            put_page(pg); /* finish pending realmode_put_page() */
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>> +        unsigned long liobn, unsigned long ioba,
>>> +        unsigned long tce_value, unsigned long npages)
>>> +{
>>> +    struct kvmppc_spapr_tce_table *tt;
>>> +    long i, ret;
>>> +
>>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>>> +    /* Didn't find the liobn, put it to userspace */
>>> +    if (!tt)
>>> +        return H_TOO_HARD;
>>> +
>>> +    ++tt->stat.vm.stuff;
>>> +
>>> +    if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
>>> +        return H_PARAMETER;
>>> +
>>> +    ret = kvmppc_emulated_validate_tce(tce_value);
>>> +    if (ret || (tce_value&  (TCE_PCI_WRITE | TCE_PCI_READ)))
>>> +        return H_PARAMETER;
>>> +
>>> +    for (i = 0; i<  npages; ++i, ioba += IOMMU_PAGE_SIZE)
>>> +        kvmppc_emulated_put_tce(tt, ioba, tce_value);
>>> +
>>> +    return H_SUCCESS;
>>> +}
>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> index 30c2f3b..cd3e6f9 100644
>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> @@ -14,6 +14,7 @@
>>>   *
>>>   * Copyright 2010 Paul Mackerras, IBM Corp.<paulus@au1.ibm.com>
>>>   * Copyright 2011 David Gibson, IBM Corporation<dwg@au1.ibm.com>
>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<aik@au1.ibm.com>
>>>   */
>>> 
>>>  #include<linux/types.h>
>>> @@ -35,42 +36,243 @@
>>>  #include<asm/ppc-opcode.h>
>>>  #include<asm/kvm_host.h>
>>>  #include<asm/udbg.h>
>>> +#include<asm/iommu.h>
>>> +#include<asm/tce.h>
>>> 
>>>  #define TCES_PER_PAGE    (PAGE_SIZE / sizeof(u64))
>>> +#define ERROR_ADDR      (~(unsigned long)0x0)
>>> 
>>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>>> - *          mode on PR KVM
>> 
>> What's wrong with the warning?
> 
> 
> It belongs to kvmppc_h_put_tce() which is not called in virtual mode anymore.

I thought the comment applied to the whole file before? Hrm. Maybe I misread it then.

> It is technically correct for kvmppc_find_tce_table() though. Should I put
> this comment before every function which may be called from real and
> virtual modes?

Yes, please. Otherwise someone might stick an access to a non-linear address in there by accident.

> 
> 
> 
>>> +/*
>>> + * Finds a TCE table descriptor by LIOBN
>>>   */
>>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
>>> +        unsigned long liobn)
>>> +{
>>> +    struct kvmppc_spapr_tce_table *tt;
>>> +
>>> +    list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) {
>>> +        if (tt->liobn == liobn)
>>> +            return tt;
>>> +    }
>>> +
>>> +    return NULL;
>>> +}
>>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>>> +
>>> +#ifdef DEBUG
>>> +/*
>>> + * Lets user mode disable realmode handlers by putting big number
>>> + * in the bottom value of LIOBN
>> 
>> What? Seriously? Just don't enable the CAP.
> 
> 
> It is under DEBUG. It really, really helps to be able to disable real mode
> handlers without reboot. Ok, no debug code, I'll remove.

Debug code is good, but #ifdefs are bad. For you, an #ifdef reads like "code that doesn't do any hard when disabled". For me, #ifdefs read "code that definitely breaks because nobody turns the #define on".

So please, avoid #ifdef'ed code whenever possible. Switching the CAP on and off is a much better debug approach in this case.

> 
> 
>>> + */
>>> +#define kvmppc_find_tce_table(a, b) \
>>> +        ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
>>> +#endif
>>> +
>>> +/*
>>> + * Validates TCE address.
>>> + * At the moment only flags are validated as other checks will
>>> significantly slow
>>> + * down or can make it even impossible to handle TCE requests in real mode.
>> 
>> What?
> 
> 
> What is missing here (besides good english)?

What badness could slip through by not validating everything?

> 
> 
>>> + */
>>> +long kvmppc_emulated_validate_tce(unsigned long tce)
>> 
>> I don't like the naming scheme. Please turn this around and make it
>> kvmppc_tce_validate().
> 
> 
> Oh. "Like"... Ok.

Yes. Like.

> 
> 
>>> +{
>>> +    if (tce&  ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>>> +        return H_PARAMETER;
>>> +
>>> +    return H_SUCCESS;
>>> +}
>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
>>> +
>>> +/*
>>> + * Handles TCE requests for QEMU emulated devices.
>> 
>> We still don't mention QEMU in KVM code. And does it really matter whether
>> they're emulated by QEMU? Devices could also be emulated by KVM.
>> 
>>> + * Puts guest TCE values to the table and expects QEMU to convert them
>>> + * later in a QEMU device implementation.
>>> + * Called in both real and virtual modes.
>>> + * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
>>> + */
>>> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>> 
>> kvmppc_tce_put()
>> 
>>> +        unsigned long ioba, unsigned long tce)
>>> +{
>>> +    unsigned long idx = ioba>>  SPAPR_TCE_SHIFT;
>>> +    struct page *page;
>>> +    u64 *tbl;
>>> +
>>> +    /*
>>> +     * Note on the use of page_address() in real mode,
>>> +     *
>>> +     * It is safe to use page_address() in real mode on ppc64 because
>>> +     * page_address() is always defined as lowmem_page_address()
>>> +     * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
>>> +     * operation and does not access page struct.
>>> +     *
>>> +     * Theoretically page_address() could be defined different
>>> +     * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>>> +     * should be enabled.
>>> +     * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>>> +     * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>>> +     * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>>> +     * is not expected to be enabled on ppc32, page_address()
>>> +     * is safe for ppc32 as well.
>>> +     */
>>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>>> +#error TODO: fix to avoid page_address() here
>>> +#endif
>> 
>> Can you extract the text above, the check and the page_address call into a
>> simple wrapper function?
> 
> 
> Is this function also too big? Sorry, I do not understand the comment.

All of the comment and #if here only deal with the fact that you have a real mode hack to call page_address() that happens to work under specific circumstances.

There's nothing kvmppc_tce_put() specific about this. The page_address() code happens to get called here, sure. But if I read the kvmppc_tce_put() function I don't care about these details - I want to understand the code flow that ends up writing the TCE.

> 
> 
>>> +    page = tt->pages[idx / TCES_PER_PAGE];
>>> +    tbl = (u64 *)page_address(page);
>>> +
>>> +    /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>> 
>> This is not an RFC, is it?
> 
> 
> Any debug code is prohibited? Ok, I'll remove.

Debug code that requires code changes is prohibited, yes. Debug code that is runtime switchable (pr_debug, trace points, etc) are allowed.

> 
> 
>>> +    tbl[idx % TCES_PER_PAGE] = tce;
>>> +}
>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
>>> +
>>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>>> +/*
>>> + * Converts guest physical address to host physical address.
>>> + * Tries to increase page counter via realmode_get_page() and
>>> + * returns ERROR_ADDR if failed.
>>> + */
>>> +static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
>>> +        unsigned long gpa, struct page **pg)
>>> +{
>>> +    struct kvm_memory_slot *memslot;
>>> +    pte_t *ptep, pte;
>>> +    unsigned long hva, hpa = ERROR_ADDR;
>>> +    unsigned long gfn = gpa>>  PAGE_SHIFT;
>>> +    unsigned shift = 0;
>>> +
>>> +    memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>> +    if (!memslot)
>>> +        return ERROR_ADDR;
>>> +
>>> +    hva = __gfn_to_hva_memslot(memslot, gfn);
>>> +
>>> +    ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva,&shift);
>>> +    if (!ptep || !pte_present(*ptep))
>>> +        return ERROR_ADDR;
>>> +    pte = *ptep;
>>> +
>>> +    if (((gpa&  TCE_PCI_WRITE) || pte_write(pte))&&  !pte_dirty(pte))
>>> +        return ERROR_ADDR;
>>> +
>>> +    if (!pte_young(pte))
>>> +        return ERROR_ADDR;
>>> +
>>> +    if (!shift)
>>> +        shift = PAGE_SHIFT;
>>> +
>>> +    /* Put huge pages handling to the virtual mode */
>>> +    if (shift>  PAGE_SHIFT)
>>> +        return ERROR_ADDR;
>>> +
>>> +    *pg = realmode_pfn_to_page(pte_pfn(pte));
>>> +    if (!*pg || realmode_get_page(*pg))
>>> +        return ERROR_ADDR;
>>> +
>>> +    /* pte_pfn(pte) returns address aligned to pg_size */
>>> +    hpa = (pte_pfn(pte)<<  PAGE_SHIFT) + (gpa&  ((1<<  shift) - 1));
>>> +
>>> +    if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>> +        hpa = ERROR_ADDR;
>>> +        realmode_put_page(*pg);
>>> +        *pg = NULL;
>>> +    }
>>> +
>>> +    return hpa;
>>> +}
>>> +
>>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>                unsigned long ioba, unsigned long tce)
>>>  {
>>> -    struct kvm *kvm = vcpu->kvm;
>>> -    struct kvmppc_spapr_tce_table *stt;
>>> -
>>> -    /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>> -    /*         liobn, ioba, tce); */
>>> -
>>> -    list_for_each_entry(stt,&kvm->arch.spapr_tce_tables, list) {
>>> -        if (stt->liobn == liobn) {
>>> -            unsigned long idx = ioba>>  SPAPR_TCE_SHIFT;
>>> -            struct page *page;
>>> -            u64 *tbl;
>>> -
>>> -            /* udbg_printf("H_PUT_TCE: liobn 0x%lx =>  stt=%p 
>>> window_size=0x%x\n", */
>>> -            /*         liobn, stt, stt->window_size); */
>>> -            if (ioba>= stt->window_size)
>>> -                return H_PARAMETER;
>>> -
>>> -            page = stt->pages[idx / TCES_PER_PAGE];
>>> -            tbl = (u64 *)page_address(page);
>>> -
>>> -            /* FIXME: Need to validate the TCE itself */
>>> -            /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>> -            tbl[idx % TCES_PER_PAGE] = tce;
>>> -            return H_SUCCESS;
>>> -        }
>>> +    long ret;
>>> +    struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
>>> +
>>> +    if (!tt)
>>> +        return H_TOO_HARD;
>>> +
>>> +    ++tt->stat.rm.put;
>>> +
>>> +    if (ioba>= tt->window_size)
>>> +        return H_PARAMETER;
>>> +
>>> +    ret = kvmppc_emulated_validate_tce(tce);
>>> +    if (!ret)
>>> +        kvmppc_emulated_put_tce(tt, ioba, tce);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>> 
>> So the _vm version is the normal one and this is the _rm version? If so,
>> please mark it as such. Is there any way to generate both from the same
>> source? The way it's now there is a lot of duplicate code.
> 
> 
> I tried, looked very ugly. If you insist, I will do so.

If it looks ugly better don't. I just want to make sure you explored the option. But please keep the naming scheme consistent.


Alex
Alexey Kardashevskiy July 11, 2013, 5:12 a.m. UTC | #4
On 07/10/2013 08:05 PM, Alexander Graf wrote:
> 
> On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote:
> 
>> On 07/10/2013 03:02 AM, Alexander Graf wrote:
>>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>>> devices or emulated PCI.  These calls allow adding multiple entries
>>>> (up to 512) into the TCE table in one call which saves time on
>>>> transition to/from real mode.
>>>
>>> We don't mention QEMU explicitly in KVM code usually.
>>>
>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>> (copied from user and verified) before writing the whole list into
>>>> the TCE table. This cache will be utilized more in the upcoming
>>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>>> mode in the case if the real mode handler failed for some reason.
>>>>
>>>> This adds a guest physical to host real address converter
>>>> and calls the existing H_PUT_TCE handler. The converting function
>>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>>
>>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>>> so in order to support the functionality of this patch, QEMU
>>>> needs to query for this capability and set the "hcall-multi-tce"
>>>> hypertas property only if the capability is present, otherwise
>>>> there will be serious performance degradation.
>>>
>>> Same as above. But really you're only giving recommendations here. What's
>>> the point? Please describe what the benefit of this patch is, not what some
>>> other random subsystem might do with the benefits it brings.
>>>
>>>>
>>>> Signed-off-by: Paul Mackerras<paulus@samba.org>
>>>> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>>>>
>>>> ---
>>>> Changelog:
>>>> 2013/07/06:
>>>> * fixed number of wrong get_page()/put_page() calls
>>>>
>>>> 2013/06/27:
>>>> * fixed clear of BUSY bit in kvmppc_lookup_pte()
>>>> * H_PUT_TCE_INDIRECT does realmode_get_page() now
>>>> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
>>>> * updated doc
>>>>
>>>> 2013/06/05:
>>>> * fixed mistype about IBMVIO in the commit message
>>>> * updated doc and moved it to another section
>>>> * changed capability number
>>>>
>>>> 2013/05/21:
>>>> * added kvm_vcpu_arch::tce_tmp
>>>> * removed cleanup if put_indirect failed, instead we do not even start
>>>> writing to TCE table if we cannot get TCEs from the user and they are
>>>> invalid
>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>>> and kvmppc_emulated_validate_tce (for the previous item)
>>>> * fixed bug with failthrough for H_IPI
>>>> * removed all get_user() from real mode handlers
>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>>>> ---
>>>>  Documentation/virtual/kvm/api.txt       |  25 +++
>>>>  arch/powerpc/include/asm/kvm_host.h     |   9 ++
>>>>  arch/powerpc/include/asm/kvm_ppc.h      |  16 +-
>>>>  arch/powerpc/kvm/book3s_64_vio.c        | 154 ++++++++++++++++++-
>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c     | 260
>>>> ++++++++++++++++++++++++++++----
>>>>  arch/powerpc/kvm/book3s_hv.c            |  41 ++++-
>>>>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +
>>>>  arch/powerpc/kvm/book3s_pr_papr.c       |  37 ++++-
>>>>  arch/powerpc/kvm/powerpc.c              |   3 +
>>>>  9 files changed, 517 insertions(+), 34 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/api.txt
>>>> b/Documentation/virtual/kvm/api.txt
>>>> index 6365fef..762c703 100644
>>>> --- a/Documentation/virtual/kvm/api.txt
>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
>>>> to userspace to be
>>>>  handled.
>>>>
>>>>
>>>> +4.86 KVM_CAP_PPC_MULTITCE
>>>> +
>>>> +Capability: KVM_CAP_PPC_MULTITCE
>>>> +Architectures: ppc
>>>> +Type: vm
>>>> +
>>>> +This capability means the kernel is capable of handling hypercalls
>>>> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
>>>> +space. This significanly accelerates DMA operations for PPC KVM guests.
>>>
>>> significanly? Please run this through a spell checker.
>>>
>>>> +The user space should expect that its handlers for these hypercalls
>>>
>>> s/The//
>>>
>>>> +are not going to be called.
>>>
>>> Is user space guaranteed they will not be called? Or can it still happen?
>>
>> ... if user space previously registered LIOBN in KVM (via
>> KVM_CREATE_SPAPR_TCE or similar calls).
>>
>> ok?
> 
> How about this?
> 
> The hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration.
> 
> ---
> 
> The target audience for this documentation is user space KVM API users. Someone developing kvm tool for example. They want to know implications specific CAPs have.
> 
>>
>> There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
>> and may never get there.
>>
>>
>>>> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
>>>> +the user space might have to advertise it for the guest. For example,
>>>> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in
>>>> +the "ibm,hypertas-functions" device-tree property.
>>>
>>> This paragraph describes sPAPR. That's fine, but please document it as
>>> such. Also please check your grammar.
>>
>>>> +
>>>> +Without this capability, only H_PUT_TCE is handled by the kernel and
>>>> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
>>>> +unless the capability is present as passing hypercalls to the userspace
>>>> +slows operations a lot.
>>>> +
>>>> +Unlike other capabilities of this section, this one is always enabled.
>>>
>>> Why? Wouldn't that confuse older user space?
>>
>>
>> How? Old user space won't check for this capability and won't tell the
>> guest to use it (via "hcall-multi-tce"). Old H_PUT_TCE is still there.
>>
>> If the guest always uses H_PUT_TCE_INDIRECT/H_STUFF_TCE no matter what,
>> then it is its problem - it won't work now anyway as neither QEMU nor host
>> kernel supports these calls.


> Always assume that you are a kernel developer without knowledge
> of any user space code using your interfaces. So there is the theoretical
> possibility that there is a user space client out there that implements
> H_PUT_TCE_INDIRECT and advertises hcall-multi-tce to the guest. 
> Would that client break? If so, we should definitely have
> the CAP disabled by default.


No, it won't break. Why would it break? I really do not get it. This user
space client has to do an extra step to get this acceleration by calling
ioctl(KVM_CREATE_SPAPR_TCE) anyway. Previously that ioctl only had effect
on H_PUT_TCE, now on all three hcalls.


> But really, it's also as much about consistency as anything else.
> If we leave everything as is and always extend functionality
> by enabling new CAPs, we're pretty much guaranteed that we
> don't break anything by accident. It also makes debugging easier
> because you can for example disable this particular feature
> to see whether something has bad side effects.


So I must add one more ioctl to enable MULTITCE in kernel handling. Is it
what you are saying?

I can see KVM_CHECK_EXTENSION but I do not see KVM_ENABLE_EXTENSION or
anything like that.



>>>> +
>>>> +
>>>>  5. The kvm_run structure
>>>>  ------------------------
>>>>
>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>>>> b/arch/powerpc/include/asm/kvm_host.h
>>>> index af326cd..20d04bd 100644
>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
>>>>      struct kvm *kvm;
>>>>      u64 liobn;
>>>>      u32 window_size;
>>>> +    struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>>>
>>> You don't need this.
>>>
>>>>      struct page *pages[0];
>>>>  };
>>>>
>>>> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
>>>>      spinlock_t tbacct_lock;
>>>>      u64 busy_stolen;
>>>>      u64 busy_preempt;
>>>> +
>>>> +    unsigned long *tce_tmp_hpas;    /* TCE cache for TCE_PUT_INDIRECT
>>>> hcall */
>>>> +    enum {
>>>> +        TCERM_NONE,
>>>> +        TCERM_GETPAGE,
>>>> +        TCERM_PUTTCE,
>>>> +        TCERM_PUTLIST,
>>>> +    } tce_rm_fail;            /* failed stage of request processing */
>>>>  #endif
>>>>  };
>>>>
>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h
>>>> b/arch/powerpc/include/asm/kvm_ppc.h
>>>> index a5287fe..fa722a0 100644
>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu
>>>> *vcpu);
>>>>
>>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>                  struct kvm_create_spapr_tce *args);
>>>> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>> -                 unsigned long ioba, unsigned long tce);
>>>> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>>> +        struct kvm_vcpu *vcpu, unsigned long liobn);
>>>> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
>>>> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>> +        unsigned long ioba, unsigned long tce);
>>>> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>> +        unsigned long liobn, unsigned long ioba,
>>>> +        unsigned long tce);
>>>> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>> +        unsigned long liobn, unsigned long ioba,
>>>> +        unsigned long tce_list, unsigned long npages);
>>>> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>> +        unsigned long liobn, unsigned long ioba,
>>>> +        unsigned long tce_value, unsigned long npages);
>>>>  extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>>>>                  struct kvm_allocate_rma *rma);
>>>>  extern struct kvmppc_linear_info *kvm_alloc_rma(void);
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>>>> b/arch/powerpc/kvm/book3s_64_vio.c
>>>> index b2d3f3b..99bf4e5 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>> @@ -14,6 +14,7 @@
>>>>   *
>>>>   * Copyright 2010 Paul Mackerras, IBM Corp.<paulus@au1.ibm.com>
>>>>   * Copyright 2011 David Gibson, IBM Corporation<dwg@au1.ibm.com>
>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<aik@au1.ibm.com>
>>>>   */
>>>>
>>>>  #include<linux/types.h>
>>>> @@ -36,8 +37,10 @@
>>>>  #include<asm/ppc-opcode.h>
>>>>  #include<asm/kvm_host.h>
>>>>  #include<asm/udbg.h>
>>>> +#include<asm/iommu.h>
>>>> +#include<asm/tce.h>
>>>>
>>>> -#define TCES_PER_PAGE    (PAGE_SIZE / sizeof(u64))
>>>> +#define ERROR_ADDR      ((void *)~(unsigned long)0x0)
>>>>
>>>>  static long kvmppc_stt_npages(unsigned long window_size)
>>>>  {
>>>> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct
>>>> kvmppc_spapr_tce_table *stt)
>>>>      struct kvm *kvm = stt->kvm;
>>>>      int i;
>>>>
>>>> +#define __SV(x) stt->stat.x
>>>> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
>>>> +    pr_debug("%s stat for liobn=%llx\n"
>>>> +            "--------------- realmode ----- virtmode ---\n"
>>>> +            "put_tce       %10ld     %10ld\n"
>>>> +            "put_tce_indir %10ld     %10ld\n"
>>>> +            "stuff_tce     %10ld     %10ld\n",
>>>> +            __func__, stt->liobn,
>>>> +            __SVD(put), __SV(vm.put),
>>>> +            __SVD(indir), __SV(vm.indir),
>>>> +            __SVD(stuff), __SV(vm.stuff));
>>>> +#undef __SVD
>>>> +#undef __SV
>>>
>>> All of these stat points should just be trace points. You can do the
>>> statistic gathering from user space then.
>>>
>>>> +
>>>>      mutex_lock(&kvm->lock);
>>>>      list_del(&stt->list);
>>>>      for (i = 0; i<  kvmppc_stt_npages(stt->window_size); i++)
>>>> @@ -148,3 +165,138 @@ fail:
>>>>      }
>>>>      return ret;
>>>>  }
>>>> +
>>>> +/* Converts guest physical address to host virtual address */
>>>> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>>>
>>> Please don't distinguish _vm versions. They're the normal case. _rm ones
>>> are the special ones.
>>>
>>>> +        unsigned long gpa, struct page **pg)
>>>> +{
>>>> +    unsigned long hva, gfn = gpa>>  PAGE_SHIFT;
>>>> +    struct kvm_memory_slot *memslot;
>>>> +
>>>> +    memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>> +    if (!memslot)
>>>> +        return ERROR_ADDR;
>>>> +
>>>> +    hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa&  ~PAGE_MASK);
>>>
>>> s/+/|/
>>>
>>>> +
>>>> +    if (get_user_pages_fast(hva&  PAGE_MASK, 1, 0, pg) != 1)
>>>> +        return ERROR_ADDR;
>>>> +
>>>> +    return (void *) hva;
>>>> +}
>>>> +
>>>> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>> +        unsigned long liobn, unsigned long ioba,
>>>> +        unsigned long tce)
>>>> +{
>>>> +    long ret;
>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>> +
>>>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>>>> +    /* Didn't find the liobn, put it to userspace */
>>>
>>> Unclear comment.
>>
>>
>> What detail is missing?
> 

> Grammar wise "it" in the second half of the sentence refers to liobn.
> So you "put" the "liobn to userspace". That sentence doesn't
> make any sense.


Removed it. H_TOO_HARD itself says enough already.


> What you really want to say is:
> 
>   /* Couldn't find the liobn. Something went wrong. Let user space handle the hypercall. That has better ways of dealing with errors. */
> 
>>
>>
>>>> +    if (!tt)
>>>> +        return H_TOO_HARD;
>>>> +
>>>> +    ++tt->stat.vm.put;
>>>> +
>>>> +    if (ioba>= tt->window_size)
>>>> +        return H_PARAMETER;
>>>> +
>>>> +    ret = kvmppc_emulated_validate_tce(tce);
>>>> +    if (ret)
>>>> +        return ret;
>>>> +
>>>> +    kvmppc_emulated_put_tce(tt, ioba, tce);
>>>> +
>>>> +    return H_SUCCESS;
>>>> +}
>>>> +
>>>> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>> +        unsigned long liobn, unsigned long ioba,
>>>> +        unsigned long tce_list, unsigned long npages)
>>>> +{
>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>> +    long i, ret = H_SUCCESS;
>>>> +    unsigned long __user *tces;
>>>> +    struct page *pg = NULL;
>>>> +
>>>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>>>> +    /* Didn't find the liobn, put it to userspace */
>>>> +    if (!tt)
>>>> +        return H_TOO_HARD;
>>>> +
>>>> +    ++tt->stat.vm.indir;
>>>> +
>>>> +    /*
>>>> +     * The spec says that the maximum size of the list is 512 TCEs so
>>>> +     * so the whole table addressed resides in 4K page
>>>
>>> so so?
>>>
>>>> +     */
>>>> +    if (npages>  512)
>>>> +        return H_PARAMETER;
>>>> +
>>>> +    if (tce_list&  ~IOMMU_PAGE_MASK)
>>>> +        return H_PARAMETER;
>>>> +
>>>> +    if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
>>>> +        return H_PARAMETER;
>>>> +
>>>> +    tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg);
>>>> +    if (tces == ERROR_ADDR)
>>>> +        return H_TOO_HARD;
>>>> +
>>>> +    if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
>>>> +        goto put_list_page_exit;
>>>> +
>>>> +    for (i = 0; i<  npages; ++i) {
>>>> +        if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
>>>> +            ret = H_PARAMETER;
>>>> +            goto put_list_page_exit;
>>>> +        }
>>>> +
>>>> +        ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
>>>> +        if (ret)
>>>> +            goto put_list_page_exit;
>>>> +    }
>>>> +
>>>> +    for (i = 0; i<  npages; ++i)
>>>> +        kvmppc_emulated_put_tce(tt, ioba + (i<<  IOMMU_PAGE_SHIFT),
>>>> +                vcpu->arch.tce_tmp_hpas[i]);
>>>> +put_list_page_exit:
>>>> +    if (pg)
>>>> +        put_page(pg);
>>>> +
>>>> +    if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
>>>> +        vcpu->arch.tce_rm_fail = TCERM_NONE;
>>>> +        if (pg&&  !PageCompound(pg))
>>>> +            put_page(pg); /* finish pending realmode_put_page() */
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>> +        unsigned long liobn, unsigned long ioba,
>>>> +        unsigned long tce_value, unsigned long npages)
>>>> +{
>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>> +    long i, ret;
>>>> +
>>>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>>>> +    /* Didn't find the liobn, put it to userspace */
>>>> +    if (!tt)
>>>> +        return H_TOO_HARD;
>>>> +
>>>> +    ++tt->stat.vm.stuff;
>>>> +
>>>> +    if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
>>>> +        return H_PARAMETER;
>>>> +
>>>> +    ret = kvmppc_emulated_validate_tce(tce_value);
>>>> +    if (ret || (tce_value&  (TCE_PCI_WRITE | TCE_PCI_READ)))
>>>> +        return H_PARAMETER;
>>>> +
>>>> +    for (i = 0; i<  npages; ++i, ioba += IOMMU_PAGE_SIZE)
>>>> +        kvmppc_emulated_put_tce(tt, ioba, tce_value);
>>>> +
>>>> +    return H_SUCCESS;
>>>> +}
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> index 30c2f3b..cd3e6f9 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> @@ -14,6 +14,7 @@
>>>>   *
>>>>   * Copyright 2010 Paul Mackerras, IBM Corp.<paulus@au1.ibm.com>
>>>>   * Copyright 2011 David Gibson, IBM Corporation<dwg@au1.ibm.com>
>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<aik@au1.ibm.com>
>>>>   */
>>>>
>>>>  #include<linux/types.h>
>>>> @@ -35,42 +36,243 @@
>>>>  #include<asm/ppc-opcode.h>
>>>>  #include<asm/kvm_host.h>
>>>>  #include<asm/udbg.h>
>>>> +#include<asm/iommu.h>
>>>> +#include<asm/tce.h>
>>>>
>>>>  #define TCES_PER_PAGE    (PAGE_SIZE / sizeof(u64))
>>>> +#define ERROR_ADDR      (~(unsigned long)0x0)
>>>>
>>>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>>>> - *          mode on PR KVM
>>>
>>> What's wrong with the warning?
>>
>>
>> It belongs to kvmppc_h_put_tce() which is not called in virtual mode anymore.
> 
> I thought the comment applied to the whole file before? Hrm. Maybe I misread it then.
> 
>> It is technically correct for kvmppc_find_tce_table() though. Should I put
>> this comment before every function which may be called from real and
>> virtual modes?
> 
> Yes, please. Otherwise someone might stick an access to a non-linear address
> in there by accident.
> 
>>
>>
>>
>>>> +/*
>>>> + * Finds a TCE table descriptor by LIOBN
>>>>   */
>>>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
>>>> +        unsigned long liobn)
>>>> +{
>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>> +
>>>> +    list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) {
>>>> +        if (tt->liobn == liobn)
>>>> +            return tt;
>>>> +    }
>>>> +
>>>> +    return NULL;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>>>> +
>>>> +#ifdef DEBUG
>>>> +/*
>>>> + * Lets user mode disable realmode handlers by putting big number
>>>> + * in the bottom value of LIOBN
>>>
>>> What? Seriously? Just don't enable the CAP.
>>
>>
>> It is under DEBUG. It really, really helps to be able to disable real mode
>> handlers without reboot. Ok, no debug code, I'll remove.
> 
> Debug code is good, but #ifdefs are bad. For you, an #ifdef reads like
> "code that doesn't do any hard when disabled". For me, #ifdefs read
> "code that definitely breaks because nobody turns the #define on".
> 
> So please, avoid #ifdef'ed code whenever possible. Switching the CAP on and
> off is a much better debug approach in this case.
> 
>>
>>
>>>> + */
>>>> +#define kvmppc_find_tce_table(a, b) \
>>>> +        ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
>>>> +#endif
>>>> +
>>>> +/*
>>>> + * Validates TCE address.
>>>> + * At the moment only flags are validated as other checks will
>>>> significantly slow
>>>> + * down or can make it even impossible to handle TCE requests in real mode.
>>>
>>> What?
>>
>>
>> What is missing here (besides good english)?
> 
> What badness could slip through by not validating everything?


I cannot think of any good check which could be done in real mode and not
be "more than 2 calls deep" (c) Ben. Check that the page is allocated at
all? How? Don't know.



>>>> + */
>>>> +long kvmppc_emulated_validate_tce(unsigned long tce)
>>>
>>> I don't like the naming scheme. Please turn this around and make it
>>> kvmppc_tce_validate().
>>
>>
>> Oh. "Like"... Ok.
> 
> Yes. Like.
> 
>>
>>
>>>> +{
>>>> +    if (tce&  ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>>>> +        return H_PARAMETER;
>>>> +
>>>> +    return H_SUCCESS;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
>>>> +
>>>> +/*
>>>> + * Handles TCE requests for QEMU emulated devices.
>>>
>>> We still don't mention QEMU in KVM code. And does it really matter whether
>>> they're emulated by QEMU? Devices could also be emulated by KVM.
>>>
>>>> + * Puts guest TCE values to the table and expects QEMU to convert them
>>>> + * later in a QEMU device implementation.
>>>> + * Called in both real and virtual modes.
>>>> + * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
>>>> + */
>>>> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>
>>> kvmppc_tce_put()
>>>
>>>> +        unsigned long ioba, unsigned long tce)
>>>> +{
>>>> +    unsigned long idx = ioba>>  SPAPR_TCE_SHIFT;
>>>> +    struct page *page;
>>>> +    u64 *tbl;
>>>> +
>>>> +    /*
>>>> +     * Note on the use of page_address() in real mode,
>>>> +     *
>>>> +     * It is safe to use page_address() in real mode on ppc64 because
>>>> +     * page_address() is always defined as lowmem_page_address()
>>>> +     * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
>>>> +     * operation and does not access page struct.
>>>> +     *
>>>> +     * Theoretically page_address() could be defined different
>>>> +     * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>>>> +     * should be enabled.
>>>> +     * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>>>> +     * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>>>> +     * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>>>> +     * is not expected to be enabled on ppc32, page_address()
>>>> +     * is safe for ppc32 as well.
>>>> +     */
>>>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>>>> +#error TODO: fix to avoid page_address() here
>>>> +#endif
>>>
>>> Can you extract the text above, the check and the page_address call into a
>>> simple wrapper function?
>>
>>
>> Is this function also too big? Sorry, I do not understand the comment.
> 
> All of the comment and #if here only deal with the fact that you
> have a real mode hack to call page_address() that happens
> to work under specific circumstances.
>
> There's nothing kvmppc_tce_put() specific about this.
> The page_address() code happens to get called here, sure.
> But if I read the kvmppc_tce_put() function I don't care about
> these details - I want to understand the code flow that ends
> up writing the TCE.
>
>>>> +    page = tt->pages[idx / TCES_PER_PAGE];
>>>> +    tbl = (u64 *)page_address(page);
>>>> +
>>>> +    /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>
>>> This is not an RFC, is it?
>>
>>
>> Any debug code is prohibited? Ok, I'll remove.
> 
> Debug code that requires code changes is prohibited, yes.
> Debug code that is runtime switchable (pr_debug, trace points, etc)
> are allowed.


Is there any easy way to enable just this specific udbg_printf (not all of
them at once)? Trace points do not work in real mode as we figured out.


>>>> +    tbl[idx % TCES_PER_PAGE] = tce;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
>>>> +
>>>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>>>> +/*
>>>> + * Converts guest physical address to host physical address.
>>>> + * Tries to increase page counter via realmode_get_page() and
>>>> + * returns ERROR_ADDR if failed.
>>>> + */
>>>> +static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
>>>> +        unsigned long gpa, struct page **pg)
>>>> +{
>>>> +    struct kvm_memory_slot *memslot;
>>>> +    pte_t *ptep, pte;
>>>> +    unsigned long hva, hpa = ERROR_ADDR;
>>>> +    unsigned long gfn = gpa>>  PAGE_SHIFT;
>>>> +    unsigned shift = 0;
>>>> +
>>>> +    memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>> +    if (!memslot)
>>>> +        return ERROR_ADDR;
>>>> +
>>>> +    hva = __gfn_to_hva_memslot(memslot, gfn);
>>>> +
>>>> +    ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva,&shift);
>>>> +    if (!ptep || !pte_present(*ptep))
>>>> +        return ERROR_ADDR;
>>>> +    pte = *ptep;
>>>> +
>>>> +    if (((gpa&  TCE_PCI_WRITE) || pte_write(pte))&&  !pte_dirty(pte))
>>>> +        return ERROR_ADDR;
>>>> +
>>>> +    if (!pte_young(pte))
>>>> +        return ERROR_ADDR;
>>>> +
>>>> +    if (!shift)
>>>> +        shift = PAGE_SHIFT;
>>>> +
>>>> +    /* Put huge pages handling to the virtual mode */
>>>> +    if (shift>  PAGE_SHIFT)
>>>> +        return ERROR_ADDR;
>>>> +
>>>> +    *pg = realmode_pfn_to_page(pte_pfn(pte));
>>>> +    if (!*pg || realmode_get_page(*pg))
>>>> +        return ERROR_ADDR;
>>>> +
>>>> +    /* pte_pfn(pte) returns address aligned to pg_size */
>>>> +    hpa = (pte_pfn(pte)<<  PAGE_SHIFT) + (gpa&  ((1<<  shift) - 1));
>>>> +
>>>> +    if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>> +        hpa = ERROR_ADDR;
>>>> +        realmode_put_page(*pg);
>>>> +        *pg = NULL;
>>>> +    }
>>>> +
>>>> +    return hpa;
>>>> +}
>>>> +
>>>>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>                unsigned long ioba, unsigned long tce)
>>>>  {
>>>> -    struct kvm *kvm = vcpu->kvm;
>>>> -    struct kvmppc_spapr_tce_table *stt;
>>>> -
>>>> -    /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>>> -    /*         liobn, ioba, tce); */
>>>> -
>>>> -    list_for_each_entry(stt,&kvm->arch.spapr_tce_tables, list) {
>>>> -        if (stt->liobn == liobn) {
>>>> -            unsigned long idx = ioba>>  SPAPR_TCE_SHIFT;
>>>> -            struct page *page;
>>>> -            u64 *tbl;
>>>> -
>>>> -            /* udbg_printf("H_PUT_TCE: liobn 0x%lx =>  stt=%p 
>>>> window_size=0x%x\n", */
>>>> -            /*         liobn, stt, stt->window_size); */
>>>> -            if (ioba>= stt->window_size)
>>>> -                return H_PARAMETER;
>>>> -
>>>> -            page = stt->pages[idx / TCES_PER_PAGE];
>>>> -            tbl = (u64 *)page_address(page);
>>>> -
>>>> -            /* FIXME: Need to validate the TCE itself */
>>>> -            /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>> -            tbl[idx % TCES_PER_PAGE] = tce;
>>>> -            return H_SUCCESS;
>>>> -        }
>>>> +    long ret;
>>>> +    struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
>>>> +
>>>> +    if (!tt)
>>>> +        return H_TOO_HARD;
>>>> +
>>>> +    ++tt->stat.rm.put;
>>>> +
>>>> +    if (ioba>= tt->window_size)
>>>> +        return H_PARAMETER;
>>>> +
>>>> +    ret = kvmppc_emulated_validate_tce(tce);
>>>> +    if (!ret)
>>>> +        kvmppc_emulated_put_tce(tt, ioba, tce);
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>
>>> So the _vm version is the normal one and this is the _rm version? If so,
>>> please mark it as such. Is there any way to generate both from the same
>>> source? The way it's now there is a lot of duplicate code.
>>
>>
>> I tried, looked very ugly. If you insist, I will do so.
> 

> If it looks ugly better don't. I just want to make sure you explored the option.
> But please keep the naming scheme consistent.


Removed _vm everywhere and put _rm in realmode handlers. I just was
confused by _vm in kvm_vm_ioctl_create_spapr_tce() at the first place.
Alexander Graf July 11, 2013, 10:11 a.m. UTC | #5
On 11.07.2013, at 07:12, Alexey Kardashevskiy wrote:

> On 07/10/2013 08:05 PM, Alexander Graf wrote:
>> 
>> On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote:
>> 
>>> On 07/10/2013 03:02 AM, Alexander Graf wrote:
>>>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>>>> devices or emulated PCI.  These calls allow adding multiple entries
>>>>> (up to 512) into the TCE table in one call which saves time on
>>>>> transition to/from real mode.
>>>> 
>>>> We don't mention QEMU explicitly in KVM code usually.
>>>> 
>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>>> (copied from user and verified) before writing the whole list into
>>>>> the TCE table. This cache will be utilized more in the upcoming
>>>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>>>> mode in the case if the real mode handler failed for some reason.
>>>>> 
>>>>> This adds a guest physical to host real address converter
>>>>> and calls the existing H_PUT_TCE handler. The converting function
>>>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>>> 
>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>>>> so in order to support the functionality of this patch, QEMU
>>>>> needs to query for this capability and set the "hcall-multi-tce"
>>>>> hypertas property only if the capability is present, otherwise
>>>>> there will be serious performance degradation.
>>>> 
>>>> Same as above. But really you're only giving recommendations here. What's
>>>> the point? Please describe what the benefit of this patch is, not what some
>>>> other random subsystem might do with the benefits it brings.
>>>> 
>>>>> 
>>>>> Signed-off-by: Paul Mackerras<paulus@samba.org>
>>>>> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>>>>> 
>>>>> ---
>>>>> Changelog:
>>>>> 2013/07/06:
>>>>> * fixed number of wrong get_page()/put_page() calls
>>>>> 
>>>>> 2013/06/27:
>>>>> * fixed clear of BUSY bit in kvmppc_lookup_pte()
>>>>> * H_PUT_TCE_INDIRECT does realmode_get_page() now
>>>>> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
>>>>> * updated doc
>>>>> 
>>>>> 2013/06/05:
>>>>> * fixed mistype about IBMVIO in the commit message
>>>>> * updated doc and moved it to another section
>>>>> * changed capability number
>>>>> 
>>>>> 2013/05/21:
>>>>> * added kvm_vcpu_arch::tce_tmp
>>>>> * removed cleanup if put_indirect failed, instead we do not even start
>>>>> writing to TCE table if we cannot get TCEs from the user and they are
>>>>> invalid
>>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>>>> and kvmppc_emulated_validate_tce (for the previous item)
>>>>> * fixed bug with failthrough for H_IPI
>>>>> * removed all get_user() from real mode handlers
>>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>>>> 
>>>>> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>>>>> ---
>>>>> Documentation/virtual/kvm/api.txt       |  25 +++
>>>>> arch/powerpc/include/asm/kvm_host.h     |   9 ++
>>>>> arch/powerpc/include/asm/kvm_ppc.h      |  16 +-
>>>>> arch/powerpc/kvm/book3s_64_vio.c        | 154 ++++++++++++++++++-
>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c     | 260
>>>>> ++++++++++++++++++++++++++++----
>>>>> arch/powerpc/kvm/book3s_hv.c            |  41 ++++-
>>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +
>>>>> arch/powerpc/kvm/book3s_pr_papr.c       |  37 ++++-
>>>>> arch/powerpc/kvm/powerpc.c              |   3 +
>>>>> 9 files changed, 517 insertions(+), 34 deletions(-)
>>>>> 
>>>>> diff --git a/Documentation/virtual/kvm/api.txt
>>>>> b/Documentation/virtual/kvm/api.txt
>>>>> index 6365fef..762c703 100644
>>>>> --- a/Documentation/virtual/kvm/api.txt
>>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>>> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
>>>>> to userspace to be
>>>>> handled.
>>>>> 
>>>>> 
>>>>> +4.86 KVM_CAP_PPC_MULTITCE
>>>>> +
>>>>> +Capability: KVM_CAP_PPC_MULTITCE
>>>>> +Architectures: ppc
>>>>> +Type: vm
>>>>> +
>>>>> +This capability means the kernel is capable of handling hypercalls
>>>>> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
>>>>> +space. This significanly accelerates DMA operations for PPC KVM guests.
>>>> 
>>>> significanly? Please run this through a spell checker.
>>>> 
>>>>> +The user space should expect that its handlers for these hypercalls
>>>> 
>>>> s/The//
>>>> 
>>>>> +are not going to be called.
>>>> 
>>>> Is user space guaranteed they will not be called? Or can it still happen?
>>> 
>>> ... if user space previously registered LIOBN in KVM (via
>>> KVM_CREATE_SPAPR_TCE or similar calls).
>>> 
>>> ok?
>> 
>> How about this?
>> 
>> The hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration.
>> 
>> ---
>> 
>> The target audience for this documentation is user space KVM API users. Someone developing kvm tool for example. They want to know implications specific CAPs have.
>> 
>>> 
>>> There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
>>> and may never get there.
>>> 
>>> 
>>>>> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
>>>>> +the user space might have to advertise it for the guest. For example,
>>>>> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in
>>>>> +the "ibm,hypertas-functions" device-tree property.
>>>> 
>>>> This paragraph describes sPAPR. That's fine, but please document it as
>>>> such. Also please check your grammar.
>>> 
>>>>> +
>>>>> +Without this capability, only H_PUT_TCE is handled by the kernel and
>>>>> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
>>>>> +unless the capability is present as passing hypercalls to the userspace
>>>>> +slows operations a lot.
>>>>> +
>>>>> +Unlike other capabilities of this section, this one is always enabled.
>>>> 
>>>> Why? Wouldn't that confuse older user space?
>>> 
>>> 
>>> How? Old user space won't check for this capability and won't tell the
>>> guest to use it (via "hcall-multi-tce"). Old H_PUT_TCE is still there.
>>> 
>>> If the guest always uses H_PUT_TCE_INDIRECT/H_STUFF_TCE no matter what,
>>> then it is its problem - it won't work now anyway as neither QEMU nor host
>>> kernel supports these calls.
> 
> 
>> Always assume that you are a kernel developer without knowledge
>> of any user space code using your interfaces. So there is the theoretical
>> possibility that there is a user space client out there that implements
>> H_PUT_TCE_INDIRECT and advertises hcall-multi-tce to the guest. 
>> Would that client break? If so, we should definitely have
>> the CAP disabled by default.
> 
> 
> No, it won't break. Why would it break? I really do not get it. This user
> space client has to do an extra step to get this acceleration by calling
> ioctl(KVM_CREATE_SPAPR_TCE) anyway. Previously that ioctl only had effect
> on H_PUT_TCE, now on all three hcalls.

Hrm. It's a change of behavior, it probably wouldn't break, yes.

> 
> 
>> But really, it's also as much about consistency as anything else.
>> If we leave everything as is and always extend functionality
>> by enabling new CAPs, we're pretty much guaranteed that we
>> don't break anything by accident. It also makes debugging easier
>> because you can for example disable this particular feature
>> to see whether something has bad side effects.
> 
> 
> So I must add one more ioctl to enable MULTITCE in kernel handling. Is it
> what you are saying?
> 
> I can see KVM_CHECK_EXTENSION but I do not see KVM_ENABLE_EXTENSION or
> anything like that.

KVM_ENABLE_CAP. It's how we enable sPAPR capabilities too.

> 
> 
> 
>>>>> +
>>>>> +
>>>>> 5. The kvm_run structure
>>>>> ------------------------
>>>>> 
>>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>>>>> b/arch/powerpc/include/asm/kvm_host.h
>>>>> index af326cd..20d04bd 100644
>>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>>> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
>>>>>     struct kvm *kvm;
>>>>>     u64 liobn;
>>>>>     u32 window_size;
>>>>> +    struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>>>> 
>>>> You don't need this.
>>>> 
>>>>>     struct page *pages[0];
>>>>> };
>>>>> 
>>>>> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
>>>>>     spinlock_t tbacct_lock;
>>>>>     u64 busy_stolen;
>>>>>     u64 busy_preempt;
>>>>> +
>>>>> +    unsigned long *tce_tmp_hpas;    /* TCE cache for TCE_PUT_INDIRECT
>>>>> hcall */
>>>>> +    enum {
>>>>> +        TCERM_NONE,
>>>>> +        TCERM_GETPAGE,
>>>>> +        TCERM_PUTTCE,
>>>>> +        TCERM_PUTLIST,
>>>>> +    } tce_rm_fail;            /* failed stage of request processing */
>>>>> #endif
>>>>> };
>>>>> 
>>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h
>>>>> b/arch/powerpc/include/asm/kvm_ppc.h
>>>>> index a5287fe..fa722a0 100644
>>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>>> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu
>>>>> *vcpu);
>>>>> 
>>>>> extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>>                 struct kvm_create_spapr_tce *args);
>>>>> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>> -                 unsigned long ioba, unsigned long tce);
>>>>> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>>>> +        struct kvm_vcpu *vcpu, unsigned long liobn);
>>>>> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
>>>>> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>>> +        unsigned long ioba, unsigned long tce);
>>>>> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>> +        unsigned long tce);
>>>>> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>> +        unsigned long tce_list, unsigned long npages);
>>>>> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>> +        unsigned long tce_value, unsigned long npages);
>>>>> extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>>>>>                 struct kvm_allocate_rma *rma);
>>>>> extern struct kvmppc_linear_info *kvm_alloc_rma(void);
>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>>>>> b/arch/powerpc/kvm/book3s_64_vio.c
>>>>> index b2d3f3b..99bf4e5 100644
>>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>>> @@ -14,6 +14,7 @@
>>>>>  *
>>>>>  * Copyright 2010 Paul Mackerras, IBM Corp.<paulus@au1.ibm.com>
>>>>>  * Copyright 2011 David Gibson, IBM Corporation<dwg@au1.ibm.com>
>>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<aik@au1.ibm.com>
>>>>>  */
>>>>> 
>>>>> #include<linux/types.h>
>>>>> @@ -36,8 +37,10 @@
>>>>> #include<asm/ppc-opcode.h>
>>>>> #include<asm/kvm_host.h>
>>>>> #include<asm/udbg.h>
>>>>> +#include<asm/iommu.h>
>>>>> +#include<asm/tce.h>
>>>>> 
>>>>> -#define TCES_PER_PAGE    (PAGE_SIZE / sizeof(u64))
>>>>> +#define ERROR_ADDR      ((void *)~(unsigned long)0x0)
>>>>> 
>>>>> static long kvmppc_stt_npages(unsigned long window_size)
>>>>> {
>>>>> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct
>>>>> kvmppc_spapr_tce_table *stt)
>>>>>     struct kvm *kvm = stt->kvm;
>>>>>     int i;
>>>>> 
>>>>> +#define __SV(x) stt->stat.x
>>>>> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
>>>>> +    pr_debug("%s stat for liobn=%llx\n"
>>>>> +            "--------------- realmode ----- virtmode ---\n"
>>>>> +            "put_tce       %10ld     %10ld\n"
>>>>> +            "put_tce_indir %10ld     %10ld\n"
>>>>> +            "stuff_tce     %10ld     %10ld\n",
>>>>> +            __func__, stt->liobn,
>>>>> +            __SVD(put), __SV(vm.put),
>>>>> +            __SVD(indir), __SV(vm.indir),
>>>>> +            __SVD(stuff), __SV(vm.stuff));
>>>>> +#undef __SVD
>>>>> +#undef __SV
>>>> 
>>>> All of these stat points should just be trace points. You can do the
>>>> statistic gathering from user space then.
>>>> 
>>>>> +
>>>>>     mutex_lock(&kvm->lock);
>>>>>     list_del(&stt->list);
>>>>>     for (i = 0; i<  kvmppc_stt_npages(stt->window_size); i++)
>>>>> @@ -148,3 +165,138 @@ fail:
>>>>>     }
>>>>>     return ret;
>>>>> }
>>>>> +
>>>>> +/* Converts guest physical address to host virtual address */
>>>>> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>>>> 
>>>> Please don't distinguish _vm versions. They're the normal case. _rm ones
>>>> are the special ones.
>>>> 
>>>>> +        unsigned long gpa, struct page **pg)
>>>>> +{
>>>>> +    unsigned long hva, gfn = gpa>>  PAGE_SHIFT;
>>>>> +    struct kvm_memory_slot *memslot;
>>>>> +
>>>>> +    memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>>> +    if (!memslot)
>>>>> +        return ERROR_ADDR;
>>>>> +
>>>>> +    hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa&  ~PAGE_MASK);
>>>> 
>>>> s/+/|/
>>>> 
>>>>> +
>>>>> +    if (get_user_pages_fast(hva&  PAGE_MASK, 1, 0, pg) != 1)
>>>>> +        return ERROR_ADDR;
>>>>> +
>>>>> +    return (void *) hva;
>>>>> +}
>>>>> +
>>>>> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>> +        unsigned long tce)
>>>>> +{
>>>>> +    long ret;
>>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>>> +
>>>>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>> +    /* Didn't find the liobn, put it to userspace */
>>>> 
>>>> Unclear comment.
>>> 
>>> 
>>> What detail is missing?
>> 
> 
>> Grammar wise "it" in the second half of the sentence refers to liobn.
>> So you "put" the "liobn to userspace". That sentence doesn't
>> make any sense.
> 
> 
> Removed it. H_TOO_HARD itself says enough already.
> 
> 
>> What you really want to say is:
>> 
>>  /* Couldn't find the liobn. Something went wrong. Let user space handle the hypercall. That has better ways of dealing with errors. */
>> 
>>> 
>>> 
>>>>> +    if (!tt)
>>>>> +        return H_TOO_HARD;
>>>>> +
>>>>> +    ++tt->stat.vm.put;
>>>>> +
>>>>> +    if (ioba>= tt->window_size)
>>>>> +        return H_PARAMETER;
>>>>> +
>>>>> +    ret = kvmppc_emulated_validate_tce(tce);
>>>>> +    if (ret)
>>>>> +        return ret;
>>>>> +
>>>>> +    kvmppc_emulated_put_tce(tt, ioba, tce);
>>>>> +
>>>>> +    return H_SUCCESS;
>>>>> +}
>>>>> +
>>>>> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>> +        unsigned long tce_list, unsigned long npages)
>>>>> +{
>>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>>> +    long i, ret = H_SUCCESS;
>>>>> +    unsigned long __user *tces;
>>>>> +    struct page *pg = NULL;
>>>>> +
>>>>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>> +    /* Didn't find the liobn, put it to userspace */
>>>>> +    if (!tt)
>>>>> +        return H_TOO_HARD;
>>>>> +
>>>>> +    ++tt->stat.vm.indir;
>>>>> +
>>>>> +    /*
>>>>> +     * The spec says that the maximum size of the list is 512 TCEs so
>>>>> +     * so the whole table addressed resides in 4K page
>>>> 
>>>> so so?
>>>> 
>>>>> +     */
>>>>> +    if (npages>  512)
>>>>> +        return H_PARAMETER;
>>>>> +
>>>>> +    if (tce_list&  ~IOMMU_PAGE_MASK)
>>>>> +        return H_PARAMETER;
>>>>> +
>>>>> +    if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
>>>>> +        return H_PARAMETER;
>>>>> +
>>>>> +    tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg);
>>>>> +    if (tces == ERROR_ADDR)
>>>>> +        return H_TOO_HARD;
>>>>> +
>>>>> +    if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
>>>>> +        goto put_list_page_exit;
>>>>> +
>>>>> +    for (i = 0; i<  npages; ++i) {
>>>>> +        if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
>>>>> +            ret = H_PARAMETER;
>>>>> +            goto put_list_page_exit;
>>>>> +        }
>>>>> +
>>>>> +        ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
>>>>> +        if (ret)
>>>>> +            goto put_list_page_exit;
>>>>> +    }
>>>>> +
>>>>> +    for (i = 0; i<  npages; ++i)
>>>>> +        kvmppc_emulated_put_tce(tt, ioba + (i<<  IOMMU_PAGE_SHIFT),
>>>>> +                vcpu->arch.tce_tmp_hpas[i]);
>>>>> +put_list_page_exit:
>>>>> +    if (pg)
>>>>> +        put_page(pg);
>>>>> +
>>>>> +    if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
>>>>> +        vcpu->arch.tce_rm_fail = TCERM_NONE;
>>>>> +        if (pg&&  !PageCompound(pg))
>>>>> +            put_page(pg); /* finish pending realmode_put_page() */
>>>>> +    }
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>> +        unsigned long tce_value, unsigned long npages)
>>>>> +{
>>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>>> +    long i, ret;
>>>>> +
>>>>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>> +    /* Didn't find the liobn, put it to userspace */
>>>>> +    if (!tt)
>>>>> +        return H_TOO_HARD;
>>>>> +
>>>>> +    ++tt->stat.vm.stuff;
>>>>> +
>>>>> +    if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
>>>>> +        return H_PARAMETER;
>>>>> +
>>>>> +    ret = kvmppc_emulated_validate_tce(tce_value);
>>>>> +    if (ret || (tce_value&  (TCE_PCI_WRITE | TCE_PCI_READ)))
>>>>> +        return H_PARAMETER;
>>>>> +
>>>>> +    for (i = 0; i<  npages; ++i, ioba += IOMMU_PAGE_SIZE)
>>>>> +        kvmppc_emulated_put_tce(tt, ioba, tce_value);
>>>>> +
>>>>> +    return H_SUCCESS;
>>>>> +}
>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>> b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>> index 30c2f3b..cd3e6f9 100644
>>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>> @@ -14,6 +14,7 @@
>>>>>  *
>>>>>  * Copyright 2010 Paul Mackerras, IBM Corp.<paulus@au1.ibm.com>
>>>>>  * Copyright 2011 David Gibson, IBM Corporation<dwg@au1.ibm.com>
>>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<aik@au1.ibm.com>
>>>>>  */
>>>>> 
>>>>> #include<linux/types.h>
>>>>> @@ -35,42 +36,243 @@
>>>>> #include<asm/ppc-opcode.h>
>>>>> #include<asm/kvm_host.h>
>>>>> #include<asm/udbg.h>
>>>>> +#include<asm/iommu.h>
>>>>> +#include<asm/tce.h>
>>>>> 
>>>>> #define TCES_PER_PAGE    (PAGE_SIZE / sizeof(u64))
>>>>> +#define ERROR_ADDR      (~(unsigned long)0x0)
>>>>> 
>>>>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>>>>> - *          mode on PR KVM
>>>> 
>>>> What's wrong with the warning?
>>> 
>>> 
>>> It belongs to kvmppc_h_put_tce() which is not called in virtual mode anymore.
>> 
>> I thought the comment applied to the whole file before? Hrm. Maybe I misread it then.
>> 
>>> It is technically correct for kvmppc_find_tce_table() though. Should I put
>>> this comment before every function which may be called from real and
>>> virtual modes?
>> 
>> Yes, please. Otherwise someone might stick an access to a non-linear address
>> in there by accident.
>> 
>>> 
>>> 
>>> 
>>>>> +/*
>>>>> + * Finds a TCE table descriptor by LIOBN
>>>>>  */
>>>>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
>>>>> +        unsigned long liobn)
>>>>> +{
>>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>>> +
>>>>> +    list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) {
>>>>> +        if (tt->liobn == liobn)
>>>>> +            return tt;
>>>>> +    }
>>>>> +
>>>>> +    return NULL;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>>>>> +
>>>>> +#ifdef DEBUG
>>>>> +/*
>>>>> + * Lets user mode disable realmode handlers by putting big number
>>>>> + * in the bottom value of LIOBN
>>>> 
>>>> What? Seriously? Just don't enable the CAP.
>>> 
>>> 
>>> It is under DEBUG. It really, really helps to be able to disable real mode
>>> handlers without reboot. Ok, no debug code, I'll remove.
>> 
>> Debug code is good, but #ifdefs are bad. For you, an #ifdef reads like
>> "code that doesn't do any hard when disabled". For me, #ifdefs read
>> "code that definitely breaks because nobody turns the #define on".
>> 
>> So please, avoid #ifdef'ed code whenever possible. Switching the CAP on and
>> off is a much better debug approach in this case.
>> 
>>> 
>>> 
>>>>> + */
>>>>> +#define kvmppc_find_tce_table(a, b) \
>>>>> +        ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
>>>>> +#endif
>>>>> +
>>>>> +/*
>>>>> + * Validates TCE address.
>>>>> + * At the moment only flags are validated as other checks will
>>>>> significantly slow
>>>>> + * down or can make it even impossible to handle TCE requests in real mode.
>>>> 
>>>> What?
>>> 
>>> 
>>> What is missing here (besides good english)?
>> 
>> What badness could slip through by not validating everything?
> 
> 
> I cannot think of any good check which could be done in real mode and not
> be "more than 2 calls deep" (c) Ben. Check that the page is allocated at
> all? How? Don't know.

If you say that our validation doesn't validate everything, that makes me really weary. Could the guest use it to maliciously inject anything? Could a missing check make our code go berserk?

What checks exactly would you do in addition when this was virtual mode?

> 
> 
> 
>>>>> + */
>>>>> +long kvmppc_emulated_validate_tce(unsigned long tce)
>>>> 
>>>> I don't like the naming scheme. Please turn this around and make it
>>>> kvmppc_tce_validate().
>>> 
>>> 
>>> Oh. "Like"... Ok.
>> 
>> Yes. Like.
>> 
>>> 
>>> 
>>>>> +{
>>>>> +    if (tce&  ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>>>>> +        return H_PARAMETER;
>>>>> +
>>>>> +    return H_SUCCESS;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
>>>>> +
>>>>> +/*
>>>>> + * Handles TCE requests for QEMU emulated devices.
>>>> 
>>>> We still don't mention QEMU in KVM code. And does it really matter whether
>>>> they're emulated by QEMU? Devices could also be emulated by KVM.
>>>> 
>>>>> + * Puts guest TCE values to the table and expects QEMU to convert them
>>>>> + * later in a QEMU device implementation.
>>>>> + * Called in both real and virtual modes.
>>>>> + * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
>>>>> + */
>>>>> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>> 
>>>> kvmppc_tce_put()
>>>> 
>>>>> +        unsigned long ioba, unsigned long tce)
>>>>> +{
>>>>> +    unsigned long idx = ioba>>  SPAPR_TCE_SHIFT;
>>>>> +    struct page *page;
>>>>> +    u64 *tbl;
>>>>> +
>>>>> +    /*
>>>>> +     * Note on the use of page_address() in real mode,
>>>>> +     *
>>>>> +     * It is safe to use page_address() in real mode on ppc64 because
>>>>> +     * page_address() is always defined as lowmem_page_address()
>>>>> +     * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
>>>>> +     * operation and does not access page struct.
>>>>> +     *
>>>>> +     * Theoretically page_address() could be defined different
>>>>> +     * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>>>>> +     * should be enabled.
>>>>> +     * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>>>>> +     * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>>>>> +     * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>>>>> +     * is not expected to be enabled on ppc32, page_address()
>>>>> +     * is safe for ppc32 as well.
>>>>> +     */
>>>>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>>>>> +#error TODO: fix to avoid page_address() here
>>>>> +#endif
>>>> 
>>>> Can you extract the text above, the check and the page_address call into a
>>>> simple wrapper function?
>>> 
>>> 
>>> Is this function also too big? Sorry, I do not understand the comment.
>> 
>> All of the comment and #if here only deal with the fact that you
>> have a real mode hack to call page_address() that happens
>> to work under specific circumstances.
>> 
>> There's nothing kvmppc_tce_put() specific about this.
>> The page_address() code happens to get called here, sure.
>> But if I read the kvmppc_tce_put() function I don't care about
>> these details - I want to understand the code flow that ends
>> up writing the TCE.
>> 
>>>>> +    page = tt->pages[idx / TCES_PER_PAGE];
>>>>> +    tbl = (u64 *)page_address(page);
>>>>> +
>>>>> +    /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>> 
>>>> This is not an RFC, is it?
>>> 
>>> 
>>> Any debug code is prohibited? Ok, I'll remove.
>> 
>> Debug code that requires code changes is prohibited, yes.
>> Debug code that is runtime switchable (pr_debug, trace points, etc)
>> are allowed.
> 
> 
> Is there any easy way to enable just this specific udbg_printf (not all of
> them at once)? Trace points do not work in real mode as we figured out.

You can enable pr_debug by file IIRC.

> 
> 
>>>>> +    tbl[idx % TCES_PER_PAGE] = tce;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
>>>>> +
>>>>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>>>>> +/*
>>>>> + * Converts guest physical address to host physical address.
>>>>> + * Tries to increase page counter via realmode_get_page() and
>>>>> + * returns ERROR_ADDR if failed.
>>>>> + */
>>>>> +static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
>>>>> +        unsigned long gpa, struct page **pg)
>>>>> +{
>>>>> +    struct kvm_memory_slot *memslot;
>>>>> +    pte_t *ptep, pte;
>>>>> +    unsigned long hva, hpa = ERROR_ADDR;
>>>>> +    unsigned long gfn = gpa>>  PAGE_SHIFT;
>>>>> +    unsigned shift = 0;
>>>>> +
>>>>> +    memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>>> +    if (!memslot)
>>>>> +        return ERROR_ADDR;
>>>>> +
>>>>> +    hva = __gfn_to_hva_memslot(memslot, gfn);
>>>>> +
>>>>> +    ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva,&shift);
>>>>> +    if (!ptep || !pte_present(*ptep))
>>>>> +        return ERROR_ADDR;
>>>>> +    pte = *ptep;
>>>>> +
>>>>> +    if (((gpa&  TCE_PCI_WRITE) || pte_write(pte))&&  !pte_dirty(pte))
>>>>> +        return ERROR_ADDR;
>>>>> +
>>>>> +    if (!pte_young(pte))
>>>>> +        return ERROR_ADDR;
>>>>> +
>>>>> +    if (!shift)
>>>>> +        shift = PAGE_SHIFT;
>>>>> +
>>>>> +    /* Put huge pages handling to the virtual mode */
>>>>> +    if (shift>  PAGE_SHIFT)
>>>>> +        return ERROR_ADDR;
>>>>> +
>>>>> +    *pg = realmode_pfn_to_page(pte_pfn(pte));
>>>>> +    if (!*pg || realmode_get_page(*pg))
>>>>> +        return ERROR_ADDR;
>>>>> +
>>>>> +    /* pte_pfn(pte) returns address aligned to pg_size */
>>>>> +    hpa = (pte_pfn(pte)<<  PAGE_SHIFT) + (gpa&  ((1<<  shift) - 1));
>>>>> +
>>>>> +    if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>>> +        hpa = ERROR_ADDR;
>>>>> +        realmode_put_page(*pg);
>>>>> +        *pg = NULL;
>>>>> +    }
>>>>> +
>>>>> +    return hpa;
>>>>> +}
>>>>> +
>>>>> long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>>               unsigned long ioba, unsigned long tce)
>>>>> {
>>>>> -    struct kvm *kvm = vcpu->kvm;
>>>>> -    struct kvmppc_spapr_tce_table *stt;
>>>>> -
>>>>> -    /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>>>> -    /*         liobn, ioba, tce); */
>>>>> -
>>>>> -    list_for_each_entry(stt,&kvm->arch.spapr_tce_tables, list) {
>>>>> -        if (stt->liobn == liobn) {
>>>>> -            unsigned long idx = ioba>>  SPAPR_TCE_SHIFT;
>>>>> -            struct page *page;
>>>>> -            u64 *tbl;
>>>>> -
>>>>> -            /* udbg_printf("H_PUT_TCE: liobn 0x%lx =>  stt=%p 
>>>>> window_size=0x%x\n", */
>>>>> -            /*         liobn, stt, stt->window_size); */
>>>>> -            if (ioba>= stt->window_size)
>>>>> -                return H_PARAMETER;
>>>>> -
>>>>> -            page = stt->pages[idx / TCES_PER_PAGE];
>>>>> -            tbl = (u64 *)page_address(page);
>>>>> -
>>>>> -            /* FIXME: Need to validate the TCE itself */
>>>>> -            /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>>> -            tbl[idx % TCES_PER_PAGE] = tce;
>>>>> -            return H_SUCCESS;
>>>>> -        }
>>>>> +    long ret;
>>>>> +    struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>> +
>>>>> +    if (!tt)
>>>>> +        return H_TOO_HARD;
>>>>> +
>>>>> +    ++tt->stat.rm.put;
>>>>> +
>>>>> +    if (ioba>= tt->window_size)
>>>>> +        return H_PARAMETER;
>>>>> +
>>>>> +    ret = kvmppc_emulated_validate_tce(tce);
>>>>> +    if (!ret)
>>>>> +        kvmppc_emulated_put_tce(tt, ioba, tce);
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>> 
>>>> So the _vm version is the normal one and this is the _rm version? If so,
>>>> please mark it as such. Is there any way to generate both from the same
>>>> source? The way it's now there is a lot of duplicate code.
>>> 
>>> 
>>> I tried, looked very ugly. If you insist, I will do so.
>> 
> 
>> If it looks ugly better don't. I just want to make sure you explored the option.
>> But please keep the naming scheme consistent.
> 
> 
> Removed _vm everywhere and put _rm in realmode handlers. I just was
> confused by _vm in kvm_vm_ioctl_create_spapr_tce() at the first place.

That vm refers to the virtual machine. It's on VM scope, not VCPU scope.


Alex

> 
> 
> -- 
> Alexey
Alexey Kardashevskiy July 11, 2013, 10:54 a.m. UTC | #6
On 07/11/2013 08:11 PM, Alexander Graf wrote:
> 
> On 11.07.2013, at 07:12, Alexey Kardashevskiy wrote:
> 
>> On 07/10/2013 08:05 PM, Alexander Graf wrote:
>>>
>>> On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote:
>>>
>>>> On 07/10/2013 03:02 AM, Alexander Graf wrote:
>>>>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>>>>> devices or emulated PCI.  These calls allow adding multiple entries
>>>>>> (up to 512) into the TCE table in one call which saves time on
>>>>>> transition to/from real mode.
>>>>>
>>>>> We don't mention QEMU explicitly in KVM code usually.
>>>>>
>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>>>> (copied from user and verified) before writing the whole list into
>>>>>> the TCE table. This cache will be utilized more in the upcoming
>>>>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>>>>> mode in the case if the real mode handler failed for some reason.
>>>>>>
>>>>>> This adds a guest physical to host real address converter
>>>>>> and calls the existing H_PUT_TCE handler. The converting function
>>>>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>>>>
>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>>>>> so in order to support the functionality of this patch, QEMU
>>>>>> needs to query for this capability and set the "hcall-multi-tce"
>>>>>> hypertas property only if the capability is present, otherwise
>>>>>> there will be serious performance degradation.
>>>>>
>>>>> Same as above. But really you're only giving recommendations here. What's
>>>>> the point? Please describe what the benefit of this patch is, not what some
>>>>> other random subsystem might do with the benefits it brings.
>>>>>
>>>>>>
>>>>>> Signed-off-by: Paul Mackerras<paulus@samba.org>
>>>>>> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>>>>>>
>>>>>> ---
>>>>>> Changelog:
>>>>>> 2013/07/06:
>>>>>> * fixed number of wrong get_page()/put_page() calls
>>>>>>
>>>>>> 2013/06/27:
>>>>>> * fixed clear of BUSY bit in kvmppc_lookup_pte()
>>>>>> * H_PUT_TCE_INDIRECT does realmode_get_page() now
>>>>>> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
>>>>>> * updated doc
>>>>>>
>>>>>> 2013/06/05:
>>>>>> * fixed mistype about IBMVIO in the commit message
>>>>>> * updated doc and moved it to another section
>>>>>> * changed capability number
>>>>>>
>>>>>> 2013/05/21:
>>>>>> * added kvm_vcpu_arch::tce_tmp
>>>>>> * removed cleanup if put_indirect failed, instead we do not even start
>>>>>> writing to TCE table if we cannot get TCEs from the user and they are
>>>>>> invalid
>>>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>>>>> and kvmppc_emulated_validate_tce (for the previous item)
>>>>>> * fixed bug with failthrough for H_IPI
>>>>>> * removed all get_user() from real mode handlers
>>>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>>>>>> ---
>>>>>> Documentation/virtual/kvm/api.txt       |  25 +++
>>>>>> arch/powerpc/include/asm/kvm_host.h     |   9 ++
>>>>>> arch/powerpc/include/asm/kvm_ppc.h      |  16 +-
>>>>>> arch/powerpc/kvm/book3s_64_vio.c        | 154 ++++++++++++++++++-
>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c     | 260
>>>>>> ++++++++++++++++++++++++++++----
>>>>>> arch/powerpc/kvm/book3s_hv.c            |  41 ++++-
>>>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +
>>>>>> arch/powerpc/kvm/book3s_pr_papr.c       |  37 ++++-
>>>>>> arch/powerpc/kvm/powerpc.c              |   3 +
>>>>>> 9 files changed, 517 insertions(+), 34 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/virtual/kvm/api.txt
>>>>>> b/Documentation/virtual/kvm/api.txt
>>>>>> index 6365fef..762c703 100644
>>>>>> --- a/Documentation/virtual/kvm/api.txt
>>>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>>>> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
>>>>>> to userspace to be
>>>>>> handled.
>>>>>>
>>>>>>
>>>>>> +4.86 KVM_CAP_PPC_MULTITCE
>>>>>> +
>>>>>> +Capability: KVM_CAP_PPC_MULTITCE
>>>>>> +Architectures: ppc
>>>>>> +Type: vm
>>>>>> +
>>>>>> +This capability means the kernel is capable of handling hypercalls
>>>>>> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
>>>>>> +space. This significanly accelerates DMA operations for PPC KVM guests.
>>>>>
>>>>> significanly? Please run this through a spell checker.
>>>>>
>>>>>> +The user space should expect that its handlers for these hypercalls
>>>>>
>>>>> s/The//
>>>>>
>>>>>> +are not going to be called.
>>>>>
>>>>> Is user space guaranteed they will not be called? Or can it still happen?
>>>>
>>>> ... if user space previously registered LIOBN in KVM (via
>>>> KVM_CREATE_SPAPR_TCE or similar calls).
>>>>
>>>> ok?
>>>
>>> How about this?
>>>
>>> The hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration.
>>>
>>> ---
>>>
>>> The target audience for this documentation is user space KVM API users. Someone developing kvm tool for example. They want to know implications specific CAPs have.
>>>
>>>>
>>>> There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
>>>> and may never get there.
>>>>
>>>>
>>>>>> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
>>>>>> +the user space might have to advertise it for the guest. For example,
>>>>>> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in
>>>>>> +the "ibm,hypertas-functions" device-tree property.
>>>>>
>>>>> This paragraph describes sPAPR. That's fine, but please document it as
>>>>> such. Also please check your grammar.
>>>>
>>>>>> +
>>>>>> +Without this capability, only H_PUT_TCE is handled by the kernel and
>>>>>> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
>>>>>> +unless the capability is present as passing hypercalls to the userspace
>>>>>> +slows operations a lot.
>>>>>> +
>>>>>> +Unlike other capabilities of this section, this one is always enabled.
>>>>>
>>>>> Why? Wouldn't that confuse older user space?
>>>>
>>>>
>>>> How? Old user space won't check for this capability and won't tell the
>>>> guest to use it (via "hcall-multi-tce"). Old H_PUT_TCE is still there.
>>>>
>>>> If the guest always uses H_PUT_TCE_INDIRECT/H_STUFF_TCE no matter what,
>>>> then it is its problem - it won't work now anyway as neither QEMU nor host
>>>> kernel supports these calls.
>>
>>
>>> Always assume that you are a kernel developer without knowledge
>>> of any user space code using your interfaces. So there is the theoretical
>>> possibility that there is a user space client out there that implements
>>> H_PUT_TCE_INDIRECT and advertises hcall-multi-tce to the guest. 
>>> Would that client break? If so, we should definitely have
>>> the CAP disabled by default.
>>
>>
>> No, it won't break. Why would it break? I really do not get it. This user
>> space client has to do an extra step to get this acceleration by calling
>> ioctl(KVM_CREATE_SPAPR_TCE) anyway. Previously that ioctl only had effect
>> on H_PUT_TCE, now on all three hcalls.
> 
> Hrm. It's a change of behavior, it probably wouldn't break, yes.


Aaand?


>>> But really, it's also as much about consistency as anything else.
>>> If we leave everything as is and always extend functionality
>>> by enabling new CAPs, we're pretty much guaranteed that we
>>> don't break anything by accident. It also makes debugging easier
>>> because you can for example disable this particular feature
>>> to see whether something has bad side effects.
>>
>>
>> So I must add one more ioctl to enable MULTITCE in kernel handling. Is it
>> what you are saying?
>>
>> I can see KVM_CHECK_EXTENSION but I do not see KVM_ENABLE_EXTENSION or
>> anything like that.
> 
> KVM_ENABLE_CAP. It's how we enable sPAPR capabilities too.


Yeah, Paul already explained. It is platform specific but ok.
And does not have "EXTENSION" in the name for some reason but ok too.

KVM_ENABLE_CAP is vcpu ioctl. So kvm_arch_vcpu_ioctl() enables VCPU's
capabilities while KVM_CAP_SPAPR_MULTITCE is KVM (or more precisely
SPAPR-TCE/LIOBN but I really do not want it to be that specific) capability.

Sure I can add to kvm_arch_vcpu_ioctl():

case KVM_CAP_SPAPR_MULTITCE:
        r = 0;
        vcpu->kvm->arch.spapr_multitce_enabled = cap->args[0];
        break;

But I suspect you and Ben will call it ugly. SO do I have to implement
KVM_ENABLE_CAP in kvm_arch_vm_ioctl and change the api.txt that it is not
just about vcpu ioctl anymore? Or my brand new ioctl for this?




>>>>>> +
>>>>>> +
>>>>>> 5. The kvm_run structure
>>>>>> ------------------------
>>>>>>
>>>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>>>>>> b/arch/powerpc/include/asm/kvm_host.h
>>>>>> index af326cd..20d04bd 100644
>>>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>>>> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
>>>>>>     struct kvm *kvm;
>>>>>>     u64 liobn;
>>>>>>     u32 window_size;
>>>>>> +    struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>>>>>
>>>>> You don't need this.
>>>>>
>>>>>>     struct page *pages[0];
>>>>>> };
>>>>>>
>>>>>> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
>>>>>>     spinlock_t tbacct_lock;
>>>>>>     u64 busy_stolen;
>>>>>>     u64 busy_preempt;
>>>>>> +
>>>>>> +    unsigned long *tce_tmp_hpas;    /* TCE cache for TCE_PUT_INDIRECT
>>>>>> hcall */
>>>>>> +    enum {
>>>>>> +        TCERM_NONE,
>>>>>> +        TCERM_GETPAGE,
>>>>>> +        TCERM_PUTTCE,
>>>>>> +        TCERM_PUTLIST,
>>>>>> +    } tce_rm_fail;            /* failed stage of request processing */
>>>>>> #endif
>>>>>> };
>>>>>>
>>>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h
>>>>>> b/arch/powerpc/include/asm/kvm_ppc.h
>>>>>> index a5287fe..fa722a0 100644
>>>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>>>> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu
>>>>>> *vcpu);
>>>>>>
>>>>>> extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>>>                 struct kvm_create_spapr_tce *args);
>>>>>> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>>> -                 unsigned long ioba, unsigned long tce);
>>>>>> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>>>>> +        struct kvm_vcpu *vcpu, unsigned long liobn);
>>>>>> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
>>>>>> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>>>> +        unsigned long ioba, unsigned long tce);
>>>>>> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>>> +        unsigned long tce);
>>>>>> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>>> +        unsigned long tce_list, unsigned long npages);
>>>>>> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>>> +        unsigned long tce_value, unsigned long npages);
>>>>>> extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>>>>>>                 struct kvm_allocate_rma *rma);
>>>>>> extern struct kvmppc_linear_info *kvm_alloc_rma(void);
>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> b/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> index b2d3f3b..99bf4e5 100644
>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> @@ -14,6 +14,7 @@
>>>>>>  *
>>>>>>  * Copyright 2010 Paul Mackerras, IBM Corp.<paulus@au1.ibm.com>
>>>>>>  * Copyright 2011 David Gibson, IBM Corporation<dwg@au1.ibm.com>
>>>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<aik@au1.ibm.com>
>>>>>>  */
>>>>>>
>>>>>> #include<linux/types.h>
>>>>>> @@ -36,8 +37,10 @@
>>>>>> #include<asm/ppc-opcode.h>
>>>>>> #include<asm/kvm_host.h>
>>>>>> #include<asm/udbg.h>
>>>>>> +#include<asm/iommu.h>
>>>>>> +#include<asm/tce.h>
>>>>>>
>>>>>> -#define TCES_PER_PAGE    (PAGE_SIZE / sizeof(u64))
>>>>>> +#define ERROR_ADDR      ((void *)~(unsigned long)0x0)
>>>>>>
>>>>>> static long kvmppc_stt_npages(unsigned long window_size)
>>>>>> {
>>>>>> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct
>>>>>> kvmppc_spapr_tce_table *stt)
>>>>>>     struct kvm *kvm = stt->kvm;
>>>>>>     int i;
>>>>>>
>>>>>> +#define __SV(x) stt->stat.x
>>>>>> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
>>>>>> +    pr_debug("%s stat for liobn=%llx\n"
>>>>>> +            "--------------- realmode ----- virtmode ---\n"
>>>>>> +            "put_tce       %10ld     %10ld\n"
>>>>>> +            "put_tce_indir %10ld     %10ld\n"
>>>>>> +            "stuff_tce     %10ld     %10ld\n",
>>>>>> +            __func__, stt->liobn,
>>>>>> +            __SVD(put), __SV(vm.put),
>>>>>> +            __SVD(indir), __SV(vm.indir),
>>>>>> +            __SVD(stuff), __SV(vm.stuff));
>>>>>> +#undef __SVD
>>>>>> +#undef __SV
>>>>>
>>>>> All of these stat points should just be trace points. You can do the
>>>>> statistic gathering from user space then.
>>>>>
>>>>>> +
>>>>>>     mutex_lock(&kvm->lock);
>>>>>>     list_del(&stt->list);
>>>>>>     for (i = 0; i<  kvmppc_stt_npages(stt->window_size); i++)
>>>>>> @@ -148,3 +165,138 @@ fail:
>>>>>>     }
>>>>>>     return ret;
>>>>>> }
>>>>>> +
>>>>>> +/* Converts guest physical address to host virtual address */
>>>>>> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>>>>>
>>>>> Please don't distinguish _vm versions. They're the normal case. _rm ones
>>>>> are the special ones.
>>>>>
>>>>>> +        unsigned long gpa, struct page **pg)
>>>>>> +{
>>>>>> +    unsigned long hva, gfn = gpa>>  PAGE_SHIFT;
>>>>>> +    struct kvm_memory_slot *memslot;
>>>>>> +
>>>>>> +    memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>>>> +    if (!memslot)
>>>>>> +        return ERROR_ADDR;
>>>>>> +
>>>>>> +    hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa&  ~PAGE_MASK);
>>>>>
>>>>> s/+/|/
>>>>>
>>>>>> +
>>>>>> +    if (get_user_pages_fast(hva&  PAGE_MASK, 1, 0, pg) != 1)
>>>>>> +        return ERROR_ADDR;
>>>>>> +
>>>>>> +    return (void *) hva;
>>>>>> +}
>>>>>> +
>>>>>> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>>> +        unsigned long tce)
>>>>>> +{
>>>>>> +    long ret;
>>>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>>>> +
>>>>>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>>> +    /* Didn't find the liobn, put it to userspace */
>>>>>
>>>>> Unclear comment.
>>>>
>>>>
>>>> What detail is missing?
>>>
>>
>>> Grammar wise "it" in the second half of the sentence refers to liobn.
>>> So you "put" the "liobn to userspace". That sentence doesn't
>>> make any sense.
>>
>>
>> Removed it. H_TOO_HARD itself says enough already.
>>
>>
>>> What you really want to say is:
>>>
>>>  /* Couldn't find the liobn. Something went wrong. Let user space handle the hypercall. That has better ways of dealing with errors. */
>>>
>>>>
>>>>
>>>>>> +    if (!tt)
>>>>>> +        return H_TOO_HARD;
>>>>>> +
>>>>>> +    ++tt->stat.vm.put;
>>>>>> +
>>>>>> +    if (ioba>= tt->window_size)
>>>>>> +        return H_PARAMETER;
>>>>>> +
>>>>>> +    ret = kvmppc_emulated_validate_tce(tce);
>>>>>> +    if (ret)
>>>>>> +        return ret;
>>>>>> +
>>>>>> +    kvmppc_emulated_put_tce(tt, ioba, tce);
>>>>>> +
>>>>>> +    return H_SUCCESS;
>>>>>> +}
>>>>>> +
>>>>>> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>>> +        unsigned long tce_list, unsigned long npages)
>>>>>> +{
>>>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>>>> +    long i, ret = H_SUCCESS;
>>>>>> +    unsigned long __user *tces;
>>>>>> +    struct page *pg = NULL;
>>>>>> +
>>>>>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>>> +    /* Didn't find the liobn, put it to userspace */
>>>>>> +    if (!tt)
>>>>>> +        return H_TOO_HARD;
>>>>>> +
>>>>>> +    ++tt->stat.vm.indir;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * The spec says that the maximum size of the list is 512 TCEs so
>>>>>> +     * so the whole table addressed resides in 4K page
>>>>>
>>>>> so so?
>>>>>
>>>>>> +     */
>>>>>> +    if (npages>  512)
>>>>>> +        return H_PARAMETER;
>>>>>> +
>>>>>> +    if (tce_list&  ~IOMMU_PAGE_MASK)
>>>>>> +        return H_PARAMETER;
>>>>>> +
>>>>>> +    if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
>>>>>> +        return H_PARAMETER;
>>>>>> +
>>>>>> +    tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg);
>>>>>> +    if (tces == ERROR_ADDR)
>>>>>> +        return H_TOO_HARD;
>>>>>> +
>>>>>> +    if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
>>>>>> +        goto put_list_page_exit;
>>>>>> +
>>>>>> +    for (i = 0; i<  npages; ++i) {
>>>>>> +        if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
>>>>>> +            ret = H_PARAMETER;
>>>>>> +            goto put_list_page_exit;
>>>>>> +        }
>>>>>> +
>>>>>> +        ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
>>>>>> +        if (ret)
>>>>>> +            goto put_list_page_exit;
>>>>>> +    }
>>>>>> +
>>>>>> +    for (i = 0; i<  npages; ++i)
>>>>>> +        kvmppc_emulated_put_tce(tt, ioba + (i<<  IOMMU_PAGE_SHIFT),
>>>>>> +                vcpu->arch.tce_tmp_hpas[i]);
>>>>>> +put_list_page_exit:
>>>>>> +    if (pg)
>>>>>> +        put_page(pg);
>>>>>> +
>>>>>> +    if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
>>>>>> +        vcpu->arch.tce_rm_fail = TCERM_NONE;
>>>>>> +        if (pg&&  !PageCompound(pg))
>>>>>> +            put_page(pg); /* finish pending realmode_put_page() */
>>>>>> +    }
>>>>>> +
>>>>>> +    return ret;
>>>>>> +}
>>>>>> +
>>>>>> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>>> +        unsigned long tce_value, unsigned long npages)
>>>>>> +{
>>>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>>>> +    long i, ret;
>>>>>> +
>>>>>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>>> +    /* Didn't find the liobn, put it to userspace */
>>>>>> +    if (!tt)
>>>>>> +        return H_TOO_HARD;
>>>>>> +
>>>>>> +    ++tt->stat.vm.stuff;
>>>>>> +
>>>>>> +    if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
>>>>>> +        return H_PARAMETER;
>>>>>> +
>>>>>> +    ret = kvmppc_emulated_validate_tce(tce_value);
>>>>>> +    if (ret || (tce_value&  (TCE_PCI_WRITE | TCE_PCI_READ)))
>>>>>> +        return H_PARAMETER;
>>>>>> +
>>>>>> +    for (i = 0; i<  npages; ++i, ioba += IOMMU_PAGE_SIZE)
>>>>>> +        kvmppc_emulated_put_tce(tt, ioba, tce_value);
>>>>>> +
>>>>>> +    return H_SUCCESS;
>>>>>> +}
>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> index 30c2f3b..cd3e6f9 100644
>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> @@ -14,6 +14,7 @@
>>>>>>  *
>>>>>>  * Copyright 2010 Paul Mackerras, IBM Corp.<paulus@au1.ibm.com>
>>>>>>  * Copyright 2011 David Gibson, IBM Corporation<dwg@au1.ibm.com>
>>>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<aik@au1.ibm.com>
>>>>>>  */
>>>>>>
>>>>>> #include<linux/types.h>
>>>>>> @@ -35,42 +36,243 @@
>>>>>> #include<asm/ppc-opcode.h>
>>>>>> #include<asm/kvm_host.h>
>>>>>> #include<asm/udbg.h>
>>>>>> +#include<asm/iommu.h>
>>>>>> +#include<asm/tce.h>
>>>>>>
>>>>>> #define TCES_PER_PAGE    (PAGE_SIZE / sizeof(u64))
>>>>>> +#define ERROR_ADDR      (~(unsigned long)0x0)
>>>>>>
>>>>>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>>>>>> - *          mode on PR KVM
>>>>>
>>>>> What's wrong with the warning?
>>>>
>>>>
>>>> It belongs to kvmppc_h_put_tce() which is not called in virtual mode anymore.
>>>
>>> I thought the comment applied to the whole file before? Hrm. Maybe I misread it then.
>>>
>>>> It is technically correct for kvmppc_find_tce_table() though. Should I put
>>>> this comment before every function which may be called from real and
>>>> virtual modes?
>>>
>>> Yes, please. Otherwise someone might stick an access to a non-linear address
>>> in there by accident.
>>>
>>>>
>>>>
>>>>
>>>>>> +/*
>>>>>> + * Finds a TCE table descriptor by LIOBN
>>>>>>  */
>>>>>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
>>>>>> +        unsigned long liobn)
>>>>>> +{
>>>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>>>> +
>>>>>> +    list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) {
>>>>>> +        if (tt->liobn == liobn)
>>>>>> +            return tt;
>>>>>> +    }
>>>>>> +
>>>>>> +    return NULL;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>>>>>> +
>>>>>> +#ifdef DEBUG
>>>>>> +/*
>>>>>> + * Lets user mode disable realmode handlers by putting big number
>>>>>> + * in the bottom value of LIOBN
>>>>>
>>>>> What? Seriously? Just don't enable the CAP.
>>>>
>>>>
>>>> It is under DEBUG. It really, really helps to be able to disable real mode
>>>> handlers without reboot. Ok, no debug code, I'll remove.
>>>
>>> Debug code is good, but #ifdefs are bad. For you, an #ifdef reads like
>>> "code that doesn't do any hard when disabled". For me, #ifdefs read
>>> "code that definitely breaks because nobody turns the #define on".
>>>
>>> So please, avoid #ifdef'ed code whenever possible. Switching the CAP on and
>>> off is a much better debug approach in this case.
>>>
>>>>
>>>>
>>>>>> + */
>>>>>> +#define kvmppc_find_tce_table(a, b) \
>>>>>> +        ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
>>>>>> +#endif
>>>>>> +
>>>>>> +/*
>>>>>> + * Validates TCE address.
>>>>>> + * At the moment only flags are validated as other checks will
>>>>>> significantly slow
>>>>>> + * down or can make it even impossible to handle TCE requests in real mode.
>>>>>
>>>>> What?
>>>>
>>>>
>>>> What is missing here (besides good english)?
>>>
>>> What badness could slip through by not validating everything?
>>
>>
>> I cannot think of any good check which could be done in real mode and not
>> be "more than 2 calls deep" (c) Ben. Check that the page is allocated at
>> all? How? Don't know.
> 

> If you say that our validation doesn't validate everything, that makes
> me really weary.


It checks that TCE does not have any bit set in bits 2..12. If they are
set, something went very wrong. Better than nothing.


> Could the guest use it to maliciously inject anything?
> Could a missing check make our code go berserk?


No. KVM does not do anything with those addresses, just puts them to the
table and lets QEMU or a guest deal with it.


> What checks exactly would you do in addition when this was virtual mode?


Check that TCE is within RAM boundaries. Or check that the page was
allocated. find_linux_pte_or_hugepte? It can fail in real mode but in
virtual mode I can call get_user_fast_page and confirm that the address is
ok. Not sure, did not think much about it. Compare page flags with TCE
flags if both or neither have "write" set, this kind of stuff.

I am not really sure we need any of those checks for emulated TCE at all.

Remove the comment then?


>>>>>> + */
>>>>>> +long kvmppc_emulated_validate_tce(unsigned long tce)
>>>>>
>>>>> I don't like the naming scheme. Please turn this around and make it
>>>>> kvmppc_tce_validate().
>>>>
>>>>
>>>> Oh. "Like"... Ok.
>>>
>>> Yes. Like.
>>>
>>>>
>>>>
>>>>>> +{
>>>>>> +    if (tce&  ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>>>>>> +        return H_PARAMETER;
>>>>>> +
>>>>>> +    return H_SUCCESS;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
>>>>>> +
>>>>>> +/*
>>>>>> + * Handles TCE requests for QEMU emulated devices.
>>>>>
>>>>> We still don't mention QEMU in KVM code. And does it really matter whether
>>>>> they're emulated by QEMU? Devices could also be emulated by KVM.
>>>>>
>>>>>> + * Puts guest TCE values to the table and expects QEMU to convert them
>>>>>> + * later in a QEMU device implementation.
>>>>>> + * Called in both real and virtual modes.
>>>>>> + * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
>>>>>> + */
>>>>>> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>>>
>>>>> kvmppc_tce_put()
>>>>>
>>>>>> +        unsigned long ioba, unsigned long tce)
>>>>>> +{
>>>>>> +    unsigned long idx = ioba>>  SPAPR_TCE_SHIFT;
>>>>>> +    struct page *page;
>>>>>> +    u64 *tbl;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Note on the use of page_address() in real mode,
>>>>>> +     *
>>>>>> +     * It is safe to use page_address() in real mode on ppc64 because
>>>>>> +     * page_address() is always defined as lowmem_page_address()
>>>>>> +     * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
>>>>>> +     * operation and does not access page struct.
>>>>>> +     *
>>>>>> +     * Theoretically page_address() could be defined different
>>>>>> +     * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>>>>>> +     * should be enabled.
>>>>>> +     * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>>>>>> +     * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>>>>>> +     * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>>>>>> +     * is not expected to be enabled on ppc32, page_address()
>>>>>> +     * is safe for ppc32 as well.
>>>>>> +     */
>>>>>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>>>>>> +#error TODO: fix to avoid page_address() here
>>>>>> +#endif
>>>>>
>>>>> Can you extract the text above, the check and the page_address call into a
>>>>> simple wrapper function?
>>>>
>>>>
>>>> Is this function also too big? Sorry, I do not understand the comment.
>>>
>>> All of the comment and #if here only deal with the fact that you
>>> have a real mode hack to call page_address() that happens
>>> to work under specific circumstances.
>>>
>>> There's nothing kvmppc_tce_put() specific about this.
>>> The page_address() code happens to get called here, sure.
>>> But if I read the kvmppc_tce_put() function I don't care about
>>> these details - I want to understand the code flow that ends
>>> up writing the TCE.
>>>
>>>>>> +    page = tt->pages[idx / TCES_PER_PAGE];
>>>>>> +    tbl = (u64 *)page_address(page);
>>>>>> +
>>>>>> +    /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>>>
>>>>> This is not an RFC, is it?
>>>>
>>>>
>>>> Any debug code is prohibited? Ok, I'll remove.
>>>
>>> Debug code that requires code changes is prohibited, yes.
>>> Debug code that is runtime switchable (pr_debug, trace points, etc)
>>> are allowed.
>>
>>
>> Is there any easy way to enable just this specific udbg_printf (not all of
>> them at once)? Trace points do not work in real mode as we figured out.
> 
> You can enable pr_debug by file IIRC.


On already running kernel? :-/ Wow. How?



>>>>>> +    tbl[idx % TCES_PER_PAGE] = tce;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
>>>>>> +
>>>>>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>>>>>> +/*
>>>>>> + * Converts guest physical address to host physical address.
>>>>>> + * Tries to increase page counter via realmode_get_page() and
>>>>>> + * returns ERROR_ADDR if failed.
>>>>>> + */
>>>>>> +static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
>>>>>> +        unsigned long gpa, struct page **pg)
>>>>>> +{
>>>>>> +    struct kvm_memory_slot *memslot;
>>>>>> +    pte_t *ptep, pte;
>>>>>> +    unsigned long hva, hpa = ERROR_ADDR;
>>>>>> +    unsigned long gfn = gpa>>  PAGE_SHIFT;
>>>>>> +    unsigned shift = 0;
>>>>>> +
>>>>>> +    memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>>>> +    if (!memslot)
>>>>>> +        return ERROR_ADDR;
>>>>>> +
>>>>>> +    hva = __gfn_to_hva_memslot(memslot, gfn);
>>>>>> +
>>>>>> +    ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva,&shift);
>>>>>> +    if (!ptep || !pte_present(*ptep))
>>>>>> +        return ERROR_ADDR;
>>>>>> +    pte = *ptep;
>>>>>> +
>>>>>> +    if (((gpa&  TCE_PCI_WRITE) || pte_write(pte))&&  !pte_dirty(pte))
>>>>>> +        return ERROR_ADDR;
>>>>>> +
>>>>>> +    if (!pte_young(pte))
>>>>>> +        return ERROR_ADDR;
>>>>>> +
>>>>>> +    if (!shift)
>>>>>> +        shift = PAGE_SHIFT;
>>>>>> +
>>>>>> +    /* Put huge pages handling to the virtual mode */
>>>>>> +    if (shift>  PAGE_SHIFT)
>>>>>> +        return ERROR_ADDR;
>>>>>> +
>>>>>> +    *pg = realmode_pfn_to_page(pte_pfn(pte));
>>>>>> +    if (!*pg || realmode_get_page(*pg))
>>>>>> +        return ERROR_ADDR;
>>>>>> +
>>>>>> +    /* pte_pfn(pte) returns address aligned to pg_size */
>>>>>> +    hpa = (pte_pfn(pte)<<  PAGE_SHIFT) + (gpa&  ((1<<  shift) - 1));
>>>>>> +
>>>>>> +    if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>>>> +        hpa = ERROR_ADDR;
>>>>>> +        realmode_put_page(*pg);
>>>>>> +        *pg = NULL;
>>>>>> +    }
>>>>>> +
>>>>>> +    return hpa;
>>>>>> +}
>>>>>> +
>>>>>> long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>>>               unsigned long ioba, unsigned long tce)
>>>>>> {
>>>>>> -    struct kvm *kvm = vcpu->kvm;
>>>>>> -    struct kvmppc_spapr_tce_table *stt;
>>>>>> -
>>>>>> -    /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>>>>> -    /*         liobn, ioba, tce); */
>>>>>> -
>>>>>> -    list_for_each_entry(stt,&kvm->arch.spapr_tce_tables, list) {
>>>>>> -        if (stt->liobn == liobn) {
>>>>>> -            unsigned long idx = ioba>>  SPAPR_TCE_SHIFT;
>>>>>> -            struct page *page;
>>>>>> -            u64 *tbl;
>>>>>> -
>>>>>> -            /* udbg_printf("H_PUT_TCE: liobn 0x%lx =>  stt=%p 
>>>>>> window_size=0x%x\n", */
>>>>>> -            /*         liobn, stt, stt->window_size); */
>>>>>> -            if (ioba>= stt->window_size)
>>>>>> -                return H_PARAMETER;
>>>>>> -
>>>>>> -            page = stt->pages[idx / TCES_PER_PAGE];
>>>>>> -            tbl = (u64 *)page_address(page);
>>>>>> -
>>>>>> -            /* FIXME: Need to validate the TCE itself */
>>>>>> -            /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>>>> -            tbl[idx % TCES_PER_PAGE] = tce;
>>>>>> -            return H_SUCCESS;
>>>>>> -        }
>>>>>> +    long ret;
>>>>>> +    struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>>> +
>>>>>> +    if (!tt)
>>>>>> +        return H_TOO_HARD;
>>>>>> +
>>>>>> +    ++tt->stat.rm.put;
>>>>>> +
>>>>>> +    if (ioba>= tt->window_size)
>>>>>> +        return H_PARAMETER;
>>>>>> +
>>>>>> +    ret = kvmppc_emulated_validate_tce(tce);
>>>>>> +    if (!ret)
>>>>>> +        kvmppc_emulated_put_tce(tt, ioba, tce);
>>>>>> +
>>>>>> +    return ret;
>>>>>> +}
>>>>>> +
>>>>>> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>
>>>>> So the _vm version is the normal one and this is the _rm version? If so,
>>>>> please mark it as such. Is there any way to generate both from the same
>>>>> source? The way it's now there is a lot of duplicate code.
>>>>
>>>>
>>>> I tried, looked very ugly. If you insist, I will do so.
>>>
>>
>>> If it looks ugly better don't. I just want to make sure you explored the option.
>>> But please keep the naming scheme consistent.
>>
>>
>> Removed _vm everywhere and put _rm in realmode handlers. I just was
>> confused by _vm in kvm_vm_ioctl_create_spapr_tce() at the first place.
> 
> That vm refers to the virtual machine. It's on VM scope, not VCPU scope.


I do not mind, just saying where it came from :)
Alexander Graf July 11, 2013, 11:15 a.m. UTC | #7
On 11.07.2013, at 12:54, Alexey Kardashevskiy wrote:

> On 07/11/2013 08:11 PM, Alexander Graf wrote:
>> 
>> On 11.07.2013, at 07:12, Alexey Kardashevskiy wrote:
>> 
>>> On 07/10/2013 08:05 PM, Alexander Graf wrote:
>>>> 
>>>> On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote:
>>>> 
>>>>> On 07/10/2013 03:02 AM, Alexander Graf wrote:
>>>>>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>>>>>> devices or emulated PCI.  These calls allow adding multiple entries
>>>>>>> (up to 512) into the TCE table in one call which saves time on
>>>>>>> transition to/from real mode.
>>>>>> 
>>>>>> We don't mention QEMU explicitly in KVM code usually.
>>>>>> 
>>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>>>>> (copied from user and verified) before writing the whole list into
>>>>>>> the TCE table. This cache will be utilized more in the upcoming
>>>>>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>>>>>> mode in the case if the real mode handler failed for some reason.
>>>>>>> 
>>>>>>> This adds a guest physical to host real address converter
>>>>>>> and calls the existing H_PUT_TCE handler. The converting function
>>>>>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>>>>> 
>>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>>>>>> so in order to support the functionality of this patch, QEMU
>>>>>>> needs to query for this capability and set the "hcall-multi-tce"
>>>>>>> hypertas property only if the capability is present, otherwise
>>>>>>> there will be serious performance degradation.
>>>>>> 
>>>>>> Same as above. But really you're only giving recommendations here. What's
>>>>>> the point? Please describe what the benefit of this patch is, not what some
>>>>>> other random subsystem might do with the benefits it brings.
>>>>>> 
>>>>>>> 
>>>>>>> Signed-off-by: Paul Mackerras<paulus@samba.org>
>>>>>>> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>>>>>>> 
>>>>>>> ---
>>>>>>> Changelog:
>>>>>>> 2013/07/06:
>>>>>>> * fixed number of wrong get_page()/put_page() calls
>>>>>>> 
>>>>>>> 2013/06/27:
>>>>>>> * fixed clear of BUSY bit in kvmppc_lookup_pte()
>>>>>>> * H_PUT_TCE_INDIRECT does realmode_get_page() now
>>>>>>> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
>>>>>>> * updated doc
>>>>>>> 
>>>>>>> 2013/06/05:
>>>>>>> * fixed mistype about IBMVIO in the commit message
>>>>>>> * updated doc and moved it to another section
>>>>>>> * changed capability number
>>>>>>> 
>>>>>>> 2013/05/21:
>>>>>>> * added kvm_vcpu_arch::tce_tmp
>>>>>>> * removed cleanup if put_indirect failed, instead we do not even start
>>>>>>> writing to TCE table if we cannot get TCEs from the user and they are
>>>>>>> invalid
>>>>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>>>>>> and kvmppc_emulated_validate_tce (for the previous item)
>>>>>>> * fixed bug with failthrough for H_IPI
>>>>>>> * removed all get_user() from real mode handlers
>>>>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>>>>>> 
>>>>>>> Signed-off-by: Alexey Kardashevskiy<aik@ozlabs.ru>
>>>>>>> ---
>>>>>>> Documentation/virtual/kvm/api.txt       |  25 +++
>>>>>>> arch/powerpc/include/asm/kvm_host.h     |   9 ++
>>>>>>> arch/powerpc/include/asm/kvm_ppc.h      |  16 +-
>>>>>>> arch/powerpc/kvm/book3s_64_vio.c        | 154 ++++++++++++++++++-
>>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c     | 260
>>>>>>> ++++++++++++++++++++++++++++----
>>>>>>> arch/powerpc/kvm/book3s_hv.c            |  41 ++++-
>>>>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +
>>>>>>> arch/powerpc/kvm/book3s_pr_papr.c       |  37 ++++-
>>>>>>> arch/powerpc/kvm/powerpc.c              |   3 +
>>>>>>> 9 files changed, 517 insertions(+), 34 deletions(-)
>>>>>>> 
>>>>>>> diff --git a/Documentation/virtual/kvm/api.txt
>>>>>>> b/Documentation/virtual/kvm/api.txt
>>>>>>> index 6365fef..762c703 100644
>>>>>>> --- a/Documentation/virtual/kvm/api.txt
>>>>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>>>>> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
>>>>>>> to userspace to be
>>>>>>> handled.
>>>>>>> 
>>>>>>> 
>>>>>>> +4.86 KVM_CAP_PPC_MULTITCE
>>>>>>> +
>>>>>>> +Capability: KVM_CAP_PPC_MULTITCE
>>>>>>> +Architectures: ppc
>>>>>>> +Type: vm
>>>>>>> +
>>>>>>> +This capability means the kernel is capable of handling hypercalls
>>>>>>> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
>>>>>>> +space. This significanly accelerates DMA operations for PPC KVM guests.
>>>>>> 
>>>>>> significanly? Please run this through a spell checker.
>>>>>> 
>>>>>>> +The user space should expect that its handlers for these hypercalls
>>>>>> 
>>>>>> s/The//
>>>>>> 
>>>>>>> +are not going to be called.
>>>>>> 
>>>>>> Is user space guaranteed they will not be called? Or can it still happen?
>>>>> 
>>>>> ... if user space previously registered LIOBN in KVM (via
>>>>> KVM_CREATE_SPAPR_TCE or similar calls).
>>>>> 
>>>>> ok?
>>>> 
>>>> How about this?
>>>> 
>>>> The hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration.
>>>> 
>>>> ---
>>>> 
>>>> The target audience for this documentation is user space KVM API users. Someone developing kvm tool for example. They want to know implications specific CAPs have.
>>>> 
>>>>> 
>>>>> There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
>>>>> and may never get there.
>>>>> 
>>>>> 
>>>>>>> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
>>>>>>> +the user space might have to advertise it for the guest. For example,
>>>>>>> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in
>>>>>>> +the "ibm,hypertas-functions" device-tree property.
>>>>>> 
>>>>>> This paragraph describes sPAPR. That's fine, but please document it as
>>>>>> such. Also please check your grammar.
>>>>> 
>>>>>>> +
>>>>>>> +Without this capability, only H_PUT_TCE is handled by the kernel and
>>>>>>> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
>>>>>>> +unless the capability is present as passing hypercalls to the userspace
>>>>>>> +slows operations a lot.
>>>>>>> +
>>>>>>> +Unlike other capabilities of this section, this one is always enabled.
>>>>>> 
>>>>>> Why? Wouldn't that confuse older user space?
>>>>> 
>>>>> 
>>>>> How? Old user space won't check for this capability and won't tell the
>>>>> guest to use it (via "hcall-multi-tce"). Old H_PUT_TCE is still there.
>>>>> 
>>>>> If the guest always uses H_PUT_TCE_INDIRECT/H_STUFF_TCE no matter what,
>>>>> then it is its problem - it won't work now anyway as neither QEMU nor host
>>>>> kernel supports these calls.
>>> 
>>> 
>>>> Always assume that you are a kernel developer without knowledge
>>>> of any user space code using your interfaces. So there is the theoretical
>>>> possibility that there is a user space client out there that implements
>>>> H_PUT_TCE_INDIRECT and advertises hcall-multi-tce to the guest. 
>>>> Would that client break? If so, we should definitely have
>>>> the CAP disabled by default.
>>> 
>>> 
>>> No, it won't break. Why would it break? I really do not get it. This user
>>> space client has to do an extra step to get this acceleration by calling
>>> ioctl(KVM_CREATE_SPAPR_TCE) anyway. Previously that ioctl only had effect
>>> on H_PUT_TCE, now on all three hcalls.
>> 
>> Hrm. It's a change of behavior, it probably wouldn't break, yes.
> 
> 
> Aaand?


And that's bad. Jeez, seriously. Don't argue this case. We enable new features individually unless we're 100% sure we can keep everything working. In this case an ENABLE_CAP doesn't hurt at all, because user space still needs to handle the hypercalls if it wants them anyways. But you get debugging for free for example.

> 
> 
>>>> But really, it's also as much about consistency as anything else.
>>>> If we leave everything as is and always extend functionality
>>>> by enabling new CAPs, we're pretty much guaranteed that we
>>>> don't break anything by accident. It also makes debugging easier
>>>> because you can for example disable this particular feature
>>>> to see whether something has bad side effects.
>>> 
>>> 
>>> So I must add one more ioctl to enable MULTITCE in kernel handling. Is it
>>> what you are saying?
>>> 
>>> I can see KVM_CHECK_EXTENSION but I do not see KVM_ENABLE_EXTENSION or
>>> anything like that.
>> 
>> KVM_ENABLE_CAP. It's how we enable sPAPR capabilities too.
> 
> 
> Yeah, Paul already explained. It is platform specific but ok.
> And does not have "EXTENSION" in the name for some reason but ok too.
> 
> KVM_ENABLE_CAP is vcpu ioctl. So kvm_arch_vcpu_ioctl() enables VCPU's
> capabilities while KVM_CAP_SPAPR_MULTITCE is KVM (or more precisely
> SPAPR-TCE/LIOBN but I really do not want it to be that specific) capability.
> 
> Sure I can add to kvm_arch_vcpu_ioctl():
> 
> case KVM_CAP_SPAPR_MULTITCE:
>        r = 0;
>        vcpu->kvm->arch.spapr_multitce_enabled = cap->args[0];
>        break;
> 
> But I suspect you and Ben will call it ugly. SO do I have to implement
> KVM_ENABLE_CAP in kvm_arch_vm_ioctl and change the api.txt that it is not
> just about vcpu ioctl anymore? Or my brand new ioctl for this?


There are 2 ways of dealing with this:

  1) Call the ENABLE_CAP on every vcpu. That way one CPU may handle this hypercall in the kernel while another one may not. The same as we handle PAPR today.

  2) Create a new ENABLE_CAP for the vm.

I think in this case option 1 is fine - it's how we handle everything else already.

> 
> 
> 
> 
>>>>>>> +
>>>>>>> +
>>>>>>> 5. The kvm_run structure
>>>>>>> ------------------------
>>>>>>> 
>>>>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>>>>>>> b/arch/powerpc/include/asm/kvm_host.h
>>>>>>> index af326cd..20d04bd 100644
>>>>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>>>>> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
>>>>>>>    struct kvm *kvm;
>>>>>>>    u64 liobn;
>>>>>>>    u32 window_size;
>>>>>>> +    struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>>>>>> 
>>>>>> You don't need this.
>>>>>> 
>>>>>>>    struct page *pages[0];
>>>>>>> };
>>>>>>> 
>>>>>>> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
>>>>>>>    spinlock_t tbacct_lock;
>>>>>>>    u64 busy_stolen;
>>>>>>>    u64 busy_preempt;
>>>>>>> +
>>>>>>> +    unsigned long *tce_tmp_hpas;    /* TCE cache for TCE_PUT_INDIRECT
>>>>>>> hcall */
>>>>>>> +    enum {
>>>>>>> +        TCERM_NONE,
>>>>>>> +        TCERM_GETPAGE,
>>>>>>> +        TCERM_PUTTCE,
>>>>>>> +        TCERM_PUTLIST,
>>>>>>> +    } tce_rm_fail;            /* failed stage of request processing */
>>>>>>> #endif
>>>>>>> };
>>>>>>> 
>>>>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h
>>>>>>> b/arch/powerpc/include/asm/kvm_ppc.h
>>>>>>> index a5287fe..fa722a0 100644
>>>>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>>>>> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu
>>>>>>> *vcpu);
>>>>>>> 
>>>>>>> extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>>>>                struct kvm_create_spapr_tce *args);
>>>>>>> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>>>> -                 unsigned long ioba, unsigned long tce);
>>>>>>> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>>>>>> +        struct kvm_vcpu *vcpu, unsigned long liobn);
>>>>>>> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
>>>>>>> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>>>>> +        unsigned long ioba, unsigned long tce);
>>>>>>> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>>>> +        unsigned long tce);
>>>>>>> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>>>> +        unsigned long tce_list, unsigned long npages);
>>>>>>> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>>>> +        unsigned long tce_value, unsigned long npages);
>>>>>>> extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>>>>>>>                struct kvm_allocate_rma *rma);
>>>>>>> extern struct kvmppc_linear_info *kvm_alloc_rma(void);
>>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>>>>>>> b/arch/powerpc/kvm/book3s_64_vio.c
>>>>>>> index b2d3f3b..99bf4e5 100644
>>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>>>>> @@ -14,6 +14,7 @@
>>>>>>> *
>>>>>>> * Copyright 2010 Paul Mackerras, IBM Corp.<paulus@au1.ibm.com>
>>>>>>> * Copyright 2011 David Gibson, IBM Corporation<dwg@au1.ibm.com>
>>>>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<aik@au1.ibm.com>
>>>>>>> */
>>>>>>> 
>>>>>>> #include<linux/types.h>
>>>>>>> @@ -36,8 +37,10 @@
>>>>>>> #include<asm/ppc-opcode.h>
>>>>>>> #include<asm/kvm_host.h>
>>>>>>> #include<asm/udbg.h>
>>>>>>> +#include<asm/iommu.h>
>>>>>>> +#include<asm/tce.h>
>>>>>>> 
>>>>>>> -#define TCES_PER_PAGE    (PAGE_SIZE / sizeof(u64))
>>>>>>> +#define ERROR_ADDR      ((void *)~(unsigned long)0x0)
>>>>>>> 
>>>>>>> static long kvmppc_stt_npages(unsigned long window_size)
>>>>>>> {
>>>>>>> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct
>>>>>>> kvmppc_spapr_tce_table *stt)
>>>>>>>    struct kvm *kvm = stt->kvm;
>>>>>>>    int i;
>>>>>>> 
>>>>>>> +#define __SV(x) stt->stat.x
>>>>>>> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
>>>>>>> +    pr_debug("%s stat for liobn=%llx\n"
>>>>>>> +            "--------------- realmode ----- virtmode ---\n"
>>>>>>> +            "put_tce       %10ld     %10ld\n"
>>>>>>> +            "put_tce_indir %10ld     %10ld\n"
>>>>>>> +            "stuff_tce     %10ld     %10ld\n",
>>>>>>> +            __func__, stt->liobn,
>>>>>>> +            __SVD(put), __SV(vm.put),
>>>>>>> +            __SVD(indir), __SV(vm.indir),
>>>>>>> +            __SVD(stuff), __SV(vm.stuff));
>>>>>>> +#undef __SVD
>>>>>>> +#undef __SV
>>>>>> 
>>>>>> All of these stat points should just be trace points. You can do the
>>>>>> statistic gathering from user space then.
>>>>>> 
>>>>>>> +
>>>>>>>    mutex_lock(&kvm->lock);
>>>>>>>    list_del(&stt->list);
>>>>>>>    for (i = 0; i<  kvmppc_stt_npages(stt->window_size); i++)
>>>>>>> @@ -148,3 +165,138 @@ fail:
>>>>>>>    }
>>>>>>>    return ret;
>>>>>>> }
>>>>>>> +
>>>>>>> +/* Converts guest physical address to host virtual address */
>>>>>>> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>>>>>> 
>>>>>> Please don't distinguish _vm versions. They're the normal case. _rm ones
>>>>>> are the special ones.
>>>>>> 
>>>>>>> +        unsigned long gpa, struct page **pg)
>>>>>>> +{
>>>>>>> +    unsigned long hva, gfn = gpa>>  PAGE_SHIFT;
>>>>>>> +    struct kvm_memory_slot *memslot;
>>>>>>> +
>>>>>>> +    memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>>>>> +    if (!memslot)
>>>>>>> +        return ERROR_ADDR;
>>>>>>> +
>>>>>>> +    hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa&  ~PAGE_MASK);
>>>>>> 
>>>>>> s/+/|/
>>>>>> 
>>>>>>> +
>>>>>>> +    if (get_user_pages_fast(hva&  PAGE_MASK, 1, 0, pg) != 1)
>>>>>>> +        return ERROR_ADDR;
>>>>>>> +
>>>>>>> +    return (void *) hva;
>>>>>>> +}
>>>>>>> +
>>>>>>> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>>>> +        unsigned long tce)
>>>>>>> +{
>>>>>>> +    long ret;
>>>>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>>>>> +
>>>>>>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>>>> +    /* Didn't find the liobn, put it to userspace */
>>>>>> 
>>>>>> Unclear comment.
>>>>> 
>>>>> 
>>>>> What detail is missing?
>>>> 
>>> 
>>>> Grammar wise "it" in the second half of the sentence refers to liobn.
>>>> So you "put" the "liobn to userspace". That sentence doesn't
>>>> make any sense.
>>> 
>>> 
>>> Removed it. H_TOO_HARD itself says enough already.
>>> 
>>> 
>>>> What you really want to say is:
>>>> 
>>>> /* Couldn't find the liobn. Something went wrong. Let user space handle the hypercall. That has better ways of dealing with errors. */
>>>> 
>>>>> 
>>>>> 
>>>>>>> +    if (!tt)
>>>>>>> +        return H_TOO_HARD;
>>>>>>> +
>>>>>>> +    ++tt->stat.vm.put;
>>>>>>> +
>>>>>>> +    if (ioba>= tt->window_size)
>>>>>>> +        return H_PARAMETER;
>>>>>>> +
>>>>>>> +    ret = kvmppc_emulated_validate_tce(tce);
>>>>>>> +    if (ret)
>>>>>>> +        return ret;
>>>>>>> +
>>>>>>> +    kvmppc_emulated_put_tce(tt, ioba, tce);
>>>>>>> +
>>>>>>> +    return H_SUCCESS;
>>>>>>> +}
>>>>>>> +
>>>>>>> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>>>> +        unsigned long tce_list, unsigned long npages)
>>>>>>> +{
>>>>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>>>>> +    long i, ret = H_SUCCESS;
>>>>>>> +    unsigned long __user *tces;
>>>>>>> +    struct page *pg = NULL;
>>>>>>> +
>>>>>>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>>>> +    /* Didn't find the liobn, put it to userspace */
>>>>>>> +    if (!tt)
>>>>>>> +        return H_TOO_HARD;
>>>>>>> +
>>>>>>> +    ++tt->stat.vm.indir;
>>>>>>> +
>>>>>>> +    /*
>>>>>>> +     * The spec says that the maximum size of the list is 512 TCEs so
>>>>>>> +     * so the whole table addressed resides in 4K page
>>>>>> 
>>>>>> so so?
>>>>>> 
>>>>>>> +     */
>>>>>>> +    if (npages>  512)
>>>>>>> +        return H_PARAMETER;
>>>>>>> +
>>>>>>> +    if (tce_list&  ~IOMMU_PAGE_MASK)
>>>>>>> +        return H_PARAMETER;
>>>>>>> +
>>>>>>> +    if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
>>>>>>> +        return H_PARAMETER;
>>>>>>> +
>>>>>>> +    tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg);
>>>>>>> +    if (tces == ERROR_ADDR)
>>>>>>> +        return H_TOO_HARD;
>>>>>>> +
>>>>>>> +    if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
>>>>>>> +        goto put_list_page_exit;
>>>>>>> +
>>>>>>> +    for (i = 0; i<  npages; ++i) {
>>>>>>> +        if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
>>>>>>> +            ret = H_PARAMETER;
>>>>>>> +            goto put_list_page_exit;
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
>>>>>>> +        if (ret)
>>>>>>> +            goto put_list_page_exit;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    for (i = 0; i<  npages; ++i)
>>>>>>> +        kvmppc_emulated_put_tce(tt, ioba + (i<<  IOMMU_PAGE_SHIFT),
>>>>>>> +                vcpu->arch.tce_tmp_hpas[i]);
>>>>>>> +put_list_page_exit:
>>>>>>> +    if (pg)
>>>>>>> +        put_page(pg);
>>>>>>> +
>>>>>>> +    if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
>>>>>>> +        vcpu->arch.tce_rm_fail = TCERM_NONE;
>>>>>>> +        if (pg&&  !PageCompound(pg))
>>>>>>> +            put_page(pg); /* finish pending realmode_put_page() */
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return ret;
>>>>>>> +}
>>>>>>> +
>>>>>>> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>>>> +        unsigned long liobn, unsigned long ioba,
>>>>>>> +        unsigned long tce_value, unsigned long npages)
>>>>>>> +{
>>>>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>>>>> +    long i, ret;
>>>>>>> +
>>>>>>> +    tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>>>> +    /* Didn't find the liobn, put it to userspace */
>>>>>>> +    if (!tt)
>>>>>>> +        return H_TOO_HARD;
>>>>>>> +
>>>>>>> +    ++tt->stat.vm.stuff;
>>>>>>> +
>>>>>>> +    if ((ioba + (npages<<  IOMMU_PAGE_SHIFT))>  tt->window_size)
>>>>>>> +        return H_PARAMETER;
>>>>>>> +
>>>>>>> +    ret = kvmppc_emulated_validate_tce(tce_value);
>>>>>>> +    if (ret || (tce_value&  (TCE_PCI_WRITE | TCE_PCI_READ)))
>>>>>>> +        return H_PARAMETER;
>>>>>>> +
>>>>>>> +    for (i = 0; i<  npages; ++i, ioba += IOMMU_PAGE_SIZE)
>>>>>>> +        kvmppc_emulated_put_tce(tt, ioba, tce_value);
>>>>>>> +
>>>>>>> +    return H_SUCCESS;
>>>>>>> +}
>>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>>> b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>>> index 30c2f3b..cd3e6f9 100644
>>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>>> @@ -14,6 +14,7 @@
>>>>>>> *
>>>>>>> * Copyright 2010 Paul Mackerras, IBM Corp.<paulus@au1.ibm.com>
>>>>>>> * Copyright 2011 David Gibson, IBM Corporation<dwg@au1.ibm.com>
>>>>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<aik@au1.ibm.com>
>>>>>>> */
>>>>>>> 
>>>>>>> #include<linux/types.h>
>>>>>>> @@ -35,42 +36,243 @@
>>>>>>> #include<asm/ppc-opcode.h>
>>>>>>> #include<asm/kvm_host.h>
>>>>>>> #include<asm/udbg.h>
>>>>>>> +#include<asm/iommu.h>
>>>>>>> +#include<asm/tce.h>
>>>>>>> 
>>>>>>> #define TCES_PER_PAGE    (PAGE_SIZE / sizeof(u64))
>>>>>>> +#define ERROR_ADDR      (~(unsigned long)0x0)
>>>>>>> 
>>>>>>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>>>>>>> - *          mode on PR KVM
>>>>>> 
>>>>>> What's wrong with the warning?
>>>>> 
>>>>> 
>>>>> It belongs to kvmppc_h_put_tce() which is not called in virtual mode anymore.
>>>> 
>>>> I thought the comment applied to the whole file before? Hrm. Maybe I misread it then.
>>>> 
>>>>> It is technically correct for kvmppc_find_tce_table() though. Should I put
>>>>> this comment before every function which may be called from real and
>>>>> virtual modes?
>>>> 
>>>> Yes, please. Otherwise someone might stick an access to a non-linear address
>>>> in there by accident.
>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>>> +/*
>>>>>>> + * Finds a TCE table descriptor by LIOBN
>>>>>>> */
>>>>>>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
>>>>>>> +        unsigned long liobn)
>>>>>>> +{
>>>>>>> +    struct kvmppc_spapr_tce_table *tt;
>>>>>>> +
>>>>>>> +    list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) {
>>>>>>> +        if (tt->liobn == liobn)
>>>>>>> +            return tt;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return NULL;
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>>>>>>> +
>>>>>>> +#ifdef DEBUG
>>>>>>> +/*
>>>>>>> + * Lets user mode disable realmode handlers by putting big number
>>>>>>> + * in the bottom value of LIOBN
>>>>>> 
>>>>>> What? Seriously? Just don't enable the CAP.
>>>>> 
>>>>> 
>>>>> It is under DEBUG. It really, really helps to be able to disable real mode
>>>>> handlers without reboot. Ok, no debug code, I'll remove.
>>>> 
>>>> Debug code is good, but #ifdefs are bad. For you, an #ifdef reads like
>>>> "code that doesn't do any hard when disabled". For me, #ifdefs read
>>>> "code that definitely breaks because nobody turns the #define on".
>>>> 
>>>> So please, avoid #ifdef'ed code whenever possible. Switching the CAP on and
>>>> off is a much better debug approach in this case.
>>>> 
>>>>> 
>>>>> 
>>>>>>> + */
>>>>>>> +#define kvmppc_find_tce_table(a, b) \
>>>>>>> +        ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
>>>>>>> +#endif
>>>>>>> +
>>>>>>> +/*
>>>>>>> + * Validates TCE address.
>>>>>>> + * At the moment only flags are validated as other checks will
>>>>>>> significantly slow
>>>>>>> + * down or can make it even impossible to handle TCE requests in real mode.
>>>>>> 
>>>>>> What?
>>>>> 
>>>>> 
>>>>> What is missing here (besides good english)?
>>>> 
>>>> What badness could slip through by not validating everything?
>>> 
>>> 
>>> I cannot think of any good check which could be done in real mode and not
>>> be "more than 2 calls deep" (c) Ben. Check that the page is allocated at
>>> all? How? Don't know.
>> 
> 
>> If you say that our validation doesn't validate everything, that makes
>> me really weary.
> 
> 
> It checks that TCE does not have any bit set in bits 2..12. If they are
> set, something went very wrong. Better than nothing.
> 
> 
>> Could the guest use it to maliciously inject anything?
>> Could a missing check make our code go berserk?
> 
> 
> No. KVM does not do anything with those addresses, just puts them to the
> table and lets QEMU or a guest deal with it.
> 
> 
>> What checks exactly would you do in addition when this was virtual mode?
> 
> 
> Check that TCE is within RAM boundaries. Or check that the page was
> allocated. find_linux_pte_or_hugepte? It can fail in real mode but in
> virtual mode I can call get_user_fast_page and confirm that the address is
> ok. Not sure, did not think much about it. Compare page flags with TCE
> flags if both or neither have "write" set, this kind of stuff.
> 
> I am not really sure we need any of those checks for emulated TCE at all.
> 
> Remove the comment then?

No, extend it. Explain what we could check and that we rely on surroundings to ensure everything's fine.

> 
> 
>>>>>>> + */
>>>>>>> +long kvmppc_emulated_validate_tce(unsigned long tce)
>>>>>> 
>>>>>> I don't like the naming scheme. Please turn this around and make it
>>>>>> kvmppc_tce_validate().
>>>>> 
>>>>> 
>>>>> Oh. "Like"... Ok.
>>>> 
>>>> Yes. Like.
>>>> 
>>>>> 
>>>>> 
>>>>>>> +{
>>>>>>> +    if (tce&  ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>>>>>>> +        return H_PARAMETER;
>>>>>>> +
>>>>>>> +    return H_SUCCESS;
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
>>>>>>> +
>>>>>>> +/*
>>>>>>> + * Handles TCE requests for QEMU emulated devices.
>>>>>> 
>>>>>> We still don't mention QEMU in KVM code. And does it really matter whether
>>>>>> they're emulated by QEMU? Devices could also be emulated by KVM.
>>>>>> 
>>>>>>> + * Puts guest TCE values to the table and expects QEMU to convert them
>>>>>>> + * later in a QEMU device implementation.
>>>>>>> + * Called in both real and virtual modes.
>>>>>>> + * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
>>>>>>> + */
>>>>>>> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>>>> 
>>>>>> kvmppc_tce_put()
>>>>>> 
>>>>>>> +        unsigned long ioba, unsigned long tce)
>>>>>>> +{
>>>>>>> +    unsigned long idx = ioba>>  SPAPR_TCE_SHIFT;
>>>>>>> +    struct page *page;
>>>>>>> +    u64 *tbl;
>>>>>>> +
>>>>>>> +    /*
>>>>>>> +     * Note on the use of page_address() in real mode,
>>>>>>> +     *
>>>>>>> +     * It is safe to use page_address() in real mode on ppc64 because
>>>>>>> +     * page_address() is always defined as lowmem_page_address()
>>>>>>> +     * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
>>>>>>> +     * operation and does not access page struct.
>>>>>>> +     *
>>>>>>> +     * Theoretically page_address() could be defined different
>>>>>>> +     * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>>>>>>> +     * should be enabled.
>>>>>>> +     * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>>>>>>> +     * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>>>>>>> +     * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>>>>>>> +     * is not expected to be enabled on ppc32, page_address()
>>>>>>> +     * is safe for ppc32 as well.
>>>>>>> +     */
>>>>>>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>>>>>>> +#error TODO: fix to avoid page_address() here
>>>>>>> +#endif
>>>>>> 
>>>>>> Can you extract the text above, the check and the page_address call into a
>>>>>> simple wrapper function?
>>>>> 
>>>>> 
>>>>> Is this function also too big? Sorry, I do not understand the comment.
>>>> 
>>>> All of the comment and #if here only deal with the fact that you
>>>> have a real mode hack to call page_address() that happens
>>>> to work under specific circumstances.
>>>> 
>>>> There's nothing kvmppc_tce_put() specific about this.
>>>> The page_address() code happens to get called here, sure.
>>>> But if I read the kvmppc_tce_put() function I don't care about
>>>> these details - I want to understand the code flow that ends
>>>> up writing the TCE.
>>>> 
>>>>>>> +    page = tt->pages[idx / TCES_PER_PAGE];
>>>>>>> +    tbl = (u64 *)page_address(page);
>>>>>>> +
>>>>>>> +    /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>>>> 
>>>>>> This is not an RFC, is it?
>>>>> 
>>>>> 
>>>>> Any debug code is prohibited? Ok, I'll remove.
>>>> 
>>>> Debug code that requires code changes is prohibited, yes.
>>>> Debug code that is runtime switchable (pr_debug, trace points, etc)
>>>> are allowed.
>>> 
>>> 
>>> Is there any easy way to enable just this specific udbg_printf (not all of
>>> them at once)? Trace points do not work in real mode as we figured out.
>> 
>> You can enable pr_debug by file IIRC.
> 
> 
> On already running kernel? :-/ Wow. How?

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/dynamic-debug-howto.txt?id=HEAD


Alex
Benjamin Herrenschmidt July 11, 2013, 12:33 p.m. UTC | #8
On Thu, 2013-07-11 at 15:12 +1000, Alexey Kardashevskiy wrote:
> >> Any debug code is prohibited? Ok, I'll remove.
> > 
> > Debug code that requires code changes is prohibited, yes.
> > Debug code that is runtime switchable (pr_debug, trace points, etc)
> > are allowed.

Bollox.

$ grep DBG\( arch/powerpc/ -r | wc -l
418

Also pr_devel is not runtime switchable in normal kernels either and
still an "official" kernel interface.

> Is there any easy way to enable just this specific udbg_printf (not all of
> them at once)? Trace points do not work in real mode as we figured out.

The cleaner way to do it is to use some kind of local macro that you
enable/disable by changing a #define at the top of the function, possibly
several.

Cheers,
Ben.
Benjamin Herrenschmidt July 11, 2013, 12:38 p.m. UTC | #9
On Thu, 2013-07-11 at 12:11 +0200, Alexander Graf wrote:
> > So I must add one more ioctl to enable MULTITCE in kernel handling. Is it
> > what you are saying?
> > 
> > I can see KVM_CHECK_EXTENSION but I do not see KVM_ENABLE_EXTENSION or
> > anything like that.
> 
> KVM_ENABLE_CAP. It's how we enable sPAPR capabilities too.

But in that case I don't see the point.

Cheers,
Ben.
Benjamin Herrenschmidt July 11, 2013, 12:39 p.m. UTC | #10
On Thu, 2013-07-11 at 13:15 +0200, Alexander Graf wrote:
> And that's bad. Jeez, seriously. Don't argue this case. We enable new
> features individually unless we're 100% sure we can keep everything
> working. In this case an ENABLE_CAP doesn't hurt at all, because user
> space still needs to handle the hypercalls if it wants them anyways.
> But you get debugging for free for example.

An ENABLE_CAP is utterly pointless. More bloat. But you seem to like
it :-)

Ben.
Benjamin Herrenschmidt July 11, 2013, 12:40 p.m. UTC | #11
On Thu, 2013-07-11 at 13:15 +0200, Alexander Graf wrote:
> There are 2 ways of dealing with this:
> 
>   1) Call the ENABLE_CAP on every vcpu. That way one CPU may handle
> this hypercall in the kernel while another one may not. The same as we
> handle PAPR today.
> 
>   2) Create a new ENABLE_CAP for the vm.
> 
> I think in this case option 1 is fine - it's how we handle everything
> else already.

So, you are now asking him to chose between a gross horror or adding a
new piece of infrastructure for something that is entirely pointless to
begin with ?

Come on, give him a break. That stuff is fine as it is.

Ben.
Alexander Graf July 11, 2013, 12:51 p.m. UTC | #12
On 11.07.2013, at 14:39, Benjamin Herrenschmidt wrote:

> On Thu, 2013-07-11 at 13:15 +0200, Alexander Graf wrote:
>> And that's bad. Jeez, seriously. Don't argue this case. We enable new
>> features individually unless we're 100% sure we can keep everything
>> working. In this case an ENABLE_CAP doesn't hurt at all, because user
>> space still needs to handle the hypercalls if it wants them anyways.
>> But you get debugging for free for example.
> 
> An ENABLE_CAP is utterly pointless. More bloat. But you seem to like
> it :-)

I don't like bloat usually. But Alexey even had an #ifdef DEBUG in there to selectively disable in-kernel handling of multi-TCE. Not calling ENABLE_CAP would give him exactly that without ugly #ifdefs in the kernel.


Alex

> 
> Ben.
> 
>
Alexey Kardashevskiy July 11, 2013, 12:56 p.m. UTC | #13
On 07/11/2013 10:51 PM, Alexander Graf wrote:
> 
> On 11.07.2013, at 14:39, Benjamin Herrenschmidt wrote:
> 
>> On Thu, 2013-07-11 at 13:15 +0200, Alexander Graf wrote:
>>> And that's bad. Jeez, seriously. Don't argue this case. We enable new
>>> features individually unless we're 100% sure we can keep everything
>>> working. In this case an ENABLE_CAP doesn't hurt at all, because user
>>> space still needs to handle the hypercalls if it wants them anyways.
>>> But you get debugging for free for example.
>>
>> An ENABLE_CAP is utterly pointless. More bloat. But you seem to like
>> it :-)
> 

> I don't like bloat usually. But Alexey even had an #ifdef DEBUG in there
> to selectively disable in-kernel handling of multi-TCE. Not calling
> ENABLE_CAP would give him exactly that without ugly #ifdefs in the
> kernel.


No, it would not give m anithing. My ugly debug was to disable realmode
only and still leave virtual mode on, not to disable both real and virtual
modes. It is a lot easier to disable in kernel handling in QEMU.
Benjamin Herrenschmidt July 11, 2013, 12:58 p.m. UTC | #14
On Thu, 2013-07-11 at 14:51 +0200, Alexander Graf wrote:
> I don't like bloat usually. But Alexey even had an #ifdef DEBUG in
> there to selectively disable in-kernel handling of multi-TCE. Not
> calling ENABLE_CAP would give him exactly that without ugly #ifdefs in
> the kernel.

I don't see much point in disabling it... but ok, if that's a valuable
feature, then shoot some VM level ENABLE_CAP (please don't iterate all
VCPUs, that's gross).

Ben.
Alexander Graf July 11, 2013, 1:11 p.m. UTC | #15
On 11.07.2013, at 14:33, Benjamin Herrenschmidt wrote:

> On Thu, 2013-07-11 at 15:12 +1000, Alexey Kardashevskiy wrote:
>>>> Any debug code is prohibited? Ok, I'll remove.
>>> 
>>> Debug code that requires code changes is prohibited, yes.
>>> Debug code that is runtime switchable (pr_debug, trace points, etc)
>>> are allowed.
> 
> Bollox.
> 
> $ grep DBG\( arch/powerpc/ -r | wc -l
> 418
> 
> Also pr_devel is not runtime switchable in normal kernels either and
> still an "official" kernel interface.
> 
>> Is there any easy way to enable just this specific udbg_printf (not all of
>> them at once)? Trace points do not work in real mode as we figured out.
> 
> The cleaner way to do it is to use some kind of local macro that you
> enable/disable by changing a #define at the top of the function, possibly
> several.

If you do that in a way that doesn't bitrot, that's acceptable, yes. Something that leaves cpp compile checks working while optimizing out the debug code.


Alex
Alexey Kardashevskiy July 11, 2013, 1:13 p.m. UTC | #16
On 07/11/2013 10:58 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2013-07-11 at 14:51 +0200, Alexander Graf wrote:
>> I don't like bloat usually. But Alexey even had an #ifdef DEBUG in
>> there to selectively disable in-kernel handling of multi-TCE. Not
>> calling ENABLE_CAP would give him exactly that without ugly #ifdefs in
>> the kernel.
> 
> I don't see much point in disabling it... but ok, if that's a valuable
> feature, then shoot some VM level ENABLE_CAP (please don't iterate all
> VCPUs, that's gross).

No use for me whatsoever as I only want to disable real more handlers and
keep virtual mode handlers enabled (sometime, for debug only) and this
capability is not about that - I can easily just not enable it in QEMU with
the exactly the same effect.

So please, fellas, decide whether I should iterate vcpu's or add ENABLE_CAP
per KVM. Thanks.
Alexander Graf July 11, 2013, 1:21 p.m. UTC | #17
On 11.07.2013, at 15:13, Alexey Kardashevskiy wrote:

> On 07/11/2013 10:58 PM, Benjamin Herrenschmidt wrote:
>> On Thu, 2013-07-11 at 14:51 +0200, Alexander Graf wrote:
>>> I don't like bloat usually. But Alexey even had an #ifdef DEBUG in
>>> there to selectively disable in-kernel handling of multi-TCE. Not
>>> calling ENABLE_CAP would give him exactly that without ugly #ifdefs in
>>> the kernel.
>> 
>> I don't see much point in disabling it... but ok, if that's a valuable
>> feature, then shoot some VM level ENABLE_CAP (please don't iterate all
>> VCPUs, that's gross).
> 
> No use for me whatsoever as I only want to disable real more handlers and
> keep virtual mode handlers enabled (sometime, for debug only) and this
> capability is not about that - I can easily just not enable it in QEMU with
> the exactly the same effect.
> 
> So please, fellas, decide whether I should iterate vcpu's or add ENABLE_CAP
> per KVM. Thanks.

Thinking hard about this it might actually be ok to not have an ENABLE_CAP for this, if kernel code always works properly, because from the guest's point of view nothing changes - it either gets handled by kernel or by user space. And user space either handles it or doesn't, so it's ok.

Just leave it out this time. But be very weary of adding new features without an ENABLE_CAP switch. They might be guest visible changes.


Alex
diff mbox

Patch

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 6365fef..762c703 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2362,6 +2362,31 @@  calls by the guest for that service will be passed to userspace to be
 handled.
 
 
+4.86 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability means the kernel is capable of handling hypercalls
+H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
+space. This significanly accelerates DMA operations for PPC KVM guests.
+The user space should expect that its handlers for these hypercalls
+are not going to be called.
+
+In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
+the user space might have to advertise it for the guest. For example,
+IBM pSeries guest starts using them if "hcall-multi-tce" is present in
+the "ibm,hypertas-functions" device-tree property.
+
+Without this capability, only H_PUT_TCE is handled by the kernel and
+therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
+unless the capability is present as passing hypercalls to the userspace
+slows operations a lot.
+
+Unlike other capabilities of this section, this one is always enabled.
+
+
 5. The kvm_run structure
 ------------------------
 
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index af326cd..20d04bd 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,7 @@  struct kvmppc_spapr_tce_table {
 	struct kvm *kvm;
 	u64 liobn;
 	u32 window_size;
+	struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
 	struct page *pages[0];
 };
 
@@ -609,6 +610,14 @@  struct kvm_vcpu_arch {
 	spinlock_t tbacct_lock;
 	u64 busy_stolen;
 	u64 busy_preempt;
+
+	unsigned long *tce_tmp_hpas;	/* TCE cache for TCE_PUT_INDIRECT hcall */
+	enum {
+		TCERM_NONE,
+		TCERM_GETPAGE,
+		TCERM_PUTTCE,
+		TCERM_PUTLIST,
+	} tce_rm_fail;			/* failed stage of request processing */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..fa722a0 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,8 +133,20 @@  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
-			     unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+		struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_emulated_validate_tce(unsigned long tce);
+extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+		unsigned long ioba, unsigned long tce);
+extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce);
+extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index b2d3f3b..99bf4e5 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@ 
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -36,8 +37,10 @@ 
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
-#define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      ((void *)~(unsigned long)0x0)
 
 static long kvmppc_stt_npages(unsigned long window_size)
 {
@@ -50,6 +53,20 @@  static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 	struct kvm *kvm = stt->kvm;
 	int i;
 
+#define __SV(x) stt->stat.x
+#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
+	pr_debug("%s stat for liobn=%llx\n"
+			"--------------- realmode ----- virtmode ---\n"
+			"put_tce       %10ld     %10ld\n"
+			"put_tce_indir %10ld     %10ld\n"
+			"stuff_tce     %10ld     %10ld\n",
+			__func__, stt->liobn,
+			__SVD(put), __SV(vm.put),
+			__SVD(indir), __SV(vm.indir),
+			__SVD(stuff), __SV(vm.stuff));
+#undef __SVD
+#undef __SV
+
 	mutex_lock(&kvm->lock);
 	list_del(&stt->list);
 	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
@@ -148,3 +165,138 @@  fail:
 	}
 	return ret;
 }
+
+/* Converts guest physical address to host virtual address */
+static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
+		unsigned long gpa, struct page **pg)
+{
+	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+	struct kvm_memory_slot *memslot;
+
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+
+	if (get_user_pages_fast(hva & PAGE_MASK, 1, 0, pg) != 1)
+		return ERROR_ADDR;
+
+	return (void *) hva;
+}
+
+long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	long ret;
+	struct kvmppc_spapr_tce_table *tt;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	++tt->stat.vm.put;
+
+	if (ioba >= tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce);
+	if (ret)
+		return ret;
+
+	kvmppc_emulated_put_tce(tt, ioba, tce);
+
+	return H_SUCCESS;
+}
+
+long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret = H_SUCCESS;
+	unsigned long __user *tces;
+	struct page *pg = NULL;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	++tt->stat.vm.indir;
+
+	/*
+	 * The spec says that the maximum size of the list is 512 TCEs so
+	 * so the whole table addressed resides in 4K page
+	 */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	if (tce_list & ~IOMMU_PAGE_MASK)
+		return H_PARAMETER;
+
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list, &pg);
+	if (tces == ERROR_ADDR)
+		return H_TOO_HARD;
+
+	if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
+		goto put_list_page_exit;
+
+	for (i = 0; i < npages; ++i) {
+		if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
+			ret = H_PARAMETER;
+			goto put_list_page_exit;
+		}
+
+		ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
+		if (ret)
+			goto put_list_page_exit;
+	}
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+				vcpu->arch.tce_tmp_hpas[i]);
+put_list_page_exit:
+	if (pg)
+		put_page(pg);
+
+	if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
+		vcpu->arch.tce_rm_fail = TCERM_NONE;
+		if (pg && !PageCompound(pg))
+			put_page(pg); /* finish pending realmode_put_page() */
+	}
+
+	return ret;
+}
+
+long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!tt)
+		return H_TOO_HARD;
+
+	++tt->stat.vm.stuff;
+
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce_value);
+	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+	return H_SUCCESS;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..cd3e6f9 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@ 
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -35,42 +36,243 @@ 
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      (~(unsigned long)0x0)
 
-/* WARNING: This will be called in real-mode on HV KVM and virtual
- *          mode on PR KVM
+/*
+ * Finds a TCE table descriptor by LIOBN
  */
+struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
+		unsigned long liobn)
+{
+	struct kvmppc_spapr_tce_table *tt;
+
+	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn == liobn)
+			return tt;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
+
+#ifdef DEBUG
+/*
+ * Lets user mode disable realmode handlers by putting big number
+ * in the bottom value of LIOBN
+ */
+#define kvmppc_find_tce_table(a, b) \
+		((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
+#endif
+
+/*
+ * Validates TCE address.
+ * At the moment only flags are validated as other checks will significantly slow
+ * down or can make it even impossible to handle TCE requests in real mode.
+ */
+long kvmppc_emulated_validate_tce(unsigned long tce)
+{
+	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+		return H_PARAMETER;
+
+	return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
+
+/*
+ * Handles TCE requests for QEMU emulated devices.
+ * Puts guest TCE values to the table and expects QEMU to convert them
+ * later in a QEMU device implementation.
+ * Called in both real and virtual modes.
+ * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
+ */
+void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+		unsigned long ioba, unsigned long tce)
+{
+	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+	struct page *page;
+	u64 *tbl;
+
+	/*
+	 * Note on the use of page_address() in real mode,
+	 *
+	 * It is safe to use page_address() in real mode on ppc64 because
+	 * page_address() is always defined as lowmem_page_address()
+	 * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
+	 * operation and does not access page struct.
+	 *
+	 * Theoretically page_address() could be defined different
+	 * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
+	 * should be enabled.
+	 * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
+	 * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
+	 * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
+	 * is not expected to be enabled on ppc32, page_address()
+	 * is safe for ppc32 as well.
+	 */
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+	page = tt->pages[idx / TCES_PER_PAGE];
+	tbl = (u64 *)page_address(page);
+
+	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
+	tbl[idx % TCES_PER_PAGE] = tce;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
+
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+/*
+ * Converts guest physical address to host physical address.
+ * Tries to increase page counter via realmode_get_page() and
+ * returns ERROR_ADDR if failed.
+ */
+static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
+		unsigned long gpa, struct page **pg)
+{
+	struct kvm_memory_slot *memslot;
+	pte_t *ptep, pte;
+	unsigned long hva, hpa = ERROR_ADDR;
+	unsigned long gfn = gpa >> PAGE_SHIFT;
+	unsigned shift = 0;
+
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	hva = __gfn_to_hva_memslot(memslot, gfn);
+
+	ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+	if (!ptep || !pte_present(*ptep))
+		return ERROR_ADDR;
+	pte = *ptep;
+
+	if (((gpa & TCE_PCI_WRITE) || pte_write(pte)) && !pte_dirty(pte))
+		return ERROR_ADDR;
+
+	if (!pte_young(pte))
+		return ERROR_ADDR;
+
+	if (!shift)
+		shift = PAGE_SHIFT;
+
+	/* Put huge pages handling to the virtual mode */
+	if (shift > PAGE_SHIFT)
+		return ERROR_ADDR;
+
+	*pg = realmode_pfn_to_page(pte_pfn(pte));
+	if (!*pg || realmode_get_page(*pg))
+		return ERROR_ADDR;
+
+	/* pte_pfn(pte) returns address aligned to pg_size */
+	hpa = (pte_pfn(pte) << PAGE_SHIFT) + (gpa & ((1 << shift) - 1));
+
+	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+		hpa = ERROR_ADDR;
+		realmode_put_page(*pg);
+		*pg = NULL;
+	}
+
+	return hpa;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvm *kvm = vcpu->kvm;
-	struct kvmppc_spapr_tce_table *stt;
-
-	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
-	/* 	    liobn, ioba, tce); */
-
-	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
-		if (stt->liobn == liobn) {
-			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-			struct page *page;
-			u64 *tbl;
-
-			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
-			/* 	    liobn, stt, stt->window_size); */
-			if (ioba >= stt->window_size)
-				return H_PARAMETER;
-
-			page = stt->pages[idx / TCES_PER_PAGE];
-			tbl = (u64 *)page_address(page);
-
-			/* FIXME: Need to validate the TCE itself */
-			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
-			tbl[idx % TCES_PER_PAGE] = tce;
-			return H_SUCCESS;
-		}
+	long ret;
+	struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
+
+	if (!tt)
+		return H_TOO_HARD;
+
+	++tt->stat.rm.put;
+
+	if (ioba >= tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce);
+	if (!ret)
+		kvmppc_emulated_put_tce(tt, ioba, tce);
+
+	return ret;
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list,	unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
+	long i, ret = H_SUCCESS;
+	unsigned long tces;
+	struct page *pg = NULL;
+
+	if (!tt)
+		return H_TOO_HARD;
+
+	++tt->stat.rm.indir;
+
+	/*
+	 * The spec says that the maximum size of the list is 512 TCEs so
+	 * so the whole table addressed resides in 4K page
+	 */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	if (tce_list & ~IOMMU_PAGE_MASK)
+		return H_PARAMETER;
+
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce_list, &pg);
+	if (tces == ERROR_ADDR)
+		return H_TOO_HARD;
+
+	for (i = 0; i < npages; ++i) {
+		ret = kvmppc_emulated_validate_tce(((unsigned long *)tces)[i]);
+		if (ret)
+			goto put_unlock_exit;
+	}
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+				((unsigned long *)tces)[i]);
+
+put_unlock_exit:
+	if (!ret && pg && !PageCompound(pg) && realmode_put_page(pg)) {
+		vcpu->arch.tce_rm_fail = TCERM_PUTLIST;
+		ret = H_TOO_HARD;
 	}
 
-	/* Didn't find the liobn, punt it to userspace */
-	return H_TOO_HARD;
+	return ret;
+}
+
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	if (!tt)
+		return H_TOO_HARD;
+
+	++tt->stat.rm.stuff;
+
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_emulated_validate_tce(tce_value);
+	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+	return H_SUCCESS;
 }
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 550f592..ac41d01 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -567,7 +567,31 @@  int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 		if (kvmppc_xics_enabled(vcpu)) {
 			ret = kvmppc_xics_hcall(vcpu, req);
 			break;
-		} /* fallthrough */
+		}
+		return RESUME_HOST;
+	case H_PUT_TCE:
+		ret = kvmppc_vm_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_PUT_TCE_INDIRECT:
+		ret = kvmppc_vm_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_STUFF_TCE:
+		ret = kvmppc_vm_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
 	default:
 		return RESUME_HOST;
 	}
@@ -958,6 +982,20 @@  struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
 	vcpu->arch.cpu_type = KVM_CPU_3S_64;
 	kvmppc_sanity_check(vcpu);
 
+	/*
+	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
+	 * half executed, we first read TCEs from the user, check them and
+	 * return error if something went wrong and only then put TCEs into
+	 * the TCE table.
+	 *
+	 * tce_tmp_hpas is a cache for TCEs to avoid stack allocation or
+	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
+	 * each (4096 bytes).
+	 */
+	vcpu->arch.tce_tmp_hpas = kmalloc(4096, GFP_KERNEL);
+	if (!vcpu->arch.tce_tmp_hpas)
+		goto free_vcpu;
+
 	return vcpu;
 
 free_vcpu:
@@ -980,6 +1018,7 @@  void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
 	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
 	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
 	spin_unlock(&vcpu->arch.vpa_update_lock);
+	kfree(vcpu->arch.tce_tmp_hpas);
 	kvm_vcpu_uninit(vcpu);
 	kmem_cache_free(kvm_vcpu_cache, vcpu);
 }
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b02f91e..d35554e 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1490,6 +1490,12 @@  hcall_real_table:
 	.long	0		/* 0x11c */
 	.long	0		/* 0x120 */
 	.long	.kvmppc_h_bulk_remove - hcall_real_table
+	.long	0		/* 0x128 */
+	.long	0		/* 0x12c */
+	.long	0		/* 0x130 */
+	.long	0		/* 0x134 */
+	.long	.kvmppc_h_stuff_tce - hcall_real_table
+	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
 hcall_real_table_end:
 
 ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index da0e0bc..edfea88 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -220,7 +220,38 @@  static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
 	long rc;
 
-	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+	rc = kvmppc_vm_h_put_tce(vcpu, liobn, ioba, tce);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_vm_h_put_tce_indirect(vcpu, liobn, ioba,
+			tce, npages);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_vm_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
 	if (rc == H_TOO_HARD)
 		return EMULATE_FAIL;
 	kvmppc_set_gpr(vcpu, 3, rc);
@@ -247,6 +278,10 @@  int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 		return kvmppc_h_pr_bulk_remove(vcpu);
 	case H_PUT_TCE:
 		return kvmppc_h_pr_put_tce(vcpu);
+	case H_PUT_TCE_INDIRECT:
+		return kvmppc_h_pr_put_tce_indirect(vcpu);
+	case H_STUFF_TCE:
+		return kvmppc_h_pr_stuff_tce(vcpu);
 	case H_CEDE:
 		vcpu->arch.shared->msr |= MSR_EE;
 		kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 6316ee3..ccb578b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -394,6 +394,9 @@  int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_PPC_GET_SMMU_INFO:
 		r = 1;
 		break;
+	case KVM_CAP_SPAPR_MULTITCE:
+		r = 1;
+		break;
 #endif
 	default:
 		r = 0;