diff mbox

[1/4] KVM: PPC: Book3S HV: Fix physical address calculations

Message ID 1384584365-18481-2-git-send-email-paulus@samba.org
State New, archived
Headers show

Commit Message

Paul Mackerras Nov. 16, 2013, 6:46 a.m. UTC
This fixes a bug in kvmppc_do_h_enter() where the physical address
for a page can be calculated incorrectly if transparent huge pages
(THP) are active.  Until THP came along, it was true that if we
encountered a large (16M) page in kvmppc_do_h_enter(), then the
associated memslot must be 16M aligned for both its guest physical
address and the userspace address, and the physical address
calculations in kvmppc_do_h_enter() assumed that.  With THP, that
is no longer true.

In the case where we are using MMU notifiers and the page size that
we get from the Linux page tables is larger than the page being mapped
by the guest, we need to fill in some low-order bits of the physical
address.  Without THP, these bits would be the same in the guest
physical address (gpa) and the host virtual address (hva).  With THP,
they can be different, and we need to use the bits from hva rather
than gpa.

In the case where we are not using MMU notifiers, the host physical
address we get from the memslot->arch.slot_phys[] array already
includes the low-order bits down to the PAGE_SIZE level, even if
we are using large pages.  Thus we can simplify the calculation in
this case to just add in the remaining bits in the case where
PAGE_SIZE is 64k and the guest is mapping a 4k page.

The same bug exists in kvmppc_book3s_hv_page_fault().  The basic fix
is to use psize (the page size from the HPTE) rather than pte_size
(the page size from the Linux PTE) when updating the HPTE low word
in r.  That means that pfn needs to be computed to PAGE_SIZE
granularity even if the Linux PTE is a huge page PTE.  That can be
arranged simply by doing the page_to_pfn() before setting page to
the head of the compound page.  If psize is less than PAGE_SIZE,
then we need to make sure we only update the bits from PAGE_SIZE
upwards, in order not to lose any sub-page offset bits in r.
On the other hand, if psize is greater than PAGE_SIZE, we need to
make sure we don't bring in non-zero low order bits in pfn, hence
we mask (pfn << PAGE_SHIFT) with ~(psize - 1).

Cc: stable@vger.kernel.org # v3.11+
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 12 +++++++++---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |  4 ++--
 2 files changed, 11 insertions(+), 5 deletions(-)

Comments

Alexey Kardashevskiy Dec. 6, 2013, 10:52 a.m. UTC | #1
On 11/16/2013 05:46 PM, Paul Mackerras wrote:
> This fixes a bug in kvmppc_do_h_enter() where the physical address
> for a page can be calculated incorrectly if transparent huge pages
> (THP) are active.  Until THP came along, it was true that if we
> encountered a large (16M) page in kvmppc_do_h_enter(), then the
> associated memslot must be 16M aligned for both its guest physical
> address and the userspace address, and the physical address
> calculations in kvmppc_do_h_enter() assumed that.  With THP, that
> is no longer true.
> 
> In the case where we are using MMU notifiers and the page size that
> we get from the Linux page tables is larger than the page being mapped
> by the guest, we need to fill in some low-order bits of the physical
> address.  Without THP, these bits would be the same in the guest
> physical address (gpa) and the host virtual address (hva).  With THP,
> they can be different, and we need to use the bits from hva rather
> than gpa.
> 
> In the case where we are not using MMU notifiers, the host physical
> address we get from the memslot->arch.slot_phys[] array already
> includes the low-order bits down to the PAGE_SIZE level, even if
> we are using large pages.  Thus we can simplify the calculation in
> this case to just add in the remaining bits in the case where
> PAGE_SIZE is 64k and the guest is mapping a 4k page.
> 
> The same bug exists in kvmppc_book3s_hv_page_fault().  The basic fix
> is to use psize (the page size from the HPTE) rather than pte_size
> (the page size from the Linux PTE) when updating the HPTE low word
> in r.  That means that pfn needs to be computed to PAGE_SIZE
> granularity even if the Linux PTE is a huge page PTE.  That can be
> arranged simply by doing the page_to_pfn() before setting page to
> the head of the compound page.  If psize is less than PAGE_SIZE,
> then we need to make sure we only update the bits from PAGE_SIZE
> upwards, in order not to lose any sub-page offset bits in r.
> On the other hand, if psize is greater than PAGE_SIZE, we need to
> make sure we don't bring in non-zero low order bits in pfn, hence
> we mask (pfn << PAGE_SHIFT) with ~(psize - 1).


This patch breaks VFIO with Intel E1000E on my p5ioc2 machine (vpl2) - the
guest tries allocating MSI ibm,change_msi and pauses for a while (>
10minutes), then continues but the ethernet adapter does not work anyway.
Disabling MSI in QEMU does not help though. I'll debug on Monday but quick
ideas are welcome.

Host kernel config huge pages bits:
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y

Huge pages are not enabled in QEMU.




> Cc: stable@vger.kernel.org # v3.11+
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
>  arch/powerpc/kvm/book3s_64_mmu_hv.c | 12 +++++++++---
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c |  4 ++--
>  2 files changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index f3ff587..47bbeaf 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -665,6 +665,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
>  			return -EFAULT;
>  	} else {
>  		page = pages[0];
> +		pfn = page_to_pfn(page);
>  		if (PageHuge(page)) {
>  			page = compound_head(page);
>  			pte_size <<= compound_order(page);
> @@ -689,7 +690,6 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
>  			}
>  			rcu_read_unlock_sched();
>  		}
> -		pfn = page_to_pfn(page);
>  	}
>  
>  	ret = -EFAULT;
> @@ -707,8 +707,14 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
>  		r = (r & ~(HPTE_R_W|HPTE_R_I|HPTE_R_G)) | HPTE_R_M;
>  	}
>  
> -	/* Set the HPTE to point to pfn */
> -	r = (r & ~(HPTE_R_PP0 - pte_size)) | (pfn << PAGE_SHIFT);
> +	/*
> +	 * Set the HPTE to point to pfn.
> +	 * Since the pfn is at PAGE_SIZE granularity, make sure we
> +	 * don't mask out lower-order bits if psize < PAGE_SIZE.
> +	 */
> +	if (psize < PAGE_SIZE)
> +		psize = PAGE_SIZE;
> +	r = (r & ~(HPTE_R_PP0 - psize)) | ((pfn << PAGE_SHIFT) & ~(psize - 1));
>  	if (hpte_is_writable(r) && !write_ok)
>  		r = hpte_make_readonly(r);
>  	ret = RESUME_GUEST;
> diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> index 9c51544..fddbf98 100644
> --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> @@ -225,6 +225,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
>  		is_io = pa & (HPTE_R_I | HPTE_R_W);
>  		pte_size = PAGE_SIZE << (pa & KVMPPC_PAGE_ORDER_MASK);
>  		pa &= PAGE_MASK;
> +		pa |= gpa & ~PAGE_MASK;
>  	} else {
>  		/* Translate to host virtual address */
>  		hva = __gfn_to_hva_memslot(memslot, gfn);
> @@ -238,13 +239,12 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
>  				ptel = hpte_make_readonly(ptel);
>  			is_io = hpte_cache_bits(pte_val(pte));
>  			pa = pte_pfn(pte) << PAGE_SHIFT;
> +			pa |= hva & (pte_size - 1);
>  		}
>  	}
>  
>  	if (pte_size < psize)
>  		return H_PARAMETER;
> -	if (pa && pte_size > psize)
> -		pa |= gpa & (pte_size - 1);
>  
>  	ptel &= ~(HPTE_R_PP0 - psize);
>  	ptel |= pa;
>
Paul Mackerras Dec. 10, 2013, 5:38 a.m. UTC | #2
On Fri, Dec 06, 2013 at 09:52:31PM +1100, Alexey Kardashevskiy wrote:
> This patch breaks VFIO with Intel E1000E on my p5ioc2 machine (vpl2) - the
> guest tries allocating MSI ibm,change_msi and pauses for a while (>
> 10minutes), then continues but the ethernet adapter does not work anyway.
> Disabling MSI in QEMU does not help though. I'll debug on Monday but quick
> ideas are welcome.

Looks like I'm losing some low-order address bits in kvmppc_do_h_enter
when the guest creates a 4k HPTE.

Alex, have you merged this patch in an immutable branch yet?  Do you
want an incremental patch on top, or a revised version of this patch?

Regards,
Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index f3ff587..47bbeaf 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -665,6 +665,7 @@  int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
 			return -EFAULT;
 	} else {
 		page = pages[0];
+		pfn = page_to_pfn(page);
 		if (PageHuge(page)) {
 			page = compound_head(page);
 			pte_size <<= compound_order(page);
@@ -689,7 +690,6 @@  int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
 			}
 			rcu_read_unlock_sched();
 		}
-		pfn = page_to_pfn(page);
 	}
 
 	ret = -EFAULT;
@@ -707,8 +707,14 @@  int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
 		r = (r & ~(HPTE_R_W|HPTE_R_I|HPTE_R_G)) | HPTE_R_M;
 	}
 
-	/* Set the HPTE to point to pfn */
-	r = (r & ~(HPTE_R_PP0 - pte_size)) | (pfn << PAGE_SHIFT);
+	/*
+	 * Set the HPTE to point to pfn.
+	 * Since the pfn is at PAGE_SIZE granularity, make sure we
+	 * don't mask out lower-order bits if psize < PAGE_SIZE.
+	 */
+	if (psize < PAGE_SIZE)
+		psize = PAGE_SIZE;
+	r = (r & ~(HPTE_R_PP0 - psize)) | ((pfn << PAGE_SHIFT) & ~(psize - 1));
 	if (hpte_is_writable(r) && !write_ok)
 		r = hpte_make_readonly(r);
 	ret = RESUME_GUEST;
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9c51544..fddbf98 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -225,6 +225,7 @@  long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 		is_io = pa & (HPTE_R_I | HPTE_R_W);
 		pte_size = PAGE_SIZE << (pa & KVMPPC_PAGE_ORDER_MASK);
 		pa &= PAGE_MASK;
+		pa |= gpa & ~PAGE_MASK;
 	} else {
 		/* Translate to host virtual address */
 		hva = __gfn_to_hva_memslot(memslot, gfn);
@@ -238,13 +239,12 @@  long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 				ptel = hpte_make_readonly(ptel);
 			is_io = hpte_cache_bits(pte_val(pte));
 			pa = pte_pfn(pte) << PAGE_SHIFT;
+			pa |= hva & (pte_size - 1);
 		}
 	}
 
 	if (pte_size < psize)
 		return H_PARAMETER;
-	if (pa && pte_size > psize)
-		pa |= gpa & (pte_size - 1);
 
 	ptel &= ~(HPTE_R_PP0 - psize);
 	ptel |= pa;