Patchwork [Precise,pre-up] Pick k(un)map_atomic fix

login
register
mail settings
Submitter Stefan Bader
Date Dec. 1, 2011, 4:04 p.m.
Message ID <1322755457-6024-1-git-send-email-stefan.bader@canonical.com>
Download mbox | patch
Permalink /patch/128721/
State New
Headers show

Comments

Stefan Bader - Dec. 1, 2011, 4:04 p.m.
This has been verified in older releases and we carry it in Natty
and in Oneiric. Unfortunately its upstreaming into 3.2 is not
certain for rather procedural reasons. Quoting Andrew (not amused)
Morton:

"I sent this patch to the x86 maintainers two weeks ago.  It was
ignored, as were the other 11 patches I sent.  Later I will resend them
all.  If they are again ignored I will later send them yet again, and
so on."

But since it has been verified to cure rather nasty failures in the
cloud (Xen) we should put it into Precise right now. If it lands
upstream in time it just can be rebase out of existence.

Maybe it needs "UBUNTU SAUCE:" tagging...

-Stefan

From b39e4363068122a5d36a26cc656c365d2341c1d8 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Wed, 30 Nov 2011 15:03:08 +1100
Subject: [PATCH] x86/paravirt: PTE updates in k(un)map_atomic need to be
 synchronous, regardless of lazy_mmu mode

Fix an outstanding issue that has been reported since 2.6.37.  Under a
heavy loaded machine processing "fork()" calls could crash with:

BUG: unable to handle kernel paging request at f573fc8c
IP: [<c01abc54>] swap_count_continued+0x104/0x180
*pdpt = 000000002a3b9027 *pde = 0000000001bed067 *pte = 0000000000000000
Oops: 0000 [#1] SMP
Modules linked in:
Pid: 1638, comm: apache2 Not tainted 3.0.4-linode37 #1
EIP: 0061:[<c01abc54>] EFLAGS: 00210246 CPU: 3
EIP is at swap_count_continued+0x104/0x180
.. snip..
Call Trace:
 [<c01ac222>] ? __swap_duplicate+0xc2/0x160
 [<c01040f7>] ? pte_mfn_to_pfn+0x87/0xe0
 [<c01ac2e4>] ? swap_duplicate+0x14/0x40
 [<c01a0a6b>] ? copy_pte_range+0x45b/0x500
 [<c01a0ca5>] ? copy_page_range+0x195/0x200
 [<c01328c6>] ? dup_mmap+0x1c6/0x2c0
 [<c0132cf8>] ? dup_mm+0xa8/0x130
 [<c013376a>] ? copy_process+0x98a/0xb30
 [<c013395f>] ? do_fork+0x4f/0x280
 [<c01573b3>] ? getnstimeofday+0x43/0x100
 [<c010f770>] ? sys_clone+0x30/0x40
 [<c06c048d>] ? ptregs_clone+0x15/0x48
 [<c06bfb71>] ? syscall_call+0x7/0xb

The problem is that in copy_page_range() we turn lazy mode on, and then in
swap_entry_free() we call swap_count_continued() which ends up in:

         map = kmap_atomic(page, KM_USER0) + offset;

and then later we touch *map.

Since we are running in batched mode (lazy) we don't actually set up the
PTE mappings and the kmap_atomic is not done synchronously and ends up
trying to dereference a page that has not been set.

Looking at kmap_atomic_prot_pfn(), it uses 'arch_flush_lazy_mmu_mode' and
doing the same in kmap_atomic_prot() and __kunmap_atomic() makes the problem
go away.

Interestingly, commit b8bcfe997e4615 ("x86/paravirt: remove lazy mode in
interrupts") removed part of this to fix an interrupt issue - but it went
to far and did not consider this scenario.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

BugLink: http://bugs.launchpad.net/bugs/854050
(cherry-picked from b39e4363068122a5d36a26cc656c365d2341c1d8 linux-next)
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 arch/x86/mm/highmem_32.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)
Leann Ogasawara - Dec. 2, 2011, 6:53 p.m.
Applied to precise master-next.

Thanks,
Leann

On Thu, 2011-12-01 at 17:04 +0100, Stefan Bader wrote:
> This has been verified in older releases and we carry it in Natty
> and in Oneiric. Unfortunately its upstreaming into 3.2 is not
> certain for rather procedural reasons. Quoting Andrew (not amused)
> Morton:
> 
> "I sent this patch to the x86 maintainers two weeks ago.  It was
> ignored, as were the other 11 patches I sent.  Later I will resend them
> all.  If they are again ignored I will later send them yet again, and
> so on."
> 
> But since it has been verified to cure rather nasty failures in the
> cloud (Xen) we should put it into Precise right now. If it lands
> upstream in time it just can be rebase out of existence.
> 
> Maybe it needs "UBUNTU SAUCE:" tagging...
> 
> -Stefan
> 
> From b39e4363068122a5d36a26cc656c365d2341c1d8 Mon Sep 17 00:00:00 2001
> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Date: Wed, 30 Nov 2011 15:03:08 +1100
> Subject: [PATCH] x86/paravirt: PTE updates in k(un)map_atomic need to be
>  synchronous, regardless of lazy_mmu mode
> 
> Fix an outstanding issue that has been reported since 2.6.37.  Under a
> heavy loaded machine processing "fork()" calls could crash with:
> 
> BUG: unable to handle kernel paging request at f573fc8c
> IP: [<c01abc54>] swap_count_continued+0x104/0x180
> *pdpt = 000000002a3b9027 *pde = 0000000001bed067 *pte = 0000000000000000
> Oops: 0000 [#1] SMP
> Modules linked in:
> Pid: 1638, comm: apache2 Not tainted 3.0.4-linode37 #1
> EIP: 0061:[<c01abc54>] EFLAGS: 00210246 CPU: 3
> EIP is at swap_count_continued+0x104/0x180
> .. snip..
> Call Trace:
>  [<c01ac222>] ? __swap_duplicate+0xc2/0x160
>  [<c01040f7>] ? pte_mfn_to_pfn+0x87/0xe0
>  [<c01ac2e4>] ? swap_duplicate+0x14/0x40
>  [<c01a0a6b>] ? copy_pte_range+0x45b/0x500
>  [<c01a0ca5>] ? copy_page_range+0x195/0x200
>  [<c01328c6>] ? dup_mmap+0x1c6/0x2c0
>  [<c0132cf8>] ? dup_mm+0xa8/0x130
>  [<c013376a>] ? copy_process+0x98a/0xb30
>  [<c013395f>] ? do_fork+0x4f/0x280
>  [<c01573b3>] ? getnstimeofday+0x43/0x100
>  [<c010f770>] ? sys_clone+0x30/0x40
>  [<c06c048d>] ? ptregs_clone+0x15/0x48
>  [<c06bfb71>] ? syscall_call+0x7/0xb
> 
> The problem is that in copy_page_range() we turn lazy mode on, and then in
> swap_entry_free() we call swap_count_continued() which ends up in:
> 
>          map = kmap_atomic(page, KM_USER0) + offset;
> 
> and then later we touch *map.
> 
> Since we are running in batched mode (lazy) we don't actually set up the
> PTE mappings and the kmap_atomic is not done synchronously and ends up
> trying to dereference a page that has not been set.
> 
> Looking at kmap_atomic_prot_pfn(), it uses 'arch_flush_lazy_mmu_mode' and
> doing the same in kmap_atomic_prot() and __kunmap_atomic() makes the problem
> go away.
> 
> Interestingly, commit b8bcfe997e4615 ("x86/paravirt: remove lazy mode in
> interrupts") removed part of this to fix an interrupt issue - but it went
> to far and did not consider this scenario.
> 
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> Cc: <stable@kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> 
> BugLink: http://bugs.launchpad.net/bugs/854050
> (cherry-picked from b39e4363068122a5d36a26cc656c365d2341c1d8 linux-next)
> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
> ---
>  arch/x86/mm/highmem_32.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
> index b499626..f4f29b1 100644
> --- a/arch/x86/mm/highmem_32.c
> +++ b/arch/x86/mm/highmem_32.c
> @@ -45,6 +45,7 @@ void *kmap_atomic_prot(struct page *page, pgprot_t prot)
>  	vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
>  	BUG_ON(!pte_none(*(kmap_pte-idx)));
>  	set_pte(kmap_pte-idx, mk_pte(page, prot));
> +	arch_flush_lazy_mmu_mode();
>  
>  	return (void *)vaddr;
>  }
> @@ -88,6 +89,7 @@ void __kunmap_atomic(void *kvaddr)
>  		 */
>  		kpte_clear_flush(kmap_pte-idx, vaddr);
>  		kmap_atomic_idx_pop();
> +		arch_flush_lazy_mmu_mode();
>  	}
>  #ifdef CONFIG_DEBUG_HIGHMEM
>  	else {
> -- 
> 1.7.5.4
> 
>

Patch

diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
index b499626..f4f29b1 100644
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -45,6 +45,7 @@  void *kmap_atomic_prot(struct page *page, pgprot_t prot)
 	vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
 	BUG_ON(!pte_none(*(kmap_pte-idx)));
 	set_pte(kmap_pte-idx, mk_pte(page, prot));
+	arch_flush_lazy_mmu_mode();
 
 	return (void *)vaddr;
 }
@@ -88,6 +89,7 @@  void __kunmap_atomic(void *kvaddr)
 		 */
 		kpte_clear_flush(kmap_pte-idx, vaddr);
 		kmap_atomic_idx_pop();
+		arch_flush_lazy_mmu_mode();
 	}
 #ifdef CONFIG_DEBUG_HIGHMEM
 	else {