Patchwork PowerPC BUG: using smp_processor_id() in preemptible code

login
register
mail settings
Submitter Hugh Dickins
Date Feb. 24, 2011, 8:47 p.m.
Message ID <alpine.LSU.2.00.1102241216420.1708@sister.anvils>
Download mbox | patch
Permalink /patch/84485/
State Accepted
Headers show

Comments

Hugh Dickins - Feb. 24, 2011, 8:47 p.m.
On Thu, 30 Dec 2010, Benjamin Herrenschmidt wrote:
> On Wed, 2010-12-29 at 14:54 -0800, Hugh Dickins wrote:
> > With recent 2.6.37-rc, with CONFIG_PREEMPT=y CONFIG_DEBUG_PREEMPT=y
> > on the PowerPC G5, I get spammed by BUG warnings each time I swapoff:
> > 
> > BUG: using smp_processor_id() in preemptible [00000000] code: swapoff/3974
> > caller is .hpte_need_flush+0x4c/0x2e8
> > Call Trace:
> > [c0000001b4a3f830] [c00000000000f3cc] .show_stack+0x6c/0x16c (unreliable)
> > [c0000001b4a3f8e0] [c00000000023eda0] .debug_smp_processor_id+0xe4/0x11c
> > [c0000001b4a3f970] [c00000000002f2f4] .hpte_need_flush+0x4c/0x2e8
> > [c0000001b4a3fa30] [c0000000000e7ef8] .vunmap_pud_range+0x148/0x200
> > [c0000001b4a3fb10] [c0000000000e8058] .vunmap_page_range+0xa8/0xd4
> > [c0000001b4a3fbb0] [c0000000000e80a4] .free_unmap_vmap_area+0x20/0x38
> > [c0000001b4a3fc40] [c0000000000e8138] .remove_vm_area+0x7c/0xb4
> > [c0000001b4a3fcd0] [c0000000000e8308] .__vunmap+0x50/0x104
> > [c0000001b4a3fd60] [c0000000000ef3fc] .SyS_swapoff+0x59c/0x6a8
> > [c0000001b4a3fe30] [c0000000000075a8] syscall_exit+0x0/0x40
> > 
> > I notice hpte_need_flush() itself acknowledges
> >  * Must be called from within some kind of spinlock/non-preempt region...
> 
> Yes, we assume that the PTE lock is always held when modifying page
> tables...

Right, not for the kernel page tables.  I remember when doing the
pte_offset_map_lock() stuff, that it appeared that the kernel would
already be in trouble if it did not have higher serialization for
its pte updates, so no point in using a page_table_lock for them.

> 
> > Though I didn't actually bisect, I believe this is since Jeremy's
> > 64141da587241301ce8638cc945f8b67853156ec "vmalloc: eagerly clear ptes
> > on vunmap", which moves a call to vunmap_page_range() from one place
> > (which happened to be inside a spinlock) to another (where it's not).
> > 
> > I guess my warnings would be easily silenced by moving that call to
> > vunmap_page_range() down just inside the spinlock below it; but I'm
> > dubious that that's the right fix - it looked as if there are other
> > paths through vmalloc.c where vunmap_page_range() has been getting
> > called without preemption disabled, long before Jeremy's change,
> > just paths that I never happen to go down in my limited testing.
> > 
> > For the moment I'm using the obvious patch below to keep it quiet;
> > but I doubt that this is the right patch either.  I'm hoping that
> > ye who understand the importance of hpte_need_flush() will be best
> > able to judge what to do.  Or might there be other architectures
> > expecting to be unpreemptible there?
> 
> Well, it looks like our kernel mappings tend to take some nasty
> shortcuts with the PTE locking, which I suppose are legit but do break
> some of my assumptions there. I need to have a closer look. Thanks for
> the report !

None of us have progressed this since late in 2.6.37-rc, and now
it's late in 2.6.38-rc and we're still in the same situation.

The patch I'm currently using to suppress all the noise (I suppose
I could just turn off DEBUG_PREEMPT but that would be cheating) is an
extract from Peter's preemptible mmu_gather patches, below, but I don't
really know if it's valid to extract it in this way.

Reading back, I see Jeremy suggested moving vb_free()'s call to
vunmap_page_range() back inside vb->lock: it certainly was his moving
the call out from under that lock that brought the issue to my notice;
but it looked as if there were other paths which would give preemptible
PowerPC the same issue, just paths I happen not to go down myself. I'm
not sure, I didn't take the time to follow it up properly, expecting
further insight to arrive shortly from Ben!

And, as threatened, Jeremy has further vmalloc changes queued up in
mmotm, which certainly make the patch below inadequate, and I imagine
the vunmap_page_range() movement too.  I'm currently (well, I think most
recent mmotm doesn't even boot on my ppc) having to disable preemption
in the kernel case of apply_to_pte_range().

What would be better for 2.6.38 and 2.6.37-stable?  Moving that call to
vunmap_page_range back under vb->lock, or the partial-Peter-patch below?
And then what should be done for 2.6.39?

Hugh
Peter Zijlstra - Feb. 24, 2011, 9:07 p.m.
On Thu, 2011-02-24 at 12:47 -0800, Hugh Dickins wrote:

Lovely problem :-), benh mentioned it on IRC, but I never got around to
finding the email thread, thanks for the CC.

> What would be better for 2.6.38 and 2.6.37-stable?  Moving that call to
> vunmap_page_range back under vb->lock, or the partial-Peter-patch below?
> And then what should be done for 2.6.39?

I think you'll also need the arch/powerpc/kernel/process.c changes that
cause context switches to flush the tlb_batch queues.

> --- 2.6.38-rc5/arch/powerpc/mm/tlb_hash64.c     2010-02-24 10:52:17.000000000 -0800
> +++ linux/arch/powerpc/mm/tlb_hash64.c  2011-02-15 23:27:21.000000000 -0800
> @@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
>   * neesd to be flushed. This function will either perform the flush
>   * immediately or will batch it up if the current CPU has an active
>   * batch on it.
> - *
> - * Must be called from within some kind of spinlock/non-preempt region...
>   */
>  void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
>                      pte_t *ptep, unsigned long pte, int huge)
>  {
> -       struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
> +       struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
>         unsigned long vsid, vaddr;
>         unsigned int psize;
>         int ssize;
> @@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
>          */
>         if (!batch->active) {
>                 flush_hash_page(vaddr, rpte, psize, ssize, 0);
> +               put_cpu_var(ppc64_tlb_batch);
>                 return;
>         }
>  
> @@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
>         batch->index = ++i;
>         if (i >= PPC64_TLB_BATCH_NR)
>                 __flush_tlb_pending(batch);
> +       put_cpu_var(ppc64_tlb_batch);
>  }
>  
>  /*
Benjamin Herrenschmidt - Feb. 24, 2011, 9:12 p.m.
On Thu, 2011-02-24 at 12:47 -0800, Hugh Dickins wrote:

> Reading back, I see Jeremy suggested moving vb_free()'s call to
> vunmap_page_range() back inside vb->lock: it certainly was his moving
> the call out from under that lock that brought the issue to my notice;
> but it looked as if there were other paths which would give preemptible
> PowerPC the same issue, just paths I happen not to go down myself. I'm
> not sure, I didn't take the time to follow it up properly, expecting
> further insight to arrive shortly from Ben!

Yeah, sorry, I've been too over extended lately...

> And, as threatened, Jeremy has further vmalloc changes queued up in
> mmotm, which certainly make the patch below inadequate, and I imagine
> the vunmap_page_range() movement too.  I'm currently (well, I think most
> recent mmotm doesn't even boot on my ppc) having to disable preemption
> in the kernel case of apply_to_pte_range().
> 
> What would be better for 2.6.38 and 2.6.37-stable?  Moving that call to
> vunmap_page_range back under vb->lock, or the partial-Peter-patch below?
> And then what should be done for 2.6.39?

Patch is fine. I should send it to Linus. It's not like we have a batch
on the vmalloc space anyways since it doesnt do the arch_lazy_mmu stuff,
so it's really about protecting the per-cpu variable.

Cheers,
Ben.

> --- 2.6.38-rc5/arch/powerpc/mm/tlb_hash64.c	2010-02-24 10:52:17.000000000 -0800
> +++ linux/arch/powerpc/mm/tlb_hash64.c	2011-02-15 23:27:21.000000000 -0800
> @@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
>   * neesd to be flushed. This function will either perform the flush
>   * immediately or will batch it up if the current CPU has an active
>   * batch on it.
> - *
> - * Must be called from within some kind of spinlock/non-preempt region...
>   */
>  void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
>  		     pte_t *ptep, unsigned long pte, int huge)
>  {
> -	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
> +	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
>  	unsigned long vsid, vaddr;
>  	unsigned int psize;
>  	int ssize;
> @@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
>  	 */
>  	if (!batch->active) {
>  		flush_hash_page(vaddr, rpte, psize, ssize, 0);
> +		put_cpu_var(ppc64_tlb_batch);
>  		return;
>  	}
>  
> @@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
>  	batch->index = ++i;
>  	if (i >= PPC64_TLB_BATCH_NR)
>  		__flush_tlb_pending(batch);
> +	put_cpu_var(ppc64_tlb_batch);
>  }
>  
>  /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
Benjamin Herrenschmidt - Feb. 24, 2011, 9:23 p.m.
On Thu, 2011-02-24 at 22:07 +0100, Peter Zijlstra wrote:
> 
> Lovely problem :-), benh mentioned it on IRC, but I never got around
> to
> finding the email thread, thanks for the CC.
> 
> > What would be better for 2.6.38 and 2.6.37-stable?  Moving that call
> to
> > vunmap_page_range back under vb->lock, or the partial-Peter-patch
> below?
> > And then what should be done for 2.6.39?
> 
> I think you'll also need the arch/powerpc/kernel/process.c changes
> that
> cause context switches to flush the tlb_batch queues.

I don't think that's needed here as there shall be no batching happening
on the vmalloc space, but it can't hurt to merge it regardless :-)

Cheers,
Ben.
Peter Zijlstra - Feb. 24, 2011, 9:27 p.m.
On Fri, 2011-02-25 at 08:23 +1100, Benjamin Herrenschmidt wrote:
> 
> I don't think that's needed here as there shall be no batching happening
> on the vmalloc space, but it can't hurt to merge it regardless :-)

Ah, due to the !batch->active thing? OK, then yeah Hugh's bit is
sufficient.

Patch

--- 2.6.38-rc5/arch/powerpc/mm/tlb_hash64.c	2010-02-24 10:52:17.000000000 -0800
+++ linux/arch/powerpc/mm/tlb_hash64.c	2011-02-15 23:27:21.000000000 -0800
@@ -38,13 +38,11 @@  DEFINE_PER_CPU(struct ppc64_tlb_batch, p
  * neesd to be flushed. This function will either perform the flush
  * immediately or will batch it up if the current CPU has an active
  * batch on it.
- *
- * Must be called from within some kind of spinlock/non-preempt region...
  */
 void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
 		     pte_t *ptep, unsigned long pte, int huge)
 {
-	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
 	unsigned long vsid, vaddr;
 	unsigned int psize;
 	int ssize;
@@ -99,6 +97,7 @@  void hpte_need_flush(struct mm_struct *m
 	 */
 	if (!batch->active) {
 		flush_hash_page(vaddr, rpte, psize, ssize, 0);
+		put_cpu_var(ppc64_tlb_batch);
 		return;
 	}
 
@@ -127,6 +126,7 @@  void hpte_need_flush(struct mm_struct *m
 	batch->index = ++i;
 	if (i >= PPC64_TLB_BATCH_NR)
 		__flush_tlb_pending(batch);
+	put_cpu_var(ppc64_tlb_batch);
 }
 
 /*