Patchwork [3/20] powerpc/mm: Add HW threads support to no_hash TLB management

login
register
mail settings
Submitter Benjamin Herrenschmidt
Date July 24, 2009, 9:15 a.m.
Message ID <20090724091523.8AD8CDDD1B@ozlabs.org>
Download mbox | patch
Permalink /patch/30179/
State Accepted
Delegated to: Benjamin Herrenschmidt
Headers show

Comments

Benjamin Herrenschmidt - July 24, 2009, 9:15 a.m.
The current "no hash" MMU context management code is written with
the assumption that one CPU == one TLB. This is not the case on
implementations that support HW multithreading, where several
linux CPUs can share the same TLB.

This adds some basic support for this to our context management
and our TLB flushing code.

It also cleans up the optional debugging output a bit

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

 arch/powerpc/include/asm/cputhreads.h |   16 +++++
 arch/powerpc/mm/mmu_context_nohash.c  |   93 ++++++++++++++++++++++------------
 arch/powerpc/mm/tlb_nohash.c          |   10 ++-
 3 files changed, 86 insertions(+), 33 deletions(-)
Kumar Gala - July 31, 2009, 3:12 a.m.
On Jul 24, 2009, at 4:15 AM, Benjamin Herrenschmidt wrote:

> The current "no hash" MMU context management code is written with
> the assumption that one CPU == one TLB. This is not the case on
> implementations that support HW multithreading, where several
> linux CPUs can share the same TLB.
>
> This adds some basic support for this to our context management
> and our TLB flushing code.
>
> It also cleans up the optional debugging output a bit
>
> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> ---

I'm getting this nice oops on 32-bit book-e SMP (and I'm guessing its  
because of this patch)

Unable to handle kernel paging request for data at address 0x00000000
Faulting instruction address: 0xc0016dac
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=8 MPC8572 DS
Modules linked in:
NIP: c0016dac LR: c0016d58 CTR: 0000001e
REGS: eed77ce0 TRAP: 0300   Not tainted  (2.6.31-rc4-00442-gdb4c9c5)
MSR: 00021000 <ME,CE>  CR: 24288482  XER: 20000000
DEAR: 00000000, ESR: 00000000
TASK = eecfe140[1581] 'msgctl08' THREAD: eed76000 CPU: 0
GPR00: 00400000 eed77d90 eecfe140 00000000 00000000 00000001 c05bf074  
c05c0cf4
GPR08: 00000003 00000002 ff7fffff 00000000 00009b05 1004f894 c05bdd24  
00000001
GPR16: ffffffff c05ab890 c05c0ce8 c04e0f58 c04da364 c05c0000 00000000  
c04cfa04
GPR24: 00000002 00000000 00000000 c05c0cd8 00000080 00000000 ef056380  
00000017
NIP [c0016dac] switch_mmu_context+0x15c/0x520
LR [c0016d58] switch_mmu_context+0x108/0x520
Call Trace:
[eed77d90] [c0016d58] switch_mmu_context+0x108/0x520 (unreliable)
[eed77df0] [c040efec] schedule+0x2bc/0x800
[eed77e70] [c01b9268] do_msgrcv+0x198/0x420
[eed77ef0] [c01b9520] sys_msgrcv+0x30/0xa0
[eed77f10] [c0003fe8] sys_ipc+0x1a8/0x2c0
[eed77f40] [c00116c4] ret_from_syscall+0x0/0x3c
Instruction dump:
57402834 7c00f850 3920fffe 5d2a003e 397b0010 5500103a 7ceb0214 60000000
60000000 81670000 39080001 38e70004 <7c0be82e> 7c005038 7c0be92e  
81260000
---[ end trace 3c4c3106446e1bd8 ]---


- k
Kumar Gala - July 31, 2009, 3:35 a.m.
On Jul 30, 2009, at 10:12 PM, Kumar Gala wrote:

>
> On Jul 24, 2009, at 4:15 AM, Benjamin Herrenschmidt wrote:
>
>> The current "no hash" MMU context management code is written with
>> the assumption that one CPU == one TLB. This is not the case on
>> implementations that support HW multithreading, where several
>> linux CPUs can share the same TLB.
>>
>> This adds some basic support for this to our context management
>> and our TLB flushing code.
>>
>> It also cleans up the optional debugging output a bit
>>
>> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> ---
>
> I'm getting this nice oops on 32-bit book-e SMP (and I'm guessing  
> its because of this patch)
>
> Unable to handle kernel paging request for data at address 0x00000000
> Faulting instruction address: 0xc0016dac
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=8 MPC8572 DS
> Modules linked in:
> NIP: c0016dac LR: c0016d58 CTR: 0000001e
> REGS: eed77ce0 TRAP: 0300   Not tainted  (2.6.31-rc4-00442-gdb4c9c5)
> MSR: 00021000 <ME,CE>  CR: 24288482  XER: 20000000
> DEAR: 00000000, ESR: 00000000
> TASK = eecfe140[1581] 'msgctl08' THREAD: eed76000 CPU: 0
> GPR00: 00400000 eed77d90 eecfe140 00000000 00000000 00000001  
> c05bf074 c05c0cf4
> GPR08: 00000003 00000002 ff7fffff 00000000 00009b05 1004f894  
> c05bdd24 00000001
> GPR16: ffffffff c05ab890 c05c0ce8 c04e0f58 c04da364 c05c0000  
> 00000000 c04cfa04
> GPR24: 00000002 00000000 00000000 c05c0cd8 00000080 00000000  
> ef056380 00000017
> NIP [c0016dac] switch_mmu_context+0x15c/0x520
> LR [c0016d58] switch_mmu_context+0x108/0x520
> Call Trace:
> [eed77d90] [c0016d58] switch_mmu_context+0x108/0x520 (unreliable)
> [eed77df0] [c040efec] schedule+0x2bc/0x800
> [eed77e70] [c01b9268] do_msgrcv+0x198/0x420
> [eed77ef0] [c01b9520] sys_msgrcv+0x30/0xa0
> [eed77f10] [c0003fe8] sys_ipc+0x1a8/0x2c0
> [eed77f40] [c00116c4] ret_from_syscall+0x0/0x3c
> Instruction dump:
> 57402834 7c00f850 3920fffe 5d2a003e 397b0010 5500103a 7ceb0214  
> 60000000
> 60000000 81670000 39080001 38e70004 <7c0be82e> 7c005038 7c0be92e  
> 81260000
> ---[ end trace 3c4c3106446e1bd8 ]---


On Jul 24, 2009, at 4:15 AM, Benjamin Herrenschmidt wrote:

> @@ -247,15 +261,20 @@ void switch_mmu_context(struct mm_struct
> 	 * local TLB for it and unmark it before we use it
> 	 */
> 	if (test_bit(id, stale_map[cpu])) {
> -		pr_devel("[%d] flushing stale context %d for mm @%p !\n",
> -			 cpu, id, next);
> +		pr_hardcont(" | stale flush %d [%d..%d]",
> +			    id, cpu_first_thread_in_core(cpu),
> +			    cpu_last_thread_in_core(cpu));
> +
> 		local_flush_tlb_mm(next);
>
> 		/* XXX This clear should ultimately be part of local_flush_tlb_mm */
> -		__clear_bit(id, stale_map[cpu]);
> +		for (cpu = cpu_first_thread_in_core(cpu);
> +		     cpu <= cpu_last_thread_in_core(cpu); cpu++)
> +			__clear_bit(id, stale_map[cpu]);
> 	}

This looks a bit dodgy.  using 'cpu' as both the loop variable and  
what you are computing to determine loop start/end..

Changing this to:

unsigned int i;
...

for (i = cpu_first_thread_in_core(cpu);
	i <= cpu_last_thread_in_core(cpu); i++)
	   __clear_bit(id, stale_map[i]);

seems to clear up the oops.

- k
Benjamin Herrenschmidt - July 31, 2009, 10:29 p.m.
On Thu, 2009-07-30 at 22:35 -0500, Kumar Gala wrote:
> >               /* XXX This clear should ultimately be part of
> local_flush_tlb_mm */
> > -             __clear_bit(id, stale_map[cpu]);
> > +             for (cpu = cpu_first_thread_in_core(cpu);
> > +                  cpu <= cpu_last_thread_in_core(cpu); cpu++)
> > +                     __clear_bit(id, stale_map[cpu]);
> >       }
> 
> This looks a bit dodgy.  using 'cpu' as both the loop variable and  
> what you are computing to determine loop start/end..
> 
Hrm... I would have thought that it was still correct... do you see any
reason why the above code is wrong ? because if not we may be hitting a
gcc issue...

IE. At loop init, cpu gets clamped down to the first thread in the core,
which should be fine. Then, we compare CPU to the last thread in core
for the current CPU which should always return the same value.

So I'm very interested to know what is actually wrong, ie, either I'm
just missing something obvious, or you are just pushing a bug under the
carpet which could come back and bit us later :-)

Cheers,
Ben.
Michael Ellerman - Aug. 3, 2009, 2:03 a.m.
On Sat, 2009-08-01 at 08:29 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2009-07-30 at 22:35 -0500, Kumar Gala wrote:
> > >               /* XXX This clear should ultimately be part of
> > local_flush_tlb_mm */
> > > -             __clear_bit(id, stale_map[cpu]);
> > > +             for (cpu = cpu_first_thread_in_core(cpu);
> > > +                  cpu <= cpu_last_thread_in_core(cpu); cpu++)
> > > +                     __clear_bit(id, stale_map[cpu]);
> > >       }
> > 
> > This looks a bit dodgy.  using 'cpu' as both the loop variable and  
> > what you are computing to determine loop start/end..
> > 
> Hrm... I would have thought that it was still correct... do you see any
> reason why the above code is wrong ? because if not we may be hitting a
> gcc issue...
> 
> IE. At loop init, cpu gets clamped down to the first thread in the core,
> which should be fine. Then, we compare CPU to the last thread in core
> for the current CPU which should always return the same value.
> 
> So I'm very interested to know what is actually wrong, ie, either I'm
> just missing something obvious, or you are just pushing a bug under the
> carpet which could come back and bit us later :-)

for (cpu = cpu_first_thread_in_core(cpu);
     cpu <= cpu_last_thread_in_core(cpu); cpu++)
        __clear_bit(id, stale_map[cpu]);

==

cpu = cpu_first_thread_in_core(cpu);
while (cpu <= cpu_last_thread_in_core(cpu)) {
	__clear_bit(id, stale_map[cpu]);
	cpu++;
}

cpu = 0
cpu <= 1
cpu++ (1)
cpu <= 1
cpu++ (2)
cpu <= 3
...

:)

cheers
Kumar Gala - Aug. 3, 2009, 4:21 p.m.
On Aug 2, 2009, at 9:03 PM, Michael Ellerman wrote:

> On Sat, 2009-08-01 at 08:29 +1000, Benjamin Herrenschmidt wrote:
>> On Thu, 2009-07-30 at 22:35 -0500, Kumar Gala wrote:
>>>>              /* XXX This clear should ultimately be part of
>>> local_flush_tlb_mm */
>>>> -             __clear_bit(id, stale_map[cpu]);
>>>> +             for (cpu = cpu_first_thread_in_core(cpu);
>>>> +                  cpu <= cpu_last_thread_in_core(cpu); cpu++)
>>>> +                     __clear_bit(id, stale_map[cpu]);
>>>>      }
>>>
>>> This looks a bit dodgy.  using 'cpu' as both the loop variable and
>>> what you are computing to determine loop start/end..
>>>
>> Hrm... I would have thought that it was still correct... do you see  
>> any
>> reason why the above code is wrong ? because if not we may be  
>> hitting a
>> gcc issue...
>>
>> IE. At loop init, cpu gets clamped down to the first thread in the  
>> core,
>> which should be fine. Then, we compare CPU to the last thread in core
>> for the current CPU which should always return the same value.
>>
>> So I'm very interested to know what is actually wrong, ie, either I'm
>> just missing something obvious, or you are just pushing a bug under  
>> the
>> carpet which could come back and bit us later :-)
>
> for (cpu = cpu_first_thread_in_core(cpu);
>     cpu <= cpu_last_thread_in_core(cpu); cpu++)
>        __clear_bit(id, stale_map[cpu]);
>
> ==
>
> cpu = cpu_first_thread_in_core(cpu);
> while (cpu <= cpu_last_thread_in_core(cpu)) {
> 	__clear_bit(id, stale_map[cpu]);
> 	cpu++;
> }
>
> cpu = 0
> cpu <= 1
> cpu++ (1)
> cpu <= 1
> cpu++ (2)
> cpu <= 3
> ...

Which is pretty much what I see, in a dual core setup, I get an oops  
because we are trying to clear cpu #2 (which clearly doesn't exist)

cpu = 1
(in loop)
	clearing 1
	clearing 2
OOPS

- k
Dave Kleikamp - Aug. 3, 2009, 5:06 p.m.
On Mon, 2009-08-03 at 11:21 -0500, Kumar Gala wrote:
> On Aug 2, 2009, at 9:03 PM, Michael Ellerman wrote:
> 
> > On Sat, 2009-08-01 at 08:29 +1000, Benjamin Herrenschmidt wrote:
> >> On Thu, 2009-07-30 at 22:35 -0500, Kumar Gala wrote:
> >>>>              /* XXX This clear should ultimately be part of
> >>> local_flush_tlb_mm */
> >>>> -             __clear_bit(id, stale_map[cpu]);
> >>>> +             for (cpu = cpu_first_thread_in_core(cpu);
> >>>> +                  cpu <= cpu_last_thread_in_core(cpu); cpu++)
> >>>> +                     __clear_bit(id, stale_map[cpu]);
> >>>>      }
> >>>
> >>> This looks a bit dodgy.  using 'cpu' as both the loop variable and
> >>> what you are computing to determine loop start/end..
> >>>
> >> Hrm... I would have thought that it was still correct... do you see  
> >> any
> >> reason why the above code is wrong ? because if not we may be  
> >> hitting a
> >> gcc issue...
> >>
> >> IE. At loop init, cpu gets clamped down to the first thread in the  
> >> core,
> >> which should be fine. Then, we compare CPU to the last thread in core
> >> for the current CPU which should always return the same value.
> >>
> >> So I'm very interested to know what is actually wrong, ie, either I'm
> >> just missing something obvious, or you are just pushing a bug under  
> >> the
> >> carpet which could come back and bit us later :-)
> >
> > for (cpu = cpu_first_thread_in_core(cpu);
> >     cpu <= cpu_last_thread_in_core(cpu); cpu++)
> >        __clear_bit(id, stale_map[cpu]);
> >
> > ==
> >
> > cpu = cpu_first_thread_in_core(cpu);
> > while (cpu <= cpu_last_thread_in_core(cpu)) {
> > 	__clear_bit(id, stale_map[cpu]);
> > 	cpu++;
> > }

cpu_last_thread_in_core(cpu) is a moving target.  You want something
like:

cpu = cpu_first_thread_in_core(cpu);
last = cpu_last_thread_in_core(cpu);
while (cpu <= last) {
	__clear_bit(id, stale_map[cpu]);
	cpu++;
}

> >
> > cpu = 0
> > cpu <= 1
> > cpu++ (1)
> > cpu <= 1
> > cpu++ (2)
> > cpu <= 3
> > ...
> 
> Which is pretty much what I see, in a dual core setup, I get an oops  
> because we are trying to clear cpu #2 (which clearly doesn't exist)
> 
> cpu = 1
> (in loop)
> 	clearing 1
> 	clearing 2
> OOPS
> 
> - k
Dave Kleikamp - Aug. 3, 2009, 5:57 p.m.
On Mon, 2009-08-03 at 12:06 -0500, Dave Kleikamp wrote:
> On Mon, 2009-08-03 at 11:21 -0500, Kumar Gala wrote:
> > On Aug 2, 2009, at 9:03 PM, Michael Ellerman wrote:
> > 

> > > for (cpu = cpu_first_thread_in_core(cpu);
> > >     cpu <= cpu_last_thread_in_core(cpu); cpu++)
> > >        __clear_bit(id, stale_map[cpu]);
> > >
> > > ==
> > >
> > > cpu = cpu_first_thread_in_core(cpu);
> > > while (cpu <= cpu_last_thread_in_core(cpu)) {
> > > 	__clear_bit(id, stale_map[cpu]);
> > > 	cpu++;
> > > }
> 
> cpu_last_thread_in_core(cpu) is a moving target.  You want something
> like:
> 
> cpu = cpu_first_thread_in_core(cpu);
> last = cpu_last_thread_in_core(cpu);
> while (cpu <= last) {
> 	__clear_bit(id, stale_map[cpu]);
> 	cpu++;
> }

Or, keeping the for loop:

for (cpu = cpu_first_thread_in_core(cpu), last =
cpu_last_thread_in_core(cpu);
     cpu <= last; cpu++)
	cpu++;

> 
> > >
> > > cpu = 0
> > > cpu <= 1
> > > cpu++ (1)
> > > cpu <= 1
> > > cpu++ (2)
> > > cpu <= 3
> > > ...
> > 
> > Which is pretty much what I see, in a dual core setup, I get an oops  
> > because we are trying to clear cpu #2 (which clearly doesn't exist)
> > 
> > cpu = 1
> > (in loop)
> > 	clearing 1
> > 	clearing 2
> > OOPS
> > 
> > - k
>
Benjamin Herrenschmidt - Aug. 3, 2009, 9:03 p.m.
> for (cpu = cpu_first_thread_in_core(cpu);
>      cpu <= cpu_last_thread_in_core(cpu); cpu++)
>         __clear_bit(id, stale_map[cpu]);
> 
> ==
> 
> cpu = cpu_first_thread_in_core(cpu);
> while (cpu <= cpu_last_thread_in_core(cpu)) {
> 	__clear_bit(id, stale_map[cpu]);
> 	cpu++;
> }
> 
> cpu = 0
> cpu <= 1
> cpu++ (1)
> cpu <= 1
> cpu++ (2)
> cpu <= 3
> ...

Ah right, /me takes snow out of his eyes... indeed, the
upper bound is fubar. Hrm. Allright, we'll use a temp.

Cheers,
Ben.
Benjamin Herrenschmidt - Aug. 4, 2009, 7:22 a.m.
On Mon, 2009-08-03 at 12:57 -0500, Dave Kleikamp wrote:
> > cpu_last_thread_in_core(cpu) is a moving target.  You want something
> > like:
> > 
> > cpu = cpu_first_thread_in_core(cpu);
> > last = cpu_last_thread_in_core(cpu);
> > while (cpu <= last) {
> >       __clear_bit(id, stale_map[cpu]);
> >       cpu++;
> > }
> 
> Or, keeping the for loop:
> 
> for (cpu = cpu_first_thread_in_core(cpu), last =
> cpu_last_thread_in_core(cpu);
>      cpu <= last; cpu++)
>         cpu++;

Yeah, whatever form is good, I had a brain fart and didn't "see" that in
the end of loop, cpu would have actually crossed the boundary to the
next core and so cpu_last_thread_in_core() would change. Just some short
circuit in a neuron somewhere.

Cheers,
Ben.

Patch

--- linux-work.orig/arch/powerpc/mm/mmu_context_nohash.c	2009-07-21 12:43:27.000000000 +1000
+++ linux-work/arch/powerpc/mm/mmu_context_nohash.c	2009-07-21 12:56:16.000000000 +1000
@@ -25,10 +25,20 @@ 
  *     also clear mm->cpu_vm_mask bits when processes are migrated
  */
 
-#undef DEBUG
-#define DEBUG_STEAL_ONLY
-#undef DEBUG_MAP_CONSISTENCY
-/*#define DEBUG_CLAMP_LAST_CONTEXT   15 */
+#define DEBUG_MAP_CONSISTENCY
+#define DEBUG_CLAMP_LAST_CONTEXT   31
+//#define DEBUG_HARDER
+
+/* We don't use DEBUG because it tends to be compiled in always nowadays
+ * and this would generate way too much output
+ */
+#ifdef DEBUG_HARDER
+#define pr_hard(args...)	printk(KERN_DEBUG args)
+#define pr_hardcont(args...)	printk(KERN_CONT args)
+#else
+#define pr_hard(args...)	do { } while(0)
+#define pr_hardcont(args...)	do { } while(0)
+#endif
 
 #include <linux/kernel.h>
 #include <linux/mm.h>
@@ -71,7 +81,7 @@  static DEFINE_SPINLOCK(context_lock);
 static unsigned int steal_context_smp(unsigned int id)
 {
 	struct mm_struct *mm;
-	unsigned int cpu, max;
+	unsigned int cpu, max, i;
 
 	max = last_context - first_context;
 
@@ -89,15 +99,22 @@  static unsigned int steal_context_smp(un
 				id = first_context;
 			continue;
 		}
-		pr_devel("[%d] steal context %d from mm @%p\n",
-			 smp_processor_id(), id, mm);
+		pr_hardcont(" | steal %d from 0x%p", id, mm);
 
 		/* Mark this mm has having no context anymore */
 		mm->context.id = MMU_NO_CONTEXT;
 
-		/* Mark it stale on all CPUs that used this mm */
-		for_each_cpu(cpu, mm_cpumask(mm))
-			__set_bit(id, stale_map[cpu]);
+		/* Mark it stale on all CPUs that used this mm. For threaded
+		 * implementations, we set it on all threads on each core
+		 * represented in the mask. A future implementation will use
+		 * a core map instead but this will do for now.
+		 */
+		for_each_cpu(cpu, mm_cpumask(mm)) {
+			for (i = cpu_first_thread_in_core(cpu);
+			     i <= cpu_last_thread_in_core(cpu); i++)
+				__set_bit(id, stale_map[i]);
+			cpu = i - 1;
+		}
 		return id;
 	}
 
@@ -126,7 +143,7 @@  static unsigned int steal_context_up(uns
 	/* Pick up the victim mm */
 	mm = context_mm[id];
 
-	pr_devel("[%d] steal context %d from mm @%p\n", cpu, id, mm);
+	pr_hardcont(" | steal %d from 0x%p", id, mm);
 
 	/* Flush the TLB for that context */
 	local_flush_tlb_mm(mm);
@@ -179,19 +196,14 @@  void switch_mmu_context(struct mm_struct
 	/* No lockless fast path .. yet */
 	spin_lock(&context_lock);
 
-#ifndef DEBUG_STEAL_ONLY
-	pr_devel("[%d] activating context for mm @%p, active=%d, id=%d\n",
-		 cpu, next, next->context.active, next->context.id);
-#endif
+	pr_hard("[%d] activating context for mm @%p, active=%d, id=%d",
+		cpu, next, next->context.active, next->context.id);
 
 #ifdef CONFIG_SMP
 	/* Mark us active and the previous one not anymore */
 	next->context.active++;
 	if (prev) {
-#ifndef DEBUG_STEAL_ONLY
-		pr_devel(" old context %p active was: %d\n",
-			 prev, prev->context.active);
-#endif
+		pr_hardcont(" (old=0x%p a=%d)", prev, prev->context.active);
 		WARN_ON(prev->context.active < 1);
 		prev->context.active--;
 	}
@@ -201,8 +213,14 @@  void switch_mmu_context(struct mm_struct
 
 	/* If we already have a valid assigned context, skip all that */
 	id = next->context.id;
-	if (likely(id != MMU_NO_CONTEXT))
+	if (likely(id != MMU_NO_CONTEXT)) {
+#ifdef DEBUG_MAP_CONSISTENCY
+		if (context_mm[id] != next)
+			pr_err("MMU: mm 0x%p has id %d but context_mm[%d] says 0x%p\n",
+			       next, id, id, context_mm[id]);
+#endif
 		goto ctxt_ok;
+	}
 
 	/* We really don't have a context, let's try to acquire one */
 	id = next_context;
@@ -234,11 +252,7 @@  void switch_mmu_context(struct mm_struct
 	next_context = id + 1;
 	context_mm[id] = next;
 	next->context.id = id;
-
-#ifndef DEBUG_STEAL_ONLY
-	pr_devel("[%d] picked up new id %d, nrf is now %d\n",
-		 cpu, id, nr_free_contexts);
-#endif
+	pr_hardcont(" | new id=%d,nrf=%d", id, nr_free_contexts);
 
 	context_check_map();
  ctxt_ok:
@@ -247,15 +261,20 @@  void switch_mmu_context(struct mm_struct
 	 * local TLB for it and unmark it before we use it
 	 */
 	if (test_bit(id, stale_map[cpu])) {
-		pr_devel("[%d] flushing stale context %d for mm @%p !\n",
-			 cpu, id, next);
+		pr_hardcont(" | stale flush %d [%d..%d]",
+			    id, cpu_first_thread_in_core(cpu),
+			    cpu_last_thread_in_core(cpu));
+
 		local_flush_tlb_mm(next);
 
 		/* XXX This clear should ultimately be part of local_flush_tlb_mm */
-		__clear_bit(id, stale_map[cpu]);
+		for (cpu = cpu_first_thread_in_core(cpu);
+		     cpu <= cpu_last_thread_in_core(cpu); cpu++)
+			__clear_bit(id, stale_map[cpu]);
 	}
 
 	/* Flick the MMU and release lock */
+	pr_hardcont(" -> %d\n", id);
 	set_context(id, next->pgd);
 	spin_unlock(&context_lock);
 }
@@ -265,6 +284,8 @@  void switch_mmu_context(struct mm_struct
  */
 int init_new_context(struct task_struct *t, struct mm_struct *mm)
 {
+	pr_hard("initing context for mm @%p\n", mm);
+
 	mm->context.id = MMU_NO_CONTEXT;
 	mm->context.active = 0;
 
@@ -304,7 +325,9 @@  static int __cpuinit mmu_context_cpu_not
 					    unsigned long action, void *hcpu)
 {
 	unsigned int cpu = (unsigned int)(long)hcpu;
-
+#ifdef CONFIG_HOTPLUG_CPU
+	struct task_struct *p;
+#endif
 	/* We don't touch CPU 0 map, it's allocated at aboot and kept
 	 * around forever
 	 */
@@ -323,8 +346,16 @@  static int __cpuinit mmu_context_cpu_not
 		pr_devel("MMU: Freeing stale context map for CPU %d\n", cpu);
 		kfree(stale_map[cpu]);
 		stale_map[cpu] = NULL;
-		break;
-#endif
+
+		/* We also clear the cpu_vm_mask bits of CPUs going away */
+		read_lock(&tasklist_lock);
+		for_each_process(p) {
+			if (p->mm)
+				cpu_mask_clear_cpu(cpu, mm_cpumask(p->mm));
+		}
+		read_unlock(&tasklist_lock);
+	break;
+#endif /* CONFIG_HOTPLUG_CPU */
 	}
 	return NOTIFY_OK;
 }
Index: linux-work/arch/powerpc/include/asm/cputhreads.h
===================================================================
--- linux-work.orig/arch/powerpc/include/asm/cputhreads.h	2009-07-21 12:43:27.000000000 +1000
+++ linux-work/arch/powerpc/include/asm/cputhreads.h	2009-07-21 12:56:16.000000000 +1000
@@ -5,6 +5,15 @@ 
 
 /*
  * Mapping of threads to cores
+ *
+ * Note: This implementation is limited to a power of 2 number of
+ * threads per core and the same number for each core in the system
+ * (though it would work if some processors had less threads as long
+ * as the CPU numbers are still allocated, just not brought offline).
+ *
+ * However, the API allows for a different implementation in the future
+ * if needed, as long as you only use the functions and not the variables
+ * directly.
  */
 
 #ifdef CONFIG_SMP
@@ -67,5 +76,12 @@  static inline int cpu_first_thread_in_co
 	return cpu & ~(threads_per_core - 1);
 }
 
+static inline int cpu_last_thread_in_core(int cpu)
+{
+	return cpu | (threads_per_core - 1);
+}
+
+
+
 #endif /* _ASM_POWERPC_CPUTHREADS_H */
 
Index: linux-work/arch/powerpc/mm/tlb_nohash.c
===================================================================
--- linux-work.orig/arch/powerpc/mm/tlb_nohash.c	2009-07-21 12:43:31.000000000 +1000
+++ linux-work/arch/powerpc/mm/tlb_nohash.c	2009-07-21 12:57:21.000000000 +1000
@@ -87,6 +87,12 @@  EXPORT_SYMBOL(local_flush_tlb_page);
 
 static DEFINE_SPINLOCK(tlbivax_lock);
 
+static int mm_is_core_local(struct mm_struct *mm)
+{
+	return cpumask_subset(mm_cpumask(mm),
+			      topology_thread_cpumask(smp_processor_id()));
+}
+
 struct tlb_flush_param {
 	unsigned long addr;
 	unsigned int pid;
@@ -131,7 +137,7 @@  void flush_tlb_mm(struct mm_struct *mm)
 	pid = mm->context.id;
 	if (unlikely(pid == MMU_NO_CONTEXT))
 		goto no_context;
-	if (!cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id()))) {
+	if (!mm_is_core_local(mm)) {
 		struct tlb_flush_param p = { .pid = pid };
 		/* Ignores smp_processor_id() even if set. */
 		smp_call_function_many(mm_cpumask(mm),
@@ -153,7 +159,7 @@  void flush_tlb_page(struct vm_area_struc
 	if (unlikely(pid == MMU_NO_CONTEXT))
 		goto bail;
 	cpu_mask = mm_cpumask(vma->vm_mm);
-	if (!cpumask_equal(cpu_mask, cpumask_of(smp_processor_id()))) {
+	if (!mm_is_core_local(mm)) {
 		/* If broadcast tlbivax is supported, use it */
 		if (mmu_has_feature(MMU_FTR_USE_TLBIVAX_BCAST)) {
 			int lock = mmu_has_feature(MMU_FTR_LOCK_BCAST_INVAL);