diff mbox

2.6.31-git5 kernel boot hangs on powerpc

Message ID 4ABA2DE2.6000601@kernel.org (mailing list archive)
State Not Applicable
Headers show

Commit Message

Tejun Heo Sept. 23, 2009, 2:17 p.m. UTC
Tejun Heo wrote:
>> One workaround i have found for this problem is to disable IPv6.
>> With IPv6 disabled the machine boots OK. Till a reliable solution
>> is available for this issue, i will keep IPv6 disabled in my configs.
> 
> I'm think it's most likely caused by some code accessing invalid
> percpu address.  I'm currently writing up access validator.  Should be
> done in several hours.  So, ipv6 it is.  I couldn't reproduce your
> problem here.  I'll give ipv6 a shot.

Can you please apply the attached patch and see whether anything
interesting shows up in the kernel log?

Thanks.

Comments

Sachin P. Sant Sept. 24, 2009, 7:58 a.m. UTC | #1
Tejun Heo wrote:
> Can you please apply the attached patch and see whether anything
> interesting shows up in the kernel log?
>   
Thanks Tejun for the debug patch. Attached here are the relevant logs.
The only messages related to percpu in the logs are

<6>PERCPU: Embedded 2 pages/cpu @c000000001200000 s100232 r0 d30840 u524288
<7>pcpu-alloc: s100232 r0 d30840 u524288 alloc=1*1048576
<7>pcpu-alloc: [0] 0 1 

The captured logs are with latest git.

Thanks
-Sachin
<4>Crash kernel location must be 0x2000000
<6>Reserving 256MB of memory at 32MB for crashkernel (System RAM: 4096MB)
<6>Phyp-dump disabled at boot time
<6>Using pSeries machine description
<7>Page orders: linear mapping = 16, virtual = 16, io = 12
<6>Using 1TB segments
<4>Found initrd at 0xc000000003500000:0xc000000003ccdf60
<6>bootconsole [udbg0] enabled
<6>Partition configured for 2 cpus.
<6>CPU maps initialized for 2 threads per core
<7> (thread shift is 1)
<4>Starting Linux PPC64 #2 SMP Thu Sep 24 12:59:21 IST 2009
<4>-----------------------------------------------------
<4>ppc64_pft_size                = 0x1a
<4>physicalMemorySize            = 0x100000000
<4>htab_hash_mask                = 0x7ffff
<4>-----------------------------------------------------
<6>Initializing cgroup subsys cpuset
<6>Initializing cgroup subsys cpu
<5>Linux version 2.6.31-git13-autotest (root@mpower6lp5) (gcc version 4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux) ) #2 SMP Thu Sep 24 12:59:21 IST 2009
<4>[boot]0012 Setup Arch
<7>Node 0 Memory:
<7>Node 2 Memory: 0x0-0xe0000000
<7>Node 3 Memory: 0xe0000000-0x100000000
<4>EEH: No capable adapters found
<6>PPC64 nvram contains 15360 bytes
<7>Using shared processor idle loop
<4>Zone PFN ranges:
<4>  DMA      0x00000000 -> 0x00010000
<4>  Normal   0x00010000 -> 0x00010000
<4>Movable zone start PFN for each node
<4>early_node_map[2] active PFN ranges
<4>    2: 0x00000000 -> 0x0000e000
<4>    3: 0x0000e000 -> 0x00010000
<4>Could not find start_pfn for node 0
<7>On node 0 totalpages: 0
<7>On node 2 totalpages: 57344
<7>  DMA zone: 56 pages used for memmap
<7>  DMA zone: 0 pages reserved
<7>  DMA zone: 57288 pages, LIFO batch:1
<7>On node 3 totalpages: 8192
<7>  DMA zone: 8 pages used for memmap
<7>  DMA zone: 0 pages reserved
<7>  DMA zone: 8184 pages, LIFO batch:0
<4>[boot]0015 Setup Done
<6>PERCPU: Embedded 2 pages/cpu @c000000001200000 s100232 r0 d30840 u524288
<7>pcpu-alloc: s100232 r0 d30840 u524288 alloc=1*1048576
<7>pcpu-alloc: [0] 0 1 
<4>Built 3 zonelists in Node order, mobility grouping on.  Total pages: 65472
<4>Policy zone: DMA
<5>Kernel command line: root=/dev/sda3 sysrq=8 insmod=sym53c8xx insmod=ipr crashkernel=512M-:256M xmon=early 
<6>PID hash table entries: 4096 (order: -1, 32768 bytes)
<4>freeing bootmem node 2
<4>freeing bootmem node 3
<6>Memory: 3896832k/4194304k available (9728k kernel code, 297472k reserved, 3072k data, 4291k bss, 576k init)
<6>SLUB: Genslabs=18, HWalign=128, Order=0-3, MinObjects=0, CPUs=2, Nodes=16
<6>Hierarchical RCU implementation.
<6>RCU-based detection of stalled CPUs is enabled.
<6>NR_IRQS:512
<4>[boot]0020 XICS Init
<4>[boot]0021 XICS Done
<7>pic: no ISA interrupt controller
<7>time_init: decrementer frequency = 512.000000 MHz
<7>time_init: processor frequency   = 4704.000000 MHz
<6>clocksource: timebase mult[7d0000] shift[22] registered
<7>clockevent: decrementer mult[83126e97] shift[32] cpu[0]
<4>Console: colour dummy device 80x25
<6>console [hvc0] enabled, bootconsole disabled
<6>allocated 2621440 bytes of page_cgroup
<6>please try 'cgroup_disable=memory' option if you don't want memory cgroups
<6>Security Framework initialized
<6>SELinux:  Disabled at boot.
<6>Dentry cache hash table entries: 524288 (order: 6, 4194304 bytes)
<6>Inode-cache hash table entries: 262144 (order: 5, 2097152 bytes)
<4>Mount-cache hash table entries: 4096
<6>Initializing cgroup subsys ns
<6>Initializing cgroup subsys cpuacct
<6>Initializing cgroup subsys memory
<6>Initializing cgroup subsys devices
<6>Initializing cgroup subsys freezer
<7>irq: irq 2 on host null mapped to virtual irq 16
<7>clockevent: decrementer mult[83126e97] shift[32] cpu[1]
<4>Processor 1 found.
<6>Brought up 2 CPUs
<7>Node 0 CPUs: 0-1
<7>Node 2 CPUs:
<7>Node 3 CPUs:
<7>CPU0 attaching sched-domain:
<7> domain 0: span 0-1 level SIBLING
<7>  groups: 0 (cpu_power = 589) 1 (cpu_power = 589)
<7>  domain 1: span 0-1 level CPU
<7>   groups: 0-1 (cpu_power = 1178)
<7>CPU1 attaching sched-domain:
<7> domain 0: span 0-1 level SIBLING
<7>  groups: 1 (cpu_power = 589) 0 (cpu_power = 589)
<7>  domain 1: span 0-1 level CPU
<7>   groups: 0-1 (cpu_power = 1178)
<6>NET: Registered protocol family 16
<6>IBM eBus Device Driver
<6>POWER6 performance monitor hardware support registered
<6>PCI: Probing PCI hardware
<7>PCI: Probing PCI hardware done
<4>bio: create slab <bio-0> at 0
<6>vgaarb: loaded
<6>usbcore: registered new interface driver usbfs
<6>usbcore: registered new interface driver hub
<6>usbcore: registered new device driver usb
<6>Switching to clocksource timebase
<6>NET: Registered protocol family 2
<6>IP route cache hash table entries: 32768 (order: 2, 262144 bytes)
<6>TCP established hash table entries: 131072 (order: 5, 2097152 bytes)
<6>TCP bind hash table entries: 65536 (order: 5, 2097152 bytes)
<6>TCP: Hash tables configured (established 131072 bind 65536)
<6>TCP reno registered
<6>NET: Registered protocol family 1
<6>Unpacking initramfs...
<7>Switched to high resolution mode on CPU 0
<7>Switched to high resolution mode on CPU 1
<7>irq: irq 655360 on host null mapped to virtual irq 17
<7>irq: irq 655367 on host null mapped to virtual irq 18
<6>IOMMU table initialized, virtual merging enabled
<7>irq: irq 589825 on host null mapped to virtual irq 19
<7>RTAS daemon started
<6>audit: initializing netlink socket (disabled)
<5>type=2000 audit(1253778214.210:1): initialized
<6>Kprobe smoke test started
<6>Kprobe smoke test passed successfully
<6>HugeTLB registered 16 MB page size, pre-allocated 0 pages
<6>HugeTLB registered 16 GB page size, pre-allocated 0 pages
<5>VFS: Disk quotas dquot_6.5.2
<4>Dquot-cache hash table entries: 8192 (order 0, 65536 bytes)
<6>Btrfs loaded
<6>msgmni has been set to 7608
<6>alg: No test for stdrng (krng)
<6>Block layer SCSI generic (bsg) driver version 0.4 loaded (major 254)
<6>io scheduler noop registered
<6>io scheduler anticipatory registered
<6>io scheduler deadline registered
<6>io scheduler cfq registered (default)
<6>pci_hotplug: PCI Hot Plug PCI Core version: 0.5
<6>rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
<7>vio_register_driver: driver hvc_console registering
<7>HVSI: registered 0 devices
<6>Generic RTC Driver v1.07
<6>Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
<6>pmac_zilog: 0.6 (Benjamin Herrenschmidt <benh@kernel.crashing.org>)
<6>input: Macintosh mouse button emulation as /devices/virtual/input/input0
<6>Uniform Multi-Platform E-IDE driver
<6>ide-gd driver 1.18
<6>IBM eHEA ethernet device driver (Release EHEA_0102)
<7>irq: irq 590088 on host null mapped to virtual irq 264
<6>ehea: eth0: Jumbo frames are disabled
<6>ehea: eth0 -> logical port id #2
<6>ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
<6>ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
<6>mice: PS/2 mouse device common for all mice
<6>EDAC MC: Ver: 2.1.0 Sep 24 2009
<6>usbcore: registered new interface driver hiddev
<6>usbcore: registered new interface driver usbhid
<6>usbhid: v2.6:USB HID core driver
<6>TCP cubic registered
<6>NET: Registered protocol family 15
<4>registered taskstats version 1
<4>Freeing unused kernel memory: 576k freed
<6>SysRq : Changing Loglevel
<4>Loglevel set to 8
<5>SCSI subsystem initialized
<7>vio_register_driver: driver ibmvscsi registering
<6>ibmvscsi 30000007: SRP_VERSION: 16.a
<6>scsi0 : IBM POWER Virtual SCSI Adapter 1.5.8
<6>ibmvscsi 30000007: partner initialization complete
<6>ibmvscsi 30000007: host srp version: 16.a, host partition VIO Server (1), OS 3, max io 1048576
<6>ibmvscsi 30000007: Client reserve enabled
<6>ibmvscsi 30000007: sent SRP login
<6>ibmvscsi 30000007: SRP_LOGIN succeeded
<5>scsi 0:0:1:0: Direct-Access     AIX      VDASD            0001 PQ: 0 ANSI: 3
<5>scsi 0:0:2:0: CD-ROM            AIX      VOPTA                 PQ: 0 ANSI: 4
<6>udevd version 128 started
<5>sd 0:0:1:0: [sda] 146800640 512-byte logical blocks: (75.1 GB/70.0 GiB)
<5>sd 0:0:1:0: [sda] Write Protect is off
<7>sd 0:0:1:0: [sda] Mode Sense: 17 00 00 08
<5>sd 0:0:1:0: [sda] Cache data unavailable
<3>sd 0:0:1:0: [sda] Assuming drive cache: write through
<5>sd 0:0:1:0: [sda] Cache data unavailable
<3>sd 0:0:1:0: [sda] Assuming drive cache: write through
<6> sda: sda1 sda2 sda3
<5>sd 0:0:1:0: [sda] Cache data unavailable
<3>sd 0:0:1:0: [sda] Assuming drive cache: write through
<5>sd 0:0:1:0: [sda] Attached SCSI disk
<6>kjournald starting.  Commit interval 5 seconds
<6>EXT3 FS on sda3, internal journal
<6>EXT3-fs: mounted filesystem with writeback data mode.
<6>udevd version 128 started
<5>sd 0:0:1:0: Attached scsi generic sg0 type 0
<5>scsi 0:0:2:0: Attached scsi generic sg1 type 5
<4>sr0: scsi-1 drive
<6>Uniform CD-ROM driver Revision: 3.20
<7>sr 0:0:2:0: Attached scsi CD-ROM sr0
<6>Adding 2096320k swap on /dev/sda2.  Priority:-1 extents:1 across:2096320k 
<6>device-mapper: uevent: version 1.0.3
<6>device-mapper: ioctl: 4.15.0-ioctl (2009-04-01) initialised: dm-devel@redhat.com
<6>loop: module loaded
<6>fuse init (API version 7.13)
<7>irq: irq 33539 on host null mapped to virtual irq 259
<6>ehea: eth0: Physical port up
<6>ehea: External switch port is backup port
<7>irq: irq 33540 on host null mapped to virtual irq 260
<6>NET: Registered protocol family 10
<3>INFO: RCU detected CPU 0 stall (t=1000 jiffies)
Tejun Heo Sept. 24, 2009, 12:59 p.m. UTC | #2
Sachin Sant wrote:
> Tejun Heo wrote:
>> Can you please apply the attached patch and see whether anything
>> interesting shows up in the kernel log?
>>   
> Thanks Tejun for the debug patch. Attached here are the relevant logs.
> The only messages related to percpu in the logs are
> 
> <6>PERCPU: Embedded 2 pages/cpu @c000000001200000 s100232 r0 d30840 u524288
> <7>pcpu-alloc: s100232 r0 d30840 u524288 alloc=1*1048576
> <7>pcpu-alloc: [0] 0 1
> The captured logs are with latest git.

Hmm... that means it wasn't caused by rogue percpu pointer access.
Pleast wait a bit.  I'll try to reproduce it.

Thanks.
Sachin P. Sant Sept. 24, 2009, 1:23 p.m. UTC | #3
Tejun Heo wrote:
> Sachin Sant wrote:
>   
>> Tejun Heo wrote:
>>     
>>> Can you please apply the attached patch and see whether anything
>>> interesting shows up in the kernel log?
>>>   
>>>       
>> Thanks Tejun for the debug patch. Attached here are the relevant logs.
>> The only messages related to percpu in the logs are
>>
>> <6>PERCPU: Embedded 2 pages/cpu @c000000001200000 s100232 r0 d30840 u524288
>> <7>pcpu-alloc: s100232 r0 d30840 u524288 alloc=1*1048576
>> <7>pcpu-alloc: [0] 0 1
>> The captured logs are with latest git.
>>     
>
> Hmm... that means it wasn't caused by rogue percpu pointer access.
> Pleast wait a bit.  I'll try to reproduce it.
>   
I was able to reproduce the hang in a different way. (I still had
IPV6 disabled in my config). I executed the network namespace container
tests from LTP and could reproduce a similar hang. The top three
function calls were the same as with IPV6. Here are the traces
using xmon debugger.


Oops: System Reset, sig: 6 [#4]
SMP NR_CPUS=1024 DEBUG_PAGEALLOC NUMA pSeries
Modules linked in: quota_v2 quota_tree fuse loop dm_mod sg sd_mod crc_t10dif ibmvscsic scsi_transport_srp scsi_tgt scsi_mod
NIP: c00000000003c310 LR: c0000000000055d0 CTR: 0000000000000040
REGS: c0000000fc90f340 TRAP: 0100   Tainted: G      D     (2.6.31-git13-autotest)
MSR: 8000000000081032 <ME,IR,DR>  CR: 28004420  XER: 20000001
TASK = c00000002c408890[8753] 'check_netns_ena' THREAD: c0000000fc90c000 CPU: 2
GPR00: 00000fffffffffff c0000000fc90f5c0 c000000000b8c2a8 d00007fffff00000
GPR04: 0000000000000201 0000000000000300 d00007fffff00000 d00007fffff00000
GPR08: 0000000000000000 000007fffff00000 0000000000000000 0000000000000000
GPR12: 8000000000009032 c000000000c82a00 0000000000000001 c0000000fc90f924
GPR16: 0000000000000300 0000000000000001 c0000000fa8e2380 0000000000000000
GPR20: 0000000000010000 0000000000000001 0000000000000000 0000000000000000
GPR24: c0000000fa9c09c8 0000000000000001 0000000000000001 c0000000faef6f60
GPR28: c000000000c6b620 0000000000000000 c000000000af2aa0 c000000000c6d1b0
NIP [c00000000003c310] .hash_page+0x24/0x4bc
LR [c0000000000055d0] .do_hash_page+0x50/0x6c
Call Trace:
[c0000000fc90f5c0] [c0000000000055d0] .do_hash_page+0x50/0x6c (unreliable)
--- Exception: 301 at .memset+0x60/0xfc
    LR = .pcpu_alloc+0x718/0x8fc
[c0000000fc90f8b0] [c0000000001700dc] .pcpu_alloc+0x6a8/0x8fc (unreliable)
[c0000000fc90f9d0] [c000000000614648] .snmp_mib_init+0x54/0x9c
[c0000000fc90fa60] [c000000000614764] .ipv4_mib_init_net+0xd4/0x1e0
[c0000000fc90fb10] [c0000000005a839c] .setup_net+0x68/0x124
[c0000000fc90fbb0] [c0000000005a8ad0] .copy_net_ns+0x88/0x130
[c0000000fc90fc40] [c0000000000bd5ac] .create_new_namespaces+0x110/0x1d0
[c0000000fc90fce0] [c0000000000bd874] .unshare_nsproxy_namespaces+0x6c/0xe8
[c0000000fc90fd80] [c000000000091ee8] .SyS_unshare+0x13c/0x318
[c0000000fc90fe30] [c0000000000085b4] syscall_exit+0x0/0x40
Instruction dump:
7c0803a6 ebe1fff8 4e800020 78690100 7c0802a6 f8010010 3800ffff fa01ff80
7cb02b78 78000500 fa21ff88 fb61ffd8 <7c912378> fa41ff90 7c7b1b78 fa61ff98

As you can see the call trace is same as far as top three function calls
are concerned [snmp_mib_init(), pcpu_alloc() and memset()].

The snmp_mib_init() function is :

int snmp_mib_init(void *ptr[2], size_t mibsize)
{
        BUG_ON(ptr == NULL);
        ptr[0] = __alloc_percpu(mibsize, __alignof__(unsigned long long));
        if (!ptr[0])
                goto err0;
        ptr[1] = __alloc_percpu(mibsize, __alignof__(unsigned long long));
        if (!ptr[1])
                goto err1;
        return 0;
.....

May be this might help..

Thanks
-Sachin
Benjamin Herrenschmidt Sept. 24, 2009, 9:05 p.m. UTC | #4
On Thu, 2009-09-24 at 18:53 +0530, Sachin Sant wrote:
> Tejun Heo wrote:
> > Sachin Sant wrote:
> >   
> >> Tejun Heo wrote:
> >>     
> >>> Can you please apply the attached patch and see whether anything
> >>> interesting shows up in the kernel log?
> >>>   
> >>>       
> >> Thanks Tejun for the debug patch. Attached here are the relevant logs.
> >> The only messages related to percpu in the logs are
> >>
> >> <6>PERCPU: Embedded 2 pages/cpu @c000000001200000 s100232 r0 d30840 u524288
> >> <7>pcpu-alloc: s100232 r0 d30840 u524288 alloc=1*1048576
> >> <7>pcpu-alloc: [0] 0 1
> >> The captured logs are with latest git.
> >>     
> >
> > Hmm... that means it wasn't caused by rogue percpu pointer access.
> > Pleast wait a bit.  I'll try to reproduce it.
> >   
> I was able to reproduce the hang in a different way. (I still had
> IPV6 disabled in my config). I executed the network namespace container
> tests from LTP and could reproduce a similar hang. The top three
> function calls were the same as with IPV6. Here are the traces
> using xmon debugger.
> 
> 
> Oops: System Reset, sig: 6 [#4]
> SMP NR_CPUS=1024 DEBUG_PAGEALLOC NUMA pSeries
> Modules linked in: quota_v2 quota_tree fuse loop dm_mod sg sd_mod crc_t10dif ibmvscsic scsi_transport_srp scsi_tgt scsi_mod
> NIP: c00000000003c310 LR: c0000000000055d0 CTR: 0000000000000040
> REGS: c0000000fc90f340 TRAP: 0100   Tainted: G      D     (2.6.31-git13-autotest)
> MSR: 8000000000081032 <ME,IR,DR>  CR: 28004420  XER: 200 00001
> TASK = c00000002c408890[8753] 'check_netns_ena' THREAD: c0000000fc90c000 CPU: 2
> GPR00: 00000fffffffffff c0000000fc90f5c0 c000000000b8c2a8 d00007fffff00000
> GPR04: 0000000000000201 0000000000000300 d00007fffff00000 d00007fffff00000
> GPR08: 0000000000000000 000007fffff00000 0000000000000000 0000000000000000
> GPR12: 8000000000009032 c000000000c82a00 0000000000000001 c0000000fc90f924
> GPR16: 0000000000000300 0000000000000001 c0000000fa8e2380 0000000000000000
> GPR20: 0000000000010000 0000000000000001 0000000000000000 0000000000000000
> GPR24: c0000000fa9c09c8 0000000000000001 0000000000000001 c0000000faef6f60
> GPR28: c000000000c6b620 0000000000000000 c000000000af2aa0 c000000000c6d1b0
> NIP [c00000000003c310] .hash_page+0x24/0x4bc
> LR [c0000000000055d0] .do_hash_page+0x50/0x6c
> Call Trace:
> [c0000000fc90f5c0] [c0000000000055d0] .do_hash_page+0x50/0x6c (unreliable)
> --- Exception: 301 at .memset+0x60/0xfc
>     LR = .pcpu_alloc+0x718/0x8fc

So it's memsetting something that causes it to hash_page(), ie, faulting
in pages (vmalloc space ?) so far nothing obviously wrong....

> [c0000000fc90f8b0] [c0000000001700dc] .pcpu_alloc+0x6a8/0x8fc (unreliable)
> [c0000000fc90f9d0] [c000000000614648] .snmp_mib_init+0x54/0x9c
> [c0000000fc90fa60] [c000000000614764] .ipv4_mib_init_net+0xd4/0x1e0
> [c0000000fc90fb10] [c0000000005a839c] .setup_net+0x68/0x124
> [c0000000fc90fbb0] [c0000000005a8ad0] .copy_net_ns+0x88/0x130
> [c0000000fc90fc40] [c0000000000bd5ac] .create_new_namespaces+0x110/0x1d0
> [c0000000fc90fce0] [c0000000000bd874] .unshare_nsproxy_namespaces+0x6c/0xe8
> [c0000000fc90fd80] [c000000000091ee8] .SyS_unshare+0x13c/0x318
> [c0000000fc90fe30] [c0000000000085b4] syscall_exit+0x0/0x40
> Instruction dump:
> 7c0803a6 ebe1fff8 4e800020 78690100 7c0802a6 f8010010 3800ffff fa01ff80
> 7cb02b78 78000500 fa21ff88 fb61ffd8 <7c912378> fa41ff90 7c7b1b78 fa61ff98
> 
> As you can see the call trace is same as far as top three function calls
> are concerned [snmp_mib_init(), pcpu_alloc() and memset()].
> 
> The snmp_mib_init() function is :
> 
> int snmp_mib_init(void *ptr[2], size_t mibsize)
> {
>         BUG_ON(ptr == NULL);
>         ptr[0] = __alloc_percpu(mibsize, __alignof__(unsigned long long));
>         if (!ptr[0])
>                 goto err0;
>         ptr[1] = __alloc_percpu(mibsize, __alignof__(unsigned long long));
>         if (!ptr[1])
>                 goto err1;
>         return 0;
> .....
> 
> May be this might help..
> 
> Thanks
> -Sachin
> 
>
diff mbox

Patch

Index: work/arch/ia64/include/asm/sn/arch.h
===================================================================
--- work.orig/arch/ia64/include/asm/sn/arch.h
+++ work/arch/ia64/include/asm/sn/arch.h
@@ -71,8 +71,8 @@  DECLARE_PER_CPU(struct sn_hub_info_s, __
  * Compact node ID to nasid mappings kept in the per-cpu data areas of each
  * cpu.
  */
-DECLARE_PER_CPU(short, __sn_cnodeid_to_nasid[MAX_COMPACT_NODES]);
-#define sn_cnodeid_to_nasid	(&__get_cpu_var(__sn_cnodeid_to_nasid[0]))
+DECLARE_PER_CPU(short [MAX_COMPACT_NODES], __sn_cnodeid_to_nasid);
+#define sn_cnodeid_to_nasid	(&__get_cpu_var(__sn_cnodeid_to_nasid)[0])
 
 
 extern u8 sn_partition_id;
Index: work/arch/powerpc/mm/stab.c
===================================================================
--- work.orig/arch/powerpc/mm/stab.c
+++ work/arch/powerpc/mm/stab.c
@@ -138,7 +138,7 @@  static int __ste_allocate(unsigned long
 	if (!is_kernel_addr(ea)) {
 		offset = __get_cpu_var(stab_cache_ptr);
 		if (offset < NR_STAB_CACHE_ENTRIES)
-			__get_cpu_var(stab_cache[offset++]) = stab_entry;
+			__get_cpu_var(stab_cache)[offset++] = stab_entry;
 		else
 			offset = NR_STAB_CACHE_ENTRIES+1;
 		__get_cpu_var(stab_cache_ptr) = offset;
@@ -185,7 +185,7 @@  void switch_stab(struct task_struct *tsk
 		int i;
 
 		for (i = 0; i < offset; i++) {
-			ste = stab + __get_cpu_var(stab_cache[i]);
+			ste = stab + __get_cpu_var(stab_cache)[i];
 			ste->esid_data = 0; /* invalidate entry */
 		}
 	} else {
Index: work/arch/x86/kernel/cpu/cpu_debug.c
===================================================================
--- work.orig/arch/x86/kernel/cpu/cpu_debug.c
+++ work/arch/x86/kernel/cpu/cpu_debug.c
@@ -531,7 +531,7 @@  static int cpu_create_file(unsigned cpu,
 
 	/* Already intialized */
 	if (file == CPU_INDEX_BIT)
-		if (per_cpu(cpu_arr[type].init, cpu))
+		if (per_cpu(cpu_arr, cpu)[type].init)
 			return 0;
 
 	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
@@ -543,7 +543,7 @@  static int cpu_create_file(unsigned cpu,
 	priv->reg = reg;
 	priv->file = file;
 	mutex_lock(&cpu_debug_lock);
-	per_cpu(priv_arr[type], cpu) = priv;
+	per_cpu(priv_arr, cpu)[type] = priv;
 	per_cpu(cpu_priv_count, cpu)++;
 	mutex_unlock(&cpu_debug_lock);
 
@@ -552,10 +552,10 @@  static int cpu_create_file(unsigned cpu,
 				    dentry, (void *)priv, &cpu_fops);
 	else {
 		debugfs_create_file(cpu_base[type].name, S_IRUGO,
-				    per_cpu(cpu_arr[type].dentry, cpu),
+				    per_cpu(cpu_arr, cpu)[type].dentry,
 				    (void *)priv, &cpu_fops);
 		mutex_lock(&cpu_debug_lock);
-		per_cpu(cpu_arr[type].init, cpu) = 1;
+		per_cpu(cpu_arr, cpu)[type].init = 1;
 		mutex_unlock(&cpu_debug_lock);
 	}
 
@@ -615,7 +615,7 @@  static int cpu_init_allreg(unsigned cpu,
 		if (!is_typeflag_valid(cpu, cpu_base[type].flag))
 			continue;
 		cpu_dentry = debugfs_create_dir(cpu_base[type].name, dentry);
-		per_cpu(cpu_arr[type].dentry, cpu) = cpu_dentry;
+		per_cpu(cpu_arr, cpu)[type].dentry = cpu_dentry;
 
 		if (type < CPU_TSS_BIT)
 			err = cpu_init_msr(cpu, type, cpu_dentry);
@@ -677,7 +677,7 @@  static void __exit cpu_debug_exit(void)
 
 	for (cpu = 0; cpu <  nr_cpu_ids; cpu++)
 		for (i = 0; i < per_cpu(cpu_priv_count, cpu); i++)
-			kfree(per_cpu(priv_arr[i], cpu));
+			kfree(per_cpu(priv_arr, cpu)[i]);
 }
 
 module_init(cpu_debug_init);
Index: work/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- work.orig/arch/x86/kernel/cpu/perf_event.c
+++ work/arch/x86/kernel/cpu/perf_event.c
@@ -1253,7 +1253,7 @@  x86_perf_event_set_period(struct perf_ev
 	if (left > x86_pmu.max_period)
 		left = x86_pmu.max_period;
 
-	per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;
+	per_cpu(pmc_prev_left, smp_processor_id())[idx] = left;
 
 	/*
 	 * The hw event starts counting from this event offset,
@@ -1470,7 +1470,7 @@  void perf_event_print_debug(void)
 		rdmsrl(x86_pmu.eventsel + idx, pmc_ctrl);
 		rdmsrl(x86_pmu.perfctr  + idx, pmc_count);
 
-		prev_left = per_cpu(pmc_prev_left[idx], cpu);
+		prev_left = per_cpu(pmc_prev_left, cpu)[idx];
 
 		pr_info("CPU#%d:   gen-PMC%d ctrl:  %016llx\n",
 			cpu, idx, pmc_ctrl);
Index: work/include/asm-generic/percpu.h
===================================================================
--- work.orig/include/asm-generic/percpu.h
+++ work/include/asm-generic/percpu.h
@@ -49,13 +49,22 @@  extern unsigned long __per_cpu_offset[NR
  * established ways to produce a usable pointer from the percpu variable
  * offset.
  */
-#define per_cpu(var, cpu) \
-	(*SHIFT_PERCPU_PTR(&per_cpu_var(var), per_cpu_offset(cpu)))
-#define __get_cpu_var(var) \
-	(*SHIFT_PERCPU_PTR(&per_cpu_var(var), my_cpu_offset))
-#define __raw_get_cpu_var(var) \
-	(*SHIFT_PERCPU_PTR(&per_cpu_var(var), __my_cpu_offset))
-
+#define per_cpu(var, cpu)	(*({					\
+	typeof(&per_cpu_var(var)) __pcpu_ptr__ = &per_cpu_var(var);	\
+	unsigned int __pcpu_cpu__ = (cpu);				\
+	pcpu_verify_access(__pcpu_ptr__, __pcpu_cpu__);			\
+	SHIFT_PERCPU_PTR(__pcpu_ptr__, per_cpu_offset(__pcpu_cpu__));	\
+}))
+#define __get_cpu_var(var)	(*({					\
+	typeof(&per_cpu_var(var)) __pcpu_ptr__ = &per_cpu_var(var);	\
+	pcpu_verify_access(__pcpu_ptr__, NR_CPUS);			\
+	SHIFT_PERCPU_PTR(__pcpu_ptr__, my_cpu_offset);			\
+}))
+#define __raw_get_cpu_var(var)	(*({					\
+	typeof(&per_cpu_var(var)) __pcpu_ptr__ = &per_cpu_var(var);	\
+	pcpu_verify_access(__pcpu_ptr__, NR_CPUS);			\
+	SHIFT_PERCPU_PTR(__pcpu_ptr__, __my_cpu_offset);		\
+}))
 
 #ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA
 extern void setup_per_cpu_areas(void);
Index: work/include/linux/percpu-defs.h
===================================================================
--- work.orig/include/linux/percpu-defs.h
+++ work/include/linux/percpu-defs.h
@@ -7,6 +7,12 @@ 
  */
 #define per_cpu_var(var) per_cpu__##var
 
+#ifdef CONFIG_DEBUG_VERIFY_PER_CPU
+extern void pcpu_verify_access(void *ptr, unsigned int cpu);
+#else
+#define pcpu_verify_access(ptr, cpu)	do {} while (0)
+#endif
+
 /*
  * Base implementations of per-CPU variable declarations and definitions, where
  * the section in which the variable is to be placed is provided by the
Index: work/include/linux/percpu.h
===================================================================
--- work.orig/include/linux/percpu.h
+++ work/include/linux/percpu.h
@@ -127,7 +127,12 @@  extern int __init pcpu_page_first_chunk(
  * dynamically allocated. Non-atomic access to the current CPU's
  * version should probably be combined with get_cpu()/put_cpu().
  */
-#define per_cpu_ptr(ptr, cpu)	SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu)))
+#define per_cpu_ptr(ptr, cpu)	({					\
+	typeof(ptr) __pcpu_ptr__ = (ptr);				\
+	unsigned int __pcpu_cpu__ = (cpu);				\
+	pcpu_verify_access(__pcpu_ptr__, __pcpu_cpu__);			\
+	SHIFT_PERCPU_PTR(__pcpu_ptr__, per_cpu_offset((__pcpu_cpu__)));	\
+})
 
 extern void *__alloc_reserved_percpu(size_t size, size_t align);
 
Index: work/kernel/softirq.c
===================================================================
--- work.orig/kernel/softirq.c
+++ work/kernel/softirq.c
@@ -560,7 +560,7 @@  EXPORT_PER_CPU_SYMBOL(softirq_work_list)
 
 static void __local_trigger(struct call_single_data *cp, int softirq)
 {
-	struct list_head *head = &__get_cpu_var(softirq_work_list[softirq]);
+	struct list_head *head = &__get_cpu_var(softirq_work_list)[softirq];
 
 	list_add_tail(&cp->list, head);
 
@@ -656,13 +656,13 @@  static int __cpuinit remote_softirq_cpu_
 
 		local_irq_disable();
 		for (i = 0; i < NR_SOFTIRQS; i++) {
-			struct list_head *head = &per_cpu(softirq_work_list[i], cpu);
+			struct list_head *head = &per_cpu(softirq_work_list, cpu)[i];
 			struct list_head *local_head;
 
 			if (list_empty(head))
 				continue;
 
-			local_head = &__get_cpu_var(softirq_work_list[i]);
+			local_head = &__get_cpu_var(softirq_work_list)[i];
 			list_splice_init(head, local_head);
 			raise_softirq_irqoff(i);
 		}
@@ -688,7 +688,7 @@  void __init softirq_init(void)
 		per_cpu(tasklet_hi_vec, cpu).tail =
 			&per_cpu(tasklet_hi_vec, cpu).head;
 		for (i = 0; i < NR_SOFTIRQS; i++)
-			INIT_LIST_HEAD(&per_cpu(softirq_work_list[i], cpu));
+			INIT_LIST_HEAD(&per_cpu(softirq_work_list, cpu)[i]);
 	}
 
 	register_hotcpu_notifier(&remote_softirq_cpu_notifier);
Index: work/lib/Kconfig.debug
===================================================================
--- work.orig/lib/Kconfig.debug
+++ work/lib/Kconfig.debug
@@ -805,6 +805,21 @@  config DEBUG_BLOCK_EXT_DEVT
 
 	  Say N if you are unsure.
 
+config DEBUG_VERIFY_PER_CPU
+	bool "Verify per-cpu accesses"
+	depends on DEBUG_KERNEL
+	depends on SMP
+	help
+
+	  This option makes percpu access macros to verify the
+	  specified processor and percpu variable offset on each
+	  access.  This helps catching percpu variable access bugs
+	  which may cause corruption on unrelated memory region making
+	  it very difficult to catch at the cost of making percpu
+	  accesses considerably slow.
+
+	  Say N if you are unsure.
+
 config DEBUG_FORCE_WEAK_PER_CPU
 	bool "Force weak per-cpu definitions"
 	depends on DEBUG_KERNEL
@@ -820,6 +835,8 @@  config DEBUG_FORCE_WEAK_PER_CPU
 	  To ensure that generic code follows the above rules, this
 	  option forces all percpu variables to be defined as weak.
 
+	  Say N if you are unsure.
+
 config LKDTM
 	tristate "Linux Kernel Dump Test Tool Module"
 	depends on DEBUG_KERNEL
Index: work/mm/percpu.c
===================================================================
--- work.orig/mm/percpu.c
+++ work/mm/percpu.c
@@ -1241,6 +1241,118 @@  void free_percpu(void *ptr)
 }
 EXPORT_SYMBOL_GPL(free_percpu);
 
+#ifdef CONFIG_DEBUG_VERIFY_PER_CPU
+static struct pcpu_chunk *pcpu_verify_match_chunk(void *addr)
+{
+	void *first_start = pcpu_first_chunk->base_addr;
+	struct pcpu_chunk *chunk;
+	int slot;
+
+	/* is it in the first chunk? */
+	if (addr >= first_start && addr < first_start + pcpu_unit_size) {
+		/* is it in the reserved area? */
+		if (addr < first_start + pcpu_reserved_chunk_limit)
+			return pcpu_reserved_chunk;
+		return pcpu_first_chunk;
+	}
+
+	/* walk each dynamic chunk */
+	for (slot = 0; slot < pcpu_nr_slots; slot++)
+		list_for_each_entry(chunk, &pcpu_slot[slot], list)
+			if (addr >= chunk->base_addr &&
+			    addr < chunk->base_addr + pcpu_unit_size)
+				return chunk;
+	return NULL;
+}
+
+void pcpu_verify_access(void *ptr, unsigned int cpu)
+{
+	static bool verifying[NR_CPUS];
+	static int warn_limit = 10;
+	char cbuf[80], obuf[160];
+	void *addr = __pcpu_ptr_to_addr(ptr);
+	bool is_static = false;
+	struct pcpu_chunk *chunk;
+	unsigned long flags;
+	int i, addr_off, off, len, end;
+
+	/* not been initialized yet or whined enough already */
+	if (unlikely(!pcpu_first_chunk || !warn_limit))
+		return;
+
+	/* don't re-enter */
+	preempt_disable();
+	if (verifying[raw_smp_processor_id()]) {
+		preempt_enable_no_resched();
+		return;
+	}
+	verifying[raw_smp_processor_id()] = true;
+
+	cbuf[0] = '\0';
+	obuf[0] = '\0';
+
+	if (unlikely(cpu < NR_CPUS && !cpu_possible(cpu)) && warn_limit)
+		snprintf(cbuf, sizeof(cbuf), "invalid cpu %u", cpu);
+
+	/*
+	 * We can enter this function from weird places and have no
+	 * way to reliably avoid deadlock.  If lock is available, grab
+	 * it and verify.  If not, just let it go through.
+	 */
+	if (!spin_trylock_irqsave(&pcpu_lock, flags))
+		goto out;
+
+	chunk = pcpu_verify_match_chunk(addr);
+	if (!chunk) {
+		snprintf(obuf, sizeof(obuf),
+			 "no matching chunk ptr=%p addr=%p", ptr, addr);
+		goto out_unlock;
+	}
+
+	addr_off = addr - chunk->base_addr;
+	if (chunk->base_addr == pcpu_first_chunk->base_addr)
+		if (chunk == pcpu_reserved_chunk || addr_off < -chunk->map[0])
+			is_static = true;
+
+	for (i = 0, off = 0; i < chunk->map_used; i++, off = end) {
+		len = chunk->map[i];
+		end = off + abs(len);
+
+		if (addr_off == off) {
+			if (unlikely(len > 0))
+				snprintf(obuf, sizeof(obuf),
+					 "free area accessed ptr=%p addr=%p "
+					 "off=%d len=%d", ptr, addr, off, len);
+			break;
+		}
+		if (!is_static && off < addr_off && addr_off < end) {
+			snprintf(obuf, sizeof(obuf),
+				 "%sarea accessed in the middle ptr=%p "
+				 "addr=%p:%d off=%d len=%d",
+				 len > 0 ? "free " : "",
+				 ptr, addr, addr_off, off, abs(len));
+			break;
+		}
+	}
+
+out_unlock:
+	spin_unlock_irqrestore(&pcpu_lock, flags);
+out:
+	if (unlikely(cbuf[0] || obuf[0])) {
+		printk(KERN_ERR "PERCPU: %s%s%s\n",
+		       cbuf, cbuf[0] ? ", " : "", obuf);
+		dump_stack();
+		if (!--warn_limit)
+			printk(KERN_WARNING "PERCPU: access warning limit "
+			       "reached, turning off access validation\n");
+	}
+
+	verifying[raw_smp_processor_id()] = false;
+	preempt_enable_no_resched();
+}
+EXPORT_SYMBOL_GPL(pcpu_verify_access);
+#endif	/* CONFIG_DEBUG_VERIFY_PER_CPU */
+
 static inline size_t pcpu_calc_fc_sizes(size_t static_size,
 					size_t reserved_size,
 					ssize_t *dyn_sizep)