Patchwork 2.6.28-rc9 panics with crashkernel=256M while booting

login
register
mail settings
Submitter Chandru
Date Jan. 8, 2009, 10:29 a.m.
Message ID <200901081559.22204.chandru@in.ibm.com>
Download mbox | patch
Permalink /patch/17320/
State Superseded, archived
Headers show

Comments

Chandru - Jan. 8, 2009, 10:29 a.m.
On Wednesday 07 January 2009 22:55:25 Dave Hansen wrote:
> 
> I'm just suggesting making your fix in the ppc code instead of in
> mm/bootmem.c.
> 

Here are the changes that helped to boot the kernel. Please review it.
Thanks,

Signed-off-by: Chandru S <chandru@linux.vnet.ibm.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
---

 arch/powerpc/mm/numa.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)
Dave Hansen - Jan. 8, 2009, 8:03 p.m.
On Thu, 2009-01-08 at 15:59 +0530, Chandru wrote:
> @@ -898,9 +899,17 @@ static void mark_reserved_regions_for_ni
>                          * if reserved region extends past active region
>                          * then trim size to active region
>                          */
> -                       if (end_pfn > node_ar.end_pfn)
> +                       if (end_pfn > node_ar.end_pfn) {
>                                 reserve_size = (node_ar.end_pfn << PAGE_SHIFT)
>                                         - (start_pfn << PAGE_SHIFT);
> +                               /*
> +                                * resize it further if the reservation could
> +                                * cross the last page in this node
> +                                */
> +                               if (PFN_UP(physbase+reserve_size) >
> +                                                node_end_pfn)
> +                                       reserve_size -= PAGE_SIZE;
> +                       }
>                         /*

Now I'm even more confused.  Could you please send a fully changelogged
patch that describes the problem, and how this fixes it?  This just
seems like an off-by-one error, which isn't what I thought we had before
at all.

I'm also horribly confused why PFN_UP is needed here.  Is 'physbase' not
page aligned?  reserve_size looks like it *has* to be.  'end_pfn' is
always (as far as I have ever seen in the kernel) the pfn of the page
after the area we are interested in and we treat it as such in that
function.  In the case of an unaligned physbase, that wouldn't be true.

Think of the case where we have a 1-byte reservation.  start_pfn will
equal end_pfn and we won't go into that while loop at *all* and won't
reserve anything.

Does 'end_pfn' need fixing?

-- Dave
Chandru - Jan. 9, 2009, 11:07 a.m.
On Friday 09 January 2009 01:33:12 Dave Hansen wrote:
> Now I'm even more confused.  Could you please send a fully changelogged
> patch that describes the problem, and how this fixes it?  This just
> seems like an off-by-one error, which isn't what I thought we had before
> at all.
> 
> I'm also horribly confused why PFN_UP is needed here.  Is 'physbase' not
> page aligned?  reserve_size looks like it *has* to be.  'end_pfn' is
> always (as far as I have ever seen in the kernel) the pfn of the page
> after the area we are interested in and we treat it as such in that
> function.  In the case of an unaligned physbase, that wouldn't be true.
> 
> Think of the case where we have a 1-byte reservation.  start_pfn will
> equal end_pfn and we won't go into that while loop at *all* and won't
> reserve anything.
> 
> Does 'end_pfn' need fixing?
> 

Attached is the console log with debug command line parameters enabled and 
with couple of more debug statements added to the code.  Please take a look at it. 

thanks,
Chandru
Using 0078209e bytes for initrd buffer
Please wait, loading kernel...
Allocated 00d00000 bytes for kernel @ 02d00000
   Elf64 kernel loaded...
Loading ramdisk...
ramdisk loaded 0078209e @ 00d00000
OF stdout device is: /vdevice/vty@30000000
Hypertas detected, assuming LPAR !
command line: root=/dev/disk/by-id/scsi-35000cca0030c7be2-part3  xmon=on crashkernel=256M@32M debug bootmem_debug numa=debug
memory layout at init:
  alloc_bottom : 0000000003880000
  alloc_top    : 0000000010000000
  alloc_top_hi : 0000000100000000
  rmo_top      : 0000000010000000
  ram_top      : 0000000100000000
Looking for displays
instantiating rtas at 0x000000000f4e0000 ... done
boot cpu hw idx 0000000000000000
copying OF device tree ...
Building dt strings...
Building dt structure...
Device tree strings 0x0000000003a90000 -> 0x0000000003a917bc
Device tree struct  0x0000000003aa0000 -> 0x0000000003ac0000
Calling quiesce ...
returning from prom_init
Reserving 256MB of memory at 32MB for crashkernel (System RAM: 4096MB)
Phyp-dump disabled at boot time
Using pSeries machine description
Page orders: linear mapping = 24, virtual = 16, io = 12
Using 1TB segments
Found initrd at 0xc000000000d00000:0xc00000000148209e
console [udbg0] enabled
Partition configured for 4 cpus.
CPU maps initialized for 2 threads per core
 (thread shift is 1)
Starting Linux PPC64 #7 SMP Fri Jan 9 04:50:05 CST 2009
-----------------------------------------------------
ppc64_pft_size                = 0x1a
physicalMemorySize            = 0x100000000
htab_hash_mask                = 0x7ffff
-----------------------------------------------------
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.28-1-ppc64 (root@rulerlp10) (gcc version 4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux) ) #7 SMP Fri Jan 9 04:50:05 CST 2009
[boot]0012 Setup Arch
NUMA associativity depth for CPU/Memory: 3
Node 0 Memory:
Node 5 Memory: 0x0-0x10000000
Node 7 Memory: 0x10000000-0x100000000
adding cpu 0 to node 0
node 0
NODE_DATA() = c0000000fffeda80
node 5
NODE_DATA() = c000000001f78800
start_paddr = 0
end_paddr = 10000000
bootmap_paddr = 1f60000
bootmem::init_bootmem_core nid=5 start=0 map=1f6 end=1000 mapsize=200
bootmem::mark_bootmem_node nid=5 start=0 end=1000 reserve=0 flags=0 bdata->node_min_pfn=0x0 bdata->node_low_pfn=0x1000
bootmem::__free nid=5 start=0 end=1000
reserve_bootmem :
        physbase :0x0  size:0xb70000 reserve_size=0xb70000 nid=5
        start_pfn:0x0  end_pfn:0xb7
        node_ar.start_pfn:0x0 node_ar.end_pfn:0x1000
        node->node_start_pfn:0x0 node->node_spanned_pages:4096
bootmem::mark_bootmem_node nid=5 start=0 end=b7 reserve=1 flags=0 bdata->node_min_pfn=0x0 bdata->node_low_pfn=0x1000
bootmem::__reserve nid=5 start=0 end=b7 flags=0
reserve_bootmem :
        physbase :0xd00000  size:0x78209e reserve_size=0x78209e nid=5
        start_pfn:0xd0  end_pfn:0x148
        node_ar.start_pfn:0x0 node_ar.end_pfn:0x1000
        node->node_start_pfn:0x0 node->node_spanned_pages:4096
bootmem::mark_bootmem_node nid=5 start=d0 end=149 reserve=1 flags=0 bdata->node_min_pfn=0x0 bdata->node_low_pfn=0x1000
bootmem::__reserve nid=5 start=d0 end=149 flags=0
reserve_bootmem :
        physbase :0xd00000  size:0x790000 reserve_size=0x790000 nid=5
        start_pfn:0xd0  end_pfn:0x149
        node_ar.start_pfn:0x0 node_ar.end_pfn:0x1000
        node->node_start_pfn:0x0 node->node_spanned_pages:4096
bootmem::mark_bootmem_node nid=5 start=d0 end=149 reserve=1 flags=0 bdata->node_min_pfn=0x0 bdata->node_low_pfn=0x1000
bootmem::__reserve nid=5 start=d0 end=149 flags=0
bootmem::__reserve silent double reserve of PFN d0
bootmem::__reserve silent double reserve of PFN d1
...
...
...
bootmem::__reserve silent double reserve of PFN 147
bootmem::__reserve silent double reserve of PFN 148
reserve_bootmem :
        physbase :0x1f60000  size:0x10000 reserve_size=0x10000 nid=5
        start_pfn:0x1f6  end_pfn:0x1f7
        node_ar.start_pfn:0x0 node_ar.end_pfn:0x1000
        node->node_start_pfn:0x0 node->node_spanned_pages:4096
bootmem::mark_bootmem_node nid=5 start=1f6 end=1f7 reserve=1 flags=0 bdata->node_min_pfn=0x0 bdata->node_low_pfn=0x1000
bootmem::__reserve nid=5 start=1f6 end=1f7 flags=0
reserve_bootmem :
        physbase :0x1f78800  size:0x10087800 reserve_size=0xe090000 nid=5
        start_pfn:0x1f7  end_pfn:0x1200
        node_ar.start_pfn:0x0 node_ar.end_pfn:0x1000
        node->node_start_pfn:0x0 node->node_spanned_pages:4096
bootmem::mark_bootmem_node nid=5 start=1f7 end=1001 reserve=1 flags=0 bdata->node_min_pfn=0x0 bdata->node_low_pfn=0x1000
------------[ cut here ]------------
kernel BUG at mm/bootmem.c:279!
cpu 0x0: Vector: 700 (Program Check) at [c0000000008a39d0]
    pc: c000000000676ec0: .mark_bootmem_node+0xb4/0x12c
    lr: c000000000676e98: .mark_bootmem_node+0x8c/0x12c
    sp: c0000000008a3c50
   msr: 8000000000021032
  current = 0xc0000000007fa820
  paca    = 0xc000000000962c00
    pid   = 0, comm = swapper
kernel BUG at mm/bootmem.c:279!
enter ? for help
[c0000000008a3d00] c000000000661d3c .do_init_bootmem+0xc3c/0xd78
[c0000000008a3e40] c0000000006567f0 .setup_arch+0x1a4/0x21c
[c0000000008a3ee0] c000000000650868 .start_kernel+0xdc/0x56c
[c0000000008a3f90] c0000000000083b8 .start_here_common+0x1c/0x64
0:mon>
Chandru - Jan. 15, 2009, 8:05 a.m.
On Friday 09 January 2009 16:37:24 Chandru wrote:
> On Friday 09 January 2009 01:33:12 Dave Hansen wrote:
> > Now I'm even more confused.  Could you please send a fully changelogged
> > patch that describes the problem, and how this fixes it?  This just
> > seems like an off-by-one error, which isn't what I thought we had before
> > at all.
> > 
> > I'm also horribly confused why PFN_UP is needed here.  Is 'physbase' not
> > page aligned?  reserve_size looks like it *has* to be.  'end_pfn' is
> > always (as far as I have ever seen in the kernel) the pfn of the page
> > after the area we are interested in and we treat it as such in that
> > function.  In the case of an unaligned physbase, that wouldn't be true.
> > 
> > Think of the case where we have a 1-byte reservation.  start_pfn will
> > equal end_pfn and we won't go into that while loop at *all* and won't
> > reserve anything.
> > 
> > Does 'end_pfn' need fixing?
> > 
> 
> Attached is the console log with debug command line parameters enabled and 
> with couple of more debug statements added to the code.  Please take a look at it. 
> 
> thanks,
> Chandru
> 

Hello Dave, From the debug console output, if there is anything you can add here,
pls let me know.

thanks
Chandru - Jan. 16, 2009, 12:16 p.m.
On Thursday 15 January 2009 13:35:27 Chandru wrote:
> Hello Dave, From the debug console output, if there is anything you can add
> here, pls let me know.

As we can see from the console output here,  physbase isn't page aligned when 
the panic occurs.  So we could as well send (start_pfn << PAGE_SHIFT) to 
reserve_bootmem_node() instead of physbase. your thoughts ?.

Also end_pfn in mark_reserved_region_for_nid() is defined as 

unsigned long end_pfn = ((physbase + size) >> PAGE_SHIFT);

Does this refer to the pfn after the area that we are interested in ?.  We have 
atleast two fixes here,  
1. Limit start and end to bdata->node_min_pfn  and bdata->node_low_pfn in 
reserve_bootmem_node() and add comments out in there that the caller of the 
funtion should be aware of how much are they reserving. 
2. send (start_pfn << PAGE_SHIFT) to reserve_bootmem_node() instead of 
physbase. 

Chandru
Dave Hansen - Jan. 16, 2009, 5:52 p.m.
On Fri, 2009-01-16 at 17:46 +0530, Chandru wrote:
> On Thursday 15 January 2009 13:35:27 Chandru wrote:
> > Hello Dave, From the debug console output, if there is anything you can add
> > here, pls let me know.
> 
> As we can see from the console output here,  physbase isn't page aligned when 
> the panic occurs.  So we could as well send (start_pfn << PAGE_SHIFT) to 
> reserve_bootmem_node() instead of physbase. your thoughts ?.
> 
> Also end_pfn in mark_reserved_region_for_nid() is defined as 
> 
> unsigned long end_pfn = ((physbase + size) >> PAGE_SHIFT);
> 
> Does this refer to the pfn after the area that we are interested in ?.  We have 
> atleast two fixes here,  
> 1. Limit start and end to bdata->node_min_pfn  and bdata->node_low_pfn in 
> reserve_bootmem_node() and add comments out in there that the caller of the 
> funtion should be aware of how much are they reserving. 
> 2. send (start_pfn << PAGE_SHIFT) to reserve_bootmem_node() instead of 
> physbase. 

Just looking at it, that calculation is OK.  But, there was one in your
dmesg that looked a page too long, like page 0x1001 instead of 0x1000.
I'd find out how that happened.

-- Dave
Chandru - Jan. 19, 2009, 8:11 a.m.
On Friday 16 January 2009 23:22:57 Dave Hansen wrote:
> Just looking at it, that calculation is OK.  But, there was one in your
> dmesg that looked a page too long, like page 0x1001 instead of 0x1000.
> I'd find out how that happened.

That is a result of PFN_UP() in reserve_bootmem_node() for which we hit the 
BUG_ON() eventually.   Prior to calling reserve_bootmem_node() we have...

node_ar.end_pfn = node->node_end_pfn = PFN_DOWN(physbase+reserve_size). 

Hence a PFN_UP() will raise the value of 'end'. The kernel has 
CONFIG_PPC_64K_PAGES enabled in the config. 

Chandru

Patch

--- linux-2.6.28/arch/powerpc/mm/numa.c.orig	2009-01-08 03:20:41.000000000 -0600
+++ linux-2.6.28/arch/powerpc/mm/numa.c	2009-01-08 03:50:41.000000000 -0600
@@ -16,6 +16,7 @@ 
 #include <linux/module.h>
 #include <linux/nodemask.h>
 #include <linux/cpu.h>
+#include <linux/pfn.h>
 #include <linux/notifier.h>
 #include <linux/lmb.h>
 #include <linux/of.h>
@@ -898,9 +899,17 @@  static void mark_reserved_regions_for_ni
 			 * if reserved region extends past active region
 			 * then trim size to active region
 			 */
-			if (end_pfn > node_ar.end_pfn)
+			if (end_pfn > node_ar.end_pfn) {
 				reserve_size = (node_ar.end_pfn << PAGE_SHIFT)
 					- (start_pfn << PAGE_SHIFT);
+				/*
+				 * resize it further if the reservation could
+				 * cross the last page in this node
+				 */
+				if (PFN_UP(physbase+reserve_size) >
+						 node_end_pfn)
+					reserve_size -= PAGE_SIZE;
+			}
 			/*
 			 * Only worry about *this* node, others may not
 			 * yet have valid NODE_DATA().