Patchwork OF-related boot crash in 3.3.0-rc3-00188-g3ec1e88

login
register
mail settings
Submitter Tejun Heo
Date Feb. 21, 2012, 1:05 a.m.
Message ID <20120221010537.GA15898@dhcp-172-17-108-109.mtv.corp.google.com>
Download mbox | patch
Permalink /patch/142215/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Tejun Heo - Feb. 21, 2012, 1:05 a.m.
Hello,

Meelis, can you please apply the following patch before & after the
offending commit, boot with "memblock=debug" added as kernel param and
post the boot log?  The patch will generate some offset warnings after
the commit but should work fine.

Sam, David, as I'm not familiar with the code base, is it possible to
tell which address is corrupted (zeroed, it seems)?  ie. can we add
"if (XXX == NULL) printk("%p is corrputed\n"...);" somewhere?

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Meelis Roos - Feb. 22, 2012, 12:36 a.m.
> Meelis, can you please apply the following patch before & after the
> offending commit, boot with "memblock=debug" added as kernel param and
> post the boot log?  The patch will generate some offset warnings after
> the commit but should work fine.

Before the commit (v3.2-rc3-75-g0ee332c): memblock1.gz (attached)
After the commit (v3.2-rc3-76-g7bd0b0f): memblock2.gz (attached)

In addition, a third type of sparc machines breaks in a third way - V210 
and V240 just hang after telling

console [tty0] enabled, bootconsole disabled

and before calibrating the delay loop. Bisect has led to the same commit.
Sam Ravnborg - Feb. 22, 2012, 5:03 p.m.
On Mon, Feb 20, 2012 at 05:05:37PM -0800, Tejun Heo wrote:
> Hello,
> 
> Meelis, can you please apply the following patch before & after the
> offending commit, boot with "memblock=debug" added as kernel param and
> post the boot log?  The patch will generate some offset warnings after
> the commit but should work fine.
> 
> Sam, David, as I'm not familiar with the code base, is it possible to
> tell which address is corrupted (zeroed, it seems)?  ie. can we add
> "if (XXX == NULL) printk("%p is corrputed\n"...);" somewhere?

No idea - sorry. I spend most of the time with sparc32 - which I
do not even feel familiar with yet :-(

One thing I noticed while working with memblock for sparc32 (*) is that allocations
are done top-down. So we may end up allocatng memory with a considerably higher
address than we are used to.
This is obviously just a wild guess...

Meelis - do the affected boxes have any special memory configurations?
Could you try to boot with a sensible mem=xxx value to see if limiting the memory
helps.

(*) I have re-done the original patch-set and I have a quite good feeling about it.
HIGHMEM support is outstanding - I got a bit confused when I looked at x86.

But my ss5 crashes the first time I try to use the allocated memory - 
so I assume I have some silly issue somewhere. Nothing points at memblock
in this case.

	Sam
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Meelis Roos - Feb. 22, 2012, 5:12 p.m.
> Meelis - do the affected boxes have any special memory configurations?

Nothin special to me. E3500 has 2G, V100 has 1G, V210 and V240 have 2G 
and 1.5G.

> Could you try to boot with a sensible mem=xxx value to see if limiting the memory
> helps.

Like mem=256M? Will try.
Sam Ravnborg - Feb. 22, 2012, 5:21 p.m.
On Wed, Feb 22, 2012 at 07:12:06PM +0200, Meelis Roos wrote:
> > Meelis - do the affected boxes have any special memory configurations?
> 
> Nothin special to me. E3500 has 2G, V100 has 1G, V210 and V240 have 2G 
> and 1.5G.
> 
> > Could you try to boot with a sensible mem=xxx value to see if limiting the memory
> > helps.
> 
> Like mem=256M? Will try.
Think just a little more - I do not think this will help.
I confused myself with some of the sparc32 issues I have hit.

I have looked a little at the log files you included.
The only thing that looked different was that the faulty version
had a number after "@" which is higher than 1 - where the OK always have 1.

This is "idx" in memblock_insert_region() - but I did not look closer.

	Sam
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Meelis Roos - Feb. 22, 2012, 5:41 p.m.
> > > Could you try to boot with a sensible mem=xxx value to see if limiting the memory
> > > helps.
> > 
> > Like mem=256M? Will try.
> Think just a little more - I do not think this will help.

Tried it on the 2G V210. It changes the picture. With 2G RAM, it 
just hangs.

With mem=256M it produces a crash in strlen and of_alias_scan like in 
V100 with 1G.

mem=512M results in the same strlen error.

mem=1G results in a stranger error:

[    0.000000] Kernel panic - not syncing: ERROR: Failed to allocate 0x90 bytes below 0x0.
[    0.000000] 
[    0.000000] Call Trace:
[    0.000000]  [00000000007a6a28] memblock_alloc_base+0x28/0x38
[    0.000000]  [000000000079ca50] prom_early_alloc+0xc/0x60
[    0.000000]  [00000000007ae090] of_pdt_create_node.part.0+0x4/0xe0
[    0.000000]  [00000000007ae250] of_pdt_build_devicetree+0x30/0xa0
[    0.000000]  [000000000079c4a8] prom_build_devicetree+0x18/0x38
[    0.000000]  [00000000007a03c0] paging_init+0x59c/0x6bc
[    0.000000]  [000000000079be50] setup_arch+0xf8/0x108
[    0.000000]  [000000000079a4e8] start_kernel+0x78/0x30c
[    0.000000]  [00000000006a3e80] tlb_fixup_done+0x98/0xa0
[    0.000000]  [0000000000000000]           (null)

The working machines have 512M RAM, 834M RAM and 2G RAM so it's not just 
the amount of RAM.
Richard Mortimer - Feb. 22, 2012, 6:22 p.m.
On 22/02/2012 00:36, Meelis Roos wrote:
>> Meelis, can you please apply the following patch before&  after the
>> offending commit, boot with "memblock=debug" added as kernel param and
>> post the boot log?  The patch will generate some offset warnings after
>> the commit but should work fine.
>
> Before the commit (v3.2-rc3-75-g0ee332c): memblock1.gz (attached)
> After the commit (v3.2-rc3-76-g7bd0b0f): memblock2.gz (attached)
>
Its a long time since I regularly had to worry about SPARC boxes (not) 
booting so may be the difference between virtual & physical addresses 
but I notice that some of the addresses in the register dump have 
non-zero values in the upper 32 bits but the memblock values have zero 
in the upper half.


memblock reserved: ADD [0x0000007fcc0a40-0x0000007fcc0a4e] node 1
memblock reserved: add [0x0000007fcc0a40-000000007fcc0a4e] node 1 @767

But a similar address in the registers has fffff800 in there.

o4: fffff8007fcc0a4d

I know that there are a number of explanations why things would be 
different (32 bit acesses etc) but it could explain things plus we would 
be talking 64 bit addresses in the kernel.

Just a thought.

Richard


> In addition, a third type of sparc machines breaks in a third way - V210
> and V240 just hang after telling
>
> console [tty0] enabled, bootconsole disabled
>
> and before calibrating the delay loop. Bisect has led to the same commit.
>
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Feb. 22, 2012, 8:26 p.m.
From: Richard Mortimer <richm@oldelvet.org.uk>
Date: Wed, 22 Feb 2012 18:22:36 +0000

> memblock reserved: ADD [0x0000007fcc0a40-0x0000007fcc0a4e] node 1
> memblock reserved: add [0x0000007fcc0a40-000000007fcc0a4e] node 1 @767

These are physical addresses.

> But a similar address in the registers has fffff800 in there.
> 
> o4: fffff8007fcc0a4d

All of physical memory is mapped linearly starting at 0xfffff80000000000
and this is such a virtual address.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/mm/memblock.c b/mm/memblock.c
index 1adbef0..dccfced 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -179,9 +179,15 @@  int __init_memblock memblock_reserve_reserved_regions(void)
 
 static void __init_memblock memblock_remove_region(struct memblock_type *type, unsigned long r)
 {
-	type->total_size -= type->regions[r].size;
-	memmove(&type->regions[r], &type->regions[r + 1],
-		(type->cnt - (r + 1)) * sizeof(type->regions[r]));
+	struct memblock_region *rgn = &type->regions[r];
+
+	memblock_dbg("     memblock %s: rm  [%#016llx-%#016llx] node %d\n",
+		     memblock_type_name(type),
+		     (unsigned long long)rgn->base,
+		     (unsigned long long)rgn->base + rgn->size, rgn->nid);
+
+	type->total_size -= rgn->size;
+	memmove(rgn, rgn + 1, (type->cnt - (r + 1)) * sizeof(*rgn));
 	type->cnt--;
 
 	/* Special case for empty arrays */
@@ -317,6 +323,9 @@  static void __init_memblock memblock_insert_region(struct memblock_type *type,
 	memblock_set_region_node(rgn, nid);
 	type->cnt++;
 	type->total_size += size;
+	memblock_dbg("   memblock %s: add [%#016llx-%016llx] node %d @%d\n",
+		     memblock_type_name(type), (unsigned long long)base,
+		     (unsigned long long)base + size, nid, idx);
 }
 
 /**
@@ -342,6 +351,10 @@  static int __init_memblock memblock_add_region(struct memblock_type *type,
 	phys_addr_t end = base + memblock_cap_size(base, &size);
 	int i, nr_new;
 
+	memblock_dbg("   memblock %s: ADD [%#016llx-%#016llx] node %d\n",
+		     memblock_type_name(type), (unsigned long long)base,
+		     (unsigned long long)base + size, nid);
+
 	/* special case for empty array */
 	if (type->regions[0].size == 0) {
 		WARN_ON(type->cnt != 1 || type->total_size);
@@ -349,6 +362,8 @@  static int __init_memblock memblock_add_region(struct memblock_type *type,
 		type->regions[0].size = size;
 		memblock_set_region_node(&type->regions[0], nid);
 		type->total_size = size;
+		memblock_dbg("     memblock %s: add first entry\n",
+			     memblock_type_name(type));
 		return 0;
 	}
 repeat:
@@ -494,6 +509,10 @@  static int __init_memblock __memblock_remove(struct memblock_type *type,
 	int start_rgn, end_rgn;
 	int i, ret;
 
+	memblock_dbg("     memblock %s: RM  [%#016llx-%016llx]\n",
+		     memblock_type_name(type), (unsigned long long)base,
+		     (unsigned long long)base + size);
+
 	ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
 	if (ret)
 		return ret;