Patchwork kernel crashes on Sun Fire 480R with cassini network driver

login
register
mail settings
Submitter David Miller
Date Dec. 18, 2008, 3:46 a.m.
Message ID <20081217.194646.25300922.davem@davemloft.net>
Download mbox | patch
Permalink /patch/14614/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

David Miller - Dec. 18, 2008, 3:46 a.m.
From: Hermann Lauer <Hermann.Lauer@iwr.uni-heidelberg.de>
Date: Fri, 24 Oct 2008 17:24:33 +0200

> Hello all,
> 
> On Mon, Sep 29, 2008 at 10:19:54AM +0200, Hermann Lauer wrote:
> > > while trying to run vanilla linux-2.6.26.5 or the debian etchnhalf kernel
> > > (2.6.24.x) I noticed the cassini network driver for the builtin gigabit
> > > network is unstable and brings the kernel down on a dualprocessor sparc
> > > SunFire 480R with a Hardware FATAL RESET.
> > 
> > I now noticed that even a few pings are possible with the cassini
> > network driver if you configure it statically, but then the machine again 
> > crashes with the known: 
> > ERROR: System Hardware FATAL RESET from  CPU0 CPU2
> 
> tried 2.6.27.2 today, but this hangs already at loading the kernel.
> Output is attached. Please tell me what else I can provide.

Unfortunately, "-p" doesn't do anything any more and the kernel
is stopping or crashing during the part of the boot between
when the early prom console is disabled and the real console is
setup.

The way to debug this is to manually get rid of the CON_BOOT flag
in the early prom console structure, like the following patch.

If you reboot with this it will print more information so we can
try and diagnose this further.

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hermann Lauer - Dec. 19, 2008, 8:56 a.m.
On Wed, Dec 17, 2008 at 07:46:46PM -0800, David Miller wrote:
> > tried 2.6.27.2 today, but this hangs already at loading the kernel.
> > Output is attached. Please tell me what else I can provide.
> 
> Unfortunately, "-p" doesn't do anything any more and the kernel
> is stopping or crashing during the part of the boot between
> when the early prom console is disabled and the real console is
> setup.
> 
> The way to debug this is to manually get rid of the CON_BOOT flag
> in the early prom console structure, like the following patch.

Applied your patch to the 2.6.27.9 vanilla source and booted,
console output up to the hang is below. Just to remember:
2.6.26.5 did not hang.

-------------------------------------------------------------------
Sun Fire 880, No Keyboard
Copyright 2007 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.22.34, 8192 MB memory installed, Serial #50911524.
Ethernet address 0:3:ba:8:d9:24, Host ID: 8308d924.

ERROR: OpenBoot Diagnostics failed
WARNING: Device /pci@8,600000/SUNW,qlc@2 being marked with 'status' == fail
Rebooting with command: boot
Boot device: disk  File and args:
SILO Version 1.4.13
boot:
Linux                    27.9                     LinuxOLD
boot: 27.9
Allocated 8 Megs of memory at 0x40000000 for kernel
Loaded kernel version 2.6.27
Loading initial ramdisk (5350711 bytes at 0xA000400000 phys, 0x40C00000 virt)...
\
[    0.000000] PROMLIB: Sun IEEE Boot Prom 'OBP 4.22.34 2007/07/23 13:01'
[    0.000000] PROMLIB: Root node compatible:
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 2.6.27.9 (hlauer@install1) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 SMP Thu Dec 18 17:20:27 CET 2008
[    0.000000] console [earlyprom0] enabled
[    0.000000] ARCH: SUN4U
[    0.000000] Ethernet address: 00:03:ba:08:d9:24
[    0.000000] Kernel: Using 2 locked TLB entries for main kernel image.
[    0.000000] Remapping the kernel... done.
[    0.000000] OF stdout device is: /pci@9,700000/ebus@1/serial@1,400000:a
[    0.000000] PROM: Built device tree with 102802 bytes of memory.
[    0.000000] Top of RAM: 0xa1ffb1a000, Total RAM: 0x1ffb0e000
[    0.000000] Memory hole size: 655360MB
[    0.000000] [0000000318000000-fffff8a000c00000] page_structs=131072 node=0 entry=1120/0
[    0.000000] [0000000318000000-fffff8a001000000] page_structs=131072 node=0 entry=1121/0
[    0.000000] [0000000318700000-fffff8a001400000] page_structs=131072 node=0 entry=1122/0
[    0.000000] [0000000318700000-fffff8a001800000] page_structs=131072 node=0 entry=1123/0
[    0.000000] [0000000318e00000-fffff8a001c00000] page_structs=131072 node=0 entry=1124/0
[    0.000000] [0000000318e00000-fffff8a002000000] page_structs=131072 node=0 entry=1125/0
[    0.000000] [0000000319500000-fffff8a002400000] page_structs=131072 node=0 entry=1126/0
[    0.000000] [0000000319c00000-fffff8a002800000] page_structs=131072 node=0 entry=1127/0
[    0.000000] [0000000319c00000-fffff8a002c00000] page_structs=131072 node=0 entry=1128/0
[    0.000000] [000000031a300000-fffff8a003000000] page_structs=131072 node=0 entry=1129/0
[    0.000000] [000000031a300000-fffff8a003400000] page_structs=131072 node=0 entry=1130/0
[    0.000000] [000000031aa00000-fffff8a003800000] page_structs=131072 node=0 entry=1131/0
[    0.000000] [000000031aa00000-fffff8a003c00000] page_structs=131072 node=0 entry=1132/0
[    0.000000] [000000031b100000-fffff8a004000000] page_structs=131072 node=0 entry=1133/0
[    0.000000] Zone PFN ranges:
[    0.000000]   Normal   0x05000000 -> 0x050ffd8d
[    0.000000] Movable zone start PFN for each node
[    0.000000] early_node_map[4] active PFN ranges
[    0.000000]     0: 0x05000000 -> 0x050ff7ff
[    0.000000]     0: 0x050ff800 -> 0x050ffd09
[    0.000000]     0: 0x050ffd0b -> 0x050ffd7b
[    0.000000]     0: 0x050ffd7e -> 0x050ffd8d
[    0.000000] Booting Linux...
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 1040779
[    0.000000] Kernel command line: root=/dev/sda1 ro
[    0.000000] PID hash table entries: 4096 (order: 12, 32768 bytes)
[    0.000000] clocksource: mult[640000] shift[16]
[    0.000000] clockevent: mult[28f5c28] shift[32]
[   46.108402] Console: colour dummy device 80x25
[   46.161498] console [tty0] enabled
[    0.000000] PROMLIB: Sun IEEE Boot Prom 'OBP 4.22.34 2007/07/23 13:01'
[    0.000000] PROMLIB: Root node compatible:
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 2.6.27.9 (hlauer@install1) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 SMP Thu Dec 18 17:20:27 CET 2008
[    0.000000] console [earlyprom0] enabled
[    0.000000] ARCH: SUN4U
[    0.000000] Ethernet address: 00:03:ba:08:d9:24
[    0.000000] Kernel: Using 2 locked TLB entries for main kernel image.
[    0.000000] Remapping the kernel... done.
[    0.000000] OF stdout device is: /pci@9,700000/ebus@1/serial@1,400000:a
[    0.000000] PROM: Built device tree with 102802 bytes of memory.
[    0.000000] Top of RAM: 0xa1ffb1a000, Total RAM: 0x1ffb0e000
[    0.000000] Memory hole size: 655360MB
[    0.000000] [0000000318000000-fffff8a000c00000] page_structs=131072 node=0 entry=1120/0
[    0.000000] [0000000318000000-fffff8a001000000] page_structs=131072 node=0 entry=1121/0
[    0.000000] [0000000318700000-fffff8a001400000] page_structs=131072 node=0 entry=1122/0
[    0.000000] [0000000318700000-fffff8a001800000] page_structs=131072 node=0 entry=1123/0
[    0.000000] [0000000318e00000-fffff8a001c00000] page_structs=131072 node=0 entry=1124/0
[    0.000000] [0000000318e00000-fffff8a002000000] page_structs=131072 node=0 entry=1125/0
[    0.000000] [0000000319500000-fffff8a002400000] page_structs=131072 node=0 entry=1126/0
[    0.000000] [0000000319c00000-fffff8a002800000] page_structs=131072 node=0 entry=1127/0
[    0.000000] [0000000319c00000-fffff8a002c00000] page_structs=131072 node=0 entry=1128/0
[    0.000000] [000000031a300000-fffff8a003000000] page_structs=131072 node=0 entry=1129/0
[    0.000000] [000000031a300000-fffff8a003400000] page_structs=131072 node=0 entry=1130/0
[    0.000000] [000000031aa00000-fffff8a003800000] page_structs=131072 node=0 entry=1131/0
[    0.000000] [000000031aa00000-fffff8a003c00000] page_structs=131072 node=0 entry=1132/0
[    0.000000] [000000031b100000-fffff8a004000000] page_structs=131072 node=0 entry=1133/0
[    0.000000] Zone PFN ranges:
[    0.000000]   Normal   0x05000000 -> 0x050ffd8d
[    0.000000] Movable zone start PFN for each node
[    0.000000] early_node_map[4] active PFN ranges
[    0.000000]     0: 0x05000000 -> 0x050ff7ff
[    0.000000]     0: 0x050ff800 -> 0x050ffd09
[    0.000000]     0: 0x050ffd0b -> 0x050ffd7b
[    0.000000]     0: 0x050ffd7e -> 0x050ffd8d
[    0.000000] Booting Linux...
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 1040779
[    0.000000] Kernel command line: root=/dev/sda1 ro
[    0.000000] PID hash table entries: 4096 (order: 12, 32768 bytes)
[    0.000000] clocksource: mult[640000] shift[16]
[    0.000000] clockevent: mult[28f5c28] shift[32]
[   46.108402] Console: colour dummy device 80x25
[   46.161498] console [tty0] enabled
[   49.320774] Dentry cache hash table entries: 1048576 (order: 10, 8388608 bytes)
[   49.426009] Inode-cache hash table entries: 524288 (order: 9, 4194304 bytes)
[   49.786779] Memory: 8301680k available (2928k kernel code, 1104k data, 208k init) [fffff80000000000,000000a1ffb1a000]
[   49.991734] Calibrating delay using timer specific routine.. 19.90 BogoMIPS (lpj=39810)
[   50.085944] Security Framework initialized
[   50.134751] SELinux:  Disabled at boot.
[   50.180620] Mount-cache hash table entries: 512
[   50.235257] Initializing cgroup subsys ns
[   50.282662] Initializing cgroup subsys cpuacct
[   50.335781] Initializing cgroup subsys devices
[   50.390759] CPU 0: synchronized TICK with master CPU (last diff 0 cycles, maxerr 11 cycles)
[   50.390776] Brought up 2 CPUs
[   50.391634] net_namespace: 1552 bytes
[   50.568542] NET: Registered protocol family 16
David Miller - Jan. 20, 2009, 6:25 a.m.
From: Hermann Lauer <Hermann.Lauer@iwr.uni-heidelberg.de>
Date: Fri, 19 Dec 2008 09:56:22 +0100

> On Wed, Dec 17, 2008 at 07:46:46PM -0800, David Miller wrote:
> > > tried 2.6.27.2 today, but this hangs already at loading the kernel.
> > > Output is attached. Please tell me what else I can provide.
> > 
> > Unfortunately, "-p" doesn't do anything any more and the kernel
> > is stopping or crashing during the part of the boot between
> > when the early prom console is disabled and the real console is
> > setup.
> > 
> > The way to debug this is to manually get rid of the CON_BOOT flag
> > in the early prom console structure, like the following patch.
> 
> Applied your patch to the 2.6.27.9 vanilla source and booted,
> console output up to the hang is below. Just to remember:
> 2.6.26.5 did not hang.
 ...
> [   50.390759] CPU 0: synchronized TICK with master CPU (last diff 0 cycles, maxerr 11 cycles)
> [   50.390776] Brought up 2 CPUs
> [   50.391634] net_namespace: 1552 bytes
> [   50.568542] NET: Registered protocol family 16
> 

The next thing that probably should run is the PCI controller
probe.  But I can't say that for certain, so we need more
info.

Please reboot this test kernel with the following added boot
command line options: initcall_debug=1 ignore_loglevel

Let us see the console log output that generates.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hermann Lauer - Jan. 22, 2009, 1:29 p.m.
On Mon, Jan 19, 2009 at 10:25:14PM -0800, David Miller wrote:

>  ...
> > [   50.390759] CPU 0: synchronized TICK with master CPU (last diff 0 cycles, maxerr 11 cycles)
> > [   50.390776] Brought up 2 CPUs
> > [   50.391634] net_namespace: 1552 bytes
> > [   50.568542] NET: Registered protocol family 16
> > 
> 
> The next thing that probably should run is the PCI controller
> probe.  But I can't say that for certain, so we need more
> info.
> 
> Please reboot this test kernel with the following added boot
> command line options: initcall_debug=1 ignore_loglevel

Compiled 2.6.27.12 today without the CON_BOOT flag and booted with the
options above. The hang seems to be in of_bus_driver_init, see
console output below.

Hope this helps advancing towards the cassini stuff. Thanks.

Patch

diff --git a/arch/sparc64/kernel/setup.c b/arch/sparc64/kernel/setup.c
index c8b03a4..2c50796 100644
--- a/arch/sparc64/kernel/setup.c
+++ b/arch/sparc64/kernel/setup.c
@@ -82,7 +82,7 @@  unsigned long cmdline_memory_size = 0;
 static struct console prom_early_console = {
 	.name =		"earlyprom",
 	.write =	prom_console_write,
-	.flags =	CON_PRINTBUFFER | CON_BOOT | CON_ANYTIME,
+	.flags =	CON_PRINTBUFFER | CON_ANYTIME,
 	.index =	-1,
 };