diff mbox

sparc64 WARNING: at mm/mmap.c:2757 exit_mmap+0x13c/0x160()

Message ID 20140813.124611.767426770013677.davem@davemloft.net
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

David Miller Aug. 13, 2014, 7:46 p.m. UTC
From: Meelis Roos <mroos@linux.ee>
Date: Wed, 13 Aug 2014 14:44:42 +0300 (EEST)

> Did not test current git more.

Current git fails to boot without this fix which I posted the other
day:

Comments

Meelis Roos Aug. 14, 2014, 12:20 p.m. UTC | #1
> > Did not test current git more.
> 
> Current git fails to boot without this fix which I posted the other
> day:
> 
> ====================
> [PATCH 1/2] sparc64: Do not disable interrupts in nmi_cpu_busy()

Thanks, I noticed it on sparclinux@ but did not add one and one 
together. Now it seems to work with at least T2000. Will test other 
machines as I get some time.
Meelis Roos Aug. 15, 2014, 12:42 p.m. UTC | #2
> > Did not test current git more.
> 
> Current git fails to boot without this fix which I posted the other
> day:

T2000 is OK with todays GIT, hugepages gcc 4.9.1.

V100 and Netra X1 now loop indefinitely on successful reboot in PROM 
recursive fault (3.16 had the fault once and continued).

Got this from one reboot of X1:
[info] Using makefile-style concurrent boot in runlevel 6.
[....] Stopping deferred execution scheduler: atd. ok
[....] Stopping MTA: exim4_listener. ok
[....] Asking all remaining processes to terminate...done.
[....] All processes ended within 4 seconds...done.
[  565.689832] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [rsyslogd:1715]
[  565.788276] Modules linked in: ipv6 loop ohci_pci ohci_hcd i2c_ali15x3 usbcore i2c_ali1535 i2ccorn
[  565.922072] CPU: 0 PID: 1715 Comm: rsyslogd Not tainted 3.16.0-10959-gf0094b2 #130
[  566.021635] task: ffffff006c772f00 ti: ffffff006c6b0000 task.ti: ffffff006c6b0000
[  566.120035] TSTATE: 0000004411001606 TPC: 00000000007895f0 TNPC: 00000000007895f4 Y: 00000000    d
[  566.249317] TPC: <put_compound_page.part.22+0x154/0x1c0>
[  566.319098] g0: 00000000004209d0 g1: 0000000000000000 g2: 0000000000000002 g3: 00000000004b0840
[  566.433415] g4: ffffff006c772f00 g5: 0000000000000008 g6: ffffff006c6b0000 g7: 0000000000000000
[  566.547817] o0: 0000000000000001 o1: 0000010000d5f818 o2: 00000000f77c2000 o3: 0000000000000001
[  566.662217] o4: ffffff006c6b3a98 o5: ffffff006c6b39dc sp: ffffff006c6b3131 ret_pc: 000000000078950
[  566.781197] RPC: <put_compound_page.part.22+0x134/0x1c0>
[  566.850994] l0: 00000000f77c2000 l1: fffffffe00000000 l2: 0000000200000000 l3: 00000000f77c1fff
[  566.965312] l4: 0000000000000000 l5: 0000000000000001 l6: 0000000000000000 l7: 0000000000000008
[  567.079714] i0: 0000010000d5f800 i1: 00000000f77c2000 i2: 0000000000000001 i3: 0000000000000000
[  567.194116] i4: 0000010000d5001c i5: 0000010000d50000 i6: ffffff006c6b31e1 i7: 000000000049aaa4
[  567.308527] I7: <get_futex_key+0x1c4/0x280>
[  567.363456] Call Trace:
[  567.395464]  [000000000049aaa4] get_futex_key+0x1c4/0x280
[  567.466332]  [000000000049ad7c] futex_wait_setup+0x1c/0xc0
[  567.538443]  [000000000049af14] futex_wait+0xf4/0x1c0
[  567.604738]  [000000000049c878] do_futex+0x138/0x240
[  567.669990]  [000000000049ce48] compat_SyS_futex+0x128/0x180
[  567.744394]  [0000000000406074] linux_sparc_syscall32+0x34/0x60

Otherwise V100 and X1 seems to survive looping git cloen well with 
transparent hugepages on and gcc 4.6.4.

U10 not tested yet so no test to CPI ROm changes yet (need to get to the 
machine). Similar for U5 and RED state exceptions on reboot.

V210 has a new problem - hans on boot during SCSI detection:
[   34.523440] f00aba6c: ttyS0 at MMIO 0x7fe010003f8 (irq = 15, base_baud = 115387) is a 16550A
[   34.523467] Console: ttyS0 (SU)
[   43.731627] console [ttyS0] enabled
[   43.777688] f00ad5ec: ttyS1 at MMIO 0x7fe010002e8 (irq = 15, base_baud = 115387) is a 16550A
[   43.889462] PCI: Enabling device: (0002:00:02.0), cmd 147
[   43.960956] sym0: <1010-66> rev 0x1 at pci 0002:00:02.0 irq 24
[   44.039849] sym0: No NVRAM, ID 7, Fast-80, LVD, parity checking
[   44.158317] sym0: SCSI BUS has been reset.
[   44.212124] scsi host0: sym-2.2.3

Retested with todays git, same.


I also solved my mysterious hangs of V100 - it was a simple user error 
with serial console and Break dropping me to OBP when the other end of 
the serial connection was rebooted with minicom open.

U1, U2, U5, U10, E220R, E420R later or some other day, whenever I get 
to them physically.
Meelis Roos Aug. 18, 2014, 12:30 p.m. UTC | #3
> U1, U2, U5, U10, E220R, E420R later or some other day, whenever I get 
> to them physically.

Ultra 5 is bad news with 3.17-rc1: it almost boots up, then aftyer 
strarting postfix and ntpd, gets RED state exception and contiunes 
looping with it (before it gor RED state only after prom reboot).

ntpd.

RED State Exception

TL=0000.0000.0000.0005 TT=0000.0000.0000.0064
   TPC=0000.0000.0042.4c80 TnPC=0000.0000.0042.4c84 TSTATE=0000.0000.1104.1407
TL=0000.0000.0000.0004 TT=0000.0000.0000.0064
   TPC=0000.0000.0042.4c80 TnPC=0000.0000.0042.4c84 TSTATE=0000.0000.1104.1407
TL=0000.0000.0000.0003 TT=0000.0000.0000.0064
   TPC=0000.0000.0042.4c80 TnPC=0000.0000.0042.4c84 TSTATE=0000.0000.1104.1407
TL=0000.0000.0000.0002 TT=0000.0000.0000.0064
   TPC=0000.0000.0042.0c80 TnPC=0000.0000.0042.0c84 TSTATE=0000.0000.1104.1407
TL=0000.0000.0000.0001 TT=0000.0000.0000.0064
   TPC=0000.0000.0044.8580 TnPC=0000.0000.0044.8584 TSTATE=0000.0000.1100.1607


RED State Exception

TL=0000.0000.0000.0005 TT=0000.0000.0000.0064
   TPC=0000.0000.f000.4c80 TnPC=0000.0000.f000.4c84 TSTATE=0000.0044.5604.1400
TL=0000.0000.0000.0004 TT=0000.0000.0000.0064
   TPC=0000.0000.f000.4c80 TnPC=0000.0000.f000.4c84 TSTATE=0000.0044.5604.1400
TL=0000.0000.0000.0003 TT=0000.0000.0000.0064
   TPC=0000.0000.f000.4c80 TnPC=0000.0000.f000.4c84 TSTATE=0000.0044.5604.1400
TL=0000.0000.0000.0002 TT=0000.0000.0000.0064
   TPC=0000.0000.f000.0c80 TnPC=0000.0000.f000.0c84 TSTATE=0000.0044.5604.1400
TL=0000.0000.0000.0001 TT=0000.0000.0000.0064
   TPC=0000.0000.f000.3a00 TnPC=0000.0000.f000.3a04 TSTATE=0000.0044.5600.0400
Aaro Koskinen Aug. 18, 2014, 5:35 p.m. UTC | #4
Hi,

On Mon, Aug 18, 2014 at 03:30:16PM +0300, Meelis Roos wrote:
> > U1, U2, U5, U10, E220R, E420R later or some other day, whenever I get 
> > to them physically.
> 
> Ultra 5 is bad news with 3.17-rc1: it almost boots up, then aftyer 
> strarting postfix and ntpd, gets RED state exception and contiunes 
> looping with it (before it gor RED state only after prom reboot).

My Ultra 5 is fine with 3.17-rc1 (I'm writing this mail from it),
also Ultra 10 seems to be OK based on quick test.

I'm going to run GCC 4.9.1 bootstrap & testsuite on these machines
maybe next week. Unfortunately due to summer schedules I'm a bit lost
if there are still some special patches I should try (to get rid
of $SUBJECT)? If not I'll probably try it with plain 3.17-rc2.

A.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 18, 2014, 5:38 p.m. UTC | #5
From: Aaro Koskinen <aaro.koskinen@iki.fi>
Date: Mon, 18 Aug 2014 20:35:52 +0300

> Hi,
> 
> On Mon, Aug 18, 2014 at 03:30:16PM +0300, Meelis Roos wrote:
>> > U1, U2, U5, U10, E220R, E420R later or some other day, whenever I get 
>> > to them physically.
>> 
>> Ultra 5 is bad news with 3.17-rc1: it almost boots up, then aftyer 
>> strarting postfix and ntpd, gets RED state exception and contiunes 
>> looping with it (before it gor RED state only after prom reboot).
> 
> My Ultra 5 is fine with 3.17-rc1 (I'm writing this mail from it),
> also Ultra 10 seems to be OK based on quick test.
> 
> I'm going to run GCC 4.9.1 bootstrap & testsuite on these machines
> maybe next week. Unfortunately due to summer schedules I'm a bit lost
> if there are still some special patches I should try (to get rid
> of $SUBJECT)? If not I'll probably try it with plain 3.17-rc2.

All patches are in 3,17-rc1

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Calaby Aug. 18, 2014, 11:45 p.m. UTC | #6
Hi All,

On Tue, Aug 19, 2014 at 3:35 AM, Aaro Koskinen <aaro.koskinen@iki.fi> wrote:
> Hi,
>
> On Mon, Aug 18, 2014 at 03:30:16PM +0300, Meelis Roos wrote:
>> > U1, U2, U5, U10, E220R, E420R later or some other day, whenever I get
>> > to them physically.
>>
>> Ultra 5 is bad news with 3.17-rc1: it almost boots up, then aftyer
>> strarting postfix and ntpd, gets RED state exception and contiunes
>> looping with it (before it gor RED state only after prom reboot).
>
> My Ultra 5 is fine with 3.17-rc1 (I'm writing this mail from it),
> also Ultra 10 seems to be OK based on quick test.

Stupid question: aren't the Ultra 5 and Ultra 10 essentially the same hardware?

Thanks,
Aaro Koskinen Aug. 19, 2014, 9:29 p.m. UTC | #7
Hi,

On Tue, Aug 19, 2014 at 09:45:03AM +1000, Julian Calaby wrote:
> Stupid question: aren't the Ultra 5 and Ultra 10 essentially
> the same hardware?

Basically yes, but often configurations are different (CPU speed,
memory capacity, peripherals, PROM versions).

A.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aaro Koskinen Aug. 30, 2014, 10:27 p.m. UTC | #8
Hi,

On Mon, Aug 18, 2014 at 10:38:50AM -0700, David Miller wrote:
> All patches are in 3,17-rc1

FYI, the warning/bug still triggers with 3.17-rc2 during GCC bootstrap:

[94075.963753] ------------[ cut here ]------------
[94076.018105] WARNING: CPU: 0 PID: 17192 at /home/aaro/los/work/shared/linux-v3.17-rc2/mm/mmap.c:2766 exit_mmap+0x128/0x160()
[94076.151407] Modules linked in:
[94076.187825] CPU: 0 PID: 17192 Comm: rm Not tainted 3.17.0-rc2-ultra-los_3ec1 #1
[94076.275319] Call Trace:
[94076.304490]  [00000000004c1308] exit_mmap+0x128/0x160
[94076.364915]  [000000000045118c] mmput+0x2c/0xc0
[94076.419062]  [0000000000453cb0] do_exit+0x1b0/0x880
[94076.477387]  [0000000000454ff8] do_group_exit+0x38/0xc0
[94076.539880]  [0000000000455094] SyS_exit_group+0x14/0x20
[94076.603429]  [0000000000406074] linux_sparc_syscall32+0x34/0x60
[94076.674225] ---[ end trace b4b3ce0b3bcc0234 ]---
[94076.729446] BUG: Bad rss-counter state mm:ffffff0016898000 idx:1 val:2

A.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

====================
[PATCH 1/2] sparc64: Do not disable interrupts in nmi_cpu_busy()

nmi_cpu_busy() is a SMP function call that just makes sure that all of the
cpus are spinning using cpu cycles while the NMI test runs.

It does not need to disable IRQs because we just care about NMIs executing
which will even with 'normal' IRQs disabled.

It is not legal to enable hard IRQs in a SMP cross call, in fact this bug
triggers the BUG check in irq_work_run_list():

	BUG_ON(!irqs_disabled());

Because now irq_work_run() is invoked from the tail of
generic_smp_call_function_single_interrupt().

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 arch/sparc/kernel/nmi.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/sparc/kernel/nmi.c b/arch/sparc/kernel/nmi.c
index 3370945..5b1151d 100644
--- a/arch/sparc/kernel/nmi.c
+++ b/arch/sparc/kernel/nmi.c
@@ -130,7 +130,6 @@  static inline unsigned int get_nmi_count(int cpu)
 
 static __init void nmi_cpu_busy(void *data)
 {
-	local_irq_enable_in_hardirq();
 	while (endflag == 0)
 		mb();
 }