diff mbox

Fix BREAK handling in sunsab serial driver

Message ID 201407271519.22272.cat.schulze@alice-dsl.net
State Changes Requested
Delegated to: David Miller
Headers show

Commit Message

Christopher Alexander Tobias Schulze July 27, 2014, 1:19 p.m. UTC
Handling of a BREAK on a serial line is currently broken in the sunsab serial
driver. Due to this defect, it is not possible to break to the firmware prompt
or trigger a Magic SysRq on these machines (which would be quite useful for
debugging kernel problems on headless machines).

This problem is caused by the fact that the count of characters to be processed
might be 0 when a BREAK interrupt is triggered. The code in the driver contains
a loop that processes the pending characters - with count set to 0 no processing
takes place at all. Unfortunately, the call of uart_handle_break() is contained
inside this loop and therefore skipped completely.

This patch moves the check for BREAK status in front of the loop and executes it
only when count == 0, i.e. when it would not be done inside the loop.

With this patch, Magic SysRq works again. However, there seems to be another
problem that causes an Oops when a register dump is requested by SysRq. This
is caused when perf_event_print_debug() is called from sysrq_handle_showregs()
in drivers/tty/sysrq.c. This patch provides no solution for this problem; my
workaround was to eliminate the call to perf_event_print_debug() in
sysrq_handle_showregs(). This problem was observed on a SunBlade 2000 with an
UltraSPARC III Cu CPU and might be specific to this CPU model (the UltraSPARC III
is the first CPU to support perf_event.).

The patch was originally developed against a 3.13 backport kernel from Debian.
Both the patch against 3.13 and a recent 3.16-rc6 are included below. Please note
that I could only test the 3.13 version as I do not have access to the affected
machine anymore.

PATCH 1 - KERNEL VERSION 3.13 #####################################################

Regards,
Alexander Schulze
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller July 28, 2014, 10:52 p.m. UTC | #1
From: Christopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net>
Date: Sun, 27 Jul 2014 15:19:21 +0200

Again, please format your subject line properly, the subsystem prefix
here could be "sunsab: ".

> Handling of a BREAK on a serial line is currently broken in the sunsab serial
> driver. Due to this defect, it is not possible to break to the firmware prompt
> or trigger a Magic SysRq on these machines (which would be quite useful for
> debugging kernel problems on headless machines).
> 
> This problem is caused by the fact that the count of characters to be processed
> might be 0 when a BREAK interrupt is triggered. The code in the driver contains
> a loop that processes the pending characters - with count set to 0 no processing
> takes place at all. Unfortunately, the call of uart_handle_break() is contained
> inside this loop and therefore skipped completely.
> 
> This patch moves the check for BREAK status in front of the loop and executes it
> only when count == 0, i.e. when it would not be done inside the loop.
> 
> With this patch, Magic SysRq works again. However, there seems to be another
> problem that causes an Oops when a register dump is requested by SysRq. This
> is caused when perf_event_print_debug() is called from sysrq_handle_showregs()
> in drivers/tty/sysrq.c. This patch provides no solution for this problem; my
> workaround was to eliminate the call to perf_event_print_debug() in
> sysrq_handle_showregs(). This problem was observed on a SunBlade 2000 with an
> UltraSPARC III Cu CPU and might be specific to this CPU model (the UltraSPARC III
> is the first CPU to support perf_event.).
> 
> The patch was originally developed against a 3.13 backport kernel from Debian.
> Both the patch against 3.13 and a recent 3.16-rc6 are included below. Please note
> that I could only test the 3.13 version as I do not have access to the affected
> machine anymore.

Please provide a proper "Singed-off-by: " as per Documentation/SubmittingPatches.

> +       if(count == 0) {

Space between "if" and "(".

> +               if(unlikely(stat->sreg.isr1 & SAB82532_ISR1_BRK)) {

Likewise.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 28, 2014, 10:54 p.m. UTC | #2
From: Christopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net>
Date: Sun, 27 Jul 2014 15:19:21 +0200

> However, there seems to be another problem that causes an Oops when
> a register dump is requested by SysRq. This is caused when
> perf_event_print_debug() is called from sysrq_handle_showregs() in
> drivers/tty/sysrq.c.

Did you save the OOPS?  It should be trivial to debug with that
information.

perf_event_print_debug() doesn't do anything fancy, it disables CPU
interrupts and reads the PCR and PIC registers.

UltraSPARC-III should be using direct_pcr_ops for this, therefore:

	__asm__ __volatile__("rd %%pcr, %0" : "=r" (val));

and

	__asm__ __volatile__("rd %%pic, %0" : "=r" (val));

are the two operations it will perform.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christopher Alexander Tobias Schulze Aug. 3, 2014, 12:25 p.m. UTC | #3
On Tuesday 29 July 2014 00:54, David Miller wrote:
> From: Christopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net>
> Date: Sun, 27 Jul 2014 15:19:21 +0200
> 
> > However, there seems to be another problem that causes an Oops when
> > a register dump is requested by SysRq. This is caused when
> > perf_event_print_debug() is called from sysrq_handle_showregs() in
> > drivers/tty/sysrq.c.
> 
> Did you save the OOPS?  It should be trivial to debug with that
> information.

We had a serial console attached to the SunBlade during debugging and
development of the patches, and serial output was logged to a file. I
think we should therefore have captured the OOPS.

Norman Beck (in CC) should be able to extract the OOPS from this log
and send it to you. (I do not have access to this SunBlade anymore
since July.)

> perf_event_print_debug() doesn't do anything fancy, it disables CPU
> interrupts and reads the PCR and PIC registers.
> 
> UltraSPARC-III should be using direct_pcr_ops for this, therefore:
> 
> 	__asm__ __volatile__("rd %%pcr, %0" : "=r" (val));
> 
> and
> 
> 	__asm__ __volatile__("rd %%pic, %0" : "=r" (val));
> 
> are the two operations it will perform.

IIRC it was a null pointer dereference in perf_event_print_debug(),
but I am not quite sure, as we quickly disabled this call to continue
working (getting BREAK on the serial console to work was top priority
then). We will have to wait for Norman to retrieve this information -
he is currently on vacation, however, so it will take another week.

Regards,
Alexander Schulze
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 4, 2014, 11:04 p.m. UTC | #4
From: Christopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net>
Date: Sun, 3 Aug 2014 14:25:35 +0200

> IIRC it was a null pointer dereference in perf_event_print_debug(),
> but I am not quite sure, as we quickly disabled this call to continue
> working (getting BREAK on the serial console to work was top priority
> then). We will have to wait for Norman to retrieve this information -
> he is currently on vacation, however, so it will take another week.

If it's in perf_event_print_debug() there isn't much to check.

sparc_pmu is derefernced in the loops, but the first thing this
function does is check it against NULL and return immediately
if it is NULL.

pcr_ops is dereferenced also, but pcr_ops should never be NULL
if sparc_pmu is not NULL.

Hmmm... that's setup by pcr_arch_init() which I only see being
invoked by smp_cpus_done().

Is this a uniprocessor build?
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christopher Alexander Tobias Schulze Aug. 11, 2014, 5:49 p.m. UTC | #5
On Tuesday 05 August 2014 01:04:23 David Miller wrote:
> From: Christopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net>
> Date: Sun, 3 Aug 2014 14:25:35 +0200
> 
> > IIRC it was a null pointer dereference in perf_event_print_debug(),
> > but I am not quite sure, as we quickly disabled this call to continue
> > working (getting BREAK on the serial console to work was top priority
> > then). We will have to wait for Norman to retrieve this information -
> > he is currently on vacation, however, so it will take another week.
> 
> If it's in perf_event_print_debug() there isn't much to check.
> 
> sparc_pmu is derefernced in the loops, but the first thing this
> function does is check it against NULL and return immediately
> if it is NULL.
> 
> pcr_ops is dereferenced also, but pcr_ops should never be NULL
> if sparc_pmu is not NULL.
> 
> Hmmm... that's setup by pcr_arch_init() which I only see being
> invoked by smp_cpus_done().
> 
> Is this a uniprocessor build?
> 
> 

My colleague has returned from his vacation and provided me the serial
console logs which contain the stacktrace for the crash in
perf_event_print_debug() that was triggered by the register dump request
using SysRq:

[  192.934865] Unable to handle kernel NULL pointer dereference
[  193.002524] tsk->{mm,active_mm}->context = 0000000000001009
[  193.069192] tsk->{mm,active_mm}->pgd = fffffc003aec2000
[  193.131694]               \|/ ____ \|/
[  193.131694]               "@'/ .. \`@"
[  193.131694]               /_| \__/ |_\
[  193.131694]                  \__U_/
[  193.307769] swapper(0): Oops [#1]
[  193.347330] CPU: 0 PID: 0 Comm: swapper Not tainted 3.13.10 #4
[  193.417122] task: 000000000092d7e0 ti: 0000000000918000 task.ti: 0000000000918000
[  193.506717] TSTATE: 0000009980e01604 TPC: 000000000044d630 TNPC: 000000000044d634 Y: 00000000    Not tainted
[  193.624441] TPC: <perf_event_print_debug+0x50/0x100>
[  193.683796] g0: 0000000001016460 g1: 0000000000000000 g2: 0000000000000007 g3: 0000000000000001
[  193.787975] g4: 000000000092d7e0 g5: 0000000000000038 g6: 0000000000918000 g7: 0000000000000720
[  193.892144] o0: 0000000000000000 o1: 000000000089d910 o2: 0000000000000000 o3: 0000000000001fff
[  193.996313] o4: 000000000089d8a0 o5: 0000000000000010 sp: fffffc003febb231 ret_pc: 000000000044d610
[  194.104653] RPC: <perf_event_print_debug+0x30/0x100>
[  194.164018] l0: 0000000000001000 l1: 00000000ffff5511 l2: 00000000004208b0 l3: 0000000000000000
[  194.268197] l4: 0000000000001000 l5: 0000000000000004 l6: 0000000000918000 l7: 0000000080001004
[  194.372368] i0: 0000000000000070 i1: 00000000009cd800 i2: 000000000089d300 i3: 00000000009240d0
[  194.476537] i4: 000000000000000e i5: 0000000000000001 i6: fffffc003febb2f1 i7: 000000000068b2a0
[  194.580715] I7: <__handle_sysrq+0xc0/0x1a0>
[  194.630694] Call Trace:
[  194.659865]  [000000000068b2a0] __handle_sysrq+0xc0/0x1a0
[  194.724458]  [00000000006a5908] sunsab_interrupt+0x468/0x8a0
[  194.792165]  [0000000000498d18] handle_irq_event_percpu+0x78/0x220
[  194.866125]  [0000000000498ee8] handle_irq_event+0x28/0x60
[  194.931754]  [000000000049b7dc] handle_fasteoi_irq+0xfc/0x1a0
[  195.000503]  [0000000000498560] generic_handle_irq+0x40/0x60
[  195.068214]  [000000000042b5d4] handler_irq+0x94/0xc0
[  195.128636]  [00000000004208b4] tl0_irq5+0x14/0x20
[  195.185924]  [000000000042bbfc] arch_cpu_idle+0x5c/0x80
[  195.248426]  [000000000049843c] cpu_startup_entry+0x19c/0x240
[  195.317179]  [000000000099a988] start_kernel+0x4f0/0x520
[  195.380724]  [00000000007e74ac] tlb_fixup_done+0x80/0x94
[  195.444263]  [0000000000000000]           (null)
[  195.499473] Disabling lock debugging due to kernel taint
[  195.563022] Caller[000000000068b2a0]: __handle_sysrq+0xc0/0x1a0
[  195.633855] Caller[00000000006a5908]: sunsab_interrupt+0x468/0x8a0
[  195.707816] Caller[0000000000498d18]: handle_irq_event_percpu+0x78/0x220
[  195.788028] Caller[0000000000498ee8]: handle_irq_event+0x28/0x60
[  195.859905] Caller[000000000049b7dc]: handle_fasteoi_irq+0xfc/0x1a0
[  195.934905] Caller[0000000000498560]: generic_handle_irq+0x40/0x60
[  196.008866] Caller[000000000042b5d4]: handler_irq+0x94/0xc0
[  196.075534] Caller[00000000004208b4]: tl0_irq5+0x14/0x20
[  196.139079] Caller[000000000042bbb8]: arch_cpu_idle+0x18/0x80
[  196.207831] Caller[000000000049843c]: cpu_startup_entry+0x19c/0x240
[  196.282833] Caller[000000000099a988]: start_kernel+0x4f0/0x520
[  196.352626] Caller[00000000007e74ac]: tlb_fixup_done+0x80/0x94
[  196.422418] Caller[0000000000000000]:           (null)
[  196.483877] Instruction DUMP: 953f6000  ba076001  9010000a <c2584000> 9fc04000  d477a7f7  d45fa7f7  96100008  92102000 
[  196.613045] Kernel panic - not syncing: Aiee, killing interrupt handler!
[  196.693264] Press Stop-A (L1-A) to return to the boot prom

Unfortunately, this is less information than I hoped: the only new information is

[  193.624441] TPC: <perf_event_print_debug+0x50/0x100>
[  194.104653] RPC: <perf_event_print_debug+0x30/0x100>

and I am not able to provide any further details (e.g., disassembly of the code
at that location).

The machine is a uniprocessor system. The kernel configuration was taken from
Debian and slightly modified, but I am not sure whether it was based on an SMP
kernel package or not (I would guess that it was a uniprocessor kernel, however).

I hope this can provide at least some further information to you.

Regards,
Alexander Schulze
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 11, 2014, 10:19 p.m. UTC | #6
From: Christopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net>
Date: Mon, 11 Aug 2014 19:49:48 +0200

> [  196.483877] Instruction DUMP: 953f6000  ba076001  9010000a <c2584000> 9fc04000  d477a7f7  d45fa7f7  96100008  92102000 
 ...
> and I am not able to provide any further details (e.g., disassembly of the code
> at that location).

The instruction dump is in the OOPS, as quoted above, and this disassembles
to:

   0:   95 3f 60 00     sra  %i5, 0, %o2
   4:   ba 07 60 01     inc  %i5
   8:   90 10 00 0a     mov  %o2, %o0

crash_pc:
   c:   c2 58 40 00     ldx  [ %g1 ], %g1
  10:   9f c0 40 00     call  %g1
  14:   d4 77 a7 f7     stx  %o2, [ %fp + 0x7f7 ]
  18:   d4 5f a7 f7     ldx  [ %fp + 0x7f7 ], %o2
  1c:   96 10 00 08     mov  %o0, %o3
  20:   92 10 20 00     clr  %o1

This is an indirect call, and the only indirect calls in perf_event_print_debug()
are the pcr_ops->() method invocations, that is what we are executing here, therefore
pcr_ops is NULL at this location.

It confirms my suspicions and I'll therefore work on a fix, thanks!
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff -Naupr linux-source-3.13-orig/drivers/tty/serial/sunsab.c linux-source-3.13-patched/drivers/tty/serial/sunsab.c
--- linux-source-3.13-orig/drivers/tty/serial/sunsab.c  2014-04-14 15:48:24.000000000 +0200
+++ linux-source-3.13-patched/drivers/tty/serial/sunsab.c       2014-07-27 14:29:58.000000000 +0200
@@ -157,6 +157,14 @@  receive_chars(struct uart_sunsab_port *u
            (up->port.line == up->port.cons->index))
                saw_console_brk = 1;
 
+       if(count == 0) {
+               if(unlikely(stat->sreg.isr1 & SAB82532_ISR1_BRK)) {
+                       stat->sreg.isr0 &= ~(SAB82532_ISR0_PERR |
+                                            SAB82532_ISR0_FERR);
+                       uart_handle_break(&up->port);
+               }
+       }
+
        for (i = 0; i < count; i++) {
                unsigned char ch = buf[i], flag;

PATCH 2 - KERNEL VERSION 3.16 #####################################################

diff -Naupr linux-3.16-rc6-orig/drivers/tty/serial/sunsab.c linux-3.16-rc6-patched/drivers/tty/serial/sunsab.c
--- linux-3.16-rc6-orig/drivers/tty/serial/sunsab.c     2014-07-27 11:50:32.000000000 +0200
+++ linux-3.16-rc6-patched/drivers/tty/serial/sunsab.c  2014-07-27 14:28:12.000000000 +0200
@@ -157,6 +157,14 @@  receive_chars(struct uart_sunsab_port *u
            (up->port.line == up->port.cons->index))
                saw_console_brk = 1;
 
+       if(count == 0) {
+               if(unlikely(stat->sreg.isr1 & SAB82532_ISR1_BRK)) {
+                       stat->sreg.isr0 &= ~(SAB82532_ISR0_PERR |
+                                            SAB82532_ISR0_FERR);
+                       uart_handle_break(&up->port);
+               }
+       }
+
        for (i = 0; i < count; i++) {
                unsigned char ch = buf[i], flag;