Message ID | 4676ea34-69ce-5422-1ded-94218b89f7d9@p23q.org |
---|---|
State | Deferred |
Headers | show |
Series | serial8250 on tegra hsuart: recover from spurious interrupts due to tegra2 silicon bug | expand |
On Friday, 13 July 2018 14:32:42 MSK David R. Piegdon wrote: > Hi, > a while back I sent a few mails regarding spurious interrupts in the > UARTA (hsuart) block of the Tegra2 SoC, when using the 8250 driver for > it instead of the hsuart driver. After going down a pretty deep > debugging/testing hole, I think I found a patch that fixes the issue. So > far testing in a reboot-cycle suggests that the error frequency dropped > from >3% of all reboots to at least <0.05% of all reboots. Tests > continue to run over the weekend. > > The patch below already is a second iteration; the first did not reset > the MCR or contain the lines below '// clear interrupts'. This resulted > in no more spurious interrupts, but in a few % of spurious interrupts > that were recovered the UART block did not receive any characters any > more. So further resetting was required to fully reacquire operational > state of the UART block. > > I'd love any comments/suggestions on this! > I'm wondering whether later Tegra's have that issue as well, maybe Mikko knows? Instead of #ifdef-ing T20 in the code, there probably should be some kind of a port flag. Please send a proper patch and follow suggestions from tty/ maintainers. -- To unsubscribe from this list: send the line "unsubscribe linux-tegra" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 18.07.2018 14:33, Dmitry Osipenko wrote: > On Friday, 13 July 2018 14:32:42 MSK David R. Piegdon wrote: >> Hi, >> a while back I sent a few mails regarding spurious interrupts in the >> UARTA (hsuart) block of the Tegra2 SoC, when using the 8250 driver for >> it instead of the hsuart driver. After going down a pretty deep >> debugging/testing hole, I think I found a patch that fixes the issue. So >> far testing in a reboot-cycle suggests that the error frequency dropped >> from >3% of all reboots to at least <0.05% of all reboots. Tests >> continue to run over the weekend. >> >> The patch below already is a second iteration; the first did not reset >> the MCR or contain the lines below '// clear interrupts'. This resulted >> in no more spurious interrupts, but in a few % of spurious interrupts >> that were recovered the UART block did not receive any characters any >> more. So further resetting was required to fully reacquire operational >> state of the UART block. >> >> I'd love any comments/suggestions on this! >> > > I'm wondering whether later Tegra's have that issue as well, maybe Mikko > knows? My understanding is that the issue is on all generations. > > Instead of #ifdef-ing T20 in the code, there probably should be some kind of a > port flag. Please send a proper patch and follow suggestions from tty/ > maintainers. > > Yep - if possible to do in a tegra-specific callback, that would be nice, but I'm not sure if that's possible. Cheers, Mikko -- To unsubscribe from this list: send the line "unsubscribe linux-tegra" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/18/2018 12:05 PM, Mikko Perttunen wrote: > On 18.07.2018 14:33, Dmitry Osipenko wrote: >> On Friday, 13 July 2018 14:32:42 MSK David R. Piegdon wrote: >>> Hi, >>> a while back I sent a few mails regarding spurious interrupts in the >>> UARTA (hsuart) block of the Tegra2 SoC, when using the 8250 driver for >>> it instead of the hsuart driver. After going down a pretty deep >>> debugging/testing hole, I think I found a patch that fixes the issue. So >>> far testing in a reboot-cycle suggests that the error frequency dropped >>> from >3% of all reboots to at least <0.05% of all reboots. Tests >>> continue to run over the weekend. >>> >>> The patch below already is a second iteration; the first did not reset >>> the MCR or contain the lines below '// clear interrupts'. This resulted >>> in no more spurious interrupts, but in a few % of spurious interrupts >>> that were recovered the UART block did not receive any characters any >>> more. So further resetting was required to fully reacquire operational >>> state of the UART block. >>> >>> I'd love any comments/suggestions on this! >>> >> >> I'm wondering whether later Tegra's have that issue as well, maybe Mikko >> knows? > > My understanding is that the issue is on all generations. > Hi, its great to get some feedback here! Disclaimer: So far I only know that my specific hardware- and software-combination sees this issue. But I do see it across multiple devices. Also I only have Tegra2 hardware, so I am completely unsure if other Tegra CPUs are affected. *I am still waiting for confirmation from anyone else, if this is a real issue*. For all that I know, it could be that we muxed some lines wrongly which results in some core power rails being screwed, or anything like this. So if you actually see this issue, please let me know! >> >> Instead of #ifdef-ing T20 in the code, there probably should be some >> kind of a >> port flag. Please send a proper patch and follow suggestions from tty/ >> maintainers.> > Yep - if possible to do in a tegra-specific callback, that would be > nice, but I'm not sure if that's possible. Due to above reasons, this was just an initial request for comments. I will implement your suggestions over the next few days. Thanks! David -- To unsubscribe from this list: send the line "unsubscribe linux-tegra" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Considering that nobody has answered if he is affected by this too, or not, I have to assume that this only affects my platform. Maybe its a pinmuxing issue. We will have to investigate further. :-(. I thus revoke this patch. David On 07/13/2018 11:32 AM, David R. Piegdon wrote: > Hi, > a while back I sent a few mails regarding spurious interrupts in the > UARTA (hsuart) block of the Tegra2 SoC, when using the 8250 driver for > it instead of the hsuart driver. After going down a pretty deep > debugging/testing hole, I think I found a patch that fixes the issue. So > far testing in a reboot-cycle suggests that the error frequency dropped > from >3% of all reboots to at least <0.05% of all reboots. Tests > continue to run over the weekend. > > The patch below already is a second iteration; the first did not reset > the MCR or contain the lines below '// clear interrupts'. This resulted > in no more spurious interrupts, but in a few % of spurious interrupts > that were recovered the UART block did not receive any characters any > more. So further resetting was required to fully reacquire operational > state of the UART block. > > I'd love any comments/suggestions on this! > > Cheers, > > David -- To unsubscribe from this list: send the line "unsubscribe linux-tegra" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeu. 2 août 2018 à 21:24, David R. Piegdon <lkml@p23q.org> a écrit : > > Considering that nobody has answered if he is affected by this too, or > not, I have to assume that this only affects my platform. Maybe its a > pinmuxing issue. We will have to investigate further. :-(. David, I've experienced this bug on a tegra20 device today (paz00). Is this the issue you have experienced ? I confirm that I don't reproduce this on every boot (even using the same kernel). Thx --- [ 22.463664] irq 282: nobody cared (try booting with the "irqpoll" option) [ 22.463678] CPU: 0 PID: 412 Comm: dbus-daemon Tainted: G C 4.18.0-0.rc7.git0.1.tegra.fc28.armv7hl #1 [ 22.463681] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree) [ 22.463716] [<c0312594>] (unwind_backtrace) from [<c030cc40>] (show_stack+0x18/0x1c) [ 22.463736] [<c030cc40>] (show_stack) from [<c0ae4e78>] (dump_stack+0x80/0xa0) [ 22.463753] [<c0ae4e78>] (dump_stack) from [<c03ac0c0>] (__report_bad_irq+0x2c/0xc8) [ 22.463765] [<c03ac0c0>] (__report_bad_irq) from [<c03abfbc>] (note_interrupt+0x1ec/0x294) [ 22.463775] [<c03abfbc>] (note_interrupt) from [<c03a94c8>] (handle_irq_event_percpu+0x4c/0x5c) [ 22.463784] [<c03a94c8>] (handle_irq_event_percpu) from [<c03a9524>] (handle_irq_event+0x4c/0x70) [ 22.463795] [<c03a9524>] (handle_irq_event) from [<c03ad2cc>] (handle_fasteoi_irq+0xb8/0x13c) [ 22.463806] [<c03ad2cc>] (handle_fasteoi_irq) from [<c03a857c>] (generic_handle_irq+0x20/0x30) [ 22.463814] [<c03a857c>] (generic_handle_irq) from [<c03a8c4c>] (__handle_domain_irq+0xa4/0xbc) [ 22.463828] [<c03a8c4c>] (__handle_domain_irq) from [<c06a0658>] (gic_handle_irq+0x5c/0x88) [ 22.463840] [<c06a0658>] (gic_handle_irq) from [<c0301a0c>] (__irq_svc+0x6c/0x90) [ 22.463845] Exception stack(0xd87a1ee0 to 0xd87a1f28) [ 22.463854] 1ee0: 234d9720 00000000 1a9ef000 c113e880 c1203d00 00000000 00000202 c113e880 [ 22.463863] 1f00: ffffe000 ffffe000 0000000a b6f29000 fffffff6 d87a1f30 00000000 c0302174 [ 22.463867] 1f20: 400d0113 ffffffff [ 22.463879] [<c0301a0c>] (__irq_svc) from [<c0302174>] (__do_softirq+0x8c/0x360) [ 22.463890] [<c0302174>] (__do_softirq) from [<c035261c>] (irq_exit+0x7c/0xe4) [ 22.463899] [<c035261c>] (irq_exit) from [<c03a8c30>] (__handle_domain_irq+0x88/0xbc) [ 22.463908] [<c03a8c30>] (__handle_domain_irq) from [<c06a0658>] (gic_handle_irq+0x5c/0x88) [ 22.463917] [<c06a0658>] (gic_handle_irq) from [<c0301db0>] (__irq_usr+0x50/0x80) [ 22.463921] Exception stack(0xd87a1fb0 to 0xd87a1ff8) [ 22.463927] 1fa0: 00000069 b6f29000 bec15b0c b6e0087d [ 22.463936] 1fc0: 52671368 00000000 0000021b 00000016 b6e0086e bec15b0c b6f29000 b6f29000 [ 22.463943] 1fe0: b6dff9e0 bec15a18 b6f07770 b6f05368 200d0010 ffffffff [ 22.463946] handlers: [ 22.463960] [<504af4a6>] serial8250_interrupt [ 22.463966] Disabling IRQ #282 ---
On 08/03/2018 08:40 AM, Nicolas Chauvet wrote: > Le jeu. 2 août 2018 à 21:24, David R. Piegdon <lkml@p23q.org> a écrit : >> Considering that nobody has answered if he is affected by this too, or >> not, I have to assume that this only affects my platform. Maybe its a >> pinmuxing issue. We will have to investigate further. :-(. > David, > > I've experienced this bug on a tegra20 device today (paz00). Is this > the issue you have experienced ? > I confirm that I don't reproduce this on every boot (even using the > same kernel). Hello Nicolas, it looks very much like it, yes. I assume that you are running the serial console with the 8250 driver. The console would stutter afterwards (polling mode). Also one core runs at 100% sirq. But once getty re-opens the device (e.g. initial login or when you log out and getty gets restarted), the device uses IRQs again and works properly. For me, the bug was triggered roughly 3-5% of all reboots before the first getty pops up. Can you confirm this? Then you might want to try the patch and let me know if it helped. The patch should fix the issue most of the time. But in about 0.05% or so of all reboots the console will stop receiving characters until it is reopened. I was not able to fix this so far. But at least there is no more 100% sirq hogging of one core. Please let me know about any progress! David -- To unsubscribe from this list: send the line "unsubscribe linux-tegra" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On Fri, Jul 13, 2018 at 11:32:42AM +0000, David R. Piegdon wrote: > Hi, > a while back I sent a few mails regarding spurious interrupts in the > UARTA (hsuart) block of the Tegra2 SoC, when using the 8250 driver for > it instead of the hsuart driver. After going down a pretty deep > debugging/testing hole, I think I found a patch that fixes the issue. So > far testing in a reboot-cycle suggests that the error frequency dropped > from >3% of all reboots to at least <0.05% of all reboots. Tests > continue to run over the weekend. > > The patch below already is a second iteration; the first did not reset > the MCR or contain the lines below '// clear interrupts'. This resulted > in no more spurious interrupts, but in a few % of spurious interrupts > that were recovered the UART block did not receive any characters any > more. So further resetting was required to fully reacquire operational > state of the UART block. > > I'd love any comments/suggestions on this! I'd like to follow up on this ancient patch as we are using it successfully for a few years with different kernel versions on a tegra20 SOM (tamonten) now and I'm currently cleaning up our tree. David, have you done any work in regarding this issue since 2018? What would be needed to get this solution mainline? The recipient of this mail are from the initial thread [1] and a current get_maintainers.pl run. regards;rl [1] https://patchwork.ozlabs.org/project/linux-tegra/patch/4676ea34-69ce-5422-1ded-94218b89f7d9@p23q.org/ > > Cheers, > > David > > diff --git a/drivers/tty/serial/8250/8250_core.c b/drivers/tty/serial/8250/8250_core.c > index e8819aa20415..1d76eebefd4e 100644 > --- a/drivers/tty/serial/8250/8250_core.c > +++ b/drivers/tty/serial/8250/8250_core.c > @@ -140,6 +140,38 @@ static irqreturn_t serial8250_interrupt(int irq, void *dev_id) > "serial8250: too much work for irq%d\n", irq); > break; > } > + > +#ifdef CONFIG_ARCH_TEGRA_2x_SOC > + if (!handled && (port->type == PORT_TEGRA)) { > + /* > + * Fix Tegra 2 CPU silicon bug where sometimes > + * "TX holding register empty" interrupts result in a > + * bad (metastable?) state in Tegras HSUART IP core. > + * Only way to recover seems to be to reset all > + * interrupts as well as the TX queue and the MCR. > + * But we don't want to loose any outgoing characters, > + * so only do it if the RX and TX queues are empty. > + */ > + unsigned char lsr = port->serial_in(port, UART_LSR); > + const unsigned char fifo_empty_mask = > + (UART_LSR_TEMT | UART_LSR_THRE); > + if (((lsr & (UART_LSR_DR | fifo_empty_mask)) == > + fifo_empty_mask)) { > + port->serial_out(port, UART_IER, 0); > + port->serial_out(port, UART_MCR, 0); > + serial8250_clear_and_reinit_fifos(up); > + port->serial_out(port, UART_MCR, up->mcr); > + port->serial_out(port, UART_IER, up->ier); > + // clear interrupts > + serial_port_in(port, UART_LSR); > + serial_port_in(port, UART_RX); > + serial_port_in(port, UART_IIR); > + serial_port_in(port, UART_MSR); > + up->lsr_saved_flags = 0; > + up->msr_saved_flags = 0; > + } > + } > +#endif > } while (l != end); > > spin_unlock(&i->lock);
Hi, On Tue, Nov 22, 2022 at 06:47:39PM +0100, David R. Piegdon wrote: > Hallo Richard, > it's great to hear that the patch was helpful for someone other than us. > However, I no longer work on that project, or that company. I have added my old coworker Randolph to the recipients, who, I think, still maintains that platform (based on colibri t20 from toradex). Maybe he can be of help. If you come up with technical questions, I might still have a memory or two. Thanks for the feedback. As I'd love to see this mainline: Does one of you (David, Randolph) have interest/time to try to bring this mainline? If not I can try, altough I'm not into the Tegra UARTs details that deep... regards;rl > @Randolph: Cheers & I hope everything is going well! > > Yours, > David >
On Mon, 21 Nov 2022, Richard Leitner wrote: > Hi, > > On Fri, Jul 13, 2018 at 11:32:42AM +0000, David R. Piegdon wrote: > > Hi, > > a while back I sent a few mails regarding spurious interrupts in the > > UARTA (hsuart) block of the Tegra2 SoC, when using the 8250 driver for > > it instead of the hsuart driver. After going down a pretty deep > > debugging/testing hole, I think I found a patch that fixes the issue. So > > far testing in a reboot-cycle suggests that the error frequency dropped > > from >3% of all reboots to at least <0.05% of all reboots. Tests > > continue to run over the weekend. > > > > The patch below already is a second iteration; the first did not reset > > the MCR or contain the lines below '// clear interrupts'. This resulted > > in no more spurious interrupts, but in a few % of spurious interrupts > > that were recovered the UART block did not receive any characters any > > more. So further resetting was required to fully reacquire operational > > state of the UART block. > > > > I'd love any comments/suggestions on this! > > I'd like to follow up on this ancient patch as we are using it > successfully for a few years with different kernel versions on a > tegra20 SOM (tamonten) now and I'm currently cleaning up our tree. > > David, have you done any work in regarding this issue since 2018? > > What would be needed to get this solution mainline? It seems that the code would belong to ->handle_irq() rather than 8250_core. Do the affected device belong under 8250_tegra.c? If they do, then just create .handle_irq for it and detect this condition there after call to serial8250_handle_irq(). > The recipient of this mail are from the initial thread [1] and > a current get_maintainers.pl run. > > regards;rl > > [1] https://patchwork.ozlabs.org/project/linux-tegra/patch/4676ea34-69ce-5422-1ded-94218b89f7d9@p23q.org/ > > > > > Cheers, > > > > David > > > > diff --git a/drivers/tty/serial/8250/8250_core.c b/drivers/tty/serial/8250/8250_core.c > > index e8819aa20415..1d76eebefd4e 100644 > > --- a/drivers/tty/serial/8250/8250_core.c > > +++ b/drivers/tty/serial/8250/8250_core.c > > @@ -140,6 +140,38 @@ static irqreturn_t serial8250_interrupt(int irq, void *dev_id) > > "serial8250: too much work for irq%d\n", irq); > > break; > > } > > + > > +#ifdef CONFIG_ARCH_TEGRA_2x_SOC > > + if (!handled && (port->type == PORT_TEGRA)) { > > + /* > > + * Fix Tegra 2 CPU silicon bug where sometimes > > + * "TX holding register empty" interrupts result in a > > + * bad (metastable?) state in Tegras HSUART IP core. > > + * Only way to recover seems to be to reset all > > + * interrupts as well as the TX queue and the MCR. > > + * But we don't want to loose any outgoing characters, > > + * so only do it if the RX and TX queues are empty. > > + */ > > + unsigned char lsr = port->serial_in(port, UART_LSR); serial_lsr_in(), make sure you take the port's lock btw. > > + const unsigned char fifo_empty_mask = > > + (UART_LSR_TEMT | UART_LSR_THRE); > > + if (((lsr & (UART_LSR_DR | fifo_empty_mask)) == > > + fifo_empty_mask)) { uart_lsr_tx_empty(lsr) && !(lsr & UART_LSR_DR) fifo_empty_mask can be dropped.
diff --git a/drivers/tty/serial/8250/8250_core.c b/drivers/tty/serial/8250/8250_core.c index e8819aa20415..1d76eebefd4e 100644 --- a/drivers/tty/serial/8250/8250_core.c +++ b/drivers/tty/serial/8250/8250_core.c @@ -140,6 +140,38 @@ static irqreturn_t serial8250_interrupt(int irq, void *dev_id) "serial8250: too much work for irq%d\n", irq); break; } + +#ifdef CONFIG_ARCH_TEGRA_2x_SOC + if (!handled && (port->type == PORT_TEGRA)) { + /* + * Fix Tegra 2 CPU silicon bug where sometimes + * "TX holding register empty" interrupts result in a + * bad (metastable?) state in Tegras HSUART IP core. + * Only way to recover seems to be to reset all + * interrupts as well as the TX queue and the MCR. + * But we don't want to loose any outgoing characters, + * so only do it if the RX and TX queues are empty. + */ + unsigned char lsr = port->serial_in(port, UART_LSR); + const unsigned char fifo_empty_mask = + (UART_LSR_TEMT | UART_LSR_THRE); + if (((lsr & (UART_LSR_DR | fifo_empty_mask)) == + fifo_empty_mask)) { + port->serial_out(port, UART_IER, 0); + port->serial_out(port, UART_MCR, 0); + serial8250_clear_and_reinit_fifos(up); + port->serial_out(port, UART_MCR, up->mcr); + port->serial_out(port, UART_IER, up->ier); + // clear interrupts + serial_port_in(port, UART_LSR); + serial_port_in(port, UART_RX); + serial_port_in(port, UART_IIR); + serial_port_in(port, UART_MSR); + up->lsr_saved_flags = 0; + up->msr_saved_flags = 0; + } + } +#endif } while (l != end); spin_unlock(&i->lock);