diff mbox series

serial8250 on tegra hsuart: recover from spurious interrupts due to tegra2 silicon bug

Message ID 4676ea34-69ce-5422-1ded-94218b89f7d9@p23q.org
State Deferred
Headers show
Series serial8250 on tegra hsuart: recover from spurious interrupts due to tegra2 silicon bug | expand

Commit Message

David R. Piegdon July 13, 2018, 11:32 a.m. UTC
Hi,
a while back I sent a few mails regarding spurious interrupts in the
UARTA (hsuart) block of the Tegra2 SoC, when using the 8250 driver for
it instead of the hsuart driver. After going down a pretty deep
debugging/testing hole, I think I found a patch that fixes the issue. So
far testing in a reboot-cycle suggests that the error frequency dropped
from >3% of all reboots to at least <0.05% of all reboots. Tests
continue to run over the weekend.

The patch below already is a second iteration; the first did not reset
the MCR or contain the lines below '// clear interrupts'. This resulted
in no more spurious interrupts, but in a few % of spurious interrupts
that were recovered the UART block did not receive any characters any
more. So further resetting was required to fully reacquire operational
state of the UART block.

I'd love any comments/suggestions on this!

Cheers,

David

Comments

Dmitry Osipenko July 18, 2018, 11:33 a.m. UTC | #1
On Friday, 13 July 2018 14:32:42 MSK David R. Piegdon wrote:
> Hi,
> a while back I sent a few mails regarding spurious interrupts in the
> UARTA (hsuart) block of the Tegra2 SoC, when using the 8250 driver for
> it instead of the hsuart driver. After going down a pretty deep
> debugging/testing hole, I think I found a patch that fixes the issue. So
> far testing in a reboot-cycle suggests that the error frequency dropped
> from >3% of all reboots to at least <0.05% of all reboots. Tests
> continue to run over the weekend.
> 
> The patch below already is a second iteration; the first did not reset
> the MCR or contain the lines below '// clear interrupts'. This resulted
> in no more spurious interrupts, but in a few % of spurious interrupts
> that were recovered the UART block did not receive any characters any
> more. So further resetting was required to fully reacquire operational
> state of the UART block.
> 
> I'd love any comments/suggestions on this!
> 

I'm wondering whether later Tegra's have that issue as well, maybe Mikko 
knows?

Instead of #ifdef-ing T20 in the code, there probably should be some kind of a 
port flag. Please send a proper patch and follow suggestions from tty/ 
maintainers.


--
To unsubscribe from this list: send the line "unsubscribe linux-tegra" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikko Perttunen July 18, 2018, 12:05 p.m. UTC | #2
On 18.07.2018 14:33, Dmitry Osipenko wrote:
> On Friday, 13 July 2018 14:32:42 MSK David R. Piegdon wrote:
>> Hi,
>> a while back I sent a few mails regarding spurious interrupts in the
>> UARTA (hsuart) block of the Tegra2 SoC, when using the 8250 driver for
>> it instead of the hsuart driver. After going down a pretty deep
>> debugging/testing hole, I think I found a patch that fixes the issue. So
>> far testing in a reboot-cycle suggests that the error frequency dropped
>> from >3% of all reboots to at least <0.05% of all reboots. Tests
>> continue to run over the weekend.
>>
>> The patch below already is a second iteration; the first did not reset
>> the MCR or contain the lines below '// clear interrupts'. This resulted
>> in no more spurious interrupts, but in a few % of spurious interrupts
>> that were recovered the UART block did not receive any characters any
>> more. So further resetting was required to fully reacquire operational
>> state of the UART block.
>>
>> I'd love any comments/suggestions on this!
>>
> 
> I'm wondering whether later Tegra's have that issue as well, maybe Mikko
> knows?

My understanding is that the issue is on all generations.

> 
> Instead of #ifdef-ing T20 in the code, there probably should be some kind of a
> port flag. Please send a proper patch and follow suggestions from tty/
> maintainers.
> 
> 

Yep - if possible to do in a tegra-specific callback, that would be 
nice, but I'm not sure if that's possible.

Cheers,
Mikko
--
To unsubscribe from this list: send the line "unsubscribe linux-tegra" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David R. Piegdon July 18, 2018, 12:15 p.m. UTC | #3
On 07/18/2018 12:05 PM, Mikko Perttunen wrote:
> On 18.07.2018 14:33, Dmitry Osipenko wrote:
>> On Friday, 13 July 2018 14:32:42 MSK David R. Piegdon wrote:
>>> Hi,
>>> a while back I sent a few mails regarding spurious interrupts in the
>>> UARTA (hsuart) block of the Tegra2 SoC, when using the 8250 driver for
>>> it instead of the hsuart driver. After going down a pretty deep
>>> debugging/testing hole, I think I found a patch that fixes the issue. So
>>> far testing in a reboot-cycle suggests that the error frequency dropped
>>> from >3% of all reboots to at least <0.05% of all reboots. Tests
>>> continue to run over the weekend.
>>>
>>> The patch below already is a second iteration; the first did not reset
>>> the MCR or contain the lines below '// clear interrupts'. This resulted
>>> in no more spurious interrupts, but in a few % of spurious interrupts
>>> that were recovered the UART block did not receive any characters any
>>> more. So further resetting was required to fully reacquire operational
>>> state of the UART block.
>>>
>>> I'd love any comments/suggestions on this!
>>>
>>
>> I'm wondering whether later Tegra's have that issue as well, maybe Mikko
>> knows?
> 
> My understanding is that the issue is on all generations.
> 

Hi,
its great to get some feedback here!

Disclaimer: So far I only know that my specific hardware- and
software-combination sees this issue. But I do see it across multiple
devices.
Also I only have Tegra2 hardware, so I am completely unsure if other
Tegra CPUs are affected.

*I am still waiting for confirmation from anyone else, if this is a real
issue*. For all that I know, it could be that we muxed some lines
wrongly which results in some core power rails being screwed, or
anything like this.
So if you actually see this issue, please let me know!


>>
>> Instead of #ifdef-ing T20 in the code, there probably should be some
>> kind of a
>> port flag. Please send a proper patch and follow suggestions from tty/
>> maintainers.>
> Yep - if possible to do in a tegra-specific callback, that would be
> nice, but I'm not sure if that's possible.


Due to above reasons, this was just an initial request for comments. I
will implement your suggestions over the next few days.
Thanks!

David



--
To unsubscribe from this list: send the line "unsubscribe linux-tegra" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David R. Piegdon Aug. 2, 2018, 7:23 p.m. UTC | #4
Considering that nobody has answered if he is affected by this too, or
not, I have to assume that this only affects my platform. Maybe its a
pinmuxing issue. We will have to investigate further. :-(.


I thus revoke this patch.


David


On 07/13/2018 11:32 AM, David R. Piegdon wrote:
> Hi,
> a while back I sent a few mails regarding spurious interrupts in the
> UARTA (hsuart) block of the Tegra2 SoC, when using the 8250 driver for
> it instead of the hsuart driver. After going down a pretty deep
> debugging/testing hole, I think I found a patch that fixes the issue. So
> far testing in a reboot-cycle suggests that the error frequency dropped
> from >3% of all reboots to at least <0.05% of all reboots. Tests
> continue to run over the weekend.
>
> The patch below already is a second iteration; the first did not reset
> the MCR or contain the lines below '// clear interrupts'. This resulted
> in no more spurious interrupts, but in a few % of spurious interrupts
> that were recovered the UART block did not receive any characters any
> more. So further resetting was required to fully reacquire operational
> state of the UART block.
>
> I'd love any comments/suggestions on this!
>
> Cheers,
>
> David


--
To unsubscribe from this list: send the line "unsubscribe linux-tegra" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas Chauvet Aug. 3, 2018, 8:40 a.m. UTC | #5
Le jeu. 2 août 2018 à 21:24, David R. Piegdon <lkml@p23q.org> a écrit :
>
> Considering that nobody has answered if he is affected by this too, or
> not, I have to assume that this only affects my platform. Maybe its a
> pinmuxing issue. We will have to investigate further. :-(.

David,

I've experienced this bug on a tegra20 device today (paz00). Is this
the issue you have experienced ?
I confirm that I don't reproduce this on every boot (even using the
same kernel).

Thx
---
[   22.463664] irq 282: nobody cared (try booting with the "irqpoll" option)
[   22.463678] CPU: 0 PID: 412 Comm: dbus-daemon Tainted: G         C
      4.18.0-0.rc7.git0.1.tegra.fc28.armv7hl #1
[   22.463681] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
[   22.463716] [<c0312594>] (unwind_backtrace) from [<c030cc40>]
(show_stack+0x18/0x1c)
[   22.463736] [<c030cc40>] (show_stack) from [<c0ae4e78>]
(dump_stack+0x80/0xa0)
[   22.463753] [<c0ae4e78>] (dump_stack) from [<c03ac0c0>]
(__report_bad_irq+0x2c/0xc8)
[   22.463765] [<c03ac0c0>] (__report_bad_irq) from [<c03abfbc>]
(note_interrupt+0x1ec/0x294)
[   22.463775] [<c03abfbc>] (note_interrupt) from [<c03a94c8>]
(handle_irq_event_percpu+0x4c/0x5c)
[   22.463784] [<c03a94c8>] (handle_irq_event_percpu) from
[<c03a9524>] (handle_irq_event+0x4c/0x70)
[   22.463795] [<c03a9524>] (handle_irq_event) from [<c03ad2cc>]
(handle_fasteoi_irq+0xb8/0x13c)
[   22.463806] [<c03ad2cc>] (handle_fasteoi_irq) from [<c03a857c>]
(generic_handle_irq+0x20/0x30)
[   22.463814] [<c03a857c>] (generic_handle_irq) from [<c03a8c4c>]
(__handle_domain_irq+0xa4/0xbc)
[   22.463828] [<c03a8c4c>] (__handle_domain_irq) from [<c06a0658>]
(gic_handle_irq+0x5c/0x88)
[   22.463840] [<c06a0658>] (gic_handle_irq) from [<c0301a0c>]
(__irq_svc+0x6c/0x90)
[   22.463845] Exception stack(0xd87a1ee0 to 0xd87a1f28)
[   22.463854] 1ee0: 234d9720 00000000 1a9ef000 c113e880 c1203d00
00000000 00000202 c113e880
[   22.463863] 1f00: ffffe000 ffffe000 0000000a b6f29000 fffffff6
d87a1f30 00000000 c0302174
[   22.463867] 1f20: 400d0113 ffffffff
[   22.463879] [<c0301a0c>] (__irq_svc) from [<c0302174>]
(__do_softirq+0x8c/0x360)
[   22.463890] [<c0302174>] (__do_softirq) from [<c035261c>]
(irq_exit+0x7c/0xe4)
[   22.463899] [<c035261c>] (irq_exit) from [<c03a8c30>]
(__handle_domain_irq+0x88/0xbc)
[   22.463908] [<c03a8c30>] (__handle_domain_irq) from [<c06a0658>]
(gic_handle_irq+0x5c/0x88)
[   22.463917] [<c06a0658>] (gic_handle_irq) from [<c0301db0>]
(__irq_usr+0x50/0x80)
[   22.463921] Exception stack(0xd87a1fb0 to 0xd87a1ff8)
[   22.463927] 1fa0:                                     00000069
b6f29000 bec15b0c b6e0087d
[   22.463936] 1fc0: 52671368 00000000 0000021b 00000016 b6e0086e
bec15b0c b6f29000 b6f29000
[   22.463943] 1fe0: b6dff9e0 bec15a18 b6f07770 b6f05368 200d0010 ffffffff
[   22.463946] handlers:
[   22.463960] [<504af4a6>] serial8250_interrupt
[   22.463966] Disabling IRQ #282
---
David R. Piegdon Aug. 3, 2018, 9:45 a.m. UTC | #6
On 08/03/2018 08:40 AM, Nicolas Chauvet wrote:
> Le jeu. 2 août 2018 à 21:24, David R. Piegdon <lkml@p23q.org> a écrit :
>> Considering that nobody has answered if he is affected by this too, or
>> not, I have to assume that this only affects my platform. Maybe its a
>> pinmuxing issue. We will have to investigate further. :-(.
> David,
>
> I've experienced this bug on a tegra20 device today (paz00). Is this
> the issue you have experienced ?
> I confirm that I don't reproduce this on every boot (even using the
> same kernel).

Hello Nicolas,

it looks very much like it, yes. I assume that you are running the 
serial console with the 8250 driver.

The console would stutter afterwards (polling mode). Also one core runs 
at 100% sirq.

But once getty re-opens the device (e.g. initial login or when you log 
out and getty gets restarted), the device uses IRQs again and works 
properly.

For me, the bug was triggered roughly 3-5% of all reboots before the 
first getty pops up.

Can you confirm this?

Then you might want to try the patch and let me know if it helped.

The patch should fix the issue most of the time. But in about 0.05% or 
so of all reboots the console will stop receiving characters until it is 
reopened. I was not able to fix this so far. But at least there is no 
more 100% sirq hogging of one core.


Please let me know about any progress!

David

--
To unsubscribe from this list: send the line "unsubscribe linux-tegra" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Richard Leitner Nov. 21, 2022, 10:35 a.m. UTC | #7
Hi,

On Fri, Jul 13, 2018 at 11:32:42AM +0000, David R. Piegdon wrote:
> Hi,
> a while back I sent a few mails regarding spurious interrupts in the
> UARTA (hsuart) block of the Tegra2 SoC, when using the 8250 driver for
> it instead of the hsuart driver. After going down a pretty deep
> debugging/testing hole, I think I found a patch that fixes the issue. So
> far testing in a reboot-cycle suggests that the error frequency dropped
> from >3% of all reboots to at least <0.05% of all reboots. Tests
> continue to run over the weekend.
> 
> The patch below already is a second iteration; the first did not reset
> the MCR or contain the lines below '// clear interrupts'. This resulted
> in no more spurious interrupts, but in a few % of spurious interrupts
> that were recovered the UART block did not receive any characters any
> more. So further resetting was required to fully reacquire operational
> state of the UART block.
> 
> I'd love any comments/suggestions on this!

I'd like to follow up on this ancient patch as we are using it
successfully for a few years with different kernel versions on a
tegra20 SOM (tamonten) now and I'm currently cleaning up our tree.

David, have you done any work in regarding this issue since 2018?

What would be needed to get this solution mainline?

The recipient of this mail are from the initial thread [1] and
a current get_maintainers.pl run.

regards;rl

[1] https://patchwork.ozlabs.org/project/linux-tegra/patch/4676ea34-69ce-5422-1ded-94218b89f7d9@p23q.org/

> 
> Cheers,
> 
> David
> 
> diff --git a/drivers/tty/serial/8250/8250_core.c b/drivers/tty/serial/8250/8250_core.c
> index e8819aa20415..1d76eebefd4e 100644
> --- a/drivers/tty/serial/8250/8250_core.c
> +++ b/drivers/tty/serial/8250/8250_core.c
> @@ -140,6 +140,38 @@ static irqreturn_t serial8250_interrupt(int irq, void *dev_id)
>  				"serial8250: too much work for irq%d\n", irq);
>  			break;
>  		}
> +
> +#ifdef CONFIG_ARCH_TEGRA_2x_SOC
> +		if (!handled && (port->type == PORT_TEGRA)) {
> +			/*
> +			 * Fix Tegra 2 CPU silicon bug where sometimes
> +			 * "TX holding register empty" interrupts result in a
> +			 * bad (metastable?) state in Tegras HSUART IP core.
> +			 * Only way to recover seems to be to reset all
> +			 * interrupts as well as the TX queue and the MCR.
> +			 * But we don't want to loose any outgoing characters,
> +			 * so only do it if the RX and TX queues are empty.
> +			 */
> +			unsigned char lsr = port->serial_in(port, UART_LSR);
> +			const unsigned char fifo_empty_mask =
> +						(UART_LSR_TEMT | UART_LSR_THRE);
> +			if (((lsr & (UART_LSR_DR | fifo_empty_mask)) ==
> +							fifo_empty_mask)) {
> +				port->serial_out(port, UART_IER, 0);
> +				port->serial_out(port, UART_MCR, 0);
> +				serial8250_clear_and_reinit_fifos(up);
> +				port->serial_out(port, UART_MCR, up->mcr);
> +				port->serial_out(port, UART_IER, up->ier);
> +				// clear interrupts
> +				serial_port_in(port, UART_LSR);
> +				serial_port_in(port, UART_RX);
> +				serial_port_in(port, UART_IIR);
> +				serial_port_in(port, UART_MSR);
> +				up->lsr_saved_flags = 0;
> +				up->msr_saved_flags = 0;
> +			}
> +		}
> +#endif
>  	} while (l != end);
>  
>  	spin_unlock(&i->lock);
Richard Leitner Dec. 22, 2022, 7:07 a.m. UTC | #8
Hi,

On Tue, Nov 22, 2022 at 06:47:39PM +0100, David R. Piegdon wrote:
> Hallo Richard,
> it's great to hear that the patch was helpful for someone other than us.
> However, I no longer work on that project, or that company. I have added my old coworker Randolph to the recipients, who, I think, still maintains that platform (based on colibri t20 from toradex). Maybe he can be of help. If you come up with technical questions, I might still have a memory or two.

Thanks for the feedback.

As I'd love to see this mainline: Does one of you (David, Randolph) have
interest/time to try to bring this mainline? If not I can try, altough
I'm not into the Tegra UARTs details that deep...

regards;rl

> @Randolph: Cheers & I hope everything is going well!
> 
> Yours,
> David
>
Ilpo Järvinen Dec. 22, 2022, 10:30 a.m. UTC | #9
On Mon, 21 Nov 2022, Richard Leitner wrote:

> Hi,
> 
> On Fri, Jul 13, 2018 at 11:32:42AM +0000, David R. Piegdon wrote:
> > Hi,
> > a while back I sent a few mails regarding spurious interrupts in the
> > UARTA (hsuart) block of the Tegra2 SoC, when using the 8250 driver for
> > it instead of the hsuart driver. After going down a pretty deep
> > debugging/testing hole, I think I found a patch that fixes the issue. So
> > far testing in a reboot-cycle suggests that the error frequency dropped
> > from >3% of all reboots to at least <0.05% of all reboots. Tests
> > continue to run over the weekend.
> > 
> > The patch below already is a second iteration; the first did not reset
> > the MCR or contain the lines below '// clear interrupts'. This resulted
> > in no more spurious interrupts, but in a few % of spurious interrupts
> > that were recovered the UART block did not receive any characters any
> > more. So further resetting was required to fully reacquire operational
> > state of the UART block.
> > 
> > I'd love any comments/suggestions on this!
> 
> I'd like to follow up on this ancient patch as we are using it
> successfully for a few years with different kernel versions on a
> tegra20 SOM (tamonten) now and I'm currently cleaning up our tree.
> 
> David, have you done any work in regarding this issue since 2018?
> 
> What would be needed to get this solution mainline?

It seems that the code would belong to ->handle_irq() rather than 
8250_core. Do the affected device belong under 8250_tegra.c? If they do, 
then just create .handle_irq for it and detect this condition there after 
call to serial8250_handle_irq().

> The recipient of this mail are from the initial thread [1] and
> a current get_maintainers.pl run.
> 
> regards;rl
> 
> [1] https://patchwork.ozlabs.org/project/linux-tegra/patch/4676ea34-69ce-5422-1ded-94218b89f7d9@p23q.org/
> 
> > 
> > Cheers,
> > 
> > David
> > 
> > diff --git a/drivers/tty/serial/8250/8250_core.c b/drivers/tty/serial/8250/8250_core.c
> > index e8819aa20415..1d76eebefd4e 100644
> > --- a/drivers/tty/serial/8250/8250_core.c
> > +++ b/drivers/tty/serial/8250/8250_core.c
> > @@ -140,6 +140,38 @@ static irqreturn_t serial8250_interrupt(int irq, void *dev_id)
> >  				"serial8250: too much work for irq%d\n", irq);
> >  			break;
> >  		}
> > +
> > +#ifdef CONFIG_ARCH_TEGRA_2x_SOC
> > +		if (!handled && (port->type == PORT_TEGRA)) {
> > +			/*
> > +			 * Fix Tegra 2 CPU silicon bug where sometimes
> > +			 * "TX holding register empty" interrupts result in a
> > +			 * bad (metastable?) state in Tegras HSUART IP core.
> > +			 * Only way to recover seems to be to reset all
> > +			 * interrupts as well as the TX queue and the MCR.
> > +			 * But we don't want to loose any outgoing characters,
> > +			 * so only do it if the RX and TX queues are empty.
> > +			 */
> > +			unsigned char lsr = port->serial_in(port, UART_LSR);

serial_lsr_in(), make sure you take the port's lock btw.

> > +			const unsigned char fifo_empty_mask =
> > +						(UART_LSR_TEMT | UART_LSR_THRE);
> > +			if (((lsr & (UART_LSR_DR | fifo_empty_mask)) ==
> > +							fifo_empty_mask)) {

uart_lsr_tx_empty(lsr) && !(lsr & UART_LSR_DR)

fifo_empty_mask can be dropped.
diff mbox series

Patch

diff --git a/drivers/tty/serial/8250/8250_core.c b/drivers/tty/serial/8250/8250_core.c
index e8819aa20415..1d76eebefd4e 100644
--- a/drivers/tty/serial/8250/8250_core.c
+++ b/drivers/tty/serial/8250/8250_core.c
@@ -140,6 +140,38 @@  static irqreturn_t serial8250_interrupt(int irq, void *dev_id)
 				"serial8250: too much work for irq%d\n", irq);
 			break;
 		}
+
+#ifdef CONFIG_ARCH_TEGRA_2x_SOC
+		if (!handled && (port->type == PORT_TEGRA)) {
+			/*
+			 * Fix Tegra 2 CPU silicon bug where sometimes
+			 * "TX holding register empty" interrupts result in a
+			 * bad (metastable?) state in Tegras HSUART IP core.
+			 * Only way to recover seems to be to reset all
+			 * interrupts as well as the TX queue and the MCR.
+			 * But we don't want to loose any outgoing characters,
+			 * so only do it if the RX and TX queues are empty.
+			 */
+			unsigned char lsr = port->serial_in(port, UART_LSR);
+			const unsigned char fifo_empty_mask =
+						(UART_LSR_TEMT | UART_LSR_THRE);
+			if (((lsr & (UART_LSR_DR | fifo_empty_mask)) ==
+							fifo_empty_mask)) {
+				port->serial_out(port, UART_IER, 0);
+				port->serial_out(port, UART_MCR, 0);
+				serial8250_clear_and_reinit_fifos(up);
+				port->serial_out(port, UART_MCR, up->mcr);
+				port->serial_out(port, UART_IER, up->ier);
+				// clear interrupts
+				serial_port_in(port, UART_LSR);
+				serial_port_in(port, UART_RX);
+				serial_port_in(port, UART_IIR);
+				serial_port_in(port, UART_MSR);
+				up->lsr_saved_flags = 0;
+				up->msr_saved_flags = 0;
+			}
+		}
+#endif
 	} while (l != end);
 
 	spin_unlock(&i->lock);