Patchwork [1/1] fsldma: ignore end of segments interrupt

login
register
mail settings
Submitter Ira Snyder
Date Jan. 26, 2012, 8:58 p.m.
Message ID <1327611520-18256-1-git-send-email-iws@ovro.caltech.edu>
Download mbox | patch
Permalink /patch/138050/
State Superseded
Headers show

Comments

Ira Snyder - Jan. 26, 2012, 8:58 p.m.
The mpc8349ea has been observed to generate spurious end of segments
interrupts despite the fact that they are not enabled by this driver.
Check for them and ignore them to avoid a kernel error message.

Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
Cc: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dma/fsldma.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)
Tabi Timur-B04825 - Feb. 16, 2012, 5:50 p.m.
On Thu, Jan 26, 2012 at 2:58 PM, Ira W. Snyder <iws@ovro.caltech.edu> wrote:
> The mpc8349ea has been observed to generate spurious end of segments
> interrupts despite the fact that they are not enabled by this driver.
> Check for them and ignore them to avoid a kernel error message.

When this happens, are there any other status bits set?  It seems
weird that there are spurious interrupts from an internal block,
especially since it's the same block on all 83xx parts.

I wonder if the EOSI bit just happens to be set when the interrupt
occurs for some other reason.
Ira Snyder - Feb. 16, 2012, 7 p.m.
On Thu, Feb 16, 2012 at 05:50:47PM +0000, Tabi Timur-B04825 wrote:
> On Thu, Jan 26, 2012 at 2:58 PM, Ira W. Snyder <iws@ovro.caltech.edu> wrote:
> > The mpc8349ea has been observed to generate spurious end of segments
> > interrupts despite the fact that they are not enabled by this driver.
> > Check for them and ignore them to avoid a kernel error message.
> 
> When this happens, are there any other status bits set?  It seems
> weird that there are spurious interrupts from an internal block,
> especially since it's the same block on all 83xx parts.
> 
> I wonder if the EOSI bit just happens to be set when the interrupt
> occurs for some other reason.
> 

I'm not sure. The fsldma irq handler only prints bits it did not handle.
There are several other bits in the driver which should never be seen,
but they are handled by the irq handler anyway. This is just a remnant
from the original Freescale code.

I have a set of 15 test boards that I can use to figure out which other
bits are set when this happens, if it is important.

I put a variation of this patch (missing the "skip tasklet if not idle"
logic) into my production boards roughly a month ago. I've gotten the
"controller not idle" error message 748 times, as compared to the
"unhandled sr 0x00000002" message 3449 times.

This leads me to believe that this occurs mostly (but not always)
concurrent with the end-of-chain interrupt.

In the last month, the "unhandled sr" error has occurred on 92 out of
120 boards in production use. The statistics are included below. On some
boards, it is much more frequent than on others. All boards have roughly
the same workload.

Another interesting tidbit from my logs: this only occurs on DMA channel
2 (the are numbered starting at 0, it is the 3rd channel). Here is an
example log message:

[3484053.821689] of:fsl-elo-dma e00082a8.dma: chan2: irq: unhandled sr 0x00000002

Thanks,
Ira

     15 serial-number-5
      1 serial-number-16
      8 serial-number-18
     16 serial-number-19
      3 serial-number-20
     21 serial-number-21
      1 serial-number-24
      1 serial-number-26
      3 serial-number-27
      2 serial-number-28
     16 serial-number-29
      4 serial-number-30
      1 serial-number-31
      4 serial-number-32
      5 serial-number-33
      1 serial-number-34
      6 serial-number-35
     18 serial-number-36
      1 serial-number-39
      1 serial-number-40
      2 serial-number-41
     10 serial-number-42
     11 serial-number-43
     32 serial-number-45
      6 serial-number-46
      4 serial-number-47
      1 serial-number-49
      6 serial-number-50
      2 serial-number-51
      4 serial-number-53
      1 serial-number-55
      1 serial-number-57
     15 serial-number-58
      1 serial-number-60
      1 serial-number-62
      1 serial-number-66
      8 serial-number-67
      2 serial-number-75
      1 serial-number-76
     11 serial-number-79
      4 serial-number-80
      8 serial-number-81
      1 serial-number-82
     11 serial-number-84
      2 serial-number-92
     20 serial-number-93
     30 serial-number-94
     19 serial-number-95
     32 serial-number-96
     73 serial-number-97
     18 serial-number-99
     57 serial-number-100
     41 serial-number-101
     28 serial-number-102
      8 serial-number-103
    132 serial-number-107
     60 serial-number-108
     55 serial-number-109
     97 serial-number-110
     18 serial-number-111
     45 serial-number-113
      6 serial-number-114
    123 serial-number-115
     27 serial-number-117
     29 serial-number-118
     12 serial-number-119
     47 serial-number-120
     74 serial-number-121
      8 serial-number-124
    128 serial-number-125
    326 serial-number-128
     84 serial-number-129
     36 serial-number-130
      2 serial-number-131
     75 serial-number-133
     64 serial-number-135
    686 serial-number-137
     97 serial-number-139
     28 serial-number-140
     82 serial-number-141
     36 serial-number-144
     31 serial-number-145
     47 serial-number-147
     60 serial-number-150
     22 serial-number-152
     36 serial-number-154
     57 serial-number-156
     68 serial-number-158
     54 serial-number-159
     37 serial-number-160
     46 serial-number-161
     14 serial-number-162
Tabi Timur-B04825 - Feb. 16, 2012, 7:34 p.m.
Ira W. Snyder wrote:
> This leads me to believe that this occurs mostly (but not always)
> concurrent with the end-of-chain interrupt.

Have you tested this on an 85xx platform?

I noticed something odd.  You're modifying fsldma_chan_irq(), which is for
DMA controllers that have per-channel IRQs.  83xx devices don't have
per-channel IRQs -- all channels on one controller have the same IRQ.
Looking at the device tree, I see that the IRQs are listed in the channel
nodes *and* in the controller node.  I don't see how we ever use the
per-controller ISR.

I wonder if the shared IRQ is the part of the cause of the interrupts
you're seeing.

> 
> In the last month, the "unhandled sr" error has occurred on 92 out of
> 120 boards in production use. The statistics are included below. On some
> boards, it is much more frequent than on others. All boards have roughly
> the same workload.
> 
> Another interesting tidbit from my logs: this only occurs on DMA channel
> 2 (the are numbered starting at 0, it is the 3rd channel). Here is an
> example log message:

What happens if you never register that channel?  That is, remove this
node from the device tree:

dma-channel@100 {
	compatible = "fsl,mpc8349-dma-channel", "fsl,elo-dma-channel";
	reg = <0x100 0x80>;
	cell-index = <2>;
	interrupt-parent = <&ipic>;
	interrupts = <71 8>;
};
Ira Snyder - Feb. 16, 2012, 7:46 p.m.
On Thu, Feb 16, 2012 at 01:34:00PM -0600, Timur Tabi wrote:
> Ira W. Snyder wrote:
> > This leads me to believe that this occurs mostly (but not always)
> > concurrent with the end-of-chain interrupt.
> 
> Have you tested this on an 85xx platform?
> 

No. I don't have the ability to connect my P2020 up to an FPGA to
recreate the DMA workload that causes this on my 8349EA. I can run the
dmatest module, if you'd like.

> I noticed something odd.  You're modifying fsldma_chan_irq(), which is for
> DMA controllers that have per-channel IRQs.  83xx devices don't have
> per-channel IRQs -- all channels on one controller have the same IRQ.
> Looking at the device tree, I see that the IRQs are listed in the channel
> nodes *and* in the controller node.  I don't see how we ever use the
> per-controller ISR.
> 

fsldma_ctrl_irq() (the per-controller irq handler) just calls through to
fsldma_chan_irq() (the per-channel irq handler).

> I wonder if the shared IRQ is the part of the cause of the interrupts
> you're seeing.
> 

My device tree is slightly modified to remove the per-controller
interrupts and interrupt-parent properties. Each individual channel has
identical interrupts and interrupt-parent properties specified.

Someone here suggested that I do that, several years ago. It has been
too long, and I do not remember who. I can reverse it, and use the
per-controller IRQ instead.

> > 
> > In the last month, the "unhandled sr" error has occurred on 92 out of
> > 120 boards in production use. The statistics are included below. On some
> > boards, it is much more frequent than on others. All boards have roughly
> > the same workload.
> > 
> > Another interesting tidbit from my logs: this only occurs on DMA channel
> > 2 (the are numbered starting at 0, it is the 3rd channel). Here is an
> > example log message:
> 
> What happens if you never register that channel?  That is, remove this
> node from the device tree:
> 
> dma-channel@100 {
> 	compatible = "fsl,mpc8349-dma-channel", "fsl,elo-dma-channel";
> 	reg = <0x100 0x80>;
> 	cell-index = <2>;
> 	interrupt-parent = <&ipic>;
> 	interrupts = <71 8>;
> };
> 

I can try that. I hunch the problem will move, as the carma-fpga driver
(see drivers/misc/carma/carma-fpga.c) will claim the 4th channel
instead.

Ira
Tabi Timur-B04825 - Feb. 16, 2012, 7:48 p.m.
Ira W. Snyder wrote:

> No. I don't have the ability to connect my P2020 up to an FPGA to
> recreate the DMA workload that causes this on my 8349EA. I can run the
> dmatest module, if you'd like.

I just want to make sure your patch doesn't break 85xx.
Ira Snyder - Feb. 17, 2012, 12:57 a.m.
On Thu, Feb 16, 2012 at 01:48:20PM -0600, Timur Tabi wrote:
> Ira W. Snyder wrote:
> 
> > No. I don't have the ability to connect my P2020 up to an FPGA to
> > recreate the DMA workload that causes this on my 8349EA. I can run the
> > dmatest module, if you'd like.
> 
> I just want to make sure your patch doesn't break 85xx.
> 

I tried both with and without this patch on my P2020 COM Express board.
With both kernels, the board locks up after 20 minutes or so, no
messages to the serial console.

I wouldn't be surprised if there are some memory problems with this
board. In any case, I don't have any reason to believe that this patch
causes any trouble: the board dies without it.

However, the patch doesn't break DMA on 85xx. If I unload the dmatest
module after 10 minutes or so, it claims to have passed many thousands
of tests without problems.

My 8349EA test boards (15 of them) have been running their normal DMA
workload plus dmatest on the unused 4th channel, all without errors, for
several hours. ~2.5 million successful tests per board, as I write this.

Ira

Patch

diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c
index 8a78154..7dc9689 100644
--- a/drivers/dma/fsldma.c
+++ b/drivers/dma/fsldma.c
@@ -1052,6 +1052,16 @@  static irqreturn_t fsldma_chan_irq(int irq, void *data)
 		stat &= ~FSL_DMA_SR_EOLNI;
 	}
 
+	/*
+	 * This driver does not use this feature, therefore we shouldn't
+	 * ever see this bit set in the status register. However, it has
+	 * been observed on MPC8349EA parts.
+	 */
+	if (stat & FSL_DMA_SR_EOSI) {
+		chan_dbg(chan, "irq: End-of-Segments INT\n");
+		stat &= ~FSL_DMA_SR_EOSI;
+	}
+
 	/* check that the DMA controller is really idle */
 	if (!dma_is_idle(chan))
 		chan_err(chan, "irq: controller not idle!\n");