diff mbox

dmaengine: tegra: crash fix observed during dma client(UART) stress testing

Message ID 1462277698-11360-1-git-send-email-smohammed@nvidia.com
State Deferred
Delegated to: Jon Hunter
Headers show

Commit Message

Shardar Shariff Md May 3, 2016, 12:14 p.m. UTC
During DMA client(UART) stress testing, observed below crash:

[  167.041591] Unable to handle kernel paging request at virtual address 00100108
[  167.048818] pgd = ffffffc0de7ee000
[  167.052222] [00100108] *pgd=0000000000000000
[  167.056513] Internal error: Oops: 96000045 [#1] PREEMPT SMP
[  167.084048] Modules linked in:
[  167.087126] CPU: 0 PID: 1786 Comm: uarttest Tainted: G        W    3.10.33-gb76f6f9 #5
[  167.095040] task: ffffffc0a5ba6ac0 ti: ffffffc094380000 task.ti: ffffffc094380000
[  167.102529] PC is at tegra_dma_tasklet+0x50/0xf4
[  167.107148] LR is at tegra_dma_tasklet+0xc0/0xf4
[  167.111767] pc : [<ffffffc00044acc8>] lr : [<ffffffc00044ad38>] pstate: 800001c5
[  167.119155] sp : ffffffc094383a60
[  167.122469] x29: ffffffc094383a60 x28: 0000000000000000

Issue: UART RX channel DMA completion EOC(End of completion) interrupt
occurs and dma driver schedules tasklet() to execute callback function
and empty the cb_desc (callback descriptor). Before dma driver tasklet
runs, UART RX EORD (end of receive data) interrupt occurs. Here UART RX
ISR handler calls tegra_dma_terminate_all() and re-configures the DMA
for RX. While re-configuring, the cb_node data is re-initialized but the
cb_desc list is not emptied. Now when dma driver tasklet callback function
tries to check cb_desc and delete the cb_node (re-initialized node) kernel
crashes.

Fix: Empty the cb_desc data structure during tegra_dma_terminate_all()
routine if there are no pending transfers.

Signed-off-by: Shardar Shariff Md <smohammed@nvidia.com>
---
 drivers/dma/tegra20-apb-dma.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

Comments

Jon Hunter May 3, 2016, 2:09 p.m. UTC | #1
On 03/05/16 13:14, Shardar Shariff Md wrote:
> During DMA client(UART) stress testing, observed below crash:
> 
> [  167.041591] Unable to handle kernel paging request at virtual address 00100108
> [  167.048818] pgd = ffffffc0de7ee000
> [  167.052222] [00100108] *pgd=0000000000000000
> [  167.056513] Internal error: Oops: 96000045 [#1] PREEMPT SMP
> [  167.084048] Modules linked in:
> [  167.087126] CPU: 0 PID: 1786 Comm: uarttest Tainted: G        W    3.10.33-gb76f6f9 #5
> [  167.095040] task: ffffffc0a5ba6ac0 ti: ffffffc094380000 task.ti: ffffffc094380000
> [  167.102529] PC is at tegra_dma_tasklet+0x50/0xf4
> [  167.107148] LR is at tegra_dma_tasklet+0xc0/0xf4
> [  167.111767] pc : [<ffffffc00044acc8>] lr : [<ffffffc00044ad38>] pstate: 800001c5
> [  167.119155] sp : ffffffc094383a60
> [  167.122469] x29: ffffffc094383a60 x28: 0000000000000000

This appears to be from quite an old kernel. I assume that this is still
valid for the latest mainline?

> Issue: UART RX channel DMA completion EOC(End of completion) interrupt
> occurs and dma driver schedules tasklet() to execute callback function
> and empty the cb_desc (callback descriptor). Before dma driver tasklet
> runs, UART RX EORD (end of receive data) interrupt occurs. Here UART RX
> ISR handler calls tegra_dma_terminate_all() and re-configures the DMA
> for RX. While re-configuring, the cb_node data is re-initialized but the
> cb_desc list is not emptied. Now when dma driver tasklet callback function
> tries to check cb_desc and delete the cb_node (re-initialized node) kernel
> crashes.

I am wondering if we can simplify the description a bit here.

Is the problem that the current implementation assumes that the tasklet
will run before the next transfer has been configured? And if this does
not happen then we may request the same descriptor for the next tranfer
that is currently on the callback queue waiting for the tasklet to run?

> Fix: Empty the cb_desc data structure during tegra_dma_terminate_all()
> routine if there are no pending transfers.

Does note really describe the fix. We are emptying a list of descriptors
that have a callback pending.

I would be tempted to change the subject slightly as this is fixing a
race condition that could be seen by various different clients.

Cheers
Jon
--
To unsubscribe from this list: send the line "unsubscribe linux-tegra" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shardar Shariff Md May 3, 2016, 2:44 p.m. UTC | #2
Thanks for review. Please find my comments inline.

> On 03/05/16 13:14, Shardar Shariff Md wrote:
> > During DMA client(UART) stress testing, observed below crash:
> >
> > [  167.041591] Unable to handle kernel paging request at virtual
> > address 00100108 [  167.048818] pgd = ffffffc0de7ee000 [  167.052222]
> > [00100108] *pgd=0000000000000000 [  167.056513] Internal error: Oops:
> > 96000045 [#1] PREEMPT SMP [  167.084048] Modules linked in:
> > [  167.087126] CPU: 0 PID: 1786 Comm: uarttest Tainted: G        W    3.10.33-
> gb76f6f9 #5
> > [  167.095040] task: ffffffc0a5ba6ac0 ti: ffffffc094380000 task.ti:
> > ffffffc094380000 [  167.102529] PC is at tegra_dma_tasklet+0x50/0xf4 [
> > 167.107148] LR is at tegra_dma_tasklet+0xc0/0xf4 [  167.111767] pc :
> > [<ffffffc00044acc8>] lr : [<ffffffc00044ad38>] pstate: 800001c5 [
> > 167.119155] sp : ffffffc094383a60 [  167.122469] x29: ffffffc094383a60
> > x28: 0000000000000000
> 
> This appears to be from quite an old kernel. I assume that this is still valid for
> the latest mainline?
[Shardar] Yes.
> 
> > Issue: UART RX channel DMA completion EOC(End of completion) interrupt
> > occurs and dma driver schedules tasklet() to execute callback function
> > and empty the cb_desc (callback descriptor). Before dma driver tasklet
> > runs, UART RX EORD (end of receive data) interrupt occurs. Here UART
> > RX ISR handler calls tegra_dma_terminate_all() and re-configures the
> > DMA for RX. While re-configuring, the cb_node data is re-initialized
> > but the cb_desc list is not emptied. Now when dma driver tasklet
> > callback function tries to check cb_desc and delete the cb_node
> > (re-initialized node) kernel crashes.
> 
> I am wondering if we can simplify the description a bit here.
> 
[Shardar] Will modify and push new patch.

> Is the problem that the current implementation assumes that the tasklet will
> run before the next transfer has been configured? And if this does not
> happen then we may request the same descriptor for the next tranfer that is
> currently on the callback queue waiting for the tasklet to run?
[Shardar] Yes.
If tasklet has not yet run and meanwhile dma client (UART RX ISR for EORD case) isr handler runs then it calls the tegra_dma_terminate_all() and queries for the bytes transferred through dma and then acknowledges the dma_desc (async_tx_ack(rx_dma_desc)). As the dma desc is acknowledged the same descriptor is allocated to the next transfer and dma_desc is initialized (where cb_node list is initialized).
Now when the tasklet runs, channel callback descriptor queue will be pointed to reinitialized cb_node. Which will cause the crash when deleting cb_node.

> 
> > Fix: Empty the cb_desc data structure during tegra_dma_terminate_all()
> > routine if there are no pending transfers.
> 
> Does note really describe the fix. We are emptying a list of descriptors that
> have a callback pending.
> 
[Shardar] Will elaborate the fix and upload new patch set.
Here dma client (UART ISR) is calling the tegra_dma_terminate_all() before the tasklet is running, so there is no need to handle the functionality of tasklet i.e execute the callback function and deleting the callback nodes from the queue. So fix is to delete the callback queue when there are no pending transfers.

> I would be tempted to change the subject slightly as this is fixing a race
> condition that could be seen by various different clients.
> 
[Shardar] Will modify it.
> Cheers
> Jon
--
To unsubscribe from this list: send the line "unsubscribe linux-tegra" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding May 3, 2016, 3:17 p.m. UTC | #3
On Tue, May 03, 2016 at 02:44:49PM +0000, Shardar Mohammed wrote:
> Thanks for review. Please find my comments inline.
> 
> > On 03/05/16 13:14, Shardar Shariff Md wrote:
> > > During DMA client(UART) stress testing, observed below crash:
> > >
> > > [  167.041591] Unable to handle kernel paging request at virtual
> > > address 00100108 [  167.048818] pgd = ffffffc0de7ee000 [  167.052222]
> > > [00100108] *pgd=0000000000000000 [  167.056513] Internal error: Oops:
> > > 96000045 [#1] PREEMPT SMP [  167.084048] Modules linked in:
> > > [  167.087126] CPU: 0 PID: 1786 Comm: uarttest Tainted: G        W    3.10.33-
> > gb76f6f9 #5
> > > [  167.095040] task: ffffffc0a5ba6ac0 ti: ffffffc094380000 task.ti:
> > > ffffffc094380000 [  167.102529] PC is at tegra_dma_tasklet+0x50/0xf4 [
> > > 167.107148] LR is at tegra_dma_tasklet+0xc0/0xf4 [  167.111767] pc :
> > > [<ffffffc00044acc8>] lr : [<ffffffc00044ad38>] pstate: 800001c5 [
> > > 167.119155] sp : ffffffc094383a60 [  167.122469] x29: ffffffc094383a60
> > > x28: 0000000000000000
> > 
> > This appears to be from quite an old kernel. I assume that this is still valid for
> > the latest mainline?
> [Shardar] Yes.

Please get into the habit of testing your patches against upstream
kernels. Merely testing that it applies is not enough.

All patches that are submitted for mainline inclusion must be *runtime*
tested against a mainline kernel.

Also it would be nice if you could offer steps on how to reproduce, that
way we can extend our testing to make sure we never regress.

Thierry
diff mbox

Patch

diff --git a/drivers/dma/tegra20-apb-dma.c b/drivers/dma/tegra20-apb-dma.c
index 3871f29..34bb4cd 100644
--- a/drivers/dma/tegra20-apb-dma.c
+++ b/drivers/dma/tegra20-apb-dma.c
@@ -751,10 +751,8 @@  static int tegra_dma_terminate_all(struct dma_chan *dc)
 	bool was_busy;
 
 	spin_lock_irqsave(&tdc->lock, flags);
-	if (list_empty(&tdc->pending_sg_req)) {
-		spin_unlock_irqrestore(&tdc->lock, flags);
-		return 0;
-	}
+	if (list_empty(&tdc->pending_sg_req))
+		goto empty_cblist;
 
 	if (!tdc->busy)
 		goto skip_dma_stop;
@@ -787,6 +785,7 @@  static int tegra_dma_terminate_all(struct dma_chan *dc)
 skip_dma_stop:
 	tegra_dma_abort_all(tdc);
 
+empty_cblist:
 	while (!list_empty(&tdc->cb_desc)) {
 		dma_desc  = list_first_entry(&tdc->cb_desc,
 					typeof(*dma_desc), cb_node);