Patchwork [RFC] fsldma: fix performance degradation by optimizing spinlock use.

login
register
mail settings
Submitter b29237@freescale.com
Date Nov. 22, 2011, 4:55 a.m.
Message ID <1321937705-19587-1-git-send-email-b29237@freescale.com>
Download mbox | patch
Permalink /patch/127003/
State RFC
Headers show

Comments

b29237@freescale.com - Nov. 22, 2011, 4:55 a.m.
From: Forrest Shi <b29237@freescale.com>

    dma status check function fsl_tx_status is heavily called in
    a tight loop and the desc lock in fsl_tx_status contended by
    the dma status update function. this caused the dma performance
    degrades much.

    this patch releases the lock in the fsl_tx_status function.
    I believe it has no neglect impact on the following call of
    dma_async_is_complete(...).

    we can see below three conditions will be identified as success
    a)  x < complete < use
    b)  x < complete+N < use+N
    c)  x < complete < use+N
    here complete is the completed_cookie, use is the last_used
    cookie, x is the querying cookie, N is MAX cookie

    when chan->completed_cookie is being read, the last_used may
    be incresed. Anyway it has no neglect impact on the dma status
    decision.

    Signed-off-by: Forrest Shi <xuelin.shi@freescale.com>
---
 drivers/dma/fsldma.c |    5 -----
 1 files changed, 0 insertions(+), 5 deletions(-)
Ira Snyder - Nov. 22, 2011, 6:59 p.m.
On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote:
> From: Forrest Shi <b29237@freescale.com>
> 
>     dma status check function fsl_tx_status is heavily called in
>     a tight loop and the desc lock in fsl_tx_status contended by
>     the dma status update function. this caused the dma performance
>     degrades much.
> 
>     this patch releases the lock in the fsl_tx_status function.
>     I believe it has no neglect impact on the following call of
>     dma_async_is_complete(...).
> 
>     we can see below three conditions will be identified as success
>     a)  x < complete < use
>     b)  x < complete+N < use+N
>     c)  x < complete < use+N
>     here complete is the completed_cookie, use is the last_used
>     cookie, x is the querying cookie, N is MAX cookie
> 
>     when chan->completed_cookie is being read, the last_used may
>     be incresed. Anyway it has no neglect impact on the dma status
>     decision.
> 
>     Signed-off-by: Forrest Shi <xuelin.shi@freescale.com>
> ---
>  drivers/dma/fsldma.c |    5 -----
>  1 files changed, 0 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c
> index 8a78154..1dca56f 100644
> --- a/drivers/dma/fsldma.c
> +++ b/drivers/dma/fsldma.c
> @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan *dchan,
>  	struct fsldma_chan *chan = to_fsl_chan(dchan);
>  	dma_cookie_t last_complete;
>  	dma_cookie_t last_used;
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&chan->desc_lock, flags);
>  

This will cause a bug. See below for a detailed explanation. You need
this instead:

	/*
	 * On an SMP system, we must ensure that this CPU has seen the
	 * memory accesses performed by another CPU under the
	 * chan->desc_lock spinlock.
	 */
	smp_mb();
>  	last_complete = chan->completed_cookie;
>  	last_used = dchan->cookie;
>  
> -	spin_unlock_irqrestore(&chan->desc_lock, flags);
> -
>  	dma_set_tx_state(txstate, last_complete, last_used, 0);
>  	return dma_async_is_complete(cookie, last_complete, last_used);
>  }

Facts:
- dchan->cookie is the same member as chan->common.cookie (same memory location)
- chan->common.cookie is the "last allocated cookie for a pending transaction"
- chan->completed_cookie is the "last completed transaction"

I have replaced "dchan->cookie" with "chan->common.cookie" in the below
explanation, to keep everything referenced from the same structure.

Variable usage before your change. Everything is used locked.
- RW chan->common.cookie		(fsl_dma_tx_submit)
- R  chan->common.cookie		(fsl_tx_status)
- R  chan->completed_cookie		(fsl_tx_status)
- W  chan->completed_cookie		(dma_do_tasklet)

Variable usage after your change:
- RW chan->common.cookie		LOCKED
- R  chan->common.cookie		NO LOCK
- R  chan->completed_cookie		NO LOCK
- W  chan->completed_cookie             LOCKED

What if we assume that you have a 2 CPU system (such as a P2020). After
your changes, one possible sequence is:

=== CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() ===
spin_lock_irqsave
descriptor->cookie = 20		(x in your example)
chan->common.cookie = 20	(used in your example)
spin_unlock_irqrestore

=== CPU2 - immediately calls fsl_tx_status() ===
chan->common.cookie == 19
chan->completed_cookie == 19
descriptor->cookie == 20

Since we don't have locks anymore, CPU2 may not have seen the write to
chan->common.cookie yet.

Also assume that the DMA hardware has not started processing the
transaction yet. Therefore dma_do_tasklet() has not been called, and
chan->completed_cookie has not been updated.

In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS,
even though the DMA operation has not succeeded. The DMA operation has
not even started yet!

The smp_mb() fixes this, since it forces CPU2 to have seen all memory
operations that happened before CPU1 released the spinlock. Spinlocks
are implicit SMP memory barriers.

Therefore, the above example becomes:
smp_mb();
chan->common.cookie == 20
chan->completed_cookie == 19
descriptor->cookie == 20

Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct.

Thanks,
Ira
b29237@freescale.com - Nov. 24, 2011, 8:12 a.m.
Hi Ira,

Thanks for your review.

After second thought, I think your scenario may not occur.
Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in practice. 
We never query a cookie not returned by fsl_dma_tx_submit(...).

When we call fsl_tx_status(20), the chan->common.cookie is definitely wrote as 20 and cpu2 could not read as 19.

How do you think? 

Happy Thanks Giving day,
Forrest

-----Original Message-----
From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] 
Sent: 2011年11月23日 2:59
To: Shi Xuelin-B29237
Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org; vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org; linux-kernel@vger.kernel.org
Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.

On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote:
> From: Forrest Shi <b29237@freescale.com>
> 
>     dma status check function fsl_tx_status is heavily called in
>     a tight loop and the desc lock in fsl_tx_status contended by
>     the dma status update function. this caused the dma performance
>     degrades much.
> 
>     this patch releases the lock in the fsl_tx_status function.
>     I believe it has no neglect impact on the following call of
>     dma_async_is_complete(...).
> 
>     we can see below three conditions will be identified as success
>     a)  x < complete < use
>     b)  x < complete+N < use+N
>     c)  x < complete < use+N
>     here complete is the completed_cookie, use is the last_used
>     cookie, x is the querying cookie, N is MAX cookie
> 
>     when chan->completed_cookie is being read, the last_used may
>     be incresed. Anyway it has no neglect impact on the dma status
>     decision.
> 
>     Signed-off-by: Forrest Shi <xuelin.shi@freescale.com>
> ---
>  drivers/dma/fsldma.c |    5 -----
>  1 files changed, 0 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 
> 8a78154..1dca56f 100644
> --- a/drivers/dma/fsldma.c
> +++ b/drivers/dma/fsldma.c
> @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan *dchan,
>  	struct fsldma_chan *chan = to_fsl_chan(dchan);
>  	dma_cookie_t last_complete;
>  	dma_cookie_t last_used;
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&chan->desc_lock, flags);
>  

This will cause a bug. See below for a detailed explanation. You need this instead:

	/*
	 * On an SMP system, we must ensure that this CPU has seen the
	 * memory accesses performed by another CPU under the
	 * chan->desc_lock spinlock.
	 */
	smp_mb();
>  	last_complete = chan->completed_cookie;
>  	last_used = dchan->cookie;
>  
> -	spin_unlock_irqrestore(&chan->desc_lock, flags);
> -
>  	dma_set_tx_state(txstate, last_complete, last_used, 0);
>  	return dma_async_is_complete(cookie, last_complete, last_used);  }

Facts:
- dchan->cookie is the same member as chan->common.cookie (same memory location)
- chan->common.cookie is the "last allocated cookie for a pending transaction"
- chan->completed_cookie is the "last completed transaction"

I have replaced "dchan->cookie" with "chan->common.cookie" in the below explanation, to keep everything referenced from the same structure.

Variable usage before your change. Everything is used locked.
- RW chan->common.cookie		(fsl_dma_tx_submit)
- R  chan->common.cookie		(fsl_tx_status)
- R  chan->completed_cookie		(fsl_tx_status)
- W  chan->completed_cookie		(dma_do_tasklet)

Variable usage after your change:
- RW chan->common.cookie		LOCKED
- R  chan->common.cookie		NO LOCK
- R  chan->completed_cookie		NO LOCK
- W  chan->completed_cookie             LOCKED

What if we assume that you have a 2 CPU system (such as a P2020). After your changes, one possible sequence is:

=== CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === spin_lock_irqsave
descriptor->cookie = 20		(x in your example)
chan->common.cookie = 20	(used in your example)
spin_unlock_irqrestore

=== CPU2 - immediately calls fsl_tx_status() ===
chan->common.cookie == 19
chan->completed_cookie == 19
descriptor->cookie == 20

Since we don't have locks anymore, CPU2 may not have seen the write to
chan->common.cookie yet.

Also assume that the DMA hardware has not started processing the transaction yet. Therefore dma_do_tasklet() has not been called, and
chan->completed_cookie has not been updated.

In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS, even though the DMA operation has not succeeded. The DMA operation has not even started yet!

The smp_mb() fixes this, since it forces CPU2 to have seen all memory operations that happened before CPU1 released the spinlock. Spinlocks are implicit SMP memory barriers.

Therefore, the above example becomes:
smp_mb();
chan->common.cookie == 20
chan->completed_cookie == 19
descriptor->cookie == 20

Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct.

Thanks,
Ira
Ira Snyder - Nov. 28, 2011, 4:38 p.m.
On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote:
> Hi Ira,
> 
> Thanks for your review.
> 
> After second thought, I think your scenario may not occur.
> Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in practice. 
> We never query a cookie not returned by fsl_dma_tx_submit(...).
> 

I agree about this part.

> When we call fsl_tx_status(20), the chan->common.cookie is definitely wrote as 20 and cpu2 could not read as 19.
> 

This is what I don't agree about. However, I'm not an expert on CPU
cache vs. memory accesses in an multi-processor system. The section
titled "CACHE COHERENCY" in Documentation/memory-barriers.txt leads me
to believe that the scenario I described is possible.

What happens if CPU1's write of chan->common.cookie only goes into
CPU1's cache. It never makes it to main memory before CPU2 fetches the
old value of 19.

I don't think you should see any performance impact from the smp_mb()
operation.

Thanks,
Ira

> -----Original Message-----
> From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] 
> Sent: 2011年11月23日 2:59
> To: Shi Xuelin-B29237
> Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org; vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.
> 
> On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote:
> > From: Forrest Shi <b29237@freescale.com>
> > 
> >     dma status check function fsl_tx_status is heavily called in
> >     a tight loop and the desc lock in fsl_tx_status contended by
> >     the dma status update function. this caused the dma performance
> >     degrades much.
> > 
> >     this patch releases the lock in the fsl_tx_status function.
> >     I believe it has no neglect impact on the following call of
> >     dma_async_is_complete(...).
> > 
> >     we can see below three conditions will be identified as success
> >     a)  x < complete < use
> >     b)  x < complete+N < use+N
> >     c)  x < complete < use+N
> >     here complete is the completed_cookie, use is the last_used
> >     cookie, x is the querying cookie, N is MAX cookie
> > 
> >     when chan->completed_cookie is being read, the last_used may
> >     be incresed. Anyway it has no neglect impact on the dma status
> >     decision.
> > 
> >     Signed-off-by: Forrest Shi <xuelin.shi@freescale.com>
> > ---
> >  drivers/dma/fsldma.c |    5 -----
> >  1 files changed, 0 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 
> > 8a78154..1dca56f 100644
> > --- a/drivers/dma/fsldma.c
> > +++ b/drivers/dma/fsldma.c
> > @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan *dchan,
> >  	struct fsldma_chan *chan = to_fsl_chan(dchan);
> >  	dma_cookie_t last_complete;
> >  	dma_cookie_t last_used;
> > -	unsigned long flags;
> > -
> > -	spin_lock_irqsave(&chan->desc_lock, flags);
> >  
> 
> This will cause a bug. See below for a detailed explanation. You need this instead:
> 
> 	/*
> 	 * On an SMP system, we must ensure that this CPU has seen the
> 	 * memory accesses performed by another CPU under the
> 	 * chan->desc_lock spinlock.
> 	 */
> 	smp_mb();
> >  	last_complete = chan->completed_cookie;
> >  	last_used = dchan->cookie;
> >  
> > -	spin_unlock_irqrestore(&chan->desc_lock, flags);
> > -
> >  	dma_set_tx_state(txstate, last_complete, last_used, 0);
> >  	return dma_async_is_complete(cookie, last_complete, last_used);  }
> 
> Facts:
> - dchan->cookie is the same member as chan->common.cookie (same memory location)
> - chan->common.cookie is the "last allocated cookie for a pending transaction"
> - chan->completed_cookie is the "last completed transaction"
> 
> I have replaced "dchan->cookie" with "chan->common.cookie" in the below explanation, to keep everything referenced from the same structure.
> 
> Variable usage before your change. Everything is used locked.
> - RW chan->common.cookie		(fsl_dma_tx_submit)
> - R  chan->common.cookie		(fsl_tx_status)
> - R  chan->completed_cookie		(fsl_tx_status)
> - W  chan->completed_cookie		(dma_do_tasklet)
> 
> Variable usage after your change:
> - RW chan->common.cookie		LOCKED
> - R  chan->common.cookie		NO LOCK
> - R  chan->completed_cookie		NO LOCK
> - W  chan->completed_cookie             LOCKED
> 
> What if we assume that you have a 2 CPU system (such as a P2020). After your changes, one possible sequence is:
> 
> === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === spin_lock_irqsave
> descriptor->cookie = 20		(x in your example)
> chan->common.cookie = 20	(used in your example)
> spin_unlock_irqrestore
> 
> === CPU2 - immediately calls fsl_tx_status() ===
> chan->common.cookie == 19
> chan->completed_cookie == 19
> descriptor->cookie == 20
> 
> Since we don't have locks anymore, CPU2 may not have seen the write to
> chan->common.cookie yet.
> 
> Also assume that the DMA hardware has not started processing the transaction yet. Therefore dma_do_tasklet() has not been called, and
> chan->completed_cookie has not been updated.
> 
> In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS, even though the DMA operation has not succeeded. The DMA operation has not even started yet!
> 
> The smp_mb() fixes this, since it forces CPU2 to have seen all memory operations that happened before CPU1 released the spinlock. Spinlocks are implicit SMP memory barriers.
> 
> Therefore, the above example becomes:
> smp_mb();
> chan->common.cookie == 20
> chan->completed_cookie == 19
> descriptor->cookie == 20
> 
> Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct.
> 
> Thanks,
> Ira
> 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
Li Yang-R58472 - Nov. 29, 2011, 3:19 a.m.
> Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing

> spinlock use.

> 

> On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote:

> > Hi Ira,

> >

> > Thanks for your review.

> >

> > After second thought, I think your scenario may not occur.

> > Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in

> practice.

> > We never query a cookie not returned by fsl_dma_tx_submit(...).

> >

> 

> I agree about this part.

> 

> > When we call fsl_tx_status(20), the chan->common.cookie is definitely wrote as

> 20 and cpu2 could not read as 19.

> >

> 

> This is what I don't agree about. However, I'm not an expert on CPU cache vs.

> memory accesses in an multi-processor system. The section titled "CACHE

> COHERENCY" in Documentation/memory-barriers.txt leads me to believe that the

> scenario I described is possible.


For Freescale PowerPC, the chip automatically takes care of cache coherency.  Even if this is a concern, spinlock can't address it.

> 

> What happens if CPU1's write of chan->common.cookie only goes into CPU1's

> cache. It never makes it to main memory before CPU2 fetches the old value of 19.

> 

> I don't think you should see any performance impact from the smp_mb()

> operation.


Smp_mb() do have impact on performance if it's in the hot path.  While it might be safer having it, I doubt it is really necessary.  If the CPU1 doesn't have the updated last_used, it's shouldn't have known there is a cookie 20 existed either.

- Leo

> 

> Thanks,

> Ira

> 

> > -----Original Message-----

> > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu]

> > Sent: 2011年11月23日 2:59

> > To: Shi Xuelin-B29237

> > Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org;

> > vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org;

> > linux-kernel@vger.kernel.org

> > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing

> spinlock use.

> >

> > On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote:

> > > From: Forrest Shi <b29237@freescale.com>

> > >

> > >     dma status check function fsl_tx_status is heavily called in

> > >     a tight loop and the desc lock in fsl_tx_status contended by

> > >     the dma status update function. this caused the dma performance

> > >     degrades much.

> > >

> > >     this patch releases the lock in the fsl_tx_status function.

> > >     I believe it has no neglect impact on the following call of

> > >     dma_async_is_complete(...).

> > >

> > >     we can see below three conditions will be identified as success

> > >     a)  x < complete < use

> > >     b)  x < complete+N < use+N

> > >     c)  x < complete < use+N

> > >     here complete is the completed_cookie, use is the last_used

> > >     cookie, x is the querying cookie, N is MAX cookie

> > >

> > >     when chan->completed_cookie is being read, the last_used may

> > >     be incresed. Anyway it has no neglect impact on the dma status

> > >     decision.

> > >

> > >     Signed-off-by: Forrest Shi <xuelin.shi@freescale.com>

> > > ---

> > >  drivers/dma/fsldma.c |    5 -----

> > >  1 files changed, 0 insertions(+), 5 deletions(-)

> > >

> > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index

> > > 8a78154..1dca56f 100644

> > > --- a/drivers/dma/fsldma.c

> > > +++ b/drivers/dma/fsldma.c

> > > @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct

> dma_chan *dchan,

> > >  	struct fsldma_chan *chan = to_fsl_chan(dchan);

> > >  	dma_cookie_t last_complete;

> > >  	dma_cookie_t last_used;

> > > -	unsigned long flags;

> > > -

> > > -	spin_lock_irqsave(&chan->desc_lock, flags);

> > >

> >

> > This will cause a bug. See below for a detailed explanation. You need this instead:

> >

> > 	/*

> > 	 * On an SMP system, we must ensure that this CPU has seen the

> > 	 * memory accesses performed by another CPU under the

> > 	 * chan->desc_lock spinlock.

> > 	 */

> > 	smp_mb();

> > >  	last_complete = chan->completed_cookie;

> > >  	last_used = dchan->cookie;

> > >

> > > -	spin_unlock_irqrestore(&chan->desc_lock, flags);

> > > -

> > >  	dma_set_tx_state(txstate, last_complete, last_used, 0);

> > >  	return dma_async_is_complete(cookie, last_complete, last_used);  }

> >

> > Facts:

> > - dchan->cookie is the same member as chan->common.cookie (same memory

> > location)

> > - chan->common.cookie is the "last allocated cookie for a pending transaction"

> > - chan->completed_cookie is the "last completed transaction"

> >

> > I have replaced "dchan->cookie" with "chan->common.cookie" in the below

> explanation, to keep everything referenced from the same structure.

> >

> > Variable usage before your change. Everything is used locked.

> > - RW chan->common.cookie		(fsl_dma_tx_submit)

> > - R  chan->common.cookie		(fsl_tx_status)

> > - R  chan->completed_cookie		(fsl_tx_status)

> > - W  chan->completed_cookie		(dma_do_tasklet)

> >

> > Variable usage after your change:

> > - RW chan->common.cookie		LOCKED

> > - R  chan->common.cookie		NO LOCK

> > - R  chan->completed_cookie		NO LOCK

> > - W  chan->completed_cookie             LOCKED

> >

> > What if we assume that you have a 2 CPU system (such as a P2020). After your

> changes, one possible sequence is:

> >

> > === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() ===

> > spin_lock_irqsave

> > descriptor->cookie = 20		(x in your example)

> > chan->common.cookie = 20	(used in your example)

> > spin_unlock_irqrestore

> >

> > === CPU2 - immediately calls fsl_tx_status() ===

> > chan->common.cookie == 19

> > chan->completed_cookie == 19

> > descriptor->cookie == 20

> >

> > Since we don't have locks anymore, CPU2 may not have seen the write to

> > chan->common.cookie yet.

> >

> > Also assume that the DMA hardware has not started processing the

> > transaction yet. Therefore dma_do_tasklet() has not been called, and

> > chan->completed_cookie has not been updated.

> >

> > In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS, even

> though the DMA operation has not succeeded. The DMA operation has not even

> started yet!

> >

> > The smp_mb() fixes this, since it forces CPU2 to have seen all memory operations

> that happened before CPU1 released the spinlock. Spinlocks are implicit SMP

> memory barriers.

> >

> > Therefore, the above example becomes:

> > smp_mb();

> > chan->common.cookie == 20

> > chan->completed_cookie == 19

> > descriptor->cookie == 20

> >

> > Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct.

> >

> > Thanks,

> > Ira

> >

> >

> > _______________________________________________

> > Linuxppc-dev mailing list

> > Linuxppc-dev@lists.ozlabs.org

> > https://lists.ozlabs.org/listinfo/linuxppc-dev
b29237@freescale.com - Nov. 29, 2011, 3:41 a.m.
Hi Ira, 

see my comments below.

Thanks,
Forrest

-----Original Message-----
From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] 

Sent: 2011年11月29日 0:38
To: Shi Xuelin-B29237
Cc: vinod.koul@intel.com; dan.j.williams@intel.com; linuxppc-dev@lists.ozlabs.org; Li Yang-R58472; linux-kernel@vger.kernel.org
Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.

On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote:
> Hi Ira,

> 

> Thanks for your review.

> 

> After second thought, I think your scenario may not occur.

> Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in practice. 

> We never query a cookie not returned by fsl_dma_tx_submit(...).

> 


I agree about this part.

> When we call fsl_tx_status(20), the chan->common.cookie is definitely wrote as 20 and cpu2 could not read as 19.

> 


This is what I don't agree about. However, I'm not an expert on CPU cache vs. memory accesses in an multi-processor system. The section titled "CACHE COHERENCY" in Documentation/memory-barriers.txt leads me to believe that the scenario I described is possible.

 
[Shi Xuelin-B29237] If query is used without rules, your scenario may happen. But in DMA usage here, the query is used something like sequentially. Only after chan->common.cookie is updated in fsl_dma_tx_submit(...) and returned, then it could be queried. So you scenario will not happen.

What happens if CPU1's write of chan->common.cookie only goes into CPU1's cache. It never makes it to main memory before CPU2 fetches the old value of 19.
 
[Shi Xuelin-B29237] As you see, chan->common.cookie is updated in fsl_dma_tx_submit(...) and enclosed by spinlock. The spinlock implementation in PPC will guarantee the cache coherence among cores, something like you called implicit smp_mb.

I don't think you should see any performance impact from the smp_mb() operation.

[Shi Xuelin-B29237] Only add smp_mb() doesn't work. It only sync on this step, but next step is the same as just getting into this function without smp_mb operation.

Thanks,
Ira

> -----Original Message-----

> From: Ira W. Snyder [mailto:iws@ovro.caltech.edu]

> Sent: 2011年11月23日 2:59

> To: Shi Xuelin-B29237

> Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org; 

> vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org; 

> linux-kernel@vger.kernel.org

> Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.

> 

> On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote:

> > From: Forrest Shi <b29237@freescale.com>

> > 

> >     dma status check function fsl_tx_status is heavily called in

> >     a tight loop and the desc lock in fsl_tx_status contended by

> >     the dma status update function. this caused the dma performance

> >     degrades much.

> > 

> >     this patch releases the lock in the fsl_tx_status function.

> >     I believe it has no neglect impact on the following call of

> >     dma_async_is_complete(...).

> > 

> >     we can see below three conditions will be identified as success

> >     a)  x < complete < use

> >     b)  x < complete+N < use+N

> >     c)  x < complete < use+N

> >     here complete is the completed_cookie, use is the last_used

> >     cookie, x is the querying cookie, N is MAX cookie

> > 

> >     when chan->completed_cookie is being read, the last_used may

> >     be incresed. Anyway it has no neglect impact on the dma status

> >     decision.

> > 

> >     Signed-off-by: Forrest Shi <xuelin.shi@freescale.com>

> > ---

> >  drivers/dma/fsldma.c |    5 -----

> >  1 files changed, 0 insertions(+), 5 deletions(-)

> > 

> > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 

> > 8a78154..1dca56f 100644

> > --- a/drivers/dma/fsldma.c

> > +++ b/drivers/dma/fsldma.c

> > @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan *dchan,

> >  	struct fsldma_chan *chan = to_fsl_chan(dchan);

> >  	dma_cookie_t last_complete;

> >  	dma_cookie_t last_used;

> > -	unsigned long flags;

> > -

> > -	spin_lock_irqsave(&chan->desc_lock, flags);

> >  

> 

> This will cause a bug. See below for a detailed explanation. You need this instead:

> 

> 	/*

> 	 * On an SMP system, we must ensure that this CPU has seen the

> 	 * memory accesses performed by another CPU under the

> 	 * chan->desc_lock spinlock.

> 	 */

> 	smp_mb();

> >  	last_complete = chan->completed_cookie;

> >  	last_used = dchan->cookie;

> >  

> > -	spin_unlock_irqrestore(&chan->desc_lock, flags);

> > -

> >  	dma_set_tx_state(txstate, last_complete, last_used, 0);

> >  	return dma_async_is_complete(cookie, last_complete, last_used);  }

> 

> Facts:

> - dchan->cookie is the same member as chan->common.cookie (same memory 

> location)

> - chan->common.cookie is the "last allocated cookie for a pending transaction"

> - chan->completed_cookie is the "last completed transaction"

> 

> I have replaced "dchan->cookie" with "chan->common.cookie" in the below explanation, to keep everything referenced from the same structure.

> 

> Variable usage before your change. Everything is used locked.

> - RW chan->common.cookie		(fsl_dma_tx_submit)

> - R  chan->common.cookie		(fsl_tx_status)

> - R  chan->completed_cookie		(fsl_tx_status)

> - W  chan->completed_cookie		(dma_do_tasklet)

> 

> Variable usage after your change:

> - RW chan->common.cookie		LOCKED

> - R  chan->common.cookie		NO LOCK

> - R  chan->completed_cookie		NO LOCK

> - W  chan->completed_cookie             LOCKED

> 

> What if we assume that you have a 2 CPU system (such as a P2020). After your changes, one possible sequence is:

> 

> === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === 

> spin_lock_irqsave

> descriptor->cookie = 20		(x in your example)

> chan->common.cookie = 20	(used in your example)

> spin_unlock_irqrestore

> 

> === CPU2 - immediately calls fsl_tx_status() ===

> chan->common.cookie == 19

> chan->completed_cookie == 19

> descriptor->cookie == 20

> 

> Since we don't have locks anymore, CPU2 may not have seen the write to

> chan->common.cookie yet.

> 

> Also assume that the DMA hardware has not started processing the 

> transaction yet. Therefore dma_do_tasklet() has not been called, and

> chan->completed_cookie has not been updated.

> 

> In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS, even though the DMA operation has not succeeded. The DMA operation has not even started yet!

> 

> The smp_mb() fixes this, since it forces CPU2 to have seen all memory operations that happened before CPU1 released the spinlock. Spinlocks are implicit SMP memory barriers.

> 

> Therefore, the above example becomes:

> smp_mb();

> chan->common.cookie == 20

> chan->completed_cookie == 19

> descriptor->cookie == 20

> 

> Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct.

> 

> Thanks,

> Ira

> 

> 

> _______________________________________________

> Linuxppc-dev mailing list

> Linuxppc-dev@lists.ozlabs.org

> https://lists.ozlabs.org/listinfo/linuxppc-dev
Ira Snyder - Nov. 29, 2011, 5:25 p.m.
On Tue, Nov 29, 2011 at 03:19:05AM +0000, Li Yang-R58472 wrote:
> > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing
> > spinlock use.
> > 
> > On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote:
> > > Hi Ira,
> > >
> > > Thanks for your review.
> > >
> > > After second thought, I think your scenario may not occur.
> > > Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in
> > practice.
> > > We never query a cookie not returned by fsl_dma_tx_submit(...).
> > >
> > 
> > I agree about this part.
> > 
> > > When we call fsl_tx_status(20), the chan->common.cookie is definitely wrote as
> > 20 and cpu2 could not read as 19.
> > >
> > 
> > This is what I don't agree about. However, I'm not an expert on CPU cache vs.
> > memory accesses in an multi-processor system. The section titled "CACHE
> > COHERENCY" in Documentation/memory-barriers.txt leads me to believe that the
> > scenario I described is possible.
> 
> For Freescale PowerPC, the chip automatically takes care of cache coherency.  Even if this is a concern, spinlock can't address it.
> 
> > 
> > What happens if CPU1's write of chan->common.cookie only goes into CPU1's
> > cache. It never makes it to main memory before CPU2 fetches the old value of 19.
> > 
> > I don't think you should see any performance impact from the smp_mb()
> > operation.
> 
> Smp_mb() do have impact on performance if it's in the hot path.  While it might be safer having it, I doubt it is really necessary.  If the CPU1 doesn't have the updated last_used, it's shouldn't have known there is a cookie 20 existed either.
> 

I believe that you are correct, for powerpc. However, anything outside
of arch/powerpc shouldn't assume it only runs on powerpc. I wouldn't be
surprised to see fsldma running on an iMX someday (ARM processor).

My interpretation says that the change introduces the possibility that
fsl_tx_status() returns the wrong answer for an extremely small time
window, on SMP only, based on Documentation/memory-barriers.txt. But I
can't seem convince you.

My real question is what code path is hitting this spinlock? Is it in
mainline Linux? Why is it polling rather than using callbacks to
determine DMA completion?

Thanks,
Ira

> > > -----Original Message-----
> > > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu]
> > > Sent: 2011年11月23日 2:59
> > > To: Shi Xuelin-B29237
> > > Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org;
> > > vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org;
> > > linux-kernel@vger.kernel.org
> > > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing
> > spinlock use.
> > >
> > > On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote:
> > > > From: Forrest Shi <b29237@freescale.com>
> > > >
> > > >     dma status check function fsl_tx_status is heavily called in
> > > >     a tight loop and the desc lock in fsl_tx_status contended by
> > > >     the dma status update function. this caused the dma performance
> > > >     degrades much.
> > > >
> > > >     this patch releases the lock in the fsl_tx_status function.
> > > >     I believe it has no neglect impact on the following call of
> > > >     dma_async_is_complete(...).
> > > >
> > > >     we can see below three conditions will be identified as success
> > > >     a)  x < complete < use
> > > >     b)  x < complete+N < use+N
> > > >     c)  x < complete < use+N
> > > >     here complete is the completed_cookie, use is the last_used
> > > >     cookie, x is the querying cookie, N is MAX cookie
> > > >
> > > >     when chan->completed_cookie is being read, the last_used may
> > > >     be incresed. Anyway it has no neglect impact on the dma status
> > > >     decision.
> > > >
> > > >     Signed-off-by: Forrest Shi <xuelin.shi@freescale.com>
> > > > ---
> > > >  drivers/dma/fsldma.c |    5 -----
> > > >  1 files changed, 0 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index
> > > > 8a78154..1dca56f 100644
> > > > --- a/drivers/dma/fsldma.c
> > > > +++ b/drivers/dma/fsldma.c
> > > > @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct
> > dma_chan *dchan,
> > > >  	struct fsldma_chan *chan = to_fsl_chan(dchan);
> > > >  	dma_cookie_t last_complete;
> > > >  	dma_cookie_t last_used;
> > > > -	unsigned long flags;
> > > > -
> > > > -	spin_lock_irqsave(&chan->desc_lock, flags);
> > > >
> > >
> > > This will cause a bug. See below for a detailed explanation. You need this instead:
> > >
> > > 	/*
> > > 	 * On an SMP system, we must ensure that this CPU has seen the
> > > 	 * memory accesses performed by another CPU under the
> > > 	 * chan->desc_lock spinlock.
> > > 	 */
> > > 	smp_mb();
> > > >  	last_complete = chan->completed_cookie;
> > > >  	last_used = dchan->cookie;
> > > >
> > > > -	spin_unlock_irqrestore(&chan->desc_lock, flags);
> > > > -
> > > >  	dma_set_tx_state(txstate, last_complete, last_used, 0);
> > > >  	return dma_async_is_complete(cookie, last_complete, last_used);  }
> > >
> > > Facts:
> > > - dchan->cookie is the same member as chan->common.cookie (same memory
> > > location)
> > > - chan->common.cookie is the "last allocated cookie for a pending transaction"
> > > - chan->completed_cookie is the "last completed transaction"
> > >
> > > I have replaced "dchan->cookie" with "chan->common.cookie" in the below
> > explanation, to keep everything referenced from the same structure.
> > >
> > > Variable usage before your change. Everything is used locked.
> > > - RW chan->common.cookie		(fsl_dma_tx_submit)
> > > - R  chan->common.cookie		(fsl_tx_status)
> > > - R  chan->completed_cookie		(fsl_tx_status)
> > > - W  chan->completed_cookie		(dma_do_tasklet)
> > >
> > > Variable usage after your change:
> > > - RW chan->common.cookie		LOCKED
> > > - R  chan->common.cookie		NO LOCK
> > > - R  chan->completed_cookie		NO LOCK
> > > - W  chan->completed_cookie             LOCKED
> > >
> > > What if we assume that you have a 2 CPU system (such as a P2020). After your
> > changes, one possible sequence is:
> > >
> > > === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() ===
> > > spin_lock_irqsave
> > > descriptor->cookie = 20		(x in your example)
> > > chan->common.cookie = 20	(used in your example)
> > > spin_unlock_irqrestore
> > >
> > > === CPU2 - immediately calls fsl_tx_status() ===
> > > chan->common.cookie == 19
> > > chan->completed_cookie == 19
> > > descriptor->cookie == 20
> > >
> > > Since we don't have locks anymore, CPU2 may not have seen the write to
> > > chan->common.cookie yet.
> > >
> > > Also assume that the DMA hardware has not started processing the
> > > transaction yet. Therefore dma_do_tasklet() has not been called, and
> > > chan->completed_cookie has not been updated.
> > >
> > > In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS, even
> > though the DMA operation has not succeeded. The DMA operation has not even
> > started yet!
> > >
> > > The smp_mb() fixes this, since it forces CPU2 to have seen all memory operations
> > that happened before CPU1 released the spinlock. Spinlocks are implicit SMP
> > memory barriers.
> > >
> > > Therefore, the above example becomes:
> > > smp_mb();
> > > chan->common.cookie == 20
> > > chan->completed_cookie == 19
> > > descriptor->cookie == 20
> > >
> > > Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct.
> > >
> > > Thanks,
> > > Ira
> > >
> > >
> > > _______________________________________________
> > > Linuxppc-dev mailing list
> > > Linuxppc-dev@lists.ozlabs.org
> > > https://lists.ozlabs.org/listinfo/linuxppc-dev
>
Scott Wood - Nov. 29, 2011, 7:49 p.m.
On 11/28/2011 09:19 PM, Li Yang-R58472 wrote:
>> Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing
>> spinlock use.
>>
>> On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote:
>>> Hi Ira,
>>>
>>> Thanks for your review.
>>>
>>> After second thought, I think your scenario may not occur.
>>> Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in
>> practice.
>>> We never query a cookie not returned by fsl_dma_tx_submit(...).
>>>
>>
>> I agree about this part.
>>
>>> When we call fsl_tx_status(20), the chan->common.cookie is definitely wrote as
>> 20 and cpu2 could not read as 19.
>>>
>>
>> This is what I don't agree about. However, I'm not an expert on CPU cache vs.
>> memory accesses in an multi-processor system. The section titled "CACHE
>> COHERENCY" in Documentation/memory-barriers.txt leads me to believe that the
>> scenario I described is possible.
> 
> For Freescale PowerPC, the chip automatically takes care of cache coherency.  Even if this is a concern, spinlock can't address it.

Cache coherency is not the same thing as ordering -- and spinlocks do
address ordering, because there are memory barriers in the lock
implementation.

If you're relying on some non-universal ordering guarantee that all
chips with this device make, it needs to be documented explicitly what
you're assuming and why it's valid.

-Scott
Tabi Timur-B04825 - Nov. 30, 2011, 12:08 a.m.
On Tue, Nov 29, 2011 at 11:25 AM, Ira W. Snyder <iws@ovro.caltech.edu> wrote:

> I believe that you are correct, for powerpc. However, anything outside
> of arch/powerpc shouldn't assume it only runs on powerpc. I wouldn't be
> surprised to see fsldma running on an iMX someday (ARM processor).

I certainly would.  The i.MX chips have their own DMA controller.  The
DMA controller that fsldma.c works with is unique to Freescale's 83xx,
85xx, and QorIQ chips.
b29237@freescale.com - Nov. 30, 2011, 9:57 a.m.
Hello Ira,

In drivers/dma/dmaengine.c, we have below tight loop to check DMA completion in mainline Linux:
       do {
                status = dma_async_is_tx_complete(chan, cookie, NULL, NULL);
                if (time_after_eq(jiffies, dma_sync_wait_timeout)) {
                        printk(KERN_ERR "dma_sync_wait_timeout!\n");
                        return DMA_ERROR;
                }
        } while (status == DMA_IN_PROGRESS);


Thanks,
Forrest

-----Original Message-----
From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] 

Sent: 2011年11月30日 1:26
To: Li Yang-R58472
Cc: Shi Xuelin-B29237; vinod.koul@intel.com; dan.j.williams@intel.com; linuxppc-dev@lists.ozlabs.org; linux-kernel@vger.kernel.org
Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.

On Tue, Nov 29, 2011 at 03:19:05AM +0000, Li Yang-R58472 wrote:
> > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by 

> > optimizing spinlock use.

> > 

> > On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote:

> > > Hi Ira,

> > >

> > > Thanks for your review.

> > >

> > > After second thought, I think your scenario may not occur.

> > > Because the cookie 20 we query must be returned by 

> > > fsl_dma_tx_submit(...) in

> > practice.

> > > We never query a cookie not returned by fsl_dma_tx_submit(...).

> > >

> > 

> > I agree about this part.

> > 

> > > When we call fsl_tx_status(20), the chan->common.cookie is 

> > > definitely wrote as

> > 20 and cpu2 could not read as 19.

> > >

> > 

> > This is what I don't agree about. However, I'm not an expert on CPU cache vs.

> > memory accesses in an multi-processor system. The section titled 

> > "CACHE COHERENCY" in Documentation/memory-barriers.txt leads me to 

> > believe that the scenario I described is possible.

> 

> For Freescale PowerPC, the chip automatically takes care of cache coherency.  Even if this is a concern, spinlock can't address it.

> 

> > 

> > What happens if CPU1's write of chan->common.cookie only goes into 

> > CPU1's cache. It never makes it to main memory before CPU2 fetches the old value of 19.

> > 

> > I don't think you should see any performance impact from the 

> > smp_mb() operation.

> 

> Smp_mb() do have impact on performance if it's in the hot path.  While it might be safer having it, I doubt it is really necessary.  If the CPU1 doesn't have the updated last_used, it's shouldn't have known there is a cookie 20 existed either.

> 


I believe that you are correct, for powerpc. However, anything outside of arch/powerpc shouldn't assume it only runs on powerpc. I wouldn't be surprised to see fsldma running on an iMX someday (ARM processor).

My interpretation says that the change introduces the possibility that
fsl_tx_status() returns the wrong answer for an extremely small time window, on SMP only, based on Documentation/memory-barriers.txt. But I can't seem convince you.

My real question is what code path is hitting this spinlock? Is it in mainline Linux? Why is it polling rather than using callbacks to determine DMA completion?

Thanks,
Ira

> > > -----Original Message-----

> > > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu]

> > > Sent: 2011年11月23日 2:59

> > > To: Shi Xuelin-B29237

> > > Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org; 

> > > vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org; 

> > > linux-kernel@vger.kernel.org

> > > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by 

> > > optimizing

> > spinlock use.

> > >

> > > On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote:

> > > > From: Forrest Shi <b29237@freescale.com>

> > > >

> > > >     dma status check function fsl_tx_status is heavily called in

> > > >     a tight loop and the desc lock in fsl_tx_status contended by

> > > >     the dma status update function. this caused the dma performance

> > > >     degrades much.

> > > >

> > > >     this patch releases the lock in the fsl_tx_status function.

> > > >     I believe it has no neglect impact on the following call of

> > > >     dma_async_is_complete(...).

> > > >

> > > >     we can see below three conditions will be identified as success

> > > >     a)  x < complete < use

> > > >     b)  x < complete+N < use+N

> > > >     c)  x < complete < use+N

> > > >     here complete is the completed_cookie, use is the last_used

> > > >     cookie, x is the querying cookie, N is MAX cookie

> > > >

> > > >     when chan->completed_cookie is being read, the last_used may

> > > >     be incresed. Anyway it has no neglect impact on the dma status

> > > >     decision.

> > > >

> > > >     Signed-off-by: Forrest Shi <xuelin.shi@freescale.com>

> > > > ---

> > > >  drivers/dma/fsldma.c |    5 -----

> > > >  1 files changed, 0 insertions(+), 5 deletions(-)

> > > >

> > > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 

> > > > 8a78154..1dca56f 100644

> > > > --- a/drivers/dma/fsldma.c

> > > > +++ b/drivers/dma/fsldma.c

> > > > @@ -986,15 +986,10 @@ static enum dma_status 

> > > > fsl_tx_status(struct

> > dma_chan *dchan,

> > > >  	struct fsldma_chan *chan = to_fsl_chan(dchan);

> > > >  	dma_cookie_t last_complete;

> > > >  	dma_cookie_t last_used;

> > > > -	unsigned long flags;

> > > > -

> > > > -	spin_lock_irqsave(&chan->desc_lock, flags);

> > > >

> > >

> > > This will cause a bug. See below for a detailed explanation. You need this instead:

> > >

> > > 	/*

> > > 	 * On an SMP system, we must ensure that this CPU has seen the

> > > 	 * memory accesses performed by another CPU under the

> > > 	 * chan->desc_lock spinlock.

> > > 	 */

> > > 	smp_mb();

> > > >  	last_complete = chan->completed_cookie;

> > > >  	last_used = dchan->cookie;

> > > >

> > > > -	spin_unlock_irqrestore(&chan->desc_lock, flags);

> > > > -

> > > >  	dma_set_tx_state(txstate, last_complete, last_used, 0);

> > > >  	return dma_async_is_complete(cookie, last_complete, 

> > > > last_used);  }

> > >

> > > Facts:

> > > - dchan->cookie is the same member as chan->common.cookie (same 

> > > memory

> > > location)

> > > - chan->common.cookie is the "last allocated cookie for a pending transaction"

> > > - chan->completed_cookie is the "last completed transaction"

> > >

> > > I have replaced "dchan->cookie" with "chan->common.cookie" in the 

> > > below

> > explanation, to keep everything referenced from the same structure.

> > >

> > > Variable usage before your change. Everything is used locked.

> > > - RW chan->common.cookie		(fsl_dma_tx_submit)

> > > - R  chan->common.cookie		(fsl_tx_status)

> > > - R  chan->completed_cookie		(fsl_tx_status)

> > > - W  chan->completed_cookie		(dma_do_tasklet)

> > >

> > > Variable usage after your change:

> > > - RW chan->common.cookie		LOCKED

> > > - R  chan->common.cookie		NO LOCK

> > > - R  chan->completed_cookie		NO LOCK

> > > - W  chan->completed_cookie             LOCKED

> > >

> > > What if we assume that you have a 2 CPU system (such as a P2020). 

> > > After your

> > changes, one possible sequence is:

> > >

> > > === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === 

> > > spin_lock_irqsave

> > > descriptor->cookie = 20		(x in your example)

> > > chan->common.cookie = 20	(used in your example)

> > > spin_unlock_irqrestore

> > >

> > > === CPU2 - immediately calls fsl_tx_status() ===

> > > chan->common.cookie == 19

> > > chan->completed_cookie == 19

> > > descriptor->cookie == 20

> > >

> > > Since we don't have locks anymore, CPU2 may not have seen the 

> > > write to

> > > chan->common.cookie yet.

> > >

> > > Also assume that the DMA hardware has not started processing the 

> > > transaction yet. Therefore dma_do_tasklet() has not been called, 

> > > and

> > > chan->completed_cookie has not been updated.

> > >

> > > In this case, dma_async_is_complete() (on CPU2) returns 

> > > DMA_SUCCESS, even

> > though the DMA operation has not succeeded. The DMA operation has 

> > not even started yet!

> > >

> > > The smp_mb() fixes this, since it forces CPU2 to have seen all 

> > > memory operations

> > that happened before CPU1 released the spinlock. Spinlocks are 

> > implicit SMP memory barriers.

> > >

> > > Therefore, the above example becomes:

> > > smp_mb();

> > > chan->common.cookie == 20

> > > chan->completed_cookie == 19

> > > descriptor->cookie == 20

> > >

> > > Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct.

> > >

> > > Thanks,

> > > Ira

> > >

> > >

> > > _______________________________________________

> > > Linuxppc-dev mailing list

> > > Linuxppc-dev@lists.ozlabs.org

> > > https://lists.ozlabs.org/listinfo/linuxppc-dev

>
Ira Snyder - Nov. 30, 2011, 5:07 p.m.
On Wed, Nov 30, 2011 at 09:57:47AM +0000, Shi Xuelin-B29237 wrote:
> Hello Ira,
> 
> In drivers/dma/dmaengine.c, we have below tight loop to check DMA completion in mainline Linux:
>        do {
>                 status = dma_async_is_tx_complete(chan, cookie, NULL, NULL);
>                 if (time_after_eq(jiffies, dma_sync_wait_timeout)) {
>                         printk(KERN_ERR "dma_sync_wait_timeout!\n");
>                         return DMA_ERROR;
>                 }
>         } while (status == DMA_IN_PROGRESS);
> 

That is the body of dma_sync_wait(). It is mostly used in the raid code.
I understand that you don't want to change the raid code to use
callbacks.

In any case, I think we've strayed from the topic under consideration,
which is: can we remove this spinlock without introducing a bug.

I'm convinced that "smp_rmb()" is needed when removing the spinlock. As
noted, Documentation/memory-barriers.txt says that stores on one CPU can
be observed by another CPU in a different order.

Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a
LOCK (in fsl_tx_status). This provided a "full barrier", forcing the
operations to complete correctly when viewed by the second CPU. From the
text:

> Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
> equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.

Also, please read "EXAMPLES OF MEMORY BARRIER SEQUENCES" and "INTER-CPU
LOCKING BARRIER EFFECTS". Particularly, in "EXAMPLES OF MEMORY BARRIER
SEQUENCES", the text notes:

> Without intervention, CPU 2 may perceive the events on CPU 1 in some
> effectively random order, despite the write barrier issued by CPU 1:
>
> [snip diagram]
>
> And thirdly, a read barrier acts as a partial order on loads. Consider the
> following sequence of events:
>
> [snip diagram]
>
> Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
> some effectively random order, despite the write barrier issued by CPU 1:
>
> [snip diagram]
>

And so on. Please read this entire section in the document.

I can't give you an ACK on the proposed patch. To the best of my
understanding, I believe it introduces a bug. I've tried to provide as
much evidence for this belief as I can, in the form of documentation in
the kernel source tree. If you can cite some documentation that shows I
am wrong, I will happily change my mind!

Ira

> -----Original Message-----
> From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] 
> Sent: 2011年11月30日 1:26
> To: Li Yang-R58472
> Cc: Shi Xuelin-B29237; vinod.koul@intel.com; dan.j.williams@intel.com; linuxppc-dev@lists.ozlabs.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.
> 
> On Tue, Nov 29, 2011 at 03:19:05AM +0000, Li Yang-R58472 wrote:
> > > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by 
> > > optimizing spinlock use.
> > > 
> > > On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote:
> > > > Hi Ira,
> > > >
> > > > Thanks for your review.
> > > >
> > > > After second thought, I think your scenario may not occur.
> > > > Because the cookie 20 we query must be returned by 
> > > > fsl_dma_tx_submit(...) in
> > > practice.
> > > > We never query a cookie not returned by fsl_dma_tx_submit(...).
> > > >
> > > 
> > > I agree about this part.
> > > 
> > > > When we call fsl_tx_status(20), the chan->common.cookie is 
> > > > definitely wrote as
> > > 20 and cpu2 could not read as 19.
> > > >
> > > 
> > > This is what I don't agree about. However, I'm not an expert on CPU cache vs.
> > > memory accesses in an multi-processor system. The section titled 
> > > "CACHE COHERENCY" in Documentation/memory-barriers.txt leads me to 
> > > believe that the scenario I described is possible.
> > 
> > For Freescale PowerPC, the chip automatically takes care of cache coherency.  Even if this is a concern, spinlock can't address it.
> > 
> > > 
> > > What happens if CPU1's write of chan->common.cookie only goes into 
> > > CPU1's cache. It never makes it to main memory before CPU2 fetches the old value of 19.
> > > 
> > > I don't think you should see any performance impact from the 
> > > smp_mb() operation.
> > 
> > Smp_mb() do have impact on performance if it's in the hot path.  While it might be safer having it, I doubt it is really necessary.  If the CPU1 doesn't have the updated last_used, it's shouldn't have known there is a cookie 20 existed either.
> > 
> 
> I believe that you are correct, for powerpc. However, anything outside of arch/powerpc shouldn't assume it only runs on powerpc. I wouldn't be surprised to see fsldma running on an iMX someday (ARM processor).
> 
> My interpretation says that the change introduces the possibility that
> fsl_tx_status() returns the wrong answer for an extremely small time window, on SMP only, based on Documentation/memory-barriers.txt. But I can't seem convince you.
> 
> My real question is what code path is hitting this spinlock? Is it in mainline Linux? Why is it polling rather than using callbacks to determine DMA completion?
> 
> Thanks,
> Ira
> 
> > > > -----Original Message-----
> > > > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu]
> > > > Sent: 2011年11月23日 2:59
> > > > To: Shi Xuelin-B29237
> > > > Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org; 
> > > > vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org; 
> > > > linux-kernel@vger.kernel.org
> > > > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by 
> > > > optimizing
> > > spinlock use.
> > > >
> > > > On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote:
> > > > > From: Forrest Shi <b29237@freescale.com>
> > > > >
> > > > >     dma status check function fsl_tx_status is heavily called in
> > > > >     a tight loop and the desc lock in fsl_tx_status contended by
> > > > >     the dma status update function. this caused the dma performance
> > > > >     degrades much.
> > > > >
> > > > >     this patch releases the lock in the fsl_tx_status function.
> > > > >     I believe it has no neglect impact on the following call of
> > > > >     dma_async_is_complete(...).
> > > > >
> > > > >     we can see below three conditions will be identified as success
> > > > >     a)  x < complete < use
> > > > >     b)  x < complete+N < use+N
> > > > >     c)  x < complete < use+N
> > > > >     here complete is the completed_cookie, use is the last_used
> > > > >     cookie, x is the querying cookie, N is MAX cookie
> > > > >
> > > > >     when chan->completed_cookie is being read, the last_used may
> > > > >     be incresed. Anyway it has no neglect impact on the dma status
> > > > >     decision.
> > > > >
> > > > >     Signed-off-by: Forrest Shi <xuelin.shi@freescale.com>
> > > > > ---
> > > > >  drivers/dma/fsldma.c |    5 -----
> > > > >  1 files changed, 0 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 
> > > > > 8a78154..1dca56f 100644
> > > > > --- a/drivers/dma/fsldma.c
> > > > > +++ b/drivers/dma/fsldma.c
> > > > > @@ -986,15 +986,10 @@ static enum dma_status 
> > > > > fsl_tx_status(struct
> > > dma_chan *dchan,
> > > > >  	struct fsldma_chan *chan = to_fsl_chan(dchan);
> > > > >  	dma_cookie_t last_complete;
> > > > >  	dma_cookie_t last_used;
> > > > > -	unsigned long flags;
> > > > > -
> > > > > -	spin_lock_irqsave(&chan->desc_lock, flags);
> > > > >
> > > >
> > > > This will cause a bug. See below for a detailed explanation. You need this instead:
> > > >
> > > > 	/*
> > > > 	 * On an SMP system, we must ensure that this CPU has seen the
> > > > 	 * memory accesses performed by another CPU under the
> > > > 	 * chan->desc_lock spinlock.
> > > > 	 */
> > > > 	smp_mb();
> > > > >  	last_complete = chan->completed_cookie;
> > > > >  	last_used = dchan->cookie;
> > > > >
> > > > > -	spin_unlock_irqrestore(&chan->desc_lock, flags);
> > > > > -
> > > > >  	dma_set_tx_state(txstate, last_complete, last_used, 0);
> > > > >  	return dma_async_is_complete(cookie, last_complete, 
> > > > > last_used);  }
> > > >
> > > > Facts:
> > > > - dchan->cookie is the same member as chan->common.cookie (same 
> > > > memory
> > > > location)
> > > > - chan->common.cookie is the "last allocated cookie for a pending transaction"
> > > > - chan->completed_cookie is the "last completed transaction"
> > > >
> > > > I have replaced "dchan->cookie" with "chan->common.cookie" in the 
> > > > below
> > > explanation, to keep everything referenced from the same structure.
> > > >
> > > > Variable usage before your change. Everything is used locked.
> > > > - RW chan->common.cookie		(fsl_dma_tx_submit)
> > > > - R  chan->common.cookie		(fsl_tx_status)
> > > > - R  chan->completed_cookie		(fsl_tx_status)
> > > > - W  chan->completed_cookie		(dma_do_tasklet)
> > > >
> > > > Variable usage after your change:
> > > > - RW chan->common.cookie		LOCKED
> > > > - R  chan->common.cookie		NO LOCK
> > > > - R  chan->completed_cookie		NO LOCK
> > > > - W  chan->completed_cookie             LOCKED
> > > >
> > > > What if we assume that you have a 2 CPU system (such as a P2020). 
> > > > After your
> > > changes, one possible sequence is:
> > > >
> > > > === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === 
> > > > spin_lock_irqsave
> > > > descriptor->cookie = 20		(x in your example)
> > > > chan->common.cookie = 20	(used in your example)
> > > > spin_unlock_irqrestore
> > > >
> > > > === CPU2 - immediately calls fsl_tx_status() ===
> > > > chan->common.cookie == 19
> > > > chan->completed_cookie == 19
> > > > descriptor->cookie == 20
> > > >
> > > > Since we don't have locks anymore, CPU2 may not have seen the 
> > > > write to
> > > > chan->common.cookie yet.
> > > >
> > > > Also assume that the DMA hardware has not started processing the 
> > > > transaction yet. Therefore dma_do_tasklet() has not been called, 
> > > > and
> > > > chan->completed_cookie has not been updated.
> > > >
> > > > In this case, dma_async_is_complete() (on CPU2) returns 
> > > > DMA_SUCCESS, even
> > > though the DMA operation has not succeeded. The DMA operation has 
> > > not even started yet!
> > > >
> > > > The smp_mb() fixes this, since it forces CPU2 to have seen all 
> > > > memory operations
> > > that happened before CPU1 released the spinlock. Spinlocks are 
> > > implicit SMP memory barriers.
> > > >
> > > > Therefore, the above example becomes:
> > > > smp_mb();
> > > > chan->common.cookie == 20
> > > > chan->completed_cookie == 19
> > > > descriptor->cookie == 20
> > > >
> > > > Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct.
> > > >
> > > > Thanks,
> > > > Ira
> > > >
> > > >
> > > > _______________________________________________
> > > > Linuxppc-dev mailing list
> > > > Linuxppc-dev@lists.ozlabs.org
> > > > https://lists.ozlabs.org/listinfo/linuxppc-dev
> > 
>
b29237@freescale.com - Dec. 2, 2011, 3:47 a.m.
Hi Iris,

>I'm convinced that "smp_rmb()" is needed when removing the spinlock. As noted, Documentation/memory-barriers.txt says that stores on one CPU can be

>observed by another CPU in a different order.

>Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a LOCK (in fsl_tx_status). This provided a "full barrier", forcing the operations to 

>complete correctly when viewed by the second CPU. 


I do not agree this smp_rmb() works here. Because when this smp_rmb() executed and begin to read chan->common.cookie, you still cannot avoid the order issue. Something like one is reading old value, but another CPU is updating the new value. 

My point is here the order is not important for the DMA decision.
Completed DMA tx is decided as not complete is not a big deal, because next time it will be OK.

I believe there is no case that could cause uncompleted DMA tx is decided as completed, because the fsl_tx_status is called after fsl_dma_tx_submit for a specific cookie. If you can give me an example here, I will agree with you.

Thanks,
Forrest 

-----Original Message-----
From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] 

Sent: 2011年12月1日 1:08
To: Shi Xuelin-B29237
Cc: vinod.koul@intel.com; dan.j.williams@intel.com; linuxppc-dev@lists.ozlabs.org; linux-kernel@vger.kernel.org; Li Yang-R58472
Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.

On Wed, Nov 30, 2011 at 09:57:47AM +0000, Shi Xuelin-B29237 wrote:
> Hello Ira,

> 

> In drivers/dma/dmaengine.c, we have below tight loop to check DMA completion in mainline Linux:

>        do {

>                 status = dma_async_is_tx_complete(chan, cookie, NULL, NULL);

>                 if (time_after_eq(jiffies, dma_sync_wait_timeout)) {

>                         printk(KERN_ERR "dma_sync_wait_timeout!\n");

>                         return DMA_ERROR;

>                 }

>         } while (status == DMA_IN_PROGRESS);

> 


That is the body of dma_sync_wait(). It is mostly used in the raid code.
I understand that you don't want to change the raid code to use callbacks.

In any case, I think we've strayed from the topic under consideration, which is: can we remove this spinlock without introducing a bug.

I'm convinced that "smp_rmb()" is needed when removing the spinlock. As noted, Documentation/memory-barriers.txt says that stores on one CPU can be observed by another CPU in a different order.

Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a LOCK (in fsl_tx_status). This provided a "full barrier", forcing the operations to complete correctly when viewed by the second CPU. From the
text:

> Therefore, from (1), (2) and (4) an UNLOCK followed by an 

> unconditional LOCK is equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.


Also, please read "EXAMPLES OF MEMORY BARRIER SEQUENCES" and "INTER-CPU LOCKING BARRIER EFFECTS". Particularly, in "EXAMPLES OF MEMORY BARRIER SEQUENCES", the text notes:

> Without intervention, CPU 2 may perceive the events on CPU 1 in some 

> effectively random order, despite the write barrier issued by CPU 1:

>

> [snip diagram]

>

> And thirdly, a read barrier acts as a partial order on loads. Consider 

> the following sequence of events:

>

> [snip diagram]

>

> Without intervention, CPU 2 may then choose to perceive the events on 

> CPU 1 in some effectively random order, despite the write barrier issued by CPU 1:

>

> [snip diagram]

>


And so on. Please read this entire section in the document.

I can't give you an ACK on the proposed patch. To the best of my understanding, I believe it introduces a bug. I've tried to provide as much evidence for this belief as I can, in the form of documentation in the kernel source tree. If you can cite some documentation that shows I am wrong, I will happily change my mind!

Ira

> -----Original Message-----

> From: Ira W. Snyder [mailto:iws@ovro.caltech.edu]

> Sent: 2011年11月30日 1:26

> To: Li Yang-R58472

> Cc: Shi Xuelin-B29237; vinod.koul@intel.com; dan.j.williams@intel.com; 

> linuxppc-dev@lists.ozlabs.org; linux-kernel@vger.kernel.org

> Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.

> 

> On Tue, Nov 29, 2011 at 03:19:05AM +0000, Li Yang-R58472 wrote:

> > > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by 

> > > optimizing spinlock use.

> > > 

> > > On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote:

> > > > Hi Ira,

> > > >

> > > > Thanks for your review.

> > > >

> > > > After second thought, I think your scenario may not occur.

> > > > Because the cookie 20 we query must be returned by

> > > > fsl_dma_tx_submit(...) in

> > > practice.

> > > > We never query a cookie not returned by fsl_dma_tx_submit(...).

> > > >

> > > 

> > > I agree about this part.

> > > 

> > > > When we call fsl_tx_status(20), the chan->common.cookie is 

> > > > definitely wrote as

> > > 20 and cpu2 could not read as 19.

> > > >

> > > 

> > > This is what I don't agree about. However, I'm not an expert on CPU cache vs.

> > > memory accesses in an multi-processor system. The section titled 

> > > "CACHE COHERENCY" in Documentation/memory-barriers.txt leads me to 

> > > believe that the scenario I described is possible.

> > 

> > For Freescale PowerPC, the chip automatically takes care of cache coherency.  Even if this is a concern, spinlock can't address it.

> > 

> > > 

> > > What happens if CPU1's write of chan->common.cookie only goes into 

> > > CPU1's cache. It never makes it to main memory before CPU2 fetches the old value of 19.

> > > 

> > > I don't think you should see any performance impact from the

> > > smp_mb() operation.

> > 

> > Smp_mb() do have impact on performance if it's in the hot path.  While it might be safer having it, I doubt it is really necessary.  If the CPU1 doesn't have the updated last_used, it's shouldn't have known there is a cookie 20 existed either.

> > 

> 

> I believe that you are correct, for powerpc. However, anything outside of arch/powerpc shouldn't assume it only runs on powerpc. I wouldn't be surprised to see fsldma running on an iMX someday (ARM processor).

> 

> My interpretation says that the change introduces the possibility that

> fsl_tx_status() returns the wrong answer for an extremely small time window, on SMP only, based on Documentation/memory-barriers.txt. But I can't seem convince you.

> 

> My real question is what code path is hitting this spinlock? Is it in mainline Linux? Why is it polling rather than using callbacks to determine DMA completion?

> 

> Thanks,

> Ira

> 

> > > > -----Original Message-----

> > > > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu]

> > > > Sent: 2011年11月23日 2:59

> > > > To: Shi Xuelin-B29237

> > > > Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org; 

> > > > vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org; 

> > > > linux-kernel@vger.kernel.org

> > > > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by 

> > > > optimizing

> > > spinlock use.

> > > >

> > > > On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote:

> > > > > From: Forrest Shi <b29237@freescale.com>

> > > > >

> > > > >     dma status check function fsl_tx_status is heavily called in

> > > > >     a tight loop and the desc lock in fsl_tx_status contended by

> > > > >     the dma status update function. this caused the dma performance

> > > > >     degrades much.

> > > > >

> > > > >     this patch releases the lock in the fsl_tx_status function.

> > > > >     I believe it has no neglect impact on the following call of

> > > > >     dma_async_is_complete(...).

> > > > >

> > > > >     we can see below three conditions will be identified as success

> > > > >     a)  x < complete < use

> > > > >     b)  x < complete+N < use+N

> > > > >     c)  x < complete < use+N

> > > > >     here complete is the completed_cookie, use is the last_used

> > > > >     cookie, x is the querying cookie, N is MAX cookie

> > > > >

> > > > >     when chan->completed_cookie is being read, the last_used may

> > > > >     be incresed. Anyway it has no neglect impact on the dma status

> > > > >     decision.

> > > > >

> > > > >     Signed-off-by: Forrest Shi <xuelin.shi@freescale.com>

> > > > > ---

> > > > >  drivers/dma/fsldma.c |    5 -----

> > > > >  1 files changed, 0 insertions(+), 5 deletions(-)

> > > > >

> > > > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 

> > > > > 8a78154..1dca56f 100644

> > > > > --- a/drivers/dma/fsldma.c

> > > > > +++ b/drivers/dma/fsldma.c

> > > > > @@ -986,15 +986,10 @@ static enum dma_status 

> > > > > fsl_tx_status(struct

> > > dma_chan *dchan,

> > > > >  	struct fsldma_chan *chan = to_fsl_chan(dchan);

> > > > >  	dma_cookie_t last_complete;

> > > > >  	dma_cookie_t last_used;

> > > > > -	unsigned long flags;

> > > > > -

> > > > > -	spin_lock_irqsave(&chan->desc_lock, flags);

> > > > >

> > > >

> > > > This will cause a bug. See below for a detailed explanation. You need this instead:

> > > >

> > > > 	/*

> > > > 	 * On an SMP system, we must ensure that this CPU has seen the

> > > > 	 * memory accesses performed by another CPU under the

> > > > 	 * chan->desc_lock spinlock.

> > > > 	 */

> > > > 	smp_mb();

> > > > >  	last_complete = chan->completed_cookie;

> > > > >  	last_used = dchan->cookie;

> > > > >

> > > > > -	spin_unlock_irqrestore(&chan->desc_lock, flags);

> > > > > -

> > > > >  	dma_set_tx_state(txstate, last_complete, last_used, 0);

> > > > >  	return dma_async_is_complete(cookie, last_complete, 

> > > > > last_used);  }

> > > >

> > > > Facts:

> > > > - dchan->cookie is the same member as chan->common.cookie (same 

> > > > memory

> > > > location)

> > > > - chan->common.cookie is the "last allocated cookie for a pending transaction"

> > > > - chan->completed_cookie is the "last completed transaction"

> > > >

> > > > I have replaced "dchan->cookie" with "chan->common.cookie" in 

> > > > the below

> > > explanation, to keep everything referenced from the same structure.

> > > >

> > > > Variable usage before your change. Everything is used locked.

> > > > - RW chan->common.cookie		(fsl_dma_tx_submit)

> > > > - R  chan->common.cookie		(fsl_tx_status)

> > > > - R  chan->completed_cookie		(fsl_tx_status)

> > > > - W  chan->completed_cookie		(dma_do_tasklet)

> > > >

> > > > Variable usage after your change:

> > > > - RW chan->common.cookie		LOCKED

> > > > - R  chan->common.cookie		NO LOCK

> > > > - R  chan->completed_cookie		NO LOCK

> > > > - W  chan->completed_cookie             LOCKED

> > > >

> > > > What if we assume that you have a 2 CPU system (such as a P2020). 

> > > > After your

> > > changes, one possible sequence is:

> > > >

> > > > === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === 

> > > > spin_lock_irqsave

> > > > descriptor->cookie = 20		(x in your example)

> > > > chan->common.cookie = 20	(used in your example)

> > > > spin_unlock_irqrestore

> > > >

> > > > === CPU2 - immediately calls fsl_tx_status() ===

> > > > chan->common.cookie == 19

> > > > chan->completed_cookie == 19

> > > > descriptor->cookie == 20

> > > >

> > > > Since we don't have locks anymore, CPU2 may not have seen the 

> > > > write to

> > > > chan->common.cookie yet.

> > > >

> > > > Also assume that the DMA hardware has not started processing the 

> > > > transaction yet. Therefore dma_do_tasklet() has not been called, 

> > > > and

> > > > chan->completed_cookie has not been updated.

> > > >

> > > > In this case, dma_async_is_complete() (on CPU2) returns 

> > > > DMA_SUCCESS, even

> > > though the DMA operation has not succeeded. The DMA operation has 

> > > not even started yet!

> > > >

> > > > The smp_mb() fixes this, since it forces CPU2 to have seen all 

> > > > memory operations

> > > that happened before CPU1 released the spinlock. Spinlocks are 

> > > implicit SMP memory barriers.

> > > >

> > > > Therefore, the above example becomes:

> > > > smp_mb();

> > > > chan->common.cookie == 20

> > > > chan->completed_cookie == 19

> > > > descriptor->cookie == 20

> > > >

> > > > Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct.

> > > >

> > > > Thanks,

> > > > Ira

> > > >

> > > >

> > > > _______________________________________________

> > > > Linuxppc-dev mailing list

> > > > Linuxppc-dev@lists.ozlabs.org

> > > > https://lists.ozlabs.org/listinfo/linuxppc-dev

> > 

>
Ira Snyder - Dec. 2, 2011, 5:13 p.m.
On Fri, Dec 02, 2011 at 03:47:27AM +0000, Shi Xuelin-B29237 wrote:
> Hi Iris,
> 
> >I'm convinced that "smp_rmb()" is needed when removing the spinlock. As noted, Documentation/memory-barriers.txt says that stores on one CPU can be
> >observed by another CPU in a different order.
> >Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a LOCK (in fsl_tx_status). This provided a "full barrier", forcing the operations to 
> >complete correctly when viewed by the second CPU. 
> 
> I do not agree this smp_rmb() works here. Because when this smp_rmb() executed and begin to read chan->common.cookie, you still cannot avoid the order issue. Something like one is reading old value, but another CPU is updating the new value. 
> 
> My point is here the order is not important for the DMA decision.
> Completed DMA tx is decided as not complete is not a big deal, because next time it will be OK.
> 
> I believe there is no case that could cause uncompleted DMA tx is decided as completed, because the fsl_tx_status is called after fsl_dma_tx_submit for a specific cookie. If you can give me an example here, I will agree with you.
> 

According to memory-barriers.txt, writes to main memory may be observed in
any order if memory barriers are not used. This means that writes can
appear to happen in a different order than they were issued by the CPU.

Citing from the text:

> There are certain things that the Linux kernel memory barriers do not guarantee:
>
>  (*) There is no guarantee that any of the memory accesses specified before a
>      memory barrier will be _complete_ by the completion of a memory barrier
>      instruction; the barrier can be considered to draw a line in that CPU's
>      access queue that accesses of the appropriate type may not cross.

Also:

> Without intervention, CPU 2 may perceive the events on CPU 1 in some
> effectively random order, despite the write barrier issued by CPU 1:

Also:

> When dealing with CPU-CPU interactions, certain types of memory barrier should
> always be paired.  A lack of appropriate pairing is almost certainly an error.
>
> A write barrier should always be paired with a data dependency barrier or read
> barrier, though a general barrier would also be viable.

Therefore, in an SMP system, the following situation can happen.

descriptor->cookie = 2
chan->common.cookie = 1
chan->completed_cookie = 1

This occurs when CPU-A calls fsl_dma_tx_submit() and then CPU-B calls
dma_async_is_complete() ***after*** CPU-B has observed the write to
descriptor->cookie, and ***before*** before CPU-B has observed the write to
chan->common.cookie.

Remember, without barriers, CPU-B can observe CPU-A's memory accesses in
*any possible order*. Memory accesses are not guaranteed to be *complete*
by the time fsl_dma_tx_submit() returns!

With the above values, dma_async_is_complete() returns DMA_COMPLETE. This
is incorrect: the DMA is still in progress. The required invariant
chan->common.cookie >= descriptor->cookie has not been met.

By adding an smp_rmb(), I force CPU-B to stall until *both* stores in
fsl_dma_tx_submit() (descriptor->cookie and chan->common.cookie) actually
hit main memory. This avoids the above situation: all CPU's observe
descriptor->cookie and chan->common.cookie to update in sync with each
other.

Is this unclear in any way?

Please run your test with the smp_rmb() and measure the performance
impact.

Ira
b29237@freescale.com - Dec. 5, 2011, 6:11 a.m.
Hi Iris,

>Remember, without barriers, CPU-B can observe CPU-A's memory accesses in *any possible order*. Memory accesses are not guaranteed to be *complete* by 
>the time fsl_dma_tx_submit() returns!

fsl_dma_tx_submit is enclosed by spin_lock_irqsave/spin_unlock_irqrestore, when this function returns, I believe the memory access are completed. spin_unlock_irqsave is an implicit memory barrier and guaranteed this.

Thanks,
Forrest


-----Original Message-----
From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] 
Sent: 2011年12月3日 1:14
To: Shi Xuelin-B29237
Cc: vinod.koul@intel.com; dan.j.williams@intel.com; linuxppc-dev@lists.ozlabs.org; linux-kernel@vger.kernel.org; Li Yang-R58472
Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.

On Fri, Dec 02, 2011 at 03:47:27AM +0000, Shi Xuelin-B29237 wrote:
> Hi Iris,
> 
> >I'm convinced that "smp_rmb()" is needed when removing the spinlock. 
> >As noted, Documentation/memory-barriers.txt says that stores on one CPU can be observed by another CPU in a different order.
> >Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a 
> >LOCK (in fsl_tx_status). This provided a "full barrier", forcing the operations to complete correctly when viewed by the second CPU.
> 
> I do not agree this smp_rmb() works here. Because when this smp_rmb() executed and begin to read chan->common.cookie, you still cannot avoid the order issue. Something like one is reading old value, but another CPU is updating the new value. 
> 
> My point is here the order is not important for the DMA decision.
> Completed DMA tx is decided as not complete is not a big deal, because next time it will be OK.
> 
> I believe there is no case that could cause uncompleted DMA tx is decided as completed, because the fsl_tx_status is called after fsl_dma_tx_submit for a specific cookie. If you can give me an example here, I will agree with you.
> 

According to memory-barriers.txt, writes to main memory may be observed in any order if memory barriers are not used. This means that writes can appear to happen in a different order than they were issued by the CPU.

Citing from the text:

> There are certain things that the Linux kernel memory barriers do not guarantee:
>
>  (*) There is no guarantee that any of the memory accesses specified before a
>      memory barrier will be _complete_ by the completion of a memory barrier
>      instruction; the barrier can be considered to draw a line in that CPU's
>      access queue that accesses of the appropriate type may not cross.

Also:

> Without intervention, CPU 2 may perceive the events on CPU 1 in some 
> effectively random order, despite the write barrier issued by CPU 1:

Also:

> When dealing with CPU-CPU interactions, certain types of memory 
> barrier should always be paired.  A lack of appropriate pairing is almost certainly an error.
>
> A write barrier should always be paired with a data dependency barrier 
> or read barrier, though a general barrier would also be viable.

Therefore, in an SMP system, the following situation can happen.

descriptor->cookie = 2
chan->common.cookie = 1
chan->completed_cookie = 1

This occurs when CPU-A calls fsl_dma_tx_submit() and then CPU-B calls
dma_async_is_complete() ***after*** CPU-B has observed the write to
descriptor->cookie, and ***before*** before CPU-B has observed the write 
descriptor->to
chan->common.cookie.

Remember, without barriers, CPU-B can observe CPU-A's memory accesses in *any possible order*. Memory accesses are not guaranteed to be *complete* by the time fsl_dma_tx_submit() returns!

With the above values, dma_async_is_complete() returns DMA_COMPLETE. This is incorrect: the DMA is still in progress. The required invariant
chan->common.cookie >= descriptor->cookie has not been met.

By adding an smp_rmb(), I force CPU-B to stall until *both* stores in
fsl_dma_tx_submit() (descriptor->cookie and chan->common.cookie) actually hit main memory. This avoids the above situation: all CPU's observe
descriptor->cookie and chan->common.cookie to update in sync with each
other.

Is this unclear in any way?

Please run your test with the smp_rmb() and measure the performance impact.

Ira

Patch

diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c
index 8a78154..1dca56f 100644
--- a/drivers/dma/fsldma.c
+++ b/drivers/dma/fsldma.c
@@ -986,15 +986,10 @@  static enum dma_status fsl_tx_status(struct dma_chan *dchan,
 	struct fsldma_chan *chan = to_fsl_chan(dchan);
 	dma_cookie_t last_complete;
 	dma_cookie_t last_used;
-	unsigned long flags;
-
-	spin_lock_irqsave(&chan->desc_lock, flags);
 
 	last_complete = chan->completed_cookie;
 	last_used = dchan->cookie;
 
-	spin_unlock_irqrestore(&chan->desc_lock, flags);
-
 	dma_set_tx_state(txstate, last_complete, last_used, 0);
 	return dma_async_is_complete(cookie, last_complete, last_used);
 }