Message ID | 1321937705-19587-1-git-send-email-b29237@freescale.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote: > From: Forrest Shi <b29237@freescale.com> > > dma status check function fsl_tx_status is heavily called in > a tight loop and the desc lock in fsl_tx_status contended by > the dma status update function. this caused the dma performance > degrades much. > > this patch releases the lock in the fsl_tx_status function. > I believe it has no neglect impact on the following call of > dma_async_is_complete(...). > > we can see below three conditions will be identified as success > a) x < complete < use > b) x < complete+N < use+N > c) x < complete < use+N > here complete is the completed_cookie, use is the last_used > cookie, x is the querying cookie, N is MAX cookie > > when chan->completed_cookie is being read, the last_used may > be incresed. Anyway it has no neglect impact on the dma status > decision. > > Signed-off-by: Forrest Shi <xuelin.shi@freescale.com> > --- > drivers/dma/fsldma.c | 5 ----- > 1 files changed, 0 insertions(+), 5 deletions(-) > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c > index 8a78154..1dca56f 100644 > --- a/drivers/dma/fsldma.c > +++ b/drivers/dma/fsldma.c > @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan *dchan, > struct fsldma_chan *chan = to_fsl_chan(dchan); > dma_cookie_t last_complete; > dma_cookie_t last_used; > - unsigned long flags; > - > - spin_lock_irqsave(&chan->desc_lock, flags); > This will cause a bug. See below for a detailed explanation. You need this instead: /* * On an SMP system, we must ensure that this CPU has seen the * memory accesses performed by another CPU under the * chan->desc_lock spinlock. */ smp_mb(); > last_complete = chan->completed_cookie; > last_used = dchan->cookie; > > - spin_unlock_irqrestore(&chan->desc_lock, flags); > - > dma_set_tx_state(txstate, last_complete, last_used, 0); > return dma_async_is_complete(cookie, last_complete, last_used); > } Facts: - dchan->cookie is the same member as chan->common.cookie (same memory location) - chan->common.cookie is the "last allocated cookie for a pending transaction" - chan->completed_cookie is the "last completed transaction" I have replaced "dchan->cookie" with "chan->common.cookie" in the below explanation, to keep everything referenced from the same structure. Variable usage before your change. Everything is used locked. - RW chan->common.cookie (fsl_dma_tx_submit) - R chan->common.cookie (fsl_tx_status) - R chan->completed_cookie (fsl_tx_status) - W chan->completed_cookie (dma_do_tasklet) Variable usage after your change: - RW chan->common.cookie LOCKED - R chan->common.cookie NO LOCK - R chan->completed_cookie NO LOCK - W chan->completed_cookie LOCKED What if we assume that you have a 2 CPU system (such as a P2020). After your changes, one possible sequence is: === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === spin_lock_irqsave descriptor->cookie = 20 (x in your example) chan->common.cookie = 20 (used in your example) spin_unlock_irqrestore === CPU2 - immediately calls fsl_tx_status() === chan->common.cookie == 19 chan->completed_cookie == 19 descriptor->cookie == 20 Since we don't have locks anymore, CPU2 may not have seen the write to chan->common.cookie yet. Also assume that the DMA hardware has not started processing the transaction yet. Therefore dma_do_tasklet() has not been called, and chan->completed_cookie has not been updated. In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS, even though the DMA operation has not succeeded. The DMA operation has not even started yet! The smp_mb() fixes this, since it forces CPU2 to have seen all memory operations that happened before CPU1 released the spinlock. Spinlocks are implicit SMP memory barriers. Therefore, the above example becomes: smp_mb(); chan->common.cookie == 20 chan->completed_cookie == 19 descriptor->cookie == 20 Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct. Thanks, Ira
Hi Ira, Thanks for your review. After second thought, I think your scenario may not occur. Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in practice. We never query a cookie not returned by fsl_dma_tx_submit(...). When we call fsl_tx_status(20), the chan->common.cookie is definitely wrote as 20 and cpu2 could not read as 19. How do you think? Happy Thanks Giving day, Forrest -----Original Message----- From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] Sent: 2011年11月23日 2:59 To: Shi Xuelin-B29237 Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org; vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote: > From: Forrest Shi <b29237@freescale.com> > > dma status check function fsl_tx_status is heavily called in > a tight loop and the desc lock in fsl_tx_status contended by > the dma status update function. this caused the dma performance > degrades much. > > this patch releases the lock in the fsl_tx_status function. > I believe it has no neglect impact on the following call of > dma_async_is_complete(...). > > we can see below three conditions will be identified as success > a) x < complete < use > b) x < complete+N < use+N > c) x < complete < use+N > here complete is the completed_cookie, use is the last_used > cookie, x is the querying cookie, N is MAX cookie > > when chan->completed_cookie is being read, the last_used may > be incresed. Anyway it has no neglect impact on the dma status > decision. > > Signed-off-by: Forrest Shi <xuelin.shi@freescale.com> > --- > drivers/dma/fsldma.c | 5 ----- > 1 files changed, 0 insertions(+), 5 deletions(-) > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index > 8a78154..1dca56f 100644 > --- a/drivers/dma/fsldma.c > +++ b/drivers/dma/fsldma.c > @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan *dchan, > struct fsldma_chan *chan = to_fsl_chan(dchan); > dma_cookie_t last_complete; > dma_cookie_t last_used; > - unsigned long flags; > - > - spin_lock_irqsave(&chan->desc_lock, flags); > This will cause a bug. See below for a detailed explanation. You need this instead: /* * On an SMP system, we must ensure that this CPU has seen the * memory accesses performed by another CPU under the * chan->desc_lock spinlock. */ smp_mb(); > last_complete = chan->completed_cookie; > last_used = dchan->cookie; > > - spin_unlock_irqrestore(&chan->desc_lock, flags); > - > dma_set_tx_state(txstate, last_complete, last_used, 0); > return dma_async_is_complete(cookie, last_complete, last_used); } Facts: - dchan->cookie is the same member as chan->common.cookie (same memory location) - chan->common.cookie is the "last allocated cookie for a pending transaction" - chan->completed_cookie is the "last completed transaction" I have replaced "dchan->cookie" with "chan->common.cookie" in the below explanation, to keep everything referenced from the same structure. Variable usage before your change. Everything is used locked. - RW chan->common.cookie (fsl_dma_tx_submit) - R chan->common.cookie (fsl_tx_status) - R chan->completed_cookie (fsl_tx_status) - W chan->completed_cookie (dma_do_tasklet) Variable usage after your change: - RW chan->common.cookie LOCKED - R chan->common.cookie NO LOCK - R chan->completed_cookie NO LOCK - W chan->completed_cookie LOCKED What if we assume that you have a 2 CPU system (such as a P2020). After your changes, one possible sequence is: === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === spin_lock_irqsave descriptor->cookie = 20 (x in your example) chan->common.cookie = 20 (used in your example) spin_unlock_irqrestore === CPU2 - immediately calls fsl_tx_status() === chan->common.cookie == 19 chan->completed_cookie == 19 descriptor->cookie == 20 Since we don't have locks anymore, CPU2 may not have seen the write to chan->common.cookie yet. Also assume that the DMA hardware has not started processing the transaction yet. Therefore dma_do_tasklet() has not been called, and chan->completed_cookie has not been updated. In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS, even though the DMA operation has not succeeded. The DMA operation has not even started yet! The smp_mb() fixes this, since it forces CPU2 to have seen all memory operations that happened before CPU1 released the spinlock. Spinlocks are implicit SMP memory barriers. Therefore, the above example becomes: smp_mb(); chan->common.cookie == 20 chan->completed_cookie == 19 descriptor->cookie == 20 Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct. Thanks, Ira
On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote: > Hi Ira, > > Thanks for your review. > > After second thought, I think your scenario may not occur. > Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in practice. > We never query a cookie not returned by fsl_dma_tx_submit(...). > I agree about this part. > When we call fsl_tx_status(20), the chan->common.cookie is definitely wrote as 20 and cpu2 could not read as 19. > This is what I don't agree about. However, I'm not an expert on CPU cache vs. memory accesses in an multi-processor system. The section titled "CACHE COHERENCY" in Documentation/memory-barriers.txt leads me to believe that the scenario I described is possible. What happens if CPU1's write of chan->common.cookie only goes into CPU1's cache. It never makes it to main memory before CPU2 fetches the old value of 19. I don't think you should see any performance impact from the smp_mb() operation. Thanks, Ira > -----Original Message----- > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] > Sent: 2011年11月23日 2:59 > To: Shi Xuelin-B29237 > Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org; vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. > > On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote: > > From: Forrest Shi <b29237@freescale.com> > > > > dma status check function fsl_tx_status is heavily called in > > a tight loop and the desc lock in fsl_tx_status contended by > > the dma status update function. this caused the dma performance > > degrades much. > > > > this patch releases the lock in the fsl_tx_status function. > > I believe it has no neglect impact on the following call of > > dma_async_is_complete(...). > > > > we can see below three conditions will be identified as success > > a) x < complete < use > > b) x < complete+N < use+N > > c) x < complete < use+N > > here complete is the completed_cookie, use is the last_used > > cookie, x is the querying cookie, N is MAX cookie > > > > when chan->completed_cookie is being read, the last_used may > > be incresed. Anyway it has no neglect impact on the dma status > > decision. > > > > Signed-off-by: Forrest Shi <xuelin.shi@freescale.com> > > --- > > drivers/dma/fsldma.c | 5 ----- > > 1 files changed, 0 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index > > 8a78154..1dca56f 100644 > > --- a/drivers/dma/fsldma.c > > +++ b/drivers/dma/fsldma.c > > @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan *dchan, > > struct fsldma_chan *chan = to_fsl_chan(dchan); > > dma_cookie_t last_complete; > > dma_cookie_t last_used; > > - unsigned long flags; > > - > > - spin_lock_irqsave(&chan->desc_lock, flags); > > > > This will cause a bug. See below for a detailed explanation. You need this instead: > > /* > * On an SMP system, we must ensure that this CPU has seen the > * memory accesses performed by another CPU under the > * chan->desc_lock spinlock. > */ > smp_mb(); > > last_complete = chan->completed_cookie; > > last_used = dchan->cookie; > > > > - spin_unlock_irqrestore(&chan->desc_lock, flags); > > - > > dma_set_tx_state(txstate, last_complete, last_used, 0); > > return dma_async_is_complete(cookie, last_complete, last_used); } > > Facts: > - dchan->cookie is the same member as chan->common.cookie (same memory location) > - chan->common.cookie is the "last allocated cookie for a pending transaction" > - chan->completed_cookie is the "last completed transaction" > > I have replaced "dchan->cookie" with "chan->common.cookie" in the below explanation, to keep everything referenced from the same structure. > > Variable usage before your change. Everything is used locked. > - RW chan->common.cookie (fsl_dma_tx_submit) > - R chan->common.cookie (fsl_tx_status) > - R chan->completed_cookie (fsl_tx_status) > - W chan->completed_cookie (dma_do_tasklet) > > Variable usage after your change: > - RW chan->common.cookie LOCKED > - R chan->common.cookie NO LOCK > - R chan->completed_cookie NO LOCK > - W chan->completed_cookie LOCKED > > What if we assume that you have a 2 CPU system (such as a P2020). After your changes, one possible sequence is: > > === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === spin_lock_irqsave > descriptor->cookie = 20 (x in your example) > chan->common.cookie = 20 (used in your example) > spin_unlock_irqrestore > > === CPU2 - immediately calls fsl_tx_status() === > chan->common.cookie == 19 > chan->completed_cookie == 19 > descriptor->cookie == 20 > > Since we don't have locks anymore, CPU2 may not have seen the write to > chan->common.cookie yet. > > Also assume that the DMA hardware has not started processing the transaction yet. Therefore dma_do_tasklet() has not been called, and > chan->completed_cookie has not been updated. > > In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS, even though the DMA operation has not succeeded. The DMA operation has not even started yet! > > The smp_mb() fixes this, since it forces CPU2 to have seen all memory operations that happened before CPU1 released the spinlock. Spinlocks are implicit SMP memory barriers. > > Therefore, the above example becomes: > smp_mb(); > chan->common.cookie == 20 > chan->completed_cookie == 19 > descriptor->cookie == 20 > > Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct. > > Thanks, > Ira > > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev
> Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing > spinlock use. > > On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote: > > Hi Ira, > > > > Thanks for your review. > > > > After second thought, I think your scenario may not occur. > > Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in > practice. > > We never query a cookie not returned by fsl_dma_tx_submit(...). > > > > I agree about this part. > > > When we call fsl_tx_status(20), the chan->common.cookie is definitely wrote as > 20 and cpu2 could not read as 19. > > > > This is what I don't agree about. However, I'm not an expert on CPU cache vs. > memory accesses in an multi-processor system. The section titled "CACHE > COHERENCY" in Documentation/memory-barriers.txt leads me to believe that the > scenario I described is possible. For Freescale PowerPC, the chip automatically takes care of cache coherency. Even if this is a concern, spinlock can't address it. > > What happens if CPU1's write of chan->common.cookie only goes into CPU1's > cache. It never makes it to main memory before CPU2 fetches the old value of 19. > > I don't think you should see any performance impact from the smp_mb() > operation. Smp_mb() do have impact on performance if it's in the hot path. While it might be safer having it, I doubt it is really necessary. If the CPU1 doesn't have the updated last_used, it's shouldn't have known there is a cookie 20 existed either. - Leo > > Thanks, > Ira > > > -----Original Message----- > > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] > > Sent: 2011年11月23日 2:59 > > To: Shi Xuelin-B29237 > > Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org; > > vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org; > > linux-kernel@vger.kernel.org > > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing > spinlock use. > > > > On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote: > > > From: Forrest Shi <b29237@freescale.com> > > > > > > dma status check function fsl_tx_status is heavily called in > > > a tight loop and the desc lock in fsl_tx_status contended by > > > the dma status update function. this caused the dma performance > > > degrades much. > > > > > > this patch releases the lock in the fsl_tx_status function. > > > I believe it has no neglect impact on the following call of > > > dma_async_is_complete(...). > > > > > > we can see below three conditions will be identified as success > > > a) x < complete < use > > > b) x < complete+N < use+N > > > c) x < complete < use+N > > > here complete is the completed_cookie, use is the last_used > > > cookie, x is the querying cookie, N is MAX cookie > > > > > > when chan->completed_cookie is being read, the last_used may > > > be incresed. Anyway it has no neglect impact on the dma status > > > decision. > > > > > > Signed-off-by: Forrest Shi <xuelin.shi@freescale.com> > > > --- > > > drivers/dma/fsldma.c | 5 ----- > > > 1 files changed, 0 insertions(+), 5 deletions(-) > > > > > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index > > > 8a78154..1dca56f 100644 > > > --- a/drivers/dma/fsldma.c > > > +++ b/drivers/dma/fsldma.c > > > @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct > dma_chan *dchan, > > > struct fsldma_chan *chan = to_fsl_chan(dchan); > > > dma_cookie_t last_complete; > > > dma_cookie_t last_used; > > > - unsigned long flags; > > > - > > > - spin_lock_irqsave(&chan->desc_lock, flags); > > > > > > > This will cause a bug. See below for a detailed explanation. You need this instead: > > > > /* > > * On an SMP system, we must ensure that this CPU has seen the > > * memory accesses performed by another CPU under the > > * chan->desc_lock spinlock. > > */ > > smp_mb(); > > > last_complete = chan->completed_cookie; > > > last_used = dchan->cookie; > > > > > > - spin_unlock_irqrestore(&chan->desc_lock, flags); > > > - > > > dma_set_tx_state(txstate, last_complete, last_used, 0); > > > return dma_async_is_complete(cookie, last_complete, last_used); } > > > > Facts: > > - dchan->cookie is the same member as chan->common.cookie (same memory > > location) > > - chan->common.cookie is the "last allocated cookie for a pending transaction" > > - chan->completed_cookie is the "last completed transaction" > > > > I have replaced "dchan->cookie" with "chan->common.cookie" in the below > explanation, to keep everything referenced from the same structure. > > > > Variable usage before your change. Everything is used locked. > > - RW chan->common.cookie (fsl_dma_tx_submit) > > - R chan->common.cookie (fsl_tx_status) > > - R chan->completed_cookie (fsl_tx_status) > > - W chan->completed_cookie (dma_do_tasklet) > > > > Variable usage after your change: > > - RW chan->common.cookie LOCKED > > - R chan->common.cookie NO LOCK > > - R chan->completed_cookie NO LOCK > > - W chan->completed_cookie LOCKED > > > > What if we assume that you have a 2 CPU system (such as a P2020). After your > changes, one possible sequence is: > > > > === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === > > spin_lock_irqsave > > descriptor->cookie = 20 (x in your example) > > chan->common.cookie = 20 (used in your example) > > spin_unlock_irqrestore > > > > === CPU2 - immediately calls fsl_tx_status() === > > chan->common.cookie == 19 > > chan->completed_cookie == 19 > > descriptor->cookie == 20 > > > > Since we don't have locks anymore, CPU2 may not have seen the write to > > chan->common.cookie yet. > > > > Also assume that the DMA hardware has not started processing the > > transaction yet. Therefore dma_do_tasklet() has not been called, and > > chan->completed_cookie has not been updated. > > > > In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS, even > though the DMA operation has not succeeded. The DMA operation has not even > started yet! > > > > The smp_mb() fixes this, since it forces CPU2 to have seen all memory operations > that happened before CPU1 released the spinlock. Spinlocks are implicit SMP > memory barriers. > > > > Therefore, the above example becomes: > > smp_mb(); > > chan->common.cookie == 20 > > chan->completed_cookie == 19 > > descriptor->cookie == 20 > > > > Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct. > > > > Thanks, > > Ira > > > > > > _______________________________________________ > > Linuxppc-dev mailing list > > Linuxppc-dev@lists.ozlabs.org > > https://lists.ozlabs.org/listinfo/linuxppc-dev
Hi Ira, see my comments below. Thanks, Forrest -----Original Message----- From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] Sent: 2011年11月29日 0:38 To: Shi Xuelin-B29237 Cc: vinod.koul@intel.com; dan.j.williams@intel.com; linuxppc-dev@lists.ozlabs.org; Li Yang-R58472; linux-kernel@vger.kernel.org Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote: > Hi Ira, > > Thanks for your review. > > After second thought, I think your scenario may not occur. > Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in practice. > We never query a cookie not returned by fsl_dma_tx_submit(...). > I agree about this part. > When we call fsl_tx_status(20), the chan->common.cookie is definitely wrote as 20 and cpu2 could not read as 19. > This is what I don't agree about. However, I'm not an expert on CPU cache vs. memory accesses in an multi-processor system. The section titled "CACHE COHERENCY" in Documentation/memory-barriers.txt leads me to believe that the scenario I described is possible. [Shi Xuelin-B29237] If query is used without rules, your scenario may happen. But in DMA usage here, the query is used something like sequentially. Only after chan->common.cookie is updated in fsl_dma_tx_submit(...) and returned, then it could be queried. So you scenario will not happen. What happens if CPU1's write of chan->common.cookie only goes into CPU1's cache. It never makes it to main memory before CPU2 fetches the old value of 19. [Shi Xuelin-B29237] As you see, chan->common.cookie is updated in fsl_dma_tx_submit(...) and enclosed by spinlock. The spinlock implementation in PPC will guarantee the cache coherence among cores, something like you called implicit smp_mb. I don't think you should see any performance impact from the smp_mb() operation. [Shi Xuelin-B29237] Only add smp_mb() doesn't work. It only sync on this step, but next step is the same as just getting into this function without smp_mb operation. Thanks, Ira > -----Original Message----- > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] > Sent: 2011年11月23日 2:59 > To: Shi Xuelin-B29237 > Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org; > vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org; > linux-kernel@vger.kernel.org > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. > > On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote: > > From: Forrest Shi <b29237@freescale.com> > > > > dma status check function fsl_tx_status is heavily called in > > a tight loop and the desc lock in fsl_tx_status contended by > > the dma status update function. this caused the dma performance > > degrades much. > > > > this patch releases the lock in the fsl_tx_status function. > > I believe it has no neglect impact on the following call of > > dma_async_is_complete(...). > > > > we can see below three conditions will be identified as success > > a) x < complete < use > > b) x < complete+N < use+N > > c) x < complete < use+N > > here complete is the completed_cookie, use is the last_used > > cookie, x is the querying cookie, N is MAX cookie > > > > when chan->completed_cookie is being read, the last_used may > > be incresed. Anyway it has no neglect impact on the dma status > > decision. > > > > Signed-off-by: Forrest Shi <xuelin.shi@freescale.com> > > --- > > drivers/dma/fsldma.c | 5 ----- > > 1 files changed, 0 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index > > 8a78154..1dca56f 100644 > > --- a/drivers/dma/fsldma.c > > +++ b/drivers/dma/fsldma.c > > @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan *dchan, > > struct fsldma_chan *chan = to_fsl_chan(dchan); > > dma_cookie_t last_complete; > > dma_cookie_t last_used; > > - unsigned long flags; > > - > > - spin_lock_irqsave(&chan->desc_lock, flags); > > > > This will cause a bug. See below for a detailed explanation. You need this instead: > > /* > * On an SMP system, we must ensure that this CPU has seen the > * memory accesses performed by another CPU under the > * chan->desc_lock spinlock. > */ > smp_mb(); > > last_complete = chan->completed_cookie; > > last_used = dchan->cookie; > > > > - spin_unlock_irqrestore(&chan->desc_lock, flags); > > - > > dma_set_tx_state(txstate, last_complete, last_used, 0); > > return dma_async_is_complete(cookie, last_complete, last_used); } > > Facts: > - dchan->cookie is the same member as chan->common.cookie (same memory > location) > - chan->common.cookie is the "last allocated cookie for a pending transaction" > - chan->completed_cookie is the "last completed transaction" > > I have replaced "dchan->cookie" with "chan->common.cookie" in the below explanation, to keep everything referenced from the same structure. > > Variable usage before your change. Everything is used locked. > - RW chan->common.cookie (fsl_dma_tx_submit) > - R chan->common.cookie (fsl_tx_status) > - R chan->completed_cookie (fsl_tx_status) > - W chan->completed_cookie (dma_do_tasklet) > > Variable usage after your change: > - RW chan->common.cookie LOCKED > - R chan->common.cookie NO LOCK > - R chan->completed_cookie NO LOCK > - W chan->completed_cookie LOCKED > > What if we assume that you have a 2 CPU system (such as a P2020). After your changes, one possible sequence is: > > === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === > spin_lock_irqsave > descriptor->cookie = 20 (x in your example) > chan->common.cookie = 20 (used in your example) > spin_unlock_irqrestore > > === CPU2 - immediately calls fsl_tx_status() === > chan->common.cookie == 19 > chan->completed_cookie == 19 > descriptor->cookie == 20 > > Since we don't have locks anymore, CPU2 may not have seen the write to > chan->common.cookie yet. > > Also assume that the DMA hardware has not started processing the > transaction yet. Therefore dma_do_tasklet() has not been called, and > chan->completed_cookie has not been updated. > > In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS, even though the DMA operation has not succeeded. The DMA operation has not even started yet! > > The smp_mb() fixes this, since it forces CPU2 to have seen all memory operations that happened before CPU1 released the spinlock. Spinlocks are implicit SMP memory barriers. > > Therefore, the above example becomes: > smp_mb(); > chan->common.cookie == 20 > chan->completed_cookie == 19 > descriptor->cookie == 20 > > Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct. > > Thanks, > Ira > > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev
On Tue, Nov 29, 2011 at 03:19:05AM +0000, Li Yang-R58472 wrote: > > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing > > spinlock use. > > > > On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote: > > > Hi Ira, > > > > > > Thanks for your review. > > > > > > After second thought, I think your scenario may not occur. > > > Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in > > practice. > > > We never query a cookie not returned by fsl_dma_tx_submit(...). > > > > > > > I agree about this part. > > > > > When we call fsl_tx_status(20), the chan->common.cookie is definitely wrote as > > 20 and cpu2 could not read as 19. > > > > > > > This is what I don't agree about. However, I'm not an expert on CPU cache vs. > > memory accesses in an multi-processor system. The section titled "CACHE > > COHERENCY" in Documentation/memory-barriers.txt leads me to believe that the > > scenario I described is possible. > > For Freescale PowerPC, the chip automatically takes care of cache coherency. Even if this is a concern, spinlock can't address it. > > > > > What happens if CPU1's write of chan->common.cookie only goes into CPU1's > > cache. It never makes it to main memory before CPU2 fetches the old value of 19. > > > > I don't think you should see any performance impact from the smp_mb() > > operation. > > Smp_mb() do have impact on performance if it's in the hot path. While it might be safer having it, I doubt it is really necessary. If the CPU1 doesn't have the updated last_used, it's shouldn't have known there is a cookie 20 existed either. > I believe that you are correct, for powerpc. However, anything outside of arch/powerpc shouldn't assume it only runs on powerpc. I wouldn't be surprised to see fsldma running on an iMX someday (ARM processor). My interpretation says that the change introduces the possibility that fsl_tx_status() returns the wrong answer for an extremely small time window, on SMP only, based on Documentation/memory-barriers.txt. But I can't seem convince you. My real question is what code path is hitting this spinlock? Is it in mainline Linux? Why is it polling rather than using callbacks to determine DMA completion? Thanks, Ira > > > -----Original Message----- > > > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] > > > Sent: 2011年11月23日 2:59 > > > To: Shi Xuelin-B29237 > > > Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org; > > > vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org; > > > linux-kernel@vger.kernel.org > > > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing > > spinlock use. > > > > > > On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote: > > > > From: Forrest Shi <b29237@freescale.com> > > > > > > > > dma status check function fsl_tx_status is heavily called in > > > > a tight loop and the desc lock in fsl_tx_status contended by > > > > the dma status update function. this caused the dma performance > > > > degrades much. > > > > > > > > this patch releases the lock in the fsl_tx_status function. > > > > I believe it has no neglect impact on the following call of > > > > dma_async_is_complete(...). > > > > > > > > we can see below three conditions will be identified as success > > > > a) x < complete < use > > > > b) x < complete+N < use+N > > > > c) x < complete < use+N > > > > here complete is the completed_cookie, use is the last_used > > > > cookie, x is the querying cookie, N is MAX cookie > > > > > > > > when chan->completed_cookie is being read, the last_used may > > > > be incresed. Anyway it has no neglect impact on the dma status > > > > decision. > > > > > > > > Signed-off-by: Forrest Shi <xuelin.shi@freescale.com> > > > > --- > > > > drivers/dma/fsldma.c | 5 ----- > > > > 1 files changed, 0 insertions(+), 5 deletions(-) > > > > > > > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index > > > > 8a78154..1dca56f 100644 > > > > --- a/drivers/dma/fsldma.c > > > > +++ b/drivers/dma/fsldma.c > > > > @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct > > dma_chan *dchan, > > > > struct fsldma_chan *chan = to_fsl_chan(dchan); > > > > dma_cookie_t last_complete; > > > > dma_cookie_t last_used; > > > > - unsigned long flags; > > > > - > > > > - spin_lock_irqsave(&chan->desc_lock, flags); > > > > > > > > > > This will cause a bug. See below for a detailed explanation. You need this instead: > > > > > > /* > > > * On an SMP system, we must ensure that this CPU has seen the > > > * memory accesses performed by another CPU under the > > > * chan->desc_lock spinlock. > > > */ > > > smp_mb(); > > > > last_complete = chan->completed_cookie; > > > > last_used = dchan->cookie; > > > > > > > > - spin_unlock_irqrestore(&chan->desc_lock, flags); > > > > - > > > > dma_set_tx_state(txstate, last_complete, last_used, 0); > > > > return dma_async_is_complete(cookie, last_complete, last_used); } > > > > > > Facts: > > > - dchan->cookie is the same member as chan->common.cookie (same memory > > > location) > > > - chan->common.cookie is the "last allocated cookie for a pending transaction" > > > - chan->completed_cookie is the "last completed transaction" > > > > > > I have replaced "dchan->cookie" with "chan->common.cookie" in the below > > explanation, to keep everything referenced from the same structure. > > > > > > Variable usage before your change. Everything is used locked. > > > - RW chan->common.cookie (fsl_dma_tx_submit) > > > - R chan->common.cookie (fsl_tx_status) > > > - R chan->completed_cookie (fsl_tx_status) > > > - W chan->completed_cookie (dma_do_tasklet) > > > > > > Variable usage after your change: > > > - RW chan->common.cookie LOCKED > > > - R chan->common.cookie NO LOCK > > > - R chan->completed_cookie NO LOCK > > > - W chan->completed_cookie LOCKED > > > > > > What if we assume that you have a 2 CPU system (such as a P2020). After your > > changes, one possible sequence is: > > > > > > === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === > > > spin_lock_irqsave > > > descriptor->cookie = 20 (x in your example) > > > chan->common.cookie = 20 (used in your example) > > > spin_unlock_irqrestore > > > > > > === CPU2 - immediately calls fsl_tx_status() === > > > chan->common.cookie == 19 > > > chan->completed_cookie == 19 > > > descriptor->cookie == 20 > > > > > > Since we don't have locks anymore, CPU2 may not have seen the write to > > > chan->common.cookie yet. > > > > > > Also assume that the DMA hardware has not started processing the > > > transaction yet. Therefore dma_do_tasklet() has not been called, and > > > chan->completed_cookie has not been updated. > > > > > > In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS, even > > though the DMA operation has not succeeded. The DMA operation has not even > > started yet! > > > > > > The smp_mb() fixes this, since it forces CPU2 to have seen all memory operations > > that happened before CPU1 released the spinlock. Spinlocks are implicit SMP > > memory barriers. > > > > > > Therefore, the above example becomes: > > > smp_mb(); > > > chan->common.cookie == 20 > > > chan->completed_cookie == 19 > > > descriptor->cookie == 20 > > > > > > Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct. > > > > > > Thanks, > > > Ira > > > > > > > > > _______________________________________________ > > > Linuxppc-dev mailing list > > > Linuxppc-dev@lists.ozlabs.org > > > https://lists.ozlabs.org/listinfo/linuxppc-dev >
On 11/28/2011 09:19 PM, Li Yang-R58472 wrote: >> Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing >> spinlock use. >> >> On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote: >>> Hi Ira, >>> >>> Thanks for your review. >>> >>> After second thought, I think your scenario may not occur. >>> Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in >> practice. >>> We never query a cookie not returned by fsl_dma_tx_submit(...). >>> >> >> I agree about this part. >> >>> When we call fsl_tx_status(20), the chan->common.cookie is definitely wrote as >> 20 and cpu2 could not read as 19. >>> >> >> This is what I don't agree about. However, I'm not an expert on CPU cache vs. >> memory accesses in an multi-processor system. The section titled "CACHE >> COHERENCY" in Documentation/memory-barriers.txt leads me to believe that the >> scenario I described is possible. > > For Freescale PowerPC, the chip automatically takes care of cache coherency. Even if this is a concern, spinlock can't address it. Cache coherency is not the same thing as ordering -- and spinlocks do address ordering, because there are memory barriers in the lock implementation. If you're relying on some non-universal ordering guarantee that all chips with this device make, it needs to be documented explicitly what you're assuming and why it's valid. -Scott
On Tue, Nov 29, 2011 at 11:25 AM, Ira W. Snyder <iws@ovro.caltech.edu> wrote: > I believe that you are correct, for powerpc. However, anything outside > of arch/powerpc shouldn't assume it only runs on powerpc. I wouldn't be > surprised to see fsldma running on an iMX someday (ARM processor). I certainly would. The i.MX chips have their own DMA controller. The DMA controller that fsldma.c works with is unique to Freescale's 83xx, 85xx, and QorIQ chips.
Hello Ira, In drivers/dma/dmaengine.c, we have below tight loop to check DMA completion in mainline Linux: do { status = dma_async_is_tx_complete(chan, cookie, NULL, NULL); if (time_after_eq(jiffies, dma_sync_wait_timeout)) { printk(KERN_ERR "dma_sync_wait_timeout!\n"); return DMA_ERROR; } } while (status == DMA_IN_PROGRESS); Thanks, Forrest -----Original Message----- From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] Sent: 2011年11月30日 1:26 To: Li Yang-R58472 Cc: Shi Xuelin-B29237; vinod.koul@intel.com; dan.j.williams@intel.com; linuxppc-dev@lists.ozlabs.org; linux-kernel@vger.kernel.org Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. On Tue, Nov 29, 2011 at 03:19:05AM +0000, Li Yang-R58472 wrote: > > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by > > optimizing spinlock use. > > > > On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote: > > > Hi Ira, > > > > > > Thanks for your review. > > > > > > After second thought, I think your scenario may not occur. > > > Because the cookie 20 we query must be returned by > > > fsl_dma_tx_submit(...) in > > practice. > > > We never query a cookie not returned by fsl_dma_tx_submit(...). > > > > > > > I agree about this part. > > > > > When we call fsl_tx_status(20), the chan->common.cookie is > > > definitely wrote as > > 20 and cpu2 could not read as 19. > > > > > > > This is what I don't agree about. However, I'm not an expert on CPU cache vs. > > memory accesses in an multi-processor system. The section titled > > "CACHE COHERENCY" in Documentation/memory-barriers.txt leads me to > > believe that the scenario I described is possible. > > For Freescale PowerPC, the chip automatically takes care of cache coherency. Even if this is a concern, spinlock can't address it. > > > > > What happens if CPU1's write of chan->common.cookie only goes into > > CPU1's cache. It never makes it to main memory before CPU2 fetches the old value of 19. > > > > I don't think you should see any performance impact from the > > smp_mb() operation. > > Smp_mb() do have impact on performance if it's in the hot path. While it might be safer having it, I doubt it is really necessary. If the CPU1 doesn't have the updated last_used, it's shouldn't have known there is a cookie 20 existed either. > I believe that you are correct, for powerpc. However, anything outside of arch/powerpc shouldn't assume it only runs on powerpc. I wouldn't be surprised to see fsldma running on an iMX someday (ARM processor). My interpretation says that the change introduces the possibility that fsl_tx_status() returns the wrong answer for an extremely small time window, on SMP only, based on Documentation/memory-barriers.txt. But I can't seem convince you. My real question is what code path is hitting this spinlock? Is it in mainline Linux? Why is it polling rather than using callbacks to determine DMA completion? Thanks, Ira > > > -----Original Message----- > > > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] > > > Sent: 2011年11月23日 2:59 > > > To: Shi Xuelin-B29237 > > > Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org; > > > vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org; > > > linux-kernel@vger.kernel.org > > > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by > > > optimizing > > spinlock use. > > > > > > On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote: > > > > From: Forrest Shi <b29237@freescale.com> > > > > > > > > dma status check function fsl_tx_status is heavily called in > > > > a tight loop and the desc lock in fsl_tx_status contended by > > > > the dma status update function. this caused the dma performance > > > > degrades much. > > > > > > > > this patch releases the lock in the fsl_tx_status function. > > > > I believe it has no neglect impact on the following call of > > > > dma_async_is_complete(...). > > > > > > > > we can see below three conditions will be identified as success > > > > a) x < complete < use > > > > b) x < complete+N < use+N > > > > c) x < complete < use+N > > > > here complete is the completed_cookie, use is the last_used > > > > cookie, x is the querying cookie, N is MAX cookie > > > > > > > > when chan->completed_cookie is being read, the last_used may > > > > be incresed. Anyway it has no neglect impact on the dma status > > > > decision. > > > > > > > > Signed-off-by: Forrest Shi <xuelin.shi@freescale.com> > > > > --- > > > > drivers/dma/fsldma.c | 5 ----- > > > > 1 files changed, 0 insertions(+), 5 deletions(-) > > > > > > > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index > > > > 8a78154..1dca56f 100644 > > > > --- a/drivers/dma/fsldma.c > > > > +++ b/drivers/dma/fsldma.c > > > > @@ -986,15 +986,10 @@ static enum dma_status > > > > fsl_tx_status(struct > > dma_chan *dchan, > > > > struct fsldma_chan *chan = to_fsl_chan(dchan); > > > > dma_cookie_t last_complete; > > > > dma_cookie_t last_used; > > > > - unsigned long flags; > > > > - > > > > - spin_lock_irqsave(&chan->desc_lock, flags); > > > > > > > > > > This will cause a bug. See below for a detailed explanation. You need this instead: > > > > > > /* > > > * On an SMP system, we must ensure that this CPU has seen the > > > * memory accesses performed by another CPU under the > > > * chan->desc_lock spinlock. > > > */ > > > smp_mb(); > > > > last_complete = chan->completed_cookie; > > > > last_used = dchan->cookie; > > > > > > > > - spin_unlock_irqrestore(&chan->desc_lock, flags); > > > > - > > > > dma_set_tx_state(txstate, last_complete, last_used, 0); > > > > return dma_async_is_complete(cookie, last_complete, > > > > last_used); } > > > > > > Facts: > > > - dchan->cookie is the same member as chan->common.cookie (same > > > memory > > > location) > > > - chan->common.cookie is the "last allocated cookie for a pending transaction" > > > - chan->completed_cookie is the "last completed transaction" > > > > > > I have replaced "dchan->cookie" with "chan->common.cookie" in the > > > below > > explanation, to keep everything referenced from the same structure. > > > > > > Variable usage before your change. Everything is used locked. > > > - RW chan->common.cookie (fsl_dma_tx_submit) > > > - R chan->common.cookie (fsl_tx_status) > > > - R chan->completed_cookie (fsl_tx_status) > > > - W chan->completed_cookie (dma_do_tasklet) > > > > > > Variable usage after your change: > > > - RW chan->common.cookie LOCKED > > > - R chan->common.cookie NO LOCK > > > - R chan->completed_cookie NO LOCK > > > - W chan->completed_cookie LOCKED > > > > > > What if we assume that you have a 2 CPU system (such as a P2020). > > > After your > > changes, one possible sequence is: > > > > > > === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === > > > spin_lock_irqsave > > > descriptor->cookie = 20 (x in your example) > > > chan->common.cookie = 20 (used in your example) > > > spin_unlock_irqrestore > > > > > > === CPU2 - immediately calls fsl_tx_status() === > > > chan->common.cookie == 19 > > > chan->completed_cookie == 19 > > > descriptor->cookie == 20 > > > > > > Since we don't have locks anymore, CPU2 may not have seen the > > > write to > > > chan->common.cookie yet. > > > > > > Also assume that the DMA hardware has not started processing the > > > transaction yet. Therefore dma_do_tasklet() has not been called, > > > and > > > chan->completed_cookie has not been updated. > > > > > > In this case, dma_async_is_complete() (on CPU2) returns > > > DMA_SUCCESS, even > > though the DMA operation has not succeeded. The DMA operation has > > not even started yet! > > > > > > The smp_mb() fixes this, since it forces CPU2 to have seen all > > > memory operations > > that happened before CPU1 released the spinlock. Spinlocks are > > implicit SMP memory barriers. > > > > > > Therefore, the above example becomes: > > > smp_mb(); > > > chan->common.cookie == 20 > > > chan->completed_cookie == 19 > > > descriptor->cookie == 20 > > > > > > Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct. > > > > > > Thanks, > > > Ira > > > > > > > > > _______________________________________________ > > > Linuxppc-dev mailing list > > > Linuxppc-dev@lists.ozlabs.org > > > https://lists.ozlabs.org/listinfo/linuxppc-dev >
On Wed, Nov 30, 2011 at 09:57:47AM +0000, Shi Xuelin-B29237 wrote: > Hello Ira, > > In drivers/dma/dmaengine.c, we have below tight loop to check DMA completion in mainline Linux: > do { > status = dma_async_is_tx_complete(chan, cookie, NULL, NULL); > if (time_after_eq(jiffies, dma_sync_wait_timeout)) { > printk(KERN_ERR "dma_sync_wait_timeout!\n"); > return DMA_ERROR; > } > } while (status == DMA_IN_PROGRESS); > That is the body of dma_sync_wait(). It is mostly used in the raid code. I understand that you don't want to change the raid code to use callbacks. In any case, I think we've strayed from the topic under consideration, which is: can we remove this spinlock without introducing a bug. I'm convinced that "smp_rmb()" is needed when removing the spinlock. As noted, Documentation/memory-barriers.txt says that stores on one CPU can be observed by another CPU in a different order. Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a LOCK (in fsl_tx_status). This provided a "full barrier", forcing the operations to complete correctly when viewed by the second CPU. From the text: > Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is > equivalent to a full barrier, but a LOCK followed by an UNLOCK is not. Also, please read "EXAMPLES OF MEMORY BARRIER SEQUENCES" and "INTER-CPU LOCKING BARRIER EFFECTS". Particularly, in "EXAMPLES OF MEMORY BARRIER SEQUENCES", the text notes: > Without intervention, CPU 2 may perceive the events on CPU 1 in some > effectively random order, despite the write barrier issued by CPU 1: > > [snip diagram] > > And thirdly, a read barrier acts as a partial order on loads. Consider the > following sequence of events: > > [snip diagram] > > Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in > some effectively random order, despite the write barrier issued by CPU 1: > > [snip diagram] > And so on. Please read this entire section in the document. I can't give you an ACK on the proposed patch. To the best of my understanding, I believe it introduces a bug. I've tried to provide as much evidence for this belief as I can, in the form of documentation in the kernel source tree. If you can cite some documentation that shows I am wrong, I will happily change my mind! Ira > -----Original Message----- > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] > Sent: 2011年11月30日 1:26 > To: Li Yang-R58472 > Cc: Shi Xuelin-B29237; vinod.koul@intel.com; dan.j.williams@intel.com; linuxppc-dev@lists.ozlabs.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. > > On Tue, Nov 29, 2011 at 03:19:05AM +0000, Li Yang-R58472 wrote: > > > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by > > > optimizing spinlock use. > > > > > > On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote: > > > > Hi Ira, > > > > > > > > Thanks for your review. > > > > > > > > After second thought, I think your scenario may not occur. > > > > Because the cookie 20 we query must be returned by > > > > fsl_dma_tx_submit(...) in > > > practice. > > > > We never query a cookie not returned by fsl_dma_tx_submit(...). > > > > > > > > > > I agree about this part. > > > > > > > When we call fsl_tx_status(20), the chan->common.cookie is > > > > definitely wrote as > > > 20 and cpu2 could not read as 19. > > > > > > > > > > This is what I don't agree about. However, I'm not an expert on CPU cache vs. > > > memory accesses in an multi-processor system. The section titled > > > "CACHE COHERENCY" in Documentation/memory-barriers.txt leads me to > > > believe that the scenario I described is possible. > > > > For Freescale PowerPC, the chip automatically takes care of cache coherency. Even if this is a concern, spinlock can't address it. > > > > > > > > What happens if CPU1's write of chan->common.cookie only goes into > > > CPU1's cache. It never makes it to main memory before CPU2 fetches the old value of 19. > > > > > > I don't think you should see any performance impact from the > > > smp_mb() operation. > > > > Smp_mb() do have impact on performance if it's in the hot path. While it might be safer having it, I doubt it is really necessary. If the CPU1 doesn't have the updated last_used, it's shouldn't have known there is a cookie 20 existed either. > > > > I believe that you are correct, for powerpc. However, anything outside of arch/powerpc shouldn't assume it only runs on powerpc. I wouldn't be surprised to see fsldma running on an iMX someday (ARM processor). > > My interpretation says that the change introduces the possibility that > fsl_tx_status() returns the wrong answer for an extremely small time window, on SMP only, based on Documentation/memory-barriers.txt. But I can't seem convince you. > > My real question is what code path is hitting this spinlock? Is it in mainline Linux? Why is it polling rather than using callbacks to determine DMA completion? > > Thanks, > Ira > > > > > -----Original Message----- > > > > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] > > > > Sent: 2011年11月23日 2:59 > > > > To: Shi Xuelin-B29237 > > > > Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org; > > > > vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org; > > > > linux-kernel@vger.kernel.org > > > > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by > > > > optimizing > > > spinlock use. > > > > > > > > On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote: > > > > > From: Forrest Shi <b29237@freescale.com> > > > > > > > > > > dma status check function fsl_tx_status is heavily called in > > > > > a tight loop and the desc lock in fsl_tx_status contended by > > > > > the dma status update function. this caused the dma performance > > > > > degrades much. > > > > > > > > > > this patch releases the lock in the fsl_tx_status function. > > > > > I believe it has no neglect impact on the following call of > > > > > dma_async_is_complete(...). > > > > > > > > > > we can see below three conditions will be identified as success > > > > > a) x < complete < use > > > > > b) x < complete+N < use+N > > > > > c) x < complete < use+N > > > > > here complete is the completed_cookie, use is the last_used > > > > > cookie, x is the querying cookie, N is MAX cookie > > > > > > > > > > when chan->completed_cookie is being read, the last_used may > > > > > be incresed. Anyway it has no neglect impact on the dma status > > > > > decision. > > > > > > > > > > Signed-off-by: Forrest Shi <xuelin.shi@freescale.com> > > > > > --- > > > > > drivers/dma/fsldma.c | 5 ----- > > > > > 1 files changed, 0 insertions(+), 5 deletions(-) > > > > > > > > > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index > > > > > 8a78154..1dca56f 100644 > > > > > --- a/drivers/dma/fsldma.c > > > > > +++ b/drivers/dma/fsldma.c > > > > > @@ -986,15 +986,10 @@ static enum dma_status > > > > > fsl_tx_status(struct > > > dma_chan *dchan, > > > > > struct fsldma_chan *chan = to_fsl_chan(dchan); > > > > > dma_cookie_t last_complete; > > > > > dma_cookie_t last_used; > > > > > - unsigned long flags; > > > > > - > > > > > - spin_lock_irqsave(&chan->desc_lock, flags); > > > > > > > > > > > > > This will cause a bug. See below for a detailed explanation. You need this instead: > > > > > > > > /* > > > > * On an SMP system, we must ensure that this CPU has seen the > > > > * memory accesses performed by another CPU under the > > > > * chan->desc_lock spinlock. > > > > */ > > > > smp_mb(); > > > > > last_complete = chan->completed_cookie; > > > > > last_used = dchan->cookie; > > > > > > > > > > - spin_unlock_irqrestore(&chan->desc_lock, flags); > > > > > - > > > > > dma_set_tx_state(txstate, last_complete, last_used, 0); > > > > > return dma_async_is_complete(cookie, last_complete, > > > > > last_used); } > > > > > > > > Facts: > > > > - dchan->cookie is the same member as chan->common.cookie (same > > > > memory > > > > location) > > > > - chan->common.cookie is the "last allocated cookie for a pending transaction" > > > > - chan->completed_cookie is the "last completed transaction" > > > > > > > > I have replaced "dchan->cookie" with "chan->common.cookie" in the > > > > below > > > explanation, to keep everything referenced from the same structure. > > > > > > > > Variable usage before your change. Everything is used locked. > > > > - RW chan->common.cookie (fsl_dma_tx_submit) > > > > - R chan->common.cookie (fsl_tx_status) > > > > - R chan->completed_cookie (fsl_tx_status) > > > > - W chan->completed_cookie (dma_do_tasklet) > > > > > > > > Variable usage after your change: > > > > - RW chan->common.cookie LOCKED > > > > - R chan->common.cookie NO LOCK > > > > - R chan->completed_cookie NO LOCK > > > > - W chan->completed_cookie LOCKED > > > > > > > > What if we assume that you have a 2 CPU system (such as a P2020). > > > > After your > > > changes, one possible sequence is: > > > > > > > > === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === > > > > spin_lock_irqsave > > > > descriptor->cookie = 20 (x in your example) > > > > chan->common.cookie = 20 (used in your example) > > > > spin_unlock_irqrestore > > > > > > > > === CPU2 - immediately calls fsl_tx_status() === > > > > chan->common.cookie == 19 > > > > chan->completed_cookie == 19 > > > > descriptor->cookie == 20 > > > > > > > > Since we don't have locks anymore, CPU2 may not have seen the > > > > write to > > > > chan->common.cookie yet. > > > > > > > > Also assume that the DMA hardware has not started processing the > > > > transaction yet. Therefore dma_do_tasklet() has not been called, > > > > and > > > > chan->completed_cookie has not been updated. > > > > > > > > In this case, dma_async_is_complete() (on CPU2) returns > > > > DMA_SUCCESS, even > > > though the DMA operation has not succeeded. The DMA operation has > > > not even started yet! > > > > > > > > The smp_mb() fixes this, since it forces CPU2 to have seen all > > > > memory operations > > > that happened before CPU1 released the spinlock. Spinlocks are > > > implicit SMP memory barriers. > > > > > > > > Therefore, the above example becomes: > > > > smp_mb(); > > > > chan->common.cookie == 20 > > > > chan->completed_cookie == 19 > > > > descriptor->cookie == 20 > > > > > > > > Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct. > > > > > > > > Thanks, > > > > Ira > > > > > > > > > > > > _______________________________________________ > > > > Linuxppc-dev mailing list > > > > Linuxppc-dev@lists.ozlabs.org > > > > https://lists.ozlabs.org/listinfo/linuxppc-dev > > >
Hi Iris, >I'm convinced that "smp_rmb()" is needed when removing the spinlock. As noted, Documentation/memory-barriers.txt says that stores on one CPU can be >observed by another CPU in a different order. >Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a LOCK (in fsl_tx_status). This provided a "full barrier", forcing the operations to >complete correctly when viewed by the second CPU. I do not agree this smp_rmb() works here. Because when this smp_rmb() executed and begin to read chan->common.cookie, you still cannot avoid the order issue. Something like one is reading old value, but another CPU is updating the new value. My point is here the order is not important for the DMA decision. Completed DMA tx is decided as not complete is not a big deal, because next time it will be OK. I believe there is no case that could cause uncompleted DMA tx is decided as completed, because the fsl_tx_status is called after fsl_dma_tx_submit for a specific cookie. If you can give me an example here, I will agree with you. Thanks, Forrest -----Original Message----- From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] Sent: 2011年12月1日 1:08 To: Shi Xuelin-B29237 Cc: vinod.koul@intel.com; dan.j.williams@intel.com; linuxppc-dev@lists.ozlabs.org; linux-kernel@vger.kernel.org; Li Yang-R58472 Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. On Wed, Nov 30, 2011 at 09:57:47AM +0000, Shi Xuelin-B29237 wrote: > Hello Ira, > > In drivers/dma/dmaengine.c, we have below tight loop to check DMA completion in mainline Linux: > do { > status = dma_async_is_tx_complete(chan, cookie, NULL, NULL); > if (time_after_eq(jiffies, dma_sync_wait_timeout)) { > printk(KERN_ERR "dma_sync_wait_timeout!\n"); > return DMA_ERROR; > } > } while (status == DMA_IN_PROGRESS); > That is the body of dma_sync_wait(). It is mostly used in the raid code. I understand that you don't want to change the raid code to use callbacks. In any case, I think we've strayed from the topic under consideration, which is: can we remove this spinlock without introducing a bug. I'm convinced that "smp_rmb()" is needed when removing the spinlock. As noted, Documentation/memory-barriers.txt says that stores on one CPU can be observed by another CPU in a different order. Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a LOCK (in fsl_tx_status). This provided a "full barrier", forcing the operations to complete correctly when viewed by the second CPU. From the text: > Therefore, from (1), (2) and (4) an UNLOCK followed by an > unconditional LOCK is equivalent to a full barrier, but a LOCK followed by an UNLOCK is not. Also, please read "EXAMPLES OF MEMORY BARRIER SEQUENCES" and "INTER-CPU LOCKING BARRIER EFFECTS". Particularly, in "EXAMPLES OF MEMORY BARRIER SEQUENCES", the text notes: > Without intervention, CPU 2 may perceive the events on CPU 1 in some > effectively random order, despite the write barrier issued by CPU 1: > > [snip diagram] > > And thirdly, a read barrier acts as a partial order on loads. Consider > the following sequence of events: > > [snip diagram] > > Without intervention, CPU 2 may then choose to perceive the events on > CPU 1 in some effectively random order, despite the write barrier issued by CPU 1: > > [snip diagram] > And so on. Please read this entire section in the document. I can't give you an ACK on the proposed patch. To the best of my understanding, I believe it introduces a bug. I've tried to provide as much evidence for this belief as I can, in the form of documentation in the kernel source tree. If you can cite some documentation that shows I am wrong, I will happily change my mind! Ira > -----Original Message----- > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] > Sent: 2011年11月30日 1:26 > To: Li Yang-R58472 > Cc: Shi Xuelin-B29237; vinod.koul@intel.com; dan.j.williams@intel.com; > linuxppc-dev@lists.ozlabs.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. > > On Tue, Nov 29, 2011 at 03:19:05AM +0000, Li Yang-R58472 wrote: > > > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by > > > optimizing spinlock use. > > > > > > On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote: > > > > Hi Ira, > > > > > > > > Thanks for your review. > > > > > > > > After second thought, I think your scenario may not occur. > > > > Because the cookie 20 we query must be returned by > > > > fsl_dma_tx_submit(...) in > > > practice. > > > > We never query a cookie not returned by fsl_dma_tx_submit(...). > > > > > > > > > > I agree about this part. > > > > > > > When we call fsl_tx_status(20), the chan->common.cookie is > > > > definitely wrote as > > > 20 and cpu2 could not read as 19. > > > > > > > > > > This is what I don't agree about. However, I'm not an expert on CPU cache vs. > > > memory accesses in an multi-processor system. The section titled > > > "CACHE COHERENCY" in Documentation/memory-barriers.txt leads me to > > > believe that the scenario I described is possible. > > > > For Freescale PowerPC, the chip automatically takes care of cache coherency. Even if this is a concern, spinlock can't address it. > > > > > > > > What happens if CPU1's write of chan->common.cookie only goes into > > > CPU1's cache. It never makes it to main memory before CPU2 fetches the old value of 19. > > > > > > I don't think you should see any performance impact from the > > > smp_mb() operation. > > > > Smp_mb() do have impact on performance if it's in the hot path. While it might be safer having it, I doubt it is really necessary. If the CPU1 doesn't have the updated last_used, it's shouldn't have known there is a cookie 20 existed either. > > > > I believe that you are correct, for powerpc. However, anything outside of arch/powerpc shouldn't assume it only runs on powerpc. I wouldn't be surprised to see fsldma running on an iMX someday (ARM processor). > > My interpretation says that the change introduces the possibility that > fsl_tx_status() returns the wrong answer for an extremely small time window, on SMP only, based on Documentation/memory-barriers.txt. But I can't seem convince you. > > My real question is what code path is hitting this spinlock? Is it in mainline Linux? Why is it polling rather than using callbacks to determine DMA completion? > > Thanks, > Ira > > > > > -----Original Message----- > > > > From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] > > > > Sent: 2011年11月23日 2:59 > > > > To: Shi Xuelin-B29237 > > > > Cc: dan.j.williams@intel.com; Li Yang-R58472; zw@zh-kernel.org; > > > > vinod.koul@intel.com; linuxppc-dev@lists.ozlabs.org; > > > > linux-kernel@vger.kernel.org > > > > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by > > > > optimizing > > > spinlock use. > > > > > > > > On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29237@freescale.com wrote: > > > > > From: Forrest Shi <b29237@freescale.com> > > > > > > > > > > dma status check function fsl_tx_status is heavily called in > > > > > a tight loop and the desc lock in fsl_tx_status contended by > > > > > the dma status update function. this caused the dma performance > > > > > degrades much. > > > > > > > > > > this patch releases the lock in the fsl_tx_status function. > > > > > I believe it has no neglect impact on the following call of > > > > > dma_async_is_complete(...). > > > > > > > > > > we can see below three conditions will be identified as success > > > > > a) x < complete < use > > > > > b) x < complete+N < use+N > > > > > c) x < complete < use+N > > > > > here complete is the completed_cookie, use is the last_used > > > > > cookie, x is the querying cookie, N is MAX cookie > > > > > > > > > > when chan->completed_cookie is being read, the last_used may > > > > > be incresed. Anyway it has no neglect impact on the dma status > > > > > decision. > > > > > > > > > > Signed-off-by: Forrest Shi <xuelin.shi@freescale.com> > > > > > --- > > > > > drivers/dma/fsldma.c | 5 ----- > > > > > 1 files changed, 0 insertions(+), 5 deletions(-) > > > > > > > > > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index > > > > > 8a78154..1dca56f 100644 > > > > > --- a/drivers/dma/fsldma.c > > > > > +++ b/drivers/dma/fsldma.c > > > > > @@ -986,15 +986,10 @@ static enum dma_status > > > > > fsl_tx_status(struct > > > dma_chan *dchan, > > > > > struct fsldma_chan *chan = to_fsl_chan(dchan); > > > > > dma_cookie_t last_complete; > > > > > dma_cookie_t last_used; > > > > > - unsigned long flags; > > > > > - > > > > > - spin_lock_irqsave(&chan->desc_lock, flags); > > > > > > > > > > > > > This will cause a bug. See below for a detailed explanation. You need this instead: > > > > > > > > /* > > > > * On an SMP system, we must ensure that this CPU has seen the > > > > * memory accesses performed by another CPU under the > > > > * chan->desc_lock spinlock. > > > > */ > > > > smp_mb(); > > > > > last_complete = chan->completed_cookie; > > > > > last_used = dchan->cookie; > > > > > > > > > > - spin_unlock_irqrestore(&chan->desc_lock, flags); > > > > > - > > > > > dma_set_tx_state(txstate, last_complete, last_used, 0); > > > > > return dma_async_is_complete(cookie, last_complete, > > > > > last_used); } > > > > > > > > Facts: > > > > - dchan->cookie is the same member as chan->common.cookie (same > > > > memory > > > > location) > > > > - chan->common.cookie is the "last allocated cookie for a pending transaction" > > > > - chan->completed_cookie is the "last completed transaction" > > > > > > > > I have replaced "dchan->cookie" with "chan->common.cookie" in > > > > the below > > > explanation, to keep everything referenced from the same structure. > > > > > > > > Variable usage before your change. Everything is used locked. > > > > - RW chan->common.cookie (fsl_dma_tx_submit) > > > > - R chan->common.cookie (fsl_tx_status) > > > > - R chan->completed_cookie (fsl_tx_status) > > > > - W chan->completed_cookie (dma_do_tasklet) > > > > > > > > Variable usage after your change: > > > > - RW chan->common.cookie LOCKED > > > > - R chan->common.cookie NO LOCK > > > > - R chan->completed_cookie NO LOCK > > > > - W chan->completed_cookie LOCKED > > > > > > > > What if we assume that you have a 2 CPU system (such as a P2020). > > > > After your > > > changes, one possible sequence is: > > > > > > > > === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === > > > > spin_lock_irqsave > > > > descriptor->cookie = 20 (x in your example) > > > > chan->common.cookie = 20 (used in your example) > > > > spin_unlock_irqrestore > > > > > > > > === CPU2 - immediately calls fsl_tx_status() === > > > > chan->common.cookie == 19 > > > > chan->completed_cookie == 19 > > > > descriptor->cookie == 20 > > > > > > > > Since we don't have locks anymore, CPU2 may not have seen the > > > > write to > > > > chan->common.cookie yet. > > > > > > > > Also assume that the DMA hardware has not started processing the > > > > transaction yet. Therefore dma_do_tasklet() has not been called, > > > > and > > > > chan->completed_cookie has not been updated. > > > > > > > > In this case, dma_async_is_complete() (on CPU2) returns > > > > DMA_SUCCESS, even > > > though the DMA operation has not succeeded. The DMA operation has > > > not even started yet! > > > > > > > > The smp_mb() fixes this, since it forces CPU2 to have seen all > > > > memory operations > > > that happened before CPU1 released the spinlock. Spinlocks are > > > implicit SMP memory barriers. > > > > > > > > Therefore, the above example becomes: > > > > smp_mb(); > > > > chan->common.cookie == 20 > > > > chan->completed_cookie == 19 > > > > descriptor->cookie == 20 > > > > > > > > Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct. > > > > > > > > Thanks, > > > > Ira > > > > > > > > > > > > _______________________________________________ > > > > Linuxppc-dev mailing list > > > > Linuxppc-dev@lists.ozlabs.org > > > > https://lists.ozlabs.org/listinfo/linuxppc-dev > > >
On Fri, Dec 02, 2011 at 03:47:27AM +0000, Shi Xuelin-B29237 wrote: > Hi Iris, > > >I'm convinced that "smp_rmb()" is needed when removing the spinlock. As noted, Documentation/memory-barriers.txt says that stores on one CPU can be > >observed by another CPU in a different order. > >Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a LOCK (in fsl_tx_status). This provided a "full barrier", forcing the operations to > >complete correctly when viewed by the second CPU. > > I do not agree this smp_rmb() works here. Because when this smp_rmb() executed and begin to read chan->common.cookie, you still cannot avoid the order issue. Something like one is reading old value, but another CPU is updating the new value. > > My point is here the order is not important for the DMA decision. > Completed DMA tx is decided as not complete is not a big deal, because next time it will be OK. > > I believe there is no case that could cause uncompleted DMA tx is decided as completed, because the fsl_tx_status is called after fsl_dma_tx_submit for a specific cookie. If you can give me an example here, I will agree with you. > According to memory-barriers.txt, writes to main memory may be observed in any order if memory barriers are not used. This means that writes can appear to happen in a different order than they were issued by the CPU. Citing from the text: > There are certain things that the Linux kernel memory barriers do not guarantee: > > (*) There is no guarantee that any of the memory accesses specified before a > memory barrier will be _complete_ by the completion of a memory barrier > instruction; the barrier can be considered to draw a line in that CPU's > access queue that accesses of the appropriate type may not cross. Also: > Without intervention, CPU 2 may perceive the events on CPU 1 in some > effectively random order, despite the write barrier issued by CPU 1: Also: > When dealing with CPU-CPU interactions, certain types of memory barrier should > always be paired. A lack of appropriate pairing is almost certainly an error. > > A write barrier should always be paired with a data dependency barrier or read > barrier, though a general barrier would also be viable. Therefore, in an SMP system, the following situation can happen. descriptor->cookie = 2 chan->common.cookie = 1 chan->completed_cookie = 1 This occurs when CPU-A calls fsl_dma_tx_submit() and then CPU-B calls dma_async_is_complete() ***after*** CPU-B has observed the write to descriptor->cookie, and ***before*** before CPU-B has observed the write to chan->common.cookie. Remember, without barriers, CPU-B can observe CPU-A's memory accesses in *any possible order*. Memory accesses are not guaranteed to be *complete* by the time fsl_dma_tx_submit() returns! With the above values, dma_async_is_complete() returns DMA_COMPLETE. This is incorrect: the DMA is still in progress. The required invariant chan->common.cookie >= descriptor->cookie has not been met. By adding an smp_rmb(), I force CPU-B to stall until *both* stores in fsl_dma_tx_submit() (descriptor->cookie and chan->common.cookie) actually hit main memory. This avoids the above situation: all CPU's observe descriptor->cookie and chan->common.cookie to update in sync with each other. Is this unclear in any way? Please run your test with the smp_rmb() and measure the performance impact. Ira
Hi Iris, >Remember, without barriers, CPU-B can observe CPU-A's memory accesses in *any possible order*. Memory accesses are not guaranteed to be *complete* by >the time fsl_dma_tx_submit() returns! fsl_dma_tx_submit is enclosed by spin_lock_irqsave/spin_unlock_irqrestore, when this function returns, I believe the memory access are completed. spin_unlock_irqsave is an implicit memory barrier and guaranteed this. Thanks, Forrest -----Original Message----- From: Ira W. Snyder [mailto:iws@ovro.caltech.edu] Sent: 2011年12月3日 1:14 To: Shi Xuelin-B29237 Cc: vinod.koul@intel.com; dan.j.williams@intel.com; linuxppc-dev@lists.ozlabs.org; linux-kernel@vger.kernel.org; Li Yang-R58472 Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. On Fri, Dec 02, 2011 at 03:47:27AM +0000, Shi Xuelin-B29237 wrote: > Hi Iris, > > >I'm convinced that "smp_rmb()" is needed when removing the spinlock. > >As noted, Documentation/memory-barriers.txt says that stores on one CPU can be observed by another CPU in a different order. > >Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a > >LOCK (in fsl_tx_status). This provided a "full barrier", forcing the operations to complete correctly when viewed by the second CPU. > > I do not agree this smp_rmb() works here. Because when this smp_rmb() executed and begin to read chan->common.cookie, you still cannot avoid the order issue. Something like one is reading old value, but another CPU is updating the new value. > > My point is here the order is not important for the DMA decision. > Completed DMA tx is decided as not complete is not a big deal, because next time it will be OK. > > I believe there is no case that could cause uncompleted DMA tx is decided as completed, because the fsl_tx_status is called after fsl_dma_tx_submit for a specific cookie. If you can give me an example here, I will agree with you. > According to memory-barriers.txt, writes to main memory may be observed in any order if memory barriers are not used. This means that writes can appear to happen in a different order than they were issued by the CPU. Citing from the text: > There are certain things that the Linux kernel memory barriers do not guarantee: > > (*) There is no guarantee that any of the memory accesses specified before a > memory barrier will be _complete_ by the completion of a memory barrier > instruction; the barrier can be considered to draw a line in that CPU's > access queue that accesses of the appropriate type may not cross. Also: > Without intervention, CPU 2 may perceive the events on CPU 1 in some > effectively random order, despite the write barrier issued by CPU 1: Also: > When dealing with CPU-CPU interactions, certain types of memory > barrier should always be paired. A lack of appropriate pairing is almost certainly an error. > > A write barrier should always be paired with a data dependency barrier > or read barrier, though a general barrier would also be viable. Therefore, in an SMP system, the following situation can happen. descriptor->cookie = 2 chan->common.cookie = 1 chan->completed_cookie = 1 This occurs when CPU-A calls fsl_dma_tx_submit() and then CPU-B calls dma_async_is_complete() ***after*** CPU-B has observed the write to descriptor->cookie, and ***before*** before CPU-B has observed the write descriptor->to chan->common.cookie. Remember, without barriers, CPU-B can observe CPU-A's memory accesses in *any possible order*. Memory accesses are not guaranteed to be *complete* by the time fsl_dma_tx_submit() returns! With the above values, dma_async_is_complete() returns DMA_COMPLETE. This is incorrect: the DMA is still in progress. The required invariant chan->common.cookie >= descriptor->cookie has not been met. By adding an smp_rmb(), I force CPU-B to stall until *both* stores in fsl_dma_tx_submit() (descriptor->cookie and chan->common.cookie) actually hit main memory. This avoids the above situation: all CPU's observe descriptor->cookie and chan->common.cookie to update in sync with each other. Is this unclear in any way? Please run your test with the smp_rmb() and measure the performance impact. Ira
diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 8a78154..1dca56f 100644 --- a/drivers/dma/fsldma.c +++ b/drivers/dma/fsldma.c @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan *dchan, struct fsldma_chan *chan = to_fsl_chan(dchan); dma_cookie_t last_complete; dma_cookie_t last_used; - unsigned long flags; - - spin_lock_irqsave(&chan->desc_lock, flags); last_complete = chan->completed_cookie; last_used = dchan->cookie; - spin_unlock_irqrestore(&chan->desc_lock, flags); - dma_set_tx_state(txstate, last_complete, last_used, 0); return dma_async_is_complete(cookie, last_complete, last_used); }