diff mbox

[v2] sata_sil24: Use memory barriers before issuing commands

Message ID 20100610160212.18091.29856.stgit@e102109-lin.cambridge.arm.com
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Catalin Marinas June 10, 2010, 4:02 p.m. UTC
The data in the cmd_block buffers may reach the main memory after the
writel() to the device ports. This patch introduces two calls to wmb()
to ensure the relative ordering.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Colin Tuckley <colin.tuckley@arm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jeff Garzik <jeff@garzik.org>
---
 drivers/ata/sata_sil24.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Tejun Heo June 10, 2010, 4:12 p.m. UTC | #1
On 06/10/2010 06:02 PM, Catalin Marinas wrote:
> The data in the cmd_block buffers may reach the main memory after the
> writel() to the device ports. This patch introduces two calls to wmb()
> to ensure the relative ordering.
> 
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Tested-by: Colin Tuckley <colin.tuckley@arm.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Jeff Garzik <jeff@garzik.org>

I suppose you have tested and verified that this is actually
necessary, right?  I've been looking through the docs but couldn't
find anything which described the ordering between writes to main
memory and write[bwl]()'s.  One thing that kind of bothers me is that
r/wmb()'s are for ordering memory accesses among CPUs which
participate in cache coherency protocol and although it may work right
in the above case I'm not really sure whether this is the right thing
to do.  Do you have more information on the subject?

Thanks.
Catalin Marinas June 10, 2010, 4:23 p.m. UTC | #2
On Thu, 2010-06-10 at 17:12 +0100, Tejun Heo wrote:
> On 06/10/2010 06:02 PM, Catalin Marinas wrote:
> > The data in the cmd_block buffers may reach the main memory after the
> > writel() to the device ports. This patch introduces two calls to wmb()
> > to ensure the relative ordering.
> >
> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > Tested-by: Colin Tuckley <colin.tuckley@arm.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Jeff Garzik <jeff@garzik.org>
> 
> I suppose you have tested and verified that this is actually
> necessary, right?  

Yes, otherwise we get random failures with this device on ARM.

> I've been looking through the docs but couldn't
> find anything which described the ordering between writes to main
> memory and write[bwl]()'s.  One thing that kind of bothers me is that
> r/wmb()'s are for ordering memory accesses among CPUs which
> participate in cache coherency protocol and although it may work right
> in the above case I'm not really sure whether this is the right thing
> to do.  Do you have more information on the subject?

The mb() are not for ordering accesses among CPUs (though they would
cover this case as well). For inter-CPU ordering, we have smp_mb() and
friends. For all other cases, we have the mandatory barriers mb() and
friends and DMA is one of them.

Apart from the memory-barriers.txt document, there is the Device I/O
docbook which mentions something about DMA buffers, though not very
clear on which barriers to use (something like just make sure that the
writes to the buffer reached the memory).

There were some past discussions on linux-arch before and I'm cc'ing
this list again (ARM is not the only architecture with a weakly memory
ordering model).

I'm copying the patch below again for the linux-arch people that haven't
seen the beginning of the thread:


> diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c
> index e925051..a5d5aff 100644
> --- a/drivers/ata/sata_sil24.c
> +++ b/drivers/ata/sata_sil24.c
> @@ -622,6 +622,11 @@ static int sil24_exec_polled_cmd(struct ata_port *ap, int pmp,
>  	irq_enabled = readl(port + PORT_IRQ_ENABLE_SET);
>  	writel(PORT_IRQ_COMPLETE | PORT_IRQ_ERROR, port + PORT_IRQ_ENABLE_CLR);
>  
> +	/*
> +	 * The barrier is required to ensure that writes to cmd_block reach
> +	 * the memory before the write to PORT_CMD_ACTIVATE.
> +	 */
> +	wmb();
>  	writel((u32)paddr, port + PORT_CMD_ACTIVATE);
>  	writel((u64)paddr >> 32, port + PORT_CMD_ACTIVATE + 4);
>  
> @@ -895,6 +900,11 @@ static unsigned int sil24_qc_issue(struct ata_queued_cmd *qc)
>  	paddr = pp->cmd_block_dma + tag * sizeof(*pp->cmd_block);
>  	activate = port + PORT_CMD_ACTIVATE + tag * 8;
>  
> +	/*
> +	 * The barrier is required to ensure that writes to cmd_block reach
> +	 * the memory before the write to PORT_CMD_ACTIVATE.
> +	 */
> +	wmb();
>  	writel((u32)paddr, activate);
>  	writel((u64)paddr >> 32, activate + 4);
>  

Thanks.
Catalin Marinas June 10, 2010, 4:42 p.m. UTC | #3
On Thu, 2010-06-10 at 17:23 +0100, Catalin Marinas wrote:
> On Thu, 2010-06-10 at 17:12 +0100, Tejun Heo wrote:
> > On 06/10/2010 06:02 PM, Catalin Marinas wrote:
> > > The data in the cmd_block buffers may reach the main memory after the
> > > writel() to the device ports. This patch introduces two calls to wmb()
> > > to ensure the relative ordering.
> > >
> > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > > Tested-by: Colin Tuckley <colin.tuckley@arm.com>
> > > Cc: Tejun Heo <tj@kernel.org>
> > > Cc: Jeff Garzik <jeff@garzik.org>
> >
> > I suppose you have tested and verified that this is actually
> > necessary, right? 
> 
> Yes, otherwise we get random failures with this device on ARM.
> 
> > I've been looking through the docs but couldn't
> > find anything which described the ordering between writes to main
> > memory and write[bwl]()'s.  One thing that kind of bothers me is that
> > r/wmb()'s are for ordering memory accesses among CPUs which
> > participate in cache coherency protocol and although it may work right
> > in the above case I'm not really sure whether this is the right thing
> > to do.  Do you have more information on the subject?
> 
> The mb() are not for ordering accesses among CPUs (though they would
> cover this case as well). For inter-CPU ordering, we have smp_mb() and
> friends. For all other cases, we have the mandatory barriers mb() and
> friends and DMA is one of them.
> 
> Apart from the memory-barriers.txt document, there is the Device I/O
> docbook which mentions something about DMA buffers, though not very
> clear on which barriers to use (something like just make sure that the
> writes to the buffer reached the memory).

It was actually the DMA-API.txt (rather than deviceiobook) in the
description of dma_alloc_coherent().
Jeff Garzik June 10, 2010, 8:14 p.m. UTC | #4
On 06/10/2010 12:02 PM, Catalin Marinas wrote:
> The data in the cmd_block buffers may reach the main memory after the
> writel() to the device ports. This patch introduces two calls to wmb()
> to ensure the relative ordering.
>
> Signed-off-by: Catalin Marinas<catalin.marinas@arm.com>
> Tested-by: Colin Tuckley<colin.tuckley@arm.com>
> Cc: Tejun Heo<tj@kernel.org>
> Cc: Jeff Garzik<jeff@garzik.org>
> ---
>   drivers/ata/sata_sil24.c |   10 ++++++++++
>   1 files changed, 10 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c
> index e925051..a5d5aff 100644
> --- a/drivers/ata/sata_sil24.c
> +++ b/drivers/ata/sata_sil24.c
> @@ -622,6 +622,11 @@ static int sil24_exec_polled_cmd(struct ata_port *ap, int pmp,
>   	irq_enabled = readl(port + PORT_IRQ_ENABLE_SET);
>   	writel(PORT_IRQ_COMPLETE | PORT_IRQ_ERROR, port + PORT_IRQ_ENABLE_CLR);
>
> +	/*
> +	 * The barrier is required to ensure that writes to cmd_block reach
> +	 * the memory before the write to PORT_CMD_ACTIVATE.
> +	 */
> +	wmb();
>   	writel((u32)paddr, port + PORT_CMD_ACTIVATE);
>   	writel((u64)paddr>>  32, port + PORT_CMD_ACTIVATE + 4);
>
> @@ -895,6 +900,11 @@ static unsigned int sil24_qc_issue(struct ata_queued_cmd *qc)
>   	paddr = pp->cmd_block_dma + tag * sizeof(*pp->cmd_block);
>   	activate = port + PORT_CMD_ACTIVATE + tag * 8;
>
> +	/*
> +	 * The barrier is required to ensure that writes to cmd_block reach
> +	 * the memory before the write to PORT_CMD_ACTIVATE.
> +	 */
> +	wmb();
>   	writel((u32)paddr, activate);
>   	writel((u64)paddr>>  32, activate + 4);
>

applied


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert Hancock June 11, 2010, 12:43 a.m. UTC | #5
On 06/10/2010 10:23 AM, Catalin Marinas wrote:
> On Thu, 2010-06-10 at 17:12 +0100, Tejun Heo wrote:
>> On 06/10/2010 06:02 PM, Catalin Marinas wrote:
>>> The data in the cmd_block buffers may reach the main memory after the
>>> writel() to the device ports. This patch introduces two calls to wmb()
>>> to ensure the relative ordering.
>>>
>>> Signed-off-by: Catalin Marinas<catalin.marinas@arm.com>
>>> Tested-by: Colin Tuckley<colin.tuckley@arm.com>
>>> Cc: Tejun Heo<tj@kernel.org>
>>> Cc: Jeff Garzik<jeff@garzik.org>
>>
>> I suppose you have tested and verified that this is actually
>> necessary, right?
>
> Yes, otherwise we get random failures with this device on ARM.
>
>> I've been looking through the docs but couldn't
>> find anything which described the ordering between writes to main
>> memory and write[bwl]()'s.  One thing that kind of bothers me is that
>> r/wmb()'s are for ordering memory accesses among CPUs which
>> participate in cache coherency protocol and although it may work right
>> in the above case I'm not really sure whether this is the right thing
>> to do.  Do you have more information on the subject?
>
> The mb() are not for ordering accesses among CPUs (though they would
> cover this case as well). For inter-CPU ordering, we have smp_mb() and
> friends. For all other cases, we have the mandatory barriers mb() and
> friends and DMA is one of them.
>
> Apart from the memory-barriers.txt document, there is the Device I/O
> docbook which mentions something about DMA buffers, though not very
> clear on which barriers to use (something like just make sure that the
> writes to the buffer reached the memory).
>
> There were some past discussions on linux-arch before and I'm cc'ing
> this list again (ARM is not the only architecture with a weakly memory
> ordering model).
>
> I'm copying the patch below again for the linux-arch people that haven't
> seen the beginning of the thread:

My memory is fuzzy but I thought this came up before on PPC and I also 
thought the conclusion was that the platform code (for writel, etc.) 
should enforce ordering of MMIO accesses with respect to normal RAM 
accesses. (Or maybe it was just MMIO accesses with respect to each 
other?) I don't think the answer to that question has been clearly 
documented anywhere, which is somewhat unfortunate.

If the answer is that this is needed then there are likely a lot of 
other drivers in libata and elsewhere which need to be fixed as well. 
For example, I don't see any such barriers in libahci.c when I presume 
it would need them.

IMHO, it would be better for the platform code to ensure that MMIO 
access was strongly ordered with respect to each other and to RAM 
access. Drivers are just too likely to get this wrong, especially when 
x86, the most tested platform, doesn't have such issues.

>
>
>> diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c
>> index e925051..a5d5aff 100644
>> --- a/drivers/ata/sata_sil24.c
>> +++ b/drivers/ata/sata_sil24.c
>> @@ -622,6 +622,11 @@ static int sil24_exec_polled_cmd(struct ata_port *ap, int pmp,
>>   	irq_enabled = readl(port + PORT_IRQ_ENABLE_SET);
>>   	writel(PORT_IRQ_COMPLETE | PORT_IRQ_ERROR, port + PORT_IRQ_ENABLE_CLR);
>>
>> +	/*
>> +	 * The barrier is required to ensure that writes to cmd_block reach
>> +	 * the memory before the write to PORT_CMD_ACTIVATE.
>> +	 */
>> +	wmb();
>>   	writel((u32)paddr, port + PORT_CMD_ACTIVATE);
>>   	writel((u64)paddr>>  32, port + PORT_CMD_ACTIVATE + 4);
>>
>> @@ -895,6 +900,11 @@ static unsigned int sil24_qc_issue(struct ata_queued_cmd *qc)
>>   	paddr = pp->cmd_block_dma + tag * sizeof(*pp->cmd_block);
>>   	activate = port + PORT_CMD_ACTIVATE + tag * 8;
>>
>> +	/*
>> +	 * The barrier is required to ensure that writes to cmd_block reach
>> +	 * the memory before the write to PORT_CMD_ACTIVATE.
>> +	 */
>> +	wmb();
>>   	writel((u32)paddr, activate);
>>   	writel((u64)paddr>>  32, activate + 4);
>>
>
> Thanks.
>

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin June 11, 2010, 1:38 a.m. UTC | #6
On Thu, Jun 10, 2010 at 06:43:03PM -0600, Robert Hancock wrote:
> On 06/10/2010 10:23 AM, Catalin Marinas wrote:
> >On Thu, 2010-06-10 at 17:12 +0100, Tejun Heo wrote:
> >>On 06/10/2010 06:02 PM, Catalin Marinas wrote:
> >>>The data in the cmd_block buffers may reach the main memory after the
> >>>writel() to the device ports. This patch introduces two calls to wmb()
> >>>to ensure the relative ordering.
> >>>
> >>>Signed-off-by: Catalin Marinas<catalin.marinas@arm.com>
> >>>Tested-by: Colin Tuckley<colin.tuckley@arm.com>
> >>>Cc: Tejun Heo<tj@kernel.org>
> >>>Cc: Jeff Garzik<jeff@garzik.org>
> >>
> >>I suppose you have tested and verified that this is actually
> >>necessary, right?
> >
> >Yes, otherwise we get random failures with this device on ARM.
> >
> >>I've been looking through the docs but couldn't
> >>find anything which described the ordering between writes to main
> >>memory and write[bwl]()'s.  One thing that kind of bothers me is that
> >>r/wmb()'s are for ordering memory accesses among CPUs which
> >>participate in cache coherency protocol and although it may work right
> >>in the above case I'm not really sure whether this is the right thing
> >>to do.  Do you have more information on the subject?
> >
> >The mb() are not for ordering accesses among CPUs (though they would
> >cover this case as well). For inter-CPU ordering, we have smp_mb() and
> >friends. For all other cases, we have the mandatory barriers mb() and
> >friends and DMA is one of them.
> >
> >Apart from the memory-barriers.txt document, there is the Device I/O
> >docbook which mentions something about DMA buffers, though not very
> >clear on which barriers to use (something like just make sure that the
> >writes to the buffer reached the memory).
> >
> >There were some past discussions on linux-arch before and I'm cc'ing
> >this list again (ARM is not the only architecture with a weakly memory
> >ordering model).
> >
> >I'm copying the patch below again for the linux-arch people that haven't
> >seen the beginning of the thread:
> 
> My memory is fuzzy but I thought this came up before on PPC and I
> also thought the conclusion was that the platform code (for writel,
> etc.) should enforce ordering of MMIO accesses with respect to
> normal RAM accesses. (Or maybe it was just MMIO accesses with
> respect to each other?) I don't think the answer to that question
> has been clearly documented anywhere, which is somewhat unfortunate.
> 
> If the answer is that this is needed then there are likely a lot of
> other drivers in libata and elsewhere which need to be fixed as
> well. For example, I don't see any such barriers in libahci.c when I
> presume it would need them.
> 
> IMHO, it would be better for the platform code to ensure that MMIO
> access was strongly ordered with respect to each other and to RAM
> access. Drivers are just too likely to get this wrong, especially
> when x86, the most tested platform, doesn't have such issues.

The plan is to make all platforms do this. writes should be
strongly ordered with memory. That serves to keep them inside
critical sections as well.

Then write?_relaxed, and memory barriers can be added if performance
is required.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
FUJITA Tomonori June 11, 2010, 9:16 a.m. UTC | #7
On Fri, 11 Jun 2010 11:38:29 +1000
Nick Piggin <npiggin@suse.de> wrote:

> On Thu, Jun 10, 2010 at 06:43:03PM -0600, Robert Hancock wrote:
> > On 06/10/2010 10:23 AM, Catalin Marinas wrote:
> > >On Thu, 2010-06-10 at 17:12 +0100, Tejun Heo wrote:
> > >>On 06/10/2010 06:02 PM, Catalin Marinas wrote:
> > >>>The data in the cmd_block buffers may reach the main memory after the
> > >>>writel() to the device ports. This patch introduces two calls to wmb()
> > >>>to ensure the relative ordering.
> > >>>
> > >>>Signed-off-by: Catalin Marinas<catalin.marinas@arm.com>
> > >>>Tested-by: Colin Tuckley<colin.tuckley@arm.com>
> > >>>Cc: Tejun Heo<tj@kernel.org>
> > >>>Cc: Jeff Garzik<jeff@garzik.org>
> > >>
> > >>I suppose you have tested and verified that this is actually
> > >>necessary, right?
> > >
> > >Yes, otherwise we get random failures with this device on ARM.
> > >
> > >>I've been looking through the docs but couldn't
> > >>find anything which described the ordering between writes to main
> > >>memory and write[bwl]()'s.  One thing that kind of bothers me is that
> > >>r/wmb()'s are for ordering memory accesses among CPUs which
> > >>participate in cache coherency protocol and although it may work right
> > >>in the above case I'm not really sure whether this is the right thing
> > >>to do.  Do you have more information on the subject?
> > >
> > >The mb() are not for ordering accesses among CPUs (though they would
> > >cover this case as well). For inter-CPU ordering, we have smp_mb() and
> > >friends. For all other cases, we have the mandatory barriers mb() and
> > >friends and DMA is one of them.
> > >
> > >Apart from the memory-barriers.txt document, there is the Device I/O
> > >docbook which mentions something about DMA buffers, though not very
> > >clear on which barriers to use (something like just make sure that the
> > >writes to the buffer reached the memory).
> > >
> > >There were some past discussions on linux-arch before and I'm cc'ing
> > >this list again (ARM is not the only architecture with a weakly memory
> > >ordering model).
> > >
> > >I'm copying the patch below again for the linux-arch people that haven't
> > >seen the beginning of the thread:
> > 
> > My memory is fuzzy but I thought this came up before on PPC and I
> > also thought the conclusion was that the platform code (for writel,
> > etc.) should enforce ordering of MMIO accesses with respect to
> > normal RAM accesses. (Or maybe it was just MMIO accesses with
> > respect to each other?) I don't think the answer to that question
> > has been clearly documented anywhere, which is somewhat unfortunate.
> > 
> > If the answer is that this is needed then there are likely a lot of
> > other drivers in libata and elsewhere which need to be fixed as
> > well. For example, I don't see any such barriers in libahci.c when I
> > presume it would need them.
> > 
> > IMHO, it would be better for the platform code to ensure that MMIO
> > access was strongly ordered with respect to each other and to RAM
> > access. Drivers are just too likely to get this wrong, especially
> > when x86, the most tested platform, doesn't have such issues.
> 
> The plan is to make all platforms do this. writes should be
> strongly ordered with memory. That serves to keep them inside
> critical sections as well.

I support the plan, however looks like nothing has changed in any
platforms since the plan was made (in 2008 ?).

I guess that we need to add something to Documentation/ now then we
can shout, "fix arch's code instead of a driver". The consensus was
that strongly ordered with any memory?
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Catalin Marinas June 11, 2010, 9:41 a.m. UTC | #8
On Fri, 2010-06-11 at 02:38 +0100, Nick Piggin wrote:
> On Thu, Jun 10, 2010 at 06:43:03PM -0600, Robert Hancock wrote:
> > IMHO, it would be better for the platform code to ensure that MMIO
> > access was strongly ordered with respect to each other and to RAM
> > access. Drivers are just too likely to get this wrong, especially
> > when x86, the most tested platform, doesn't have such issues.
> 
> The plan is to make all platforms do this. writes should be
> strongly ordered with memory. That serves to keep them inside
> critical sections as well.

Are there any public references to this discussion? Maybe a
Documentation/ file (or update the memory-barriers.txt one would be
useful).

I guess correctness takes precedence here but on ARM, the only way to
ensure relative ordering between non-cacheable writes and I/O writes is
by flushing the write buffer (and an L2 write buffer if external cache
is present). Hence the expensive mb().

The only reference of DMA buffers vs I/O I found in the DMA-API.txt
file:

        Consistent memory is memory for which a write by either the
        device or the processor can immediately be read by the processor
        or device without having to worry about caching effects. (You
        may however need to make sure to flush the processor's write
        buffers before telling devices to read that memory.)

But there is no API for "flushing the processor's write buffers". Does
it mean that this should be taken care of in writel()? We would make the
I/O accessors pretty expensive on some architectures.

Thanks.
Nick Piggin June 11, 2010, 10:11 a.m. UTC | #9
On Fri, Jun 11, 2010 at 10:41:46AM +0100, Catalin Marinas wrote:
> On Fri, 2010-06-11 at 02:38 +0100, Nick Piggin wrote:
> > On Thu, Jun 10, 2010 at 06:43:03PM -0600, Robert Hancock wrote:
> > > IMHO, it would be better for the platform code to ensure that MMIO
> > > access was strongly ordered with respect to each other and to RAM
> > > access. Drivers are just too likely to get this wrong, especially
> > > when x86, the most tested platform, doesn't have such issues.
> > 
> > The plan is to make all platforms do this. writes should be
> > strongly ordered with memory. That serves to keep them inside
> > critical sections as well.
> 
> Are there any public references to this discussion? Maybe a
> Documentation/ file (or update the memory-barriers.txt one would be
> useful).

It was on the mailing list, don't have a ref off the top of my
head. Primarily between the ia64 and powerpc people and myself
IIRC.

They thought it would also be too expensive to do, but it turned
out not to be noticable with a few simple tests. It will obviously
depend on a lot of factors...

Also I think most high performance drivers tend to have just a few
critical mmios so they should be able to be identified and improved
relatively easily (relatively, as in: much  more easily than trying to
find all the obscure ordering problems).

So anyway powerpc were reluctant because they try to fix it in their
spinlocks, but I demonstrated that there were drivers using mutexes
and other synchronization and found one or two broken ones in the
first place I looked.

 
> I guess correctness takes precedence here but on ARM, the only way to
> ensure relative ordering between non-cacheable writes and I/O writes is
> by flushing the write buffer (and an L2 write buffer if external cache
> is present). Hence the expensive mb().

Default IO accessors would be a little more expensive, yes.

 
> The only reference of DMA buffers vs I/O I found in the DMA-API.txt
> file:
> 
>         Consistent memory is memory for which a write by either the
>         device or the processor can immediately be read by the processor
>         or device without having to worry about caching effects. (You
>         may however need to make sure to flush the processor's write
>         buffers before telling devices to read that memory.)
> 
> But there is no API for "flushing the processor's write buffers". Does
> it mean that this should be taken care of in writel()? We would make the
> I/O accessors pretty expensive on some architectures.

The APIs for that are mb/wmb/rmb ones.


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Catalin Marinas June 11, 2010, 11:04 a.m. UTC | #10
On Fri, 2010-06-11 at 11:11 +0100, Nick Piggin wrote:
> On Fri, Jun 11, 2010 at 10:41:46AM +0100, Catalin Marinas wrote:
> > On Fri, 2010-06-11 at 02:38 +0100, Nick Piggin wrote:
> > > On Thu, Jun 10, 2010 at 06:43:03PM -0600, Robert Hancock wrote:
> > > > IMHO, it would be better for the platform code to ensure that MMIO
> > > > access was strongly ordered with respect to each other and to RAM
> > > > access. Drivers are just too likely to get this wrong, especially
> > > > when x86, the most tested platform, doesn't have such issues.
> > >
> > > The plan is to make all platforms do this. writes should be
> > > strongly ordered with memory. That serves to keep them inside
> > > critical sections as well.
[...]
> Also I think most high performance drivers tend to have just a few
> critical mmios so they should be able to be identified and improved
> relatively easily (relatively, as in: much  more easily than trying to
> find all the obscure ordering problems).
> 
> So anyway powerpc were reluctant because they try to fix it in their
> spinlocks, but I demonstrated that there were drivers using mutexes
> and other synchronization and found one or two broken ones in the
> first place I looked.

On the ARM implementation we are safe with regards to spinlocks/mutexes
vs. IO accesses, no weird ordering issues here (if there would be, I
agree that it would need fixing).

> > The only reference of DMA buffers vs I/O I found in the DMA-API.txt
> > file:
> >
> >         Consistent memory is memory for which a write by either the
> >         device or the processor can immediately be read by the processor
> >         or device without having to worry about caching effects. (You
> >         may however need to make sure to flush the processor's write
> >         buffers before telling devices to read that memory.)
> >
> > But there is no API for "flushing the processor's write buffers". Does
> > it mean that this should be taken care of in writel()? We would make the
> > I/O accessors pretty expensive on some architectures.
> 
> The APIs for that are mb/wmb/rmb ones.

So if that's the API for the above case and we are strictly referring to
the sata_sil24 patch I sent - shouldn't we just add wmb() in the driver
between the write to the DMA buffer and the writel() to start the DMA
transfer? Do we need to move the wmb() to the writel() macro?
Robert Hancock June 12, 2010, 1:30 a.m. UTC | #11
On Fri, Jun 11, 2010 at 5:04 AM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Fri, 2010-06-11 at 11:11 +0100, Nick Piggin wrote:
>> On Fri, Jun 11, 2010 at 10:41:46AM +0100, Catalin Marinas wrote:
>> > On Fri, 2010-06-11 at 02:38 +0100, Nick Piggin wrote:
>> > > On Thu, Jun 10, 2010 at 06:43:03PM -0600, Robert Hancock wrote:
>> > > > IMHO, it would be better for the platform code to ensure that MMIO
>> > > > access was strongly ordered with respect to each other and to RAM
>> > > > access. Drivers are just too likely to get this wrong, especially
>> > > > when x86, the most tested platform, doesn't have such issues.
>> > >
>> > > The plan is to make all platforms do this. writes should be
>> > > strongly ordered with memory. That serves to keep them inside
>> > > critical sections as well.
> [...]
>> Also I think most high performance drivers tend to have just a few
>> critical mmios so they should be able to be identified and improved
>> relatively easily (relatively, as in: much  more easily than trying to
>> find all the obscure ordering problems).
>>
>> So anyway powerpc were reluctant because they try to fix it in their
>> spinlocks, but I demonstrated that there were drivers using mutexes
>> and other synchronization and found one or two broken ones in the
>> first place I looked.
>
> On the ARM implementation we are safe with regards to spinlocks/mutexes
> vs. IO accesses, no weird ordering issues here (if there would be, I
> agree that it would need fixing).
>
>> > The only reference of DMA buffers vs I/O I found in the DMA-API.txt
>> > file:
>> >
>> >         Consistent memory is memory for which a write by either the
>> >         device or the processor can immediately be read by the processor
>> >         or device without having to worry about caching effects. (You
>> >         may however need to make sure to flush the processor's write
>> >         buffers before telling devices to read that memory.)
>> >
>> > But there is no API for "flushing the processor's write buffers". Does
>> > it mean that this should be taken care of in writel()? We would make the
>> > I/O accessors pretty expensive on some architectures.
>>
>> The APIs for that are mb/wmb/rmb ones.
>
> So if that's the API for the above case and we are strictly referring to
> the sata_sil24 patch I sent - shouldn't we just add wmb() in the driver
> between the write to the DMA buffer and the writel() to start the DMA
> transfer? Do we need to move the wmb() to the writel() macro?

I think it would be best if writel, etc. did the write buffer flushing
by default. As Nick said, if there are some performance critical areas
then those can use the relaxed versions but it's safest if the default
behavior works as drivers expect.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
FUJITA Tomonori June 14, 2010, 12:35 a.m. UTC | #12
On Fri, 11 Jun 2010 10:41:46 +0100
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On Fri, 2010-06-11 at 02:38 +0100, Nick Piggin wrote:
> > On Thu, Jun 10, 2010 at 06:43:03PM -0600, Robert Hancock wrote:
> > > IMHO, it would be better for the platform code to ensure that MMIO
> > > access was strongly ordered with respect to each other and to RAM
> > > access. Drivers are just too likely to get this wrong, especially
> > > when x86, the most tested platform, doesn't have such issues.
> > 
> > The plan is to make all platforms do this. writes should be
> > strongly ordered with memory. That serves to keep them inside
> > critical sections as well.
> 
> Are there any public references to this discussion? Maybe a
> Documentation/ file (or update the memory-barriers.txt one would be
> useful).

I guess,

http://marc.info/?t=121185223900001&r=1&w=2
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Catalin Marinas June 15, 2010, 11:10 a.m. UTC | #13
On Sat, 2010-06-12 at 02:30 +0100, Robert Hancock wrote:
> On Fri, Jun 11, 2010 at 5:04 AM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > On Fri, 2010-06-11 at 11:11 +0100, Nick Piggin wrote:
> >> On Fri, Jun 11, 2010 at 10:41:46AM +0100, Catalin Marinas wrote:
> >> > The only reference of DMA buffers vs I/O I found in the DMA-API.txt
> >> > file:
> >> >
> >> >         Consistent memory is memory for which a write by either the
> >> >         device or the processor can immediately be read by the processor
> >> >         or device without having to worry about caching effects. (You
> >> >         may however need to make sure to flush the processor's write
> >> >         buffers before telling devices to read that memory.)
> >> >
> >> > But there is no API for "flushing the processor's write buffers". Does
> >> > it mean that this should be taken care of in writel()? We would make the
> >> > I/O accessors pretty expensive on some architectures.
> >>
> >> The APIs for that are mb/wmb/rmb ones.
> >
> > So if that's the API for the above case and we are strictly referring to
> > the sata_sil24 patch I sent - shouldn't we just add wmb() in the driver
> > between the write to the DMA buffer and the writel() to start the DMA
> > transfer? Do we need to move the wmb() to the writel() macro?
> 
> I think it would be best if writel, etc. did the write buffer flushing
> by default. As Nick said, if there are some performance critical areas
> then those can use the relaxed versions but it's safest if the default
> behavior works as drivers expect.

I went through the past discussion pointed to by Fujita (thanks!) but I
wouldn't say it resulted in a definitive guideline on how architectures
should implement the I/O accessors.

>From an ARM perspective, I would prefer to add wmb() in the drivers
where it matters - basically only those using DMA coherent buffers. A
lot of drivers already have this in place and that's already documented
in DMA-API.txt (maybe with a bit of clarification).

Some statistics - grepping drivers/ for alloc_coherent shows 285 files.
Of these, 69 already use barriers. It's not trivial to go through 200+
drivers and add barriers but I wouldn't say that's impossible.

If we go the other route of adding mb() in writel() (though I don't
prefer it), there are two additional issues:

(1) how relaxed would the "writel_relaxed" etc. accessors be? Are they
relaxed only with regards to coherent DMA buffers or relaxed with other
I/O operations as well? Can the compiler reorder them?

(2) do we go through all the drivers that currently have *mb() and
remove them? A quick grep in drivers/ shows over 1600 occurrences of
*mb().
Nick Piggin June 15, 2010, 11:31 a.m. UTC | #14
On Tue, Jun 15, 2010 at 12:10:53PM +0100, Catalin Marinas wrote:
> On Sat, 2010-06-12 at 02:30 +0100, Robert Hancock wrote:
> > > So if that's the API for the above case and we are strictly referring to
> > > the sata_sil24 patch I sent - shouldn't we just add wmb() in the driver
> > > between the write to the DMA buffer and the writel() to start the DMA
> > > transfer? Do we need to move the wmb() to the writel() macro?
> > 
> > I think it would be best if writel, etc. did the write buffer flushing
> > by default. As Nick said, if there are some performance critical areas
> > then those can use the relaxed versions but it's safest if the default
> > behavior works as drivers expect.
> 
> I went through the past discussion pointed to by Fujita (thanks!) but I
> wouldn't say it resulted in a definitive guideline on how architectures
> should implement the I/O accessors.
> 
> >From an ARM perspective, I would prefer to add wmb() in the drivers
> where it matters - basically only those using DMA coherent buffers. A
> lot of drivers already have this in place and that's already documented
> in DMA-API.txt (maybe with a bit of clarification).
> 
> Some statistics - grepping drivers/ for alloc_coherent shows 285 files.
> Of these, 69 already use barriers. It's not trivial to go through 200+
> drivers and add barriers but I wouldn't say that's impossible.

I disagree. Firstly, you will get subtle bugs, not able to be reproduced
and situations where driver writers don't even have that architecture to
test on. Secondly, it is not a one-time audit and be done with it, code
is constantly being changed and added. Driver code is going to be
written and tested and run on different archs or even different
implementations that do different things to ordering.

On the other hand, a performance slowdown should be far more reproducible
and traceable.

 
> If we go the other route of adding mb() in writel() (though I don't
> prefer it), there are two additional issues:
> 
> (1) how relaxed would the "writel_relaxed" etc. accessors be? Are they
> relaxed only with regards to coherent DMA buffers or relaxed with other
> I/O operations as well? Can the compiler reorder them?

That was up for debate. I think totally weak (like SMP ordering)
should be fine, but that may require that we need some more barriers
like io_wmb() (which just orders two writels from the same CPU).
Remember we want to restrict their use outside critical fastpaths
of important drivers -- this is a far smaller task than auditing all
accesses in all drivers.


> (2) do we go through all the drivers that currently have *mb() and
> remove them? A quick grep in drivers/ shows over 1600 occurrences of
> *mb().

No need, but the ones that matter will get updated slowly. The ones
that don't will continue to work (from an ordering point of view) on
obscure or untested hardware.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Catalin Marinas June 19, 2010, 10:32 p.m. UTC | #15
On Tue, 2010-06-15 at 12:31 +0100, Nick Piggin wrote:
> On Tue, Jun 15, 2010 at 12:10:53PM +0100, Catalin Marinas wrote:
> > On Sat, 2010-06-12 at 02:30 +0100, Robert Hancock wrote:
> > > > So if that's the API for the above case and we are strictly referring to
> > > > the sata_sil24 patch I sent - shouldn't we just add wmb() in the driver
> > > > between the write to the DMA buffer and the writel() to start the DMA
> > > > transfer? Do we need to move the wmb() to the writel() macro?
> > >
> > > I think it would be best if writel, etc. did the write buffer flushing
> > > by default. As Nick said, if there are some performance critical areas
> > > then those can use the relaxed versions but it's safest if the default
> > > behavior works as drivers expect.
> >
> > I went through the past discussion pointed to by Fujita (thanks!) but I
> > wouldn't say it resulted in a definitive guideline on how architectures
> > should implement the I/O accessors.
> >
> > From an ARM perspective, I would prefer to add wmb() in the drivers
> > where it matters - basically only those using DMA coherent buffers. A
> > lot of drivers already have this in place and that's already documented
> > in DMA-API.txt (maybe with a bit of clarification).
> >
> > Some statistics - grepping drivers/ for alloc_coherent shows 285 files.
> > Of these, 69 already use barriers. It's not trivial to go through 200+
> > drivers and add barriers but I wouldn't say that's impossible.
> 
> I disagree. Firstly, you will get subtle bugs, not able to be reproduced
> and situations where driver writers don't even have that architecture to
> test on. Secondly, it is not a one-time audit and be done with it, code
> is constantly being changed and added. Driver code is going to be
> written and tested and run on different archs or even different
> implementations that do different things to ordering.
> 
> On the other hand, a performance slowdown should be far more reproducible
> and traceable.

I need to do some benchmarks to see the overall impact on ARM. There are
many PIO drivers that will have the performance affected (and we
probably use more PIO than DMA drivers).

A solution may be to redefine the IO accessors in the dma-mapping.h file
so that you get the fully ordered ones when the DMA API is used.

Many drivers seem to convert a dma_addr_t value to u32/u64 and pass it
to the device for initiating the DMA transfer. Could we enforce an
arch-specific converter that contains the write barrier (similar to the
dma_map_* functions but without cache maintenance)?

> > If we go the other route of adding mb() in writel() (though I don't
> > prefer it), there are two additional issues:
> >
> > (1) how relaxed would the "writel_relaxed" etc. accessors be? Are they
> > relaxed only with regards to coherent DMA buffers or relaxed with other
> > I/O operations as well? Can the compiler reorder them?
> 
> That was up for debate. I think totally weak (like SMP ordering)
> should be fine, but that may require that we need some more barriers
> like io_wmb() (which just orders two writels from the same CPU).
> Remember we want to restrict their use outside critical fastpaths
> of important drivers -- this is a far smaller task than auditing all
> accesses in all drivers.

If we are to make the writel/readl on ARM fully ordered with both IO
(enforced by hardware) and uncached memory, do we add barriers on each
side of the writel/readl etc.? The common cases would require a barrier
before writel (write buffer flushing) and a barrier after readl (in case
of polling for a "DMA complete" state).

So if io_wmb() just orders to IO writes (writel_relaxed), does it mean
that we still need a mighty wmb() that orders any type of accesses (i.e.
uncached memory vs IO)? Can drivers not use the strict writel() and no
longer rely on wmb() (wondering whether we could simplify it on ARM with
fully ordered IO accessors)?

Apart from the DMA coherent, what are the other uses of wmb() in Linux?
I've seen some use cases in drivers on SMP systems in relation to
interrupt routines.

Thanks.
Mark Lord June 23, 2010, 1 p.m. UTC | #16
On 10/06/10 08:43 PM, Robert Hancock wrote:
..
> My memory is fuzzy but I thought this came up before on PPC and I also
> thought the conclusion was that the platform code (for writel, etc.)
> should enforce ordering of MMIO accesses with respect to normal RAM
> accesses. (Or maybe it was just MMIO accesses with respect to each
> other?) I don't think the answer to that question has been clearly
> documented anywhere, which is somewhat unfortunate.
..

Different problem.  That discussion was for PIO reads into the page cache,
and ensuring coherency from all of that.

Whereas this patch is just ordinary low-level chipset programming,
and ensuring the descriptors are visible before issuing the "go" command.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c
index e925051..a5d5aff 100644
--- a/drivers/ata/sata_sil24.c
+++ b/drivers/ata/sata_sil24.c
@@ -622,6 +622,11 @@  static int sil24_exec_polled_cmd(struct ata_port *ap, int pmp,
 	irq_enabled = readl(port + PORT_IRQ_ENABLE_SET);
 	writel(PORT_IRQ_COMPLETE | PORT_IRQ_ERROR, port + PORT_IRQ_ENABLE_CLR);
 
+	/*
+	 * The barrier is required to ensure that writes to cmd_block reach
+	 * the memory before the write to PORT_CMD_ACTIVATE.
+	 */
+	wmb();
 	writel((u32)paddr, port + PORT_CMD_ACTIVATE);
 	writel((u64)paddr >> 32, port + PORT_CMD_ACTIVATE + 4);
 
@@ -895,6 +900,11 @@  static unsigned int sil24_qc_issue(struct ata_queued_cmd *qc)
 	paddr = pp->cmd_block_dma + tag * sizeof(*pp->cmd_block);
 	activate = port + PORT_CMD_ACTIVATE + tag * 8;
 
+	/*
+	 * The barrier is required to ensure that writes to cmd_block reach
+	 * the memory before the write to PORT_CMD_ACTIVATE.
+	 */
+	wmb();
 	writel((u32)paddr, activate);
 	writel((u64)paddr >> 32, activate + 4);