Message ID | 20100610160212.18091.29856.stgit@e102109-lin.cambridge.arm.com |
---|---|
State | Not Applicable |
Delegated to: | David Miller |
Headers | show |
On 06/10/2010 06:02 PM, Catalin Marinas wrote: > The data in the cmd_block buffers may reach the main memory after the > writel() to the device ports. This patch introduces two calls to wmb() > to ensure the relative ordering. > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > Tested-by: Colin Tuckley <colin.tuckley@arm.com> > Cc: Tejun Heo <tj@kernel.org> > Cc: Jeff Garzik <jeff@garzik.org> I suppose you have tested and verified that this is actually necessary, right? I've been looking through the docs but couldn't find anything which described the ordering between writes to main memory and write[bwl]()'s. One thing that kind of bothers me is that r/wmb()'s are for ordering memory accesses among CPUs which participate in cache coherency protocol and although it may work right in the above case I'm not really sure whether this is the right thing to do. Do you have more information on the subject? Thanks.
On Thu, 2010-06-10 at 17:12 +0100, Tejun Heo wrote: > On 06/10/2010 06:02 PM, Catalin Marinas wrote: > > The data in the cmd_block buffers may reach the main memory after the > > writel() to the device ports. This patch introduces two calls to wmb() > > to ensure the relative ordering. > > > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > > Tested-by: Colin Tuckley <colin.tuckley@arm.com> > > Cc: Tejun Heo <tj@kernel.org> > > Cc: Jeff Garzik <jeff@garzik.org> > > I suppose you have tested and verified that this is actually > necessary, right? Yes, otherwise we get random failures with this device on ARM. > I've been looking through the docs but couldn't > find anything which described the ordering between writes to main > memory and write[bwl]()'s. One thing that kind of bothers me is that > r/wmb()'s are for ordering memory accesses among CPUs which > participate in cache coherency protocol and although it may work right > in the above case I'm not really sure whether this is the right thing > to do. Do you have more information on the subject? The mb() are not for ordering accesses among CPUs (though they would cover this case as well). For inter-CPU ordering, we have smp_mb() and friends. For all other cases, we have the mandatory barriers mb() and friends and DMA is one of them. Apart from the memory-barriers.txt document, there is the Device I/O docbook which mentions something about DMA buffers, though not very clear on which barriers to use (something like just make sure that the writes to the buffer reached the memory). There were some past discussions on linux-arch before and I'm cc'ing this list again (ARM is not the only architecture with a weakly memory ordering model). I'm copying the patch below again for the linux-arch people that haven't seen the beginning of the thread: > diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c > index e925051..a5d5aff 100644 > --- a/drivers/ata/sata_sil24.c > +++ b/drivers/ata/sata_sil24.c > @@ -622,6 +622,11 @@ static int sil24_exec_polled_cmd(struct ata_port *ap, int pmp, > irq_enabled = readl(port + PORT_IRQ_ENABLE_SET); > writel(PORT_IRQ_COMPLETE | PORT_IRQ_ERROR, port + PORT_IRQ_ENABLE_CLR); > > + /* > + * The barrier is required to ensure that writes to cmd_block reach > + * the memory before the write to PORT_CMD_ACTIVATE. > + */ > + wmb(); > writel((u32)paddr, port + PORT_CMD_ACTIVATE); > writel((u64)paddr >> 32, port + PORT_CMD_ACTIVATE + 4); > > @@ -895,6 +900,11 @@ static unsigned int sil24_qc_issue(struct ata_queued_cmd *qc) > paddr = pp->cmd_block_dma + tag * sizeof(*pp->cmd_block); > activate = port + PORT_CMD_ACTIVATE + tag * 8; > > + /* > + * The barrier is required to ensure that writes to cmd_block reach > + * the memory before the write to PORT_CMD_ACTIVATE. > + */ > + wmb(); > writel((u32)paddr, activate); > writel((u64)paddr >> 32, activate + 4); > Thanks.
On Thu, 2010-06-10 at 17:23 +0100, Catalin Marinas wrote: > On Thu, 2010-06-10 at 17:12 +0100, Tejun Heo wrote: > > On 06/10/2010 06:02 PM, Catalin Marinas wrote: > > > The data in the cmd_block buffers may reach the main memory after the > > > writel() to the device ports. This patch introduces two calls to wmb() > > > to ensure the relative ordering. > > > > > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > > > Tested-by: Colin Tuckley <colin.tuckley@arm.com> > > > Cc: Tejun Heo <tj@kernel.org> > > > Cc: Jeff Garzik <jeff@garzik.org> > > > > I suppose you have tested and verified that this is actually > > necessary, right? > > Yes, otherwise we get random failures with this device on ARM. > > > I've been looking through the docs but couldn't > > find anything which described the ordering between writes to main > > memory and write[bwl]()'s. One thing that kind of bothers me is that > > r/wmb()'s are for ordering memory accesses among CPUs which > > participate in cache coherency protocol and although it may work right > > in the above case I'm not really sure whether this is the right thing > > to do. Do you have more information on the subject? > > The mb() are not for ordering accesses among CPUs (though they would > cover this case as well). For inter-CPU ordering, we have smp_mb() and > friends. For all other cases, we have the mandatory barriers mb() and > friends and DMA is one of them. > > Apart from the memory-barriers.txt document, there is the Device I/O > docbook which mentions something about DMA buffers, though not very > clear on which barriers to use (something like just make sure that the > writes to the buffer reached the memory). It was actually the DMA-API.txt (rather than deviceiobook) in the description of dma_alloc_coherent().
On 06/10/2010 12:02 PM, Catalin Marinas wrote: > The data in the cmd_block buffers may reach the main memory after the > writel() to the device ports. This patch introduces two calls to wmb() > to ensure the relative ordering. > > Signed-off-by: Catalin Marinas<catalin.marinas@arm.com> > Tested-by: Colin Tuckley<colin.tuckley@arm.com> > Cc: Tejun Heo<tj@kernel.org> > Cc: Jeff Garzik<jeff@garzik.org> > --- > drivers/ata/sata_sil24.c | 10 ++++++++++ > 1 files changed, 10 insertions(+), 0 deletions(-) > > diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c > index e925051..a5d5aff 100644 > --- a/drivers/ata/sata_sil24.c > +++ b/drivers/ata/sata_sil24.c > @@ -622,6 +622,11 @@ static int sil24_exec_polled_cmd(struct ata_port *ap, int pmp, > irq_enabled = readl(port + PORT_IRQ_ENABLE_SET); > writel(PORT_IRQ_COMPLETE | PORT_IRQ_ERROR, port + PORT_IRQ_ENABLE_CLR); > > + /* > + * The barrier is required to ensure that writes to cmd_block reach > + * the memory before the write to PORT_CMD_ACTIVATE. > + */ > + wmb(); > writel((u32)paddr, port + PORT_CMD_ACTIVATE); > writel((u64)paddr>> 32, port + PORT_CMD_ACTIVATE + 4); > > @@ -895,6 +900,11 @@ static unsigned int sil24_qc_issue(struct ata_queued_cmd *qc) > paddr = pp->cmd_block_dma + tag * sizeof(*pp->cmd_block); > activate = port + PORT_CMD_ACTIVATE + tag * 8; > > + /* > + * The barrier is required to ensure that writes to cmd_block reach > + * the memory before the write to PORT_CMD_ACTIVATE. > + */ > + wmb(); > writel((u32)paddr, activate); > writel((u64)paddr>> 32, activate + 4); > applied -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/10/2010 10:23 AM, Catalin Marinas wrote: > On Thu, 2010-06-10 at 17:12 +0100, Tejun Heo wrote: >> On 06/10/2010 06:02 PM, Catalin Marinas wrote: >>> The data in the cmd_block buffers may reach the main memory after the >>> writel() to the device ports. This patch introduces two calls to wmb() >>> to ensure the relative ordering. >>> >>> Signed-off-by: Catalin Marinas<catalin.marinas@arm.com> >>> Tested-by: Colin Tuckley<colin.tuckley@arm.com> >>> Cc: Tejun Heo<tj@kernel.org> >>> Cc: Jeff Garzik<jeff@garzik.org> >> >> I suppose you have tested and verified that this is actually >> necessary, right? > > Yes, otherwise we get random failures with this device on ARM. > >> I've been looking through the docs but couldn't >> find anything which described the ordering between writes to main >> memory and write[bwl]()'s. One thing that kind of bothers me is that >> r/wmb()'s are for ordering memory accesses among CPUs which >> participate in cache coherency protocol and although it may work right >> in the above case I'm not really sure whether this is the right thing >> to do. Do you have more information on the subject? > > The mb() are not for ordering accesses among CPUs (though they would > cover this case as well). For inter-CPU ordering, we have smp_mb() and > friends. For all other cases, we have the mandatory barriers mb() and > friends and DMA is one of them. > > Apart from the memory-barriers.txt document, there is the Device I/O > docbook which mentions something about DMA buffers, though not very > clear on which barriers to use (something like just make sure that the > writes to the buffer reached the memory). > > There were some past discussions on linux-arch before and I'm cc'ing > this list again (ARM is not the only architecture with a weakly memory > ordering model). > > I'm copying the patch below again for the linux-arch people that haven't > seen the beginning of the thread: My memory is fuzzy but I thought this came up before on PPC and I also thought the conclusion was that the platform code (for writel, etc.) should enforce ordering of MMIO accesses with respect to normal RAM accesses. (Or maybe it was just MMIO accesses with respect to each other?) I don't think the answer to that question has been clearly documented anywhere, which is somewhat unfortunate. If the answer is that this is needed then there are likely a lot of other drivers in libata and elsewhere which need to be fixed as well. For example, I don't see any such barriers in libahci.c when I presume it would need them. IMHO, it would be better for the platform code to ensure that MMIO access was strongly ordered with respect to each other and to RAM access. Drivers are just too likely to get this wrong, especially when x86, the most tested platform, doesn't have such issues. > > >> diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c >> index e925051..a5d5aff 100644 >> --- a/drivers/ata/sata_sil24.c >> +++ b/drivers/ata/sata_sil24.c >> @@ -622,6 +622,11 @@ static int sil24_exec_polled_cmd(struct ata_port *ap, int pmp, >> irq_enabled = readl(port + PORT_IRQ_ENABLE_SET); >> writel(PORT_IRQ_COMPLETE | PORT_IRQ_ERROR, port + PORT_IRQ_ENABLE_CLR); >> >> + /* >> + * The barrier is required to ensure that writes to cmd_block reach >> + * the memory before the write to PORT_CMD_ACTIVATE. >> + */ >> + wmb(); >> writel((u32)paddr, port + PORT_CMD_ACTIVATE); >> writel((u64)paddr>> 32, port + PORT_CMD_ACTIVATE + 4); >> >> @@ -895,6 +900,11 @@ static unsigned int sil24_qc_issue(struct ata_queued_cmd *qc) >> paddr = pp->cmd_block_dma + tag * sizeof(*pp->cmd_block); >> activate = port + PORT_CMD_ACTIVATE + tag * 8; >> >> + /* >> + * The barrier is required to ensure that writes to cmd_block reach >> + * the memory before the write to PORT_CMD_ACTIVATE. >> + */ >> + wmb(); >> writel((u32)paddr, activate); >> writel((u64)paddr>> 32, activate + 4); >> > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 10, 2010 at 06:43:03PM -0600, Robert Hancock wrote: > On 06/10/2010 10:23 AM, Catalin Marinas wrote: > >On Thu, 2010-06-10 at 17:12 +0100, Tejun Heo wrote: > >>On 06/10/2010 06:02 PM, Catalin Marinas wrote: > >>>The data in the cmd_block buffers may reach the main memory after the > >>>writel() to the device ports. This patch introduces two calls to wmb() > >>>to ensure the relative ordering. > >>> > >>>Signed-off-by: Catalin Marinas<catalin.marinas@arm.com> > >>>Tested-by: Colin Tuckley<colin.tuckley@arm.com> > >>>Cc: Tejun Heo<tj@kernel.org> > >>>Cc: Jeff Garzik<jeff@garzik.org> > >> > >>I suppose you have tested and verified that this is actually > >>necessary, right? > > > >Yes, otherwise we get random failures with this device on ARM. > > > >>I've been looking through the docs but couldn't > >>find anything which described the ordering between writes to main > >>memory and write[bwl]()'s. One thing that kind of bothers me is that > >>r/wmb()'s are for ordering memory accesses among CPUs which > >>participate in cache coherency protocol and although it may work right > >>in the above case I'm not really sure whether this is the right thing > >>to do. Do you have more information on the subject? > > > >The mb() are not for ordering accesses among CPUs (though they would > >cover this case as well). For inter-CPU ordering, we have smp_mb() and > >friends. For all other cases, we have the mandatory barriers mb() and > >friends and DMA is one of them. > > > >Apart from the memory-barriers.txt document, there is the Device I/O > >docbook which mentions something about DMA buffers, though not very > >clear on which barriers to use (something like just make sure that the > >writes to the buffer reached the memory). > > > >There were some past discussions on linux-arch before and I'm cc'ing > >this list again (ARM is not the only architecture with a weakly memory > >ordering model). > > > >I'm copying the patch below again for the linux-arch people that haven't > >seen the beginning of the thread: > > My memory is fuzzy but I thought this came up before on PPC and I > also thought the conclusion was that the platform code (for writel, > etc.) should enforce ordering of MMIO accesses with respect to > normal RAM accesses. (Or maybe it was just MMIO accesses with > respect to each other?) I don't think the answer to that question > has been clearly documented anywhere, which is somewhat unfortunate. > > If the answer is that this is needed then there are likely a lot of > other drivers in libata and elsewhere which need to be fixed as > well. For example, I don't see any such barriers in libahci.c when I > presume it would need them. > > IMHO, it would be better for the platform code to ensure that MMIO > access was strongly ordered with respect to each other and to RAM > access. Drivers are just too likely to get this wrong, especially > when x86, the most tested platform, doesn't have such issues. The plan is to make all platforms do this. writes should be strongly ordered with memory. That serves to keep them inside critical sections as well. Then write?_relaxed, and memory barriers can be added if performance is required. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 11 Jun 2010 11:38:29 +1000 Nick Piggin <npiggin@suse.de> wrote: > On Thu, Jun 10, 2010 at 06:43:03PM -0600, Robert Hancock wrote: > > On 06/10/2010 10:23 AM, Catalin Marinas wrote: > > >On Thu, 2010-06-10 at 17:12 +0100, Tejun Heo wrote: > > >>On 06/10/2010 06:02 PM, Catalin Marinas wrote: > > >>>The data in the cmd_block buffers may reach the main memory after the > > >>>writel() to the device ports. This patch introduces two calls to wmb() > > >>>to ensure the relative ordering. > > >>> > > >>>Signed-off-by: Catalin Marinas<catalin.marinas@arm.com> > > >>>Tested-by: Colin Tuckley<colin.tuckley@arm.com> > > >>>Cc: Tejun Heo<tj@kernel.org> > > >>>Cc: Jeff Garzik<jeff@garzik.org> > > >> > > >>I suppose you have tested and verified that this is actually > > >>necessary, right? > > > > > >Yes, otherwise we get random failures with this device on ARM. > > > > > >>I've been looking through the docs but couldn't > > >>find anything which described the ordering between writes to main > > >>memory and write[bwl]()'s. One thing that kind of bothers me is that > > >>r/wmb()'s are for ordering memory accesses among CPUs which > > >>participate in cache coherency protocol and although it may work right > > >>in the above case I'm not really sure whether this is the right thing > > >>to do. Do you have more information on the subject? > > > > > >The mb() are not for ordering accesses among CPUs (though they would > > >cover this case as well). For inter-CPU ordering, we have smp_mb() and > > >friends. For all other cases, we have the mandatory barriers mb() and > > >friends and DMA is one of them. > > > > > >Apart from the memory-barriers.txt document, there is the Device I/O > > >docbook which mentions something about DMA buffers, though not very > > >clear on which barriers to use (something like just make sure that the > > >writes to the buffer reached the memory). > > > > > >There were some past discussions on linux-arch before and I'm cc'ing > > >this list again (ARM is not the only architecture with a weakly memory > > >ordering model). > > > > > >I'm copying the patch below again for the linux-arch people that haven't > > >seen the beginning of the thread: > > > > My memory is fuzzy but I thought this came up before on PPC and I > > also thought the conclusion was that the platform code (for writel, > > etc.) should enforce ordering of MMIO accesses with respect to > > normal RAM accesses. (Or maybe it was just MMIO accesses with > > respect to each other?) I don't think the answer to that question > > has been clearly documented anywhere, which is somewhat unfortunate. > > > > If the answer is that this is needed then there are likely a lot of > > other drivers in libata and elsewhere which need to be fixed as > > well. For example, I don't see any such barriers in libahci.c when I > > presume it would need them. > > > > IMHO, it would be better for the platform code to ensure that MMIO > > access was strongly ordered with respect to each other and to RAM > > access. Drivers are just too likely to get this wrong, especially > > when x86, the most tested platform, doesn't have such issues. > > The plan is to make all platforms do this. writes should be > strongly ordered with memory. That serves to keep them inside > critical sections as well. I support the plan, however looks like nothing has changed in any platforms since the plan was made (in 2008 ?). I guess that we need to add something to Documentation/ now then we can shout, "fix arch's code instead of a driver". The consensus was that strongly ordered with any memory? -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2010-06-11 at 02:38 +0100, Nick Piggin wrote: > On Thu, Jun 10, 2010 at 06:43:03PM -0600, Robert Hancock wrote: > > IMHO, it would be better for the platform code to ensure that MMIO > > access was strongly ordered with respect to each other and to RAM > > access. Drivers are just too likely to get this wrong, especially > > when x86, the most tested platform, doesn't have such issues. > > The plan is to make all platforms do this. writes should be > strongly ordered with memory. That serves to keep them inside > critical sections as well. Are there any public references to this discussion? Maybe a Documentation/ file (or update the memory-barriers.txt one would be useful). I guess correctness takes precedence here but on ARM, the only way to ensure relative ordering between non-cacheable writes and I/O writes is by flushing the write buffer (and an L2 write buffer if external cache is present). Hence the expensive mb(). The only reference of DMA buffers vs I/O I found in the DMA-API.txt file: Consistent memory is memory for which a write by either the device or the processor can immediately be read by the processor or device without having to worry about caching effects. (You may however need to make sure to flush the processor's write buffers before telling devices to read that memory.) But there is no API for "flushing the processor's write buffers". Does it mean that this should be taken care of in writel()? We would make the I/O accessors pretty expensive on some architectures. Thanks.
On Fri, Jun 11, 2010 at 10:41:46AM +0100, Catalin Marinas wrote: > On Fri, 2010-06-11 at 02:38 +0100, Nick Piggin wrote: > > On Thu, Jun 10, 2010 at 06:43:03PM -0600, Robert Hancock wrote: > > > IMHO, it would be better for the platform code to ensure that MMIO > > > access was strongly ordered with respect to each other and to RAM > > > access. Drivers are just too likely to get this wrong, especially > > > when x86, the most tested platform, doesn't have such issues. > > > > The plan is to make all platforms do this. writes should be > > strongly ordered with memory. That serves to keep them inside > > critical sections as well. > > Are there any public references to this discussion? Maybe a > Documentation/ file (or update the memory-barriers.txt one would be > useful). It was on the mailing list, don't have a ref off the top of my head. Primarily between the ia64 and powerpc people and myself IIRC. They thought it would also be too expensive to do, but it turned out not to be noticable with a few simple tests. It will obviously depend on a lot of factors... Also I think most high performance drivers tend to have just a few critical mmios so they should be able to be identified and improved relatively easily (relatively, as in: much more easily than trying to find all the obscure ordering problems). So anyway powerpc were reluctant because they try to fix it in their spinlocks, but I demonstrated that there were drivers using mutexes and other synchronization and found one or two broken ones in the first place I looked. > I guess correctness takes precedence here but on ARM, the only way to > ensure relative ordering between non-cacheable writes and I/O writes is > by flushing the write buffer (and an L2 write buffer if external cache > is present). Hence the expensive mb(). Default IO accessors would be a little more expensive, yes. > The only reference of DMA buffers vs I/O I found in the DMA-API.txt > file: > > Consistent memory is memory for which a write by either the > device or the processor can immediately be read by the processor > or device without having to worry about caching effects. (You > may however need to make sure to flush the processor's write > buffers before telling devices to read that memory.) > > But there is no API for "flushing the processor's write buffers". Does > it mean that this should be taken care of in writel()? We would make the > I/O accessors pretty expensive on some architectures. The APIs for that are mb/wmb/rmb ones. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2010-06-11 at 11:11 +0100, Nick Piggin wrote: > On Fri, Jun 11, 2010 at 10:41:46AM +0100, Catalin Marinas wrote: > > On Fri, 2010-06-11 at 02:38 +0100, Nick Piggin wrote: > > > On Thu, Jun 10, 2010 at 06:43:03PM -0600, Robert Hancock wrote: > > > > IMHO, it would be better for the platform code to ensure that MMIO > > > > access was strongly ordered with respect to each other and to RAM > > > > access. Drivers are just too likely to get this wrong, especially > > > > when x86, the most tested platform, doesn't have such issues. > > > > > > The plan is to make all platforms do this. writes should be > > > strongly ordered with memory. That serves to keep them inside > > > critical sections as well. [...] > Also I think most high performance drivers tend to have just a few > critical mmios so they should be able to be identified and improved > relatively easily (relatively, as in: much more easily than trying to > find all the obscure ordering problems). > > So anyway powerpc were reluctant because they try to fix it in their > spinlocks, but I demonstrated that there were drivers using mutexes > and other synchronization and found one or two broken ones in the > first place I looked. On the ARM implementation we are safe with regards to spinlocks/mutexes vs. IO accesses, no weird ordering issues here (if there would be, I agree that it would need fixing). > > The only reference of DMA buffers vs I/O I found in the DMA-API.txt > > file: > > > > Consistent memory is memory for which a write by either the > > device or the processor can immediately be read by the processor > > or device without having to worry about caching effects. (You > > may however need to make sure to flush the processor's write > > buffers before telling devices to read that memory.) > > > > But there is no API for "flushing the processor's write buffers". Does > > it mean that this should be taken care of in writel()? We would make the > > I/O accessors pretty expensive on some architectures. > > The APIs for that are mb/wmb/rmb ones. So if that's the API for the above case and we are strictly referring to the sata_sil24 patch I sent - shouldn't we just add wmb() in the driver between the write to the DMA buffer and the writel() to start the DMA transfer? Do we need to move the wmb() to the writel() macro?
On Fri, Jun 11, 2010 at 5:04 AM, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Fri, 2010-06-11 at 11:11 +0100, Nick Piggin wrote: >> On Fri, Jun 11, 2010 at 10:41:46AM +0100, Catalin Marinas wrote: >> > On Fri, 2010-06-11 at 02:38 +0100, Nick Piggin wrote: >> > > On Thu, Jun 10, 2010 at 06:43:03PM -0600, Robert Hancock wrote: >> > > > IMHO, it would be better for the platform code to ensure that MMIO >> > > > access was strongly ordered with respect to each other and to RAM >> > > > access. Drivers are just too likely to get this wrong, especially >> > > > when x86, the most tested platform, doesn't have such issues. >> > > >> > > The plan is to make all platforms do this. writes should be >> > > strongly ordered with memory. That serves to keep them inside >> > > critical sections as well. > [...] >> Also I think most high performance drivers tend to have just a few >> critical mmios so they should be able to be identified and improved >> relatively easily (relatively, as in: much more easily than trying to >> find all the obscure ordering problems). >> >> So anyway powerpc were reluctant because they try to fix it in their >> spinlocks, but I demonstrated that there were drivers using mutexes >> and other synchronization and found one or two broken ones in the >> first place I looked. > > On the ARM implementation we are safe with regards to spinlocks/mutexes > vs. IO accesses, no weird ordering issues here (if there would be, I > agree that it would need fixing). > >> > The only reference of DMA buffers vs I/O I found in the DMA-API.txt >> > file: >> > >> > Consistent memory is memory for which a write by either the >> > device or the processor can immediately be read by the processor >> > or device without having to worry about caching effects. (You >> > may however need to make sure to flush the processor's write >> > buffers before telling devices to read that memory.) >> > >> > But there is no API for "flushing the processor's write buffers". Does >> > it mean that this should be taken care of in writel()? We would make the >> > I/O accessors pretty expensive on some architectures. >> >> The APIs for that are mb/wmb/rmb ones. > > So if that's the API for the above case and we are strictly referring to > the sata_sil24 patch I sent - shouldn't we just add wmb() in the driver > between the write to the DMA buffer and the writel() to start the DMA > transfer? Do we need to move the wmb() to the writel() macro? I think it would be best if writel, etc. did the write buffer flushing by default. As Nick said, if there are some performance critical areas then those can use the relaxed versions but it's safest if the default behavior works as drivers expect. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 11 Jun 2010 10:41:46 +0100 Catalin Marinas <catalin.marinas@arm.com> wrote: > On Fri, 2010-06-11 at 02:38 +0100, Nick Piggin wrote: > > On Thu, Jun 10, 2010 at 06:43:03PM -0600, Robert Hancock wrote: > > > IMHO, it would be better for the platform code to ensure that MMIO > > > access was strongly ordered with respect to each other and to RAM > > > access. Drivers are just too likely to get this wrong, especially > > > when x86, the most tested platform, doesn't have such issues. > > > > The plan is to make all platforms do this. writes should be > > strongly ordered with memory. That serves to keep them inside > > critical sections as well. > > Are there any public references to this discussion? Maybe a > Documentation/ file (or update the memory-barriers.txt one would be > useful). I guess, http://marc.info/?t=121185223900001&r=1&w=2 -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2010-06-12 at 02:30 +0100, Robert Hancock wrote: > On Fri, Jun 11, 2010 at 5:04 AM, Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Fri, 2010-06-11 at 11:11 +0100, Nick Piggin wrote: > >> On Fri, Jun 11, 2010 at 10:41:46AM +0100, Catalin Marinas wrote: > >> > The only reference of DMA buffers vs I/O I found in the DMA-API.txt > >> > file: > >> > > >> > Consistent memory is memory for which a write by either the > >> > device or the processor can immediately be read by the processor > >> > or device without having to worry about caching effects. (You > >> > may however need to make sure to flush the processor's write > >> > buffers before telling devices to read that memory.) > >> > > >> > But there is no API for "flushing the processor's write buffers". Does > >> > it mean that this should be taken care of in writel()? We would make the > >> > I/O accessors pretty expensive on some architectures. > >> > >> The APIs for that are mb/wmb/rmb ones. > > > > So if that's the API for the above case and we are strictly referring to > > the sata_sil24 patch I sent - shouldn't we just add wmb() in the driver > > between the write to the DMA buffer and the writel() to start the DMA > > transfer? Do we need to move the wmb() to the writel() macro? > > I think it would be best if writel, etc. did the write buffer flushing > by default. As Nick said, if there are some performance critical areas > then those can use the relaxed versions but it's safest if the default > behavior works as drivers expect. I went through the past discussion pointed to by Fujita (thanks!) but I wouldn't say it resulted in a definitive guideline on how architectures should implement the I/O accessors. >From an ARM perspective, I would prefer to add wmb() in the drivers where it matters - basically only those using DMA coherent buffers. A lot of drivers already have this in place and that's already documented in DMA-API.txt (maybe with a bit of clarification). Some statistics - grepping drivers/ for alloc_coherent shows 285 files. Of these, 69 already use barriers. It's not trivial to go through 200+ drivers and add barriers but I wouldn't say that's impossible. If we go the other route of adding mb() in writel() (though I don't prefer it), there are two additional issues: (1) how relaxed would the "writel_relaxed" etc. accessors be? Are they relaxed only with regards to coherent DMA buffers or relaxed with other I/O operations as well? Can the compiler reorder them? (2) do we go through all the drivers that currently have *mb() and remove them? A quick grep in drivers/ shows over 1600 occurrences of *mb().
On Tue, Jun 15, 2010 at 12:10:53PM +0100, Catalin Marinas wrote: > On Sat, 2010-06-12 at 02:30 +0100, Robert Hancock wrote: > > > So if that's the API for the above case and we are strictly referring to > > > the sata_sil24 patch I sent - shouldn't we just add wmb() in the driver > > > between the write to the DMA buffer and the writel() to start the DMA > > > transfer? Do we need to move the wmb() to the writel() macro? > > > > I think it would be best if writel, etc. did the write buffer flushing > > by default. As Nick said, if there are some performance critical areas > > then those can use the relaxed versions but it's safest if the default > > behavior works as drivers expect. > > I went through the past discussion pointed to by Fujita (thanks!) but I > wouldn't say it resulted in a definitive guideline on how architectures > should implement the I/O accessors. > > >From an ARM perspective, I would prefer to add wmb() in the drivers > where it matters - basically only those using DMA coherent buffers. A > lot of drivers already have this in place and that's already documented > in DMA-API.txt (maybe with a bit of clarification). > > Some statistics - grepping drivers/ for alloc_coherent shows 285 files. > Of these, 69 already use barriers. It's not trivial to go through 200+ > drivers and add barriers but I wouldn't say that's impossible. I disagree. Firstly, you will get subtle bugs, not able to be reproduced and situations where driver writers don't even have that architecture to test on. Secondly, it is not a one-time audit and be done with it, code is constantly being changed and added. Driver code is going to be written and tested and run on different archs or even different implementations that do different things to ordering. On the other hand, a performance slowdown should be far more reproducible and traceable. > If we go the other route of adding mb() in writel() (though I don't > prefer it), there are two additional issues: > > (1) how relaxed would the "writel_relaxed" etc. accessors be? Are they > relaxed only with regards to coherent DMA buffers or relaxed with other > I/O operations as well? Can the compiler reorder them? That was up for debate. I think totally weak (like SMP ordering) should be fine, but that may require that we need some more barriers like io_wmb() (which just orders two writels from the same CPU). Remember we want to restrict their use outside critical fastpaths of important drivers -- this is a far smaller task than auditing all accesses in all drivers. > (2) do we go through all the drivers that currently have *mb() and > remove them? A quick grep in drivers/ shows over 1600 occurrences of > *mb(). No need, but the ones that matter will get updated slowly. The ones that don't will continue to work (from an ordering point of view) on obscure or untested hardware. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2010-06-15 at 12:31 +0100, Nick Piggin wrote: > On Tue, Jun 15, 2010 at 12:10:53PM +0100, Catalin Marinas wrote: > > On Sat, 2010-06-12 at 02:30 +0100, Robert Hancock wrote: > > > > So if that's the API for the above case and we are strictly referring to > > > > the sata_sil24 patch I sent - shouldn't we just add wmb() in the driver > > > > between the write to the DMA buffer and the writel() to start the DMA > > > > transfer? Do we need to move the wmb() to the writel() macro? > > > > > > I think it would be best if writel, etc. did the write buffer flushing > > > by default. As Nick said, if there are some performance critical areas > > > then those can use the relaxed versions but it's safest if the default > > > behavior works as drivers expect. > > > > I went through the past discussion pointed to by Fujita (thanks!) but I > > wouldn't say it resulted in a definitive guideline on how architectures > > should implement the I/O accessors. > > > > From an ARM perspective, I would prefer to add wmb() in the drivers > > where it matters - basically only those using DMA coherent buffers. A > > lot of drivers already have this in place and that's already documented > > in DMA-API.txt (maybe with a bit of clarification). > > > > Some statistics - grepping drivers/ for alloc_coherent shows 285 files. > > Of these, 69 already use barriers. It's not trivial to go through 200+ > > drivers and add barriers but I wouldn't say that's impossible. > > I disagree. Firstly, you will get subtle bugs, not able to be reproduced > and situations where driver writers don't even have that architecture to > test on. Secondly, it is not a one-time audit and be done with it, code > is constantly being changed and added. Driver code is going to be > written and tested and run on different archs or even different > implementations that do different things to ordering. > > On the other hand, a performance slowdown should be far more reproducible > and traceable. I need to do some benchmarks to see the overall impact on ARM. There are many PIO drivers that will have the performance affected (and we probably use more PIO than DMA drivers). A solution may be to redefine the IO accessors in the dma-mapping.h file so that you get the fully ordered ones when the DMA API is used. Many drivers seem to convert a dma_addr_t value to u32/u64 and pass it to the device for initiating the DMA transfer. Could we enforce an arch-specific converter that contains the write barrier (similar to the dma_map_* functions but without cache maintenance)? > > If we go the other route of adding mb() in writel() (though I don't > > prefer it), there are two additional issues: > > > > (1) how relaxed would the "writel_relaxed" etc. accessors be? Are they > > relaxed only with regards to coherent DMA buffers or relaxed with other > > I/O operations as well? Can the compiler reorder them? > > That was up for debate. I think totally weak (like SMP ordering) > should be fine, but that may require that we need some more barriers > like io_wmb() (which just orders two writels from the same CPU). > Remember we want to restrict their use outside critical fastpaths > of important drivers -- this is a far smaller task than auditing all > accesses in all drivers. If we are to make the writel/readl on ARM fully ordered with both IO (enforced by hardware) and uncached memory, do we add barriers on each side of the writel/readl etc.? The common cases would require a barrier before writel (write buffer flushing) and a barrier after readl (in case of polling for a "DMA complete" state). So if io_wmb() just orders to IO writes (writel_relaxed), does it mean that we still need a mighty wmb() that orders any type of accesses (i.e. uncached memory vs IO)? Can drivers not use the strict writel() and no longer rely on wmb() (wondering whether we could simplify it on ARM with fully ordered IO accessors)? Apart from the DMA coherent, what are the other uses of wmb() in Linux? I've seen some use cases in drivers on SMP systems in relation to interrupt routines. Thanks.
On 10/06/10 08:43 PM, Robert Hancock wrote: .. > My memory is fuzzy but I thought this came up before on PPC and I also > thought the conclusion was that the platform code (for writel, etc.) > should enforce ordering of MMIO accesses with respect to normal RAM > accesses. (Or maybe it was just MMIO accesses with respect to each > other?) I don't think the answer to that question has been clearly > documented anywhere, which is somewhat unfortunate. .. Different problem. That discussion was for PIO reads into the page cache, and ensuring coherency from all of that. Whereas this patch is just ordinary low-level chipset programming, and ensuring the descriptors are visible before issuing the "go" command. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c index e925051..a5d5aff 100644 --- a/drivers/ata/sata_sil24.c +++ b/drivers/ata/sata_sil24.c @@ -622,6 +622,11 @@ static int sil24_exec_polled_cmd(struct ata_port *ap, int pmp, irq_enabled = readl(port + PORT_IRQ_ENABLE_SET); writel(PORT_IRQ_COMPLETE | PORT_IRQ_ERROR, port + PORT_IRQ_ENABLE_CLR); + /* + * The barrier is required to ensure that writes to cmd_block reach + * the memory before the write to PORT_CMD_ACTIVATE. + */ + wmb(); writel((u32)paddr, port + PORT_CMD_ACTIVATE); writel((u64)paddr >> 32, port + PORT_CMD_ACTIVATE + 4); @@ -895,6 +900,11 @@ static unsigned int sil24_qc_issue(struct ata_queued_cmd *qc) paddr = pp->cmd_block_dma + tag * sizeof(*pp->cmd_block); activate = port + PORT_CMD_ACTIVATE + tag * 8; + /* + * The barrier is required to ensure that writes to cmd_block reach + * the memory before the write to PORT_CMD_ACTIVATE. + */ + wmb(); writel((u32)paddr, activate); writel((u64)paddr >> 32, activate + 4);