Message ID | 201002022211.o12MB8cJ017441@imap1.linux-foundation.org |
---|---|
State | Not Applicable |
Delegated to: | David Miller |
Headers | show |
On Tue, 2010-02-02 at 14:11 -0800, akpm@linux-foundation.org wrote: > From: Catalin Marinas <catalin.marinas@arm.com> > > Depending on the direction of the transfer, flush_dcache_page() must be > called either before (ATA_TFLAG_WRITE) or after (!ATA_TFLAG_WRITE) the > data copying to avoid D-cache aliasing with user space or I-D cache > coherency issues (when reading data from an ATA device using PIO, the > kernel dirties the D-cache but there is no flush_dcache_page() required on > Harvard architectures). > > This patch allows the ARM boards to use a rootfs on CompactFlash with the > PATA platform driver. > > As Anfei Zhou mentioned in a recent patch ("flush dcache before writing > into page to avoid alias"), on some architectures there may be a > performance benefit in differentiating the flush_dcache_page() calls based > on whether the kernel or the user page needs flushing. > > IMHO, we should differentiate based on the direction (kernel reading or > writing from/to such page). In the ARM case with PIPT Harvard caches > (newer processors), the kernel reading from a page that may be mapped in > user space shouldn't need cache flushing. The kernel writing to such page > would require D-cache flushing because of coherency with the I-cache. > Currently on ARM, the latter happens in both cases. > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > Cc: Jeff Garzik <jgarzik@pobox.com> > Cc: Tejun Heo <tj@kernel.org> > Cc: <stable@kernel.org> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > drivers/ata/libata-sff.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff -puN drivers/ata/libata-sff.c~ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc drivers/ata/libata-sff.c > --- a/drivers/ata/libata-sff.c~ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc > +++ a/drivers/ata/libata-sff.c > @@ -874,6 +874,9 @@ static void ata_pio_sector(struct ata_qu > > DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read"); > > + if (do_write) > + flush_dcache_page(page); > + This looks wrong; the upper layers should already have made the page aliases coherent from user to kernel by calling flush_dcache_page (in __get_user_pages()), so the aliases should already be up to date and this flush is spurious. > if (PageHighMem(page)) { > unsigned long flags; > > @@ -893,6 +896,9 @@ static void ata_pio_sector(struct ata_qu > do_write); > } > > + if (!do_write) > + flush_dcache_page(page); > + OK, so this too looks wrong for two reasons 1. it's over flushing. Even after the write to the page by the kernel PIO, the only alias that is dirty should be the kernel, so this needs a flush_kernel_dcache_page() to empty the kernel alias. It is possible user space will have speculated over the user aliases, but there's stuff further up the block stack to bring this back into coherence. 2. If the page really is in highmem, the flush has to happen along the kernel alias, which you just lost because this flush is happening after the kunmap_atomic(), so it has to occur somewhere between the PIO operation and the kunmap. > qc->curbytes += qc->sect_size; > qc->cursg_ofs += qc->sect_size; > James -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 02 Feb 2010 16:58:38 -0600 James Bottomley <James.Bottomley@suse.de> wrote: > On Tue, 2010-02-02 at 14:11 -0800, akpm@linux-foundation.org wrote: > > From: Catalin Marinas <catalin.marinas@arm.com> > > > > Depending on the direction of the transfer, flush_dcache_page() must be > > called either before (ATA_TFLAG_WRITE) or after (!ATA_TFLAG_WRITE) the > > data copying to avoid D-cache aliasing with user space or I-D cache > > coherency issues (when reading data from an ATA device using PIO, the > > kernel dirties the D-cache but there is no flush_dcache_page() required on > > Harvard architectures). > > > > This patch allows the ARM boards to use a rootfs on CompactFlash with the > > PATA platform driver. > > > > As Anfei Zhou mentioned in a recent patch ("flush dcache before writing > > into page to avoid alias"), on some architectures there may be a > > performance benefit in differentiating the flush_dcache_page() calls based > > on whether the kernel or the user page needs flushing. > > > > IMHO, we should differentiate based on the direction (kernel reading or > > writing from/to such page). In the ARM case with PIPT Harvard caches > > (newer processors), the kernel reading from a page that may be mapped in > > user space shouldn't need cache flushing. The kernel writing to such page > > would require D-cache flushing because of coherency with the I-cache. > > Currently on ARM, the latter happens in both cases. > > > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > > Cc: Jeff Garzik <jgarzik@pobox.com> > > Cc: Tejun Heo <tj@kernel.org> > > Cc: <stable@kernel.org> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > > --- > > > > drivers/ata/libata-sff.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff -puN drivers/ata/libata-sff.c~ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc drivers/ata/libata-sff.c > > --- a/drivers/ata/libata-sff.c~ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc > > +++ a/drivers/ata/libata-sff.c > > @@ -874,6 +874,9 @@ static void ata_pio_sector(struct ata_qu > > > > DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read"); > > > > + if (do_write) > > + flush_dcache_page(page); > > + > > This looks wrong; the upper layers should already have made the page > aliases coherent from user to kernel by calling flush_dcache_page (in > __get_user_pages()), so the aliases should already be up to date and > this flush is spurious. The upper layers don't know that the CPU touched the data! If the driver did a DMA transfer then such a flush is unneeded, so we don't do it. > > if (PageHighMem(page)) { > > unsigned long flags; > > > > @@ -893,6 +896,9 @@ static void ata_pio_sector(struct ata_qu > > do_write); > > } > > > > + if (!do_write) > > + flush_dcache_page(page); > > + > > OK, so this too looks wrong for two reasons > > 1. it's over flushing. Even after the write to the page by the > kernel PIO, the only alias that is dirty should be the kernel, > so this needs a flush_kernel_dcache_page() to empty the kernel > alias. It is possible user space will have speculated over the > user aliases, but there's stuff further up the block stack to > bring this back into coherence. > 2. If the page really is in highmem, the flush has to happen along > the kernel alias, which you just lost because this flush is > happening after the kunmap_atomic(), so it has to occur > somewhere between the PIO operation and the kunmap. > > > qc->curbytes += qc->sect_size; > > qc->cursg_ofs += qc->sect_size; > > > > James > -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/02/2010 06:05 PM, Andrew Morton wrote: > On Tue, 02 Feb 2010 16:58:38 -0600 > James Bottomley<James.Bottomley@suse.de> wrote: > >> On Tue, 2010-02-02 at 14:11 -0800, akpm@linux-foundation.org wrote: >>> From: Catalin Marinas<catalin.marinas@arm.com> >>> >>> Depending on the direction of the transfer, flush_dcache_page() must be >>> called either before (ATA_TFLAG_WRITE) or after (!ATA_TFLAG_WRITE) the >>> data copying to avoid D-cache aliasing with user space or I-D cache >>> coherency issues (when reading data from an ATA device using PIO, the >>> kernel dirties the D-cache but there is no flush_dcache_page() required on >>> Harvard architectures). >>> >>> This patch allows the ARM boards to use a rootfs on CompactFlash with the >>> PATA platform driver. >>> >>> As Anfei Zhou mentioned in a recent patch ("flush dcache before writing >>> into page to avoid alias"), on some architectures there may be a >>> performance benefit in differentiating the flush_dcache_page() calls based >>> on whether the kernel or the user page needs flushing. >>> >>> IMHO, we should differentiate based on the direction (kernel reading or >>> writing from/to such page). In the ARM case with PIPT Harvard caches >>> (newer processors), the kernel reading from a page that may be mapped in >>> user space shouldn't need cache flushing. The kernel writing to such page >>> would require D-cache flushing because of coherency with the I-cache. >>> Currently on ARM, the latter happens in both cases. >>> >>> Signed-off-by: Catalin Marinas<catalin.marinas@arm.com> >>> Cc: Jeff Garzik<jgarzik@pobox.com> >>> Cc: Tejun Heo<tj@kernel.org> >>> Cc:<stable@kernel.org> >>> Signed-off-by: Andrew Morton<akpm@linux-foundation.org> >>> --- >>> >>> drivers/ata/libata-sff.c | 6 ++++++ >>> 1 file changed, 6 insertions(+) >>> >>> diff -puN drivers/ata/libata-sff.c~ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc drivers/ata/libata-sff.c >>> --- a/drivers/ata/libata-sff.c~ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc >>> +++ a/drivers/ata/libata-sff.c >>> @@ -874,6 +874,9 @@ static void ata_pio_sector(struct ata_qu >>> >>> DPRINTK("data %s\n", qc->tf.flags& ATA_TFLAG_WRITE ? "write" : "read"); >>> >>> + if (do_write) >>> + flush_dcache_page(page); >>> + >> >> This looks wrong; the upper layers should already have made the page >> aliases coherent from user to kernel by calling flush_dcache_page (in >> __get_user_pages()), so the aliases should already be up to date and >> this flush is spurious. > > The upper layers don't know that the CPU touched the data! If the > driver did a DMA transfer then such a flush is unneeded, so we don't do > it. The patch in question only affects PIO transfers, not DMA. Data is transferred from a kernel buffer to hardware via out[bwl] via page data -> CPU register -> out[bwl] or, data is transferred from hardware to a kernel buffer via in[bwl] -> CPU register -> page data So what are the flushing rules given those conditions? Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2010-02-02 at 15:05 -0800, Andrew Morton wrote: > On Tue, 02 Feb 2010 16:58:38 -0600 > James Bottomley <James.Bottomley@suse.de> wrote: > > > On Tue, 2010-02-02 at 14:11 -0800, akpm@linux-foundation.org wrote: > > > From: Catalin Marinas <catalin.marinas@arm.com> > > > > > > Depending on the direction of the transfer, flush_dcache_page() must be > > > called either before (ATA_TFLAG_WRITE) or after (!ATA_TFLAG_WRITE) the > > > data copying to avoid D-cache aliasing with user space or I-D cache > > > coherency issues (when reading data from an ATA device using PIO, the > > > kernel dirties the D-cache but there is no flush_dcache_page() required on > > > Harvard architectures). > > > > > > This patch allows the ARM boards to use a rootfs on CompactFlash with the > > > PATA platform driver. > > > > > > As Anfei Zhou mentioned in a recent patch ("flush dcache before writing > > > into page to avoid alias"), on some architectures there may be a > > > performance benefit in differentiating the flush_dcache_page() calls based > > > on whether the kernel or the user page needs flushing. > > > > > > IMHO, we should differentiate based on the direction (kernel reading or > > > writing from/to such page). In the ARM case with PIPT Harvard caches > > > (newer processors), the kernel reading from a page that may be mapped in > > > user space shouldn't need cache flushing. The kernel writing to such page > > > would require D-cache flushing because of coherency with the I-cache. > > > Currently on ARM, the latter happens in both cases. > > > > > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > > > Cc: Jeff Garzik <jgarzik@pobox.com> > > > Cc: Tejun Heo <tj@kernel.org> > > > Cc: <stable@kernel.org> > > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > > > --- > > > > > > drivers/ata/libata-sff.c | 6 ++++++ > > > 1 file changed, 6 insertions(+) > > > > > > diff -puN drivers/ata/libata-sff.c~ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc drivers/ata/libata-sff.c > > > --- a/drivers/ata/libata-sff.c~ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc > > > +++ a/drivers/ata/libata-sff.c > > > @@ -874,6 +874,9 @@ static void ata_pio_sector(struct ata_qu > > > > > > DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read"); > > > > > > + if (do_write) > > > + flush_dcache_page(page); > > > + > > > > This looks wrong; the upper layers should already have made the page > > aliases coherent from user to kernel by calling flush_dcache_page (in > > __get_user_pages()), so the aliases should already be up to date and > > this flush is spurious. > > The upper layers don't know that the CPU touched the data! The upper layers know to flush the user aliases. > If the > driver did a DMA transfer then such a flush is unneeded, so we don't do > it. The point is that this is a write case (transfer from kernel to device) so the underlying data should remain stable. Once __get_user_pages() has cohered the aliases, nothing should decohere them because nothing should be altering the data in the page. The danger case for VI caches is the read case because there you are altering data via the kernel aliases. James -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2010-02-02 at 18:14 -0500, Jeff Garzik wrote: > On 02/02/2010 06:05 PM, Andrew Morton wrote: > > On Tue, 02 Feb 2010 16:58:38 -0600 > > James Bottomley<James.Bottomley@suse.de> wrote: > > > >> On Tue, 2010-02-02 at 14:11 -0800, akpm@linux-foundation.org wrote: > >>> From: Catalin Marinas<catalin.marinas@arm.com> > >>> > >>> Depending on the direction of the transfer, flush_dcache_page() must be > >>> called either before (ATA_TFLAG_WRITE) or after (!ATA_TFLAG_WRITE) the > >>> data copying to avoid D-cache aliasing with user space or I-D cache > >>> coherency issues (when reading data from an ATA device using PIO, the > >>> kernel dirties the D-cache but there is no flush_dcache_page() required on > >>> Harvard architectures). > >>> > >>> This patch allows the ARM boards to use a rootfs on CompactFlash with the > >>> PATA platform driver. > >>> > >>> As Anfei Zhou mentioned in a recent patch ("flush dcache before writing > >>> into page to avoid alias"), on some architectures there may be a > >>> performance benefit in differentiating the flush_dcache_page() calls based > >>> on whether the kernel or the user page needs flushing. > >>> > >>> IMHO, we should differentiate based on the direction (kernel reading or > >>> writing from/to such page). In the ARM case with PIPT Harvard caches > >>> (newer processors), the kernel reading from a page that may be mapped in > >>> user space shouldn't need cache flushing. The kernel writing to such page > >>> would require D-cache flushing because of coherency with the I-cache. > >>> Currently on ARM, the latter happens in both cases. > >>> > >>> Signed-off-by: Catalin Marinas<catalin.marinas@arm.com> > >>> Cc: Jeff Garzik<jgarzik@pobox.com> > >>> Cc: Tejun Heo<tj@kernel.org> > >>> Cc:<stable@kernel.org> > >>> Signed-off-by: Andrew Morton<akpm@linux-foundation.org> > >>> --- > >>> > >>> drivers/ata/libata-sff.c | 6 ++++++ > >>> 1 file changed, 6 insertions(+) > >>> > >>> diff -puN drivers/ata/libata-sff.c~ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc drivers/ata/libata-sff.c > >>> --- a/drivers/ata/libata-sff.c~ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc > >>> +++ a/drivers/ata/libata-sff.c > >>> @@ -874,6 +874,9 @@ static void ata_pio_sector(struct ata_qu > >>> > >>> DPRINTK("data %s\n", qc->tf.flags& ATA_TFLAG_WRITE ? "write" : "read"); > >>> > >>> + if (do_write) > >>> + flush_dcache_page(page); > >>> + > >> > >> This looks wrong; the upper layers should already have made the page > >> aliases coherent from user to kernel by calling flush_dcache_page (in > >> __get_user_pages()), so the aliases should already be up to date and > >> this flush is spurious. > > > > The upper layers don't know that the CPU touched the data! If the > > driver did a DMA transfer then such a flush is unneeded, so we don't do > > it. > > The patch in question only affects PIO transfers, not DMA. Data is > transferred from a kernel buffer to hardware via out[bwl] via > > page data -> CPU register -> out[bwl] No flush required here ... all the aliases should be up to date after __get_user_pages() > or, data is transferred from hardware to a kernel buffer via > > in[bwl] -> CPU register -> page data This writes to the page via the kernel alias, so the kernel alias becomes dirty. In order to make the underlying page up to date, the kernel alias must be flushed this flush *must* occur while the page is mapped along the kernel alias. In order that the user aliases see the correct data, any speculation they may have done before the kernel alias was flushed must be invalidated (depending on architectures ... some architectures won't speculate reads in this case). This invalidation *should* be occurring further up the stack .. but if the architecture has implemented this invalidation in the dma ops, then flush_dcache_page() may be needed. > So what are the flushing rules given those conditions? See above. James -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jeff Garzik <jgarzik@pobox.com> Date: Tue, 02 Feb 2010 18:14:40 -0500 > The patch in question only affects PIO transfers, not DMA. Data is > transferred from a kernel buffer to hardware via out[bwl] via > > page data -> CPU register -> out[bwl] > > or, data is transferred from hardware to a kernel buffer via > > in[bwl] -> CPU register -> page data > > So what are the flushing rules given those conditions? Any time you touch a page cache page with the cpu, you have to get it purged from aliasing caches before it gets mapped into userspace or similar. That's the problem on these machines. Actually, sparc64 is probably susceptible to the same problem. And it's not an issue in the IDE layer, know why? :-) In the IDE layer we have arch specific ide_*() interfaces which is where all of this PIO dma flushing stuff deserves to live. ATA should do something similar instead of randomly scattering flush_dcache_page() calls all over the place. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> ATA should do something similar instead of randomly scattering > flush_dcache_page() calls all over the place. ATA has an sff_data_xfer callback for standard style interfaces for exactly the same reason that IDE did. That should be the only place that needs touching for most controllers (either the callback point or in the callback). The non-standard ones that do bits of PIO will need fixing in place because they are sufficiently weird they don't fit any standard format. Fortunately they are almost entirely platform specific and/or DMA. Alan -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2010-02-02 at 15:21 -0800, David Miller wrote: > From: Jeff Garzik <jgarzik@pobox.com> > Date: Tue, 02 Feb 2010 18:14:40 -0500 > > > The patch in question only affects PIO transfers, not DMA. Data is > > transferred from a kernel buffer to hardware via out[bwl] via > > > > page data -> CPU register -> out[bwl] > > > > or, data is transferred from hardware to a kernel buffer via > > > > in[bwl] -> CPU register -> page data > > > > So what are the flushing rules given those conditions? > > Any time you touch a page cache page with the cpu, you have > to get it purged from aliasing caches before it gets mapped > into userspace or similar. > > That's the problem on these machines. > > Actually, sparc64 is probably susceptible to the same problem. > > And it's not an issue in the IDE layer, know why? :-) > > In the IDE layer we have arch specific ide_*() interfaces which is > where all of this PIO dma flushing stuff deserves to live. > > ATA should do something similar instead of randomly scattering > flush_dcache_page() calls all over the place. Agree with this, except it shouldn't be ide specific ... any driver that does PIO (we have some examples in SCSI) is vulnerable to this issue and should have the same fix. The ide interfaces are very low level .. they're the ide_in/out[sw] APIs which aren't necessarily the most useful (some of the SCSI "PIO" stuff actually gets fed in via memory transfers to registers rather than port in/out). What it looks like we need is kmap_for_pio() and kunmap_for_pio() which take flags (or API extensions) indicating whether we're transferring to or from the device, which return the correct kernel address to the page (mapping it if it's in highmem) and perform the correct flushes/invalidates. James -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: James Bottomley <James.Bottomley@suse.de> Date: Tue, 02 Feb 2010 17:32:48 -0600 > What it looks like we need is kmap_for_pio() and kunmap_for_pio() which > take flags (or API extensions) indicating whether we're transferring to > or from the device, which return the correct kernel address to the page > (mapping it if it's in highmem) and perform the correct > flushes/invalidates. That sounds like a good idea. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2010-02-02 at 22:58 +0000, James Bottomley wrote: > On Tue, 2010-02-02 at 14:11 -0800, akpm@linux-foundation.org wrote: > > From: Catalin Marinas <catalin.marinas@arm.com> > > > > Depending on the direction of the transfer, flush_dcache_page() must be > > called either before (ATA_TFLAG_WRITE) or after (!ATA_TFLAG_WRITE) the > > data copying to avoid D-cache aliasing with user space or I-D cache > > coherency issues (when reading data from an ATA device using PIO, the > > kernel dirties the D-cache but there is no flush_dcache_page() required on > > Harvard architectures). > > > > This patch allows the ARM boards to use a rootfs on CompactFlash with the > > PATA platform driver. > > > > As Anfei Zhou mentioned in a recent patch ("flush dcache before writing > > into page to avoid alias"), on some architectures there may be a > > performance benefit in differentiating the flush_dcache_page() calls based > > on whether the kernel or the user page needs flushing. > > > > IMHO, we should differentiate based on the direction (kernel reading or > > writing from/to such page). In the ARM case with PIPT Harvard caches > > (newer processors), the kernel reading from a page that may be mapped in > > user space shouldn't need cache flushing. The kernel writing to such page > > would require D-cache flushing because of coherency with the I-cache. > > Currently on ARM, the latter happens in both cases. [...] > > diff -puN drivers/ata/libata-sff.c~ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc drivers/ata/libata-sff.c > > --- a/drivers/ata/libata-sff.c~ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc > > +++ a/drivers/ata/libata-sff.c > > @@ -874,6 +874,9 @@ static void ata_pio_sector(struct ata_qu > > > > DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read"); > > > > + if (do_write) > > + flush_dcache_page(page); > > + > > This looks wrong; the upper layers should already have made the page > aliases coherent from user to kernel by calling flush_dcache_page (in > __get_user_pages()), so the aliases should already be up to date and > this flush is spurious. In my case it was working fine even without this. I just added this for completeness following the cachetlb.txt document. > > if (PageHighMem(page)) { > > unsigned long flags; > > > > @@ -893,6 +896,9 @@ static void ata_pio_sector(struct ata_qu > > do_write); > > } > > > > + if (!do_write) > > + flush_dcache_page(page); > > + > > OK, so this too looks wrong for two reasons > > 1. it's over flushing. Even after the write to the page by the > kernel PIO, the only alias that is dirty should be the kernel, > so this needs a flush_kernel_dcache_page() to empty the kernel > alias. It is possible user space will have speculated over the > user aliases, but there's stuff further up the block stack to > bring this back into coherence. In the ARM case, it's not necessarily the user D-cache aliasing (which on my current hardware does not exist) but the I and D cache coherency. With non-aliasing D-cache on ARM, the flush_kernel_dcache_page() is a no-op. If we consider the I-cache as being an alias, this function would need to be implemented. So far this issue has been handled by flush_dcache_page() (lazily). > 2. If the page really is in highmem, the flush has to happen along > the kernel alias, which you just lost because this flush is > happening after the kunmap_atomic(), so it has to occur > somewhere between the PIO operation and the kunmap. The kunmap_atomic(), at least in the ARM case, seems to do the cache flushing. My cache coherency issues show up with highmem disabled.
On Tue, 2010-02-02 at 23:32 +0000, James Bottomley wrote: > On Tue, 2010-02-02 at 15:21 -0800, David Miller wrote: > > From: Jeff Garzik <jgarzik@pobox.com> > > Date: Tue, 02 Feb 2010 18:14:40 -0500 > > > > > The patch in question only affects PIO transfers, not DMA. Data is > > > transferred from a kernel buffer to hardware via out[bwl] via > > > > > > page data -> CPU register -> out[bwl] > > > > > > or, data is transferred from hardware to a kernel buffer via > > > > > > in[bwl] -> CPU register -> page data > > > > > > So what are the flushing rules given those conditions? > > > > Any time you touch a page cache page with the cpu, you have > > to get it purged from aliasing caches before it gets mapped > > into userspace or similar. > > > > That's the problem on these machines. > > > > Actually, sparc64 is probably susceptible to the same problem. > > > > And it's not an issue in the IDE layer, know why? :-) > > > > In the IDE layer we have arch specific ide_*() interfaces which is > > where all of this PIO dma flushing stuff deserves to live. > > > > ATA should do something similar instead of randomly scattering > > flush_dcache_page() calls all over the place. > > Agree with this, except it shouldn't be ide specific ... any driver that > does PIO (we have some examples in SCSI) is vulnerable to this issue and > should have the same fix. > > The ide interfaces are very low level .. they're the ide_in/out[sw] APIs > which aren't necessarily the most useful (some of the SCSI "PIO" stuff > actually gets fed in via memory transfers to registers rather than port > in/out). > > What it looks like we need is kmap_for_pio() and kunmap_for_pio() which > take flags (or API extensions) indicating whether we're transferring to > or from the device, which return the correct kernel address to the page > (mapping it if it's in highmem) and perform the correct > flushes/invalidates. In a similar patch for dealing with a PIO USB HCD driver (ISP1760) I suggested something like pio_map_single/pio_unmap_single similar to the dma_* API. The problem with the HCD drivers is that the page information is lost when they are called. They only have access to a urb->transfer_buffer pointer and you can never know whether it is a highmem one or not. Hence I think an API that deals directly with pointers rather than page structures would be better. If you are OK with the idea, I'm happy to propose a patch on linux-arch. Thanks for the feedback so far.
On Wed, 2010-02-03 at 10:18 +0000, Catalin Marinas wrote: > On Tue, 2010-02-02 at 23:32 +0000, James Bottomley wrote: > > On Tue, 2010-02-02 at 15:21 -0800, David Miller wrote: > > > From: Jeff Garzik <jgarzik@pobox.com> > > > Date: Tue, 02 Feb 2010 18:14:40 -0500 > > > > > > > The patch in question only affects PIO transfers, not DMA. Data is > > > > transferred from a kernel buffer to hardware via out[bwl] via > > > > > > > > page data -> CPU register -> out[bwl] > > > > > > > > or, data is transferred from hardware to a kernel buffer via > > > > > > > > in[bwl] -> CPU register -> page data > > > > > > > > So what are the flushing rules given those conditions? > > > > > > Any time you touch a page cache page with the cpu, you have > > > to get it purged from aliasing caches before it gets mapped > > > into userspace or similar. > > > > > > That's the problem on these machines. > > > > > > Actually, sparc64 is probably susceptible to the same problem. > > > > > > And it's not an issue in the IDE layer, know why? :-) > > > > > > In the IDE layer we have arch specific ide_*() interfaces which is > > > where all of this PIO dma flushing stuff deserves to live. > > > > > > ATA should do something similar instead of randomly scattering > > > flush_dcache_page() calls all over the place. > > > > Agree with this, except it shouldn't be ide specific ... any driver that > > does PIO (we have some examples in SCSI) is vulnerable to this issue and > > should have the same fix. > > > > The ide interfaces are very low level .. they're the ide_in/out[sw] APIs > > which aren't necessarily the most useful (some of the SCSI "PIO" stuff > > actually gets fed in via memory transfers to registers rather than port > > in/out). > > > > What it looks like we need is kmap_for_pio() and kunmap_for_pio() which > > take flags (or API extensions) indicating whether we're transferring to > > or from the device, which return the correct kernel address to the page > > (mapping it if it's in highmem) and perform the correct > > flushes/invalidates. > > In a similar patch for dealing with a PIO USB HCD driver (ISP1760) I > suggested something like pio_map_single/pio_unmap_single similar to the > dma_* API. The problem with the HCD drivers is that the page information > is lost when they are called. They only have access to a > urb->transfer_buffer pointer and you can never know whether it is a > highmem one or not. Hence I think an API that deals directly with > pointers rather than page structures would be better. Actually, thinking about the semantics, the normal kmap has to have these flushes in the map and unmap anyway ... otherwise the page gets decohered if you do normal read/write on it. The only information we don't have in current kmap is whether we're mapping for read/write or both. So the API I'd propose is keep current kmap as is for the read and write case, and introduce a kmap_for_read and kmap_for_write (with corresponding umaps, and obviously atomics). The fix to libata looks to be just that it should kmap all the time rather than trying to fiddle with the page to see if its higmem. For kmap on a normal page, we should just return the offset map address and do all the flushing. James > If you are OK with the idea, I'm happy to propose a patch on linux-arch. > > Thanks for the feedback so far. > -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/03/2010 11:40 AM, James Bottomley wrote: > The fix to libata looks to be just that it should kmap all the time > rather than trying to fiddle with the page to see if its higmem. For > kmap on a normal page, we should just return the offset map address and > do all the flushing. libata tests PageHighMem() because it was measurably faster to do things the current way (which includes local_irq_save/restore, only for highmem) on boxes where it actually matters. It seems more efficient to add a flush where necessary, than unconditionally punish everyone... Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 03 Feb 2010 12:00:58 -0500 Jeff Garzik <jgarzik@pobox.com> wrote: > On 02/03/2010 11:40 AM, James Bottomley wrote: > > The fix to libata looks to be just that it should kmap all the time > > rather than trying to fiddle with the page to see if its higmem. For > > kmap on a normal page, we should just return the offset map address and > > do all the flushing. > > libata tests PageHighMem() because it was measurably faster to do things > the current way (which includes local_irq_save/restore, only for > highmem) on boxes where it actually matters. > > It seems more efficient to add a flush where necessary, than > unconditionally punish everyone... kmap_atomic() tests PageHighMem() too - it's pretty lightweight for lowmem pages. Anyway, I'd draw your attention to this claim in the changelog: "This patch allows the ARM boards to use a rootfs on CompactFlash with the PATA platform driver." Immediate-term, we should be looking for a small fix for this issue which is acceptable for 2.6.33 and 2.6.32 and earlier. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: James Bottomley <James.Bottomley@suse.de> Date: Wed, 03 Feb 2010 10:40:54 -0600 > Actually, thinking about the semantics, the normal kmap has to have > these flushes in the map and unmap anyway ... otherwise the page gets > decohered if you do normal read/write on it. The only information we > don't have in current kmap is whether we're mapping for read/write or > both. So the API I'd propose is keep current kmap as is for the read > and write case, and introduce a kmap_for_read and kmap_for_write (with > corresponding umaps, and obviously atomics). This would basically force non-highmem platforms to implement the kmap_*() interfaces. I understand that MIPS does this already to deal with cache aliasing. But that's currently a choice, and sparc64 for example has no need to do things that way. It would in fact be more expensive than how sparc64 currently handles cache aliasing. So I'm more in support of a new interface. Platforms like sparc64 can implement it and then get rid of the ide_*() string op cache flushes and instead we put the new API calls in the right spots. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/03/2010 12:06 PM, Andrew Morton wrote: > On Wed, 03 Feb 2010 12:00:58 -0500 Jeff Garzik<jgarzik@pobox.com> wrote: > >> On 02/03/2010 11:40 AM, James Bottomley wrote: >>> The fix to libata looks to be just that it should kmap all the time >>> rather than trying to fiddle with the page to see if its higmem. For >>> kmap on a normal page, we should just return the offset map address and >>> do all the flushing. >> >> libata tests PageHighMem() because it was measurably faster to do things >> the current way (which includes local_irq_save/restore, only for >> highmem) on boxes where it actually matters. >> >> It seems more efficient to add a flush where necessary, than >> unconditionally punish everyone... > > kmap_atomic() tests PageHighMem() too - it's pretty lightweight for > lowmem pages. Note the lack of local_irq_save/restore in our code, though... These PIO xfers are __slow__, from the perspective of a CPU manufactured in the past decade; you are definitely disabling local interrupts for a long time. I suppose we could do if (high mem) local irq save kmap xfer kunmap local irq restore else kmap xfer kunmap does that solve the problem for ARM, for 2.6.33? > Anyway, I'd draw your attention to this claim in the changelog: "This > patch allows the ARM boards to use a rootfs on CompactFlash with the > PATA platform driver." Immediate-term, we should be looking for a small > fix for this issue which is acceptable for 2.6.33 and 2.6.32 and earlier. Sure... see above. hopefully one that does not punish -everybody- though. It would be sad to unconditionally slow down millions of volume platform (read: x86) users for some embedded boards. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 03 Feb 2010 12:15:46 -0500 Jeff Garzik <jgarzik@pobox.com> wrote: > On 02/03/2010 12:06 PM, Andrew Morton wrote: > > On Wed, 03 Feb 2010 12:00:58 -0500 Jeff Garzik<jgarzik@pobox.com> wrote: > > > >> On 02/03/2010 11:40 AM, James Bottomley wrote: > >>> The fix to libata looks to be just that it should kmap all the time > >>> rather than trying to fiddle with the page to see if its higmem. For > >>> kmap on a normal page, we should just return the offset map address and > >>> do all the flushing. > >> > >> libata tests PageHighMem() because it was measurably faster to do things > >> the current way (which includes local_irq_save/restore, only for > >> highmem) on boxes where it actually matters. > >> > >> It seems more efficient to add a flush where necessary, than > >> unconditionally punish everyone... > > > > kmap_atomic() tests PageHighMem() too - it's pretty lightweight for > > lowmem pages. > > Note the lack of local_irq_save/restore in our code, though... These > PIO xfers are __slow__, from the perspective of a CPU manufactured in > the past decade; you are definitely disabling local interrupts for a > long time. I suppose we could do > > if (high mem) > local irq save > kmap > xfer > kunmap > local irq restore > else > kmap > xfer > kunmap > > does that solve the problem for ARM, for 2.6.33? > It's unclear (to me) why that code is using KM_IRQ0 at all. Can't it use a non-irq kmap slot? > > Anyway, I'd draw your attention to this claim in the changelog: "This > > patch allows the ARM boards to use a rootfs on CompactFlash with the > > PATA platform driver." Immediate-term, we should be looking for a small > > fix for this issue which is acceptable for 2.6.33 and 2.6.32 and earlier. > > Sure... see above. hopefully one that does not punish -everybody- > though. It would be sad to unconditionally slow down millions of volume > platform (read: x86) users for some embedded boards. Well. ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc.patch is a no-op on x86. It only has an effect on architectures which implement flush_dcache_page(). And I expect flush_dcache_page() is pretty light on those architectures, when compared with a PIO-mode transfer. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/03/2010 12:20 PM, Andrew Morton wrote: > On Wed, 03 Feb 2010 12:15:46 -0500 Jeff Garzik<jgarzik@pobox.com> wrote: > >> On 02/03/2010 12:06 PM, Andrew Morton wrote: >>> On Wed, 03 Feb 2010 12:00:58 -0500 Jeff Garzik<jgarzik@pobox.com> wrote: >>> >>>> On 02/03/2010 11:40 AM, James Bottomley wrote: >>>>> The fix to libata looks to be just that it should kmap all the time >>>>> rather than trying to fiddle with the page to see if its higmem. For >>>>> kmap on a normal page, we should just return the offset map address and >>>>> do all the flushing. >>>> >>>> libata tests PageHighMem() because it was measurably faster to do things >>>> the current way (which includes local_irq_save/restore, only for >>>> highmem) on boxes where it actually matters. >>>> >>>> It seems more efficient to add a flush where necessary, than >>>> unconditionally punish everyone... >>> >>> kmap_atomic() tests PageHighMem() too - it's pretty lightweight for >>> lowmem pages. >> >> Note the lack of local_irq_save/restore in our code, though... These >> PIO xfers are __slow__, from the perspective of a CPU manufactured in >> the past decade; you are definitely disabling local interrupts for a >> long time. I suppose we could do >> >> if (high mem) >> local irq save >> kmap >> xfer >> kunmap >> local irq restore >> else >> kmap >> xfer >> kunmap >> >> does that solve the problem for ARM, for 2.6.33? >> > > It's unclear (to me) why that code is using KM_IRQ0 at all. Can't it > use a non-irq kmap slot? libata may have to transfer data in response to an interrupt. That is normal interrupt-driven PIO -- although it should be noted that the code supports polled PIO as well. >>> Anyway, I'd draw your attention to this claim in the changelog: "This >>> patch allows the ARM boards to use a rootfs on CompactFlash with the >>> PATA platform driver." Immediate-term, we should be looking for a small >>> fix for this issue which is acceptable for 2.6.33 and 2.6.32 and earlier. >> >> Sure... see above. hopefully one that does not punish -everybody- >> though. It would be sad to unconditionally slow down millions of volume >> platform (read: x86) users for some embedded boards. > > Well. > ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc.patch > is a no-op on x86. It only has an effect on architectures which > implement flush_dcache_page(). And I expect flush_dcache_page() is > pretty light on those architectures, when compared with a PIO-mode > transfer. I don't object to the patch... as long as the arch people are happy. Arch people seem to be the ones complaining, though. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2010-02-03 at 12:29 -0500, Jeff Garzik wrote: > On 02/03/2010 12:20 PM, Andrew Morton wrote: > > On Wed, 03 Feb 2010 12:15:46 -0500 Jeff Garzik<jgarzik@pobox.com> wrote: > > > >> On 02/03/2010 12:06 PM, Andrew Morton wrote: > >>> On Wed, 03 Feb 2010 12:00:58 -0500 Jeff Garzik<jgarzik@pobox.com> wrote: > >>> > >>>> On 02/03/2010 11:40 AM, James Bottomley wrote: > >>>>> The fix to libata looks to be just that it should kmap all the time > >>>>> rather than trying to fiddle with the page to see if its higmem. For > >>>>> kmap on a normal page, we should just return the offset map address and > >>>>> do all the flushing. > >>>> > >>>> libata tests PageHighMem() because it was measurably faster to do things > >>>> the current way (which includes local_irq_save/restore, only for > >>>> highmem) on boxes where it actually matters. > >>>> > >>>> It seems more efficient to add a flush where necessary, than > >>>> unconditionally punish everyone... > >>> > >>> kmap_atomic() tests PageHighMem() too - it's pretty lightweight for > >>> lowmem pages. > >> > >> Note the lack of local_irq_save/restore in our code, though... These > >> PIO xfers are __slow__, from the perspective of a CPU manufactured in > >> the past decade; you are definitely disabling local interrupts for a > >> long time. I suppose we could do > >> > >> if (high mem) > >> local irq save > >> kmap > >> xfer > >> kunmap > >> local irq restore > >> else > >> kmap > >> xfer > >> kunmap > >> > >> does that solve the problem for ARM, for 2.6.33? > >> > > > > It's unclear (to me) why that code is using KM_IRQ0 at all. Can't it > > use a non-irq kmap slot? > > libata may have to transfer data in response to an interrupt. That is > normal interrupt-driven PIO -- although it should be noted that the code > supports polled PIO as well. > > > >>> Anyway, I'd draw your attention to this claim in the changelog: "This > >>> patch allows the ARM boards to use a rootfs on CompactFlash with the > >>> PATA platform driver." Immediate-term, we should be looking for a small > >>> fix for this issue which is acceptable for 2.6.33 and 2.6.32 and earlier. > >> > >> Sure... see above. hopefully one that does not punish -everybody- > >> though. It would be sad to unconditionally slow down millions of volume > >> platform (read: x86) users for some embedded boards. > > > > Well. > > ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc.patch > > is a no-op on x86. It only has an effect on architectures which > > implement flush_dcache_page(). And I expect flush_dcache_page() is > > pretty light on those architectures, when compared with a PIO-mode > > transfer. > > I don't object to the patch... as long as the arch people are happy. > Arch people seem to be the ones complaining, though Apart from the contents, which is looking like sprinkle mainline with random flushes, I'm unhappy that something which could be better implemented by thinking about the problem is now being rammed through as a must have bug fix. We got this piece of crap in the same way: commit 2d4dc890b5c8fabd818a8586607e6843c4375e62 Author: Ilya Loginov <isloginov@gmail.com> Date: Thu Nov 26 09:16:19 2009 +0100 block: add helpers to run flush_dcache_page() against a bio and a request's pages Which is another race to flush everywhere until my coherence problem goes away. This scattershot approach to coherency is really damaging in the long term because all these random flushes are going to mask real problems we may or may not have in the arch APIs ... and worse, they'll mask mostly ... there'll likely be times when the masking is incomplete and we're left with incredibly hard to debug data loss or binary crashes. James -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> if (high mem) > local irq save > kmap > xfer > kunmap > local irq restore At the very least you need to do irq save kmap copy to local buffer kunmap local irq restore xfer and the mirror image for read. Alan -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
And indeed there was a patch I proposed in 2008 for this bounce
buffer latency: See the archive
Date: Fri, 29 Feb 2008 13:51:06 +0000
From: Alan Cox <alan@lxorguk.ukuu.org.uk>
To: linux-ide@vger.kernel.org, jeff@garzik.org
Subject: [RFC PATCH] libata: PIO via bounce buffer
although it doesn't deal with the dcache coherency issue and seems to
need a little tweaking to apply due to other changes
Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2010-02-03 at 17:15 +0000, Jeff Garzik wrote: > On 02/03/2010 12:06 PM, Andrew Morton wrote: > > On Wed, 03 Feb 2010 12:00:58 -0500 Jeff Garzik<jgarzik@pobox.com> wrote: > > > >> On 02/03/2010 11:40 AM, James Bottomley wrote: > >>> The fix to libata looks to be just that it should kmap all the time > >>> rather than trying to fiddle with the page to see if its higmem. For > >>> kmap on a normal page, we should just return the offset map address and > >>> do all the flushing. > >> > >> libata tests PageHighMem() because it was measurably faster to do things > >> the current way (which includes local_irq_save/restore, only for > >> highmem) on boxes where it actually matters. > >> > >> It seems more efficient to add a flush where necessary, than > >> unconditionally punish everyone... > > > > kmap_atomic() tests PageHighMem() too - it's pretty lightweight for > > lowmem pages. > > Note the lack of local_irq_save/restore in our code, though... These > PIO xfers are __slow__, from the perspective of a CPU manufactured in > the past decade; you are definitely disabling local interrupts for a > long time. I suppose we could do > > if (high mem) > local irq save > kmap > xfer > kunmap > local irq restore > else > kmap > xfer > kunmap > > does that solve the problem for ARM, for 2.6.33? No, it doesn't. If the page isn't a highmem one, the kunmap doesn't do anything to the caches (and I also have highmem disabled). My logic was based on a statement in cachetlb.txt that flush_dcache_page() should be called if the kernel writes to a page cache page, hence the patch (as it was pointed out, the first call to flush_dcache_page() isn't needed).
On 02/03/2010 12:46 PM, Alan Cox wrote: > And indeed there was a patch I proposed in 2008 for this bounce > buffer latency: See the archive > > Date: Fri, 29 Feb 2008 13:51:06 +0000 > From: Alan Cox<alan@lxorguk.ukuu.org.uk> > To: linux-ide@vger.kernel.org, jeff@garzik.org > Subject: [RFC PATCH] libata: PIO via bounce buffer > > although it doesn't deal with the dcache coherency issue and seems to > need a little tweaking to apply due to other changes Yeah, I definitely liked the idea. I wonder if we could do the allocation during port_start rather than at the time of bounce? One maximally-sized buffer per device should not be too punishing on the system, IMO. In any case, it is a nice approach to pursue. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/03/2010 12:49 PM, Catalin Marinas wrote: > No, it doesn't. If the page isn't a highmem one, the kunmap doesn't do > anything to the caches (and I also have highmem disabled). My logic was > based on a statement in cachetlb.txt that flush_dcache_page() should be > called if the kernel writes to a page cache page, hence the patch (as it > was pointed out, the first call to flush_dcache_page() isn't needed). Well, if those are the API rules, libata certainly must follow them... Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 03 Feb 2010 12:52:34 -0500 Jeff Garzik <jgarzik@pobox.com> wrote: > On 02/03/2010 12:46 PM, Alan Cox wrote: > > And indeed there was a patch I proposed in 2008 for this bounce > > buffer latency: See the archive > > > > Date: Fri, 29 Feb 2008 13:51:06 +0000 > > From: Alan Cox<alan@lxorguk.ukuu.org.uk> > > To: linux-ide@vger.kernel.org, jeff@garzik.org > > Subject: [RFC PATCH] libata: PIO via bounce buffer > > > > although it doesn't deal with the dcache coherency issue and seems to > > need a little tweaking to apply due to other changes > > Yeah, I definitely liked the idea. I wonder if we could do the > allocation during port_start rather than at the time of bounce? I actually got better numbers using kmalloc - no idea why - perhaps we get a hot page ? -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/03/2010 01:00 PM, Alan Cox wrote: > On Wed, 03 Feb 2010 12:52:34 -0500 > Jeff Garzik<jgarzik@pobox.com> wrote: > >> On 02/03/2010 12:46 PM, Alan Cox wrote: >>> And indeed there was a patch I proposed in 2008 for this bounce >>> buffer latency: See the archive >>> >>> Date: Fri, 29 Feb 2008 13:51:06 +0000 >>> From: Alan Cox<alan@lxorguk.ukuu.org.uk> >>> To: linux-ide@vger.kernel.org, jeff@garzik.org >>> Subject: [RFC PATCH] libata: PIO via bounce buffer >>> >>> although it doesn't deal with the dcache coherency issue and seems to >>> need a little tweaking to apply due to other changes >> >> Yeah, I definitely liked the idea. I wonder if we could do the >> allocation during port_start rather than at the time of bounce? > > I actually got better numbers using kmalloc - no idea why - perhaps we > get a hot page ? Seems logical. It surprises me, though, that being lockless and completely avoiding the kmalloc infrastructure wasn't a bigger win. Was this measured on UP? If your approach has the lowest cost, let's get it in... I like decisions informed by hard data. :) Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2010-02-03 at 17:39 +0000, James Bottomley wrote: > On Wed, 2010-02-03 at 12:29 -0500, Jeff Garzik wrote: > > On 02/03/2010 12:20 PM, Andrew Morton wrote: > > > On Wed, 03 Feb 2010 12:15:46 -0500 Jeff Garzik<jgarzik@pobox.com> wrote: > > >> On 02/03/2010 12:06 PM, Andrew Morton wrote: > > >>> Anyway, I'd draw your attention to this claim in the changelog: "This > > >>> patch allows the ARM boards to use a rootfs on CompactFlash with the > > >>> PATA platform driver." Immediate-term, we should be looking for a small > > >>> fix for this issue which is acceptable for 2.6.33 and 2.6.32 and earlier. > > >> > > >> Sure... see above. hopefully one that does not punish -everybody- > > >> though. It would be sad to unconditionally slow down millions of volume > > >> platform (read: x86) users for some embedded boards. > > > > > > Well. > > > ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc.patch > > > is a no-op on x86. It only has an effect on architectures which > > > implement flush_dcache_page(). And I expect flush_dcache_page() is > > > pretty light on those architectures, when compared with a PIO-mode > > > transfer. > > > > I don't object to the patch... as long as the arch people are happy. > > Arch people seem to be the ones complaining, though > > Apart from the contents, which is looking like sprinkle mainline with > random flushes, I'm unhappy that something which could be better > implemented by thinking about the problem is now being rammed through as > a must have bug fix. > > We got this piece of crap in the same way: > > commit 2d4dc890b5c8fabd818a8586607e6843c4375e62 > Author: Ilya Loginov <isloginov@gmail.com> > Date: Thu Nov 26 09:16:19 2009 +0100 > > block: add helpers to run flush_dcache_page() against a bio and a > request's pages > > Which is another race to flush everywhere until my coherence problem > goes away. > > This scattershot approach to coherency is really damaging in the long > term because all these random flushes are going to mask real problems we > may or may not have in the arch APIs ... and worse, they'll mask > mostly ... there'll likely be times when the masking is incomplete and > we're left with incredibly hard to debug data loss or binary crashes. I agree that this could be solved in a better way and I'm in favour of API improvements. The current API and description in cachetlb.txt suggest that flush_dcache_page() is the best approach when modifying page cache pages. AFAICT, there are two main use cases of flush_dcache_page() (could be more but that's what affects ARM): (1) filesystems modifying page cache pages (NFS, cramfs etc.) and (2) drivers doing PIO (block, HCD) that may write directly to page cache pages. Point (2) is a newer addition (as the one above) to fix coherency problems on Harvard architectures. As long as you use filesystems like cramfs that call flush_dcache_page(), the PIO block driver doesn't need to do any flushing. That's one of the reasons why MTD didn't need any for a long time. Once you start using filesystems like ext2, there is no flush_dcache_page() called by the VFS layer, so we thought about moving this into the driver (for a long time I had a dirty hack in mpage_end_io_read() to call flush_dcache_page()). If we state that flush_dcache_page() should not be used in driver code, than we need additional API for this (like pio_map_page etc. to be in line with the dma_map_* functions). This would allow architectures to implement them however they like. I can go on linux-arch with a proposed patch if this sounds acceptable. An alternative may be to force this flushing in kunmap(). Is it guaranteed that page cache pages are only accessed via kmap/kunmap? There are a few cases in the kernel where the code checks whether the page is a highmem one before using these functions. Regards.
On Thu, 2010-02-04 at 14:33 +0000, Catalin Marinas wrote: > On Wed, 2010-02-03 at 17:39 +0000, James Bottomley wrote: > > On Wed, 2010-02-03 at 12:29 -0500, Jeff Garzik wrote: > > > On 02/03/2010 12:20 PM, Andrew Morton wrote: > > > > On Wed, 03 Feb 2010 12:15:46 -0500 Jeff Garzik<jgarzik@pobox.com> wrote: > > > >> On 02/03/2010 12:06 PM, Andrew Morton wrote: > > > >>> Anyway, I'd draw your attention to this claim in the changelog: "This > > > >>> patch allows the ARM boards to use a rootfs on CompactFlash with the > > > >>> PATA platform driver." Immediate-term, we should be looking for a small > > > >>> fix for this issue which is acceptable for 2.6.33 and 2.6.32 and earlier. > > > >> > > > >> Sure... see above. hopefully one that does not punish -everybody- > > > >> though. It would be sad to unconditionally slow down millions of volume > > > >> platform (read: x86) users for some embedded boards. > > > > > > > > Well. > > > > ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc.patch > > > > is a no-op on x86. It only has an effect on architectures which > > > > implement flush_dcache_page(). And I expect flush_dcache_page() is > > > > pretty light on those architectures, when compared with a PIO-mode > > > > transfer. > > > > > > I don't object to the patch... as long as the arch people are happy. > > > Arch people seem to be the ones complaining, though > > > > Apart from the contents, which is looking like sprinkle mainline with > > random flushes, I'm unhappy that something which could be better > > implemented by thinking about the problem is now being rammed through as > > a must have bug fix. > > > > We got this piece of crap in the same way: > > > > commit 2d4dc890b5c8fabd818a8586607e6843c4375e62 > > Author: Ilya Loginov <isloginov@gmail.com> > > Date: Thu Nov 26 09:16:19 2009 +0100 > > > > block: add helpers to run flush_dcache_page() against a bio and a > > request's pages > > > > Which is another race to flush everywhere until my coherence problem > > goes away. > > > > This scattershot approach to coherency is really damaging in the long > > term because all these random flushes are going to mask real problems we > > may or may not have in the arch APIs ... and worse, they'll mask > > mostly ... there'll likely be times when the masking is incomplete and > > we're left with incredibly hard to debug data loss or binary crashes. > > I agree that this could be solved in a better way and I'm in favour of > API improvements. The current API and description in cachetlb.txt > suggest that flush_dcache_page() is the best approach when modifying > page cache pages. > > AFAICT, there are two main use cases of flush_dcache_page() (could be > more but that's what affects ARM): (1) filesystems modifying page cache > pages (NFS, cramfs etc.) and (2) drivers doing PIO (block, HCD) that may > write directly to page cache pages. Point (2) is a newer addition (as > the one above) to fix coherency problems on Harvard architectures. > > As long as you use filesystems like cramfs that call > flush_dcache_page(), the PIO block driver doesn't need to do any > flushing. That's one of the reasons why MTD didn't need any for a long > time. Once you start using filesystems like ext2, there is no > flush_dcache_page() called by the VFS layer, so we thought about moving > this into the driver (for a long time I had a dirty hack in > mpage_end_io_read() to call flush_dcache_page()). > > If we state that flush_dcache_page() should not be used in driver code, > than we need additional API for this (like pio_map_page etc. to be in > line with the dma_map_* functions). This would allow architectures to > implement them however they like. > > I can go on linux-arch with a proposed patch if this sounds acceptable. I think I'd prefer some type of kmap_ variant. The reason is that all the semantics are going to be the same as kmap ... down to having the slots for atomic. All we need is the extra flag giving the direction of transfer and, of course, what you get back is a virtual address pointer (all the dma operations return physical address handles). > An alternative may be to force this flushing in kunmap(). Is it > guaranteed that page cache pages are only accessed via kmap/kunmap? > There are a few cases in the kernel where the code checks whether the > page is a highmem one before using these functions. So code which checks for higmem really shouldn't. However DaveM makes the valid point that this would be problematic on platforms that have no highmem. But an additional variant like kmap_pio which all platforms can intercept looks more acceptable. James -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2010-02-04 at 15:01 +0000, James Bottomley wrote: > On Thu, 2010-02-04 at 14:33 +0000, Catalin Marinas wrote: > > On Wed, 2010-02-03 at 17:39 +0000, James Bottomley wrote: > > > On Wed, 2010-02-03 at 12:29 -0500, Jeff Garzik wrote: > > > > On 02/03/2010 12:20 PM, Andrew Morton wrote: > > > > > On Wed, 03 Feb 2010 12:15:46 -0500 Jeff Garzik<jgarzik@pobox.com> wrote: > > > > >> On 02/03/2010 12:06 PM, Andrew Morton wrote: > > > > >>> Anyway, I'd draw your attention to this claim in the changelog: "This > > > > >>> patch allows the ARM boards to use a rootfs on CompactFlash with the > > > > >>> PATA platform driver." Immediate-term, we should be looking for a small > > > > >>> fix for this issue which is acceptable for 2.6.33 and 2.6.32 and earlier. > > > > >> > > > > >> Sure... see above. hopefully one that does not punish -everybody- > > > > >> though. It would be sad to unconditionally slow down millions of volume > > > > >> platform (read: x86) users for some embedded boards. > > > > > > > > > > Well. > > > > > ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc.patch > > > > > is a no-op on x86. It only has an effect on architectures which > > > > > implement flush_dcache_page(). And I expect flush_dcache_page() is > > > > > pretty light on those architectures, when compared with a PIO-mode > > > > > transfer. > > > > > > > > I don't object to the patch... as long as the arch people are happy. > > > > Arch people seem to be the ones complaining, though > > > > > > Apart from the contents, which is looking like sprinkle mainline with > > > random flushes, I'm unhappy that something which could be better > > > implemented by thinking about the problem is now being rammed through as > > > a must have bug fix. > > > > > > We got this piece of crap in the same way: > > > > > > commit 2d4dc890b5c8fabd818a8586607e6843c4375e62 > > > Author: Ilya Loginov <isloginov@gmail.com> > > > Date: Thu Nov 26 09:16:19 2009 +0100 > > > > > > block: add helpers to run flush_dcache_page() against a bio and a > > > request's pages > > > > > > Which is another race to flush everywhere until my coherence problem > > > goes away. > > > > > > This scattershot approach to coherency is really damaging in the long > > > term because all these random flushes are going to mask real problems we > > > may or may not have in the arch APIs ... and worse, they'll mask > > > mostly ... there'll likely be times when the masking is incomplete and > > > we're left with incredibly hard to debug data loss or binary crashes. > > > > I agree that this could be solved in a better way and I'm in favour of > > API improvements. The current API and description in cachetlb.txt > > suggest that flush_dcache_page() is the best approach when modifying > > page cache pages. > > > > AFAICT, there are two main use cases of flush_dcache_page() (could be > > more but that's what affects ARM): (1) filesystems modifying page cache > > pages (NFS, cramfs etc.) and (2) drivers doing PIO (block, HCD) that may > > write directly to page cache pages. Point (2) is a newer addition (as > > the one above) to fix coherency problems on Harvard architectures. > > > > As long as you use filesystems like cramfs that call > > flush_dcache_page(), the PIO block driver doesn't need to do any > > flushing. That's one of the reasons why MTD didn't need any for a long > > time. Once you start using filesystems like ext2, there is no > > flush_dcache_page() called by the VFS layer, so we thought about moving > > this into the driver (for a long time I had a dirty hack in > > mpage_end_io_read() to call flush_dcache_page()). > > > > If we state that flush_dcache_page() should not be used in driver code, > > than we need additional API for this (like pio_map_page etc. to be in > > line with the dma_map_* functions). This would allow architectures to > > implement them however they like. > > > > I can go on linux-arch with a proposed patch if this sounds acceptable. > > I think I'd prefer some type of kmap_ variant. The reason is that all > the semantics are going to be the same as kmap ... down to having the > slots for atomic. All we need is the extra flag giving the direction of > transfer and, of course, what you get back is a virtual address pointer > (all the dma operations return physical address handles). Would a kmap_pio API always require a page structure as argument? There is a similar situation for HCD drivers and USB mass storage which I raised recently. The idea that the USB mass storage code should do the cache flushing was rejected as this has to be pushed down to the HCD driver. However, the HCD driver doesn't get page information, only a urb->transfer_buffer pointer to which it may do either DMA or PIO (both may actually happen in the same driver as an optimisation). If we enforce the use of a kmap_pio API in the USB mass storage, it wouldn't be optimal since this code is not aware of whether the HCD driver would do PIO or DMA. Looking at the HCD drivers, there are more similarities to the DMA API like pio_map_page, pio_map_single etc. I personally don't care much about how the API is named but rather where such API should be called from. Regards.
On Thu, 2010-02-04 at 15:39 +0000, Catalin Marinas wrote: > On Thu, 2010-02-04 at 15:01 +0000, James Bottomley wrote: > > On Thu, 2010-02-04 at 14:33 +0000, Catalin Marinas wrote: > > > On Wed, 2010-02-03 at 17:39 +0000, James Bottomley wrote: > > > > On Wed, 2010-02-03 at 12:29 -0500, Jeff Garzik wrote: > > > > > On 02/03/2010 12:20 PM, Andrew Morton wrote: > > > > > > On Wed, 03 Feb 2010 12:15:46 -0500 Jeff Garzik<jgarzik@pobox.com> wrote: > > > > > >> On 02/03/2010 12:06 PM, Andrew Morton wrote: > > > > > >>> Anyway, I'd draw your attention to this claim in the changelog: "This > > > > > >>> patch allows the ARM boards to use a rootfs on CompactFlash with the > > > > > >>> PATA platform driver." Immediate-term, we should be looking for a small > > > > > >>> fix for this issue which is acceptable for 2.6.33 and 2.6.32 and earlier. > > > > > >> > > > > > >> Sure... see above. hopefully one that does not punish -everybody- > > > > > >> though. It would be sad to unconditionally slow down millions of volume > > > > > >> platform (read: x86) users for some embedded boards. > > > > > > > > > > > > Well. > > > > > > ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc.patch > > > > > > is a no-op on x86. It only has an effect on architectures which > > > > > > implement flush_dcache_page(). And I expect flush_dcache_page() is > > > > > > pretty light on those architectures, when compared with a PIO-mode > > > > > > transfer. > > > > > > > > > > I don't object to the patch... as long as the arch people are happy. > > > > > Arch people seem to be the ones complaining, though > > > > > > > > Apart from the contents, which is looking like sprinkle mainline with > > > > random flushes, I'm unhappy that something which could be better > > > > implemented by thinking about the problem is now being rammed through as > > > > a must have bug fix. > > > > > > > > We got this piece of crap in the same way: > > > > > > > > commit 2d4dc890b5c8fabd818a8586607e6843c4375e62 > > > > Author: Ilya Loginov <isloginov@gmail.com> > > > > Date: Thu Nov 26 09:16:19 2009 +0100 > > > > > > > > block: add helpers to run flush_dcache_page() against a bio and a > > > > request's pages > > > > > > > > Which is another race to flush everywhere until my coherence problem > > > > goes away. > > > > > > > > This scattershot approach to coherency is really damaging in the long > > > > term because all these random flushes are going to mask real problems we > > > > may or may not have in the arch APIs ... and worse, they'll mask > > > > mostly ... there'll likely be times when the masking is incomplete and > > > > we're left with incredibly hard to debug data loss or binary crashes. > > > > > > I agree that this could be solved in a better way and I'm in favour of > > > API improvements. The current API and description in cachetlb.txt > > > suggest that flush_dcache_page() is the best approach when modifying > > > page cache pages. > > > > > > AFAICT, there are two main use cases of flush_dcache_page() (could be > > > more but that's what affects ARM): (1) filesystems modifying page cache > > > pages (NFS, cramfs etc.) and (2) drivers doing PIO (block, HCD) that may > > > write directly to page cache pages. Point (2) is a newer addition (as > > > the one above) to fix coherency problems on Harvard architectures. > > > > > > As long as you use filesystems like cramfs that call > > > flush_dcache_page(), the PIO block driver doesn't need to do any > > > flushing. That's one of the reasons why MTD didn't need any for a long > > > time. Once you start using filesystems like ext2, there is no > > > flush_dcache_page() called by the VFS layer, so we thought about moving > > > this into the driver (for a long time I had a dirty hack in > > > mpage_end_io_read() to call flush_dcache_page()). > > > > > > If we state that flush_dcache_page() should not be used in driver code, > > > than we need additional API for this (like pio_map_page etc. to be in > > > line with the dma_map_* functions). This would allow architectures to > > > implement them however they like. > > > > > > I can go on linux-arch with a proposed patch if this sounds acceptable. > > > > I think I'd prefer some type of kmap_ variant. The reason is that all > > the semantics are going to be the same as kmap ... down to having the > > slots for atomic. All we need is the extra flag giving the direction of > > transfer and, of course, what you get back is a virtual address pointer > > (all the dma operations return physical address handles). > > Would a kmap_pio API always require a page structure as argument? This seems a reasonable requirement. > There is a similar situation for HCD drivers and USB mass storage which > I raised recently. The idea that the USB mass storage code should do the > cache flushing was rejected as this has to be pushed down to the HCD > driver. However, the HCD driver doesn't get page information, only a > urb->transfer_buffer pointer to which it may do either DMA or PIO (both > may actually happen in the same driver as an optimisation). You'd want to do dma_map or kmap at a point after the DMA or PIO decision is made. > If we enforce the use of a kmap_pio API in the USB mass storage, it > wouldn't be optimal since this code is not aware of whether the HCD > driver would do PIO or DMA. So, if the decision isn't made until the HCD driver, then surely this belongs there. > Looking at the HCD drivers, there are more similarities to the DMA API > like pio_map_page, pio_map_single etc. I personally don't care much > about how the API is named but rather where such API should be called > from. the API usage is similar to the dma one. The point I was making is the semantics are similar to kmap because of the slot and atomic/non atomic requirements. James -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff -puN drivers/ata/libata-sff.c~ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc drivers/ata/libata-sff.c --- a/drivers/ata/libata-sff.c~ata-call-flush_dcache_page-around-pio-data-transfers-in-libata-affc +++ a/drivers/ata/libata-sff.c @@ -874,6 +874,9 @@ static void ata_pio_sector(struct ata_qu DPRINTK("data %s\n", qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read"); + if (do_write) + flush_dcache_page(page); + if (PageHighMem(page)) { unsigned long flags; @@ -893,6 +896,9 @@ static void ata_pio_sector(struct ata_qu do_write); } + if (!do_write) + flush_dcache_page(page); + qc->curbytes += qc->sect_size; qc->cursg_ofs += qc->sect_size;