diff mbox

POWERPC: MTD: Add cached map support to physmap_of MTD driver

Message ID 1229365518-20538-1-git-send-email-tpiepho@freescale.com
State New, archived
Headers show

Commit Message

Trent Piepho Dec. 15, 2008, 6:25 p.m. UTC
The MTD system supports operation where a direct mapped flash chip is
mapped twice.  The normal mapping is a standard ioremap(), which is
non-cached and guarded on powerpc.  The second mapping is used only for
reads and can be cached and non-guarded.  Currently, only the pxa2xx
mapping driver makes use of this feature.  This patch adds support to the
physmap_of driver on PPC32 platforms for this cached mapping mode.

Because the flash chip doesn't participate in the cache coherency protocol,
it's necessary to invalidate the cache for parts of flash that are modified
with a program or erase operation.  This is platform specific, for instance
the pxa2xx driver uses an ARM specific function.  This patch adds
invalidate_dcache_icache_range() for PPC32 and uses it.  Because of XIP,
it's entirely possible that the flash might be in the icache(*), so the
existing invalidate_dcache_range() function isn't enough.

Of course, a cached mapping can increase performance if the data is read
from cache instead of flash.  But less obvious is that it can provide a
significant performance increase for cold-cache reads that still come from
flash.  It allows efficient back-to-back reads and if the flash chip &
controller support page burst mode, it allows that to be used as well.

The figures are for *cold-cache* read performance, measured on a Freescale
MPC8572 controlling a Spansion S29GL064N NOR flash chip.  With and without
the flash being mapped cached and with and without the localbus controller
being programmed to use page burst mode:

Non-cached, w/o bursts: 13.61 MB/s
Non-cached, w/ bursts:  13.61 MB/s
Cached, w/o bursts:     16.75 MB/s 23% increase
Cached, w/ bursts:      44.79 MB/s 229% increase!

Even without any cache hits, the cached mapping provides a significant
increase in performance via improved bus utilization.  Enabling burst
transfers is even more significant.

(*) The MTD device's ->point() method, which is the mechanism for
supporting mmap and XIP, only allows for mmapping the uncached region.  So
you can't actually XIP anything in the cache.  But this could be fixed.

Signed-off-by: Trent Piepho <tpiepho@freescale.com>
---
 arch/powerpc/include/asm/cacheflush.h |    2 ++
 arch/powerpc/kernel/misc_32.S         |   21 +++++++++++++++++++++
 drivers/mtd/maps/physmap_of.c         |   20 ++++++++++++++++++++
 3 files changed, 43 insertions(+), 0 deletions(-)

Comments

Josh Boyer Dec. 15, 2008, 6:30 p.m. UTC | #1
On Mon, 15 Dec 2008 10:25:18 -0800
Trent Piepho <tpiepho@freescale.com> wrote:

> The MTD system supports operation where a direct mapped flash chip is
> mapped twice.  The normal mapping is a standard ioremap(), which is
> non-cached and guarded on powerpc.  The second mapping is used only for
> reads and can be cached and non-guarded.  Currently, only the pxa2xx
> mapping driver makes use of this feature.  This patch adds support to the
> physmap_of driver on PPC32 platforms for this cached mapping mode.
> 
> Because the flash chip doesn't participate in the cache coherency protocol,
> it's necessary to invalidate the cache for parts of flash that are modified
> with a program or erase operation.  This is platform specific, for instance
> the pxa2xx driver uses an ARM specific function.  This patch adds
> invalidate_dcache_icache_range() for PPC32 and uses it.  Because of XIP,
> it's entirely possible that the flash might be in the icache(*), so the
> existing invalidate_dcache_range() function isn't enough.
> 
> Of course, a cached mapping can increase performance if the data is read
> from cache instead of flash.  But less obvious is that it can provide a
> significant performance increase for cold-cache reads that still come from
> flash.  It allows efficient back-to-back reads and if the flash chip &
> controller support page burst mode, it allows that to be used as well.
> 
> The figures are for *cold-cache* read performance, measured on a Freescale
> MPC8572 controlling a Spansion S29GL064N NOR flash chip.  With and without
> the flash being mapped cached and with and without the localbus controller
> being programmed to use page burst mode:
> 
> Non-cached, w/o bursts: 13.61 MB/s
> Non-cached, w/ bursts:  13.61 MB/s
> Cached, w/o bursts:     16.75 MB/s 23% increase
> Cached, w/ bursts:      44.79 MB/s 229% increase!
> 
> Even without any cache hits, the cached mapping provides a significant
> increase in performance via improved bus utilization.  Enabling burst
> transfers is even more significant.
> 
> (*) The MTD device's ->point() method, which is the mechanism for
> supporting mmap and XIP, only allows for mmapping the uncached region.  So
> you can't actually XIP anything in the cache.  But this could be fixed.
> 
> Signed-off-by: Trent Piepho <tpiepho@freescale.com>

Did you actually change anything in this version when compared to the
version you sent out last week?  If not, is there a reason you sent it
again without flagging it as a resend?

(Hint, the MTD community reviews patches slowly.  No comments for a
week is normal.)

josh
Trent Piepho Dec. 15, 2008, 10:56 p.m. UTC | #2
On Mon, 15 Dec 2008, Josh Boyer wrote:
>
> Did you actually change anything in this version when compared to the
> version you sent out last week?  If not, is there a reason you sent it
> again without flagging it as a resend?

I sent it out last week?  I'm trying to tie up loose ends before I leave
and thought this patch hadn't gone out yet.  No changes unless there was
some kind of last minute spelling fix in the patch comments.

At least this one has my non-Freescale address CCed so I'll get replies if
there are any.
Paul Mackerras Dec. 16, 2008, 12:21 a.m. UTC | #3
Trent Piepho writes:

> The MTD system supports operation where a direct mapped flash chip is
> mapped twice.  The normal mapping is a standard ioremap(), which is
> non-cached and guarded on powerpc.  The second mapping is used only for
> reads and can be cached and non-guarded.  Currently, only the pxa2xx
> mapping driver makes use of this feature.  This patch adds support to the
> physmap_of driver on PPC32 platforms for this cached mapping mode.

Note that having two mappings of the same physical address that differ
in cacheability is a programming error according to the PowerPC
architecture.

Do you know that the processor implementations where you want to do
this can cope with having two mappings with different cacheability?

Paul.
Trent Piepho Dec. 16, 2008, 1:11 a.m. UTC | #4
On Tue, 16 Dec 2008, Paul Mackerras wrote:
> Trent Piepho writes:
>> The MTD system supports operation where a direct mapped flash chip is
>> mapped twice.  The normal mapping is a standard ioremap(), which is
>> non-cached and guarded on powerpc.  The second mapping is used only for
>> reads and can be cached and non-guarded.  Currently, only the pxa2xx
>> mapping driver makes use of this feature.  This patch adds support to the
>> physmap_of driver on PPC32 platforms for this cached mapping mode.
>
> Note that having two mappings of the same physical address that differ
> in cacheability is a programming error according to the PowerPC
> architecture.
>
> Do you know that the processor implementations where you want to do
> this can cope with having two mappings with different cacheability?

Good question.  It worked for me, but I guess all powerpc32 can't handle
it.

Shame, as it provides a huge speed up.  I suppose an alternative would be
to map the chip twice at different physical addresses, by just configuring
the chip select to be twice the size it should be, and giving them
different cacheability.

Or changing the mapping for writes and then changing it back.  It wouldn't
be necessary to change the whole thing, just the page being written to.
Benjamin Herrenschmidt Dec. 16, 2008, 8:55 a.m. UTC | #5
On Mon, 2008-12-15 at 10:25 -0800, Trent Piepho wrote:
> The MTD system supports operation where a direct mapped flash chip is
> mapped twice.  The normal mapping is a standard ioremap(), which is
> non-cached and guarded on powerpc.  The second mapping is used only for
> reads and can be cached and non-guarded.  Currently, only the pxa2xx
> mapping driver makes use of this feature.  This patch adds support to the
> physmap_of driver on PPC32 platforms for this cached mapping mode.

This can be dangerous tho.

I think it's illegal as per the architecture to have the same physical
address mapped twice with different caching attributes.

More specifically, various processors can get very upset if you do an
uncached access that happens to hit a line in the cache.

The problem gets worsened by the fact that cores that support
speculative loads and prefetch will potentially bring anything mapped
into the cache even if it's not directly accessed.

So you have to be very careful and first verify that on whatever core
you intend to use that feature, what you are doing is indeed safe.

Cheers,
Ben.
Benjamin Herrenschmidt Dec. 16, 2008, 8:56 a.m. UTC | #6
On Mon, 2008-12-15 at 17:11 -0800, Trent Piepho wrote:
> Shame, as it provides a huge speed up.  I suppose an alternative would be
> to map the chip twice at different physical addresses, by just configuring
> the chip select to be twice the size it should be, and giving them
> different cacheability.

Nice trick. That would probably work.

> Or changing the mapping for writes and then changing it back.  It wouldn't
> be necessary to change the whole thing, just the page being written to.

Right though changing mappings can be expensive. It might be worth
looking at using fixmap for that tho, which is the fastest way to setup
and tear down mappings, especially since we can (though we don't today)
implement a bypass on those to directly load the TLB.

Cheers,
Ben.
Trent Piepho Dec. 17, 2008, 9:01 p.m. UTC | #7
On Tue, 16 Dec 2008, Benjamin Herrenschmidt wrote:
> On Mon, 2008-12-15 at 17:11 -0800, Trent Piepho wrote:
>> Shame, as it provides a huge speed up.  I suppose an alternative would be
>> to map the chip twice at different physical addresses, by just configuring
>> the chip select to be twice the size it should be, and giving them
>> different cacheability.
>
> Nice trick. That would probably work.

Thinking about it more, this is probably the way to do it.  Mapping the
same address twice appeared to worked for me, but it looks like it's a bad
thing to do.  To bad I didn't have time to finish this.

Creating two copies of the flash chip will take twice the physical address
space, but the virtual address space used is the same as mapping the chip
twice.  Since kernel virtual address space <= physical address space, there
really shouldn't be a problem with that.

Probably do something like this to the dts:

  localbus {
-	ranges = <0x1 0x0 0xe8000000 0x08000000>;
+	ranges = <0x1 0x0 0xe0000000 0x10000000>;  /* CS size x2 */
-       nor@1,0 {
+       nor@1,0x08000000 {
 		compatible = "cfi-flash";
-		reg = <0x1 0x0 0x08000000>;
+		reg = <0x1 0x08000000 0x08000000>;
+		cached-alias = <&cached_nor>;
         };
+	cached_nor: nor@1,0 {
+		compatible = "alias";
+		reg = <0x1 0 0x08000000>;
+	};
  }

Since physmap_of is an openfirmware driver, it won't be a problem to have
if look for "cached-alias" to get the range to map as cached.  The MTD
layer only supports one "map->phys" address, but I don't think this address
is used for anything on powerpc.

>> Or changing the mapping for writes and then changing it back.  It wouldn't
>> be necessary to change the whole thing, just the page being written to.
>
> Right though changing mappings can be expensive. It might be worth
> looking at using fixmap for that tho, which is the fastest way to setup
> and tear down mappings, especially since we can (though we don't today)
> implement a bypass on those to directly load the TLB.

The MTD layer appears to program flash one word at a time, so writing to
flash would mean changing maps on a per word basis.  Of course flash is
slow too so maybe the relative cost is not that much.  It takes more
modifications to MTD than the previous method.

> The problem gets worsened by the fact that cores that support
> speculative loads and prefetch will potentially bring anything mapped
> into the cache even if it's not directly accessed.

This is really the whole point of mapping it cached.  Since the cpu can
prefetch data, it's able to use more efficient back-to-back reads or page
burst mode to read a whole cache line at once.  The latter can more than
triple the read rate.
diff mbox

Patch

diff --git a/arch/powerpc/include/asm/cacheflush.h b/arch/powerpc/include/asm/cacheflush.h
index ba667a3..385c26c 100644
--- a/arch/powerpc/include/asm/cacheflush.h
+++ b/arch/powerpc/include/asm/cacheflush.h
@@ -49,6 +49,8 @@  extern void flush_dcache_range(unsigned long start, unsigned long stop);
 #ifdef CONFIG_PPC32
 extern void clean_dcache_range(unsigned long start, unsigned long stop);
 extern void invalidate_dcache_range(unsigned long start, unsigned long stop);
+extern void invalidate_dcache_icache_range(unsigned long start,
+					   unsigned long stop);
 #endif /* CONFIG_PPC32 */
 #ifdef CONFIG_PPC64
 extern void flush_inval_dcache_range(unsigned long start, unsigned long stop);
diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
index d108715..5e6a154 100644
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -639,6 +639,27 @@  _GLOBAL(invalidate_dcache_range)
 	blr
 
 /*
+ * Like above, but invalidate both the D-cache and I-cache.  Used when a cached
+ * region has been modified from a source that does not participate in the cache
+ * coherency protocol.
+ *
+ * invalidate_dcache_icache_range(unsigned long start, unsigned long stop)
+ */
+_GLOBAL(invalidate_dcache_icache_range)
+	clrrwi	r3, r3, L1_CACHE_SHIFT	/* start &= ~((1<<SHIFT)-1) */
+	subf	r4,r3,r4
+	addi	r4,r4,L1_CACHE_BYTES-1
+	srwi.	r4,r4,L1_CACHE_SHIFT	/* count = (start-stop+BYTES-1)/BYTES */
+	beqlr				/* if (!count) return */
+	mtctr	r4
+1:	dcbi	0,r3
+	icbi	0,r3
+	addi	r3,r3,L1_CACHE_BYTES
+	bdnz	1b
+	sync				/* wait for [id]cbi's to get to ram */
+	blr
+
+/*
  * Flush a particular page from the data cache to RAM.
  * Note: this is necessary because the instruction cache does *not*
  * snoop from the data cache.
diff --git a/drivers/mtd/maps/physmap_of.c b/drivers/mtd/maps/physmap_of.c
index 5fcfec0..e834298 100644
--- a/drivers/mtd/maps/physmap_of.c
+++ b/drivers/mtd/maps/physmap_of.c
@@ -32,6 +32,16 @@  struct of_flash {
 #endif
 };
 
+#ifdef CONFIG_PPC32
+#include <asm/cacheflush.h>
+static void ppc32_inval_cache(struct map_info *map, unsigned long from,
+			      ssize_t len)
+{
+	invalidate_dcache_icache_range((unsigned long)map->cached + from,
+	                               (unsigned long)map->cached + from + len);
+}
+#endif
+
 #ifdef CONFIG_MTD_PARTITIONS
 #define OF_FLASH_PARTS(info)	((info)->parts)
 
@@ -106,6 +116,8 @@  static int of_flash_remove(struct of_device *dev)
 
 	if (info->map.virt)
 		iounmap(info->map.virt);
+	if (info->map.cached)
+		iounmap(info->map.cached);
 
 	if (info->res) {
 		release_resource(info->res);
@@ -205,6 +217,14 @@  static int __devinit of_flash_probe(struct of_device *dev,
 		dev_err(&dev->dev, "Failed to ioremap() flash region\n");
 		goto err_out;
 	}
+#ifdef CONFIG_PPC32
+	/* Don't use no-cache or guarded flags */
+	info->map.cached = ioremap_flags(info->map.phys, info->map.size, 0);
+	if (!info->map.cached)
+		dev_warn(&dev->dev, "Failed to ioremap() cached flash region\n");
+	else
+		info->map.inval_cache = ppc32_inval_cache;
+#endif
 
 	simple_map_init(&info->map);