diff mbox

[v3] powerpc/4xx: work around CHIP11 errata in a more PAGE_SIZE-friendly way

Message ID 700c59731cf97778d3a4.1226448406@localhost.localdomain (mailing list archive)
State Superseded, archived
Delegated to: Josh Boyer
Headers show

Commit Message

Hollis Blanchard Nov. 12, 2008, 12:06 a.m. UTC
The current CHIP11 errata truncates the device tree memory node, and subtracts
(hardcoded) 4096 bytes. This breaks kernels with larger PAGE_SIZE, since the
bootmem allocator assumes that total memory is a multiple of PAGE_SIZE.

Instead, use a device tree memory reservation to reserve only the 256 bytes
actually affected by the errata, leaving the total memory size unaltered.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>

---

Changes from v2:
- David pointed out I'd duplicated the fdt_add_mem_rsv() prototype, and that
  4xx.c should directly include libfdt/libfdt.h instead.

Using large pages results in a huge performance improvement for KVM, and this
patch is required to make Ilya's large page patch work. David and/or Josh,
please apply.

Comments

David Gibson Nov. 12, 2008, 12:09 a.m. UTC | #1
On Tue, Nov 11, 2008 at 06:06:46PM -0600, Hollis Blanchard wrote:
> The current CHIP11 errata truncates the device tree memory node, and subtracts
> (hardcoded) 4096 bytes. This breaks kernels with larger PAGE_SIZE, since the
> bootmem allocator assumes that total memory is a multiple of PAGE_SIZE.
> 
> Instead, use a device tree memory reservation to reserve only the 256 bytes
> actually affected by the errata, leaving the total memory size unaltered.
> 
> Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>

libfdt usage changes look fine to me.

Acked-by: David Gibson <david@gibson.dropbear.id.au>
Benjamin Herrenschmidt Nov. 12, 2008, 4:37 a.m. UTC | #2
On Tue, 2008-11-11 at 18:06 -0600, Hollis Blanchard wrote:
> The current CHIP11 errata truncates the device tree memory node, and subtracts
> (hardcoded) 4096 bytes. This breaks kernels with larger PAGE_SIZE, since the
> bootmem allocator assumes that total memory is a multiple of PAGE_SIZE.
> 
> Instead, use a device tree memory reservation to reserve only the 256 bytes
> actually affected by the errata, leaving the total memory size unaltered.
> 
> Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>

While I prefer this approach, won't it break kexec ?

I don't understand why we don't just have a bit of code in the kernel
itself that reserve that page on 44x at boot time and be done with it.

It's like we are trying to be too smart and over-engineer the solution.

Cheers,
Ben.
Josh Boyer Nov. 12, 2008, 11:31 a.m. UTC | #3
On Wed, 12 Nov 2008 15:37:43 +1100
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Tue, 2008-11-11 at 18:06 -0600, Hollis Blanchard wrote:
> > The current CHIP11 errata truncates the device tree memory node, and subtracts
> > (hardcoded) 4096 bytes. This breaks kernels with larger PAGE_SIZE, since the
> > bootmem allocator assumes that total memory is a multiple of PAGE_SIZE.
> > 
> > Instead, use a device tree memory reservation to reserve only the 256 bytes
> > actually affected by the errata, leaving the total memory size unaltered.
> > 
> > Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
> 
> While I prefer this approach, won't it break kexec ?

Break it how?  Particularly given that kexec doesn't work on 4xx (yet).

> I don't understand why we don't just have a bit of code in the kernel
> itself that reserve that page on 44x at boot time and be done with it.
> 
> It's like we are trying to be too smart and over-engineer the solution.

I don't think that's it.  I think it's more that we're opportunistic and
the wrapper is the easiest place to do this, given that U-Boot itself
will be doing the reserve for platforms that don't require the
wrapper.

So we could do the fixup in-kernel, but how do you do that
deterministically given that U-Boot might have already done it?

josh
Benjamin Herrenschmidt Nov. 12, 2008, 11:52 a.m. UTC | #4
On Wed, 2008-11-12 at 06:31 -0500, Josh Boyer wrote:
> On Wed, 12 Nov 2008 15:37:43 +1100
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> > On Tue, 2008-11-11 at 18:06 -0600, Hollis Blanchard wrote:
> > > The current CHIP11 errata truncates the device tree memory node, and subtracts
> > > (hardcoded) 4096 bytes. This breaks kernels with larger PAGE_SIZE, since the
> > > bootmem allocator assumes that total memory is a multiple of PAGE_SIZE.
> > > 
> > > Instead, use a device tree memory reservation to reserve only the 256 bytes
> > > actually affected by the errata, leaving the total memory size unaltered.
> > > 
> > > Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
> > 
> > While I prefer this approach, won't it break kexec ?
> 
> Break it how?  Particularly given that kexec doesn't work on 4xx (yet).

Allright, wrong wording. It will make kexec more painful since it will
have to also create that reserved area in the target DT.

> I don't think that's it.  I think it's more that we're opportunistic and
> the wrapper is the easiest place to do this, given that U-Boot itself
> will be doing the reserve for platforms that don't require the
> wrapper.
> 
> So we could do the fixup in-kernel, but how do you do that
> deterministically given that U-Boot might have already done it?

Bah, do you know many RAM chip that will chop off the last 4K ?

I still find it a bit tricky to have memory nodes not aligned on nice
fat big boundaries tho.

Cheers,
Ben.
Hollis Blanchard Nov. 12, 2008, 3:11 p.m. UTC | #5
On Wed, 2008-11-12 at 22:52 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2008-11-12 at 06:31 -0500, Josh Boyer wrote:
> > On Wed, 12 Nov 2008 15:37:43 +1100
> > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > 
> > > On Tue, 2008-11-11 at 18:06 -0600, Hollis Blanchard wrote:
> > > > The current CHIP11 errata truncates the device tree memory node, and subtracts
> > > > (hardcoded) 4096 bytes. This breaks kernels with larger PAGE_SIZE, since the
> > > > bootmem allocator assumes that total memory is a multiple of PAGE_SIZE.
> > > > 
> > > > Instead, use a device tree memory reservation to reserve only the 256 bytes
> > > > actually affected by the errata, leaving the total memory size unaltered.
> > > > 
> > > > Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
> > > 
> > > While I prefer this approach, won't it break kexec ?
> > 
> > Break it how?  Particularly given that kexec doesn't work on 4xx (yet).
> 
> Allright, wrong wording. It will make kexec more painful since it will
> have to also create that reserved area in the target DT.
> 
> > I don't think that's it.  I think it's more that we're opportunistic and
> > the wrapper is the easiest place to do this, given that U-Boot itself
> > will be doing the reserve for platforms that don't require the
> > wrapper.
> > 
> > So we could do the fixup in-kernel, but how do you do that
> > deterministically given that U-Boot might have already done it?
> 
> Bah, do you know many RAM chip that will chop off the last 4K ?

Forget pages. The errata is about the last 256 bytes of physical memory.

> I still find it a bit tricky to have memory nodes not aligned on nice
> fat big boundaries tho.

I don't know what you're referring to. The patch I sent doesn't touch
memory nodes, so they are indeed still aligned on nice fat big
boundaries.

I don't think this is overengineering at all. We can't touch the last
256 bytes, so we mark it reserved, and then we won't. Altering memory
nodes is far more complicated and error-prone.
Benjamin Herrenschmidt Nov. 12, 2008, 8:44 p.m. UTC | #6
On Wed, 2008-11-12 at 09:11 -0600, Hollis Blanchard wrote:
> Forget pages. The errata is about the last 256 bytes of physical
> memory.
> 
> > I still find it a bit tricky to have memory nodes not aligned on
> nice
> > fat big boundaries tho.
> 
> I don't know what you're referring to. The patch I sent doesn't touch
> memory nodes, so they are indeed still aligned on nice fat big
> boundaries.

My last comment was about the approach of modifying the memory node.

> I don't think this is overengineering at all. We can't touch the last
> 256 bytes, so we mark it reserved, and then we won't. Altering memory
> nodes is far more complicated and error-prone.

But your approach is going to be painful for kexec which will have to
duplicate that logic.

Again, why can't we just stick something in the kernel code that
reserves the last page ? It could be in prom.c or it could be called by
affected 4xx platforms by the platform code, whatever, but the reserve
map isn't really meant for that and will not be passed over from kernel
to kernel by kexec.

Ben.
Josh Boyer Nov. 12, 2008, 8:53 p.m. UTC | #7
On Thu, 13 Nov 2008 07:44:56 +1100
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Wed, 2008-11-12 at 09:11 -0600, Hollis Blanchard wrote:
> > Forget pages. The errata is about the last 256 bytes of physical
> > memory.
> > 
> > > I still find it a bit tricky to have memory nodes not aligned on
> > nice
> > > fat big boundaries tho.
> > 
> > I don't know what you're referring to. The patch I sent doesn't touch
> > memory nodes, so they are indeed still aligned on nice fat big
> > boundaries.
> 
> My last comment was about the approach of modifying the memory node.
> 
> > I don't think this is overengineering at all. We can't touch the last
> > 256 bytes, so we mark it reserved, and then we won't. Altering memory
> > nodes is far more complicated and error-prone.
> 
> But your approach is going to be painful for kexec which will have to
> duplicate that logic.
> 
> Again, why can't we just stick something in the kernel code that
> reserves the last page ? It could be in prom.c or it could be called by
> affected 4xx platforms by the platform code, whatever, but the reserve
> map isn't really meant for that and will not be passed over from kernel
> to kernel by kexec.

Again, because newer U-Boot is doing the fixup on memsize for us
already.  This is why it was done in the wrapper to begin with, since
it depends on the version of U-Boot that you happen to be using.

If you have a good idea on how to figure that out in-kernel, do the
fixup when needed, and not make people's eyes bleed, I'm all for it.

josh
Hollis Blanchard Nov. 13, 2008, 7:54 p.m. UTC | #8
On Thu, 2008-11-13 at 07:44 +1100, Benjamin Herrenschmidt wrote:
> 
> Again, why can't we just stick something in the kernel code that
> reserves the last page ? It could be in prom.c or it could be called by
> affected 4xx platforms by the platform code, whatever, but the reserve
> map isn't really meant for that and will not be passed over from kernel
> to kernel by kexec.

Reserving a page is overkill; only the last 256 bytes are affected. We
need to intercept at the LMB level, because allocations are already done
there, so by the time we hit bootmem it's way too late.

I simply don't see a good place to do this in the kernel. It would have
to be before the first lmb_alloc() call, which for safety would put it
inside early_init_devtree() -- along with the other lmb_reserve()
calls.[1]

However, ppc_md.probe() hasn't even been called yet, so there's no way
of knowing if we're on an affected system, unless you want to add a
special of_scan_flat_dt() call here.

I'm open to suggestions, but I don't see a better way than what I
already sent. I think the important part is to call lmb_add() for all
memory, but lmb_reserve() the last 256 bytes before lmb_alloc() happens.

It sounds like kexec must have some knowledge of the platform and device
tree already, so is this really a big deal? At any rate, this
conversation is somewhat academic, since there is no kexec on 44x... so
maybe this can be re-addressed when that becomes a real issue.

[1] This is exactly where flat device tree reservations are done, and
that's why the patch I submitted works.
diff mbox

Patch

diff --git a/arch/powerpc/boot/4xx.c b/arch/powerpc/boot/4xx.c
--- a/arch/powerpc/boot/4xx.c
+++ b/arch/powerpc/boot/4xx.c
@@ -20,8 +20,9 @@ 
 #include "ops.h"
 #include "reg.h"
 #include "dcr.h"
+#include "libfdt/libfdt.h"
 
-static unsigned long chip_11_errata(unsigned long memsize)
+static void chip_11_errata(unsigned long memsize)
 {
 	unsigned long pvr;
 
@@ -31,13 +32,11 @@  static unsigned long chip_11_errata(unsi
 		case 0x40000850:
 		case 0x400008d0:
 		case 0x200008d0:
-			memsize -= 4096;
+			fdt_add_mem_rsv(fdt, memsize - 256, 256);
 			break;
 		default:
 			break;
 	}
-
-	return memsize;
 }
 
 /* Read the 4xx SDRAM controller to get size of system memory. */
@@ -53,7 +52,7 @@  void ibm4xx_sdram_fixup_memsize(void)
 			memsize += SDRAM_CONFIG_BANK_SIZE(bank_config);
 	}
 
-	memsize = chip_11_errata(memsize);
+	chip_11_errata(memsize);
 	dt_fixup_memory(0, memsize);
 }
 
@@ -219,7 +218,7 @@  void ibm4xx_denali_fixup_memsize(void)
 		bank = 4; /* 4 banks */
 
 	memsize = cs * (1 << (col+row)) * bank * dpath;
-	memsize = chip_11_errata(memsize);
+	chip_11_errata(memsize);
 	dt_fixup_memory(0, memsize);
 }
 
diff --git a/arch/powerpc/boot/libfdt-wrapper.c b/arch/powerpc/boot/libfdt-wrapper.c
--- a/arch/powerpc/boot/libfdt-wrapper.c
+++ b/arch/powerpc/boot/libfdt-wrapper.c
@@ -51,7 +51,7 @@ 
 #define devp_offset_find(devp)	(((int)(devp))-1)
 #define devp_offset(devp)	(devp ? ((int)(devp))-1 : 0)
 
-static void *fdt;
+void *fdt;
 static void *buf; /* = NULL */
 
 #define EXPAND_GRANULARITY	1024
diff --git a/arch/powerpc/boot/ops.h b/arch/powerpc/boot/ops.h
--- a/arch/powerpc/boot/ops.h
+++ b/arch/powerpc/boot/ops.h
@@ -14,6 +14,7 @@ 
 #include <stddef.h>
 #include "types.h"
 #include "string.h"
+#include "libfdt_env.h"
 
 #define	COMMAND_LINE_SIZE	512
 #define	MAX_PATH_LEN		256
@@ -32,6 +33,9 @@  struct platform_ops {
 	void *	(*vmlinux_alloc)(unsigned long size);
 };
 extern struct platform_ops platform_ops;
+
+/* The device tree itself. Should almost always be accessed via dt_ops. */
+extern void *fdt;
 
 /* Device Tree operations */
 struct dt_ops {