diff mbox

[v2,1/5,ppc] Process dynamic relocations for kernel

Message ID 1320276969.3309.3.camel@treble (mailing list archive)
State Superseded
Headers show

Commit Message

Josh Poimboeuf Nov. 2, 2011, 11:36 p.m. UTC
On Tue, 2011-10-25 at 17:23 +0530, Suzuki K. Poulose wrote: 
> The following patch implements the dynamic relocation processing for
> PPC32 kernel. relocate() accepts the target virtual address and relocates
>  the kernel image to the same.

Hi Suzuki,

Thanks for the patches.  I've been testing them on a 440-based card, and
encountered TLB error exceptions because the BSS section wasn't getting
properly cleared in early_init().

It turns out that some of the instructions which were modified in
relocate() weren't then getting flushed out of the d-cache into memory.
After that, early_init() executed the stale (non-modified) instructions
for the BSS area.  Those instructions just accessed offset 0 instead of
the actual BSS-related offsets.  That resulted in BSS not getting`
zeroed.

I was able to verify this on my 440 by comparing the d-cache and i-cache
entries for the BSS-accessing instructions in early_init() using a
RISCWatch.  As I suspected, the instructions in the d-cache showed the
corrected offsets, but the i-cache showed the old, non-relocated
offsets.

To fix the issue, I wrote the following patch, applied on top of your
patches.  Suggestions and comments are welcome.



From c88ae39da0c0352f411aca8d9636990a442d47da Mon Sep 17 00:00:00 2001
From: Josh Poimboeuf <jpoimboe@linux.vnet.ibm.com>
Date: Wed, 2 Nov 2011 16:41:24 -0500
Subject: [PATCH] Flush relocated instructions from data cache

After updating instructions with relocated addresses, flush them from
the data cache and invalidate the icache line so we don't execute stale
instructions.

Signed-off-by: Josh Poimboeuf <jpoimboe@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/reloc_32.S |   11 ++++++++++-
 1 files changed, 10 insertions(+), 1 deletions(-)

Comments

Suzuki Poulose Nov. 4, 2011, 8:36 a.m. UTC | #1
On 11/03/11 05:06, Josh Poimboeuf wrote:
> On Tue, 2011-10-25 at 17:23 +0530, Suzuki K. Poulose wrote:
>> The following patch implements the dynamic relocation processing for
>> PPC32 kernel. relocate() accepts the target virtual address and relocates
>>   the kernel image to the same.
>
> Hi Suzuki,
>
> Thanks for the patches.  I've been testing them on a 440-based card, and
> encountered TLB error exceptions because the BSS section wasn't getting
> properly cleared in early_init().

Thanks a lot for the testing.
>
> It turns out that some of the instructions which were modified in
> relocate() weren't then getting flushed out of the d-cache into memory.
> After that, early_init() executed the stale (non-modified) instructions
> for the BSS area.  Those instructions just accessed offset 0 instead of
> the actual BSS-related offsets.  That resulted in BSS not getting`
> zeroed.
>
> I was able to verify this on my 440 by comparing the d-cache and i-cache
> entries for the BSS-accessing instructions in early_init() using a
> RISCWatch.  As I suspected, the instructions in the d-cache showed the
> corrected offsets, but the i-cache showed the old, non-relocated
> offsets.
>
> To fix the issue, I wrote the following patch, applied on top of your
> patches.  Suggestions and comments are welcome.
>
>
>
>  From c88ae39da0c0352f411aca8d9636990a442d47da Mon Sep 17 00:00:00 2001
> From: Josh Poimboeuf<jpoimboe@linux.vnet.ibm.com>
> Date: Wed, 2 Nov 2011 16:41:24 -0500
> Subject: [PATCH] Flush relocated instructions from data cache
>
> After updating instructions with relocated addresses, flush them from
> the data cache and invalidate the icache line so we don't execute stale
> instructions.
>
> Signed-off-by: Josh Poimboeuf<jpoimboe@linux.vnet.ibm.com>
> ---
>   arch/powerpc/kernel/reloc_32.S |   11 ++++++++++-
>   1 files changed, 10 insertions(+), 1 deletions(-)
>
> diff --git a/arch/powerpc/kernel/reloc_32.S
> b/arch/powerpc/kernel/reloc_32.S
> index 045d61e..a92857d 100644
> --- a/arch/powerpc/kernel/reloc_32.S
> +++ b/arch/powerpc/kernel/reloc_32.S
> @@ -137,6 +137,9 @@ get_type:
>   	lwz	r0, 8(r9)	/* r_addend */
>   	add	r0, r0, r3	/* final addend */
>   	stwx	r0, r4, r7	/* memory[r4+r7]) = (u32)r0 */
> +	dcbst	r4,r7		/* flush dcache line to memory */
> +	sync			/* wait for flush to complete */
> +	icbi	r4,r7		/* invalidate icache line */

Doing it this way has two drawbacks :

1) Placing it here in relocate would do the flushing for each and every update.
2) I would like to keep this code as generic as possible for the PPC32 code.

Could we move this to the place from relocate is called and flush the d-cache and
i-cache entirely ?

Thanks

Suzuki
Josh Poimboeuf Nov. 7, 2011, 3:13 p.m. UTC | #2
On Fri, 2011-11-04 at 14:06 +0530, Suzuki Poulose wrote:
> On 11/03/11 05:06, Josh Poimboeuf wrote:
> > On Tue, 2011-10-25 at 17:23 +0530, Suzuki K. Poulose wrote:
> > @@ -137,6 +137,9 @@ get_type:
> >   	lwz	r0, 8(r9)	/* r_addend */
> >   	add	r0, r0, r3	/* final addend */
> >   	stwx	r0, r4, r7	/* memory[r4+r7]) = (u32)r0 */
> > +	dcbst	r4,r7		/* flush dcache line to memory */
> > +	sync			/* wait for flush to complete */
> > +	icbi	r4,r7		/* invalidate icache line */
> 
> Doing it this way has two drawbacks :
> 
> 1) Placing it here in relocate would do the flushing for each and every update.

I agree.  My kernel had around 80,000 relocations, which means 80,000
d-cache line flushes (for a 32k d-cache) and 80,000 i-cache line
invalidates (for a 32k i-cache).  Which is obviously a little overkill.
Although I didn't notice a performance hit during boot.


> 2) I would like to keep this code as generic as possible for the PPC32 code.
> 
> Could we move this to the place from relocate is called and flush the d-cache and
> i-cache entirely ?

Why not put the cache flushing code at the end of relocate?  Would some
of the other PPC32 platforms not require the cache flushing?

My PPC32 knowledge is 4xx-centric, so please feel free to rewrite the
patch as needed to accommodate other PPC32 cores.


Thanks,
Josh
David Laight Nov. 7, 2011, 3:26 p.m. UTC | #3
> On Fri, 2011-11-04 at 14:06 +0530, Suzuki Poulose wrote:
> > On 11/03/11 05:06, Josh Poimboeuf wrote:
> > > On Tue, 2011-10-25 at 17:23 +0530, Suzuki K. Poulose wrote:
> > > @@ -137,6 +137,9 @@ get_type:
> > >   	lwz	r0, 8(r9)	/* r_addend */
> > >   	add	r0, r0, r3	/* final addend */
> > >   	stwx	r0, r4, r7	/* memory[r4+r7]) = (u32)r0 */
> > > +	dcbst	r4,r7		/* flush dcache line to memory */
> > > +	sync			/* wait for flush to complete */
> > > +	icbi	r4,r7		/* invalidate icache line */
> > 
> > Doing it this way has two drawbacks :
> > 
> > 1) Placing it here in relocate would do the flushing for 
> each and every update.
> 
> I agree.  My kernel had around 80,000 relocations, which means 80,000
> d-cache line flushes (for a 32k d-cache) and 80,000 i-cache line
> invalidates (for a 32k i-cache).  Which is obviously a little 
> overkill.
> Although I didn't notice a performance hit during boot.

The I-cache invalidates shouldn't be needed, the un-relocated
code can't be in the I-cache (on the grounds that executing
it would crash the system).
A single sync at the end is probably enough as well.
I guess it is possible for the cpu to prefetch/preload
into the i-cache through the jump into the relocated code?
So maybe a full i-cache invalidate right at the end? (or
a jump indirect? - which is probably there anyway)

The d-cache will need some kind of flush, since the modified
lines have to be written out, the only time it generates
additional memeory cycles are if there are two (or more)
reloations in the same d-cache line. Otherwise the early
write-back might help!

	David
Suzuki Poulose Nov. 8, 2011, 7:11 a.m. UTC | #4
On 11/07/11 20:43, Josh Poimboeuf wrote:
> On Fri, 2011-11-04 at 14:06 +0530, Suzuki Poulose wrote:
>> On 11/03/11 05:06, Josh Poimboeuf wrote:
>>> On Tue, 2011-10-25 at 17:23 +0530, Suzuki K. Poulose wrote:
>>> @@ -137,6 +137,9 @@ get_type:
>>>    	lwz	r0, 8(r9)	/* r_addend */
>>>    	add	r0, r0, r3	/* final addend */
>>>    	stwx	r0, r4, r7	/* memory[r4+r7]) = (u32)r0 */
>>> +	dcbst	r4,r7		/* flush dcache line to memory */
>>> +	sync			/* wait for flush to complete */
>>> +	icbi	r4,r7		/* invalidate icache line */
>>
>> Doing it this way has two drawbacks :
>>
>> 1) Placing it here in relocate would do the flushing for each and every update.
>
> I agree.  My kernel had around 80,000 relocations, which means 80,000
> d-cache line flushes (for a 32k d-cache) and 80,000 i-cache line
> invalidates (for a 32k i-cache).  Which is obviously a little overkill.
> Although I didn't notice a performance hit during boot.
>
>
>> 2) I would like to keep this code as generic as possible for the PPC32 code.
>>
>> Could we move this to the place from relocate is called and flush the d-cache and
>> i-cache entirely ?
>
> Why not put the cache flushing code at the end of relocate?  Would some
> of the other PPC32 platforms not require the cache flushing?
What I was suggesting is, instead of flushing the cache in relocate(), lets do it
like:

for e.g, on 440x, (in head_44x.S :)

#ifdef CONFIG_RELOCATABLE
	...
	bl relocate

	#Flush the d-cache and invalidate the i-cache here
#endif


This would let the different platforms do the the cache invalidation in their
own way.

Btw, I didn't find an instruction to flush the entire d-cache in PPC440 manual.
We have instructions to flush only a block corresponding to an address.

However, we have 'iccci' which would invalidate the entire i-cache which, which
I think is better than 80,000 i-cache invalidates.

Kumar / Josh,

Do you have any suggestions here ?




>
> My PPC32 knowledge is 4xx-centric, so please feel free to rewrite the
> patch as needed to accommodate other PPC32 cores.

Same here :)

Thanks
Suzuki
Josh Poimboeuf Nov. 8, 2011, 4:19 p.m. UTC | #5
On Tue, 2011-11-08 at 12:41 +0530, Suzuki Poulose wrote:
> What I was suggesting is, instead of flushing the cache in relocate(), lets do it
> like:
> 
> for e.g, on 440x, (in head_44x.S :)
> 
> #ifdef CONFIG_RELOCATABLE
> 	...
> 	bl relocate
> 
> 	#Flush the d-cache and invalidate the i-cache here
> #endif
> 
> 
> This would let the different platforms do the the cache invalidation in their
> own way.
> 
> Btw, I didn't find an instruction to flush the entire d-cache in PPC440 manual.
> We have instructions to flush only a block corresponding to an address.
> 
> However, we have 'iccci' which would invalidate the entire i-cache which, which
> I think is better than 80,000 i-cache invalidates.

In misc_32.S there are already some platform-independent cache
management functions.  If we use those, then relocate() could simply
call them.  Then the different platforms calling relocate() wouldn't
have to worry about flushing/invalidating caches.

For example, there's a clean_dcache_range() function.  Given any range
twice the size of the d-cache, it should flush the entire d-cache.  But
the only drawback is that it would require the caller to know the size
of the d-cache.

Instead, I think it would be preferable to create a new clean_dcache()
(or clean_dcache_all()?) function in misc_32.S, which could call
clean_dcache_range() with the appropriate args for flushing the entire
d-cache.  relocate() could then call the platform-independent
clean_dcache().

For i-cache invalidation there's already the (incorrectly named?)
flush_instruction_cache().  It uses the appropriate platform-specific
methods (e.g. iccci for 44x) to invalidate the entire i-cache.

Suzuki, if you agree with this direction, I could work up a new patch if
needed.


Josh
diff mbox

Patch

diff --git a/arch/powerpc/kernel/reloc_32.S
b/arch/powerpc/kernel/reloc_32.S
index 045d61e..a92857d 100644
--- a/arch/powerpc/kernel/reloc_32.S
+++ b/arch/powerpc/kernel/reloc_32.S
@@ -137,6 +137,9 @@  get_type:
 	lwz	r0, 8(r9)	/* r_addend */
 	add	r0, r0, r3	/* final addend */
 	stwx	r0, r4, r7	/* memory[r4+r7]) = (u32)r0 */
+	dcbst	r4,r7		/* flush dcache line to memory */
+	sync			/* wait for flush to complete */
+	icbi	r4,r7		/* invalidate icache line */
 	b	nxtrela		/* continue */
 
 	/* R_PPC_ADDR16_HI */
@@ -177,6 +180,9 @@  lo16:
 	/* Store half word */
 store_half:
 	sthx	r0, r4, r7	/* memory[r4+r7] = (u16)r0 */
+	dcbst	r4,r7		/* flush dcache line to memory */
+	sync			/* wait for flush to complete */
+	icbi	r4,r7		/* invalidate icache line */
 
 nxtrela:
 	cmpwi	r8, 0		/* relasz = 0 ? */
@@ -185,7 +191,10 @@  nxtrela:
 	subf	r8, r6, r8	/* relasz -= relaent */
 	b	applyrela
 
-done:	blr
+done:
+	sync			/* wait for icache invalidates to complete */
+	isync			/* discard any prefetched instructions */
+	blr
 
 
 p_dyn:		.long	__dynamic_start - 0b