diff mbox

[RFC] Use two VSX registers to do initial copy of skiboot

Message ID 1432197635-13724-1-git-send-email-stewart@linux.vnet.ibm.com
State Deferred
Headers show

Commit Message

Stewart Smith May 21, 2015, 8:40 a.m. UTC
Note my awful fiddling of MSR so that I can use VSX registers :)

In the hello_world test running in mambo, we get the following reduction
in instruction/cycle count:
Before:
20284943: ** finished running 20284942 instructions **

Single VSX register:
19687022: ** finished running 19641315 instructions **

using 2 vsx registers:
19621488: ** finished running 19575781 instructions **
19621488: ** finished running 19575781 instructions **

65534 fewer cycles & instructions than just 1
709161 fewer than base implementation

using 3 vsx regs:
19883634: ** finished running 19837927 instructions **

Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
---
 asm/head.S |   24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

Comments

Benjamin Herrenschmidt May 21, 2015, 8:48 a.m. UTC | #1
On Thu, 2015-05-21 at 18:40 +1000, Stewart Smith wrote:
> Note my awful fiddling of MSR so that I can use VSX registers :)

Is this really worth it ? :-) I like being conservative in that code...

> In the hello_world test running in mambo, we get the following reduction
> in instruction/cycle count:
> Before:
> 20284943: ** finished running 20284942 instructions **
> 
> Single VSX register:
> 19687022: ** finished running 19641315 instructions **
> 
> using 2 vsx registers:
> 19621488: ** finished running 19575781 instructions **
> 19621488: ** finished running 19575781 instructions **
> 
> 65534 fewer cycles & instructions than just 1
> 709161 fewer than base implementation
> 
> using 3 vsx regs:
> 19883634: ** finished running 19837927 instructions **
> 
> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
> ---
>  asm/head.S |   24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/asm/head.S b/asm/head.S
> index fd6e3fb..9b3d7bb 100644
> --- a/asm/head.S
> +++ b/asm/head.S
> @@ -314,16 +314,30 @@ boot_entry:
>  	cmpd	%r29,%r30
>  	beq	2f
>  	LOAD_IMM32(%r3, _sbss - __head)
> -	srdi	%r3,%r3,3
> +	srdi	%r3,%r3,5
>  	mtctr	%r3
> +	mfmsr	%r24
> +	mfmsr	%r4
> +	oris	%r4,%r4, (1<<13)@h
> +	oris	%r4,%r4, (1<<23)@h
> +	oris	%r4,%r4, (1<<25)@h
> +	mtmsr	%r4
>  	mr	%r4,%r30
>  	mr	%r15,%r30
>  	mr	%r30,%r29
> -1:	ld	%r0,0(%r4)
> -	std	%r0,0(%r29)
> -	addi	%r29,%r29,8
> -	addi	%r4,%r4,8
> +	addi	%r5,%r4,16
> +	addi	%r6,%r29,16
> +	
> +1:	lxvd2x	%vs1,%r0,%r4
> +	lxvd2x	%vs2,%r0,%r5
> +	stxvd2x	%vs1,%r0,%r29
> +	stxvd2x	%vs2,%r0,%r6
> +	addi	%r29,%r29,16
> +	addi	%r4,%r4,16
> +	addi    %r5,%r5,16
> +        addi    %r6,%r6,16
>  	bdnz	1b
> +	mtmsr	%r24
>  	sync
>  	icbi	0,%r29
>  	sync
Stewart Smith May 21, 2015, 2:07 p.m. UTC | #2
Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:
> On Thu, 2015-05-21 at 18:40 +1000, Stewart Smith wrote:
>> Note my awful fiddling of MSR so that I can use VSX registers :)
>
> Is this really worth it ? :-) I like being conservative in that
> code...

Maybe, maybe not... Mikey is playing with some simulator foo and wanting
to get cycle count down. Of course, if you change the load location then
you avoid it, but then you also execute a bit differently than on real
hardware.
Michael Neuling May 22, 2015, 4:15 a.m. UTC | #3
On Fri, 2015-05-22 at 00:07 +1000, Stewart Smith wrote:
> Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:
> > On Thu, 2015-05-21 at 18:40 +1000, Stewart Smith wrote:
> >> Note my awful fiddling of MSR so that I can use VSX registers :)
> >
> > Is this really worth it ? :-) I like being conservative in that
> > code...
> 
> Maybe, maybe not... Mikey is playing with some simulator foo and wanting
> to get cycle count down. Of course, if you change the load location then
> you avoid it, but then you also execute a bit differently than on real
> hardware.

So on this... If we can assume that memory is zero, we can save a lot of
cycles with not having to zero out some stuff (console, cpu stacks and
trace buffer in particular).  I can get the number of instructions
required to boot in sim down from 15M to 600K (ie 20x faster).  

benh, can we use the ipl-params/ipl-params/cec-major-type property to
mark "cold" boots as having zeroed memory, and hence skip some of these?
I'd like to do this at run time, rather than compile time .

Mikey
Benjamin Herrenschmidt May 22, 2015, 4:35 a.m. UTC | #4
On Fri, 2015-05-22 at 14:15 +1000, Michael Neuling wrote:
> So on this... If we can assume that memory is zero, we can save a lot of
> cycles with not having to zero out some stuff (console, cpu stacks and
> trace buffer in particular).  I can get the number of instructions
> required to boot in sim down from 15M to 600K (ie 20x faster).  
> 
> benh, can we use the ipl-params/ipl-params/cec-major-type property to
> mark "cold" boots as having zeroed memory, and hence skip some of these?
> I'd like to do this at run time, rather than compile time .

I don't know whether we have any guarantee from hostboot that we have
zeroed memory. In fact we don't ... HB itself has left remains behind.

We'll need a specific type to represent a sim env. with initial zeroed
memory.

Another bunch of things to consider:

 - Remove useless clearing unconditionally. For example, stacks. We only
need to clear the cpu_thread structure at the bottom and the last backlink.

 - Make clearing more efficient. Stewart gave it a good try but what
about using dcbz ?

 - Link skiboot at 0x3000_0000 so when pre-loaded there, it doesn't need
to relocate itself (skip not just copy but also relocation phase).

Cheers,
Ben.
Stewart Smith May 22, 2015, 5:26 a.m. UTC | #5
Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:
>> benh, can we use the ipl-params/ipl-params/cec-major-type property to
>> mark "cold" boots as having zeroed memory, and hence skip some of these?
>> I'd like to do this at run time, rather than compile time .
>
> I don't know whether we have any guarantee from hostboot that we have
> zeroed memory. In fact we don't ... HB itself has left remains behind.

Hrm... that's true.

> We'll need a specific type to represent a sim env. with initial zeroed
> memory.
>
> Another bunch of things to consider:
>
>  - Remove useless clearing unconditionally. For example, stacks. We only
> need to clear the cpu_thread structure at the bottom and the last
> backlink.

May be useful to (like mem poisoning) poison the stacks on boot too, may
find something exciting and annoying to debug!

>  - Make clearing more efficient. Stewart gave it a good try but what
> about using dcbz ?

yeah, that's probably what we should end up doing. I was just looking
into the implementation that ended up in linux, I'm tempted to try
something with dcbz, yeah. Or, even better, I go to the pub early and
Mikey does it and sends me patches :)

>  - Link skiboot at 0x3000_0000 so when pre-loaded there, it doesn't need
> to relocate itself (skip not just copy but also relocation phase).

That could be useful, yeah.
diff mbox

Patch

diff --git a/asm/head.S b/asm/head.S
index fd6e3fb..9b3d7bb 100644
--- a/asm/head.S
+++ b/asm/head.S
@@ -314,16 +314,30 @@  boot_entry:
 	cmpd	%r29,%r30
 	beq	2f
 	LOAD_IMM32(%r3, _sbss - __head)
-	srdi	%r3,%r3,3
+	srdi	%r3,%r3,5
 	mtctr	%r3
+	mfmsr	%r24
+	mfmsr	%r4
+	oris	%r4,%r4, (1<<13)@h
+	oris	%r4,%r4, (1<<23)@h
+	oris	%r4,%r4, (1<<25)@h
+	mtmsr	%r4
 	mr	%r4,%r30
 	mr	%r15,%r30
 	mr	%r30,%r29
-1:	ld	%r0,0(%r4)
-	std	%r0,0(%r29)
-	addi	%r29,%r29,8
-	addi	%r4,%r4,8
+	addi	%r5,%r4,16
+	addi	%r6,%r29,16
+	
+1:	lxvd2x	%vs1,%r0,%r4
+	lxvd2x	%vs2,%r0,%r5
+	stxvd2x	%vs1,%r0,%r29
+	stxvd2x	%vs2,%r0,%r6
+	addi	%r29,%r29,16
+	addi	%r4,%r4,16
+	addi    %r5,%r5,16
+        addi    %r6,%r6,16
 	bdnz	1b
+	mtmsr	%r24
 	sync
 	icbi	0,%r29
 	sync