Message ID | 1432197635-13724-1-git-send-email-stewart@linux.vnet.ibm.com |
---|---|
State | Deferred |
Headers | show |
On Thu, 2015-05-21 at 18:40 +1000, Stewart Smith wrote: > Note my awful fiddling of MSR so that I can use VSX registers :) Is this really worth it ? :-) I like being conservative in that code... > In the hello_world test running in mambo, we get the following reduction > in instruction/cycle count: > Before: > 20284943: ** finished running 20284942 instructions ** > > Single VSX register: > 19687022: ** finished running 19641315 instructions ** > > using 2 vsx registers: > 19621488: ** finished running 19575781 instructions ** > 19621488: ** finished running 19575781 instructions ** > > 65534 fewer cycles & instructions than just 1 > 709161 fewer than base implementation > > using 3 vsx regs: > 19883634: ** finished running 19837927 instructions ** > > Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> > --- > asm/head.S | 24 +++++++++++++++++++----- > 1 file changed, 19 insertions(+), 5 deletions(-) > > diff --git a/asm/head.S b/asm/head.S > index fd6e3fb..9b3d7bb 100644 > --- a/asm/head.S > +++ b/asm/head.S > @@ -314,16 +314,30 @@ boot_entry: > cmpd %r29,%r30 > beq 2f > LOAD_IMM32(%r3, _sbss - __head) > - srdi %r3,%r3,3 > + srdi %r3,%r3,5 > mtctr %r3 > + mfmsr %r24 > + mfmsr %r4 > + oris %r4,%r4, (1<<13)@h > + oris %r4,%r4, (1<<23)@h > + oris %r4,%r4, (1<<25)@h > + mtmsr %r4 > mr %r4,%r30 > mr %r15,%r30 > mr %r30,%r29 > -1: ld %r0,0(%r4) > - std %r0,0(%r29) > - addi %r29,%r29,8 > - addi %r4,%r4,8 > + addi %r5,%r4,16 > + addi %r6,%r29,16 > + > +1: lxvd2x %vs1,%r0,%r4 > + lxvd2x %vs2,%r0,%r5 > + stxvd2x %vs1,%r0,%r29 > + stxvd2x %vs2,%r0,%r6 > + addi %r29,%r29,16 > + addi %r4,%r4,16 > + addi %r5,%r5,16 > + addi %r6,%r6,16 > bdnz 1b > + mtmsr %r24 > sync > icbi 0,%r29 > sync
Benjamin Herrenschmidt <benh@kernel.crashing.org> writes: > On Thu, 2015-05-21 at 18:40 +1000, Stewart Smith wrote: >> Note my awful fiddling of MSR so that I can use VSX registers :) > > Is this really worth it ? :-) I like being conservative in that > code... Maybe, maybe not... Mikey is playing with some simulator foo and wanting to get cycle count down. Of course, if you change the load location then you avoid it, but then you also execute a bit differently than on real hardware.
On Fri, 2015-05-22 at 00:07 +1000, Stewart Smith wrote: > Benjamin Herrenschmidt <benh@kernel.crashing.org> writes: > > On Thu, 2015-05-21 at 18:40 +1000, Stewart Smith wrote: > >> Note my awful fiddling of MSR so that I can use VSX registers :) > > > > Is this really worth it ? :-) I like being conservative in that > > code... > > Maybe, maybe not... Mikey is playing with some simulator foo and wanting > to get cycle count down. Of course, if you change the load location then > you avoid it, but then you also execute a bit differently than on real > hardware. So on this... If we can assume that memory is zero, we can save a lot of cycles with not having to zero out some stuff (console, cpu stacks and trace buffer in particular). I can get the number of instructions required to boot in sim down from 15M to 600K (ie 20x faster). benh, can we use the ipl-params/ipl-params/cec-major-type property to mark "cold" boots as having zeroed memory, and hence skip some of these? I'd like to do this at run time, rather than compile time . Mikey
On Fri, 2015-05-22 at 14:15 +1000, Michael Neuling wrote: > So on this... If we can assume that memory is zero, we can save a lot of > cycles with not having to zero out some stuff (console, cpu stacks and > trace buffer in particular). I can get the number of instructions > required to boot in sim down from 15M to 600K (ie 20x faster). > > benh, can we use the ipl-params/ipl-params/cec-major-type property to > mark "cold" boots as having zeroed memory, and hence skip some of these? > I'd like to do this at run time, rather than compile time . I don't know whether we have any guarantee from hostboot that we have zeroed memory. In fact we don't ... HB itself has left remains behind. We'll need a specific type to represent a sim env. with initial zeroed memory. Another bunch of things to consider: - Remove useless clearing unconditionally. For example, stacks. We only need to clear the cpu_thread structure at the bottom and the last backlink. - Make clearing more efficient. Stewart gave it a good try but what about using dcbz ? - Link skiboot at 0x3000_0000 so when pre-loaded there, it doesn't need to relocate itself (skip not just copy but also relocation phase). Cheers, Ben.
Benjamin Herrenschmidt <benh@kernel.crashing.org> writes: >> benh, can we use the ipl-params/ipl-params/cec-major-type property to >> mark "cold" boots as having zeroed memory, and hence skip some of these? >> I'd like to do this at run time, rather than compile time . > > I don't know whether we have any guarantee from hostboot that we have > zeroed memory. In fact we don't ... HB itself has left remains behind. Hrm... that's true. > We'll need a specific type to represent a sim env. with initial zeroed > memory. > > Another bunch of things to consider: > > - Remove useless clearing unconditionally. For example, stacks. We only > need to clear the cpu_thread structure at the bottom and the last > backlink. May be useful to (like mem poisoning) poison the stacks on boot too, may find something exciting and annoying to debug! > - Make clearing more efficient. Stewart gave it a good try but what > about using dcbz ? yeah, that's probably what we should end up doing. I was just looking into the implementation that ended up in linux, I'm tempted to try something with dcbz, yeah. Or, even better, I go to the pub early and Mikey does it and sends me patches :) > - Link skiboot at 0x3000_0000 so when pre-loaded there, it doesn't need > to relocate itself (skip not just copy but also relocation phase). That could be useful, yeah.
diff --git a/asm/head.S b/asm/head.S index fd6e3fb..9b3d7bb 100644 --- a/asm/head.S +++ b/asm/head.S @@ -314,16 +314,30 @@ boot_entry: cmpd %r29,%r30 beq 2f LOAD_IMM32(%r3, _sbss - __head) - srdi %r3,%r3,3 + srdi %r3,%r3,5 mtctr %r3 + mfmsr %r24 + mfmsr %r4 + oris %r4,%r4, (1<<13)@h + oris %r4,%r4, (1<<23)@h + oris %r4,%r4, (1<<25)@h + mtmsr %r4 mr %r4,%r30 mr %r15,%r30 mr %r30,%r29 -1: ld %r0,0(%r4) - std %r0,0(%r29) - addi %r29,%r29,8 - addi %r4,%r4,8 + addi %r5,%r4,16 + addi %r6,%r29,16 + +1: lxvd2x %vs1,%r0,%r4 + lxvd2x %vs2,%r0,%r5 + stxvd2x %vs1,%r0,%r29 + stxvd2x %vs2,%r0,%r6 + addi %r29,%r29,16 + addi %r4,%r4,16 + addi %r5,%r5,16 + addi %r6,%r6,16 bdnz 1b + mtmsr %r24 sync icbi 0,%r29 sync
Note my awful fiddling of MSR so that I can use VSX registers :) In the hello_world test running in mambo, we get the following reduction in instruction/cycle count: Before: 20284943: ** finished running 20284942 instructions ** Single VSX register: 19687022: ** finished running 19641315 instructions ** using 2 vsx registers: 19621488: ** finished running 19575781 instructions ** 19621488: ** finished running 19575781 instructions ** 65534 fewer cycles & instructions than just 1 709161 fewer than base implementation using 3 vsx regs: 19883634: ** finished running 19837927 instructions ** Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> --- asm/head.S | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-)