Message ID | 20120605120222.6722a3e3@kryten (mailing list archive) |
---|---|
State | Accepted, archived |
Commit | cf8fb5533f35709ba7e31560264b565a9c7a090f |
Delegated to: | Michael Ellerman |
Headers | show |
> +err1; dcbz r0,r3
There is no such instruction, you probably meant "dcbz 0,r3"?
Segher
On Wed, 2012-06-06 at 18:40 +0200, Segher Boessenkool wrote: > > +err1; dcbz r0,r3 > > There is no such instruction, you probably meant "dcbz 0,r3"? This reminds me... what would happen if we changed all our #define r0 0 #define r1 1 etc... to: #define r0 %r0 #define r1 %r1 ? I'm thinking it might help catch that sort of nasties (and some of them can be really nasty, such as inverting mfspr/mtspr arguments, or vs ori, etc... ). I'm sure we'd have a problem with a few macros & inline constructs but nothing we can't fix.. (Haven't tested ... still home, officially sick :-) Cheers, Ben.
On Mon, Jun 4, 2012 at 7:02 PM, Anton Blanchard <anton@samba.org> wrote: > > I blame Mikey for this. He elevated my slightly dubious testcase: > > # dd if=/dev/zero of=/dev/null bs=1M count=10000 > > to benchmark status. And naturally we need to be number 1 at creating > zeros. So lets improve __clear_user some more. > > As Paul suggests we can use dcbz for large lengths. This patch gets > the destination cacheline aligned then uses dcbz on whole cachelines. > > Before: > 10485760000 bytes (10 GB) copied, 0.414744 s, 25.3 GB/s > > After: > 10485760000 bytes (10 GB) copied, 0.268597 s, 39.0 GB/s > > 39 GB/s, a new record. > > Signed-off-by: Anton Blanchard <anton@samba.org> Besides the comments from Segher, feel free to add: Tested-by: Olof Johansson <olof@lixom.net> Acked-by: Olof Johansson <olof@lixom.net> Didn't help performance all that much on pa6t, but it didn't go down. Too low on cycles to actually analyze why at this time. -OIof
On Wed, Jun 06, 2012 at 06:40:54PM +0200, Segher Boessenkool wrote: > >+err1; dcbz r0,r3 > > There is no such instruction, you probably meant "dcbz 0,r3"? There certainly is such an instruction, though it doesn't do exactly what a naive reader might expect. Using 0 rather than r0 or %r0 improves readability but makes no difference to the assembler or the cpu. Paul.
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Wed, 2012-06-06 at 18:40 +0200, Segher Boessenkool wrote: > > > +err1; dcbz r0,r3 > > > > There is no such instruction, you probably meant "dcbz 0,r3"? > > This reminds me... what would happen if we changed all our > > #define r0 0 > #define r1 1 > > etc... to: > > #define r0 %r0 > #define r1 %r1 > > ? > > I'm thinking it might help catch that sort of nasties (and some of them > can be really nasty, such as inverting mfspr/mtspr arguments, or vs ori, > etc... ). I'm sure we'd have a problem with a few macros & inline > constructs but nothing we can't fix.. One problem with this is when we construct the instructions, like using anything from ppc-opcode.h. eg. using PPC_POPCNTB would need to go from: PPC_POPCNTB(r3,r3) to: PPC_POPCNTB(3,3) Which is less readable IMHO. Mikey
On Thu, 2012-06-07 at 16:05 +1000, Michael Neuling wrote: > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > On Wed, 2012-06-06 at 18:40 +0200, Segher Boessenkool wrote: > > > > +err1; dcbz r0,r3 > > > > > > There is no such instruction, you probably meant "dcbz 0,r3"? > > > > This reminds me... what would happen if we changed all our > > > > #define r0 0 > > #define r1 1 > > > > etc... to: > > > > #define r0 %r0 > > #define r1 %r1 > > > > ? > > > > I'm thinking it might help catch that sort of nasties (and some of them > > can be really nasty, such as inverting mfspr/mtspr arguments, or vs ori, > > etc... ). I'm sure we'd have a problem with a few macros & inline > > constructs but nothing we can't fix.. > > One problem with this is when we construct the instructions, like using > anything from ppc-opcode.h. eg. using PPC_POPCNTB would need to go from: > PPC_POPCNTB(r3,r3) > to: > PPC_POPCNTB(3,3) #define R(x) x #define PPC_POPCNTB(R(3), R(3)) ?? cheers
Michael Ellerman <michael@ellerman.id.au> wrote: > On Thu, 2012-06-07 at 16:05 +1000, Michael Neuling wrote: > > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > > > On Wed, 2012-06-06 at 18:40 +0200, Segher Boessenkool wrote: > > > > > +err1; dcbz r0,r3 > > > > > > > > There is no such instruction, you probably meant "dcbz 0,r3"? > > > > > > This reminds me... what would happen if we changed all our > > > > > > #define r0 0 > > > #define r1 1 > > > > > > etc... to: > > > > > > #define r0 %r0 > > > #define r1 %r1 > > > > > > ? > > > > > > I'm thinking it might help catch that sort of nasties (and some of them > > > can be really nasty, such as inverting mfspr/mtspr arguments, or vs ori, > > > etc... ). I'm sure we'd have a problem with a few macros & inline > > > constructs but nothing we can't fix.. > > > > One problem with this is when we construct the instructions, like using > > anything from ppc-opcode.h. eg. using PPC_POPCNTB would need to go from: > > PPC_POPCNTB(r3,r3) > > to: > > PPC_POPCNTB(3,3) > > #define R(x) x #define R(x) (x) > #define PPC_POPCNTB(R(3), R(3)) Maybe, looks pretty gross but you're the maintainer! :-) Mikey
On Thu, 2012-06-07 at 16:12 +1000, Michael Neuling wrote: > Michael Ellerman <michael@ellerman.id.au> wrote: > > > On Thu, 2012-06-07 at 16:05 +1000, Michael Neuling wrote: > > > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > > > > > On Wed, 2012-06-06 at 18:40 +0200, Segher Boessenkool wrote: > > > > > > +err1; dcbz r0,r3 > > > > > > > > > > There is no such instruction, you probably meant "dcbz 0,r3"? > > > > > > > > This reminds me... what would happen if we changed all our > > > > > > > > #define r0 0 > > > > #define r1 1 > > > > > > > > etc... to: > > > > > > > > #define r0 %r0 > > > > #define r1 %r1 > > > > > > > > ? > > > > > > > > I'm thinking it might help catch that sort of nasties (and some of them > > > > can be really nasty, such as inverting mfspr/mtspr arguments, or vs ori, > > > > etc... ). I'm sure we'd have a problem with a few macros & inline > > > > constructs but nothing we can't fix.. > > > > > > One problem with this is when we construct the instructions, like using > > > anything from ppc-opcode.h. eg. using PPC_POPCNTB would need to go from: > > > PPC_POPCNTB(r3,r3) > > > to: > > > PPC_POPCNTB(3,3) > > > > #define R(x) x > > #define R(x) (x) > > > #define PPC_POPCNTB(R(3), R(3)) > > Maybe, looks pretty gross but you're the maintainer! :-) No I am not! I agree it's fairly gross. But I'll take gross and correct over ungross and buggy. cheers
On Thu, 2012-06-07 at 16:05 +1000, Michael Neuling wrote: > One problem with this is when we construct the instructions, like using > anything from ppc-opcode.h. eg. using PPC_POPCNTB would need to go from: > PPC_POPCNTB(r3,r3) > to: > PPC_POPCNTB(3,3) > Which is less readable IMHO. Yes, I know. Not much to do about this, but it might still be worth it, how much time wasted due to mixing up or with ori in asm somewhere ? One option would be to #define R3 (or _r3) for use in those macros so we still have something nicer than just "3"... oh well. Cheers, Ben.
> -----Original Message----- > From: Linuxppc-dev > [mailto:linuxppc-dev-bounces+david.laight=aculab.com@lists.ozl abs.org] On Behalf Of Benjamin Herrenschmidt > Sent: 07 June 2012 07:40 > To: Michael Neuling > Cc: michael@ellerman.id.au; paulus@samba.org; Anton > Blanchard; olof@lixom.net; linuxppc-dev@lists.ozlabs.org > Subject: Re: [PATCH] powerpc: Optimise the 64bit optimised > __clear_user > > On Thu, 2012-06-07 at 16:05 +1000, Michael Neuling wrote: > > > One problem with this is when we construct the > instructions, like using > > anything from ppc-opcode.h. eg. using PPC_POPCNTB would > need to go from: > > PPC_POPCNTB(r3,r3) > > to: > > PPC_POPCNTB(3,3) > > Which is less readable IMHO. > > Yes, I know. Not much to do about this, but it might still be > worth it, > how much time wasted due to mixing up or with ori in asm somewhere ? > > One option would be to #define R3 (or _r3) for use in those macros so > we still have something nicer than just "3"... oh well. You then need to catch all the other bugs where a register is being used instead of a constant. Time to make the assembler require %r0 (etc) David
On 06/06/2012 10:04 PM, Paul Mackerras wrote: > On Wed, Jun 06, 2012 at 06:40:54PM +0200, Segher Boessenkool wrote: >>> +err1; dcbz r0,r3 >> >> There is no such instruction, you probably meant "dcbz 0,r3"? > > There certainly is such an instruction, though it doesn't do exactly > what a naive reader might expect. Using 0 rather than r0 or %r0 > improves readability but makes no difference to the assembler or the > cpu. The assembler can't tell that you used r0 rather than 0, because that's handled by the preprocessor, but it seems like a bug (or at least lax error checking) that it accepts %r0 there, and that objdump decodes it as "dcbz r0,r3" rather than "dcbz 0,r3". It's also odd that objdump produces output containing "r3" (not %r3) when the assembler won't directly accept it. -Scott
On Thu, 2012-06-07 at 12:51 -0500, Scott Wood wrote: > > The assembler can't tell that you used r0 rather than 0, because > that's > handled by the preprocessor, but it seems like a bug (or at least lax > error checking) that it accepts %r0 there, and that objdump decodes it > as "dcbz r0,r3" rather than "dcbz 0,r3". Well, this is the domain of bike shed painting :-) the syntax of the instruction is: dcbz RA,RB Now, of course, like other RA,RB pairs it has the semantic that if RA = 0 then b <- 0 else b <- (RA) EA <- b + (RB) If we were to be pendantic and our assembly could also enforce the use of % for registers, I suppose it would make sense to require 0 rather than %0 for those "RA" forms to make it absolutely clear that we are talking about 0 and not r0 in this specific case, I agree, but in the current shape of the asm, I don't think that's something to expect. Cheers, Ben.
Scott Wood <scottwood@freescale.com> writes:
> and that objdump decodes it as "dcbz r0,r3" rather than "dcbz 0,r3".
<http://permalink.gmane.org/gmane.comp.gnu.binutils/57870>
Andreas.
First 5 patches convert us to %r0-31. Next 10 patches make using R0-31 required.
Michael Neuling <mikey@neuling.org> writes: > Index: clone3/arch/powerpc/include/asm/ppc_asm.h > =================================================================== > --- clone3.orig/arch/powerpc/include/asm/ppc_asm.h > +++ clone3/arch/powerpc/include/asm/ppc_asm.h > @@ -178,6 +178,11 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLP > #define HMT_HIGH or 3,3,3 > #define HMT_EXTRA_HIGH or 7,7,7 # power7 only > > +#define STACKFRAMESIZE 256 > +#define STK_REG(i) (112 + ((i)-14)*8) > + > +#define STK_PARAM(i) (48 + ((i)-3)*8) That should probably be bracketed by CONFIG_PPC64, since it is only applicable to 64-bit code. Andreas.
Michael Neuling <mikey@neuling.org> writes: > /* Invalidate all TLBs */ > - PPC_TLBILX_ALL(0,0) > + PPC_TLBILX_ALL(R0,R0) The first argument is (RA|0), so 0 is zero, not r0. Andreas.
On Fri, 2012-06-08 at 14:15 +0200, Andreas Schwab wrote: > Michael Neuling <mikey@neuling.org> writes: > > > /* Invalidate all TLBs */ > > - PPC_TLBILX_ALL(0,0) > > + PPC_TLBILX_ALL(R0,R0) > > The first argument is (RA|0), so 0 is zero, not r0. The macro system we use cannot do that (it will prefix with REG_), since both arguments are registers we must use R0 in this case. Cheers, Ben.
Benjamin Herrenschmidt <benh@kernel.crashing.org> writes: > The macro system we use cannot do that (it will prefix with REG_), since > both arguments are registers we must use R0 in this case. So define a ___PPC_RA0 macro that doesn't do that. Andreas.
On Sat, 2012-06-09 at 08:53 +0200, Andreas Schwab wrote: > > > The macro system we use cannot do that (it will prefix with REG_), > since > > both arguments are registers we must use R0 in this case. > > So define a ___PPC_RA0 macro that doesn't do that. But then we lose the checking for other instructions :-) Unless we start being nasty and defining a different macro form for RA which can be 0... I'd rather not go there unless we absolutely have to... What would be nice also would be if we had a gas option to enforce the use of % for register names. We'd probably have to struggle a little bit with gcc inline asm in a case or two though. Cheers, Ben.
Benjamin Herrenschmidt <benh@kernel.crashing.org> writes: > On Sat, 2012-06-09 at 08:53 +0200, Andreas Schwab wrote: >> >> > The macro system we use cannot do that (it will prefix with REG_), >> since >> > both arguments are registers we must use R0 in this case. >> >> So define a ___PPC_RA0 macro that doesn't do that. > > But then we lose the checking for other instructions :-) ??? There is no loss of checking for instructions that do not use ___PPC_RA0. > Unless we start being nasty and defining a different macro form for RA > which can be 0... That's what ___PPC_RA0 is all about. > I'd rather not go there unless we absolutely have to... Having to use R0 for an insn that does *not* use r0 is clearly a step backwards. > What would be nice also would be if we had a gas option to enforce the > use of % for register names. If gas is ever changed that way you have to be explict about 0 vs. %r0 anyway. Andreas.
First 5 patches convert us to %r0-31. Next 12 convert make using R0-31 required in macros. Last 2 convert instructions where ra = r0 we use 0 rather than the register value (as suggested by Andreas). Version 2 adds: ra = 0 idea (as Andreas suggested) Fixes for 32bit KVM (added ppc44x_defconfig to my testing) Based on mpe's next tree which has Antons power7 copy patches which needed fixes for this
Index: linux-build/arch/powerpc/lib/string_64.S =================================================================== --- linux-build.orig/arch/powerpc/lib/string_64.S 2012-06-04 16:18:56.351604302 +1000 +++ linux-build/arch/powerpc/lib/string_64.S 2012-06-05 11:42:10.044654851 +1000 @@ -19,6 +19,12 @@ */ #include <asm/ppc_asm.h> +#include <asm/asm-offsets.h> + + .section ".toc","aw" +PPC64_CACHES: + .tc ppc64_caches[TC],ppc64_caches + .section ".text" /** * __clear_user: - Zero a block of memory in user space, with less checking. @@ -94,9 +100,14 @@ err1; stw r0,0(r3) addi r3,r3,4 3: sub r4,r4,r6 - srdi r6,r4,5 + cmpdi r4,32 + cmpdi cr1,r4,512 blt .Lshort_clear + bgt cr1,.Llong_clear + +.Lmedium_clear: + srdi r6,r4,5 mtctr r6 /* Do 32 byte chunks */ @@ -139,3 +150,53 @@ err1; stb r0,0(r3) 10: li r3,0 blr + +.Llong_clear: + ld r5,PPC64_CACHES@toc(r2) + + bf cr7*4+0,11f +err2; std r0,0(r3) + addi r3,r3,8 + addi r4,r4,-8 + + /* Destination is 16 byte aligned, need to get it cacheline aligned */ +11: lwz r7,DCACHEL1LOGLINESIZE(r5) + lwz r9,DCACHEL1LINESIZE(r5) + + /* + * With worst case alignment the long clear loop takes a minimum + * of 1 byte less than 2 cachelines. + */ + sldi r10,r9,2 + cmpd r4,r10 + blt .Lmedium_clear + + neg r6,r3 + addi r10,r9,-1 + and. r5,r6,r10 + beq 13f + + srdi r6,r5,4 + mtctr r6 + mr r8,r3 +12: +err1; std r0,0(r3) +err1; std r0,8(r3) + addi r3,r3,16 + bdnz 12b + + sub r4,r4,r5 + +13: srd r6,r4,r7 + mtctr r6 + mr r8,r3 +14: +err1; dcbz r0,r3 + add r3,r3,r9 + bdnz 14b + + and r4,r4,r10 + + cmpdi r4,32 + blt .Lshort_clear + b .Lmedium_clear
I blame Mikey for this. He elevated my slightly dubious testcase: # dd if=/dev/zero of=/dev/null bs=1M count=10000 to benchmark status. And naturally we need to be number 1 at creating zeros. So lets improve __clear_user some more. As Paul suggests we can use dcbz for large lengths. This patch gets the destination cacheline aligned then uses dcbz on whole cachelines. Before: 10485760000 bytes (10 GB) copied, 0.414744 s, 25.3 GB/s After: 10485760000 bytes (10 GB) copied, 0.268597 s, 39.0 GB/s 39 GB/s, a new record. Signed-off-by: Anton Blanchard <anton@samba.org> --- v2: handle any cacheline size