diff mbox

powerpc: Optimise the 64bit optimised __clear_user

Message ID 20120605120222.6722a3e3@kryten (mailing list archive)
State Accepted, archived
Commit cf8fb5533f35709ba7e31560264b565a9c7a090f
Delegated to: Michael Ellerman
Headers show

Commit Message

Anton Blanchard June 5, 2012, 2:02 a.m. UTC
I blame Mikey for this. He elevated my slightly dubious testcase:

# dd if=/dev/zero of=/dev/null bs=1M count=10000

to benchmark status. And naturally we need to be number 1 at creating
zeros. So lets improve __clear_user some more.

As Paul suggests we can use dcbz for large lengths. This patch gets
the destination cacheline aligned then uses dcbz on whole cachelines.

Before:
10485760000 bytes (10 GB) copied, 0.414744 s, 25.3 GB/s

After:
10485760000 bytes (10 GB) copied, 0.268597 s, 39.0 GB/s

39 GB/s, a new record.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

v2: handle any cacheline size

Comments

Segher Boessenkool June 6, 2012, 4:40 p.m. UTC | #1
> +err1;	dcbz	r0,r3

There is no such instruction, you probably meant "dcbz 0,r3"?


Segher
Benjamin Herrenschmidt June 6, 2012, 9:20 p.m. UTC | #2
On Wed, 2012-06-06 at 18:40 +0200, Segher Boessenkool wrote:
> > +err1;	dcbz	r0,r3
> 
> There is no such instruction, you probably meant "dcbz 0,r3"?

This reminds me... what would happen if we changed all our

#define	r0	0
#define	r1	1

etc... to:

#define r0	%r0
#define r1	%r1

?

I'm thinking it might help catch that sort of nasties (and some of them
can be really nasty, such as inverting mfspr/mtspr arguments, or vs ori,
etc... ). I'm sure we'd have a problem with a few macros & inline
constructs but nothing we can't fix..

(Haven't tested ... still home, officially sick :-)

Cheers,
Ben.
Olof Johansson June 7, 2012, 12:30 a.m. UTC | #3
On Mon, Jun 4, 2012 at 7:02 PM, Anton Blanchard <anton@samba.org> wrote:
>
> I blame Mikey for this. He elevated my slightly dubious testcase:
>
> # dd if=/dev/zero of=/dev/null bs=1M count=10000
>
> to benchmark status. And naturally we need to be number 1 at creating
> zeros. So lets improve __clear_user some more.
>
> As Paul suggests we can use dcbz for large lengths. This patch gets
> the destination cacheline aligned then uses dcbz on whole cachelines.
>
> Before:
> 10485760000 bytes (10 GB) copied, 0.414744 s, 25.3 GB/s
>
> After:
> 10485760000 bytes (10 GB) copied, 0.268597 s, 39.0 GB/s
>
> 39 GB/s, a new record.
>
> Signed-off-by: Anton Blanchard <anton@samba.org>

Besides the comments from Segher, feel free to add:

Tested-by: Olof Johansson <olof@lixom.net>
Acked-by: Olof Johansson <olof@lixom.net>

Didn't help performance all that much on pa6t, but it didn't go down.
Too low on cycles to actually analyze why at this time.

-OIof
Paul Mackerras June 7, 2012, 3:04 a.m. UTC | #4
On Wed, Jun 06, 2012 at 06:40:54PM +0200, Segher Boessenkool wrote:
> >+err1;	dcbz	r0,r3
> 
> There is no such instruction, you probably meant "dcbz 0,r3"?

There certainly is such an instruction, though it doesn't do exactly
what a naive reader might expect.  Using 0 rather than r0 or %r0
improves readability but makes no difference to the assembler or the
cpu.

Paul.
Michael Neuling June 7, 2012, 6:05 a.m. UTC | #5
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Wed, 2012-06-06 at 18:40 +0200, Segher Boessenkool wrote:
> > > +err1;	dcbz	r0,r3
> > 
> > There is no such instruction, you probably meant "dcbz 0,r3"?
> 
> This reminds me... what would happen if we changed all our
> 
> #define	r0	0
> #define	r1	1
> 
> etc... to:
> 
> #define r0	%r0
> #define r1	%r1
> 
> ?
> 
> I'm thinking it might help catch that sort of nasties (and some of them
> can be really nasty, such as inverting mfspr/mtspr arguments, or vs ori,
> etc... ). I'm sure we'd have a problem with a few macros & inline
> constructs but nothing we can't fix..

One problem with this is when we construct the instructions, like using
anything from ppc-opcode.h.  eg. using PPC_POPCNTB would need to go from:
    PPC_POPCNTB(r3,r3) 
to:
    PPC_POPCNTB(3,3) 
Which is less readable IMHO.

Mikey
Michael Ellerman June 7, 2012, 6:07 a.m. UTC | #6
On Thu, 2012-06-07 at 16:05 +1000, Michael Neuling wrote:
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> > On Wed, 2012-06-06 at 18:40 +0200, Segher Boessenkool wrote:
> > > > +err1;	dcbz	r0,r3
> > > 
> > > There is no such instruction, you probably meant "dcbz 0,r3"?
> > 
> > This reminds me... what would happen if we changed all our
> > 
> > #define	r0	0
> > #define	r1	1
> > 
> > etc... to:
> > 
> > #define r0	%r0
> > #define r1	%r1
> > 
> > ?
> > 
> > I'm thinking it might help catch that sort of nasties (and some of them
> > can be really nasty, such as inverting mfspr/mtspr arguments, or vs ori,
> > etc... ). I'm sure we'd have a problem with a few macros & inline
> > constructs but nothing we can't fix..
> 
> One problem with this is when we construct the instructions, like using
> anything from ppc-opcode.h.  eg. using PPC_POPCNTB would need to go from:
>     PPC_POPCNTB(r3,r3) 
> to:
>     PPC_POPCNTB(3,3) 

#define R(x)	x

#define PPC_POPCNTB(R(3), R(3))

??

cheers
Michael Neuling June 7, 2012, 6:12 a.m. UTC | #7
Michael Ellerman <michael@ellerman.id.au> wrote:

> On Thu, 2012-06-07 at 16:05 +1000, Michael Neuling wrote:
> > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > 
> > > On Wed, 2012-06-06 at 18:40 +0200, Segher Boessenkool wrote:
> > > > > +err1;	dcbz	r0,r3
> > > > 
> > > > There is no such instruction, you probably meant "dcbz 0,r3"?
> > > 
> > > This reminds me... what would happen if we changed all our
> > > 
> > > #define	r0	0
> > > #define	r1	1
> > > 
> > > etc... to:
> > > 
> > > #define r0	%r0
> > > #define r1	%r1
> > > 
> > > ?
> > > 
> > > I'm thinking it might help catch that sort of nasties (and some of them
> > > can be really nasty, such as inverting mfspr/mtspr arguments, or vs ori,
> > > etc... ). I'm sure we'd have a problem with a few macros & inline
> > > constructs but nothing we can't fix..
> > 
> > One problem with this is when we construct the instructions, like using
> > anything from ppc-opcode.h.  eg. using PPC_POPCNTB would need to go from:
> >     PPC_POPCNTB(r3,r3) 
> > to:
> >     PPC_POPCNTB(3,3) 
> 
> #define R(x)	x

#define R(x)	(x) 

> #define PPC_POPCNTB(R(3), R(3))

Maybe, looks pretty gross but you're the maintainer! :-)

Mikey
Michael Ellerman June 7, 2012, 6:18 a.m. UTC | #8
On Thu, 2012-06-07 at 16:12 +1000, Michael Neuling wrote:
> Michael Ellerman <michael@ellerman.id.au> wrote:
> 
> > On Thu, 2012-06-07 at 16:05 +1000, Michael Neuling wrote:
> > > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > > 
> > > > On Wed, 2012-06-06 at 18:40 +0200, Segher Boessenkool wrote:
> > > > > > +err1;	dcbz	r0,r3
> > > > > 
> > > > > There is no such instruction, you probably meant "dcbz 0,r3"?
> > > > 
> > > > This reminds me... what would happen if we changed all our
> > > > 
> > > > #define	r0	0
> > > > #define	r1	1
> > > > 
> > > > etc... to:
> > > > 
> > > > #define r0	%r0
> > > > #define r1	%r1
> > > > 
> > > > ?
> > > > 
> > > > I'm thinking it might help catch that sort of nasties (and some of them
> > > > can be really nasty, such as inverting mfspr/mtspr arguments, or vs ori,
> > > > etc... ). I'm sure we'd have a problem with a few macros & inline
> > > > constructs but nothing we can't fix..
> > > 
> > > One problem with this is when we construct the instructions, like using
> > > anything from ppc-opcode.h.  eg. using PPC_POPCNTB would need to go from:
> > >     PPC_POPCNTB(r3,r3) 
> > > to:
> > >     PPC_POPCNTB(3,3) 
> > 
> > #define R(x)	x
> 
> #define R(x)	(x) 
> 
> > #define PPC_POPCNTB(R(3), R(3))
> 
> Maybe, looks pretty gross but you're the maintainer! :-)

No I am not!


I agree it's fairly gross. But I'll take gross and correct over ungross
and buggy.

cheers
Benjamin Herrenschmidt June 7, 2012, 6:39 a.m. UTC | #9
On Thu, 2012-06-07 at 16:05 +1000, Michael Neuling wrote:

> One problem with this is when we construct the instructions, like using
> anything from ppc-opcode.h.  eg. using PPC_POPCNTB would need to go from:
>     PPC_POPCNTB(r3,r3) 
> to:
>     PPC_POPCNTB(3,3) 
> Which is less readable IMHO.

Yes, I know. Not much to do about this, but it might still be worth it,
how much time wasted due to mixing up or with ori in asm somewhere ?

One option would be to #define R3 (or _r3) for use in those macros so
we still have something nicer than just "3"... oh well.

Cheers,
Ben.
David Laight June 7, 2012, 8:41 a.m. UTC | #10
> -----Original Message-----
> From: Linuxppc-dev 
> [mailto:linuxppc-dev-bounces+david.laight=aculab.com@lists.ozl
abs.org] On Behalf Of Benjamin Herrenschmidt
> Sent: 07 June 2012 07:40
> To: Michael Neuling
> Cc: michael@ellerman.id.au; paulus@samba.org; Anton 
> Blanchard; olof@lixom.net; linuxppc-dev@lists.ozlabs.org
> Subject: Re: [PATCH] powerpc: Optimise the 64bit optimised 
> __clear_user
> 
> On Thu, 2012-06-07 at 16:05 +1000, Michael Neuling wrote:
> 
> > One problem with this is when we construct the 
> instructions, like using
> > anything from ppc-opcode.h.  eg. using PPC_POPCNTB would 
> need to go from:
> >     PPC_POPCNTB(r3,r3) 
> > to:
> >     PPC_POPCNTB(3,3) 
> > Which is less readable IMHO.
> 
> Yes, I know. Not much to do about this, but it might still be 
> worth it,
> how much time wasted due to mixing up or with ori in asm somewhere ?
> 
> One option would be to #define R3 (or _r3) for use in those macros so
> we still have something nicer than just "3"... oh well.

You then need to catch all the other bugs where a register
is being used instead of a constant.
Time to make the assembler require %r0 (etc)

	David
Scott Wood June 7, 2012, 5:51 p.m. UTC | #11
On 06/06/2012 10:04 PM, Paul Mackerras wrote:
> On Wed, Jun 06, 2012 at 06:40:54PM +0200, Segher Boessenkool wrote:
>>> +err1;	dcbz	r0,r3
>>
>> There is no such instruction, you probably meant "dcbz 0,r3"?
> 
> There certainly is such an instruction, though it doesn't do exactly
> what a naive reader might expect.  Using 0 rather than r0 or %r0
> improves readability but makes no difference to the assembler or the
> cpu.

The assembler can't tell that you used r0 rather than 0, because that's
handled by the preprocessor, but it seems like a bug (or at least lax
error checking) that it accepts %r0 there, and that objdump decodes it
as "dcbz r0,r3" rather than "dcbz 0,r3".

It's also odd that objdump produces output containing "r3" (not %r3)
when the assembler won't directly accept it.

-Scott
Benjamin Herrenschmidt June 8, 2012, 12:51 a.m. UTC | #12
On Thu, 2012-06-07 at 12:51 -0500, Scott Wood wrote:
> 
> The assembler can't tell that you used r0 rather than 0, because
> that's
> handled by the preprocessor, but it seems like a bug (or at least lax
> error checking) that it accepts %r0 there, and that objdump decodes it
> as "dcbz r0,r3" rather than "dcbz 0,r3".

Well, this is the domain of bike shed painting :-) the syntax of the
instruction is:

	dcbz RA,RB

Now, of course, like other RA,RB pairs it has the semantic that

if RA = 0 then b <- 0
else           b <- (RA)
EA <- b + (RB)

If we were to be pendantic and our assembly could also enforce the use
of % for registers, I suppose it would make sense to require 0 rather
than %0 for those "RA" forms to make it absolutely clear that we are
talking about 0 and not r0 in this specific case, I agree, but in the
current shape of the asm, I don't think that's something to expect.

Cheers,
Ben.
Andreas Schwab June 8, 2012, 7:34 a.m. UTC | #13
Scott Wood <scottwood@freescale.com> writes:

> and that objdump decodes it as "dcbz r0,r3" rather than "dcbz 0,r3".

<http://permalink.gmane.org/gmane.comp.gnu.binutils/57870>

Andreas.
Michael Neuling June 8, 2012, 11:36 a.m. UTC | #14
First 5 patches convert us to %r0-31.

Next 10 patches make using R0-31 required.
Andreas Schwab June 8, 2012, 12:12 p.m. UTC | #15
Michael Neuling <mikey@neuling.org> writes:

> Index: clone3/arch/powerpc/include/asm/ppc_asm.h
> ===================================================================
> --- clone3.orig/arch/powerpc/include/asm/ppc_asm.h
> +++ clone3/arch/powerpc/include/asm/ppc_asm.h
> @@ -178,6 +178,11 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLP
>  #define HMT_HIGH	or	3,3,3
>  #define HMT_EXTRA_HIGH	or	7,7,7		# power7 only
>  
> +#define STACKFRAMESIZE 256
> +#define STK_REG(i)     (112 + ((i)-14)*8)
> +
> +#define STK_PARAM(i)	(48 + ((i)-3)*8)

That should probably be bracketed by CONFIG_PPC64, since it is only
applicable to 64-bit code.

Andreas.
Andreas Schwab June 8, 2012, 12:15 p.m. UTC | #16
Michael Neuling <mikey@neuling.org> writes:

>  	/* Invalidate all TLBs */
> -	PPC_TLBILX_ALL(0,0)
> +	PPC_TLBILX_ALL(R0,R0)

The first argument is (RA|0), so 0 is zero, not r0.

Andreas.
Benjamin Herrenschmidt June 8, 2012, 10:42 p.m. UTC | #17
On Fri, 2012-06-08 at 14:15 +0200, Andreas Schwab wrote:
> Michael Neuling <mikey@neuling.org> writes:
> 
> >  	/* Invalidate all TLBs */
> > -	PPC_TLBILX_ALL(0,0)
> > +	PPC_TLBILX_ALL(R0,R0)
> 
> The first argument is (RA|0), so 0 is zero, not r0.

The macro system we use cannot do that (it will prefix with REG_), since
both arguments are registers we must use R0 in this case.

Cheers,
Ben.
Andreas Schwab June 9, 2012, 6:53 a.m. UTC | #18
Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

> The macro system we use cannot do that (it will prefix with REG_), since
> both arguments are registers we must use R0 in this case.

So define a ___PPC_RA0 macro that doesn't do that.

Andreas.
Benjamin Herrenschmidt June 9, 2012, 7:17 a.m. UTC | #19
On Sat, 2012-06-09 at 08:53 +0200, Andreas Schwab wrote:
> 
> > The macro system we use cannot do that (it will prefix with REG_),
> since
> > both arguments are registers we must use R0 in this case.
> 
> So define a ___PPC_RA0 macro that doesn't do that.

But then we lose the checking for other instructions :-) Unless we start
being nasty and defining a different macro form for RA which can be 0...

I'd rather not go there unless we absolutely have to...

What would be nice also would be if we had a gas option to enforce the
use of % for register names. We'd probably have to struggle a little bit
with gcc inline asm in a case or two though.

Cheers,
Ben.
Andreas Schwab June 9, 2012, 9:39 a.m. UTC | #20
Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

> On Sat, 2012-06-09 at 08:53 +0200, Andreas Schwab wrote:
>> 
>> > The macro system we use cannot do that (it will prefix with REG_),
>> since
>> > both arguments are registers we must use R0 in this case.
>> 
>> So define a ___PPC_RA0 macro that doesn't do that.
>
> But then we lose the checking for other instructions :-)

??? There is no loss of checking for instructions that do not use
___PPC_RA0.

> Unless we start being nasty and defining a different macro form for RA
> which can be 0...

That's what ___PPC_RA0 is all about.

> I'd rather not go there unless we absolutely have to...

Having to use R0 for an insn that does *not* use r0 is clearly a step
backwards.

> What would be nice also would be if we had a gas option to enforce the
> use of % for register names.

If gas is ever changed that way you have to be explict about 0 vs. %r0
anyway.

Andreas.
Michael Neuling June 14, 2012, 6:15 a.m. UTC | #21
First 5 patches convert us to %r0-31.

Next 12 convert make using R0-31 required in macros.

Last 2 convert instructions where ra = r0 we use 0 rather than the
register value (as suggested by Andreas).

Version 2 adds:
  ra = 0 idea (as Andreas suggested)
  Fixes for 32bit KVM (added ppc44x_defconfig to my testing)
  Based on mpe's next tree which has Antons power7 copy patches which
    needed fixes for this
diff mbox

Patch

Index: linux-build/arch/powerpc/lib/string_64.S
===================================================================
--- linux-build.orig/arch/powerpc/lib/string_64.S	2012-06-04 16:18:56.351604302 +1000
+++ linux-build/arch/powerpc/lib/string_64.S	2012-06-05 11:42:10.044654851 +1000
@@ -19,6 +19,12 @@ 
  */
 
 #include <asm/ppc_asm.h>
+#include <asm/asm-offsets.h>
+
+	.section	".toc","aw"
+PPC64_CACHES:
+	.tc		ppc64_caches[TC],ppc64_caches
+	.section	".text"
 
 /**
  * __clear_user: - Zero a block of memory in user space, with less checking.
@@ -94,9 +100,14 @@  err1;	stw	r0,0(r3)
 	addi	r3,r3,4
 
 3:	sub	r4,r4,r6
-	srdi	r6,r4,5
+
 	cmpdi	r4,32
+	cmpdi	cr1,r4,512
 	blt	.Lshort_clear
+	bgt	cr1,.Llong_clear
+
+.Lmedium_clear:
+	srdi	r6,r4,5
 	mtctr	r6
 
 	/* Do 32 byte chunks */
@@ -139,3 +150,53 @@  err1;	stb	r0,0(r3)
 
 10:	li	r3,0
 	blr
+
+.Llong_clear:
+	ld	r5,PPC64_CACHES@toc(r2)
+
+	bf	cr7*4+0,11f
+err2;	std	r0,0(r3)
+	addi	r3,r3,8
+	addi	r4,r4,-8
+
+	/* Destination is 16 byte aligned, need to get it cacheline aligned */
+11:	lwz	r7,DCACHEL1LOGLINESIZE(r5)
+	lwz	r9,DCACHEL1LINESIZE(r5)
+
+	/*
+	 * With worst case alignment the long clear loop takes a minimum
+	 * of 1 byte less than 2 cachelines.
+	 */
+	sldi	r10,r9,2
+	cmpd	r4,r10
+	blt	.Lmedium_clear
+
+	neg	r6,r3
+	addi	r10,r9,-1
+	and.	r5,r6,r10
+	beq	13f
+
+	srdi	r6,r5,4
+	mtctr	r6
+	mr	r8,r3
+12:
+err1;	std	r0,0(r3)
+err1;	std	r0,8(r3)
+	addi	r3,r3,16
+	bdnz	12b
+
+	sub	r4,r4,r5
+
+13:	srd	r6,r4,r7
+	mtctr	r6
+	mr	r8,r3
+14:
+err1;	dcbz	r0,r3
+	add	r3,r3,r9
+	bdnz	14b
+
+	and	r4,r4,r10
+
+	cmpdi	r4,32
+	blt	.Lshort_clear
+	b	.Lmedium_clear