Optimised memset64/memset32 for powerpc

Submitted by Matthew Wilcox on March 20, 2017, 9:14 p.m.

Details

Message ID 20170320211447.GB5073@bombadil.infradead.org
State New
Headers show

Commit Message

Matthew Wilcox March 20, 2017, 9:14 p.m.
I recently introduced memset32() / memset64().  I've done implementations
for x86 & ARM; akpm has agreed to take the patchset through his tree.
Do you fancy doing a powerpc version?  Minchan Kim got a 7% performance
increase with zram from switching to the optimised version on x86.

Here's the development git tree:
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/memfill
(most recent 7 commits)

ARM probably offers the best model for you to work from; it's basically
just a case of jumping into the middle of your existing memset loop.
It was only three instructions to add support to ARM, but I don't know
PowerPC well enough to understand how your existing memset works.
I'd start with something like this ... note that you don't have to
implement memset64 on 32-bit; I only did it on ARM because it was free.
It doesn't look free for you as you only store one register each time
around the loop in the 32-bit memset implementation:

1:      stwu    r4,4(r6)
        bdnz    1b

(wouldn't you get better performance on 32-bit powerpc by unrolling that
loop like you do on 64-bit?)

Comments

Benjamin Herrenschmidt March 20, 2017, 9:23 p.m.
On Mon, 2017-03-20 at 14:14 -0700, Matthew Wilcox wrote:
> I recently introduced memset32() / memset64().  I've done implementations
> for x86 & ARM; akpm has agreed to take the patchset through his tree.
> Do you fancy doing a powerpc version?  Minchan Kim got a 7% performance
> increase with zram from switching to the optimised version on x86.

+Anton

Thanks Matthew !

> Here's the development git tree:
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/memfill
> (most recent 7 commits)
> 
> ARM probably offers the best model for you to work from; it's basically
> just a case of jumping into the middle of your existing memset loop.
> It was only three instructions to add support to ARM, but I don't know
> PowerPC well enough to understand how your existing memset works.
> I'd start with something like this ... note that you don't have to
> implement memset64 on 32-bit; I only did it on ARM because it was free.
> It doesn't look free for you as you only store one register each time
> around the loop in the 32-bit memset implementation:
> 
> 1:      stwu    r4,4(r6)
>         bdnz    1b
> 
> (wouldn't you get better performance on 32-bit powerpc by unrolling that
> loop like you do on 64-bit?)
> 
> diff --git a/arch/powerpc/include/asm/string.h b/arch/powerpc/include/asm/string.h
> index da3cdffca440..c02392fced98 100644
> --- a/arch/powerpc/include/asm/string.h
> +++ b/arch/powerpc/include/asm/string.h
> @@ -6,6 +6,7 @@
>  #define __HAVE_ARCH_STRNCPY
>  #define __HAVE_ARCH_STRNCMP
>  #define __HAVE_ARCH_MEMSET
> +#define __HAVE_ARCH_MEMSET_PLUS
>  #define __HAVE_ARCH_MEMCPY
>  #define __HAVE_ARCH_MEMMOVE
>  #define __HAVE_ARCH_MEMCMP
> @@ -23,6 +24,18 @@ extern void * memmove(void *,const void *,__kernel_size_t);
>  extern int memcmp(const void *,const void *,__kernel_size_t);
>  extern void * memchr(const void *,int,__kernel_size_t);
>  
> +extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
> +static inline void *memset32(uint32_t *p, uint32_t v, __kernel_size_t n)
> +{
> > +	return __memset32(p, v, n * 4);
> +}
> +
> +extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t);
> +static inline void *memset64(uint64_t *p, uint64_t v, __kernel_size_t n)
> +{
> > +	return __memset64(p, v, n * 8);
> +}
> +
>  #endif /* __KERNEL__ */
>  
> >  #endif	/* _ASM_POWERPC_STRING_H */
Christophe Leroy March 21, 2017, 12:23 p.m.
Hi Matthew

Le 20/03/2017 à 22:14, Matthew Wilcox a écrit :
> I recently introduced memset32() / memset64().  I've done implementations
> for x86 & ARM; akpm has agreed to take the patchset through his tree.
> Do you fancy doing a powerpc version?  Minchan Kim got a 7% performance
> increase with zram from switching to the optimised version on x86.
>
> Here's the development git tree:
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/memfill
> (most recent 7 commits)
>
> ARM probably offers the best model for you to work from; it's basically
> just a case of jumping into the middle of your existing memset loop.
> It was only three instructions to add support to ARM, but I don't know
> PowerPC well enough to understand how your existing memset works.
> I'd start with something like this ... note that you don't have to
> implement memset64 on 32-bit; I only did it on ARM because it was free.
> It doesn't look free for you as you only store one register each time
> around the loop in the 32-bit memset implementation:
>
> 1:      stwu    r4,4(r6)
>         bdnz    1b
>
> (wouldn't you get better performance on 32-bit powerpc by unrolling that
> loop like you do on 64-bit?)

In arch/powerpc/lib/copy_32.S, the implementation of memset() is 
optimised when the value to be set is zero. It makes use of the 'dcbz' 
instruction which zeroizes a complete cache line.

Not much effort has been put on optimising non-zero memset() because 
there are almost none.

Unrolling the loop could help a bit on old powerpc32s that don't have 
branch units, but on those processors the main driver is the time spent 
to do the effective write to memory, and the operations necessary to 
unroll the loop are not worth the cycle added by the branch.

On more modern powerpc32s, the branch unit implies that branches have a 
zero cost.

A simple static inline C function would probably do the job, based on 
what I get below:

void memset32(int *p, int v, unsigned int c)
{
	int i;

	for (i = 0; i < c; i++)
		*p++ = v;
}

void memset64(long long *p, long long v, unsigned int c)
{
	int i;

	for (i = 0; i < c; i++)
		*p++ = v;
}

test.o:     file format elf32-powerpc


Disassembly of section .text:

00000000 <memset32>:
    0:	2c 05 00 00 	cmpwi   r5,0
    4:	38 63 ff fc 	addi    r3,r3,-4
    8:	4d 82 00 20 	beqlr
    c:	7c a9 03 a6 	mtctr   r5
   10:	94 83 00 04 	stwu    r4,4(r3)
   14:	42 00 ff fc 	bdnz    10 <memset32+0x10>
   18:	4e 80 00 20 	blr

0000001c <memset64>:
   1c:	2c 07 00 00 	cmpwi   r7,0
   20:	7c cb 33 78 	mr      r11,r6
   24:	7c aa 2b 78 	mr      r10,r5
   28:	38 63 ff f8 	addi    r3,r3,-8
   2c:	4d 82 00 20 	beqlr
   30:	7c e9 03 a6 	mtctr   r7
   34:	95 43 00 08 	stwu    r10,8(r3)
   38:	91 63 00 04 	stw     r11,4(r3)
   3c:	42 00 ff f8 	bdnz    34 <memset64+0x18>
   40:	4e 80 00 20 	blr



Christophe

>
> diff --git a/arch/powerpc/include/asm/string.h b/arch/powerpc/include/asm/string.h
> index da3cdffca440..c02392fced98 100644
> --- a/arch/powerpc/include/asm/string.h
> +++ b/arch/powerpc/include/asm/string.h
> @@ -6,6 +6,7 @@
>  #define __HAVE_ARCH_STRNCPY
>  #define __HAVE_ARCH_STRNCMP
>  #define __HAVE_ARCH_MEMSET
> +#define __HAVE_ARCH_MEMSET_PLUS
>  #define __HAVE_ARCH_MEMCPY
>  #define __HAVE_ARCH_MEMMOVE
>  #define __HAVE_ARCH_MEMCMP
> @@ -23,6 +24,18 @@ extern void * memmove(void *,const void *,__kernel_size_t);
>  extern int memcmp(const void *,const void *,__kernel_size_t);
>  extern void * memchr(const void *,int,__kernel_size_t);
>
> +extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
> +static inline void *memset32(uint32_t *p, uint32_t v, __kernel_size_t n)
> +{
> +	return __memset32(p, v, n * 4);
> +}
> +
> +extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t);
> +static inline void *memset64(uint64_t *p, uint64_t v, __kernel_size_t n)
> +{
> +	return __memset64(p, v, n * 8);
> +}
> +
>  #endif /* __KERNEL__ */
>
>  #endif	/* _ASM_POWERPC_STRING_H */
>
Matthew Wilcox March 21, 2017, 1:29 p.m.
On Tue, Mar 21, 2017 at 01:23:36PM +0100, Christophe LEROY wrote:
> > It doesn't look free for you as you only store one register each time
> > around the loop in the 32-bit memset implementation:
> > 
> > 1:      stwu    r4,4(r6)
> >         bdnz    1b
> > 
> > (wouldn't you get better performance on 32-bit powerpc by unrolling that
> > loop like you do on 64-bit?)
> 
> In arch/powerpc/lib/copy_32.S, the implementation of memset() is optimised
> when the value to be set is zero. It makes use of the 'dcbz' instruction
> which zeroizes a complete cache line.
> 
> Not much effort has been put on optimising non-zero memset() because there
> are almost none.

Yes, bzero() is much more common than setting an 8-bit pattern.
And setting an 8-bit pattern is almost certainly more common than setting
a 32 or 64 bit pattern.

> Unrolling the loop could help a bit on old powerpc32s that don't have branch
> units, but on those processors the main driver is the time spent to do the
> effective write to memory, and the operations necessary to unroll the loop
> are not worth the cycle added by the branch.
> 
> On more modern powerpc32s, the branch unit implies that branches have a zero
> cost.

Fair enough.  I'm just surprised it was worth unrolling the loop on
powerpc64 and not on powerpc32 -- see mem_64.S.

> A simple static inline C function would probably do the job, based on what I
> get below:
> 
> void memset32(int *p, int v, unsigned int c)
> {
> 	int i;
> 
> 	for (i = 0; i < c; i++)
> 		*p++ = v;
> }
> 
> void memset64(long long *p, long long v, unsigned int c)
> {
> 	int i;
> 
> 	for (i = 0; i < c; i++)
> 		*p++ = v;
> }

Well, those are the generic versions in the first patch:

http://git.infradead.org/users/willy/linux-dax.git/commitdiff/538b9776ac925199969bd5af4e994da776d461e7

so if those are good enough for you guys, there's no need for you to
do anything.

Thanks for your time!
Segher Boessenkool March 21, 2017, 4:45 p.m.
On Tue, Mar 21, 2017 at 06:29:10AM -0700, Matthew Wilcox wrote:
> > Unrolling the loop could help a bit on old powerpc32s that don't have branch
> > units, but on those processors the main driver is the time spent to do the
> > effective write to memory, and the operations necessary to unroll the loop
> > are not worth the cycle added by the branch.
> > 
> > On more modern powerpc32s, the branch unit implies that branches have a zero
> > cost.
> 
> Fair enough.  I'm just surprised it was worth unrolling the loop on
> powerpc64 and not on powerpc32 -- see mem_64.S.

We can do at most one loop iteration per cycle, but we can do multiple
stores per cycle, on modern, bigger CPUs.  Many old or small CPUs have
only one load/store unit on the other hand.  There are other issues,
but that is the biggest difference.


Segher
Benjamin Herrenschmidt March 21, 2017, 9:26 p.m.
On Tue, 2017-03-21 at 06:29 -0700, Matthew Wilcox wrote:
> 
> Well, those are the generic versions in the first patch:
> 
> http://git.infradead.org/users/willy/linux-dax.git/commitdiff/538b977
> 6ac925199969bd5af4e994da776d461e7
> 
> so if those are good enough for you guys, there's no need for you to
> do anything.
> 
> Thanks for your time!

I suspect on ppc64 we can do much better, if anything moving 64-bit at
a time. Matthew, what are the main use cases of these ?

Cheers,
Ben.
Matthew Wilcox March 22, 2017, 1:18 p.m.
On Wed, Mar 22, 2017 at 08:26:12AM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2017-03-21 at 06:29 -0700, Matthew Wilcox wrote:
> > 
> > Well, those are the generic versions in the first patch:
> > 
> > http://git.infradead.org/users/willy/linux-dax.git/commitdiff/538b977
> > 6ac925199969bd5af4e994da776d461e7
> > 
> > so if those are good enough for you guys, there's no need for you to
> > do anything.
> > 
> > Thanks for your time!
> 
> I suspect on ppc64 we can do much better, if anything moving 64-bit at
> a time. Matthew, what are the main use cases of these ?

I've only converted two users so far -- zram was the initial inspiration
for this.  It notices when a page has a pattern in it which is
representable as a repetition of an 'unsigned long' (this seems to be
a relatively common thing for userspace to do -- not as common as an
entirely zero page, but common enough to be worth optimising for).  So it
may be doing an entire page worth of this to handle a page fault, or if
there's an I/O to such a page, it will be doing a multiple of 512 bytes.

The other user is sym53c8xx_2; it's an initialisation path thing, and
it saves a few bytes in the driver to call the optimised routine rather
than have its own loop to initialise the array.

I suspect we have additional places in the kernel that could use
memset32/memset64 -- look for loops which store a value which is not
dependent on the loop counter.  They're probably not performance path
though; I'd focus on zram as being the case to optimise for.

There's one other potential user I've been wondering about, which are the
various console drivers.  They use 'memsetw' to blank the entire console
or lines of the console when scrolling, but the only architecture which
ever bothered implementing an optimised version of it was Alpha.

Might be worth it on powerpc actually ... better than a loop calling
cpu_to_le16() on each iteration.  That'd complete the set with a
memset16().

Patch hide | download patch | download mbox

diff --git a/arch/powerpc/include/asm/string.h b/arch/powerpc/include/asm/string.h
index da3cdffca440..c02392fced98 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -6,6 +6,7 @@ 
 #define __HAVE_ARCH_STRNCPY
 #define __HAVE_ARCH_STRNCMP
 #define __HAVE_ARCH_MEMSET
+#define __HAVE_ARCH_MEMSET_PLUS
 #define __HAVE_ARCH_MEMCPY
 #define __HAVE_ARCH_MEMMOVE
 #define __HAVE_ARCH_MEMCMP
@@ -23,6 +24,18 @@  extern void * memmove(void *,const void *,__kernel_size_t);
 extern int memcmp(const void *,const void *,__kernel_size_t);
 extern void * memchr(const void *,int,__kernel_size_t);
 
+extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
+static inline void *memset32(uint32_t *p, uint32_t v, __kernel_size_t n)
+{
+	return __memset32(p, v, n * 4);
+}
+
+extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t);
+static inline void *memset64(uint64_t *p, uint64_t v, __kernel_size_t n)
+{
+	return __memset64(p, v, n * 8);
+}
+
 #endif /* __KERNEL__ */
 
 #endif	/* _ASM_POWERPC_STRING_H */