Message ID | 20180517100413.856096F938@po14934vm.idsi0.si.c-s.fr (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | powerpc/lib: Remove .balign inside string functions for PPC32 | expand |
On Thu, 17 May 2018 12:04:13 +0200 (CEST) Christophe Leroy <christophe.leroy@c-s.fr> wrote: > commit 87a156fb18fe1 ("Align hot loops of some string functions") > degraded the performance of string functions by adding useless > nops > > A simple benchmark on an 8xx calling 100000x a memchr() that > matches the first byte runs in 41668 TB ticks before this patch > and in 35986 TB ticks after this patch. So this gives an > improvement of approx 10% > > Another benchmark doing the same with a memchr() matching the 128th > byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks > after this patch, so regardless on the number of loops, removing > those useless nops improves the test by 5683 TB ticks. > > Fixes: 87a156fb18fe1 ("Align hot loops of some string functions") > Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> > --- > Was sent already as part of a serie optimising string functions. > Resending on itself as it is independent of the other changes in the > serie > > arch/powerpc/lib/string.S | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S > index a787776822d8..a026d8fa8a99 100644 > --- a/arch/powerpc/lib/string.S > +++ b/arch/powerpc/lib/string.S > @@ -23,7 +23,9 @@ _GLOBAL(strncpy) > mtctr r5 > addi r6,r3,-1 > addi r4,r4,-1 > +#ifdef CONFIG_PPC64 > .balign 16 > +#endif > 1: lbzu r0,1(r4) > cmpwi 0,r0,0 > stbu r0,1(r6) The ifdefs are a bit ugly, but you can't argue with the numbers. These alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise the ifetch performance when you have such a loop (although there is always a tradeoff for a single iteration). Would it make sense to define that for 32-bit as well, and you could use it here instead of the ifdefs? Small CPUs could just use 0. Thanks, Nick
Nicholas Piggin <npiggin@gmail.com> writes: > On Thu, 17 May 2018 12:04:13 +0200 (CEST) > Christophe Leroy <christophe.leroy@c-s.fr> wrote: > >> commit 87a156fb18fe1 ("Align hot loops of some string functions") >> degraded the performance of string functions by adding useless >> nops >> >> A simple benchmark on an 8xx calling 100000x a memchr() that >> matches the first byte runs in 41668 TB ticks before this patch >> and in 35986 TB ticks after this patch. So this gives an >> improvement of approx 10% >> >> Another benchmark doing the same with a memchr() matching the 128th >> byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks >> after this patch, so regardless on the number of loops, removing >> those useless nops improves the test by 5683 TB ticks. >> >> Fixes: 87a156fb18fe1 ("Align hot loops of some string functions") >> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> >> --- >> Was sent already as part of a serie optimising string functions. >> Resending on itself as it is independent of the other changes in the >> serie >> >> arch/powerpc/lib/string.S | 6 ++++++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S >> index a787776822d8..a026d8fa8a99 100644 >> --- a/arch/powerpc/lib/string.S >> +++ b/arch/powerpc/lib/string.S >> @@ -23,7 +23,9 @@ _GLOBAL(strncpy) >> mtctr r5 >> addi r6,r3,-1 >> addi r4,r4,-1 >> +#ifdef CONFIG_PPC64 >> .balign 16 >> +#endif >> 1: lbzu r0,1(r4) >> cmpwi 0,r0,0 >> stbu r0,1(r6) > > The ifdefs are a bit ugly, but you can't argue with the numbers. These > alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise > the ifetch performance when you have such a loop (although there is > always a tradeoff for a single iteration). > > Would it make sense to define that for 32-bit as well, and you could use > it here instead of the ifdefs? Small CPUs could just use 0. Can we do it with a macro in the header, eg. like: #ifdef CONFIG_PPC64 #define IFETCH_BALIGN .balign IFETCH_ALIGN_BYTES #endif ... addi r4,r4,-1 IFETCH_BALIGN 1: lbzu r0,1(r4) cheers
Le 17/05/2018 à 15:46, Michael Ellerman a écrit : > Nicholas Piggin <npiggin@gmail.com> writes: > >> On Thu, 17 May 2018 12:04:13 +0200 (CEST) >> Christophe Leroy <christophe.leroy@c-s.fr> wrote: >> >>> commit 87a156fb18fe1 ("Align hot loops of some string functions") >>> degraded the performance of string functions by adding useless >>> nops >>> >>> A simple benchmark on an 8xx calling 100000x a memchr() that >>> matches the first byte runs in 41668 TB ticks before this patch >>> and in 35986 TB ticks after this patch. So this gives an >>> improvement of approx 10% >>> >>> Another benchmark doing the same with a memchr() matching the 128th >>> byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks >>> after this patch, so regardless on the number of loops, removing >>> those useless nops improves the test by 5683 TB ticks. >>> >>> Fixes: 87a156fb18fe1 ("Align hot loops of some string functions") >>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> >>> --- >>> Was sent already as part of a serie optimising string functions. >>> Resending on itself as it is independent of the other changes in the >>> serie >>> >>> arch/powerpc/lib/string.S | 6 ++++++ >>> 1 file changed, 6 insertions(+) >>> >>> diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S >>> index a787776822d8..a026d8fa8a99 100644 >>> --- a/arch/powerpc/lib/string.S >>> +++ b/arch/powerpc/lib/string.S >>> @@ -23,7 +23,9 @@ _GLOBAL(strncpy) >>> mtctr r5 >>> addi r6,r3,-1 >>> addi r4,r4,-1 >>> +#ifdef CONFIG_PPC64 >>> .balign 16 >>> +#endif >>> 1: lbzu r0,1(r4) >>> cmpwi 0,r0,0 >>> stbu r0,1(r6) >> >> The ifdefs are a bit ugly, but you can't argue with the numbers. These >> alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise >> the ifetch performance when you have such a loop (although there is >> always a tradeoff for a single iteration). >> >> Would it make sense to define that for 32-bit as well, and you could use >> it here instead of the ifdefs? Small CPUs could just use 0. > > Can we do it with a macro in the header, eg. like: > > #ifdef CONFIG_PPC64 > #define IFETCH_BALIGN .balign IFETCH_ALIGN_BYTES > #endif > > ... > > addi r4,r4,-1 > IFETCH_BALIGN > 1: lbzu r0,1(r4) > > Why not just define IFETCH_ALIGN_SHIFT for PPC32 as well in asm/cache.h ?, then replace the .balign 16 by .balign IFETCH_ALIGN_BYTES (or .align IFETCH_ALIGN_SHIFT) ? Christophe > cheers >
diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S index a787776822d8..a026d8fa8a99 100644 --- a/arch/powerpc/lib/string.S +++ b/arch/powerpc/lib/string.S @@ -23,7 +23,9 @@ _GLOBAL(strncpy) mtctr r5 addi r6,r3,-1 addi r4,r4,-1 +#ifdef CONFIG_PPC64 .balign 16 +#endif 1: lbzu r0,1(r4) cmpwi 0,r0,0 stbu r0,1(r6) @@ -43,7 +45,9 @@ _GLOBAL(strncmp) mtctr r5 addi r5,r3,-1 addi r4,r4,-1 +#ifdef CONFIG_PPC64 .balign 16 +#endif 1: lbzu r3,1(r5) cmpwi 1,r3,0 lbzu r0,1(r4) @@ -77,7 +81,9 @@ _GLOBAL(memchr) beq- 2f mtctr r5 addi r3,r3,-1 +#ifdef CONFIG_PPC64 .balign 16 +#endif 1: lbzu r0,1(r3) cmpw 0,r0,r4 bdnzf 2,1b
commit 87a156fb18fe1 ("Align hot loops of some string functions") degraded the performance of string functions by adding useless nops A simple benchmark on an 8xx calling 100000x a memchr() that matches the first byte runs in 41668 TB ticks before this patch and in 35986 TB ticks after this patch. So this gives an improvement of approx 10% Another benchmark doing the same with a memchr() matching the 128th byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks after this patch, so regardless on the number of loops, removing those useless nops improves the test by 5683 TB ticks. Fixes: 87a156fb18fe1 ("Align hot loops of some string functions") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> --- Was sent already as part of a serie optimising string functions. Resending on itself as it is independent of the other changes in the serie arch/powerpc/lib/string.S | 6 ++++++ 1 file changed, 6 insertions(+)