diff mbox

[4/*,v2] Optimize strchrnul more

Message ID 20150524171036.GA20947@domone
State New
Headers show

Commit Message

Ondřej Bílka May 24, 2015, 5:10 p.m. UTC
On Sun, May 24, 2015 at 06:32:14PM +0200, Ondřej Bílka wrote:
> Hi,
> this is nontrivial optimization of string inlines.
> First it decreases icache pressure as you don't need strchr.
> 

Just realized that optimization there is silly way to find terminating
zero. On x64 rawmemchr is around 50% slower than strlen so add rawmemchr
special case that does just that.
 
 	* string/bits/string2.h (strchrnul, rawmemchr): Add inline
 	  (strchr): Optimize.

Comments

Wilco May 26, 2015, 12:23 p.m. UTC | #1
> Ondřej Bílka wrote:
> On Sun, May 24, 2015 at 06:32:14PM +0200, Ondřej Bílka wrote:
> > Hi,
> > this is nontrivial optimization of string inlines.
> > First it decreases icache pressure as you don't need strchr.

It's not obvious to me that is the right thing to do. Generally it is best
to use the standard C90/C99 functions rather than infrequently used
non-standard ones. A quick grep of GLIBC shows strchr is used a lot more
than strchrnul, and 9 targets have an optimized strchr vs 5 for strchrnul.

Using strchrnul also means extra work compared to calling strchr at the
callsite (the *__r == c) as well as in strchrnul if the character is not
found (returning NULL is easier than computing the pointer to the 
terminating nul character).

(note for string/strchr.c it seems reasonable to defer to strchrnul)

> Just realized that optimization there is silly way to find terminating
> zero. On x64 rawmemchr is around 50% slower than strlen so add rawmemchr
> special case that does just that.

Yes, given GCC doesn't optimize strchr (s, 0) into s + strlen (s), I agree
we should add this to the GLIBC headers.

>  	* string/bits/string2.h (strchrnul, rawmemchr): Add inline
>  	  (strchr): Optimize.
> 
> 
> diff --git a/string/bits/string2.h b/string/bits/string2.h
> index 2fe67b3..8f1eb04 100644
> --- a/string/bits/string2.h
> +++ b/string/bits/string2.h
> @@ -108,18 +108,39 @@ __STRING2_COPY_TYPE (8);
>  #endif
> 
> 
> -/* Return pointer to C in S.  */
> -#ifndef _HAVE_STRING_ARCH_strchr
> +#ifndef _HAVE_STRING_ARCH_rawmemchr
>  extern void *__rawmemchr (const void *__s, int __c);
>  # if __GNUC_PREREQ (3, 2)
> -#  define strchr(s, c) \
> +#  define __rawmemchr(s, c) \
> +  (__extension__ (__builtin_constant_p (c) && !__builtin_constant_p (s)	      \
> +		  && (c) == '\0'					      \
> +		  ? s + strlen (s) 					      \
> +		  : __rawmemchr (s, c)))
> +# endif
> +#endif
> +
> +
> +
> +#ifndef _HAVE_STRING_ARCH_strchrnul
> +# if __GNUC_PREREQ (3, 2)
> +#  define strchrnul(s, c) \
>    (__extension__ (__builtin_constant_p (c) && !__builtin_constant_p (s)	      \
>  		  && (c) == '\0'					      \
>  		  ? (char *) __rawmemchr (s, c)				      \

Should use s + strlen (s) here too rather than rely on above define.

> -		  : __builtin_strchr (s, c)))
> +		  : strchrnul (s, c)))
>  # endif
>  #endif
> 
> +/* Return pointer to C in S.  */
> +#ifndef _HAVE_STRING_ARCH_strchr
> +# if __GNUC_PREREQ (3, 2)
> +#  define strchr(s, c) \
> +  (__extension__ ({ char *__r = strchrnul (s, c);			      \
> +		       *__r == c ? __r : NULL; }))
> +# endif
> +#endif

I think this should do the strlen optimization and only change into strchrnul
on targets that do have an optimized strchrnul implementation but not strchr.

Wilco
Ondřej Bílka May 26, 2015, 1:02 p.m. UTC | #2
On Tue, May 26, 2015 at 01:23:36PM +0100, Wilco Dijkstra wrote:
> > Ondřej Bílka wrote:
> > On Sun, May 24, 2015 at 06:32:14PM +0200, Ondřej Bílka wrote:
> > > Hi,
> > > this is nontrivial optimization of string inlines.
> > > First it decreases icache pressure as you don't need strchr.
>

I addressed these in original mail but I will reply again anyway.
 
> It's not obvious to me that is the right thing to do. Generally it is best
> to use the standard C90/C99 functions rather than infrequently used
> non-standard ones. A quick grep of GLIBC shows strchr is used a lot more
> than strchrnul, and 9 targets have an optimized strchr vs 5 for strchrnul.
> 
For optimizations its often handy to introduce new specialized functions that 
are faster. Strchrnul fits that role.

Yes there will be transition period where apps will use both strchr and
strchrnul but eventually only strchrnul would be used.

As architectures thats problem that maintainers should address. I could
add _HAVE_FAST_STRCHRNUL and maintainers would enable that when they add
implementation.

Also you need to do this way as for strchr you would need do strlen if
it returns null.

> Using strchrnul also means extra work compared to calling strchr at the
> callsite (the *__r == c) as well as in strchrnul if the character is not
> found (returning NULL is easier than computing the pointer to the 
> terminating nul character).
> 
This is false. using strchrnul is one reason for this optimization.
With recent gcc since 4.9 there would be extra branch only when user
didn't check if function returns null.

gcc does simplify pattern ((*s == '\0' ? s : NULL) == NULL) 
into *s == '\0'

Then to answer that returning NULL is easier you didn't optize your
implementation enough. You should combine char mask with null mask and
have only one check instead of two separate ones for null and character.

Also you forgotten to add cost of branch misprediction. It isn't that
terrible here, character is found only 16.35% times but still
significant.

And you need to calculate index of zero byte anyway. A character could
be also found and you need to determine if character or zero comes
first.
Ondřej Bílka May 26, 2015, 2:38 p.m. UTC | #3
On Tue, May 26, 2015 at 01:23:36PM +0100, Wilco Dijkstra wrote:
> > Ondřej Bílka wrote:
> > On Sun, May 24, 2015 at 06:32:14PM +0200, Ondřej Bílka wrote:
> > > Hi,
> > > this is nontrivial optimization of string inlines.
> > > First it decreases icache pressure as you don't need strchr.
> 
> It's not obvious to me that is the right thing to do. Generally it is best
> to use the standard C90/C99 functions rather than infrequently used
> non-standard ones. A quick grep of GLIBC shows strchr is used a lot more
> than strchrnul, and 9 targets have an optimized strchr vs 5 for strchrnul.
>
I looked at how architectures optimize strchr and that number is less.

Some of these are not optimized implementations. If you do gcc
string/strchr.c -S then you get assembly implementation. Several
architectures don't exploit any hardware capabilities, just use same
algorithm as generic so these could be deleted.

I know that generic implementation could be improved and will post a
patches.

Only two that are real optimizations and don't have strchnul are armv6
and alpha
So could you adapt armv6 one? 

Richard as you added alpha could you adapt it to strchnul? And why you
do a binary search there? Accoring to wiki alpha supports ctlz
instruction that does exactly that.


So here follows list of strchr(nul)? 

./sysdeps/powerpc/powerpc64/power7/strchr.S
./sysdeps/powerpc/powerpc64/power7/strchrnul.S
./sysdeps/powerpc/powerpc64/strchr.S
./sysdeps/powerpc/powerpc32/power7/strchr.S
./sysdeps/powerpc/powerpc32/power7/strchrnul.S
./sysdeps/powerpc/powerpc32/strchr.S

Power7 hardware support.

A generic could be replaced by one from string.h

./sysdeps/m68k/strchr.S
./sysdeps/m68k/strchrnul.S

replace by generic

./sysdeps/i386/strchr.S
./sysdeps/i386/strchrnul.S

replace by generic

./sysdeps/ia64/strchr.S

hardware dont have strchrnul

./sysdeps/sparc/sparc64/strchr.S
./sysdeps/sparc/sparc32/strchr.S

replace with generic

./sysdeps/alpha/strchr.S

harware

./sysdeps/arm/armv6/strchr.S

hardware

./sysdeps/aarch64/strchr.S
./sysdeps/aarch64/strchrnul.S

hardware
Matt Turner May 26, 2015, 4:58 p.m. UTC | #4
On Tue, May 26, 2015 at 7:38 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Tue, May 26, 2015 at 01:23:36PM +0100, Wilco Dijkstra wrote:
>> > Ondřej Bílka wrote:
>> > On Sun, May 24, 2015 at 06:32:14PM +0200, Ondřej Bílka wrote:
>> > > Hi,
>> > > this is nontrivial optimization of string inlines.
>> > > First it decreases icache pressure as you don't need strchr.
>>
>> It's not obvious to me that is the right thing to do. Generally it is best
>> to use the standard C90/C99 functions rather than infrequently used
>> non-standard ones. A quick grep of GLIBC shows strchr is used a lot more
>> than strchrnul, and 9 targets have an optimized strchr vs 5 for strchrnul.
>>
> I looked at how architectures optimize strchr and that number is less.
>
> Some of these are not optimized implementations. If you do gcc
> string/strchr.c -S then you get assembly implementation. Several
> architectures don't exploit any hardware capabilities, just use same
> algorithm as generic so these could be deleted.
>
> I know that generic implementation could be improved and will post a
> patches.
>
> Only two that are real optimizations and don't have strchnul are armv6
> and alpha
> So could you adapt armv6 one?
>
> Richard as you added alpha could you adapt it to strchnul? And why you
> do a binary search there? Accoring to wiki alpha supports ctlz
> instruction that does exactly that.

It's because Alphas < EV67 don't have ctlz.
Ondřej Bílka May 26, 2015, 5:06 p.m. UTC | #5
On Tue, May 26, 2015 at 09:58:01AM -0700, Matt Turner wrote:
> On Tue, May 26, 2015 at 7:38 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> > On Tue, May 26, 2015 at 01:23:36PM +0100, Wilco Dijkstra wrote:
> >> > Ondřej Bílka wrote:
> >> > On Sun, May 24, 2015 at 06:32:14PM +0200, Ondřej Bílka wrote:
> >> > > Hi,
> >> > > this is nontrivial optimization of string inlines.
> >> > > First it decreases icache pressure as you don't need strchr.
> >>
> >> It's not obvious to me that is the right thing to do. Generally it is best
> >> to use the standard C90/C99 functions rather than infrequently used
> >> non-standard ones. A quick grep of GLIBC shows strchr is used a lot more
> >> than strchrnul, and 9 targets have an optimized strchr vs 5 for strchrnul.
> >>
> > I looked at how architectures optimize strchr and that number is less.
> >
> > Some of these are not optimized implementations. If you do gcc
> > string/strchr.c -S then you get assembly implementation. Several
> > architectures don't exploit any hardware capabilities, just use same
> > algorithm as generic so these could be deleted.
> >
> > I know that generic implementation could be improved and will post a
> > patches.
> >
> > Only two that are real optimizations and don't have strchnul are armv6
> > and alpha
> > So could you adapt armv6 one?
> >
> > Richard as you added alpha could you adapt it to strchnul? And why you
> > do a binary search there? Accoring to wiki alpha supports ctlz
> > instruction that does exactly that.
> 
> It's because Alphas < EV67 don't have ctlz.

OK. You could try optimize it bit by implementing ctl with
multiplication but that likely doesn't matter.
Richard Henderson May 26, 2015, 5:18 p.m. UTC | #6
On 05/26/2015 07:38 AM, Ondřej Bílka wrote:
> Richard as you added alpha could you adapt it to strchnul? And why you
> do a binary search there? Accoring to wiki alpha supports ctlz
> instruction that does exactly that.

I suppose I could.  Does it get used often, or indeed ever?

Matt already mentioned the ev67 thing, and you can see that we do indeed have
another copy that uses it: sysdeps/alpha/alphaev67/strchr.S.


r~
Wilco May 27, 2015, 11:47 a.m. UTC | #7
> Ondřej Bílka wrote:
> On Tue, May 26, 2015 at 01:23:36PM +0100, Wilco Dijkstra wrote:
> > > Ondřej Bílka wrote:
> > > On Sun, May 24, 2015 at 06:32:14PM +0200, Ondřej Bílka wrote:
> > > > Hi,
> > > > this is nontrivial optimization of string inlines.
> > > > First it decreases icache pressure as you don't need strchr.
> >
> > It's not obvious to me that is the right thing to do. Generally it is best
> > to use the standard C90/C99 functions rather than infrequently used
> > non-standard ones. A quick grep of GLIBC shows strchr is used a lot more
> > than strchrnul, and 9 targets have an optimized strchr vs 5 for strchrnul.
> >
> I looked at how architectures optimize strchr and that number is less.
> 
> Some of these are not optimized implementations. If you do gcc
> string/strchr.c -S then you get assembly implementation. Several
> architectures don't exploit any hardware capabilities, just use same
> algorithm as generic so these could be deleted.
> 
> I know that generic implementation could be improved and will post a
> patches.

Sounds good.

> Only two that are real optimizations and don't have strchnul are armv6
> and alpha
> So could you adapt armv6 one?

Yes that shouldn't be too hard - and looking at it, it seems feasible to
tweak some extra performance out of it as well.

Wilco
diff mbox

Patch

diff --git a/string/bits/string2.h b/string/bits/string2.h
index 2fe67b3..8f1eb04 100644
--- a/string/bits/string2.h
+++ b/string/bits/string2.h
@@ -108,18 +108,39 @@  __STRING2_COPY_TYPE (8);
 #endif
 
 
-/* Return pointer to C in S.  */
-#ifndef _HAVE_STRING_ARCH_strchr
+#ifndef _HAVE_STRING_ARCH_rawmemchr
 extern void *__rawmemchr (const void *__s, int __c);
 # if __GNUC_PREREQ (3, 2)
-#  define strchr(s, c) \
+#  define __rawmemchr(s, c) \
+  (__extension__ (__builtin_constant_p (c) && !__builtin_constant_p (s)	      \
+		  && (c) == '\0'					      \
+		  ? s + strlen (s) 					      \
+		  : __rawmemchr (s, c)))
+# endif
+#endif
+
+
+
+#ifndef _HAVE_STRING_ARCH_strchrnul
+# if __GNUC_PREREQ (3, 2)
+#  define strchrnul(s, c) \
   (__extension__ (__builtin_constant_p (c) && !__builtin_constant_p (s)	      \
 		  && (c) == '\0'					      \
 		  ? (char *) __rawmemchr (s, c)				      \
-		  : __builtin_strchr (s, c)))
+		  : strchrnul (s, c)))
 # endif
 #endif
 
+/* Return pointer to C in S.  */
+#ifndef _HAVE_STRING_ARCH_strchr
+# if __GNUC_PREREQ (3, 2)
+#  define strchr(s, c) \
+  (__extension__ ({ char *__r = strchrnul (s, c);			      \
+		       *__r == c ? __r : NULL; }))
+# endif
+#endif
+
+
 /* Copy SRC to DEST, returning pointer to final NUL byte.  */
 #ifdef __USE_GNU
 # if !defined _HAVE_STRING_ARCH_stpcpy || defined _FORCE_INLINES