Improve strncpy performance further

Message ID	20150113191449.AD91B2C39DC@topped-with-meat.com
State	New
Headers	show Return-Path: <libc-alpha-return-56241-incoming=patchwork.ozlabs.org@sourceware.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:mime-version:content-type :content-transfer-encoding:from:to:cc:subject:in-reply-to :references:message-id:date; q=dns; s=default; b=exK5EAaP8yVaHPR 9MwzRmpWZrExri1q+IiP+UVZ9ttGtDR0KnwAOiiXgVE3lW3YcvEMm2lH/yxxKV64 ih+rzEDlHhv8eweOtYTEi8A2f7J8gOrDSilnlm6sgvqDz4KttRYGtZtwHGB2/Yea z68xiVjEM6QRGHF4OJcZ8/kOB4oY= Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit From: Roland McGrath <roland@hack.frob.com> To: "Wilco Dijkstra" <wdijkstr@arm.com> Cc: <libc-alpha@sourceware.org> Subject: RE: [PATCH] Improve strncpy performance further In-Reply-To: Wilco Dijkstra's message of Sunday, 11 January 2015 18:09:25 -0000 <001a01d02dc9$bd6f0370$384d0a50$@com> References: <001801d02b72$6ce0c3c0$46a24b40$@com> <20150108185812.285782C3BF6@topped-with-meat.com> <001901d02c0d$43cf9920$cb6ecb60$@com> <20150109191632.694692C3C1F@topped-with-meat.com> <001a01d02dc9$bd6f0370$384d0a50$@com> Message-Id: <20150113191449.AD91B2C39DC@topped-with-meat.com> Date: Tue, 13 Jan 2015 11:14:49 -0800 (PST)

Roland McGrath Jan. 13, 2015, 7:14 p.m. UTC

> Well, bcopy and bzero are both used in several non-test sources in GLIBC, 
> but one is always defined as __bzero, and the other as bcopy.

There are no uses of bcopy outside of tests, and portability macros in code
(possibly, or once) shared with gnulib.  There is a bcopy macro in
locale/programs/simple-hash.c, but it's not used anywhere.  I've removed
that in the patch below.

There are three uses of bzero outside of tests.  I've changed those to
memset in the patch below.  In my build (x86_64-linux-gnu, GCC 4.8.3) this
did not change the compiled code at all--so they were already being reduced
to memset or inlined directly by the compiler.

There are also two uses of bcmp outside of tests (both in old resolver
code), which is a similar case (but even easier and less useful, since it
is nothing but exactly an alias for memcmp).  I've changed those to memcmp
in the patch below.

There is a handful of uses of __bzero, all in sunrpc/ (none even in tests).
We're avoiding touching sunrpc/ when it's not thoroughly necessary, so
we'll leave those alone.

> It's unclear to me what the exact namespace rules are (or whether there is 
> even a concise description of them somewhere), however it is obvious that 
> this is not consistent. Note any rule that relies on whether a function is
> called or not from within the same library is risky unless you have 
> automated checks to catch that.

The rules per se can be described concisely, but they incorporate by
reference the set of each symbols in each standard and the dependency graph
of standards that are specified as supersets of others.  However, the
linknamespace tests should already be prepared to catch any concrete
violations.


Here's what I've just committed.  As I mentioned above, it caused no
changes to the compiled code on x86_64-linux-gnu with gcc-4.8.3, except for
the expected s/bcmp/memcmp/ changes that are trivially provably harmless.


Thanks,
Roland


2015-01-13  Roland McGrath  <roland@hack.frob.com>

	* login/logout.c (logout): Use memset rather than bzero.
	* nis/nss_compat/compat-pwd.c (getpwent_next_file): Likewise.
	* nis/nss_compat/compat-spwd.c (getspent_next_file): Likewise.
	* resolv/gethnamaddr.c (gethostbyaddr): Use memcmp rather than bcmp.
	(_gethtbyaddr): Likewise.
	* locale/programs/simple-hash.c (bcopy): Macro removed.

Wilco Jan. 14, 2015, 2:11 p.m. UTC | #1

Roland McGrath wrote:
> There are no uses of bcopy outside of tests, and portability macros in code
> (possibly, or once) shared with gnulib.  There is a bcopy macro in
> locale/programs/simple-hash.c, but it's not used anywhere.  I've removed
> that in the patch below.
> 
> There are three uses of bzero outside of tests.  I've changed those to
> memset in the patch below.  In my build (x86_64-linux-gnu, GCC 4.8.3) this
> did not change the compiled code at all--so they were already being reduced
> to memset or inlined directly by the compiler.

GCC often expands bzero into memset, but not all versions do this, and neither
do all options, so you cannot rely on this. So before your patch login/logout.c 
might actually call bzero...

We need something like this in string.h so we always optimize all calls to
standard optimized functions, irrespectively of the compiler and options used:

#ifdef __USE_MISC
# define bzero(s, n) memset (s, '\0', n)
# define __bzero(s, n) memset (s, '\0', n)
# define bcopy(src, dest, n) memmove (dest, src, n)
# define bcmp(s1, s2) memcmp(s1, s2)
#endif

Does that seem reasonable?

Now the only remaining one to deal with is mempcpy - I'd like something like 
this in string/strings2.h:

#define mempcpy(dest, src, n) __mempcpy_inline (dest, src, n)
__STRING_INLINE void *__mempcpy_inline (void *dest, void *src, size_t n)
{
  return memcpy(dest, src, n) + n;
}

If there are targets which want to keep using their optimized version, then 
the above could be guarded by a define (eg. _HAVE_OPTIMIZED_MEMPCPY) which would 
be defined by any target which has an optimized mempcpy. Or is there an easier way
to test whether a target has added an optimized version of a generic function?

> Here's what I've just committed.  As I mentioned above, it caused no
> changes to the compiled code on x86_64-linux-gnu with gcc-4.8.3, except for
> the expected s/bcmp/memcmp/ changes that are trivially provably harmless.

Thanks, that looks good.

Wilco

Joseph Myers Jan. 14, 2015, 5:24 p.m. UTC | #2

On Wed, 14 Jan 2015, Wilco Dijkstra wrote:

> Roland McGrath wrote:
> > There are no uses of bcopy outside of tests, and portability macros in code
> > (possibly, or once) shared with gnulib.  There is a bcopy macro in
> > locale/programs/simple-hash.c, but it's not used anywhere.  I've removed
> > that in the patch below.
> > 
> > There are three uses of bzero outside of tests.  I've changed those to
> > memset in the patch below.  In my build (x86_64-linux-gnu, GCC 4.8.3) this
> > did not change the compiled code at all--so they were already being reduced
> > to memset or inlined directly by the compiler.
> 
> GCC often expands bzero into memset, but not all versions do this, and neither
> do all options, so you cannot rely on this. So before your patch login/logout.c 
> might actually call bzero...

glibc only supports building with a limited set of compilers, and the 
semantic options used for the build must be those determined by glibc's 
built system (it's not valid to include -fno-builtin in CC or CFLAGS when 
configuring glibc, for example).

> We need something like this in string.h so we always optimize all calls to
> standard optimized functions, irrespectively of the compiler and options used:

We hardly *need* such an optimization, when it's only for obscure cases of 
old code and unusual compiler options.  (It may still be a good idea, if 
done correctly.)

> #ifdef __USE_MISC
> # define bzero(s, n) memset (s, '\0', n)
> # define __bzero(s, n) memset (s, '\0', n)
> # define bcopy(src, dest, n) memmove (dest, src, n)

These all need casts to void so as to return the correct types.

Roland McGrath Jan. 14, 2015, 7:32 p.m. UTC | #3

> GCC often expands bzero into memset, but not all versions do this, and
> neither do all options, so you cannot rely on this. So before your patch
> login/logout.c might actually call bzero...

Sure.

> We need something like this in string.h so we always optimize all calls to
> standard optimized functions, irrespectively of the compiler and options used:

We would need that if we wanted to do that.  But these entrypoints are all
old and deprecated.  They are only for the benefit of old code.  Any code
so old that it hasn't been touched since there were actually systems to
build it on that don't have the C89 standard functions surely has worse
performance issues than this.  Making the deprecated functions optimal only
encourages people to keep using them.

> Now the only remaining one to deal with is mempcpy - I'd like something like 
> this in string/strings2.h:

Why?  It's trivial enough for each memcpy implementation to implement
mempcpy too, and for many implementations rolling it in might save an
instruction or two over the generic addition.  It doesn't seem worth
the complexity to bother with anything in the header files.

Wilco Jan. 15, 2015, 3:48 p.m. UTC | #4

Roland McGrath wrote:
> Wilco Dijkstra wrote:
> > We need something like this in string.h so we always optimize all calls to
> > standard optimized functions, irrespectively of the compiler and options used:
> 
> We would need that if we wanted to do that.  But these entrypoints are all
> old and deprecated.  They are only for the benefit of old code.  Any code
> so old that it hasn't been touched since there were actually systems to
> build it on that don't have the C89 standard functions surely has worse
> performance issues than this.  Making the deprecated functions optimal only
> encourages people to keep using them.

Agreed, however they appear to be used in a lot of code, including benchmarks.
For example a quick grep shows there are a large number of occurrences of 
bzero and bcopy in SPEC2006.

> > Now the only remaining one to deal with is mempcpy - I'd like something like
> > this in string/strings2.h:
> 
> Why?  It's trivial enough for each memcpy implementation to implement
> mempcpy too, and for many implementations rolling it in might save an
> instruction or two over the generic addition.  It doesn't seem worth
> the complexity to bother with anything in the header files.

OK, so the goal of many of the changes I've been making is as follows:

By default GLIBC should provide the most efficient generic implementations
so that a new target is not forced to write a large number of optimized 
assembler functions in order to get reasonable performance. Additionally,
given that all targets add optimized versions of a few key functions
(such as memcpy, memset, strlen), use those whenever feasible rather than
less widely used variants.

Back to mempcpy, not only is inlining mempcpy simple and a good idea, it is
also the most efficient implementation. If you create a separate optimized
implementation of mempcpy, it requires 1-2 extra instructions and increases
pressure on caches and branch predictors. Another approach would be to set
the return value at the start of memcpy so that mempcpy can jump past it. 
This means 1 extra instruction in every memcpy invocation plus an extra
branch for mempcpy. Neither option is clearly better than just inlining. 
This ignores the additional effort to write/test mempcpy which could be 
spent on more important things. It appears most targets have not bothered 
with mempcpy as a result.

So to me adding the inline version is a no-brainer and should have been done
a long time ago.

Wilco

Ondřej Bílka Jan. 31, 2015, 8:36 p.m. UTC | #5

On Thu, Jan 15, 2015 at 03:48:47PM -0000, Wilco Dijkstra wrote:
> Roland McGrath wrote:
> > Wilco Dijkstra wrote:
> > > We need something like this in string.h so we always optimize all calls to
> > > standard optimized functions, irrespectively of the compiler and options used:
> > 
> > We would need that if we wanted to do that.  But these entrypoints are all
> > old and deprecated.  They are only for the benefit of old code.  Any code
> > so old that it hasn't been touched since there were actually systems to
> > build it on that don't have the C89 standard functions surely has worse
> > performance issues than this.  Making the deprecated functions optimal only
> > encourages people to keep using them.
> 
> Agreed, however they appear to be used in a lot of code, including benchmarks.
> For example a quick grep shows there are a large number of occurrences of 
> bzero and bcopy in SPEC2006.
>
Also gcc could optimize memset to __bzero, I will probably write patch
for x64 to save few cycles. There is omplication that gcc could use
memset return value so we need to check if its dead or create new
symbol.
 
> > > Now the only remaining one to deal with is mempcpy - I'd like something like
> > > this in string/strings2.h:
> > 
> > Why?  It's trivial enough for each memcpy implementation to implement
> > mempcpy too, and for many implementations rolling it in might save an
> > instruction or two over the generic addition.  It doesn't seem worth
> > the complexity to bother with anything in the header files.
> 
> Back to mempcpy, not only is inlining mempcpy simple and a good idea, it is
> also the most efficient implementation. If you create a separate optimized
> implementation of mempcpy, it requires 1-2 extra instructions and increases
> pressure on caches and branch predictors. Another approach would be to set

That was previously mentioned in parent thread. With separate mempcpy
you will likely pay additional 100 cycle penalty as mempcy is not called
often.

> the return value at the start of memcpy so that mempcpy can jump past it. 
> This means 1 extra instruction in every memcpy invocation plus an extra
> branch for mempcpy.

That is false. You need to copy starting memcpy fragment until you set
return value and then jump which gives no overhead to memcpy.

That could be problematic on some architectures as you need to do it
without spilling extra register.

> Neither option is clearly better than just inlining. 
> This ignores the additional effort to write/test mempcpy which could be 
> spent on more important things. It appears most targets have not bothered 
> with mempcpy as a result.

Wilco Feb. 4, 2015, 4:30 p.m. UTC | #6

> Ondřej Bílka wrote:
> On Thu, Jan 15, 2015 at 03:48:47PM -0000, Wilco Dijkstra wrote:
> > Roland McGrath wrote:
> > > Wilco Dijkstra wrote:
> > > > We need something like this in string.h so we always optimize all calls to
> > > > standard optimized functions, irrespectively of the compiler and options used:
> > >
> > > We would need that if we wanted to do that.  But these entrypoints are all
> > > old and deprecated.  They are only for the benefit of old code.  Any code
> > > so old that it hasn't been touched since there were actually systems to
> > > build it on that don't have the C89 standard functions surely has worse
> > > performance issues than this.  Making the deprecated functions optimal only
> > > encourages people to keep using them.
> >
> > Agreed, however they appear to be used in a lot of code, including benchmarks.
> > For example a quick grep shows there are a large number of occurrences of
> > bzero and bcopy in SPEC2006.
> >
> Also gcc could optimize memset to __bzero, I will probably write patch
> for x64 to save few cycles. There is omplication that gcc could use
> memset return value so we need to check if its dead or create new
> symbol.

That is certainly a good idea - I added _memclr to armcc a long time ago as 99% of
uses of memset set it to zero and don't use the return value (and the cost of
save/restore the return value inside memcpy/memset is higher than just recomputing
it on most targets).

However this means we need to first make sure all targets have a decent __bzero
implementation as otherwise you penalize everybody with an extra veneer to memset...

> > > > Now the only remaining one to deal with is mempcpy - I'd like something like
> > > > this in string/strings2.h:
> > >
> > > Why?  It's trivial enough for each memcpy implementation to implement
> > > mempcpy too, and for many implementations rolling it in might save an
> > > instruction or two over the generic addition.  It doesn't seem worth
> > > the complexity to bother with anything in the header files.
> >
> > Back to mempcpy, not only is inlining mempcpy simple and a good idea, it is
> > also the most efficient implementation. If you create a separate optimized
> > implementation of mempcpy, it requires 1-2 extra instructions and increases
> > pressure on caches and branch predictors. Another approach would be to set
> 
> That was previously mentioned in parent thread. With separate mempcpy
> you will likely pay additional 100 cycle penalty as mempcy is not called
> often.
>
> > the return value at the start of memcpy so that mempcpy can jump past it.
> > This means 1 extra instruction in every memcpy invocation plus an extra
> > branch for mempcpy.
> 
> That is false. You need to copy starting memcpy fragment until you set
> return value and then jump which gives no overhead to memcpy.

That's not how memcpy implementations work. You never set the return value
explicitly, you either don't change the destination register (which on most ABIs
also is the return value) or save/restore it on targets with few registers.
Additionally for small/medium copies you use the destination (and return value)
unchanged, so to support a different return value you need an extra instruction
to make a copy of the destination ...
 
> That could be problematic on some architectures as you need to do it
> without spilling extra register.

... which could also mean an extra spill/restore in the small/medium copy cases.

So I don't think merging mempcpy and memcpy is a good idea on any target.

Wilco

Joseph Myers Feb. 4, 2015, 5:30 p.m. UTC | #7

On Wed, 4 Feb 2015, Wilco Dijkstra wrote:

> That is certainly a good idea - I added _memclr to armcc a long time ago 
> as 99% of uses of memset set it to zero and don't use the return value 
> (and the cost of save/restore the return value inside memcpy/memset is 
> higher than just recomputing it on most targets).

On ARM there's __aeabi_memclr (and __aeabi_memclr4 and __aeabi_memclr8), 
but the __aeabi_mem* functions are mostly currently implemented as 
wrappers (__aeabi_memcpy* as aliases for __memcpy_arm in the 
armv7/multiarch case, to avoid clobbering NEON registers) and there's no 
GCC support for generating calls to those functions (which would only be 
useful with glibc if they were actually more efficient rather than 
wrappers).

Wilco Feb. 4, 2015, 6:42 p.m. UTC | #8

> Joseph Myers wrote:
> On Wed, 4 Feb 2015, Wilco Dijkstra wrote:
> 
> > That is certainly a good idea - I added _memclr to armcc a long time ago
> > as 99% of uses of memset set it to zero and don't use the return value
> > (and the cost of save/restore the return value inside memcpy/memset is
> > higher than just recomputing it on most targets).
> 
> On ARM there's __aeabi_memclr (and __aeabi_memclr4 and __aeabi_memclr8),
> but the __aeabi_mem* functions are mostly currently implemented as
> wrappers (__aeabi_memcpy* as aliases for __memcpy_arm in the
> armv7/multiarch case, to avoid clobbering NEON registers) and there's no
> GCC support for generating calls to those functions (which would only be
> useful with glibc if they were actually more efficient rather than
> wrappers).

Indeed it's unfortunate that these as well as the generic __bzero are
inefficient. Would it be possible for GCC to detect a modern GLIBC from the
headers and only emit calls to __bzero (or say a new __memclr symbol) if a
target provides an optimized version? Not sure whether there is already a way
to do this, or whether such trickery is discouraged...

Wilco

Joseph Myers Feb. 4, 2015, 6:56 p.m. UTC | #9

On Wed, 4 Feb 2015, Wilco Dijkstra wrote:

> Indeed it's unfortunate that these as well as the generic __bzero are
> inefficient. Would it be possible for GCC to detect a modern GLIBC from the
> headers and only emit calls to __bzero (or say a new __memclr symbol) if a
> target provides an optimized version? Not sure whether there is already a way
> to do this, or whether such trickery is discouraged...

There's existing support for checking the glibc version at GCC configure 
time and acting in GCC based on that (of course, that's just the minimum 
glibc version that GCC knows will be used at runtime - it doesn't take 
account of glibc upgrades).  That's sufficient if using new glibc features 
would require a GCC upgrade anyway.  If an older GCC could meaningfully 
use a newer glibc feature when glibc is upgraded - if, for example, GCC 
knows about using efficient __bzero but the set of relevant architectures 
varies with glibc version, so you want glibc headers to tell GCC whether 
__bzero is efficient for that glibc version and architecture - then you 
get into stdc-predef.h (or an architecture-specific bits/ file included 
therefrom) doing "#pragma GCC library_feature efficient___bzero" or 
similar - a new pragma to inform GCC of library features it can presume on 
the target.  (There might be other possible approaches to implementing 
this.)

Ondřej Bílka Feb. 11, 2015, 1:06 p.m. UTC | #10

On Wed, Feb 04, 2015 at 04:30:43PM -0000, Wilco Dijkstra wrote:
> > > the return value at the start of memcpy so that mempcpy can jump past it.
> > > This means 1 extra instruction in every memcpy invocation plus an extra
> > > branch for mempcpy.
> > 
> > That is false. You need to copy starting memcpy fragment until you set
> > return value and then jump which gives no overhead to memcpy.
> 
> That's not how memcpy implementations work. You never set the return value
> explicitly, you either don't change the destination register (which on most ABIs
> also is the return value) or save/restore it on targets with few registers.
> Additionally for small/medium copies you use the destination (and return value)
> unchanged, so to support a different return value you need an extra instruction
> to make a copy of the destination ...
> 

No, my description is quite explicit. You take memcpy implementation and
look at first instructions such that there is no read/write to return
register/memory after reaching that instruction.

Now for mempcpy you take memcpy as template and clone it until you reach
instruction corresponding to one described before. 
On that position you change return value and jump to corresponding
instruction in memcpy.

It is obvious this does not add extra instruction to memcpy as memcpy is
not changed.

Wilco Feb. 11, 2015, 2:51 p.m. UTC | #11

> Ondřej Bílka wrote:
> On Wed, Feb 04, 2015 at 04:30:43PM -0000, Wilco Dijkstra wrote:
> > > > the return value at the start of memcpy so that mempcpy can jump past it.
> > > > This means 1 extra instruction in every memcpy invocation plus an extra
> > > > branch for mempcpy.
> > >
> > > That is false. You need to copy starting memcpy fragment until you set
> > > return value and then jump which gives no overhead to memcpy.
> >
> > That's not how memcpy implementations work. You never set the return value
> > explicitly, you either don't change the destination register (which on most ABIs
> > also is the return value) or save/restore it on targets with few registers.
> > Additionally for small/medium copies you use the destination (and return value)
> > unchanged, so to support a different return value you need an extra instruction
> > to make a copy of the destination ...
> >
> 
> No, my description is quite explicit. You take memcpy implementation and
> look at first instructions such that there is no read/write to return
> register/memory after reaching that instruction.
>
> Now for mempcpy you take memcpy as template and clone it until you reach
> instruction corresponding to one described before.
> On that position you change return value and jump to corresponding
> instruction in memcpy.
> 
> It is obvious this does not add extra instruction to memcpy as memcpy is
> not changed.

Yes but that means if memcpy never makes a copy of the destination register 
and uses it in *all* subcases (like my memcpy does) then you end up having
to duplicate almost all of the memcpy code.

While duplicating most of memcpy is not as bad as duplicating all, the common
cases will not be cached/predicted. So the conclusion is that you either make
memcpy less efficient in order to share all of it (by using an extra register
and making a copy of the destination register early on) or duplicate most of
memcpy. Neither is a good idea.

Wilco

Improve strncpy performance further

Commit Message

Comments

Patch