Message ID | 20200703165452.GA226121@gmail.com |
---|---|
State | New |
Headers | show |
Series | x86: Add thresholds for "rep movsb/stosb" to tunables | expand |
On 7/3/20 12:54 PM, H.J. Lu wrote: > On Fri, Jul 03, 2020 at 12:14:01PM -0400, Carlos O'Donell wrote: >> On 7/2/20 3:08 PM, H.J. Lu wrote: >>> On Thu, Jul 02, 2020 at 02:00:54PM -0400, Carlos O'Donell wrote: >>>> On 6/6/20 5:51 PM, H.J. Lu wrote: >>>>> On Fri, Jun 5, 2020 at 3:45 PM H.J. Lu <hjl.tools@gmail.com> wrote: >>>>>> >>>>>> On Thu, Jun 04, 2020 at 02:00:35PM -0700, H.J. Lu wrote: >>>>>>> On Mon, Jun 1, 2020 at 7:08 PM Carlos O'Donell <carlos@redhat.com> wrote: >>>>>>>> >>>>>>>> On Mon, Jun 1, 2020 at 6:44 PM H.J. Lu <hjl.tools@gmail.com> wrote: >>>>>>>>> Tunables are designed to pass info from user to glibc, not the other >>>>>>>>> way around. When __libc_main is called, init_cacheinfo is never >>>>>>>>> called. I can call init_cacheinfo from __libc_main. But there is no >>>>>>>>> interface to update min and max values from init_cacheinfo. I don't >>>>>>>>> think --list-tunables will work here without changes to tunables. >>>>>>>> >>>>>>>> You have a dynamic threshold. >>>>>>>> >>>>>>>> You have to tell the user what that minimum is, otherwise they can't >>>>>>>> use the tunable reliably. >>>>>>>> >>>>>>>> This is the first instance of a min/max that is dynamically determined. >>>>>>>> >>>>>>>> You must fetch the cache info ahead of the tunable initialization, that >>>>>>>> is you must call init_cacheinfo before __init_tunables. >>>>>>>> >>>>>>>> You can initialize the tunable data dynamically like this: >>>>>>>> >>>>>>>> /* Dynamically set the min and max of glibc.foo.bar. */ >>>>>>>> tunable_id_t id = TUNABLE_ENUM_NAME (glibc, foo, bar); >>>>>>>> tunable_list[id].type.min = lowval; >>>>>>>> tunable_list[id].type.max = highval; >>>>>>>> >>>>>>>> We do something similar for maybe_enable_malloc_check. >>>>>>>> >>>>>>>> Then once the tunables are parsed, and the cpu features are loaded >>>>>>>> you can print the tunables, and the printed tunables will have meaningful >>>>>>>> min and max values. >>>>>>>> >>>>>>>> If you have circular dependency, then you must process the cpu features >>>>>>>> first without reading from the tunables, then allow the tunables to be >>>>>>>> initialized from the system, *then* process the tunables to alter the existing >>>>>>>> cpu feature settings. >>>>>>>> >>>>>>> >>>>>>> How about this? I got >>>>>>> >>>>>> >>>>>> Here is the updated patch, which depends on >>>>>> >>>>>> https://sourceware.org/pipermail/libc-alpha/2020-June/114820.html >>>>>> >>>>>> to add "%d" support to _dl_debug_vdprintf. I got >>>>>> >>>>>> $ ./elf/ld.so ./libc.so --list-tunables >>>>>> glibc.elision.skip_lock_after_retries: 3 (min: -2147483648, max: 2147483647) >>>>>> glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffff) >>>>>> glibc.malloc.perturb: 0 (min: 0, max: 255) >>>>>> glibc.cpu.x86_shared_cache_size: 0x100000 (min: 0x0, max: 0xffffffff) >>>>>> glibc.elision.tries: 3 (min: -2147483648, max: 2147483647) >>>>>> glibc.elision.enable: 0 (min: 0, max: 1) >>>>>> glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffff) >>>>>> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647) >>>>>> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffff) >>>>>> glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x0, max: 0xffffffff) >>>>>> glibc.cpu.x86_shstk: >>>>>> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffff) >>>>>> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647) >>>>>> glibc.elision.skip_trylock_internal_abort: 3 (min: -2147483648, max: 2147483647) >>>>>> glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0xffffffff) >>>>>> glibc.cpu.x86_ibt: >>>>>> glibc.cpu.hwcaps: >>>>>> glibc.elision.skip_lock_internal_abort: 3 (min: -2147483648, max: 2147483647) >>>>>> glibc.malloc.arena_max: 0x0 (min: 0x1, max: 0xffffffff) >>>>>> glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffff) >>>>>> glibc.cpu.x86_data_cache_size: 0x8000 (min: 0x0, max: 0xffffffff) >>>>>> glibc.malloc.tcache_count: 0x0 (min: 0x0, max: 0xffffffff) >>>>>> glibc.malloc.arena_test: 0x0 (min: 0x1, max: 0xffffffff) >>>>>> glibc.pthread.mutex_spin_count: 100 (min: 0, max: 32767) >>>>>> glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0xffffffff) >>>>>> glibc.malloc.check: 0 (min: 0, max: 3) >>>>>> $ >>>>>> >>>>>> Ok for master? >>>>>> >>>>> >>>>> Here is the updated patch. To support --list-tunables, a target should add >>>>> >>>>> CPPFLAGS-version.c = -DLIBC_MAIN=__libc_main_body >>>>> CPPFLAGS-libc-main.S = -DLIBC_MAIN=__libc_main_body >>>>> >>>>> and start.S should be updated to define __libc_main and call >>>>> __libc_main_body: >>>>> >>>>> extern void __libc_main_body (int argc, char **argv) >>>>> __attribute__ ((noreturn, visibility ("hidden"))); >>>>> >>>>> when LIBC_MAIN is defined. >>>> >>>> I like where this patch is going, but the __libc_main wiring up means >>>> we'll have to delay this until glibc 2.33 opens for development and >>>> give the architectures time to fill in the required pieces of assembly. >>>> >>>> Can we split this into: >>>> >>>> (a) Minimum required to implement the feature e.g. just the tunable without >>>> my requested changes. >>>> >>>> (b) A second patch which implements the --list-tunables that users can >>>> then use to know what the values they can choose are. >>>> >>>> That way we can commit (a) right now, and then commit (b) when we >>>> reopen for development? >>>> >>> >>> Like this? >> >> Almost. >> >> Why do we still use a constructor? >> >> Why don't we accurately set the min and max? >> >> +#if HAVE_TUNABLES >> + TUNABLE_UPDATE (x86_non_temporal_threshold, long int, >> + __x86_shared_non_temporal_threshold, 0, >> + (long int) -1); >> + TUNABLE_UPDATE (x86_rep_movsb_threshold, long int, >> + __x86_rep_movsb_threshold, >> + minimum_rep_movsb_threshold, (long int) -1); >> + TUNABLE_UPDATE (x86_rep_stosb_threshold, long int, >> + __x86_rep_stosb_threshold, 0, (long int) -1); >> >> A min and max of 0 and -1 respectively could have been set in the tunables >> list file and are not dynamic? >> >> I'd expect your patch would do everything except actually implement >> --list-tunables. > > Here is the followup patch which does it. > >> >> We need a manual page, and I accept that showing a "lower value" will >> have to wait for --list-tunables. >> >> Otherwise the patch is looking ready. > > > Are these 2 patches OK for trunk? Could you please post the patches in a distinct thread with a clear subject, that way I know exactly what I'm applying and testing. I'll review those ASAP so we can get something in place.
On Fri, Jul 3, 2020 at 10:43 AM Carlos O'Donell <carlos@redhat.com> wrote: > > On 7/3/20 12:54 PM, H.J. Lu wrote: > > On Fri, Jul 03, 2020 at 12:14:01PM -0400, Carlos O'Donell wrote: > >> On 7/2/20 3:08 PM, H.J. Lu wrote: > >>> On Thu, Jul 02, 2020 at 02:00:54PM -0400, Carlos O'Donell wrote: > >>>> On 6/6/20 5:51 PM, H.J. Lu wrote: > >>>>> On Fri, Jun 5, 2020 at 3:45 PM H.J. Lu <hjl.tools@gmail.com> wrote: > >>>>>> > >>>>>> On Thu, Jun 04, 2020 at 02:00:35PM -0700, H.J. Lu wrote: > >>>>>>> On Mon, Jun 1, 2020 at 7:08 PM Carlos O'Donell <carlos@redhat.com> wrote: > >>>>>>>> > >>>>>>>> On Mon, Jun 1, 2020 at 6:44 PM H.J. Lu <hjl.tools@gmail.com> wrote: > >>>>>>>>> Tunables are designed to pass info from user to glibc, not the other > >>>>>>>>> way around. When __libc_main is called, init_cacheinfo is never > >>>>>>>>> called. I can call init_cacheinfo from __libc_main. But there is no > >>>>>>>>> interface to update min and max values from init_cacheinfo. I don't > >>>>>>>>> think --list-tunables will work here without changes to tunables. > >>>>>>>> > >>>>>>>> You have a dynamic threshold. > >>>>>>>> > >>>>>>>> You have to tell the user what that minimum is, otherwise they can't > >>>>>>>> use the tunable reliably. > >>>>>>>> > >>>>>>>> This is the first instance of a min/max that is dynamically determined. > >>>>>>>> > >>>>>>>> You must fetch the cache info ahead of the tunable initialization, that > >>>>>>>> is you must call init_cacheinfo before __init_tunables. > >>>>>>>> > >>>>>>>> You can initialize the tunable data dynamically like this: > >>>>>>>> > >>>>>>>> /* Dynamically set the min and max of glibc.foo.bar. */ > >>>>>>>> tunable_id_t id = TUNABLE_ENUM_NAME (glibc, foo, bar); > >>>>>>>> tunable_list[id].type.min = lowval; > >>>>>>>> tunable_list[id].type.max = highval; > >>>>>>>> > >>>>>>>> We do something similar for maybe_enable_malloc_check. > >>>>>>>> > >>>>>>>> Then once the tunables are parsed, and the cpu features are loaded > >>>>>>>> you can print the tunables, and the printed tunables will have meaningful > >>>>>>>> min and max values. > >>>>>>>> > >>>>>>>> If you have circular dependency, then you must process the cpu features > >>>>>>>> first without reading from the tunables, then allow the tunables to be > >>>>>>>> initialized from the system, *then* process the tunables to alter the existing > >>>>>>>> cpu feature settings. > >>>>>>>> > >>>>>>> > >>>>>>> How about this? I got > >>>>>>> > >>>>>> > >>>>>> Here is the updated patch, which depends on > >>>>>> > >>>>>> https://sourceware.org/pipermail/libc-alpha/2020-June/114820.html > >>>>>> > >>>>>> to add "%d" support to _dl_debug_vdprintf. I got > >>>>>> > >>>>>> $ ./elf/ld.so ./libc.so --list-tunables > >>>>>> glibc.elision.skip_lock_after_retries: 3 (min: -2147483648, max: 2147483647) > >>>>>> glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffff) > >>>>>> glibc.malloc.perturb: 0 (min: 0, max: 255) > >>>>>> glibc.cpu.x86_shared_cache_size: 0x100000 (min: 0x0, max: 0xffffffff) > >>>>>> glibc.elision.tries: 3 (min: -2147483648, max: 2147483647) > >>>>>> glibc.elision.enable: 0 (min: 0, max: 1) > >>>>>> glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffff) > >>>>>> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647) > >>>>>> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffff) > >>>>>> glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x0, max: 0xffffffff) > >>>>>> glibc.cpu.x86_shstk: > >>>>>> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffff) > >>>>>> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647) > >>>>>> glibc.elision.skip_trylock_internal_abort: 3 (min: -2147483648, max: 2147483647) > >>>>>> glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0xffffffff) > >>>>>> glibc.cpu.x86_ibt: > >>>>>> glibc.cpu.hwcaps: > >>>>>> glibc.elision.skip_lock_internal_abort: 3 (min: -2147483648, max: 2147483647) > >>>>>> glibc.malloc.arena_max: 0x0 (min: 0x1, max: 0xffffffff) > >>>>>> glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffff) > >>>>>> glibc.cpu.x86_data_cache_size: 0x8000 (min: 0x0, max: 0xffffffff) > >>>>>> glibc.malloc.tcache_count: 0x0 (min: 0x0, max: 0xffffffff) > >>>>>> glibc.malloc.arena_test: 0x0 (min: 0x1, max: 0xffffffff) > >>>>>> glibc.pthread.mutex_spin_count: 100 (min: 0, max: 32767) > >>>>>> glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0xffffffff) > >>>>>> glibc.malloc.check: 0 (min: 0, max: 3) > >>>>>> $ > >>>>>> > >>>>>> Ok for master? > >>>>>> > >>>>> > >>>>> Here is the updated patch. To support --list-tunables, a target should add > >>>>> > >>>>> CPPFLAGS-version.c = -DLIBC_MAIN=__libc_main_body > >>>>> CPPFLAGS-libc-main.S = -DLIBC_MAIN=__libc_main_body > >>>>> > >>>>> and start.S should be updated to define __libc_main and call > >>>>> __libc_main_body: > >>>>> > >>>>> extern void __libc_main_body (int argc, char **argv) > >>>>> __attribute__ ((noreturn, visibility ("hidden"))); > >>>>> > >>>>> when LIBC_MAIN is defined. > >>>> > >>>> I like where this patch is going, but the __libc_main wiring up means > >>>> we'll have to delay this until glibc 2.33 opens for development and > >>>> give the architectures time to fill in the required pieces of assembly. > >>>> > >>>> Can we split this into: > >>>> > >>>> (a) Minimum required to implement the feature e.g. just the tunable without > >>>> my requested changes. > >>>> > >>>> (b) A second patch which implements the --list-tunables that users can > >>>> then use to know what the values they can choose are. > >>>> > >>>> That way we can commit (a) right now, and then commit (b) when we > >>>> reopen for development? > >>>> > >>> > >>> Like this? > >> > >> Almost. > >> > >> Why do we still use a constructor? > >> > >> Why don't we accurately set the min and max? > >> > >> +#if HAVE_TUNABLES > >> + TUNABLE_UPDATE (x86_non_temporal_threshold, long int, > >> + __x86_shared_non_temporal_threshold, 0, > >> + (long int) -1); > >> + TUNABLE_UPDATE (x86_rep_movsb_threshold, long int, > >> + __x86_rep_movsb_threshold, > >> + minimum_rep_movsb_threshold, (long int) -1); > >> + TUNABLE_UPDATE (x86_rep_stosb_threshold, long int, > >> + __x86_rep_stosb_threshold, 0, (long int) -1); > >> > >> A min and max of 0 and -1 respectively could have been set in the tunables > >> list file and are not dynamic? > >> > >> I'd expect your patch would do everything except actually implement > >> --list-tunables. > > > > Here is the followup patch which does it. > > > >> > >> We need a manual page, and I accept that showing a "lower value" will > >> have to wait for --list-tunables. > >> > >> Otherwise the patch is looking ready. > > > > > > Are these 2 patches OK for trunk? > > Could you please post the patches in a distinct thread with a clear > subject, that way I know exactly what I'm applying and testing. > I'll review those ASAP so we can get something in place. > Done: https://sourceware.org/pipermail/libc-alpha/2020-July/115759.html
diff --git a/manual/tunables.texi b/manual/tunables.texi index ec18b10834..61edd62425 100644 --- a/manual/tunables.texi +++ b/manual/tunables.texi @@ -396,6 +396,20 @@ to set threshold in bytes for non temporal store. This tunable is specific to i386 and x86-64. @end deftp +@deftp Tunable glibc.cpu.x86_rep_movsb_threshold +The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user +to set threshold in bytes to start using "rep movsb". + +This tunable is specific to i386 and x86-64. +@end deftp + +@deftp Tunable glibc.cpu.x86_rep_stosb_threshold +The @code{glibc.cpu.x86_rep_stosb_threshold} tunable allows the user +to set threshold in bytes to start using "rep stosb". + +This tunable is specific to i386 and x86-64. +@end deftp + @deftp Tunable glibc.cpu.x86_ibt The @code{glibc.cpu.x86_ibt} tunable allows the user to control how indirect branch tracking (IBT) should be enabled. Accepted values are diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c index 8c4c7f9972..bb536d96ef 100644 --- a/sysdeps/x86/cacheinfo.c +++ b/sysdeps/x86/cacheinfo.c @@ -41,6 +41,23 @@ long int __x86_raw_shared_cache_size attribute_hidden = 1024 * 1024; /* Threshold to use non temporal store. */ long int __x86_shared_non_temporal_threshold attribute_hidden; +/* Threshold to use Enhanced REP MOVSB. Since there is overhead to set + up REP MOVSB operation, REP MOVSB isn't faster on short data. The + memcpy micro benchmark in glibc shows that 2KB is the approximate + value above which REP MOVSB becomes faster than SSE2 optimization + on processors with Enhanced REP MOVSB. Since larger register size + can move more data with a single load and store, the threshold is + higher with larger register size. */ +long int __x86_rep_movsb_threshold attribute_hidden = 2048; + +/* Threshold to use Enhanced REP STOSB. Since there is overhead to set + up REP STOSB operation, REP STOSB isn't faster on short data. The + memset micro benchmark in glibc shows that 2KB is the approximate + value above which REP STOSB becomes faster on processors with + Enhanced REP STOSB. Since the stored value is fixed, larger register + size has minimal impact on threshold. */ +long int __x86_rep_stosb_threshold attribute_hidden = 2048; + #ifndef __x86_64__ /* PREFETCHW support flag for use in memory and string routines. */ int __x86_prefetchw attribute_hidden; @@ -117,6 +134,9 @@ init_cacheinfo (void) __x86_shared_non_temporal_threshold = cpu_features->non_temporal_threshold; + __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold; + __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold; + #ifndef __x86_64__ __x86_prefetchw = cpu_features->prefetchw; #endif diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h index 3aaed33cbc..002e12e11f 100644 --- a/sysdeps/x86/cpu-features.h +++ b/sysdeps/x86/cpu-features.h @@ -128,6 +128,10 @@ struct cpu_features /* PREFETCHW support flag for use in memory and string routines. */ unsigned long int prefetchw; #endif + /* Threshold to use "rep movsb". */ + unsigned long int rep_movsb_threshold; + /* Threshold to use "rep stosb". */ + unsigned long int rep_stosb_threshold; }; /* Used from outside of glibc to get access to the CPU features diff --git a/sysdeps/x86/dl-cacheinfo.c b/sysdeps/x86/dl-cacheinfo.c index 8e2a6f552c..aff9bd1067 100644 --- a/sysdeps/x86/dl-cacheinfo.c +++ b/sysdeps/x86/dl-cacheinfo.c @@ -860,6 +860,31 @@ __init_cacheinfo (void) total shared cache size. */ unsigned long int non_temporal_threshold = (shared * threads * 3 / 4); + /* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8. */ + unsigned long int minimum_rep_movsb_threshold; + /* NB: The default REP MOVSB threshold is 2048 * (VEC_SIZE / 16). See + comments for __x86_rep_movsb_threshold in cacheinfo.c. */ + unsigned long int rep_movsb_threshold; + if (CPU_FEATURES_ARCH_P (cpu_features, AVX512F_Usable) + && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512)) + { + rep_movsb_threshold = 2048 * (64 / 16); + minimum_rep_movsb_threshold = 64 * 8; + } + else if (CPU_FEATURES_ARCH_P (cpu_features, + AVX_Fast_Unaligned_Load)) + { + rep_movsb_threshold = 2048 * (32 / 16); + minimum_rep_movsb_threshold = 32 * 8; + } + else + { + rep_movsb_threshold = 2048 * (16 / 16); + minimum_rep_movsb_threshold = 16 * 8; + } + /* NB: See comments for __x86_rep_stosb_threshold in cacheinfo.c. */ + unsigned long int rep_stosb_threshold = 2048; + #if HAVE_TUNABLES long int tunable_size; tunable_size = TUNABLE_GET (x86_data_cache_size, long int, NULL); @@ -871,11 +896,19 @@ __init_cacheinfo (void) tunable_size = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL); if (tunable_size != 0) non_temporal_threshold = tunable_size; + tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL); + if (tunable_size > minimum_rep_movsb_threshold) + rep_movsb_threshold = tunable_size; + tunable_size = TUNABLE_GET (x86_rep_stosb_threshold, long int, NULL); + if (tunable_size != 0) + rep_stosb_threshold = tunable_size; #endif cpu_features->data_cache_size = data; cpu_features->shared_cache_size = shared; cpu_features->non_temporal_threshold = non_temporal_threshold; + cpu_features->rep_movsb_threshold = rep_movsb_threshold; + cpu_features->rep_stosb_threshold = rep_stosb_threshold; #if HAVE_TUNABLES TUNABLE_UPDATE (x86_data_cache_size, long int, @@ -884,5 +917,10 @@ __init_cacheinfo (void) shared, 0, (long int) -1); TUNABLE_UPDATE (x86_non_temporal_threshold, long int, non_temporal_threshold, 0, (long int) -1); + TUNABLE_UPDATE (x86_rep_movsb_threshold, long int, + rep_movsb_threshold, minimum_rep_movsb_threshold, + (long int) -1); + TUNABLE_UPDATE (x86_rep_stosb_threshold, long int, + rep_stosb_threshold, 0, (long int) -1); #endif } diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list index 251b926ce4..43bf6c2389 100644 --- a/sysdeps/x86/dl-tunables.list +++ b/sysdeps/x86/dl-tunables.list @@ -30,6 +30,12 @@ glibc { x86_non_temporal_threshold { type: SIZE_T } + x86_rep_movsb_threshold { + type: SIZE_T + } + x86_rep_stosb_threshold { + type: SIZE_T + } x86_data_cache_size { type: SIZE_T } diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S index 74953245aa..bd5dc1a3f3 100644 --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S @@ -56,17 +56,6 @@ # endif #endif -/* Threshold to use Enhanced REP MOVSB. Since there is overhead to set - up REP MOVSB operation, REP MOVSB isn't faster on short data. The - memcpy micro benchmark in glibc shows that 2KB is the approximate - value above which REP MOVSB becomes faster than SSE2 optimization - on processors with Enhanced REP MOVSB. Since larger register size - can move more data with a single load and store, the threshold is - higher with larger register size. */ -#ifndef REP_MOVSB_THRESHOLD -# define REP_MOVSB_THRESHOLD (2048 * (VEC_SIZE / 16)) -#endif - #ifndef PREFETCH # define PREFETCH(addr) prefetcht0 addr #endif @@ -253,9 +242,6 @@ L(movsb): leaq (%rsi,%rdx), %r9 cmpq %r9, %rdi /* Avoid slow backward REP MOVSB. */ -# if REP_MOVSB_THRESHOLD <= (VEC_SIZE * 8) -# error Unsupported REP_MOVSB_THRESHOLD and VEC_SIZE! -# endif jb L(more_8x_vec_backward) 1: mov %RDX_LP, %RCX_LP @@ -331,7 +317,7 @@ L(between_2_3): #if defined USE_MULTIARCH && IS_IN (libc) L(movsb_more_2x_vec): - cmpq $REP_MOVSB_THRESHOLD, %rdx + cmp __x86_rep_movsb_threshold(%rip), %RDX_LP ja L(movsb) #endif L(more_2x_vec): diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S index af2299709c..2bfc95de05 100644 --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S @@ -58,16 +58,6 @@ # endif #endif -/* Threshold to use Enhanced REP STOSB. Since there is overhead to set - up REP STOSB operation, REP STOSB isn't faster on short data. The - memset micro benchmark in glibc shows that 2KB is the approximate - value above which REP STOSB becomes faster on processors with - Enhanced REP STOSB. Since the stored value is fixed, larger register - size has minimal impact on threshold. */ -#ifndef REP_STOSB_THRESHOLD -# define REP_STOSB_THRESHOLD 2048 -#endif - #ifndef SECTION # error SECTION is not defined! #endif @@ -181,7 +171,7 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) ret L(stosb_more_2x_vec): - cmpq $REP_STOSB_THRESHOLD, %rdx + cmp __x86_rep_stosb_threshold(%rip), %RDX_LP ja L(stosb) #endif L(more_2x_vec):