Message ID | 20181022223711.26910-1-hjl.tools@gmail.com |
---|---|
State | New |
Headers | show |
Series | x86: Add --enable-rdtscp-in-benchtests | expand |
On Mon, 22 Oct 2018, H.J. Lu wrote: > RDTSCP waits until all previous instructions have executed and all > previous loads are globally visible before reading the counter. RDTSC > doesn't wait until all previous instructions have been executed before > reading the counter. This patch adds --enable-rdtscp-in-benchtests to > use RDTSCP in benchtests. > > NOTE: Benchtests in RDTSCP-enabled glibc require CPUs capable of RDTSCP > instruction. All x86 processors since 2010 support RDTSCP instruction. Without implying an objection to the patch, I'd like to point out that the Linux kernel always uses "lfence; rdtsc" on Intel CPUs to obtain ordered timestamps with lowest possible overhead. LFENCE is available on all x86-64 processors as part of SSE2. On AMD CPUs the kernel also uses "lfence; rdtsc", except if it cannot setup a specific MSR to make LFENCE serializing; in that case it falls back to "mfence; rdtsc". "lfence; rdtsc" sequence is also recommended by Intel SDM documentation. Is there a specific reason that "rdtscp" is preferable for this patch? Thanks. Alexander
* H. J. Lu: > RDTSCP waits until all previous instructions have executed and all > previous loads are globally visible before reading the counter. RDTSC > doesn't wait until all previous instructions have been executed before > reading the counter. This patch adds --enable-rdtscp-in-benchtests to > use RDTSCP in benchtests. > > NOTE: Benchtests in RDTSCP-enabled glibc require CPUs capable of RDTSCP > instruction. All x86 processors since 2010 support RDTSCP instruction. Shouldn't the benchtests use clock_gettime anyway, to avoid issues in case the TSC is not synchronized across cores? Thanks, Florian
On 23/10/18 2:34 PM, Florian Weimer wrote: > Shouldn't the benchtests use clock_gettime anyway, to avoid issues in > case the TSC is not synchronized across cores? There's an option USE_CLOCK_GETTIME to make benchtests do that, but otherwise it uses the hp_timing bits by default. Siddhesh
On 10/23/18, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote: > On 23/10/18 2:34 PM, Florian Weimer wrote: >> Shouldn't the benchtests use clock_gettime anyway, to avoid issues in >> case the TSC is not synchronized across cores? > > There's an option USE_CLOCK_GETTIME to make benchtests do that, but > otherwise it uses the hp_timing bits by default. I want something better that rdtsc and very low overhead since some bench tests only last a few cycles. Adding lfence may make timing data look like noise.
On 23/10/18 4:28 PM, H.J. Lu wrote: > On 10/23/18, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote: >> On 23/10/18 2:34 PM, Florian Weimer wrote: >>> Shouldn't the benchtests use clock_gettime anyway, to avoid issues in >>> case the TSC is not synchronized across cores? >> >> There's an option USE_CLOCK_GETTIME to make benchtests do that, but >> otherwise it uses the hp_timing bits by default. > > I want something better that rdtsc and very low overhead since some bench > tests only last a few cycles. Adding lfence may make timing data look like > noise. Is the configure option necessary? You could just add a Makefile option in benchtests similar to USE_CLOCK_GETTIME and then document it in benchtests/README. Siddhesh
On 23/10/18 11:58, H.J. Lu wrote: > On 10/23/18, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote: >> On 23/10/18 2:34 PM, Florian Weimer wrote: >>> Shouldn't the benchtests use clock_gettime anyway, to avoid issues in >>> case the TSC is not synchronized across cores? >> >> There's an option USE_CLOCK_GETTIME to make benchtests do that, but >> otherwise it uses the hp_timing bits by default. > > I want something better that rdtsc and very low overhead since some bench > tests only last a few cycles. Adding lfence may make timing data look like > noise. > ideally bench test should be fixed so clock_gettime gives stable enough results. target specific timers are not always available and their results are hard to interpret compared to a standard api that returns wall clock time in sane units.
On 23/10/18 5:24 PM, Szabolcs Nagy wrote: > target specific timers are not always available and their > results are hard to interpret compared to a standard api > that returns wall clock time in sane units. Right, that is specifically why the USE_CLOCK_GETTIME option exists. It's not the default though because comparison between architectures is not the dominant use case. Siddhesh
diff --git a/INSTALL b/INSTALL index f9c5cbb9a3..01e4f7af0c 100644 --- a/INSTALL +++ b/INSTALL @@ -153,6 +153,16 @@ if 'CFLAGS' is specified it must enable optimization. For example: library. This option hardcodes the newly built C library path in dynamic tests so that they can be invoked directly. +'--enable-rdtscp-in-benchtests' + This x86 only option enables RDTSCP instruction in benchtests. + When the GNU C Library is built with '--enable-rdtscp-in-tests', + benchtests will use RDTSCP instruction, instead of RDTSC + instruction, for high precision, low overhead timing, which + provides more precise timing data. The resulting library is + unchanged by this option. Note that benchtests in RDTSCP-enabled + the GNU C Library require CPUs capable of RDTSCP instruction. All + x86 processors since 2010 support RDTSCP instruction. + '--disable-timezone-tools' By default, timezone related utilities ('zic', 'zdump', and 'tzselect') are installed with the GNU C Library. If you are diff --git a/NEWS b/NEWS index f054dc0433..d6007ec4f5 100644 --- a/NEWS +++ b/NEWS @@ -9,6 +9,13 @@ Version 2.29 Major new features: +* The GNU C Library can now be built with --enable-rdtscp-in-benchtests + to use RDTSCP instruction in benchtests for high precision, low + overhead timing on x86 CPUs, which provides more precise timing data. + The resulting library is unchanged. Benchtests in RDTSCP-enabled glibc + require CPUs capable of RDTSCP instruction. All x86 processors since + 2010 support RDTSCP instruction. + * A new convenience target has been added for distribution maintainers to build and install all locales as directories with files. The new target is run by issuing the following command in your build tree: diff --git a/benchtests/Makefile b/benchtests/Makefile index bcd6a9c26d..16326c9e41 100644 --- a/benchtests/Makefile +++ b/benchtests/Makefile @@ -131,6 +131,10 @@ CPPFLAGS-nonlib += -DDURATION=$(BENCH_DURATION) -D_ISOMAC # HP_TIMING if it is available. ifdef USE_CLOCK_GETTIME CPPFLAGS-nonlib += -DUSE_CLOCK_GETTIME +else +ifeq (${enable-rdtscp-in-benchtests},yes) +CPPFLAGS-nonlib += -DENABLE_RDTSCP +endif endif DETAILED_OPT := diff --git a/configure b/configure index f30c31afdc..a562849490 100755 --- a/configure +++ b/configure @@ -794,6 +794,7 @@ enable_pt_chown enable_tunables enable_mathvec enable_cet +enable_rdtscp_in_benchtests with_cpu ' ac_precious_vars='build_alias @@ -1471,6 +1472,8 @@ Optional Features: depends on architecture] --enable-cet enable Intel Control-flow Enforcement Technology (CET), x86 only + --enable-rdtscp-in-benchtests + enable RDTSCP in benchtests, x86 only [default=no] Optional Packages: --with-PACKAGE[=ARG] use PACKAGE [ARG=yes] @@ -3785,6 +3788,16 @@ else fi +# Check whether --enable-rdtscp-in-benchtests was given. +if test "${enable_rdtscp_in_benchtests+set}" = set; then : + enableval=$enable_rdtscp_in_benchtests; rdtscp_in_benchtests=$enableval +else + rdtscp_in_benchtests=no +fi + +config_vars="$config_vars +enable-rdtscp-in-benchtests = $rdtscp_in_benchtests" + # We keep the original values in `$config_*' and never modify them, so we # can write them unchanged into config.make. Everything else uses # $machine, $vendor, and $os, and changes them whenever convenient. diff --git a/configure.ac b/configure.ac index e983fd8faa..d3cfb2c728 100644 --- a/configure.ac +++ b/configure.ac @@ -478,6 +478,13 @@ AC_ARG_ENABLE([cet], [enable_cet=$enableval], [enable_cet=no]) +AC_ARG_ENABLE([rdtscp-in-benchtests], + AC_HELP_STRING([--enable-rdtscp-in-benchtests], + [enable RDTSCP in benchtests, x86 only @<:@default=no@:>@]), + [rdtscp_in_benchtests=$enableval], + [rdtscp_in_benchtests=no]) +LIBC_CONFIG_VAR([enable-rdtscp-in-benchtests], [$rdtscp_in_benchtests]) + # We keep the original values in `$config_*' and never modify them, so we # can write them unchanged into config.make. Everything else uses # $machine, $vendor, and $os, and changes them whenever convenient. diff --git a/manual/install.texi b/manual/install.texi index 61178dadcd..fa3fb912f6 100644 --- a/manual/install.texi +++ b/manual/install.texi @@ -182,6 +182,15 @@ By default, dynamic tests are linked to run with the installed C library. This option hardcodes the newly built C library path in dynamic tests so that they can be invoked directly. +@item --enable-rdtscp-in-benchtests +This x86 only option enables RDTSCP instruction in benchtests. When +@theglibc{} is built with @option{--enable-rdtscp-in-tests}, benchtests +will use RDTSCP instruction, instead of RDTSC instruction, for high +precision, low overhead timing, which provides more precise timing data. +The resulting library is unchanged by this option. Note that benchtests +in RDTSCP-enabled @theglibc{} require CPUs capable of RDTSCP instruction. +All x86 processors since 2010 support RDTSCP instruction. + @item --disable-timezone-tools By default, timezone related utilities (@command{zic}, @command{zdump}, and @command{tzselect}) are installed with @theglibc{}. If you are building diff --git a/sysdeps/x86/hp-timing.h b/sysdeps/x86/hp-timing.h index 77a1360748..fd840314fa 100644 --- a/sysdeps/x86/hp-timing.h +++ b/sysdeps/x86/hp-timing.h @@ -40,7 +40,20 @@ typedef unsigned long long int hp_timing_t; NB: Use __builtin_ia32_rdtsc directly since including <x86intrin.h> makes building glibc very slow. */ -# define HP_TIMING_NOW(Var) ((Var) = __builtin_ia32_rdtsc ()) +# ifdef ENABLE_RDTSCP +/* RDTSCP waits until all previous instructions have executed and all + previous loads are globally visible before reading the counter. + RDTSC doesn't wait until all previous instructions have been executed + before reading the counter. When --enable-rdtscp-in-tests is used, + use RDTSCP in benchtests. */ +# define HP_TIMING_NOW(Var) \ + (__extension__ ({ \ + unsigned int __aux; \ + (Var) = __builtin_ia32_rdtscp (&__aux); \ + })) +# else +# define HP_TIMING_NOW(Var) ((Var) = __builtin_ia32_rdtsc ()) +# endif # include <hp-timing-common.h> #else