mbox series

[0/4] x86: Improve ERMS usage on Zen3+

Message ID 20231031200925.3297456-1-adhemerval.zanella@linaro.org
Headers show
Series x86: Improve ERMS usage on Zen3+ | expand

Message

Adhemerval Zanella Netto Oct. 31, 2023, 8:09 p.m. UTC
For the sizes where REP MOVSB and REP STOSB are used on Zen3+ cores, the
result performance is lower than vectorized instructions (with some
input alignment showing a very large performance gap as indicated by
BZ#30995). 

The glibc enables ERMS on AMD code for sizes between 2113
(rep_movsb_threshold) and L2 cache size (rep_movsb_stop_threshold or 
524288 on a Zen3 core). Using the provided benchmarks from BZ#30995, the
memcpy on Ryzen 9 5900X shows:

  Size (bytes)   Destination Alignment      Throughput (GB/s)
  2113                               0                84.2448              
  2113                              15                 4.4310
  524287                             0                57.1122 
  524287                            15                4.34671

While by using vectorized instructions with the tunable
GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 it shows:

  Size (bytes)   Destination Alignment      Throughput (GB/s)
  2113                               0               124.1830             
  2113                              15               121.8720
  524287                             0                58.3212 
  524287                            15                58.5352 

Increasing the number of concurrent jobs does show improvements in ERMS
over vectorized instructions as well. The performance difference with
ERMS improves if input alignments are equal, although it does not reach
parity with the vectorized path.

The memset also shows similar performance improvement with vectorized
instructions instead of REP STOSB. On the same machine, the default
strategy shows:

  Size (bytes)   Destination Alignment      Throughput (GB/s)
  2113                               0                68.0113            
  2113                              15                56.1880
  524287                             0               119.3670
  524287                            15               116.2590

While with GLIBC_TUNABLES=glibc.cpu.x86_rep_stosb_threshold=1000000: 

  Size (bytes)   Destination Alignment      Throughput (GB/s)
2113                                 0               133.2310
2113                                15               132.5800
524287                               0               112.0650
524287                              15               118.0960

I also saw a slight performance increase on 502.gcc_r (1 copy), where
where result went from 9.82 to 9.85. The benchmarks hit hard both memcpy
and memset.

The first patch adds a way to check if tunable is set (BZ 27069), which
is used on the second patch to select the best strategy. The BZ 30994
fix also adds a new tunable, glibc.cpu.x86_rep_movsb_stop_threshold, so
the caller can specify a size range for force ERMS usage (from BZ #30994
discussion, there are some cases where ERMS is profitable). Patch 3
disables ERMS usage for memset on Zen 3+. And patch 4 slightly improves
slight the x86 memcpy documentation.

Adhemerval Zanella (4):
  elf: Add a way to check if tunable is set (BZ 27069)
  x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)
  x86: Do not prefer ERMS for memset on Zen3+
  x86: Expand the comment on when REP STOSB is used on memset

 elf/dl-tunable-types.h                        |  1 +
 elf/dl-tunables.c                             | 40 ++++++++++
 elf/dl-tunables.h                             | 28 +++++++
 elf/dl-tunables.list                          |  1 +
 manual/tunables.texi                          |  9 +++
 scripts/gen-tunables.awk                      |  4 +-
 sysdeps/x86/dl-cacheinfo.h                    | 74 ++++++++++++-------
 sysdeps/x86/dl-tunables.list                  | 10 +++
 .../multiarch/memset-vec-unaligned-erms.S     |  4 +-
 9 files changed, 142 insertions(+), 29 deletions(-)

Comments

Sajan Karumanchi Nov. 15, 2023, 7:05 p.m. UTC | #1
Adhemerval,

We added this to our todo list, and will get back shortly after verifying the patches.

-Sajan
Adhemerval Zanella Netto Nov. 16, 2023, 6:35 p.m. UTC | #2
On 15/11/23 16:05, sajan.karumanchi@gmail.com wrote:
> Adhemerval,
> 
> We added this to our todo list, and will get back shortly after verifying the patches.
> 
> -Sajan

Thanks Sajan, let me know if you need anything else.  I only have access to a Zen3 core
machine, so if you could also check the BZ30995 [1] it would be helpful (it is related
to Zen4 performance for memcpy).

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=30995
Sajan Karumanchi Feb. 5, 2024, 7:01 p.m. UTC | #3
Adhemerval,

In our extensive testing, we observed mixed results for rep-movs/stos performance with the ERMS feature enabled. 
Henceforth, we approve this patch to avoid the ERMS code path on AMD processors for better performance.

-Sajan
Adhemerval Zanella Netto Feb. 6, 2024, 1 p.m. UTC | #4
On 05/02/24 16:01, Sajan Karumanchi wrote:
> 
> Adhemerval,
> 
> In our extensive testing, we observed mixed results for rep-movs/stos performance with the ERMS feature enabled. 
> Henceforth, we approve this patch to avoid the ERMS code path on AMD processors for better performance.
> 
> -Sajan
> 
> 

Thanks for checking this out Sajan, I will rebase with some wording fixes
in comments and double check if everything is ok.  If you can, please
send a Ack-by or Reviewed-by in the next comment.  I will also check with
H.J and Noah (x86 maintaners) to see if everything is ok.