Message ID | e6de570b-48bf-88cf-2cec-5f5a5e7821bf@huawei.com |
---|---|
State | New |
Headers | show |
Series | x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M | expand |
On 5/23/20 9:40 AM, liqingqing wrote: > this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset, > and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB, > this api will use STOB to instead of MOVQ > > but when I test this API on x86_64 platform > and found that this default value is not appropriate for some input length. here it's the enviornment and result This patch is not needed anymore since the threshold has been made a tunable: glibc.cpu.x86_rep_movsb_threshold. Siddhesh
OK, thanks. On 2020/12/21 12:38, Siddhesh Poyarekar wrote: > On 5/23/20 9:40 AM, liqingqing wrote: >> this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset, >> and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB, >> this api will use STOB to instead of MOVQ >> >> but when I test this API on x86_64 platform >> and found that this default value is not appropriate for some input length. here it's the enviornment and result > > This patch is not needed anymore since the threshold has been made a tunable: glibc.cpu.x86_rep_movsb_threshold. > > Siddhesh > .
diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S index dcd63c92..92c08eed 100644 --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S @@ -65,7 +65,7 @@ Enhanced REP STOSB. Since the stored value is fixed, larger register size has minimal impact on threshold. */ #ifndef REP_STOSB_THRESHOLD -# define REP_STOSB_THRESHOLD 2048 +# define REP_STOSB_THRESHOLD 1048576 #endif #ifndef SECTION
this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset, and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB, this api will use STOB to instead of MOVQ but when I test this API on x86_64 platform and found that this default value is not appropriate for some input length. here it's the enviornment and result test suite: libMicro-0.4.0 ./memset -E -C 200 -L -S -W -N "memset_4k" -s 4k -I 250 ./memset -E -C 200 -L -S -W -N "memset_4k_uc" -s 4k -u -I 400 ./memset -E -C 200 -L -S -W -N "memset_1m" -s 1m -I 200000 ./memset -E -C 200 -L -S -W -N "memset_10m" -s 10m -I 2000000 hardware platform: Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz L1d cache:32KB L1i cache: 32KB L2 cache: 1MB L3 cache: 60MB the result is that when input length is between the processor's L1 data cache and L2 cache size, the REP_STOSB_THRESHOLD=2KB will reduce performance. before this commit after this commit cycle cycle memset_4k 249 96 memset_10k 657 185 memset_36k 2773 3767 memset_100k 7594 10002 memset_500k 37678 52149 memset_1m 86780 108044 memset_10m 1307238 1148994 before this commit after this commit MLC cache miss(10sec) MLC cache miss(10sec) memset_4k 1,09,33,823 1,01,79,270 memset_10k 1,23,78,958 1,05,41,087 memset_36k 3,61,64,244 4,07,22,429 memset_100k 8,25,33,052 9,31,81,253 memset_500k 37,32,55,449 43,56,70,395 memset_1m 75,16,28,239 88,29,90,237 memset_10m 9,36,61,67,397 8,96,69,49,522 though REP_STOSB_THRESHOLD can be modified at the building time by use -DREP_STOSB_THRESHOLD=xxx, but I think the default value may be is not a better one, cause I think most of the processor's L2 cache is large than 2KB, so i submit a patch as below: From 44314a556239a7524b5a6451025737c1bdbb1cd0 Mon Sep 17 00:00:00 2001 From: liqingqing <liqingqing3@huawei.com> Date: Thu, 21 May 2020 11:23:06 +0800 Subject: [PATCH] update REP_STOSB_THRESHOLD's default value from 2k to 1M macro REP_STOSB_THRESHOLD's value will reduce memset performace when input length is between processor's L1 data cache and L2 cache. so update the defaule value to eliminate the decrement . --- sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)