Message ID | 20210107162209.4047176-1-sajan.karumanchi@amd.com |
---|---|
State | New |
Headers | show |
Series | x86: Adding an upper bound for Enhanced REP MOVSB. | expand |
* sajan karumanchi: > From: Sajan Karumanchi <sajan.karumanchi@amd.com> > > In the process of optimizing memcpy for AMD machines, we have found the > vector move operations are outperforming enhanced REP MOVSB for data > transfers above the L2 cache size on Zen3 architectures. > To handle this use case, we are adding an upper bound parameter on > enhanced REP MOVSB:'__x86_max_rep_movsb_threshold'. > As per large-bench results, we are configuring this parameter to the > L2 cache size for AMD machines and applicable from Zen3 architecture > supporting the ERMS feature. > For architectures other than AMD, it is the computed value of > non-temporal threshold parameter. > > Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com> Thanks for the patch. Would you be able to rebase it on top of current master? There are some non-trivial conflicts, as far as I can see. Florian
[AMD Public Use] Hi Florian, I have pushed a new patch on top the rebased master branch. Thanks & Regards, Sajan K. -----Original Message----- From: Florian Weimer <fweimer@redhat.com> Sent: Friday, January 8, 2021 7:33 PM To: sajan.karumanchi--- via Libc-alpha <libc-alpha@sourceware.org> Cc: carlos@redhat.com; hjl.tools@gmail.com; Karumanchi, Sajan <Sajan.Karumanchi@amd.com>; Mallappa, Premachandra <Premachandra.Mallappa@amd.com> Subject: Re: [PATCH] x86: Adding an upper bound for Enhanced REP MOVSB. [CAUTION: External Email] * sajan karumanchi: > From: Sajan Karumanchi <sajan.karumanchi@amd.com> > > In the process of optimizing memcpy for AMD machines, we have found > the vector move operations are outperforming enhanced REP MOVSB for > data transfers above the L2 cache size on Zen3 architectures. > To handle this use case, we are adding an upper bound parameter on > enhanced REP MOVSB:'__x86_max_rep_movsb_threshold'. > As per large-bench results, we are configuring this parameter to the > L2 cache size for AMD machines and applicable from Zen3 architecture > supporting the ERMS feature. > For architectures other than AMD, it is the computed value of > non-temporal threshold parameter. > > Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com> Thanks for the patch. Would you be able to rebase it on top of current master? There are some non-trivial conflicts, as far as I can see. Florian -- Red Hat GmbH, https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fde.redhat.com%2F&data=04%7C01%7Csajan.karumanchi%40amd.com%7C385bb220a2ed40f8383b08d8b3de2881%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637457114064957358%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=o3x1r6WGqvzGwx1Rju%2FEyksBPRJGb%2B3cx9c%2FJHnP%2B3k%3D&reserved=0 , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
* Sajan Karumanchi: > [AMD Public Use] > > Hi Florian, > > I have pushed a new patch on top the rebased master branch. I've received RM ack (from Adhemerval) for the patch off-list, and I think we should put it unto the release. However, we need another rebase. 8-( Sorry about that. Would you please be so kind to post it? Thanks, Florian
On 18/01/2021 14:07, Florian Weimer via Libc-alpha wrote: > * Sajan Karumanchi: > >> [AMD Public Use] >> >> Hi Florian, >> >> I have pushed a new patch on top the rebased master branch. > > I've received RM ack (from Adhemerval) for the patch off-list, and I > think we should put it unto the release. Btw, I have sent a message in private to Florian, where it should be sent to libc-alpha. -- This is ok for 2.33. If I understand correctly, it should affect only memcpy performance on x86, right? -- > > However, we need another rebase. 8-( Sorry about that. Would you please > be so kind to post it? > > Thanks, > Florian >
diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c index 3fb4a028d8..9d7f8992be 100644 --- a/sysdeps/x86/cacheinfo.c +++ b/sysdeps/x86/cacheinfo.c @@ -1,5 +1,5 @@ /* x86_64 cache info. - Copyright (C) 2003-2020 Free Software Foundation, Inc. + Copyright (C) 2003-2021 Free Software Foundation, Inc. This file is part of the GNU C Library. The GNU C Library is free software; you can redistribute it and/or @@ -533,6 +533,9 @@ long int __x86_shared_non_temporal_threshold attribute_hidden; /* Threshold to use Enhanced REP MOVSB. */ long int __x86_rep_movsb_threshold attribute_hidden = 2048; +/* Threshold to stop using Enhanced REP MOVSB. */ +long int __x86_max_rep_movsb_threshold attribute_hidden = 512 * 1024; + /* Threshold to use Enhanced REP STOSB. */ long int __x86_rep_stosb_threshold attribute_hidden = 2048; @@ -839,6 +842,11 @@ init_cacheinfo (void) /* Account for exclusive L2 and L3 caches. */ shared += core; } + /* ERMS feature is implemented from Zen3 architecture and it is + performing poorly for data above L2 cache size. Henceforth, adding + an upper bound threshold parameter to limit the usage of Enhanced + REP MOVSB operations and setting its value to L2 cache size. */ + __x86_max_rep_movsb_threshold = core; } } @@ -909,6 +917,11 @@ init_cacheinfo (void) else __x86_rep_movsb_threshold = rep_movsb_threshold; + /* Setting the upper bound of ERMS to the known default value of + non-temporal threshold for architectures other than AMD. */ + if (cpu_features->basic.kind != arch_kind_amd) + __x86_max_rep_movsb_threshold = __x86_shared_non_temporal_threshold; + # if HAVE_TUNABLES __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold; # endif diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S index bd5dc1a3f3..c18eaf7ef6 100644 --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S @@ -1,5 +1,5 @@ /* memmove/memcpy/mempcpy with unaligned load/store and rep movsb - Copyright (C) 2016-2020 Free Software Foundation, Inc. + Copyright (C) 2016-2021 Free Software Foundation, Inc. This file is part of the GNU C Library. The GNU C Library is free software; you can redistribute it and/or @@ -233,7 +233,7 @@ L(return): ret L(movsb): - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP + cmp __x86_max_rep_movsb_threshold(%rip), %RDX_LP jae L(more_8x_vec) cmpq %rsi, %rdi jb 1f