Message ID | 87lfsd477i.fsf@oldenburg2.str.redhat.com |
---|---|
State | New |
Headers | show |
Series | Optimizing hash table lookup in symbol binding | expand |
On 18/11/2019 13:58, Florian Weimer wrote: > My primary interest was the % operator because it turns out that it > actually shows up in some profiles stressing symbol binding during > program lookup. In most cases, however the search for the right mapping > dominates and the preceding bitmask check fails most of the time. But > with shallow library dependencies, we can end up in a situation where > the division actually matters. > > Strategies for optimizing integer division are discussed in Hacker's > Delight and here: > > <http://ridiculousfish.com/blog/posts/labor-of-division-episode-i.html> > <http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html> > > (I have written to the author to get some of the math fixed in minor > ways, but I think the general direction is solid.) > > The algorithm from the first episode looks like this: note that _itoa.c already uses something like this. (i think the itoa code is unnecessary: there is no reason for optimizing anything but base 10 and 16, and the compiler can do a better job at those than trying something at runtime, but you may look at that implementation if it's any better).
* Szabolcs Nagy: > On 18/11/2019 13:58, Florian Weimer wrote: >> My primary interest was the % operator because it turns out that it >> actually shows up in some profiles stressing symbol binding during >> program lookup. In most cases, however the search for the right mapping >> dominates and the preceding bitmask check fails most of the time. But >> with shallow library dependencies, we can end up in a situation where >> the division actually matters. >> >> Strategies for optimizing integer division are discussed in Hacker's >> Delight and here: >> >> <http://ridiculousfish.com/blog/posts/labor-of-division-episode-i.html> >> <http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html> >> >> (I have written to the author to get some of the math fixed in minor >> ways, but I think the general direction is solid.) >> >> The algorithm from the first episode looks like this: > > note that _itoa.c already uses something like this. > > (i think the itoa code is unnecessary: there is no reason > for optimizing anything but base 10 and 16, and the compiler > can do a better job at those than trying something at runtime, > but you may look at that implementation if it's any better). I don't think so. It doesn't show to generate magic values for arbitrary divisors, which is the tricky part anyway. It uses both the Episode 1 and Episode 3 algorithms. The uint64_t case for 32-bit architectures is very complicated because it avoids the __umoddi3 call. This is *not* something that current GCC would do on its own. GCC does not even fold the division and modulo into one libcall. For base 10, it could still result in better code to take a few __udivdi3 calls to reduce the value into 9-digit pieces, and then use the 32-bit code to process that efficiently. I'm not sure if we need anything else for glibc (only base 8, 10, 16, maybe base 2 somewhere). People have also posted vectorized implementations of this operation, including base 10. But that's not really realted to the hash table optimization. 8-) Thanks, Florian
On 11/18/19 8:58 AM, Florian Weimer wrote: > On a second-generation AMD EPYC, I didn't see a difference at all for > some reason. On Cascade Lake, I see a moderate improvement for the > dlsym test, but I don't know how realistic this microbenchmark is. Both > patches had performance that was on par. > > I also tried to remove the bitmask check altogether, but it was not an > improvement. I suspect the bitmasks are much smaller, so consulting > them avoids the cache misses in the full table lookup. > > If any of the architecture maintainers think this is worth doing, we can > still incorporate it. At this point you are probably straying to the realm of special purpose proprietary analysis programs provided by hardware vendors to get the most performance out of a given program (stalls, cache misses, interlocks etc.). Have you used perf at all to look into aspects beyond just performance? That is say confirming the cache misses saved vs. the full table lookup? What about PGO for this case? I often wonder in cases like this if we fed "normative" workloads into a PGO build if we could get better results from generated code like this, but it's an entire project just to do this. To accept this code I'd want a microbenchmark added, and then we'd commit the code, and ask the machine maintainers to review that nothing got terribly worse or was out of kilter with the microbenchmark numbers. Thoughts?
On 20/11/2019 22:55, Carlos O'Donell wrote: > On 11/18/19 8:58 AM, Florian Weimer wrote: >> On a second-generation AMD EPYC, I didn't see a difference at all for >> some reason. On Cascade Lake, I see a moderate improvement for the >> dlsym test, but I don't know how realistic this microbenchmark is. Both >> patches had performance that was on par. >> >> I also tried to remove the bitmask check altogether, but it was not an >> improvement. I suspect the bitmasks are much smaller, so consulting >> them avoids the cache misses in the full table lookup. >> >> If any of the architecture maintainers think this is worth doing, we can >> still incorporate it. > > At this point you are probably straying to the realm of special purpose > proprietary analysis programs provided by hardware vendors to get the > most performance out of a given program (stalls, cache misses, interlocks > etc.). > > Have you used perf at all to look into aspects beyond just performance? > That is say confirming the cache misses saved vs. the full table lookup? > > What about PGO for this case? I often wonder in cases like this if we > fed "normative" workloads into a PGO build if we could get better > results from generated code like this, but it's an entire project > just to do this. > > To accept this code I'd want a microbenchmark added, and then we'd > commit the code, and ask the machine maintainers to review that nothing > got terribly worse or was out of kilter with the microbenchmark numbers. > > Thoughts? > I see that such changes are hard to measure over different architectures and get even more tricker with different implementations within the architectures. This change seems a straighfoward optimization for architectures that issues a libcall for module operation, such as arm, alpha, and hppa. However it seems to be a performance regression on other architectures where the 32x32->64 multiplier *also* requires a libcall, such as csky, microblaze, nios2, and sparcv8. And it become even less obvious for architectures that depending of the compiler flags might generate better code, for instance on armv7-a where modulus is done with udiv instead of a libcall. And there is even riscv64 where provides a specific instruction for that, remuw (although I couldn't find the expected latency/throughput for that). Also, some architectures might show better latency over chips version. For instance aarch A57 has a udiv latency of 4-20 and execution throughput of 1/20 - 1/4, where A55 has a latency of 3-12 and throughput of 1/12 - 1/3. It might the case where this trick might show better performance on some chips where newer ones it would be worse than a simple modules operation. So it would require a lot of testing and deliberation for each arch maintainer to check if this optimization is worth, it would add another build permutation we need to maintained, and it would need to be constantly checked on every new releases to see implementation get modulus operand better. Instead I would try to focus and use a better hashtable scheme, as Wilco has suggested. To give a crude estimative of the possible gain, on the simple benchmark bellow the power2 rehash policy with mask ranging showed better performance on multiple architectures ranging from 17% - 50%. Even for more embedded oriented architectures, such as sh4, I see that avoid a modules hashing scheme showed about 30% better results on insertion. And I fully agree with that we should at least have some of synthetic benchmark to evaluate it. -- #include <iostream> #include <unordered_map> #include <chrono> using hash_power2_t = std::_Hashtable< uint32_t, std::pair<const uint32_t, uint32_t>, std::allocator<uint32_t>, std::__detail::_Select1st, std::equal_to<uint32_t>, std::hash<uint32_t>, std::__detail::_Mask_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Power2_rehash_policy, std::__ummap_traits<true>>; using hash_mod_t = std::_Hashtable< uint32_t, std::pair<const uint32_t, uint32_t>, std::allocator<uint32_t>, std::__detail::_Select1st, std::equal_to<uint32_t>, std::hash<uint32_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__ummap_traits<true>>; int main () { const int sz = 30000000; hash_mod_t hash_mod; hash_mod.reserve (sz); hash_power2_t hash_power2; hash_power2.reserve (sz); double mod_time, power2_time; { auto start = std::chrono::steady_clock::now(); for (int i = 0; i < sz; i++) hash_mod.emplace (std::make_pair (i, i)); auto end = std::chrono::steady_clock::now(); auto diff = end - start; mod_time = std::chrono::duration <double, std::nano> (diff).count(); std::cout << "mod hash : " << mod_time << " ns" << std::endl; } { auto start = std::chrono::steady_clock::now(); for (int i = 0; i < sz; i++) hash_power2.emplace (std::make_pair (i, i)); auto end = std::chrono::steady_clock::now(); auto diff = end - start; power2_time = std::chrono::duration <double, std::nano> (diff).count(); std::cout << "power2 hash: " << power2_time << " ns" << std::endl; } std::cout << "speedup : " << 1 / (power2_time / mod_time) << std::endl; }
diff --git a/elf/dl-lookup.c b/elf/dl-lookup.c index fd44cd4101..205d043717 100644 --- a/elf/dl-lookup.c +++ b/elf/dl-lookup.c @@ -28,6 +28,7 @@ #include <libc-lock.h> #include <tls.h> #include <atomic.h> +#include <divopt.h> #include <assert.h> @@ -394,8 +395,18 @@ do_lookup_x (const char *undef_name, uint_fast32_t new_hash, if (__glibc_unlikely ((bitmask_word >> hashbit1) & (bitmask_word >> hashbit2) & 1)) { - Elf32_Word bucket = map->l_gnu_buckets[new_hash - % map->l_nbuckets]; + Elf32_Word bucket; + if (map->l_nbuckets > 1) + { + uint32_t quotient + = divopt_32 (new_hash, map->l_nbuckets_multiplier, + map->l_nbuckets_multiplier_shift); + uint32_t remainder = new_hash - map->l_nbuckets * quotient; + bucket = map->l_gnu_buckets[remainder]; + } + else + bucket = map->l_gnu_buckets[0]; + if (bucket != 0) { const Elf32_Word *hasharr = &map->l_gnu_chain_zero[bucket]; @@ -931,6 +942,11 @@ _dl_setup_hash (struct link_map *map) /* Initialize MIPS xhash translation table. */ ELF_MACHINE_XHASH_SETUP (hash32, symbias, map); + if (map->l_nbuckets >= 2) + map->l_nbuckets_multiplier_shift + = precompute_divopt_32 (map->l_nbuckets, + &map->l_nbuckets_multiplier); + return; } diff --git a/include/divopt.h b/include/divopt.h new file mode 100644 index 0000000000..85eece8fd9 --- /dev/null +++ b/include/divopt.h @@ -0,0 +1,78 @@ +/* Optimization of repeated integer division. + Copyright (C) 2019 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#include <stdint.h> +#include <sys/param.h> + +/* Precompute *MULTIPLIER for dividing by DIVISOR, which must be two + or larger, and return the shift count (non-negative and less than + 32), for use with divopt_32 below. */ +static int __attribute__ ((used)) +precompute_divopt_32 (uint32_t divisor, uint32_t *multiplier) +{ + if (divisor == 1) + { + *multiplier = 1; + return 0; + } + + int log2 = 32 - __builtin_clz (divisor); + + /* Handle powers-of-two first, so that we do not need to deal with + the clz corner cases below. */ + if (powerof2 (divisor)) + { + *multiplier = 1; + return log2 - 2; + } + + if (log2 != 32) + { + /* Compute ceil (2**(32 + log2) / divisor). The + most-significant bit is always set and is discarded. */ + *multiplier = (((uint64_t) 1 << (32 + log2)) + divisor) / divisor; + return log2 - 1; + } + else + { + /* Perform a long division of 2**64 + (divisor - 1) by the + divisor, encoded in base-2**32, using a 64-by-32 division. + Start out with the first two digits, which are (1, 0). 2**32 + divided by the divisor is 1 because the divisor is larger + than 2**31. This set bit is discarded. */ + uint64_t remainder = -divisor; + + /* Combine the remainder of the first division with the third + and final base 2**32 digit. */ + *multiplier = ((remainder << 32) | (divisor - 1)) / divisor; + return 31; + } +} + +/* Return the quotient of DIVIDEND devided by the divisor that was + used to compute MULTIPLIER and SHIFT via precompute_divopt_32. */ +static inline uint32_t +divopt_32 (uint32_t dividend, uint32_t multiplier, int shift) +{ + /* Approximation to the quotient. */ + uint32_t quotient = ((uint64_t) dividend * multiplier) >> 32; + /* Compute (dividend + quotient) / 2 without overflow. */ + uint32_t temp = ((dividend - quotient) >> 1) + quotient; + /* The result is in the higher-order bits. */ + return temp >> shift; +} diff --git a/include/link.h b/include/link.h index 1184201f91..b09aa81bb4 100644 --- a/include/link.h +++ b/include/link.h @@ -153,6 +153,8 @@ struct link_map /* Symbol hash table. */ Elf_Symndx l_nbuckets; + uint32_t l_nbuckets_multiplier; + int l_nbuckets_multiplier_shift; Elf32_Word l_gnu_bitmask_idxbits; Elf32_Word l_gnu_shift; const ElfW(Addr) *l_gnu_bitmask; It only requires a 32x32→64 multiplier, which is an advantage. However, the dependency chain in divopt_32 is fairly long, so performance is less than optimal. The Episode 3 approach as described is not general, so it would work well only if we told binutils ld to avoid bucket counts that are problematic for it. (In both cases, we need run-time checks to avoid the unsupported cases, but if we could work together with binutils, this data-dependent branch should be easy to predict in practice.) There is a variant of the Episode 3 approach which uses a 64x64→128 multiplication, for a much shorter dependency chain. For convenience, I used the __int128 support from libgcc below, but that could probably be avoided on 64-bit architectures with suitable multipliers: diff --git a/elf/dl-lookup.c b/elf/dl-lookup.c index fd44cd4101..4d2f3b91f0 100644 --- a/elf/dl-lookup.c +++ b/elf/dl-lookup.c @@ -394,8 +395,17 @@ do_lookup_x (const char *undef_name, uint_fast32_t new_hash, if (__glibc_unlikely ((bitmask_word >> hashbit1) & (bitmask_word >> hashbit2) & 1)) { - Elf32_Word bucket = map->l_gnu_buckets[new_hash - % map->l_nbuckets]; + Elf32_Word bucket; + if (powerof2 (map->l_nbuckets)) + bucket = map->l_gnu_buckets[new_hash & (map->l_nbuckets - 1)]; + else + { + uint32_t quotient + = ((unsigned __int128) map->l_nbuckets_multiplier * ((uint64_t) new_hash + 1)) >> 64; + uint32_t remainder = new_hash - map->l_nbuckets * quotient; + bucket = map->l_gnu_buckets[remainder]; + } + if (bucket != 0) { const Elf32_Word *hasharr = &map->l_gnu_chain_zero[bucket]; @@ -931,6 +941,10 @@ _dl_setup_hash (struct link_map *map) /* Initialize MIPS xhash translation table. */ ELF_MACHINE_XHASH_SETUP (hash32, symbias, map); + if (powerof2 (map->l_nbuckets)) + map->l_nbuckets_multiplier = __builtin_ctz (map->l_nbuckets); + else + map->l_nbuckets_multiplier = ((unsigned __int128) 1 << 64) / map->l_nbuckets; return; } diff --git a/include/link.h b/include/link.h index 1184201f91..eec4c9ef6e 100644 --- a/include/link.h +++ b/include/link.h @@ -153,6 +153,7 @@ struct link_map /* Symbol hash table. */ Elf_Symndx l_nbuckets; + uint64_t l_nbuckets_multiplier; Elf32_Word l_gnu_bitmask_idxbits; Elf32_Word l_gnu_shift; const ElfW(Addr) *l_gnu_bitmask;