Message ID | 20170316154908.GA575@tigerII.localdomain |
---|---|
State | New |
Headers | show |
On Fri, 17 Mar 2017, Sergey Senozhatsky wrote: > c) I don't think I see what the "target milestone" is even supposed > to mean. Sorry! glibc version? min glibc version that requires a > backport of this fix (if there are -stable/LTS glibc releases)? etc. > etc. etc. See: <https://sourceware.org/glibc/wiki/Bugzilla%20Procedures>. The person who commits a fix for a bug needs to set the target milestone of the bug along with marking it fixed (so if 2.26 is the first release with the fix, it needs to be set to 2.26).
On Fri, 17 Mar 2017, Sergey Senozhatsky wrote: > I'm not sure I see Ulrich's "You do not even understand how binary > searching works, do you? The sum can never exceed nmemb and nmemb > obviously fits into an size_t" point. it's a bug. As I understand it: it's a bug for an array the number of whose elements is more than SIZE_MAX / 2. Now, there is certainly no need for the comparison function actually to look at the contents of the elements; it's completely valid for it to do something magic that depends on the position of the element in the array rather than (just) its contents. And 1-byte elements are certainly valid. But the pointers do need to be valid - although when you enter the POSIX context there is a strong argument that as long as the addresses are all within the process's address space, it's OK for them to be mapped with PROT_NONE permissions (as long as the comparison function doesn't actually access the referenced objects). Allocating objects occupying half or more of the address space, even with PROT_NONE permissions, is questionable because pointer subtraction cannot work reliably in them (so there is an argument for malloc, mmap etc. to disallow such allocations). But it's believed, from past discussions, that some users of 32-bit systems (possibly 32-bit processes under a 64-bit kernel) do in fact rely on being able to make a single allocation of 2 GB or more, and our existing practice for string functions is to consider it a bug if a string function doesn't work correctly for >= 2 GB strings on 32-bit systems. So on the same basis, this bsearch issue should be considered a bug. If the fix reduces performance, there's a case for making it conditional: if (__builtin_constant_p (__size) && __size >= 2) /* old code */ else /* new code */ given that the issue can only apply with elements of size 1, and the common case probably *is* that the size is constant (__builtin_constant_p works in inline functions) and at least 2, so that would avoid affecting the code generated in the common case at all.
On 17 Mar 2017 00:49, Sergey Senozhatsky wrote: > On (03/16/17 14:02), Joseph Myers wrote: > > If this fixes a user-visible bug then the ChangeLog entry needs to include > > [BZ #N] referencing the bug filed in Bugzilla (and once the fix is in the > > bug needs to be resolved as FIXED with appropriate target milestone set). > > Is this bug 2753? If not, a new bug would need to be filed for it. > > a) um... I assume glibc Bugzilla is located at https://sourceware.org/bugzilla/ > and 2753, thus, is https://sourceware.org/bugzilla/show_bug.cgi?id=2753 > if so, then, yes looks like I'm not the first one to point that out. > I'm not sure I see Ulrich's "You do not even understand how binary > searching works, do you? The sum can never exceed nmemb and nmemb > obviously fits into an size_t" point. it's a bug. if you see a comment by Ulrich that looks to be unrelated/off, then you can probably safely assume it is. if the bug is open still, then we usually consider it *now* to be something worth fixing. sorry for the hassle. -mike
On (03/16/17 20:33), Joseph Myers wrote: > On Fri, 17 Mar 2017, Sergey Senozhatsky wrote: > > > I'm not sure I see Ulrich's "You do not even understand how binary > > searching works, do you? The sum can never exceed nmemb and nmemb > > obviously fits into an size_t" point. it's a bug. so it was 1am yesterday when I took a look at that BZ and I didn't read it carefully (guilty). I kinda see what they meant there. [..] > given that the issue can only apply with elements of size 1, and the > common case probably *is* that the size is constant (__builtin_constant_p > works in inline functions) and at least 2, so that would avoid affecting > the code generated in the common case at all. the builtin __builtin_constant_p(size) suggestion sounds reasonable, since that new middle element calculation does come at a price. --- so I modified the test I posted earlier. moved array to .bss, made array size reasonable (so there is no overflow). each ./a.out executes bsearch() 10000000 times. # perf stat ./old_bsearch.out >> OLD 2>&1 # perf stat ./new_bsearch.out >> NEW 2>&1 # paste OLD NEW the results are ( you guys use big monitors, right? :) ) UNPATCHED PATCHED ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- *== CHAR array ==* Performance counter stats for './a.out': Performance counter stats for './a.out': 490.700599 task-clock:u (msec) # 0.999 CPUs utilized 571.728758 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 53 page-faults:u # 0.108 K/sec 53 page-faults:u # 0.093 K/sec 1,539,234,373 cycles:u # 3.137 GHz (83.41%) 1,792,026,989 cycles:u # 3.134 GHz (83.22%) 127,938,341 stalled-cycles-frontend:u # 8.31% frontend cycles idle (83.49%) 131,679,816 stalled-cycles-frontend:u # 7.35% frontend cycles idle (83.21%) 109,189,954 stalled-cycles-backend:u # 7.09% backend cycles idle (66.99%) 95,841,146 stalled-cycles-backend:u # 5.35% backend cycles idle (66.79%) 3,806,104,997 instructions:u # 2.47 insn per cycle 4,331,940,534 instructions:u # 2.42 insn per cycle # 0.03 stalled cycles per insn (83.49%) # 0.03 stalled cycles per insn (83.66%) 1,091,706,962 branches:u # 2224.792 M/sec (83.49%) 1,090,066,259 branches:u # 1906.614 M/sec (83.74%) 11,753,417 branch-misses:u # 1.08% of all branches (83.02%) 10,862,016 branch-misses:u # 1.00% of all branches (83.36%) 0.491320512 seconds time elapsed 0.572367076 seconds time elapsed Performance counter stats for './a.out': Performance counter stats for './a.out': 491.312169 task-clock:u (msec) # 0.999 CPUs utilized 576.186929 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 55 page-faults:u # 0.112 K/sec 54 page-faults:u # 0.094 K/sec 1,545,979,259 cycles:u # 3.147 GHz (82.91%) 1,795,566,378 cycles:u # 3.116 GHz (83.34%) 128,899,741 stalled-cycles-frontend:u # 8.34% frontend cycles idle (83.47%) 141,401,217 stalled-cycles-frontend:u # 7.88% frontend cycles idle (83.34%) 103,723,436 stalled-cycles-backend:u # 6.71% backend cycles idle (67.03%) 101,552,870 stalled-cycles-backend:u # 5.66% backend cycles idle (66.68%) 3,809,678,005 instructions:u # 2.46 insn per cycle 4,350,733,034 instructions:u # 2.42 insn per cycle # 0.03 stalled cycles per insn (83.52%) # 0.03 stalled cycles per insn (83.34%) 1,091,129,881 branches:u # 2220.848 M/sec (83.52%) 1,087,751,930 branches:u # 1887.846 M/sec (83.34%) 12,016,960 branch-misses:u # 1.10% of all branches (83.32%) 10,725,247 branch-misses:u # 0.99% of all branches (83.70%) 0.491822793 seconds time elapsed 0.576731036 seconds time elapsed Performance counter stats for './a.out': Performance counter stats for './a.out': 490.530342 task-clock:u (msec) # 0.999 CPUs utilized 572.092012 task-clock:u (msec) # 0.998 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 54 page-faults:u # 0.110 K/sec 53 page-faults:u # 0.093 K/sec 1,536,658,844 cycles:u # 3.133 GHz (82.88%) 1,786,237,073 cycles:u # 3.122 GHz (83.22%) 126,761,969 stalled-cycles-frontend:u # 8.25% frontend cycles idle (83.45%) 133,130,965 stalled-cycles-frontend:u # 7.45% frontend cycles idle (83.22%) 124,498,779 stalled-cycles-backend:u # 8.10% backend cycles idle (66.98%) 98,609,399 stalled-cycles-backend:u # 5.52% backend cycles idle (66.86%) 3,817,891,053 instructions:u # 2.48 insn per cycle 4,333,473,067 instructions:u # 2.43 insn per cycle # 0.03 stalled cycles per insn (83.49%) # 0.03 stalled cycles per insn (83.68%) 1,088,399,022 branches:u # 2218.821 M/sec (83.49%) 1,089,672,636 branches:u # 1904.716 M/sec (83.75%) 11,599,517 branch-misses:u # 1.07% of all branches (83.44%) 10,308,961 branch-misses:u # 0.95% of all branches (83.30%) 0.491065497 seconds time elapsed 0.573038448 seconds time elapsed Performance counter stats for './a.out': Performance counter stats for './a.out': 494.313419 task-clock:u (msec) # 0.999 CPUs utilized 570.833950 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 54 page-faults:u # 0.109 K/sec 55 page-faults:u # 0.096 K/sec 1,540,156,057 cycles:u # 3.116 GHz (83.02%) 1,789,383,612 cycles:u # 3.135 GHz (83.19%) 133,690,009 stalled-cycles-frontend:u # 8.68% frontend cycles idle (83.01%) 133,977,834 stalled-cycles-frontend:u # 7.49% frontend cycles idle (83.17%) 105,653,002 stalled-cycles-backend:u # 6.86% backend cycles idle (67.20%) 100,380,149 stalled-cycles-backend:u # 5.61% backend cycles idle (66.38%) 3,808,529,575 instructions:u # 2.47 insn per cycle 4,360,155,445 instructions:u # 2.44 insn per cycle # 0.04 stalled cycles per insn (83.62%) # 0.03 stalled cycles per insn (83.19%) 1,088,698,277 branches:u # 2202.445 M/sec (83.62%) 1,088,622,411 branches:u # 1907.074 M/sec (83.70%) 11,517,756 branch-misses:u # 1.06% of all branches (83.21%) 10,616,959 branch-misses:u # 0.98% of all branches (83.58%) 0.494938286 seconds time elapsed 0.571451079 seconds time elapsed *== INT array ==* Performance counter stats for './a.out': Performance counter stats for './a.out': 443.792538 task-clock:u (msec) # 0.997 CPUs utilized 515.688363 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 50 page-faults:u # 0.113 K/sec 53 page-faults:u # 0.103 K/sec 1,400,310,241 cycles:u # 3.155 GHz (83.10%) 1,623,123,844 cycles:u # 3.147 GHz (83.13%) 112,395,786 stalled-cycles-frontend:u # 8.03% frontend cycles idle (83.25%) 151,631,032 stalled-cycles-frontend:u # 9.34% frontend cycles idle (83.13%) 95,467,149 stalled-cycles-backend:u # 6.82% backend cycles idle (66.67%) 108,350,044 stalled-cycles-backend:u # 6.68% backend cycles idle (66.74%) 3,371,853,412 instructions:u # 2.41 insn per cycle 3,854,946,185 instructions:u # 2.38 insn per cycle # 0.03 stalled cycles per insn (83.69%) # 0.04 stalled cycles per insn (83.66%) 966,737,955 branches:u # 2178.356 M/sec (83.78%) 969,559,223 branches:u # 1880.126 M/sec (83.72%) 10,749,026 branch-misses:u # 1.11% of all branches (83.63%) 11,323,207 branch-misses:u # 1.17% of all branches (83.49%) 0.445114068 seconds time elapsed 0.516332787 seconds time elapsed Performance counter stats for './a.out': Performance counter stats for './a.out': 447.392463 task-clock:u (msec) # 0.998 CPUs utilized 512.701616 task-clock:u (msec) # 0.998 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 51 page-faults:u # 0.114 K/sec 52 page-faults:u # 0.101 K/sec 1,398,350,256 cycles:u # 3.126 GHz (83.24%) 1,606,918,412 cycles:u # 3.134 GHz (83.03%) 121,919,785 stalled-cycles-frontend:u # 8.72% frontend cycles idle (83.23%) 128,876,515 stalled-cycles-frontend:u # 8.02% frontend cycles idle (83.03%) 94,562,361 stalled-cycles-backend:u # 6.76% backend cycles idle (66.50%) 94,866,954 stalled-cycles-backend:u # 5.90% backend cycles idle (67.16%) 3,399,240,172 instructions:u # 2.43 insn per cycle 3,876,976,133 instructions:u # 2.41 insn per cycle # 0.04 stalled cycles per insn (83.24%) # 0.03 stalled cycles per insn (83.64%) 964,909,971 branches:u # 2156.742 M/sec (83.83%) 969,090,907 branches:u # 1890.166 M/sec (83.64%) 10,618,480 branch-misses:u # 1.10% of all branches (83.69%) 10,861,681 branch-misses:u # 1.12% of all branches (83.33%) 0.448073517 seconds time elapsed 0.513532781 seconds time elapsed Performance counter stats for './a.out': Performance counter stats for './a.out': 447.387472 task-clock:u (msec) # 0.999 CPUs utilized 517.531654 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 51 page-faults:u # 0.114 K/sec 53 page-faults:u # 0.102 K/sec 1,406,720,267 cycles:u # 3.144 GHz (83.24%) 1,616,525,247 cycles:u # 3.124 GHz (83.18%) 123,427,413 stalled-cycles-frontend:u # 8.77% frontend cycles idle (83.24%) 142,913,567 stalled-cycles-frontend:u # 8.84% frontend cycles idle (83.21%) 112,720,863 stalled-cycles-backend:u # 8.01% backend cycles idle (66.48%) 114,806,324 stalled-cycles-backend:u # 7.10% backend cycles idle (66.82%) 3,376,115,599 instructions:u # 2.40 insn per cycle 3,868,878,104 instructions:u # 2.39 insn per cycle # 0.04 stalled cycles per insn (83.24%) # 0.04 stalled cycles per insn (83.69%) 970,199,916 branches:u # 2168.590 M/sec (83.24%) 967,881,284 branches:u # 1870.188 M/sec (83.79%) 11,041,374 branch-misses:u # 1.14% of all branches (83.86%) 11,289,517 branch-misses:u # 1.17% of all branches (83.27%) 0.448035542 seconds time elapsed 0.518214692 seconds time elapsed Performance counter stats for './a.out': Performance counter stats for './a.out': 450.246740 task-clock:u (msec) # 0.998 CPUs utilized 518.604218 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 52 page-faults:u # 0.115 K/sec 52 page-faults:u # 0.100 K/sec 1,403,767,028 cycles:u # 3.118 GHz (83.35%) 1,622,782,703 cycles:u # 3.129 GHz (83.21%) 124,553,251 stalled-cycles-frontend:u # 8.87% frontend cycles idle (83.37%) 148,974,707 stalled-cycles-frontend:u # 9.18% frontend cycles idle (83.24%) 101,045,481 stalled-cycles-backend:u # 7.20% backend cycles idle (66.71%) 112,861,212 stalled-cycles-backend:u # 6.95% backend cycles idle (66.45%) 3,380,728,555 instructions:u # 2.41 insn per cycle 3,865,667,812 instructions:u # 2.38 insn per cycle # 0.04 stalled cycles per insn (83.37%) # 0.04 stalled cycles per insn (83.22%) 973,763,119 branches:u # 2162.732 M/sec (83.35%) 970,109,337 branches:u # 1870.616 M/sec (83.81%) 11,511,822 branch-misses:u # 1.18% of all branches (83.54%) 11,546,829 branch-misses:u # 1.19% of all branches (83.29%) 0.451002497 seconds time elapsed 0.519224084 seconds time elapsed -ss
On Thu, Mar 16, 2017 at 04:54:40PM -0400, Mike Frysinger wrote: > On 17 Mar 2017 00:49, Sergey Senozhatsky wrote: > > On (03/16/17 14:02), Joseph Myers wrote: > > > If this fixes a user-visible bug then the ChangeLog entry needs to include > > > [BZ #N] referencing the bug filed in Bugzilla (and once the fix is in the > > > bug needs to be resolved as FIXED with appropriate target milestone set). > > > Is this bug 2753? If not, a new bug would need to be filed for it. > > > > a) um... I assume glibc Bugzilla is located at https://sourceware.org/bugzilla/ > > and 2753, thus, is https://sourceware.org/bugzilla/show_bug.cgi?id=2753 > > if so, then, yes looks like I'm not the first one to point that out. > > I'm not sure I see Ulrich's "You do not even understand how binary > > searching works, do you? The sum can never exceed nmemb and nmemb > > obviously fits into an size_t" point. it's a bug. > > if you see a comment by Ulrich that looks to be unrelated/off, then you > can probably safely assume it is. if the bug is open still, then we > usually consider it *now* to be something worth fixing. > No, this is wrong fix as I probably write before. It isn't visible bug unless programmer is idiot who tries to sort gigabyte long array of single bytes and use binary search on it. Just keeping 256 byte table of positions would be more effective than sorting. For any bigger size there isn't overflow as difference between that pointers is bounded. This fix takes extra instruction which would harm performance for basically everybody for nonexistent problem. If you want to fix this special-case one byte branch to make gcc optimize it away for larger sizes.
On Thu, Mar 16, 2017 at 11:32 PM, Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote: > so I modified the test I posted earlier. moved array to .bss, made array > size reasonable (so there is no overflow). each ./a.out executes bsearch() > 10000000 times. Thanks for this. It might also be useful to show the codegen difference, which (on x86-64) is -bsearch_stock: +bsearch_patched: .LFB10: .cfi_startproc pushq %r15 @@ -43,9 +43,11 @@ cmpq %r13, %r14 jnb .L6 .L9: - leaq (%r14,%r13), %rbx + movq %r13, %rbx movq (%rsp), %rdi + subq %r14, %rbx shrq %rbx + addq %r14, %rbx movq %rbx, %r15 imulq %r12, %r15 addq 8(%rsp), %r15 This probably hurts by extending the critical dependency chain, as well as by issuing more instructions. I would like to know what the codegen difference looks like on more RISCy architectures (maybe MIPS and AArch64?) -- I have to wonder whether a CPU that didn't have complex LEA would wind up generating essentially the same code either way. zw
On Tue, Mar 28, 2017 at 6:19 AM, Ondřej Bílka <neleai@seznam.cz> wrote: > On Thu, Mar 16, 2017 at 04:54:40PM -0400, Mike Frysinger wrote: >> if you see a comment by Ulrich that looks to be unrelated/off, then you >> can probably safely assume it is. if the bug is open still, then we >> usually consider it *now* to be something worth fixing. > > No, this is wrong fix as I probably write before. It isn't visible bug > unless programmer is idiot who tries to sort gigabyte long array of > single bytes and use binary search on it. Just keeping 256 byte table of positions > would be more effective than sorting. There's another possibility: a program might be tricked into calling bsearch with the wrong arguments for the array it actually wants to search. If I understand the problem correctly, under the right conditions bsearch could return a pointer outside the bounds of the true (quite small) array. Because of that, I think this qualifies as a must-fix bug and as a potential security vulnerability. > For any bigger size there isn't overflow as difference between that > pointers is bounded. It's unfortunate that there is no good way of making that apparent to the compiler. The relationship among base, nmemb, size, and the address space as a whole is just too complicated. Maybe if we were actually doing pointer arithmetic on a nonce type - but that has its own problems. > This fix takes extra instruction which would harm performance for basically everybody for nonexistent problem. As a meta concern: it is possible to disagree with people without being this hostile. You could have said "I don't think this is worth doing, because it makes all uses of bsearch slower to handle a corner case that will essentially never occur in practice" - same message, avoids elevating your personal opinon to the status of a fact, and avoids blowing off someone who cared enough to write a patch. Please strive to do better in the future. zw
On (03/28/17 11:30), Zack Weinberg wrote: > On Thu, Mar 16, 2017 at 11:32 PM, Sergey Senozhatsky > <sergey.senozhatsky.work@gmail.com> wrote: > > so I modified the test I posted earlier. moved array to .bss, made array > > size reasonable (so there is no overflow). each ./a.out executes bsearch() > > 10000000 times. > > Thanks for this. It might also be useful to show the codegen > difference, which (on x86-64) is > > -bsearch_stock: > +bsearch_patched: > .LFB10: > .cfi_startproc > pushq %r15 > @@ -43,9 +43,11 @@ > cmpq %r13, %r14 > jnb .L6 > .L9: > - leaq (%r14,%r13), %rbx > + movq %r13, %rbx > movq (%rsp), %rdi > + subq %r14, %rbx > shrq %rbx > + addq %r14, %rbx > movq %rbx, %r15 > imulq %r12, %r15 > addq 8(%rsp), %r15 > > This probably hurts by extending the critical dependency chain, as > well as by issuing more instructions. I would like to know what the > codegen difference looks like on more RISCy architectures (maybe MIPS > and AArch64?) -- I have to wonder whether a CPU that didn't have > complex LEA would wind up generating essentially the same code either > way. sorry fot the delay. ARM arm-none-eabi-gcc (Arch Repository) 6.3.0 -O2 UNPATCHED PATCHED 8368: e92d4ff8 push {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr} 83e4: e92d4ff8 push {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr} 836c: e2526000 subs r6, r2, #0 83e8: e2526000 subs r6, r2, #0 8370: e59da028 ldr sl, [sp, #40] ; 0x28 83ec: e59da028 ldr sl, [sp, #40] ; 0x28 8374: 11a07000 movne r7, r0 83f0: 11a07000 movne r7, r0 8378: 11a08001 movne r8, r1 83f4: 11a08001 movne r8, r1 837c: 11a09003 movne r9, r3 83f8: 11a09003 movne r9, r3 8380: 13a05000 movne r5, #0 83fc: 13a05000 movne r5, #0 8384: 1a000004 bne 839c <____bsearch+0x34> 8400: 1a000004 bne 8418 <__bsearch+0x34> 8388: ea00000f b 83cc <____bsearch+0x64> 8404: ea00000f b 8448 <__bsearch+0x64> 838c: 0a000011 beq 83d8 <____bsearch+0x70> 8408: 0a000011 beq 8454 <__bsearch+0x70> 8390: e2845001 add r5, r4, #1 840c: e2845001 add r5, r4, #1 8394: e1550006 cmp r5, r6 8410: e1550006 cmp r5, r6 8398: 2a00000b bcs 83cc <____bsearch+0x64> 8414: 2a00000b bcs 8448 <__bsearch+0x64> 839c: e0854006 add r4, r5, r6 8418: e0464005 sub r4, r6, r5 83a0: e1a040a4 lsr r4, r4, #1 841c: e08540a4 add r4, r5, r4, lsr #1 83a4: e02b8499 mla fp, r9, r4, r8 8420: e02b8499 mla fp, r9, r4, r8 83a8: e1a00007 mov r0, r7 8424: e1a00007 mov r0, r7 83ac: e1a0100b mov r1, fp 8428: e1a0100b mov r1, fp 83b0: e1a0e00f mov lr, pc 842c: e1a0e00f mov lr, pc 83b4: e12fff1a bx sl 8430: e12fff1a bx sl 83b8: e3500000 cmp r0, #0 8434: e3500000 cmp r0, #0 83bc: aafffff2 bge 838c <____bsearch+0x24> 8438: aafffff2 bge 8408 <__bsearch+0x24> 83c0: e1a06004 mov r6, r4 843c: e1a06004 mov r6, r4 83c4: e1550006 cmp r5, r6 8440: e1550006 cmp r5, r6 83c8: 3afffff3 bcc 839c <____bsearch+0x34> 8444: 3afffff3 bcc 8418 <__bsearch+0x34> 83cc: e3a00000 mov r0, #0 8448: e3a00000 mov r0, #0 83d0: e8bd4ff8 pop {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr} 844c: e8bd4ff8 pop {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr} 83d4: e12fff1e bx lr 8450: e12fff1e bx lr 83d8: e1a0000b mov r0, fp 8454: e1a0000b mov r0, fp 83dc: e8bd4ff8 pop {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr} 8458: e8bd4ff8 pop {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr} 83e0: e12fff1e bx lr 845c: e12fff1e bx lr -ss
On Wed, Mar 29, 2017 at 7:02 PM, Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote: > On (03/28/17 11:30), Zack Weinberg wrote: >> On Thu, Mar 16, 2017 at 11:32 PM, Sergey Senozhatsky >> <sergey.senozhatsky.work@gmail.com> wrote: >> > so I modified the test I posted earlier. moved array to .bss, made array >> > size reasonable (so there is no overflow). each ./a.out executes bsearch() >> > 10000000 times. >> >> Thanks for this. It might also be useful to show the codegen >> difference, which (on x86-64) is >> >> -bsearch_stock: >> +bsearch_patched: >> .LFB10: >> .cfi_startproc >> pushq %r15 >> @@ -43,9 +43,11 @@ >> cmpq %r13, %r14 >> jnb .L6 >> .L9: >> - leaq (%r14,%r13), %rbx >> + movq %r13, %rbx >> movq (%rsp), %rdi >> + subq %r14, %rbx >> shrq %rbx >> + addq %r14, %rbx >> movq %rbx, %r15 >> imulq %r12, %r15 >> addq 8(%rsp), %r15 >> >> This probably hurts by extending the critical dependency chain, as >> well as by issuing more instructions. I would like to know what the >> codegen difference looks like on more RISCy architectures (maybe MIPS >> and AArch64?) -- I have to wonder whether a CPU that didn't have >> complex LEA would wind up generating essentially the same code either >> way. > > sorry fot the delay. > > ARM > > arm-none-eabi-gcc (Arch Repository) 6.3.0 > > -O2 > > UNPATCHED PATCHED > > 8368: e92d4ff8 push {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr} 83e4: e92d4ff8 push {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr} > 836c: e2526000 subs r6, r2, #0 83e8: e2526000 subs r6, r2, #0 > 8370: e59da028 ldr sl, [sp, #40] ; 0x28 83ec: e59da028 ldr sl, [sp, #40] ; 0x28 > 8374: 11a07000 movne r7, r0 83f0: 11a07000 movne r7, r0 > 8378: 11a08001 movne r8, r1 83f4: 11a08001 movne r8, r1 > 837c: 11a09003 movne r9, r3 83f8: 11a09003 movne r9, r3 > 8380: 13a05000 movne r5, #0 83fc: 13a05000 movne r5, #0 > 8384: 1a000004 bne 839c <____bsearch+0x34> 8400: 1a000004 bne 8418 <__bsearch+0x34> > 8388: ea00000f b 83cc <____bsearch+0x64> 8404: ea00000f b 8448 <__bsearch+0x64> > 838c: 0a000011 beq 83d8 <____bsearch+0x70> 8408: 0a000011 beq 8454 <__bsearch+0x70> > 8390: e2845001 add r5, r4, #1 840c: e2845001 add r5, r4, #1 > 8394: e1550006 cmp r5, r6 8410: e1550006 cmp r5, r6 > 8398: 2a00000b bcs 83cc <____bsearch+0x64> 8414: 2a00000b bcs 8448 <__bsearch+0x64> > 839c: e0854006 add r4, r5, r6 8418: e0464005 sub r4, r6, r5 > 83a0: e1a040a4 lsr r4, r4, #1 841c: e08540a4 add r4, r5, r4, lsr #1 This is just the magic here. ARM is special in that they have shifts with their adds. AARCH64 is similar but on some microarch, doing the shift seperate from the add is better. So this shows that it is worse on those targets for latency reasons :). Thanks, Andrew > 83a4: e02b8499 mla fp, r9, r4, r8 8420: e02b8499 mla fp, r9, r4, r8 > 83a8: e1a00007 mov r0, r7 8424: e1a00007 mov r0, r7 > 83ac: e1a0100b mov r1, fp 8428: e1a0100b mov r1, fp > 83b0: e1a0e00f mov lr, pc 842c: e1a0e00f mov lr, pc > 83b4: e12fff1a bx sl 8430: e12fff1a bx sl > 83b8: e3500000 cmp r0, #0 8434: e3500000 cmp r0, #0 > 83bc: aafffff2 bge 838c <____bsearch+0x24> 8438: aafffff2 bge 8408 <__bsearch+0x24> > 83c0: e1a06004 mov r6, r4 843c: e1a06004 mov r6, r4 > 83c4: e1550006 cmp r5, r6 8440: e1550006 cmp r5, r6 > 83c8: 3afffff3 bcc 839c <____bsearch+0x34> 8444: 3afffff3 bcc 8418 <__bsearch+0x34> > 83cc: e3a00000 mov r0, #0 8448: e3a00000 mov r0, #0 > 83d0: e8bd4ff8 pop {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr} 844c: e8bd4ff8 pop {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr} > 83d4: e12fff1e bx lr 8450: e12fff1e bx lr > 83d8: e1a0000b mov r0, fp 8454: e1a0000b mov r0, fp > 83dc: e8bd4ff8 pop {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr} 8458: e8bd4ff8 pop {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr} > 83e0: e12fff1e bx lr 845c: e12fff1e bx lr > > -ss
On (03/29/17 21:06), Andrew Pinski wrote: > > 8368: e92d4ff8 push {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr} 83e4: e92d4ff8 push {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr} > > 836c: e2526000 subs r6, r2, #0 83e8: e2526000 subs r6, r2, #0 > > 8370: e59da028 ldr sl, [sp, #40] ; 0x28 83ec: e59da028 ldr sl, [sp, #40] ; 0x28 > > 8374: 11a07000 movne r7, r0 83f0: 11a07000 movne r7, r0 > > 8378: 11a08001 movne r8, r1 83f4: 11a08001 movne r8, r1 > > 837c: 11a09003 movne r9, r3 83f8: 11a09003 movne r9, r3 > > 8380: 13a05000 movne r5, #0 83fc: 13a05000 movne r5, #0 > > 8384: 1a000004 bne 839c <____bsearch+0x34> 8400: 1a000004 bne 8418 <__bsearch+0x34> > > 8388: ea00000f b 83cc <____bsearch+0x64> 8404: ea00000f b 8448 <__bsearch+0x64> > > 838c: 0a000011 beq 83d8 <____bsearch+0x70> 8408: 0a000011 beq 8454 <__bsearch+0x70> > > 8390: e2845001 add r5, r4, #1 840c: e2845001 add r5, r4, #1 > > 8394: e1550006 cmp r5, r6 8410: e1550006 cmp r5, r6 > > 8398: 2a00000b bcs 83cc <____bsearch+0x64> 8414: 2a00000b bcs 8448 <__bsearch+0x64> > > 839c: e0854006 add r4, r5, r6 8418: e0464005 sub r4, r6, r5 > > 83a0: e1a040a4 lsr r4, r4, #1 841c: e08540a4 add r4, r5, r4, lsr #1 > > This is just the magic here. ARM is special in that they have shifts > with their adds. AARCH64 is similar but on some microarch, doing the > shift seperate from the add is better. So this shows that it is worse > on those targets for latency reasons :). BTW, FreeBSD's kernel bsearch implementation is quite interesting and unique. take a look: https://github.com/freebsd/freebsd/blob/master/sys/libkern/bsearch.c x86_64 gcc7 -O2 400630: 41 57 push %r15 400632: 41 56 push %r14 400634: 41 55 push %r13 400636: 41 54 push %r12 400638: 55 push %rbp 400639: 53 push %rbx 40063a: 48 83 ec 18 sub $0x18,%rsp 40063e: 48 85 d2 test %rdx,%rdx 400641: 48 89 7c 24 08 mov %rdi,0x8(%rsp) 400646: 74 53 je 40069b <_bsearch+0x6b> 400648: 49 89 f4 mov %rsi,%r12 40064b: 48 89 d3 mov %rdx,%rbx 40064e: 48 89 cd mov %rcx,%rbp 400651: 4d 89 c5 mov %r8,%r13 400654: eb 1a jmp 400670 <_bsearch+0x40> 400656: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 40065d: 00 00 00 400660: 48 83 eb 01 sub $0x1,%rbx 400664: 4d 8d 24 2e lea (%r14,%rbp,1),%r12 400668: 48 d1 eb shr %rbx 40066b: 48 85 db test %rbx,%rbx 40066e: 74 2b je 40069b <_bsearch+0x6b> 400670: 49 89 df mov %rbx,%r15 400673: 48 8b 7c 24 08 mov 0x8(%rsp),%rdi 400678: 49 d1 ef shr %r15 40067b: 4c 89 fa mov %r15,%rdx 40067e: 48 0f af d5 imul %rbp,%rdx 400682: 4d 8d 34 14 lea (%r12,%rdx,1),%r14 400686: 4c 89 f6 mov %r14,%rsi 400689: 41 ff d5 callq *%r13 40068c: 83 f8 00 cmp $0x0,%eax 40068f: 74 0d je 40069e <_bsearch+0x6e> 400691: 7f cd jg 400660 <_bsearch+0x30> 400693: 4c 89 fb mov %r15,%rbx 400696: 48 85 db test %rbx,%rbx 400699: 75 d5 jne 400670 <_bsearch+0x40> 40069b: 45 31 f6 xor %r14d,%r14d 40069e: 48 83 c4 18 add $0x18,%rsp 4006a2: 4c 89 f0 mov %r14,%rax 4006a5: 5b pop %rbx 4006a6: 5d pop %rbp 4006a7: 41 5c pop %r12 4006a9: 41 5d pop %r13 4006ab: 41 5e pop %r14 4006ad: 41 5f pop %r15 4006af: c3 retq this should not overflow, as far as I can see. perf stat (-O2, x86_64) and it *seems* to be faster than the existing glibc bsearch() //* I'm not stating this though *// == CHAR array Performance counter stats for './a.out': 421.053507 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 49 page-faults:u # 0.116 K/sec 1,324,686,237 cycles:u # 3.146 GHz (82.90%) 91,991,733 stalled-cycles-frontend:u # 6.94% frontend cycles idle (83.54%) 57,399,708 stalled-cycles-backend:u # 4.33% backend cycles idle (67.23%) 3,596,376,160 instructions:u # 2.71 insn per cycle # 0.03 stalled cycles per insn (83.61%) 732,521,201 branches:u # 1739.734 M/sec (83.61%) 7,764,387 branch-misses:u # 1.06% of all branches (83.09%) 0.421633822 seconds time elapsed Performance counter stats for './a.out': 420.720555 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 49 page-faults:u # 0.116 K/sec 1,317,505,328 cycles:u # 3.132 GHz (82.89%) 86,237,745 stalled-cycles-frontend:u # 6.55% frontend cycles idle (82.89%) 58,424,019 stalled-cycles-backend:u # 4.43% backend cycles idle (67.16%) 3,591,713,950 instructions:u # 2.73 insn per cycle # 0.02 stalled cycles per insn (83.60%) 733,832,646 branches:u # 1744.228 M/sec (83.60%) 7,335,495 branch-misses:u # 1.00% of all branches (83.55%) 0.421209527 seconds time elapsed Performance counter stats for './a.out': 420.446508 task-clock:u (msec) # 0.998 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 49 page-faults:u # 0.117 K/sec 1,309,007,427 cycles:u # 3.113 GHz (82.87%) 85,067,714 stalled-cycles-frontend:u # 6.50% frontend cycles idle (83.54%) 63,478,544 stalled-cycles-backend:u # 4.85% backend cycles idle (67.18%) 3,606,339,245 instructions:u # 2.76 insn per cycle # 0.02 stalled cycles per insn (83.59%) 730,795,217 branches:u # 1738.141 M/sec (83.62%) 6,900,426 branch-misses:u # 0.94% of all branches (83.09%) 0.421232740 seconds time elapsed Performance counter stats for './a.out': 422.715593 task-clock:u (msec) # 0.998 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 47 page-faults:u # 0.111 K/sec 1,326,499,959 cycles:u # 3.138 GHz (82.97%) 100,747,231 stalled-cycles-frontend:u # 7.59% frontend cycles idle (82.97%) 67,604,908 stalled-cycles-backend:u # 5.10% backend cycles idle (67.29%) 3,578,538,258 instructions:u # 2.70 insn per cycle # 0.03 stalled cycles per insn (83.68%) 731,849,769 branches:u # 1731.305 M/sec (83.68%) 8,105,098 branch-misses:u # 1.11% of all branches (83.22%) 0.423636686 seconds time elapsed == INT array Performance counter stats for './a.out': 367.111305 task-clock:u (msec) # 0.998 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 52 page-faults:u # 0.142 K/sec 1,142,267,432 cycles:u # 3.112 GHz (82.87%) 186,370,432 stalled-cycles-frontend:u # 16.32% frontend cycles idle (83.52%) 18,145,529 stalled-cycles-backend:u # 1.59% backend cycles idle (67.30%) 3,608,624,538 instructions:u # 3.16 insn per cycle # 0.05 stalled cycles per insn (83.65%) 729,956,547 branches:u # 1988.379 M/sec (83.67%) 6,567 branch-misses:u # 0.00% of all branches (83.31%) 0.367872314 seconds time elapsed Performance counter stats for './a.out': 367.367983 task-clock:u (msec) # 0.998 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 53 page-faults:u # 0.144 K/sec 1,146,948,361 cycles:u # 3.122 GHz (82.85%) 189,862,982 stalled-cycles-frontend:u # 16.55% frontend cycles idle (82.85%) 16,090,029 stalled-cycles-backend:u # 1.40% backend cycles idle (67.34%) 3,610,943,628 instructions:u # 3.15 insn per cycle # 0.05 stalled cycles per insn (83.71%) 729,705,480 branches:u # 1986.307 M/sec (83.68%) 239,259 branch-misses:u # 0.03% of all branches (83.33%) 0.368188108 seconds time elapsed Performance counter stats for './a.out': 435.185387 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 52 page-faults:u # 0.119 K/sec 1,367,508,177 cycles:u # 3.142 GHz (82.77%) 317,962,731 stalled-cycles-frontend:u # 23.25% frontend cycles idle (83.45%) 89,442,496 stalled-cycles-backend:u # 6.54% backend cycles idle (66.91%) 3,614,939,853 instructions:u # 2.64 insn per cycle # 0.09 stalled cycles per insn (83.46%) 731,171,118 branches:u # 1680.137 M/sec (83.46%) 11,699,371 branch-misses:u # 1.60% of all branches (83.41%) 0.435828955 seconds time elapsed Performance counter stats for './a.out': 426.947345 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 53 page-faults:u # 0.124 K/sec 1,335,309,841 cycles:u # 3.128 GHz (83.14%) 299,538,314 stalled-cycles-frontend:u # 22.43% frontend cycles idle (83.15%) 78,271,313 stalled-cycles-backend:u # 5.86% backend cycles idle (66.28%) 3,619,103,716 instructions:u # 2.71 insn per cycle # 0.08 stalled cycles per insn (83.14%) 734,182,321 branches:u # 1719.609 M/sec (83.81%) 10,304,408 branch-misses:u # 1.40% of all branches (83.76%) 0.427541178 seconds time elapsed -ss
On Thu, Mar 30, 2017 at 12:59 AM, Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote: > > BTW, FreeBSD's kernel bsearch implementation is quite interesting > and unique. take a look: > > https://github.com/freebsd/freebsd/blob/master/sys/libkern/bsearch.c Oh, that's clever. I _thought_ there had to be a way to avoid the addition. This should indeed be immune to the overflow problem, although we still need to have a test for it. libstdc++-v3 appears to use the same trick: see https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/include/bits/stl_algo.h;h=2cd5303a100b998f80369c0baf2cd39aedf14b52;hb=af44a97c14af3a0079590662d49a8ced98954376#l2037 (sorry for length of URL). So we can crib from that and not have to worry about copyrights. As long as we're microoptimizing, I found an article <https://realm.io/news/how-we-beat-cpp-stl-binary-search/> which, in essence, says that because the branches after the comparison are going to be unpredictable, you really want to be using conditional moves instead. It might be worth trying a little to convince GCC and LLVM to do that. (It's too bad that there's no __builtin_unpredictable() to go with __builtin_expect().) zw
On (03/30/17 09:18), Zack Weinberg wrote: > On Thu, Mar 30, 2017 at 12:59 AM, Sergey Senozhatsky > <sergey.senozhatsky.work@gmail.com> wrote: > > > > BTW, FreeBSD's kernel bsearch implementation is quite interesting > > and unique. take a look: > > > > https://github.com/freebsd/freebsd/blob/master/sys/libkern/bsearch.c > > Oh, that's clever. I _thought_ there had to be a way to avoid the > addition. This should indeed be immune to the overflow problem, > although we still need to have a test for it. > > libstdc++-v3 appears to use the same trick: see > https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/include/bits/stl_algo.h;h=2cd5303a100b998f80369c0baf2cd39aedf14b52;hb=af44a97c14af3a0079590662d49a8ced98954376#l2037 > (sorry for length of URL). So we can crib from that and not have to > worry about copyrights. interesting. so, a simple-minded direct re-implementation of libstdc++ approach may look like this void *__bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size, __compar_fn_t __compar) { const char *base = __base, *pivot; size_t len = __nmemb; size_t half; int comparison; while (len > 0) { half = len >> 1; pivot = base + half * __size; comparison = (*__compar) (__key, pivot); if (comparison == 0) return (void *)pivot; if (comparison < 0) { len = half; } else { base = pivot + __size; len = len - half - 1; } } return NULL; } and it generates a bit worse code, than BSD version, due to this part len = len - half - 1; which is sub $0x1,%rbx lea (%r15,%rbp,1),%r12 sub %r14,%rbx objdump 400730: 41 57 push %r15 400732: 41 56 push %r14 400734: 41 55 push %r13 400736: 41 54 push %r12 400738: 55 push %rbp 400739: 53 push %rbx 40073a: 48 83 ec 18 sub $0x18,%rsp 40073e: 48 85 d2 test %rdx,%rdx 400741: 48 89 7c 24 08 mov %rdi,0x8(%rsp) 400746: 74 51 je 400799 <_bsearch+0x69> 400748: 49 89 f4 mov %rsi,%r12 40074b: 48 89 d3 mov %rdx,%rbx 40074e: 48 89 cd mov %rcx,%rbp 400751: 4d 89 c5 mov %r8,%r13 400754: eb 1a jmp 400770 <_bsearch+0x40> 400756: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 40075d: 00 00 00 400760: 48 83 eb 01 sub $0x1,%rbx 400764: 4d 8d 24 2f lea (%r15,%rbp,1),%r12 400768: 4c 29 f3 sub %r14,%rbx 40076b: 48 85 db test %rbx,%rbx 40076e: 74 29 je 400799 <_bsearch+0x69> 400770: 49 89 de mov %rbx,%r14 400773: 48 8b 7c 24 08 mov 0x8(%rsp),%rdi 400778: 49 d1 ee shr %r14 40077b: 4d 89 f7 mov %r14,%r15 40077e: 4c 0f af fd imul %rbp,%r15 400782: 4d 01 e7 add %r12,%r15 400785: 4c 89 fe mov %r15,%rsi 400788: 41 ff d5 callq *%r13 40078b: 85 c0 test %eax,%eax 40078d: 74 0d je 40079c <_bsearch+0x6c> 40078f: 79 cf jns 400760 <_bsearch+0x30> 400791: 4c 89 f3 mov %r14,%rbx 400794: 48 85 db test %rbx,%rbx 400797: 75 d7 jne 400770 <_bsearch+0x40> 400799: 45 31 ff xor %r15d,%r15d 40079c: 48 83 c4 18 add $0x18,%rsp 4007a0: 4c 89 f8 mov %r15,%rax 4007a3: 5b pop %rbx 4007a4: 5d pop %rbp 4007a5: 41 5c pop %r12 4007a7: 41 5d pop %r13 4007a9: 41 5e pop %r14 4007ab: 41 5f pop %r15 4007ad: c3 retq 4007ae: 66 90 xchg %ax,%ax while something like this (basically, a BDS version) void *bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size, __compar_fn_t __compar) { const char *base = __base, *pivot; size_t len = __nmemb; int comparison; while (len > 0) { pivot = base + (len >> 1) * __size; comparison = (*__compar) (__key, pivot); if (comparison == 0) return (void *)pivot; if (comparison > 0) { base = pivot + __size; len--; } len >>= 1; } return NULL; } uses shr %rbx objdump 4006b0: 41 57 push %r15 4006b2: 41 56 push %r14 4006b4: 41 55 push %r13 4006b6: 41 54 push %r12 4006b8: 55 push %rbp 4006b9: 53 push %rbx 4006ba: 48 83 ec 18 sub $0x18,%rsp 4006be: 48 85 d2 test %rdx,%rdx 4006c1: 48 89 7c 24 08 mov %rdi,0x8(%rsp) 4006c6: 74 53 je 40071b <__bsearch+0x6b> 4006c8: 49 89 f4 mov %rsi,%r12 4006cb: 48 89 d3 mov %rdx,%rbx 4006ce: 48 89 cd mov %rcx,%rbp 4006d1: 4d 89 c5 mov %r8,%r13 4006d4: eb 1a jmp 4006f0 <__bsearch+0x40> 4006d6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 4006dd: 00 00 00 4006e0: 48 83 eb 01 sub $0x1,%rbx 4006e4: 4d 8d 24 2e lea (%r14,%rbp,1),%r12 4006e8: 48 d1 eb shr %rbx 4006eb: 48 85 db test %rbx,%rbx 4006ee: 74 2b je 40071b <__bsearch+0x6b> 4006f0: 49 89 df mov %rbx,%r15 4006f3: 48 8b 7c 24 08 mov 0x8(%rsp),%rdi 4006f8: 49 d1 ef shr %r15 4006fb: 4c 89 fa mov %r15,%rdx 4006fe: 48 0f af d5 imul %rbp,%rdx 400702: 4d 8d 34 14 lea (%r12,%rdx,1),%r14 400706: 4c 89 f6 mov %r14,%rsi 400709: 41 ff d5 callq *%r13 40070c: 83 f8 00 cmp $0x0,%eax 40070f: 74 0d je 40071e <__bsearch+0x6e> 400711: 7f cd jg 4006e0 <__bsearch+0x30> 400713: 4c 89 fb mov %r15,%rbx 400716: 48 85 db test %rbx,%rbx 400719: 75 d5 jne 4006f0 <__bsearch+0x40> 40071b: 45 31 f6 xor %r14d,%r14d 40071e: 48 83 c4 18 add $0x18,%rsp 400722: 4c 89 f0 mov %r14,%rax 400725: 5b pop %rbx 400726: 5d pop %rbp 400727: 41 5c pop %r12 400729: 41 5d pop %r13 40072b: 41 5e pop %r14 40072d: 41 5f pop %r15 40072f: c3 retq on practice, this seem to matter a lot (gcc7, x86_64, -O2) LIBSTDC++ BSD Performance counter stats for './a.out': Performance counter stats for './a.out': 529.563747 task-clock:u (msec) # 0.999 CPUs utilized 431.084495 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 48 page-faults:u # 0.091 K/sec 49 page-faults:u # 0.114 K/sec 1,660,238,680 cycles:u # 3.135 GHz (83.01%) 1,359,410,967 cycles:u # 3.153 GHz (83.30%) 18,865,782 stalled-cycles-frontend:u # 1.14% frontend cycles idle (83.49%) 112,645,472 stalled-cycles-frontend:u # 8.29% frontend cycles idle (83.30%) 28,578,724 stalled-cycles-backend:u # 1.72% backend cycles idle (67.15%) 64,645,660 stalled-cycles-backend:u # 4.76% backend cycles idle (66.60%) 3,809,381,770 instructions:u # 2.29 insn per cycle 3,602,947,356 instructions:u # 2.65 insn per cycle # 0.01 stalled cycles per insn (83.57%) # 0.03 stalled cycles per insn (83.30%) 949,275,159 branches:u # 1792.561 M/sec (83.57%) 730,987,355 branches:u # 1695.694 M/sec (83.30%) 4,026 branch-misses:u # 0.00% of all branches (83.16%) 10,441,749 branch-misses:u # 1.43% of all branches (83.78%) 0.530168158 seconds time elapsed 0.431656059 seconds time elapsed Performance counter stats for './a.out': Performance counter stats for './a.out': 531.296478 task-clock:u (msec) # 0.999 CPUs utilized 436.154599 task-clock:u (msec) # 0.998 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 50 page-faults:u # 0.094 K/sec 50 page-faults:u # 0.115 K/sec 1,659,440,099 cycles:u # 3.123 GHz (83.06%) 1,348,636,891 cycles:u # 3.092 GHz (83.46%) 19,236,667 stalled-cycles-frontend:u # 1.16% frontend cycles idle (83.06%) 119,038,311 stalled-cycles-frontend:u # 8.83% frontend cycles idle (83.47%) 28,226,341 stalled-cycles-backend:u # 1.70% backend cycles idle (67.06%) 82,372,803 stalled-cycles-backend:u # 6.11% backend cycles idle (67.00%) 3,810,692,002 instructions:u # 2.30 insn per cycle 3,593,517,874 instructions:u # 2.66 insn per cycle # 0.01 stalled cycles per insn (83.61%) # 0.03 stalled cycles per insn (83.49%) 950,132,896 branches:u # 1788.329 M/sec (83.64%) 732,065,055 branches:u # 1678.453 M/sec (83.50%) 4,383 branch-misses:u # 0.00% of all branches (83.55%) 9,774,701 branch-misses:u # 1.34% of all branches (82.86%) 0.531825605 seconds time elapsed 0.436922711 seconds time elapsed Performance counter stats for './a.out': Performance counter stats for './a.out': 529.942094 task-clock:u (msec) # 0.998 CPUs utilized 435.002430 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 48 page-faults:u # 0.091 K/sec 49 page-faults:u # 0.113 K/sec 1,661,178,675 cycles:u # 3.135 GHz (83.02%) 1,351,752,937 cycles:u # 3.107 GHz (83.40%) 21,119,773 stalled-cycles-frontend:u # 1.27% frontend cycles idle (83.04%) 115,705,283 stalled-cycles-frontend:u # 8.56% frontend cycles idle (83.46%) 28,723,902 stalled-cycles-backend:u # 1.73% backend cycles idle (67.16%) 73,129,341 stalled-cycles-backend:u # 5.41% backend cycles idle (66.90%) 3,806,932,705 instructions:u # 2.29 insn per cycle 3,607,666,833 instructions:u # 2.67 insn per cycle # 0.01 stalled cycles per insn (83.58%) # 0.03 stalled cycles per insn (83.45%) 949,754,552 branches:u # 1792.186 M/sec (83.58%) 728,361,887 branches:u # 1674.386 M/sec (83.45%) 4,278 branch-misses:u # 0.00% of all branches (83.25%) 10,571,686 branch-misses:u # 1.45% of all branches (83.00%) 0.530759498 seconds time elapsed 0.435504277 seconds time elapsed Performance counter stats for './a.out': Performance counter stats for './a.out': 535.067404 task-clock:u (msec) # 0.999 CPUs utilized 433.108622 task-clock:u (msec) # 0.998 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 49 page-faults:u # 0.092 K/sec 48 page-faults:u # 0.111 K/sec 1,662,263,437 cycles:u # 3.107 GHz (83.21%) 1,357,994,817 cycles:u # 3.135 GHz (83.38%) 23,904,662 stalled-cycles-frontend:u # 1.44% frontend cycles idle (83.18%) 117,585,762 stalled-cycles-frontend:u # 8.66% frontend cycles idle (82.73%) 30,990,320 stalled-cycles-backend:u # 1.86% backend cycles idle (66.75%) 71,549,183 stalled-cycles-backend:u # 5.27% backend cycles idle (66.75%) 3,788,168,547 instructions:u # 2.28 insn per cycle 3,608,615,459 instructions:u # 2.66 insn per cycle # 0.01 stalled cycles per insn (83.66%) # 0.03 stalled cycles per insn (83.38%) 949,809,536 branches:u # 1775.121 M/sec (83.74%) 730,772,715 branches:u # 1687.274 M/sec (83.38%) 5,849 branch-misses:u # 0.00% of all branches (83.47%) 10,279,444 branch-misses:u # 1.41% of all branches (83.90%) 0.535721929 seconds time elapsed 0.433903691 seconds time elapsed for comparison, unpatched glibc bsearch() Performance counter stats for './a.out': 447.937515 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 48 page-faults:u # 0.107 K/sec 1,409,322,732 cycles:u # 3.146 GHz (83.26%) 137,696,335 stalled-cycles-frontend:u # 9.77% frontend cycles idle (83.26%) 107,106,201 stalled-cycles-backend:u # 7.60% backend cycles idle (66.51%) 3,390,292,254 instructions:u # 2.41 insn per cycle # 0.04 stalled cycles per insn (83.26%) 968,222,443 branches:u # 2161.512 M/sec (83.86%) 13,048,699 branch-misses:u # 1.35% of all branches (83.44%) 0.448465765 seconds time elapsed Performance counter stats for './a.out': 451.069138 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 49 page-faults:u # 0.109 K/sec 1,406,560,677 cycles:u # 3.118 GHz (83.38%) 132,787,307 stalled-cycles-frontend:u # 9.44% frontend cycles idle (83.37%) 106,997,916 stalled-cycles-backend:u # 7.61% backend cycles idle (66.75%) 3,395,472,906 instructions:u # 2.41 insn per cycle # 0.04 stalled cycles per insn (83.38%) 966,504,113 branches:u # 2142.696 M/sec (83.37%) 11,790,044 branch-misses:u # 1.22% of all branches (83.39%) 0.451607453 seconds time elapsed Performance counter stats for './a.out': 449.328189 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 49 page-faults:u # 0.109 K/sec 1,408,416,612 cycles:u # 3.134 GHz (83.31%) 129,766,221 stalled-cycles-frontend:u # 9.21% frontend cycles idle (83.31%) 116,797,583 stalled-cycles-backend:u # 8.29% backend cycles idle (66.62%) 3,397,205,219 instructions:u # 2.41 insn per cycle # 0.04 stalled cycles per insn (83.31%) 971,010,420 branches:u # 2161.027 M/sec (83.31%) 12,745,400 branch-misses:u # 1.31% of all branches (83.81%) 0.449839176 seconds time elapsed Performance counter stats for './a.out': 446.382825 task-clock:u (msec) # 0.998 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 48 page-faults:u # 0.108 K/sec 1,401,235,764 cycles:u # 3.139 GHz (83.20%) 135,184,574 stalled-cycles-frontend:u # 9.65% frontend cycles idle (83.20%) 112,230,343 stalled-cycles-backend:u # 8.01% backend cycles idle (66.40%) 3,371,533,704 instructions:u # 2.41 insn per cycle # 0.04 stalled cycles per insn (83.20%) 972,968,920 branches:u # 2179.674 M/sec (83.86%) 12,100,921 branch-misses:u # 1.24% of all branches (83.39%) 0.447111384 seconds time elapsed > As long as we're microoptimizing, I found an article > <https://realm.io/news/how-we-beat-cpp-stl-binary-search/> which, in > essence, says that because the branches after the comparison are going > to be unpredictable, you really want to be using conditional moves > instead. It might be worth trying a little to convince GCC and LLVM > to do that. (It's too bad that there's no __builtin_unpredictable() > to go with __builtin_expect().) thanks for the link. -ss
diff --git a/ChangeLog b/ChangeLog index e0acd7d0c4..7142794922 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,8 @@ +2017-03-17 Sergey Senozhatsky <sergey.senozhatsky@gmail.com> + + [BZ #2753] + * bits/stdlib-bsearch.h: Fix integer overflow. + 2017-03-10 Stefan Liebler <stli@linux.vnet.ibm.com> * math/auto-libm-test-out-catan: Regenerated. diff --git a/bits/stdlib-bsearch.h b/bits/stdlib-bsearch.h index eb145381fd..5fd8a8b607 100644 --- a/bits/stdlib-bsearch.h +++ b/bits/stdlib-bsearch.h @@ -28,7 +28,7 @@ bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size, __u = __nmemb; while (__l < __u) { - __idx = (__l + __u) / 2; + __idx = __l + (__u - __l) / 2; __p = (void *) (((const char *) __base) + (__idx * __size)); __comparison = (*__compar) (__key, __p); if (__comparison < 0)