stdlib-bsearch: middle element calculation may overflow

Submitted by Sergey Senozhatsky on March 16, 2017, 3:49 p.m.

Details

Message ID 20170316154908.GA575@tigerII.localdomain
State New
Headers show

Commit Message

Sergey Senozhatsky March 16, 2017, 3:49 p.m.
Hi,

On (03/16/17 14:02), Joseph Myers wrote:
> If this fixes a user-visible bug then the ChangeLog entry needs to include 
> [BZ #N] referencing the bug filed in Bugzilla (and once the fix is in the 
> bug needs to be resolved as FIXED with appropriate target milestone set).  
> Is this bug 2753?  If not, a new bug would need to be filed for it.

a) um... I assume glibc Bugzilla is located at https://sourceware.org/bugzilla/
and 2753, thus, is https://sourceware.org/bugzilla/show_bug.cgi?id=2753
if so, then, yes looks like I'm not the first one to point that out.
I'm not sure I see Ulrich's "You do not even understand how binary
searching works, do you?  The sum can never exceed nmemb and nmemb
obviously fits into an size_t" point. it's a bug.


b) I guess I got the ChangeLog file format mostly right. well, not
entirely sure (does it have to be so complicated?  :) )


c) I don't think I see what the "target milestone" is even supposed
to mean. Sorry! glibc version? min glibc version that requires a
backport of this fix (if there are -stable/LTS glibc releases)? etc.
etc. etc.


---8<---8<----

From f4cbb4449cc8605ea5b223f2537b82224c8685e9 Mon Sep 17 00:00:00 2001
From: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Date: Fri, 17 Mar 2017 00:31:44 +0900
Subject: [PATCH] stdlib-bsearch: middle element calculation may overflow

Middle element calculation may overflow at '__l + __u' when
__l and __u are large enough. Use distance between __u and
__l instead.

	[BZ #2753]
	* bits/stdlib-bsearch.h: Fix integer overflow.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
---
 ChangeLog             | 5 +++++
 bits/stdlib-bsearch.h | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

Comments

Joseph S. Myers March 16, 2017, 4:10 p.m.
On Fri, 17 Mar 2017, Sergey Senozhatsky wrote:

> c) I don't think I see what the "target milestone" is even supposed
> to mean. Sorry! glibc version? min glibc version that requires a
> backport of this fix (if there are -stable/LTS glibc releases)? etc.
> etc. etc.

See: <https://sourceware.org/glibc/wiki/Bugzilla%20Procedures>.  The 
person who commits a fix for a bug needs to set the target milestone of 
the bug along with marking it fixed (so if 2.26 is the first release with 
the fix, it needs to be set to 2.26).
Joseph S. Myers March 16, 2017, 8:33 p.m.
On Fri, 17 Mar 2017, Sergey Senozhatsky wrote:

> I'm not sure I see Ulrich's "You do not even understand how binary
> searching works, do you?  The sum can never exceed nmemb and nmemb
> obviously fits into an size_t" point. it's a bug.

As I understand it: it's a bug for an array the number of whose elements 
is more than SIZE_MAX / 2.

Now, there is certainly no need for the comparison function actually to 
look at the contents of the elements; it's completely valid for it to do 
something magic that depends on the position of the element in the array 
rather than (just) its contents.  And 1-byte elements are certainly valid.  
But the pointers do need to be valid - although when you enter the POSIX 
context there is a strong argument that as long as the addresses are all 
within the process's address space, it's OK for them to be mapped with 
PROT_NONE permissions (as long as the comparison function doesn't actually 
access the referenced objects).

Allocating objects occupying half or more of the address space, even with 
PROT_NONE permissions, is questionable because pointer subtraction cannot 
work reliably in them (so there is an argument for malloc, mmap etc. to 
disallow such allocations).  But it's believed, from past discussions, 
that some users of 32-bit systems (possibly 32-bit processes under a 
64-bit kernel) do in fact rely on being able to make a single allocation 
of 2 GB or more, and our existing practice for string functions is to 
consider it a bug if a string function doesn't work correctly for >= 2 GB 
strings on 32-bit systems.  So on the same basis, this bsearch issue 
should be considered a bug.

If the fix reduces performance, there's a case for making it conditional:

if (__builtin_constant_p (__size) && __size >= 2)
  /* old code */
else
  /* new code */

given that the issue can only apply with elements of size 1, and the 
common case probably *is* that the size is constant (__builtin_constant_p 
works in inline functions) and at least 2, so that would avoid affecting 
the code generated in the common case at all.
Mike Frysinger March 16, 2017, 8:54 p.m.
On 17 Mar 2017 00:49, Sergey Senozhatsky wrote:
> On (03/16/17 14:02), Joseph Myers wrote:
> > If this fixes a user-visible bug then the ChangeLog entry needs to include 
> > [BZ #N] referencing the bug filed in Bugzilla (and once the fix is in the 
> > bug needs to be resolved as FIXED with appropriate target milestone set).  
> > Is this bug 2753?  If not, a new bug would need to be filed for it.
> 
> a) um... I assume glibc Bugzilla is located at https://sourceware.org/bugzilla/
> and 2753, thus, is https://sourceware.org/bugzilla/show_bug.cgi?id=2753
> if so, then, yes looks like I'm not the first one to point that out.
> I'm not sure I see Ulrich's "You do not even understand how binary
> searching works, do you?  The sum can never exceed nmemb and nmemb
> obviously fits into an size_t" point. it's a bug.

if you see a comment by Ulrich that looks to be unrelated/off, then you
can probably safely assume it is.  if the bug is open still, then we
usually consider it *now* to be something worth fixing.

sorry for the hassle.
-mike
Sergey Senozhatsky March 17, 2017, 3:32 a.m.
On (03/16/17 20:33), Joseph Myers wrote:
> On Fri, 17 Mar 2017, Sergey Senozhatsky wrote:
> 
> > I'm not sure I see Ulrich's "You do not even understand how binary
> > searching works, do you?  The sum can never exceed nmemb and nmemb
> > obviously fits into an size_t" point. it's a bug.

so it was 1am yesterday when I took a look at that BZ and I didn't
read it carefully (guilty). I kinda see what they meant there.


[..]
> given that the issue can only apply with elements of size 1, and the 
> common case probably *is* that the size is constant (__builtin_constant_p 
> works in inline functions) and at least 2, so that would avoid affecting 
> the code generated in the common case at all.

the builtin __builtin_constant_p(size) suggestion sounds reasonable,
since that new middle element calculation does come at a price.


---
so I modified the test I posted earlier. moved array to .bss, made array
size reasonable (so there is no overflow). each ./a.out executes bsearch()
10000000 times.

# perf stat ./old_bsearch.out >> OLD 2>&1
# perf stat ./new_bsearch.out >> NEW 2>&1
# paste OLD NEW


the results are ( you guys use big monitors, right? :) )

			UNPATCHED											PATCHED
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

*== CHAR array ==*

 Performance counter stats for './a.out':	 						Performance counter stats for './a.out':
	
        490.700599      task-clock:u (msec)       #    0.999 CPUs utilized          	        	571.728758      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  		                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  		                 0      cpu-migrations:u          #    0.000 K/sec                  
                53      page-faults:u             #    0.108 K/sec                  		                53      page-faults:u             #    0.093 K/sec                  
     1,539,234,373      cycles:u                  #    3.137 GHz                      (83.41%)	     1,792,026,989      cycles:u                  #    3.134 GHz                      (83.22%)
       127,938,341      stalled-cycles-frontend:u #    8.31% frontend cycles idle     (83.49%)	       131,679,816      stalled-cycles-frontend:u #    7.35% frontend cycles idle     (83.21%)
       109,189,954      stalled-cycles-backend:u  #    7.09% backend cycles idle      (66.99%)	        95,841,146      stalled-cycles-backend:u  #    5.35% backend cycles idle      (66.79%)
     3,806,104,997      instructions:u            #    2.47  insn per cycle         	     	     4,331,940,534      instructions:u            #    2.42  insn per cycle         
                                                  #    0.03  stalled cycles per insn  (83.49%)	                                                  #    0.03  stalled cycles per insn  (83.66%)
     1,091,706,962      branches:u                # 2224.792 M/sec                    (83.49%)	     1,090,066,259      branches:u                # 1906.614 M/sec                    (83.74%)
        11,753,417      branch-misses:u           #    1.08% of all branches          (83.02%)	        10,862,016      branch-misses:u           #    1.00% of all branches          (83.36%)
	
       0.491320512 seconds time elapsed	       							0.572367076 seconds time elapsed
	
	

 Performance counter stats for './a.out':	 						Performance counter stats for './a.out':
	
        491.312169      task-clock:u (msec)       #    0.999 CPUs utilized          		        576.186929      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  		                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  		                 0      cpu-migrations:u          #    0.000 K/sec                  
                55      page-faults:u             #    0.112 K/sec                  		                54      page-faults:u             #    0.094 K/sec                  
     1,545,979,259      cycles:u                  #    3.147 GHz                      (82.91%)	     1,795,566,378      cycles:u                  #    3.116 GHz                      (83.34%)
       128,899,741      stalled-cycles-frontend:u #    8.34% frontend cycles idle     (83.47%)	       141,401,217      stalled-cycles-frontend:u #    7.88% frontend cycles idle     (83.34%)
       103,723,436      stalled-cycles-backend:u  #    6.71% backend cycles idle      (67.03%)	       101,552,870      stalled-cycles-backend:u  #    5.66% backend cycles idle      (66.68%)
     3,809,678,005      instructions:u            #    2.46  insn per cycle         	             4,350,733,034      instructions:u            #    2.42  insn per cycle         
                                                  #    0.03  stalled cycles per insn  (83.52%)	                                                  #    0.03  stalled cycles per insn  (83.34%)
     1,091,129,881      branches:u                # 2220.848 M/sec                    (83.52%)	     1,087,751,930      branches:u                # 1887.846 M/sec                    (83.34%)
        12,016,960      branch-misses:u           #    1.10% of all branches          (83.32%)	        10,725,247      branch-misses:u           #    0.99% of all branches          (83.70%)
	
       0.491822793 seconds time elapsed	       							0.576731036 seconds time elapsed
	
	

 Performance counter stats for './a.out':	 						Performance counter stats for './a.out':
	
        490.530342      task-clock:u (msec)       #    0.999 CPUs utilized          		        572.092012      task-clock:u (msec)       #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  		                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  		                 0      cpu-migrations:u          #    0.000 K/sec                  
                54      page-faults:u             #    0.110 K/sec                  		                53      page-faults:u             #    0.093 K/sec                  
     1,536,658,844      cycles:u                  #    3.133 GHz                      (82.88%)	     1,786,237,073      cycles:u                  #    3.122 GHz                      (83.22%)
       126,761,969      stalled-cycles-frontend:u #    8.25% frontend cycles idle     (83.45%)	       133,130,965      stalled-cycles-frontend:u #    7.45% frontend cycles idle     (83.22%)
       124,498,779      stalled-cycles-backend:u  #    8.10% backend cycles idle      (66.98%)	        98,609,399      stalled-cycles-backend:u  #    5.52% backend cycles idle      (66.86%)
     3,817,891,053      instructions:u            #    2.48  insn per cycle         	             4,333,473,067      instructions:u            #    2.43  insn per cycle         
                                                  #    0.03  stalled cycles per insn  (83.49%)	                                                  #    0.03  stalled cycles per insn  (83.68%)
     1,088,399,022      branches:u                # 2218.821 M/sec                    (83.49%)	     1,089,672,636      branches:u                # 1904.716 M/sec                    (83.75%)
        11,599,517      branch-misses:u           #    1.07% of all branches          (83.44%)	        10,308,961      branch-misses:u           #    0.95% of all branches          (83.30%)
	
       0.491065497 seconds time elapsed	       							0.573038448 seconds time elapsed
	
	

 Performance counter stats for './a.out':	 						Performance counter stats for './a.out':
	
        494.313419      task-clock:u (msec)       #    0.999 CPUs utilized          		        570.833950      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  		                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  		                 0      cpu-migrations:u          #    0.000 K/sec                  
                54      page-faults:u             #    0.109 K/sec                  		                55      page-faults:u             #    0.096 K/sec                  
     1,540,156,057      cycles:u                  #    3.116 GHz                      (83.02%)	     1,789,383,612      cycles:u                  #    3.135 GHz                      (83.19%)
       133,690,009      stalled-cycles-frontend:u #    8.68% frontend cycles idle     (83.01%)	       133,977,834      stalled-cycles-frontend:u #    7.49% frontend cycles idle     (83.17%)
       105,653,002      stalled-cycles-backend:u  #    6.86% backend cycles idle      (67.20%)	       100,380,149      stalled-cycles-backend:u  #    5.61% backend cycles idle      (66.38%)
     3,808,529,575      instructions:u            #    2.47  insn per cycle         	             4,360,155,445      instructions:u            #    2.44  insn per cycle         
                                                  #    0.04  stalled cycles per insn  (83.62%)	                                                  #    0.03  stalled cycles per insn  (83.19%)
     1,088,698,277      branches:u                # 2202.445 M/sec                    (83.62%)	     1,088,622,411      branches:u                # 1907.074 M/sec                    (83.70%)
        11,517,756      branch-misses:u           #    1.06% of all branches          (83.21%)	        10,616,959      branch-misses:u           #    0.98% of all branches          (83.58%)
	
       0.494938286 seconds time elapsed	       							0.571451079 seconds time elapsed






*== INT array ==*

 Performance counter stats for './a.out':	 						Performance counter stats for './a.out':
	
        443.792538      task-clock:u (msec)       #    0.997 CPUs utilized          		        515.688363      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  		                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  		                 0      cpu-migrations:u          #    0.000 K/sec                  
                50      page-faults:u             #    0.113 K/sec                  		                53      page-faults:u             #    0.103 K/sec                  
     1,400,310,241      cycles:u                  #    3.155 GHz                      (83.10%)	     1,623,123,844      cycles:u                  #    3.147 GHz                      (83.13%)
       112,395,786      stalled-cycles-frontend:u #    8.03% frontend cycles idle     (83.25%)	       151,631,032      stalled-cycles-frontend:u #    9.34% frontend cycles idle     (83.13%)
        95,467,149      stalled-cycles-backend:u  #    6.82% backend cycles idle      (66.67%)	       108,350,044      stalled-cycles-backend:u  #    6.68% backend cycles idle      (66.74%)
     3,371,853,412      instructions:u            #    2.41  insn per cycle         	             3,854,946,185      instructions:u            #    2.38  insn per cycle         
                                                  #    0.03  stalled cycles per insn  (83.69%)	                                                  #    0.04  stalled cycles per insn  (83.66%)
       966,737,955      branches:u                # 2178.356 M/sec                    (83.78%)	       969,559,223      branches:u                # 1880.126 M/sec                    (83.72%)
        10,749,026      branch-misses:u           #    1.11% of all branches          (83.63%)	        11,323,207      branch-misses:u           #    1.17% of all branches          (83.49%)
	
       0.445114068 seconds time elapsed	       							0.516332787 seconds time elapsed
	

	
 Performance counter stats for './a.out':	 						Performance counter stats for './a.out':
	
        447.392463      task-clock:u (msec)       #    0.998 CPUs utilized          		        512.701616      task-clock:u (msec)       #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  		                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  		                 0      cpu-migrations:u          #    0.000 K/sec                  
                51      page-faults:u             #    0.114 K/sec                  		                52      page-faults:u             #    0.101 K/sec                  
     1,398,350,256      cycles:u                  #    3.126 GHz                      (83.24%)	     1,606,918,412      cycles:u                  #    3.134 GHz                      (83.03%)
       121,919,785      stalled-cycles-frontend:u #    8.72% frontend cycles idle     (83.23%)	       128,876,515      stalled-cycles-frontend:u #    8.02% frontend cycles idle     (83.03%)
        94,562,361      stalled-cycles-backend:u  #    6.76% backend cycles idle      (66.50%)	        94,866,954      stalled-cycles-backend:u  #    5.90% backend cycles idle      (67.16%)
     3,399,240,172      instructions:u            #    2.43  insn per cycle         	             3,876,976,133      instructions:u            #    2.41  insn per cycle         
                                                  #    0.04  stalled cycles per insn  (83.24%)	                                                  #    0.03  stalled cycles per insn  (83.64%)
       964,909,971      branches:u                # 2156.742 M/sec                    (83.83%)	       969,090,907      branches:u                # 1890.166 M/sec                    (83.64%)
        10,618,480      branch-misses:u           #    1.10% of all branches          (83.69%)	        10,861,681      branch-misses:u           #    1.12% of all branches          (83.33%)
	
       0.448073517 seconds time elapsed	       								0.513532781 seconds time elapsed
	
	

 Performance counter stats for './a.out':	 						Performance counter stats for './a.out':
	
        447.387472      task-clock:u (msec)       #    0.999 CPUs utilized          		        517.531654      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  		                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  		                 0      cpu-migrations:u          #    0.000 K/sec                  
                51      page-faults:u             #    0.114 K/sec                  		                53      page-faults:u             #    0.102 K/sec                  
     1,406,720,267      cycles:u                  #    3.144 GHz                      (83.24%)	     1,616,525,247      cycles:u                  #    3.124 GHz                      (83.18%)
       123,427,413      stalled-cycles-frontend:u #    8.77% frontend cycles idle     (83.24%)	       142,913,567      stalled-cycles-frontend:u #    8.84% frontend cycles idle     (83.21%)
       112,720,863      stalled-cycles-backend:u  #    8.01% backend cycles idle      (66.48%)	       114,806,324      stalled-cycles-backend:u  #    7.10% backend cycles idle      (66.82%)
     3,376,115,599      instructions:u            #    2.40  insn per cycle         	             3,868,878,104      instructions:u            #    2.39  insn per cycle         
                                                  #    0.04  stalled cycles per insn  (83.24%)	                                                  #    0.04  stalled cycles per insn  (83.69%)
       970,199,916      branches:u                # 2168.590 M/sec                    (83.24%)	       967,881,284      branches:u                # 1870.188 M/sec                    (83.79%)
        11,041,374      branch-misses:u           #    1.14% of all branches          (83.86%)	        11,289,517      branch-misses:u           #    1.17% of all branches          (83.27%)
	
       0.448035542 seconds time elapsed	       							0.518214692 seconds time elapsed
	
	

 Performance counter stats for './a.out':	 						Performance counter stats for './a.out':
	
        450.246740      task-clock:u (msec)       #    0.998 CPUs utilized          		        518.604218      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  		                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  		                 0      cpu-migrations:u          #    0.000 K/sec                  
                52      page-faults:u             #    0.115 K/sec                  		                52      page-faults:u             #    0.100 K/sec                  
     1,403,767,028      cycles:u                  #    3.118 GHz                      (83.35%)	     1,622,782,703      cycles:u                  #    3.129 GHz                      (83.21%)
       124,553,251      stalled-cycles-frontend:u #    8.87% frontend cycles idle     (83.37%)	       148,974,707      stalled-cycles-frontend:u #    9.18% frontend cycles idle     (83.24%)
       101,045,481      stalled-cycles-backend:u  #    7.20% backend cycles idle      (66.71%)	       112,861,212      stalled-cycles-backend:u  #    6.95% backend cycles idle      (66.45%)
     3,380,728,555      instructions:u            #    2.41  insn per cycle         	             3,865,667,812      instructions:u            #    2.38  insn per cycle         
                                                  #    0.04  stalled cycles per insn  (83.37%)	                                                  #    0.04  stalled cycles per insn  (83.22%)
       973,763,119      branches:u                # 2162.732 M/sec                    (83.35%)	       970,109,337      branches:u                # 1870.616 M/sec                    (83.81%)
        11,511,822      branch-misses:u           #    1.18% of all branches          (83.54%)	        11,546,829      branch-misses:u           #    1.19% of all branches          (83.29%)
	
       0.451002497 seconds time elapsed	       							0.519224084 seconds time elapsed


	-ss
Ondrej Bilka March 28, 2017, 10:19 a.m.
On Thu, Mar 16, 2017 at 04:54:40PM -0400, Mike Frysinger wrote:
> On 17 Mar 2017 00:49, Sergey Senozhatsky wrote:
> > On (03/16/17 14:02), Joseph Myers wrote:
> > > If this fixes a user-visible bug then the ChangeLog entry needs to include 
> > > [BZ #N] referencing the bug filed in Bugzilla (and once the fix is in the 
> > > bug needs to be resolved as FIXED with appropriate target milestone set).  
> > > Is this bug 2753?  If not, a new bug would need to be filed for it.
> > 
> > a) um... I assume glibc Bugzilla is located at https://sourceware.org/bugzilla/
> > and 2753, thus, is https://sourceware.org/bugzilla/show_bug.cgi?id=2753
> > if so, then, yes looks like I'm not the first one to point that out.
> > I'm not sure I see Ulrich's "You do not even understand how binary
> > searching works, do you?  The sum can never exceed nmemb and nmemb
> > obviously fits into an size_t" point. it's a bug.
> 
> if you see a comment by Ulrich that looks to be unrelated/off, then you
> can probably safely assume it is.  if the bug is open still, then we
> usually consider it *now* to be something worth fixing.
>
No, this is wrong fix as I probably write before. It isn't visible bug
unless programmer is idiot who tries to sort gigabyte long array of
single bytes and use binary search on it. Just keeping 256 byte table of positions
would be more effective than sorting. 

For any bigger size there isn't overflow as difference between that
pointers is bounded.

This fix takes extra instruction which would harm performance for basically everybody for nonexistent problem.

If you want to fix this special-case one byte branch to make gcc
optimize it away for larger sizes.
Zack Weinberg March 28, 2017, 3:30 p.m.
On Thu, Mar 16, 2017 at 11:32 PM, Sergey Senozhatsky
<sergey.senozhatsky.work@gmail.com> wrote:
> so I modified the test I posted earlier. moved array to .bss, made array
> size reasonable (so there is no overflow). each ./a.out executes bsearch()
> 10000000 times.

Thanks for this.  It might also be useful to show the codegen
difference, which (on x86-64) is

-bsearch_stock:
+bsearch_patched:
 .LFB10:
     .cfi_startproc
     pushq    %r15
@@ -43,9 +43,11 @@
     cmpq    %r13, %r14
     jnb    .L6
 .L9:
-    leaq    (%r14,%r13), %rbx
+    movq    %r13, %rbx
     movq    (%rsp), %rdi
+    subq    %r14, %rbx
     shrq    %rbx
+    addq    %r14, %rbx
     movq    %rbx, %r15
     imulq    %r12, %r15
     addq    8(%rsp), %r15

This probably hurts by extending the critical dependency chain, as
well as by issuing more instructions.  I would like to know what the
codegen difference looks like on more RISCy architectures (maybe MIPS
and AArch64?) -- I have to wonder whether a CPU that didn't have
complex LEA would wind up generating essentially the same code either
way.

zw
Zack Weinberg March 28, 2017, 4 p.m.
On Tue, Mar 28, 2017 at 6:19 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Thu, Mar 16, 2017 at 04:54:40PM -0400, Mike Frysinger wrote:
>> if you see a comment by Ulrich that looks to be unrelated/off, then you
>> can probably safely assume it is.  if the bug is open still, then we
>> usually consider it *now* to be something worth fixing.
>
> No, this is wrong fix as I probably write before. It isn't visible bug
> unless programmer is idiot who tries to sort gigabyte long array of
> single bytes and use binary search on it. Just keeping 256 byte table of positions
> would be more effective than sorting.

There's another possibility: a program might be tricked into calling
bsearch with the wrong arguments for the array it actually wants to
search.  If I understand the problem correctly, under the right
conditions bsearch could return a pointer outside the bounds of the
true (quite small) array.

Because of that, I think this qualifies as a must-fix bug and as a
potential security vulnerability.

> For any bigger size there isn't overflow as difference between that
> pointers is bounded.

It's unfortunate that there is no good way of making that apparent to
the compiler.  The relationship among base, nmemb, size, and the
address space as a whole is just too complicated.  Maybe if we were
actually doing pointer arithmetic on a nonce type - but that has its
own problems.

> This fix takes extra instruction which would harm performance for basically everybody for nonexistent problem.

As a meta concern: it is possible to disagree with people without
being this hostile.  You could have said "I don't think this is worth
doing, because it makes all uses of bsearch slower to handle a corner
case that will essentially never occur in practice" - same message,
avoids elevating your personal opinon to the status of a fact, and
avoids blowing off someone who cared enough to write a patch.

Please strive to do better in the future.

zw
Sergey Senozhatsky March 30, 2017, 2:02 a.m.
On (03/28/17 11:30), Zack Weinberg wrote:
> On Thu, Mar 16, 2017 at 11:32 PM, Sergey Senozhatsky
> <sergey.senozhatsky.work@gmail.com> wrote:
> > so I modified the test I posted earlier. moved array to .bss, made array
> > size reasonable (so there is no overflow). each ./a.out executes bsearch()
> > 10000000 times.
> 
> Thanks for this.  It might also be useful to show the codegen
> difference, which (on x86-64) is
> 
> -bsearch_stock:
> +bsearch_patched:
>  .LFB10:
>      .cfi_startproc
>      pushq    %r15
> @@ -43,9 +43,11 @@
>      cmpq    %r13, %r14
>      jnb    .L6
>  .L9:
> -    leaq    (%r14,%r13), %rbx
> +    movq    %r13, %rbx
>      movq    (%rsp), %rdi
> +    subq    %r14, %rbx
>      shrq    %rbx
> +    addq    %r14, %rbx
>      movq    %rbx, %r15
>      imulq    %r12, %r15
>      addq    8(%rsp), %r15
> 
> This probably hurts by extending the critical dependency chain, as
> well as by issuing more instructions.  I would like to know what the
> codegen difference looks like on more RISCy architectures (maybe MIPS
> and AArch64?) -- I have to wonder whether a CPU that didn't have
> complex LEA would wind up generating essentially the same code either
> way.

sorry fot the delay.

ARM

arm-none-eabi-gcc (Arch Repository) 6.3.0

-O2

	UNPATCHED										PATCHED

    8368:	e92d4ff8 	push	{r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}     	    83e4:	e92d4ff8 	push	{r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}
    836c:	e2526000 	subs	r6, r2, #0                                   	    83e8:	e2526000 	subs	r6, r2, #0
    8370:	e59da028 	ldr	sl, [sp, #40]	; 0x28                         	    83ec:	e59da028 	ldr	sl, [sp, #40]	; 0x28
    8374:	11a07000 	movne	r7, r0                                      	    83f0:	11a07000 	movne	r7, r0
    8378:	11a08001 	movne	r8, r1                                      	    83f4:	11a08001 	movne	r8, r1
    837c:	11a09003 	movne	r9, r3                                      	    83f8:	11a09003 	movne	r9, r3
    8380:	13a05000 	movne	r5, #0                                      	    83fc:	13a05000 	movne	r5, #0
    8384:	1a000004 	bne	839c <____bsearch+0x34>                       	    8400:	1a000004 	bne	8418 <__bsearch+0x34>
    8388:	ea00000f 	b	83cc <____bsearch+0x64>                        	    8404:	ea00000f 	b	8448 <__bsearch+0x64>
    838c:	0a000011 	beq	83d8 <____bsearch+0x70>                       	    8408:	0a000011 	beq	8454 <__bsearch+0x70>
    8390:	e2845001 	add	r5, r4, #1                                    	    840c:	e2845001 	add	r5, r4, #1
    8394:	e1550006 	cmp	r5, r6                                        	    8410:	e1550006 	cmp	r5, r6
    8398:	2a00000b 	bcs	83cc <____bsearch+0x64>                       	    8414:	2a00000b 	bcs	8448 <__bsearch+0x64>
    839c:	e0854006 	add	r4, r5, r6                                    	    8418:	e0464005 	sub	r4, r6, r5
    83a0:	e1a040a4 	lsr	r4, r4, #1                                    	    841c:	e08540a4 	add	r4, r5, r4, lsr #1
    83a4:	e02b8499 	mla	fp, r9, r4, r8                                	    8420:	e02b8499 	mla	fp, r9, r4, r8
    83a8:	e1a00007 	mov	r0, r7                                        	    8424:	e1a00007 	mov	r0, r7
    83ac:	e1a0100b 	mov	r1, fp                                        	    8428:	e1a0100b 	mov	r1, fp
    83b0:	e1a0e00f 	mov	lr, pc                                        	    842c:	e1a0e00f 	mov	lr, pc
    83b4:	e12fff1a 	bx	sl                                             	    8430:	e12fff1a 	bx	sl
    83b8:	e3500000 	cmp	r0, #0                                        	    8434:	e3500000 	cmp	r0, #0
    83bc:	aafffff2 	bge	838c <____bsearch+0x24>                       	    8438:	aafffff2 	bge	8408 <__bsearch+0x24>
    83c0:	e1a06004 	mov	r6, r4                                        	    843c:	e1a06004 	mov	r6, r4
    83c4:	e1550006 	cmp	r5, r6                                        	    8440:	e1550006 	cmp	r5, r6
    83c8:	3afffff3 	bcc	839c <____bsearch+0x34>                       	    8444:	3afffff3 	bcc	8418 <__bsearch+0x34>
    83cc:	e3a00000 	mov	r0, #0                                        	    8448:	e3a00000 	mov	r0, #0
    83d0:	e8bd4ff8 	pop	{r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}      	    844c:	e8bd4ff8 	pop	{r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}
    83d4:	e12fff1e 	bx	lr                                             	    8450:	e12fff1e 	bx	lr
    83d8:	e1a0000b 	mov	r0, fp                                        	    8454:	e1a0000b 	mov	r0, fp
    83dc:	e8bd4ff8 	pop	{r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}      	    8458:	e8bd4ff8 	pop	{r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}
    83e0:	e12fff1e 	bx	lr                                             	    845c:	e12fff1e 	bx	lr

	-ss
Andrew Pinski March 30, 2017, 4:06 a.m.
On Wed, Mar 29, 2017 at 7:02 PM, Sergey Senozhatsky
<sergey.senozhatsky.work@gmail.com> wrote:
> On (03/28/17 11:30), Zack Weinberg wrote:
>> On Thu, Mar 16, 2017 at 11:32 PM, Sergey Senozhatsky
>> <sergey.senozhatsky.work@gmail.com> wrote:
>> > so I modified the test I posted earlier. moved array to .bss, made array
>> > size reasonable (so there is no overflow). each ./a.out executes bsearch()
>> > 10000000 times.
>>
>> Thanks for this.  It might also be useful to show the codegen
>> difference, which (on x86-64) is
>>
>> -bsearch_stock:
>> +bsearch_patched:
>>  .LFB10:
>>      .cfi_startproc
>>      pushq    %r15
>> @@ -43,9 +43,11 @@
>>      cmpq    %r13, %r14
>>      jnb    .L6
>>  .L9:
>> -    leaq    (%r14,%r13), %rbx
>> +    movq    %r13, %rbx
>>      movq    (%rsp), %rdi
>> +    subq    %r14, %rbx
>>      shrq    %rbx
>> +    addq    %r14, %rbx
>>      movq    %rbx, %r15
>>      imulq    %r12, %r15
>>      addq    8(%rsp), %r15
>>
>> This probably hurts by extending the critical dependency chain, as
>> well as by issuing more instructions.  I would like to know what the
>> codegen difference looks like on more RISCy architectures (maybe MIPS
>> and AArch64?) -- I have to wonder whether a CPU that didn't have
>> complex LEA would wind up generating essentially the same code either
>> way.
>
> sorry fot the delay.
>
> ARM
>
> arm-none-eabi-gcc (Arch Repository) 6.3.0
>
> -O2
>
>         UNPATCHED                                                                               PATCHED
>
>     8368:       e92d4ff8        push    {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}            83e4:       e92d4ff8        push    {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}
>     836c:       e2526000        subs    r6, r2, #0                                          83e8:       e2526000        subs    r6, r2, #0
>     8370:       e59da028        ldr     sl, [sp, #40]   ; 0x28                              83ec:       e59da028        ldr     sl, [sp, #40]   ; 0x28
>     8374:       11a07000        movne   r7, r0                                              83f0:       11a07000        movne   r7, r0
>     8378:       11a08001        movne   r8, r1                                              83f4:       11a08001        movne   r8, r1
>     837c:       11a09003        movne   r9, r3                                              83f8:       11a09003        movne   r9, r3
>     8380:       13a05000        movne   r5, #0                                              83fc:       13a05000        movne   r5, #0
>     8384:       1a000004        bne     839c <____bsearch+0x34>                             8400:       1a000004        bne     8418 <__bsearch+0x34>
>     8388:       ea00000f        b       83cc <____bsearch+0x64>                             8404:       ea00000f        b       8448 <__bsearch+0x64>
>     838c:       0a000011        beq     83d8 <____bsearch+0x70>                             8408:       0a000011        beq     8454 <__bsearch+0x70>
>     8390:       e2845001        add     r5, r4, #1                                          840c:       e2845001        add     r5, r4, #1
>     8394:       e1550006        cmp     r5, r6                                              8410:       e1550006        cmp     r5, r6
>     8398:       2a00000b        bcs     83cc <____bsearch+0x64>                             8414:       2a00000b        bcs     8448 <__bsearch+0x64>
>     839c:       e0854006        add     r4, r5, r6                                          8418:       e0464005        sub     r4, r6, r5
>     83a0:       e1a040a4        lsr     r4, r4, #1                                          841c:       e08540a4        add     r4, r5, r4, lsr #1

This is just the magic here.  ARM is special in that they have shifts
with their adds.  AARCH64 is similar but on some microarch, doing the
shift seperate from the add is better.  So this shows that it is worse
on those targets for latency reasons :).

Thanks,
Andrew

>     83a4:       e02b8499        mla     fp, r9, r4, r8                                      8420:       e02b8499        mla     fp, r9, r4, r8
>     83a8:       e1a00007        mov     r0, r7                                              8424:       e1a00007        mov     r0, r7
>     83ac:       e1a0100b        mov     r1, fp                                              8428:       e1a0100b        mov     r1, fp
>     83b0:       e1a0e00f        mov     lr, pc                                              842c:       e1a0e00f        mov     lr, pc
>     83b4:       e12fff1a        bx      sl                                                  8430:       e12fff1a        bx      sl
>     83b8:       e3500000        cmp     r0, #0                                              8434:       e3500000        cmp     r0, #0
>     83bc:       aafffff2        bge     838c <____bsearch+0x24>                             8438:       aafffff2        bge     8408 <__bsearch+0x24>
>     83c0:       e1a06004        mov     r6, r4                                              843c:       e1a06004        mov     r6, r4
>     83c4:       e1550006        cmp     r5, r6                                              8440:       e1550006        cmp     r5, r6
>     83c8:       3afffff3        bcc     839c <____bsearch+0x34>                             8444:       3afffff3        bcc     8418 <__bsearch+0x34>
>     83cc:       e3a00000        mov     r0, #0                                              8448:       e3a00000        mov     r0, #0
>     83d0:       e8bd4ff8        pop     {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}            844c:       e8bd4ff8        pop     {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}
>     83d4:       e12fff1e        bx      lr                                                  8450:       e12fff1e        bx      lr
>     83d8:       e1a0000b        mov     r0, fp                                              8454:       e1a0000b        mov     r0, fp
>     83dc:       e8bd4ff8        pop     {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}            8458:       e8bd4ff8        pop     {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}
>     83e0:       e12fff1e        bx      lr                                                  845c:       e12fff1e        bx      lr
>
>         -ss
Sergey Senozhatsky March 30, 2017, 4:59 a.m.
On (03/29/17 21:06), Andrew Pinski wrote:
> >     8368:       e92d4ff8        push    {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}            83e4:       e92d4ff8        push    {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}
> >     836c:       e2526000        subs    r6, r2, #0                                          83e8:       e2526000        subs    r6, r2, #0
> >     8370:       e59da028        ldr     sl, [sp, #40]   ; 0x28                              83ec:       e59da028        ldr     sl, [sp, #40]   ; 0x28
> >     8374:       11a07000        movne   r7, r0                                              83f0:       11a07000        movne   r7, r0
> >     8378:       11a08001        movne   r8, r1                                              83f4:       11a08001        movne   r8, r1
> >     837c:       11a09003        movne   r9, r3                                              83f8:       11a09003        movne   r9, r3
> >     8380:       13a05000        movne   r5, #0                                              83fc:       13a05000        movne   r5, #0
> >     8384:       1a000004        bne     839c <____bsearch+0x34>                             8400:       1a000004        bne     8418 <__bsearch+0x34>
> >     8388:       ea00000f        b       83cc <____bsearch+0x64>                             8404:       ea00000f        b       8448 <__bsearch+0x64>
> >     838c:       0a000011        beq     83d8 <____bsearch+0x70>                             8408:       0a000011        beq     8454 <__bsearch+0x70>
> >     8390:       e2845001        add     r5, r4, #1                                          840c:       e2845001        add     r5, r4, #1
> >     8394:       e1550006        cmp     r5, r6                                              8410:       e1550006        cmp     r5, r6
> >     8398:       2a00000b        bcs     83cc <____bsearch+0x64>                             8414:       2a00000b        bcs     8448 <__bsearch+0x64>
> >     839c:       e0854006        add     r4, r5, r6                                          8418:       e0464005        sub     r4, r6, r5
> >     83a0:       e1a040a4        lsr     r4, r4, #1                                          841c:       e08540a4        add     r4, r5, r4, lsr #1
> 
> This is just the magic here.  ARM is special in that they have shifts
> with their adds.  AARCH64 is similar but on some microarch, doing the
> shift seperate from the add is better.  So this shows that it is worse
> on those targets for latency reasons :).

BTW, FreeBSD's kernel bsearch implementation is quite interesting
and unique. take a look:

https://github.com/freebsd/freebsd/blob/master/sys/libkern/bsearch.c

x86_64 gcc7 -O2


  400630:	41 57                	push   %r15
  400632:	41 56                	push   %r14
  400634:	41 55                	push   %r13
  400636:	41 54                	push   %r12
  400638:	55                   	push   %rbp
  400639:	53                   	push   %rbx
  40063a:	48 83 ec 18          	sub    $0x18,%rsp
  40063e:	48 85 d2             	test   %rdx,%rdx
  400641:	48 89 7c 24 08       	mov    %rdi,0x8(%rsp)
  400646:	74 53                	je     40069b <_bsearch+0x6b>
  400648:	49 89 f4             	mov    %rsi,%r12
  40064b:	48 89 d3             	mov    %rdx,%rbx
  40064e:	48 89 cd             	mov    %rcx,%rbp
  400651:	4d 89 c5             	mov    %r8,%r13
  400654:	eb 1a                	jmp    400670 <_bsearch+0x40>
  400656:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  40065d:	00 00 00 
  400660:	48 83 eb 01          	sub    $0x1,%rbx
  400664:	4d 8d 24 2e          	lea    (%r14,%rbp,1),%r12
  400668:	48 d1 eb             	shr    %rbx
  40066b:	48 85 db             	test   %rbx,%rbx
  40066e:	74 2b                	je     40069b <_bsearch+0x6b>
  400670:	49 89 df             	mov    %rbx,%r15
  400673:	48 8b 7c 24 08       	mov    0x8(%rsp),%rdi
  400678:	49 d1 ef             	shr    %r15
  40067b:	4c 89 fa             	mov    %r15,%rdx
  40067e:	48 0f af d5          	imul   %rbp,%rdx
  400682:	4d 8d 34 14          	lea    (%r12,%rdx,1),%r14
  400686:	4c 89 f6             	mov    %r14,%rsi
  400689:	41 ff d5             	callq  *%r13
  40068c:	83 f8 00             	cmp    $0x0,%eax
  40068f:	74 0d                	je     40069e <_bsearch+0x6e>
  400691:	7f cd                	jg     400660 <_bsearch+0x30>
  400693:	4c 89 fb             	mov    %r15,%rbx
  400696:	48 85 db             	test   %rbx,%rbx
  400699:	75 d5                	jne    400670 <_bsearch+0x40>
  40069b:	45 31 f6             	xor    %r14d,%r14d
  40069e:	48 83 c4 18          	add    $0x18,%rsp
  4006a2:	4c 89 f0             	mov    %r14,%rax
  4006a5:	5b                   	pop    %rbx
  4006a6:	5d                   	pop    %rbp
  4006a7:	41 5c                	pop    %r12
  4006a9:	41 5d                	pop    %r13
  4006ab:	41 5e                	pop    %r14
  4006ad:	41 5f                	pop    %r15
  4006af:	c3                   	retq   



this should not overflow, as far as I can see.




perf stat (-O2, x86_64)


and it *seems* to be faster than the existing glibc bsearch()
//* I'm not stating this though *//


== CHAR array


 Performance counter stats for './a.out':

        421.053507      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                49      page-faults:u             #    0.116 K/sec                  
     1,324,686,237      cycles:u                  #    3.146 GHz                      (82.90%)
        91,991,733      stalled-cycles-frontend:u #    6.94% frontend cycles idle     (83.54%)
        57,399,708      stalled-cycles-backend:u  #    4.33% backend cycles idle      (67.23%)
     3,596,376,160      instructions:u            #    2.71  insn per cycle         
                                                  #    0.03  stalled cycles per insn  (83.61%)
       732,521,201      branches:u                # 1739.734 M/sec                    (83.61%)
         7,764,387      branch-misses:u           #    1.06% of all branches          (83.09%)

       0.421633822 seconds time elapsed


 Performance counter stats for './a.out':

        420.720555      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                49      page-faults:u             #    0.116 K/sec                  
     1,317,505,328      cycles:u                  #    3.132 GHz                      (82.89%)
        86,237,745      stalled-cycles-frontend:u #    6.55% frontend cycles idle     (82.89%)
        58,424,019      stalled-cycles-backend:u  #    4.43% backend cycles idle      (67.16%)
     3,591,713,950      instructions:u            #    2.73  insn per cycle         
                                                  #    0.02  stalled cycles per insn  (83.60%)
       733,832,646      branches:u                # 1744.228 M/sec                    (83.60%)
         7,335,495      branch-misses:u           #    1.00% of all branches          (83.55%)

       0.421209527 seconds time elapsed


 Performance counter stats for './a.out':

        420.446508      task-clock:u (msec)       #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                49      page-faults:u             #    0.117 K/sec                  
     1,309,007,427      cycles:u                  #    3.113 GHz                      (82.87%)
        85,067,714      stalled-cycles-frontend:u #    6.50% frontend cycles idle     (83.54%)
        63,478,544      stalled-cycles-backend:u  #    4.85% backend cycles idle      (67.18%)
     3,606,339,245      instructions:u            #    2.76  insn per cycle         
                                                  #    0.02  stalled cycles per insn  (83.59%)
       730,795,217      branches:u                # 1738.141 M/sec                    (83.62%)
         6,900,426      branch-misses:u           #    0.94% of all branches          (83.09%)

       0.421232740 seconds time elapsed


 Performance counter stats for './a.out':

        422.715593      task-clock:u (msec)       #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                47      page-faults:u             #    0.111 K/sec                  
     1,326,499,959      cycles:u                  #    3.138 GHz                      (82.97%)
       100,747,231      stalled-cycles-frontend:u #    7.59% frontend cycles idle     (82.97%)
        67,604,908      stalled-cycles-backend:u  #    5.10% backend cycles idle      (67.29%)
     3,578,538,258      instructions:u            #    2.70  insn per cycle         
                                                  #    0.03  stalled cycles per insn  (83.68%)
       731,849,769      branches:u                # 1731.305 M/sec                    (83.68%)
         8,105,098      branch-misses:u           #    1.11% of all branches          (83.22%)

       0.423636686 seconds time elapsed



== INT array


 Performance counter stats for './a.out':

        367.111305      task-clock:u (msec)       #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                52      page-faults:u             #    0.142 K/sec                  
     1,142,267,432      cycles:u                  #    3.112 GHz                      (82.87%)
       186,370,432      stalled-cycles-frontend:u #   16.32% frontend cycles idle     (83.52%)
        18,145,529      stalled-cycles-backend:u  #    1.59% backend cycles idle      (67.30%)
     3,608,624,538      instructions:u            #    3.16  insn per cycle         
                                                  #    0.05  stalled cycles per insn  (83.65%)
       729,956,547      branches:u                # 1988.379 M/sec                    (83.67%)
             6,567      branch-misses:u           #    0.00% of all branches          (83.31%)

       0.367872314 seconds time elapsed


 Performance counter stats for './a.out':

        367.367983      task-clock:u (msec)       #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                53      page-faults:u             #    0.144 K/sec                  
     1,146,948,361      cycles:u                  #    3.122 GHz                      (82.85%)
       189,862,982      stalled-cycles-frontend:u #   16.55% frontend cycles idle     (82.85%)
        16,090,029      stalled-cycles-backend:u  #    1.40% backend cycles idle      (67.34%)
     3,610,943,628      instructions:u            #    3.15  insn per cycle         
                                                  #    0.05  stalled cycles per insn  (83.71%)
       729,705,480      branches:u                # 1986.307 M/sec                    (83.68%)
           239,259      branch-misses:u           #    0.03% of all branches          (83.33%)

       0.368188108 seconds time elapsed


 Performance counter stats for './a.out':

        435.185387      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                52      page-faults:u             #    0.119 K/sec                  
     1,367,508,177      cycles:u                  #    3.142 GHz                      (82.77%)
       317,962,731      stalled-cycles-frontend:u #   23.25% frontend cycles idle     (83.45%)
        89,442,496      stalled-cycles-backend:u  #    6.54% backend cycles idle      (66.91%)
     3,614,939,853      instructions:u            #    2.64  insn per cycle         
                                                  #    0.09  stalled cycles per insn  (83.46%)
       731,171,118      branches:u                # 1680.137 M/sec                    (83.46%)
        11,699,371      branch-misses:u           #    1.60% of all branches          (83.41%)

       0.435828955 seconds time elapsed


 Performance counter stats for './a.out':

        426.947345      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                53      page-faults:u             #    0.124 K/sec                  
     1,335,309,841      cycles:u                  #    3.128 GHz                      (83.14%)
       299,538,314      stalled-cycles-frontend:u #   22.43% frontend cycles idle     (83.15%)
        78,271,313      stalled-cycles-backend:u  #    5.86% backend cycles idle      (66.28%)
     3,619,103,716      instructions:u            #    2.71  insn per cycle         
                                                  #    0.08  stalled cycles per insn  (83.14%)
       734,182,321      branches:u                # 1719.609 M/sec                    (83.81%)
        10,304,408      branch-misses:u           #    1.40% of all branches          (83.76%)

       0.427541178 seconds time elapsed


	-ss
Zack Weinberg March 30, 2017, 1:18 p.m.
On Thu, Mar 30, 2017 at 12:59 AM, Sergey Senozhatsky
<sergey.senozhatsky.work@gmail.com> wrote:
>
> BTW, FreeBSD's kernel bsearch implementation is quite interesting
> and unique. take a look:
>
> https://github.com/freebsd/freebsd/blob/master/sys/libkern/bsearch.c

Oh, that's clever.  I _thought_ there had to be a way to avoid the
addition.  This should indeed be immune to the overflow problem,
although we still need to have a test for it.

libstdc++-v3 appears to use the same trick: see
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/include/bits/stl_algo.h;h=2cd5303a100b998f80369c0baf2cd39aedf14b52;hb=af44a97c14af3a0079590662d49a8ced98954376#l2037
(sorry for length of URL).  So we can crib from that and not have to
worry about copyrights.

As long as we're microoptimizing, I found an article
<https://realm.io/news/how-we-beat-cpp-stl-binary-search/> which, in
essence, says that because the branches after the comparison are going
to be unpredictable, you really want to be using conditional moves
instead.  It might be worth trying a little to convince GCC and LLVM
to do that.  (It's too bad that there's no __builtin_unpredictable()
to go with __builtin_expect().)

zw
Sergey Senozhatsky March 31, 2017, 2:21 a.m.
On (03/30/17 09:18), Zack Weinberg wrote:
> On Thu, Mar 30, 2017 at 12:59 AM, Sergey Senozhatsky
> <sergey.senozhatsky.work@gmail.com> wrote:
> >
> > BTW, FreeBSD's kernel bsearch implementation is quite interesting
> > and unique. take a look:
> >
> > https://github.com/freebsd/freebsd/blob/master/sys/libkern/bsearch.c
> 
> Oh, that's clever.  I _thought_ there had to be a way to avoid the
> addition.  This should indeed be immune to the overflow problem,
> although we still need to have a test for it.
> 
> libstdc++-v3 appears to use the same trick: see
> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/include/bits/stl_algo.h;h=2cd5303a100b998f80369c0baf2cd39aedf14b52;hb=af44a97c14af3a0079590662d49a8ced98954376#l2037
> (sorry for length of URL).  So we can crib from that and not have to
> worry about copyrights.

interesting.

so, a simple-minded direct re-implementation of libstdc++
approach may look like this

void *__bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
		__compar_fn_t __compar)
{
	const char *base = __base, *pivot;
	size_t len = __nmemb;
	size_t half;
	int comparison;

	while (len > 0) {
		half = len >> 1;
		pivot = base + half * __size;
		comparison = (*__compar) (__key, pivot);

		if (comparison == 0)
			return (void *)pivot;

		if (comparison < 0) {
			len = half;
		} else {
			base = pivot + __size;
			len = len - half - 1;
		}
	}

	return NULL;
}


and it generates a bit worse code, than BSD version, due to this part

	len = len - half - 1;

which is 
  	sub    $0x1,%rbx
  	lea    (%r15,%rbp,1),%r12
  	sub    %r14,%rbx

objdump

  400730:	41 57                	push   %r15
  400732:	41 56                	push   %r14
  400734:	41 55                	push   %r13
  400736:	41 54                	push   %r12
  400738:	55                   	push   %rbp
  400739:	53                   	push   %rbx
  40073a:	48 83 ec 18          	sub    $0x18,%rsp
  40073e:	48 85 d2             	test   %rdx,%rdx
  400741:	48 89 7c 24 08       	mov    %rdi,0x8(%rsp)
  400746:	74 51                	je     400799 <_bsearch+0x69>
  400748:	49 89 f4             	mov    %rsi,%r12
  40074b:	48 89 d3             	mov    %rdx,%rbx
  40074e:	48 89 cd             	mov    %rcx,%rbp
  400751:	4d 89 c5             	mov    %r8,%r13
  400754:	eb 1a                	jmp    400770 <_bsearch+0x40>
  400756:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  40075d:	00 00 00 
  400760:	48 83 eb 01          	sub    $0x1,%rbx
  400764:	4d 8d 24 2f          	lea    (%r15,%rbp,1),%r12
  400768:	4c 29 f3             	sub    %r14,%rbx
  40076b:	48 85 db             	test   %rbx,%rbx
  40076e:	74 29                	je     400799 <_bsearch+0x69>
  400770:	49 89 de             	mov    %rbx,%r14
  400773:	48 8b 7c 24 08       	mov    0x8(%rsp),%rdi
  400778:	49 d1 ee             	shr    %r14
  40077b:	4d 89 f7             	mov    %r14,%r15
  40077e:	4c 0f af fd          	imul   %rbp,%r15
  400782:	4d 01 e7             	add    %r12,%r15
  400785:	4c 89 fe             	mov    %r15,%rsi
  400788:	41 ff d5             	callq  *%r13
  40078b:	85 c0                	test   %eax,%eax
  40078d:	74 0d                	je     40079c <_bsearch+0x6c>
  40078f:	79 cf                	jns    400760 <_bsearch+0x30>
  400791:	4c 89 f3             	mov    %r14,%rbx
  400794:	48 85 db             	test   %rbx,%rbx
  400797:	75 d7                	jne    400770 <_bsearch+0x40>
  400799:	45 31 ff             	xor    %r15d,%r15d
  40079c:	48 83 c4 18          	add    $0x18,%rsp
  4007a0:	4c 89 f8             	mov    %r15,%rax
  4007a3:	5b                   	pop    %rbx
  4007a4:	5d                   	pop    %rbp
  4007a5:	41 5c                	pop    %r12
  4007a7:	41 5d                	pop    %r13
  4007a9:	41 5e                	pop    %r14
  4007ab:	41 5f                	pop    %r15
  4007ad:	c3                   	retq   
  4007ae:	66 90                	xchg   %ax,%ax


while something like this (basically, a BDS version)

void *bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
		__compar_fn_t __compar)
{
	const char *base = __base, *pivot;
	size_t len = __nmemb;
	int comparison;

	while (len > 0) {
		pivot = base + (len >> 1) * __size;
		comparison = (*__compar) (__key, pivot);

		if (comparison == 0)
			return (void *)pivot;

		if (comparison > 0) {
			base = pivot + __size;
			len--;
		}
		len >>= 1;
	}

	return NULL;
}

uses 
		shr    %rbx

objdump

  4006b0:	41 57                	push   %r15
  4006b2:	41 56                	push   %r14
  4006b4:	41 55                	push   %r13
  4006b6:	41 54                	push   %r12
  4006b8:	55                   	push   %rbp
  4006b9:	53                   	push   %rbx
  4006ba:	48 83 ec 18          	sub    $0x18,%rsp
  4006be:	48 85 d2             	test   %rdx,%rdx
  4006c1:	48 89 7c 24 08       	mov    %rdi,0x8(%rsp)
  4006c6:	74 53                	je     40071b <__bsearch+0x6b>
  4006c8:	49 89 f4             	mov    %rsi,%r12
  4006cb:	48 89 d3             	mov    %rdx,%rbx
  4006ce:	48 89 cd             	mov    %rcx,%rbp
  4006d1:	4d 89 c5             	mov    %r8,%r13
  4006d4:	eb 1a                	jmp    4006f0 <__bsearch+0x40>
  4006d6:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  4006dd:	00 00 00 
  4006e0:	48 83 eb 01          	sub    $0x1,%rbx
  4006e4:	4d 8d 24 2e          	lea    (%r14,%rbp,1),%r12
  4006e8:	48 d1 eb             	shr    %rbx
  4006eb:	48 85 db             	test   %rbx,%rbx
  4006ee:	74 2b                	je     40071b <__bsearch+0x6b>
  4006f0:	49 89 df             	mov    %rbx,%r15
  4006f3:	48 8b 7c 24 08       	mov    0x8(%rsp),%rdi
  4006f8:	49 d1 ef             	shr    %r15
  4006fb:	4c 89 fa             	mov    %r15,%rdx
  4006fe:	48 0f af d5          	imul   %rbp,%rdx
  400702:	4d 8d 34 14          	lea    (%r12,%rdx,1),%r14
  400706:	4c 89 f6             	mov    %r14,%rsi
  400709:	41 ff d5             	callq  *%r13
  40070c:	83 f8 00             	cmp    $0x0,%eax
  40070f:	74 0d                	je     40071e <__bsearch+0x6e>
  400711:	7f cd                	jg     4006e0 <__bsearch+0x30>
  400713:	4c 89 fb             	mov    %r15,%rbx
  400716:	48 85 db             	test   %rbx,%rbx
  400719:	75 d5                	jne    4006f0 <__bsearch+0x40>
  40071b:	45 31 f6             	xor    %r14d,%r14d
  40071e:	48 83 c4 18          	add    $0x18,%rsp
  400722:	4c 89 f0             	mov    %r14,%rax
  400725:	5b                   	pop    %rbx
  400726:	5d                   	pop    %rbp
  400727:	41 5c                	pop    %r12
  400729:	41 5d                	pop    %r13
  40072b:	41 5e                	pop    %r14
  40072d:	41 5f                	pop    %r15
  40072f:	c3                   	retq   



on practice, this seem to matter a lot (gcc7, x86_64, -O2)


		LIBSTDC++											BSD
 Performance counter stats for './a.out':	 						Performance counter stats for './a.out':
	
        529.563747      task-clock:u (msec)       #    0.999 CPUs utilized          		        431.084495      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  		                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  		                 0      cpu-migrations:u          #    0.000 K/sec                  
                48      page-faults:u             #    0.091 K/sec                  		                49      page-faults:u             #    0.114 K/sec                  
     1,660,238,680      cycles:u                  #    3.135 GHz                      (83.01%)	     1,359,410,967      cycles:u                  #    3.153 GHz                      (83.30%)
        18,865,782      stalled-cycles-frontend:u #    1.14% frontend cycles idle     (83.49%)	       112,645,472      stalled-cycles-frontend:u #    8.29% frontend cycles idle     (83.30%)
        28,578,724      stalled-cycles-backend:u  #    1.72% backend cycles idle      (67.15%)	        64,645,660      stalled-cycles-backend:u  #    4.76% backend cycles idle      (66.60%)
     3,809,381,770      instructions:u            #    2.29  insn per cycle         		     3,602,947,356      instructions:u            #    2.65  insn per cycle         
                                                  #    0.01  stalled cycles per insn  (83.57%)	                                                  #    0.03  stalled cycles per insn  (83.30%)
       949,275,159      branches:u                # 1792.561 M/sec                    (83.57%)	       730,987,355      branches:u                # 1695.694 M/sec                    (83.30%)
             4,026      branch-misses:u           #    0.00% of all branches          (83.16%)	        10,441,749      branch-misses:u           #    1.43% of all branches          (83.78%)
	
       0.530168158 seconds time elapsed	       							0.431656059 seconds time elapsed
	
	
 Performance counter stats for './a.out':	 						Performance counter stats for './a.out':
	
        531.296478      task-clock:u (msec)       #    0.999 CPUs utilized          		        436.154599      task-clock:u (msec)       #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  		                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  		                 0      cpu-migrations:u          #    0.000 K/sec                  
                50      page-faults:u             #    0.094 K/sec                  		                50      page-faults:u             #    0.115 K/sec                  
     1,659,440,099      cycles:u                  #    3.123 GHz                      (83.06%)	     1,348,636,891      cycles:u                  #    3.092 GHz                      (83.46%)
        19,236,667      stalled-cycles-frontend:u #    1.16% frontend cycles idle     (83.06%)	       119,038,311      stalled-cycles-frontend:u #    8.83% frontend cycles idle     (83.47%)
        28,226,341      stalled-cycles-backend:u  #    1.70% backend cycles idle      (67.06%)	        82,372,803      stalled-cycles-backend:u  #    6.11% backend cycles idle      (67.00%)
     3,810,692,002      instructions:u            #    2.30  insn per cycle         		     3,593,517,874      instructions:u            #    2.66  insn per cycle         
                                                  #    0.01  stalled cycles per insn  (83.61%)	                                                  #    0.03  stalled cycles per insn  (83.49%)
       950,132,896      branches:u                # 1788.329 M/sec                    (83.64%)	       732,065,055      branches:u                # 1678.453 M/sec                    (83.50%)
             4,383      branch-misses:u           #    0.00% of all branches          (83.55%)	         9,774,701      branch-misses:u           #    1.34% of all branches          (82.86%)
	
       0.531825605 seconds time elapsed	       							0.436922711 seconds time elapsed
	
	
 Performance counter stats for './a.out':	 						Performance counter stats for './a.out':
	
        529.942094      task-clock:u (msec)       #    0.998 CPUs utilized          		        435.002430      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  		                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  		                 0      cpu-migrations:u          #    0.000 K/sec                  
                48      page-faults:u             #    0.091 K/sec                  		                49      page-faults:u             #    0.113 K/sec                  
     1,661,178,675      cycles:u                  #    3.135 GHz                      (83.02%)	     1,351,752,937      cycles:u                  #    3.107 GHz                      (83.40%)
        21,119,773      stalled-cycles-frontend:u #    1.27% frontend cycles idle     (83.04%)	       115,705,283      stalled-cycles-frontend:u #    8.56% frontend cycles idle     (83.46%)
        28,723,902      stalled-cycles-backend:u  #    1.73% backend cycles idle      (67.16%)	        73,129,341      stalled-cycles-backend:u  #    5.41% backend cycles idle      (66.90%)
     3,806,932,705      instructions:u            #    2.29  insn per cycle         		     3,607,666,833      instructions:u            #    2.67  insn per cycle         
                                                  #    0.01  stalled cycles per insn  (83.58%)	                                                  #    0.03  stalled cycles per insn  (83.45%)
       949,754,552      branches:u                # 1792.186 M/sec                    (83.58%)	       728,361,887      branches:u                # 1674.386 M/sec                    (83.45%)
             4,278      branch-misses:u           #    0.00% of all branches          (83.25%)	        10,571,686      branch-misses:u           #    1.45% of all branches          (83.00%)
	
       0.530759498 seconds time elapsed	       							0.435504277 seconds time elapsed
	
	
 Performance counter stats for './a.out':	 						Performance counter stats for './a.out':
	
        535.067404      task-clock:u (msec)       #    0.999 CPUs utilized          		        433.108622      task-clock:u (msec)       #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  		                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  		                 0      cpu-migrations:u          #    0.000 K/sec                  
                49      page-faults:u             #    0.092 K/sec                  		                48      page-faults:u             #    0.111 K/sec                  
     1,662,263,437      cycles:u                  #    3.107 GHz                      (83.21%)	     1,357,994,817      cycles:u                  #    3.135 GHz                      (83.38%)
        23,904,662      stalled-cycles-frontend:u #    1.44% frontend cycles idle     (83.18%)	       117,585,762      stalled-cycles-frontend:u #    8.66% frontend cycles idle     (82.73%)
        30,990,320      stalled-cycles-backend:u  #    1.86% backend cycles idle      (66.75%)	        71,549,183      stalled-cycles-backend:u  #    5.27% backend cycles idle      (66.75%)
     3,788,168,547      instructions:u            #    2.28  insn per cycle         		     3,608,615,459      instructions:u            #    2.66  insn per cycle         
                                                  #    0.01  stalled cycles per insn  (83.66%)	                                                  #    0.03  stalled cycles per insn  (83.38%)
       949,809,536      branches:u                # 1775.121 M/sec                    (83.74%)	       730,772,715      branches:u                # 1687.274 M/sec                    (83.38%)
             5,849      branch-misses:u           #    0.00% of all branches          (83.47%)	        10,279,444      branch-misses:u           #    1.41% of all branches          (83.90%)
	
       0.535721929 seconds time elapsed	       							0.433903691 seconds time elapsed





for comparison, unpatched glibc bsearch()

 Performance counter stats for './a.out':

        447.937515      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                48      page-faults:u             #    0.107 K/sec                  
     1,409,322,732      cycles:u                  #    3.146 GHz                      (83.26%)
       137,696,335      stalled-cycles-frontend:u #    9.77% frontend cycles idle     (83.26%)
       107,106,201      stalled-cycles-backend:u  #    7.60% backend cycles idle      (66.51%)
     3,390,292,254      instructions:u            #    2.41  insn per cycle         
                                                  #    0.04  stalled cycles per insn  (83.26%)
       968,222,443      branches:u                # 2161.512 M/sec                    (83.86%)
        13,048,699      branch-misses:u           #    1.35% of all branches          (83.44%)

       0.448465765 seconds time elapsed


 Performance counter stats for './a.out':

        451.069138      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                49      page-faults:u             #    0.109 K/sec                  
     1,406,560,677      cycles:u                  #    3.118 GHz                      (83.38%)
       132,787,307      stalled-cycles-frontend:u #    9.44% frontend cycles idle     (83.37%)
       106,997,916      stalled-cycles-backend:u  #    7.61% backend cycles idle      (66.75%)
     3,395,472,906      instructions:u            #    2.41  insn per cycle         
                                                  #    0.04  stalled cycles per insn  (83.38%)
       966,504,113      branches:u                # 2142.696 M/sec                    (83.37%)
        11,790,044      branch-misses:u           #    1.22% of all branches          (83.39%)

       0.451607453 seconds time elapsed


 Performance counter stats for './a.out':

        449.328189      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                49      page-faults:u             #    0.109 K/sec                  
     1,408,416,612      cycles:u                  #    3.134 GHz                      (83.31%)
       129,766,221      stalled-cycles-frontend:u #    9.21% frontend cycles idle     (83.31%)
       116,797,583      stalled-cycles-backend:u  #    8.29% backend cycles idle      (66.62%)
     3,397,205,219      instructions:u            #    2.41  insn per cycle         
                                                  #    0.04  stalled cycles per insn  (83.31%)
       971,010,420      branches:u                # 2161.027 M/sec                    (83.31%)
        12,745,400      branch-misses:u           #    1.31% of all branches          (83.81%)

       0.449839176 seconds time elapsed


 Performance counter stats for './a.out':

        446.382825      task-clock:u (msec)       #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                48      page-faults:u             #    0.108 K/sec                  
     1,401,235,764      cycles:u                  #    3.139 GHz                      (83.20%)
       135,184,574      stalled-cycles-frontend:u #    9.65% frontend cycles idle     (83.20%)
       112,230,343      stalled-cycles-backend:u  #    8.01% backend cycles idle      (66.40%)
     3,371,533,704      instructions:u            #    2.41  insn per cycle         
                                                  #    0.04  stalled cycles per insn  (83.20%)
       972,968,920      branches:u                # 2179.674 M/sec                    (83.86%)
        12,100,921      branch-misses:u           #    1.24% of all branches          (83.39%)

       0.447111384 seconds time elapsed




> As long as we're microoptimizing, I found an article
> <https://realm.io/news/how-we-beat-cpp-stl-binary-search/> which, in
> essence, says that because the branches after the comparison are going
> to be unpredictable, you really want to be using conditional moves
> instead.  It might be worth trying a little to convince GCC and LLVM
> to do that.  (It's too bad that there's no __builtin_unpredictable()
> to go with __builtin_expect().)

thanks for the link.

	-ss

Patch hide | download patch | download mbox

diff --git a/ChangeLog b/ChangeLog
index e0acd7d0c4..7142794922 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,8 @@ 
+2017-03-17 Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+
+	[BZ #2753]
+	* bits/stdlib-bsearch.h: Fix integer overflow.
+
 2017-03-10  Stefan Liebler  <stli@linux.vnet.ibm.com>
 
 	* math/auto-libm-test-out-catan: Regenerated.
diff --git a/bits/stdlib-bsearch.h b/bits/stdlib-bsearch.h
index eb145381fd..5fd8a8b607 100644
--- a/bits/stdlib-bsearch.h
+++ b/bits/stdlib-bsearch.h
@@ -28,7 +28,7 @@  bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
   __u = __nmemb;
   while (__l < __u)
     {
-      __idx = (__l + __u) / 2;
+      __idx = __l + (__u - __l) / 2;
       __p = (void *) (((const char *) __base) + (__idx * __size));
       __comparison = (*__compar) (__key, __p);
       if (__comparison < 0)