diff mbox series

[v2] x86: Refactor and improve performance of strchr-avx2.S

Message ID 20210121001206.471283-1-goldstein.w.n@gmail.com
State New
Headers show
Series [v2] x86: Refactor and improve performance of strchr-avx2.S | expand

Commit Message

Noah Goldstein Jan. 21, 2021, 12:12 a.m. UTC
No bug. Just seemed the performance could be improved a bit.  Observed
and expected behavior are unchanged. Optimized body of main
loop. Updated page cross logic and optimized accordingly. Made a few
minor instruction selection modifications. No regressions in test
suite. test-strchrnul, test-strchr, test-wcschrnul, and test-wcschr
all passed.

Author: noah <goldstein.w.n@gmail.com>
---
>>
>> No bug. Just seemed the performance could be improved a bit. Observed
>> and expected behavior are unchanged. Optimized body of main
>> loop. Updated page cross logic and optimized accordingly. Made a few
>> minor instruction selection modifications. No regressions in test
>> suite. Both test-strchrnul and test-strchr passed.
>
>Thank you very much.
>

:) this is my first patch (hopefully). Its exciting!

>> Author: noah <goldstein.w.n@gmail.com>
>> ---
>> Possible improvements fall into 5 categories roughly ordered by
>> importance:
>>
>> 1 - The main loop that handles 4 vectors as a time has 4 uops removed
>>     from it. This is at the cost of additional port pressure on ports
>>     0/1 as vpminub and vpminud can only run on those ports whereas
>>     vpor can run on ports 0/1/5. But the 4 saved uops should more than
>>     make up for that. Analysis either by latency, throughput, or
>>     benchmarks shows its a performance improvement.
>>
>> 2 - As far as I can tell the cros_page_boundary logic was broken (or
>>     the jump label was especially confusing). The origional code
>>     tested for this by checking if the load would split cache lines
>>     not pages. I don't think there are any x86_64 architectures that
>
>The original code has
>
>        /* Check if we may cross page boundary with one vector load.  */
>        andl    $(2 * VEC_SIZE - 1), %ecx
>        cmpl    $VEC_SIZE, %ecx
>        ja      L(cros_page_boundary)
>
>It is just very conservative.   It never fails to cross page boundary.
>

I see. Got it.

>>     support both AVX2 (Haskwell and newer) and don't have a minimum
>>     page size of 4kb. Given that the cost of splitting a cacheline
>>     appears to be low on CPUs that support AVX2
>>     [https://www.agner.org/optimize/blog/read.php?i=415#423] and this
>>     is one-off I don't see it as being critical to avoid. If it
>>     critical to avoid a cacheline split load I think a branchless
>>     approach might be better. Thoughts? Note: Given that the check was
>>     changed to only be for page crosses, I think it is significantly
>>     less likely than before so I moved the branch target from such
>>     prime real estate. I am also unsure if there is a PAGE_SIZE define
>>     in glibc somewhere I can use instead of defining it here.
>
>Define PAGE_SIZE to 4096 is fine.
>

Got it.

>> 3 - What was origionally the more_4x_vec label was removed and the
>>     code only does 3 vector blocks now. The reasoning is as follows;
>>     there are two entry points to that code section, from a page cross
>>     or fall through (no page cross). The fall through case is more
>>     important to optimize for assuming the point above. In this case
>>     the incoming alignments (i.e alignment of ptr in rdi) mod 128 can
>>     be [0 - 31], [32 - 63], [64 - 95], [96 - 127]. Doing 4 vector
>>     blocks optimizes for the [96 - 127] so that when the main 4x loop
>>     is hit, a new 128 byte aligned segment can be started. Doing 3
>>     vector blocks optimizes for the [0 - 32] case. I generally think
>>     the string is more likely to for aligned to cache line size (or L2
>>     prefetcher cache line pairs) than at [96 - 127] bytes. An
>>     alternative would be to make that code do 5x vector blocks. This
>>     would mean that at most 2 vector blocks where wasted when
>>     realigning to 128 bytes (as opposed to 3x or 4x which can allow
>>     for 3 vector blocks to be wasted). Thoughts?
>
>I picked 4x because it was how I unrolled in other string functions.
>3x is fine if it faster than 4x.
>

I see. What are your thoughts on doing 5x? With 5x the worst case is 2
vector blocks get wasted on reallignment though it will add overhead
for A) getting the 4x loop and B) add overhead for [0 - 31], [32 -
63], and [96 - 127] initial alignment cases.

>> 4 - Replace salq using the %cl partial register to sarx. This assumes
>>     BMI2 which is Haskwell and newer but AVX2 implies Hashwell and
>>     newer so I think it is safe.
>
>Need to add a BMI2 check in strchr.c:
>
>&& CPU_FEATURE_USABLE_P (cpu_features, BMI2)
>

Done.

>>
>> 5 - in the first_vec_xN return blocks change the addq from using rax
>>     as a destination to rdi as a destination. This just allows for 1
>>     uop shorter latency.
>
>Sounds reasonable.
>

I forgot to add in the origional email. I also removed a few
unnecissary instructions on ecx (it was being realigned with rdi but
never used)

>> Benchamrks:
>> I can post my benchmarking code in the email thread if that is
>> appreciated. I benchmarked a variety of cases with different
>> alignments, sizes, and data hotness (in L1, L2, etc...) so I can just
>> give a simple number of x percentage improvement. They where also run
>> on my personal computer (icelake-client). Of my 2732 test cases 1985
>> saw an improvement with these changes, 747 performed better with the
>> origional code. I should note, however, that my test cases had a
>> disproportionate number of cases with 4kb page crosses, which as
>> discussed I moved to a colder path.
>
>Please submit a separate patch to add your workload to
>benchtests/bench-strstr.c.
>

Got it. A few questions. Should the benchmark have a single return
value to indicate pass/failure (for a performance regression) or is
just outputing times the idea so it can be used as a tool for future
patches? Second is the coding standard for benchmarks the same as
production code or are rules more relaxed? Last should there be a
seperate benchmark for strchr, strchrnul, wcschr, and wcschrnul or is
having them all in one with #defines the best approach?

>> In general the affects of this change are:
>>
>> large/medium sized strings (from any part of memory really) 10-30%
>> performance boost.
>> Small strings that are not page crosses by 10-20% performance boost.
>> Small strings are cache line splits by 20-30%  performance boost.
>> Small strings that cross a page boundary (4kb page that is) see a 10%
>> performance regression.
>>
>> No regressions in test suite. Both test-strchrnul and test-strchr
>> passed.
>
>It is also used for wcschr and wcschrnul.  Do they work correctly with
>your change?
>

Yes. Updated the commit message to indicate those tests pass
aswell. Updated my benchmark as well to test those for performance.

>> I would love to here you feedback and all of these points (or any that
>> I missed).
>>
>> FSF Documentation has been signed and returned (via pub key and email
>> respectively)
 sysdeps/x86_64/multiarch/strchr-avx2.S | 177 +++++++++++++------------
 sysdeps/x86_64/multiarch/strchr.c      |   2 +
 2 files changed, 91 insertions(+), 88 deletions(-)
diff mbox series

Patch

diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S
index d416558d04..628242c483 100644
--- a/sysdeps/x86_64/multiarch/strchr-avx2.S
+++ b/sysdeps/x86_64/multiarch/strchr-avx2.S
@@ -27,10 +27,12 @@ 
 # ifdef USE_AS_WCSCHR
 #  define VPBROADCAST	vpbroadcastd
 #  define VPCMPEQ	vpcmpeqd
+#  define VPMINU	vpminud
 #  define CHAR_REG	esi
 # else
 #  define VPBROADCAST	vpbroadcastb
 #  define VPCMPEQ	vpcmpeqb
+#  define VPMINU	vpminub
 #  define CHAR_REG	sil
 # endif
 
@@ -39,7 +41,8 @@ 
 # endif
 
 # define VEC_SIZE 32
-
+# define PAGE_SIZE 4096
+    
 	.section .text.avx,"ax",@progbits
 ENTRY (STRCHR)
 	movl	%edi, %ecx
@@ -47,10 +50,10 @@  ENTRY (STRCHR)
 	vmovd	%esi, %xmm0
 	vpxor	%xmm9, %xmm9, %xmm9
 	VPBROADCAST %xmm0, %ymm0
-	/* Check if we may cross page boundary with one vector load.  */
-	andl	$(2 * VEC_SIZE - 1), %ecx
-	cmpl	$VEC_SIZE, %ecx
-	ja	L(cros_page_boundary)
+	/* Check if we cross page boundary with one vector load.  */
+	andl	$(PAGE_SIZE - 1), %ecx
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %ecx
+	ja	L(cross_page_boundary)
 
 	/* Check the first VEC_SIZE bytes.  Search for both CHAR and the
 	   null byte.  */
@@ -63,45 +66,11 @@  ENTRY (STRCHR)
 	jnz	L(first_vec_x0)
 
 	/* Align data for aligned loads in the loop.  */
-	addq	$VEC_SIZE, %rdi
-	andl	$(VEC_SIZE - 1), %ecx
-	andq	$-VEC_SIZE, %rdi
-
-	jmp	L(more_4x_vec)
-
-	.p2align 4
-L(cros_page_boundary):
-	andl	$(VEC_SIZE - 1), %ecx
-	andq	$-VEC_SIZE, %rdi
-	vmovdqu	(%rdi), %ymm8
-	VPCMPEQ %ymm8, %ymm0, %ymm1
-	VPCMPEQ %ymm8, %ymm9, %ymm2
-	vpor	%ymm1, %ymm2, %ymm1
-	vpmovmskb %ymm1, %eax
-	/* Remove the leading bytes.  */
-	sarl	%cl, %eax
-	testl	%eax, %eax
-	jz	L(aligned_more)
-	/* Found CHAR or the null byte.  */
-	tzcntl	%eax, %eax
-	addq	%rcx, %rax
-# ifdef USE_AS_STRCHRNUL
-	addq	%rdi, %rax
-# else
-	xorl	%edx, %edx
-	leaq	(%rdi, %rax), %rax
-	cmp	(%rax), %CHAR_REG
-	cmovne	%rdx, %rax
-# endif
-	VZEROUPPER
-	ret
-
-	.p2align 4
+    andq	$-VEC_SIZE, %rdi
 L(aligned_more):
 	addq	$VEC_SIZE, %rdi
 
-L(more_4x_vec):
-	/* Check the first 4 * VEC_SIZE.  Only one VEC_SIZE at a time
+	/* Check the next 3 * VEC_SIZE.  Only one VEC_SIZE at a time
 	   since data is only aligned to VEC_SIZE.  */
 	vmovdqa	(%rdi), %ymm8
 	VPCMPEQ %ymm8, %ymm0, %ymm1
@@ -127,19 +96,9 @@  L(more_4x_vec):
 	testl	%eax, %eax
 	jnz	L(first_vec_x2)
 
-	vmovdqa	(VEC_SIZE * 3)(%rdi), %ymm8
-	VPCMPEQ %ymm8, %ymm0, %ymm1
-	VPCMPEQ %ymm8, %ymm9, %ymm2
-	vpor	%ymm1, %ymm2, %ymm1
-	vpmovmskb %ymm1, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x3)
-
-	addq	$(VEC_SIZE * 4), %rdi
-
+    
 	/* Align data to 4 * VEC_SIZE.  */
-	movq	%rdi, %rcx
-	andl	$(4 * VEC_SIZE - 1), %ecx
+	addq	$(VEC_SIZE * 3), %rdi
 	andq	$-(4 * VEC_SIZE), %rdi
 
 	.p2align 4
@@ -150,34 +109,61 @@  L(loop_4x_vec):
 	vmovdqa	(VEC_SIZE * 2)(%rdi), %ymm7
 	vmovdqa	(VEC_SIZE * 3)(%rdi), %ymm8
 
-	VPCMPEQ %ymm5, %ymm0, %ymm1
-	VPCMPEQ %ymm6, %ymm0, %ymm2
-	VPCMPEQ %ymm7, %ymm0, %ymm3
-	VPCMPEQ %ymm8, %ymm0, %ymm4
-
-	VPCMPEQ %ymm5, %ymm9, %ymm5
-	VPCMPEQ %ymm6, %ymm9, %ymm6
-	VPCMPEQ %ymm7, %ymm9, %ymm7
-	VPCMPEQ %ymm8, %ymm9, %ymm8
+    /* Leaves only CHARS matching esi as 0.  */
+    vpxor %ymm5, %ymm0, %ymm1
+    vpxor %ymm6, %ymm0, %ymm2
+    vpxor %ymm7, %ymm0, %ymm3
+    vpxor %ymm8, %ymm0, %ymm4
 
-	vpor	%ymm1, %ymm5, %ymm1
-	vpor	%ymm2, %ymm6, %ymm2
-	vpor	%ymm3, %ymm7, %ymm3
-	vpor	%ymm4, %ymm8, %ymm4
+	VPMINU	%ymm1, %ymm5, %ymm1
+	VPMINU	%ymm2, %ymm6, %ymm2
+	VPMINU	%ymm3, %ymm7, %ymm3
+	VPMINU	%ymm4, %ymm8, %ymm4
 
-	vpor	%ymm1, %ymm2, %ymm5
-	vpor	%ymm3, %ymm4, %ymm6
+	VPMINU	%ymm1, %ymm2, %ymm5
+	VPMINU	%ymm3, %ymm4, %ymm6
 
-	vpor	%ymm5, %ymm6, %ymm5
+	VPMINU	%ymm5, %ymm6, %ymm5
 
+    VPCMPEQ %ymm5, %ymm9, %ymm5
 	vpmovmskb %ymm5, %eax
+    addq	$(VEC_SIZE * 4), %rdi
+    
 	testl	%eax, %eax
-	jnz	L(4x_vec_end)
+	jz	L(loop_4x_vec)
 
-	addq	$(VEC_SIZE * 4), %rdi
+    subq	$(VEC_SIZE * 4), %rdi
+
+L(4x_vec_end):
+    VPCMPEQ %ymm1, %ymm9, %ymm1
+	vpmovmskb %ymm1, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x0)
+    VPCMPEQ %ymm2, %ymm9, %ymm2
+	vpmovmskb %ymm2, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x1)
+    VPCMPEQ %ymm3, %ymm9, %ymm3
+	vpmovmskb %ymm3, %eax
+	testl	%eax, %eax
+	jnz	L(first_vec_x2)
+    VPCMPEQ %ymm4, %ymm9, %ymm4
+	vpmovmskb %ymm4, %eax
 
-	jmp	L(loop_4x_vec)
+	tzcntl	%eax, %eax
+# ifdef USE_AS_STRCHRNUL
+	addq	$(VEC_SIZE * 3), %rdi
+	addq	%rdi, %rax
+# else
+	xorl	%edx, %edx
+	leaq	(VEC_SIZE * 3)(%rdi, %rax), %rax
+	cmp	(%rax), %CHAR_REG
+	cmovne	%rdx, %rax
+# endif
+	VZEROUPPER
+	ret
 
+    
 	.p2align 4
 L(first_vec_x0):
 	/* Found CHAR or the null byte.  */
@@ -197,7 +183,7 @@  L(first_vec_x0):
 L(first_vec_x1):
 	tzcntl	%eax, %eax
 # ifdef USE_AS_STRCHRNUL
-	addq	$VEC_SIZE, %rax
+	addq	$VEC_SIZE, %rdi
 	addq	%rdi, %rax
 # else
 	xorl	%edx, %edx
@@ -212,7 +198,7 @@  L(first_vec_x1):
 L(first_vec_x2):
 	tzcntl	%eax, %eax
 # ifdef USE_AS_STRCHRNUL
-	addq	$(VEC_SIZE * 2), %rax
+	addq	$(VEC_SIZE * 2), %rdi
 	addq	%rdi, %rax
 # else
 	xorl	%edx, %edx
@@ -223,32 +209,47 @@  L(first_vec_x2):
 	VZEROUPPER
 	ret
 
+    /* Cold case for crossing page with first load.  */
 	.p2align 4
-L(4x_vec_end):
+L(cross_page_boundary):
+	andl	$(VEC_SIZE - 1), %ecx
+	andq	$-VEC_SIZE, %rdi
+	vmovdqu	(%rdi), %ymm8
+	VPCMPEQ %ymm8, %ymm0, %ymm1
+	VPCMPEQ %ymm8, %ymm9, %ymm2
+	vpor	%ymm1, %ymm2, %ymm1
 	vpmovmskb %ymm1, %eax
+	/* Remove the leading bits.  */
+	sarxl	%ecx, %eax, %eax
 	testl	%eax, %eax
-	jnz	L(first_vec_x0)
-	vpmovmskb %ymm2, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x1)
-	vpmovmskb %ymm3, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x2)
-	vpmovmskb %ymm4, %eax
+	jnz	L(cross_page_return)
+	
+    /* Second block so that the 3 other blocks from L(aligned_more)
+       will get to next 4 * VEC_SIZE alignment.  */
+    andq	$-VEC_SIZE, %rdi
+    addq	$VEC_SIZE, %rdi
+    xorl    %ecx, %ecx
+    vmovdqa	(%rdi), %ymm8
+	VPCMPEQ %ymm8, %ymm0, %ymm1
+	VPCMPEQ %ymm8, %ymm9, %ymm2
+	vpor	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %eax
 	testl	%eax, %eax
-L(first_vec_x3):
+	jz	L(aligned_more)
+
+L(cross_page_return):
 	tzcntl	%eax, %eax
+	addq	%rcx, %rax
 # ifdef USE_AS_STRCHRNUL
-	addq	$(VEC_SIZE * 3), %rax
 	addq	%rdi, %rax
 # else
 	xorl	%edx, %edx
-	leaq	(VEC_SIZE * 3)(%rdi, %rax), %rax
+	leaq	(%rdi, %rax), %rax
 	cmp	(%rax), %CHAR_REG
 	cmovne	%rdx, %rax
 # endif
 	VZEROUPPER
 	ret
-
+    
 END (STRCHR)
-#endif
+# endif
diff --git a/sysdeps/x86_64/multiarch/strchr.c b/sysdeps/x86_64/multiarch/strchr.c
index 583a152794..77ff44b8f3 100644
--- a/sysdeps/x86_64/multiarch/strchr.c
+++ b/sysdeps/x86_64/multiarch/strchr.c
@@ -37,6 +37,8 @@  IFUNC_SELECTOR (void)
 
   if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER)
       && CPU_FEATURE_USABLE_P (cpu_features, AVX2)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI1)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI2)
       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
     return OPTIMIZE (avx2);