Message ID | CAMe9rOqwAAwAtDEWEsdcafSfQBRQXqPjEw9gfeVAVvp9ciVryQ@mail.gmail.com |
---|---|
State | New |
Headers | show |
On Fri, Jun 23, 2017 at 11:33 AM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Fri, Jun 23, 2017 at 9:42 AM, Florian Weimer <fweimer@redhat.com> wrote: >> On 06/23/2017 06:38 PM, Carlos O'Donell wrote: >> >>> I assume that this catches the regression by ensuring the high values of >>> the subtraction result in an underflow which results in a positive value >>> of the subtraction and a wrong answer? >> >> Yes, I thought I said so in the commit message. >> >>> Was this comment ever accurate? mobzwl is not a BE load. >> >> We used bswap, so the register contents before the comparison is in >> big-endian format. >> >>>> + orl %edi, %eax >>>> + orl %esi, %ecx >>>> + /* Subtraction is okay because the upper 8 bits a zero. */ >>> >>> s/a zero/are zero/g >> >> Okay, I'll fix this typo in a follow-up commit. > > How about this patch to turn > > movzbl -1(%rdi, %rdx), %edi > movzbl -1(%rsi, %rdx), %esi > orl %edi, %eax > orl %esi, %ecx > > into > > movb -1(%rdi, %rdx), %al > movb -1(%rsi, %rdx), %cl Here is the benchmark result on Haswell. [hjl@gnu-6 glibc-test]$ make ./test movb : 19937666 movzbl: 21518186 [hjl@gnu-6 glibc-test]$
On 06/23/2017 09:12 PM, H.J. Lu wrote: >> movzbl -1(%rdi, %rdx), %edi >> movzbl -1(%rsi, %rdx), %esi >> orl %edi, %eax >> orl %esi, %ecx >> >> into >> >> movb -1(%rdi, %rdx), %al >> movb -1(%rsi, %rdx), %cl > > Here is the benchmark result on Haswell. > > [hjl@gnu-6 glibc-test]$ make > ./test > movb : 19937666 > movzbl: 21518186 > [hjl@gnu-6 glibc-test]$ Interesting. So there isn't a steep penalty for partial register writes anymore? Your patch is a nice improvement then. Thanks, Florian
On Fri, Jun 23, 2017 at 12:27 PM, Florian Weimer <fweimer@redhat.com> wrote: > On 06/23/2017 09:12 PM, H.J. Lu wrote: > >>> movzbl -1(%rdi, %rdx), %edi >>> movzbl -1(%rsi, %rdx), %esi >>> orl %edi, %eax >>> orl %esi, %ecx >>> >>> into >>> >>> movb -1(%rdi, %rdx), %al >>> movb -1(%rsi, %rdx), %cl >> >> Here is the benchmark result on Haswell. >> >> [hjl@gnu-6 glibc-test]$ make >> ./test >> movb : 19937666 >> movzbl: 21518186 >> [hjl@gnu-6 glibc-test]$ > > Interesting. So there isn't a steep penalty for partial register writes > anymore? Your patch is a nice improvement then. Intel Optimization Guide has In Intel microarchitecture code name Sandy Bridge, partial register access is handled in hardware by inserting a micro-op that merges the partial register with the full register in the following cases: • After a write to one of the registers AH, BH, CH or DH and before a following read of the 2-, 4- or 8- byte form of the same register. In these cases a merge micro-op is inserted. The insertion consumes a full allocation cycle in which other micro-ops cannot be allocated. • After a micro-op with a destination register of 1 or 2 bytes, which is not a source of the instruction (or the register's bigger form), and before a following read of a 2-,4- or 8-byte form of the same register. In these cases the merge micro-op is part of the flow. None of them apply here to Haswell and Skylake. I am checking in my patch now.
From acecb3f7de4892b68ec1b464a576ee84b3f97527 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" <hjl.tools@gmail.com> Date: Fri, 23 Jun 2017 11:29:38 -0700 Subject: [PATCH] x86-64: Optimize L(between_2_3) in memcmp-avx2-movbe.S Turn movzbl -1(%rdi, %rdx), %edi movzbl -1(%rsi, %rdx), %esi orl %edi, %eax orl %esi, %ecx into movb -1(%rdi, %rdx), %al movb -1(%rsi, %rdx), %cl * sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (between_2_3): Replace movzbl and orl with movb. --- sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S index 9d19210..abcc61c 100644 --- a/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S +++ b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S @@ -144,11 +144,9 @@ L(between_2_3): shll $8, %ecx bswap %eax bswap %ecx - movzbl -1(%rdi, %rdx), %edi - movzbl -1(%rsi, %rdx), %esi - orl %edi, %eax - orl %esi, %ecx - /* Subtraction is okay because the upper 8 bits a zero. */ + movb -1(%rdi, %rdx), %al + movb -1(%rsi, %rdx), %cl + /* Subtraction is okay because the upper 8 bits are zero. */ subl %ecx, %eax ret -- 2.9.4