diff mbox series

x86_64: Optimize ffsll function code size.

Message ID 20230726160524.1955013-1-skpgkp2@gmail.com
State New
Headers show
Series x86_64: Optimize ffsll function code size. | expand

Commit Message

Sunil Pandey July 26, 2023, 4:05 p.m. UTC
Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
Currently ffsll function randomly regress by ~20%, depending on how
code get aligned.

This patch fixes ffsll function random performance regression.
---
 sysdeps/x86_64/ffsll.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Richard Henderson July 26, 2023, 4:38 p.m. UTC | #1
On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
> Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> Currently ffsll function randomly regress by ~20%, depending on how
> code get aligned.
> 
> This patch fixes ffsll function random performance regression.
> ---
>   sysdeps/x86_64/ffsll.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> index a1c13d4906..dbded6f0a1 100644
> --- a/sysdeps/x86_64/ffsll.c
> +++ b/sysdeps/x86_64/ffsll.c
> @@ -29,7 +29,7 @@ ffsll (long long int x)
>     long long int tmp;
>   
>     asm ("bsfq %2,%0\n"		/* Count low bits in X and store in %1.  */
> -       "cmoveq %1,%0\n"		/* If number was zero, use -1 as result.  */
> +       "cmove %k1,%k0\n"	/* If number was zero, use -1 as result.  */

This no longer produces -1, but 0xffffffff in cnt.  However, since the return type is 
'int', cnt need not be 'long long int' either.  I'm not sure why tmp exists at all, since 
cnt is the only register modified.


r~
H.J. Lu July 26, 2023, 4:50 p.m. UTC | #2
On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
> > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > Currently ffsll function randomly regress by ~20%, depending on how
> > code get aligned.
> >
> > This patch fixes ffsll function random performance regression.
> > ---
> >   sysdeps/x86_64/ffsll.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > index a1c13d4906..dbded6f0a1 100644
> > --- a/sysdeps/x86_64/ffsll.c
> > +++ b/sysdeps/x86_64/ffsll.c
> > @@ -29,7 +29,7 @@ ffsll (long long int x)
> >     long long int tmp;
> >
> >     asm ("bsfq %2,%0\n"               /* Count low bits in X and store in %1.  */
> > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as result.  */
> > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as result.  */
>
> This no longer produces -1, but 0xffffffff in cnt.  However, since the return type is
> 'int', cnt need not be 'long long int' either.  I'm not sure why tmp exists at all, since
> cnt is the only register modified.
>
>
> r~

tmp was initialized to -1 with

: "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));

H.J.
Noah Goldstein July 26, 2023, 4:51 p.m. UTC | #3
On Wed, Jul 26, 2023 at 11:38 AM Richard Henderson via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
> > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > Currently ffsll function randomly regress by ~20%, depending on how
> > code get aligned.
> >
> > This patch fixes ffsll function random performance regression.
> > ---
> >   sysdeps/x86_64/ffsll.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > index a1c13d4906..dbded6f0a1 100644
> > --- a/sysdeps/x86_64/ffsll.c
> > +++ b/sysdeps/x86_64/ffsll.c
> > @@ -29,7 +29,7 @@ ffsll (long long int x)
> >     long long int tmp;
> >
> >     asm ("bsfq %2,%0\n"               /* Count low bits in X and store in %1.  */
> > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as result.  */
> > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as result.  */
Alternatively just store `-1` in dst before the `bsfq` and drop the
cmov altogether.
>
> This no longer produces -1, but 0xffffffff in cnt.  However, since the return type is
> 'int', cnt need not be 'long long int' either.  I'm not sure why tmp exists at all, since
> cnt is the only register modified.
>
>
> r~
Sunil Pandey July 26, 2023, 4:51 p.m. UTC | #4
On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
richard.henderson@linaro.org> wrote:

> On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
> > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > Currently ffsll function randomly regress by ~20%, depending on how
> > code get aligned.
> >
> > This patch fixes ffsll function random performance regression.
> > ---
> >   sysdeps/x86_64/ffsll.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > index a1c13d4906..dbded6f0a1 100644
> > --- a/sysdeps/x86_64/ffsll.c
> > +++ b/sysdeps/x86_64/ffsll.c
> > @@ -29,7 +29,7 @@ ffsll (long long int x)
> >     long long int tmp;
> >
> >     asm ("bsfq %2,%0\n"               /* Count low bits in X and store
> in %1.  */
> > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as
> result.  */
> > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as result.
> */
>
> This no longer produces -1, but 0xffffffff in cnt.  However, since the
> return type is
> 'int', cnt need not be 'long long int' either.  I'm not sure why tmp
> exists at all, since
> cnt is the only register modified.
>

Here is the exact assembly produced with this change.
./build-x86_64-linux/string/ffsll.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <ffsll>:
   0: ba ff ff ff ff       mov    $0xffffffff,%edx
   5: 48 0f bc c7           bsf    %rdi,%rax
   9: 0f 44 c2             cmove  %edx,%eax
   c: 83 c0 01             add    $0x1,%eax
   f: c3                   ret


>
>
> r~
>
Noah Goldstein July 26, 2023, 4:59 p.m. UTC | #5
On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
> richard.henderson@linaro.org> wrote:
>
> > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
> > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > > Currently ffsll function randomly regress by ~20%, depending on how
> > > code get aligned.
> > >
> > > This patch fixes ffsll function random performance regression.
> > > ---
> > >   sysdeps/x86_64/ffsll.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > > index a1c13d4906..dbded6f0a1 100644
> > > --- a/sysdeps/x86_64/ffsll.c
> > > +++ b/sysdeps/x86_64/ffsll.c
> > > @@ -29,7 +29,7 @@ ffsll (long long int x)
> > >     long long int tmp;
> > >
> > >     asm ("bsfq %2,%0\n"               /* Count low bits in X and store
> > in %1.  */
> > > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as
> > result.  */
> > > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as result.
> > */
> >
> > This no longer produces -1, but 0xffffffff in cnt.  However, since the
> > return type is
> > 'int', cnt need not be 'long long int' either.  I'm not sure why tmp
> > exists at all, since
> > cnt is the only register modified.
> >
>
> Here is the exact assembly produced with this change.
> ./build-x86_64-linux/string/ffsll.o:     file format elf64-x86-64
>
>
> Disassembly of section .text:
>
> 0000000000000000 <ffsll>:
>    0: ba ff ff ff ff       mov    $0xffffffff,%edx
>    5: 48 0f bc c7           bsf    %rdi,%rax
>    9: 0f 44 c2             cmove  %edx,%eax
>    c: 83 c0 01             add    $0x1,%eax
>    f: c3                   ret
>

FWIW it should be:
```
0000000000000000 <.text>:
   0: b8 ff ff ff ff        mov    $0xffffffff,%eax
   5: 48 0f bc c7          bsf    %rdi,%rax
   9: ff c0                inc    %eax
```

And since its in inline asm no reason not to get that.
>
> >
> >
> > r~
> >
Noah Goldstein July 26, 2023, 5:04 p.m. UTC | #6
On Wed, Jul 26, 2023 at 11:06 AM Sunil K Pandey via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> Currently ffsll function randomly regress by ~20%, depending on how
> code get aligned.
>
> This patch fixes ffsll function random performance regression.
> ---
>  sysdeps/x86_64/ffsll.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> index a1c13d4906..dbded6f0a1 100644
> --- a/sysdeps/x86_64/ffsll.c
> +++ b/sysdeps/x86_64/ffsll.c
> @@ -29,7 +29,7 @@ ffsll (long long int x)
>    long long int tmp;
>
>    asm ("bsfq %2,%0\n"          /* Count low bits in X and store in %1.  */
> -       "cmoveq %1,%0\n"                /* If number was zero, use -1 as result.  */
> +       "cmove %k1,%k0\n"       /* If number was zero, use -1 as result.  */
>         : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
>
Note, there is technically a "bug" in this code in that we clobber
"cc" but don't
specify so.
Mightly as well fix that.
>    return cnt + 1;
> --
> 2.41.0
>
Adhemerval Zanella Netto July 26, 2023, 5:11 p.m. UTC | #7
On 26/07/23 13:59, Noah Goldstein via Libc-alpha wrote:
> On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
>> richard.henderson@linaro.org> wrote:
>>
>>> On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
>>>> Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
>>>> Currently ffsll function randomly regress by ~20%, depending on how
>>>> code get aligned.
>>>>
>>>> This patch fixes ffsll function random performance regression.
>>>> ---
>>>>   sysdeps/x86_64/ffsll.c | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
>>>> index a1c13d4906..dbded6f0a1 100644
>>>> --- a/sysdeps/x86_64/ffsll.c
>>>> +++ b/sysdeps/x86_64/ffsll.c
>>>> @@ -29,7 +29,7 @@ ffsll (long long int x)
>>>>     long long int tmp;
>>>>
>>>>     asm ("bsfq %2,%0\n"               /* Count low bits in X and store
>>> in %1.  */
>>>> -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as
>>> result.  */
>>>> +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as result.
>>> */
>>>
>>> This no longer produces -1, but 0xffffffff in cnt.  However, since the
>>> return type is
>>> 'int', cnt need not be 'long long int' either.  I'm not sure why tmp
>>> exists at all, since
>>> cnt is the only register modified.
>>>
>>
>> Here is the exact assembly produced with this change.
>> ./build-x86_64-linux/string/ffsll.o:     file format elf64-x86-64
>>
>>
>> Disassembly of section .text:
>>
>> 0000000000000000 <ffsll>:
>>    0: ba ff ff ff ff       mov    $0xffffffff,%edx
>>    5: 48 0f bc c7           bsf    %rdi,%rax
>>    9: 0f 44 c2             cmove  %edx,%eax
>>    c: 83 c0 01             add    $0x1,%eax
>>    f: c3                   ret
>>
> 
> FWIW it should be:
> ```
> 0000000000000000 <.text>:
>    0: b8 ff ff ff ff        mov    $0xffffffff,%eax
>    5: 48 0f bc c7          bsf    %rdi,%rax
>    9: ff c0                inc    %eax
> ```
> 
> And since its in inline asm no reason not to get that.

Can't we just use compiler builtins instead and remove a bunch of asm?
GCC already optimizes ffsl/ffsll to builtin if the architecture allows it,
so I think microptimizing it on libc.so using arch-specific is really not
ideal.

With gcc 13 and my patch [1] I see:

$ objdump -d ./string/ffsll.os

./string/ffsll.os:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <__ffsll>:
   0:   48 0f bc ff             bsf    %rdi,%rdi
   4:   48 c7 c0 ff ff ff ff    mov    $0xffffffffffffffff,%rax
   b:   48 0f 44 f8             cmove  %rax,%rdi
   f:   8d 47 01                lea    0x1(%rdi),%eax
  12:   c3                      ret


[1] https://patchwork.sourceware.org/project/glibc/patch/20230717143431.2075924-1-adhemerval.zanella@linaro.org/
Andrew Pinski July 26, 2023, 5:25 p.m. UTC | #8
On Wed, Jul 26, 2023 at 10:05 AM Noah Goldstein via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Jul 26, 2023 at 11:06 AM Sunil K Pandey via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > Currently ffsll function randomly regress by ~20%, depending on how
> > code get aligned.
> >
> > This patch fixes ffsll function random performance regression.
> > ---
> >  sysdeps/x86_64/ffsll.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > index a1c13d4906..dbded6f0a1 100644
> > --- a/sysdeps/x86_64/ffsll.c
> > +++ b/sysdeps/x86_64/ffsll.c
> > @@ -29,7 +29,7 @@ ffsll (long long int x)
> >    long long int tmp;
> >
> >    asm ("bsfq %2,%0\n"          /* Count low bits in X and store in %1.  */
> > -       "cmoveq %1,%0\n"                /* If number was zero, use -1 as result.  */
> > +       "cmove %k1,%k0\n"       /* If number was zero, use -1 as result.  */
> >         : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
> >
> Note, there is technically a "bug" in this code in that we clobber
> "cc" but don't
> specify so.
> Mightly as well fix that.

On x86, flags/cc is always considered as clobbered by inline-asm.
ix86_md_asm_adjust will add the clobber if there was no asm flags output.

Thanks,
Andrew Pinski

> >    return cnt + 1;
> > --
> > 2.41.0
> >
Sunil Pandey July 26, 2023, 8:43 p.m. UTC | #9
On Wed, Jul 26, 2023 at 10:00 AM Noah Goldstein <goldstein.w.n@gmail.com>
wrote:

> On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
> > richard.henderson@linaro.org> wrote:
> >
> > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
> > > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > > > Currently ffsll function randomly regress by ~20%, depending on how
> > > > code get aligned.
> > > >
> > > > This patch fixes ffsll function random performance regression.
> > > > ---
> > > >   sysdeps/x86_64/ffsll.c | 2 +-
> > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > > > index a1c13d4906..dbded6f0a1 100644
> > > > --- a/sysdeps/x86_64/ffsll.c
> > > > +++ b/sysdeps/x86_64/ffsll.c
> > > > @@ -29,7 +29,7 @@ ffsll (long long int x)
> > > >     long long int tmp;
> > > >
> > > >     asm ("bsfq %2,%0\n"               /* Count low bits in X and
> store
> > > in %1.  */
> > > > -       "cmoveq %1,%0\n"              /* If number was zero, use -1
> as
> > > result.  */
> > > > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as
> result.
> > > */
> > >
> > > This no longer produces -1, but 0xffffffff in cnt.  However, since the
> > > return type is
> > > 'int', cnt need not be 'long long int' either.  I'm not sure why tmp
> > > exists at all, since
> > > cnt is the only register modified.
> > >
> >
> > Here is the exact assembly produced with this change.
> > ./build-x86_64-linux/string/ffsll.o:     file format elf64-x86-64
> >
> >
> > Disassembly of section .text:
> >
> > 0000000000000000 <ffsll>:
> >    0: ba ff ff ff ff       mov    $0xffffffff,%edx
> >    5: 48 0f bc c7           bsf    %rdi,%rax
> >    9: 0f 44 c2             cmove  %edx,%eax
> >    c: 83 c0 01             add    $0x1,%eax
> >    f: c3                   ret
> >
>
> FWIW it should be:
> ```
> 0000000000000000 <.text>:
>    0: b8 ff ff ff ff        mov    $0xffffffff,%eax
>    5: 48 0f bc c7          bsf    %rdi,%rax
>    9: ff c0                inc    %eax
> ```
>
> And since its in inline asm no reason not to get that.
>

We shouldn't remove cmove because as per Intel BSF instruction manual, if
the content of source operand
 is 0, the content of the destination operand is undefined. Also removing
cmove doesn't provide any perf
advantage.



> >
> > >
> > >
> > > r~
> > >
>
Noah Goldstein July 26, 2023, 9:05 p.m. UTC | #10
On Wed, Jul 26, 2023 at 3:44 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
>
>
>
> On Wed, Jul 26, 2023 at 10:00 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha
>> <libc-alpha@sourceware.org> wrote:
>> >
>> > On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
>> > richard.henderson@linaro.org> wrote:
>> >
>> > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
>> > > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
>> > > > Currently ffsll function randomly regress by ~20%, depending on how
>> > > > code get aligned.
>> > > >
>> > > > This patch fixes ffsll function random performance regression.
>> > > > ---
>> > > >   sysdeps/x86_64/ffsll.c | 2 +-
>> > > >   1 file changed, 1 insertion(+), 1 deletion(-)
>> > > >
>> > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
>> > > > index a1c13d4906..dbded6f0a1 100644
>> > > > --- a/sysdeps/x86_64/ffsll.c
>> > > > +++ b/sysdeps/x86_64/ffsll.c
>> > > > @@ -29,7 +29,7 @@ ffsll (long long int x)
>> > > >     long long int tmp;
>> > > >
>> > > >     asm ("bsfq %2,%0\n"               /* Count low bits in X and store
>> > > in %1.  */
>> > > > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as
>> > > result.  */
>> > > > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as result.
>> > > */
>> > >
>> > > This no longer produces -1, but 0xffffffff in cnt.  However, since the
>> > > return type is
>> > > 'int', cnt need not be 'long long int' either.  I'm not sure why tmp
>> > > exists at all, since
>> > > cnt is the only register modified.
>> > >
>> >
>> > Here is the exact assembly produced with this change.
>> > ./build-x86_64-linux/string/ffsll.o:     file format elf64-x86-64
>> >
>> >
>> > Disassembly of section .text:
>> >
>> > 0000000000000000 <ffsll>:
>> >    0: ba ff ff ff ff       mov    $0xffffffff,%edx
>> >    5: 48 0f bc c7           bsf    %rdi,%rax
>> >    9: 0f 44 c2             cmove  %edx,%eax
>> >    c: 83 c0 01             add    $0x1,%eax
>> >    f: c3                   ret
>> >
>>
>> FWIW it should be:
>> ```
>> 0000000000000000 <.text>:
>>    0: b8 ff ff ff ff        mov    $0xffffffff,%eax
>>    5: 48 0f bc c7          bsf    %rdi,%rax
>>    9: ff c0                inc    %eax
>> ```
>>
>> And since its in inline asm no reason not to get that.
>
>
> We shouldn't remove cmove because as per Intel BSF instruction manual, if the content of source operand
>  is 0, the content of the destination operand is undefined. Also removing cmove doesn't provide any perf
> advantage.

We rely on that behavior in other areas (memchr-evex512 for example).
It's been confirmed it is defined to not changed the destination (like AMD).

It saves an instructions..

Either way, however, I think adhemerval is right and we should use builtins
for this.
>
>
>>
>> >
>> > >
>> > >
>> > > r~
>> > >
Sunil Pandey July 26, 2023, 10:37 p.m. UTC | #11
On Wed, Jul 26, 2023 at 2:05 PM Noah Goldstein <goldstein.w.n@gmail.com>
wrote:

> On Wed, Jul 26, 2023 at 3:44 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
> >
> >
> >
> > On Wed, Jul 26, 2023 at 10:00 AM Noah Goldstein <goldstein.w.n@gmail.com>
> wrote:
> >>
> >> On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha
> >> <libc-alpha@sourceware.org> wrote:
> >> >
> >> > On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
> >> > richard.henderson@linaro.org> wrote:
> >> >
> >> > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
> >> > > > Ffsll function size is 17 byte, this patch optimizes size to 16
> byte.
> >> > > > Currently ffsll function randomly regress by ~20%, depending on
> how
> >> > > > code get aligned.
> >> > > >
> >> > > > This patch fixes ffsll function random performance regression.
> >> > > > ---
> >> > > >   sysdeps/x86_64/ffsll.c | 2 +-
> >> > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> >> > > >
> >> > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> >> > > > index a1c13d4906..dbded6f0a1 100644
> >> > > > --- a/sysdeps/x86_64/ffsll.c
> >> > > > +++ b/sysdeps/x86_64/ffsll.c
> >> > > > @@ -29,7 +29,7 @@ ffsll (long long int x)
> >> > > >     long long int tmp;
> >> > > >
> >> > > >     asm ("bsfq %2,%0\n"               /* Count low bits in X and
> store
> >> > > in %1.  */
> >> > > > -       "cmoveq %1,%0\n"              /* If number was zero, use
> -1 as
> >> > > result.  */
> >> > > > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as
> result.
> >> > > */
> >> > >
> >> > > This no longer produces -1, but 0xffffffff in cnt.  However, since
> the
> >> > > return type is
> >> > > 'int', cnt need not be 'long long int' either.  I'm not sure why tmp
> >> > > exists at all, since
> >> > > cnt is the only register modified.
> >> > >
> >> >
> >> > Here is the exact assembly produced with this change.
> >> > ./build-x86_64-linux/string/ffsll.o:     file format elf64-x86-64
> >> >
> >> >
> >> > Disassembly of section .text:
> >> >
> >> > 0000000000000000 <ffsll>:
> >> >    0: ba ff ff ff ff       mov    $0xffffffff,%edx
> >> >    5: 48 0f bc c7           bsf    %rdi,%rax
> >> >    9: 0f 44 c2             cmove  %edx,%eax
> >> >    c: 83 c0 01             add    $0x1,%eax
> >> >    f: c3                   ret
> >> >
> >>
> >> FWIW it should be:
> >> ```
> >> 0000000000000000 <.text>:
> >>    0: b8 ff ff ff ff        mov    $0xffffffff,%eax
> >>    5: 48 0f bc c7          bsf    %rdi,%rax
> >>    9: ff c0                inc    %eax
> >> ```
> >>
> >> And since its in inline asm no reason not to get that.
> >
> >
> > We shouldn't remove cmove because as per Intel BSF instruction manual,
> if the content of source operand
> >  is 0, the content of the destination operand is undefined. Also
> removing cmove doesn't provide any perf
> > advantage.
>
> We rely on that behavior in other areas (memchr-evex512 for example).
> It's been confirmed it is defined to not changed the destination (like
> AMD).
>
> It saves an instructions..
>
> Either way, however, I think adhemerval is right and we should use builtins
> for this.
>

Few things to notice here.

      This code is more than 20 years old, not sure what all intel
processors this code may be running.
      Builtin will have the same issue, unless it can produce code 16 byte
or less.
      Also the purpose of this patch is to fix random unexpected ~20% perf
loss.

>
> >
> >>
> >> >
> >> > >
> >> > >
> >> > > r~
> >> > >
>
Noah Goldstein July 27, 2023, midnight UTC | #12
On Wed, Jul 26, 2023 at 5:38 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
>
>
>
> On Wed, Jul 26, 2023 at 2:05 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Wed, Jul 26, 2023 at 3:44 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
>> >
>> >
>> >
>> > On Wed, Jul 26, 2023 at 10:00 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>> >>
>> >> On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha
>> >> <libc-alpha@sourceware.org> wrote:
>> >> >
>> >> > On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
>> >> > richard.henderson@linaro.org> wrote:
>> >> >
>> >> > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
>> >> > > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
>> >> > > > Currently ffsll function randomly regress by ~20%, depending on how
>> >> > > > code get aligned.
>> >> > > >
>> >> > > > This patch fixes ffsll function random performance regression.
>> >> > > > ---
>> >> > > >   sysdeps/x86_64/ffsll.c | 2 +-
>> >> > > >   1 file changed, 1 insertion(+), 1 deletion(-)
>> >> > > >
>> >> > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
>> >> > > > index a1c13d4906..dbded6f0a1 100644
>> >> > > > --- a/sysdeps/x86_64/ffsll.c
>> >> > > > +++ b/sysdeps/x86_64/ffsll.c
>> >> > > > @@ -29,7 +29,7 @@ ffsll (long long int x)
>> >> > > >     long long int tmp;
>> >> > > >
>> >> > > >     asm ("bsfq %2,%0\n"               /* Count low bits in X and store
>> >> > > in %1.  */
>> >> > > > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as
>> >> > > result.  */
>> >> > > > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as result.
>> >> > > */
>> >> > >
>> >> > > This no longer produces -1, but 0xffffffff in cnt.  However, since the
>> >> > > return type is
>> >> > > 'int', cnt need not be 'long long int' either.  I'm not sure why tmp
>> >> > > exists at all, since
>> >> > > cnt is the only register modified.
>> >> > >
>> >> >
>> >> > Here is the exact assembly produced with this change.
>> >> > ./build-x86_64-linux/string/ffsll.o:     file format elf64-x86-64
>> >> >
>> >> >
>> >> > Disassembly of section .text:
>> >> >
>> >> > 0000000000000000 <ffsll>:
>> >> >    0: ba ff ff ff ff       mov    $0xffffffff,%edx
>> >> >    5: 48 0f bc c7           bsf    %rdi,%rax
>> >> >    9: 0f 44 c2             cmove  %edx,%eax
>> >> >    c: 83 c0 01             add    $0x1,%eax
>> >> >    f: c3                   ret
>> >> >
>> >>
>> >> FWIW it should be:
>> >> ```
>> >> 0000000000000000 <.text>:
>> >>    0: b8 ff ff ff ff        mov    $0xffffffff,%eax
>> >>    5: 48 0f bc c7          bsf    %rdi,%rax
>> >>    9: ff c0                inc    %eax
>> >> ```
>> >>
>> >> And since its in inline asm no reason not to get that.
>> >
>> >
>> > We shouldn't remove cmove because as per Intel BSF instruction manual, if the content of source operand
>> >  is 0, the content of the destination operand is undefined. Also removing cmove doesn't provide any perf
>> > advantage.
>>
>> We rely on that behavior in other areas (memchr-evex512 for example).
>> It's been confirmed it is defined to not changed the destination (like AMD).
>>
>> It saves an instructions..
>>
>> Either way, however, I think adhemerval is right and we should use builtins
>> for this.
>
>
> Few things to notice here.
>
>       This code is more than 20 years old, not sure what all intel processors this code may be running.
>       Builtin will have the same issue, unless it can produce code 16 byte or less.
>       Also the purpose of this patch is to fix random unexpected ~20% perf loss.
>
Likewise for the string/memory library....
Why not just update it w.o cmov? Just seems like a waste not to
address it while we're at it.

>> >
>> >
>> >>
>> >> >
>> >> > >
>> >> > >
>> >> > > r~
>> >> > >
Cristian Rodríguez July 27, 2023, 12:31 a.m. UTC | #13
On Wed, Jul 26, 2023 at 1:12 PM Adhemerval Zanella Netto via Libc-alpha <
libc-alpha@sourceware.org> wrote:

>
>
> Can't we just use compiler builtins instead and remove a bunch of asm?
>

Yes, please. and any better code generation strategy is better directed at
gcc not the glibc.
Florian Weimer July 27, 2023, 8:16 a.m. UTC | #14
* Noah Goldstein via Libc-alpha:

> Likewise for the string/memory library....
> Why not just update it w.o cmov? Just seems like a waste not to
> address it while we're at it.

Given that it violates the spec, doing it in an inline function seems
kind of risky.

Thanks,
Florian
Alexander Monakov July 27, 2023, 11:46 a.m. UTC | #15
On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote:

> * Noah Goldstein via Libc-alpha:
> 
> > Likewise for the string/memory library....
> > Why not just update it w.o cmov? Just seems like a waste not to
> > address it while we're at it.
> 
> Given that it violates the spec, doing it in an inline function seems
> kind of risky.

Sorry, what inline function? The function seems to modify an implementation
in ffsll.c, nothing else.

Alexander
Florian Weimer July 27, 2023, 12:10 p.m. UTC | #16
* Alexander Monakov:

> On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote:
>
>> * Noah Goldstein via Libc-alpha:
>> 
>> > Likewise for the string/memory library....
>> > Why not just update it w.o cmov? Just seems like a waste not to
>> > address it while we're at it.
>> 
>> Given that it violates the spec, doing it in an inline function seems
>> kind of risky.
>
> Sorry, what inline function? The function seems to modify an implementation
> in ffsll.c, nothing else.

Yeah, sorry, I was confused.  There's a GCC built-in, right?  So the
glibc implementation probably isn't that important on x86.

Thanks,
Florian
Cristian Rodríguez July 27, 2023, 1:59 p.m. UTC | #17
On Thu, Jul 27, 2023 at 8:11 AM Florian Weimer via Libc-alpha <
libc-alpha@sourceware.org> wrote:

>
> Yeah, sorry, I was confused.  There's a GCC built-in, right?


Yes, in all targets that have an ffs pattern,,the others need libgcc
Alexander Monakov July 27, 2023, 2 p.m. UTC | #18
On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote:

> * Alexander Monakov:
> 
> > On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote:
> >
> >> * Noah Goldstein via Libc-alpha:
> >> 
> >> > Likewise for the string/memory library....
> >> > Why not just update it w.o cmov? Just seems like a waste not to
> >> > address it while we're at it.
> >> 
> >> Given that it violates the spec, doing it in an inline function seems
> >> kind of risky.
> >
> > Sorry, what inline function? The function seems to modify an implementation
> > in ffsll.c, nothing else.
> 
> Yeah, sorry, I was confused.  There's a GCC built-in, right?  So the
> glibc implementation probably isn't that important on x86.

Yep. The built-in gets expanded even at -Os, so it's quite unusual that Glibc's
implementation get called. I see the following possibilities:

* at -O0 (but then performance doesn't matter, presumably)
* with -fno-builtin
* when called via a function pointer

Sunil, could you clear this up, please?

Alexander
Sunil Pandey July 27, 2023, 3:13 p.m. UTC | #19
On Thu, Jul 27, 2023 at 7:00 AM Alexander Monakov <amonakov@ispras.ru>
wrote:

>
> On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote:
>
> > * Alexander Monakov:
> >
> > > On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote:
> > >
> > >> * Noah Goldstein via Libc-alpha:
> > >>
> > >> > Likewise for the string/memory library....
> > >> > Why not just update it w.o cmov? Just seems like a waste not to
> > >> > address it while we're at it.
> > >>
> > >> Given that it violates the spec, doing it in an inline function seems
> > >> kind of risky.
> > >
> > > Sorry, what inline function? The function seems to modify an
> implementation
> > > in ffsll.c, nothing else.
> >
> > Yeah, sorry, I was confused.  There's a GCC built-in, right?  So the
> > glibc implementation probably isn't that important on x86.
>
> Yep. The built-in gets expanded even at -Os, so it's quite unusual that
> Glibc's
> implementation get called. I see the following possibilities:
>
> * at -O0 (but then performance doesn't matter, presumably)
> * with -fno-builtin
> * when called via a function pointer
>
> Sunil, could you clear this up, please?
>

This issue only matters if ffsll functionality is implemented in a
function(size > 16 byte) and the
function is not inlined (doesn't matter whether it's implemented in C or
assembly).

By default the function entry point gets aligned to the 16 byte boundary,
so the following layout
are all valid.

64 byte aligned: No issue as 17 byte function will not cause another cache
line load.
48 byte aligned: ~20% regression as 17 byte function will trigger another
cache line load.
32 byte aligned: No issue as 17 byte function will not cause another cache
line load.
16 byte aligned: No issue as 17 byte function will not cause another cache
line load.

Ffsll is one of the benchmark tests in the phoronix test suite, not sure
how much it matters to
the application. Lots of people involved in phoronix benchmark
testing/tracking and this kind
of random perf behavior wastes their time.

Again I'm not against GCC, but if this function exists in glibc, I don't
see any harm in fixing it.


>
> Alexander
>
Alexander Monakov July 27, 2023, 3:50 p.m. UTC | #20
On Thu, 27 Jul 2023, Sunil Pandey wrote:

> Ffsll is one of the benchmark tests in the phoronix test suite, not
> sure how much it matters to the application. Lots of people involved
> in phoronix benchmark testing/tracking and this kind of random perf
> behavior wastes their time.

Ah, I see — PTS runs Glibc benchtests (which use -fno-builtin).

> Again I'm not against GCC, but if this function exists in glibc, I don't
> see any harm in fixing it.

Sure. But as Richard seems to be pointing out, the function seems
needlessly obfuscated (even the first comment is wrong!). So if Noah's
suggestion is not accepted, I can offer the following cleaned-up
variant:

int
ffsll (long long int x)
{
  int cnt;

  asm ("bsfq %1,%q0\n"          /* Count low bits in X and store in %0.  */
       "cmovel %2,%0\n"         /* If number was zero, use -1 as result.  */
       : "=&r" (cnt) : "r" (x), "r" (-1));

  return cnt + 1;
}

(note that initial bsfq has an undesirable false dependency on the
output operand, and Noah's variant fixes that too)

Alexander
Florian Weimer July 27, 2023, 4:24 p.m. UTC | #21
* Sunil Pandey:

> Ffsll is one of the benchmark tests in the phoronix test suite, not
> sure how much it matters to the application. Lots of people involved
> in phoronix benchmark testing/tracking and this kind of random perf
> behavior wastes their time.

That's a good point.  I've seen similar reports before (sadly I don't
recall if they were specifically about ffsll).

Regarding the mechanics of fixing it, if the instruction ordering and
sizing is so sensitive, should this be an assembler implementation
instead?  And will the fix even work for distributions that build with
--enable-cet, considering that there's going to be an additional 4-byte
NOP at the start of the function?

Thanks,
Florian
Adhemerval Zanella Netto July 27, 2023, 4:35 p.m. UTC | #22
On 27/07/23 13:24, Florian Weimer via Libc-alpha wrote:
> * Sunil Pandey:
> 
>> Ffsll is one of the benchmark tests in the phoronix test suite, not
>> sure how much it matters to the application. Lots of people involved
>> in phoronix benchmark testing/tracking and this kind of random perf
>> behavior wastes their time.
> 
> That's a good point.  I've seen similar reports before (sadly I don't
> recall if they were specifically about ffsll).
> 
> Regarding the mechanics of fixing it, if the instruction ordering and
> sizing is so sensitive, should this be an assembler implementation
> instead?  And will the fix even work for distributions that build with
> --enable-cet, considering that there's going to be an additional 4-byte
> NOP at the start of the function?
> 


Sigh... do we really need to care about this synthetic benchmark that is
exercising a fallback path since compiler will most likely issue the
inline builtin? And even this is really important, tune function alignment
and size to fit on a cacheline should be done by the compiler, specially
in the case where we can implement by using a builtin.
Sunil Pandey July 27, 2023, 4:40 p.m. UTC | #23
On Thu, Jul 27, 2023 at 9:24 AM Florian Weimer <fweimer@redhat.com> wrote:

> * Sunil Pandey:
>
> > Ffsll is one of the benchmark tests in the phoronix test suite, not
> > sure how much it matters to the application. Lots of people involved
> > in phoronix benchmark testing/tracking and this kind of random perf
> > behavior wastes their time.
>
> That's a good point.  I've seen similar reports before (sadly I don't
> recall if they were specifically about ffsll).
>
> Regarding the mechanics of fixing it, if the instruction ordering and
> sizing is so sensitive, should this be an assembler implementation
> instead?


Instruction ordering and sizing pretty much always matter. Many instructions
can't be used in low latency applications, because their encoding size is
big.
Unfortunately assemblers can't do much in this case.


>   And will the fix even work for distributions that build with
> --enable-cet, considering that there's going to be an additional 4-byte
> NOP at the start of the function?
>

Extra 4 byte from --enable-cet can affect performance of many small
size highly optimized routines and can potentially create perf variation
depending on how function gets loaded to memory.  But again, this is
one of the costs of using cet.


> Thanks,
> Florian
>
>
Florian Weimer July 27, 2023, 5:09 p.m. UTC | #24
* Adhemerval Zanella Netto:

> On 27/07/23 13:24, Florian Weimer via Libc-alpha wrote:
>> * Sunil Pandey:
>> 
>>> Ffsll is one of the benchmark tests in the phoronix test suite, not
>>> sure how much it matters to the application. Lots of people involved
>>> in phoronix benchmark testing/tracking and this kind of random perf
>>> behavior wastes their time.
>> 
>> That's a good point.  I've seen similar reports before (sadly I don't
>> recall if they were specifically about ffsll).
>> 
>> Regarding the mechanics of fixing it, if the instruction ordering and
>> sizing is so sensitive, should this be an assembler implementation
>> instead?  And will the fix even work for distributions that build with
>> --enable-cet, considering that there's going to be an additional 4-byte
>> NOP at the start of the function?

> Sigh... do we really need to care about this synthetic benchmark that is
> exercising a fallback path since compiler will most likely issue the
> inline builtin? And even this is really important, tune function alignment
> and size to fit on a cacheline should be done by the compiler, specially
> in the case where we can implement by using a builtin.

I think avoiding wasting people's time with spurious benchmark
differences is useful.  Compared to things like the PID cache, this one
seems pretty harmless.

Maybe we should increase function alignment to cover the CET case, too?

Thanks,
Florian
Sunil Pandey July 27, 2023, 5:25 p.m. UTC | #25
On Thu, Jul 27, 2023 at 10:09 AM Florian Weimer <fweimer@redhat.com> wrote:

> * Adhemerval Zanella Netto:
>
> > On 27/07/23 13:24, Florian Weimer via Libc-alpha wrote:
> >> * Sunil Pandey:
> >>
> >>> Ffsll is one of the benchmark tests in the phoronix test suite, not
> >>> sure how much it matters to the application. Lots of people involved
> >>> in phoronix benchmark testing/tracking and this kind of random perf
> >>> behavior wastes their time.
> >>
> >> That's a good point.  I've seen similar reports before (sadly I don't
> >> recall if they were specifically about ffsll).
> >>
> >> Regarding the mechanics of fixing it, if the instruction ordering and
> >> sizing is so sensitive, should this be an assembler implementation
> >> instead?  And will the fix even work for distributions that build with
> >> --enable-cet, considering that there's going to be an additional 4-byte
> >> NOP at the start of the function?
>
> > Sigh... do we really need to care about this synthetic benchmark that is
> > exercising a fallback path since compiler will most likely issue the
> > inline builtin? And even this is really important, tune function
> alignment
> > and size to fit on a cacheline should be done by the compiler, specially
> > in the case where we can implement by using a builtin.
>
> I think avoiding wasting people's time with spurious benchmark
> differences is useful.  Compared to things like the PID cache, this one
> seems pretty harmless.
>
> Maybe we should increase function alignment to cover the CET case, too?
>

Good point. I have seen some other cases where function alignment creates
big perf swing. I think aligning function entry points to cache line size
will fix
most of the alignment related issues. Trade-off of bigger alignment will be
increased memory usage.


> Thanks,
> Florian
>
>
diff mbox series

Patch

diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
index a1c13d4906..dbded6f0a1 100644
--- a/sysdeps/x86_64/ffsll.c
+++ b/sysdeps/x86_64/ffsll.c
@@ -29,7 +29,7 @@  ffsll (long long int x)
   long long int tmp;
 
   asm ("bsfq %2,%0\n"		/* Count low bits in X and store in %1.  */
-       "cmoveq %1,%0\n"		/* If number was zero, use -1 as result.  */
+       "cmove %k1,%k0\n"	/* If number was zero, use -1 as result.  */
        : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
 
   return cnt + 1;