diff mbox series

[v2] x86_64: Optimize ffsll function code size.

Message ID 20230731183549.2396362-1-skpgkp2@gmail.com
State New
Headers show
Series [v2] x86_64: Optimize ffsll function code size. | expand

Commit Message

Sunil Pandey July 31, 2023, 6:35 p.m. UTC
Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
Currently ffsll function randomly regress by ~20%, depending on how
code get aligned.

This patch fixes ffsll function random performance regression.

Changes from v1:
- Further reduce size ffsll function size to 12 bytes.
---
 sysdeps/x86_64/ffsll.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Adhemerval Zanella Netto July 31, 2023, 8:12 p.m. UTC | #1
On 31/07/23 15:35, Sunil K Pandey via Libc-alpha wrote:
> Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> Currently ffsll function randomly regress by ~20%, depending on how
> code get aligned.
> 
> This patch fixes ffsll function random performance regression.
> 
> Changes from v1:
> - Further reduce size ffsll function size to 12 bytes.
> ---
>  sysdeps/x86_64/ffsll.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> index a1c13d4906..6a5803c7c1 100644
> --- a/sysdeps/x86_64/ffsll.c
> +++ b/sysdeps/x86_64/ffsll.c
> @@ -26,13 +26,13 @@ int
>  ffsll (long long int x)
>  {
>    long long int cnt;
> -  long long int tmp;
>  
> -  asm ("bsfq %2,%0\n"		/* Count low bits in X and store in %1.  */
> -       "cmoveq %1,%0\n"		/* If number was zero, use -1 as result.  */
> -       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
> +  asm ("mov $-1,%k0\n"	/* Intialize CNT to -1.  */
> +       "bsf %1,%0\n"	/* Count low bits in X and store in CNT.  */
> +       "inc %k0\n"	/* Increment CNT by 1.  */
> +       : "=&r" (cnt) : "r" (x));
>  
> -  return cnt + 1;
> +  return cnt;
>  }
>  
>  #ifndef __ILP32__



I still prefer if we can just remove this arch-optimized function in favor
in compiler builtins.
Sunil Pandey July 31, 2023, 8:58 p.m. UTC | #2
On Mon, Jul 31, 2023 at 1:12 PM Adhemerval Zanella Netto <
adhemerval.zanella@linaro.org> wrote:

>
>
> On 31/07/23 15:35, Sunil K Pandey via Libc-alpha wrote:
> > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > Currently ffsll function randomly regress by ~20%, depending on how
> > code get aligned.
> >
> > This patch fixes ffsll function random performance regression.
> >
> > Changes from v1:
> > - Further reduce size ffsll function size to 12 bytes.
> > ---
> >  sysdeps/x86_64/ffsll.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > index a1c13d4906..6a5803c7c1 100644
> > --- a/sysdeps/x86_64/ffsll.c
> > +++ b/sysdeps/x86_64/ffsll.c
> > @@ -26,13 +26,13 @@ int
> >  ffsll (long long int x)
> >  {
> >    long long int cnt;
> > -  long long int tmp;
> >
> > -  asm ("bsfq %2,%0\n"                /* Count low bits in X and store
> in %1.  */
> > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as
> result.  */
> > -       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
> > +  asm ("mov $-1,%k0\n"       /* Intialize CNT to -1.  */
> > +       "bsf %1,%0\n" /* Count low bits in X and store in CNT.  */
> > +       "inc %k0\n"   /* Increment CNT by 1.  */
> > +       : "=&r" (cnt) : "r" (x));
> >
> > -  return cnt + 1;
> > +  return cnt;
> >  }
> >
> >  #ifndef __ILP32__
>
>
>
> I still prefer if we can just remove this arch-optimized function in favor
> in compiler builtins.
>

Sure, compiler builtin should replace it in the long run.
In the meantime, can it get fixed?
Adhemerval Zanella Netto July 31, 2023, 10:57 p.m. UTC | #3
On 31/07/23 17:58, Sunil Pandey wrote:
> 
> 
> On Mon, Jul 31, 2023 at 1:12 PM Adhemerval Zanella Netto <adhemerval.zanella@linaro.org <mailto:adhemerval.zanella@linaro.org>> wrote:
> 
> 
> 
>     On 31/07/23 15:35, Sunil K Pandey via Libc-alpha wrote:
>     > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
>     > Currently ffsll function randomly regress by ~20%, depending on how
>     > code get aligned.
>     >
>     > This patch fixes ffsll function random performance regression.
>     >
>     > Changes from v1:
>     > - Further reduce size ffsll function size to 12 bytes.
>     > ---
>     >  sysdeps/x86_64/ffsll.c | 10 +++++-----
>     >  1 file changed, 5 insertions(+), 5 deletions(-)
>     >
>     > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
>     > index a1c13d4906..6a5803c7c1 100644
>     > --- a/sysdeps/x86_64/ffsll.c
>     > +++ b/sysdeps/x86_64/ffsll.c
>     > @@ -26,13 +26,13 @@ int
>     >  ffsll (long long int x)
>     >  {
>     >    long long int cnt;
>     > -  long long int tmp;
>     > 
>     > -  asm ("bsfq %2,%0\n"                /* Count low bits in X and store in %1.  */
>     > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as result.  */
>     > -       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
>     > +  asm ("mov $-1,%k0\n"       /* Intialize CNT to -1.  */
>     > +       "bsf %1,%0\n" /* Count low bits in X and store in CNT.  */
>     > +       "inc %k0\n"   /* Increment CNT by 1.  */
>     > +       : "=&r" (cnt) : "r" (x));
>     > 
>     > -  return cnt + 1;
>     > +  return cnt;
>     >  }
>     > 
>     >  #ifndef __ILP32__
> 
> 
> 
>     I still prefer if we can just remove this arch-optimized function in favor
>     in compiler builtins.
> 
> 
> Sure, compiler builtin should replace it in the long run.
> In the meantime, can it get fixed? 

This fix only works if compiler does not insert anything in the prologue, if 
you use CET or stack protector strong it might not work.  And I *really*
do not want to add another assembly optimization to a symbol that won't
be used in most real programs.

And already have a fix to use compiler builtins [1].

[1] https://patchwork.sourceware.org/project/glibc/patch/20230717143431.2075924-1-adhemerval.zanella@linaro.org/
Sunil Pandey July 31, 2023, 11:44 p.m. UTC | #4
On Mon, Jul 31, 2023 at 3:57 PM Adhemerval Zanella Netto <
adhemerval.zanella@linaro.org> wrote:

>
>
> On 31/07/23 17:58, Sunil Pandey wrote:
> >
> >
> > On Mon, Jul 31, 2023 at 1:12 PM Adhemerval Zanella Netto <
> adhemerval.zanella@linaro.org <mailto:adhemerval.zanella@linaro.org>>
> wrote:
> >
> >
> >
> >     On 31/07/23 15:35, Sunil K Pandey via Libc-alpha wrote:
> >     > Ffsll function size is 17 byte, this patch optimizes size to 16
> byte.
> >     > Currently ffsll function randomly regress by ~20%, depending on how
> >     > code get aligned.
> >     >
> >     > This patch fixes ffsll function random performance regression.
> >     >
> >     > Changes from v1:
> >     > - Further reduce size ffsll function size to 12 bytes.
> >     > ---
> >     >  sysdeps/x86_64/ffsll.c | 10 +++++-----
> >     >  1 file changed, 5 insertions(+), 5 deletions(-)
> >     >
> >     > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> >     > index a1c13d4906..6a5803c7c1 100644
> >     > --- a/sysdeps/x86_64/ffsll.c
> >     > +++ b/sysdeps/x86_64/ffsll.c
> >     > @@ -26,13 +26,13 @@ int
> >     >  ffsll (long long int x)
> >     >  {
> >     >    long long int cnt;
> >     > -  long long int tmp;
> >     >
> >     > -  asm ("bsfq %2,%0\n"                /* Count low bits in X and
> store in %1.  */
> >     > -       "cmoveq %1,%0\n"              /* If number was zero, use
> -1 as result.  */
> >     > -       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
> >     > +  asm ("mov $-1,%k0\n"       /* Intialize CNT to -1.  */
> >     > +       "bsf %1,%0\n" /* Count low bits in X and store in CNT.  */
> >     > +       "inc %k0\n"   /* Increment CNT by 1.  */
> >     > +       : "=&r" (cnt) : "r" (x));
> >     >
> >     > -  return cnt + 1;
> >     > +  return cnt;
> >     >  }
> >     >
> >     >  #ifndef __ILP32__
> >
> >
> >
> >     I still prefer if we can just remove this arch-optimized function in
> favor
> >     in compiler builtins.
> >
> >
> > Sure, compiler builtin should replace it in the long run.
> > In the meantime, can it get fixed?
>
> This fix only works if compiler does not insert anything in the prologue,
> if
> you use CET or stack protector strong it might not work.  And I *really*
> do not want to add another assembly optimization to a symbol that won't
> be used in most real programs.
>

v2 will fix it, as CET overhead is 4 byte and the latest code size is only
12 byte.
So toal code size will be 16 byte even if CET gets enabled.


> And already have a fix to use compiler builtins [1].
>

Here is code generated from the builtin patch.

(gdb) b ffsll
Breakpoint 1 at 0x10a0
(gdb) run --direct
Starting program: benchtests/bench-ffsll --direct
Breakpoint 1, __ffsll (i=66900473708975203) at ffsll.c:30
30  return __builtin_ffsll (i);
(gdb) disass
Dump of assembler code for function __ffsll:
=> 0x00007ffff7c955a0 <+0>: bsf    %rdi,%rdi
   0x00007ffff7c955a4 <+4>: mov    $0xffffffffffffffff,%rax
   0x00007ffff7c955ab <+11>: cmove  %rax,%rdi
   0x00007ffff7c955af <+15>: lea    0x1(%rdi),%eax
   0x00007ffff7c955b2 <+18>: ret
End of assembler dump.
(gdb)

It is not going to fix the problem. Random 20% variation will continue even
with
builtin patch in benchmarking.

I do not see anything wrong using architecture features, if it helps
people save their valuable debugging time. After all, its valuable
glibc feature to override generic implementation with architecture specific
code.


> [1]
> https://patchwork.sourceware.org/project/glibc/patch/20230717143431.2075924-1-adhemerval.zanella@linaro.org/
>
Noah Goldstein July 31, 2023, 11:54 p.m. UTC | #5
On Mon, Jul 31, 2023 at 6:44 PM Sunil Pandey via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Mon, Jul 31, 2023 at 3:57 PM Adhemerval Zanella Netto <
> adhemerval.zanella@linaro.org> wrote:
>
> >
> >
> > On 31/07/23 17:58, Sunil Pandey wrote:
> > >
> > >
> > > On Mon, Jul 31, 2023 at 1:12 PM Adhemerval Zanella Netto <
> > adhemerval.zanella@linaro.org <mailto:adhemerval.zanella@linaro.org>>
> > wrote:
> > >
> > >
> > >
> > >     On 31/07/23 15:35, Sunil K Pandey via Libc-alpha wrote:
> > >     > Ffsll function size is 17 byte, this patch optimizes size to 16
> > byte.
> > >     > Currently ffsll function randomly regress by ~20%, depending on how
> > >     > code get aligned.
> > >     >
> > >     > This patch fixes ffsll function random performance regression.
> > >     >
> > >     > Changes from v1:
> > >     > - Further reduce size ffsll function size to 12 bytes.
> > >     > ---
> > >     >  sysdeps/x86_64/ffsll.c | 10 +++++-----
> > >     >  1 file changed, 5 insertions(+), 5 deletions(-)
> > >     >
> > >     > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > >     > index a1c13d4906..6a5803c7c1 100644
> > >     > --- a/sysdeps/x86_64/ffsll.c
> > >     > +++ b/sysdeps/x86_64/ffsll.c
> > >     > @@ -26,13 +26,13 @@ int
> > >     >  ffsll (long long int x)
> > >     >  {
> > >     >    long long int cnt;
> > >     > -  long long int tmp;
> > >     >
> > >     > -  asm ("bsfq %2,%0\n"                /* Count low bits in X and
> > store in %1.  */
> > >     > -       "cmoveq %1,%0\n"              /* If number was zero, use
> > -1 as result.  */
> > >     > -       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
> > >     > +  asm ("mov $-1,%k0\n"       /* Intialize CNT to -1.  */
> > >     > +       "bsf %1,%0\n" /* Count low bits in X and store in CNT.  */
> > >     > +       "inc %k0\n"   /* Increment CNT by 1.  */
> > >     > +       : "=&r" (cnt) : "r" (x));
> > >     >
> > >     > -  return cnt + 1;
> > >     > +  return cnt;
> > >     >  }
> > >     >
> > >     >  #ifndef __ILP32__
> > >
> > >
> > >
> > >     I still prefer if we can just remove this arch-optimized function in
> > favor
> > >     in compiler builtins.
> > >
> > >
> > > Sure, compiler builtin should replace it in the long run.
> > > In the meantime, can it get fixed?
> >
> > This fix only works if compiler does not insert anything in the prologue,
> > if
> > you use CET or stack protector strong it might not work.  And I *really*
> > do not want to add another assembly optimization to a symbol that won't
> > be used in most real programs.
> >
>
> v2 will fix it, as CET overhead is 4 byte and the latest code size is only
> 12 byte.
> So toal code size will be 16 byte even if CET gets enabled.
>
>
> > And already have a fix to use compiler builtins [1].
> >
>
> Here is code generated from the builtin patch.
>
> (gdb) b ffsll
> Breakpoint 1 at 0x10a0
> (gdb) run --direct
> Starting program: benchtests/bench-ffsll --direct
> Breakpoint 1, __ffsll (i=66900473708975203) at ffsll.c:30
> 30  return __builtin_ffsll (i);
> (gdb) disass
> Dump of assembler code for function __ffsll:
> => 0x00007ffff7c955a0 <+0>: bsf    %rdi,%rdi
>    0x00007ffff7c955a4 <+4>: mov    $0xffffffffffffffff,%rax
>    0x00007ffff7c955ab <+11>: cmove  %rax,%rdi
>    0x00007ffff7c955af <+15>: lea    0x1(%rdi),%eax
>    0x00007ffff7c955b2 <+18>: ret
> End of assembler dump.
> (gdb)
>
> It is not going to fix the problem. Random 20% variation will continue even
> with
> builtin patch in benchmarking.
>
> I do not see anything wrong using architecture features, if it helps
> people save their valuable debugging time. After all, its valuable
> glibc feature to override generic implementation with architecture specific
> code.
>
>
> > [1]
> > https://patchwork.sourceware.org/project/glibc/patch/20230717143431.2075924-1-adhemerval.zanella@linaro.org/
> >

For my money this patch is fine.
It's not really that important to optimize (or worth the time thats been
spent in argument), and don't really see the harm in the asm impl,
esp since GCC doesn't really produce good codegen for the builtin :(
Andreas Schwab Aug. 1, 2023, 6:47 a.m. UTC | #6
On Jul 31 2023, Noah Goldstein via Libc-alpha wrote:

> esp since GCC doesn't really produce good codegen for the builtin :(

So this is a GCC bug.  Please report it.
Adhemerval Zanella Netto Aug. 1, 2023, 1:46 p.m. UTC | #7
On 31/07/23 20:44, Sunil Pandey wrote:
>
> 
> It is not going to fix the problem. Random 20% variation will continue even with
> builtin patch in benchmarking.
> 
> I do not see anything wrong using architecture features, if it helps
> people save their valuable debugging time. After all, its valuable
> glibc feature to override generic implementation with architecture specific
> code.
>
Because it is a synthetic benchmark that is exercising code that should not be
stressed with the default compiler flags.  And the problem of using arch-specific
code is when you already have support from the compiler to generate optimized
code, this is just extra complexity plus maintenance.

On my system, ffssl is only used by a single program (/usr/bin/nvidia-persistenced)
which I am not sure how it is compile because it a closed source one.  The ffs is
used by a couple more (gdb, lvm, and some mesa drivers), but again far from being
a hotspot.

And as Andreas have said, the best course of action here is to fix the compiler
if it is not generating code enough code.  Fixing gcc means that we will get 
any optimization by free (by using the builtin) and any code that actually uses
ffs/ffsl/ffsll.
Cristian Rodríguez Aug. 1, 2023, 6:25 p.m. UTC | #8
On Tue, Aug 1, 2023 at 9:47 AM Adhemerval Zanella Netto via Libc-alpha <
libc-alpha@sourceware.org> wrote:

>
>
> On 31/07/23 20:44, Sunil Pandey wrote:
>
> On my system, ffssl is only used by a single program
> (/usr/bin/nvidia-persistenced)
> which I am not sure how it is compile because it a closed source one.


it is probably built with some strict-standard setting.. -ansi or whatever.


>   The ffs is
> used by a couple more (gdb, lvm, and some mesa drivers), but again far
> from being
> a hotspot.
>
> https://cgit.freedesktop.org/drm/libdrm/tree/meson.build#n27 --> built
with std=c11 instead of gnu11 therefore not benefiting from the compiler's
use of builtins..

And as Andreas have said, the best course of action here is to fix the
> compiler
> if it is not generating code enough code.  Fixing gcc means that we will
> get
> any optimization by free (by using the builtin) and any code that actually
> uses
> ffs/ffsl/ffsll.
>

I agree. the compiler already does this, if done suboptimally the ball is
on GCC to fix it.
Carlos O'Donell Jan. 10, 2024, 7:19 p.m. UTC | #9
On 7/31/23 14:35, Sunil K Pandey via Libc-alpha wrote:
> Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> Currently ffsll function randomly regress by ~20%, depending on how
> code get aligned.
> 
> This patch fixes ffsll function random performance regression.

Here is my suggestion.

Commit this to glibc 2.39 as an incremental improvement that we can backport
to any active branch.

I have already approved the removal for glibc 2.40 via Adhemerval's patch.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>

> Changes from v1:
> - Further reduce size ffsll function size to 12 bytes.
> ---
>  sysdeps/x86_64/ffsll.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> index a1c13d4906..6a5803c7c1 100644
> --- a/sysdeps/x86_64/ffsll.c
> +++ b/sysdeps/x86_64/ffsll.c
> @@ -26,13 +26,13 @@ int
>  ffsll (long long int x)
>  {
>    long long int cnt;
> -  long long int tmp;
>  
> -  asm ("bsfq %2,%0\n"		/* Count low bits in X and store in %1.  */
> -       "cmoveq %1,%0\n"		/* If number was zero, use -1 as result.  */
> -       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
> +  asm ("mov $-1,%k0\n"	/* Intialize CNT to -1.  */
> +       "bsf %1,%0\n"	/* Count low bits in X and store in CNT.  */
> +       "inc %k0\n"	/* Increment CNT by 1.  */
> +       : "=&r" (cnt) : "r" (x));
>  
> -  return cnt + 1;
> +  return cnt;
>  }
>  
>  #ifndef __ILP32__
Sunil Pandey Jan. 25, 2024, 3:10 a.m. UTC | #10
On Wed, Jan 10, 2024 at 11:19 AM Carlos O'Donell <carlos@redhat.com> wrote:

> On 7/31/23 14:35, Sunil K Pandey via Libc-alpha wrote:
> > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > Currently ffsll function randomly regress by ~20%, depending on how
> > code get aligned.
> >
> > This patch fixes ffsll function random performance regression.
>
> Here is my suggestion.
>
> Commit this to glibc 2.39 as an incremental improvement that we can
> backport
> to any active branch.
>
> I have already approved the removal for glibc 2.40 via Adhemerval's patch.
>
> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
>
> > Changes from v1:
> > - Further reduce size ffsll function size to 12 bytes.
> > ---
> >  sysdeps/x86_64/ffsll.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > index a1c13d4906..6a5803c7c1 100644
> > --- a/sysdeps/x86_64/ffsll.c
> > +++ b/sysdeps/x86_64/ffsll.c
> > @@ -26,13 +26,13 @@ int
> >  ffsll (long long int x)
> >  {
> >    long long int cnt;
> > -  long long int tmp;
> >
> > -  asm ("bsfq %2,%0\n"                /* Count low bits in X and store
> in %1.  */
> > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as
> result.  */
> > -       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
> > +  asm ("mov $-1,%k0\n"       /* Intialize CNT to -1.  */
> > +       "bsf %1,%0\n" /* Count low bits in X and store in CNT.  */
> > +       "inc %k0\n"   /* Increment CNT by 1.  */
> > +       : "=&r" (cnt) : "r" (x));
> >
> > -  return cnt + 1;
> > +  return cnt;
> >  }
> >
> >  #ifndef __ILP32__
>
> --
> Cheers,
> Carlos.
>

I would like to backport this patch to release branches from 2.28 to 2.38.
Any comments or objections?

--Sunil
diff mbox series

Patch

diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
index a1c13d4906..6a5803c7c1 100644
--- a/sysdeps/x86_64/ffsll.c
+++ b/sysdeps/x86_64/ffsll.c
@@ -26,13 +26,13 @@  int
 ffsll (long long int x)
 {
   long long int cnt;
-  long long int tmp;
 
-  asm ("bsfq %2,%0\n"		/* Count low bits in X and store in %1.  */
-       "cmoveq %1,%0\n"		/* If number was zero, use -1 as result.  */
-       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
+  asm ("mov $-1,%k0\n"	/* Intialize CNT to -1.  */
+       "bsf %1,%0\n"	/* Count low bits in X and store in CNT.  */
+       "inc %k0\n"	/* Increment CNT by 1.  */
+       : "=&r" (cnt) : "r" (x));
 
-  return cnt + 1;
+  return cnt;
 }
 
 #ifndef __ILP32__