mbox series

[v3,0/7] bitops: let optimize out non-atomic bitops on compile-time constants

Message ID 20220617144031.2549432-1-alexandr.lobakin@intel.com
Headers show
Series bitops: let optimize out non-atomic bitops on compile-time constants | expand

Message

Alexander Lobakin June 17, 2022, 2:40 p.m. UTC
While I was working on converting some structure fields from a fixed
type to a bitmap, I started observing code size increase not only in
places where the code works with the converted structure fields, but
also where the converted vars were on the stack. That said, the
following code:

	DECLARE_BITMAP(foo, BITS_PER_LONG) = { }; // -> unsigned long foo[1];
	unsigned long bar = BIT(BAR_BIT);
	unsigned long baz = 0;

	__set_bit(FOO_BIT, foo);
	baz |= BIT(BAZ_BIT);

	BUILD_BUG_ON(!__builtin_constant_p(test_bit(FOO_BIT, foo));
	BUILD_BUG_ON(!__builtin_constant_p(bar & BAR_BIT));
	BUILD_BUG_ON(!__builtin_constant_p(baz & BAZ_BIT));

triggers the first assertion on x86_64, which means that the
compiler is unable to evaluate it to a compile-time initializer
when the architecture-specific bitop is used even if it's obvious.
I found that this is due to that many architecture-specific
non-atomic bitop implementations use inline asm or other hacks which
are faster or more robust when working with "real" variables (i.e.
fields from the structures etc.), but the compilers have no clue how
to optimize them out when called on compile-time constants.

So, in order to let the compiler optimize out such cases, expand the
test_bit() and __*_bit() definitions with a compile-time condition
check, so that they will pick the generic C non-atomic bitop
implementations when all of the arguments passed are compile-time
constants, which means that the result will be a compile-time
constant as well and the compiler will produce more efficient and
simple code in 100% cases (no changes when there's at least one
non-compile-time-constant argument).
The condition itself:

if (
__builtin_constant_p(nr) &&	/* <- bit position is constant */
__builtin_constant_p(!!addr) &&	/* <- compiler knows bitmap addr is
				      always either NULL or not */
addr &&				/* <- bitmap addr is not NULL */
__builtin_constant_p(*addr)	/* <- compiler knows the value of
				      the target bitmap */
)
	/* then pick the generic C variant
else
	/* old code path, arch-specific

I also tried __is_constexpr() as suggested by Andy, but it was
always returning 0 ('not a constant') for the 2,3 and 4th
conditions.

The savings are architecture, compiler and compiler flags dependent,
for example, on x86_64 -O2:

GCC 12: add/remove: 78/29 grow/shrink: 332/525 up/down: 31325/-61560 (-30235)
LLVM 13: add/remove: 79/76 grow/shrink: 184/537 up/down: 55076/-141892 (-86816)
LLVM 14: add/remove: 10/3 grow/shrink: 93/138 up/down: 3705/-6992 (-3287)

and ARM64 (courtesy of Mark[0]):

GCC 11: add/remove: 92/29 grow/shrink: 933/2766 up/down: 39340/-82580 (-43240)
LLVM 14: add/remove: 21/11 grow/shrink: 620/651 up/down: 12060/-15824 (-3764)

And the following:

	DECLARE_BITMAP(flags, __IP_TUNNEL_FLAG_NUM) = { };
	__be16 flags;

	__set_bit(IP_TUNNEL_CSUM_BIT, flags);

	tun_flags = cpu_to_be16(*flags & U16_MAX);

	if (test_bit(IP_TUNNEL_VTI_BIT, flags))
		tun_flags |= VTI_ISVTI;

	BUILD_BUG_ON(!__builtin_constant_p(tun_flags));

doesn't blow up anymore (which is being checked now at build time),
so that we can now e.g. use fixed bitmaps in compile-time assertions
etc.

The series has been in intel-next for a while with no reported issues.

From v2[1]:
* collect several Reviewed-bys (Andy, Yury);
* add a comment to generic_test_bit() that it is atomic-safe and
  must always stay like that (the first version of this series
  errorneously tried to change this) (Andy, Marco);
* unify the way how architectures define platform-specific bitops,
  both supporting instrumentation and not: now they define only
  'arch_' versions and asm-generic includes take care of the rest;
* micro-optimize the diffstat of 0004/0007 (__check_bitop_pr())
  (Andy);
* add compile-time tests to lib/test_bitmap to make sure everything
  works as expected on any setup (Yury).

From v1[2]:
* change 'gen_' prefixes to '_generic' to disambiguate from
  'generated' etc. (Mark);
* define a separate 'const_' set to use in the optimization to keep
  the generic test_bit() atomic-safe (Marco);
* unify arch_{test,__*}_bit() as well and include them in the type
  check;
* add more relevant and up-to-date bloat-o-meter results, including
  ARM64 (me, Mark);
* pick a couple '*-by' tags (Mark, Yury).

Also available on my open GH[3].

[0] https://lore.kernel.org/all/Yp4GQFQYD32Rs9Cw@FVFF77S0Q05N
[1] https://lore.kernel.org/linux-arch/20220610113427.908751-1-alexandr.lobakin@intel.com
[2] https://lore.kernel.org/all/20220606114908.962562-1-alexandr.lobakin@intel.com
[3] https://github.com/alobakin/linux/commits/bitops

Alexander Lobakin (7):
  ia64, processor: fix -Wincompatible-pointer-types in ia64_get_irr()
  bitops: always define asm-generic non-atomic bitops
  bitops: unify non-atomic bitops prototypes across architectures
  bitops: define const_*() versions of the non-atomics
  bitops: wrap non-atomic bitops with a transparent macro
  bitops: let optimize out non-atomic bitops on compile-time constants
  lib: test_bitmap: add compile-time optimization/evaluations assertions

 arch/alpha/include/asm/bitops.h               |  32 ++--
 arch/hexagon/include/asm/bitops.h             |  24 ++-
 arch/ia64/include/asm/bitops.h                |  42 ++---
 arch/ia64/include/asm/processor.h             |   2 +-
 arch/m68k/include/asm/bitops.h                |  49 ++++--
 arch/sh/include/asm/bitops-op32.h             |  34 ++--
 arch/sparc/include/asm/bitops_32.h            |  18 +-
 arch/sparc/lib/atomic32.c                     |  12 +-
 arch/x86/include/asm/bitops.h                 |  22 +--
 .../asm-generic/bitops/generic-non-atomic.h   | 161 ++++++++++++++++++
 .../bitops/instrumented-non-atomic.h          |  35 ++--
 include/asm-generic/bitops/non-atomic.h       | 121 +------------
 .../bitops/non-instrumented-non-atomic.h      |  16 ++
 include/linux/bitops.h                        |  50 ++++++
 lib/test_bitmap.c                             |  45 +++++
 tools/include/asm-generic/bitops/non-atomic.h |  34 ++--
 tools/include/linux/bitops.h                  |  16 ++
 17 files changed, 476 insertions(+), 237 deletions(-)
 create mode 100644 include/asm-generic/bitops/generic-non-atomic.h
 create mode 100644 include/asm-generic/bitops/non-instrumented-non-atomic.h

Comments

Geert Uytterhoeven June 20, 2022, 12:07 p.m. UTC | #1
Hi Olek,

On Fri, Jun 17, 2022 at 6:51 PM Alexander Lobakin
<alexandr.lobakin@intel.com> wrote:
> While I was working on converting some structure fields from a fixed
> type to a bitmap, I started observing code size increase not only in
> places where the code works with the converted structure fields, but
> also where the converted vars were on the stack. That said, the
> following code:
>
>         DECLARE_BITMAP(foo, BITS_PER_LONG) = { }; // -> unsigned long foo[1];
>         unsigned long bar = BIT(BAR_BIT);
>         unsigned long baz = 0;
>
>         __set_bit(FOO_BIT, foo);
>         baz |= BIT(BAZ_BIT);
>
>         BUILD_BUG_ON(!__builtin_constant_p(test_bit(FOO_BIT, foo));
>         BUILD_BUG_ON(!__builtin_constant_p(bar & BAR_BIT));
>         BUILD_BUG_ON(!__builtin_constant_p(baz & BAZ_BIT));
>
> triggers the first assertion on x86_64, which means that the
> compiler is unable to evaluate it to a compile-time initializer
> when the architecture-specific bitop is used even if it's obvious.
> I found that this is due to that many architecture-specific
> non-atomic bitop implementations use inline asm or other hacks which
> are faster or more robust when working with "real" variables (i.e.
> fields from the structures etc.), but the compilers have no clue how
> to optimize them out when called on compile-time constants.
>
> So, in order to let the compiler optimize out such cases, expand the
> test_bit() and __*_bit() definitions with a compile-time condition
> check, so that they will pick the generic C non-atomic bitop
> implementations when all of the arguments passed are compile-time
> constants, which means that the result will be a compile-time
> constant as well and the compiler will produce more efficient and
> simple code in 100% cases (no changes when there's at least one
> non-compile-time-constant argument).
> The condition itself:
>
> if (
> __builtin_constant_p(nr) &&     /* <- bit position is constant */
> __builtin_constant_p(!!addr) && /* <- compiler knows bitmap addr is
>                                       always either NULL or not */
> addr &&                         /* <- bitmap addr is not NULL */
> __builtin_constant_p(*addr)     /* <- compiler knows the value of
>                                       the target bitmap */
> )
>         /* then pick the generic C variant
> else
>         /* old code path, arch-specific
>
> I also tried __is_constexpr() as suggested by Andy, but it was
> always returning 0 ('not a constant') for the 2,3 and 4th
> conditions.
>
> The savings are architecture, compiler and compiler flags dependent,
> for example, on x86_64 -O2:
>
> GCC 12: add/remove: 78/29 grow/shrink: 332/525 up/down: 31325/-61560 (-30235)
> LLVM 13: add/remove: 79/76 grow/shrink: 184/537 up/down: 55076/-141892 (-86816)
> LLVM 14: add/remove: 10/3 grow/shrink: 93/138 up/down: 3705/-6992 (-3287)
>
> and ARM64 (courtesy of Mark[0]):
>
> GCC 11: add/remove: 92/29 grow/shrink: 933/2766 up/down: 39340/-82580 (-43240)
> LLVM 14: add/remove: 21/11 grow/shrink: 620/651 up/down: 12060/-15824 (-3764)
>
> And the following:
>
>         DECLARE_BITMAP(flags, __IP_TUNNEL_FLAG_NUM) = { };
>         __be16 flags;
>
>         __set_bit(IP_TUNNEL_CSUM_BIT, flags);
>
>         tun_flags = cpu_to_be16(*flags & U16_MAX);
>
>         if (test_bit(IP_TUNNEL_VTI_BIT, flags))
>                 tun_flags |= VTI_ISVTI;
>
>         BUILD_BUG_ON(!__builtin_constant_p(tun_flags));
>
> doesn't blow up anymore (which is being checked now at build time),
> so that we can now e.g. use fixed bitmaps in compile-time assertions
> etc.
>
> The series has been in intel-next for a while with no reported issues.
>
> From v2[1]:
> * collect several Reviewed-bys (Andy, Yury);
> * add a comment to generic_test_bit() that it is atomic-safe and
>   must always stay like that (the first version of this series
>   errorneously tried to change this) (Andy, Marco);
> * unify the way how architectures define platform-specific bitops,
>   both supporting instrumentation and not: now they define only
>   'arch_' versions and asm-generic includes take care of the rest;
> * micro-optimize the diffstat of 0004/0007 (__check_bitop_pr())
>   (Andy);
> * add compile-time tests to lib/test_bitmap to make sure everything
>   works as expected on any setup (Yury).

Thanks for the update!

Still seeing
add/remove: 49/13 grow/shrink: 280/137 up/down: 6464/-3328 (3136)

on m68k atari_defconfig (i.e.CONFIG_CC_OPTIMIZE_FOR_SIZE=y)
with gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04).

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Alexander Lobakin June 20, 2022, 1:22 p.m. UTC | #2
From: Geert Uytterhoeven <geert@linux-m68k.org>
Date: Mon, 20 Jun 2022 14:07:27 +0200

> Hi Olek,

Hey!

> 
> On Fri, Jun 17, 2022 at 6:51 PM Alexander Lobakin
> <alexandr.lobakin@intel.com> wrote:
> > While I was working on converting some structure fields from a fixed
> > type to a bitmap, I started observing code size increase not only in
> > places where the code works with the converted structure fields, but
> > also where the converted vars were on the stack. That said, the
> > following code:
> >
> >         DECLARE_BITMAP(foo, BITS_PER_LONG) = { }; // -> unsigned long foo[1];
> >         unsigned long bar = BIT(BAR_BIT);
> >         unsigned long baz = 0;
> >
> >         __set_bit(FOO_BIT, foo);
> >         baz |= BIT(BAZ_BIT);
> >
> >         BUILD_BUG_ON(!__builtin_constant_p(test_bit(FOO_BIT, foo));
> >         BUILD_BUG_ON(!__builtin_constant_p(bar & BAR_BIT));
> >         BUILD_BUG_ON(!__builtin_constant_p(baz & BAZ_BIT));
> >
> > triggers the first assertion on x86_64, which means that the
> > compiler is unable to evaluate it to a compile-time initializer
> > when the architecture-specific bitop is used even if it's obvious.
> > I found that this is due to that many architecture-specific
> > non-atomic bitop implementations use inline asm or other hacks which
> > are faster or more robust when working with "real" variables (i.e.
> > fields from the structures etc.), but the compilers have no clue how
> > to optimize them out when called on compile-time constants.
> >
> > So, in order to let the compiler optimize out such cases, expand the
> > test_bit() and __*_bit() definitions with a compile-time condition
> > check, so that they will pick the generic C non-atomic bitop
> > implementations when all of the arguments passed are compile-time
> > constants, which means that the result will be a compile-time
> > constant as well and the compiler will produce more efficient and
> > simple code in 100% cases (no changes when there's at least one
> > non-compile-time-constant argument).
> > The condition itself:
> >
> > if (
> > __builtin_constant_p(nr) &&     /* <- bit position is constant */
> > __builtin_constant_p(!!addr) && /* <- compiler knows bitmap addr is
> >                                       always either NULL or not */
> > addr &&                         /* <- bitmap addr is not NULL */
> > __builtin_constant_p(*addr)     /* <- compiler knows the value of
> >                                       the target bitmap */
> > )
> >         /* then pick the generic C variant
> > else
> >         /* old code path, arch-specific
> >
> > I also tried __is_constexpr() as suggested by Andy, but it was
> > always returning 0 ('not a constant') for the 2,3 and 4th
> > conditions.
> >
> > The savings are architecture, compiler and compiler flags dependent,
> > for example, on x86_64 -O2:
> >
> > GCC 12: add/remove: 78/29 grow/shrink: 332/525 up/down: 31325/-61560 (-30235)
> > LLVM 13: add/remove: 79/76 grow/shrink: 184/537 up/down: 55076/-141892 (-86816)
> > LLVM 14: add/remove: 10/3 grow/shrink: 93/138 up/down: 3705/-6992 (-3287)
> >
> > and ARM64 (courtesy of Mark[0]):
> >
> > GCC 11: add/remove: 92/29 grow/shrink: 933/2766 up/down: 39340/-82580 (-43240)
> > LLVM 14: add/remove: 21/11 grow/shrink: 620/651 up/down: 12060/-15824 (-3764)
> >
> > And the following:
> >
> >         DECLARE_BITMAP(flags, __IP_TUNNEL_FLAG_NUM) = { };
> >         __be16 flags;
> >
> >         __set_bit(IP_TUNNEL_CSUM_BIT, flags);
> >
> >         tun_flags = cpu_to_be16(*flags & U16_MAX);
> >
> >         if (test_bit(IP_TUNNEL_VTI_BIT, flags))
> >                 tun_flags |= VTI_ISVTI;
> >
> >         BUILD_BUG_ON(!__builtin_constant_p(tun_flags));
> >
> > doesn't blow up anymore (which is being checked now at build time),
> > so that we can now e.g. use fixed bitmaps in compile-time assertions
> > etc.
> >
> > The series has been in intel-next for a while with no reported issues.
> >
> > From v2[1]:
> > * collect several Reviewed-bys (Andy, Yury);
> > * add a comment to generic_test_bit() that it is atomic-safe and
> >   must always stay like that (the first version of this series
> >   errorneously tried to change this) (Andy, Marco);
> > * unify the way how architectures define platform-specific bitops,
> >   both supporting instrumentation and not: now they define only
> >   'arch_' versions and asm-generic includes take care of the rest;
> > * micro-optimize the diffstat of 0004/0007 (__check_bitop_pr())
> >   (Andy);
> > * add compile-time tests to lib/test_bitmap to make sure everything
> >   works as expected on any setup (Yury).
> 
> Thanks for the update!
> 
> Still seeing
> add/remove: 49/13 grow/shrink: 280/137 up/down: 6464/-3328 (3136)

Meh. What about -O2 (OPTIMIZE_FOR_PERFORMANCE)? I have a thought to
make it depend on the config option above, but that would make code
behave differently, so it's not safe.
Are those 3 Kb critical for m68k machines? I'm asking because for
some embedded systems they are :)
Another thing, this could happen due to inlining rebalance. E.g.
the compiler could inline or uninline some functions due to
resolving bit{maps,ops} to compile-time constants. I was seeing
such in the past several times. Also, IIRC you already sent some
bloat-o-meter results here, and that was the case.
On the other hand, if lib/test_bitmap.o builds successfully (I
assume it does), the series works as expected.

> 
> on m68k atari_defconfig (i.e.CONFIG_CC_OPTIMIZE_FOR_SIZE=y)
> with gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04).
> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds

Thanks,
Olek
Alexander Lobakin June 20, 2022, 1:51 p.m. UTC | #3
From: kernel test robot <lkp@intel.com>
Date: Sun, 19 Jun 2022 17:20:05 +0800

Also, could someone please help me with this? I don't get what went
wrong with sparse, it's not even some new code, just moving old
stuff.

> tree:   https://github.com/alobakin/linux bitops
> head:   9bd39b17ce49d350eed93a031e0da6389067013e
> commit: 521611f961a7dda92eefa26e1afd3914c06af64e [3/7] bitops: unify non-atomic bitops prototypes across architectures
> config: mips-randconfig-s031-20220619 (https://download.01.org/0day-ci/archive/20220619/202206191726.wq70mbMK-lkp@intel.com/config)
> compiler: mips64el-linux-gcc (GCC) 11.3.0
> reproduce:
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # apt-get install sparse
>         # sparse version: v0.6.4-30-g92122700-dirty
>         # https://github.com/alobakin/linux/commit/521611f961a7dda92eefa26e1afd3914c06af64e
>         git remote add alobakin https://github.com/alobakin/linux
>         git fetch --no-tags alobakin bitops
>         git checkout 521611f961a7dda92eefa26e1afd3914c06af64e
>         # save the config file
>         mkdir build_dir && cp config build_dir/.config
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=mips SHELL=/bin/bash
> 
> If you fix the issue, kindly add following tag where applicable
> Reported-by: kernel test robot <lkp@intel.com>
> 
> 
> sparse warnings: (new ones prefixed by >>)
>    command-line: note: in included file:
>    builtin:1:9: sparse: sparse: preprocessor token __ATOMIC_ACQUIRE redefined
>    builtin:0:0: sparse: this was the original definition
>    builtin:1:9: sparse: sparse: preprocessor token __ATOMIC_SEQ_CST redefined
>    builtin:0:0: sparse: this was the original definition
>    builtin:1:9: sparse: sparse: preprocessor token __ATOMIC_ACQ_REL redefined
>    builtin:0:0: sparse: this was the original definition
>    builtin:1:9: sparse: sparse: preprocessor token __ATOMIC_RELEASE redefined
>    builtin:0:0: sparse: this was the original definition
>    block/elevator.c: note: in included file (through include/linux/bitops.h, include/linux/kernel.h):
>    include/asm-generic/bitops/generic-non-atomic.h:29:9: sparse: sparse: unreplaced symbol 'mask'
>    include/asm-generic/bitops/generic-non-atomic.h:30:9: sparse: sparse: unreplaced symbol 'p'
>    include/asm-generic/bitops/generic-non-atomic.h:32:10: sparse: sparse: unreplaced symbol 'p'
>    include/asm-generic/bitops/generic-non-atomic.h:32:16: sparse: sparse: unreplaced symbol 'mask'
>    include/asm-generic/bitops/generic-non-atomic.h:27:1: sparse: sparse: unreplaced symbol 'return'
>    include/asm-generic/bitops/generic-non-atomic.h:38:9: sparse: sparse: unreplaced symbol 'mask'
>    include/asm-generic/bitops/generic-non-atomic.h:39:9: sparse: sparse: unreplaced symbol 'p'
>    include/asm-generic/bitops/generic-non-atomic.h:41:10: sparse: sparse: unreplaced symbol 'p'
>    include/asm-generic/bitops/generic-non-atomic.h:41:16: sparse: sparse: unreplaced symbol 'mask'
>    include/asm-generic/bitops/generic-non-atomic.h:36:1: sparse: sparse: unreplaced symbol 'return'
>    include/asm-generic/bitops/generic-non-atomic.h:56:9: sparse: sparse: unreplaced symbol 'mask'
>    include/asm-generic/bitops/generic-non-atomic.h:57:9: sparse: sparse: unreplaced symbol 'p'
>    include/asm-generic/bitops/generic-non-atomic.h:59:10: sparse: sparse: unreplaced symbol 'p'
>    include/asm-generic/bitops/generic-non-atomic.h:59:15: sparse: sparse: unreplaced symbol 'mask'
>    include/asm-generic/bitops/generic-non-atomic.h:54:1: sparse: sparse: unreplaced symbol 'return'
>    include/asm-generic/bitops/generic-non-atomic.h:74:9: sparse: sparse: unreplaced symbol 'mask'
>    include/asm-generic/bitops/generic-non-atomic.h:75:9: sparse: sparse: unreplaced symbol 'p'
>    include/asm-generic/bitops/generic-non-atomic.h:76:9: sparse: sparse: unreplaced symbol 'old'
>    include/asm-generic/bitops/generic-non-atomic.h:78:10: sparse: sparse: unreplaced symbol 'p'
>    include/asm-generic/bitops/generic-non-atomic.h:78:14: sparse: sparse: unreplaced symbol 'old'
>    include/asm-generic/bitops/generic-non-atomic.h:78:20: sparse: sparse: unreplaced symbol 'mask'
>    include/asm-generic/bitops/generic-non-atomic.h:79:17: sparse: sparse: unreplaced symbol 'old'
>    include/asm-generic/bitops/generic-non-atomic.h:79:23: sparse: sparse: unreplaced symbol 'mask'
>    include/asm-generic/bitops/generic-non-atomic.h:79:9: sparse: sparse: unreplaced symbol 'return'
>    include/asm-generic/bitops/generic-non-atomic.h:72:1: sparse: sparse: unreplaced symbol 'return'
>    include/asm-generic/bitops/generic-non-atomic.h:94:9: sparse: sparse: unreplaced symbol 'mask'
>    include/asm-generic/bitops/generic-non-atomic.h:95:9: sparse: sparse: unreplaced symbol 'p'
>    include/asm-generic/bitops/generic-non-atomic.h:96:9: sparse: sparse: unreplaced symbol 'old'
>    include/asm-generic/bitops/generic-non-atomic.h:98:10: sparse: sparse: unreplaced symbol 'p'
>    include/asm-generic/bitops/generic-non-atomic.h:98:14: sparse: sparse: unreplaced symbol 'old'
>    include/asm-generic/bitops/generic-non-atomic.h:98:21: sparse: sparse: unreplaced symbol 'mask'
>    include/asm-generic/bitops/generic-non-atomic.h:99:17: sparse: sparse: unreplaced symbol 'old'
>    include/asm-generic/bitops/generic-non-atomic.h:99:23: sparse: sparse: unreplaced symbol 'mask'
>    include/asm-generic/bitops/generic-non-atomic.h:99:9: sparse: sparse: unreplaced symbol 'return'
>    include/asm-generic/bitops/generic-non-atomic.h:92:1: sparse: sparse: unreplaced symbol 'return'
>    include/asm-generic/bitops/generic-non-atomic.h:106:9: sparse: sparse: unreplaced symbol 'mask'
>    include/asm-generic/bitops/generic-non-atomic.h:107:9: sparse: sparse: unreplaced symbol 'p'
>    include/asm-generic/bitops/generic-non-atomic.h:108:9: sparse: sparse: unreplaced symbol 'old'
>    include/asm-generic/bitops/generic-non-atomic.h:110:10: sparse: sparse: unreplaced symbol 'p'
>    include/asm-generic/bitops/generic-non-atomic.h:110:14: sparse: sparse: unreplaced symbol 'old'
>    include/asm-generic/bitops/generic-non-atomic.h:110:20: sparse: sparse: unreplaced symbol 'mask'
>    include/asm-generic/bitops/generic-non-atomic.h:111:17: sparse: sparse: unreplaced symbol 'old'
>    include/asm-generic/bitops/generic-non-atomic.h:111:23: sparse: sparse: unreplaced symbol 'mask'
>    include/asm-generic/bitops/generic-non-atomic.h:111:9: sparse: sparse: unreplaced symbol 'return'
>    include/asm-generic/bitops/generic-non-atomic.h:104:1: sparse: sparse: unreplaced symbol 'return'
>    include/asm-generic/bitops/generic-non-atomic.h:127:9: sparse: sparse: unreplaced symbol 'return'
>    include/asm-generic/bitops/generic-non-atomic.h:120:1: sparse: sparse: unreplaced symbol 'return'
> >> block/elevator.c:222:9: sparse: sparse: cast from restricted req_flags_t
> 
> vim +222 block/elevator.c
> 
> 9817064b68fef7 Jens Axboe        2006-07-28  217  
> 70b3ea056f3074 Jens Axboe        2016-12-07  218  void elv_rqhash_add(struct request_queue *q, struct request *rq)
> 9817064b68fef7 Jens Axboe        2006-07-28  219  {
> b374d18a4bfce7 Jens Axboe        2008-10-31  220  	struct elevator_queue *e = q->elevator;
> 9817064b68fef7 Jens Axboe        2006-07-28  221  
> 9817064b68fef7 Jens Axboe        2006-07-28 @222  	BUG_ON(ELV_ON_HASH(rq));
> 242d98f077ac0a Sasha Levin       2012-12-17  223  	hash_add(e->hash, &rq->hash, rq_hash_key(rq));
> e806402130c9c4 Christoph Hellwig 2016-10-20  224  	rq->rq_flags |= RQF_HASHED;
> 9817064b68fef7 Jens Axboe        2006-07-28  225  }
> bd166ef183c263 Jens Axboe        2017-01-17  226  EXPORT_SYMBOL_GPL(elv_rqhash_add);
> 9817064b68fef7 Jens Axboe        2006-07-28  227  
> 
> :::::: The code at line 222 was first introduced by commit
> :::::: 9817064b68fef7e4580c6df1ea597e106b9ff88b [PATCH] elevator: move the backmerging logic into the elevator core
> 
> :::::: TO: Jens Axboe <axboe@suse.de>
> :::::: CC: Jens Axboe <axboe@nelson.home.kernel.dk>
> 
> -- 
> 0-DAY CI Kernel Test Service
> https://01.org/lkp

Thanks,
Olek
Mark Rutland June 20, 2022, 2:19 p.m. UTC | #4
On Fri, Jun 17, 2022 at 04:40:24PM +0200, Alexander Lobakin wrote:
> So, in order to let the compiler optimize out such cases, expand the
> test_bit() and __*_bit() definitions with a compile-time condition
> check, so that they will pick the generic C non-atomic bitop
> implementations when all of the arguments passed are compile-time
> constants, which means that the result will be a compile-time
> constant as well and the compiler will produce more efficient and
> simple code in 100% cases (no changes when there's at least one
> non-compile-time-constant argument).

> The savings are architecture, compiler and compiler flags dependent,
> for example, on x86_64 -O2:
> 
> GCC 12: add/remove: 78/29 grow/shrink: 332/525 up/down: 31325/-61560 (-30235)
> LLVM 13: add/remove: 79/76 grow/shrink: 184/537 up/down: 55076/-141892 (-86816)
> LLVM 14: add/remove: 10/3 grow/shrink: 93/138 up/down: 3705/-6992 (-3287)
> 
> and ARM64 (courtesy of Mark[0]):
> 
> GCC 11: add/remove: 92/29 grow/shrink: 933/2766 up/down: 39340/-82580 (-43240)
> LLVM 14: add/remove: 21/11 grow/shrink: 620/651 up/down: 12060/-15824 (-3764)

Hmm... with *this version* of the series, I'm not getting results nearly as
good as that when building defconfig atop v5.19-rc3:

  GCC 8.5.0:   add/remove: 83/49 grow/shrink: 973/1147 up/down: 32020/-47824 (-15804)
  GCC 9.3.0:   add/remove: 68/51 grow/shrink: 1167/592 up/down: 30720/-31352 (-632)
  GCC 10.3.0:  add/remove: 84/37 grow/shrink: 1711/1003 up/down: 45392/-41844 (3548)
  GCC 11.1.0:  add/remove: 88/31 grow/shrink: 1635/963 up/down: 51540/-46096 (5444)
  GCC 11.3.0:  add/remove: 89/32 grow/shrink: 1629/966 up/down: 51456/-46056 (5400)
  GCC 12.1.0:  add/remove: 84/31 grow/shrink: 1540/829 up/down: 48772/-43164 (5608)

  LLVM 12.0.1: add/remove: 118/58 grow/shrink: 437/381 up/down: 45312/-65668 (-20356)
  LLVM 13.0.1: add/remove: 35/19 grow/shrink: 416/243 up/down: 14408/-22200 (-7792)
  LLVM 14.0.0: add/remove: 42/16 grow/shrink: 415/234 up/down: 15296/-21008 (-5712)

... and that now seems to be regressing codegen with recent versions of GCC as
much as it improves it LLVM.

I'm not sure if we've improved some other code and removed the benefit between
v5.19-rc1 and v5.19-rc3, or whether something else it at play, but this doesn't
look as compelling as it did.

Overall that's mostly hidden in the Image size, due to 64K alignment and
padding requirements:

  Toolchain      Before      After       Difference

  GCC 8.5.0      36178432    36178432    0
  GCC 9.3.0      36112896    36112896    0
  GCC 10.3.0     36442624    36377088    -65536
  GCC 11.1.0     36311552    36377088    +65536
  GCC 11.3.0     36311552    36311552    0
  GCC 12.1.0     36377088    36377088    0

  LLVM 12.0.1    31418880    31418880    0
  LLVM 13.0.1    31418880    31418880    0
  LLVM 14.0.0    31218176    31218176    0

... so aside from the blip around GCC 10.3.0 and 11.1.0, there's not a massive
change overall (due to 64KiB alignment restrictions for portions of the kernel
Image).

Thanks,
Mark.
Alexander Lobakin June 20, 2022, 3:08 p.m. UTC | #5
From: Mark Rutland <mark.rutland@arm.com>
Date: Mon, 20 Jun 2022 15:19:42 +0100

> On Fri, Jun 17, 2022 at 04:40:24PM +0200, Alexander Lobakin wrote:
> > So, in order to let the compiler optimize out such cases, expand the
> > test_bit() and __*_bit() definitions with a compile-time condition
> > check, so that they will pick the generic C non-atomic bitop
> > implementations when all of the arguments passed are compile-time
> > constants, which means that the result will be a compile-time
> > constant as well and the compiler will produce more efficient and
> > simple code in 100% cases (no changes when there's at least one
> > non-compile-time-constant argument).
> 
> > The savings are architecture, compiler and compiler flags dependent,
> > for example, on x86_64 -O2:
> > 
> > GCC 12: add/remove: 78/29 grow/shrink: 332/525 up/down: 31325/-61560 (-30235)
> > LLVM 13: add/remove: 79/76 grow/shrink: 184/537 up/down: 55076/-141892 (-86816)
> > LLVM 14: add/remove: 10/3 grow/shrink: 93/138 up/down: 3705/-6992 (-3287)
> > 
> > and ARM64 (courtesy of Mark[0]):
> > 
> > GCC 11: add/remove: 92/29 grow/shrink: 933/2766 up/down: 39340/-82580 (-43240)
> > LLVM 14: add/remove: 21/11 grow/shrink: 620/651 up/down: 12060/-15824 (-3764)
> 
> Hmm... with *this version* of the series, I'm not getting results nearly as
> good as that when building defconfig atop v5.19-rc3:
> 
>   GCC 8.5.0:   add/remove: 83/49 grow/shrink: 973/1147 up/down: 32020/-47824 (-15804)
>   GCC 9.3.0:   add/remove: 68/51 grow/shrink: 1167/592 up/down: 30720/-31352 (-632)
>   GCC 10.3.0:  add/remove: 84/37 grow/shrink: 1711/1003 up/down: 45392/-41844 (3548)
>   GCC 11.1.0:  add/remove: 88/31 grow/shrink: 1635/963 up/down: 51540/-46096 (5444)
>   GCC 11.3.0:  add/remove: 89/32 grow/shrink: 1629/966 up/down: 51456/-46056 (5400)
>   GCC 12.1.0:  add/remove: 84/31 grow/shrink: 1540/829 up/down: 48772/-43164 (5608)
> 
>   LLVM 12.0.1: add/remove: 118/58 grow/shrink: 437/381 up/down: 45312/-65668 (-20356)
>   LLVM 13.0.1: add/remove: 35/19 grow/shrink: 416/243 up/down: 14408/-22200 (-7792)
>   LLVM 14.0.0: add/remove: 42/16 grow/shrink: 415/234 up/down: 15296/-21008 (-5712)
> 
> ... and that now seems to be regressing codegen with recent versions of GCC as
> much as it improves it LLVM.
> 
> I'm not sure if we've improved some other code and removed the benefit between
> v5.19-rc1 and v5.19-rc3, or whether something else it at play, but this doesn't
> look as compelling as it did.

Mostly likely it's due to that in v1 I mistakingly removed
`volatile` from gen[eric]_test_bit(), so there was an impact for
non-constant cases as well.
+5 Kb sounds bad tho. Do you have CONFIG_TEST_BITMAP enabled, does
it work? Probably the same reason as for m68k, more constant
optimization -> more aggressive inlining or inlining rebalance ->
larger code. OTOH I've no idea why sometimes compiler decides to
uninline really tiny functions where due to this patch series some
bitops have been converted to constants, like it goes on m68k.

> 
> Overall that's mostly hidden in the Image size, due to 64K alignment and
> padding requirements:
> 
>   Toolchain      Before      After       Difference
> 
>   GCC 8.5.0      36178432    36178432    0
>   GCC 9.3.0      36112896    36112896    0
>   GCC 10.3.0     36442624    36377088    -65536
>   GCC 11.1.0     36311552    36377088    +65536
>   GCC 11.3.0     36311552    36311552    0
>   GCC 12.1.0     36377088    36377088    0
> 
>   LLVM 12.0.1    31418880    31418880    0
>   LLVM 13.0.1    31418880    31418880    0
>   LLVM 14.0.0    31218176    31218176    0
> 
> ... so aside from the blip around GCC 10.3.0 and 11.1.0, there's not a massive
> change overall (due to 64KiB alignment restrictions for portions of the kernel
> Image).
> 
> Thanks,
> Mark.

Thanks,
Olek
Andy Shevchenko June 20, 2022, 3:18 p.m. UTC | #6
On Mon, Jun 20, 2022 at 4:48 PM Alexander Lobakin
<alexandr.lobakin@intel.com> wrote:
>
> From: kernel test robot <lkp@intel.com>
> Date: Sun, 19 Jun 2022 17:20:05 +0800
>
> Also, could someone please help me with this? I don't get what went
> wrong with sparse, it's not even some new code, just moving old
> stuff.
>
> > tree:   https://github.com/alobakin/linux bitops
> > head:   9bd39b17ce49d350eed93a031e0da6389067013e
> > commit: 521611f961a7dda92eefa26e1afd3914c06af64e [3/7] bitops: unify non-atomic bitops prototypes across architectures
> > config: mips-randconfig-s031-20220619 (https://download.01.org/0day-ci/archive/20220619/202206191726.wq70mbMK-lkp@intel.com/config)
> > compiler: mips64el-linux-gcc (GCC) 11.3.0
> > reproduce:
> >         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> >         chmod +x ~/bin/make.cross
> >         # apt-get install sparse
> >         # sparse version: v0.6.4-30-g92122700-dirty
> >         # https://github.com/alobakin/linux/commit/521611f961a7dda92eefa26e1afd3914c06af64e
> >         git remote add alobakin https://github.com/alobakin/linux
> >         git fetch --no-tags alobakin bitops
> >         git checkout 521611f961a7dda92eefa26e1afd3914c06af64e
> >         # save the config file
> >         mkdir build_dir && cp config build_dir/.config
> >         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=mips SHELL=/bin/bash
> >
> > If you fix the issue, kindly add following tag where applicable
> > Reported-by: kernel test robot <lkp@intel.com>
> >
> >
> > sparse warnings: (new ones prefixed by >>)
> >    command-line: note: in included file:
> >    builtin:1:9: sparse: sparse: preprocessor token __ATOMIC_ACQUIRE redefined
> >    builtin:0:0: sparse: this was the original definition
> >    builtin:1:9: sparse: sparse: preprocessor token __ATOMIC_SEQ_CST redefined
> >    builtin:0:0: sparse: this was the original definition
> >    builtin:1:9: sparse: sparse: preprocessor token __ATOMIC_ACQ_REL redefined
> >    builtin:0:0: sparse: this was the original definition
> >    builtin:1:9: sparse: sparse: preprocessor token __ATOMIC_RELEASE redefined
> >    builtin:0:0: sparse: this was the original definition
> >    block/elevator.c: note: in included file (through include/linux/bitops.h, include/linux/kernel.h):
> >    include/asm-generic/bitops/generic-non-atomic.h:29:9: sparse: sparse: unreplaced symbol 'mask'
> >    include/asm-generic/bitops/generic-non-atomic.h:30:9: sparse: sparse: unreplaced symbol 'p'
> >    include/asm-generic/bitops/generic-non-atomic.h:32:10: sparse: sparse: unreplaced symbol 'p'
> >    include/asm-generic/bitops/generic-non-atomic.h:32:16: sparse: sparse: unreplaced symbol 'mask'
> >    include/asm-generic/bitops/generic-non-atomic.h:27:1: sparse: sparse: unreplaced symbol 'return'
> >    include/asm-generic/bitops/generic-non-atomic.h:38:9: sparse: sparse: unreplaced symbol 'mask'
> >    include/asm-generic/bitops/generic-non-atomic.h:39:9: sparse: sparse: unreplaced symbol 'p'
> >    include/asm-generic/bitops/generic-non-atomic.h:41:10: sparse: sparse: unreplaced symbol 'p'
> >    include/asm-generic/bitops/generic-non-atomic.h:41:16: sparse: sparse: unreplaced symbol 'mask'
> >    include/asm-generic/bitops/generic-non-atomic.h:36:1: sparse: sparse: unreplaced symbol 'return'
> >    include/asm-generic/bitops/generic-non-atomic.h:56:9: sparse: sparse: unreplaced symbol 'mask'
> >    include/asm-generic/bitops/generic-non-atomic.h:57:9: sparse: sparse: unreplaced symbol 'p'
> >    include/asm-generic/bitops/generic-non-atomic.h:59:10: sparse: sparse: unreplaced symbol 'p'
> >    include/asm-generic/bitops/generic-non-atomic.h:59:15: sparse: sparse: unreplaced symbol 'mask'
> >    include/asm-generic/bitops/generic-non-atomic.h:54:1: sparse: sparse: unreplaced symbol 'return'
> >    include/asm-generic/bitops/generic-non-atomic.h:74:9: sparse: sparse: unreplaced symbol 'mask'
> >    include/asm-generic/bitops/generic-non-atomic.h:75:9: sparse: sparse: unreplaced symbol 'p'
> >    include/asm-generic/bitops/generic-non-atomic.h:76:9: sparse: sparse: unreplaced symbol 'old'
> >    include/asm-generic/bitops/generic-non-atomic.h:78:10: sparse: sparse: unreplaced symbol 'p'
> >    include/asm-generic/bitops/generic-non-atomic.h:78:14: sparse: sparse: unreplaced symbol 'old'
> >    include/asm-generic/bitops/generic-non-atomic.h:78:20: sparse: sparse: unreplaced symbol 'mask'
> >    include/asm-generic/bitops/generic-non-atomic.h:79:17: sparse: sparse: unreplaced symbol 'old'
> >    include/asm-generic/bitops/generic-non-atomic.h:79:23: sparse: sparse: unreplaced symbol 'mask'
> >    include/asm-generic/bitops/generic-non-atomic.h:79:9: sparse: sparse: unreplaced symbol 'return'
> >    include/asm-generic/bitops/generic-non-atomic.h:72:1: sparse: sparse: unreplaced symbol 'return'
> >    include/asm-generic/bitops/generic-non-atomic.h:94:9: sparse: sparse: unreplaced symbol 'mask'
> >    include/asm-generic/bitops/generic-non-atomic.h:95:9: sparse: sparse: unreplaced symbol 'p'
> >    include/asm-generic/bitops/generic-non-atomic.h:96:9: sparse: sparse: unreplaced symbol 'old'
> >    include/asm-generic/bitops/generic-non-atomic.h:98:10: sparse: sparse: unreplaced symbol 'p'
> >    include/asm-generic/bitops/generic-non-atomic.h:98:14: sparse: sparse: unreplaced symbol 'old'
> >    include/asm-generic/bitops/generic-non-atomic.h:98:21: sparse: sparse: unreplaced symbol 'mask'
> >    include/asm-generic/bitops/generic-non-atomic.h:99:17: sparse: sparse: unreplaced symbol 'old'
> >    include/asm-generic/bitops/generic-non-atomic.h:99:23: sparse: sparse: unreplaced symbol 'mask'
> >    include/asm-generic/bitops/generic-non-atomic.h:99:9: sparse: sparse: unreplaced symbol 'return'
> >    include/asm-generic/bitops/generic-non-atomic.h:92:1: sparse: sparse: unreplaced symbol 'return'
> >    include/asm-generic/bitops/generic-non-atomic.h:106:9: sparse: sparse: unreplaced symbol 'mask'
> >    include/asm-generic/bitops/generic-non-atomic.h:107:9: sparse: sparse: unreplaced symbol 'p'
> >    include/asm-generic/bitops/generic-non-atomic.h:108:9: sparse: sparse: unreplaced symbol 'old'
> >    include/asm-generic/bitops/generic-non-atomic.h:110:10: sparse: sparse: unreplaced symbol 'p'
> >    include/asm-generic/bitops/generic-non-atomic.h:110:14: sparse: sparse: unreplaced symbol 'old'
> >    include/asm-generic/bitops/generic-non-atomic.h:110:20: sparse: sparse: unreplaced symbol 'mask'
> >    include/asm-generic/bitops/generic-non-atomic.h:111:17: sparse: sparse: unreplaced symbol 'old'
> >    include/asm-generic/bitops/generic-non-atomic.h:111:23: sparse: sparse: unreplaced symbol 'mask'
> >    include/asm-generic/bitops/generic-non-atomic.h:111:9: sparse: sparse: unreplaced symbol 'return'
> >    include/asm-generic/bitops/generic-non-atomic.h:104:1: sparse: sparse: unreplaced symbol 'return'
> >    include/asm-generic/bitops/generic-non-atomic.h:127:9: sparse: sparse: unreplaced symbol 'return'
> >    include/asm-generic/bitops/generic-non-atomic.h:120:1: sparse: sparse: unreplaced symbol 'return'
> > >> block/elevator.c:222:9: sparse: sparse: cast from restricted req_flags_t
> >
> > vim +222 block/elevator.c
> >
> > 9817064b68fef7 Jens Axboe        2006-07-28  217
> > 70b3ea056f3074 Jens Axboe        2016-12-07  218  void elv_rqhash_add(struct request_queue *q, struct request *rq)
> > 9817064b68fef7 Jens Axboe        2006-07-28  219  {
> > b374d18a4bfce7 Jens Axboe        2008-10-31  220      struct elevator_queue *e = q->elevator;
> > 9817064b68fef7 Jens Axboe        2006-07-28  221
> > 9817064b68fef7 Jens Axboe        2006-07-28 @222      BUG_ON(ELV_ON_HASH(rq));
> > 242d98f077ac0a Sasha Levin       2012-12-17  223      hash_add(e->hash, &rq->hash, rq_hash_key(rq));
> > e806402130c9c4 Christoph Hellwig 2016-10-20  224      rq->rq_flags |= RQF_HASHED;
> > 9817064b68fef7 Jens Axboe        2006-07-28  225  }
> > bd166ef183c263 Jens Axboe        2017-01-17  226  EXPORT_SYMBOL_GPL(elv_rqhash_add);
> > 9817064b68fef7 Jens Axboe        2006-07-28  227

It looks like a false positive for _your_ case, but if you want to fix
here is the background.

The sparse has an ability to control custom types that should never
set bits outside of the limited range. For this the special annotation
is given, i.e. __bitwise. Since the culprit type is defined that way
it means the pure integer (signed or unsigned) that comes with pure
definition can't be used in a safe way. To solve this each of such
definitions should be converted to the very same type (req_flags_t).
See serial core where some UART flags are defined in a similar way and
how code copes with that.
Alexander Lobakin June 20, 2022, 3:27 p.m. UTC | #7
From: Andy Shevchenko <andy.shevchenko@gmail.com>
Date: Mon, 20 Jun 2022 17:18:40 +0200

> On Mon, Jun 20, 2022 at 4:48 PM Alexander Lobakin
> <alexandr.lobakin@intel.com> wrote:
> >
> > From: kernel test robot <lkp@intel.com>
> > Date: Sun, 19 Jun 2022 17:20:05 +0800
> >
> > Also, could someone please help me with this? I don't get what went
> > wrong with sparse, it's not even some new code, just moving old
> > stuff.
> >
> > > tree:   https://github.com/alobakin/linux bitops
> > > head:   9bd39b17ce49d350eed93a031e0da6389067013e
> > > commit: 521611f961a7dda92eefa26e1afd3914c06af64e [3/7] bitops: unify non-atomic bitops prototypes across architectures
> > > config: mips-randconfig-s031-20220619 (https://download.01.org/0day-ci/archive/20220619/202206191726.wq70mbMK-lkp@intel.com/config)
> > > compiler: mips64el-linux-gcc (GCC) 11.3.0
> > > reproduce:
> > >         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> > >         chmod +x ~/bin/make.cross
> > >         # apt-get install sparse
> > >         # sparse version: v0.6.4-30-g92122700-dirty
> > >         # https://github.com/alobakin/linux/commit/521611f961a7dda92eefa26e1afd3914c06af64e
> > >         git remote add alobakin https://github.com/alobakin/linux
> > >         git fetch --no-tags alobakin bitops
> > >         git checkout 521611f961a7dda92eefa26e1afd3914c06af64e
> > >         # save the config file
> > >         mkdir build_dir && cp config build_dir/.config
> > >         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=mips SHELL=/bin/bash
> > >
> > > If you fix the issue, kindly add following tag where applicable
> > > Reported-by: kernel test robot <lkp@intel.com>
> > >
> > >
> > > sparse warnings: (new ones prefixed by >>)
> > >    command-line: note: in included file:
> > >    builtin:1:9: sparse: sparse: preprocessor token __ATOMIC_ACQUIRE redefined
> > >    builtin:0:0: sparse: this was the original definition
> > >    builtin:1:9: sparse: sparse: preprocessor token __ATOMIC_SEQ_CST redefined
> > >    builtin:0:0: sparse: this was the original definition
> > >    builtin:1:9: sparse: sparse: preprocessor token __ATOMIC_ACQ_REL redefined
> > >    builtin:0:0: sparse: this was the original definition
> > >    builtin:1:9: sparse: sparse: preprocessor token __ATOMIC_RELEASE redefined
> > >    builtin:0:0: sparse: this was the original definition
> > >    block/elevator.c: note: in included file (through include/linux/bitops.h, include/linux/kernel.h):
> > >    include/asm-generic/bitops/generic-non-atomic.h:29:9: sparse: sparse: unreplaced symbol 'mask'
> > >    include/asm-generic/bitops/generic-non-atomic.h:30:9: sparse: sparse: unreplaced symbol 'p'
> > >    include/asm-generic/bitops/generic-non-atomic.h:32:10: sparse: sparse: unreplaced symbol 'p'
> > >    include/asm-generic/bitops/generic-non-atomic.h:32:16: sparse: sparse: unreplaced symbol 'mask'
> > >    include/asm-generic/bitops/generic-non-atomic.h:27:1: sparse: sparse: unreplaced symbol 'return'
> > >    include/asm-generic/bitops/generic-non-atomic.h:38:9: sparse: sparse: unreplaced symbol 'mask'
> > >    include/asm-generic/bitops/generic-non-atomic.h:39:9: sparse: sparse: unreplaced symbol 'p'
> > >    include/asm-generic/bitops/generic-non-atomic.h:41:10: sparse: sparse: unreplaced symbol 'p'
> > >    include/asm-generic/bitops/generic-non-atomic.h:41:16: sparse: sparse: unreplaced symbol 'mask'
> > >    include/asm-generic/bitops/generic-non-atomic.h:36:1: sparse: sparse: unreplaced symbol 'return'
> > >    include/asm-generic/bitops/generic-non-atomic.h:56:9: sparse: sparse: unreplaced symbol 'mask'
> > >    include/asm-generic/bitops/generic-non-atomic.h:57:9: sparse: sparse: unreplaced symbol 'p'
> > >    include/asm-generic/bitops/generic-non-atomic.h:59:10: sparse: sparse: unreplaced symbol 'p'
> > >    include/asm-generic/bitops/generic-non-atomic.h:59:15: sparse: sparse: unreplaced symbol 'mask'
> > >    include/asm-generic/bitops/generic-non-atomic.h:54:1: sparse: sparse: unreplaced symbol 'return'
> > >    include/asm-generic/bitops/generic-non-atomic.h:74:9: sparse: sparse: unreplaced symbol 'mask'
> > >    include/asm-generic/bitops/generic-non-atomic.h:75:9: sparse: sparse: unreplaced symbol 'p'
> > >    include/asm-generic/bitops/generic-non-atomic.h:76:9: sparse: sparse: unreplaced symbol 'old'
> > >    include/asm-generic/bitops/generic-non-atomic.h:78:10: sparse: sparse: unreplaced symbol 'p'
> > >    include/asm-generic/bitops/generic-non-atomic.h:78:14: sparse: sparse: unreplaced symbol 'old'
> > >    include/asm-generic/bitops/generic-non-atomic.h:78:20: sparse: sparse: unreplaced symbol 'mask'
> > >    include/asm-generic/bitops/generic-non-atomic.h:79:17: sparse: sparse: unreplaced symbol 'old'
> > >    include/asm-generic/bitops/generic-non-atomic.h:79:23: sparse: sparse: unreplaced symbol 'mask'
> > >    include/asm-generic/bitops/generic-non-atomic.h:79:9: sparse: sparse: unreplaced symbol 'return'
> > >    include/asm-generic/bitops/generic-non-atomic.h:72:1: sparse: sparse: unreplaced symbol 'return'
> > >    include/asm-generic/bitops/generic-non-atomic.h:94:9: sparse: sparse: unreplaced symbol 'mask'
> > >    include/asm-generic/bitops/generic-non-atomic.h:95:9: sparse: sparse: unreplaced symbol 'p'
> > >    include/asm-generic/bitops/generic-non-atomic.h:96:9: sparse: sparse: unreplaced symbol 'old'
> > >    include/asm-generic/bitops/generic-non-atomic.h:98:10: sparse: sparse: unreplaced symbol 'p'
> > >    include/asm-generic/bitops/generic-non-atomic.h:98:14: sparse: sparse: unreplaced symbol 'old'
> > >    include/asm-generic/bitops/generic-non-atomic.h:98:21: sparse: sparse: unreplaced symbol 'mask'
> > >    include/asm-generic/bitops/generic-non-atomic.h:99:17: sparse: sparse: unreplaced symbol 'old'
> > >    include/asm-generic/bitops/generic-non-atomic.h:99:23: sparse: sparse: unreplaced symbol 'mask'
> > >    include/asm-generic/bitops/generic-non-atomic.h:99:9: sparse: sparse: unreplaced symbol 'return'
> > >    include/asm-generic/bitops/generic-non-atomic.h:92:1: sparse: sparse: unreplaced symbol 'return'
> > >    include/asm-generic/bitops/generic-non-atomic.h:106:9: sparse: sparse: unreplaced symbol 'mask'
> > >    include/asm-generic/bitops/generic-non-atomic.h:107:9: sparse: sparse: unreplaced symbol 'p'
> > >    include/asm-generic/bitops/generic-non-atomic.h:108:9: sparse: sparse: unreplaced symbol 'old'
> > >    include/asm-generic/bitops/generic-non-atomic.h:110:10: sparse: sparse: unreplaced symbol 'p'
> > >    include/asm-generic/bitops/generic-non-atomic.h:110:14: sparse: sparse: unreplaced symbol 'old'
> > >    include/asm-generic/bitops/generic-non-atomic.h:110:20: sparse: sparse: unreplaced symbol 'mask'
> > >    include/asm-generic/bitops/generic-non-atomic.h:111:17: sparse: sparse: unreplaced symbol 'old'
> > >    include/asm-generic/bitops/generic-non-atomic.h:111:23: sparse: sparse: unreplaced symbol 'mask'
> > >    include/asm-generic/bitops/generic-non-atomic.h:111:9: sparse: sparse: unreplaced symbol 'return'
> > >    include/asm-generic/bitops/generic-non-atomic.h:104:1: sparse: sparse: unreplaced symbol 'return'
> > >    include/asm-generic/bitops/generic-non-atomic.h:127:9: sparse: sparse: unreplaced symbol 'return'
> > >    include/asm-generic/bitops/generic-non-atomic.h:120:1: sparse: sparse: unreplaced symbol 'return'
> > > >> block/elevator.c:222:9: sparse: sparse: cast from restricted req_flags_t
> > >
> > > vim +222 block/elevator.c
> > >
> > > 9817064b68fef7 Jens Axboe        2006-07-28  217
> > > 70b3ea056f3074 Jens Axboe        2016-12-07  218  void elv_rqhash_add(struct request_queue *q, struct request *rq)
> > > 9817064b68fef7 Jens Axboe        2006-07-28  219  {
> > > b374d18a4bfce7 Jens Axboe        2008-10-31  220      struct elevator_queue *e = q->elevator;
> > > 9817064b68fef7 Jens Axboe        2006-07-28  221
> > > 9817064b68fef7 Jens Axboe        2006-07-28 @222      BUG_ON(ELV_ON_HASH(rq));
> > > 242d98f077ac0a Sasha Levin       2012-12-17  223      hash_add(e->hash, &rq->hash, rq_hash_key(rq));
> > > e806402130c9c4 Christoph Hellwig 2016-10-20  224      rq->rq_flags |= RQF_HASHED;
> > > 9817064b68fef7 Jens Axboe        2006-07-28  225  }
> > > bd166ef183c263 Jens Axboe        2017-01-17  226  EXPORT_SYMBOL_GPL(elv_rqhash_add);
> > > 9817064b68fef7 Jens Axboe        2006-07-28  227
> 
> It looks like a false positive for _your_ case, but if you want to fix
> here is the background.
> 
> The sparse has an ability to control custom types that should never
> set bits outside of the limited range. For this the special annotation
> is given, i.e. __bitwise. Since the culprit type is defined that way
> it means the pure integer (signed or unsigned) that comes with pure
> definition can't be used in a safe way. To solve this each of such
> definitions should be converted to the very same type (req_flags_t).
> See serial core where some UART flags are defined in a similar way and
> how code copes with that.

Ah, I haven't looked at the req_flags_t definition and didn't know
it uses bitwise annotation. So it's a local blk layer problem (not
casting from/to bitwise), not a generic header issue. Thanks for
the hint!

> 
> -- 
> With Best Regards,
> Andy Shevchenko

Thanks,
Olek
Luc Van Oostenryck June 20, 2022, 7:21 p.m. UTC | #8
On Mon, Jun 20, 2022 at 03:51:46PM +0200, Alexander Lobakin wrote:
> From: kernel test robot <lkp@intel.com>
> Date: Sun, 19 Jun 2022 17:20:05 +0800
> 
> Also, could someone please help me with this? I don't get what went
> wrong with sparse, it's not even some new code, just moving old
> stuff.

Hi,

The first sparse's warnings (sparse: preprocessor token __ATOMIC_*) are already
fixed (and most probably, the bots have already taken the fix).
I'm working on the second ones (sparse: unreplaced symbol).

-- Luc (Sparse's maintainer)
Mark Rutland June 21, 2022, 6:03 a.m. UTC | #9
On Mon, Jun 20, 2022 at 05:08:55PM +0200, Alexander Lobakin wrote:
> From: Mark Rutland <mark.rutland@arm.com>
> Date: Mon, 20 Jun 2022 15:19:42 +0100
> 
> > On Fri, Jun 17, 2022 at 04:40:24PM +0200, Alexander Lobakin wrote:
> > 
> > > The savings are architecture, compiler and compiler flags dependent,
> > > for example, on x86_64 -O2:
> > > 
> > > GCC 12: add/remove: 78/29 grow/shrink: 332/525 up/down: 31325/-61560 (-30235)
> > > LLVM 13: add/remove: 79/76 grow/shrink: 184/537 up/down: 55076/-141892 (-86816)
> > > LLVM 14: add/remove: 10/3 grow/shrink: 93/138 up/down: 3705/-6992 (-3287)
> > > 
> > > and ARM64 (courtesy of Mark[0]):
> > > 
> > > GCC 11: add/remove: 92/29 grow/shrink: 933/2766 up/down: 39340/-82580 (-43240)
> > > LLVM 14: add/remove: 21/11 grow/shrink: 620/651 up/down: 12060/-15824 (-3764)
> > 
> > Hmm... with *this version* of the series, I'm not getting results nearly as
> > good as that when building defconfig atop v5.19-rc3:
> > 
> >   GCC 8.5.0:   add/remove: 83/49 grow/shrink: 973/1147 up/down: 32020/-47824 (-15804)
> >   GCC 9.3.0:   add/remove: 68/51 grow/shrink: 1167/592 up/down: 30720/-31352 (-632)
> >   GCC 10.3.0:  add/remove: 84/37 grow/shrink: 1711/1003 up/down: 45392/-41844 (3548)
> >   GCC 11.1.0:  add/remove: 88/31 grow/shrink: 1635/963 up/down: 51540/-46096 (5444)
> >   GCC 11.3.0:  add/remove: 89/32 grow/shrink: 1629/966 up/down: 51456/-46056 (5400)
> >   GCC 12.1.0:  add/remove: 84/31 grow/shrink: 1540/829 up/down: 48772/-43164 (5608)
> > 
> >   LLVM 12.0.1: add/remove: 118/58 grow/shrink: 437/381 up/down: 45312/-65668 (-20356)
> >   LLVM 13.0.1: add/remove: 35/19 grow/shrink: 416/243 up/down: 14408/-22200 (-7792)
> >   LLVM 14.0.0: add/remove: 42/16 grow/shrink: 415/234 up/down: 15296/-21008 (-5712)
> > 
> > ... and that now seems to be regressing codegen with recent versions of GCC as
> > much as it improves it LLVM.
> > 
> > I'm not sure if we've improved some other code and removed the benefit between
> > v5.19-rc1 and v5.19-rc3, or whether something else it at play, but this doesn't
> > look as compelling as it did.
> 
> Mostly likely it's due to that in v1 I mistakingly removed
> `volatile` from gen[eric]_test_bit(), so there was an impact for
> non-constant cases as well.
> +5 Kb sounds bad tho. Do you have CONFIG_TEST_BITMAP enabled, does
> it work? 

I didn't have it enabled, but I tried that just nw with GCC 12.1.0 and it
builds cleanly, and test_bitmap_const_eval() gets optimized away entirely. If i
remove the `static` from that, GCC generates:

| <test_bitmap_const_eval>:
|     paciasp
|     autiasp
|     ret

... which is a trivial stub.

> Probably the same reason as for m68k, more constant
> optimization -> more aggressive inlining or inlining rebalance ->
> larger code. OTOH I've no idea why sometimes compiler decides to
> uninline really tiny functions where due to this patch series some
> bitops have been converted to constants, like it goes on m68k.

Hmm.... it'd be interesting to take a look at a few architectures and see what
the common case is.

Thanks,
Mark.
Yury Norov June 21, 2022, 5:39 p.m. UTC | #10
On Mon, Jun 20, 2022 at 05:08:55PM +0200, Alexander Lobakin wrote:
> From: Mark Rutland <mark.rutland@arm.com>
> Date: Mon, 20 Jun 2022 15:19:42 +0100
> 
> > On Fri, Jun 17, 2022 at 04:40:24PM +0200, Alexander Lobakin wrote:
> > > So, in order to let the compiler optimize out such cases, expand the
> > > test_bit() and __*_bit() definitions with a compile-time condition
> > > check, so that they will pick the generic C non-atomic bitop
> > > implementations when all of the arguments passed are compile-time
> > > constants, which means that the result will be a compile-time
> > > constant as well and the compiler will produce more efficient and
> > > simple code in 100% cases (no changes when there's at least one
> > > non-compile-time-constant argument).
> > 
> > > The savings are architecture, compiler and compiler flags dependent,
> > > for example, on x86_64 -O2:
> > > 
> > > GCC 12: add/remove: 78/29 grow/shrink: 332/525 up/down: 31325/-61560 (-30235)
> > > LLVM 13: add/remove: 79/76 grow/shrink: 184/537 up/down: 55076/-141892 (-86816)
> > > LLVM 14: add/remove: 10/3 grow/shrink: 93/138 up/down: 3705/-6992 (-3287)
> > > 
> > > and ARM64 (courtesy of Mark[0]):
> > > 
> > > GCC 11: add/remove: 92/29 grow/shrink: 933/2766 up/down: 39340/-82580 (-43240)
> > > LLVM 14: add/remove: 21/11 grow/shrink: 620/651 up/down: 12060/-15824 (-3764)
> > 
> > Hmm... with *this version* of the series, I'm not getting results nearly as
> > good as that when building defconfig atop v5.19-rc3:
> > 
> >   GCC 8.5.0:   add/remove: 83/49 grow/shrink: 973/1147 up/down: 32020/-47824 (-15804)
> >   GCC 9.3.0:   add/remove: 68/51 grow/shrink: 1167/592 up/down: 30720/-31352 (-632)
> >   GCC 10.3.0:  add/remove: 84/37 grow/shrink: 1711/1003 up/down: 45392/-41844 (3548)
> >   GCC 11.1.0:  add/remove: 88/31 grow/shrink: 1635/963 up/down: 51540/-46096 (5444)
> >   GCC 11.3.0:  add/remove: 89/32 grow/shrink: 1629/966 up/down: 51456/-46056 (5400)
> >   GCC 12.1.0:  add/remove: 84/31 grow/shrink: 1540/829 up/down: 48772/-43164 (5608)
> > 
> >   LLVM 12.0.1: add/remove: 118/58 grow/shrink: 437/381 up/down: 45312/-65668 (-20356)
> >   LLVM 13.0.1: add/remove: 35/19 grow/shrink: 416/243 up/down: 14408/-22200 (-7792)
> >   LLVM 14.0.0: add/remove: 42/16 grow/shrink: 415/234 up/down: 15296/-21008 (-5712)
> > 
> > ... and that now seems to be regressing codegen with recent versions of GCC as
> > much as it improves it LLVM.
> > 
> > I'm not sure if we've improved some other code and removed the benefit between
> > v5.19-rc1 and v5.19-rc3, or whether something else it at play, but this doesn't
> > look as compelling as it did.
> 
> Mostly likely it's due to that in v1 I mistakingly removed
> `volatile` from gen[eric]_test_bit(), so there was an impact for
> non-constant cases as well.
> +5 Kb sounds bad tho. Do you have CONFIG_TEST_BITMAP enabled, does
> it work? Probably the same reason as for m68k, more constant
> optimization -> more aggressive inlining or inlining rebalance ->
> larger code. OTOH I've no idea why sometimes compiler decides to
> uninline really tiny functions where due to this patch series some
> bitops have been converted to constants, like it goes on m68k.
> 
> > 
> > Overall that's mostly hidden in the Image size, due to 64K alignment and
> > padding requirements:
> > 
> >   Toolchain      Before      After       Difference
> > 
> >   GCC 8.5.0      36178432    36178432    0
> >   GCC 9.3.0      36112896    36112896    0
> >   GCC 10.3.0     36442624    36377088    -65536
> >   GCC 11.1.0     36311552    36377088    +65536
> >   GCC 11.3.0     36311552    36311552    0
> >   GCC 12.1.0     36377088    36377088    0
> > 
> >   LLVM 12.0.1    31418880    31418880    0
> >   LLVM 13.0.1    31418880    31418880    0
> >   LLVM 14.0.0    31218176    31218176    0
> > 
> > ... so aside from the blip around GCC 10.3.0 and 11.1.0, there's not a massive
> > change overall (due to 64KiB alignment restrictions for portions of the kernel
> > Image).

I gave it a try on v5.19-rc3 for arm64 with my default GCC 11.2, and it's:
add/remove: 89/33 grow/shrink: 1629/966 up/down: 51456/-46064 (5392)

Which is not great in terms of layout size. But I don't think we should
focus too much on those numbers. The goal of the series is not to shrink
the image; the true goal is to provide more information to the compiler
in a hope that it will make a better decision regarding optimizations.

Looking at results provided by Mark, both GCC and LLVM have a tendency
to inline and use other techniques that increase the image more
aggressively in newer releases, comparing to old ones. From this
perspective, unless we find some terribly wrong behavior, I'm OK with
+5K for the Image, because I trust my compiler and believe it spent
those 5K wisely.

For the reasons said above, I think we shouldn't disable const
bitops for -Os build.

I think this series has total positive impact because it adds a lot
of information for compiler with just a few lines of code.

If no objections, I think it's good to try it in -next. Alexander,
would you like me to fix gen/generic typo in comment and take it in
bitmap-for-next, or you'd prefer to send v4?

Thanks,
Yury
Alexander Lobakin June 21, 2022, 6:51 p.m. UTC | #11
From: Yury Norov <yury.norov@gmail.com>
Date: Tue, 21 Jun 2022 10:39:44 -0700

> On Mon, Jun 20, 2022 at 05:08:55PM +0200, Alexander Lobakin wrote:
> > From: Mark Rutland <mark.rutland@arm.com>
> > Date: Mon, 20 Jun 2022 15:19:42 +0100
> > 
> > > On Fri, Jun 17, 2022 at 04:40:24PM +0200, Alexander Lobakin wrote:
> > > > So, in order to let the compiler optimize out such cases, expand the
> > > > test_bit() and __*_bit() definitions with a compile-time condition
> > > > check, so that they will pick the generic C non-atomic bitop
> > > > implementations when all of the arguments passed are compile-time
> > > > constants, which means that the result will be a compile-time
> > > > constant as well and the compiler will produce more efficient and
> > > > simple code in 100% cases (no changes when there's at least one
> > > > non-compile-time-constant argument).
> > > 
> > > > The savings are architecture, compiler and compiler flags dependent,
> > > > for example, on x86_64 -O2:
> > > > 
> > > > GCC 12: add/remove: 78/29 grow/shrink: 332/525 up/down: 31325/-61560 (-30235)
> > > > LLVM 13: add/remove: 79/76 grow/shrink: 184/537 up/down: 55076/-141892 (-86816)
> > > > LLVM 14: add/remove: 10/3 grow/shrink: 93/138 up/down: 3705/-6992 (-3287)
> > > > 
> > > > and ARM64 (courtesy of Mark[0]):
> > > > 
> > > > GCC 11: add/remove: 92/29 grow/shrink: 933/2766 up/down: 39340/-82580 (-43240)
> > > > LLVM 14: add/remove: 21/11 grow/shrink: 620/651 up/down: 12060/-15824 (-3764)
> > > 
> > > Hmm... with *this version* of the series, I'm not getting results nearly as
> > > good as that when building defconfig atop v5.19-rc3:
> > > 
> > >   GCC 8.5.0:   add/remove: 83/49 grow/shrink: 973/1147 up/down: 32020/-47824 (-15804)
> > >   GCC 9.3.0:   add/remove: 68/51 grow/shrink: 1167/592 up/down: 30720/-31352 (-632)
> > >   GCC 10.3.0:  add/remove: 84/37 grow/shrink: 1711/1003 up/down: 45392/-41844 (3548)
> > >   GCC 11.1.0:  add/remove: 88/31 grow/shrink: 1635/963 up/down: 51540/-46096 (5444)
> > >   GCC 11.3.0:  add/remove: 89/32 grow/shrink: 1629/966 up/down: 51456/-46056 (5400)
> > >   GCC 12.1.0:  add/remove: 84/31 grow/shrink: 1540/829 up/down: 48772/-43164 (5608)
> > > 
> > >   LLVM 12.0.1: add/remove: 118/58 grow/shrink: 437/381 up/down: 45312/-65668 (-20356)
> > >   LLVM 13.0.1: add/remove: 35/19 grow/shrink: 416/243 up/down: 14408/-22200 (-7792)
> > >   LLVM 14.0.0: add/remove: 42/16 grow/shrink: 415/234 up/down: 15296/-21008 (-5712)
> > > 
> > > ... and that now seems to be regressing codegen with recent versions of GCC as
> > > much as it improves it LLVM.
> > > 
> > > I'm not sure if we've improved some other code and removed the benefit between
> > > v5.19-rc1 and v5.19-rc3, or whether something else it at play, but this doesn't
> > > look as compelling as it did.
> > 
> > Mostly likely it's due to that in v1 I mistakingly removed
> > `volatile` from gen[eric]_test_bit(), so there was an impact for
> > non-constant cases as well.
> > +5 Kb sounds bad tho. Do you have CONFIG_TEST_BITMAP enabled, does
> > it work? Probably the same reason as for m68k, more constant
> > optimization -> more aggressive inlining or inlining rebalance ->
> > larger code. OTOH I've no idea why sometimes compiler decides to
> > uninline really tiny functions where due to this patch series some
> > bitops have been converted to constants, like it goes on m68k.
> > 
> > > 
> > > Overall that's mostly hidden in the Image size, due to 64K alignment and
> > > padding requirements:
> > > 
> > >   Toolchain      Before      After       Difference
> > > 
> > >   GCC 8.5.0      36178432    36178432    0
> > >   GCC 9.3.0      36112896    36112896    0
> > >   GCC 10.3.0     36442624    36377088    -65536
> > >   GCC 11.1.0     36311552    36377088    +65536
> > >   GCC 11.3.0     36311552    36311552    0
> > >   GCC 12.1.0     36377088    36377088    0
> > > 
> > >   LLVM 12.0.1    31418880    31418880    0
> > >   LLVM 13.0.1    31418880    31418880    0
> > >   LLVM 14.0.0    31218176    31218176    0
> > > 
> > > ... so aside from the blip around GCC 10.3.0 and 11.1.0, there's not a massive
> > > change overall (due to 64KiB alignment restrictions for portions of the kernel
> > > Image).
> 
> I gave it a try on v5.19-rc3 for arm64 with my default GCC 11.2, and it's:
> add/remove: 89/33 grow/shrink: 1629/966 up/down: 51456/-46064 (5392)
> 
> Which is not great in terms of layout size. But I don't think we should
> focus too much on those numbers. The goal of the series is not to shrink
> the image; the true goal is to provide more information to the compiler
> in a hope that it will make a better decision regarding optimizations.
> 
> Looking at results provided by Mark, both GCC and LLVM have a tendency
> to inline and use other techniques that increase the image more
> aggressively in newer releases, comparing to old ones. From this
> perspective, unless we find some terribly wrong behavior, I'm OK with
> +5K for the Image, because I trust my compiler and believe it spent
> those 5K wisely.
> 
> For the reasons said above, I think we shouldn't disable const
> bitops for -Os build.
> 
> I think this series has total positive impact because it adds a lot
> of information for compiler with just a few lines of code.

Right, that was the primary intention. But then I got some text size
decreases and thought this applies to any setup :)

> 
> If no objections, I think it's good to try it in -next. Alexander,
> would you like me to fix gen/generic typo in comment and take it in
> bitmap-for-next, or you'd prefer to send v4?

I'm sending v4 in a couple minutes, lkp reported that on ARC GCC
never expands mem*() builtins to plain assignments, which sucks,
but failed my compile-time tests, so I adjusted code a bit. Hope
that change will be okay for everyone, so that you could pick it.

> 
> Thanks,
> Yury

Thanks,
Olek