Message ID | 20211113023409.49472-1-hongyu.wang@intel.com |
---|---|
State | New |
Headers | show |
Series | PR target/103069: Relax cmpxchg loop for x86 target | expand |
On Sat, Nov 13, 2021 at 3:34 AM Hongyu Wang <hongyu.wang@intel.com> wrote: > > Hi, > > From the CPU's point of view, getting a cache line for writing is more > expensive than reading. See Appendix A.2 Spinlock in: > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ > xeon-lock-scaling-analysis-paper.pdf > > The full compare and swap will grab the cache line exclusive and causes > excessive cache line bouncing. > > The atomic_fetch_{or,xor,and,nand} builtins generates cmpxchg loop under > -march=x86-64 like: > > movl (%rdi), %eax > .L2: > movl %eax, %edx > movl %eax, %r8d > orl $esi, %edx > lock cmpxchgl %edx, (%rdi) > jne .L2 > movl %r8d, %eax > ret > > To relax above loop, GCC should first emit a normal load, check and jump to > .L2 if cmpxchgl may fail. Before jump to .L2, PAUSE should be inserted to > yield the CPU to another hyperthread and to save power, so the code is > like > > movl (%rdi), %eax > .L4: > movl (%rdi), %ecx > movl %eax, %edx > orl %esi, %edx > cmpl %eax, %ecx > jne .L2 > lock cmpxchgl %edx, (%rdi) > jne .L4 > .L2: > rep nop > jmp .L4 > > This patch adds corresponding atomic_fetch_op expanders to insert load/ > compare and pause for all the atomic logic fetch builtins. Add flag > -mrelax-cmpxchg-loop to control whether to generate relaxed loop. > > Bootstraped/regtested on x86_64-pc-linux-gnu{-m32,}. > Ok for master? > > gcc/ChangeLog: > > PR target/103069 > * config/i386/i386-expand.c (ix86_expand_atomic_fetch_op_loop): > New expand function. > * config/i386/i386-options.c (ix86_target_string): Add > -mrelax-cmpxchg-loop flag. > (ix86_valid_target_attribute_inner_p): Likewise. > * config/i386/i386-protos.h (ix86_expand_atomic_fetch_op_loop): > New expand function prototype. > * config/i386/i386.opt: Add -mrelax-cmpxchg-loop. > * config/i386/sync.md (atomic_fetch_<logic><mode>): New expander > for SI,HI,QI modes. > (atomic_<logic>_fetch<mode>): Likewise. > (atomic_fetch_nand<mode>): Likewise. > (atomic_nand_fetch<mode>): Likewise. > (atomic_fetch_<logic><mode>): New expander for DI,TI modes. > (atomic_<logic>_fetch<mode>): Likewise. > (atomic_fetch_nand<mode>): Likewise. > (atomic_nand_fetch<mode>): Likewise. > * doc/invoke.texi: Document -mrelax-cmpxchg-loop. > > gcc/testsuite/ChangeLog: > > PR target/103069 > * gcc.target/i386/pr103069-1.c: New test. > * gcc.target/i386/pr103069-2.c: Ditto. LGTM, with a couple of issues in the testsuite section. Thanks, Uros. > --- > gcc/config/i386/i386-expand.c | 77 ++++++++++++++ > gcc/config/i386/i386-options.c | 7 +- > gcc/config/i386/i386-protos.h | 2 + > gcc/config/i386/i386.opt | 4 + > gcc/config/i386/sync.md | 117 +++++++++++++++++++++ > gcc/doc/invoke.texi | 9 +- > gcc/testsuite/gcc.target/i386/pr103069-1.c | 35 ++++++ > gcc/testsuite/gcc.target/i386/pr103069-2.c | 70 ++++++++++++ > 8 files changed, 319 insertions(+), 2 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr103069-1.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr103069-2.c > > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c > index 088e6af2258..f8a61835d85 100644 > --- a/gcc/config/i386/i386-expand.c > +++ b/gcc/config/i386/i386-expand.c > @@ -23138,4 +23138,81 @@ ix86_expand_divmod_libfunc (rtx libfunc, machine_mode mode, > *rem_p = rem; > } > > +void ix86_expand_atomic_fetch_op_loop (rtx target, rtx mem, rtx val, > + enum rtx_code code, bool after, > + bool doubleword) > +{ > + rtx old_reg, new_reg, old_mem, success, oldval, new_mem; > + rtx_code_label *loop_label, *pause_label; > + machine_mode mode = GET_MODE (target); > + > + old_reg = gen_reg_rtx (mode); > + new_reg = old_reg; > + loop_label = gen_label_rtx (); > + pause_label = gen_label_rtx (); > + old_mem = copy_to_reg (mem); > + emit_label (loop_label); > + emit_move_insn (old_reg, old_mem); > + > + /* return value for atomic_fetch_op. */ > + if (!after) > + emit_move_insn (target, old_reg); > + > + if (code == NOT) > + { > + new_reg = expand_simple_binop (mode, AND, new_reg, val, NULL_RTX, > + true, OPTAB_LIB_WIDEN); > + new_reg = expand_simple_unop (mode, code, new_reg, NULL_RTX, true); > + } > + else > + new_reg = expand_simple_binop (mode, code, new_reg, val, NULL_RTX, > + true, OPTAB_LIB_WIDEN); > + > + /* return value for atomic_op_fetch. */ > + if (after) > + emit_move_insn (target, new_reg); > + > + /* Load memory again inside loop. */ > + new_mem = copy_to_reg (mem); > + /* Compare mem value with expected value. */ > + > + if (doubleword) > + { > + machine_mode half_mode = (mode == DImode)? SImode : DImode; > + rtx low_new_mem = gen_lowpart (half_mode, new_mem); > + rtx low_old_mem = gen_lowpart (half_mode, old_mem); > + rtx high_new_mem = gen_highpart (half_mode, new_mem); > + rtx high_old_mem = gen_highpart (half_mode, old_mem); > + emit_cmp_and_jump_insns (low_new_mem, low_old_mem, NE, NULL_RTX, > + half_mode, 1, pause_label, > + profile_probability::guessed_never ()); > + emit_cmp_and_jump_insns (high_new_mem, high_old_mem, NE, NULL_RTX, > + half_mode, 1, pause_label, > + profile_probability::guessed_never ()); > + } > + else > + emit_cmp_and_jump_insns (new_mem, old_mem, NE, NULL_RTX, > + GET_MODE (old_mem), 1, pause_label, > + profile_probability::guessed_never ()); > + > + success = NULL_RTX; > + oldval = old_mem; > + expand_atomic_compare_and_swap (&success, &oldval, mem, old_reg, > + new_reg, false, MEMMODEL_SYNC_SEQ_CST, > + MEMMODEL_RELAXED); > + if (oldval != old_mem) > + emit_move_insn (old_mem, oldval); > + > + emit_cmp_and_jump_insns (success, const0_rtx, EQ, const0_rtx, > + GET_MODE (success), 1, loop_label, > + profile_probability::guessed_never ()); > + > + /* If mem is not expected, pause and loop back. */ > + emit_label (pause_label); > + emit_insn (gen_pause ()); > + emit_jump_insn (gen_jump (loop_label)); > + emit_barrier (); > +} > + > + > #include "gt-i386-expand.h" > diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c > index a8cc0664f11..feff2584f41 100644 > --- a/gcc/config/i386/i386-options.c > +++ b/gcc/config/i386/i386-options.c > @@ -397,7 +397,8 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2, > { "-mstv", MASK_STV }, > { "-mavx256-split-unaligned-load", MASK_AVX256_SPLIT_UNALIGNED_LOAD }, > { "-mavx256-split-unaligned-store", MASK_AVX256_SPLIT_UNALIGNED_STORE }, > - { "-mcall-ms2sysv-xlogues", MASK_CALL_MS2SYSV_XLOGUES } > + { "-mcall-ms2sysv-xlogues", MASK_CALL_MS2SYSV_XLOGUES }, > + { "-mrelax-cmpxchg-loop", MASK_RELAX_CMPXCHG_LOOP } > }; > > /* Additional flag options. */ > @@ -1092,6 +1093,10 @@ ix86_valid_target_attribute_inner_p (tree fndecl, tree args, char *p_strings[], > IX86_ATTR_IX86_YES ("general-regs-only", > OPT_mgeneral_regs_only, > OPTION_MASK_GENERAL_REGS_ONLY), > + > + IX86_ATTR_YES ("relax-cmpxchg-loop", > + OPT_mrelax_cmpxchg_loop, > + MASK_RELAX_CMPXCHG_LOOP), > }; > > location_t loc > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h > index bd52450a148..7e05510c679 100644 > --- a/gcc/config/i386/i386-protos.h > +++ b/gcc/config/i386/i386-protos.h > @@ -217,6 +217,8 @@ extern void ix86_move_vector_high_sse_to_mmx (rtx); > extern void ix86_split_mmx_pack (rtx[], enum rtx_code); > extern void ix86_split_mmx_punpck (rtx[], bool); > extern void ix86_expand_avx_vzeroupper (void); > +extern void ix86_expand_atomic_fetch_op_loop (rtx, rtx, rtx, enum rtx_code, > + bool, bool); > > #ifdef TREE_CODE > extern void init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree, int); > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt > index ad366974b5b..46fad3cc038 100644 > --- a/gcc/config/i386/i386.opt > +++ b/gcc/config/i386/i386.opt > @@ -404,6 +404,10 @@ momit-leaf-frame-pointer > Target Mask(OMIT_LEAF_FRAME_POINTER) Save > Omit the frame pointer in leaf functions. > > +mrelax-cmpxchg-loop > +Target Mask(RELAX_CMPXCHG_LOOP) Save > +Relax cmpxchg loop for atomic_fetch_{or,xor,and,nand} by adding load and cmp before cmpxchg, execute pause and loop back to load and compare if load value is not expected. > + > mpc32 > Target RejectNegative > Set 80387 floating-point precision to 32-bit. > diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md > index 05a835256bb..46048425327 100644 > --- a/gcc/config/i386/sync.md > +++ b/gcc/config/i386/sync.md > @@ -525,6 +525,123 @@ > (set (reg:CCZ FLAGS_REG) > (unspec_volatile:CCZ [(const_int 0)] UNSPECV_CMPXCHG))])]) > > +(define_expand "atomic_fetch_<logic><mode>" > + [(match_operand:SWI124 0 "register_operand") > + (any_logic:SWI124 > + (match_operand:SWI124 1 "memory_operand") > + (match_operand:SWI124 2 "register_operand")) > + (match_operand:SI 3 "const_int_operand")] > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > +{ > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > + operands[2], <CODE>, false, > + false); > + DONE; > +}) > + > +(define_expand "atomic_<logic>_fetch<mode>" > + [(match_operand:SWI124 0 "register_operand") > + (any_logic:SWI124 > + (match_operand:SWI124 1 "memory_operand") > + (match_operand:SWI124 2 "register_operand")) > + (match_operand:SI 3 "const_int_operand")] > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > +{ > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > + operands[2], <CODE>, true, > + false); > + DONE; > +}) > + > +(define_expand "atomic_fetch_nand<mode>" > + [(match_operand:SWI124 0 "register_operand") > + (match_operand:SWI124 1 "memory_operand") > + (match_operand:SWI124 2 "register_operand") > + (match_operand:SI 3 "const_int_operand")] > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > +{ > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > + operands[2], NOT, false, > + false); > + DONE; > +}) > + > +(define_expand "atomic_nand_fetch<mode>" > + [(match_operand:SWI124 0 "register_operand") > + (match_operand:SWI124 1 "memory_operand") > + (match_operand:SWI124 2 "register_operand") > + (match_operand:SI 3 "const_int_operand")] > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > +{ > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > + operands[2], NOT, true, > + false); > + DONE; > +}) > + > +(define_expand "atomic_fetch_<logic><mode>" > + [(match_operand:CASMODE 0 "register_operand") > + (any_logic:CASMODE > + (match_operand:CASMODE 1 "memory_operand") > + (match_operand:CASMODE 2 "register_operand")) > + (match_operand:SI 3 "const_int_operand")] > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > +{ > + bool doubleword = (<MODE>mode == DImode && !TARGET_64BIT) > + || (<MODE>mode == TImode); > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > + operands[2], <CODE>, false, > + doubleword); > + DONE; > +}) > + > +(define_expand "atomic_<logic>_fetch<mode>" > + [(match_operand:CASMODE 0 "register_operand") > + (any_logic:CASMODE > + (match_operand:CASMODE 1 "memory_operand") > + (match_operand:CASMODE 2 "register_operand")) > + (match_operand:SI 3 "const_int_operand")] > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > +{ > + bool doubleword = (<MODE>mode == DImode && !TARGET_64BIT) > + || (<MODE>mode == TImode); > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > + operands[2], <CODE>, true, > + doubleword); > + DONE; > +}) > + > +(define_expand "atomic_fetch_nand<mode>" > + [(match_operand:CASMODE 0 "register_operand") > + (match_operand:CASMODE 1 "memory_operand") > + (match_operand:CASMODE 2 "register_operand") > + (match_operand:SI 3 "const_int_operand")] > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > +{ > + bool doubleword = (<MODE>mode == DImode && !TARGET_64BIT) > + || (<MODE>mode == TImode); > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > + operands[2], NOT, false, > + doubleword); > + DONE; > +}) > + > +(define_expand "atomic_nand_fetch<mode>" > + [(match_operand:CASMODE 0 "register_operand") > + (match_operand:CASMODE 1 "memory_operand") > + (match_operand:CASMODE 2 "register_operand") > + (match_operand:SI 3 "const_int_operand")] > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > +{ > + bool doubleword = (<MODE>mode == DImode && !TARGET_64BIT) > + || (<MODE>mode == TImode); > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > + operands[2], NOT, true, > + doubleword); > + DONE; > +}) > + > + > ;; For operand 2 nonmemory_operand predicate is used instead of > ;; register_operand to allow combiner to better optimize atomic > ;; additions of constants. > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > index 2aba4c70b44..06ecf79bc0c 100644 > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -1419,7 +1419,7 @@ See RS/6000 and PowerPC Options. > -mstack-protector-guard-reg=@var{reg} @gol > -mstack-protector-guard-offset=@var{offset} @gol > -mstack-protector-guard-symbol=@var{symbol} @gol > --mgeneral-regs-only -mcall-ms2sysv-xlogues @gol > +-mgeneral-regs-only -mcall-ms2sysv-xlogues -mrelax-cmpxchg-loop @gol > -mindirect-branch=@var{choice} -mfunction-return=@var{choice} @gol > -mindirect-branch-register -mneeded} > > @@ -32259,6 +32259,13 @@ Generate code that uses only the general-purpose registers. This > prevents the compiler from using floating-point, vector, mask and bound > registers. > > +@item -mrelax-cmpxchg-loop > +@opindex mrelax-cmpxchg-loop > +Relax cmpxchg loop by emit early load and compare before cmpxchg, ... by emitting an ... > +execute pause if load value is not expected. This reduces excessive > +cachline bouncing when and works for all atomic logic fetch builtins > +that generates compare and swap loop. > + > @item -mindirect-branch=@var{choice} > @opindex mindirect-branch > Convert indirect call and jump with @var{choice}. The default is > diff --git a/gcc/testsuite/gcc.target/i386/pr103069-1.c b/gcc/testsuite/gcc.target/i386/pr103069-1.c > new file mode 100644 > index 00000000000..444485cbae9 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr103069-1.c > @@ -0,0 +1,35 @@ > +/* PR target/103068 */ > +/* { dg-do compile } */ > +/* { dg-additional-options "-O2 -march=x86-64 -mtune=generic -mrelax-cmpxchg-loop" } */ > +/* { dg-final { scan-assembler-times "rep nop" 32 } } */ Please note that for older assemblers we emit "rep; nop" here. > + > +#include <stdint.h> > + > +#define FUNC_ATOMIC(TYPE, OP) \ > +__attribute__ ((noinline, noclone)) \ > +TYPE f_##TYPE##_##OP##_fetch (TYPE *a, TYPE b) \ > +{ \ > + return __atomic_##OP##_fetch (a, b, __ATOMIC_RELAXED); \ > +} \ > +__attribute__ ((noinline, noclone)) \ > +TYPE f_##TYPE##_fetch_##OP (TYPE *a, TYPE b) \ > +{ \ > + return __atomic_fetch_##OP (a, b, __ATOMIC_RELAXED); \ > +} > + > +FUNC_ATOMIC (int64_t, and) > +FUNC_ATOMIC (int64_t, nand) > +FUNC_ATOMIC (int64_t, or) > +FUNC_ATOMIC (int64_t, xor) > +FUNC_ATOMIC (int, and) > +FUNC_ATOMIC (int, nand) > +FUNC_ATOMIC (int, or) > +FUNC_ATOMIC (int, xor) > +FUNC_ATOMIC (short, and) > +FUNC_ATOMIC (short, nand) > +FUNC_ATOMIC (short, or) > +FUNC_ATOMIC (short, xor) > +FUNC_ATOMIC (char, and) > +FUNC_ATOMIC (char, nand) > +FUNC_ATOMIC (char, or) > +FUNC_ATOMIC (char, xor) > diff --git a/gcc/testsuite/gcc.target/i386/pr103069-2.c b/gcc/testsuite/gcc.target/i386/pr103069-2.c > new file mode 100644 > index 00000000000..8ac824cc8e8 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr103069-2.c > @@ -0,0 +1,70 @@ > +/* PR target/103068 */ > +/* { dg-do compile } */ > +/* { dg-additional-options "-O2 -march=x86-64 -mtune=generic" } */ > + > +#include <stdlib.h> > +#include "pr103069-1.c" > + > +#define FUNC_ATOMIC_RELAX(TYPE, OP) \ > +__attribute__ ((noinline, noclone, target ("relax-cmpxchg-loop"))) \ > +TYPE relax_##TYPE##_##OP##_fetch (TYPE *a, TYPE b) \ > +{ \ > + return __atomic_##OP##_fetch (a, b, __ATOMIC_RELAXED); \ > +} \ > +__attribute__ ((noinline, noclone, target ("relax-cmpxchg-loop"))) \ > +TYPE relax_##TYPE##_fetch_##OP (TYPE *a, TYPE b) \ > +{ \ > + return __atomic_fetch_##OP (a, b, __ATOMIC_RELAXED); \ > +} > + > +FUNC_ATOMIC_RELAX (int64_t, and) > +FUNC_ATOMIC_RELAX (int64_t, nand) > +FUNC_ATOMIC_RELAX (int64_t, or) > +FUNC_ATOMIC_RELAX (int64_t, xor) > +FUNC_ATOMIC_RELAX (int, and) > +FUNC_ATOMIC_RELAX (int, nand) > +FUNC_ATOMIC_RELAX (int, or) > +FUNC_ATOMIC_RELAX (int, xor) > +FUNC_ATOMIC_RELAX (short, and) > +FUNC_ATOMIC_RELAX (short, nand) > +FUNC_ATOMIC_RELAX (short, or) > +FUNC_ATOMIC_RELAX (short, xor) > +FUNC_ATOMIC_RELAX (char, and) > +FUNC_ATOMIC_RELAX (char, nand) > +FUNC_ATOMIC_RELAX (char, or) > +FUNC_ATOMIC_RELAX (char, xor) > + > +#define TEST_ATOMIC_FETCH_LOGIC(TYPE, OP) \ > +{ \ > + TYPE a = 11, b = 101, res, exp; \ > + res = relax_##TYPE##_##OP##_fetch (&a, b); \ > + exp = f_##TYPE##_##OP##_fetch (&a, b); \ > + if (res != exp) \ > + abort (); \ > + a = 21, b = 92; \ > + res = relax_##TYPE##_fetch_##OP (&a, b); \ > + exp = f_##TYPE##_fetch_##OP (&a, b); \ > + if (res != exp) \ > + abort (); \ > +} > + > +int main (void) > +{ > + TEST_ATOMIC_FETCH_LOGIC (int64_t, and) > + TEST_ATOMIC_FETCH_LOGIC (int64_t, nand) > + TEST_ATOMIC_FETCH_LOGIC (int64_t, or) > + TEST_ATOMIC_FETCH_LOGIC (int64_t, xor) > + TEST_ATOMIC_FETCH_LOGIC (int, and) > + TEST_ATOMIC_FETCH_LOGIC (int, nand) > + TEST_ATOMIC_FETCH_LOGIC (int, or) > + TEST_ATOMIC_FETCH_LOGIC (int, xor) > + TEST_ATOMIC_FETCH_LOGIC (short, and) > + TEST_ATOMIC_FETCH_LOGIC (short, nand) > + TEST_ATOMIC_FETCH_LOGIC (short, or) > + TEST_ATOMIC_FETCH_LOGIC (short, xor) > + TEST_ATOMIC_FETCH_LOGIC (char, and) > + TEST_ATOMIC_FETCH_LOGIC (char, nand) > + TEST_ATOMIC_FETCH_LOGIC (char, or) > + TEST_ATOMIC_FETCH_LOGIC (char, xor) > + return 0; > +} > -- > 2.18.1 >
Thanks for your review, this is the patch I'm going to check-in. Uros Bizjak via Gcc-patches <gcc-patches@gcc.gnu.org> 于2021年11月15日周一 下午4:25写道: > > On Sat, Nov 13, 2021 at 3:34 AM Hongyu Wang <hongyu.wang@intel.com> wrote: > > > > Hi, > > > > From the CPU's point of view, getting a cache line for writing is more > > expensive than reading. See Appendix A.2 Spinlock in: > > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ > > xeon-lock-scaling-analysis-paper.pdf > > > > The full compare and swap will grab the cache line exclusive and causes > > excessive cache line bouncing. > > > > The atomic_fetch_{or,xor,and,nand} builtins generates cmpxchg loop under > > -march=x86-64 like: > > > > movl (%rdi), %eax > > .L2: > > movl %eax, %edx > > movl %eax, %r8d > > orl $esi, %edx > > lock cmpxchgl %edx, (%rdi) > > jne .L2 > > movl %r8d, %eax > > ret > > > > To relax above loop, GCC should first emit a normal load, check and jump to > > .L2 if cmpxchgl may fail. Before jump to .L2, PAUSE should be inserted to > > yield the CPU to another hyperthread and to save power, so the code is > > like > > > > movl (%rdi), %eax > > .L4: > > movl (%rdi), %ecx > > movl %eax, %edx > > orl %esi, %edx > > cmpl %eax, %ecx > > jne .L2 > > lock cmpxchgl %edx, (%rdi) > > jne .L4 > > .L2: > > rep nop > > jmp .L4 > > > > This patch adds corresponding atomic_fetch_op expanders to insert load/ > > compare and pause for all the atomic logic fetch builtins. Add flag > > -mrelax-cmpxchg-loop to control whether to generate relaxed loop. > > > > Bootstraped/regtested on x86_64-pc-linux-gnu{-m32,}. > > Ok for master? > > > > gcc/ChangeLog: > > > > PR target/103069 > > * config/i386/i386-expand.c (ix86_expand_atomic_fetch_op_loop): > > New expand function. > > * config/i386/i386-options.c (ix86_target_string): Add > > -mrelax-cmpxchg-loop flag. > > (ix86_valid_target_attribute_inner_p): Likewise. > > * config/i386/i386-protos.h (ix86_expand_atomic_fetch_op_loop): > > New expand function prototype. > > * config/i386/i386.opt: Add -mrelax-cmpxchg-loop. > > * config/i386/sync.md (atomic_fetch_<logic><mode>): New expander > > for SI,HI,QI modes. > > (atomic_<logic>_fetch<mode>): Likewise. > > (atomic_fetch_nand<mode>): Likewise. > > (atomic_nand_fetch<mode>): Likewise. > > (atomic_fetch_<logic><mode>): New expander for DI,TI modes. > > (atomic_<logic>_fetch<mode>): Likewise. > > (atomic_fetch_nand<mode>): Likewise. > > (atomic_nand_fetch<mode>): Likewise. > > * doc/invoke.texi: Document -mrelax-cmpxchg-loop. > > > > gcc/testsuite/ChangeLog: > > > > PR target/103069 > > * gcc.target/i386/pr103069-1.c: New test. > > * gcc.target/i386/pr103069-2.c: Ditto. > > LGTM, with a couple of issues in the testsuite section. > > Thanks, > Uros. > > > --- > > gcc/config/i386/i386-expand.c | 77 ++++++++++++++ > > gcc/config/i386/i386-options.c | 7 +- > > gcc/config/i386/i386-protos.h | 2 + > > gcc/config/i386/i386.opt | 4 + > > gcc/config/i386/sync.md | 117 +++++++++++++++++++++ > > gcc/doc/invoke.texi | 9 +- > > gcc/testsuite/gcc.target/i386/pr103069-1.c | 35 ++++++ > > gcc/testsuite/gcc.target/i386/pr103069-2.c | 70 ++++++++++++ > > 8 files changed, 319 insertions(+), 2 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/i386/pr103069-1.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr103069-2.c > > > > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c > > index 088e6af2258..f8a61835d85 100644 > > --- a/gcc/config/i386/i386-expand.c > > +++ b/gcc/config/i386/i386-expand.c > > @@ -23138,4 +23138,81 @@ ix86_expand_divmod_libfunc (rtx libfunc, machine_mode mode, > > *rem_p = rem; > > } > > > > +void ix86_expand_atomic_fetch_op_loop (rtx target, rtx mem, rtx val, > > + enum rtx_code code, bool after, > > + bool doubleword) > > +{ > > + rtx old_reg, new_reg, old_mem, success, oldval, new_mem; > > + rtx_code_label *loop_label, *pause_label; > > + machine_mode mode = GET_MODE (target); > > + > > + old_reg = gen_reg_rtx (mode); > > + new_reg = old_reg; > > + loop_label = gen_label_rtx (); > > + pause_label = gen_label_rtx (); > > + old_mem = copy_to_reg (mem); > > + emit_label (loop_label); > > + emit_move_insn (old_reg, old_mem); > > + > > + /* return value for atomic_fetch_op. */ > > + if (!after) > > + emit_move_insn (target, old_reg); > > + > > + if (code == NOT) > > + { > > + new_reg = expand_simple_binop (mode, AND, new_reg, val, NULL_RTX, > > + true, OPTAB_LIB_WIDEN); > > + new_reg = expand_simple_unop (mode, code, new_reg, NULL_RTX, true); > > + } > > + else > > + new_reg = expand_simple_binop (mode, code, new_reg, val, NULL_RTX, > > + true, OPTAB_LIB_WIDEN); > > + > > + /* return value for atomic_op_fetch. */ > > + if (after) > > + emit_move_insn (target, new_reg); > > + > > + /* Load memory again inside loop. */ > > + new_mem = copy_to_reg (mem); > > + /* Compare mem value with expected value. */ > > + > > + if (doubleword) > > + { > > + machine_mode half_mode = (mode == DImode)? SImode : DImode; > > + rtx low_new_mem = gen_lowpart (half_mode, new_mem); > > + rtx low_old_mem = gen_lowpart (half_mode, old_mem); > > + rtx high_new_mem = gen_highpart (half_mode, new_mem); > > + rtx high_old_mem = gen_highpart (half_mode, old_mem); > > + emit_cmp_and_jump_insns (low_new_mem, low_old_mem, NE, NULL_RTX, > > + half_mode, 1, pause_label, > > + profile_probability::guessed_never ()); > > + emit_cmp_and_jump_insns (high_new_mem, high_old_mem, NE, NULL_RTX, > > + half_mode, 1, pause_label, > > + profile_probability::guessed_never ()); > > + } > > + else > > + emit_cmp_and_jump_insns (new_mem, old_mem, NE, NULL_RTX, > > + GET_MODE (old_mem), 1, pause_label, > > + profile_probability::guessed_never ()); > > + > > + success = NULL_RTX; > > + oldval = old_mem; > > + expand_atomic_compare_and_swap (&success, &oldval, mem, old_reg, > > + new_reg, false, MEMMODEL_SYNC_SEQ_CST, > > + MEMMODEL_RELAXED); > > + if (oldval != old_mem) > > + emit_move_insn (old_mem, oldval); > > + > > + emit_cmp_and_jump_insns (success, const0_rtx, EQ, const0_rtx, > > + GET_MODE (success), 1, loop_label, > > + profile_probability::guessed_never ()); > > + > > + /* If mem is not expected, pause and loop back. */ > > + emit_label (pause_label); > > + emit_insn (gen_pause ()); > > + emit_jump_insn (gen_jump (loop_label)); > > + emit_barrier (); > > +} > > + > > + > > #include "gt-i386-expand.h" > > diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c > > index a8cc0664f11..feff2584f41 100644 > > --- a/gcc/config/i386/i386-options.c > > +++ b/gcc/config/i386/i386-options.c > > @@ -397,7 +397,8 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2, > > { "-mstv", MASK_STV }, > > { "-mavx256-split-unaligned-load", MASK_AVX256_SPLIT_UNALIGNED_LOAD }, > > { "-mavx256-split-unaligned-store", MASK_AVX256_SPLIT_UNALIGNED_STORE }, > > - { "-mcall-ms2sysv-xlogues", MASK_CALL_MS2SYSV_XLOGUES } > > + { "-mcall-ms2sysv-xlogues", MASK_CALL_MS2SYSV_XLOGUES }, > > + { "-mrelax-cmpxchg-loop", MASK_RELAX_CMPXCHG_LOOP } > > }; > > > > /* Additional flag options. */ > > @@ -1092,6 +1093,10 @@ ix86_valid_target_attribute_inner_p (tree fndecl, tree args, char *p_strings[], > > IX86_ATTR_IX86_YES ("general-regs-only", > > OPT_mgeneral_regs_only, > > OPTION_MASK_GENERAL_REGS_ONLY), > > + > > + IX86_ATTR_YES ("relax-cmpxchg-loop", > > + OPT_mrelax_cmpxchg_loop, > > + MASK_RELAX_CMPXCHG_LOOP), > > }; > > > > location_t loc > > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h > > index bd52450a148..7e05510c679 100644 > > --- a/gcc/config/i386/i386-protos.h > > +++ b/gcc/config/i386/i386-protos.h > > @@ -217,6 +217,8 @@ extern void ix86_move_vector_high_sse_to_mmx (rtx); > > extern void ix86_split_mmx_pack (rtx[], enum rtx_code); > > extern void ix86_split_mmx_punpck (rtx[], bool); > > extern void ix86_expand_avx_vzeroupper (void); > > +extern void ix86_expand_atomic_fetch_op_loop (rtx, rtx, rtx, enum rtx_code, > > + bool, bool); > > > > #ifdef TREE_CODE > > extern void init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree, int); > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt > > index ad366974b5b..46fad3cc038 100644 > > --- a/gcc/config/i386/i386.opt > > +++ b/gcc/config/i386/i386.opt > > @@ -404,6 +404,10 @@ momit-leaf-frame-pointer > > Target Mask(OMIT_LEAF_FRAME_POINTER) Save > > Omit the frame pointer in leaf functions. > > > > +mrelax-cmpxchg-loop > > +Target Mask(RELAX_CMPXCHG_LOOP) Save > > +Relax cmpxchg loop for atomic_fetch_{or,xor,and,nand} by adding load and cmp before cmpxchg, execute pause and loop back to load and compare if load value is not expected. > > + > > mpc32 > > Target RejectNegative > > Set 80387 floating-point precision to 32-bit. > > diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md > > index 05a835256bb..46048425327 100644 > > --- a/gcc/config/i386/sync.md > > +++ b/gcc/config/i386/sync.md > > @@ -525,6 +525,123 @@ > > (set (reg:CCZ FLAGS_REG) > > (unspec_volatile:CCZ [(const_int 0)] UNSPECV_CMPXCHG))])]) > > > > +(define_expand "atomic_fetch_<logic><mode>" > > + [(match_operand:SWI124 0 "register_operand") > > + (any_logic:SWI124 > > + (match_operand:SWI124 1 "memory_operand") > > + (match_operand:SWI124 2 "register_operand")) > > + (match_operand:SI 3 "const_int_operand")] > > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > > +{ > > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > > + operands[2], <CODE>, false, > > + false); > > + DONE; > > +}) > > + > > +(define_expand "atomic_<logic>_fetch<mode>" > > + [(match_operand:SWI124 0 "register_operand") > > + (any_logic:SWI124 > > + (match_operand:SWI124 1 "memory_operand") > > + (match_operand:SWI124 2 "register_operand")) > > + (match_operand:SI 3 "const_int_operand")] > > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > > +{ > > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > > + operands[2], <CODE>, true, > > + false); > > + DONE; > > +}) > > + > > +(define_expand "atomic_fetch_nand<mode>" > > + [(match_operand:SWI124 0 "register_operand") > > + (match_operand:SWI124 1 "memory_operand") > > + (match_operand:SWI124 2 "register_operand") > > + (match_operand:SI 3 "const_int_operand")] > > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > > +{ > > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > > + operands[2], NOT, false, > > + false); > > + DONE; > > +}) > > + > > +(define_expand "atomic_nand_fetch<mode>" > > + [(match_operand:SWI124 0 "register_operand") > > + (match_operand:SWI124 1 "memory_operand") > > + (match_operand:SWI124 2 "register_operand") > > + (match_operand:SI 3 "const_int_operand")] > > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > > +{ > > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > > + operands[2], NOT, true, > > + false); > > + DONE; > > +}) > > + > > +(define_expand "atomic_fetch_<logic><mode>" > > + [(match_operand:CASMODE 0 "register_operand") > > + (any_logic:CASMODE > > + (match_operand:CASMODE 1 "memory_operand") > > + (match_operand:CASMODE 2 "register_operand")) > > + (match_operand:SI 3 "const_int_operand")] > > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > > +{ > > + bool doubleword = (<MODE>mode == DImode && !TARGET_64BIT) > > + || (<MODE>mode == TImode); > > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > > + operands[2], <CODE>, false, > > + doubleword); > > + DONE; > > +}) > > + > > +(define_expand "atomic_<logic>_fetch<mode>" > > + [(match_operand:CASMODE 0 "register_operand") > > + (any_logic:CASMODE > > + (match_operand:CASMODE 1 "memory_operand") > > + (match_operand:CASMODE 2 "register_operand")) > > + (match_operand:SI 3 "const_int_operand")] > > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > > +{ > > + bool doubleword = (<MODE>mode == DImode && !TARGET_64BIT) > > + || (<MODE>mode == TImode); > > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > > + operands[2], <CODE>, true, > > + doubleword); > > + DONE; > > +}) > > + > > +(define_expand "atomic_fetch_nand<mode>" > > + [(match_operand:CASMODE 0 "register_operand") > > + (match_operand:CASMODE 1 "memory_operand") > > + (match_operand:CASMODE 2 "register_operand") > > + (match_operand:SI 3 "const_int_operand")] > > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > > +{ > > + bool doubleword = (<MODE>mode == DImode && !TARGET_64BIT) > > + || (<MODE>mode == TImode); > > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > > + operands[2], NOT, false, > > + doubleword); > > + DONE; > > +}) > > + > > +(define_expand "atomic_nand_fetch<mode>" > > + [(match_operand:CASMODE 0 "register_operand") > > + (match_operand:CASMODE 1 "memory_operand") > > + (match_operand:CASMODE 2 "register_operand") > > + (match_operand:SI 3 "const_int_operand")] > > + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" > > +{ > > + bool doubleword = (<MODE>mode == DImode && !TARGET_64BIT) > > + || (<MODE>mode == TImode); > > + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], > > + operands[2], NOT, true, > > + doubleword); > > + DONE; > > +}) > > + > > + > > ;; For operand 2 nonmemory_operand predicate is used instead of > > ;; register_operand to allow combiner to better optimize atomic > > ;; additions of constants. > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > index 2aba4c70b44..06ecf79bc0c 100644 > > --- a/gcc/doc/invoke.texi > > +++ b/gcc/doc/invoke.texi > > @@ -1419,7 +1419,7 @@ See RS/6000 and PowerPC Options. > > -mstack-protector-guard-reg=@var{reg} @gol > > -mstack-protector-guard-offset=@var{offset} @gol > > -mstack-protector-guard-symbol=@var{symbol} @gol > > --mgeneral-regs-only -mcall-ms2sysv-xlogues @gol > > +-mgeneral-regs-only -mcall-ms2sysv-xlogues -mrelax-cmpxchg-loop @gol > > -mindirect-branch=@var{choice} -mfunction-return=@var{choice} @gol > > -mindirect-branch-register -mneeded} > > > > @@ -32259,6 +32259,13 @@ Generate code that uses only the general-purpose registers. This > > prevents the compiler from using floating-point, vector, mask and bound > > registers. > > > > +@item -mrelax-cmpxchg-loop > > +@opindex mrelax-cmpxchg-loop > > +Relax cmpxchg loop by emit early load and compare before cmpxchg, > > ... by emitting an ... > > > +execute pause if load value is not expected. This reduces excessive > > +cachline bouncing when and works for all atomic logic fetch builtins > > +that generates compare and swap loop. > > + > > @item -mindirect-branch=@var{choice} > > @opindex mindirect-branch > > Convert indirect call and jump with @var{choice}. The default is > > diff --git a/gcc/testsuite/gcc.target/i386/pr103069-1.c b/gcc/testsuite/gcc.target/i386/pr103069-1.c > > new file mode 100644 > > index 00000000000..444485cbae9 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/pr103069-1.c > > @@ -0,0 +1,35 @@ > > +/* PR target/103068 */ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-O2 -march=x86-64 -mtune=generic -mrelax-cmpxchg-loop" } */ > > +/* { dg-final { scan-assembler-times "rep nop" 32 } } */ > > Please note that for older assemblers we emit "rep; nop" here. > > > + > > +#include <stdint.h> > > + > > +#define FUNC_ATOMIC(TYPE, OP) \ > > +__attribute__ ((noinline, noclone)) \ > > +TYPE f_##TYPE##_##OP##_fetch (TYPE *a, TYPE b) \ > > +{ \ > > + return __atomic_##OP##_fetch (a, b, __ATOMIC_RELAXED); \ > > +} \ > > +__attribute__ ((noinline, noclone)) \ > > +TYPE f_##TYPE##_fetch_##OP (TYPE *a, TYPE b) \ > > +{ \ > > + return __atomic_fetch_##OP (a, b, __ATOMIC_RELAXED); \ > > +} > > + > > +FUNC_ATOMIC (int64_t, and) > > +FUNC_ATOMIC (int64_t, nand) > > +FUNC_ATOMIC (int64_t, or) > > +FUNC_ATOMIC (int64_t, xor) > > +FUNC_ATOMIC (int, and) > > +FUNC_ATOMIC (int, nand) > > +FUNC_ATOMIC (int, or) > > +FUNC_ATOMIC (int, xor) > > +FUNC_ATOMIC (short, and) > > +FUNC_ATOMIC (short, nand) > > +FUNC_ATOMIC (short, or) > > +FUNC_ATOMIC (short, xor) > > +FUNC_ATOMIC (char, and) > > +FUNC_ATOMIC (char, nand) > > +FUNC_ATOMIC (char, or) > > +FUNC_ATOMIC (char, xor) > > diff --git a/gcc/testsuite/gcc.target/i386/pr103069-2.c b/gcc/testsuite/gcc.target/i386/pr103069-2.c > > new file mode 100644 > > index 00000000000..8ac824cc8e8 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/pr103069-2.c > > @@ -0,0 +1,70 @@ > > +/* PR target/103068 */ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-O2 -march=x86-64 -mtune=generic" } */ > > + > > +#include <stdlib.h> > > +#include "pr103069-1.c" > > + > > +#define FUNC_ATOMIC_RELAX(TYPE, OP) \ > > +__attribute__ ((noinline, noclone, target ("relax-cmpxchg-loop"))) \ > > +TYPE relax_##TYPE##_##OP##_fetch (TYPE *a, TYPE b) \ > > +{ \ > > + return __atomic_##OP##_fetch (a, b, __ATOMIC_RELAXED); \ > > +} \ > > +__attribute__ ((noinline, noclone, target ("relax-cmpxchg-loop"))) \ > > +TYPE relax_##TYPE##_fetch_##OP (TYPE *a, TYPE b) \ > > +{ \ > > + return __atomic_fetch_##OP (a, b, __ATOMIC_RELAXED); \ > > +} > > + > > +FUNC_ATOMIC_RELAX (int64_t, and) > > +FUNC_ATOMIC_RELAX (int64_t, nand) > > +FUNC_ATOMIC_RELAX (int64_t, or) > > +FUNC_ATOMIC_RELAX (int64_t, xor) > > +FUNC_ATOMIC_RELAX (int, and) > > +FUNC_ATOMIC_RELAX (int, nand) > > +FUNC_ATOMIC_RELAX (int, or) > > +FUNC_ATOMIC_RELAX (int, xor) > > +FUNC_ATOMIC_RELAX (short, and) > > +FUNC_ATOMIC_RELAX (short, nand) > > +FUNC_ATOMIC_RELAX (short, or) > > +FUNC_ATOMIC_RELAX (short, xor) > > +FUNC_ATOMIC_RELAX (char, and) > > +FUNC_ATOMIC_RELAX (char, nand) > > +FUNC_ATOMIC_RELAX (char, or) > > +FUNC_ATOMIC_RELAX (char, xor) > > + > > +#define TEST_ATOMIC_FETCH_LOGIC(TYPE, OP) \ > > +{ \ > > + TYPE a = 11, b = 101, res, exp; \ > > + res = relax_##TYPE##_##OP##_fetch (&a, b); \ > > + exp = f_##TYPE##_##OP##_fetch (&a, b); \ > > + if (res != exp) \ > > + abort (); \ > > + a = 21, b = 92; \ > > + res = relax_##TYPE##_fetch_##OP (&a, b); \ > > + exp = f_##TYPE##_fetch_##OP (&a, b); \ > > + if (res != exp) \ > > + abort (); \ > > +} > > + > > +int main (void) > > +{ > > + TEST_ATOMIC_FETCH_LOGIC (int64_t, and) > > + TEST_ATOMIC_FETCH_LOGIC (int64_t, nand) > > + TEST_ATOMIC_FETCH_LOGIC (int64_t, or) > > + TEST_ATOMIC_FETCH_LOGIC (int64_t, xor) > > + TEST_ATOMIC_FETCH_LOGIC (int, and) > > + TEST_ATOMIC_FETCH_LOGIC (int, nand) > > + TEST_ATOMIC_FETCH_LOGIC (int, or) > > + TEST_ATOMIC_FETCH_LOGIC (int, xor) > > + TEST_ATOMIC_FETCH_LOGIC (short, and) > > + TEST_ATOMIC_FETCH_LOGIC (short, nand) > > + TEST_ATOMIC_FETCH_LOGIC (short, or) > > + TEST_ATOMIC_FETCH_LOGIC (short, xor) > > + TEST_ATOMIC_FETCH_LOGIC (char, and) > > + TEST_ATOMIC_FETCH_LOGIC (char, nand) > > + TEST_ATOMIC_FETCH_LOGIC (char, or) > > + TEST_ATOMIC_FETCH_LOGIC (char, xor) > > + return 0; > > +} > > -- > > 2.18.1 > >
diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c index 088e6af2258..f8a61835d85 100644 --- a/gcc/config/i386/i386-expand.c +++ b/gcc/config/i386/i386-expand.c @@ -23138,4 +23138,81 @@ ix86_expand_divmod_libfunc (rtx libfunc, machine_mode mode, *rem_p = rem; } +void ix86_expand_atomic_fetch_op_loop (rtx target, rtx mem, rtx val, + enum rtx_code code, bool after, + bool doubleword) +{ + rtx old_reg, new_reg, old_mem, success, oldval, new_mem; + rtx_code_label *loop_label, *pause_label; + machine_mode mode = GET_MODE (target); + + old_reg = gen_reg_rtx (mode); + new_reg = old_reg; + loop_label = gen_label_rtx (); + pause_label = gen_label_rtx (); + old_mem = copy_to_reg (mem); + emit_label (loop_label); + emit_move_insn (old_reg, old_mem); + + /* return value for atomic_fetch_op. */ + if (!after) + emit_move_insn (target, old_reg); + + if (code == NOT) + { + new_reg = expand_simple_binop (mode, AND, new_reg, val, NULL_RTX, + true, OPTAB_LIB_WIDEN); + new_reg = expand_simple_unop (mode, code, new_reg, NULL_RTX, true); + } + else + new_reg = expand_simple_binop (mode, code, new_reg, val, NULL_RTX, + true, OPTAB_LIB_WIDEN); + + /* return value for atomic_op_fetch. */ + if (after) + emit_move_insn (target, new_reg); + + /* Load memory again inside loop. */ + new_mem = copy_to_reg (mem); + /* Compare mem value with expected value. */ + + if (doubleword) + { + machine_mode half_mode = (mode == DImode)? SImode : DImode; + rtx low_new_mem = gen_lowpart (half_mode, new_mem); + rtx low_old_mem = gen_lowpart (half_mode, old_mem); + rtx high_new_mem = gen_highpart (half_mode, new_mem); + rtx high_old_mem = gen_highpart (half_mode, old_mem); + emit_cmp_and_jump_insns (low_new_mem, low_old_mem, NE, NULL_RTX, + half_mode, 1, pause_label, + profile_probability::guessed_never ()); + emit_cmp_and_jump_insns (high_new_mem, high_old_mem, NE, NULL_RTX, + half_mode, 1, pause_label, + profile_probability::guessed_never ()); + } + else + emit_cmp_and_jump_insns (new_mem, old_mem, NE, NULL_RTX, + GET_MODE (old_mem), 1, pause_label, + profile_probability::guessed_never ()); + + success = NULL_RTX; + oldval = old_mem; + expand_atomic_compare_and_swap (&success, &oldval, mem, old_reg, + new_reg, false, MEMMODEL_SYNC_SEQ_CST, + MEMMODEL_RELAXED); + if (oldval != old_mem) + emit_move_insn (old_mem, oldval); + + emit_cmp_and_jump_insns (success, const0_rtx, EQ, const0_rtx, + GET_MODE (success), 1, loop_label, + profile_probability::guessed_never ()); + + /* If mem is not expected, pause and loop back. */ + emit_label (pause_label); + emit_insn (gen_pause ()); + emit_jump_insn (gen_jump (loop_label)); + emit_barrier (); +} + + #include "gt-i386-expand.h" diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c index a8cc0664f11..feff2584f41 100644 --- a/gcc/config/i386/i386-options.c +++ b/gcc/config/i386/i386-options.c @@ -397,7 +397,8 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2, { "-mstv", MASK_STV }, { "-mavx256-split-unaligned-load", MASK_AVX256_SPLIT_UNALIGNED_LOAD }, { "-mavx256-split-unaligned-store", MASK_AVX256_SPLIT_UNALIGNED_STORE }, - { "-mcall-ms2sysv-xlogues", MASK_CALL_MS2SYSV_XLOGUES } + { "-mcall-ms2sysv-xlogues", MASK_CALL_MS2SYSV_XLOGUES }, + { "-mrelax-cmpxchg-loop", MASK_RELAX_CMPXCHG_LOOP } }; /* Additional flag options. */ @@ -1092,6 +1093,10 @@ ix86_valid_target_attribute_inner_p (tree fndecl, tree args, char *p_strings[], IX86_ATTR_IX86_YES ("general-regs-only", OPT_mgeneral_regs_only, OPTION_MASK_GENERAL_REGS_ONLY), + + IX86_ATTR_YES ("relax-cmpxchg-loop", + OPT_mrelax_cmpxchg_loop, + MASK_RELAX_CMPXCHG_LOOP), }; location_t loc diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index bd52450a148..7e05510c679 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -217,6 +217,8 @@ extern void ix86_move_vector_high_sse_to_mmx (rtx); extern void ix86_split_mmx_pack (rtx[], enum rtx_code); extern void ix86_split_mmx_punpck (rtx[], bool); extern void ix86_expand_avx_vzeroupper (void); +extern void ix86_expand_atomic_fetch_op_loop (rtx, rtx, rtx, enum rtx_code, + bool, bool); #ifdef TREE_CODE extern void init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree, int); diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt index ad366974b5b..46fad3cc038 100644 --- a/gcc/config/i386/i386.opt +++ b/gcc/config/i386/i386.opt @@ -404,6 +404,10 @@ momit-leaf-frame-pointer Target Mask(OMIT_LEAF_FRAME_POINTER) Save Omit the frame pointer in leaf functions. +mrelax-cmpxchg-loop +Target Mask(RELAX_CMPXCHG_LOOP) Save +Relax cmpxchg loop for atomic_fetch_{or,xor,and,nand} by adding load and cmp before cmpxchg, execute pause and loop back to load and compare if load value is not expected. + mpc32 Target RejectNegative Set 80387 floating-point precision to 32-bit. diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md index 05a835256bb..46048425327 100644 --- a/gcc/config/i386/sync.md +++ b/gcc/config/i386/sync.md @@ -525,6 +525,123 @@ (set (reg:CCZ FLAGS_REG) (unspec_volatile:CCZ [(const_int 0)] UNSPECV_CMPXCHG))])]) +(define_expand "atomic_fetch_<logic><mode>" + [(match_operand:SWI124 0 "register_operand") + (any_logic:SWI124 + (match_operand:SWI124 1 "memory_operand") + (match_operand:SWI124 2 "register_operand")) + (match_operand:SI 3 "const_int_operand")] + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" +{ + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], + operands[2], <CODE>, false, + false); + DONE; +}) + +(define_expand "atomic_<logic>_fetch<mode>" + [(match_operand:SWI124 0 "register_operand") + (any_logic:SWI124 + (match_operand:SWI124 1 "memory_operand") + (match_operand:SWI124 2 "register_operand")) + (match_operand:SI 3 "const_int_operand")] + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" +{ + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], + operands[2], <CODE>, true, + false); + DONE; +}) + +(define_expand "atomic_fetch_nand<mode>" + [(match_operand:SWI124 0 "register_operand") + (match_operand:SWI124 1 "memory_operand") + (match_operand:SWI124 2 "register_operand") + (match_operand:SI 3 "const_int_operand")] + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" +{ + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], + operands[2], NOT, false, + false); + DONE; +}) + +(define_expand "atomic_nand_fetch<mode>" + [(match_operand:SWI124 0 "register_operand") + (match_operand:SWI124 1 "memory_operand") + (match_operand:SWI124 2 "register_operand") + (match_operand:SI 3 "const_int_operand")] + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" +{ + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], + operands[2], NOT, true, + false); + DONE; +}) + +(define_expand "atomic_fetch_<logic><mode>" + [(match_operand:CASMODE 0 "register_operand") + (any_logic:CASMODE + (match_operand:CASMODE 1 "memory_operand") + (match_operand:CASMODE 2 "register_operand")) + (match_operand:SI 3 "const_int_operand")] + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" +{ + bool doubleword = (<MODE>mode == DImode && !TARGET_64BIT) + || (<MODE>mode == TImode); + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], + operands[2], <CODE>, false, + doubleword); + DONE; +}) + +(define_expand "atomic_<logic>_fetch<mode>" + [(match_operand:CASMODE 0 "register_operand") + (any_logic:CASMODE + (match_operand:CASMODE 1 "memory_operand") + (match_operand:CASMODE 2 "register_operand")) + (match_operand:SI 3 "const_int_operand")] + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" +{ + bool doubleword = (<MODE>mode == DImode && !TARGET_64BIT) + || (<MODE>mode == TImode); + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], + operands[2], <CODE>, true, + doubleword); + DONE; +}) + +(define_expand "atomic_fetch_nand<mode>" + [(match_operand:CASMODE 0 "register_operand") + (match_operand:CASMODE 1 "memory_operand") + (match_operand:CASMODE 2 "register_operand") + (match_operand:SI 3 "const_int_operand")] + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" +{ + bool doubleword = (<MODE>mode == DImode && !TARGET_64BIT) + || (<MODE>mode == TImode); + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], + operands[2], NOT, false, + doubleword); + DONE; +}) + +(define_expand "atomic_nand_fetch<mode>" + [(match_operand:CASMODE 0 "register_operand") + (match_operand:CASMODE 1 "memory_operand") + (match_operand:CASMODE 2 "register_operand") + (match_operand:SI 3 "const_int_operand")] + "TARGET_CMPXCHG && TARGET_RELAX_CMPXCHG_LOOP" +{ + bool doubleword = (<MODE>mode == DImode && !TARGET_64BIT) + || (<MODE>mode == TImode); + ix86_expand_atomic_fetch_op_loop (operands[0], operands[1], + operands[2], NOT, true, + doubleword); + DONE; +}) + + ;; For operand 2 nonmemory_operand predicate is used instead of ;; register_operand to allow combiner to better optimize atomic ;; additions of constants. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 2aba4c70b44..06ecf79bc0c 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -1419,7 +1419,7 @@ See RS/6000 and PowerPC Options. -mstack-protector-guard-reg=@var{reg} @gol -mstack-protector-guard-offset=@var{offset} @gol -mstack-protector-guard-symbol=@var{symbol} @gol --mgeneral-regs-only -mcall-ms2sysv-xlogues @gol +-mgeneral-regs-only -mcall-ms2sysv-xlogues -mrelax-cmpxchg-loop @gol -mindirect-branch=@var{choice} -mfunction-return=@var{choice} @gol -mindirect-branch-register -mneeded} @@ -32259,6 +32259,13 @@ Generate code that uses only the general-purpose registers. This prevents the compiler from using floating-point, vector, mask and bound registers. +@item -mrelax-cmpxchg-loop +@opindex mrelax-cmpxchg-loop +Relax cmpxchg loop by emit early load and compare before cmpxchg, +execute pause if load value is not expected. This reduces excessive +cachline bouncing when and works for all atomic logic fetch builtins +that generates compare and swap loop. + @item -mindirect-branch=@var{choice} @opindex mindirect-branch Convert indirect call and jump with @var{choice}. The default is diff --git a/gcc/testsuite/gcc.target/i386/pr103069-1.c b/gcc/testsuite/gcc.target/i386/pr103069-1.c new file mode 100644 index 00000000000..444485cbae9 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr103069-1.c @@ -0,0 +1,35 @@ +/* PR target/103068 */ +/* { dg-do compile } */ +/* { dg-additional-options "-O2 -march=x86-64 -mtune=generic -mrelax-cmpxchg-loop" } */ +/* { dg-final { scan-assembler-times "rep nop" 32 } } */ + +#include <stdint.h> + +#define FUNC_ATOMIC(TYPE, OP) \ +__attribute__ ((noinline, noclone)) \ +TYPE f_##TYPE##_##OP##_fetch (TYPE *a, TYPE b) \ +{ \ + return __atomic_##OP##_fetch (a, b, __ATOMIC_RELAXED); \ +} \ +__attribute__ ((noinline, noclone)) \ +TYPE f_##TYPE##_fetch_##OP (TYPE *a, TYPE b) \ +{ \ + return __atomic_fetch_##OP (a, b, __ATOMIC_RELAXED); \ +} + +FUNC_ATOMIC (int64_t, and) +FUNC_ATOMIC (int64_t, nand) +FUNC_ATOMIC (int64_t, or) +FUNC_ATOMIC (int64_t, xor) +FUNC_ATOMIC (int, and) +FUNC_ATOMIC (int, nand) +FUNC_ATOMIC (int, or) +FUNC_ATOMIC (int, xor) +FUNC_ATOMIC (short, and) +FUNC_ATOMIC (short, nand) +FUNC_ATOMIC (short, or) +FUNC_ATOMIC (short, xor) +FUNC_ATOMIC (char, and) +FUNC_ATOMIC (char, nand) +FUNC_ATOMIC (char, or) +FUNC_ATOMIC (char, xor) diff --git a/gcc/testsuite/gcc.target/i386/pr103069-2.c b/gcc/testsuite/gcc.target/i386/pr103069-2.c new file mode 100644 index 00000000000..8ac824cc8e8 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr103069-2.c @@ -0,0 +1,70 @@ +/* PR target/103068 */ +/* { dg-do compile } */ +/* { dg-additional-options "-O2 -march=x86-64 -mtune=generic" } */ + +#include <stdlib.h> +#include "pr103069-1.c" + +#define FUNC_ATOMIC_RELAX(TYPE, OP) \ +__attribute__ ((noinline, noclone, target ("relax-cmpxchg-loop"))) \ +TYPE relax_##TYPE##_##OP##_fetch (TYPE *a, TYPE b) \ +{ \ + return __atomic_##OP##_fetch (a, b, __ATOMIC_RELAXED); \ +} \ +__attribute__ ((noinline, noclone, target ("relax-cmpxchg-loop"))) \ +TYPE relax_##TYPE##_fetch_##OP (TYPE *a, TYPE b) \ +{ \ + return __atomic_fetch_##OP (a, b, __ATOMIC_RELAXED); \ +} + +FUNC_ATOMIC_RELAX (int64_t, and) +FUNC_ATOMIC_RELAX (int64_t, nand) +FUNC_ATOMIC_RELAX (int64_t, or) +FUNC_ATOMIC_RELAX (int64_t, xor) +FUNC_ATOMIC_RELAX (int, and) +FUNC_ATOMIC_RELAX (int, nand) +FUNC_ATOMIC_RELAX (int, or) +FUNC_ATOMIC_RELAX (int, xor) +FUNC_ATOMIC_RELAX (short, and) +FUNC_ATOMIC_RELAX (short, nand) +FUNC_ATOMIC_RELAX (short, or) +FUNC_ATOMIC_RELAX (short, xor) +FUNC_ATOMIC_RELAX (char, and) +FUNC_ATOMIC_RELAX (char, nand) +FUNC_ATOMIC_RELAX (char, or) +FUNC_ATOMIC_RELAX (char, xor) + +#define TEST_ATOMIC_FETCH_LOGIC(TYPE, OP) \ +{ \ + TYPE a = 11, b = 101, res, exp; \ + res = relax_##TYPE##_##OP##_fetch (&a, b); \ + exp = f_##TYPE##_##OP##_fetch (&a, b); \ + if (res != exp) \ + abort (); \ + a = 21, b = 92; \ + res = relax_##TYPE##_fetch_##OP (&a, b); \ + exp = f_##TYPE##_fetch_##OP (&a, b); \ + if (res != exp) \ + abort (); \ +} + +int main (void) +{ + TEST_ATOMIC_FETCH_LOGIC (int64_t, and) + TEST_ATOMIC_FETCH_LOGIC (int64_t, nand) + TEST_ATOMIC_FETCH_LOGIC (int64_t, or) + TEST_ATOMIC_FETCH_LOGIC (int64_t, xor) + TEST_ATOMIC_FETCH_LOGIC (int, and) + TEST_ATOMIC_FETCH_LOGIC (int, nand) + TEST_ATOMIC_FETCH_LOGIC (int, or) + TEST_ATOMIC_FETCH_LOGIC (int, xor) + TEST_ATOMIC_FETCH_LOGIC (short, and) + TEST_ATOMIC_FETCH_LOGIC (short, nand) + TEST_ATOMIC_FETCH_LOGIC (short, or) + TEST_ATOMIC_FETCH_LOGIC (short, xor) + TEST_ATOMIC_FETCH_LOGIC (char, and) + TEST_ATOMIC_FETCH_LOGIC (char, nand) + TEST_ATOMIC_FETCH_LOGIC (char, or) + TEST_ATOMIC_FETCH_LOGIC (char, xor) + return 0; +}