Message ID | n99bnewfm4l.fsf@arm.com |
---|---|
State | New |
Headers | show |
On Tue, Jul 28, 2015 at 02:12:36PM +0100, Jiong Wang wrote: > > The instruction sequences for preparing argument for TLS descriptor > runtime resolver and the later function call to resolver can actually be > hoisted out of the loop. > > Currently we can't because we have exposed the hard register X0 as > destination of "set". While GCC's RTL data flow infrastructure will > skip or do very conservative assumption when hard register involved in > and thus some loop IV opportunities are missed. > > This patch add another "tlsdesc_small_pseudo_<mode>" pattern, and avoid > expose x0 to gcc generic code. > > Generally, we define a new register class FIXED_R0 which only contains register > 0, so the instruction sequences generated from the new add pattern is the same > as tlsdesc_small_<mode>, while the operand 0 is wrapped as pseudo register that > RTL IV opt can handle it. > > Ideally, we should allow operand 0 to be any pseudo register, but then > we can't model the override of x0 caused by the function call which is > hidded by the UNSPEC. > > So here, we restricting operand 0 to be x0, the override of x0 can be > reflected to the gcc. > > OK for trunk? OK. Thanks, James
On Tue, Jul 28, 2015 at 6:12 AM, Jiong Wang <jiong.wang@arm.com> wrote: > > The instruction sequences for preparing argument for TLS descriptor > runtime resolver and the later function call to resolver can actually be > hoisted out of the loop. > > Currently we can't because we have exposed the hard register X0 as > destination of "set". While GCC's RTL data flow infrastructure will > skip or do very conservative assumption when hard register involved in > and thus some loop IV opportunities are missed. > > This patch add another "tlsdesc_small_pseudo_<mode>" pattern, and avoid > expose x0 to gcc generic code. > > Generally, we define a new register class FIXED_R0 which only contains register > 0, so the instruction sequences generated from the new add pattern is the same > as tlsdesc_small_<mode>, while the operand 0 is wrapped as pseudo register that > RTL IV opt can handle it. > > Ideally, we should allow operand 0 to be any pseudo register, but then > we can't model the override of x0 caused by the function call which is > hidded by the UNSPEC. > > So here, we restricting operand 0 to be x0, the override of x0 can be > reflected to the gcc. > > OK for trunk? This patch broke ILP32 because we used mode rather than ptr_mode for the psedu . I have an idea on how to fix it (like tlsie_small_sidi case) but I still need to test it fully. This is the smallest testcase where the problem is: struct dtor_list { struct dtor_list *next; }; static __thread struct dtor_list *tls_dtor_list; __cxa_thread_atexit_impl ( struct dtor_list *new) { new->next = tls_dtor_list; tls_dtor_list = new; } Thanks, Andrew > > 2015-07-28 Ramana Radhakrishnan <ramana.radhakrishnan@arm.com> > Jiong Wang <jiong.wang@arm.com> > > gcc/ > * config/aarch64/aarch64.d (tlsdesc_small_pseudo_<mode>): New pattern. > * config/aarch64/aarch64.h (reg_class): New enumeration FIXED_REG0. > (REG_CLASS_NAMES): Likewise. > (REG_CLASS_CONTENTS): Likewise. > * config/aarch64/aarch64.c (aarch64_class_max_nregs): Likewise. > (aarch64_register_move_cost): Likewise. > (aarch64_load_symref_appropriately): Invoke the new added pattern if > possible. > * config/aarch64/constraints.md (Uc0): New constraint. > > gcc/testsuite. > * gcc.target/aarch64/tlsdesc_hoist.c: New testcase. > > -- > Regards, > Jiong >
On Fri, Sep 25, 2015 at 11:40 PM, Andrew Pinski <pinskia@gmail.com> wrote: > On Tue, Jul 28, 2015 at 6:12 AM, Jiong Wang <jiong.wang@arm.com> wrote: >> >> The instruction sequences for preparing argument for TLS descriptor >> runtime resolver and the later function call to resolver can actually be >> hoisted out of the loop. >> >> Currently we can't because we have exposed the hard register X0 as >> destination of "set". While GCC's RTL data flow infrastructure will >> skip or do very conservative assumption when hard register involved in >> and thus some loop IV opportunities are missed. >> >> This patch add another "tlsdesc_small_pseudo_<mode>" pattern, and avoid >> expose x0 to gcc generic code. >> >> Generally, we define a new register class FIXED_R0 which only contains register >> 0, so the instruction sequences generated from the new add pattern is the same >> as tlsdesc_small_<mode>, while the operand 0 is wrapped as pseudo register that >> RTL IV opt can handle it. >> >> Ideally, we should allow operand 0 to be any pseudo register, but then >> we can't model the override of x0 caused by the function call which is >> hidded by the UNSPEC. >> >> So here, we restricting operand 0 to be x0, the override of x0 can be >> reflected to the gcc. >> >> OK for trunk? > > > This patch broke ILP32 because we used mode rather than ptr_mode for > the psedu . I have an idea on how to fix it (like tlsie_small_sidi > case) but I still need to test it fully. > > This is the smallest testcase where the problem is: > struct dtor_list > { > struct dtor_list *next; > }; > static __thread struct dtor_list *tls_dtor_list; > __cxa_thread_atexit_impl ( struct dtor_list *new) > { > new->next = tls_dtor_list; > tls_dtor_list = new; > } Actually there is another bug with respect of the output too. Some of the <w>0 should have been plain x0 due to only the 64bit register is accepted in some contexts. Thanks, Andrew > > > Thanks, > Andrew > >> >> 2015-07-28 Ramana Radhakrishnan <ramana.radhakrishnan@arm.com> >> Jiong Wang <jiong.wang@arm.com> >> >> gcc/ >> * config/aarch64/aarch64.d (tlsdesc_small_pseudo_<mode>): New pattern. >> * config/aarch64/aarch64.h (reg_class): New enumeration FIXED_REG0. >> (REG_CLASS_NAMES): Likewise. >> (REG_CLASS_CONTENTS): Likewise. >> * config/aarch64/aarch64.c (aarch64_class_max_nregs): Likewise. >> (aarch64_register_move_cost): Likewise. >> (aarch64_load_symref_appropriately): Invoke the new added pattern if >> possible. >> * config/aarch64/constraints.md (Uc0): New constraint. >> >> gcc/testsuite. >> * gcc.target/aarch64/tlsdesc_hoist.c: New testcase. >> >> -- >> Regards, >> Jiong >>
Andrew Pinski writes: > On Tue, Jul 28, 2015 at 6:12 AM, Jiong Wang <jiong.wang@arm.com> wrote: >> >> The instruction sequences for preparing argument for TLS descriptor >> runtime resolver and the later function call to resolver can actually be >> hoisted out of the loop. >> >> Currently we can't because we have exposed the hard register X0 as >> destination of "set". While GCC's RTL data flow infrastructure will >> skip or do very conservative assumption when hard register involved in >> and thus some loop IV opportunities are missed. >> >> This patch add another "tlsdesc_small_pseudo_<mode>" pattern, and avoid >> expose x0 to gcc generic code. >> >> Generally, we define a new register class FIXED_R0 which only contains register >> 0, so the instruction sequences generated from the new add pattern is the same >> as tlsdesc_small_<mode>, while the operand 0 is wrapped as pseudo register that >> RTL IV opt can handle it. >> >> Ideally, we should allow operand 0 to be any pseudo register, but then >> we can't model the override of x0 caused by the function call which is >> hidded by the UNSPEC. >> >> So here, we restricting operand 0 to be x0, the override of x0 can be >> reflected to the gcc. >> >> OK for trunk? > > > This patch broke ILP32 because we used mode rather than ptr_mode for > the psedu . I have an idea on how to fix it (like tlsie_small_sidi > case) but I still need to test it fully. Have done a quick re-visit the code, the use of "mode" instead of "ptr_mode" looks OK to me. While what looks strange to me is under ILP32 symbol_ref is DImode. (symbol_ref:DI ("*.LANCHOR0") My my understanding, the symbol_ref should be SI mode under ilp32 instead of DI mode. So it's better to fix "create_block_symbol" in varasm, and we should let it use ptr_mode instead of Pmode as Pmode is used to describe the underline hardware mode instead of the mode view in C language level. > > This is the smallest testcase where the problem is: > struct dtor_list > { > struct dtor_list *next; > }; > static __thread struct dtor_list *tls_dtor_list; > __cxa_thread_atexit_impl ( struct dtor_list *new) > { > new->next = tls_dtor_list; > tls_dtor_list = new; > } > > > Thanks, > Andrew > >> >> 2015-07-28 Ramana Radhakrishnan <ramana.radhakrishnan@arm.com> >> Jiong Wang <jiong.wang@arm.com> >> >> gcc/ >> * config/aarch64/aarch64.d (tlsdesc_small_pseudo_<mode>): New pattern. >> * config/aarch64/aarch64.h (reg_class): New enumeration FIXED_REG0. >> (REG_CLASS_NAMES): Likewise. >> (REG_CLASS_CONTENTS): Likewise. >> * config/aarch64/aarch64.c (aarch64_class_max_nregs): Likewise. >> (aarch64_register_move_cost): Likewise. >> (aarch64_load_symref_appropriately): Invoke the new added pattern if >> possible. >> * config/aarch64/constraints.md (Uc0): New constraint. >> >> gcc/testsuite. >> * gcc.target/aarch64/tlsdesc_hoist.c: New testcase. >> >> -- >> Regards, >> Jiong >>
On 28 July 2015 at 14:12, Jiong Wang <jiong.wang@arm.com> wrote: > > The instruction sequences for preparing argument for TLS descriptor > runtime resolver and the later function call to resolver can actually be > hoisted out of the loop. > > Currently we can't because we have exposed the hard register X0 as > destination of "set". While GCC's RTL data flow infrastructure will > skip or do very conservative assumption when hard register involved in > and thus some loop IV opportunities are missed. This patch feels like we are botching the back end to over come a limitation in RTL IV opt. Isn't the real solution in RTL IV opt ? /Marcus > > This patch add another "tlsdesc_small_pseudo_<mode>" pattern, and avoid > expose x0 to gcc generic code. > > Generally, we define a new register class FIXED_R0 which only contains register > 0, so the instruction sequences generated from the new add pattern is the same > as tlsdesc_small_<mode>, while the operand 0 is wrapped as pseudo register that > RTL IV opt can handle it. > > Ideally, we should allow operand 0 to be any pseudo register, but then > we can't model the override of x0 caused by the function call which is > hidded by the UNSPEC. > > So here, we restricting operand 0 to be x0, the override of x0 can be > reflected to the gcc. > > OK for trunk? > > 2015-07-28 Ramana Radhakrishnan <ramana.radhakrishnan@arm.com> > Jiong Wang <jiong.wang@arm.com> > > gcc/ > * config/aarch64/aarch64.d (tlsdesc_small_pseudo_<mode>): New pattern. > * config/aarch64/aarch64.h (reg_class): New enumeration FIXED_REG0. > (REG_CLASS_NAMES): Likewise. > (REG_CLASS_CONTENTS): Likewise. > * config/aarch64/aarch64.c (aarch64_class_max_nregs): Likewise. > (aarch64_register_move_cost): Likewise. > (aarch64_load_symref_appropriately): Invoke the new added pattern if > possible. > * config/aarch64/constraints.md (Uc0): New constraint. > > gcc/testsuite. > * gcc.target/aarch64/tlsdesc_hoist.c: New testcase. > > -- > Regards, > Jiong >
Jiong Wang writes: > Andrew Pinski writes: > >> On Tue, Jul 28, 2015 at 6:12 AM, Jiong Wang <jiong.wang@arm.com> wrote: >>> >>> The instruction sequences for preparing argument for TLS descriptor >>> runtime resolver and the later function call to resolver can actually be >>> hoisted out of the loop. >>> >>> Currently we can't because we have exposed the hard register X0 as >>> destination of "set". While GCC's RTL data flow infrastructure will >>> skip or do very conservative assumption when hard register involved in >>> and thus some loop IV opportunities are missed. >>> >>> This patch add another "tlsdesc_small_pseudo_<mode>" pattern, and avoid >>> expose x0 to gcc generic code. >>> >>> Generally, we define a new register class FIXED_R0 which only contains register >>> 0, so the instruction sequences generated from the new add pattern is the same >>> as tlsdesc_small_<mode>, while the operand 0 is wrapped as pseudo register that >>> RTL IV opt can handle it. >>> >>> Ideally, we should allow operand 0 to be any pseudo register, but then >>> we can't model the override of x0 caused by the function call which is >>> hidded by the UNSPEC. >>> >>> So here, we restricting operand 0 to be x0, the override of x0 can be >>> reflected to the gcc. >>> >>> OK for trunk? >> >> >> This patch broke ILP32 because we used mode rather than ptr_mode for >> the psedu . I have an idea on how to fix it (like tlsie_small_sidi >> case) but I still need to test it fully. > > Have done a quick re-visit the code, the use of "mode" instead of > "ptr_mode" looks OK to me. > > While what looks strange to me is under ILP32 symbol_ref is DImode. > > (symbol_ref:DI ("*.LANCHOR0") > > My my understanding, the symbol_ref should be SI mode under ilp32 > instead of DI mode. > > So it's better to fix "create_block_symbol" in varasm, and we should let > it use ptr_mode instead of Pmode as Pmode is used to describe the > underline hardware mode instead of the mode view in C language level. meanwhile I revert this patch (r228211) as there is unresolved ILP32 issue.
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h index 3851564..fb4834a 100644 --- a/gcc/config/aarch64/aarch64.h +++ b/gcc/config/aarch64/aarch64.h @@ -454,6 +454,7 @@ extern unsigned long aarch64_isa_flags; enum reg_class { NO_REGS, + FIXED_REG0, CALLER_SAVE_REGS, GENERAL_REGS, STACK_REG, @@ -469,6 +470,7 @@ enum reg_class #define REG_CLASS_NAMES \ { \ "NO_REGS", \ + "FIXED_REG0" \ "CALLER_SAVE_REGS", \ "GENERAL_REGS", \ "STACK_REG", \ @@ -481,6 +483,7 @@ enum reg_class #define REG_CLASS_CONTENTS \ { \ { 0x00000000, 0x00000000, 0x00000000 }, /* NO_REGS */ \ + { 0x00000001, 0x00000000, 0x00000000 }, /* FIXED_REG0 */ \ { 0x0007ffff, 0x00000000, 0x00000000 }, /* CALLER_SAVE_REGS */ \ { 0x7fffffff, 0x00000000, 0x00000003 }, /* GENERAL_REGS */ \ { 0x80000000, 0x00000000, 0x00000000 }, /* STACK_REG */ \ diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index ef07e05..f1f2cab 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -1038,22 +1038,39 @@ aarch64_load_symref_appropriately (rtx dest, rtx imm, { machine_mode mode = GET_MODE (dest); rtx x0 = gen_rtx_REG (mode, R0_REGNUM); + rtx offset; rtx tp; gcc_assert (mode == Pmode || mode == ptr_mode); - /* In ILP32, the got entry is always of SImode size. Unlike - small GOT, the dest is fixed at reg 0. */ - if (TARGET_ILP32) - emit_insn (gen_tlsdesc_small_si (imm)); + if (can_create_pseudo_p ()) + { + rtx reg = gen_reg_rtx (mode); + + if (TARGET_ILP32) + emit_insn (gen_tlsdesc_small_pseudo_si (reg, imm)); + else + emit_insn (gen_tlsdesc_small_pseudo_di (reg, imm)); + + offset = reg; + } else - emit_insn (gen_tlsdesc_small_di (imm)); + { + /* In ILP32, the got entry is always of SImode size. Unlike + small GOT, the dest is fixed at reg 0. */ + if (TARGET_ILP32) + emit_insn (gen_tlsdesc_small_si (imm)); + else + emit_insn (gen_tlsdesc_small_di (imm)); + + offset = x0; + } tp = aarch64_load_tp (NULL); if (mode != Pmode) tp = gen_lowpart (mode, tp); - emit_insn (gen_rtx_SET (dest, gen_rtx_PLUS (mode, tp, x0))); + emit_insn (gen_rtx_SET (dest, gen_rtx_PLUS (mode, tp, offset))); set_unique_reg_note (get_last_insn (), REG_EQUIV, imm); return; } @@ -5099,6 +5116,7 @@ aarch64_class_max_nregs (reg_class_t regclass, machine_mode mode) aarch64_vector_mode_p (mode) ? (GET_MODE_SIZE (mode) + UNITS_PER_VREG - 1) / UNITS_PER_VREG : (GET_MODE_SIZE (mode) + UNITS_PER_WORD - 1) / UNITS_PER_WORD; + case FIXED_REG0: case STACK_REG: return 1; @@ -6948,10 +6966,10 @@ aarch64_register_move_cost (machine_mode mode, = aarch64_tune_params.regmove_cost; /* Caller save and pointer regs are equivalent to GENERAL_REGS. */ - if (to == CALLER_SAVE_REGS || to == POINTER_REGS) + if (to == CALLER_SAVE_REGS || to == POINTER_REGS || to == FIXED_REG0) to = GENERAL_REGS; - if (from == CALLER_SAVE_REGS || from == POINTER_REGS) + if (from == CALLER_SAVE_REGS || from == POINTER_REGS || from == FIXED_REG0) from = GENERAL_REGS; /* Moving between GPR and stack cost is the same as GP2GP. */ diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index c681cf1..3fcfe55 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -4496,6 +4496,25 @@ [(set_attr "type" "call") (set_attr "length" "16")]) +;; The same as tlsdesc_small_<mode> with hard register hiding. +;; The first operand is actually x0, while we wrap it under a delicated +;; register class so that before register allocation, it's seen as pseudo +;; register. The reason for doing this is we don't expose hard register X0 +;; as the destination of set as it will cause trouble for RTL loop iv. +;; RTL loop iv will abort ongoing optimization once it finds there is hard reg +;; as destination of set. +(define_insn "tlsdesc_small_pseudo_<mode>" + [(set (match_operand:PTR 0 "register_operand" "=Uc0") + (unspec:PTR [(match_operand 1 "aarch64_valid_symref" "S")] + UNSPEC_TLSDESC)) + (clobber (reg:DI LR_REGNUM)) + (clobber (reg:CC CC_REGNUM)) + (clobber (match_scratch:DI 2 "=r"))] + "TARGET_TLS_DESC" + "adrp\\t<w>0, %A1\;ldr\\t%<w>2, [%<w>0, #%L1]\;add\\t%<w>0, %<w>0, %L1\;.tlsdesccall\\t%1\;blr\\t%2" + [(set_attr "type" "call") + (set_attr "length" "16")]) + (define_insn "stack_tie" [(set (mem:BLK (scratch)) (unspec:BLK [(match_operand:DI 0 "register_operand" "rk") diff --git a/gcc/config/aarch64/constraints.md b/gcc/config/aarch64/constraints.md index 5b189ea..b09d10d 100644 --- a/gcc/config/aarch64/constraints.md +++ b/gcc/config/aarch64/constraints.md @@ -24,6 +24,9 @@ (define_register_constraint "Ucs" "CALLER_SAVE_REGS" "@internal The caller save registers.") +(define_register_constraint "Uc0" "FIXED_REG0" + "@internal Represent X0/W0.") + (define_register_constraint "w" "FP_REGS" "Floating point and SIMD vector registers.") diff --git a/gcc/testsuite/gcc.target/aarch64/tlsdesc_hoist.c b/gcc/testsuite/gcc.target/aarch64/tlsdesc_hoist.c new file mode 100644 index 0000000..f9a6d7f --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/tlsdesc_hoist.c @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target tls_native } */ +/* { dg-options "-O2 -fpic -fdump-rtl-loop2_invariant" } */ + +int cal (int, int); +__thread int tls_data; + +int +foo (int bound) +{ + int i = 0; + int sum = 0; + + for (i; i < bound; i++) + sum = cal (sum, tls_data); + + return sum; +} + +/* Insn sequences for TLS descriptor should be hoisted out of the loop. */ +/* { dg-final { scan-rtl-dump "Decided" "loop2_invariant" } } */