Message ID | n99vbgp35kl.fsf@arm.com |
---|---|
State | New |
Headers | show |
On 04/21/2015 08:24 AM, Jiong Wang wrote: > > Jiong Wang writes: > >> 2015-04-14 18:24 GMT+01:00 Jeff Law <law@redhat.com>: >>> On 04/14/2015 10:48 AM, Steven Bosscher wrote: >>>>> >>>>> So I think this stage2/3 binary difference is acceptable? >>>> >>>> >>>> No, they should be identical. If there's a difference, then there's a >>>> bug - which, it seems, you've already found, too. >>> >>> RIght. And so the natural question is how to fix. >>> >>> At first glance it would seem like having this new code ignore dependencies >>> rising from debug insns would work. >>> >>> Which then begs the question, what happens to the debug insn -- it's >>> certainly not going to be correct anymore if the transformation is made. >> >> Exactly. >> >> The debug_insn 2776 in my example is to record the base address of a >> local array. the new code is doing correctly here by not shuffling the >> operands of insn 2556 and 2557 as there is additional reference of >> reg:1473 from debug insn, although the code will still execute correctly >> if we do the transformation. >> >> my understanding to fix this: >> >> * delete the out-of-date mismatch debug_insn? as there is no guarantee >> to generate accurate debug info under -O2. >> >> IMO, this debug_insn may affect "DW_AT_location" field for variable >> descrption of "classes" in .debug_info section, but it's omitted in >> the final output already. >> >> <3><38a4d>: Abbrev Number: 137 (DW_TAG_variable) >> <38a4f> DW_AT_name : (indirect string, offset: 0x18db): classes >> <38a53> DW_AT_decl_file : 1 >> <38a54> DW_AT_decl_line : 548 >> <38a56> DW_AT_type : <0x38cb4> >> >> * update the debug_insn? if the following change is OK with dwarf standard >> >> from >> >> insn0: reg0 = fp + reg1 >> debug_insn: var_loc = reg0 + const_off >> insn1: reg2 = reg0 + const_off >> >> to >> >> insn0: reg0 = fp + const_off >> debug_insn: var_loc = reg0 + reg1 >> insn1: reg2 = reg0 + reg1 >> >> Thanks, >> > > And attachment is the new patch which will update debug_insn as > described in the second solution above. > > Now the stage2/3 binary differences on AArch64 gone away. Bootstrap OK. > > On AArch64, this patch give 600+ new rtl loop invariants found across > spec2k6 float. +4.5% perf improvement on 436.cactusADM because four new > invariants found in the critical function "regex_compile". > > The similar improvements may be achieved on other RISC backends like > powerpc/mips I guess. > > One thing to mention, for AArch64, one minor glitch in > aarch64_legitimize_address needs to be fixed to let this patch take > effect, I will send out that patch later as it's a seperate issue. > Powerpc/Mips don't have this glitch in LEGITIMIZE_ADDRESS hook, so > should be OK, and I verified the base address of local array in the > testcase given by Seb on pr62173 do hoisted on ppc64 now. I think > pr62173 is fixed on those 64bit arch by this patch. > > Thoughts? > > Thanks. > > 2015-04-21 Jiong Wang <jiong.wang@arm.com> > > gcc/ > * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build. > (vfp_const_iv): New hash table. > (expensive_addr_check_p): New boolean. > (init_inv_motion_data): Initialize new variables.> > (free_inv_motion_data): Release hash table. > (create_new_invariant): Set cheap_address to false for iv in > vfp_const_iv table. > (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv > table. > (use_for_single_du): New function. > (reshuffle_insn_with_vfp): Likewise. > (find_invariants_bb): Call reshuffle_insn_with_vfp. > > gcc/testsuite/ > * gcc.dg/pr62173.c: New testcase. So ultimately isn't this just a specialized version of reassociation of pointer arithmetic? And if so, don't we have all the usual problems around introducing overflow? ISTM if this is going to go forward (ie, we somehow convince ourselves that this kind of reassociation is OK), then should it be made to apply on pointers in general rather than restricting to stack, frame, virtual-frame? I wish I'd looked more closely at the patch the first time around so that I could have raised these questions sooner. Jeff >
Jeff Law writes: > On 04/21/2015 08:24 AM, Jiong Wang wrote: >> >> Jiong Wang writes: >> >>> 2015-04-14 18:24 GMT+01:00 Jeff Law <law@redhat.com>: >>>> On 04/14/2015 10:48 AM, Steven Bosscher wrote: >>>>>> >>>>>> So I think this stage2/3 binary difference is acceptable? >>>>> >>>>> >>>>> No, they should be identical. If there's a difference, then there's a >>>>> bug - which, it seems, you've already found, too. >>>> >>>> RIght. And so the natural question is how to fix. >>>> >>>> At first glance it would seem like having this new code ignore dependencies >>>> rising from debug insns would work. >>>> >>>> Which then begs the question, what happens to the debug insn -- it's >>>> certainly not going to be correct anymore if the transformation is made. >>> >>> Exactly. >>> >>> The debug_insn 2776 in my example is to record the base address of a >>> local array. the new code is doing correctly here by not shuffling the >>> operands of insn 2556 and 2557 as there is additional reference of >>> reg:1473 from debug insn, although the code will still execute correctly >>> if we do the transformation. >>> >>> my understanding to fix this: >>> >>> * delete the out-of-date mismatch debug_insn? as there is no guarantee >>> to generate accurate debug info under -O2. >>> >>> IMO, this debug_insn may affect "DW_AT_location" field for variable >>> descrption of "classes" in .debug_info section, but it's omitted in >>> the final output already. >>> >>> <3><38a4d>: Abbrev Number: 137 (DW_TAG_variable) >>> <38a4f> DW_AT_name : (indirect string, offset: 0x18db): classes >>> <38a53> DW_AT_decl_file : 1 >>> <38a54> DW_AT_decl_line : 548 >>> <38a56> DW_AT_type : <0x38cb4> >>> >>> * update the debug_insn? if the following change is OK with dwarf standard >>> >>> from >>> >>> insn0: reg0 = fp + reg1 >>> debug_insn: var_loc = reg0 + const_off >>> insn1: reg2 = reg0 + const_off >>> >>> to >>> >>> insn0: reg0 = fp + const_off >>> debug_insn: var_loc = reg0 + reg1 >>> insn1: reg2 = reg0 + reg1 >>> >>> Thanks, >>> >> >> And attachment is the new patch which will update debug_insn as >> described in the second solution above. >> >> Now the stage2/3 binary differences on AArch64 gone away. Bootstrap OK. >> >> On AArch64, this patch give 600+ new rtl loop invariants found across >> spec2k6 float. +4.5% perf improvement on 436.cactusADM because four new >> invariants found in the critical function "regex_compile". >> >> The similar improvements may be achieved on other RISC backends like >> powerpc/mips I guess. >> >> One thing to mention, for AArch64, one minor glitch in >> aarch64_legitimize_address needs to be fixed to let this patch take >> effect, I will send out that patch later as it's a seperate issue. >> Powerpc/Mips don't have this glitch in LEGITIMIZE_ADDRESS hook, so >> should be OK, and I verified the base address of local array in the >> testcase given by Seb on pr62173 do hoisted on ppc64 now. I think >> pr62173 is fixed on those 64bit arch by this patch. >> >> Thoughts? >> >> Thanks. >> >> 2015-04-21 Jiong Wang <jiong.wang@arm.com> >> >> gcc/ >> * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build. >> (vfp_const_iv): New hash table. >> (expensive_addr_check_p): New boolean. >> (init_inv_motion_data): Initialize new variables.> >> (free_inv_motion_data): Release hash table. >> (create_new_invariant): Set cheap_address to false for iv in >> vfp_const_iv table. >> (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv >> table. >> (use_for_single_du): New function. >> (reshuffle_insn_with_vfp): Likewise. >> (find_invariants_bb): Call reshuffle_insn_with_vfp. >> >> gcc/testsuite/ >> * gcc.dg/pr62173.c: New testcase. > So ultimately isn't this just a specialized version of reassociation of > pointer arithmetic? And if so, don't we have all the usual problems > around introducing overflow? > > > ISTM if this is going to go forward (ie, we somehow convince ourselves > that this kind of reassociation is OK), then should it be made to apply > on pointers in general rather than restricting to stack, frame, > virtual-frame? > Jeff, Thanks for the review. This transformation is not reassociation of pointer arithmetic. The idea of this patch is, given two instructions with variable value, we may get new instruction sequences with fixed value by reassociating their operands. And currently GCC only generate such instruction sequences for local array accessing as far as I known. for the statement "D.2707 = A[i]", gcc generate the following instruction sequence: (insn 92 91 93 6 (set (reg/f:DI 148) (plus:DI (reg/f:DI 64 sfp) (reg:DI 147 [ i ]))) (expr_list:REG_DEAD (reg:DI 147 [ i ]) (nil))) (insn 93 92 94 6 (set (reg:SI 149 [ D.2707 ]) (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 148) (const_int -16 [0xfffffffffffffff0])) [0 A S1 A8]))) (expr_list:REG_DEAD (reg/f:DI 148) (nil))) both insn 92 and insn 93 are with variable value, but "(reg/f:DI 64 sfp)" in insn 92 and "const_int -16" in insn 93 are fixed value. So my patch will transform above sequence into the following if "apply_change_group" succeed. Then insn 92 is with fixed value and thus loop invariant. (insn 92 88 97 5 (set (reg/f:DI 148) (plus:DI (reg/f:DI 64 sfp) (const_int -16 [0xfffffffffffffff0]))) (nil)) (insn 93 131 94 6 (set (reg:SI 149 [ D.2707 ]) (zero_extend:SI (mem/j:QI (plus:DI (reg/f:DI 148) (reg:DI 147 [ i ])) [0 A S1 A8]))) (expr_list:REG_DEAD (reg:DI 147 [ i ]) (expr_list:REG_DEAD (reg/f:DI 148) (nil)))) This tranformation will only be done if the def of r148 in insn 92 is the single def for the use of it in insn 93, and the use in insn 93 is also single use. And if there is overflow, then it will happen in the original insn 93 also, then it should not be generated in the first place. So I think there is no need to do overflow check. Does this explanation make sense?
Hi Matthew, 2015-04-21 15:24 GMT+01:00 Jiong Wang <jiong.wang@arm.com>: > > 2015-04-21 Jiong Wang <jiong.wang@arm.com> > > gcc/ > * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build. > (vfp_const_iv): New hash table. > (expensive_addr_check_p): New boolean. > (init_inv_motion_data): Initialize new variables.> > (free_inv_motion_data): Release hash table. > (create_new_invariant): Set cheap_address to false for iv in > vfp_const_iv table. > (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv > table. > (use_for_single_du): New function. > (reshuffle_insn_with_vfp): Likewise. > (find_invariants_bb): Call reshuffle_insn_with_vfp. Is it possible for you to test this patch on spec2k6 on MIPS? especially 436.cactusADM. I am interested in the performance impact on other load/store machines. Thanks. Regards, Jiong
> Hi Matthew, > > 2015-04-21 15:24 GMT+01:00 Jiong Wang <jiong.wang@arm.com>: > > > > > 2015-04-21 Jiong Wang <jiong.wang@arm.com> > > > > gcc/ > > * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build. > > (vfp_const_iv): New hash table. > > (expensive_addr_check_p): New boolean. > > (init_inv_motion_data): Initialize new variables.> > > (free_inv_motion_data): Release hash table. > > (create_new_invariant): Set cheap_address to false for iv in > > vfp_const_iv table. > > (find_invariant_insn): Skip dependencies check for iv in > vfp_const_iv > > table. > > (use_for_single_du): New function. > > (reshuffle_insn_with_vfp): Likewise. > > (find_invariants_bb): Call reshuffle_insn_with_vfp. > > Is it possible for you to test this patch on spec2k6 on MIPS? > especially 436.cactusADM. > I am interested in the performance impact on other load/store machines. I will certainly try and fit it in. It may take me a week to get back to you though as I'm away on vacation for a few days. Matthew
2015-04-28 14:56 GMT+01:00 Matthew Fortune <Matthew.Fortune@imgtec.com>: >> Hi Matthew, >> >> 2015-04-21 15:24 GMT+01:00 Jiong Wang <jiong.wang@arm.com>: >> >> > >> > 2015-04-21 Jiong Wang <jiong.wang@arm.com> >> > >> > gcc/ >> > * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build. >> > (vfp_const_iv): New hash table. >> > (expensive_addr_check_p): New boolean. >> > (init_inv_motion_data): Initialize new variables.> >> > (free_inv_motion_data): Release hash table. >> > (create_new_invariant): Set cheap_address to false for iv in >> > vfp_const_iv table. >> > (find_invariant_insn): Skip dependencies check for iv in >> vfp_const_iv >> > table. >> > (use_for_single_du): New function. >> > (reshuffle_insn_with_vfp): Likewise. >> > (find_invariants_bb): Call reshuffle_insn_with_vfp. >> >> Is it possible for you to test this patch on spec2k6 on MIPS? >> especially 436.cactusADM. >> I am interested in the performance impact on other load/store machines. > > I will certainly try and fit it in. It may take me a week to get back to > you though as I'm away on vacation for a few days. Thanks very much. I appreciate that. And you can add "-fdump-rtl-loop2_invariant", then we can see how many new loop invariants found at RTL level across spec2k6 on MIPS. Thanks. Regards, Jiong
On 04/24/2015 10:18 AM, Jiong Wang wrote: >>> 2015-04-21 Jiong Wang <jiong.wang@arm.com> >>> >>> gcc/ >>> * loop-invariant.c (find_defs): Enable DF_DU_CHAIN build. >>> (vfp_const_iv): New hash table. >>> (expensive_addr_check_p): New boolean. >>> (init_inv_motion_data): Initialize new variables.> >>> (free_inv_motion_data): Release hash table. >>> (create_new_invariant): Set cheap_address to false for iv in >>> vfp_const_iv table. >>> (find_invariant_insn): Skip dependencies check for iv in vfp_const_iv >>> table. >>> (use_for_single_du): New function. >>> (reshuffle_insn_with_vfp): Likewise. >>> (find_invariants_bb): Call reshuffle_insn_with_vfp. >>> >>> gcc/testsuite/ >>> * gcc.dg/pr62173.c: New testcase. >> So ultimately isn't this just a specialized version of reassociation of >> pointer arithmetic? And if so, don't we have all the usual problems >> around introducing overflow? >> >> >> ISTM if this is going to go forward (ie, we somehow convince ourselves >> that this kind of reassociation is OK), then should it be made to apply >> on pointers in general rather than restricting to stack, frame, >> virtual-frame? >> > > Jeff, > > Thanks for the review. > > This transformation is not reassociation of pointer arithmetic. ^^^ > > The idea of this patch is, given two instructions with variable value, > we may get new instruction sequences with fixed value by reassociating > their operands. And currently GCC only generate such instruction > sequences for local array accessing as far as I known. Which is precisely reassociation of pointer arithmetic. Given: x = a + b y = *(x + c) You're generating: x = a + c y = *(x + b) That's reassociation of pointer arithmetic. For all kinds of reassociation we have to concern ourselves with adding overflow where it didn't already occur. Assuming a 32 bit architecture we could get overflow if A is 0x7fffffff, b is -4 and and c = 3 0x7fffffff + -4 = 0x7ffffffb 0x7ffffffb + 3 = 0x7ffffffe If you make the transformation you're suggesting we get 0x7fffffff + 3 = 0x80000002 OVERFLOW 0x80000002 - 4 = 0x7ffffffe Now if you always know pointers are unsigned, then the overflow is defined and you'd be OK. But that's a property of the target and one that's not well modeled within GCC (we have POINTER_EXTEND_UNSIGNED which kind of tells us something in this space). In addition to worrying about overflow, you have to worry about segmented architectures with implicit segment selection -- especially if the segment selection comes from the base register than the entire effective address. On such architectures a + b != b + a when used in a memory reference. The HPPA and mn103 ports immediately come to mind. I'm sure there's at least one more since I recall helping someone with these issues in the past. Anyway So given x = a + b; y = *(x + c) Can only be transformed into x = a + c y = *(x + b) If and only if a + b would choose the same segment as a + c. On the PA that wouldn't be true for something like a = 0x4000000 b = 0x10 c = -4 a + b is 0x40000010 a + c is 0x3ffffffc Those select different segments. Even though the full address is the same (0x4000000c). Sadly we don't expose this aspect of target characteristics at all. To proceed, you'd probably need ways to conditionalize these transformations. ie, target hooks. Jeff
Jeff Law writes: > For all kinds of reassociation we have to concern ourselves with adding > overflow where it didn't already occur. Assuming a 32 bit architecture > we could get overflow if A is 0x7fffffff, b is -4 and and c = 3 > > 0x7fffffff + -4 = 0x7ffffffb > 0x7ffffffb + 3 = 0x7ffffffe > > > If you make the transformation you're suggesting we get > > 0x7fffffff + 3 = 0x80000002 OVERFLOW > 0x80000002 - 4 = 0x7ffffffe > > Now if you always know pointers are unsigned, then the overflow is > defined and you'd be OK. But that's a property of the target and one > that's not well modeled within GCC (we have POINTER_EXTEND_UNSIGNED > which kind of tells us something in this space). I see, understood, cool! Thanks for such detailed explanation. Above scenario do may happen for general pointer arith reassociation. One thing may make life easier as my reassociation is restricted within frame pointer. the "(plus (plus fp, index_reg) + const_off)" pattern was to address some variable on stack. index_reg, const_off were part of the stack offset of the variable. Reassociate them means reorder two parts of the stack offset. There may be way to prove the transformation will not add extra overflow risk, especially when the index_reg is unsigned. I understand for general pointer arith reassociation, there do have big risk, as the involved operands largely come from irrelevant instruction, no relationship between the values from those operands, we can deduce nothing. > > In addition to worrying about overflow, you have to worry about > segmented architectures with implicit segment selection -- especially if > the segment selection comes from the base register than the entire > effective address. > Hmm, understood! This let me recall something as dark as x86 segment descriptor in protecting mode... > > Sadly we don't expose this aspect of target characteristics at all. > > > To proceed, you'd probably need ways to conditionalize these > transformations. ie, target hooks. > > Jeff
On 05/14/2015 03:13 PM, Jiong Wang wrote: > > Jeff Law writes: > >> For all kinds of reassociation we have to concern ourselves with adding >> overflow where it didn't already occur. Assuming a 32 bit architecture >> we could get overflow if A is 0x7fffffff, b is -4 and and c = 3 >> >> 0x7fffffff + -4 = 0x7ffffffb >> 0x7ffffffb + 3 = 0x7ffffffe >> >> >> If you make the transformation you're suggesting we get >> >> 0x7fffffff + 3 = 0x80000002 OVERFLOW >> 0x80000002 - 4 = 0x7ffffffe >> >> Now if you always know pointers are unsigned, then the overflow is >> defined and you'd be OK. But that's a property of the target and one >> that's not well modeled within GCC (we have POINTER_EXTEND_UNSIGNED >> which kind of tells us something in this space). > > I see, understood, cool! Thanks for such detailed explanation. > > Above scenario do may happen for general pointer arith > reassociation. > > One thing may make life easier as my reassociation is restricted within > frame pointer. the "(plus (plus fp, index_reg) + const_off)" pattern was > to address some variable on stack. index_reg, const_off were part of > the stack offset of the variable. Reassociate them means reorder two > parts of the stack offset. There may be way to prove the transformation > will not add extra overflow risk, especially when the index_reg is > unsigned. > > I understand for general pointer arith reassociation, there do have big > risk, as the involved operands largely come from irrelevant instruction, > no relationship between the values from those operands, we can deduce nothing. Given the special status of SP, FP and ARGP and a known constant part, we can probably do something here. More below... > >> >> In addition to worrying about overflow, you have to worry about >> segmented architectures with implicit segment selection -- especially if >> the segment selection comes from the base register than the entire >> effective address. >> > > Hmm, understood! > > This let me recall something as dark as x86 segment descriptor in protecting mode... Possibly, I've actually never studied the segmented aspects of the x86. But I'm painfully familiar with the others mentioned :( My recollection for the segmented stuff on the PA is we only had a single guard page at both ends of the segment. So we only allowed an offset of +-4k when doing address reassociations in legitimize_address. This was possible because we had callouts from the right places in the RTL generators/optimizers to allow targets to rewrite address arithmetic. So we could naturally bury the target details away from the code generator/optimizers. So we could possibly parameterize the transformation around similar concepts. The design issue here is it's introducing more target dependencies in places where we've really wanted to avoid them. In theory the gimple optimizers are supposed to be target independent. Reality is some stuff bleeds into them (the one that's mentioned the most often is branch costing, but there's others). *If* we decide to go forward with using some target hooks here. I'd be tempted to do 2. One that's effective a tri-state. Full reassociation, limited reassociation, no reassociation. The second would bound the constants in the limited reassociation case. Thoughts? Jeff
Jeff Law writes: > On 05/14/2015 03:13 PM, Jiong Wang wrote: >> >> Jeff Law writes: >> >>> For all kinds of reassociation we have to concern ourselves with adding >>> overflow where it didn't already occur. Assuming a 32 bit architecture >>> we could get overflow if A is 0x7fffffff, b is -4 and and c = 3 >>> >>> 0x7fffffff + -4 = 0x7ffffffb >>> 0x7ffffffb + 3 = 0x7ffffffe >>> >>> >>> If you make the transformation you're suggesting we get >>> >>> 0x7fffffff + 3 = 0x80000002 OVERFLOW >>> 0x80000002 - 4 = 0x7ffffffe >>> >>> Now if you always know pointers are unsigned, then the overflow is >>> defined and you'd be OK. But that's a property of the target and one >>> that's not well modeled within GCC (we have POINTER_EXTEND_UNSIGNED >>> which kind of tells us something in this space). >> >> I see, understood, cool! Thanks for such detailed explanation. >> >> Above scenario do may happen for general pointer arith >> reassociation. >> >> One thing may make life easier as my reassociation is restricted within >> frame pointer. the "(plus (plus fp, index_reg) + const_off)" pattern was >> to address some variable on stack. index_reg, const_off were part of >> the stack offset of the variable. Reassociate them means reorder two >> parts of the stack offset. There may be way to prove the transformation >> will not add extra overflow risk, especially when the index_reg is >> unsigned. >> >> I understand for general pointer arith reassociation, there do have big >> risk, as the involved operands largely come from irrelevant instruction, >> no relationship between the values from those operands, we can deduce nothing. > Given the special status of SP, FP and ARGP and a known constant part, > we can probably do something here. More below... > > > >> >>> >>> In addition to worrying about overflow, you have to worry about >>> segmented architectures with implicit segment selection -- especially if >>> the segment selection comes from the base register than the entire >>> effective address. >>> >> >> Hmm, understood! >> >> This let me recall something as dark as x86 segment descriptor in protecting mode... > Possibly, I've actually never studied the segmented aspects of the x86. > But I'm painfully familiar with the others mentioned :( > > My recollection for the segmented stuff on the PA is we only had a > single guard page at both ends of the segment. So we only allowed an > offset of +-4k when doing address reassociations in legitimize_address. > This was possible because we had callouts from the right places in the > RTL generators/optimizers to allow targets to rewrite address > arithmetic. So we could naturally bury the target details away from the > code generator/optimizers. > > So we could possibly parameterize the transformation around similar > concepts. The design issue here is it's introducing more target > dependencies in places where we've really wanted to avoid them. In > theory the gimple optimizers are supposed to be target independent. > Reality is some stuff bleeds into them (the one that's mentioned the > most often is branch costing, but there's others). > > *If* we decide to go forward with using some target hooks here. I'd be > tempted to do 2. One that's effective a tri-state. Full reassociation, > limited reassociation, no reassociation. The second would bound the > constants in the limited reassociation case. > > Thoughts? Thanks for these thoughts. I tried but still can't prove this transformation will not introduce extra pointer overflow even given it's reassociation with vfp, although my first impression is it do will not introduce extra risk in real application. Have done a quick check on hppa's legitimize_address. I see for (plus sym_ref, const_int), if const_int is beyond +-4K, then that hook will force them into register, then (plus reg, reg) is always OK. So for target hooks, my understanding of your idea is something like: new hook targetm.pointer_arith_reassociate (), if return -1 then support full reassociation, 0 for limited, 1 for should not do any reassociation. the default version return -1 as most targets are OK to do reassociation given we can prove there is no introducing of overflow risk. While for target like HPPA, we should define this hook to return 0 for limited support. Then, if targetm.pointer_arith_reassociate () return 1, we should further invoke the second hook targetm.limited_reassociate_p (rtx x), to check the reassociated rtx 'x' meets any restrictions, for example for HPPA, constants part shouldn't beyond +-4K. not sure whether my understanding is correct.
On 05/21/2015 02:46 PM, Jiong Wang wrote: > > Thanks for these thoughts. > > I tried but still can't prove this transformation will not introduce > extra pointer overflow even given it's reassociation with vfp, although > my first impression is it do will not introduce extra risk in real > application. > > Have done a quick check on hppa's legitimize_address. I see for (plus > sym_ref, const_int), if const_int is beyond +-4K, then that hook will > force them into register, then (plus reg, reg) is always OK. I'm virtually certain the PA's legitimize_address is not overflow safe. It was written long before we started worrying about overflows in address computations. It was mostly concerned with trying generate good addressing modes without running afoul of the implicit space register selection issues. A SYMBOL_REF is always a valid base register. However, as the comment in hppa_legitimize_address notes, we might be given a MEM for something like: x[n-100000]. We don't want to rewrite that as (x-100000) + n, even though doing so would be beneficial for LICM. > > So for target hooks, my understanding of your idea is something like: > > new hook targetm.pointer_arith_reassociate (), if return -1 then > support full reassociation, 0 for limited, 1 for should not do any > reassociation. the default version return -1 as most targets are OK to > do reassociation given we can prove there is no introducing of overflow > risk. While for target like HPPA, we should define this hook to return > 0 for limited support. Right. Rather than use magic constants, I'd suggest an enum for the tri-state. FULL_PTR_REASSOCIATION, PARTIAL_PTR_REASSOCIATION, NO_PTR_REASSOCIATION. > > Then, if targetm.pointer_arith_reassociate () return 1, we should > further invoke the second hook targetm.limited_reassociate_p (rtx x), > to check the reassociated rtx 'x' meets any restrictions, for example > for HPPA, constants part shouldn't beyond +-4K. Right. Jeff
Jeff Law writes: > On 05/21/2015 02:46 PM, Jiong Wang wrote: >> >> Thanks for these thoughts. >> >> I tried but still can't prove this transformation will not introduce >> extra pointer overflow even given it's reassociation with vfp, although >> my first impression is it do will not introduce extra risk in real >> application. >> >> Have done a quick check on hppa's legitimize_address. I see for (plus >> sym_ref, const_int), if const_int is beyond +-4K, then that hook will >> force them into register, then (plus reg, reg) is always OK. > I'm virtually certain the PA's legitimize_address is not overflow safe. > It was written long before we started worrying about overflows in > address computations. It was mostly concerned with trying generate good > addressing modes without running afoul of the implicit space register > selection issues. > > A SYMBOL_REF is always a valid base register. However, as the comment > in hppa_legitimize_address notes, we might be given a MEM for something > like: x[n-100000]. > > We don't want to rewrite that as (x-100000) + n, even though doing so > would be beneficial for LICM. > > >> >> So for target hooks, my understanding of your idea is something like: >> >> new hook targetm.pointer_arith_reassociate (), if return -1 then >> support full reassociation, 0 for limited, 1 for should not do any >> reassociation. the default version return -1 as most targets are OK to >> do reassociation given we can prove there is no introducing of overflow >> risk. While for target like HPPA, we should define this hook to return >> 0 for limited support. > Right. Rather than use magic constants, I'd suggest an enum for the > tri-state. FULL_PTR_REASSOCIATION, PARTIAL_PTR_REASSOCIATION, > NO_PTR_REASSOCIATION. > > >> >> Then, if targetm.pointer_arith_reassociate () return 1, we should >> further invoke the second hook targetm.limited_reassociate_p (rtx x), >> to check the reassociated rtx 'x' meets any restrictions, for example >> for HPPA, constants part shouldn't beyond +-4K. > Right. > > Jeff For the record, after Bin's recent tree-ssa-ivopt improvement originated from PR62173, this patch is not benefitial anymore. I can't see such re-shuffling opportunites in RTL level anymore. This patch was trying to hoist those RTL sequences generated for local array base address which haven't been hoisted out of the loop at tree level, while now they are handled quite well by tree-ssa-ivopt. During both aarch64 and mips64 bootstrapping, optimization in this patch haven't been triggered while there were quite a few before Bin's tree level fix. I have stopped working on this patch. Thanks for those time spent on reviewing and discussing on this.
On 09/02/2015 07:36 AM, Jiong Wang wrote: > For the record, after Bin's recent tree-ssa-ivopt improvement originated > from PR62173, this patch is not benefitial anymore. It happens sometimes. > > I have stopped working on this patch. Thanks for those time spent on > reviewing and discussing on this. No problem, thanks for testing after Bin's changes and letting me know that this has become a non-issue. Jeff
diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c index f79b497..f70dfb0 100644 --- a/gcc/loop-invariant.c +++ b/gcc/loop-invariant.c @@ -203,6 +203,8 @@ typedef struct invariant *invariant_p; /* The invariants. */ static vec<invariant_p> invariants; +static hash_table <pointer_hash <rtx_insn> > *vfp_const_iv; +static bool need_expensive_addr_check_p; /* Check the size of the invariant table and realloc if necessary. */ @@ -695,7 +697,7 @@ find_defs (struct loop *loop) df_remove_problem (df_chain); df_process_deferred_rescans (); - df_chain_add_problem (DF_UD_CHAIN); + df_chain_add_problem (DF_UD_CHAIN + DF_DU_CHAIN); df_set_flags (DF_RD_PRUNE_DEAD_DEFS); df_analyze_loop (loop); check_invariant_table_size (); @@ -742,6 +744,9 @@ create_new_invariant (struct def *def, rtx_insn *insn, bitmap depends_on, See http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html . */ inv->cheap_address = address_cost (SET_SRC (set), word_mode, ADDR_SPACE_GENERIC, speed) < 3; + + if (need_expensive_addr_check_p && vfp_const_iv->find (insn)) + inv->cheap_address = false; } else { @@ -952,7 +957,8 @@ find_invariant_insn (rtx_insn *insn, bool always_reached, bool always_executed) return; depends_on = BITMAP_ALLOC (NULL); - if (!check_dependencies (insn, depends_on)) + if (!vfp_const_iv->find (insn) + && !check_dependencies (insn, depends_on)) { BITMAP_FREE (depends_on); return; @@ -1007,6 +1013,180 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) record_uses (insn); } +/* Find the single use of def of insn. At the same time, + make sure def of insn is the single def which reach the use. + NOTE: debug_insn can affect DF, var_location debug insn may reference + the same rtl expr as variable location, such debug_insn should not + affect the insn shuffling while we do need to update the loc of the + debug_insn after insn shuffling. So here we will record such debug_insn. */ + +static rtx_insn * +use_for_single_du (rtx_insn *insn, rtx_insn **debug_insn, rtx *debug_expr_loc) +{ + struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn); + df_ref def; + + FOR_EACH_INSN_INFO_DEF (def, insn_info) + { + struct df_link *chain = DF_REF_CHAIN (def); + + if (!chain) + return NULL; + + /* For multi use, only allow the second use be debug_insn. */ + if (chain->next + && (chain->next->next + || NONDEBUG_INSN_P (DF_REF_INSN (chain->next->ref)))) + return NULL; + + if (chain->next) + { + rtx_insn *insn = DF_REF_INSN (chain->next->ref); + rtx debug_pat = PATTERN (insn); + if (GET_CODE (debug_pat) != VAR_LOCATION) + return NULL; + else + { + *debug_insn = insn; + *debug_expr_loc = PAT_VAR_LOCATION_LOC (debug_pat); + } + } + + df_ref use = chain->ref; + chain = DF_REF_CHAIN (use); + + if (!chain + || chain->next) + return NULL; + + return DF_REF_INSN (use); + } + + return NULL; +} + +/* This function also try to transform + + RA <- fixed_reg + RC + RD <- MEM (RA + const_offset) + + into: + + RA <- fixed_reg + const_offset + RD <- MEM (RA + RC) <- pos0 + + If use of RA in the second insn is the single use, and the define of + RA in the first insn is the single def reach the second insn. + + After this change, the first instruction is loop invariant. */ + +static void +reshuffle_insn_with_vfp (rtx_insn *insn) +{ + rtx set = single_set (insn); + + if (!set + || GET_CODE (SET_SRC (set)) != PLUS) + return; + + rtx dest = SET_DEST (set); + rtx op0 = XEXP (SET_SRC (set), 0); + rtx op1 = XEXP (SET_SRC (set), 1); + rtx non_vfp_op = op1; + + if (op1 == frame_pointer_rtx + || op1 == stack_pointer_rtx + || op1 == virtual_stack_vars_rtx) + { + non_vfp_op = op0; + std::swap (op0, op1); + } + + if (GET_CODE (op1) == SIGN_EXTEND + || GET_CODE (op1) == ZERO_EXTEND) + op1 = XEXP (op1, 0); + + if (!(REG_P (dest) && REG_P (op0) && REG_P (op1) + && (op0 == frame_pointer_rtx + || op0 == stack_pointer_rtx + || op0 == virtual_stack_vars_rtx))) + return; + + rtx_insn *use_insn, *debug_insn = NULL; + rtx debug_expr_loc; + + if (!(use_insn = use_for_single_du (insn, &debug_insn, &debug_expr_loc))) + return; + + rtx u_set = single_set (use_insn); + + if (!(u_set && (REG_P (SET_DEST (u_set)) || MEM_P (SET_DEST (u_set))))) + return; + + rtx mem_addr; + + if (REG_P (SET_DEST (u_set))) + mem_addr = SET_SRC (u_set); + else + mem_addr = SET_DEST (u_set); + + if (debug_insn && !rtx_equal_p (debug_expr_loc, mem_addr)) + return; + + if (GET_CODE (mem_addr) == ZERO_EXTEND + || GET_CODE (mem_addr) == SIGN_EXTEND + || GET_CODE (mem_addr) == TRUNCATE) + mem_addr = XEXP (mem_addr, 0); + + if (!MEM_P (mem_addr) + || GET_CODE (XEXP (mem_addr, 0)) != PLUS) + return; + + rtx *mem_plus_loc = &XEXP (mem_addr, 0); + rtx u_op0 = XEXP (XEXP (mem_addr, 0), 0); + rtx u_op1 = XEXP (XEXP (mem_addr, 0), 1); + + if (!(REG_P (u_op0) && CONST_INT_P (u_op1) + && REGNO (dest) == REGNO (u_op0))) + return; + + rtx new_src = plus_constant (GET_MODE (op0), op0, INTVAL (u_op1)); + validate_change (insn, &SET_SRC (set), new_src, 1); + new_src = gen_rtx_PLUS (GET_MODE (u_op0), u_op0, non_vfp_op); + validate_change (use_insn, mem_plus_loc, new_src, 1); + if (debug_insn) + validate_change (debug_insn, &XEXP(debug_expr_loc, 0), copy_rtx (new_src), + 1); + + if (apply_change_group ()) + { + rtx_insn **slot = vfp_const_iv->find_slot (insn, INSERT); + + if (*slot) + gcc_unreachable (); + else + *slot = insn; + + need_expensive_addr_check_p = true; + + /* We should update REG_DEAD info also. */ + rtx note; + if ((note = find_reg_note (insn, REG_DEAD, non_vfp_op))) + { + remove_note (insn, note); + add_shallow_copy_of_reg_note (use_insn, note); + } + + if (dump_file) + fprintf (dump_file, + "\nRe-associate insn %d and %d for later" + " RTL loop invariant hoisting.\n", + INSN_UID (insn), INSN_UID (use_insn)); + } + + return; +} + /* Finds invariants in basic block BB. ALWAYS_REACHED is true if the basic block is always executed. ALWAYS_EXECUTED is true if the basic block is always executed, unless the program ends due to a function @@ -1022,6 +1197,8 @@ find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) if (!NONDEBUG_INSN_P (insn)) continue; + reshuffle_insn_with_vfp (insn); + find_invariants_insn (insn, always_reached, always_executed); if (always_reached @@ -1647,6 +1824,9 @@ move_invariants (struct loop *loop) static void init_inv_motion_data (void) { + need_expensive_addr_check_p = false; + vfp_const_iv = new hash_table <pointer_hash <rtx_insn> > (4); + actual_stamp = 1; invariants.create (100); @@ -1682,6 +1862,8 @@ free_inv_motion_data (void) free (inv); } invariants.release (); + + delete vfp_const_iv; } /* Move the invariants out of the LOOP. */