Message ID | CABu31nPB3t+b61fSndgbFD-Ps4SXo62872f2cOyRBOi0nW22ig@mail.gmail.com |
---|---|
State | New |
Headers | show |
Yes, I agree with you. fold_rtx also needs to be extended because now it only handles the case similar as follows for shift insn: a = b op const1 c = a >> const2 for our motivational case, the second operand of the first insn is a reg instead of a const. We also need to add the truncation support for our case in simplify_binary_operation. I will send out a more official patch about fwprop extension soon. Then it may be easier to talk about its rationality. Thanks, Wei. On Wed, Feb 27, 2013 at 1:21 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: > On Wed, Feb 27, 2013 at 7:37 PM, Wei Mi wrote: >> What do you think? > > I think you'll not be able to teach fold_rtx to perform the > transformation you want it to do without having SHIFT_COUNT_TRUNCATED > set for i386. I already tried it the other day, but GCC won't do the > truncation without knowing the insn is really a shift insn and > shift_truncation_mask returns something useful. > > Ciao! > Steven > > > Index: cse.c > =================================================================== > --- cse.c (revision 196182) > +++ cse.c (working copy) > @@ -3179,9 +3179,22 @@ fold_rtx (rtx x, rtx insn) > > switch (GET_CODE (folded_arg)) > { > + case SUBREG: > + /* If the SUBREG_REG comes in from an AND, and this is not a > + paradoxical subreg, then try to fold the SUBREG. */ > + if (REG_P (SUBREG_REG (folded_arg)) > + && ! paradoxical_subreg_p (folded_arg)) > + { > + rtx y = lookup_as_function (SUBREG_REG (folded_arg), AND); > + if (y != 0) > + y = simplify_gen_binary(AND, GET_MODE (folded_arg), > + XEXP(y, 0), XEXP(y, 1)); > + if (y != 0) > + folded_arg = y; > + } > + /* ... fall through ... */ > case MEM: > case REG: > - case SUBREG: > const_arg = equiv_constant (folded_arg); > break;
Hi, This is the fwprop extension patch which is put in order. Regression test and bootstrap pass. Please help to review its rationality. The following is a brief description what I have done in the patch. In order to make fwprop more effective in rtl optimization, we extend it to handle general expressions instead of the three cases listed in the head comment in fwprop.c. The major changes include a) We need to check propagation correctness for src exprs of def which contain mem references. Previous fwprop for the three cases above doesn't have the problem. b) We need a better cost model because the benefit is usually not so apparent as the three cases above. For a general fwprop problem, there are two possible sources where benefit comes from. The frist is the new use insn after propagation and simplification may have lower cost than itself before propagation, or propagation may create a new insn, that could be splitted or peephole optimized later and get a lower cost. The second is that if all the uses are replaced with the src of the def insn, the def insn could be deleted. So instead of check each def-use pair independently, we use DU chain to track all the uses for a def. For each def-use pair, we attempt the propagation, record the change candidate in changes[] array, but we wait to confirm the changes until all the pairs with the same def are iterated. The changes confirmation is done in the func confirm_change_group_by_cost. We only do this for fwprop. For fwprop_addr, the benefit of each change is ensured by propagation_rtx_1 using should_replace_address, so we just confirm all the changes without checking benefit again. Thanks, Wei. On Wed, Feb 27, 2013 at 1:56 PM, Wei Mi <wmi@google.com> wrote: > Yes, I agree with you. fold_rtx also needs to be extended because now > it only handles the case similar as follows for shift insn: > a = b op const1 > c = a >> const2 > for our motivational case, the second operand of the first insn is a > reg instead of a const. We also need to add the truncation support for > our case in simplify_binary_operation. > > I will send out a more official patch about fwprop extension soon. > Then it may be easier to talk about its rationality. > > Thanks, > Wei. > > On Wed, Feb 27, 2013 at 1:21 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: >> On Wed, Feb 27, 2013 at 7:37 PM, Wei Mi wrote: >>> What do you think? >> >> I think you'll not be able to teach fold_rtx to perform the >> transformation you want it to do without having SHIFT_COUNT_TRUNCATED >> set for i386. I already tried it the other day, but GCC won't do the >> truncation without knowing the insn is really a shift insn and >> shift_truncation_mask returns something useful. >> >> Ciao! >> Steven >> >> >> Index: cse.c >> =================================================================== >> --- cse.c (revision 196182) >> +++ cse.c (working copy) >> @@ -3179,9 +3179,22 @@ fold_rtx (rtx x, rtx insn) >> >> switch (GET_CODE (folded_arg)) >> { >> + case SUBREG: >> + /* If the SUBREG_REG comes in from an AND, and this is not a >> + paradoxical subreg, then try to fold the SUBREG. */ >> + if (REG_P (SUBREG_REG (folded_arg)) >> + && ! paradoxical_subreg_p (folded_arg)) >> + { >> + rtx y = lookup_as_function (SUBREG_REG (folded_arg), AND); >> + if (y != 0) >> + y = simplify_gen_binary(AND, GET_MODE (folded_arg), >> + XEXP(y, 0), XEXP(y, 1)); >> + if (y != 0) >> + folded_arg = y; >> + } >> + /* ... fall through ... */ >> case MEM: >> case REG: >> - case SUBREG: >> const_arg = equiv_constant (folded_arg); >> break;
On 03/10/2013 11:52 PM, Wei Mi wrote: > Hi, > > This is the fwprop extension patch which is put in order. Regression > test and bootstrap pass. Please help to review its rationality. The > following is a brief description what I have done in the patch. > > In order to make fwprop more effective in rtl optimization, we extend > it to handle general expressions instead of the three cases listed in > the head comment in fwprop.c. The major changes include a) We need to > check propagation correctness for src exprs of def which contain mem > references. Previous fwprop for the three cases above doesn't have the > problem. b) We need a better cost model because the benefit is usually > not so apparent as the three cases above. > > For a general fwprop problem, there are two possible sources where > benefit comes from. The frist is the new use insn after propagation > and simplification may have lower cost than itself before propagation, > or propagation may create a new insn, that could be splitted or > peephole optimized later and get a lower cost. The second is that if > all the uses are replaced with the src of the def insn, the def insn > could be deleted. > > So instead of check each def-use pair independently, we use DU chain > to track all the uses for a def. For each def-use pair, we attempt the > propagation, record the change candidate in changes[] array, but we > wait to confirm the changes until all the pairs with the same def are > iterated. The changes confirmation is done in the func > confirm_change_group_by_cost. We only do this for fwprop. For > fwprop_addr, the benefit of each change is ensured by > propagation_rtx_1 using should_replace_address, so we just confirm all > the changes without checking benefit again. Can you please attach this to the 4.9 pending patches tracker bug. We're really focused on trying to get 4.8 out the door and this doesn't seem like suitable material for GCC 4.8. I haven't looked at the details of the patch at all yet and doubt I would prior to GCC 4.8 going out the door. Thanks, jeff
On Mon, Mar 11, 2013 at 7:10 PM, Jeff Law <law@redhat.com> wrote: > On 03/10/2013 11:52 PM, Wei Mi wrote: >> >> Hi, >> >> This is the fwprop extension patch which is put in order. Regression >> test and bootstrap pass. Please help to review its rationality. The >> following is a brief description what I have done in the patch. >> >> In order to make fwprop more effective in rtl optimization, we extend >> it to handle general expressions instead of the three cases listed in >> the head comment in fwprop.c. The major changes include a) We need to >> check propagation correctness for src exprs of def which contain mem >> references. Previous fwprop for the three cases above doesn't have the >> problem. b) We need a better cost model because the benefit is usually >> not so apparent as the three cases above. >> >> For a general fwprop problem, there are two possible sources where >> benefit comes from. The frist is the new use insn after propagation >> and simplification may have lower cost than itself before propagation, >> or propagation may create a new insn, that could be splitted or >> peephole optimized later and get a lower cost. The second is that if >> all the uses are replaced with the src of the def insn, the def insn >> could be deleted. >> >> So instead of check each def-use pair independently, we use DU chain >> to track all the uses for a def. For each def-use pair, we attempt the >> propagation, record the change candidate in changes[] array, but we >> wait to confirm the changes until all the pairs with the same def are >> iterated. The changes confirmation is done in the func >> confirm_change_group_by_cost. We only do this for fwprop. For >> fwprop_addr, the benefit of each change is ensured by >> propagation_rtx_1 using should_replace_address, so we just confirm all >> the changes without checking benefit again. > > Can you please attach this to the 4.9 pending patches tracker bug. We're > really focused on trying to get 4.8 out the door and this doesn't seem like > suitable material for GCC 4.8. > > I haven't looked at the details of the patch at all yet and doubt I would > prior to GCC 4.8 going out the door. > > Thanks, > jeff > Jeff, The world has more people than you, and with different interests. This patch was posted here for comments on the idea, and while I'm sure your feedback would be very valuable, it is no more required for discussing this patch than it is for releasing GCC 4.8. Ciao! Steven
On Mon, Mar 11, 2013 at 6:52 AM, Wei Mi wrote: > This is the fwprop extension patch which is put in order. Regression > test and bootstrap pass. Please help to review its rationality. The > following is a brief description what I have done in the patch. > > In order to make fwprop more effective in rtl optimization, we extend > it to handle general expressions instead of the three cases listed in > the head comment in fwprop.c. The major changes include a) We need to > check propagation correctness for src exprs of def which contain mem > references. Previous fwprop for the three cases above doesn't have the > problem. b) We need a better cost model because the benefit is usually > not so apparent as the three cases above. > > For a general fwprop problem, there are two possible sources where > benefit comes from. The frist is the new use insn after propagation > and simplification may have lower cost than itself before propagation, > or propagation may create a new insn, that could be splitted or > peephole optimized later and get a lower cost. The second is that if > all the uses are replaced with the src of the def insn, the def insn > could be deleted. > > So instead of check each def-use pair independently, we use DU chain > to track all the uses for a def. For each def-use pair, we attempt the > propagation, record the change candidate in changes[] array, but we > wait to confirm the changes until all the pairs with the same def are > iterated. The changes confirmation is done in the func > confirm_change_group_by_cost. We only do this for fwprop. For > fwprop_addr, the benefit of each change is ensured by > propagation_rtx_1 using should_replace_address, so we just confirm all > the changes without checking benefit again. Hello Wei Mi, So IIUC, in essence you are doing: main: FOR_EACH_BB: FOR_BB_INSNS, non-debug insns only: for each df_ref DEF operand on insn: iterate_def_uses iterate_def_uses: for each UD chain from DEF to USE(i): forward_propagate_into confirm changes by total benefit I still like the idea, but there are also still a few "design issues" to resolve. Some of the same comments as before apply: Do you really, really, reallyreally have to go so low-level as to insn splitting, peephole optimizations, and even register allocation, to get the cost right? That will almost certainly not be acceptable, and I for one would oppose such a change. It's IMHO a violation of proper engineering when your medium-to-high level code transformations have to do that. If you have strong reasons for your approach, it'd be helpful if you can explain them so that we can together look for a less intrusive solution (e.g. splitting earlier, adjusting the cost model, etc.). So things like: > + /* we may call peephole2_insns in fwprop phase to estimate how > + peephole will affect the cost of the insn transformed by fwprop. > + fwprop is done before ira phase. In that case, we simply return > + a new pseudo register. */ > + if (!strncmp (current_pass->name, "fwprop", 6)) > + return gen_reg_rtx (mode); and > Index: config/i386/i386.c > =================================================================== > --- config/i386/i386.c (revision 196270) > +++ config/i386/i386.c (working copy) > @@ -15901,8 +15901,14 @@ ix86_expand_clear (rtx dest) > { > rtx tmp; > > - /* We play register width games, which are only valid after reload. */ > - gcc_assert (reload_completed); > + /* We play register width games, which are only valid after reload. > + An exception: fwprop call peephole to estimate the change benefit, > + and peephole will call this func. That is before reload complete. > + It will not bring any problem because the peephole2_insns call is > + only used for cost estimation in fwprop, and its change will be > + abandoned immediately after the cost estimation. */ > + if (strncmp (current_pass->name, "fwprop", 6)) > + gcc_assert (reload_completed); are IMHO not OK. Note that your patch is a bit difficult to read at some points because you have included a bunch of non-changes (whitespaces fixes -- necessary cleanups but not relevant for your patch), see e.g. the changed lines that contain "lra_in_progress". Also the changes like: > static bool > -propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, int flags) > +propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, bool speed) which are quite distracting, making it harder to see what has *really* changed. You should probably just a helper function apply_change_group_num() and avoid all the apply_change_group use fixups. In fwprop.c: > + /* DF_LR_RUN_DCE is used in peephole2_insns, which is called for cost > + estimation in estimate_split_and_peephole. */ > + df_set_flags (DF_LR_RUN_DCE); > df_md_add_problem (); > df_note_add_problem (); > - df_analyze (); > + df_chain_add_problem (DF_UD_CHAIN | DF_DU_CHAIN); > df_maybe_reorganize_use_refs (DF_REF_ORDER_BY_INSN_WITH_NOTES); > + df_analyze (); you add DU and UD chains, and implicitly the RD problem, but you also already have the MD problem. I think my reaching-defs patches for GCC 4.8 make the MD problem less necessary, but you certainly don't need MD + RD + UD + DU. You've noticed so yourself: > + We think the maintainance for use_def_ref vector is not necessary, so > + we remove update_df/update_uses/update_df_init/register_active_defs. */ and it looks like you're simply avoiding the problem by queuing up changes and commit them all at the end. I don't believe that will work, you'll break the UD and DU chains and may end up with dangling pointers to free'd or removed df_refs. > + /* see when the insn is not a set */ > + if (!set) > + return false; fwprop.c was speciflcally developed to also handle multiple-set instructions, like the bits of cse.c that it tried to replace. Your patch should not change this. > +static bool > +mem_may_be_modified (rtx from, rtx to) This has "potentially slow" written all over it :-) (You're punting on any MEM for now, but someone at some point will find a reason to use alias analysis, blowing up really bad test cases like PR39326...) > +int > +reg_mentioned_num (const_rtx reg, const_rtx in Should use DF caches instead of deep-diving the pattern. Or if DF cache updates are deferred, use for_each_rtx on the pattern. > +/* Find whether the set define a return reg. */ > + > +static bool > +def_return_reg (rtx set) > +{ > + edge eg; > + edge_iterator ei; > + rtx dest = SET_DEST (set); > + > + if (!REG_P (dest)) > + return false; The return USE will also be a hard reg: + if (!REG_P (dest) || ! HARD_REGISTER_P (dest)) + return false; > + FOR_EACH_EDGE (eg, ei, EXIT_BLOCK_PTR->preds) > + if (eg->flags & EDGE_FALLTHRU) > + { > + basic_block src_bb = eg->src; > + rtx last_insn, ret_reg; > + if (EDGE_COUNT (EXIT_BLOCK_PTR->preds) == 1 single_pred_p(), but why do FOR_EACH_EDGE and then chec that there is only one pred to begin with? > + && NONJUMP_INSN_P ((last_insn = BB_END (src_bb))) > + && GET_CODE (PATTERN (last_insn)) == USE > + && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG > + && REGNO (ret_reg) == REGNO (dest)) > + return true; > + } > + return false; > +} Actually this whole change makes me nervous. I don't think you should propagate into any USE at all, for return value or otherwise. > + if (def_insn) > + { > + rtx set = single_set (def_insn); > + if (set) > + def_insn_cost = set_src_cost (SET_SRC (set), speed) > + + set_src_cost (SET_DEST (set), speed) + 1; > + else > + return false; > + } As before: You'll have to deal with non-single_set insns also. > +void > +dump_cfg (FILE *file) > +{ You'll find -fdump-rtl-fwprop-graph useful, as well as brief_dump_cfg. > + if (fwprop_addr) > + return confirm_change_group_by_cost (false, > + 0, > + false); > + else > + { > + all_uses_replaced = (use_num == reg_replaced_num); > + return confirm_change_group_by_cost (all_uses_replaced, > + def_insn_cost, > + true); > + } What happens if you propagate into an insn that uses the same register twice? Will the DU chains still be valid (I don't think that's guaranteed)? Is the extra_benefit flag always applicable if all USEs of a DEF have been propagated out? What if the DEF is in an insn that is inherently necessary? Have you measured what effect this pass has on combine? Ciao! Steven
Thanks for the helpful comments! I have some replies inlined. Regards, Wei. On Mon, Mar 11, 2013 at 12:52 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: > On Mon, Mar 11, 2013 at 6:52 AM, Wei Mi wrote: >> This is the fwprop extension patch which is put in order. Regression >> test and bootstrap pass. Please help to review its rationality. The >> following is a brief description what I have done in the patch. >> >> In order to make fwprop more effective in rtl optimization, we extend >> it to handle general expressions instead of the three cases listed in >> the head comment in fwprop.c. The major changes include a) We need to >> check propagation correctness for src exprs of def which contain mem >> references. Previous fwprop for the three cases above doesn't have the >> problem. b) We need a better cost model because the benefit is usually >> not so apparent as the three cases above. >> >> For a general fwprop problem, there are two possible sources where >> benefit comes from. The frist is the new use insn after propagation >> and simplification may have lower cost than itself before propagation, >> or propagation may create a new insn, that could be splitted or >> peephole optimized later and get a lower cost. The second is that if >> all the uses are replaced with the src of the def insn, the def insn >> could be deleted. >> >> So instead of check each def-use pair independently, we use DU chain >> to track all the uses for a def. For each def-use pair, we attempt the >> propagation, record the change candidate in changes[] array, but we >> wait to confirm the changes until all the pairs with the same def are >> iterated. The changes confirmation is done in the func >> confirm_change_group_by_cost. We only do this for fwprop. For >> fwprop_addr, the benefit of each change is ensured by >> propagation_rtx_1 using should_replace_address, so we just confirm all >> the changes without checking benefit again. > > Hello Wei Mi, > > So IIUC, in essence you are doing: > > main: > FOR_EACH_BB: > FOR_BB_INSNS, non-debug insns only: > for each df_ref DEF operand on insn: > iterate_def_uses > > iterate_def_uses: > for each UD chain from DEF to USE(i): > forward_propagate_into > confirm changes by total benefit > > I still like the idea, but there are also still a few "design issues" > to resolve. > > Some of the same comments as before apply: Do you really, really, > reallyreally have to go so low-level as to insn splitting, peephole > optimizations, and even register allocation, to get the cost right? > That will almost certainly not be acceptable, and I for one would > oppose such a change. It's IMHO a violation of proper engineering when > your medium-to-high level code transformations have to do that. If you > have strong reasons for your approach, it'd be helpful if you can > explain them so that we can together look for a less intrusive > solution (e.g. splitting earlier, adjusting the cost model, etc.). > For the motivational case, I need insn splitting to get the cost right. insn splitting is not very intrusive. All I need is to call split_insns func. I think split_insns is just a pattern matching func just like recog(), which is called at many places. Peephole is not necessary (I add it in order to find as many oppotunities as possible, but from my trace analysis, it doesn't help much). To call peephole2_insn() is indeed intrusive, because peephole assumes reg allocation is completed, I have to insert the ugly workaround below. peephole also requires setting DF_LR_RUN_DCE flag and some initialization of peep2_insn_data array. So how about keep split_insns and remove peephole in the cost estimation func? > So things like: >> + /* we may call peephole2_insns in fwprop phase to estimate how >> + peephole will affect the cost of the insn transformed by fwprop. >> + fwprop is done before ira phase. In that case, we simply return >> + a new pseudo register. */ >> + if (!strncmp (current_pass->name, "fwprop", 6)) >> + return gen_reg_rtx (mode); > > and > >> Index: config/i386/i386.c >> =================================================================== >> --- config/i386/i386.c (revision 196270) >> +++ config/i386/i386.c (working copy) >> @@ -15901,8 +15901,14 @@ ix86_expand_clear (rtx dest) >> { >> rtx tmp; >> >> - /* We play register width games, which are only valid after reload. */ >> - gcc_assert (reload_completed); >> + /* We play register width games, which are only valid after reload. >> + An exception: fwprop call peephole to estimate the change benefit, >> + and peephole will call this func. That is before reload complete. >> + It will not bring any problem because the peephole2_insns call is >> + only used for cost estimation in fwprop, and its change will be >> + abandoned immediately after the cost estimation. */ >> + if (strncmp (current_pass->name, "fwprop", 6)) >> + gcc_assert (reload_completed); > > are IMHO not OK. > They are intrusive and inserted for peephole. > Note that your patch is a bit difficult to read at some points because > you have included a bunch of non-changes (whitespaces fixes -- > necessary cleanups but not relevant for your patch), see e.g. the > changed lines that contain "lra_in_progress". Also the changes like: >> static bool >> -propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, int flags) >> +propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, bool speed) > > which are quite distracting, making it harder to see what has *really* changed. > In the old fwprop, the flags is of enum type {PR_CAN_APPEAR, PR_HANDLE_MEM, PR_OPTIMIZE_FOR_SPEED}. PR_CAN_APPEAR is used to restrict fwprop to only handle the three typical cases in the comment at the head of fwprop.c. PR_HANDLE_MEM is used for mem addr. PR_OPTIMIZE_FOR_SPEED indicates optimization is for speed or for size. New fwprop will only use PR_OPTIMIZE_FOR_SPEED, and the other two are useless. So I change the param from "int flags" to "bool speed". > You should probably just a helper function apply_change_group_num() > and avoid all the apply_change_group use fixups. > > Yes, they are distracting. I can use apply_change_group_num for easy code review, but I think to commit, extending apply_change_group and avoid creating a very similar func is more welcomed? > In fwprop.c: >> + /* DF_LR_RUN_DCE is used in peephole2_insns, which is called for cost >> + estimation in estimate_split_and_peephole. */ >> + df_set_flags (DF_LR_RUN_DCE); >> df_md_add_problem (); >> df_note_add_problem (); >> - df_analyze (); >> + df_chain_add_problem (DF_UD_CHAIN | DF_DU_CHAIN); >> df_maybe_reorganize_use_refs (DF_REF_ORDER_BY_INSN_WITH_NOTES); >> + df_analyze (); > > you add DU and UD chains, and implicitly the RD problem, but you also > already have the MD problem. I think my reaching-defs patches for GCC > 4.8 make the MD problem less necessary, but you certainly don't need > MD + RD + UD + DU. > MD problem is used to set use_def_ref and check whether a use has multiple def in the old fwprop. I reuse that part in new fwprop. With UD chain, we may remove MD problem and use_def_ref vector, and use UD chain to do the check. I will try it. > You've noticed so yourself: >> + We think the maintainance for use_def_ref vector is not necessary, so >> + we remove update_df/update_uses/update_df_init/register_active_defs. */ > > and it looks like you're simply avoiding the problem by queuing up > changes and commit them all at the end. I don't believe that will > work, you'll break the UD and DU chains and may end up with dangling > pointers to free'd or removed df_refs. > I only remove the code related with use_def_ref vector. The code to update df_refs is kept but moved to update_df in recog.c. > >> + /* see when the insn is not a set */ >> + if (!set) >> + return false; > > fwprop.c was speciflcally developed to also handle multiple-set > instructions, like the bits of cse.c that it tried to replace. Your > patch should not change this. > ok, I will remove the limitation. > >> +static bool >> +mem_may_be_modified (rtx from, rtx to) > > This has "potentially slow" written all over it :-) (You're punting on > any MEM for now, but someone at some point will find a reason to use > alias analysis, blowing up really bad test cases like PR39326...) > If someone wants to add alias analysis here, he must find out some good reasons :-). That is what I expect. > >> +int >> +reg_mentioned_num (const_rtx reg, const_rtx in > > Should use DF caches instead of deep-diving the pattern. Or if DF > cache updates are deferred, use for_each_rtx on the pattern. > I will fix it. > >> +/* Find whether the set define a return reg. */ >> + >> +static bool >> +def_return_reg (rtx set) >> +{ >> + edge eg; >> + edge_iterator ei; >> + rtx dest = SET_DEST (set); >> + >> + if (!REG_P (dest)) >> + return false; > > The return USE will also be a hard reg: > + if (!REG_P (dest) || ! HARD_REGISTER_P (dest)) > + return false; > > >> + FOR_EACH_EDGE (eg, ei, EXIT_BLOCK_PTR->preds) >> + if (eg->flags & EDGE_FALLTHRU) >> + { >> + basic_block src_bb = eg->src; >> + rtx last_insn, ret_reg; >> + if (EDGE_COUNT (EXIT_BLOCK_PTR->preds) == 1 > > single_pred_p(), but why do FOR_EACH_EDGE and then chec that there is > only one pred to begin with? My mistake. I copy the chunk of code from mode-switching. FOR_EACH_EDGE is unneeded. > >> + && NONJUMP_INSN_P ((last_insn = BB_END (src_bb))) >> + && GET_CODE (PATTERN (last_insn)) == USE >> + && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG >> + && REGNO (ret_reg) == REGNO (dest)) >> + return true; >> + } >> + return false; >> +} > > Actually this whole change makes me nervous. I don't think you should > propagate into any USE at all, for return value or otherwise. Yes, reasonable. I will add the restriction. > >> + if (def_insn) >> + { >> + rtx set = single_set (def_insn); >> + if (set) >> + def_insn_cost = set_src_cost (SET_SRC (set), speed) >> + + set_src_cost (SET_DEST (set), speed) + 1; >> + else >> + return false; >> + } > > As before: You'll have to deal with non-single_set insns also. > I will remove the limitation. > >> +void >> +dump_cfg (FILE *file) >> +{ > > You'll find -fdump-rtl-fwprop-graph useful, as well as brief_dump_cfg. > Oh, thanks. > >> + if (fwprop_addr) >> + return confirm_change_group_by_cost (false, >> + 0, >> + false); >> + else >> + { >> + all_uses_replaced = (use_num == reg_replaced_num); >> + return confirm_change_group_by_cost (all_uses_replaced, >> + def_insn_cost, >> + true); >> + } > > > What happens if you propagate into an insn that uses the same register > twice? Will the DU chains still be valid (I don't think that's > guaranteed)? I think the DU chains still be valid. If propagate into the insn that uses the same register twice, the two uses will be replaced when the first use is seen (propagate_rtx_1 will propagate all the occurrances of the same reg in the use insn). When the second use is seen, the df_use and use insn in its insn_info are still available. forward_propagate_into will early return after check reg_mentioned_p (DF_REF_REG (use), parent) and find out no reg is used any more. A testcase in gcc regression testsuite shows the case: gcc/testsuite/gcc.target/i386/387-10.c > > Is the extra_benefit flag always applicable if all USEs of a DEF have > been propagated out? What if the DEF is in an insn that is inherently > necessary? > If the DEF is an insn that is inherently necessary, we may mistakenly assume it could be deleted in cost estimation. But it will not affect correctness because fwprop depends on delete_trivially_dead_insns in fwprop_done or DCE after fwprop to delete dead DEF. > Have you measured what effect this pass has on combine? I didn't do the measure. I only see a case that new fwprop does the same optimization as what combine does, which old fwprop doesn't optimize. New fwprop bring the optimization to an earlier stage. The testcase is gcc/testsuite/gcc.target/i386/387-10.c. Qualitatively, fwprop optimize the single-def multiple down uses case which combine cannot handle. But I didn't do measurement about their interactive potential. > > Ciao! > Steven
On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote: > For the motivational case, I need insn splitting to get the cost > right. insn splitting is not very intrusive. All I need is to call > split_insns func. It may not look very intrusive, but there's a lot happening in the back ground. You're creating a lot of new RTL, and then just throw it away again. You fake the compiler into thinking you're much deeper in the pipeline than you really are. You're assuming there are no side-effects other than that some insn gets split, but there are back ends where splitters may have side-effects. Even though I've asked twice now, you still have not explained this motivational case, except to say that there is one. *What* are you trying to do, *what* is not happening without the splits, and *what* happens if you split. Only if you explain that in a lot more detail than "I have a motivational case" then we can look into what is a proper solution. The problem with some of the splitters is that they exist to break up RTL from 'expand' to initially keep some pattern together to allow the code transformation passes to handle the pattern as one instruction. This made sense when RTL was the only intermediate representation and splitting too early would inhibit some optimizations. But I would expect most (if not all) such cases to be less relevant because of the GIMPLE middle-end work. The only splitters you can trigger are the pre-reload splitters (all the reload_completed conditions obviously can't trigger if you're splitting from fwprop). Perhaps those splitters can/should run earlier, or be made obsolete by expanding directly to the post-splitting insns. Unfortunately, it's not possible to tell for your case, because you haven't explained it yet... > So how about keep split_insns and remove peephole in the cost estimation func? I'd strongly oppose this. I do not believe this is necessary, and I think it's conceptually wrong. >> What happens if you propagate into an insn that uses the same register >> twice? Will the DU chains still be valid (I don't think that's >> guaranteed)? > > I think the DU chains still be valid. If propagate into the insn that > uses the same register twice, the two uses will be replaced when the > first use is seen (propagate_rtx_1 will propagate all the occurrances > of the same reg in the use insn). When the second use is seen, the > df_use and use insn in its insn_info are still available. > forward_propagate_into will early return after check reg_mentioned_p > (DF_REF_REG (use), parent) and find out no reg is used any more. With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is still valid. In any case, returning to the RD problem for DU/UD chains is probably a good idea, now that RD is not such a hog anymore. In effect fwprop.c would return to what it looked like before the patch of r149010. As a way forward on all of this, I'd suggest the following steps, each with a separate patch: 1. replace the MD problem with RD again, and build full DU/UD chains. 2. post all the recog changes separately, with minimum impact on the parts of the compiler you don't really change. (For apply_change_group you could even choose to overload it, or use a NUM argument with a default value -- not sure if default argument values are OK for GCC tho'.) 3. implement propagation into multiple USEs, but without the splitting and peepholing. 4. see about fixing the back end to either split earlier or expand to the desired patterns directly. Ciao! Steven
Hi, On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: > On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote: >> For the motivational case, I need insn splitting to get the cost >> right. insn splitting is not very intrusive. All I need is to call >> split_insns func. > > It may not look very intrusive, but there's a lot happening in the > back ground. You're creating a lot of new RTL, and then just throw it > away again. You fake the compiler into thinking you're much deeper in > the pipeline than you really are. You're assuming there are no > side-effects other than that some insn gets split, but there are back > ends where splitters may have side-effects. Ok, then I will remove the split_insns call. > > Even though I've asked twice now, you still have not explained this > motivational case, except to say that there is one. *What* are you > trying to do, *what* is not happening without the splits, and *what* > happens if you split. Only if you explain that in a lot more detail > than "I have a motivational case" then we can look into what is a > proper solution. :-). Sorry, I didn't say it clearly. The motivational case is the one mentioned in the following posts (split_insns changes a << (b & 63) to a << b). http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html If I remove the split_insns call and related cost estimation adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop here looks like a reverse process of cse, the total cost after fwprop change is increased. Def insn 18: Use insn 23 Use insn 22 If we include the split_insns cost estimation adjustment. extra benefit by removing def insn 18 = 5 change[0]: benefit = 0, verified - ok // The cost of insn 22 will not change after fwprop + insn splitting. change[1]: benefit = 0, verified - ok // The insn 23 is the same with insn 22 Total benefit is 5, fwprop will go on. If we remove the split_insns cost estimation adjustment. extra benefit by removing def insn 18 = 5 change[0]: benefit = -4, verified - ok // The costs of insn 22 and insn 23 will increase after fwprop. change[1]: benefit = -4, verified - ok // The insn 23 is the same with insn 22 Total benefit is -3, fwprop will punt. How about adding the (a << (b&63) ==> a << b) transformation in simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a kind of architecture specific expr simplification? Then fwprop could do the propagation as I expect. > > The problem with some of the splitters is that they exist to break up > RTL from 'expand' to initially keep some pattern together to allow the > code transformation passes to handle the pattern as one instruction. > This made sense when RTL was the only intermediate representation and > splitting too early would inhibit some optimizations. But I would > expect most (if not all) such cases to be less relevant because of the > GIMPLE middle-end work. The only splitters you can trigger are the > pre-reload splitters (all the reload_completed conditions obviously > can't trigger if you're splitting from fwprop). Perhaps those > splitters can/should run earlier, or be made obsolete by expanding > directly to the post-splitting insns. > > Unfortunately, it's not possible to tell for your case, because you > haven't explained it yet... > > >> So how about keep split_insns and remove peephole in the cost estimation func? > > I'd strongly oppose this. I do not believe this is necessary, and I > think it's conceptually wrong. > > >>> What happens if you propagate into an insn that uses the same register >>> twice? Will the DU chains still be valid (I don't think that's >>> guaranteed)? >> >> I think the DU chains still be valid. If propagate into the insn that >> uses the same register twice, the two uses will be replaced when the >> first use is seen (propagate_rtx_1 will propagate all the occurrances >> of the same reg in the use insn). When the second use is seen, the >> df_use and use insn in its insn_info are still available. >> forward_propagate_into will early return after check reg_mentioned_p >> (DF_REF_REG (use), parent) and find out no reg is used any more. > > With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is > still valid. I think DF_REF_LOC of USE may be invalid if dangling rtx will be recycled by garbage collection very soon (I don't know when GC will happen). Although DF_REF_LOC of USE maybe invalid, the early return in forward_propagate_into ensure it will not cause any correctness problem. > > In any case, returning to the RD problem for DU/UD chains is probably > a good idea, now that RD is not such a hog anymore. In effect fwprop.c > would return to what it looked like before the patch of r149010. I remove MD problem and use DU/UD instead. > > As a way forward on all of this, I'd suggest the following steps, each > with a separate patch: Thanks for the suggestion! > 1. replace the MD problem with RD again, and build full DU/UD chains. I include patch.1 attached. > 2. post all the recog changes separately, with minimum impact on the > parts of the compiler you don't really change. (For apply_change_group > you could even choose to overload it, or use a NUM argument with a > default value -- not sure if default argument values are OK for GCC > tho'.) patch.2 attached. > 3. implement propagation into multiple USEs, but without the splitting > and peepholing. patch.3 attached. > 4. see about fixing the back end to either split earlier or expand to > the desired patterns directly. I havn't included this part. If you agree with the proposal to add the transformation (a << (b&63) ==> a << b) in simplify_binary_operation_1, I will send out another patch about it. Thanks, Wei.
On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote: > Hi, > > On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: >> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote: >>> For the motivational case, I need insn splitting to get the cost >>> right. insn splitting is not very intrusive. All I need is to call >>> split_insns func. >> >> It may not look very intrusive, but there's a lot happening in the >> back ground. You're creating a lot of new RTL, and then just throw it >> away again. You fake the compiler into thinking you're much deeper in >> the pipeline than you really are. You're assuming there are no >> side-effects other than that some insn gets split, but there are back >> ends where splitters may have side-effects. > > Ok, then I will remove the split_insns call. > >> >> Even though I've asked twice now, you still have not explained this >> motivational case, except to say that there is one. *What* are you >> trying to do, *what* is not happening without the splits, and *what* >> happens if you split. Only if you explain that in a lot more detail >> than "I have a motivational case" then we can look into what is a >> proper solution. > > :-). Sorry, I didn't say it clearly. The motivational case is the one > mentioned in the following posts (split_insns changes a << (b & 63) to > a << b). > http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html > http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html > > If I remove the split_insns call and related cost estimation > adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop > here looks like a reverse process of cse, the total cost after fwprop > change is increased. > > Def insn 18: > Use insn 23 > Use insn 22 > > If we include the split_insns cost estimation adjustment. > extra benefit by removing def insn 18 = 5 > change[0]: benefit = 0, verified - ok // The cost of insn 22 will > not change after fwprop + insn splitting. > change[1]: benefit = 0, verified - ok // The insn 23 is the same with insn 22 > Total benefit is 5, fwprop will go on. > > If we remove the split_insns cost estimation adjustment. > extra benefit by removing def insn 18 = 5 > change[0]: benefit = -4, verified - ok // The costs of insn 22 and > insn 23 will increase after fwprop. > change[1]: benefit = -4, verified - ok // The insn 23 is the same > with insn 22 > Total benefit is -3, fwprop will punt. > > How about adding the (a << (b&63) ==> a << b) transformation in > simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a > kind of architecture specific expr simplification? Then fwprop could > do the propagation as I expect. > >> >> The problem with some of the splitters is that they exist to break up >> RTL from 'expand' to initially keep some pattern together to allow the >> code transformation passes to handle the pattern as one instruction. >> This made sense when RTL was the only intermediate representation and >> splitting too early would inhibit some optimizations. But I would >> expect most (if not all) such cases to be less relevant because of the >> GIMPLE middle-end work. The only splitters you can trigger are the >> pre-reload splitters (all the reload_completed conditions obviously >> can't trigger if you're splitting from fwprop). Perhaps those >> splitters can/should run earlier, or be made obsolete by expanding >> directly to the post-splitting insns. >> >> Unfortunately, it's not possible to tell for your case, because you >> haven't explained it yet... >> >> >>> So how about keep split_insns and remove peephole in the cost estimation func? >> >> I'd strongly oppose this. I do not believe this is necessary, and I >> think it's conceptually wrong. >> >> >>>> What happens if you propagate into an insn that uses the same register >>>> twice? Will the DU chains still be valid (I don't think that's >>>> guaranteed)? >>> >>> I think the DU chains still be valid. If propagate into the insn that >>> uses the same register twice, the two uses will be replaced when the >>> first use is seen (propagate_rtx_1 will propagate all the occurrances >>> of the same reg in the use insn). When the second use is seen, the >>> df_use and use insn in its insn_info are still available. >>> forward_propagate_into will early return after check reg_mentioned_p >>> (DF_REF_REG (use), parent) and find out no reg is used any more. >> >> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is >> still valid. > > I think DF_REF_LOC of USE may be invalid if dangling rtx will be > recycled by garbage collection very soon (I don't know when GC will > happen). Although DF_REF_LOC of USE maybe invalid, the early return in > forward_propagate_into ensure it will not cause any correctness > problem. > >> >> In any case, returning to the RD problem for DU/UD chains is probably >> a good idea, now that RD is not such a hog anymore. In effect fwprop.c >> would return to what it looked like before the patch of r149010. > > I remove MD problem and use DU/UD instead. > >> >> As a way forward on all of this, I'd suggest the following steps, each >> with a separate patch: > > Thanks for the suggestion! > >> 1. replace the MD problem with RD again, and build full DU/UD chains. > > I include patch.1 attached. > >> 2. post all the recog changes separately, with minimum impact on the >> parts of the compiler you don't really change. (For apply_change_group >> you could even choose to overload it, or use a NUM argument with a >> default value -- not sure if default argument values are OK for GCC >> tho'.) > > patch.2 attached. > >> 3. implement propagation into multiple USEs, but without the splitting >> and peepholing. > > patch.3 attached. > >> 4. see about fixing the back end to either split earlier or expand to >> the desired patterns directly. > > I havn't included this part. If you agree with the proposal to add the > transformation (a << (b&63) ==> a << b) in > simplify_binary_operation_1, I will send out another patch about it Sounds like you need to look into using SHIFT_COUNT_TRUNCATED and TARGET_SHIFT_TRUNCATION_MASK which you could use in simplify_binary_operation_1 without it being target specific as it uses the target hooks to find out that :). Thanks, Andrew . > > Thanks, > Wei.
This is the patch to add the shift truncation in simplify_binary_operation_1. I add a new hook TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide whether we can do shift truncation. I didn't use TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it uses the macro SHIFT_COUNT_TRUNCATED. If I change SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in TARGET_SHIFT_TRUNCATION_MASK, I need to give TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't trivial to get at many places in existing code. patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu. Thanks, Wei. On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote: > Hi, > > On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: >> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote: >>> For the motivational case, I need insn splitting to get the cost >>> right. insn splitting is not very intrusive. All I need is to call >>> split_insns func. >> >> It may not look very intrusive, but there's a lot happening in the >> back ground. You're creating a lot of new RTL, and then just throw it >> away again. You fake the compiler into thinking you're much deeper in >> the pipeline than you really are. You're assuming there are no >> side-effects other than that some insn gets split, but there are back >> ends where splitters may have side-effects. > > Ok, then I will remove the split_insns call. > >> >> Even though I've asked twice now, you still have not explained this >> motivational case, except to say that there is one. *What* are you >> trying to do, *what* is not happening without the splits, and *what* >> happens if you split. Only if you explain that in a lot more detail >> than "I have a motivational case" then we can look into what is a >> proper solution. > > :-). Sorry, I didn't say it clearly. The motivational case is the one > mentioned in the following posts (split_insns changes a << (b & 63) to > a << b). > http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html > http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html > > If I remove the split_insns call and related cost estimation > adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop > here looks like a reverse process of cse, the total cost after fwprop > change is increased. > > Def insn 18: > Use insn 23 > Use insn 22 > > If we include the split_insns cost estimation adjustment. > extra benefit by removing def insn 18 = 5 > change[0]: benefit = 0, verified - ok // The cost of insn 22 will > not change after fwprop + insn splitting. > change[1]: benefit = 0, verified - ok // The insn 23 is the same with insn 22 > Total benefit is 5, fwprop will go on. > > If we remove the split_insns cost estimation adjustment. > extra benefit by removing def insn 18 = 5 > change[0]: benefit = -4, verified - ok // The costs of insn 22 and > insn 23 will increase after fwprop. > change[1]: benefit = -4, verified - ok // The insn 23 is the same > with insn 22 > Total benefit is -3, fwprop will punt. > > How about adding the (a << (b&63) ==> a << b) transformation in > simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a > kind of architecture specific expr simplification? Then fwprop could > do the propagation as I expect. > >> >> The problem with some of the splitters is that they exist to break up >> RTL from 'expand' to initially keep some pattern together to allow the >> code transformation passes to handle the pattern as one instruction. >> This made sense when RTL was the only intermediate representation and >> splitting too early would inhibit some optimizations. But I would >> expect most (if not all) such cases to be less relevant because of the >> GIMPLE middle-end work. The only splitters you can trigger are the >> pre-reload splitters (all the reload_completed conditions obviously >> can't trigger if you're splitting from fwprop). Perhaps those >> splitters can/should run earlier, or be made obsolete by expanding >> directly to the post-splitting insns. >> >> Unfortunately, it's not possible to tell for your case, because you >> haven't explained it yet... >> >> >>> So how about keep split_insns and remove peephole in the cost estimation func? >> >> I'd strongly oppose this. I do not believe this is necessary, and I >> think it's conceptually wrong. >> >> >>>> What happens if you propagate into an insn that uses the same register >>>> twice? Will the DU chains still be valid (I don't think that's >>>> guaranteed)? >>> >>> I think the DU chains still be valid. If propagate into the insn that >>> uses the same register twice, the two uses will be replaced when the >>> first use is seen (propagate_rtx_1 will propagate all the occurrances >>> of the same reg in the use insn). When the second use is seen, the >>> df_use and use insn in its insn_info are still available. >>> forward_propagate_into will early return after check reg_mentioned_p >>> (DF_REF_REG (use), parent) and find out no reg is used any more. >> >> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is >> still valid. > > I think DF_REF_LOC of USE may be invalid if dangling rtx will be > recycled by garbage collection very soon (I don't know when GC will > happen). Although DF_REF_LOC of USE maybe invalid, the early return in > forward_propagate_into ensure it will not cause any correctness > problem. > >> >> In any case, returning to the RD problem for DU/UD chains is probably >> a good idea, now that RD is not such a hog anymore. In effect fwprop.c >> would return to what it looked like before the patch of r149010. > > I remove MD problem and use DU/UD instead. > >> >> As a way forward on all of this, I'd suggest the following steps, each >> with a separate patch: > > Thanks for the suggestion! > >> 1. replace the MD problem with RD again, and build full DU/UD chains. > > I include patch.1 attached. > >> 2. post all the recog changes separately, with minimum impact on the >> parts of the compiler you don't really change. (For apply_change_group >> you could even choose to overload it, or use a NUM argument with a >> default value -- not sure if default argument values are OK for GCC >> tho'.) > > patch.2 attached. > >> 3. implement propagation into multiple USEs, but without the splitting >> and peepholing. > > patch.3 attached. > >> 4. see about fixing the back end to either split earlier or expand to >> the desired patterns directly. > > I havn't included this part. If you agree with the proposal to add the > transformation (a << (b&63) ==> a << b) in > simplify_binary_operation_1, I will send out another patch about it. > > Thanks, > Wei.
Hi, On Sat, 2013-03-23 at 21:18 -0700, Wei Mi wrote: > This is the patch to add the shift truncation in > simplify_binary_operation_1. I add a new hook > TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide > whether we can do shift truncation. I didn't use > TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it > uses the macro SHIFT_COUNT_TRUNCATED. If I change > SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in > TARGET_SHIFT_TRUNCATION_MASK, I need to give > TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't > trivial to get at many places in existing code. > During 4.8 development there was a similar issue with the TARGET_CANONICALIZE_COMPARISON hook. As a temporary solution the rtx_code has been passed as int. I think the story started here: http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00379.html The conclusion regarding rtx_code ... http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00646.html Maybe this should be addressed first, since now there are at least two cases where it's in the way. Cheers, Oleg > patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu. > > Thanks, > Wei. > > On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote: > > Hi, > > > > On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: > >> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote: > >>> For the motivational case, I need insn splitting to get the cost > >>> right. insn splitting is not very intrusive. All I need is to call > >>> split_insns func. > >> > >> It may not look very intrusive, but there's a lot happening in the > >> back ground. You're creating a lot of new RTL, and then just throw it > >> away again. You fake the compiler into thinking you're much deeper in > >> the pipeline than you really are. You're assuming there are no > >> side-effects other than that some insn gets split, but there are back > >> ends where splitters may have side-effects. > > > > Ok, then I will remove the split_insns call. > > > >> > >> Even though I've asked twice now, you still have not explained this > >> motivational case, except to say that there is one. *What* are you > >> trying to do, *what* is not happening without the splits, and *what* > >> happens if you split. Only if you explain that in a lot more detail > >> than "I have a motivational case" then we can look into what is a > >> proper solution. > > > > :-). Sorry, I didn't say it clearly. The motivational case is the one > > mentioned in the following posts (split_insns changes a << (b & 63) to > > a << b). > > http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html > > http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html > > > > If I remove the split_insns call and related cost estimation > > adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop > > here looks like a reverse process of cse, the total cost after fwprop > > change is increased. > > > > Def insn 18: > > Use insn 23 > > Use insn 22 > > > > If we include the split_insns cost estimation adjustment. > > extra benefit by removing def insn 18 = 5 > > change[0]: benefit = 0, verified - ok // The cost of insn 22 will > > not change after fwprop + insn splitting. > > change[1]: benefit = 0, verified - ok // The insn 23 is the same with insn 22 > > Total benefit is 5, fwprop will go on. > > > > If we remove the split_insns cost estimation adjustment. > > extra benefit by removing def insn 18 = 5 > > change[0]: benefit = -4, verified - ok // The costs of insn 22 and > > insn 23 will increase after fwprop. > > change[1]: benefit = -4, verified - ok // The insn 23 is the same > > with insn 22 > > Total benefit is -3, fwprop will punt. > > > > How about adding the (a << (b&63) ==> a << b) transformation in > > simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a > > kind of architecture specific expr simplification? Then fwprop could > > do the propagation as I expect. > > > >> > >> The problem with some of the splitters is that they exist to break up > >> RTL from 'expand' to initially keep some pattern together to allow the > >> code transformation passes to handle the pattern as one instruction. > >> This made sense when RTL was the only intermediate representation and > >> splitting too early would inhibit some optimizations. But I would > >> expect most (if not all) such cases to be less relevant because of the > >> GIMPLE middle-end work. The only splitters you can trigger are the > >> pre-reload splitters (all the reload_completed conditions obviously > >> can't trigger if you're splitting from fwprop). Perhaps those > >> splitters can/should run earlier, or be made obsolete by expanding > >> directly to the post-splitting insns. > >> > >> Unfortunately, it's not possible to tell for your case, because you > >> haven't explained it yet... > >> > >> > >>> So how about keep split_insns and remove peephole in the cost estimation func? > >> > >> I'd strongly oppose this. I do not believe this is necessary, and I > >> think it's conceptually wrong. > >> > >> > >>>> What happens if you propagate into an insn that uses the same register > >>>> twice? Will the DU chains still be valid (I don't think that's > >>>> guaranteed)? > >>> > >>> I think the DU chains still be valid. If propagate into the insn that > >>> uses the same register twice, the two uses will be replaced when the > >>> first use is seen (propagate_rtx_1 will propagate all the occurrances > >>> of the same reg in the use insn). When the second use is seen, the > >>> df_use and use insn in its insn_info are still available. > >>> forward_propagate_into will early return after check reg_mentioned_p > >>> (DF_REF_REG (use), parent) and find out no reg is used any more. > >> > >> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is > >> still valid. > > > > I think DF_REF_LOC of USE may be invalid if dangling rtx will be > > recycled by garbage collection very soon (I don't know when GC will > > happen). Although DF_REF_LOC of USE maybe invalid, the early return in > > forward_propagate_into ensure it will not cause any correctness > > problem. > > > >> > >> In any case, returning to the RD problem for DU/UD chains is probably > >> a good idea, now that RD is not such a hog anymore. In effect fwprop.c > >> would return to what it looked like before the patch of r149010. > > > > I remove MD problem and use DU/UD instead. > > > >> > >> As a way forward on all of this, I'd suggest the following steps, each > >> with a separate patch: > > > > Thanks for the suggestion! > > > >> 1. replace the MD problem with RD again, and build full DU/UD chains. > > > > I include patch.1 attached. > > > >> 2. post all the recog changes separately, with minimum impact on the > >> parts of the compiler you don't really change. (For apply_change_group > >> you could even choose to overload it, or use a NUM argument with a > >> default value -- not sure if default argument values are OK for GCC > >> tho'.) > > > > patch.2 attached. > > > >> 3. implement propagation into multiple USEs, but without the splitting > >> and peepholing. > > > > patch.3 attached. > > > >> 4. see about fixing the back end to either split earlier or expand to > >> the desired patterns directly. > > > > I havn't included this part. If you agree with the proposal to add the > > transformation (a << (b&63) ==> a << b) in > > simplify_binary_operation_1, I will send out another patch about it. > > > > Thanks, > > Wei.
On Sun, Mar 24, 2013 at 5:18 AM, Wei Mi <wmi@google.com> wrote: > This is the patch to add the shift truncation in > simplify_binary_operation_1. I add a new hook > TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide > whether we can do shift truncation. I didn't use > TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it > uses the macro SHIFT_COUNT_TRUNCATED. If I change > SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in > TARGET_SHIFT_TRUNCATION_MASK, I need to give > TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't > trivial to get at many places in existing code. > > patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu. Doing this might prove dangerous in case some pass may later decide to use an instruction that behaves in different ways. Consider tem = 1<< (n & 255); // count truncated x = y & tem; // bittest instruction bit nr _not_ truncated so if tem is expanded to use a shift instruction which truncates the shift count the explicit and is dropped. If later combine comes around and combines the bit-test to use the bittest instruction which does not implicitely truncate the cound you have generated wrong-code. So we need to make sure any explicit truncation originally in place is kept in the RTL - which means SHIFT_COUNT_TRUNCATED should not exist at all, but instead there would be two patterns for shifts with implicit truncation - one involving the truncation (canonicalized to bitwise and) and one not involving the truncation. Richard. > Thanks, > Wei. > > On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote: >> Hi, >> >> On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: >>> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote: >>>> For the motivational case, I need insn splitting to get the cost >>>> right. insn splitting is not very intrusive. All I need is to call >>>> split_insns func. >>> >>> It may not look very intrusive, but there's a lot happening in the >>> back ground. You're creating a lot of new RTL, and then just throw it >>> away again. You fake the compiler into thinking you're much deeper in >>> the pipeline than you really are. You're assuming there are no >>> side-effects other than that some insn gets split, but there are back >>> ends where splitters may have side-effects. >> >> Ok, then I will remove the split_insns call. >> >>> >>> Even though I've asked twice now, you still have not explained this >>> motivational case, except to say that there is one. *What* are you >>> trying to do, *what* is not happening without the splits, and *what* >>> happens if you split. Only if you explain that in a lot more detail >>> than "I have a motivational case" then we can look into what is a >>> proper solution. >> >> :-). Sorry, I didn't say it clearly. The motivational case is the one >> mentioned in the following posts (split_insns changes a << (b & 63) to >> a << b). >> http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html >> http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html >> >> If I remove the split_insns call and related cost estimation >> adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop >> here looks like a reverse process of cse, the total cost after fwprop >> change is increased. >> >> Def insn 18: >> Use insn 23 >> Use insn 22 >> >> If we include the split_insns cost estimation adjustment. >> extra benefit by removing def insn 18 = 5 >> change[0]: benefit = 0, verified - ok // The cost of insn 22 will >> not change after fwprop + insn splitting. >> change[1]: benefit = 0, verified - ok // The insn 23 is the same with insn 22 >> Total benefit is 5, fwprop will go on. >> >> If we remove the split_insns cost estimation adjustment. >> extra benefit by removing def insn 18 = 5 >> change[0]: benefit = -4, verified - ok // The costs of insn 22 and >> insn 23 will increase after fwprop. >> change[1]: benefit = -4, verified - ok // The insn 23 is the same >> with insn 22 >> Total benefit is -3, fwprop will punt. >> >> How about adding the (a << (b&63) ==> a << b) transformation in >> simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a >> kind of architecture specific expr simplification? Then fwprop could >> do the propagation as I expect. >> >>> >>> The problem with some of the splitters is that they exist to break up >>> RTL from 'expand' to initially keep some pattern together to allow the >>> code transformation passes to handle the pattern as one instruction. >>> This made sense when RTL was the only intermediate representation and >>> splitting too early would inhibit some optimizations. But I would >>> expect most (if not all) such cases to be less relevant because of the >>> GIMPLE middle-end work. The only splitters you can trigger are the >>> pre-reload splitters (all the reload_completed conditions obviously >>> can't trigger if you're splitting from fwprop). Perhaps those >>> splitters can/should run earlier, or be made obsolete by expanding >>> directly to the post-splitting insns. >>> >>> Unfortunately, it's not possible to tell for your case, because you >>> haven't explained it yet... >>> >>> >>>> So how about keep split_insns and remove peephole in the cost estimation func? >>> >>> I'd strongly oppose this. I do not believe this is necessary, and I >>> think it's conceptually wrong. >>> >>> >>>>> What happens if you propagate into an insn that uses the same register >>>>> twice? Will the DU chains still be valid (I don't think that's >>>>> guaranteed)? >>>> >>>> I think the DU chains still be valid. If propagate into the insn that >>>> uses the same register twice, the two uses will be replaced when the >>>> first use is seen (propagate_rtx_1 will propagate all the occurrances >>>> of the same reg in the use insn). When the second use is seen, the >>>> df_use and use insn in its insn_info are still available. >>>> forward_propagate_into will early return after check reg_mentioned_p >>>> (DF_REF_REG (use), parent) and find out no reg is used any more. >>> >>> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is >>> still valid. >> >> I think DF_REF_LOC of USE may be invalid if dangling rtx will be >> recycled by garbage collection very soon (I don't know when GC will >> happen). Although DF_REF_LOC of USE maybe invalid, the early return in >> forward_propagate_into ensure it will not cause any correctness >> problem. >> >>> >>> In any case, returning to the RD problem for DU/UD chains is probably >>> a good idea, now that RD is not such a hog anymore. In effect fwprop.c >>> would return to what it looked like before the patch of r149010. >> >> I remove MD problem and use DU/UD instead. >> >>> >>> As a way forward on all of this, I'd suggest the following steps, each >>> with a separate patch: >> >> Thanks for the suggestion! >> >>> 1. replace the MD problem with RD again, and build full DU/UD chains. >> >> I include patch.1 attached. >> >>> 2. post all the recog changes separately, with minimum impact on the >>> parts of the compiler you don't really change. (For apply_change_group >>> you could even choose to overload it, or use a NUM argument with a >>> default value -- not sure if default argument values are OK for GCC >>> tho'.) >> >> patch.2 attached. >> >>> 3. implement propagation into multiple USEs, but without the splitting >>> and peepholing. >> >> patch.3 attached. >> >>> 4. see about fixing the back end to either split earlier or expand to >>> the desired patterns directly. >> >> I havn't included this part. If you agree with the proposal to add the >> transformation (a << (b&63) ==> a << b) in >> simplify_binary_operation_1, I will send out another patch about it. >> >> Thanks, >> Wei.
On Mon, Mar 25, 2013 at 2:35 AM, Richard Biener <richard.guenther@gmail.com> wrote: > On Sun, Mar 24, 2013 at 5:18 AM, Wei Mi <wmi@google.com> wrote: >> This is the patch to add the shift truncation in >> simplify_binary_operation_1. I add a new hook >> TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide >> whether we can do shift truncation. I didn't use >> TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it >> uses the macro SHIFT_COUNT_TRUNCATED. If I change >> SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in >> TARGET_SHIFT_TRUNCATION_MASK, I need to give >> TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't >> trivial to get at many places in existing code. >> >> patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu. > > Doing this might prove dangerous in case some pass may later decide > to use an instruction that behaves in different ways. Consider > > tem = 1<< (n & 255); // count truncated > x = y & tem; // bittest instruction bit nr _not_ truncated > > so if tem is expanded to use a shift instruction which truncates the shift > count the explicit and is dropped. If later combine comes around and > combines the bit-test to use the bittest instruction which does not > implicitely truncate the cound you have generated wrong-code. > So it means the existing truncation pattern defined in insn split is also incorrect because the truncated shift may be combined into a bit test pattern? // The following define_insn_and_split will do shift truncation. (define_insn_and_split "*<shift_insn><mode>3_mask" [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm") (any_shiftrt:SWI48 (match_operand:SWI48 1 "nonimmediate_operand" "0") (subreg:QI (and:SI (match_operand:SI 2 "nonimmediate_operand" "c") (match_operand:SI 3 "const_int_operand" "n")) 0))) (clobber (reg:CC FLAGS_REG))] "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands) && (INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1)) == GET_MODE_BITSIZE (<MODE>mode)-1" "#" "&& 1" [(parallel [(set (match_dup 0) (any_shiftrt:SWI48 (match_dup 1) (match_dup 2))) (clobber (reg:CC FLAGS_REG))])] { if (can_create_pseudo_p ()) operands [2] = force_reg (SImode, operands[2]); operands[2] = simplify_gen_subreg (QImode, operands[2], SImode, 0); } [(set_attr "type" "ishift") (set_attr "mode" "<MODE>")]) > So we need to make sure any explicit truncation originally in place > is kept in the RTL - which means SHIFT_COUNT_TRUNCATED should > not exist at all, but instead there would be two patterns for shifts > with implicit truncation - one involving the truncation (canonicalized to > bitwise and) and one not involving the truncation. > > Richard. > I am trying to figure out a way not to lose the opportunity when shift truncation is not combined in a bit test pattern. Can we keep the explicit truncation in RTL, but generate truncation code in assembly? Then only shift truncation which not combined in a bit test pattershift truncationn will happen. (define_insn "*<shift_insn_and><mode>" [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm") (any_shiftrt:SWI48 (match_operand:SWI48 1 "nonimmediate_operand" "0") (subreg:QI (and:SI (match_operand:SI 2 "nonimmediate_operand" "c") (match_operand:SI 3 "const_int_operand" "n")) 0))) (clobber (reg:CC FLAGS_REG))] "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)" { if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1)) == GET_MODE_BITSIZE (<MODE>mode)-1) return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}"; else "shift\t{%2, %0|%0, %2}"; } Thanks, Wei.
> I am trying to figure out a way not to lose the opportunity when shift > truncation is not combined in a bit test pattern. Can we keep the > explicit truncation in RTL, but generate truncation code in assembly? > Then only shift truncation which not combined in a bit test > pattershift truncationn will happen. > > (define_insn "*<shift_insn_and><mode>" > [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm") > (any_shiftrt:SWI48 > (match_operand:SWI48 1 "nonimmediate_operand" "0") > (subreg:QI > (and:SI > (match_operand:SI 2 "nonimmediate_operand" "c") > (match_operand:SI 3 "const_int_operand" "n")) 0))) > (clobber (reg:CC FLAGS_REG))] > "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)" > { > if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1)) > == GET_MODE_BITSIZE (<MODE>mode)-1) > return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}"; > else > "shift\t{%2, %0|%0, %2}"; > } Sorry, rectify a mistake: { if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1)) == GET_MODE_BITSIZE (<MODE>mode)-1) return "shift\t{%2, %0|%0, %2}"; else return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}"; } Thanks, Wei.
On Mon, Mar 25, 2013 at 6:33 PM, Wei Mi <wmi@google.com> wrote: >> I am trying to figure out a way not to lose the opportunity when shift >> truncation is not combined in a bit test pattern. Can we keep the >> explicit truncation in RTL, but generate truncation code in assembly? >> Then only shift truncation which not combined in a bit test >> pattershift truncationn will happen. >> >> (define_insn "*<shift_insn_and><mode>" >> [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm") >> (any_shiftrt:SWI48 >> (match_operand:SWI48 1 "nonimmediate_operand" "0") >> (subreg:QI >> (and:SI >> (match_operand:SI 2 "nonimmediate_operand" "c") >> (match_operand:SI 3 "const_int_operand" "n")) 0))) >> (clobber (reg:CC FLAGS_REG))] >> "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)" >> { >> if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1)) >> == GET_MODE_BITSIZE (<MODE>mode)-1) >> return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}"; >> else >> "shift\t{%2, %0|%0, %2}"; >> } > > Sorry, rectify a mistake: > > { > if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1)) > == GET_MODE_BITSIZE (<MODE>mode)-1) > return "shift\t{%2, %0|%0, %2}"; > else > return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}"; > } I'm not sure the existing patterns are wrong because SHIFT_COUNT_TRUNCATED is false for x86 AFAIK, exactly because of the bit-test vs. shift instruction differences. So there is no inconsistency. The i386 backend seems to try to follow my suggestion as if SHIFT_COUNT_TRUNCATED didn't exist (well, it's false, so it technically doesn't exist for i386) and recognizes the shift with truncate with the *<shift_insn><mode>3_mask splitter. But I'm not sure why it bothers to do it with a splitter instead of just with a define_insn? Because the split code, [(parallel [(set (match_dup 0) (any_shiftrt:SWI48 (match_dup 1) (match_dup 2))) (clobber (reg:CC FLAGS_REG))])] is wrong and could be combined into a bit-test instruction. No? That is, why not have define_insn variants for shift instructions with explicit truncation? Richard. > Thanks, > Wei.
On Tue, Mar 26, 2013 at 10:14 AM, Richard Biener <richard.guenther@gmail.com> wrote: >>> I am trying to figure out a way not to lose the opportunity when shift >>> truncation is not combined in a bit test pattern. Can we keep the >>> explicit truncation in RTL, but generate truncation code in assembly? >>> Then only shift truncation which not combined in a bit test >>> pattershift truncationn will happen. >>> >>> (define_insn "*<shift_insn_and><mode>" >>> [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm") >>> (any_shiftrt:SWI48 >>> (match_operand:SWI48 1 "nonimmediate_operand" "0") >>> (subreg:QI >>> (and:SI >>> (match_operand:SI 2 "nonimmediate_operand" "c") >>> (match_operand:SI 3 "const_int_operand" "n")) 0))) >>> (clobber (reg:CC FLAGS_REG))] >>> "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)" >>> { >>> if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1)) >>> == GET_MODE_BITSIZE (<MODE>mode)-1) >>> return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}"; >>> else >>> "shift\t{%2, %0|%0, %2}"; >>> } >> >> Sorry, rectify a mistake: >> >> { >> if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1)) >> == GET_MODE_BITSIZE (<MODE>mode)-1) >> return "shift\t{%2, %0|%0, %2}"; >> else >> return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}"; >> } > > I'm not sure the existing patterns are wrong because SHIFT_COUNT_TRUNCATED > is false for x86 AFAIK, exactly because of the bit-test vs. shift instruction > differences. So there is no inconsistency. The i386 backend seems to > try to follow my suggestion as if SHIFT_COUNT_TRUNCATED didn't exist > (well, it's false, so it technically doesn't exist for i386) and recognizes > the shift with truncate with the *<shift_insn><mode>3_mask splitter. > But I'm not sure why it bothers to do it with a splitter instead of just > with a define_insn? Because the split code, > > [(parallel [(set (match_dup 0) > (any_shiftrt:SWI48 (match_dup 1) (match_dup 2))) > (clobber (reg:CC FLAGS_REG))])] > > is wrong and could be combined into a bit-test instruction. No? > > That is, why not have define_insn variants for shift instructions with > explicit truncation? You are right, the split is harmful in this case. It looks to me, that explicit truncation can be added to split patterns in the most elegant way using proposed "define_subst" infrastructure. Uros.
I am not familiar how to use define_subst, so I write a patch that changes define_insn_and_split to define_insn. bootstrapped and regression tested on x86_64-unknown-linux-gnu. A question is: after that change, Is there anyway I can make targetm.rtx_costs() aware about the truncation, .i.e the cost is only a "shift" instead of "shift + and". Thanks, Wei. On Tue, Mar 26, 2013 at 11:23 AM, Uros Bizjak <ubizjak@gmail.com> wrote: > On Tue, Mar 26, 2013 at 10:14 AM, Richard Biener > <richard.guenther@gmail.com> wrote: >>>> I am trying to figure out a way not to lose the opportunity when shift >>>> truncation is not combined in a bit test pattern. Can we keep the >>>> explicit truncation in RTL, but generate truncation code in assembly? >>>> Then only shift truncation which not combined in a bit test >>>> pattershift truncationn will happen. >>>> >>>> (define_insn "*<shift_insn_and><mode>" >>>> [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm") >>>> (any_shiftrt:SWI48 >>>> (match_operand:SWI48 1 "nonimmediate_operand" "0") >>>> (subreg:QI >>>> (and:SI >>>> (match_operand:SI 2 "nonimmediate_operand" "c") >>>> (match_operand:SI 3 "const_int_operand" "n")) 0))) >>>> (clobber (reg:CC FLAGS_REG))] >>>> "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)" >>>> { >>>> if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1)) >>>> == GET_MODE_BITSIZE (<MODE>mode)-1) >>>> return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}"; >>>> else >>>> "shift\t{%2, %0|%0, %2}"; >>>> } >>> >>> Sorry, rectify a mistake: >>> >>> { >>> if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1)) >>> == GET_MODE_BITSIZE (<MODE>mode)-1) >>> return "shift\t{%2, %0|%0, %2}"; >>> else >>> return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}"; >>> } >> >> I'm not sure the existing patterns are wrong because SHIFT_COUNT_TRUNCATED >> is false for x86 AFAIK, exactly because of the bit-test vs. shift instruction >> differences. So there is no inconsistency. The i386 backend seems to >> try to follow my suggestion as if SHIFT_COUNT_TRUNCATED didn't exist >> (well, it's false, so it technically doesn't exist for i386) and recognizes >> the shift with truncate with the *<shift_insn><mode>3_mask splitter. >> But I'm not sure why it bothers to do it with a splitter instead of just >> with a define_insn? Because the split code, >> >> [(parallel [(set (match_dup 0) >> (any_shiftrt:SWI48 (match_dup 1) (match_dup 2))) >> (clobber (reg:CC FLAGS_REG))])] >> >> is wrong and could be combined into a bit-test instruction. No? >> >> That is, why not have define_insn variants for shift instructions with >> explicit truncation? > > You are right, the split is harmful in this case. > > It looks to me, that explicit truncation can be added to split > patterns in the most elegant way using proposed "define_subst" > infrastructure. > > Uros.
On Thu, Mar 28, 2013 at 5:34 AM, Wei Mi <wmi@google.com> wrote: > I am not familiar how to use define_subst, so I write a patch that > changes define_insn_and_split to define_insn. bootstrapped and > regression tested on x86_64-unknown-linux-gnu. > > A question is: after that change, Is there anyway I can make > targetm.rtx_costs() aware about the truncation, .i.e the cost is only > a "shift" instead of "shift + and". Please also change all operand 2 predicates to "register_operand". 2013-03-27 Wei Mi <wmi@google.com> * config/i386/i386.md: Do shift truncation in define_insn instead of define_insn_and_split. Please write ChangeLog as: * config/i386/i386.md (*ashl<mode>3_mask): Rewrite as define_insn. Truncate operand 2 using %b asm operand modifier. (*<shift_insn><mode>3_mask): Ditto. (*<rotate_insn><mode>3_mask): Ditto. OK for mainline and all release branches with these changes. Thanks, Uros.
I attached the patch.4 based on r197308. r197308 changes shift-and type truncation from define_insn_and_split to define_insn. patch.4 changes ix86_rtx_costs for shift-and type rtx to get the correct cost for the result after the shift-and truncation. With the patch.1 ~ patch.4, fwprop extension could handle the motivational case 1.c attached by removing all the redundent "x & 63" operations. patch.1~patch.4 regression and bootstrap ok on x86_64-unknown-linux-gnu. Is it ok for trunk? Thanks, Wei. On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote: > Hi, > > On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: >> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote: >>> For the motivational case, I need insn splitting to get the cost >>> right. insn splitting is not very intrusive. All I need is to call >>> split_insns func. >> >> It may not look very intrusive, but there's a lot happening in the >> back ground. You're creating a lot of new RTL, and then just throw it >> away again. You fake the compiler into thinking you're much deeper in >> the pipeline than you really are. You're assuming there are no >> side-effects other than that some insn gets split, but there are back >> ends where splitters may have side-effects. > > Ok, then I will remove the split_insns call. > >> >> Even though I've asked twice now, you still have not explained this >> motivational case, except to say that there is one. *What* are you >> trying to do, *what* is not happening without the splits, and *what* >> happens if you split. Only if you explain that in a lot more detail >> than "I have a motivational case" then we can look into what is a >> proper solution. > > :-). Sorry, I didn't say it clearly. The motivational case is the one > mentioned in the following posts (split_insns changes a << (b & 63) to > a << b). > http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html > http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html > > If I remove the split_insns call and related cost estimation > adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop > here looks like a reverse process of cse, the total cost after fwprop > change is increased. > > Def insn 18: > Use insn 23 > Use insn 22 > > If we include the split_insns cost estimation adjustment. > extra benefit by removing def insn 18 = 5 > change[0]: benefit = 0, verified - ok // The cost of insn 22 will > not change after fwprop + insn splitting. > change[1]: benefit = 0, verified - ok // The insn 23 is the same with insn 22 > Total benefit is 5, fwprop will go on. > > If we remove the split_insns cost estimation adjustment. > extra benefit by removing def insn 18 = 5 > change[0]: benefit = -4, verified - ok // The costs of insn 22 and > insn 23 will increase after fwprop. > change[1]: benefit = -4, verified - ok // The insn 23 is the same > with insn 22 > Total benefit is -3, fwprop will punt. > > How about adding the (a << (b&63) ==> a << b) transformation in > simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a > kind of architecture specific expr simplification? Then fwprop could > do the propagation as I expect. > >> >> The problem with some of the splitters is that they exist to break up >> RTL from 'expand' to initially keep some pattern together to allow the >> code transformation passes to handle the pattern as one instruction. >> This made sense when RTL was the only intermediate representation and >> splitting too early would inhibit some optimizations. But I would >> expect most (if not all) such cases to be less relevant because of the >> GIMPLE middle-end work. The only splitters you can trigger are the >> pre-reload splitters (all the reload_completed conditions obviously >> can't trigger if you're splitting from fwprop). Perhaps those >> splitters can/should run earlier, or be made obsolete by expanding >> directly to the post-splitting insns. >> >> Unfortunately, it's not possible to tell for your case, because you >> haven't explained it yet... >> >> >>> So how about keep split_insns and remove peephole in the cost estimation func? >> >> I'd strongly oppose this. I do not believe this is necessary, and I >> think it's conceptually wrong. >> >> >>>> What happens if you propagate into an insn that uses the same register >>>> twice? Will the DU chains still be valid (I don't think that's >>>> guaranteed)? >>> >>> I think the DU chains still be valid. If propagate into the insn that >>> uses the same register twice, the two uses will be replaced when the >>> first use is seen (propagate_rtx_1 will propagate all the occurrances >>> of the same reg in the use insn). When the second use is seen, the >>> df_use and use insn in its insn_info are still available. >>> forward_propagate_into will early return after check reg_mentioned_p >>> (DF_REF_REG (use), parent) and find out no reg is used any more. >> >> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is >> still valid. > > I think DF_REF_LOC of USE may be invalid if dangling rtx will be > recycled by garbage collection very soon (I don't know when GC will > happen). Although DF_REF_LOC of USE maybe invalid, the early return in > forward_propagate_into ensure it will not cause any correctness > problem. > >> >> In any case, returning to the RD problem for DU/UD chains is probably >> a good idea, now that RD is not such a hog anymore. In effect fwprop.c >> would return to what it looked like before the patch of r149010. > > I remove MD problem and use DU/UD instead. > >> >> As a way forward on all of this, I'd suggest the following steps, each >> with a separate patch: > > Thanks for the suggestion! > >> 1. replace the MD problem with RD again, and build full DU/UD chains. > > I include patch.1 attached. > >> 2. post all the recog changes separately, with minimum impact on the >> parts of the compiler you don't really change. (For apply_change_group >> you could even choose to overload it, or use a NUM argument with a >> default value -- not sure if default argument values are OK for GCC >> tho'.) > > patch.2 attached. > >> 3. implement propagation into multiple USEs, but without the splitting >> and peepholing. > > patch.3 attached. > >> 4. see about fixing the back end to either split earlier or expand to >> the desired patterns directly. > > I havn't included this part. If you agree with the proposal to add the > transformation (a << (b&63) ==> a << b) in > simplify_binary_operation_1, I will send out another patch about it. > > Thanks, > Wei.
1.c attached. On Mon, Apr 1, 2013 at 10:43 PM, Wei Mi <wmi@google.com> wrote: > I attached the patch.4 based on r197308. r197308 changes shift-and > type truncation from define_insn_and_split to define_insn. patch.4 > changes ix86_rtx_costs for shift-and type rtx to get the correct cost > for the result after the shift-and truncation. > > With the patch.1 ~ patch.4, fwprop extension could handle the > motivational case 1.c attached by removing all the redundent "x & 63" > operations. > > patch.1~patch.4 regression and bootstrap ok on > x86_64-unknown-linux-gnu. Is it ok for trunk? > > Thanks, > Wei. > > On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote: >> Hi, >> >> On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: >>> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote: >>>> For the motivational case, I need insn splitting to get the cost >>>> right. insn splitting is not very intrusive. All I need is to call >>>> split_insns func. >>> >>> It may not look very intrusive, but there's a lot happening in the >>> back ground. You're creating a lot of new RTL, and then just throw it >>> away again. You fake the compiler into thinking you're much deeper in >>> the pipeline than you really are. You're assuming there are no >>> side-effects other than that some insn gets split, but there are back >>> ends where splitters may have side-effects. >> >> Ok, then I will remove the split_insns call. >> >>> >>> Even though I've asked twice now, you still have not explained this >>> motivational case, except to say that there is one. *What* are you >>> trying to do, *what* is not happening without the splits, and *what* >>> happens if you split. Only if you explain that in a lot more detail >>> than "I have a motivational case" then we can look into what is a >>> proper solution. >> >> :-). Sorry, I didn't say it clearly. The motivational case is the one >> mentioned in the following posts (split_insns changes a << (b & 63) to >> a << b). >> http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html >> http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html >> >> If I remove the split_insns call and related cost estimation >> adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop >> here looks like a reverse process of cse, the total cost after fwprop >> change is increased. >> >> Def insn 18: >> Use insn 23 >> Use insn 22 >> >> If we include the split_insns cost estimation adjustment. >> extra benefit by removing def insn 18 = 5 >> change[0]: benefit = 0, verified - ok // The cost of insn 22 will >> not change after fwprop + insn splitting. >> change[1]: benefit = 0, verified - ok // The insn 23 is the same with insn 22 >> Total benefit is 5, fwprop will go on. >> >> If we remove the split_insns cost estimation adjustment. >> extra benefit by removing def insn 18 = 5 >> change[0]: benefit = -4, verified - ok // The costs of insn 22 and >> insn 23 will increase after fwprop. >> change[1]: benefit = -4, verified - ok // The insn 23 is the same >> with insn 22 >> Total benefit is -3, fwprop will punt. >> >> How about adding the (a << (b&63) ==> a << b) transformation in >> simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a >> kind of architecture specific expr simplification? Then fwprop could >> do the propagation as I expect. >> >>> >>> The problem with some of the splitters is that they exist to break up >>> RTL from 'expand' to initially keep some pattern together to allow the >>> code transformation passes to handle the pattern as one instruction. >>> This made sense when RTL was the only intermediate representation and >>> splitting too early would inhibit some optimizations. But I would >>> expect most (if not all) such cases to be less relevant because of the >>> GIMPLE middle-end work. The only splitters you can trigger are the >>> pre-reload splitters (all the reload_completed conditions obviously >>> can't trigger if you're splitting from fwprop). Perhaps those >>> splitters can/should run earlier, or be made obsolete by expanding >>> directly to the post-splitting insns. >>> >>> Unfortunately, it's not possible to tell for your case, because you >>> haven't explained it yet... >>> >>> >>>> So how about keep split_insns and remove peephole in the cost estimation func? >>> >>> I'd strongly oppose this. I do not believe this is necessary, and I >>> think it's conceptually wrong. >>> >>> >>>>> What happens if you propagate into an insn that uses the same register >>>>> twice? Will the DU chains still be valid (I don't think that's >>>>> guaranteed)? >>>> >>>> I think the DU chains still be valid. If propagate into the insn that >>>> uses the same register twice, the two uses will be replaced when the >>>> first use is seen (propagate_rtx_1 will propagate all the occurrances >>>> of the same reg in the use insn). When the second use is seen, the >>>> df_use and use insn in its insn_info are still available. >>>> forward_propagate_into will early return after check reg_mentioned_p >>>> (DF_REF_REG (use), parent) and find out no reg is used any more. >>> >>> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is >>> still valid. >> >> I think DF_REF_LOC of USE may be invalid if dangling rtx will be >> recycled by garbage collection very soon (I don't know when GC will >> happen). Although DF_REF_LOC of USE maybe invalid, the early return in >> forward_propagate_into ensure it will not cause any correctness >> problem. >> >>> >>> In any case, returning to the RD problem for DU/UD chains is probably >>> a good idea, now that RD is not such a hog anymore. In effect fwprop.c >>> would return to what it looked like before the patch of r149010. >> >> I remove MD problem and use DU/UD instead. >> >>> >>> As a way forward on all of this, I'd suggest the following steps, each >>> with a separate patch: >> >> Thanks for the suggestion! >> >>> 1. replace the MD problem with RD again, and build full DU/UD chains. >> >> I include patch.1 attached. >> >>> 2. post all the recog changes separately, with minimum impact on the >>> parts of the compiler you don't really change. (For apply_change_group >>> you could even choose to overload it, or use a NUM argument with a >>> default value -- not sure if default argument values are OK for GCC >>> tho'.) >> >> patch.2 attached. >> >>> 3. implement propagation into multiple USEs, but without the splitting >>> and peepholing. >> >> patch.3 attached. >> >>> 4. see about fixing the back end to either split earlier or expand to >>> the desired patterns directly. >> >> I havn't included this part. If you agree with the proposal to add the >> transformation (a << (b&63) ==> a << b) in >> simplify_binary_operation_1, I will send out another patch about it. >> >> Thanks, >> Wei.
On Tue, Apr 2, 2013 at 7:43 AM, Wei Mi <wmi@google.com> wrote: > I attached the patch.4 based on r197308. r197308 changes shift-and > type truncation from define_insn_and_split to define_insn. patch.4 > changes ix86_rtx_costs for shift-and type rtx to get the correct cost > for the result after the shift-and truncation. > > With the patch.1 ~ patch.4, fwprop extension could handle the > motivational case 1.c attached by removing all the redundent "x & 63" > operations. > > patch.1~patch.4 regression and bootstrap ok on > x86_64-unknown-linux-gnu. Is it ok for trunk? > 2013-04-01 Wei Mi <wmi@google.com> > > * config/i386/i386.c (ix86_rtx_costs): Set proper rtx cost for > ashl<mode>3_mask, *<shift_insn><mode>3_mask and *<rotate_insn><mode>3_mask in i386.md. Patch 4 is OK for mainline and also for release branches that were changed by your previous i386.md patch. Thanks, Uros.
Index: cse.c =================================================================== --- cse.c (revision 196182) +++ cse.c (working copy) @@ -3179,9 +3179,22 @@ fold_rtx (rtx x, rtx insn) switch (GET_CODE (folded_arg)) { + case SUBREG: + /* If the SUBREG_REG comes in from an AND, and this is not a + paradoxical subreg, then try to fold the SUBREG. */ + if (REG_P (SUBREG_REG (folded_arg)) + && ! paradoxical_subreg_p (folded_arg)) + { + rtx y = lookup_as_function (SUBREG_REG (folded_arg), AND); + if (y != 0) + y = simplify_gen_binary(AND, GET_MODE (folded_arg), + XEXP(y, 0), XEXP(y, 1)); + if (y != 0) + folded_arg = y; + } + /* ... fall through ... */ case MEM: case REG: - case SUBREG: const_arg = equiv_constant (folded_arg); break;
On Wed, Feb 27, 2013 at 7:37 PM, Wei Mi wrote: > What do you think? I think you'll not be able to teach fold_rtx to perform the transformation you want it to do without having SHIFT_COUNT_TRUNCATED set for i386. I already tried it the other day, but GCC won't do the truncation without knowing the insn is really a shift insn and shift_truncation_mask returns something useful. Ciao! Steven