diff mbox

extend fwprop optimization

Message ID CABu31nPB3t+b61fSndgbFD-Ps4SXo62872f2cOyRBOi0nW22ig@mail.gmail.com
State New
Headers show

Commit Message

Steven Bosscher Feb. 27, 2013, 9:21 p.m. UTC
On Wed, Feb 27, 2013 at 7:37 PM, Wei Mi wrote:
> What do you think?

I think you'll not be able to teach fold_rtx to perform the
transformation you want it to do without having SHIFT_COUNT_TRUNCATED
set for i386. I already tried it the other day, but GCC won't do the
truncation without knowing the insn is really a shift insn and
shift_truncation_mask returns something useful.

Ciao!
Steven

Comments

Wei Mi Feb. 27, 2013, 9:56 p.m. UTC | #1
Yes, I agree with you. fold_rtx also needs to be extended because now
it only handles the case similar as follows for shift insn:
  a = b op const1
  c = a >> const2
for our motivational case, the second operand of the first insn is a
reg instead of a const. We also need to add the truncation support for
our case in simplify_binary_operation.

I will send out a more official patch about fwprop extension soon.
Then it may be easier to talk about its rationality.

Thanks,
Wei.

On Wed, Feb 27, 2013 at 1:21 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Wed, Feb 27, 2013 at 7:37 PM, Wei Mi wrote:
>> What do you think?
>
> I think you'll not be able to teach fold_rtx to perform the
> transformation you want it to do without having SHIFT_COUNT_TRUNCATED
> set for i386. I already tried it the other day, but GCC won't do the
> truncation without knowing the insn is really a shift insn and
> shift_truncation_mask returns something useful.
>
> Ciao!
> Steven
>
>
> Index: cse.c
> ===================================================================
> --- cse.c       (revision 196182)
> +++ cse.c       (working copy)
> @@ -3179,9 +3179,22 @@ fold_rtx (rtx x, rtx insn)
>
>         switch (GET_CODE (folded_arg))
>           {
> +         case SUBREG:
> +           /* If the SUBREG_REG comes in from an AND, and this is not a
> +              paradoxical subreg, then try to fold the SUBREG.  */
> +           if (REG_P (SUBREG_REG (folded_arg))
> +               && ! paradoxical_subreg_p (folded_arg))
> +             {
> +               rtx y = lookup_as_function (SUBREG_REG (folded_arg), AND);
> +               if (y != 0)
> +                 y = simplify_gen_binary(AND, GET_MODE (folded_arg),
> +                                         XEXP(y, 0), XEXP(y, 1));
> +               if (y != 0)
> +                 folded_arg = y;
> +             }
> +           /* ... fall through ...  */
>           case MEM:
>           case REG:
> -         case SUBREG:
>             const_arg = equiv_constant (folded_arg);
>             break;
Wei Mi March 11, 2013, 5:52 a.m. UTC | #2
Hi,

This is the fwprop extension patch which is put in order. Regression
test and bootstrap pass. Please help to review its rationality. The
following is a brief description what I have done in the patch.

In order to make fwprop more effective in rtl optimization, we extend
it to handle general expressions instead of the three cases listed in
the head comment in fwprop.c. The major changes include a) We need to
check propagation correctness for src exprs of def which contain mem
references. Previous fwprop for the three cases above doesn't have the
problem. b) We need a better cost model because the benefit is usually
not so apparent as the three cases above.

For a general fwprop problem, there are two possible sources where
benefit comes from. The frist is the new use insn after propagation
and simplification may have lower cost than itself before propagation,
or propagation may create a new insn, that could be splitted or
peephole optimized later and get a lower cost. The second is that if
all the uses are replaced with the src of the def insn, the def insn
could be deleted.

So instead of check each def-use pair independently, we use DU chain
to track all the uses for a def. For each def-use pair, we attempt the
propagation, record the change candidate in changes[] array, but we
wait to confirm the changes until all the pairs with the same def are
iterated. The changes confirmation is done in the func
confirm_change_group_by_cost. We only do this for fwprop. For
fwprop_addr, the benefit of each change is ensured by
propagation_rtx_1 using should_replace_address, so we just confirm all
the changes without checking benefit again.

Thanks,
Wei.

On Wed, Feb 27, 2013 at 1:56 PM, Wei Mi <wmi@google.com> wrote:
> Yes, I agree with you. fold_rtx also needs to be extended because now
> it only handles the case similar as follows for shift insn:
>   a = b op const1
>   c = a >> const2
> for our motivational case, the second operand of the first insn is a
> reg instead of a const. We also need to add the truncation support for
> our case in simplify_binary_operation.
>
> I will send out a more official patch about fwprop extension soon.
> Then it may be easier to talk about its rationality.
>
> Thanks,
> Wei.
>
> On Wed, Feb 27, 2013 at 1:21 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>> On Wed, Feb 27, 2013 at 7:37 PM, Wei Mi wrote:
>>> What do you think?
>>
>> I think you'll not be able to teach fold_rtx to perform the
>> transformation you want it to do without having SHIFT_COUNT_TRUNCATED
>> set for i386. I already tried it the other day, but GCC won't do the
>> truncation without knowing the insn is really a shift insn and
>> shift_truncation_mask returns something useful.
>>
>> Ciao!
>> Steven
>>
>>
>> Index: cse.c
>> ===================================================================
>> --- cse.c       (revision 196182)
>> +++ cse.c       (working copy)
>> @@ -3179,9 +3179,22 @@ fold_rtx (rtx x, rtx insn)
>>
>>         switch (GET_CODE (folded_arg))
>>           {
>> +         case SUBREG:
>> +           /* If the SUBREG_REG comes in from an AND, and this is not a
>> +              paradoxical subreg, then try to fold the SUBREG.  */
>> +           if (REG_P (SUBREG_REG (folded_arg))
>> +               && ! paradoxical_subreg_p (folded_arg))
>> +             {
>> +               rtx y = lookup_as_function (SUBREG_REG (folded_arg), AND);
>> +               if (y != 0)
>> +                 y = simplify_gen_binary(AND, GET_MODE (folded_arg),
>> +                                         XEXP(y, 0), XEXP(y, 1));
>> +               if (y != 0)
>> +                 folded_arg = y;
>> +             }
>> +           /* ... fall through ...  */
>>           case MEM:
>>           case REG:
>> -         case SUBREG:
>>             const_arg = equiv_constant (folded_arg);
>>             break;
Jeff Law March 11, 2013, 6:10 p.m. UTC | #3
On 03/10/2013 11:52 PM, Wei Mi wrote:
> Hi,
>
> This is the fwprop extension patch which is put in order. Regression
> test and bootstrap pass. Please help to review its rationality. The
> following is a brief description what I have done in the patch.
>
> In order to make fwprop more effective in rtl optimization, we extend
> it to handle general expressions instead of the three cases listed in
> the head comment in fwprop.c. The major changes include a) We need to
> check propagation correctness for src exprs of def which contain mem
> references. Previous fwprop for the three cases above doesn't have the
> problem. b) We need a better cost model because the benefit is usually
> not so apparent as the three cases above.
>
> For a general fwprop problem, there are two possible sources where
> benefit comes from. The frist is the new use insn after propagation
> and simplification may have lower cost than itself before propagation,
> or propagation may create a new insn, that could be splitted or
> peephole optimized later and get a lower cost. The second is that if
> all the uses are replaced with the src of the def insn, the def insn
> could be deleted.
>
> So instead of check each def-use pair independently, we use DU chain
> to track all the uses for a def. For each def-use pair, we attempt the
> propagation, record the change candidate in changes[] array, but we
> wait to confirm the changes until all the pairs with the same def are
> iterated. The changes confirmation is done in the func
> confirm_change_group_by_cost. We only do this for fwprop. For
> fwprop_addr, the benefit of each change is ensured by
> propagation_rtx_1 using should_replace_address, so we just confirm all
> the changes without checking benefit again.
Can you please attach this to the 4.9 pending patches tracker bug. 
We're really focused on trying to get 4.8 out the door and this doesn't 
seem like suitable material for GCC 4.8.

I haven't looked at the details of the patch at all yet and doubt I 
would prior to GCC 4.8 going out the door.

Thanks,
jeff
Steven Bosscher March 11, 2013, 6:16 p.m. UTC | #4
On Mon, Mar 11, 2013 at 7:10 PM, Jeff Law <law@redhat.com> wrote:
> On 03/10/2013 11:52 PM, Wei Mi wrote:
>>
>> Hi,
>>
>> This is the fwprop extension patch which is put in order. Regression
>> test and bootstrap pass. Please help to review its rationality. The
>> following is a brief description what I have done in the patch.
>>
>> In order to make fwprop more effective in rtl optimization, we extend
>> it to handle general expressions instead of the three cases listed in
>> the head comment in fwprop.c. The major changes include a) We need to
>> check propagation correctness for src exprs of def which contain mem
>> references. Previous fwprop for the three cases above doesn't have the
>> problem. b) We need a better cost model because the benefit is usually
>> not so apparent as the three cases above.
>>
>> For a general fwprop problem, there are two possible sources where
>> benefit comes from. The frist is the new use insn after propagation
>> and simplification may have lower cost than itself before propagation,
>> or propagation may create a new insn, that could be splitted or
>> peephole optimized later and get a lower cost. The second is that if
>> all the uses are replaced with the src of the def insn, the def insn
>> could be deleted.
>>
>> So instead of check each def-use pair independently, we use DU chain
>> to track all the uses for a def. For each def-use pair, we attempt the
>> propagation, record the change candidate in changes[] array, but we
>> wait to confirm the changes until all the pairs with the same def are
>> iterated. The changes confirmation is done in the func
>> confirm_change_group_by_cost. We only do this for fwprop. For
>> fwprop_addr, the benefit of each change is ensured by
>> propagation_rtx_1 using should_replace_address, so we just confirm all
>> the changes without checking benefit again.
>
> Can you please attach this to the 4.9 pending patches tracker bug. We're
> really focused on trying to get 4.8 out the door and this doesn't seem like
> suitable material for GCC 4.8.
>
> I haven't looked at the details of the patch at all yet and doubt I would
> prior to GCC 4.8 going out the door.
>
> Thanks,
> jeff
>

Jeff,

The world has more people than you, and with different interests. This
patch was posted here for comments on the idea, and while I'm sure
your feedback would be very valuable, it is no more required for
discussing this patch than it is for releasing GCC 4.8.

Ciao!
Steven
Steven Bosscher March 11, 2013, 7:52 p.m. UTC | #5
On Mon, Mar 11, 2013 at 6:52 AM, Wei Mi wrote:
> This is the fwprop extension patch which is put in order. Regression
> test and bootstrap pass. Please help to review its rationality. The
> following is a brief description what I have done in the patch.
>
> In order to make fwprop more effective in rtl optimization, we extend
> it to handle general expressions instead of the three cases listed in
> the head comment in fwprop.c. The major changes include a) We need to
> check propagation correctness for src exprs of def which contain mem
> references. Previous fwprop for the three cases above doesn't have the
> problem. b) We need a better cost model because the benefit is usually
> not so apparent as the three cases above.
>
> For a general fwprop problem, there are two possible sources where
> benefit comes from. The frist is the new use insn after propagation
> and simplification may have lower cost than itself before propagation,
> or propagation may create a new insn, that could be splitted or
> peephole optimized later and get a lower cost. The second is that if
> all the uses are replaced with the src of the def insn, the def insn
> could be deleted.
>
> So instead of check each def-use pair independently, we use DU chain
> to track all the uses for a def. For each def-use pair, we attempt the
> propagation, record the change candidate in changes[] array, but we
> wait to confirm the changes until all the pairs with the same def are
> iterated. The changes confirmation is done in the func
> confirm_change_group_by_cost. We only do this for fwprop. For
> fwprop_addr, the benefit of each change is ensured by
> propagation_rtx_1 using should_replace_address, so we just confirm all
> the changes without checking benefit again.

Hello Wei Mi,

So IIUC, in essence you are doing:

main:
  FOR_EACH_BB:
    FOR_BB_INSNS, non-debug insns only:
      for each df_ref DEF operand on insn:
        iterate_def_uses

iterate_def_uses:
  for each UD chain from DEF to USE(i):
    forward_propagate_into
  confirm changes by total benefit

I still like the idea, but there are also still a few "design issues"
to resolve.

Some of the same comments as before apply: Do you really, really,
reallyreally have to go so low-level as to insn splitting, peephole
optimizations, and even register allocation, to get the cost right?
That will almost certainly not be acceptable, and I for one would
oppose such a change. It's IMHO a violation of proper engineering when
your medium-to-high level code transformations have to do that. If you
have strong reasons for your approach, it'd be helpful if you can
explain them so that we can together look for a less intrusive
solution (e.g. splitting earlier, adjusting the cost model, etc.).

So things like:
> +  /* we may call peephole2_insns in fwprop phase to estimate how
> +     peephole will affect the cost of the insn transformed by fwprop.
> +     fwprop is done before ira phase. In that case, we simply return
> +     a new pseudo register.  */
> +  if (!strncmp (current_pass->name, "fwprop", 6))
> +    return gen_reg_rtx (mode);

and

> Index: config/i386/i386.c
> ===================================================================
> --- config/i386/i386.c        (revision 196270)
> +++ config/i386/i386.c        (working copy)
> @@ -15901,8 +15901,14 @@ ix86_expand_clear (rtx dest)
>  {
>    rtx tmp;
>
> -  /* We play register width games, which are only valid after reload.  */
> -  gcc_assert (reload_completed);
> +  /* We play register width games, which are only valid after reload.
> +     An exception: fwprop call peephole to estimate the change benefit,
> +     and peephole will call this func. That is before reload complete.
> +     It will not bring any problem because the peephole2_insns call is
> +     only used for cost estimation in fwprop, and its change will be
> +     abandoned immediately after the cost estimation.  */
> +  if (strncmp (current_pass->name, "fwprop", 6))
> +    gcc_assert (reload_completed);

are IMHO not OK.

Note that your patch is a bit difficult to read at some points because
you have included a bunch of non-changes (whitespaces fixes --
necessary cleanups but not relevant for your patch), see e.g. the
changed lines that contain "lra_in_progress". Also the changes like:
>  static bool
> -propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, int flags)
> +propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, bool speed)

which are quite distracting, making it harder to see what has *really* changed.

You should probably just a helper function apply_change_group_num()
and avoid all the apply_change_group use fixups.


In fwprop.c:
> +  /* DF_LR_RUN_DCE is used in peephole2_insns, which is called for cost
> +     estimation in estimate_split_and_peephole.  */
> +  df_set_flags (DF_LR_RUN_DCE);
>    df_md_add_problem ();
>    df_note_add_problem ();
> -  df_analyze ();
> +  df_chain_add_problem (DF_UD_CHAIN | DF_DU_CHAIN);
>    df_maybe_reorganize_use_refs (DF_REF_ORDER_BY_INSN_WITH_NOTES);
> +  df_analyze ();

you add DU and UD chains, and implicitly the RD problem, but you also
already have the MD problem. I think my reaching-defs patches for GCC
4.8 make the MD problem less necessary, but you certainly don't need
MD + RD + UD + DU.

You've noticed so yourself:
> +   We think the maintainance for use_def_ref vector is not necessary, so
> +   we remove update_df/update_uses/update_df_init/register_active_defs.  */

and it looks like you're simply avoiding the problem by queuing up
changes and commit them all at the end. I don't believe that will
work, you'll break the UD and DU chains and may end up with dangling
pointers to free'd or removed df_refs.


> +  /* see when the insn is not a set  */
> +  if (!set)
> +    return false;

fwprop.c was speciflcally developed to also handle multiple-set
instructions, like the bits of cse.c that it tried to replace. Your
patch should not change this.


> +static bool
> +mem_may_be_modified (rtx from, rtx to)

This has "potentially slow" written all over it :-) (You're punting on
any MEM for now, but someone at some point will find a reason to use
alias analysis, blowing up really bad test cases like PR39326...)


> +int
> +reg_mentioned_num (const_rtx reg, const_rtx in

Should use DF caches instead of deep-diving the pattern. Or if DF
cache updates are deferred, use for_each_rtx on the pattern.


> +/* Find whether the set define a return reg.  */
> +
> +static bool
> +def_return_reg (rtx set)
> +{
> +  edge eg;
> +  edge_iterator ei;
> +  rtx dest = SET_DEST (set);
> +
> +  if (!REG_P (dest))
> +    return false;

The return USE will also be a hard reg:
 +  if (!REG_P (dest) || ! HARD_REGISTER_P (dest))
 +    return false;


> +  FOR_EACH_EDGE (eg, ei, EXIT_BLOCK_PTR->preds)
> +    if (eg->flags & EDGE_FALLTHRU)
> +      {
> +     basic_block src_bb = eg->src;
> +     rtx last_insn, ret_reg;
> +     if (EDGE_COUNT (EXIT_BLOCK_PTR->preds) == 1

single_pred_p(), but why do FOR_EACH_EDGE and then chec that there is
only one pred to begin with?

> +         && NONJUMP_INSN_P ((last_insn = BB_END (src_bb)))
> +         && GET_CODE (PATTERN (last_insn)) == USE
> +         && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG
> +         && REGNO (ret_reg) == REGNO (dest))
> +       return true;
> +      }
> +  return false;
> +}

Actually this whole change makes me nervous. I don't think you should
propagate into any USE at all, for return value or otherwise.


> +  if (def_insn)
> +  {
> +    rtx set = single_set (def_insn);
> +    if (set)
> +      def_insn_cost = set_src_cost (SET_SRC (set), speed)
> +                   + set_src_cost (SET_DEST (set), speed) + 1;
> +    else
> +      return false;
> +  }

As before: You'll have to deal with non-single_set insns also.


> +void
> +dump_cfg (FILE *file)
> +{

You'll find -fdump-rtl-fwprop-graph useful, as well as brief_dump_cfg.


> +  if (fwprop_addr)
> +     return confirm_change_group_by_cost (false,
> +                                       0,
> +                                       false);
> +  else
> +    {
> +      all_uses_replaced = (use_num == reg_replaced_num);
> +      return confirm_change_group_by_cost (all_uses_replaced,
> +                                        def_insn_cost,
> +                                        true);
> +    }


What happens if you propagate into an insn that uses the same register
twice? Will the DU chains still be valid (I don't think that's
guaranteed)?

Is the extra_benefit flag always applicable if all USEs of a DEF have
been propagated out? What if the DEF is in an insn that is inherently
necessary?

Have you measured what effect this pass has on combine?

Ciao!
Steven
Wei Mi March 12, 2013, 7:18 a.m. UTC | #6
Thanks for the helpful comments! I have some replies inlined.

Regards,
Wei.

On Mon, Mar 11, 2013 at 12:52 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Mon, Mar 11, 2013 at 6:52 AM, Wei Mi wrote:
>> This is the fwprop extension patch which is put in order. Regression
>> test and bootstrap pass. Please help to review its rationality. The
>> following is a brief description what I have done in the patch.
>>
>> In order to make fwprop more effective in rtl optimization, we extend
>> it to handle general expressions instead of the three cases listed in
>> the head comment in fwprop.c. The major changes include a) We need to
>> check propagation correctness for src exprs of def which contain mem
>> references. Previous fwprop for the three cases above doesn't have the
>> problem. b) We need a better cost model because the benefit is usually
>> not so apparent as the three cases above.
>>
>> For a general fwprop problem, there are two possible sources where
>> benefit comes from. The frist is the new use insn after propagation
>> and simplification may have lower cost than itself before propagation,
>> or propagation may create a new insn, that could be splitted or
>> peephole optimized later and get a lower cost. The second is that if
>> all the uses are replaced with the src of the def insn, the def insn
>> could be deleted.
>>
>> So instead of check each def-use pair independently, we use DU chain
>> to track all the uses for a def. For each def-use pair, we attempt the
>> propagation, record the change candidate in changes[] array, but we
>> wait to confirm the changes until all the pairs with the same def are
>> iterated. The changes confirmation is done in the func
>> confirm_change_group_by_cost. We only do this for fwprop. For
>> fwprop_addr, the benefit of each change is ensured by
>> propagation_rtx_1 using should_replace_address, so we just confirm all
>> the changes without checking benefit again.
>
> Hello Wei Mi,
>
> So IIUC, in essence you are doing:
>
> main:
>   FOR_EACH_BB:
>     FOR_BB_INSNS, non-debug insns only:
>       for each df_ref DEF operand on insn:
>         iterate_def_uses
>
> iterate_def_uses:
>   for each UD chain from DEF to USE(i):
>     forward_propagate_into
>   confirm changes by total benefit
>
> I still like the idea, but there are also still a few "design issues"
> to resolve.
>
> Some of the same comments as before apply: Do you really, really,
> reallyreally have to go so low-level as to insn splitting, peephole
> optimizations, and even register allocation, to get the cost right?
> That will almost certainly not be acceptable, and I for one would
> oppose such a change. It's IMHO a violation of proper engineering when
> your medium-to-high level code transformations have to do that. If you
> have strong reasons for your approach, it'd be helpful if you can
> explain them so that we can together look for a less intrusive
> solution (e.g. splitting earlier, adjusting the cost model, etc.).
>

For the motivational case, I need insn splitting to get the cost
right. insn splitting is not very intrusive. All I need is to call
split_insns func. I think split_insns is just a pattern matching func
just like recog(), which is called at many places. Peephole is not
necessary (I add it in order to find as many oppotunities as possible,
but from my trace analysis, it doesn't help much). To call
peephole2_insn() is indeed intrusive, because peephole assumes reg
allocation is completed, I have to insert the ugly workaround below.
peephole also requires setting DF_LR_RUN_DCE flag and some
initialization of peep2_insn_data array.

So how about keep split_insns and remove peephole in the cost estimation func?

> So things like:
>> +  /* we may call peephole2_insns in fwprop phase to estimate how
>> +     peephole will affect the cost of the insn transformed by fwprop.
>> +     fwprop is done before ira phase. In that case, we simply return
>> +     a new pseudo register.  */
>> +  if (!strncmp (current_pass->name, "fwprop", 6))
>> +    return gen_reg_rtx (mode);
>
> and
>
>> Index: config/i386/i386.c
>> ===================================================================
>> --- config/i386/i386.c        (revision 196270)
>> +++ config/i386/i386.c        (working copy)
>> @@ -15901,8 +15901,14 @@ ix86_expand_clear (rtx dest)
>>  {
>>    rtx tmp;
>>
>> -  /* We play register width games, which are only valid after reload.  */
>> -  gcc_assert (reload_completed);
>> +  /* We play register width games, which are only valid after reload.
>> +     An exception: fwprop call peephole to estimate the change benefit,
>> +     and peephole will call this func. That is before reload complete.
>> +     It will not bring any problem because the peephole2_insns call is
>> +     only used for cost estimation in fwprop, and its change will be
>> +     abandoned immediately after the cost estimation.  */
>> +  if (strncmp (current_pass->name, "fwprop", 6))
>> +    gcc_assert (reload_completed);
>
> are IMHO not OK.
>

They are intrusive and inserted for peephole.

> Note that your patch is a bit difficult to read at some points because
> you have included a bunch of non-changes (whitespaces fixes --
> necessary cleanups but not relevant for your patch), see e.g. the
> changed lines that contain "lra_in_progress". Also the changes like:
>>  static bool
>> -propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, int flags)
>> +propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, bool speed)
>
> which are quite distracting, making it harder to see what has *really* changed.
>

In the old fwprop, the flags is of enum type {PR_CAN_APPEAR,
PR_HANDLE_MEM, PR_OPTIMIZE_FOR_SPEED}.   PR_CAN_APPEAR is used to
restrict fwprop to only handle the three typical cases in the comment
at the head of fwprop.c. PR_HANDLE_MEM is used for mem addr.
PR_OPTIMIZE_FOR_SPEED indicates optimization is for speed or for size.
New fwprop will only use PR_OPTIMIZE_FOR_SPEED, and the other two are
useless. So I change the param from "int flags" to "bool speed".

> You should probably just a helper function apply_change_group_num()
> and avoid all the apply_change_group use fixups.
>
>

Yes, they are distracting. I can use apply_change_group_num for easy
code review, but I think to commit, extending apply_change_group and
avoid creating a very similar func is more welcomed?


> In fwprop.c:
>> +  /* DF_LR_RUN_DCE is used in peephole2_insns, which is called for cost
>> +     estimation in estimate_split_and_peephole.  */
>> +  df_set_flags (DF_LR_RUN_DCE);
>>    df_md_add_problem ();
>>    df_note_add_problem ();
>> -  df_analyze ();
>> +  df_chain_add_problem (DF_UD_CHAIN | DF_DU_CHAIN);
>>    df_maybe_reorganize_use_refs (DF_REF_ORDER_BY_INSN_WITH_NOTES);
>> +  df_analyze ();
>
> you add DU and UD chains, and implicitly the RD problem, but you also
> already have the MD problem. I think my reaching-defs patches for GCC
> 4.8 make the MD problem less necessary, but you certainly don't need
> MD + RD + UD + DU.
>

MD problem is used to set use_def_ref and check whether a use has
multiple def in the old fwprop. I reuse that part in new fwprop. With
UD chain, we may remove MD problem and use_def_ref vector, and use UD
chain to do the check. I will try it.

> You've noticed so yourself:
>> +   We think the maintainance for use_def_ref vector is not necessary, so
>> +   we remove update_df/update_uses/update_df_init/register_active_defs.  */
>
> and it looks like you're simply avoiding the problem by queuing up
> changes and commit them all at the end. I don't believe that will
> work, you'll break the UD and DU chains and may end up with dangling
> pointers to free'd or removed df_refs.
>

I only remove the code related with use_def_ref vector. The code to
update df_refs is kept but moved to update_df in recog.c.

>
>> +  /* see when the insn is not a set  */
>> +  if (!set)
>> +    return false;
>
> fwprop.c was speciflcally developed to also handle multiple-set
> instructions, like the bits of cse.c that it tried to replace. Your
> patch should not change this.
>

ok, I will remove the limitation.

>
>> +static bool
>> +mem_may_be_modified (rtx from, rtx to)
>
> This has "potentially slow" written all over it :-) (You're punting on
> any MEM for now, but someone at some point will find a reason to use
> alias analysis, blowing up really bad test cases like PR39326...)
>

If someone wants to add alias analysis here, he must find out some
good reasons :-). That is what I expect.

>
>> +int
>> +reg_mentioned_num (const_rtx reg, const_rtx in
>
> Should use DF caches instead of deep-diving the pattern. Or if DF
> cache updates are deferred, use for_each_rtx on the pattern.
>

I will fix it.

>
>> +/* Find whether the set define a return reg.  */
>> +
>> +static bool
>> +def_return_reg (rtx set)
>> +{
>> +  edge eg;
>> +  edge_iterator ei;
>> +  rtx dest = SET_DEST (set);
>> +
>> +  if (!REG_P (dest))
>> +    return false;
>
> The return USE will also be a hard reg:
>  +  if (!REG_P (dest) || ! HARD_REGISTER_P (dest))
>  +    return false;
>
>
>> +  FOR_EACH_EDGE (eg, ei, EXIT_BLOCK_PTR->preds)
>> +    if (eg->flags & EDGE_FALLTHRU)
>> +      {
>> +     basic_block src_bb = eg->src;
>> +     rtx last_insn, ret_reg;
>> +     if (EDGE_COUNT (EXIT_BLOCK_PTR->preds) == 1
>
> single_pred_p(), but why do FOR_EACH_EDGE and then chec that there is
> only one pred to begin with?

My mistake. I copy the chunk of code from mode-switching.
FOR_EACH_EDGE is unneeded.

>
>> +         && NONJUMP_INSN_P ((last_insn = BB_END (src_bb)))
>> +         && GET_CODE (PATTERN (last_insn)) == USE
>> +         && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG
>> +         && REGNO (ret_reg) == REGNO (dest))
>> +       return true;
>> +      }
>> +  return false;
>> +}
>
> Actually this whole change makes me nervous. I don't think you should
> propagate into any USE at all, for return value or otherwise.

Yes, reasonable. I will add the restriction.

>
>> +  if (def_insn)
>> +  {
>> +    rtx set = single_set (def_insn);
>> +    if (set)
>> +      def_insn_cost = set_src_cost (SET_SRC (set), speed)
>> +                   + set_src_cost (SET_DEST (set), speed) + 1;
>> +    else
>> +      return false;
>> +  }
>
> As before: You'll have to deal with non-single_set insns also.
>

I will remove the limitation.

>
>> +void
>> +dump_cfg (FILE *file)
>> +{
>
> You'll find -fdump-rtl-fwprop-graph useful, as well as brief_dump_cfg.
>

Oh, thanks.

>
>> +  if (fwprop_addr)
>> +     return confirm_change_group_by_cost (false,
>> +                                       0,
>> +                                       false);
>> +  else
>> +    {
>> +      all_uses_replaced = (use_num == reg_replaced_num);
>> +      return confirm_change_group_by_cost (all_uses_replaced,
>> +                                        def_insn_cost,
>> +                                        true);
>> +    }
>
>
> What happens if you propagate into an insn that uses the same register
> twice? Will the DU chains still be valid (I don't think that's
> guaranteed)?

I think the DU chains still be valid. If propagate into the insn that
uses the same register twice, the two uses will be replaced when the
first use is seen (propagate_rtx_1 will propagate all the occurrances
of the same reg in the use insn).  When the second use is seen, the
df_use and use insn in its insn_info are still available.
forward_propagate_into will early return after check reg_mentioned_p
(DF_REF_REG (use), parent) and find out no reg is used  any more.

A testcase in gcc regression testsuite shows the case:
gcc/testsuite/gcc.target/i386/387-10.c

>
> Is the extra_benefit flag always applicable if all USEs of a DEF have
> been propagated out? What if the DEF is in an insn that is inherently
> necessary?
>

If the DEF is an insn that is inherently necessary, we may mistakenly
assume it could be deleted in cost estimation. But it will not affect
correctness because fwprop depends on delete_trivially_dead_insns in
fwprop_done or DCE after fwprop to delete dead DEF.

> Have you measured what effect this pass has on combine?

I didn't do the measure. I only see a case that new fwprop does the
same optimization as what combine does, which old fwprop doesn't
optimize. New fwprop bring the optimization to an earlier stage. The
testcase is gcc/testsuite/gcc.target/i386/387-10.c.
Qualitatively, fwprop optimize the single-def multiple down uses case
which combine cannot handle. But I didn't do measurement about their
interactive potential.

>
> Ciao!
> Steven
Steven Bosscher March 16, 2013, 10:48 p.m. UTC | #7
On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
> For the motivational case, I need insn splitting to get the cost
> right. insn splitting is not very intrusive. All I need is to call
> split_insns func.

It may not look very intrusive, but there's a lot happening in the
back ground. You're creating a lot of new RTL, and then just throw it
away again. You fake the compiler into thinking you're much deeper in
the pipeline than you really are. You're assuming there are no
side-effects other than that some insn gets split, but there are back
ends where splitters may have side-effects.

Even though I've asked twice now, you still have not explained this
motivational case, except to say that there is one. *What* are you
trying to do, *what* is not happening without the splits, and *what*
happens if you split. Only if you explain that in a lot more detail
than "I have a motivational case" then we can look into what is a
proper solution.

The problem with some of the splitters is that they exist to break up
RTL from 'expand' to initially keep some pattern together to allow the
code transformation passes to handle the pattern as one instruction.
This made sense when RTL was the only intermediate representation and
splitting too early would inhibit some optimizations. But I would
expect most (if not all) such cases to be less relevant because of the
GIMPLE middle-end work. The only splitters you can trigger are the
pre-reload splitters (all the reload_completed conditions obviously
can't trigger if you're splitting from fwprop). Perhaps those
splitters can/should run earlier, or be made obsolete by expanding
directly to the post-splitting insns.

Unfortunately, it's not possible to tell for your case, because you
haven't explained it yet...


> So how about keep split_insns and remove peephole in the cost estimation func?

I'd strongly oppose this. I do not believe this is necessary, and I
think it's conceptually wrong.


>> What happens if you propagate into an insn that uses the same register
>> twice? Will the DU chains still be valid (I don't think that's
>> guaranteed)?
>
> I think the DU chains still be valid. If propagate into the insn that
> uses the same register twice, the two uses will be replaced when the
> first use is seen (propagate_rtx_1 will propagate all the occurrances
> of the same reg in the use insn).  When the second use is seen, the
> df_use and use insn in its insn_info are still available.
> forward_propagate_into will early return after check reg_mentioned_p
> (DF_REF_REG (use), parent) and find out no reg is used  any more.

With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
still valid.

In any case, returning to the RD problem for DU/UD chains is probably
a good idea, now that RD is not such a hog anymore. In effect fwprop.c
would return to what it looked like before the patch of r149010.

As a way forward on all of this, I'd suggest the following steps, each
with a separate patch:
1. replace the MD problem with RD again, and build full DU/UD chains.
2. post all the recog changes separately, with minimum impact on the
parts of the compiler you don't really change. (For apply_change_group
you could even choose to overload it, or use a NUM argument with a
default value -- not sure if default argument values are OK for GCC
tho'.)
3. implement propagation into multiple USEs, but without the splitting
and peepholing.
4. see about fixing the back end to either split earlier or expand to
the desired patterns directly.

Ciao!
Steven
Wei Mi March 17, 2013, 7:15 a.m. UTC | #8
Hi,

On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
>> For the motivational case, I need insn splitting to get the cost
>> right. insn splitting is not very intrusive. All I need is to call
>> split_insns func.
>
> It may not look very intrusive, but there's a lot happening in the
> back ground. You're creating a lot of new RTL, and then just throw it
> away again. You fake the compiler into thinking you're much deeper in
> the pipeline than you really are. You're assuming there are no
> side-effects other than that some insn gets split, but there are back
> ends where splitters may have side-effects.

Ok, then I will remove the split_insns call.

>
> Even though I've asked twice now, you still have not explained this
> motivational case, except to say that there is one. *What* are you
> trying to do, *what* is not happening without the splits, and *what*
> happens if you split. Only if you explain that in a lot more detail
> than "I have a motivational case" then we can look into what is a
> proper solution.

:-). Sorry, I didn't say it clearly. The motivational case is the one
mentioned in the following posts (split_insns changes a << (b & 63) to
a << b).
http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html

If I remove the split_insns call and related cost estimation
adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop
here looks like a reverse process of cse, the total cost after fwprop
change is increased.

Def insn 18:
        Use insn 23
        Use insn 22

If we include the split_insns cost estimation adjustment.
  extra benefit by removing def insn 18 = 5
  change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
not change after fwprop + insn splitting.
  change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 22
Total benefit is 5, fwprop will go on.

If we remove the split_insns cost estimation adjustment.
  extra benefit by removing def insn 18 = 5
  change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
insn 23 will increase after fwprop.
  change[1]: benefit = -4, verified - ok   // The insn 23 is the same
with insn 22
Total benefit is -3, fwprop will punt.

How about adding the (a << (b&63) ==> a << b) transformation in
simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a
kind of architecture specific expr simplification? Then fwprop could
do the propagation as I expect.

>
> The problem with some of the splitters is that they exist to break up
> RTL from 'expand' to initially keep some pattern together to allow the
> code transformation passes to handle the pattern as one instruction.
> This made sense when RTL was the only intermediate representation and
> splitting too early would inhibit some optimizations. But I would
> expect most (if not all) such cases to be less relevant because of the
> GIMPLE middle-end work. The only splitters you can trigger are the
> pre-reload splitters (all the reload_completed conditions obviously
> can't trigger if you're splitting from fwprop). Perhaps those
> splitters can/should run earlier, or be made obsolete by expanding
> directly to the post-splitting insns.
>
> Unfortunately, it's not possible to tell for your case, because you
> haven't explained it yet...
>
>
>> So how about keep split_insns and remove peephole in the cost estimation func?
>
> I'd strongly oppose this. I do not believe this is necessary, and I
> think it's conceptually wrong.
>
>
>>> What happens if you propagate into an insn that uses the same register
>>> twice? Will the DU chains still be valid (I don't think that's
>>> guaranteed)?
>>
>> I think the DU chains still be valid. If propagate into the insn that
>> uses the same register twice, the two uses will be replaced when the
>> first use is seen (propagate_rtx_1 will propagate all the occurrances
>> of the same reg in the use insn).  When the second use is seen, the
>> df_use and use insn in its insn_info are still available.
>> forward_propagate_into will early return after check reg_mentioned_p
>> (DF_REF_REG (use), parent) and find out no reg is used  any more.
>
> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
> still valid.

I think DF_REF_LOC of USE may be invalid if dangling rtx will be
recycled by garbage collection very soon (I don't know when GC will
happen). Although DF_REF_LOC of USE maybe invalid, the early return in
forward_propagate_into ensure it will not cause any correctness
problem.

>
> In any case, returning to the RD problem for DU/UD chains is probably
> a good idea, now that RD is not such a hog anymore. In effect fwprop.c
> would return to what it looked like before the patch of r149010.

I remove MD problem and use DU/UD instead.

>
> As a way forward on all of this, I'd suggest the following steps, each
> with a separate patch:

Thanks for the suggestion!

> 1. replace the MD problem with RD again, and build full DU/UD chains.

I include patch.1 attached.

> 2. post all the recog changes separately, with minimum impact on the
> parts of the compiler you don't really change. (For apply_change_group
> you could even choose to overload it, or use a NUM argument with a
> default value -- not sure if default argument values are OK for GCC
> tho'.)

patch.2 attached.

> 3. implement propagation into multiple USEs, but without the splitting
> and peepholing.

patch.3 attached.

> 4. see about fixing the back end to either split earlier or expand to
> the desired patterns directly.

I havn't included this part. If you agree with the proposal to add the
transformation (a << (b&63) ==> a << b) in
simplify_binary_operation_1, I will send out another patch about it.

Thanks,
Wei.
Andrew Pinski March 17, 2013, 7:23 a.m. UTC | #9
On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote:
> Hi,
>
> On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
>>> For the motivational case, I need insn splitting to get the cost
>>> right. insn splitting is not very intrusive. All I need is to call
>>> split_insns func.
>>
>> It may not look very intrusive, but there's a lot happening in the
>> back ground. You're creating a lot of new RTL, and then just throw it
>> away again. You fake the compiler into thinking you're much deeper in
>> the pipeline than you really are. You're assuming there are no
>> side-effects other than that some insn gets split, but there are back
>> ends where splitters may have side-effects.
>
> Ok, then I will remove the split_insns call.
>
>>
>> Even though I've asked twice now, you still have not explained this
>> motivational case, except to say that there is one. *What* are you
>> trying to do, *what* is not happening without the splits, and *what*
>> happens if you split. Only if you explain that in a lot more detail
>> than "I have a motivational case" then we can look into what is a
>> proper solution.
>
> :-). Sorry, I didn't say it clearly. The motivational case is the one
> mentioned in the following posts (split_insns changes a << (b & 63) to
> a << b).
> http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
> http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html



>
> If I remove the split_insns call and related cost estimation
> adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop
> here looks like a reverse process of cse, the total cost after fwprop
> change is increased.
>
> Def insn 18:
>         Use insn 23
>         Use insn 22
>
> If we include the split_insns cost estimation adjustment.
>   extra benefit by removing def insn 18 = 5
>   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
> not change after fwprop + insn splitting.
>   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 22
> Total benefit is 5, fwprop will go on.
>
> If we remove the split_insns cost estimation adjustment.
>   extra benefit by removing def insn 18 = 5
>   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
> insn 23 will increase after fwprop.
>   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
> with insn 22
> Total benefit is -3, fwprop will punt.
>
> How about adding the (a << (b&63) ==> a << b) transformation in
> simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a
> kind of architecture specific expr simplification? Then fwprop could
> do the propagation as I expect.
>
>>
>> The problem with some of the splitters is that they exist to break up
>> RTL from 'expand' to initially keep some pattern together to allow the
>> code transformation passes to handle the pattern as one instruction.
>> This made sense when RTL was the only intermediate representation and
>> splitting too early would inhibit some optimizations. But I would
>> expect most (if not all) such cases to be less relevant because of the
>> GIMPLE middle-end work. The only splitters you can trigger are the
>> pre-reload splitters (all the reload_completed conditions obviously
>> can't trigger if you're splitting from fwprop). Perhaps those
>> splitters can/should run earlier, or be made obsolete by expanding
>> directly to the post-splitting insns.
>>
>> Unfortunately, it's not possible to tell for your case, because you
>> haven't explained it yet...
>>
>>
>>> So how about keep split_insns and remove peephole in the cost estimation func?
>>
>> I'd strongly oppose this. I do not believe this is necessary, and I
>> think it's conceptually wrong.
>>
>>
>>>> What happens if you propagate into an insn that uses the same register
>>>> twice? Will the DU chains still be valid (I don't think that's
>>>> guaranteed)?
>>>
>>> I think the DU chains still be valid. If propagate into the insn that
>>> uses the same register twice, the two uses will be replaced when the
>>> first use is seen (propagate_rtx_1 will propagate all the occurrances
>>> of the same reg in the use insn).  When the second use is seen, the
>>> df_use and use insn in its insn_info are still available.
>>> forward_propagate_into will early return after check reg_mentioned_p
>>> (DF_REF_REG (use), parent) and find out no reg is used  any more.
>>
>> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
>> still valid.
>
> I think DF_REF_LOC of USE may be invalid if dangling rtx will be
> recycled by garbage collection very soon (I don't know when GC will
> happen). Although DF_REF_LOC of USE maybe invalid, the early return in
> forward_propagate_into ensure it will not cause any correctness
> problem.
>
>>
>> In any case, returning to the RD problem for DU/UD chains is probably
>> a good idea, now that RD is not such a hog anymore. In effect fwprop.c
>> would return to what it looked like before the patch of r149010.
>
> I remove MD problem and use DU/UD instead.
>
>>
>> As a way forward on all of this, I'd suggest the following steps, each
>> with a separate patch:
>
> Thanks for the suggestion!
>
>> 1. replace the MD problem with RD again, and build full DU/UD chains.
>
> I include patch.1 attached.
>
>> 2. post all the recog changes separately, with minimum impact on the
>> parts of the compiler you don't really change. (For apply_change_group
>> you could even choose to overload it, or use a NUM argument with a
>> default value -- not sure if default argument values are OK for GCC
>> tho'.)
>
> patch.2 attached.
>
>> 3. implement propagation into multiple USEs, but without the splitting
>> and peepholing.
>
> patch.3 attached.
>
>> 4. see about fixing the back end to either split earlier or expand to
>> the desired patterns directly.
>
> I havn't included this part. If you agree with the proposal to add the
> transformation (a << (b&63) ==> a << b) in
> simplify_binary_operation_1, I will send out another patch about it

Sounds like you need to look into using SHIFT_COUNT_TRUNCATED and
TARGET_SHIFT_TRUNCATION_MASK which you could use in
simplify_binary_operation_1 without it being target specific as it
uses the target hooks to find out that :).

Thanks,
Andrew

.
>
> Thanks,
> Wei.
Wei Mi March 24, 2013, 4:18 a.m. UTC | #10
This is the patch to add the shift truncation in
simplify_binary_operation_1. I add a new hook
TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide
whether we can do shift truncation. I didn't use
TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it
uses the macro SHIFT_COUNT_TRUNCATED. If I change
SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in
TARGET_SHIFT_TRUNCATION_MASK, I need to give
TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't
trivial to get at many places in existing code.

patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu.

Thanks,
Wei.

On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote:
> Hi,
>
> On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
>>> For the motivational case, I need insn splitting to get the cost
>>> right. insn splitting is not very intrusive. All I need is to call
>>> split_insns func.
>>
>> It may not look very intrusive, but there's a lot happening in the
>> back ground. You're creating a lot of new RTL, and then just throw it
>> away again. You fake the compiler into thinking you're much deeper in
>> the pipeline than you really are. You're assuming there are no
>> side-effects other than that some insn gets split, but there are back
>> ends where splitters may have side-effects.
>
> Ok, then I will remove the split_insns call.
>
>>
>> Even though I've asked twice now, you still have not explained this
>> motivational case, except to say that there is one. *What* are you
>> trying to do, *what* is not happening without the splits, and *what*
>> happens if you split. Only if you explain that in a lot more detail
>> than "I have a motivational case" then we can look into what is a
>> proper solution.
>
> :-). Sorry, I didn't say it clearly. The motivational case is the one
> mentioned in the following posts (split_insns changes a << (b & 63) to
> a << b).
> http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
> http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html
>
> If I remove the split_insns call and related cost estimation
> adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop
> here looks like a reverse process of cse, the total cost after fwprop
> change is increased.
>
> Def insn 18:
>         Use insn 23
>         Use insn 22
>
> If we include the split_insns cost estimation adjustment.
>   extra benefit by removing def insn 18 = 5
>   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
> not change after fwprop + insn splitting.
>   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 22
> Total benefit is 5, fwprop will go on.
>
> If we remove the split_insns cost estimation adjustment.
>   extra benefit by removing def insn 18 = 5
>   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
> insn 23 will increase after fwprop.
>   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
> with insn 22
> Total benefit is -3, fwprop will punt.
>
> How about adding the (a << (b&63) ==> a << b) transformation in
> simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a
> kind of architecture specific expr simplification? Then fwprop could
> do the propagation as I expect.
>
>>
>> The problem with some of the splitters is that they exist to break up
>> RTL from 'expand' to initially keep some pattern together to allow the
>> code transformation passes to handle the pattern as one instruction.
>> This made sense when RTL was the only intermediate representation and
>> splitting too early would inhibit some optimizations. But I would
>> expect most (if not all) such cases to be less relevant because of the
>> GIMPLE middle-end work. The only splitters you can trigger are the
>> pre-reload splitters (all the reload_completed conditions obviously
>> can't trigger if you're splitting from fwprop). Perhaps those
>> splitters can/should run earlier, or be made obsolete by expanding
>> directly to the post-splitting insns.
>>
>> Unfortunately, it's not possible to tell for your case, because you
>> haven't explained it yet...
>>
>>
>>> So how about keep split_insns and remove peephole in the cost estimation func?
>>
>> I'd strongly oppose this. I do not believe this is necessary, and I
>> think it's conceptually wrong.
>>
>>
>>>> What happens if you propagate into an insn that uses the same register
>>>> twice? Will the DU chains still be valid (I don't think that's
>>>> guaranteed)?
>>>
>>> I think the DU chains still be valid. If propagate into the insn that
>>> uses the same register twice, the two uses will be replaced when the
>>> first use is seen (propagate_rtx_1 will propagate all the occurrances
>>> of the same reg in the use insn).  When the second use is seen, the
>>> df_use and use insn in its insn_info are still available.
>>> forward_propagate_into will early return after check reg_mentioned_p
>>> (DF_REF_REG (use), parent) and find out no reg is used  any more.
>>
>> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
>> still valid.
>
> I think DF_REF_LOC of USE may be invalid if dangling rtx will be
> recycled by garbage collection very soon (I don't know when GC will
> happen). Although DF_REF_LOC of USE maybe invalid, the early return in
> forward_propagate_into ensure it will not cause any correctness
> problem.
>
>>
>> In any case, returning to the RD problem for DU/UD chains is probably
>> a good idea, now that RD is not such a hog anymore. In effect fwprop.c
>> would return to what it looked like before the patch of r149010.
>
> I remove MD problem and use DU/UD instead.
>
>>
>> As a way forward on all of this, I'd suggest the following steps, each
>> with a separate patch:
>
> Thanks for the suggestion!
>
>> 1. replace the MD problem with RD again, and build full DU/UD chains.
>
> I include patch.1 attached.
>
>> 2. post all the recog changes separately, with minimum impact on the
>> parts of the compiler you don't really change. (For apply_change_group
>> you could even choose to overload it, or use a NUM argument with a
>> default value -- not sure if default argument values are OK for GCC
>> tho'.)
>
> patch.2 attached.
>
>> 3. implement propagation into multiple USEs, but without the splitting
>> and peepholing.
>
> patch.3 attached.
>
>> 4. see about fixing the back end to either split earlier or expand to
>> the desired patterns directly.
>
> I havn't included this part. If you agree with the proposal to add the
> transformation (a << (b&63) ==> a << b) in
> simplify_binary_operation_1, I will send out another patch about it.
>
> Thanks,
> Wei.
Oleg Endo March 24, 2013, 12:32 p.m. UTC | #11
Hi,

On Sat, 2013-03-23 at 21:18 -0700, Wei Mi wrote:
> This is the patch to add the shift truncation in
> simplify_binary_operation_1. I add a new hook
> TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide
> whether we can do shift truncation. I didn't use
> TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it
> uses the macro SHIFT_COUNT_TRUNCATED. If I change
> SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in
> TARGET_SHIFT_TRUNCATION_MASK, I need to give
> TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't
> trivial to get at many places in existing code.
> 

During 4.8 development there was a similar issue with the
TARGET_CANONICALIZE_COMPARISON hook.  As a temporary solution the
rtx_code has been passed as int.  I think the story started here:
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00379.html

The conclusion regarding rtx_code ...
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00646.html

Maybe this should be addressed first, since now there are at least two
cases where it's in the way.

Cheers,
Oleg



> patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu.
> 
> Thanks,
> Wei.
> 
> On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote:
> > Hi,
> >
> > On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> >> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
> >>> For the motivational case, I need insn splitting to get the cost
> >>> right. insn splitting is not very intrusive. All I need is to call
> >>> split_insns func.
> >>
> >> It may not look very intrusive, but there's a lot happening in the
> >> back ground. You're creating a lot of new RTL, and then just throw it
> >> away again. You fake the compiler into thinking you're much deeper in
> >> the pipeline than you really are. You're assuming there are no
> >> side-effects other than that some insn gets split, but there are back
> >> ends where splitters may have side-effects.
> >
> > Ok, then I will remove the split_insns call.
> >
> >>
> >> Even though I've asked twice now, you still have not explained this
> >> motivational case, except to say that there is one. *What* are you
> >> trying to do, *what* is not happening without the splits, and *what*
> >> happens if you split. Only if you explain that in a lot more detail
> >> than "I have a motivational case" then we can look into what is a
> >> proper solution.
> >
> > :-). Sorry, I didn't say it clearly. The motivational case is the one
> > mentioned in the following posts (split_insns changes a << (b & 63) to
> > a << b).
> > http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
> > http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html
> >
> > If I remove the split_insns call and related cost estimation
> > adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop
> > here looks like a reverse process of cse, the total cost after fwprop
> > change is increased.
> >
> > Def insn 18:
> >         Use insn 23
> >         Use insn 22
> >
> > If we include the split_insns cost estimation adjustment.
> >   extra benefit by removing def insn 18 = 5
> >   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
> > not change after fwprop + insn splitting.
> >   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 22
> > Total benefit is 5, fwprop will go on.
> >
> > If we remove the split_insns cost estimation adjustment.
> >   extra benefit by removing def insn 18 = 5
> >   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
> > insn 23 will increase after fwprop.
> >   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
> > with insn 22
> > Total benefit is -3, fwprop will punt.
> >
> > How about adding the (a << (b&63) ==> a << b) transformation in
> > simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a
> > kind of architecture specific expr simplification? Then fwprop could
> > do the propagation as I expect.
> >
> >>
> >> The problem with some of the splitters is that they exist to break up
> >> RTL from 'expand' to initially keep some pattern together to allow the
> >> code transformation passes to handle the pattern as one instruction.
> >> This made sense when RTL was the only intermediate representation and
> >> splitting too early would inhibit some optimizations. But I would
> >> expect most (if not all) such cases to be less relevant because of the
> >> GIMPLE middle-end work. The only splitters you can trigger are the
> >> pre-reload splitters (all the reload_completed conditions obviously
> >> can't trigger if you're splitting from fwprop). Perhaps those
> >> splitters can/should run earlier, or be made obsolete by expanding
> >> directly to the post-splitting insns.
> >>
> >> Unfortunately, it's not possible to tell for your case, because you
> >> haven't explained it yet...
> >>
> >>
> >>> So how about keep split_insns and remove peephole in the cost estimation func?
> >>
> >> I'd strongly oppose this. I do not believe this is necessary, and I
> >> think it's conceptually wrong.
> >>
> >>
> >>>> What happens if you propagate into an insn that uses the same register
> >>>> twice? Will the DU chains still be valid (I don't think that's
> >>>> guaranteed)?
> >>>
> >>> I think the DU chains still be valid. If propagate into the insn that
> >>> uses the same register twice, the two uses will be replaced when the
> >>> first use is seen (propagate_rtx_1 will propagate all the occurrances
> >>> of the same reg in the use insn).  When the second use is seen, the
> >>> df_use and use insn in its insn_info are still available.
> >>> forward_propagate_into will early return after check reg_mentioned_p
> >>> (DF_REF_REG (use), parent) and find out no reg is used  any more.
> >>
> >> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
> >> still valid.
> >
> > I think DF_REF_LOC of USE may be invalid if dangling rtx will be
> > recycled by garbage collection very soon (I don't know when GC will
> > happen). Although DF_REF_LOC of USE maybe invalid, the early return in
> > forward_propagate_into ensure it will not cause any correctness
> > problem.
> >
> >>
> >> In any case, returning to the RD problem for DU/UD chains is probably
> >> a good idea, now that RD is not such a hog anymore. In effect fwprop.c
> >> would return to what it looked like before the patch of r149010.
> >
> > I remove MD problem and use DU/UD instead.
> >
> >>
> >> As a way forward on all of this, I'd suggest the following steps, each
> >> with a separate patch:
> >
> > Thanks for the suggestion!
> >
> >> 1. replace the MD problem with RD again, and build full DU/UD chains.
> >
> > I include patch.1 attached.
> >
> >> 2. post all the recog changes separately, with minimum impact on the
> >> parts of the compiler you don't really change. (For apply_change_group
> >> you could even choose to overload it, or use a NUM argument with a
> >> default value -- not sure if default argument values are OK for GCC
> >> tho'.)
> >
> > patch.2 attached.
> >
> >> 3. implement propagation into multiple USEs, but without the splitting
> >> and peepholing.
> >
> > patch.3 attached.
> >
> >> 4. see about fixing the back end to either split earlier or expand to
> >> the desired patterns directly.
> >
> > I havn't included this part. If you agree with the proposal to add the
> > transformation (a << (b&63) ==> a << b) in
> > simplify_binary_operation_1, I will send out another patch about it.
> >
> > Thanks,
> > Wei.
Richard Biener March 25, 2013, 9:35 a.m. UTC | #12
On Sun, Mar 24, 2013 at 5:18 AM, Wei Mi <wmi@google.com> wrote:
> This is the patch to add the shift truncation in
> simplify_binary_operation_1. I add a new hook
> TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide
> whether we can do shift truncation. I didn't use
> TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it
> uses the macro SHIFT_COUNT_TRUNCATED. If I change
> SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in
> TARGET_SHIFT_TRUNCATION_MASK, I need to give
> TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't
> trivial to get at many places in existing code.
>
> patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu.

Doing this might prove dangerous in case some pass may later decide
to use an instruction that behaves in different ways.  Consider

   tem = 1<< (n & 255);   // count truncated
   x = y & tem;  // bittest instruction bit nr _not_ truncated

so if tem is expanded to use a shift instruction which truncates the shift
count the explicit and is dropped.  If later combine comes around and
combines the bit-test to use the bittest instruction which does not
implicitely truncate the cound you have generated wrong-code.

So we need to make sure any explicit truncation originally in place
is kept in the RTL - which means SHIFT_COUNT_TRUNCATED should
not exist at all, but instead there would be two patterns for shifts
with implicit truncation - one involving the truncation (canonicalized to
bitwise and) and one not involving the truncation.

Richard.

> Thanks,
> Wei.
>
> On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote:
>> Hi,
>>
>> On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>>> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
>>>> For the motivational case, I need insn splitting to get the cost
>>>> right. insn splitting is not very intrusive. All I need is to call
>>>> split_insns func.
>>>
>>> It may not look very intrusive, but there's a lot happening in the
>>> back ground. You're creating a lot of new RTL, and then just throw it
>>> away again. You fake the compiler into thinking you're much deeper in
>>> the pipeline than you really are. You're assuming there are no
>>> side-effects other than that some insn gets split, but there are back
>>> ends where splitters may have side-effects.
>>
>> Ok, then I will remove the split_insns call.
>>
>>>
>>> Even though I've asked twice now, you still have not explained this
>>> motivational case, except to say that there is one. *What* are you
>>> trying to do, *what* is not happening without the splits, and *what*
>>> happens if you split. Only if you explain that in a lot more detail
>>> than "I have a motivational case" then we can look into what is a
>>> proper solution.
>>
>> :-). Sorry, I didn't say it clearly. The motivational case is the one
>> mentioned in the following posts (split_insns changes a << (b & 63) to
>> a << b).
>> http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
>> http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html
>>
>> If I remove the split_insns call and related cost estimation
>> adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop
>> here looks like a reverse process of cse, the total cost after fwprop
>> change is increased.
>>
>> Def insn 18:
>>         Use insn 23
>>         Use insn 22
>>
>> If we include the split_insns cost estimation adjustment.
>>   extra benefit by removing def insn 18 = 5
>>   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
>> not change after fwprop + insn splitting.
>>   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 22
>> Total benefit is 5, fwprop will go on.
>>
>> If we remove the split_insns cost estimation adjustment.
>>   extra benefit by removing def insn 18 = 5
>>   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
>> insn 23 will increase after fwprop.
>>   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
>> with insn 22
>> Total benefit is -3, fwprop will punt.
>>
>> How about adding the (a << (b&63) ==> a << b) transformation in
>> simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a
>> kind of architecture specific expr simplification? Then fwprop could
>> do the propagation as I expect.
>>
>>>
>>> The problem with some of the splitters is that they exist to break up
>>> RTL from 'expand' to initially keep some pattern together to allow the
>>> code transformation passes to handle the pattern as one instruction.
>>> This made sense when RTL was the only intermediate representation and
>>> splitting too early would inhibit some optimizations. But I would
>>> expect most (if not all) such cases to be less relevant because of the
>>> GIMPLE middle-end work. The only splitters you can trigger are the
>>> pre-reload splitters (all the reload_completed conditions obviously
>>> can't trigger if you're splitting from fwprop). Perhaps those
>>> splitters can/should run earlier, or be made obsolete by expanding
>>> directly to the post-splitting insns.
>>>
>>> Unfortunately, it's not possible to tell for your case, because you
>>> haven't explained it yet...
>>>
>>>
>>>> So how about keep split_insns and remove peephole in the cost estimation func?
>>>
>>> I'd strongly oppose this. I do not believe this is necessary, and I
>>> think it's conceptually wrong.
>>>
>>>
>>>>> What happens if you propagate into an insn that uses the same register
>>>>> twice? Will the DU chains still be valid (I don't think that's
>>>>> guaranteed)?
>>>>
>>>> I think the DU chains still be valid. If propagate into the insn that
>>>> uses the same register twice, the two uses will be replaced when the
>>>> first use is seen (propagate_rtx_1 will propagate all the occurrances
>>>> of the same reg in the use insn).  When the second use is seen, the
>>>> df_use and use insn in its insn_info are still available.
>>>> forward_propagate_into will early return after check reg_mentioned_p
>>>> (DF_REF_REG (use), parent) and find out no reg is used  any more.
>>>
>>> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
>>> still valid.
>>
>> I think DF_REF_LOC of USE may be invalid if dangling rtx will be
>> recycled by garbage collection very soon (I don't know when GC will
>> happen). Although DF_REF_LOC of USE maybe invalid, the early return in
>> forward_propagate_into ensure it will not cause any correctness
>> problem.
>>
>>>
>>> In any case, returning to the RD problem for DU/UD chains is probably
>>> a good idea, now that RD is not such a hog anymore. In effect fwprop.c
>>> would return to what it looked like before the patch of r149010.
>>
>> I remove MD problem and use DU/UD instead.
>>
>>>
>>> As a way forward on all of this, I'd suggest the following steps, each
>>> with a separate patch:
>>
>> Thanks for the suggestion!
>>
>>> 1. replace the MD problem with RD again, and build full DU/UD chains.
>>
>> I include patch.1 attached.
>>
>>> 2. post all the recog changes separately, with minimum impact on the
>>> parts of the compiler you don't really change. (For apply_change_group
>>> you could even choose to overload it, or use a NUM argument with a
>>> default value -- not sure if default argument values are OK for GCC
>>> tho'.)
>>
>> patch.2 attached.
>>
>>> 3. implement propagation into multiple USEs, but without the splitting
>>> and peepholing.
>>
>> patch.3 attached.
>>
>>> 4. see about fixing the back end to either split earlier or expand to
>>> the desired patterns directly.
>>
>> I havn't included this part. If you agree with the proposal to add the
>> transformation (a << (b&63) ==> a << b) in
>> simplify_binary_operation_1, I will send out another patch about it.
>>
>> Thanks,
>> Wei.
Wei Mi March 25, 2013, 5:29 p.m. UTC | #13
On Mon, Mar 25, 2013 at 2:35 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Sun, Mar 24, 2013 at 5:18 AM, Wei Mi <wmi@google.com> wrote:
>> This is the patch to add the shift truncation in
>> simplify_binary_operation_1. I add a new hook
>> TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide
>> whether we can do shift truncation. I didn't use
>> TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it
>> uses the macro SHIFT_COUNT_TRUNCATED. If I change
>> SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in
>> TARGET_SHIFT_TRUNCATION_MASK, I need to give
>> TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't
>> trivial to get at many places in existing code.
>>
>> patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu.
>
> Doing this might prove dangerous in case some pass may later decide
> to use an instruction that behaves in different ways.  Consider
>
>    tem = 1<< (n & 255);   // count truncated
>    x = y & tem;  // bittest instruction bit nr _not_ truncated
>
> so if tem is expanded to use a shift instruction which truncates the shift
> count the explicit and is dropped.  If later combine comes around and
> combines the bit-test to use the bittest instruction which does not
> implicitely truncate the cound you have generated wrong-code.
>

So it means the existing truncation pattern defined in insn split is
also incorrect because the truncated shift may be combined into a bit
test pattern?

// The following define_insn_and_split will do shift truncation.
(define_insn_and_split "*<shift_insn><mode>3_mask"
  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm")
        (any_shiftrt:SWI48
          (match_operand:SWI48 1 "nonimmediate_operand" "0")
          (subreg:QI
            (and:SI
              (match_operand:SI 2 "nonimmediate_operand" "c")
              (match_operand:SI 3 "const_int_operand" "n")) 0)))
   (clobber (reg:CC FLAGS_REG))]
  "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)
   && (INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
      == GET_MODE_BITSIZE (<MODE>mode)-1"
  "#"
  "&& 1"
  [(parallel [(set (match_dup 0)
                   (any_shiftrt:SWI48 (match_dup 1) (match_dup 2)))
              (clobber (reg:CC FLAGS_REG))])]
{
  if (can_create_pseudo_p ())
    operands [2] = force_reg (SImode, operands[2]);

  operands[2] = simplify_gen_subreg (QImode, operands[2], SImode, 0);
}
  [(set_attr "type" "ishift")
   (set_attr "mode" "<MODE>")])

> So we need to make sure any explicit truncation originally in place
> is kept in the RTL - which means SHIFT_COUNT_TRUNCATED should
> not exist at all, but instead there would be two patterns for shifts
> with implicit truncation - one involving the truncation (canonicalized to
> bitwise and) and one not involving the truncation.
>
> Richard.
>

I am trying to figure out a way not to lose the opportunity when shift
truncation is not combined in a bit test pattern. Can we keep the
explicit truncation in RTL, but generate truncation code in assembly?
Then only shift truncation which not combined in a bit test
pattershift truncationn will happen.

(define_insn "*<shift_insn_and><mode>"
  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm")
        (any_shiftrt:SWI48
          (match_operand:SWI48 1 "nonimmediate_operand" "0")
          (subreg:QI
            (and:SI
              (match_operand:SI 2 "nonimmediate_operand" "c")
              (match_operand:SI 3 "const_int_operand" "n")) 0)))
   (clobber (reg:CC FLAGS_REG))]
  "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
{
   if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
      == GET_MODE_BITSIZE (<MODE>mode)-1)
      return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
   else
      "shift\t{%2, %0|%0, %2}";
}

Thanks,
Wei.
Wei Mi March 25, 2013, 5:33 p.m. UTC | #14
> I am trying to figure out a way not to lose the opportunity when shift
> truncation is not combined in a bit test pattern. Can we keep the
> explicit truncation in RTL, but generate truncation code in assembly?
> Then only shift truncation which not combined in a bit test
> pattershift truncationn will happen.
>
> (define_insn "*<shift_insn_and><mode>"
>   [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm")
>         (any_shiftrt:SWI48
>           (match_operand:SWI48 1 "nonimmediate_operand" "0")
>           (subreg:QI
>             (and:SI
>               (match_operand:SI 2 "nonimmediate_operand" "c")
>               (match_operand:SI 3 "const_int_operand" "n")) 0)))
>    (clobber (reg:CC FLAGS_REG))]
>   "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
> {
>    if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
>       == GET_MODE_BITSIZE (<MODE>mode)-1)
>       return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
>    else
>       "shift\t{%2, %0|%0, %2}";
> }

Sorry, rectify a mistake:

{
   if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
      == GET_MODE_BITSIZE (<MODE>mode)-1)
      return "shift\t{%2, %0|%0, %2}";
   else
      return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
}

Thanks,
Wei.
Richard Biener March 26, 2013, 9:14 a.m. UTC | #15
On Mon, Mar 25, 2013 at 6:33 PM, Wei Mi <wmi@google.com> wrote:
>> I am trying to figure out a way not to lose the opportunity when shift
>> truncation is not combined in a bit test pattern. Can we keep the
>> explicit truncation in RTL, but generate truncation code in assembly?
>> Then only shift truncation which not combined in a bit test
>> pattershift truncationn will happen.
>>
>> (define_insn "*<shift_insn_and><mode>"
>>   [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm")
>>         (any_shiftrt:SWI48
>>           (match_operand:SWI48 1 "nonimmediate_operand" "0")
>>           (subreg:QI
>>             (and:SI
>>               (match_operand:SI 2 "nonimmediate_operand" "c")
>>               (match_operand:SI 3 "const_int_operand" "n")) 0)))
>>    (clobber (reg:CC FLAGS_REG))]
>>   "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
>> {
>>    if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
>>       == GET_MODE_BITSIZE (<MODE>mode)-1)
>>       return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
>>    else
>>       "shift\t{%2, %0|%0, %2}";
>> }
>
> Sorry, rectify a mistake:
>
> {
>    if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
>       == GET_MODE_BITSIZE (<MODE>mode)-1)
>       return "shift\t{%2, %0|%0, %2}";
>    else
>       return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
> }

I'm not sure the existing patterns are wrong because SHIFT_COUNT_TRUNCATED
is false for x86 AFAIK, exactly because of the bit-test vs. shift instruction
differences.  So there is no inconsistency.  The i386 backend seems to
try to follow my suggestion as if SHIFT_COUNT_TRUNCATED didn't exist
(well, it's false, so it technically doesn't exist for i386) and recognizes
the shift with truncate with the *<shift_insn><mode>3_mask splitter.
But I'm not sure why it bothers to do it with a splitter instead of just
with a define_insn?  Because the split code,

  [(parallel [(set (match_dup 0)
                   (any_shiftrt:SWI48 (match_dup 1) (match_dup 2)))
              (clobber (reg:CC FLAGS_REG))])]

is wrong and could be combined into a bit-test instruction.  No?

That is, why not have define_insn variants for shift instructions with
explicit truncation?

Richard.


> Thanks,
> Wei.
Uros Bizjak March 26, 2013, 6:23 p.m. UTC | #16
On Tue, Mar 26, 2013 at 10:14 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
>>> I am trying to figure out a way not to lose the opportunity when shift
>>> truncation is not combined in a bit test pattern. Can we keep the
>>> explicit truncation in RTL, but generate truncation code in assembly?
>>> Then only shift truncation which not combined in a bit test
>>> pattershift truncationn will happen.
>>>
>>> (define_insn "*<shift_insn_and><mode>"
>>>   [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm")
>>>         (any_shiftrt:SWI48
>>>           (match_operand:SWI48 1 "nonimmediate_operand" "0")
>>>           (subreg:QI
>>>             (and:SI
>>>               (match_operand:SI 2 "nonimmediate_operand" "c")
>>>               (match_operand:SI 3 "const_int_operand" "n")) 0)))
>>>    (clobber (reg:CC FLAGS_REG))]
>>>   "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
>>> {
>>>    if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
>>>       == GET_MODE_BITSIZE (<MODE>mode)-1)
>>>       return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
>>>    else
>>>       "shift\t{%2, %0|%0, %2}";
>>> }
>>
>> Sorry, rectify a mistake:
>>
>> {
>>    if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
>>       == GET_MODE_BITSIZE (<MODE>mode)-1)
>>       return "shift\t{%2, %0|%0, %2}";
>>    else
>>       return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
>> }
>
> I'm not sure the existing patterns are wrong because SHIFT_COUNT_TRUNCATED
> is false for x86 AFAIK, exactly because of the bit-test vs. shift instruction
> differences.  So there is no inconsistency.  The i386 backend seems to
> try to follow my suggestion as if SHIFT_COUNT_TRUNCATED didn't exist
> (well, it's false, so it technically doesn't exist for i386) and recognizes
> the shift with truncate with the *<shift_insn><mode>3_mask splitter.
> But I'm not sure why it bothers to do it with a splitter instead of just
> with a define_insn?  Because the split code,
>
>   [(parallel [(set (match_dup 0)
>                    (any_shiftrt:SWI48 (match_dup 1) (match_dup 2)))
>               (clobber (reg:CC FLAGS_REG))])]
>
> is wrong and could be combined into a bit-test instruction.  No?
>
> That is, why not have define_insn variants for shift instructions with
> explicit truncation?

You are right, the split is harmful in this case.

It looks to me, that explicit truncation can be added to split
patterns in the most elegant way using proposed "define_subst"
infrastructure.

Uros.
Wei Mi March 28, 2013, 4:34 a.m. UTC | #17
I am not familiar how to use define_subst, so I write a patch that
changes define_insn_and_split to define_insn. bootstrapped and
regression tested on x86_64-unknown-linux-gnu.

A question is: after that change, Is there anyway I can make
targetm.rtx_costs() aware about the truncation, .i.e the cost is only
a "shift" instead of "shift + and".

Thanks,
Wei.

On Tue, Mar 26, 2013 at 11:23 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Tue, Mar 26, 2013 at 10:14 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>>>> I am trying to figure out a way not to lose the opportunity when shift
>>>> truncation is not combined in a bit test pattern. Can we keep the
>>>> explicit truncation in RTL, but generate truncation code in assembly?
>>>> Then only shift truncation which not combined in a bit test
>>>> pattershift truncationn will happen.
>>>>
>>>> (define_insn "*<shift_insn_and><mode>"
>>>>   [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm")
>>>>         (any_shiftrt:SWI48
>>>>           (match_operand:SWI48 1 "nonimmediate_operand" "0")
>>>>           (subreg:QI
>>>>             (and:SI
>>>>               (match_operand:SI 2 "nonimmediate_operand" "c")
>>>>               (match_operand:SI 3 "const_int_operand" "n")) 0)))
>>>>    (clobber (reg:CC FLAGS_REG))]
>>>>   "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
>>>> {
>>>>    if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
>>>>       == GET_MODE_BITSIZE (<MODE>mode)-1)
>>>>       return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
>>>>    else
>>>>       "shift\t{%2, %0|%0, %2}";
>>>> }
>>>
>>> Sorry, rectify a mistake:
>>>
>>> {
>>>    if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
>>>       == GET_MODE_BITSIZE (<MODE>mode)-1)
>>>       return "shift\t{%2, %0|%0, %2}";
>>>    else
>>>       return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
>>> }
>>
>> I'm not sure the existing patterns are wrong because SHIFT_COUNT_TRUNCATED
>> is false for x86 AFAIK, exactly because of the bit-test vs. shift instruction
>> differences.  So there is no inconsistency.  The i386 backend seems to
>> try to follow my suggestion as if SHIFT_COUNT_TRUNCATED didn't exist
>> (well, it's false, so it technically doesn't exist for i386) and recognizes
>> the shift with truncate with the *<shift_insn><mode>3_mask splitter.
>> But I'm not sure why it bothers to do it with a splitter instead of just
>> with a define_insn?  Because the split code,
>>
>>   [(parallel [(set (match_dup 0)
>>                    (any_shiftrt:SWI48 (match_dup 1) (match_dup 2)))
>>               (clobber (reg:CC FLAGS_REG))])]
>>
>> is wrong and could be combined into a bit-test instruction.  No?
>>
>> That is, why not have define_insn variants for shift instructions with
>> explicit truncation?
>
> You are right, the split is harmful in this case.
>
> It looks to me, that explicit truncation can be added to split
> patterns in the most elegant way using proposed "define_subst"
> infrastructure.
>
> Uros.
Uros Bizjak March 28, 2013, 3:49 p.m. UTC | #18
On Thu, Mar 28, 2013 at 5:34 AM, Wei Mi <wmi@google.com> wrote:
> I am not familiar how to use define_subst, so I write a patch that
> changes define_insn_and_split to define_insn. bootstrapped and
> regression tested on x86_64-unknown-linux-gnu.
>
> A question is: after that change, Is there anyway I can make
> targetm.rtx_costs() aware about the truncation, .i.e the cost is only
> a "shift" instead of "shift + and".

Please also change all operand 2 predicates to "register_operand".

2013-03-27  Wei Mi  <wmi@google.com>

	* config/i386/i386.md: Do shift truncation in define_insn
	instead of define_insn_and_split.

Please write ChangeLog as:

	* config/i386/i386.md (*ashl<mode>3_mask): Rewrite as define_insn.
	Truncate operand 2 using %b asm operand modifier.
	(*<shift_insn><mode>3_mask): Ditto.
	(*<rotate_insn><mode>3_mask): Ditto.

OK for mainline and all release branches with these changes.

Thanks,
Uros.
Wei Mi April 2, 2013, 5:43 a.m. UTC | #19
I attached the patch.4 based on r197308. r197308 changes shift-and
type truncation from define_insn_and_split to define_insn.  patch.4
changes ix86_rtx_costs for shift-and type rtx to get the correct cost
for the result after the shift-and truncation.

With the patch.1 ~ patch.4, fwprop extension could handle the
motivational case 1.c attached by removing all the redundent "x & 63"
operations.

patch.1~patch.4 regression and bootstrap ok on
x86_64-unknown-linux-gnu. Is it ok for trunk?

Thanks,
Wei.

On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote:
> Hi,
>
> On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
>>> For the motivational case, I need insn splitting to get the cost
>>> right. insn splitting is not very intrusive. All I need is to call
>>> split_insns func.
>>
>> It may not look very intrusive, but there's a lot happening in the
>> back ground. You're creating a lot of new RTL, and then just throw it
>> away again. You fake the compiler into thinking you're much deeper in
>> the pipeline than you really are. You're assuming there are no
>> side-effects other than that some insn gets split, but there are back
>> ends where splitters may have side-effects.
>
> Ok, then I will remove the split_insns call.
>
>>
>> Even though I've asked twice now, you still have not explained this
>> motivational case, except to say that there is one. *What* are you
>> trying to do, *what* is not happening without the splits, and *what*
>> happens if you split. Only if you explain that in a lot more detail
>> than "I have a motivational case" then we can look into what is a
>> proper solution.
>
> :-). Sorry, I didn't say it clearly. The motivational case is the one
> mentioned in the following posts (split_insns changes a << (b & 63) to
> a << b).
> http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
> http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html
>
> If I remove the split_insns call and related cost estimation
> adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop
> here looks like a reverse process of cse, the total cost after fwprop
> change is increased.
>
> Def insn 18:
>         Use insn 23
>         Use insn 22
>
> If we include the split_insns cost estimation adjustment.
>   extra benefit by removing def insn 18 = 5
>   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
> not change after fwprop + insn splitting.
>   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 22
> Total benefit is 5, fwprop will go on.
>
> If we remove the split_insns cost estimation adjustment.
>   extra benefit by removing def insn 18 = 5
>   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
> insn 23 will increase after fwprop.
>   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
> with insn 22
> Total benefit is -3, fwprop will punt.
>
> How about adding the (a << (b&63) ==> a << b) transformation in
> simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a
> kind of architecture specific expr simplification? Then fwprop could
> do the propagation as I expect.
>
>>
>> The problem with some of the splitters is that they exist to break up
>> RTL from 'expand' to initially keep some pattern together to allow the
>> code transformation passes to handle the pattern as one instruction.
>> This made sense when RTL was the only intermediate representation and
>> splitting too early would inhibit some optimizations. But I would
>> expect most (if not all) such cases to be less relevant because of the
>> GIMPLE middle-end work. The only splitters you can trigger are the
>> pre-reload splitters (all the reload_completed conditions obviously
>> can't trigger if you're splitting from fwprop). Perhaps those
>> splitters can/should run earlier, or be made obsolete by expanding
>> directly to the post-splitting insns.
>>
>> Unfortunately, it's not possible to tell for your case, because you
>> haven't explained it yet...
>>
>>
>>> So how about keep split_insns and remove peephole in the cost estimation func?
>>
>> I'd strongly oppose this. I do not believe this is necessary, and I
>> think it's conceptually wrong.
>>
>>
>>>> What happens if you propagate into an insn that uses the same register
>>>> twice? Will the DU chains still be valid (I don't think that's
>>>> guaranteed)?
>>>
>>> I think the DU chains still be valid. If propagate into the insn that
>>> uses the same register twice, the two uses will be replaced when the
>>> first use is seen (propagate_rtx_1 will propagate all the occurrances
>>> of the same reg in the use insn).  When the second use is seen, the
>>> df_use and use insn in its insn_info are still available.
>>> forward_propagate_into will early return after check reg_mentioned_p
>>> (DF_REF_REG (use), parent) and find out no reg is used  any more.
>>
>> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
>> still valid.
>
> I think DF_REF_LOC of USE may be invalid if dangling rtx will be
> recycled by garbage collection very soon (I don't know when GC will
> happen). Although DF_REF_LOC of USE maybe invalid, the early return in
> forward_propagate_into ensure it will not cause any correctness
> problem.
>
>>
>> In any case, returning to the RD problem for DU/UD chains is probably
>> a good idea, now that RD is not such a hog anymore. In effect fwprop.c
>> would return to what it looked like before the patch of r149010.
>
> I remove MD problem and use DU/UD instead.
>
>>
>> As a way forward on all of this, I'd suggest the following steps, each
>> with a separate patch:
>
> Thanks for the suggestion!
>
>> 1. replace the MD problem with RD again, and build full DU/UD chains.
>
> I include patch.1 attached.
>
>> 2. post all the recog changes separately, with minimum impact on the
>> parts of the compiler you don't really change. (For apply_change_group
>> you could even choose to overload it, or use a NUM argument with a
>> default value -- not sure if default argument values are OK for GCC
>> tho'.)
>
> patch.2 attached.
>
>> 3. implement propagation into multiple USEs, but without the splitting
>> and peepholing.
>
> patch.3 attached.
>
>> 4. see about fixing the back end to either split earlier or expand to
>> the desired patterns directly.
>
> I havn't included this part. If you agree with the proposal to add the
> transformation (a << (b&63) ==> a << b) in
> simplify_binary_operation_1, I will send out another patch about it.
>
> Thanks,
> Wei.
Wei Mi April 2, 2013, 5:44 a.m. UTC | #20
1.c attached.

On Mon, Apr 1, 2013 at 10:43 PM, Wei Mi <wmi@google.com> wrote:
> I attached the patch.4 based on r197308. r197308 changes shift-and
> type truncation from define_insn_and_split to define_insn.  patch.4
> changes ix86_rtx_costs for shift-and type rtx to get the correct cost
> for the result after the shift-and truncation.
>
> With the patch.1 ~ patch.4, fwprop extension could handle the
> motivational case 1.c attached by removing all the redundent "x & 63"
> operations.
>
> patch.1~patch.4 regression and bootstrap ok on
> x86_64-unknown-linux-gnu. Is it ok for trunk?
>
> Thanks,
> Wei.
>
> On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote:
>> Hi,
>>
>> On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>>> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
>>>> For the motivational case, I need insn splitting to get the cost
>>>> right. insn splitting is not very intrusive. All I need is to call
>>>> split_insns func.
>>>
>>> It may not look very intrusive, but there's a lot happening in the
>>> back ground. You're creating a lot of new RTL, and then just throw it
>>> away again. You fake the compiler into thinking you're much deeper in
>>> the pipeline than you really are. You're assuming there are no
>>> side-effects other than that some insn gets split, but there are back
>>> ends where splitters may have side-effects.
>>
>> Ok, then I will remove the split_insns call.
>>
>>>
>>> Even though I've asked twice now, you still have not explained this
>>> motivational case, except to say that there is one. *What* are you
>>> trying to do, *what* is not happening without the splits, and *what*
>>> happens if you split. Only if you explain that in a lot more detail
>>> than "I have a motivational case" then we can look into what is a
>>> proper solution.
>>
>> :-). Sorry, I didn't say it clearly. The motivational case is the one
>> mentioned in the following posts (split_insns changes a << (b & 63) to
>> a << b).
>> http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
>> http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html
>>
>> If I remove the split_insns call and related cost estimation
>> adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop
>> here looks like a reverse process of cse, the total cost after fwprop
>> change is increased.
>>
>> Def insn 18:
>>         Use insn 23
>>         Use insn 22
>>
>> If we include the split_insns cost estimation adjustment.
>>   extra benefit by removing def insn 18 = 5
>>   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
>> not change after fwprop + insn splitting.
>>   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 22
>> Total benefit is 5, fwprop will go on.
>>
>> If we remove the split_insns cost estimation adjustment.
>>   extra benefit by removing def insn 18 = 5
>>   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
>> insn 23 will increase after fwprop.
>>   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
>> with insn 22
>> Total benefit is -3, fwprop will punt.
>>
>> How about adding the (a << (b&63) ==> a << b) transformation in
>> simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a
>> kind of architecture specific expr simplification? Then fwprop could
>> do the propagation as I expect.
>>
>>>
>>> The problem with some of the splitters is that they exist to break up
>>> RTL from 'expand' to initially keep some pattern together to allow the
>>> code transformation passes to handle the pattern as one instruction.
>>> This made sense when RTL was the only intermediate representation and
>>> splitting too early would inhibit some optimizations. But I would
>>> expect most (if not all) such cases to be less relevant because of the
>>> GIMPLE middle-end work. The only splitters you can trigger are the
>>> pre-reload splitters (all the reload_completed conditions obviously
>>> can't trigger if you're splitting from fwprop). Perhaps those
>>> splitters can/should run earlier, or be made obsolete by expanding
>>> directly to the post-splitting insns.
>>>
>>> Unfortunately, it's not possible to tell for your case, because you
>>> haven't explained it yet...
>>>
>>>
>>>> So how about keep split_insns and remove peephole in the cost estimation func?
>>>
>>> I'd strongly oppose this. I do not believe this is necessary, and I
>>> think it's conceptually wrong.
>>>
>>>
>>>>> What happens if you propagate into an insn that uses the same register
>>>>> twice? Will the DU chains still be valid (I don't think that's
>>>>> guaranteed)?
>>>>
>>>> I think the DU chains still be valid. If propagate into the insn that
>>>> uses the same register twice, the two uses will be replaced when the
>>>> first use is seen (propagate_rtx_1 will propagate all the occurrances
>>>> of the same reg in the use insn).  When the second use is seen, the
>>>> df_use and use insn in its insn_info are still available.
>>>> forward_propagate_into will early return after check reg_mentioned_p
>>>> (DF_REF_REG (use), parent) and find out no reg is used  any more.
>>>
>>> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
>>> still valid.
>>
>> I think DF_REF_LOC of USE may be invalid if dangling rtx will be
>> recycled by garbage collection very soon (I don't know when GC will
>> happen). Although DF_REF_LOC of USE maybe invalid, the early return in
>> forward_propagate_into ensure it will not cause any correctness
>> problem.
>>
>>>
>>> In any case, returning to the RD problem for DU/UD chains is probably
>>> a good idea, now that RD is not such a hog anymore. In effect fwprop.c
>>> would return to what it looked like before the patch of r149010.
>>
>> I remove MD problem and use DU/UD instead.
>>
>>>
>>> As a way forward on all of this, I'd suggest the following steps, each
>>> with a separate patch:
>>
>> Thanks for the suggestion!
>>
>>> 1. replace the MD problem with RD again, and build full DU/UD chains.
>>
>> I include patch.1 attached.
>>
>>> 2. post all the recog changes separately, with minimum impact on the
>>> parts of the compiler you don't really change. (For apply_change_group
>>> you could even choose to overload it, or use a NUM argument with a
>>> default value -- not sure if default argument values are OK for GCC
>>> tho'.)
>>
>> patch.2 attached.
>>
>>> 3. implement propagation into multiple USEs, but without the splitting
>>> and peepholing.
>>
>> patch.3 attached.
>>
>>> 4. see about fixing the back end to either split earlier or expand to
>>> the desired patterns directly.
>>
>> I havn't included this part. If you agree with the proposal to add the
>> transformation (a << (b&63) ==> a << b) in
>> simplify_binary_operation_1, I will send out another patch about it.
>>
>> Thanks,
>> Wei.
Uros Bizjak April 2, 2013, 6:46 a.m. UTC | #21
On Tue, Apr 2, 2013 at 7:43 AM, Wei Mi <wmi@google.com> wrote:
> I attached the patch.4 based on r197308. r197308 changes shift-and
> type truncation from define_insn_and_split to define_insn.  patch.4
> changes ix86_rtx_costs for shift-and type rtx to get the correct cost
> for the result after the shift-and truncation.
>
> With the patch.1 ~ patch.4, fwprop extension could handle the
> motivational case 1.c attached by removing all the redundent "x & 63"
> operations.
>
> patch.1~patch.4 regression and bootstrap ok on
> x86_64-unknown-linux-gnu. Is it ok for trunk?

> 2013-04-01  Wei Mi  <wmi@google.com>
>
>     * config/i386/i386.c (ix86_rtx_costs): Set proper rtx cost for
>     ashl<mode>3_mask, *<shift_insn><mode>3_mask and
    *<rotate_insn><mode>3_mask in i386.md.

Patch 4 is OK for mainline and also for release branches that were
changed by your previous i386.md patch.

Thanks,
Uros.
diff mbox

Patch

Index: cse.c
===================================================================
--- cse.c       (revision 196182)
+++ cse.c       (working copy)
@@ -3179,9 +3179,22 @@  fold_rtx (rtx x, rtx insn)

        switch (GET_CODE (folded_arg))
          {
+         case SUBREG:
+           /* If the SUBREG_REG comes in from an AND, and this is not a
+              paradoxical subreg, then try to fold the SUBREG.  */
+           if (REG_P (SUBREG_REG (folded_arg))
+               && ! paradoxical_subreg_p (folded_arg))
+             {
+               rtx y = lookup_as_function (SUBREG_REG (folded_arg), AND);
+               if (y != 0)
+                 y = simplify_gen_binary(AND, GET_MODE (folded_arg),
+                                         XEXP(y, 0), XEXP(y, 1));
+               if (y != 0)
+                 folded_arg = y;
+             }
+           /* ... fall through ...  */
          case MEM:
          case REG:
-         case SUBREG:
            const_arg = equiv_constant (folded_arg);
            break;