[C] : pr52543

Message ID	4F74CFB3.5090308@naturalbridge.com
State	New
Headers	show Return-Path: <gcc-patches-return-315827-incoming=patchwork.ozlabs.org@gcc.gnu.org> Comment: DKIM? See http://www.dkim.org Comment: DomainKeys? See http://antispam.yahoo.com/domainkeys DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=default; d=gcc.gnu.org; h=Received:Received:X-SWARE-Spam-Status:X-Spam-Check-By:Received:Received:X-Google-DKIM-Signature:Received:Received:Message-ID:Date:From:User-Agent:MIME-Version:To:CC:Subject:References:In-Reply-To:Content-Type:X-Gm-Message-State:Mailing-List:Precedence:List-Id:List-Unsubscribe:List-Archive:List-Post:List-Help:Sender:Delivered-To; b=rV+CSEwjsyC4cqr7xjJnXkqwu4mqgqU2FnM2vnoPYMxzsCCMm6Q/wCWiV8qYrZ RPATGp5Tgys3HKQykgZ/fPt4SEy+zZQo1leLE2da6Rc3jAe4viH4icRqZovsObRv NOSzXgaJRUvzQEeb2jhmt1lc1Y2b0sottTBexPZ37MCls=; Message-ID: <4F74CFB3.5090308@naturalbridge.com> Date: Thu, 29 Mar 2012 17:10:11 -0400 From: Kenneth Zadeck <zadeck@naturalbridge.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120310 Thunderbird/11.0 MIME-Version: 1.0 To: gcc-patches <gcc-patches@gcc.gnu.org> CC: Ian Lance Taylor <iant@google.com>, Richard Sandiford <r.sandiford@uk.ibm.com>, Mike Stump <mikestump@comcast.net>, Kenneth Zadeck <Kenneth.Zadeck@NaturalBridge.com>, avr@gjlay.de Subject: [C Patch]: pr52543 References: <4F67D595.9030700@naturalbridge.com> <mcrd387n9bf.fsf@dhcp-172-18-216-180.mtv.corp.google.com> <4F6881EA.9080306@naturalbridge.com> <mcrvclzl7b7.fsf@dhcp-172-18-216-180.mtv.corp.google.com> <4F6889CC.8080302@naturalbridge.com> <mcrr4wnl6lb.fsf@dhcp-172-18-216-180.mtv.corp.google.com> In-Reply-To: <mcrr4wnl6lb.fsf@dhcp-172-18-216-180.mtv.corp.google.com> Content-Type: multipart/mixed; boundary="------------080009010904060101070302" Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org

Kenneth Zadeck March 29, 2012, 9:10 p.m. UTC

This patch takes a different approach to fixing PR52543 than does the 
patch in

http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html

This patch transforms the lower-subreg pass(es) from unconditionally 
splitting wide moves, zero extensions, and shifts, so that it now takes 
into account the target specific costs and only does the transformations 
if it is profitable.

Unconditional splitting is a problem that not only occurs on the AVR but 
is also a problem on the ARM NEON and my private port.  Furthermore, it 
is a problem that is likely to occur on most modern larger machines 
since these machines are more likely to have fast instructions for 
moving things that are larger than word mode.

At compiler initialization time, each mode that is larger that a word 
mode is examined to determine if the cost of moving a value of that mode 
is less expensive that inserting the proper number of word sided 
moves.   If it is cheaper to split it up, a bit is set to allow moves of 
that mode to be lowered.

A similar analysis is made for the zero extensions and shifts except 
that lower subreg had been (and is still limited to only breaking up 
these operations if the target size was twice the size of word mode.)  
Also, if the analysis determines that there are no profitable 
transformations, the pass exits quickly without doing any analysis.

It is quite likely that most ports will have to be adjusted after this 
patch is accepted.   For instance, the analysis discovers that there are 
no profitable transformations to be performed on the x86-64.    Since 
this is not my platform, I have no idea if these are the correct 
settings.   But the pass uses the standard insn_rtx_cost interface and 
it is the port maintainers responsibility to not lie to the optimization 
passes so this extra work in stage one should be acceptable.

I do know from a private conversation with Richard Sandiford, that mips 
patches are likely forthcoming.

There is preprocessor controlled code that prints out the cost 
analysis.  Only a summary of this can go in the subregs dump file 
because the analysis is called from backend_init_target and so the dump 
file is not available.      But it is very useful to define LOG_COSTS 
when adjusting your port.

There is also preprocessor code that forces all of the lowering 
operations to marked as profitable.  This is useful in debugging the new 
logic.

Both of these preprocessor symbols are documented at the top of the pass.

I have tested this on an x86_64 with both the force lowering on and off 
and neither cause any regressions as well as extensive testing on my port.

Ok to commit?

Kenny

2012-03-29 Kenneth Zadeck <zadeck@naturalbridge.com>

     * toplev.c (backend_init_target): Call initializer for lower-subreg 
pass.

     * lower-subreg.c (move_modes_to_split, splitting_ashift, 
splitting_lshiftrt)
     splitting_zext, splitting_some_shifts, twice_word_mode,    
something_to_do,
     word_mode_move_cost, move_zero_cost): New static vars.
     (compute_move_cost, profitable_shift_p, init_lower_subreg): New
     functions.
     (find_pseudo_copy, resolve_simple_move): Added code to only split 
based on costs.
     (find_decomposable_subregs): Added code to mark as decomposable
     moves that are not profitable.
     (find_decomposable_shift_zext): Added code to only decompose
     shifts and zext if profitable.
     (resolve_shift_zext): Added comment.
     (decompose_multiword_subregs): Dump list of profitable
     transformations.  Add code to skip non profitable transformations.

     *rtl.h(init_lower_subreg): Added declaration.

Ramana Radhakrishnan March 30, 2012, 8:33 a.m. UTC | #1

On 29 March 2012 22:10, Kenneth Zadeck <zadeck@naturalbridge.com> wrote:
> This patch takes a different approach to fixing PR52543 than does the patch
> in
>
> http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html
>
> This patch transforms the lower-subreg pass(es) from unconditionally
> splitting wide moves, zero extensions, and shifts, so that it now takes into
> account the target specific costs and only does the transformations if it is
> profitable.
>
> Unconditional splitting is a problem that not only occurs on the AVR but is
> also a problem on the ARM NEON and my private port.  Furthermore, it is a
> problem that is likely to occur on most modern larger machines since these
> machines are more likely to have fast instructions for moving things that
> are larger than word mode.

Nice - this means that atleast one pending patches for subreg
style operations for neon intrinsics can go in after appropriate tweaking.
of costs. It probably requires some tweaking and benchmarking on ARM, but
the case where we saw such spills to the stack with subreg style operations is
now much improved , indicating that the existing costs infrastructure
manages to get this right atleast for this case.

Richard(S) - If you remember your PR48941 patch - after applying the
lower-subreg patch I now see far better code and what one gets out of
-fno-split-wide-types but a lot of that gratuitous spillng has gone away.

There are still too many moves between Neon registers but there are
far less moves
to the integer side and the gratuitous spilling is now gone.

old on left - new on right ( i.e. Kenneth's patch + Richard's PR48941 patch
http://patchwork.ozlabs.org/patch/130429/)

regards
Ramana
cross:                                                          cross:
        @ args = 0, pretend = 0, frame = 16                   |         @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 1, uses_anonymous_args = 0           |         @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.                                @ link register save eliminated.
        str     fp, [sp, #-4]!                                |         vldmia  r0, {d26-d27}
        add     fp, sp, #0                                    |         vldmia  r1, {d24-d25}
        sub     sp, sp, #20                                   |         vmov    q10, q13  @ v4sf
        vldmia  r0, {d16-d17}                                 |         vmov    q11, q13  @ v4sf
        vmov    q10, q8  @ v4sf                               |         vmov    q8, q12  @ v4sf
        sub     sp, sp, #48                                   |         vmov    q9, q12  @ v4sf
                                                              >         vzip.32 q10, q11
                                                              >         vzip.32 q8, q9
                                                              >         vmov    q14, q10  @ v4sf
        vmov    q12, q8  @ v4sf                                         vmov    q12, q8  @ v4sf
        add     r3, sp, #15                                   |         vmov    d21, d22  @ v2sf
        bic     r3, r3, #15                                   |         vmul.f32        d16, d29, d18
        vzip.32 q10, q12                                      |         vmul.f32        d17, d21, d24
        vstmia  r3, {d20-d21}                                 |         vmov    d19, d18  @ v2sf
        vstr    d24, [r3, #16]                                |         vmul.f32        d18, d28, d25
        vstr    d25, [r3, #24]                                |         vmls.f32        d16, d21, d25
        vldmia  r1, {d16-d17}                                 |         vmls.f32        d17, d28, d19
        vmov    q9, q8  @ v4sf                                |         vmls.f32        d18, d29, d24
        vmov    q11, q8  @ v4sf                               |         vmov    d26, d16  @ v2sf
        vzip.32 q9, q11                                       |         vmov    d27, d17  @ v2sf
        vstmia  r3, {d18-d19}                                 |         vmov    d17, d18  @ v2sf
        vstr    d22, [r3, #16]                                |         vuzp.32 d26, d27
        vstr    d23, [r3, #24]                                |         vmov    d16, d26  @ v2sf
        vmov    d25, d18  @ v2sf                              |         vmov    r0, r1, d16  @ v4sf
        vmul.f32        d17, d21, d22                         |         vmov    r2, r3, d17
        vmul.f32        d18, d24, d18                         <
        vmov    d16, d19  @ v2sf                              <
        vmul.f32        d19, d20, d19                         <
        vmls.f32        d17, d24, d16                         <
        vmls.f32        d18, d20, d22                         <
        vmls.f32        d19, d21, d25                         <
        vuzp.32 d17, d18                                      <
        vmov    d20, d17  @ v2sf                              <
        vmov    d21, d19  @ v2sf                              <
        vmov    r0, r1, d20  @ v4sf                           <
        vmov    r2, r3, d21                                   <
        add     sp, fp, #0                                    <
        ldmfd   sp!, {fp}                                     <
        bx      lr                                                      bx      lr

Richard Sandiford March 30, 2012, 10:08 a.m. UTC | #2

Kenneth Zadeck <zadeck@naturalbridge.com> writes:
> This patch takes a different approach to fixing PR52543 than does the 
> patch in
>
> http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html
>
> This patch transforms the lower-subreg pass(es) from unconditionally 
> splitting wide moves, zero extensions, and shifts, so that it now takes 
> into account the target specific costs and only does the transformations 
> if it is profitable.

Thanks for doing this.

> +   This pass only splits moves with modes are wider than word_mode and

...modes that are wider...

> +   ashifts, lshiftrt and zero_extensions with integer modes that are

Nitlet: lshiftrts to be consistent.  Or "ASHIFTs, LSHIFTRTs and ZERO_EXTENDs".

> +   There are two useful preprocessor defines for use by maintainers:  
> +
> +   #define LOG_COSTS
> +
> +   if you wish to see the actual cost estimates that are being used
> +   for each mode wider than word mode and the cost estimates for zero
> +   extension and the shifts.   This can be useful when port maintainers 
> +   are tuning insn rtx costs.
> +
> +   #define FORCE_LOWERING
> +
> +   if you wish to test the pass with all the transformation forced on.
> +   This can be useful for finding bugs in the transformations.

Must admit I'm not keen on these kinds of macro, but it's Ian's call.

Idea for the future (i.e. not this patch) is to have a dump file for
target initialisation.

> +/* This pass can transform 4 different operations: move, ashift,
> +   lshiftrt, and zero_extend.  There is a boolean vector for move
> +   splitting that is indexed by mode and is true for each mode that is
> +   to have its copies split.  The other three operations are only done
> +   for one mode so they are only controlled by a single boolean  .*/

As mentioned privately, whether this is profitable for shifts depends
to some extent on the shift amount.  GCC already supports targets where
this transformation would be OK for some shift amounts but not others.
So for shifts, I think this should be an array of HOST_BITS_PER_WIDE_INT
booleans rather than just one.

More comments below about how this filters through your other changes.

> +/* This pass can transform 4 different operations: move, ashift,
> +   lshiftrt, and zero_extend.  There is a boolean vector for move
> +   splitting that is indexed by mode and is true for each mode that is
> +   to have its copies split.  The other three operations are only done
> +   for one mode so they are only controlled by a single boolean  .*/
> +static bool move_modes_to_split[MAX_MACHINE_MODE];
> +
> +/* Other splitting operations to be done on the this platform.  */
> +static bool splitting_ashift;
> +static bool splitting_lshiftrt;
> +static bool splitting_zext;
> +/* The or of the previous three values.  */
> +static bool splitting_some_shifts;
> +
> +/* The integer mode that is twice the width of word_mode.  */
> +static enum machine_mode twice_word_mode;
> +
>  /* Bit N in the bitmap in element M of this array is set if there is a
>     copy from reg M to reg N.  */
>  static VEC(bitmap,heap) *reg_copy_graph;
>  
> -/* Return whether X is a simple object which we can take a word_mode
> -   subreg of.  */
> +static bool something_to_do;
> +
> +/* Precomputed cost values used to determine if lowering shift
> +   operations is profitable.  */ 
> +static int word_mode_move_cost;
> +static int move_zero_cost;
> +

All this new data should be wrapped in a target structure.
See expmed.[hc] for an example.

> +    return wide_cost > word_mode_move_cost + move_zero_cost;

Should be >=.  The purpose of this transformation isn't AIUI to improve
shifts in themselves.  It's just to stop shifts from being an artificial
barrier to lower-subreg's main "optimisation" -- which in some ways is more
of an IR simplification -- i.e., replacing subregs with regs.  We want
to do it if the costs of the two sequences are equal.  More justification
at the end.

> +  else
> +    {
> +      int narrow_cost;
> +
> +      trg = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER);
> +      src = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER);

Probably best to use FIRST_PSEUDO_REGISTER + 1 for the second (here and
elsewhere) so that we get the cost of a general 3-operand shift rather
than a 2-operand one.

> +      narrow_cost = insn_rtx_cost (gen_rtx_SET (VOIDmode, trg, 
> +						gen_rtx_fmt_ee (code, word_mode, 
> +								src, 
> +								GEN_INT (shift_amt - word_mode_size))),

Watch the long lines.  Everything should be less than 80 chars.

> +      return wide_cost > narrow_cost + move_zero_cost;

">=" here too.

> +/* Initialize pass once per execution.  This envolves determining

Once per target.  And since it can be called more than once, the
function needs to zero out the new data rather than rely on load-time
zero-initialisation.

> +      t = compute_move_cost ((enum machine_mode) i);
> +
> +#ifdef LOG_COSTS
> +      fprintf (stderr, "mode %s, move cost %d, factor %d\n", mode_name[i], t, factor);
> +#endif
> +      if (t > word_mode_move_cost * factor)

">=" here too.

> +  /* The only case here to check to see if moving the upper part with a
> +     zero is cheaper than doing the zext itself.  There is some theory
> +     that any well implemented port would just have the pattern.  */

Don't really get this comment.  These days, I don't think ports are
really expected to define double-word patterns that simply decompose
to the usual (target-independent) word-mode sequence.  Ports used to do
that because having a unified operation helped the earlier rtl optimisers,
but those cases are now handled at the gimple level.  These days it seems
better to let optabs lower the thing immediately and allow the individual
word ops (which weren't exposed at the gimple level) to get maximum
optimisation.

Maybe easiest to remove the second sentence rather than argue about it.

> +  if (insn_rtx_cost (pat, true) > word_mode_move_cost + move_zero_cost) 

">=" :-)

> @@ -173,7 +364,7 @@ find_pseudo_copy (rtx set)
>    if (HARD_REGISTER_NUM_P (rd) || HARD_REGISTER_NUM_P (rs))
>      return false;
>  
> -  if (GET_MODE_SIZE (GET_MODE (dest)) <= UNITS_PER_WORD)
> +  if (!move_modes_to_split[(int)GET_MODE (dest)])
>      return false;

Space after "(int)".

> @@ -335,6 +526,11 @@ find_decomposable_subregs (rtx *px, void
>  		bitmap_set_bit (decomposable_context, regno);
>  	      break;
>  	    case SIMPLE_MOVE:
> +	      /* If this is marked as a simple move with a mode this
> +		 large, it is because find_pseudo_copy returned false
> +		 because this was not a mode we wanted to split.  */
> +	      if (!move_modes_to_split[GET_MODE (x)])

I think you need the (int) here too.  (Should have been caught by boostrap
if so.)

> @@ -668,7 +864,7 @@ resolve_simple_move (rtx set, rtx insn)
>    orig_mode = GET_MODE (dest);
>  
>    words = (GET_MODE_SIZE (orig_mode) + UNITS_PER_WORD - 1) / UNITS_PER_WORD;
> -  if (words <= 1)
> +  if (!move_modes_to_split[(int)GET_MODE (dest)])
>      return insn;
>  
>    start_sequence ();

Space after "(int)".

>    op = SET_SRC (set);
> -  if (GET_CODE (op) != ASHIFT
> -      && GET_CODE (op) != LSHIFTRT
> -      && GET_CODE (op) != ZERO_EXTEND)
> +  mode = GET_MODE (op);
> +
> +  if (mode != twice_word_mode)
> +    return 0;
> +
> +  switch (GET_CODE (op)) {
> +  case ASHIFT:
> +    if (splitting_ashift)
> +      break;
> +    return 0;
> +    
> +  case LSHIFTRT:
> +    if (splitting_lshiftrt)
> +      break;
> +    return 0;
> +    
> +  case ZERO_EXTEND:
> +    if (splitting_zext)
> +      break;
> +    return 0;

With the boolean array change suggested above, we'd not bother with this
bit and check the cached cost information:

> @@ -961,23 +1178,26 @@ find_decomposable_shift_zext (rtx insn)
>  
>    if (GET_CODE (op) == ZERO_EXTEND)
>      {
> -      if (GET_MODE (op_operand) != word_mode
> -	  || GET_MODE_BITSIZE (GET_MODE (op)) != 2 * BITS_PER_WORD)
> +      if (GET_MODE (op_operand) != word_mode)
>  	return 0;
>      }

...here and:

>    else /* left or right shift */
>      {
> +      int shift_val;
> +
>        if (!CONST_INT_P (XEXP (op, 1))
> -	  || INTVAL (XEXP (op, 1)) < BITS_PER_WORD
> -	  || GET_MODE_BITSIZE (GET_MODE (op_operand)) != 2 * BITS_PER_WORD)
> +	  || INTVAL (XEXP (op, 1)) < BITS_PER_WORD)
> +	return 0;
> +
> +      shift_val = INTVAL (XEXP (op, 1));
> +      if (!profitable_shift_p (GET_CODE (op), shift_val))
>  	return 0;
> +
> +      bitmap_set_bit (decomposable_context, REGNO (op_operand));
>      }

...here (using the boolean array instead of profitable_shift_p).

> @@ -1094,24 +1342,41 @@ decompose_multiword_subregs (void)
>       all the insns.  */
>    {
>      unsigned int i;
> +    bool useful_modes_seen = false;
>  
>      for (i = FIRST_PSEUDO_REGISTER; i < max; ++i)
>        {
> -	if (regno_reg_rtx[i] != NULL
> -	    && GET_MODE_SIZE (GET_MODE (regno_reg_rtx[i])) > UNITS_PER_WORD)
> -	  break;
> +	if (regno_reg_rtx[i] != NULL)
> +	  {
> +	    enum machine_mode mode = GET_MODE (regno_reg_rtx[i]);
> +	    if (regno_reg_rtx[i] != NULL 
> +		&& (move_modes_to_split [mode] 
> +		    || (splitting_some_shifts && mode == twice_word_mode)))

Here I think we should just check move_modes_to_split.  We should
not split a mode for which moves get more expensive, regardless of
whether shifts are cheaper.  lower-subreg is designed to help later
passes produce better code.  Splitting a shift when the associated
move is more expensive is likely to encourage those passes to generate
more expensive move sequences instead of cheaper ones.  (Includes RA
and reload.)

Besides, I claim that it makes no sense to define a double-word
shift pattern that is inherently more expensive than what optabs would
generate without the double-word shift pattern.  So I think the ">="s in
the (modified) shift cost calculations should act as "=="s in practice
(but should still be coded as ">="s for consistency).  That is, we'll
allow the shift to be split if the cost of doing so is the same as the
original shift, but not otherwise.

Or the short version: if splitting double-word moves is expensive,
skip all the zext and shift stuff too.

Looks good to me otherwise, but it's Ian's pass.

Richard

Kenneth Zadeck March 30, 2012, 3:24 p.m. UTC | #3

>> +   There are two useful preprocessor defines for use by maintainers:
>> +
>> +   #define LOG_COSTS
>> +
>> +   if you wish to see the actual cost estimates that are being used
>> +   for each mode wider than word mode and the cost estimates for zero
>> +   extension and the shifts.   This can be useful when port maintainers
>> +   are tuning insn rtx costs.
>> +
>> +   #define FORCE_LOWERING
>> +
>> +   if you wish to test the pass with all the transformation forced on.
>> +   This can be useful for finding bugs in the transformations.
> Must admit I'm not keen on these kinds of macro, but it's Ian's call.
>
> Idea for the future (i.e. not this patch) is to have a dump file for
> target initialisation.
Imagine my horror when i did all of this as you had privately suggested 
and discovered that there was no way to log what i was doing.  This is 
good enough until someone wants to fix the general problem.

>> +/* This pass can transform 4 different operations: move, ashift,
>> +   lshiftrt, and zero_extend.  There is a boolean vector for move
>> +   splitting that is indexed by mode and is true for each mode that is
>> +   to have its copies split.  The other three operations are only done
>> +   for one mode so they are only controlled by a single boolean  .*/
> As mentioned privately, whether this is profitable for shifts depends
> to some extent on the shift amount.  GCC already supports targets where
> this transformation would be OK for some shift amounts but not others.
> So for shifts, I think this should be an array of HOST_BITS_PER_WIDE_INT
> booleans rather than just one.
>
> More comments below about how this filters through your other changes.
I think that you actually are missing what i am doing with this.   I 
look at 3 representative values that "should" discover any non 
uniformities.   If any of them are profitable, i set this bit.  Then at 
the point where i really have to pull the trigger on a real instance, i 
check the shift amount used at that spot to see if the individual shift 
is profitable.

I did this for two reasons.   One of them was that i was a little 
concerned that HOST_BITS_PER_WIDE_INT on the smallest host was not as 
big as the bitsize of word_word mode on the largest target (it could be 
but this knowledge is above my pay grade).   The other reason was did 
not see this as a common operation and checking it on demand seemed like 
the winner.



I will do everything else you mention and resubmit after i fix ramana's ice.

Richard Sandiford March 30, 2012, 3:41 p.m. UTC | #4

Kenneth Zadeck <zadeck@naturalbridge.com> writes:
>>> +   There are two useful preprocessor defines for use by maintainers:
>>> +
>>> +   #define LOG_COSTS
>>> +
>>> +   if you wish to see the actual cost estimates that are being used
>>> +   for each mode wider than word mode and the cost estimates for zero
>>> +   extension and the shifts.   This can be useful when port maintainers
>>> +   are tuning insn rtx costs.
>>> +
>>> +   #define FORCE_LOWERING
>>> +
>>> +   if you wish to test the pass with all the transformation forced on.
>>> +   This can be useful for finding bugs in the transformations.
>> Must admit I'm not keen on these kinds of macro, but it's Ian's call.
>>
>> Idea for the future (i.e. not this patch) is to have a dump file for
>> target initialisation.
> Imagine my horror when i did all of this as you had privately suggested 
> and discovered that there was no way to log what i was doing.  This is 
> good enough until someone wants to fix the general problem.
>
>>> +/* This pass can transform 4 different operations: move, ashift,
>>> +   lshiftrt, and zero_extend.  There is a boolean vector for move
>>> +   splitting that is indexed by mode and is true for each mode that is
>>> +   to have its copies split.  The other three operations are only done
>>> +   for one mode so they are only controlled by a single boolean  .*/
>> As mentioned privately, whether this is profitable for shifts depends
>> to some extent on the shift amount.  GCC already supports targets where
>> this transformation would be OK for some shift amounts but not others.
>> So for shifts, I think this should be an array of HOST_BITS_PER_WIDE_INT
>> booleans rather than just one.
>>
>> More comments below about how this filters through your other changes.
> I think that you actually are missing what i am doing with this.   I 
> look at 3 representative values that "should" discover any non 
> uniformities.   If any of them are profitable, i set this bit.  Then at 
> the point where i really have to pull the trigger on a real instance, i 
> check the shift amount used at that spot to see if the individual shift 
> is profitable.

No, I got that.  I just think it's an unnecessary complication.

> I did this for two reasons.   One of them was that i was a little 
> concerned that HOST_BITS_PER_WIDE_INT on the smallest host was not as 
> big as the bitsize of word_word mode on the largest target (it could be 
> but this knowledge is above my pay grade).

Ah, yes, sorry, I meant an array of BITS_PER_WORD booleans.  I had
HOST_WIDE_INT on the brain after Mike's patch.

> The other reason was did not see this as a common operation and
> checking it on demand seemed like the winner.

But (at least after the other changes I mentioned) these overall
booleans cut out only a very small portion of find_decomposable_shift_zext.
I.e.:

  op = SET_SRC (set);
  if (GET_CODE (op) != ASHIFT
      && GET_CODE (op) != LSHIFTRT
      && GET_CODE (op) != ZERO_EXTEND)  <-- unified booleans checked here
    return 0;

  op_operand = XEXP (op, 0);
  if (!REG_P (SET_DEST (set)) || !REG_P (op_operand)
      || HARD_REGISTER_NUM_P (REGNO (SET_DEST (set)))
      || HARD_REGISTER_NUM_P (REGNO (op_operand))
      || !SCALAR_INT_MODE_P (GET_MODE (op)))
    return 0;

  if (GET_CODE (op) == ZERO_EXTEND)
    {
      if (GET_MODE (op_operand) != word_mode
	  || GET_MODE_BITSIZE (GET_MODE (op)) != 2 * BITS_PER_WORD)
	return 0;
    }
  else /* left or right shift */
    {
      <--- specific booleans checked here
      if (!CONST_INT_P (XEXP (op, 1))
	  || INTVAL (XEXP (op, 1)) < BITS_PER_WORD
	  || GET_MODE_BITSIZE (GET_MODE (op_operand)) != 2 * BITS_PER_WORD)
	return 0;
    }

It seems better (and simpler) not to prejudge which shift amounts are
interesting and instead cache the "win or no win" flag for each value.

As I say, this is all in the context of this pass not being interesting
for modes where the split move is strictly more expensive than the
unified move, regardless of shift & zext costs.

Richard

Kenneth Zadeck March 30, 2012, 7:29 p.m. UTC | #5

ramana

i get the same failure on the trunk without my patch.

kenny

On 03/30/2012 07:36 AM, Ramana Radhakrishnan wrote:
> Hi
>
>
>> I have tested this on an x86_64 with both the force lowering on and off and
>> neither cause any regressions as well as extensive testing on my port.
>>
> So, just out of curiosity,  I decided to run this through a
> cross-build and noticed the following ICE with eglibc. I haven't had
> the time to debug this further but it does appear as though it could
> do with some more testing on some more ports and this probably needs
> some tuning as you say.
>
> $>  /work/cross-build/fsf/arm-none-linux-gnueabi/tools-lowersubregchanges-patched/bin/arm-none-linux-gnueabi-gcc
> -c -O2 ./besttry.c  -mfloat-abi=soft -march=armv5te
> ./besttry.c: In function ‘_IO_new_file_write’:
> ./besttry.c:36:1: internal compiler error: in get_loop_body, at cfgloop.c:831
>
> $>  cat besttry.c
> __extension__ typedef int __ssize_t;
> extern __thread int __libc_errno __attribute__ ((tls_model ("initial-exec")));
> struct _IO_FILE {
>    int _fileno;
>    int _flags2;
> };
> typedef struct _IO_FILE _IO_FILE;
> _IO_new_file_write (f,
>        data,
>        n)
>       _IO_FILE *f;
> {
>    __ssize_t to_do = n;
>    while (to_do>  0)
>      {
>        __ssize_t count =
>   (__builtin_expect (f->_flags2&  2, 0) ?
>    ({ unsigned int _sys_result = ({ register int _a1 asm ("r0"), _nr asm ("r7");
>          int _a3tmp = (int) ((to_do));
>          int _a2tmp = (int) ((data));
>          register int _a2 asm ("a2") = _a2tmp;
>          register int _a3 asm ("a3") = _a3tmp; _nr = ((0 + 4));
>          asm volatile ("swi	0x0	@ syscall " "SYS_ify(write)" : "=r"
> (_a1) : "r" (_nr) , "r" (_a1), "r" (_a2), "r" (_a3) : "memory"); _a1;
> });
>      if (__builtin_expect (((unsigned int) (_sys_result)>= 0xfffff001u), 0))
>        { (__libc_errno = ((-(_sys_result))));
>          _sys_result = (unsigned int) -1; }
>      (int) _sys_result; })
>    : __write (f->_fileno, data, to_do));
>        if (count<  0)
>   {
>     break;
>          }
>        to_do -= count;
>      }
> }
>
>
> Ramana
>
> On 29 March 2012 22:10, Kenneth Zadeck<zadeck@naturalbridge.com>  wrote:
>> This patch takes a different approach to fixing PR52543 than does the patch
>> in
>>
>> http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html
>>
>> This patch transforms the lower-subreg pass(es) from unconditionally
>> splitting wide moves, zero extensions, and shifts, so that it now takes into
>> account the target specific costs and only does the transformations if it is
>> profitable.
>>
>> Unconditional splitting is a problem that not only occurs on the AVR but is
>> also a problem on the ARM NEON and my private port.  Furthermore, it is a
>> problem that is likely to occur on most modern larger machines since these
>> machines are more likely to have fast instructions for moving things that
>> are larger than word mode.
>>
>> At compiler initialization time, each mode that is larger that a word mode
>> is examined to determine if the cost of moving a value of that mode is less
>> expensive that inserting the proper number of word sided moves.   If it is
>> cheaper to split it up, a bit is set to allow moves of that mode to be
>> lowered.
>>
>> A similar analysis is made for the zero extensions and shifts except that
>> lower subreg had been (and is still limited to only breaking up these
>> operations if the target size was twice the size of word mode.)  Also, if
>> the analysis determines that there are no profitable transformations, the
>> pass exits quickly without doing any analysis.
>>
>> It is quite likely that most ports will have to be adjusted after this patch
>> is accepted.   For instance, the analysis discovers that there are no
>> profitable transformations to be performed on the x86-64.    Since this is
>> not my platform, I have no idea if these are the correct settings.   But the
>> pass uses the standard insn_rtx_cost interface and it is the port
>> maintainers responsibility to not lie to the optimization passes so this
>> extra work in stage one should be acceptable.
>>
>> I do know from a private conversation with Richard Sandiford, that mips
>> patches are likely forthcoming.
>>
>> There is preprocessor controlled code that prints out the cost analysis.
>>   Only a summary of this can go in the subregs dump file because the analysis
>> is called from backend_init_target and so the dump file is not available.
>>     But it is very useful to define LOG_COSTS when adjusting your port.
>>
>> There is also preprocessor code that forces all of the lowering operations
>> to marked as profitable.  This is useful in debugging the new logic.
>>
>> Both of these preprocessor symbols are documented at the top of the pass.
>>
>> I have tested this on an x86_64 with both the force lowering on and off and
>> neither cause any regressions as well as extensive testing on my port.
>>
>> Ok to commit?
>>
>> Kenny
>>
>> 2012-03-29 Kenneth Zadeck<zadeck@naturalbridge.com>
>>
>>     * toplev.c (backend_init_target): Call initializer for lower-subreg pass.
>>
>>     * lower-subreg.c (move_modes_to_split, splitting_ashift,
>> splitting_lshiftrt)
>>     splitting_zext, splitting_some_shifts, twice_word_mode,
>>   something_to_do,
>>     word_mode_move_cost, move_zero_cost): New static vars.
>>     (compute_move_cost, profitable_shift_p, init_lower_subreg): New
>>     functions.
>>     (find_pseudo_copy, resolve_simple_move): Added code to only split based
>> on costs.
>>     (find_decomposable_subregs): Added code to mark as decomposable
>>     moves that are not profitable.
>>     (find_decomposable_shift_zext): Added code to only decompose
>>     shifts and zext if profitable.
>>     (resolve_shift_zext): Added comment.
>>     (decompose_multiword_subregs): Dump list of profitable
>>     transformations.  Add code to skip non profitable transformations.
>>
>>     *rtl.h(init_lower_subreg): Added declaration.
>>
>>

Ramana Radhakrishnan March 30, 2012, 10:38 p.m. UTC | #6

On 30 March 2012 20:29, Kenneth Zadeck <zadeck@naturalbridge.com> wrote:
> ramana
>
> i get the same failure on the trunk without my patch.
>

In which case I apologise and will file a bug report separately. I
should really have checked :( .

Ramana

>
> kenny
>
> On 03/30/2012 07:36 AM, Ramana Radhakrishnan wrote:
>>
>> Hi
>>
>>
>>> I have tested this on an x86_64 with both the force lowering on and off
>>> and
>>> neither cause any regressions as well as extensive testing on my port.
>>>
>> So, just out of curiosity,  I decided to run this through a
>> cross-build and noticed the following ICE with eglibc. I haven't had
>> the time to debug this further but it does appear as though it could
>> do with some more testing on some more ports and this probably needs
>> some tuning as you say.
>>
>> $>
>>  /work/cross-build/fsf/arm-none-linux-gnueabi/tools-lowersubregchanges-patched/bin/arm-none-linux-gnueabi-gcc
>> -c -O2 ./besttry.c  -mfloat-abi=soft -march=armv5te
>> ./besttry.c: In function ‘_IO_new_file_write’:
>> ./besttry.c:36:1: internal compiler error: in get_loop_body, at
>> cfgloop.c:831
>>
>> $>  cat besttry.c
>> __extension__ typedef int __ssize_t;
>> extern __thread int __libc_errno __attribute__ ((tls_model
>> ("initial-exec")));
>> struct _IO_FILE {
>>   int _fileno;
>>   int _flags2;
>> };
>> typedef struct _IO_FILE _IO_FILE;
>> _IO_new_file_write (f,
>>       data,
>>       n)
>>      _IO_FILE *f;
>> {
>>   __ssize_t to_do = n;
>>   while (to_do>  0)
>>     {
>>       __ssize_t count =
>>  (__builtin_expect (f->_flags2&  2, 0) ?
>>
>>   ({ unsigned int _sys_result = ({ register int _a1 asm ("r0"), _nr asm
>> ("r7");
>>         int _a3tmp = (int) ((to_do));
>>         int _a2tmp = (int) ((data));
>>         register int _a2 asm ("a2") = _a2tmp;
>>         register int _a3 asm ("a3") = _a3tmp; _nr = ((0 + 4));
>>         asm volatile ("swi     0x0     @ syscall " "SYS_ify(write)" : "=r"
>> (_a1) : "r" (_nr) , "r" (_a1), "r" (_a2), "r" (_a3) : "memory"); _a1;
>> });
>>     if (__builtin_expect (((unsigned int) (_sys_result)>= 0xfffff001u),
>> 0))
>>       { (__libc_errno = ((-(_sys_result))));
>>         _sys_result = (unsigned int) -1; }
>>     (int) _sys_result; })
>>   : __write (f->_fileno, data, to_do));
>>       if (count<  0)
>>  {
>>    break;
>>         }
>>       to_do -= count;
>>     }
>> }
>>
>>
>> Ramana
>>
>> On 29 March 2012 22:10, Kenneth Zadeck<zadeck@naturalbridge.com>  wrote:
>>>
>>> This patch takes a different approach to fixing PR52543 than does the
>>> patch
>>> in
>>>
>>> http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html
>>>
>>> This patch transforms the lower-subreg pass(es) from unconditionally
>>> splitting wide moves, zero extensions, and shifts, so that it now takes
>>> into
>>> account the target specific costs and only does the transformations if it
>>> is
>>> profitable.
>>>
>>> Unconditional splitting is a problem that not only occurs on the AVR but
>>> is
>>> also a problem on the ARM NEON and my private port.  Furthermore, it is a
>>> problem that is likely to occur on most modern larger machines since
>>> these
>>> machines are more likely to have fast instructions for moving things that
>>> are larger than word mode.
>>>
>>> At compiler initialization time, each mode that is larger that a word
>>> mode
>>> is examined to determine if the cost of moving a value of that mode is
>>> less
>>> expensive that inserting the proper number of word sided moves.   If it
>>> is
>>> cheaper to split it up, a bit is set to allow moves of that mode to be
>>> lowered.
>>>
>>> A similar analysis is made for the zero extensions and shifts except that
>>> lower subreg had been (and is still limited to only breaking up these
>>> operations if the target size was twice the size of word mode.)  Also, if
>>> the analysis determines that there are no profitable transformations, the
>>> pass exits quickly without doing any analysis.
>>>
>>> It is quite likely that most ports will have to be adjusted after this
>>> patch
>>> is accepted.   For instance, the analysis discovers that there are no
>>> profitable transformations to be performed on the x86-64.    Since this
>>> is
>>> not my platform, I have no idea if these are the correct settings.   But
>>> the
>>> pass uses the standard insn_rtx_cost interface and it is the port
>>> maintainers responsibility to not lie to the optimization passes so this
>>> extra work in stage one should be acceptable.
>>>
>>> I do know from a private conversation with Richard Sandiford, that mips
>>> patches are likely forthcoming.
>>>
>>> There is preprocessor controlled code that prints out the cost analysis.
>>>  Only a summary of this can go in the subregs dump file because the
>>> analysis
>>> is called from backend_init_target and so the dump file is not available.
>>>    But it is very useful to define LOG_COSTS when adjusting your port.
>>>
>>> There is also preprocessor code that forces all of the lowering
>>> operations
>>> to marked as profitable.  This is useful in debugging the new logic.
>>>
>>> Both of these preprocessor symbols are documented at the top of the pass.
>>>
>>> I have tested this on an x86_64 with both the force lowering on and off
>>> and
>>> neither cause any regressions as well as extensive testing on my port.
>>>
>>> Ok to commit?
>>>
>>> Kenny
>>>
>>> 2012-03-29 Kenneth Zadeck<zadeck@naturalbridge.com>
>>>
>>>    * toplev.c (backend_init_target): Call initializer for lower-subreg
>>> pass.
>>>
>>>    * lower-subreg.c (move_modes_to_split, splitting_ashift,
>>> splitting_lshiftrt)
>>>    splitting_zext, splitting_some_shifts, twice_word_mode,
>>>  something_to_do,
>>>    word_mode_move_cost, move_zero_cost): New static vars.
>>>    (compute_move_cost, profitable_shift_p, init_lower_subreg): New
>>>    functions.
>>>    (find_pseudo_copy, resolve_simple_move): Added code to only split
>>> based
>>> on costs.
>>>    (find_decomposable_subregs): Added code to mark as decomposable
>>>    moves that are not profitable.
>>>    (find_decomposable_shift_zext): Added code to only decompose
>>>    shifts and zext if profitable.
>>>    (resolve_shift_zext): Added comment.
>>>    (decompose_multiword_subregs): Dump list of profitable
>>>    transformations.  Add code to skip non profitable transformations.
>>>
>>>    *rtl.h(init_lower_subreg): Added declaration.
>>>
>>>
>

Paolo Bonzini May 10, 2012, 6:45 a.m. UTC | #7

Il 30/03/2012 12:08, Richard Sandiford ha scritto:
>> > +   There are two useful preprocessor defines for use by maintainers:  
>> > +
>> > +   #define LOG_COSTS
>> > +
>> > +   if you wish to see the actual cost estimates that are being used
>> > +   for each mode wider than word mode and the cost estimates for zero
>> > +   extension and the shifts.   This can be useful when port maintainers 
>> > +   are tuning insn rtx costs.
>> > +
>> > +   #define FORCE_LOWERING
>> > +
>> > +   if you wish to test the pass with all the transformation forced on.
>> > +   This can be useful for finding bugs in the transformations.
> Must admit I'm not keen on these kinds of macro, but it's Ian's call.

Indeed, LOG_COSTS should be (dump_flags & TDF_DETAILS) != 0, and perhaps
FORCE_LOWERING should be a -f flag (like -flower-all-subregs) or a --param.

Paolo

Paolo Bonzini May 10, 2012, 6:54 a.m. UTC | #8

Il 10/05/2012 08:45, Paolo Bonzini ha scritto:
> Il 30/03/2012 12:08, Richard Sandiford ha scritto:
>>>> +   There are two useful preprocessor defines for use by maintainers:  
>>>> +
>>>> +   #define LOG_COSTS
>>>> +
>>>> +   if you wish to see the actual cost estimates that are being used
>>>> +   for each mode wider than word mode and the cost estimates for zero
>>>> +   extension and the shifts.   This can be useful when port maintainers 
>>>> +   are tuning insn rtx costs.
>>>> +
>>>> +   #define FORCE_LOWERING
>>>> +
>>>> +   if you wish to test the pass with all the transformation forced on.
>>>> +   This can be useful for finding bugs in the transformations.
>> Must admit I'm not keen on these kinds of macro, but it's Ian's call.
> 
> Indeed, LOG_COSTS should be (dump_flags & TDF_DETAILS) != 0, and perhaps
> FORCE_LOWERING should be a -f flag (like -flower-all-subregs) or a --param.

Not sure how this got sent a month after I wrote it (and decided not to
send it). :)

Paolo

Commit Message

Comments

Patch