diff mbox

new sign/zero extension elimination pass

Message ID 4CBC698B.3080204@codesourcery.com
State New
Headers show

Commit Message

Tom de Vries Oct. 18, 2010, 3:36 p.m. UTC
I created a new sign/zero extension elimination pass.

The motivating example for this pass is:

  void f(unsigned char *p, short s, int c, int *z)
    {
      if (c)
        *z = 0;
      *p ^= (unsigned char)s;
    }

For MIPS, compilation results in the following insns.

  (set (reg/v:SI 199)
       (sign_extend:SI (subreg:HI (reg:SI 200) 2)))

  ...

  (set (reg:QI 203)
       (subreg:QI (reg/v:SI 199) 3))

These insns are the only def and the only use of reg 199, each located in a
different bb.

The sign-extension preserves the lower half of reg 200 and copies it to
reg 199, and the subreg use of reg 199 only reads the least significant byte.
The sign extension is therefore redundant (the extension part, not the copy
part), and can safely be replaced with a regcopy from reg 200 to reg 199:

  (set (reg/v:SI 199)
       (reg:SI 200))


This new pass manages to analyze this pattern, and replace the sign_extend with
a regcopy, which results in 1 less instruction in the assembly. The other passes
that eliminate sign/zero extensions do no manage to do that. Combine doesn't
work since the def and the use of reg 199 are in a different bb. Implicit_zee
does not work here since it only combines an extension with the defs of its src
operand, which is not applicable in this case.

The pass does a couple of linear scans over the insns, so it's not expensive in
terms of runtime.

See ee.c for a more detailed writeup related to motivating example, comparison
to other passes that do sign/zero extension elimination, intended effect,
implementation and limitations.

The pass has been tested on X86_64, MIPS, ARM. An earlier version of the pass
was tested on PPC as well. All on Linux host.

At the moment there are 2 known issue:
- a bug on ARM, which I'm fixing currently.
- an assert while building on MIPS. I hit the assert
    gcc_assert ((*inner_mode == *outer_mode) != (*extend != UNKNOWN));
  in loop-iv.c:get_biv_step. I disabled it and no tests fell over, but that
  needs more investigation.

I would like review comments on the general idea of the pass.

- Tom de Vries

Comments

Andrew Pinski Oct. 18, 2010, 3:42 p.m. UTC | #1
On Mon, Oct 18, 2010 at 8:36 AM, Tom de Vries <tom@codesourcery.com> wrote:
> I created a new sign/zero extension elimination pass.
>
> The motivating example for this pass is:
>
>  void f(unsigned char *p, short s, int c, int *z)
>    {
>      if (c)
>        *z = 0;
>      *p ^= (unsigned char)s;
>    }
>
> For MIPS, compilation results in the following insns.
>
>  (set (reg/v:SI 199)
>       (sign_extend:SI (subreg:HI (reg:SI 200) 2)))
>
>  ...
>
>  (set (reg:QI 203)
>       (subreg:QI (reg/v:SI 199) 3))
>
> These insns are the only def and the only use of reg 199, each located in a
> different bb.


This sounds like a job for GCSE to do.

Thanks,
Andrew Pinski
Richard Biener Oct. 18, 2010, 4:13 p.m. UTC | #2
On Mon, Oct 18, 2010 at 5:42 PM, Andrew Pinski <pinskia@gmail.com> wrote:
> On Mon, Oct 18, 2010 at 8:36 AM, Tom de Vries <tom@codesourcery.com> wrote:
>> I created a new sign/zero extension elimination pass.
>>
>> The motivating example for this pass is:
>>
>>  void f(unsigned char *p, short s, int c, int *z)
>>    {
>>      if (c)
>>        *z = 0;
>>      *p ^= (unsigned char)s;
>>    }
>>
>> For MIPS, compilation results in the following insns.
>>
>>  (set (reg/v:SI 199)
>>       (sign_extend:SI (subreg:HI (reg:SI 200) 2)))
>>
>>  ...
>>
>>  (set (reg:QI 203)
>>       (subreg:QI (reg/v:SI 199) 3))
>>
>> These insns are the only def and the only use of reg 199, each located in a
>> different bb.
>
>
> This sounds like a job for GCSE to do.

The question is why it is expanded the way it is.  On x86 I just get

;; *p_3(D) = D.2690_7;

(insn 16 15 0 (parallel [
            (set (mem:QI (reg/v/f:DI 62 [ p ]) [0 *p_3(D)+0 S1 A8])
                (xor:QI (mem:QI (reg/v/f:DI 62 [ p ]) [0 *p_3(D)+0 S1 A8])
                    (subreg:QI (reg/v:HI 63 [ s ]) 0)))
            (clobber (reg:CC 17 flags))
        ]) t.c:5 -1
     (nil))

note that the tree level already has

  D.2688_4 = *p_3(D);
  D.2689_6 = (unsigned char) s_5(D);
  D.2690_7 = D.2689_6 ^ D.2688_4;
  *p_3(D) = D.2690_7;

so no extension.

Richard.

> Thanks,
> Andrew Pinski
>
Tom de Vries Oct. 21, 2010, 9:20 a.m. UTC | #3
Richard Guenther wrote:
> On Mon, Oct 18, 2010 at 5:42 PM, Andrew Pinski <pinskia@gmail.com> wrote:
>> On Mon, Oct 18, 2010 at 8:36 AM, Tom de Vries <tom@codesourcery.com> wrote:
>>> I created a new sign/zero extension elimination pass.
>>>
>>> The motivating example for this pass is:
>>>
>>>  void f(unsigned char *p, short s, int c, int *z)
>>>    {
>>>      if (c)
>>>        *z = 0;
>>>      *p ^= (unsigned char)s;
>>>    }
>>>
>>> For MIPS, compilation results in the following insns.
>>>
>>>  (set (reg/v:SI 199)
>>>       (sign_extend:SI (subreg:HI (reg:SI 200) 2)))
>>>
>>>  ...
>>>
>>>  (set (reg:QI 203)
>>>       (subreg:QI (reg/v:SI 199) 3))
>>>
>>> These insns are the only def and the only use of reg 199, each located in a
>>> different bb.
>>
>> This sounds like a job for GCSE to do.
> 
> The question is why it is expanded the way it is.  On x86 I just get
> 
> ;; *p_3(D) = D.2690_7;
> 
> (insn 16 15 0 (parallel [
>             (set (mem:QI (reg/v/f:DI 62 [ p ]) [0 *p_3(D)+0 S1 A8])
>                 (xor:QI (mem:QI (reg/v/f:DI 62 [ p ]) [0 *p_3(D)+0 S1 A8])
>                     (subreg:QI (reg/v:HI 63 [ s ]) 0)))
>             (clobber (reg:CC 17 flags))
>         ]) t.c:5 -1
>      (nil))
> 
> note that the tree level already has
> 
>   D.2688_4 = *p_3(D);
>   D.2689_6 = (unsigned char) s_5(D);
>   D.2690_7 = D.2689_6 ^ D.2688_4;
>   *p_3(D) = D.2690_7;
> 
> so no extension.
> 
> Richard.
> 
>> Thanks,
>> Andrew Pinski
>>

The extension is generated for the 'short int s' function parameter by
assign_parm_setup_reg due to the MIPS setting of PROMOTE_MODE,
TARGET_PROMOTE_FUNCTION_MODE and TARGET_PROMOTE_PROTOTYPES. However, this
behavior is not reproducible for other targets that I tried (ARM, PPC, x86_64),
so this is probably not the best example to discuss between targets.

Maybe a better example is http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40893#c0.

The optimized tree code for MIPS is this:
...
dct2x2dc_dconly (short int[2] * d)
{
  int d1;
  int d0;
  short int D.1992;
  short unsigned int D.1991;
  short int D.1990;
  short unsigned int D.1989;
  short unsigned int D.1988;
  short unsigned int D.1987;
  int D.1986;
  short int D.1985;
  int D.1984;
  short int D.1983;
  short int[2] * D.1982;
  int D.1981;
  short int D.1980;
  int D.1979;
  short int D.1978;

<bb 2>:
  D.1978_2 = (*d_1(D))[0];
  D.1979_3 = (int) D.1978_2;
  D.1980_4 = (*d_1(D))[1];
  D.1981_5 = (int) D.1980_4;
  d0_6 = D.1981_5 + D.1979_3;
  D.1982_7 = d_1(D) + 4;
  D.1983_8 = (*D.1982_7)[0];
  D.1984_9 = (int) D.1983_8;
  D.1985_11 = (*D.1982_7)[1];
  D.1986_12 = (int) D.1985_11;
  d1_13 = D.1986_12 + D.1984_9;
  D.1987_14 = (short unsigned int) d0_6;
  D.1988_15 = (short unsigned int) d1_13;
  D.1989_16 = D.1988_15 + D.1987_14;
  D.1990_17 = (short int) D.1989_16;
  (*d_1(D))[0] = D.1990_17;
  D.1991_20 = D.1987_14 - D.1988_15;
  D.1992_21 = (short int) D.1991_20;
  (*d_1(D))[1] = D.1992_21;
  return;

}
...

The assignments:
...
  D.1987_14 = (short unsigned int) d0_6;
  D.1988_15 = (short unsigned int) d1_13;
...

are expanded into zero_extensions:
...
(insn 10 9 11 3 ext13.c:5 (set (reg:SI 204 [ D.1987+-2 ])
        (zero_extend:SI (subreg:HI (reg:SI 213) 2))) -1 (nil))

(insn 14 13 15 3 ext13.c:5 (set (reg:SI 205 [ D.1988+-2 ])
        (zero_extend:SI (subreg:HI (reg:SI 216) 2))) -1 (nil))
...
The same holds for ARM and PPC.

But for x86_64, the assignments are expanded into subreg copies:
...
(insn 9 8 10 3 (set (reg:HI 69 [ D.2798 ])
        (subreg:HI (reg:SI 78) 0)) test4.c:7 -1
     (nil))

(insn 13 12 14 3 (set (reg:HI 70 [ D.2799 ])
        (subreg:HI (reg:SI 81) 0)) test4.c:7 -1
     (nil))
...

AFAIU, this difference in behaviour is caused by the difference in PROMOTE_MODE,
which causes the short unsigned int D.1987 to live an SI reg for ARM, MIPS and
PPC, an in an HI reg for X86_64.

For x86_64, the subreg copies are combined by combine with other operations, and
don't result in extra operations.

For ARM, MIPS and PPC, the zero_extensions are not optimized away by any current
pass. Demonstrated here in the MIPS assembly:
...
	lh	$6,2($2)
	lh	$4,0($2)
	lh	$5,4($2)
	lh	$3,6($2)
	addu	$4,$6,$4
	addu	$3,$3,$5
	andi	$4,$4,0xffff    <---
	andi	$3,$3,0xffff    <---
	addu	$5,$3,$4
	subu	$3,$4,$3
	sh	$5,0($2)
	j	$31
	sh	$3,2($2)
...

The new pass removes the superfluous zero_extensions for ARM, MIPS and PPC.

Another example for which the new pass removes the superfluous extensions is
https://bugs.launchpad.net/gcc-linaro/+bug/634682.

- Tom
Paolo Bonzini Oct. 21, 2010, 10:24 a.m. UTC | #4
On 10/18/2010 05:42 PM, Andrew Pinski wrote:
>> >  For MIPS, compilation results in the following insns.
>> >
>> >    (set (reg/v:SI 199)
>> >         (sign_extend:SI (subreg:HI (reg:SI 200) 2)))
>> >
>> >    ...
>> >
>> >    (set (reg:QI 203)
>> >         (subreg:QI (reg/v:SI 199) 3))
>> >
>> >  These insns are the only def and the only use of reg 199, each located in a
>> >  different bb.
>
> This sounds like a job for GCSE to do.

Actually, fwprop should _already_ do that if assuming simplify-rtx.c 
does the simplification of (subreg:QI (sign_extend:SI (subreg:HI (reg:SI 
200) 2))) 3).

Paolo
Paolo Bonzini Oct. 21, 2010, 4:53 p.m. UTC | #5
On 10/21/2010 12:24 PM, Paolo Bonzini wrote:
> On 10/18/2010 05:42 PM, Andrew Pinski wrote:
>>> > For MIPS, compilation results in the following insns.
>>> >
>>> > (set (reg/v:SI 199)
>>> > (sign_extend:SI (subreg:HI (reg:SI 200) 2)))
>>> >
>>> > ...
>>> >
>>> > (set (reg:QI 203)
>>> > (subreg:QI (reg/v:SI 199) 3))
>>> >
>>> > These insns are the only def and the only use of reg 199, each 
>>> located in a
>>> > different bb.
>>
>> This sounds like a job for GCSE to do.
> 
> Actually, fwprop should _already_ do that if assuming simplify-rtx.c 
> does the simplification of (subreg:QI (sign_extend:SI (subreg:HI (reg:SI 
> 200) 2))) 3).

... which this code should do in simplify_subreg:

  /* Optimize SUBREG truncations of zero and sign extended values.  */
  if ((GET_CODE (op) == ZERO_EXTEND
       || GET_CODE (op) == SIGN_EXTEND)
      && GET_MODE_BITSIZE (outermode) < GET_MODE_BITSIZE (innermode))
    {
      unsigned int bitpos = subreg_lsb_1 (outermode, innermode, byte);

      /* If we're requesting the lowpart of a zero or sign extension,
         there are three possibilities.  If the outermode is the same
         as the origmode, we can omit both the extension and the subreg.
         If the outermode is not larger than the origmode, we can apply
         the truncation without the extension.  Finally, if the outermode
         is larger than the origmode, but both are integer modes, we
         can just extend to the appropriate mode.  */
      if (bitpos == 0)
        {
          enum machine_mode origmode = GET_MODE (XEXP (op, 0));
          if (outermode == origmode)
            return XEXP (op, 0);
          if (GET_MODE_BITSIZE (outermode) <= GET_MODE_BITSIZE (origmode))
            return simplify_gen_subreg (outermode, XEXP (op, 0), origmode,
                                        subreg_lowpart_offset (outermode,
                                                               origmode));
          if (SCALAR_INT_MODE_P (outermode))
            return simplify_gen_unary (GET_CODE (op), outermode,
                                       XEXP (op, 0), origmode);
        }

However, the def of pseudo 200 is "complex enough" that fwprop will not want
to propagate it unless it simplifies to a constant.

It should be enough to tell fwprop that such propagations are always fine
if the destination pseudo has one use only.  In this case register pressure
cannot increase.

Paolo
Tom de Vries Oct. 22, 2010, 8:26 a.m. UTC | #6
Paolo Bonzini wrote:
> On 10/21/2010 12:24 PM, Paolo Bonzini wrote:
>> On 10/18/2010 05:42 PM, Andrew Pinski wrote:
>>>>> For MIPS, compilation results in the following insns.
>>>>>
>>>>> (set (reg/v:SI 199)
>>>>> (sign_extend:SI (subreg:HI (reg:SI 200) 2)))
>>>>>
>>>>> ...
>>>>>
>>>>> (set (reg:QI 203)
>>>>> (subreg:QI (reg/v:SI 199) 3))
>>>>>
>>>>> These insns are the only def and the only use of reg 199, each 
>>>> located in a
>>>>> different bb.
>>> This sounds like a job for GCSE to do.
>> Actually, fwprop should _already_ do that if assuming simplify-rtx.c 
>> does the simplification of (subreg:QI (sign_extend:SI (subreg:HI (reg:SI 
>> 200) 2))) 3).
> 
> ... which this code should do in simplify_subreg:
> 
>   /* Optimize SUBREG truncations of zero and sign extended values.  */
>   if ((GET_CODE (op) == ZERO_EXTEND
>        || GET_CODE (op) == SIGN_EXTEND)
>       && GET_MODE_BITSIZE (outermode) < GET_MODE_BITSIZE (innermode))
>     {
>       unsigned int bitpos = subreg_lsb_1 (outermode, innermode, byte);
> 
>       /* If we're requesting the lowpart of a zero or sign extension,
>          there are three possibilities.  If the outermode is the same
>          as the origmode, we can omit both the extension and the subreg.
>          If the outermode is not larger than the origmode, we can apply
>          the truncation without the extension.  Finally, if the outermode
>          is larger than the origmode, but both are integer modes, we
>          can just extend to the appropriate mode.  */
>       if (bitpos == 0)
>         {
>           enum machine_mode origmode = GET_MODE (XEXP (op, 0));
>           if (outermode == origmode)
>             return XEXP (op, 0);
>           if (GET_MODE_BITSIZE (outermode) <= GET_MODE_BITSIZE (origmode))
>             return simplify_gen_subreg (outermode, XEXP (op, 0), origmode,
>                                         subreg_lowpart_offset (outermode,
>                                                                origmode));
>           if (SCALAR_INT_MODE_P (outermode))
>             return simplify_gen_unary (GET_CODE (op), outermode,
>                                        XEXP (op, 0), origmode);
>         }
> 
> However, the def of pseudo 200 is "complex enough" that fwprop will not want
> to propagate it unless it simplifies to a constant.
> 
> It should be enough to tell fwprop that such propagations are always fine
> if the destination pseudo has one use only.  In this case register pressure
> cannot increase.
> 
> Paolo

Paolo,

Thanks for the pointer to fwprop. I agree with you that for this example, it
would make sense to have this done in fwprop.

But, I still think we need the new pass. I don't see how fwprop would help in
the example mentioned in
http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01796.html, where a superfluous
zero_extend:
...
(insn 10 9 11 2 ext13.c:5 (set (reg:SI 204 [ D.1987+-2 ])
        (zero_extend:SI (subreg:HI (reg:SI 213) 2))) 186 {*zero_extendhisi2}
(expr_list:REG_DEAD (reg:SI 213)
        (nil)))
...

is used by:
...
(insn 15 14 16 2 ext13.c:5 (set (reg:SI 217)
        (plus:SI (reg:SI 205 [ D.1988+-2 ])
            (reg:SI 204 [ D.1987+-2 ]))) 10 {*addsi3} (nil))

(insn 17 16 18 2 ext13.c:6 (set (reg:SI 218)
        (minus:SI (reg:SI 204 [ D.1987+-2 ])
            (reg:SI 205 [ D.1988+-2 ]))) 23 {subsi3} (expr_list:REG_DEAD (reg:SI
205 [ D.1988+-2 ])
        (expr_list:REG_DEAD (reg:SI 204 [ D.1987+-2 ])
            (nil))))
...

- Tom
Eric Botcazou Oct. 22, 2010, 8:30 a.m. UTC | #7
> This new pass manages to analyze this pattern, and replace the sign_extend
> with a regcopy, which results in 1 less instruction in the assembly. The
> other passes that eliminate sign/zero extensions do no manage to do that.
> Combine doesn't work since the def and the use of reg 199 are in a
> different bb. Implicit_zee does not work here since it only combines an
> extension with the defs of its src operand, which is not applicable in this
> case.

Was it originally written for a pre-4.3 compiler?  If not, why does it not use 
the DF framework instead of recomputing DU chains manually?

> The pass does a couple of linear scans over the insns, so it's not
> expensive in terms of runtime.

At -O2 it can do up to 5 full scans over the insns AFAICS; the head comment in 
the file has "a number of times".  This appears to be quite suboptimal and you 
rightfully mentioned it in the head comment.  Couldn't we avoid doing a full 
scan for each new round during the propagation phase, even without using a 
fully-fledged iterative algorithm?
Paolo Bonzini Oct. 22, 2010, 9:05 a.m. UTC | #8
On 10/22/2010 10:26 AM, Tom de Vries wrote:
> But, I still think we need the new pass. I don't see how fwprop would help in
> the example mentioned in
> http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01796.html, where a superfluous
> zero_extend:
> ...
> (insn 10 9 11 2 ext13.c:5 (set (reg:SI 204 [ D.1987+-2 ])
>          (zero_extend:SI (subreg:HI (reg:SI 213) 2))) 186 {*zero_extendhisi2}
> (expr_list:REG_DEAD (reg:SI 213)
>          (nil)))
> ...
>
> is used by:
> ...
> (insn 15 14 16 2 ext13.c:5 (set (reg:SI 217)
>          (plus:SI (reg:SI 205 [ D.1988+-2 ])
>              (reg:SI 204 [ D.1987+-2 ]))) 10 {*addsi3} (nil))
>
> (insn 17 16 18 2 ext13.c:6 (set (reg:SI 218)
>          (minus:SI (reg:SI 204 [ D.1987+-2 ])
>              (reg:SI 205 [ D.1988+-2 ]))) 23 {subsi3} (expr_list:REG_DEAD (reg:SI
> 205 [ D.1988+-2 ])
>          (expr_list:REG_DEAD (reg:SI 204 [ D.1987+-2 ])
>              (nil))))
> ...

Yes, in this case fwprop is not powerful enough, you need to scan the 
insn stream in two directions.

Paolo
Tom de Vries Oct. 28, 2010, 6:46 p.m. UTC | #9
Eric Botcazou wrote:
>> This new pass manages to analyze this pattern, and replace the sign_extend
>> with a regcopy, which results in 1 less instruction in the assembly. The
>> other passes that eliminate sign/zero extensions do no manage to do that.
>> Combine doesn't work since the def and the use of reg 199 are in a
>> different bb. Implicit_zee does not work here since it only combines an
>> extension with the defs of its src operand, which is not applicable in this
>> case.
> 
> Was it originally written for a pre-4.3 compiler?  

No, it was written on top of a 4.5.1 compiler.

> If not, why does it not use 
> the DF framework instead of recomputing DU chains manually?
> 

The current form of the pass does not compute DU-chains. It treats all defs and
uses of the same reg as if all defs reach all uses. This is imprecise, and I'm
expecting to find an example where this is a limiting factor, but until now I
haven't.

>> The pass does a couple of linear scans over the insns, so it's not
>> expensive in terms of runtime.
> 
> At -O2 it can do up to 5 full scans over the insns AFAICS; the head comment in 
> the file has "a number of times".  This appears to be quite suboptimal and you 
> rightfully mentioned it in the head comment.  Couldn't we avoid doing a full 
> scan for each new round during the propagation phase, even without using a 
> fully-fledged iterative algorithm?
> 

If we save the use size for each use in the second scan, rather than just saving
the biggest use size for each reg, we only have to visit instructions where
propagation can happen in a propagating scan. So we can make a list of
instructions where that can happen, and iterate over those only. The second
example I mentioned has 14 insns, of which 8 would be in such a list. Actually,
currently it's 6, but it will be 8 if we also propagate over extensions, which
we don't do yet. Either way, this is already better than doing a full scan.

Another runtime improvement would be to keep track of the propagated size per
insn, which we can use to avoid processing an insn (applying the transfer
function) when it won't have an effect.

A further runtime improvement would be to move insn out the the list when
processed, and back into the list when necessary, using UD-chains. Essentially
this is already an iterative algorithm with a worklist.

The best version I can think now of is an iterative solution using DU/UD-chains,
like this:
- Calculate and store the size of each use.
- Calculate and store the biggest use of each def (using DU-chains).
- Push the progating defs where the biggest use size is smaller than word size
  on a stack.
- Pop a def.
- For the def, check if the propagated biggest use is bigger than the current
  biggest use.
  If not, back to pop a def.
  If so, store current biggest use as propagated biggest use.
- For the def, propagate the biggest use of the def to the operands.
- For each operand, if the use size was reduced, find the reaching definitions
  (using UD-chains).
- For each propagating definition, recompute the biggest use. If that one was
  reduced below the propagated biggest use, push the def on the stack.
- back to pop a def.

I think it would make sense to do either the first 2 improvements (which are
relatively easy to do on top of the current version), or to go directly for the
last version (which would mean a rewrite).

- Tom
Andrew Pinski Oct. 28, 2010, 7:03 p.m. UTC | #10
On Mon, Oct 18, 2010 at 8:36 AM, Tom de Vries <tom@codesourcery.com> wrote:
> I created a new sign/zero extension elimination pass.
>
> The motivating example for this pass is:

In the above case fwprop could do the majority of the work.  In fact
it simplifies the (subreg (zero_extend (subreg))) into (subreg) but
does not replace it.  I think you could extend fwprop to the correct
thing.

Thanks,
Andrew Pinski
Paolo Bonzini Oct. 28, 2010, 9:14 p.m. UTC | #11
On 10/22/2010 10:30 AM, Eric Botcazou wrote:
> If not, why does it not use
> the DF framework instead of recomputing DU chains manually?

Rather than on DU chains (which are a very expensive problem just 
because of its asymptotic complexity, so it's better to use it only on 
small regions such as loops), I'd be picky about usage of note_uses, 
which can be very simply replaced by a loop over the DF_INSN_USES vector.

Paolo
Paolo Bonzini Oct. 28, 2010, 9:18 p.m. UTC | #12
On 10/28/2010 08:46 PM, Tom de Vries wrote:
> The best version I can think now of is an iterative solution using DU/UD-chains,
> like this:
> - Calculate and store the size of each use.
> - Calculate and store the biggest use of each def (using DU-chains).
> - Push the progating defs where the biggest use size is smaller than word size
>    on a stack.
> - Pop a def.
> - For the def, check if the propagated biggest use is bigger than the current
>    biggest use.
>    If not, back to pop a def.
>    If so, store current biggest use as propagated biggest use.
> - For the def, propagate the biggest use of the def to the operands.
> - For each operand, if the use size was reduced, find the reaching definitions
>    (using UD-chains).
> - For each propagating definition, recompute the biggest use. If that one was
>    reduced below the propagated biggest use, push the def on the stack.
> - back to pop a def.
>

DU and UD-chains are quadratic, and the reaching definitions problem has 
very big solutions too, so they are rarely the right solution 
(unfortunately).

In addition, I wonder if this pass wouldn't introduce new undefined 
overflows, which would be a correctness problem.  If you're going for a 
rewrite, I'd do it on tree-SSA so that:

1) you can use unsigned math to avoid undefined overflow;

2) SSA will provide you with cheap DU/UD.

Hopefully, only simple redundancies will remain after expand, and fwprop 
can take care of these using Andrew's patch.

Paolo
Tom de Vries Oct. 31, 2010, 5:55 p.m. UTC | #13
Paolo Bonzini wrote:
> On 10/28/2010 08:46 PM, Tom de Vries wrote:
>> The best version I can think now of is an iterative solution using
>> DU/UD-chains,
>> like this:
>> - Calculate and store the size of each use.
>> - Calculate and store the biggest use of each def (using DU-chains).
>> - Push the progating defs where the biggest use size is smaller than
>> word size
>>    on a stack.
>> - Pop a def.
>> - For the def, check if the propagated biggest use is bigger than the
>> current
>>    biggest use.
>>    If not, back to pop a def.
>>    If so, store current biggest use as propagated biggest use.
>> - For the def, propagate the biggest use of the def to the operands.
>> - For each operand, if the use size was reduced, find the reaching
>> definitions
>>    (using UD-chains).
>> - For each propagating definition, recompute the biggest use. If that
>> one was
>>    reduced below the propagated biggest use, push the def on the stack.
>> - back to pop a def.
>>
> 
> DU and UD-chains are quadratic, and the reaching definitions problem has
> very big solutions too, so they are rarely the right solution
> (unfortunately).
> 

thanks, I overlooked that completely.

> In addition, I wonder if this pass wouldn't introduce new undefined
> overflows, which would be a correctness problem.

The pass assumes wraparound overflow for plus and minus. The documentation of
the rtl operator 'plus' states that 'plus wraps round modulo the width of m'
(similar for 'minus'). I interpreted this independently from -fwrapv and
-fstrict-overflow, am I wrong there?

>  If you're going for a
> rewrite, I'd do it on tree-SSA so that:
> 
> 1) you can use unsigned math to avoid undefined overflow;
> 
> 2) SSA will provide you with cheap DU/UD.
> 
> Hopefully, only simple redundancies will remain after expand, and fwprop
> can take care of these using Andrew's patch.
> 

I see your point about the cheap DU/UD. I think the answer whether to do it in
tree-SSA or RTL depends on the answer to my previous question. If going to
tree-SSA allows us to catch more cases, we should do that.

- Tom
Joseph Myers Oct. 31, 2010, 6:43 p.m. UTC | #14
On Sun, 31 Oct 2010, Tom de Vries wrote:

> The pass assumes wraparound overflow for plus and minus. The documentation of
> the rtl operator 'plus' states that 'plus wraps round modulo the width of m'
> (similar for 'minus'). I interpreted this independently from -fwrapv and
> -fstrict-overflow, am I wrong there?

-fwrapv and -fstrict-overflow relate only to source code, GENERIC and 
GIMPLE semantics, and do not affect RTL which is always modulo.  (This is 
true at least for addition, subtraction, multiplication and absolute 
value.  I make no claims about the semantics for division and modulo 
operations for INT_MIN and -1, but my suggestion in bug 30484 was that the 
target should specify the semantics of its RTL insns, not that they should 
depend on command-line options.  The aim is to stop the options from 
affecting GENERIC and GIMPLE semantics in future; see the 
no-undefined-overflow branch.)
Paolo Bonzini Oct. 31, 2010, 6:54 p.m. UTC | #15
On Sun, Oct 31, 2010 at 18:55, Tom de Vries <tom@codesourcery.com> wrote:
>> DU and UD-chains are quadratic, and the reaching definitions problem has
>> very big solutions too, so they are rarely the right solution
>> (unfortunately).
>
> thanks, I overlooked that completely.

No problem, I discovered it the hard way (and had to rewrite fwprop's
dataflow to work around it).

>> In addition, I wonder if this pass wouldn't introduce new undefined
>> overflows, which would be a correctness problem.
>
> The pass assumes wraparound overflow for plus and minus. The documentation of
> the rtl operator 'plus' states that 'plus wraps round modulo the width of m'
> (similar for 'minus'). I interpreted this independently from -fwrapv and
> -fstrict-overflow, am I wrong there?

I think you're right, as Joseph confirmed.  There is one occurrence of
flag_wrapv in simplify-rtx.c and zero in combine.c, so this means that
indeed RTL treats PLUS/MINUS as wrapping.  I remember seeing more but
my memory must be at fault.  Since it's so limited, the one occurrence
that is there should be removed.

>>  If you're going for a rewrite, I'd do it on tree-SSA so that:
>>
>> 1) you can use unsigned math to avoid undefined overflow;
>>
>> 2) SSA will provide you with cheap DU/UD.
>>
>> Hopefully, only simple redundancies will remain after expand, and fwprop
>> can take care of these using Andrew's patch.
>
> I see your point about the cheap DU/UD. I think the answer whether to do it in
> tree-SSA or RTL depends on the answer to my previous question. If going to
> tree-SSA allows us to catch more cases, we should do that.

If you can fix the performance issues that Eric has without a rewrite,
and if it's not quadratic, I would have no problem at all with RTL.

If you need a rewrite, tree-SSA could be a better match if only for
DU/UD and ability to use the propagation engine.  The only thing I'd
be wary of, is vectorization failures because of the more complex
trees.  (But then it may even turn out to _improve_ vectorization.

Paolo
Eric Botcazou Nov. 3, 2010, 6:45 p.m. UTC | #16
> No problem, I discovered it the hard way (and had to rewrite fwprop's
> dataflow to work around it).

The other ZEE pass uses UD/DU chains though.  The now removed SEE pass used 
them as well.

> I think you're right, as Joseph confirmed.  There is one occurrence of
> flag_wrapv in simplify-rtx.c and zero in combine.c, so this means that
> indeed RTL treats PLUS/MINUS as wrapping.  I remember seeing more but
> my memory must be at fault.  Since it's so limited, the one occurrence
> that is there should be removed.

They were introduced to fix PR rtl-optimization/23047.

> If you can fix the performance issues that Eric has without a rewrite,
> and if it's not quadratic, I would have no problem at all with RTL.

Yes, I also think that the pass makes sense at the RTL level, as it does 
something that cannot be done (easily) elsewhere.  On the other hand, it's 
unfortunate to have 2 different ZEE passes.
Eric Botcazou Nov. 3, 2010, 6:47 p.m. UTC | #17
> Rather than on DU chains (which are a very expensive problem just
> because of its asymptotic complexity, so it's better to use it only on
> small regions such as loops), I'd be picky about usage of note_uses,
> which can be very simply replaced by a loop over the DF_INSN_USES vector.

Indeed, that would already be better.
Tom de Vries Nov. 8, 2010, 9:21 p.m. UTC | #18
Paolo,

Paolo Bonzini wrote:
> > On 10/22/2010 10:30 AM, Eric Botcazou wrote:
>> >> If not, why does it not use
>> >> the DF framework instead of recomputing DU chains manually?
> >
> > Rather than on DU chains (which are a very expensive problem just
> > because of its asymptotic complexity, so it's better to use it only on
> > small regions such as loops), I'd be picky about usage of note_uses,
> > which can be very simply replaced by a loop over the DF_INSN_USES vector.
> >
> > Paolo

I just looked into using DF_INSN_USES, and I'm not sure that using that is a
good idea. There is a difference between using note_uses and DF_INSN_USES.

If there is an insn
...
(set
  (reg:SI 1)
  (plus:SI (reg:SI 2)
           (reg:SI 3)))
...

the note_uses helper function will be visited with the plus expression,
something that is used in the pass (see note_use in ee.c). Using DF_INSN_USES
will only yield (reg:SI 2) and (reg:SI 3), and does not provide the context in
which it is used.


Furthermore, I would like to know whether there is a problem with checking in
the pass into trunk in its current form. My understanding of the discussion up
until now is that the consensus is that the pass is useful, but not fully efficient.
I agree with the fact that it's possible to improve it, but I also think that
the runtime is negligible (it's an O(n) pass currently) and the benefit of
improving the pass will not be worth the effort. I will try to confirm this with
a profiling run on spec2000. If the profiling run confirms that the runtime is
negligible, is the pass (in principle) ok for trunk?
If so, I will sent out a new version with a few bug fixes that were missing in
the previous version.

Thanks,
- Tom
Andrew Pinski Nov. 8, 2010, 9:29 p.m. UTC | #19
On Mon, Oct 18, 2010 at 8:36 AM, Tom de Vries <tom@codesourcery.com> wrote:
> I created a new sign/zero extension elimination pass.
>
> The motivating example for this pass is:
>
>  void f(unsigned char *p, short s, int c, int *z)
>    {
>      if (c)
>        *z = 0;
>      *p ^= (unsigned char)s;
>    }
>
> For MIPS, compilation results in the following insns.


One more thing about the above testcase.  When I was working on
removing some more zero extensions, I noticed that
TARGET_PROMOTE_PROTOTYPES was being defined for MIPS.  This macro was
just removed from SPARC for the same reason why I was looking into
those zero extensions.  You might want to check the definition for
MIPS and refine it.

Thanks,
Andrew Pinski
Paolo Bonzini Nov. 8, 2010, 10:04 p.m. UTC | #20
On Mon, Nov 8, 2010 at 22:21, Tom de Vries <tom@codesourcery.com> wrote:
> I just looked into using DF_INSN_USES, and I'm not sure that using that is a
> good idea. There is a difference between using note_uses and DF_INSN_USES.
>
> If there is an insn
> ...
> (set
>  (reg:SI 1)
>  (plus:SI (reg:SI 2)
>           (reg:SI 3)))
> ...
>
> the note_uses helper function will be visited with the plus expression,
> something that is used in the pass (see note_use in ee.c).

This is an interesting point, thanks.

> Furthermore, I would like to know whether there is a problem with checking in
> the pass into trunk in its current form. My understanding of the discussion up
> until now is that the consensus is that the pass is useful, but not fully efficient.
> I agree with the fact that it's possible to improve it, but I also think that
> the runtime is negligible (it's an O(n) pass currently) and the benefit of
> improving the pass will not be worth the effort. I will try to confirm this with
> a profiling run on spec2000. If the profiling run confirms that the runtime is
> negligible, is the pass (in principle) ok for trunk?

I am not an RTL reviewer, so my opinion doesn't weigh too much.

Paolo
Tom de Vries Nov. 12, 2010, 8:22 a.m. UTC | #21
Eric,

Paolo Bonzini wrote:
> On Mon, Nov 8, 2010 at 22:21, Tom de Vries <tom@codesourcery.com> wrote:
>> I just looked into using DF_INSN_USES, and I'm not sure that using that is a
>> good idea. There is a difference between using note_uses and DF_INSN_USES.
>>
>> If there is an insn
>> ...
>> (set
>>  (reg:SI 1)
>>  (plus:SI (reg:SI 2)
>>           (reg:SI 3)))
>> ...
>>
>> the note_uses helper function will be visited with the plus expression,
>> something that is used in the pass (see note_use in ee.c).
> 
> This is an interesting point, thanks.
> 
>> Furthermore, I would like to know whether there is a problem with checking in
>> the pass into trunk in its current form. My understanding of the discussion up
>> until now is that the consensus is that the pass is useful, but not fully efficient.
>> I agree with the fact that it's possible to improve it, but I also think that
>> the runtime is negligible (it's an O(n) pass currently) and the benefit of
>> improving the pass will not be worth the effort. I will try to confirm this with
>> a profiling run on spec2000. If the profiling run confirms that the runtime is
>> negligible, is the pass (in principle) ok for trunk?
> 
> I am not an RTL reviewer, so my opinion doesn't weigh too much.
> 
> Paolo

I profiled the pass on spec2000:

                    -mabi=32     -mabi=64
ee-pass (usr time):     0.70         1.16
total   (usr time):   919.30       879.26
ee-pass        (%):     0.08         0.13

The pass takes 0.13% or less of the total usr runtime. Is it necessary to
improve the runtime of this pass?

Thanks,
- Tom
Eric Botcazou Nov. 13, 2010, 9:50 a.m. UTC | #22
> I profiled the pass on spec2000:
>
>                     -mabi=32     -mabi=64
> ee-pass (usr time):     0.70         1.16
> total   (usr time):   919.30       879.26
> ee-pass        (%):     0.08         0.13
>
> The pass takes 0.13% or less of the total usr runtime.

For how many hits?  What are the numbers with --param ee-max-propagate=0?

> Is it necessary to improve the runtime of this pass?

I've already given my opinion about the implementation.  The other passes in 
the compiler try hard not to rescan everything when a single bit changes; as 
currently written, yours doesn't.
diff mbox

Patch

Index: gcc/tree-pass.h
===================================================================
--- gcc/tree-pass.h	(revision 165080)
+++ gcc/tree-pass.h	(working copy)
@@ -479,6 +479,7 @@ 
 extern struct rtl_opt_pass pass_initial_value_sets;
 extern struct rtl_opt_pass pass_unshare_all_rtl;
 extern struct rtl_opt_pass pass_instantiate_virtual_regs;
+extern struct rtl_opt_pass pass_ee;
 extern struct rtl_opt_pass pass_rtl_fwprop;
 extern struct rtl_opt_pass pass_rtl_fwprop_addr;
 extern struct rtl_opt_pass pass_jump2;
Index: gcc/opts.c
===================================================================
--- gcc/opts.c	(revision 165080)
+++ gcc/opts.c	(working copy)
@@ -823,6 +823,7 @@ 
   flag_tree_switch_conversion = opt2;
   flag_ipa_cp = opt2;
   flag_ipa_sra = opt2;
+  flag_ee = opt2;
 
   /* Track fields in field-sensitive alias analysis.  */
   set_param_value ("max-fields-for-field-sensitive",
Index: gcc/timevar.def
===================================================================
--- gcc/timevar.def	(revision 165080)
+++ gcc/timevar.def	(working copy)
@@ -179,6 +179,7 @@ 
 DEFTIMEVAR (TV_VARCONST              , "varconst")
 DEFTIMEVAR (TV_LOWER_SUBREG	     , "lower subreg")
 DEFTIMEVAR (TV_JUMP                  , "jump")
+DEFTIMEVAR (TV_EE                    , "extension elimination")
 DEFTIMEVAR (TV_FWPROP                , "forward prop")
 DEFTIMEVAR (TV_CSE                   , "CSE")
 DEFTIMEVAR (TV_DCE                   , "dead code elimination")
Index: gcc/ee.c
===================================================================
--- gcc/ee.c	(revision 0)
+++ gcc/ee.c	(revision 0)
@@ -0,0 +1,907 @@ 
+/* Redundant extension elimination 
+   Copyright (C) 2010 Free Software Foundation, Inc.
+   Contributed by Tom de Vries (tom@codesourcery.com)
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+/*
+
+  MOTIVATING EXAMPLE
+
+  The motivating example for this pass is:
+
+    void f(unsigned char *p, short s, int c, int *z)
+    {
+      if (c)
+        *z = 0;
+      *p ^= (unsigned char)s;
+    }
+
+  For MIPS, compilation results in the following insns.
+
+    (set (reg/v:SI 199)
+         (sign_extend:SI (subreg:HI (reg:SI 200) 2)))
+
+    ...
+
+    (set (reg:QI 203)
+         (subreg:QI (reg/v:SI 199) 3))
+
+  These insns are the only def and the only use of reg 199, each located in a
+  different bb.
+
+  The sign-extension preserves the lower half of reg 200 and copies them to 
+  reg 199, and the subreg use of reg 199 only reads the least significant byte.
+  The sign extension is therefore redundant (the extension part, not the copy 
+  part), and can safely be replaced with a regcopy from reg 200 to reg 199.
+
+
+  OTHER SIGN/ZERO EXTENSION ELIMINATION PASSES
+
+  There are other passes which eliminate sign/zero-extension: combine and
+  implicit_zee. Both attempt to eliminate extensions by combining them with
+  other instructions. The combine pass does this at bb level,
+  implicit_zee works at inter-bb level.
+
+  The combine pass combine an extension with either:
+  - all uses of the extension, or 
+  - all defs of the operand of the extension.
+  The implicit_zee pass only implements the latter.
+
+  For our motivating example, combine doesn't work since the def and the use of
+  reg 199 are in a different bb.
+
+  Implicit_zee does not work since it only combines an extension with the defs
+  of its operand.
+
+
+  INTENDED EFFECT
+
+  This pass works by removing sign/zero-extensions, or replacing them with
+  regcopies. The idea there is that the regcopy might be eliminated by a later
+  pass. In case the regcopy cannot be eliminated, it might at least be cheaper
+  than the extension.
+
+
+  IMPLEMENTATION
+
+  The pass scans a number of times over all instructions.
+  
+  The first scan collects all extensions.
+
+  The second scan registers all uses of a reg in the biggest_use array. After
+  that first scan, the biggest_use array contains the size in bits of the
+  biggest use of each reg, which allows us to find redundant extensions.
+
+  A number of backward scans, bounded by --param ee-max-propagate=<n> is done
+  in which information from the previous scan is used. As a runtime improvement,
+  also information from the current scan is used in propagation, if we know that
+  the information from the current scan is final (all uses have been 
+  registered). If there are no more changes in size of biggest use, or there are
+  no more extensions, no new scan is initiated.
+
+  After the last scan, redundant extensions are deleted or replaced.
+
+  In case that the src and dest reg of the replacement are not of the same size,
+  we do not replace with a normal regcopy, but with a truncate or with the copy
+  of a paradoxical subreg instead.
+
+
+  LIMITATIONS
+
+  The scope of the analysis is limited to an extension and its uses. The other
+  type of analysis (related to the defs of the operand of an extension) is not
+  done.
+
+  Furthermore, we do the analysis of biggest use per reg. So when determining
+  whether an extension is redundant, we take all uses of a the dest reg into
+  account, also the ones that are not uses of the extension. This could be
+  overcome by calculating the def-use chains and using those for analysis
+  instead.
+
+  Finally, there is propagation of information, but not in its most runtime
+  efficient form. In order to have that a text-book backward iterative approach
+  is needed.  */
+
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "rtl.h"
+#include "tree.h"
+#include "tm_p.h"
+#include "flags.h"
+#include "regs.h"
+#include "hard-reg-set.h"
+#include "basic-block.h"
+#include "insn-config.h"
+#include "function.h"
+#include "expr.h"
+#include "insn-attr.h"
+#include "recog.h"
+#include "toplev.h"
+#include "target.h"
+#include "timevar.h"
+#include "optabs.h"
+#include "insn-codes.h"
+#include "rtlhooks-def.h"
+#include "output.h"
+#include "params.h"
+#include "timevar.h"
+#include "tree-pass.h"
+#include "cgraph.h"
+
+#define SKIP_REG (-1)
+
+/* Array to register the biggest use of a reg, in bits.  */
+
+static int *biggest_use;
+
+/* Array to register the amount of uses of a reg.  */
+
+static int *nr_uses;
+
+/* Arrays to keep the data from the previous run available.  */
+
+static int *prev_biggest_use;
+static int *prev_nr_uses;
+
+/* Variable that indicates whether the biggest_use info has changed relative to
+   the previous run. */
+
+bool changed;
+
+/* Vector that contains the extensions in the function.  */
+
+VEC (rtx,heap) *extensions;
+
+/* Vector that contains the extensions in the function that are going to be
+   removed or replaced.  */
+
+VEC (rtx,heap) *redundant_extensions;
+
+/* Forward declarations.  */
+
+static void note_use (rtx *x, void *data);
+static bool skip_reg_p (int regno);
+
+/* Convenience macro to swap 2 variables.  */
+
+#define SWAP(T, a, b) do { T tmp; tmp = (a); (a) = (b); (b) = tmp; } while (0)
+
+/* Check whether this is a paradoxical subreg. */
+
+static bool
+paradoxical_subreg_p (rtx subreg)
+{
+  enum machine_mode subreg_mode, reg_mode;
+
+  if (GET_CODE (subreg) != SUBREG)
+    return false;
+
+  subreg_mode = GET_MODE (subreg);
+  reg_mode = GET_MODE (SUBREG_REG (subreg));
+
+  if (GET_MODE_SIZE (subreg_mode) > GET_MODE_SIZE (reg_mode))
+    return true;
+
+  return false;
+}
+
+/* Get the size and reg number of a REG or SUBREG use.  */
+
+static bool
+reg_use_p (rtx use, int *size, unsigned int *regno)
+{
+  rtx reg;
+      
+  if (REG_P (use))
+    {
+      *regno = REGNO (use);
+      *size = GET_MODE_BITSIZE (GET_MODE (use));
+      return true;
+    }
+  else if (GET_CODE (use) == SUBREG)
+    {
+      reg = SUBREG_REG (use);
+
+      if (!REG_P (reg))
+        return false;
+
+      *regno = REGNO (reg);
+
+      if (paradoxical_subreg_p (use))
+        *size = GET_MODE_BITSIZE (GET_MODE (reg));
+      else
+        *size = subreg_lsb (use) + GET_MODE_BITSIZE (GET_MODE (use));
+
+      return true;
+    }
+
+  return false;
+}
+
+/* Register the use of a reg.  */
+
+static void
+register_use (int size, unsigned int regno)
+{
+  int *current;
+  int prev;
+  
+  gcc_assert (size >= 0);
+  gcc_assert (regno < (unsigned int)max_reg_num ());
+
+  current = &biggest_use[regno];  
+
+  if (*current == SKIP_REG)
+    return;
+
+  *current = MAX (*current, size);
+  nr_uses[regno]++;
+
+  if (prev_nr_uses == NULL)
+    return;
+
+  prev = prev_biggest_use[regno];
+  if (nr_uses[regno] == prev_nr_uses[regno] && *current < prev)
+    {
+      if (dump_file)
+        fprintf (dump_file, "reg %d: size %d -> %d\n", regno, prev, *current);
+      changed = true;
+    }
+}
+
+/* Get the biggest use of a reg from a previous scan. In case all uses are
+   already registered in the current pass, use the biggest use from the current
+   pass.  */
+
+static int
+get_prev_biggest_use (unsigned int regno)
+{
+  int current, res;
+  
+  gcc_assert (!skip_reg_p (regno));
+
+  res = prev_biggest_use[regno];
+
+  current = biggest_use[regno];
+  gcc_assert (current <= res);
+  if (nr_uses[regno] == prev_nr_uses[regno] && current < res)
+    res = current;
+
+  gcc_assert (res >= 0);
+  return res;
+}
+
+/* Handle embedded uses.  */
+
+static void
+note_embedded_uses (rtx use, rtx pattern)
+{
+  const char *format_ptr;
+  int i, j;
+
+  format_ptr = GET_RTX_FORMAT (GET_CODE (use));
+  for (i = 0; i < GET_RTX_LENGTH (GET_CODE (use)); i++)
+    if (format_ptr[i] == 'e')
+      note_use (&XEXP (use, i), pattern);
+    else if (format_ptr[i] == 'E')
+      for (j = 0; j < XVECLEN (use, i); j++)
+        note_use (&XVECEXP (use, i, j), pattern);
+}
+
+/* Get the set that has use as its SRC operand.  */
+
+static rtx
+get_set (rtx use, rtx pattern)
+{
+  rtx sub;
+  int i;
+
+  if (GET_CODE (pattern) == SET && SET_SRC (pattern) == use)
+    return pattern;
+
+  if (GET_CODE (pattern) == PARALLEL)
+    for (i = 0; i < XVECLEN (pattern, 0); ++i)
+      {
+        sub = XVECEXP (pattern, 0, i);
+        if (GET_CODE (sub) == SET && SET_SRC (sub) == use)
+          return sub;
+      }
+  
+  return NULL_RTX;
+}
+
+/* Handle a restricted op use. In this context restricted means that a bit in an
+   operand influences only the same bit or more significant bits in the result.
+   The bitwise ops are a subclass, but PLUS is one as well.  */
+
+static void
+note_restricted_op_use (rtx use, unsigned int nr_operands, rtx pattern)
+{
+  unsigned int i, smallest;
+  int operand_size[2];
+  int used_size;
+  unsigned int operand_regno[2];
+  bool operand_reg[2];
+  bool operand_ignore[2];
+  rtx set;
+
+  /* Init operand_reg, operand_size, operand_regno and operand_ignore.  */
+  for (i = 0; i < nr_operands; ++i)
+    {
+      operand_reg[i] = reg_use_p (XEXP (use, i), &operand_size[i],
+                                  &operand_regno[i]);
+      operand_ignore[i] = false;
+    }
+
+  /* Handle case of reg and-masked with const.  */
+  if (GET_CODE (use) == AND && CONST_INT_P (XEXP (use, 1)) && operand_reg[0])
+    {
+      used_size =
+        HOST_BITS_PER_WIDE_INT - clz_hwi (UINTVAL (XEXP (use, 1)));
+      operand_size[0] = MIN (operand_size[0], used_size);
+    }
+
+  /* Handle case of reg or-masked with const.  */
+  if (GET_CODE (use) == IOR && CONST_INT_P (XEXP (use, 1)) && operand_reg[0])
+    {
+      used_size =
+        HOST_BITS_PER_WIDE_INT - clz_hwi (~UINTVAL (XEXP (use, 1)));
+      operand_size[0] = MIN (operand_size[0], used_size);
+    }
+
+  /* Ignore the use of a in 'a = a + b'.  */
+  set = get_set (use, pattern);
+  /* TODO: handle GET_CODE ((SET_DEST (set))) == SUBREG. */
+  if (set != NULL_RTX && REG_P (SET_DEST (set)))
+    for (i = 0; i < nr_operands; ++i)
+      operand_ignore[i] = (operand_reg[i]
+                           && (REGNO (SET_DEST (set)) == operand_regno[i]));
+
+  /* Handle the case a reg is combined with don't care bits.  */
+  if (nr_operands == 2 && operand_reg[0] && operand_reg[1]
+      && operand_size[0] != operand_size[1])
+    {
+      smallest = operand_size[0] > operand_size[1];
+
+      if (paradoxical_subreg_p (XEXP (use, smallest))
+          && !SUBREG_PROMOTED_VAR_P (XEXP (use, smallest)))
+        operand_size[1 - smallest] = operand_size[smallest];
+    }
+
+  /* TODO: handle GET_CODE ((SET_DEST (set))) == SUBREG. */
+  if (prev_biggest_use != NULL && set != NULL_RTX && REG_P (SET_DEST (set))
+      && !skip_reg_p (REGNO (SET_DEST (set))))
+    for (i = 0; i < nr_operands; ++i)
+      operand_size[i] = MIN (operand_size[i],
+                             get_prev_biggest_use (REGNO (SET_DEST (set))));
+
+  /* Register the operand use, if necessary.  */
+  for (i = 0; i < nr_operands; ++i)
+    if (!operand_reg[i])
+      note_use (&XEXP (use, i), pattern);
+    else if (!operand_ignore[i])
+      register_use (operand_size[i], operand_regno[i]);
+}
+
+/* Handle all uses noted by note_uses.  */
+
+static void
+note_use (rtx *x, void *data)
+{
+  rtx use = *x;
+  rtx pattern = (rtx)data;
+  int use_size;
+  unsigned int use_regno;
+  rtx set;
+
+  switch (GET_CODE (use))
+    {
+    case REG:
+    case SUBREG:
+      if (!reg_use_p (use, &use_size, &use_regno))
+        {
+          note_embedded_uses (use, pattern);
+          return;
+        }
+      if (prev_biggest_use != NULL)
+        {
+          set = get_set (use, pattern);
+          /* TODO: handle GET_CODE ((SET_DEST (set))) == SUBREG. */
+          if (set != NULL_RTX && REG_P (SET_DEST (set))
+              && !skip_reg_p (REGNO (SET_DEST (set))))
+            use_size = MIN (use_size,
+                            get_prev_biggest_use (REGNO (SET_DEST (set))));
+        }
+      register_use (use_size, use_regno);
+      return;
+    case IOR:
+    case AND:
+    case XOR:
+    case PLUS:
+    case MINUS:
+      note_restricted_op_use (use, 2, pattern);
+      return;
+    case NOT:
+    case NEG:
+      note_restricted_op_use (use, 1, pattern);
+      return;
+    case ASHIFT:
+      if (!reg_use_p (XEXP (use, 0), &use_size, &use_regno)
+	  || !CONST_INT_P (XEXP (use, 1))
+          || INTVAL (XEXP (use, 1)) <= 0
+          || paradoxical_subreg_p (XEXP (use, 0)))
+        {
+          note_embedded_uses (use, pattern);
+          return;
+        }
+      register_use (use_size - INTVAL (XEXP (use, 1)), use_regno);
+      return;
+    default:
+      note_embedded_uses (use, pattern);
+      return;
+    }
+}
+
+/* Check whether reg is implicitly used.  */
+
+static bool
+implicit_use_p (int regno ATTRIBUTE_UNUSED)
+{
+#ifdef EPILOGUE_USES
+  if (EPILOGUE_USES (regno))
+    return true;
+#endif
+
+#ifdef EH_USES
+  if (EH_USES (regno))
+    return true;
+#endif
+
+  return false;
+}
+
+/* Check whether reg should be skipped in analysis.  */
+
+static bool
+skip_reg_p (int regno)
+{
+  /* TODO: handle hard registers. The problem with hard registers is that 
+     a DI use of r0 can mean a 64bit use of r0 and a 32 bit use of r1.
+     We don't handle that properly.  */
+  return implicit_use_p (regno) || HARD_REGISTER_NUM_P (regno);
+}
+
+/* Note the uses of argument registers in a call.  */
+
+static void
+note_call_uses (rtx insn)
+{
+  rtx link, link_expr;
+
+  if (!CALL_P (insn))
+    return;
+
+  for (link = CALL_INSN_FUNCTION_USAGE (insn); link; link = XEXP (link, 1))
+    {
+      link_expr = XEXP (link, 0);
+
+      if (GET_CODE (link_expr) == USE)
+        note_use (&XEXP (link_expr, 0), link);
+    }
+}
+
+/* Dump the biggest uses found.  */
+
+static void
+dump_biggest_use (void)
+{
+  int i;
+
+  if (!dump_file)
+    return;
+
+  for (i = 0; i < max_reg_num (); i++)
+    if (biggest_use[i] > 0)
+      fprintf (dump_file, "reg %d: size %d\n", i, biggest_use[i]);
+
+  fprintf (dump_file, "\n");
+}
+
+/* Calculate the biggest use mode for all regs. */
+
+static void
+calculate_biggest_use (void)
+{
+  int i;
+  basic_block bb;
+  rtx insn;
+
+  /* Initialize biggest_use for all regs to 0. If a reg is used implicitly, we
+     handle that reg conservatively and set it to SKIP_REG instead.  */
+  for (i = 0; i < max_reg_num (); i++)
+    {
+      biggest_use[i] = skip_reg_p (i) ? SKIP_REG : 0;
+      nr_uses[i] = skip_reg_p (i) ? SKIP_REG : 0;
+    }
+
+  /* For all insns, call note_use for each use in insn.  */
+  FOR_EACH_BB (bb)
+    FOR_BB_INSNS_REVERSE (bb, insn)
+      {
+        if (!NONDEBUG_INSN_P (insn))
+          continue;
+
+        note_uses (&PATTERN (insn), note_use, PATTERN (insn));
+
+        if (CALL_P (insn))
+          note_call_uses (insn);
+      }
+}
+
+/* Check whether this is a sign/zero extension.  */
+
+static bool
+extension_p (rtx insn, rtx *dest, rtx *inner, int *preserved_size)
+{
+  rtx src, op0;
+
+  /* Detect set of reg.  */
+  if (GET_CODE (PATTERN (insn)) != SET)
+    return false;
+
+  src = SET_SRC (PATTERN (insn));
+  *dest = SET_DEST (PATTERN (insn));
+          
+  if (!REG_P (*dest))
+    return false;
+
+  /* Detect sign or zero extension.  */
+  if (GET_CODE (src) == ZERO_EXTEND || GET_CODE (src) == SIGN_EXTEND
+      || (GET_CODE (src) == AND && CONST_INT_P (XEXP (src, 1))))
+    {
+      op0 = XEXP (src, 0);
+
+      /* Determine amount of least significant bits preserved by operation.  */
+      if (GET_CODE (src) == AND)
+        *preserved_size = ctz_hwi (~UINTVAL (XEXP (src, 1)));
+      else
+        *preserved_size = GET_MODE_BITSIZE (GET_MODE (op0));
+
+      if (GET_CODE (op0) == SUBREG)
+        {
+          if (subreg_lsb (op0) != 0)
+            return false;
+      
+          *inner = SUBREG_REG (op0);
+          return true;
+        }
+      else if (REG_P (op0))
+        {
+          *inner = op0;
+          return true;
+        }
+    }
+
+  return false;
+}
+
+/* Find extensions and store them in the extensions vector.  */
+
+static bool
+find_extensions (void)
+{
+  basic_block bb;
+  rtx insn, dest, inner;
+  int preserved_size;
+
+  /* For all insns, call note_use for each use in insn.  */
+  FOR_EACH_BB (bb)
+    FOR_BB_INSNS (bb, insn)
+      {
+        if (!NONDEBUG_INSN_P (insn))
+          continue;
+
+        if (!extension_p (insn, &dest, &inner, &preserved_size))
+          continue;
+
+        VEC_safe_push (rtx, heap, extensions, insn);
+
+        if (dump_file)
+          fprintf (dump_file,
+                   "found extension %u with preserved size %d defining"
+                   " reg %d\n",
+                   INSN_UID (insn), preserved_size, REGNO (dest));
+      }
+
+  if (dump_file)
+    {
+      if (!VEC_empty (rtx, extensions))
+        fprintf (dump_file, "\n");
+      else
+        fprintf (dump_file, "no extensions found.\n");
+    }
+
+  return !VEC_empty (rtx, extensions);
+}
+
+/* Check whether this is a redundant sign/zero extension.  */
+
+static bool
+redundant_extension_p (rtx insn, rtx *dest, rtx *inner, int *preserved_size)
+{
+  int biggest_dest_use;
+
+  if (!extension_p (insn, dest, inner, preserved_size))
+    return false;
+
+  biggest_dest_use = biggest_use[REGNO (*dest)];
+      
+  if (biggest_dest_use == SKIP_REG)
+    return false;
+
+  if (*preserved_size < biggest_dest_use)
+    return false;
+
+  return true;
+}
+
+/* Find the redundant extensions in the extensions vector and move them to the
+   redundant_extensions vector.  */
+
+static void
+find_redundant_extensions (void)
+{
+  rtx insn, dest, inner;
+  int ix;
+  bool found = false;
+  int preserved_size;
+
+  FOR_EACH_VEC_ELT_REVERSE (rtx, extensions, ix, insn)
+    if (redundant_extension_p (insn, &dest, &inner, &preserved_size))
+      {
+        VEC_safe_push (rtx, heap, redundant_extensions, insn);
+        VEC_unordered_remove (rtx, extensions, ix);
+
+        if (dump_file)
+          fprintf (dump_file,
+                   "found superfluous extension %u with preserved size %d"
+                   " defining reg %d\n",
+                   INSN_UID (insn), preserved_size, REGNO (dest));
+        found = true;
+      }
+
+  if (dump_file && found)
+    fprintf (dump_file, "\n");
+}
+
+/* run calculate_biggest_use iteratively.  */
+
+static void
+propagate (void)
+{
+  unsigned int nr_propagate = 0;
+  unsigned int max_propagate = PARAM_VALUE (PARAM_EE_MAX_PROPAGATE);
+
+  while (nr_propagate < max_propagate && !VEC_empty (rtx, extensions))
+    {
+      ++nr_propagate;
+
+      if (dump_file)
+        fprintf (dump_file, "propagating(%u)\n", nr_propagate);
+
+      SWAP (int *, biggest_use, prev_biggest_use);
+      SWAP (int *, nr_uses, prev_nr_uses);
+
+      changed = false;
+      calculate_biggest_use ();
+      if (!changed)
+        break;
+
+      if (dump_file)
+        fprintf (dump_file, "\n");
+
+      find_redundant_extensions ();
+    }
+}
+
+/* Try to remove or replace the redundant extension.  */
+
+static void
+try_remove_or_replace_extension (rtx insn, rtx dest, rtx inner)
+{
+  rtx cp_src, cp_dest, seq, one;
+
+  /* Check whether replacement is needed.  */
+  if (dest != inner)
+    {
+      start_sequence ();
+
+      /* Determine the proper replacement operation.  */
+      if (GET_MODE (dest) == GET_MODE (inner))
+        {
+          cp_src = inner;
+          cp_dest = dest;
+        }
+      else if (GET_MODE_SIZE (GET_MODE (dest))
+               > GET_MODE_SIZE (GET_MODE (inner)))
+        {
+          emit_clobber (dest);
+          cp_src = inner;
+          cp_dest = gen_lowpart_SUBREG (GET_MODE (inner), dest);
+        }
+      else 
+        {
+          cp_src = gen_rtx_TRUNCATE (GET_MODE (dest), inner);
+          cp_dest = dest;
+        }
+
+      emit_move_insn (cp_dest, cp_src);
+
+      seq = get_insns ();
+      end_sequence ();
+
+      /* If the replacement is not supported, bail out.  */
+      for (one = seq; one != NULL_RTX; one = NEXT_INSN (one))
+        if (recog_memoized (one) < 0 && GET_CODE (PATTERN (one)) != CLOBBER)
+          return;
+
+      /* Insert the replacement.  */
+      emit_insn_before (seq, insn);
+
+      if (dump_file)
+        {
+          fprintf (dump_file, "superfluous extension %u replaced by\n",
+                   INSN_UID (insn));
+          for (one = seq; one != NULL_RTX; one = NEXT_INSN (one))
+            fprintf (dump_file, " %u", INSN_UID (seq));
+          fprintf (dump_file, "\n");
+        }
+    }
+  else
+    if (dump_file)
+      fprintf (dump_file, "superfluous extension %u removed\n", INSN_UID (insn));
+
+  /* Remove the extension.  */
+  delete_insn (insn);  
+}
+
+/* Remove the redundant extensions.  */
+
+static void
+remove_redundant_extensions (void)
+{
+  rtx insn, dest, inner;
+  int preserved_size;
+  int ix;
+
+  FOR_EACH_VEC_ELT (rtx, redundant_extensions, ix, insn)
+    {
+      extension_p (insn, &dest, &inner, &preserved_size);
+      try_remove_or_replace_extension (insn, dest, inner);
+    }
+
+  if (dump_file)
+    fprintf (dump_file, "\n");
+}
+
+/* Setup the variables at the start of the pass.  */
+
+static void
+init_pass (bool for_propagate)
+{
+  if (for_propagate)
+    {
+      prev_biggest_use = XNEWVEC (int, max_reg_num ());
+      prev_nr_uses = XNEWVEC (int, max_reg_num ());
+      return;
+    }
+
+  biggest_use = XNEWVEC (int, max_reg_num ());
+  nr_uses = XNEWVEC (int, max_reg_num ());
+
+  prev_biggest_use = NULL;
+  prev_nr_uses = NULL;
+
+  extensions = VEC_alloc (rtx, heap, 10); 
+  redundant_extensions = VEC_alloc (rtx, heap, 10);
+}
+
+/* Free the variables at the end of the pass.  */
+
+static void
+finish_pass (void)
+{
+  XDELETEVEC (biggest_use);
+  XDELETEVEC (nr_uses);
+  
+  XDELETEVEC (prev_biggest_use);
+  XDELETEVEC (prev_nr_uses);
+
+  VEC_free (rtx, heap, extensions);
+  VEC_free (rtx, heap, redundant_extensions);
+}
+
+/* Find redundant extensions and remove or replace them if possible.  */
+
+static void
+find_and_remove_redundant_extensions (void)
+{
+  init_pass (false);
+
+  if (!find_extensions ())
+    {
+      finish_pass ();
+      return;
+    }
+
+  calculate_biggest_use ();
+  dump_biggest_use ();
+
+  find_redundant_extensions ();
+
+  if (optimize >= 2 && !VEC_empty (rtx, extensions))
+    {
+      init_pass (true);
+      propagate ();
+    }
+
+  remove_redundant_extensions ();
+
+  finish_pass ();
+}
+
+/* Remove redundant extensions.  */
+
+static unsigned int
+rest_of_handle_ee (void)
+{
+  find_and_remove_redundant_extensions ();
+  return 0;
+}
+
+/* Run ee pass when flag_ee is set at optimization level > 0.  */
+
+static bool
+gate_handle_ee (void)
+{
+  return (optimize > 0 && flag_ee);
+}
+
+struct rtl_opt_pass pass_ee =
+{
+ {
+  RTL_PASS,
+  "ee",                                 /* name */
+  gate_handle_ee,                       /* gate */
+  rest_of_handle_ee,                    /* execute */
+  NULL,                                 /* sub */
+  NULL,                                 /* next */
+  0,                                    /* static_pass_number */
+  TV_EE,                                /* tv_id */
+  0,                                    /* properties_required */
+  0,                                    /* properties_provided */
+  0,                                    /* properties_destroyed */
+  0,                                    /* todo_flags_start */
+  TODO_ggc_collect |
+  TODO_dump_func |
+  TODO_verify_rtl_sharing,              /* todo_flags_finish */
+ }
+};
Index: gcc/common.opt
===================================================================
--- gcc/common.opt	(revision 165080)
+++ gcc/common.opt	(working copy)
@@ -761,6 +761,10 @@ 
 Common Report Var(flag_eliminate_dwarf2_dups)
 Perform DWARF2 duplicate elimination
 
+fextension-elimination
+Common Report Var(flag_ee) Init(0) Optimization
+Perform extension elimination
+
 fipa-sra
 Common Report Var(flag_ipa_sra) Init(0) Optimization
 Perform interprocedural reduction of aggregates
Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	(revision 165080)
+++ gcc/Makefile.in	(working copy)
@@ -1218,6 +1218,7 @@ 
 	dwarf2asm.o \
 	dwarf2out.o \
 	ebitmap.o \
+	ee.o \
 	emit-rtl.o \
 	et-forest.o \
 	except.o \
@@ -3107,6 +3108,11 @@ 
 web.o : web.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
    hard-reg-set.h $(FLAGS_H) $(BASIC_BLOCK_H) $(FUNCTION_H) output.h $(TOPLEV_H) $(DIAGNOSTIC_CORE_H) \
    insn-config.h $(RECOG_H) $(DF_H) $(OBSTACK_H) $(TIMEVAR_H) $(TREE_PASS_H)
+ee.o : ee.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
+   hard-reg-set.h $(FLAGS_H) $(BASIC_BLOCK_H) $(FUNCTION_H) output.h \
+   $(DF_H) $(TIMEVAR_H) tree-pass.h $(RECOG_H) $(EXPR_H) \
+   $(REGS_H) $(TREE_H) $(TM_P_H) insn-config.h $(INSN_ATTR_H) $(TOPLEV_H) $(DIAGNOSTIC_CORE_H) \
+   $(TARGET_H) $(OPTABS_H) insn-codes.h rtlhooks-def.h $(PARAMS_H) $(CGRAPH_H)
 implicit-zee.o : implicit-zee.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
    hard-reg-set.h $(FLAGS_H) $(BASIC_BLOCK_H) $(FUNCTION_H) output.h \
    $(DF_H) $(TIMEVAR_H) tree-pass.h $(RECOG_H) $(EXPR_H) \
Index: gcc/passes.c
===================================================================
--- gcc/passes.c	(revision 165080)
+++ gcc/passes.c	(working copy)
@@ -976,6 +976,7 @@ 
       NEXT_PASS (pass_lower_subreg);
       NEXT_PASS (pass_df_initialize_opt);
       NEXT_PASS (pass_cse);
+      NEXT_PASS (pass_ee);
       NEXT_PASS (pass_rtl_fwprop);
       NEXT_PASS (pass_rtl_cprop);
       NEXT_PASS (pass_rtl_pre);
Index: gcc/params.def
===================================================================
--- gcc/params.def	(revision 165080)
+++ gcc/params.def	(working copy)
@@ -849,6 +849,12 @@ 
 	  "lto-min-partition",
 	  "Size of minimal paritition for WHOPR (in estimated instructions)",
 	  1000, 0, 0)
+
+DEFPARAM (PARAM_EE_MAX_PROPAGATE,
+	  "ee-max-propagate",
+	  "Maximum number of scans over all insn to do propagation",
+	  3, 0, 0)
+
 /*
 Local variables:
 mode:c
Index: gcc/testsuite/gcc.dg/extend-4.c
===================================================================
--- gcc/testsuite/gcc.dg/extend-4.c	(revision 0)
+++ gcc/testsuite/gcc.dg/extend-4.c	(revision 0)
@@ -0,0 +1,13 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-ee" } */
+
+unsigned char f(unsigned int a)
+{
+  unsigned int b = a & 0x10ff;
+  return b;
+}
+
+/* { dg-final { scan-rtl-dump-times "and:" 0 "ee" { target mips*-*-* } } } */
+/* { dg-final { scan-rtl-dump-times "superfluous extension \[0-9\]+ replaced" 1 "ee" { target mips*-*-* } } } */
+/* { dg-final { cleanup-rtl-dump "ee" } } */
+
Index: gcc/testsuite/gcc.dg/extend-1.c
===================================================================
--- gcc/testsuite/gcc.dg/extend-1.c	(revision 0)
+++ gcc/testsuite/gcc.dg/extend-1.c	(revision 0)
@@ -0,0 +1,13 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-ee" } */
+
+void f(unsigned char * p, short s, int c, int *z)
+{
+  if (c)
+    *z = 0;
+  *p ^= (unsigned char)s;
+}
+
+/* { dg-final { scan-rtl-dump-times "sign_extend:" 0 "ee" { target mips*-*-* } } } */
+/* { dg-final { scan-rtl-dump-times "superfluous extension \[0-9\]+ replaced" 1 "ee" { target mips*-*-* } } } */
+/* { dg-final { cleanup-rtl-dump "ee" } } */
Index: gcc/testsuite/gcc.dg/extend-2.c
===================================================================
--- gcc/testsuite/gcc.dg/extend-2.c	(revision 0)
+++ gcc/testsuite/gcc.dg/extend-2.c	(revision 0)
@@ -0,0 +1,20 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-ee" } */
+/* { dg-require-effective-target ilp32 } */
+
+void f(unsigned char * p, short *s, int c)
+{
+  short or = 0;
+  while (c)
+    {
+      or = or | s[c];
+      c --;
+    }
+  *p = (unsigned char)or;
+}
+
+/* { dg-final { scan-rtl-dump-times "zero_extend" 0 "ee" { target mips*-*-* } } } */
+/* { dg-final { scan-rtl-dump-times "sign_extend" 0 "ee" { target mips*-*-* } } } */
+/* { dg-final { scan-rtl-dump-times "superfluous extension \[0-9\]+ replaced" 2 "ee" { target mips*-*-* } } } */
+/* { dg-final { cleanup-rtl-dump "ee" } } */
+
Index: gcc/testsuite/gcc.dg/extend-2-64.c
===================================================================
--- gcc/testsuite/gcc.dg/extend-2-64.c	(revision 0)
+++ gcc/testsuite/gcc.dg/extend-2-64.c	(revision 0)
@@ -0,0 +1,20 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-ee" } */
+/* { dg-require-effective-target mips64 } */
+
+void f(unsigned char * p, short *s, int c)
+{
+  short or = 0;
+  while (c)
+    {
+      or = or | s[c];
+      c --;
+    }
+  *p = (unsigned char)or;
+}
+
+/* { dg-final { scan-rtl-dump-times "zero_extend:" 1 "ee" { target mips*-*-* } } } */
+/* { dg-final { scan-rtl-dump-times "sign_extend:" 0 "ee" { target mips*-*-* } } } */
+/* { dg-final { scan-rtl-dump-times "superfluous extension \[0-9\]+ replaced" 3 "ee" { target mips*-*-* } } } */
+/* { dg-final { cleanup-rtl-dump "ee" } } */
+
Index: gcc/testsuite/gcc.dg/extend-3.c
===================================================================
--- gcc/testsuite/gcc.dg/extend-3.c	(revision 0)
+++ gcc/testsuite/gcc.dg/extend-3.c	(revision 0)
@@ -0,0 +1,12 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-ee" } */
+
+unsigned int f(unsigned char byte)
+{
+  return byte << 25;
+}
+
+/* { dg-final { scan-rtl-dump-times "zero_extend:" 0 "ee" { target mips*-*-* } } } */
+/* { dg-final { scan-rtl-dump "superfluous extension \[0-9\]+ replaced" "ee" { target mips*-*-* } } } */
+/* { dg-final { cleanup-rtl-dump "ee" } } */
+