diff mbox

[passes.def] : Fix regression on ARM PR/61608

Message ID CAEwic4Ycv_asbomwC2gpcjDLQmJ9LO023PBFtrmKA=9BXf58ag@mail.gmail.com
State New
Headers show

Commit Message

Kai Tietz June 25, 2014, 1:35 p.m. UTC
Hello,

so there seems to be a fallout caused by moving peephole2 pass. See PR/61608.
So we need indeed 2 peephole2 passes.

ChangeLog

2014-06-25  Kai Tietz  <ktietz@redhat.com>

    PR rtl-optimization/61608
    * passes.def (peephole2): Readd peephole2 pass
    before if-after-reload pass.

Tested for arm*-none-*, i686-w64-cygwin, x86_64-unknown-linux-gnu.  Ok
for apply?

Regards,
Kai

Comments

Richard Biener June 25, 2014, 1:56 p.m. UTC | #1
On Wed, Jun 25, 2014 at 3:35 PM, Kai Tietz <ktietz70@googlemail.com> wrote:
> Hello,
>
> so there seems to be a fallout caused by moving peephole2 pass. See PR/61608.
> So we need indeed 2 peephole2 passes.
>
> ChangeLog
>
> 2014-06-25  Kai Tietz  <ktietz@redhat.com>
>
>     PR rtl-optimization/61608
>     * passes.def (peephole2): Readd peephole2 pass
>     before if-after-reload pass.
>
> Tested for arm*-none-*, i686-w64-cygwin, x86_64-unknown-linux-gnu.  Ok
> for apply?

Uh, every time I try to improve compile-time by removing passes somebody
else adds other passes back.

Please try to avoid this.

Thanks,
Richard.

> Regards,
> Kai
>
> Index: passes.def
> ===================================================================
> --- passes.def    (Revision 211971)
> +++ passes.def    (Arbeitskopie)
> @@ -396,6 +396,7 @@ along with GCC; see the file COPYING3.  If not see
>        NEXT_PASS (pass_rtl_dse2);
>        NEXT_PASS (pass_stack_adjustments);
>        NEXT_PASS (pass_jump2);
> +      NEXT_PASS (pass_peephole2);
>        NEXT_PASS (pass_if_after_reload);
>        NEXT_PASS (pass_regrename);
>        NEXT_PASS (pass_cprop_hardreg);
> @@ -407,8 +408,7 @@ along with GCC; see the file COPYING3.  If not see
>           We have a single indirect branch in the entire function
>           before duplicate-compute-gotos pass.  This vastly reduces
>           the size of the CFG.
> -         For preventing to run peephole2 pass twice, its run after
> -         the jump2 got removed.  */
> +         We need to run peephole2 pass twice. See PR/61608.  */
>        NEXT_PASS (pass_peephole2);
>        NEXT_PASS (pass_branch_target_load_optimize2);
>        NEXT_PASS (pass_leaf_regs);
Jeff Law June 25, 2014, 2:04 p.m. UTC | #2
On 06/25/14 07:35, Kai Tietz wrote:
> Hello,
>
> so there seems to be a fallout caused by moving peephole2 pass. See PR/61608.
> So we need indeed 2 peephole2 passes.
>
> ChangeLog
>
> 2014-06-25  Kai Tietz  <ktietz@redhat.com>
>
>      PR rtl-optimization/61608
>      * passes.def (peephole2): Readd peephole2 pass
>      before if-after-reload pass.
So why is the peephole not working in its current location?

Jeff
Kai Tietz June 25, 2014, 3:04 p.m. UTC | #3
2014-06-25 16:04 GMT+02:00 Jeff Law <law@redhat.com>:
> So why is the peephole not working in its current location?
>
> Jeff

Hi Jeff,

that is what I read out of dumps:

If peephole2 is executed early we see following pattern transformation:

(insn 12 11 13 2 (set (reg:CC_NOOV 100 cc)
        (compare:CC_NOOV (zero_extract:SI (reg:SI 4 r4 [orig:110 D.1414 ] [110]
)
                (const_int 1 [0x1])
                (const_int 2 [0x2]))
            (const_int 0 [0]))) arm_epilog-1.c:11 84
{*zeroextractsi_compare0_scratch}
     (expr_list:REG_DEAD (reg:SI 4 r4 [orig:110 D.1414 ] [110])
        (nil)))
(jump_insn 13 12 14 2 (set (pc)
        (if_then_else (eq (reg:CC_NOOV 100 cc)
                 (const_int 0 [0]))
             (label_ref:SI 16)
            (pc))) arm_epilog-1.c:11 236 {arm_cond_branch}
     (expr_list:REG_DEAD (reg:CC_NOOV 100 cc)
        (int_list:REG_BR_PROB 3900 (nil)))
  -> 16)

which gets replaced to:

(insn 62 11 63 2 (parallel [
            (set (reg:CC_NOOV 100 cc)
                (compare:CC_NOOV (ashift:SI (reg:SI 4 r4 [orig:110
D.1414 ] [110])
                        (const_int 29 [0x1d]))
                    (const_int 0 [0])))
            (clobber (reg:SI 4 r4))
        ]) arm_epilog-1.c:11 -1
     (nil))
(jump_insn 63 62 14 2 (set (pc)
        (if_then_else (ge (reg:CC_NOOV 100 cc)
                 (const_int 0 [0]))
             (label_ref:SI 16)
            (pc))) arm_epilog-1.c:11 -1
     (nil)
  -> 16)

If we run peephole2 pass late we see instead the following pattern,
which isn't handled by peephole2 pass anymore.

(insn 12 11 15 2 (set (reg:CC_NOOV 100 cc)
        (compare:CC_NOOV (zero_extract:SI (reg:SI 4 r4 [orig:110 D.1414 ] [110])
                (const_int 1 [0x1])
                (const_int 2 [0x2]))
            (const_int 0 [0]))) arm_epilog-1_.c:11 84 {*zeroextractsi_compare0_s
cratch}
     (expr_list:REG_DEAD (reg:SI 4 r4 [orig:110 D.1414 ] [110])
        (nil)))
(insn 15 12 22 2 (cond_exec (ne (reg:CC_NOOV 100 cc)
            (const_int 0 [0]))
        (set (reg/v:SI 2 r2 [orig:115 c ] [115])
            (plus:SI (reg/v:SI 2 r2 [orig:115 c ] [115])
                (const_int 1 [0x1])))) arm_epilog-1_.c:11 3229 {*p *arm_addsi3}
     (expr_list:REG_DEAD (reg:CC_NOOV 100 cc)
        (nil)))

So this issue might be solvable by extending ARM's peephole2 patterns?
An ARM maintainer might be able to tell.

Kai
Jeff Law June 25, 2014, 3:28 p.m. UTC | #4
On 06/25/14 09:04, Kai Tietz wrote:
> 2014-06-25 16:04 GMT+02:00 Jeff Law <law@redhat.com>:
>> So why is the peephole not working in its current location?
>>
>> Jeff
>
> Hi Jeff,
>
> that is what I read out of dumps:
>
> If peephole2 is executed early we see following pattern transformation:
[ ... ]
Ask an ARM maintainer if the new code is actually better than the old code.

It appears that with the peep2 pass moved that we actually if-convert 
the fall-thru path of the conditional and eliminate the conditional. 
Which, on the surface seems like a good thing.  It may be the case that 
we need to twiddle the test.  Not sure yet.


Jeff
Ramana Radhakrishnan June 25, 2014, 3:47 p.m. UTC | #5
[Apologies about the duplicates to folks - resending to make this hit 
the lists]

On 25/06/14 16:28, Jeff Law wrote:
> On 06/25/14 09:04, Kai Tietz wrote:
>> 2014-06-25 16:04 GMT+02:00 Jeff Law <law@redhat.com>:
>>> So why is the peephole not working in its current location?
>>>
>>> Jeff
>>
>> Hi Jeff,
>>
>> that is what I read out of dumps:
>>
>> If peephole2 is executed early we see following pattern transformation:
> [ ... ]
> Ask an ARM maintainer if the new code is actually better than the old code.
>
> It appears that with the peep2 pass moved that we actually if-convert
> the fall-thru path of the conditional and eliminate the conditional.
> Which, on the surface seems like a good thing.  It may be the case that
> we need to twiddle the test.  Not sure yet.


No, we don't need to twiddle the test. The test is good as it is today
as we caught a potential size regression. (Prima-facie the test itself
doesn't show an actual size regression because of padding but you can
see the replacement of a 2 byte instruction with a 4 byte instruction)

These peepholes that were in there gave 0.1% in CSiBe for Thumb2 as per
the original patch 
(https://gcc.gnu.org/ml/gcc-patches/2010-06/msg02518.html). The 
if-conversion happened anyway afterwards regardless of where the pattern 
was but the key here was the conversion of the 4 byte instruction to a 2 
byte instruction.

Looks like the all the peephole2's that have if_then_else's (5 patterns
today) also need to develop a cond_exec variant where possible in the
ARM backend.

It seems unfortunate that the backend has to handle each of the possible 
pattern classes that may be conditionalized by the cond-exec pass and we 
potentially maintain a list of patterns in sync with anything that can 
be conditionalized.

At first glance this looks ugly to "work around" in the backend.

regards
Ramana

>
>
> Jeff
>
>
>
Richard Henderson June 25, 2014, 3:50 p.m. UTC | #6
On 06/25/2014 08:28 AM, Jeff Law wrote:
> Ask an ARM maintainer if the new code is actually better than the old code.

It isn't.

> It appears that with the peep2 pass moved that we actually if-convert the
> fall-thru path of the conditional and eliminate the conditional. Which, on the
> surface seems like a good thing.  It may be the case that we need to twiddle
> the test.  Not sure yet.

Looking at the final code in the pr, it looks like we if-convert both ways,
just with changed condition on the test.

It looks like there are at least 3 peepholes in the arm backend that match
compare+branch with a scratch register that are affected by this.  I don't
think it's reasonable to expect a peephole to match compare + n cond_exec
insns, so running peep2 before if-conversion would seem to be called for.

Kai, why does your indirect jump peephole require pass_reorder_blocks?  It
seems to me that instead of moving peephole2 down, we could move
duplicate_computed_gotos up.


r~
Kai Tietz June 25, 2014, 4:02 p.m. UTC | #7
2014-06-25 17:50 GMT+02:00 Richard Henderson <rth@redhat.com>:
> On 06/25/2014 08:28 AM, Jeff Law wrote:
>> Ask an ARM maintainer if the new code is actually better than the old code.
>
> It isn't.
>
>> It appears that with the peep2 pass moved that we actually if-convert the
>> fall-thru path of the conditional and eliminate the conditional. Which, on the
>> surface seems like a good thing.  It may be the case that we need to twiddle
>> the test.  Not sure yet.
>
> Looking at the final code in the pr, it looks like we if-convert both ways,
> just with changed condition on the test.
>
> It looks like there are at least 3 peepholes in the arm backend that match
> compare+branch with a scratch register that are affected by this.  I don't
> think it's reasonable to expect a peephole to match compare + n cond_exec
> insns, so running peep2 before if-conversion would seem to be called for.
>
> Kai, why does your indirect jump peephole require pass_reorder_blocks?  It
> seems to me that instead of moving peephole2 down, we could move
> duplicate_computed_gotos up.
>
>
> r~

We need the peephole2 pass after reorder_blocks due we move in that
pass the jump

...
(code_label 54 52 55 5 8 "" [0 uses])
(note 55 54 56 5 [bb 5] NOTE_INSN_BASIC_BLOCK)
(jump_insn 56 55 57 5 (set (pc)
        (reg/f:DI 0 ax [orig:83 D.3469 ] [83])) 638 {*indirect_jump}
     (expr_list:REG_DEAD (reg/f:DI 0 ax [orig:83 D.3469 ] [83])
        (nil)))
(barrier 57 56 58)
....

into each bb, so it can be resolved to an indirect jump on memory.

Testcase to reproduce that is:

// PR 51840/
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

enum {
  S_atop   = 0,
  S_atop_f = 1,
  S_atop_t = 2,
  S_limit  = 3
};

enum {
  I_a_dec        = 0,
  I_a_non_zero_p = 1,
  I_jmp_if_true  = 2,
  I_exit         = 3,
  I_limit        = 4
};


typedef struct {
  uint64_t a0;
} vm_state_t;


__attribute__((noinline, noclone)) void exec_code(vm_state_t *state,
uint8_t *code) {
  static void (* volatile const vm_dispatch[S_limit][I_limit]) = {
    //dispatch for [S_atop = 0][..] with four unique destination labels
    {&&atop__a_dec,
     &&atop__a_non_zero_p,
     &&fixme,
     &&atop__exit},

    //dispatch for [S_atop_f = 1][..] with two unique destination labels
    {&&fixme,
     &&fixme,
     &&atop_f__jmp_if_true,
     &&fixme},

    //dispatch for [S_atop_t = 2][..] with two unique destination labels
    {&&fixme,
     &&fixme,
     &&atop_t__jmp_if_true,
     &&fixme}
  };

  printf("atop__a_dec:         %p\n", &&atop__a_dec);
  printf("atop__a_non_zero_p:  %p\n", &&atop__a_non_zero_p);
  printf("atop__exit:          %p\n", &&atop__exit);
  printf("atop_f__jmp_if_true: %p\n", &&atop_f__jmp_if_true);
  printf("atop_t__jmp_if_true: %p\n", &&atop_t__jmp_if_true);
  printf("fixme:               %p\n", &&fixme);

  volatile uint64_t atop = state->a0;
  volatile int64_t inst_offset=0;

  goto *(vm_dispatch[S_atop][code[inst_offset]]);

 atop__a_dec: {
    printf("atop = %ld\n", atop);
    --atop;
    volatile uint64_t next_inst = code[inst_offset + 1];
    inst_offset += 1;
    goto *(vm_dispatch[S_atop][next_inst]);
  }

 atop__a_non_zero_p: {
    volatile uint64_t next_inst = code[inst_offset + 1];
    inst_offset += 1;
    if (atop != 0) {
      goto *(vm_dispatch[S_atop_t][next_inst]);
    } else {
      goto *(vm_dispatch[S_atop_f][next_inst]);
    }
  }

 atop_f__jmp_if_true: {
    volatile uint64_t next_inst = code[inst_offset + 2];
    inst_offset += 2;
    goto *(vm_dispatch[S_atop][next_inst]);
  }

 atop_t__jmp_if_true: {
    int64_t offset = (int8_t) code[inst_offset + 1];
    uint64_t next_inst = code[inst_offset + offset];
    inst_offset += offset;
    goto *(vm_dispatch[S_atop][next_inst]);
  }

 atop__exit: {
    state->a0 = atop;
    return;
  }

 fixme: {
    printf("Dispatch logic ERROR\n");
    exit(EXIT_FAILURE);
  }
}

int main(void)
{
  vm_state_t state;
  state.a0     = 10;
  uint8_t code[] = {I_a_dec,           //decrement top of stack
            I_a_non_zero_p,    //true if top of stack is non-zero
            I_jmp_if_true, -2, //if true jump back to decrement
            I_exit};           //otherwise exit

  exec_code(&state, code);
  return EXIT_SUCCESS;
}

Regards,
Kai
Jeff Law June 25, 2014, 5:15 p.m. UTC | #8
On 06/25/14 10:02, Kai Tietz wrote:
> 2014-06-25 17:50 GMT+02:00 Richard Henderson <rth@redhat.com>:
>> On 06/25/2014 08:28 AM, Jeff Law wrote:
>>> Ask an ARM maintainer if the new code is actually better than the old code.
>>
>> It isn't.
>>
>>> It appears that with the peep2 pass moved that we actually if-convert the
>>> fall-thru path of the conditional and eliminate the conditional. Which, on the
>>> surface seems like a good thing.  It may be the case that we need to twiddle
>>> the test.  Not sure yet.
>>
>> Looking at the final code in the pr, it looks like we if-convert both ways,
>> just with changed condition on the test.
>>
>> It looks like there are at least 3 peepholes in the arm backend that match
>> compare+branch with a scratch register that are affected by this.  I don't
>> think it's reasonable to expect a peephole to match compare + n cond_exec
>> insns, so running peep2 before if-conversion would seem to be called for.
>>
>> Kai, why does your indirect jump peephole require pass_reorder_blocks?  It
>> seems to me that instead of moving peephole2 down, we could move
>> duplicate_computed_gotos up.
>>
>>
>> r~
>
> We need the peephole2 pass after reorder_blocks due we move in that
> pass the jump
>
> ...
> (code_label 54 52 55 5 8 "" [0 uses])
> (note 55 54 56 5 [bb 5] NOTE_INSN_BASIC_BLOCK)
> (jump_insn 56 55 57 5 (set (pc)
>          (reg/f:DI 0 ax [orig:83 D.3469 ] [83])) 638 {*indirect_jump}
>       (expr_list:REG_DEAD (reg/f:DI 0 ax [orig:83 D.3469 ] [83])
>          (nil)))
> (barrier 57 56 58)
> ....
>
> into each bb, so it can be resolved to an indirect jump on memory.
But I think Richard's question was can we move up 
duplicate_computed_gotos since that's the code which I'd expect to make 
this transformation.  Then you run peep2 immediately after 
duplicate_computed_gotos.


Jeff
Jeff Law June 25, 2014, 5:17 p.m. UTC | #9
On 06/25/14 09:50, Richard Henderson wrote:
> On 06/25/2014 08:28 AM, Jeff Law wrote:
>> Ask an ARM maintainer if the new code is actually better than the old code.
>
> It isn't.
>
>> It appears that with the peep2 pass moved that we actually if-convert the
>> fall-thru path of the conditional and eliminate the conditional. Which, on the
>> surface seems like a good thing.  It may be the case that we need to twiddle
>> the test.  Not sure yet.
>
> Looking at the final code in the pr, it looks like we if-convert both ways,
> just with changed condition on the test.
Yea, I came to the same conclusion when I looked back at the code in the 
BZ.   I was too focused on Kai's message and the RTL contained within 
when I wrote my comment.

>
> It looks like there are at least 3 peepholes in the arm backend that match
> compare+branch with a scratch register that are affected by this.  I don't
> think it's reasonable to expect a peephole to match compare + n cond_exec
> insns, so running peep2 before if-conversion would seem to be called for.
Agreed.

jeff
Kai Tietz June 25, 2014, 5:53 p.m. UTC | #10
2014-06-25 19:15 GMT+02:00 Jeff Law <law@redhat.com>:
> On 06/25/14 10:02, Kai Tietz wrote:
>>
>> 2014-06-25 17:50 GMT+02:00 Richard Henderson <rth@redhat.com>:
>>>
>>> On 06/25/2014 08:28 AM, Jeff Law wrote:
>>>>
>>>> Ask an ARM maintainer if the new code is actually better than the old
>>>> code.
>>>
>>>
>>> It isn't.
>>>
>>>> It appears that with the peep2 pass moved that we actually if-convert
>>>> the
>>>> fall-thru path of the conditional and eliminate the conditional. Which,
>>>> on the
>>>> surface seems like a good thing.  It may be the case that we need to
>>>> twiddle
>>>> the test.  Not sure yet.
>>>
>>>
>>> Looking at the final code in the pr, it looks like we if-convert both
>>> ways,
>>> just with changed condition on the test.
>>>
>>> It looks like there are at least 3 peepholes in the arm backend that
>>> match
>>> compare+branch with a scratch register that are affected by this.  I
>>> don't
>>> think it's reasonable to expect a peephole to match compare + n cond_exec
>>> insns, so running peep2 before if-conversion would seem to be called for.
>>>
>>> Kai, why does your indirect jump peephole require pass_reorder_blocks?
>>> It
>>> seems to me that instead of moving peephole2 down, we could move
>>> duplicate_computed_gotos up.
>>>
>>>
>>> r~
>>
>>
>> We need the peephole2 pass after reorder_blocks due we move in that
>> pass the jump
>>
>> ...
>> (code_label 54 52 55 5 8 "" [0 uses])
>> (note 55 54 56 5 [bb 5] NOTE_INSN_BASIC_BLOCK)
>> (jump_insn 56 55 57 5 (set (pc)
>>          (reg/f:DI 0 ax [orig:83 D.3469 ] [83])) 638 {*indirect_jump}
>>       (expr_list:REG_DEAD (reg/f:DI 0 ax [orig:83 D.3469 ] [83])
>>          (nil)))
>> (barrier 57 56 58)
>> ....
>>
>> into each bb, so it can be resolved to an indirect jump on memory.
>
> But I think Richard's question was can we move up duplicate_computed_gotos
> since that's the code which I'd expect to make this transformation.  Then
> you run peep2 immediately after duplicate_computed_gotos.
>
>
> Jeff
>

The pass of question on ARM isn't the duplicate_compute_gotos pass.
It is the if-after-reload pass.
I tested it by moving it.

Kai
diff mbox

Patch

Index: passes.def
===================================================================
--- passes.def    (Revision 211971)
+++ passes.def    (Arbeitskopie)
@@ -396,6 +396,7 @@  along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_rtl_dse2);
       NEXT_PASS (pass_stack_adjustments);
       NEXT_PASS (pass_jump2);
+      NEXT_PASS (pass_peephole2);
       NEXT_PASS (pass_if_after_reload);
       NEXT_PASS (pass_regrename);
       NEXT_PASS (pass_cprop_hardreg);
@@ -407,8 +408,7 @@  along with GCC; see the file COPYING3.  If not see
          We have a single indirect branch in the entire function
          before duplicate-compute-gotos pass.  This vastly reduces
          the size of the CFG.
-         For preventing to run peephole2 pass twice, its run after
-         the jump2 got removed.  */
+         We need to run peephole2 pass twice. See PR/61608.  */
       NEXT_PASS (pass_peephole2);
       NEXT_PASS (pass_branch_target_load_optimize2);
       NEXT_PASS (pass_leaf_regs);