diff mbox series

[middle-end] : Fix PR 88070, ICE in create_pre_exit, at mode-switching.c:438

Message ID CAFULd4a6OpcLF6wU24YrEPL0nj2mdo11pHYPj2MTMzp_1T2ucQ@mail.gmail.com
State New
Headers show
Series [middle-end] : Fix PR 88070, ICE in create_pre_exit, at mode-switching.c:438 | expand

Commit Message

Uros Bizjak Nov. 19, 2018, 7:58 p.m. UTC
Hello!

The assert in create_pre_exit at mode-switching.c expects return copy
pair with nothing in between. However, the compiler starts mode
switching pass with the following sequence:

(insn 19 18 16 2 (set (reg:V2SF 21 xmm0)
        (mem/c:V2SF (plus:DI (reg/f:DI 7 sp)
                (const_int -72 [0xffffffffffffffb8])) [0  S8 A64]))
"pr88070.c":8 1157 {*movv2sf_internal}
     (nil))
(insn 16 19 20 2 (set (reg:V2SF 0 ax [orig:91 <retval> ] [91])
        (reg:V2SF 0 ax [89])) "pr88070.c":8 1157 {*movv2sf_internal}
     (nil))
(insn 20 16 21 2 (unspec_volatile [
            (const_int 0 [0])
        ] UNSPECV_BLOCKAGE) "pr88070.c":8 710 {blockage}
     (nil))
(insn 21 20 23 2 (use (reg:V2SF 21 xmm0)) "pr88070.c":8 -1
     (nil))

Please note how (insn 16) interferes with (insn 19)-(insn 21) return copy pair.

The culprit for this is the blockage instruction (insn 20), which
causes sched1 pass (pre reload scheduler) to skip marking (insn 19) as
unmovable instruction (as a dependent insn on return use insn), so the
scheduler is free to schedule (insn 16) between return copy pair (insn
19)-(insn 21).

The extra instruction is generated as a kludge in expand_function_end
to prevent instructions that may trap to be scheduled into function
epilogue. However, the same blockage is generated under exactly the
same conditions earlier in the expand_function_end. The first blockage
should prevent unwanted scheduling into the epilogue, so there is
actually no need for the second one.

Attached patch removes the kludge.

BTW: The extra blockage would crash compilation for all mode-switching
targets, also in the pre-reload mode switching; the vzeroupper
post-reload insertion just trips x86 target on a generic problem in
the middle-end.

2018-11-19  Uros Bizjak  <ubizjak@gmail.com>

    PR middle-end/88070
    * function.c (expand_function_end): Remove kludge that
    generates second blockage insn.

testsuite/ChangeLog:

2018-11-19  Uros Bizjak  <ubizjak@gmail.com>

    PR middle-end/88070
    * gcc.target/i386/pr88070.c: New test.

Patch was bootstrapped and regression tested on x86_64-linux-gnu
{,-m32} for all default languages, obj-c++ and go.

OK for mainline and release branches?

Uros.

Comments

Eric Botcazou Nov. 19, 2018, 10:42 p.m. UTC | #1
> The extra instruction is generated as a kludge in expand_function_end
> to prevent instructions that may trap to be scheduled into function
> epilogue. However, the same blockage is generated under exactly the
> same conditions earlier in the expand_function_end. The first blockage
> should prevent unwanted scheduling into the epilogue, so there is
> actually no need for the second one.

But there are instructions emitted after the first blockage, aren't there?

Did you check the history of the code?
Uros Bizjak Nov. 20, 2018, 12:11 a.m. UTC | #2
On Mon, Nov 19, 2018 at 11:42 PM Eric Botcazou <ebotcazou@adacore.com> wrote:
>
> > The extra instruction is generated as a kludge in expand_function_end
> > to prevent instructions that may trap to be scheduled into function
> > epilogue. However, the same blockage is generated under exactly the
> > same conditions earlier in the expand_function_end. The first blockage
> > should prevent unwanted scheduling into the epilogue, so there is
> > actually no need for the second one.
>
> But there are instructions emitted after the first blockage, aren't there?

Yes, but they don't generate exceptions. At least all memory accesses
have  /c flag.

> Did you check the history of the code?

I did.

The blockage was introduced as a fix for PR14381 [1] in r79265 [2].
Later, the blockage was moved after return label as a fix for PR25176
[3] in r107871 [4].

After that, r122626 [5] moves the blockage after the label for the
naked return from the function. Relevant posts from gcc-patches@ ML
are at [6], [7]. However, in the posts, there are no concrete
examples, how scheduler moves instructions from different BB around
blockage insn, the posts just show that there is a jump around
blockage when __builtin_return is used. I was under impression that
scheduler is unable to move instructions over BB boundaries.

A mystery is the tree-ssa merge [8] that copies back the hunk, moved
in r122626 [5] to its original position. From this revision onwards,
we emit two blockages.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14381
[2] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=79265
[3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25176
[4] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=107871
[5] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=122626
[6] https://gcc.gnu.org/ml/gcc-patches/2007-02/msg01143.html
[7] https://gcc.gnu.org/ml/gcc-patches/2007-02/msg01623.html
[8] https://gcc.gnu.org/viewcvs/gcc/trunk/gcc/function.c?limit_changes=0&r1=125624&r2=125623&pathrev=125624

Uros.
Eric Botcazou Nov. 20, 2018, 7:59 a.m. UTC | #3
> The blockage was introduced as a fix for PR14381 [1] in r79265 [2].
> Later, the blockage was moved after return label as a fix for PR25176
> [3] in r107871 [4].
> 
> After that, r122626 [5] moves the blockage after the label for the
> naked return from the function. Relevant posts from gcc-patches@ ML
> are at [6], [7]. However, in the posts, there are no concrete
> examples, how scheduler moves instructions from different BB around
> blockage insn, the posts just show that there is a jump around
> blockage when __builtin_return is used. I was under impression that
> scheduler is unable to move instructions over BB boundaries.

The scheduler works on extended basic blocks.  The [7] post gives a rather 
convincing explanation and there is a C++ testcase under PR rtl-opt/14381.

> A mystery is the tree-ssa merge [8] that copies back the hunk, moved
> in r122626 [5] to its original position. From this revision onwards,
> we emit two blockages.

It's the dataflow merge, not the tree-ssa merge.  The additional blockage 
might be needed for DF.

Given that the current PR is totally artificial, I think that we need to be 
quite conservative and only do something on mainline.  And even there I'd be 
rather conservative and remove the kludge only for targets that emit unwind 
information in the epilogue (among which there is x86 I presume).
Uros Bizjak Nov. 20, 2018, 10:24 a.m. UTC | #4
On Tue, Nov 20, 2018 at 8:59 AM Eric Botcazou <ebotcazou@adacore.com> wrote:
>
> > The blockage was introduced as a fix for PR14381 [1] in r79265 [2].
> > Later, the blockage was moved after return label as a fix for PR25176
> > [3] in r107871 [4].
> >
> > After that, r122626 [5] moves the blockage after the label for the
> > naked return from the function. Relevant posts from gcc-patches@ ML
> > are at [6], [7]. However, in the posts, there are no concrete
> > examples, how scheduler moves instructions from different BB around
> > blockage insn, the posts just show that there is a jump around
> > blockage when __builtin_return is used. I was under impression that
> > scheduler is unable to move instructions over BB boundaries.
>
> The scheduler works on extended basic blocks.  The [7] post gives a rather
> convincing explanation and there is a C++ testcase under PR rtl-opt/14381.
>
> > A mystery is the tree-ssa merge [8] that copies back the hunk, moved
> > in r122626 [5] to its original position. From this revision onwards,
> > we emit two blockages.
>
> It's the dataflow merge, not the tree-ssa merge.  The additional blockage
> might be needed for DF.
>
> Given that the current PR is totally artificial, I think that we need to be
> quite conservative and only do something on mainline.  And even there I'd be
> rather conservative and remove the kludge only for targets that emit unwind
> information in the epilogue (among which there is x86 I presume).

Hm, I think I'll rather go with somehow target-dependent patch:

--cut here--
diff --git a/gcc/mode-switching.c b/gcc/mode-switching.c
index 370a49e90a9c..de75efe2b6c9 100644
--- a/gcc/mode-switching.c
+++ b/gcc/mode-switching.c
@@ -252,7 +252,21 @@ create_pre_exit (int n_entities, int *entity_map,
const int *num_modes)
        if (EDGE_COUNT (EXIT_BLOCK_PTR_FOR_FN (cfun)->preds) == 1
            && NONJUMP_INSN_P ((last_insn = BB_END (src_bb)))
            && GET_CODE (PATTERN (last_insn)) == USE
-           && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG)
+           && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG
+
+           /* x86 targets use mode-switching infrastructure to
+              conditionally insert vzeroupper instruction at the exit
+              from the function and there is no need to switch the
+              mode before the return value copy.  The vzeroupper insertion
+              pass runs after reload, so use !reload_completed as a stand-in
+              for x86 to skip the search for return value copy insn.
+
+              N.b.: the code below assumes that return copy insn
+              immediately precedes its corresponding use insn.  This
+              assumption does not hold after reload, since sched1 pass
+              can reschedule return copy insn away from its
+              corresponding use insn.  */
+           && !reload_completed)
          {
            int ret_start = REGNO (ret_reg);
            int nregs = REG_NREGS (ret_reg);
--cut here--

WDYT?

Uros.
Uros Bizjak Nov. 20, 2018, 10:54 a.m. UTC | #5
On Tue, Nov 20, 2018 at 8:59 AM Eric Botcazou <ebotcazou@adacore.com> wrote:
>
> > The blockage was introduced as a fix for PR14381 [1] in r79265 [2].
> > Later, the blockage was moved after return label as a fix for PR25176
> > [3] in r107871 [4].
> >
> > After that, r122626 [5] moves the blockage after the label for the
> > naked return from the function. Relevant posts from gcc-patches@ ML
> > are at [6], [7]. However, in the posts, there are no concrete
> > examples, how scheduler moves instructions from different BB around
> > blockage insn, the posts just show that there is a jump around
> > blockage when __builtin_return is used. I was under impression that
> > scheduler is unable to move instructions over BB boundaries.
>
> The scheduler works on extended basic blocks.  The [7] post gives a rather
> convincing explanation and there is a C++ testcase under PR rtl-opt/14381.

Taking into account that BB edges aren't scheduling barriers, I agree with [7].

> > A mystery is the tree-ssa merge [8] that copies back the hunk, moved
> > in r122626 [5] to its original position. From this revision onwards,
> > we emit two blockages.
>
> It's the dataflow merge, not the tree-ssa merge.  The additional blockage
> might be needed for DF.

Ah yes. Thanks for the correction. However, I think  that - according
to the reintroduced duplicated comment - the blockage, reintroduced by
DF merge should not be needed. I'll investigate this a bit and try to
bootstrap/regtest a patch that removes reintroduced blockage.

> Given that the current PR is totally artificial, I think that we need to be
> quite conservative and only do something on mainline.  And even there I'd be
> rather conservative and remove the kludge only for targets that emit unwind
> information in the epilogue (among which there is x86 I presume).

The testcase actually exposes -fschedule-insn problem with x86, so the
test is not totaly artificial. The idea of removing the kludge for
certain targets is tempting (the scheduler will be given some more
freedom), but I agree that it should not be removed in stage-3.

Thanks,
Uros.
Jeff Law Nov. 20, 2018, 11:46 p.m. UTC | #6
On 11/19/18 12:58 PM, Uros Bizjak wrote:
> Hello!
> 
> The assert in create_pre_exit at mode-switching.c expects return copy
> pair with nothing in between. However, the compiler starts mode
> switching pass with the following sequence:
> 
> (insn 19 18 16 2 (set (reg:V2SF 21 xmm0)
>         (mem/c:V2SF (plus:DI (reg/f:DI 7 sp)
>                 (const_int -72 [0xffffffffffffffb8])) [0  S8 A64]))
> "pr88070.c":8 1157 {*movv2sf_internal}
>      (nil))
> (insn 16 19 20 2 (set (reg:V2SF 0 ax [orig:91 <retval> ] [91])
>         (reg:V2SF 0 ax [89])) "pr88070.c":8 1157 {*movv2sf_internal}
>      (nil))
> (insn 20 16 21 2 (unspec_volatile [
>             (const_int 0 [0])
>         ] UNSPECV_BLOCKAGE) "pr88070.c":8 710 {blockage}
>      (nil))
> (insn 21 20 23 2 (use (reg:V2SF 21 xmm0)) "pr88070.c":8 -1
>      (nil))
So I know there's an updated patch.  But I thought it might be worth
mentioning that insn 16 here appears to be a nop-move.   Removing it
might address this instance of the problem, but I doubt it's general
enough to address any larger issues.

You still might want to investigate why it's still in the IL.

Jeff
Jeff Law Nov. 20, 2018, 11:50 p.m. UTC | #7
On 11/20/18 3:24 AM, Uros Bizjak wrote:
> On Tue, Nov 20, 2018 at 8:59 AM Eric Botcazou <ebotcazou@adacore.com> wrote:
>>
>>> The blockage was introduced as a fix for PR14381 [1] in r79265 [2].
>>> Later, the blockage was moved after return label as a fix for PR25176
>>> [3] in r107871 [4].
>>>
>>> After that, r122626 [5] moves the blockage after the label for the
>>> naked return from the function. Relevant posts from gcc-patches@ ML
>>> are at [6], [7]. However, in the posts, there are no concrete
>>> examples, how scheduler moves instructions from different BB around
>>> blockage insn, the posts just show that there is a jump around
>>> blockage when __builtin_return is used. I was under impression that
>>> scheduler is unable to move instructions over BB boundaries.
>>
>> The scheduler works on extended basic blocks.  The [7] post gives a rather
>> convincing explanation and there is a C++ testcase under PR rtl-opt/14381.
>>
>>> A mystery is the tree-ssa merge [8] that copies back the hunk, moved
>>> in r122626 [5] to its original position. From this revision onwards,
>>> we emit two blockages.
>>
>> It's the dataflow merge, not the tree-ssa merge.  The additional blockage
>> might be needed for DF.
>>
>> Given that the current PR is totally artificial, I think that we need to be
>> quite conservative and only do something on mainline.  And even there I'd be
>> rather conservative and remove the kludge only for targets that emit unwind
>> information in the epilogue (among which there is x86 I presume).
> 
> Hm, I think I'll rather go with somehow target-dependent patch:
> 
> --cut here--
> diff --git a/gcc/mode-switching.c b/gcc/mode-switching.c
> index 370a49e90a9c..de75efe2b6c9 100644
> --- a/gcc/mode-switching.c
> +++ b/gcc/mode-switching.c
> @@ -252,7 +252,21 @@ create_pre_exit (int n_entities, int *entity_map,
> const int *num_modes)
>         if (EDGE_COUNT (EXIT_BLOCK_PTR_FOR_FN (cfun)->preds) == 1
>             && NONJUMP_INSN_P ((last_insn = BB_END (src_bb)))
>             && GET_CODE (PATTERN (last_insn)) == USE
> -           && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG)
> +           && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG
> +
> +           /* x86 targets use mode-switching infrastructure to
> +              conditionally insert vzeroupper instruction at the exit
> +              from the function and there is no need to switch the
> +              mode before the return value copy.  The vzeroupper insertion
> +              pass runs after reload, so use !reload_completed as a stand-in
> +              for x86 to skip the search for return value copy insn.
Note that the GCN target may well end up needing a late mode switching
pass -- it's got kindof an inverse problem to solve -- where to place
initializations of the exec register which is needed when we want to do
scalar ops in a simd unit.

I thought the SH used mode switching as well.  BUt I can't recall if it
was run before register allocation & reload.

jeff
Uros Bizjak Nov. 21, 2018, 6:54 a.m. UTC | #8
On Wed, Nov 21, 2018 at 12:50 AM Jeff Law <law@redhat.com> wrote:

> > +           /* x86 targets use mode-switching infrastructure to
> > +              conditionally insert vzeroupper instruction at the exit
> > +              from the function and there is no need to switch the
> > +              mode before the return value copy.  The vzeroupper insertion
> > +              pass runs after reload, so use !reload_completed as a stand-in
> > +              for x86 to skip the search for return value copy insn.
> Note that the GCN target may well end up needing a late mode switching
> pass -- it's got kindof an inverse problem to solve -- where to place
> initializations of the exec register which is needed when we want to do
> scalar ops in a simd unit.
>
> I thought the SH used mode switching as well.  BUt I can't recall if it
> was run before register allocation & reload.

SH mode switching ran before reload.

Uros.
Uros Bizjak Nov. 21, 2018, 9:48 a.m. UTC | #9
On Wed, Nov 21, 2018 at 12:46 AM Jeff Law <law@redhat.com> wrote:
>
> On 11/19/18 12:58 PM, Uros Bizjak wrote:
> > Hello!
> >
> > The assert in create_pre_exit at mode-switching.c expects return copy
> > pair with nothing in between. However, the compiler starts mode
> > switching pass with the following sequence:
> >
> > (insn 19 18 16 2 (set (reg:V2SF 21 xmm0)
> >         (mem/c:V2SF (plus:DI (reg/f:DI 7 sp)
> >                 (const_int -72 [0xffffffffffffffb8])) [0  S8 A64]))
> > "pr88070.c":8 1157 {*movv2sf_internal}
> >      (nil))
> > (insn 16 19 20 2 (set (reg:V2SF 0 ax [orig:91 <retval> ] [91])
> >         (reg:V2SF 0 ax [89])) "pr88070.c":8 1157 {*movv2sf_internal}
> >      (nil))
> > (insn 20 16 21 2 (unspec_volatile [
> >             (const_int 0 [0])
> >         ] UNSPECV_BLOCKAGE) "pr88070.c":8 710 {blockage}
> >      (nil))
> > (insn 21 20 23 2 (use (reg:V2SF 21 xmm0)) "pr88070.c":8 -1
> >      (nil))
> So I know there's an updated patch.  But I thought it might be worth
> mentioning that insn 16 here appears to be a nop-move.   Removing it
> might address this instance of the problem, but I doubt it's general
> enough to address any larger issues.
>
> You still might want to investigate why it's still in the IL.

Oh yes, I remember this.

These nop-moves were removed in Vlad's patch [1],[2]:

2013-10-25  Vladimir Makarov <vmakarov@redhat.com>

        ...
        * lra-spills.c (lra_final_code_change): Remove useless move insns.

Which regressed vzeroupper insertion pass [3] that was reported in [4].

The functionality was later reverted in [5]:

2013-10-26  Vladimir Makarov  <vmakarov@redhat.com>

    Revert:
    2013-10-25  Vladimir Makarov  <vmakarov@redhat.com>
    * lra-spills.c (lra_final_code_change): Remove useless move insns.

Which IMO can be reintroduced back, now that vzeroupper pass works in
a different way. We actually have a couple of tests in place for
PR58679 [6].

[1] https://gcc.gnu.org/ml/gcc-patches/2013-10/msg02208.html
[2] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=204079
[3] https://gcc.gnu.org/ml/gcc-patches/2013-10/msg02225.html
[4] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58679
[5] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=204094
[6] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=204109

Uros.
Uros Bizjak Nov. 21, 2018, 10:02 a.m. UTC | #10
On Wed, Nov 21, 2018 at 10:48 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Wed, Nov 21, 2018 at 12:46 AM Jeff Law <law@redhat.com> wrote:
> >
> > On 11/19/18 12:58 PM, Uros Bizjak wrote:
> > > Hello!
> > >
> > > The assert in create_pre_exit at mode-switching.c expects return copy
> > > pair with nothing in between. However, the compiler starts mode
> > > switching pass with the following sequence:
> > >
> > > (insn 19 18 16 2 (set (reg:V2SF 21 xmm0)
> > >         (mem/c:V2SF (plus:DI (reg/f:DI 7 sp)
> > >                 (const_int -72 [0xffffffffffffffb8])) [0  S8 A64]))
> > > "pr88070.c":8 1157 {*movv2sf_internal}
> > >      (nil))
> > > (insn 16 19 20 2 (set (reg:V2SF 0 ax [orig:91 <retval> ] [91])
> > >         (reg:V2SF 0 ax [89])) "pr88070.c":8 1157 {*movv2sf_internal}
> > >      (nil))
> > > (insn 20 16 21 2 (unspec_volatile [
> > >             (const_int 0 [0])
> > >         ] UNSPECV_BLOCKAGE) "pr88070.c":8 710 {blockage}
> > >      (nil))
> > > (insn 21 20 23 2 (use (reg:V2SF 21 xmm0)) "pr88070.c":8 -1
> > >      (nil))
> > So I know there's an updated patch.  But I thought it might be worth
> > mentioning that insn 16 here appears to be a nop-move.   Removing it
> > might address this instance of the problem, but I doubt it's general
> > enough to address any larger issues.
> >
> > You still might want to investigate why it's still in the IL.
>
> Oh yes, I remember this.
>
> These nop-moves were removed in Vlad's patch [1],[2]:
>
> 2013-10-25  Vladimir Makarov <vmakarov@redhat.com>
>
>         ...
>         * lra-spills.c (lra_final_code_change): Remove useless move insns.
>
> Which regressed vzeroupper insertion pass [3] that was reported in [4].
>
> The functionality was later reverted in [5]:
>
> 2013-10-26  Vladimir Makarov  <vmakarov@redhat.com>
>
>     Revert:
>     2013-10-25  Vladimir Makarov  <vmakarov@redhat.com>
>     * lra-spills.c (lra_final_code_change): Remove useless move insns.
>
> Which IMO can be reintroduced back, now that vzeroupper pass works in
> a different way. We actually have a couple of tests in place for
> PR58679 [6].

The revert of the revert works OK for PR58679 tests with the latest compiler.

Uros.
diff mbox series

Patch

Index: function.c
===================================================================
--- function.c	(revision 266278)
+++ function.c	(working copy)
@@ -5447,13 +5447,6 @@  expand_function_end (void)
   if (naked_return_label)
     emit_label (naked_return_label);
 
-  /* @@@ This is a kludge.  We want to ensure that instructions that
-     may trap are not moved into the epilogue by scheduling, because
-     we don't always emit unwind information for the epilogue.  */
-  if (cfun->can_throw_non_call_exceptions
-      && targetm_common.except_unwind_info (&global_options) != UI_SJLJ)
-    emit_insn (gen_blockage ());
-
   /* If stack protection is enabled for this function, check the guard.  */
   if (crtl->stack_protect_guard && targetm.stack_protect_runtime_enabled_p ())
     stack_protect_epilogue ();
Index: testsuite/gcc.target/i386/pr88070.c
===================================================================
--- testsuite/gcc.target/i386/pr88070.c	(nonexistent)
+++ testsuite/gcc.target/i386/pr88070.c	(working copy)
@@ -0,0 +1,12 @@ 
+/* PR target/88070 */
+/* { dg-do compile } */
+/* { dg-options "-O -fexpensive-optimizations -fnon-call-exceptions -fschedule-insns -fno-dce -fno-dse -mavx" } */
+
+typedef float vfloat2 __attribute__ ((__vector_size__ (2 * sizeof (float))));
+
+vfloat2
+test1float2 (float c)
+{
+  vfloat2 v = { c, c };
+  return v;
+}