diff mbox series

[4/4] Discussing PR83507

Message ID d669f4cceadaaa32994cda24814228ce@ispras.ru
State New
Headers show
Series Addressing modulo-scheduling bugs | expand

Commit Message

Roman Zhuykov April 16, 2019, 12:08 p.m. UTC
This issue unfortunately was not solved correctly.  In that example we 
don’t have -fmodulo-sched-allow-regmoves enabled and we should not 
create any register moves at all.

I’ll describe here everything I found while looking on the issue.  At 
the first step I have patched trunk compiler like this:

    }
diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c

        /* Create NREG_MOVES register moves.  */
+      gcc_assert (flag_modulo_sched_allow_regmoves);
        first_move = ps->reg_moves.length ();
        ps->reg_moves.safe_grow_cleared (first_move + nreg_moves);
        extend_node_sched_params (ps);
@@ -744,8 +745,7 @@ schedule_reg_moves (partial_schedule_ptr ps)

        /* Generate each move.  */
        old_reg = prev_reg = SET_DEST (set);
-      if (HARD_REGISTER_P (old_reg))
-	return false;
+      gcc_assert (!HARD_REGISTER_P (old_reg));

        for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
  	{
@@ -1371,7 +1371,12 @@ sms_schedule (void)
    else
      issue_rate = 1;

-  /* Initialize the scheduler.  */
+  /* Initialize the scheduler, we need live info even at O1.  */
+  if (optimize == 1)
+    {
+      df_live_add_problem ();
+      df_live_set_all_dirty ();
+    }
    setup_sched_infos ();
    haifa_sched_init ();

@@ -1473,6 +1478,8 @@ sms_schedule (void)

          if (CALL_P (insn)
              || BARRIER_P (insn)
+            || (NONDEBUG_INSN_P (insn) && extract_asm_operands (PATTERN 
(insn))
+                && MEM_VOLATILE_P (extract_asm_operands (PATTERN 
(insn))))
              || (NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
                  && !single_set (insn) && GET_CODE (PATTERN (insn)) != 
USE
                  && !reg_mentioned_p (count_reg, insn))
@@ -1489,6 +1496,11 @@ sms_schedule (void)
  		fprintf (dump_file, "SMS loop-with-call\n");
  	      else if (BARRIER_P (insn))
  		fprintf (dump_file, "SMS loop-with-barrier\n");
+	      else if (NONDEBUG_INSN_P (insn)
+		       && extract_asm_operands (PATTERN (insn))
+		       && MEM_VOLATILE_P (extract_asm_operands
+						(PATTERN (insn))))
+		fprintf (dump_file, "SMS loop-with-volatile-asm\n");
                else if ((NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
                  && !single_set (insn) && GET_CODE (PATTERN (insn)) != 
USE))
                  fprintf (dump_file, "SMS loop-with-not-single-set\n");
@@ -1752,6 +1764,9 @@ sms_schedule (void)
    /* Release scheduler data, needed until now because of DFA.  */
    haifa_sched_finish ();
    loop_optimizer_finalize ();
+
+  if (optimize == 1)
+    df_remove_problem (df_live);
  }

  /* The SMS scheduling algorithm itself

Comments

Segher Boessenkool April 16, 2019, 3:08 p.m. UTC | #1
On Tue, Apr 16, 2019 at 03:08:23PM +0300, Roman Zhuykov wrote:
> This issue unfortunately was not solved correctly.  In that example we 
> don’t have -fmodulo-sched-allow-regmoves enabled and we should not 
> create any register moves at all.

Yes, but if we do for whatever reason, we should never create register
moves of hard registers.  Because that is incorrect (there may not be
insns to do it).  It's a separate issue.

You're extending Jakub's patch here, not replacing it, so that's fine.

> In pr84524.c we got a loop with an extended inline asm:
> asm volatile ("" : "+r" (v))
> which also gives us a “surprising” situation Alexander predicts.
> 
> For sched-deps scanner such volatile asm is a “true barrier” and we 
> create dependencies to almost all other instructions, while DF scanners 
> don’t give us such information.

There is no such thing as a "true barrier" in extended asm.  The only thing
volatile asm means is that the asm has a side effect unknown to the compiler;
this can *not* be a modification of a register or of memory contents, such
things are known by the compiler, that's what clobbers and "memory" clobber
are about.

> Maybe it is a good idea somehow re-implement modulo scheduler using only 
> one scanner instead of two, but at the moment the best thing to do is to 
> detect the situation earlier and skip such loops.

Or fix the broken code...


Segher
Roman Zhuykov April 22, 2019, 4:36 p.m. UTC | #2
> > This issue unfortunately was not solved correctly.  In that example we
> > don’t have -fmodulo-sched-allow-regmoves enabled and we should not
> > create any register moves at all.
>
> Yes, but if we do for whatever reason, we should never create register
> moves of hard registers.  Because that is incorrect (there may not be
> insns to do it).  It's a separate issue.
>
> You're extending Jakub's patch here, not replacing it, so that's fine.

Certainly, the patch contains assertions to catch both situations:
when we try to create reg_move when it wasn’t allowed at all, and when
we try to create reg_move for hard register. Preventing both of them
in theory can be easily achieved when SMS algorithm got absolutely
correct DDG as an input data.

> > In pr84524.c we got a loop with an extended inline asm:
> > asm volatile ("" : "+r" (v))
> > which also gives us a “surprising” situation Alexander predicts.
> >
> > For sched-deps scanner such volatile asm is a “true barrier” and we
> > create dependencies to almost all other instructions, while DF scanners
> > don’t give us such information.
>
> There is no such thing as a "true barrier" in extended asm.  The only thing
> volatile asm means is that the asm has a side effect unknown to the compiler;
> this can *not* be a modification of a register or of memory contents, such
> things are known by the compiler, that's what clobbers and "memory" clobber
> are about.

In sched-deps.c we got:
    case ASM_OPERANDS:
    case ASM_INPUT:
      {
       /* Traditional and volatile asm instructions must be considered to use
          and clobber all hard registers, all pseudo-registers and all of
          memory.  So must TRAP_IF and UNSPEC_VOLATILE operations.

          Consider for instance a volatile asm that changes the fpu rounding
          mode.  An insn should not be moved across this even if it only uses
          pseudo-regs because it might give an incorrectly rounded result.  */
       if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
           && !DEBUG_INSN_P (insn))
         reg_pending_barrier = TRUE_BARRIER;
       ...

I understand that mentioned “changing fpu rounding mode” example may
not be relevant in current compiler.  But we still got TRUE_BARRIER
for all volatile asms.

> > Maybe it is a good idea somehow re-implement modulo scheduler using only
> > one scanner instead of two, but at the moment the best thing to do is to
> > detect the situation earlier and skip such loops.
>
> Or fix the broken code...

I’m not sure what you mean here, but the only alternative way to build
correct DDG is to change sched-deps.c parts to make analysis
consistent with DF, but that change will affect all schedulers.  And
we’ll have 2 rather big problems with such change:
1) Testing correctness and quality of generated code in all imaginable
situations in all schedulers.
2) Potential compilation slowdown.  Sched-deps analysis algorithm
implements heuristic ideas, while DF pretends to be more accurate but
slower analysis.  Trying to keep them in sync probably could increase
sched-deps analysis time.

A year ago there were relevant discussion about some unnecessary
inline asm dependencies in DF:
https://gcc.gnu.org/ml/gcc-patches/2018-04/msg01056.html

Roman
Segher Boessenkool April 23, 2019, 3:49 p.m. UTC | #3
Hi Roman,

On Mon, Apr 22, 2019 at 07:36:40PM +0300, Roman Zhuykov wrote:
> > > In pr84524.c we got a loop with an extended inline asm:
> > > asm volatile ("" : "+r" (v))
> > > which also gives us a “surprising” situation Alexander predicts.
> > >
> > > For sched-deps scanner such volatile asm is a “true barrier” and we
> > > create dependencies to almost all other instructions, while DF scanners
> > > don’t give us such information.
> >
> > There is no such thing as a "true barrier" in extended asm.  The only thing
> > volatile asm means is that the asm has a side effect unknown to the compiler;
> > this can *not* be a modification of a register or of memory contents, such
> > things are known by the compiler, that's what clobbers and "memory" clobber
> > are about.
> 
> In sched-deps.c we got:
>     case ASM_OPERANDS:
>     case ASM_INPUT:
>       {
>        /* Traditional and volatile asm instructions must be considered to use
>           and clobber all hard registers, all pseudo-registers and all of
>           memory.  So must TRAP_IF and UNSPEC_VOLATILE operations.
> 
>           Consider for instance a volatile asm that changes the fpu rounding
>           mode.  An insn should not be moved across this even if it only uses
>           pseudo-regs because it might give an incorrectly rounded result.  */
>        if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
>            && !DEBUG_INSN_P (insn))
>          reg_pending_barrier = TRUE_BARRIER;
>        ...

This code was added in 1997 (r14770).  In 2004 the documentation was
changed to clarify how things really work (r88999):

"Note that even a volatile @code{asm} instruction can be moved relative to
other code, including across jump instructions."

(followed by an example exactly about what this means for FPU control).

> I understand that mentioned “changing fpu rounding mode” example may
> not be relevant in current compiler.  But we still got TRUE_BARRIER
> for all volatile asms.

> > > Maybe it is a good idea somehow re-implement modulo scheduler using only
> > > one scanner instead of two, but at the moment the best thing to do is to
> > > detect the situation earlier and skip such loops.
> >
> > Or fix the broken code...
> 
> I’m not sure what you mean here,

I mean have the modulo scheduler implement the correct asm semantics, not
some more restrictive thing that gets it into conflicts with DF, etc.

I don't think this will turn out to be a problem in any way.  Some invalid
asm will break, sure.


Segher
Roman Zhuykov April 29, 2019, 4:28 p.m. UTC | #4
Hi, Segher

> > > > In pr84524.c we got a loop with an extended inline asm:
> > > > asm volatile ("" : "+r" (v))
> > > > which also gives us a “surprising” situation Alexander predicts.
> > > >
> > > > For sched-deps scanner such volatile asm is a “true barrier” and we
> > > > create dependencies to almost all other instructions, while DF scanners
> > > > don’t give us such information.
> > >
> > > There is no such thing as a "true barrier" in extended asm.  The only thing
> > > volatile asm means is that the asm has a side effect unknown to the compiler;
> > > this can *not* be a modification of a register or of memory contents, such
> > > things are known by the compiler, that's what clobbers and "memory" clobber
> > > are about.
> >
> > In sched-deps.c we got:
> >     case ASM_OPERANDS:
> >     case ASM_INPUT:
> >       {
> >        /* Traditional and volatile asm instructions must be considered to use
> >           and clobber all hard registers, all pseudo-registers and all of
> >           memory.  So must TRAP_IF and UNSPEC_VOLATILE operations.
> >
> >           Consider for instance a volatile asm that changes the fpu rounding
> >           mode.  An insn should not be moved across this even if it only uses
> >           pseudo-regs because it might give an incorrectly rounded result.  */
> >        if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
> >            && !DEBUG_INSN_P (insn))
> >          reg_pending_barrier = TRUE_BARRIER;
> >        ...
>
> This code was added in 1997 (r14770).  In 2004 the documentation was
> changed to clarify how things really work (r88999):
>
> "Note that even a volatile @code{asm} instruction can be moved relative to
> other code, including across jump instructions."
>
> (followed by an example exactly about what this means for FPU control).

Thanks for pointing to that changes!  Unfortunately, sched-deps.c was
more conservative this 15 years...
Let’s try to fix it.

> > > > Maybe it is a good idea somehow re-implement modulo scheduler using only
> > > > one scanner instead of two, but at the moment the best thing to do is to
> > > > detect the situation earlier and skip such loops.
> > >
> > > Or fix the broken code...
> >
> > I’m not sure what you mean here,
>
> I mean have the modulo scheduler implement the correct asm semantics, not
> some more restrictive thing that gets it into conflicts with DF, etc.
>
> I don't think this will turn out to be a problem in any way.  Some invalid
> asm will break, sure.

I have started with applying this without any SMS changes:

diff --git a/gcc/sched-deps.c b/gcc/sched-deps.c
--- a/gcc/sched-deps.c
+++ b/gcc/sched-deps.c
@@ -2753,22 +2753,14 @@ sched_analyze_2 (struct deps_desc *deps, rtx
x, rtx_insn *insn)

     case UNSPEC_VOLATILE:
       flush_pending_lists (deps, insn, true, true);
+
+      if (!DEBUG_INSN_P (insn))
+       reg_pending_barrier = TRUE_BARRIER;
       /* FALLTHRU */

     case ASM_OPERANDS:
     case ASM_INPUT:
       {
-       /* Traditional and volatile asm instructions must be considered to use
-          and clobber all hard registers, all pseudo-registers and all of
-          memory.  So must TRAP_IF and UNSPEC_VOLATILE operations.
-
-          Consider for instance a volatile asm that changes the fpu rounding
-          mode.  An insn should not be moved across this even if it only uses
-          pseudo-regs because it might give an incorrectly rounded result.  */
-       if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
-           && !DEBUG_INSN_P (insn))
-         reg_pending_barrier = TRUE_BARRIER;
-
        /* For all ASM_OPERANDS, we must traverse the vector of input operands.
           We cannot just fall through here since then we would be confused
           by the ASM_INPUT rtx inside ASM_OPERANDS, which do not indicate

Regstrapping it on x86-64 shows some failures. First is with ms-sysv
abi test and can be solved like this:

diff --git a/gcc/testsuite/gcc.target/x86_64/abi/ms-sysv/gen.cc
b/gcc/testsuite/gcc.target/x86_64/abi/ms-sysv/gen.cc
--- a/gcc/testsuite/gcc.target/x86_64/abi/ms-sysv/gen.cc
+++ b/gcc/testsuite/gcc.target/x86_64/abi/ms-sysv/gen.cc
@@ -511,7 +511,7 @@ void make_do_test (const vector<class arg> &args,
            /* End if init_test call.  */

            if (f.get_realign () && unaligned == 1)
-             out << "  __asm__ __volatile__ (\"subq $8,%%rsp\":::\"cc\");"
+             out << "  __asm__ __volatile__ (\"subq
$8,%%rsp\":::\"cc\", \"memory\");"
                  << endl;

            out << "  ret = do_test_"
@@ -525,7 +525,7 @@ void make_do_test (const vector<class arg> &args,
            out << ");" << endl;

            if (f.get_realign () && unaligned == 1)
-             out << "  __asm__ __volatile__ (\"addq $8,%%rsp\":::\"cc\");"
+             out << "  __asm__ __volatile__ (\"addq
$8,%%rsp\":::\"cc\", \"memory\");"
                  << endl;

            out << "  check_results (ret);" << endl;

Here we have some asms which control stack pointer (sigh!). It
certainly may be broken at any moment, but right now “memory” clobber
fixes everything for me.

Another failure is:

  FAIL: gcc.dg/guality/pr58791-4.c   -O2  -DPREVENT_OPTIMIZATION  line
pr58791-4.c:32 i == 486
  FAIL: gcc.dg/guality/pr58791-4.c   -O2  -DPREVENT_OPTIMIZATION  line
pr58791-4.c:32 i2 == 487
  FAIL: gcc.dg/guality/pr58791-4.c   -O3 -g  -DPREVENT_OPTIMIZATION
line pr58791-4.c:32 i == 486
  FAIL: gcc.dg/guality/pr58791-4.c   -O3 -g  -DPREVENT_OPTIMIZATION
line pr58791-4.c:32 i2 == 487
  FAIL: gcc.dg/guality/pr58791-4.c   -Os  -DPREVENT_OPTIMIZATION  line
pr58791-4.c:32 i == 486
  FAIL: gcc.dg/guality/pr58791-4.c   -Os  -DPREVENT_OPTIMIZATION  line
pr58791-4.c:32 i2 == 487
  FAIL: gcc.dg/guality/pr58791-4.c   -O2 -flto -fno-use-linker-plugin
-flto-partition=none  -DPREVENT_OPTIMIZATION line pr58791-4.c:32 i ==
486
  FAIL: gcc.dg/guality/pr58791-4.c   -O2 -flto -fno-use-linker-plugin
-flto-partition=none  -DPREVENT_OPTIMIZATION line pr58791-4.c:32 i2 ==
487
  FAIL: gcc.dg/guality/pr58791-4.c   -O2 -flto -fuse-linker-plugin
-fno-fat-lto-objects  -DPREVENT_OPTIMIZATION line pr58791-4.c:32 i ==
486
  FAIL: gcc.dg/guality/pr58791-4.c   -O2 -flto -fuse-linker-plugin
-fno-fat-lto-objects  -DPREVENT_OPTIMIZATION line pr58791-4.c:32 i2 ==
487

It is connected to debug-info, and I cannot solve it myself. I am not
sure how this should work when we try to print dead-code variable (i2)
while debugging -O2 (O3/Os) compiled executable.  Jakub created that
test, he is in CC already.

I will also look at aarch64 regstrap a bit later.  But that job should
also be done for all other targets.  Segher, could you test it on
rs6000?

> > > > In pr84524.c we got a loop with an extended inline asm:
> > > > asm volatile ("" : "+r" (v))

It seems a good idea to add “memory” clobber into asm and recheck DDG
in this test on aarch64 with both SMS and sched-deps patches applied.
I'll check it.


Roman
Segher Boessenkool April 29, 2019, 5:21 p.m. UTC | #5
Hi!

On Mon, Apr 29, 2019 at 07:28:12PM +0300, Roman Zhuykov wrote:
> > This code was added in 1997 (r14770).  In 2004 the documentation was
> > changed to clarify how things really work (r88999):
> >
> > "Note that even a volatile @code{asm} instruction can be moved relative to
> > other code, including across jump instructions."
> >
> > (followed by an example exactly about what this means for FPU control).
> 
> Thanks for pointing to that changes!  Unfortunately, sched-deps.c was
> more conservative this 15 years...
> Let’s try to fix it.

If it causes problems now, that would be a good idea yes :-)

> > I mean have the modulo scheduler implement the correct asm semantics, not
> > some more restrictive thing that gets it into conflicts with DF, etc.
> >
> > I don't think this will turn out to be a problem in any way.  Some invalid
> > asm will break, sure.
> 
> I have started with applying this without any SMS changes:
> 
> diff --git a/gcc/sched-deps.c b/gcc/sched-deps.c
> --- a/gcc/sched-deps.c
> +++ b/gcc/sched-deps.c
> @@ -2753,22 +2753,14 @@ sched_analyze_2 (struct deps_desc *deps, rtx
> x, rtx_insn *insn)
> 
>      case UNSPEC_VOLATILE:
>        flush_pending_lists (deps, insn, true, true);
> +
> +      if (!DEBUG_INSN_P (insn))
> +       reg_pending_barrier = TRUE_BARRIER;
>        /* FALLTHRU */
> 
>      case ASM_OPERANDS:
>      case ASM_INPUT:
>        {
> -       /* Traditional and volatile asm instructions must be considered to use
> -          and clobber all hard registers, all pseudo-registers and all of
> -          memory.  So must TRAP_IF and UNSPEC_VOLATILE operations.
> -
> -          Consider for instance a volatile asm that changes the fpu rounding
> -          mode.  An insn should not be moved across this even if it only uses
> -          pseudo-regs because it might give an incorrectly rounded result.  */
> -       if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
> -           && !DEBUG_INSN_P (insn))
> -         reg_pending_barrier = TRUE_BARRIER;
> -
>         /* For all ASM_OPERANDS, we must traverse the vector of input operands.
>            We cannot just fall through here since then we would be confused
>            by the ASM_INPUT rtx inside ASM_OPERANDS, which do not indicate

UNSPEC_VOLATILE and volatile asm should have the same semantics, ideally.
One step at a time I guess :-)

> Regstrapping it on x86-64 shows some failures. First is with ms-sysv
> abi test and can be solved like this:

[ snip ]

> Here we have some asms which control stack pointer (sigh!). It
> certainly may be broken at any moment, but right now “memory” clobber
> fixes everything for me.

Makes sense.

> Another failure is:
> 
>   FAIL: gcc.dg/guality/pr58791-4.c   -O2  -DPREVENT_OPTIMIZATION  line
> pr58791-4.c:32 i == 486

[ snip ]

> It is connected to debug-info, and I cannot solve it myself. I am not
> sure how this should work when we try to print dead-code variable (i2)
> while debugging -O2 (O3/Os) compiled executable.  Jakub created that
> test, he is in CC already.

What does PREVENT_OPTIMIZATION do?  It probably needs to be made a bit
stronger.

(It seems to just add some "used"?)

> I will also look at aarch64 regstrap a bit later.  But that job should
> also be done for all other targets.  Segher, could you test it on
> rs6000?

Do you have an account on the compile farm?  It has POWER7 (BE, 32/64),
as well as POWER8 and POWER9 machines (both LE).
https://cfarm.tetaneutral.net/

But I'll do it if you want.  Just let me know.

> > > > > In pr84524.c we got a loop with an extended inline asm:
> > > > > asm volatile ("" : "+r" (v))
> 
> It seems a good idea to add “memory” clobber into asm and recheck DDG
> in this test on aarch64 with both SMS and sched-deps patches applied.
> I'll check it.


Segher
Roman Zhuykov April 30, 2019, 11:12 a.m. UTC | #6
> > > This code was added in 1997 (r14770).  In 2004 the documentation was
> > > changed to clarify how things really work (r88999):
> > >
> > > "Note that even a volatile @code{asm} instruction can be moved relative to
> > > other code, including across jump instructions."
> > >
> > > (followed by an example exactly about what this means for FPU control).
> >
> > Thanks for pointing to that changes!  Unfortunately, sched-deps.c was
> > more conservative this 15 years...
> > Let’s try to fix it.
>
> If it causes problems now, that would be a good idea yes :-)
>
> > > I mean have the modulo scheduler implement the correct asm semantics, not
> > > some more restrictive thing that gets it into conflicts with DF, etc.
> > >
> > > I don't think this will turn out to be a problem in any way.  Some invalid
> > > asm will break, sure.
> >
> > I have started with applying this without any SMS changes:
> >
> > diff --git a/gcc/sched-deps.c b/gcc/sched-deps.c
> > --- a/gcc/sched-deps.c
> > +++ b/gcc/sched-deps.c
> > @@ -2753,22 +2753,14 @@ sched_analyze_2 (struct deps_desc *deps, rtx
> > x, rtx_insn *insn)
> >
> >      case UNSPEC_VOLATILE:
> >        flush_pending_lists (deps, insn, true, true);
> > +
> > +      if (!DEBUG_INSN_P (insn))
> > +       reg_pending_barrier = TRUE_BARRIER;
> >        /* FALLTHRU */
> >
> >      case ASM_OPERANDS:
> >      case ASM_INPUT:
> >        {
> > -       /* Traditional and volatile asm instructions must be considered to use
> > -          and clobber all hard registers, all pseudo-registers and all of
> > -          memory.  So must TRAP_IF and UNSPEC_VOLATILE operations.
> > -
> > -          Consider for instance a volatile asm that changes the fpu rounding
> > -          mode.  An insn should not be moved across this even if it only uses
> > -          pseudo-regs because it might give an incorrectly rounded result.  */
> > -       if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
> > -           && !DEBUG_INSN_P (insn))
> > -         reg_pending_barrier = TRUE_BARRIER;
> > -
> >         /* For all ASM_OPERANDS, we must traverse the vector of input operands.
> >            We cannot just fall through here since then we would be confused
> >            by the ASM_INPUT rtx inside ASM_OPERANDS, which do not indicate
>
> UNSPEC_VOLATILE and volatile asm should have the same semantics, ideally.
> One step at a time I guess :-)

When barrier for UNSPEC_VOLATILE is also dropped, there are more
issues on x86_64:

FAIL: gcc.dg/vect/pr62021.c -flto -ffat-lto-objects execution test
FAIL: gcc.dg/vect/pr62021.c execution test
FAIL: gcc.target/i386/avx2-vect-aggressive.c execution test

I haven’t looked at those vectorization tests.

> > Regstrapping it on x86-64 shows some failures. First is with ms-sysv
> > abi test and can be solved like this:
>
> [ snip ]
>
> > Here we have some asms which control stack pointer (sigh!). It
> > certainly may be broken at any moment, but right now “memory” clobber
> > fixes everything for me.
>
> Makes sense.
>
> > Another failure is:
> >
> >   FAIL: gcc.dg/guality/pr58791-4.c   -O2  -DPREVENT_OPTIMIZATION  line
> > pr58791-4.c:32 i == 486
>
> [ snip ]
>
> > It is connected to debug-info, and I cannot solve it myself. I am not
> > sure how this should work when we try to print dead-code variable (i2)
> > while debugging -O2 (O3/Os) compiled executable.  Jakub created that
> > test, he is in CC already.
>
> What does PREVENT_OPTIMIZATION do?  It probably needs to be made a bit
> stronger.

It seems PREVENT_OPTIMIZATION stuff influents only one test –
gcc.dg/guality/pr45882.c.  Other tests do not contain ATTRIBUTE_USED
macro.

> (It seems to just add some "used"?)
>
> > I will also look at aarch64 regstrap a bit later.  But that job should
> > also be done for all other targets.  Segher, could you test it on
> > rs6000?
>
> Do you have an account on the compile farm?  It has POWER7 (BE, 32/64),
> as well as POWER8 and POWER9 machines (both LE).
> https://cfarm.tetaneutral.net/

Ok, I’ll use the farm. Have filled the “new user” form already.
Aarch64 is still testing, maybe next time will use farm instead of
qemu-system.

I forgot to mention in previous letter, that with this patch we also
drop dependencies between volatile asms and volatile mems. We have
some offline discussion with Alex, and it seems that since 2004 docs
never guarantee such a dependency, user should add relevant
constraints into asm in all cases. But somebody may have another
opinion about this tricky moment.


Roman
Jeff Law April 30, 2019, 3:11 p.m. UTC | #7
On 4/29/19 10:28 AM, Roman Zhuykov wrote:
> Hi, Segher
> 
>>>>> In pr84524.c we got a loop with an extended inline asm:
>>>>> asm volatile ("" : "+r" (v))
>>>>> which also gives us a “surprising” situation Alexander predicts.
>>>>>
>>>>> For sched-deps scanner such volatile asm is a “true barrier” and we
>>>>> create dependencies to almost all other instructions, while DF scanners
>>>>> don’t give us such information.
>>>>
>>>> There is no such thing as a "true barrier" in extended asm.  The only thing
>>>> volatile asm means is that the asm has a side effect unknown to the compiler;
>>>> this can *not* be a modification of a register or of memory contents, such
>>>> things are known by the compiler, that's what clobbers and "memory" clobber
>>>> are about.
>>>
>>> In sched-deps.c we got:
>>>     case ASM_OPERANDS:
>>>     case ASM_INPUT:
>>>       {
>>>        /* Traditional and volatile asm instructions must be considered to use
>>>           and clobber all hard registers, all pseudo-registers and all of
>>>           memory.  So must TRAP_IF and UNSPEC_VOLATILE operations.
>>>
>>>           Consider for instance a volatile asm that changes the fpu rounding
>>>           mode.  An insn should not be moved across this even if it only uses
>>>           pseudo-regs because it might give an incorrectly rounded result.  */
>>>        if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
>>>            && !DEBUG_INSN_P (insn))
>>>          reg_pending_barrier = TRUE_BARRIER;
>>>        ...
>>
>> This code was added in 1997 (r14770).  In 2004 the documentation was
>> changed to clarify how things really work (r88999):
>>
>> "Note that even a volatile @code{asm} instruction can be moved relative to
>> other code, including across jump instructions."
>>
>> (followed by an example exactly about what this means for FPU control).
> 
> Thanks for pointing to that changes!  Unfortunately, sched-deps.c was
> more conservative this 15 years...
> Let’s try to fix it.
Also note that the integration of the Haifa scheduler was a large
change.  A small detail like this could have been missed.

Jeff
diff mbox series

Patch

diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -735,6 +735,7 @@  schedule_reg_moves (partial_schedule_ptr ps)
         continue;

        /* Create NREG_MOVES register moves.  */
+      gcc_assert (flag_modulo_sched_allow_regmoves);
        first_move = ps->reg_moves.length ();
        ps->reg_moves.safe_grow_cleared (first_move + nreg_moves);
        extend_node_sched_params (ps);
@@ -744,8 +745,7 @@  schedule_reg_moves (partial_schedule_ptr ps)

        /* Generate each move.  */
        old_reg = prev_reg = SET_DEST (set);
-      if (HARD_REGISTER_P (old_reg))
-       return false;
+      gcc_assert (!HARD_REGISTER_P (old_reg));

        for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
         {

On current trunk this patch doesn’t give any failures on powerpc, but 
I’ve checked other platforms and found gcc.c-torture/execute/pr84524.c 
ICEing with  -O1 -fmodulo-sched on aarch64.  Speaking about sms-13.c, it 
happens not to be a proper loop for modulo-scheduling at current trunk.  
Some changes in other passes caused this, and that file actually tests 
nothing right now.

Than, I added 8 branch into my testing, and decide first to find a 
proper way to fix sms-13.c on 8 branch.

Without -fmodulo-sched-allow-regmoves each TRUE_DEP edge should be 
paired with another (sometimes ANTI_DEP or OUTPUT_DEP) edge in the 
opposite direction.  My assertion failure means that DDG was odd.

As Alexander mentioned here
https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00028.html
“SMS uses sched-deps for intra-loop deps, and then separately uses DF 
for cross-iteration deps, which means that it should be ready for 
surprises when the two scanners are not 100% in sync.”

Situation here is one of such surprises, and debugging shows that we 
loose one of dependencies here.

Considering this two instructions:
(insn 30 29 31 8 (parallel [
             (set (reg:SI 142)
                 (minus:SI (reg/v:SI 133 [ x+-2 ])
                     (reg/v:SI 130 [ b+-2 ])))
             (set (reg:SI 76 ca)
                 (leu:SI (reg/v:SI 130 [ b+-2 ])
                     (reg/v:SI 133 [ x+-2 ])))
         ]) "test.c":22 107 {subfsi3_carry}
      (expr_list:REG_UNUSED (reg:SI 142)
         (nil)))

(insn 31 30 32 8 (parallel [
             (set (reg:SI 143)
                 (plus:SI (reg:SI 76 ca)
                     (const_int -1 [0xffffffffffffffff])))
             (clobber (reg:SI 76 ca))
         ]) "test.c":22 116 {subfsi3_carry_in_xx}
      (expr_list:REG_DEAD (reg:SI 76 ca)
         (expr_list:REG_UNUSED (reg:SI 76 ca)
             (nil))))

Sched-deps organize obvious 30->31 true dependency, and we want DF to 
help us with 31->30 output dependency here.  CA is clobbered in 31, and 
if we move insn 31 forward we have to stop at insn 30 on the next 
iteration.

Build_inter_loop_deps doesn’t create such output dependency because 
reaching definitions “rd_bb_info->gen” dependency set doesn’t contain CA 
register.  Maybe, because it is a clobber, not a set.  So we need 
somehow organize that dependency, and I’ve found that df_live set can be 
used here instead.

I proceed with the following change:

diff --git a/gcc/ddg.c b/gcc/ddg.c
--- a/gcc/ddg.c
+++ b/gcc/ddg.c
@@ -365,16 +365,14 @@  add_cross_iteration_register_deps (ddg_ptr g, 
df_ref last_def)
  static void
  build_inter_loop_deps (ddg_ptr g)
  {
-  unsigned rd_num;
-  struct df_rd_bb_info *rd_bb_info;
+  unsigned regno;
    bitmap_iterator bi;
-
-  rd_bb_info = DF_RD_BB_INFO (g->bb);
+  struct df_live_bb_info *live_bb_info = DF_LIVE_BB_INFO (g->bb);

    /* Find inter-loop register output, true and anti deps.  */
-  EXECUTE_IF_SET_IN_BITMAP (&rd_bb_info->gen, 0, rd_num, bi)
+  EXECUTE_IF_SET_IN_BITMAP (&live_bb_info->gen, 0, regno, bi)
    {
-    df_ref rd = DF_DEFS_GET (rd_num);
+    df_ref rd = df_bb_regno_last_def_find (g->bb, regno);

      add_cross_iteration_register_deps (g, rd);
    }

(Here I skipped minor hunks which properly add/remove df_live problem in 
modulo-sched.c when optimize == 1).

This fix corrected sms-13.c on 8 branch, but pr84524.c mentioned above 
was still  ICEing on both trunk and 8-branch.  No other failures were 
introduces while testing this intermediate patch on different branches 
and platforms.  Tested patch also contained temporarily additional 
checks that the only difference between “rd_bb_info->gen” and 
“live_bb_info->gen” sets may be additional hard registers in live set.

In pr84524.c we got a loop with an extended inline asm:
asm volatile ("" : "+r" (v))
which also gives us a “surprising” situation Alexander predicts.

For sched-deps scanner such volatile asm is a “true barrier” and we 
create dependencies to almost all other instructions, while DF scanners 
don’t give us such information.

Maybe it is a good idea somehow re-implement modulo scheduler using only 
one scanner instead of two, but at the moment the best thing to do is to 
detect the situation earlier and skip such loops.

So, the last two hunks are:

@@ -1473,6 +1478,8 @@  sms_schedule (void)

          if (CALL_P (insn)
              || BARRIER_P (insn)
+            || (NONDEBUG_INSN_P (insn) && extract_asm_operands (PATTERN 
(insn))
+                && MEM_VOLATILE_P (extract_asm_operands (PATTERN 
(insn))))
              || (NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
                  && !single_set (insn) && GET_CODE (PATTERN (insn)) != 
USE
                  && !reg_mentioned_p (count_reg, insn))
@@ -1489,6 +1496,11 @@  sms_schedule (void)
                 fprintf (dump_file, "SMS loop-with-call\n");
               else if (BARRIER_P (insn))
                 fprintf (dump_file, "SMS loop-with-barrier\n");
+             else if (NONDEBUG_INSN_P (insn)
+                      && extract_asm_operands (PATTERN (insn))
+                      && MEM_VOLATILE_P (extract_asm_operands
+                                               (PATTERN (insn))))
+               fprintf (dump_file, "SMS loop-with-volatile-asm\n");
                else if ((NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
                  && !single_set (insn) && GET_CODE (PATTERN (insn)) != 
USE))

Patch testing addressed in my cover letter. Ok for trunk?

gcc/ChangeLog:

2019-04-10  Roman Zhuykov  <zhroma@ispras.ru>

	PR target/83507
	* ddg.c (build_inter_loop_deps): Use live_bb_info instead of
	rd_bb_info to search last register definitions.
	* modulo-sched.c (schedule_reg_moves): Add an assertion which
	prevents creating register moves when it is not allowed.
	(sms_schedule): Add/remove df_live problem when optimize == 1.
	Bail out when we got volatile asm inside the loop.


diff --git a/gcc/ddg.c b/gcc/ddg.c
--- a/gcc/ddg.c
+++ b/gcc/ddg.c
@@ -365,16 +365,14 @@  add_cross_iteration_register_deps (ddg_ptr g, 
df_ref last_def)
  static void
  build_inter_loop_deps (ddg_ptr g)
  {
-  unsigned rd_num;
-  struct df_rd_bb_info *rd_bb_info;
+  unsigned regno;
    bitmap_iterator bi;
-
-  rd_bb_info = DF_RD_BB_INFO (g->bb);
+  struct df_live_bb_info *live_bb_info = DF_LIVE_BB_INFO (g->bb);

    /* Find inter-loop register output, true and anti deps.  */
-  EXECUTE_IF_SET_IN_BITMAP (&rd_bb_info->gen, 0, rd_num, bi)
+  EXECUTE_IF_SET_IN_BITMAP (&live_bb_info->gen, 0, regno, bi)
    {
-    df_ref rd = DF_DEFS_GET (rd_num);
+    df_ref rd = df_bb_regno_last_def_find (g->bb, regno);

      add_cross_iteration_register_deps (g, rd);