diff mbox

Honnor ix86_accumulate_outgoing_args again

Message ID 20131002173249.GB12304@kam.mff.cuni.cz
State New
Headers show

Commit Message

Jan Hubicka Oct. 2, 2013, 5:32 p.m. UTC
Hi,
currently ix86_accumulate_outgoing_args is ignored on all targets except for
Solaris (that sets USE_IX86_FRAME_POINTER to true).  It seems like accidental
effect of http://gcc.gnu.org/ml/gcc-patches/2010-08/txt00102.txt that enabled
omit-frame-pointer for 32bit (I take the 64bit change was purely accidental)
probably based on the fact non-accumulate-outgoing-args was not doing well with
assynchronous unwind info.

The reason for this seems to be gone by
http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00995.html

So I thing we ought to honnor accumulate-outgoing-args again and in fact
consider disabling it for generic - it is disabled for core (that may need
re-benchmarking). For all AMD targets it is currently on.  I tested disabling
it on buldozer 32bit and it seems mostly SPEC neutral for specint2000 (I am
wating for more benchmarks) with very nice code size improvements in all
benchmarks with exception of MCF with LTO (not sure at all why), with overall
reduction of 5.2% (same gain as we get for -flto aproximately)
http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-32o-32bit/size.html

There may be close to noise factor drops as seen in
http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-32o-32bit/recent.html I
will see how other tests shape and wait for multiple runs to show how much of
this is actual noise. We may consider disabling it for size optimized functions
and -O2 (and not for -O3) at least.

This patch however only remove code forcingly enabling
MASK_ACCUMULATE_OUTGOING_ARGS.  If there will be no complains, I will commit it
tomorrow.

Honza

	* i386.c (ix86_option_override_internal): Do not force
	ACCUMULATE_OUTGOING_ARGS when unwind info is generated.

Comments

Jan Hubicka Oct. 2, 2013, 10:45 p.m. UTC | #1
> So I thing we ought to honnor accumulate-outgoing-args again and in fact
> consider disabling it for generic - it is disabled for core (that may need
> re-benchmarking). For all AMD targets it is currently on.  I tested disabling
> it on buldozer 32bit and it seems mostly SPEC neutral for specint2000 (I am
> wating for more benchmarks) with very nice code size improvements in all
> benchmarks with exception of MCF with LTO (not sure at all why), with overall
> reduction of 5.2% (same gain as we get for -flto aproximately)
> http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-32o-32bit/size.html
Specint 2000/2006 seems quite good. (though for spec2k6 the differences in sizes
are no longer anywhere near to -flto code size reductions)

Unfortunately there is 40% regression on mgrid with -flto (and also noticeable
regression without LTO).  First thing I noticed is that we stop omitting frame
pointer in the hottest function.  This is because we see:

(set (reg/f:SI 7 sp)
    (plus:SI (reg/f:SI 7 sp)
        (const_int -8 [0xfffffffffffffff8])))

and we end up marking SP as as uneliminable in:

          /* See if this is setting the replacement hard register for
             an elimination.

             If DEST is the hard frame pointer, we do nothing because
             we assume that all assignments to the frame pointer are
             for non-local gotos and are being done at a time when
             they are valid and do not disturb anything else.  Some
             machines want to eliminate a fake argument pointer (or
             even a fake frame pointer) with either the real frame
             pointer or the stack pointer.  Assignments to the hard
             frame pointer must not prevent this elimination.  */

          for (ep = reg_eliminate;
               ep < &reg_eliminate[NUM_ELIMINABLE_REGS];
               ep++)
            if (ep->to_rtx == SET_DEST (x)
                && SET_DEST (x) != hard_frame_pointer_rtx
                && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
                       && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)
                    || GET_CODE (SET_SRC (x)) != PLUS
                    || XEXP (SET_SRC (x), 0) != SET_DEST (x)
                    || ! CONST_INT_P (XEXP (SET_SRC (x), 1))))
              setup_can_eliminate (ep, false);

It is because of

                && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
                       && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)

I am somewhat confused why do we need to stop eliminating.  Function is not
marked as needing drap (and in that case stack_realign_fp would be true)
What is this conditional shooting for?

Mgrid is somewhat sensitive to register pressure, so this actually may
explain the 40% difference...

Honza
H.J. Lu Oct. 2, 2013, 10:50 p.m. UTC | #2
On Wed, Oct 2, 2013 at 3:45 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> So I thing we ought to honnor accumulate-outgoing-args again and in fact
>> consider disabling it for generic - it is disabled for core (that may need
>> re-benchmarking). For all AMD targets it is currently on.  I tested disabling
>> it on buldozer 32bit and it seems mostly SPEC neutral for specint2000 (I am
>> wating for more benchmarks) with very nice code size improvements in all
>> benchmarks with exception of MCF with LTO (not sure at all why), with overall
>> reduction of 5.2% (same gain as we get for -flto aproximately)
>> http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-32o-32bit/size.html
> Specint 2000/2006 seems quite good. (though for spec2k6 the differences in sizes
> are no longer anywhere near to -flto code size reductions)
>
> Unfortunately there is 40% regression on mgrid with -flto (and also noticeable
> regression without LTO).  First thing I noticed is that we stop omitting frame
> pointer in the hottest function.  This is because we see:

Does it happen with both 32-bit and 64-bit?

> (set (reg/f:SI 7 sp)
>     (plus:SI (reg/f:SI 7 sp)
>         (const_int -8 [0xfffffffffffffff8])))
>
> and we end up marking SP as as uneliminable in:
>
>           /* See if this is setting the replacement hard register for
>              an elimination.
>
>              If DEST is the hard frame pointer, we do nothing because
>              we assume that all assignments to the frame pointer are
>              for non-local gotos and are being done at a time when
>              they are valid and do not disturb anything else.  Some
>              machines want to eliminate a fake argument pointer (or
>              even a fake frame pointer) with either the real frame
>              pointer or the stack pointer.  Assignments to the hard
>              frame pointer must not prevent this elimination.  */
>
>           for (ep = reg_eliminate;
>                ep < &reg_eliminate[NUM_ELIMINABLE_REGS];
>                ep++)
>             if (ep->to_rtx == SET_DEST (x)
>                 && SET_DEST (x) != hard_frame_pointer_rtx
>                 && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
>                        && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)
>                     || GET_CODE (SET_SRC (x)) != PLUS
>                     || XEXP (SET_SRC (x), 0) != SET_DEST (x)
>                     || ! CONST_INT_P (XEXP (SET_SRC (x), 1))))
>               setup_can_eliminate (ep, false);
>
> It is because of
>
>                 && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
>                        && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)
>
> I am somewhat confused why do we need to stop eliminating.  Function is not
> marked as needing drap (and in that case stack_realign_fp would be true)
> What is this conditional shooting for?

Why is stack_realign_fp true?
Vladimir Makarov Oct. 3, 2013, 1:10 a.m. UTC | #3
On 13-10-02 6:45 PM, Jan Hubicka wrote:
>> So I thing we ought to honnor accumulate-outgoing-args again and in fact
>> consider disabling it for generic - it is disabled for core (that may need
>> re-benchmarking). For all AMD targets it is currently on.  I tested disabling
>> it on buldozer 32bit and it seems mostly SPEC neutral for specint2000 (I am
>> wating for more benchmarks) with very nice code size improvements in all
>> benchmarks with exception of MCF with LTO (not sure at all why), with overall
>> reduction of 5.2% (same gain as we get for -flto aproximately)
>> http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-32o-32bit/size.html
> Specint 2000/2006 seems quite good. (though for spec2k6 the differences in sizes
> are no longer anywhere near to -flto code size reductions)
>
> Unfortunately there is 40% regression on mgrid with -flto (and also noticeable
> regression without LTO).  First thing I noticed is that we stop omitting frame
> pointer in the hottest function.  This is because we see:
>
> (set (reg/f:SI 7 sp)
>      (plus:SI (reg/f:SI 7 sp)
>          (const_int -8 [0xfffffffffffffff8])))
>
> and we end up marking SP as as uneliminable in:
>
>            /* See if this is setting the replacement hard register for
>               an elimination.
>
>               If DEST is the hard frame pointer, we do nothing because
>               we assume that all assignments to the frame pointer are
>               for non-local gotos and are being done at a time when
>               they are valid and do not disturb anything else.  Some
>               machines want to eliminate a fake argument pointer (or
>               even a fake frame pointer) with either the real frame
>               pointer or the stack pointer.  Assignments to the hard
>               frame pointer must not prevent this elimination.  */
>
>            for (ep = reg_eliminate;
>                 ep < &reg_eliminate[NUM_ELIMINABLE_REGS];
>                 ep++)
>              if (ep->to_rtx == SET_DEST (x)
>                  && SET_DEST (x) != hard_frame_pointer_rtx
>                  && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
>                         && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)
>                      || GET_CODE (SET_SRC (x)) != PLUS
>                      || XEXP (SET_SRC (x), 0) != SET_DEST (x)
>                      || ! CONST_INT_P (XEXP (SET_SRC (x), 1))))
>                setup_can_eliminate (ep, false);
>
> It is because of
>
>                  && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
>                         && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)
>
> I am somewhat confused why do we need to stop eliminating.  Function is not
> marked as needing drap (and in that case stack_realign_fp would be true)
> What is this conditional shooting for?
This part was added to fix

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57018

and because LRA still misses some reload functionality for elimination.  
I am a bit embarrassed: I have this thing to do for 4 months and I still 
did not start to work on it yet.  There are too much things on my plate.

As we are going to use outgoing arg accumulation, this problem is 
becoming higher priority one.
Jan Hubicka Oct. 3, 2013, 9:24 a.m. UTC | #4
> > Unfortunately there is 40% regression on mgrid with -flto (and also noticeable
> > regression without LTO).  First thing I noticed is that we stop omitting frame
> > pointer in the hottest function.  This is because we see:
> 
> Does it happen with both 32-bit and 64-bit?
No, 32bit only.
> 
> > (set (reg/f:SI 7 sp)
> >     (plus:SI (reg/f:SI 7 sp)
> >         (const_int -8 [0xfffffffffffffff8])))
> >
> > and we end up marking SP as as uneliminable in:
> >
> >           /* See if this is setting the replacement hard register for
> >              an elimination.
> >
> >              If DEST is the hard frame pointer, we do nothing because
> >              we assume that all assignments to the frame pointer are
> >              for non-local gotos and are being done at a time when
> >              they are valid and do not disturb anything else.  Some
> >              machines want to eliminate a fake argument pointer (or
> >              even a fake frame pointer) with either the real frame
> >              pointer or the stack pointer.  Assignments to the hard
> >              frame pointer must not prevent this elimination.  */
> >
> >           for (ep = reg_eliminate;
> >                ep < &reg_eliminate[NUM_ELIMINABLE_REGS];
> >                ep++)
> >             if (ep->to_rtx == SET_DEST (x)
> >                 && SET_DEST (x) != hard_frame_pointer_rtx
> >                 && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
> >                        && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)
> >                     || GET_CODE (SET_SRC (x)) != PLUS
> >                     || XEXP (SET_SRC (x), 0) != SET_DEST (x)
> >                     || ! CONST_INT_P (XEXP (SET_SRC (x), 1))))
> >               setup_can_eliminate (ep, false);
> >
> > It is because of
> >
> >                 && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
> >                        && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)
> >
> > I am somewhat confused why do we need to stop eliminating.  Function is not
> > marked as needing drap (and in that case stack_realign_fp would be true)
> > What is this conditional shooting for?
> 
> Why is stack_realign_fp true?
It is false, but htere is ! in the conditional.

Honza
> 
> 
> -- 
> H.J.
Jan Hubicka Oct. 3, 2013, 1:05 p.m. UTC | #5
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57018
> 
> and because LRA still misses some reload functionality for
> elimination.  I am a bit embarrassed: I have this thing to do for 4
> months and I still did not start to work on it yet.  There are too
> much things on my plate.
> 
> As we are going to use outgoing arg accumulation, this problem is
> becoming higher priority one.

Thank you,
we currently use outgoing arg accumulation always on x86_64, I plan to
re-disable arg accumulation on CPUs that handle push/pop well (i.e. have stack
engine).  This brings nice code size savings.

I wonder how much this actually comes from not omitting frame pointer in
non-leaf functions with IRA. EBP based addressing is more compact than ESP
and thus -fomit-frame-pointer is disabled with -Os.

Perhaps frame elimination can be actually decided on by register allocation?

On similar note I just benchmarked -mfpmath=sse for 32bit code and it is quite
big performance win and again causes about 5% code size regression.  I want to
propose defaulting to -mfpmath=sse for 32bit for -ffast-math and -Ofast.  (in a
way I would like to see -mfpmath=sse by default for 32bit on CPUs supporting
SSE2, but that has been voted down long time ago becuase it loses the 80bit
precision for temporaries in double/float computations).

I wonder if we can eventually make -mfpmath=sse,387 working well (I did not
bechmark it yet, but statically it still produces more spiling than -mfpmath=sse)
and/or if we can possibly decide on fpmath based on hotness of function
(at least with profile around).

Thanks for all the hard work on IRA!
Honza
diff mbox

Patch

Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 203117)
+++ config/i386/i386.c	(working copy)
@@ -3793,28 +3793,11 @@  ix86_option_override_internal (bool main
       }
 
   ix86_tune_mask = 1u << ix86_tune;
-  if ((!USE_IX86_FRAME_POINTER
-       || (x86_accumulate_outgoing_args & ix86_tune_mask))
+  if ((x86_accumulate_outgoing_args & ix86_tune_mask)
       && !(target_flags_explicit & MASK_ACCUMULATE_OUTGOING_ARGS)
       && !optimize_size)
     target_flags |= MASK_ACCUMULATE_OUTGOING_ARGS;
 
-  /* ??? Unwind info is not correct around the CFG unless either a frame
-     pointer is present or M_A_O_A is set.  Fixing this requires rewriting
-     unwind info generation to be aware of the CFG and propagating states
-     around edges.  */
-  if ((flag_unwind_tables || flag_asynchronous_unwind_tables
-       || flag_exceptions || flag_non_call_exceptions)
-      && flag_omit_frame_pointer
-      && !(target_flags & MASK_ACCUMULATE_OUTGOING_ARGS))
-    {
-      if (target_flags_explicit & MASK_ACCUMULATE_OUTGOING_ARGS)
-	warning (0, "unwind tables currently require either a frame pointer "
-		 "or %saccumulate-outgoing-args%s for correctness",
-		 prefix, suffix);
-      target_flags |= MASK_ACCUMULATE_OUTGOING_ARGS;
-    }
-
   /* If stack probes are required, the space used for large function
      arguments on the stack must also be probed, so enable
      -maccumulate-outgoing-args so this happens in the prologue.  */