diff mbox

Honnor ix86_accumulate_outgoing_args again

Message ID 20131010184005.GA26449@kam.mff.cuni.cz
State New
Headers show

Commit Message

Jan Hubicka Oct. 10, 2013, 6:40 p.m. UTC
Hi,
this patch makes ACCUMULATE_OUTGOING_ARGS to disable itself when function is
cold.  I did some extra testing and to my amusement we now seem to output
more compact unwind info when ACCUMULATE_OUTGOING_ARGS is disabled, so this
seems quite consistent code size win.

We actually can do better and enable ACCUMULATE_OUTGOING_ARGS only when
function contains hot calls.  This should also avoid need for frame allocation
in prologue/epilogue on hot path then. I will look into this incrementally.

I also noticed that we still have some tuning flags in i386.c rather than
in x86-tune.c so I moved them there.

Testing x86_64-linux and will commit it once testing converge.

Honza
	* config/i386/i386.h (ACCUMULATE_OUTGOING_ARGS): Disable accumulation
	for cold functions.
	* x86-tune.def (X86_TUNE_USE_LEAVE): Update comment.
	(X86_TUNE_PUSH_MEMORY): Likewise.
	(X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL,
	X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL): New.
	(X86_TUNE_ACCUMULATE_OUTGOING_ARGS, X86_TUNE_ALWAYS_FANCY_MATH_387): New.
	* i386.c (x86_accumulate_outgoing_args, x86_arch_always_fancy_math_387,
	x86_avx256_split_unaligned_load, x86_avx256_split_unaligned_store):
	Remove.
	(ix86_option_override_internal): Update to use tune features instead
	of variables.

Comments

Jan Hubicka Oct. 19, 2013, 8:30 p.m. UTC | #1
> Jan,
> 
> Does this seem reasonable to you?

Oops, sorry, I missed your email. (I was travelling and I am finishing a paper
now).
> 
> Thanks,
> Igor
> 
> > -----Original Message-----
> > From: Zamyatin, Igor
> > Sent: Tuesday, October 15, 2013 3:48 PM
> > To: Jan Hubicka
> > Subject: RE: Honnor ix86_accumulate_outgoing_args again
> > 
> > Jan,
> > 
> > Now we have following prologue in, say, phi0 routine in equake
> > 
> > 0x804aa90 1  push   %ebp
> > 0x804aa91 2  mov    %esp,%ebp
> > 0x804aa93 3  sub    $0x18,%esp
> > 0x804aa96 4  vmovsd 0x80ef7a8,%xmm0
> > 0x804aa9e 5  vmovsd 0x8(%ebp),%xmm1
> > 0x804aaa3 6  vcomisd %xmm1,%xmm0   <-- we see big stall somewhere here
> > or 1-2 instructions above
> > 
> > While earlier it was
> > 
> > 0x804abd0 1 sub    $0x2c,%esp
> > 0x804abd3 2 vmovsd 0x30(%esp),%xmm1
> > 0x804abd9 3 vmovsd 0x80efcc8,%xmm0
> > 0x804abe1 4 vcomisd %xmm1,%xmm0

Thanks for analysis! It is a different benchmark than for bulldozer, but
apparently same case.  Again we used to eliminate frame pointer here but IRS
now doesn't Do you see the same regression with -fno-omit-frame-pointer
-maccumulate-outgoing-args?

I suppose this is a conflict in between the push instruction hanled by stack
engine and initialization of EBP that isn't.  That would explain why bulldozer
don't seem to care about this particular benchmark (its stack engine seems to
have quite different design). 

This is a bit sad situation - accumulate-outgoing-args is expensive code size
wise and it seems we don't really need esp with -mno-accumulate-outgoing-args.
The non-accumulation code path was mistakely disabled for too long ;(

Vladimir, how much effort do you think it will be to fix the frame pointer
elimination here?
I can imagine it is a quite tricky case. If so I would suggest adding m_CORE_ALL
to X86_TUNE_ACCUMULATE_OUTGOING_ARGS with a comment explaining the problem and
mentioning the regression on equake on core and mgrid on Bulldizer and opening
an enhancement request for this...

I also wonder if direct ESP use and push/pop instructions are causing so
noticeable issues, I wonder if we can't "shrink wrap" this into red-zone in the
64bit compilation.  It seems that even with -maccumulate-outgoing-args pushing
the frame allocation as late as possible in the function would be a good idea
so it is not close to the push/pop/call/ret.

Honza

> > 
> > Note that for Atom and SLM no regression happens because they are already
> > in x86_accumulate_outgoing_args. As for other machines  - seems now
> > (after your change) they don't get that
> > MASK_ACCUMULATE_OUTGOING_ARGS and it leads to using ebp in the
> > prologue.
> > 
> > Thanks,
> > Igor
> > 
> > -----Original Message-----
> > From: Jan Hubicka [mailto:hubicka@ucw.cz]
> > Sent: Monday, October 14, 2013 8:44 PM
> > To: Zamyatin, Igor
> > Cc: Jan Hubicka
> > Subject: Re: Honnor ix86_accumulate_outgoing_args again
> > 
> > > Jan,
> > >
> > > I see that you haven't committed this change. Any particular reason for
> > this?
> > 
> > No, seems I forgot about it.
> > >
> > > BTW, after r203171 (http://gcc.gnu.org/ml/gcc-cvs/2013-
> > 10/msg00122.html) we see lot of performance degradation on spec2000 and
> > spec2006 tests on SandyBridge and IvyBridge in 32 bits mode. E.g.
> > 183.equake got ~-15%. Have you seen it?
> > 
> > I tested this only on Bulldozer chips where I saw large regression from mgrid,
> > but curiously not from equake.  I tracked it down to frame pointer being
> > forced by IRA that Vladimir promised to look at later.
> > 
> > I assumed that arg accumulation was tested before the flags was set, so I did
> > not benchmarked chips that previously dsabled it. Perhaps things changed
> > because IRA was merged in meantime while this feature was accidentally
> > disabled.
> > 
> > It would be great if you can figure out why equake regress - in FP
> > benchmarks we do not use push/pop so perhaps we just end up emitting
> > something really stupid or we manage to confuse stack engine...
> > 
> > Honza
> > 
> > >
> > > Thanks,
> > > Igor
> > >
> > > -----Original Message-----
> > > From: gcc-patches-owner@gcc.gnu.org
> > > [mailto:gcc-patches-owner@gcc.gnu.org] On Behalf Of Jan Hubicka
> > > Sent: Thursday, October 10, 2013 10:40 PM
> > > To: Jan Hubicka
> > > Cc: Vladimir Makarov; gcc-patches@gcc.gnu.org; hjl.tools@gmail.com;
> > > ubizjak@gmail.com; rth@redhat.com;
> > Ganesh.Gopalasubramanian@amd.com
> > > Subject: Re: Honnor ix86_accumulate_outgoing_args again
> > >
> > > Hi,
> > > this patch makes ACCUMULATE_OUTGOING_ARGS to disable itself when
> > function is cold.  I did some extra testing and to my amusement we now
> > seem to output more compact unwind info when
> > ACCUMULATE_OUTGOING_ARGS is disabled, so this seems quite consistent
> > code size win.
> > >
> > > We actually can do better and enable ACCUMULATE_OUTGOING_ARGS
> > only when function contains hot calls.  This should also avoid need for frame
> > allocation in prologue/epilogue on hot path then. I will look into this
> > incrementally..
> > >
> > > I also noticed that we still have some tuning flags in i386.c rather than in
> > x86-tune.c so I moved them there.
> > >
> > > Testing x86_64-linux and will commit it once testing converge.
> > >
> > > Honza
> > > 	* config/i386/i386.h (ACCUMULATE_OUTGOING_ARGS): Disable
> > accumulation
> > > 	for cold functions.
> > > 	* x86-tune.def (X86_TUNE_USE_LEAVE): Update comment.
> > > 	(X86_TUNE_PUSH_MEMORY): Likewise.
> > > 	(X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL,
> > > 	X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL): New.
> > > 	(X86_TUNE_ACCUMULATE_OUTGOING_ARGS,
> > X86_TUNE_ALWAYS_FANCY_MATH_387): New.
> > > 	* i386.c (x86_accumulate_outgoing_args,
> > x86_arch_always_fancy_math_387,
> > > 	x86_avx256_split_unaligned_load,
> > x86_avx256_split_unaligned_store):
> > > 	Remove.
> > > 	(ix86_option_override_internal): Update to use tune features
> > instead
> > > 	of variables.
> > > Index: config/i386/i386.h
> > >
> > ==========================================================
> > =========
> > > --- config/i386/i386.h	(revision 203380)
> > > +++ config/i386/i386.h	(working copy)
> > > @@ -1492,13 +1492,26 @@ enum reg_class
> > >     will be computed and placed into the variable `crtl->outgoing_args_size'.
> > >     No space will be pushed onto the stack for each call; instead, the
> > >     function prologue should increase the stack frame size by this amount.
> > > +
> > > +   In 32bit mode enabling argument accumulation results in about 5% code
> > size
> > > +   growth becuase move instructions are less compact than push.  In 64bit
> > > +   mode the difference is less drastic but visible.
> > > +
> > > +   FIXME: Unlike earlier implementations, the size of unwind info seems to
> > > +   actually grouw with accumulation.  Is that because accumulated args
> > > +   unwind info became unnecesarily bloated?
> > >
> > >     64-bit MS ABI seem to require 16 byte alignment everywhere except for
> > > -   function prologue and apilogue.  This is not possible without
> > > -   ACCUMULATE_OUTGOING_ARGS.  */
> > > +   function prologue and epilogue.  This is not possible without
> > > +   ACCUMULATE_OUTGOING_ARGS.
> > > +
> > > +   If stack probes are required, the space used for large function
> > > +   arguments on the stack must also be probed, so enable
> > > +   -maccumulate-outgoing-args so this happens in the prologue.  */
> > >
> > >  #define ACCUMULATE_OUTGOING_ARGS \
> > > -  (TARGET_ACCUMULATE_OUTGOING_ARGS || TARGET_64BIT_MS_ABI)
> > > +  ((TARGET_ACCUMULATE_OUTGOING_ARGS &&
> > optimize_function_for_speed_p (cfun)) \
> > > +   || TARGET_STACK_PROBE || TARGET_64BIT_MS_ABI)
> > >
> > >  /* If defined, a C expression whose value is nonzero when we want to use
> > PUSH
> > >     instructions to pass outgoing arguments.  */
> > > Index: config/i386/x86-tune.def
> > >
> > ==========================================================
> > =========
> > > --- config/i386/x86-tune.def	(revision 203387)
> > > +++ config/i386/x86-tune.def	(working copy)
> > > @@ -18,15 +18,13 @@ a copy of the GCC Runtime Library Except  see the
> > > files COPYING3 and COPYING.RUNTIME respectively.  If not, see
> > > <http://www.gnu.org/licenses/>.  */
> > >
> > > -/* X86_TUNE_USE_LEAVE: Leave does not affect Nocona SPEC2000 results
> > > -   negatively, so enabling for Generic64 seems like good code size
> > > -   tradeoff.  We can't enable it for 32bit generic because it does not
> > > -   work well with PPro base chips.  */
> > > +/* X86_TUNE_USE_LEAVE: Use "leave" instruction in epilogues where it
> > > +fits.  */
> > >  DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave",
> > >  	  m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE |
> > m_GENERIC)
> > >
> > >  /* X86_TUNE_PUSH_MEMORY: Enable generation of "push mem"
> > instructions.
> > > -   Some chips, like 486 and Pentium have problems with these sequences..
> > */
> > > +   Some chips, like 486 and Pentium works faster with separate load
> > > +   and push instructions.  */
> > >  DEF_TUNE (X86_TUNE_PUSH_MEMORY, "push_memory",
> > >            m_386 | m_P4_NOCONA | m_CORE_ALL | m_K6_GEODE |
> > m_AMD_MULTIPLE
> > >            | m_GENERIC)
> > > @@ -210,6 +208,16 @@ DEF_TUNE
> > (X86_TUNE_SSE_UNALIGNED_LOAD_OP  DEF_TUNE
> > (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL,
> > "sse_unaligned_store_optimal",
> > >            m_COREI7 | m_BDVER | m_SLM | m_GENERIC)
> > >
> > > +/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if true, unaligned
> > loads are
> > > +   split.  */
> > > +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL,
> > "256_unaligned_load_optimal",
> > > +          ~(m_COREI7 | m_GENERIC))
> > > +
> > > +/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if true, unaligned
> > loads are
> > > +   split.  */
> > > +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL,
> > "256_unaligned_load_optimal",
> > > +          ~(m_COREI7 | m_BDVER | m_GENERIC))
> > > +
> > >  /* Use packed single precision instructions where posisble.  I.e. movups
> > instead
> > >     of movupd.  */
> > >  DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL,
> > "sse_packed_single_insn_optimal", @@ -398,3 +406,24 @@ DEF_TUNE
> > (X86_TUNE_AVOID_MEM_OPND_FOR_CM
> > >     fp converts to destination register.  */  DEF_TUNE
> > (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS,
> > "split_mem_opnd_for_fp_converts",
> > >            m_SLM)
> > > +
> > > +/* X86_TUNE_ACCUMULATE_OUTGOING_ARGS: Allocate stack space for
> > outgoing
> > > +   arguments in prologue/epilogue instead of separately for each call
> > > +   by push/pop instructions.
> > > +   This increase code size by about 5% in 32bit mode, less so in 64bit mode
> > > +   because parameters are passed in registers.  It is considerable
> > > +   win for targets without stack engine that prevents multple push
> > operations
> > > +   to happen in parallel.
> > > +
> > > +   FIXME: the flags is incorrectly enabled for amdfam10, Bulldozer,
> > > +   Bobcat and Generic.  This is because disabling it causes large
> > > +   regression on mgrid due to IRA limitation leading to unecessary
> > > +   use of the frame pointer in 32bit mode.  */ DEF_TUNE
> > > +(X86_TUNE_ACCUMULATE_OUTGOING_ARGS,
> > "accumulate_outgoing_args",
> > > +	  m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM |
> > m_AMD_MULTIPLE |
> > > +m_GENERIC)
> > > +
> > > +/* X86_TUNE_ALWAYS_FANCY_MATH_387: controls use of fancy 387
> > operations,
> > > +   such as fsqrt, fprem, fsin, fcos, fsincos etc.
> > > +   Should be enabled for all targets that always has coprocesor.  */
> > > +DEF_TUNE (X86_TUNE_ALWAYS_FANCY_MATH_387,
> > "always_fancy_math_387",
> > > +          ~(m_386 | m_486))
> > > Index: config/i386/i386.c
> > >
> > ==========================================================
> > =========
> > > --- config/i386/i386.c	(revision 203380)
> > > +++ config/i386/i386.c	(working copy)
> > > @@ -1898,18 +1898,6 @@ static unsigned int initial_ix86_arch_fe
> > >    ~m_386,
> > >  };
> > >
> > > -static const unsigned int x86_accumulate_outgoing_args
> > > -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM | m_AMD_MULTIPLE
> > |
> > > m_GENERIC;
> > > -
> > > -static const unsigned int x86_arch_always_fancy_math_387
> > > -  = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM |
> > m_SLM |
> > > m_AMD_MULTIPLE | m_GENERIC;
> > > -
> > > -static const unsigned int x86_avx256_split_unaligned_load
> > > -  = m_COREI7 | m_GENERIC;
> > > -
> > > -static const unsigned int x86_avx256_split_unaligned_store
> > > -  = m_COREI7 | m_BDVER | m_GENERIC;
> > > -
> > >  /* In case the average insn count for single function invocation is
> > >     lower than this constant, emit fast (but longer) prologue and
> > >     epilogue code.  */
> > > @@ -2920,7 +2908,7 @@ static void
> > >  ix86_option_override_internal (bool main_args_p)  {
> > >    int i;
> > > -  unsigned int ix86_arch_mask, ix86_tune_mask;
> > > +  unsigned int ix86_arch_mask;
> > >    const bool ix86_tune_specified = (ix86_tune_string != NULL);
> > >    const char *prefix;
> > >    const char *suffix;
> > > @@ -3673,7 +3661,7 @@ ix86_option_override_internal (bool main
> > >
> > >    /* If the architecture always has an FPU, turn off NO_FANCY_MATH_387,
> > >       since the insns won't need emulation.  */
> > > -  if (x86_arch_always_fancy_math_387 & ix86_arch_mask)
> > > +  if (ix86_tune_features [X86_TUNE_ALWAYS_FANCY_MATH_387])
> > >      target_flags &= ~MASK_NO_FANCY_MATH_387;
> > >
> > >    /* Likewise, if the target doesn't have a 387, or we've specified @@ -
> > 3805,8 +3793,7 @@ ix86_option_override_internal (bool main
> > >  	gcc_unreachable ();
> > >        }
> > >
> > > -  ix86_tune_mask = 1u << ix86_tune;
> > > -  if ((x86_accumulate_outgoing_args & ix86_tune_mask)
> > > +  if (ix86_tune_features [X86_TUNE_ACCUMULATE_OUTGOING_ARGS]
> > >        && !(target_flags_explicit & MASK_ACCUMULATE_OUTGOING_ARGS)
> > >        && !optimize_size)
> > >      target_flags |= MASK_ACCUMULATE_OUTGOING_ARGS; @@ -3946,10
> > +3933,10 @@ ix86_option_override_internal (bool main
> > >        if (flag_expensive_optimizations
> > >  	  && !(target_flags_explicit & MASK_VZEROUPPER))
> > >  	target_flags |= MASK_VZEROUPPER;
> > > -      if ((x86_avx256_split_unaligned_load & ix86_tune_mask)
> > > +      if
> > (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
> > >  	  && !(target_flags_explicit &
> > MASK_AVX256_SPLIT_UNALIGNED_LOAD))
> > >  	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
> > > -      if ((x86_avx256_split_unaligned_store & ix86_tune_mask)
> > > +      if
> > (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL]
> > >  	  && !(target_flags_explicit &
> > MASK_AVX256_SPLIT_UNALIGNED_STORE))
> > >  	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
> > >        /* Enable 128-bit AVX instruction generation
Vladimir Makarov Oct. 21, 2013, 2:52 a.m. UTC | #2
On 13-10-19 4:30 PM, Jan Hubicka wrote:
>> Jan,
>>
>> Does this seem reasonable to you?
> Oops, sorry, I missed your email. (I was travelling and I am finishing a paper
> now).
>> Thanks,
>> Igor
>>
>>> -----Original Message-----
>>> From: Zamyatin, Igor
>>> Sent: Tuesday, October 15, 2013 3:48 PM
>>> To: Jan Hubicka
>>> Subject: RE: Honnor ix86_accumulate_outgoing_args again
>>>
>>> Jan,
>>>
>>> Now we have following prologue in, say, phi0 routine in equake
>>>
>>> 0x804aa90 1  push   %ebp
>>> 0x804aa91 2  mov    %esp,%ebp
>>> 0x804aa93 3  sub    $0x18,%esp
>>> 0x804aa96 4  vmovsd 0x80ef7a8,%xmm0
>>> 0x804aa9e 5  vmovsd 0x8(%ebp),%xmm1
>>> 0x804aaa3 6  vcomisd %xmm1,%xmm0   <-- we see big stall somewhere here
>>> or 1-2 instructions above
>>>
>>> While earlier it was
>>>
>>> 0x804abd0 1 sub    $0x2c,%esp
>>> 0x804abd3 2 vmovsd 0x30(%esp),%xmm1
>>> 0x804abd9 3 vmovsd 0x80efcc8,%xmm0
>>> 0x804abe1 4 vcomisd %xmm1,%xmm0
> Thanks for analysis! It is a different benchmark than for bulldozer, but
> apparently same case.  Again we used to eliminate frame pointer here but IRS
> now doesn't Do you see the same regression with -fno-omit-frame-pointer
> -maccumulate-outgoing-args?
>
> I suppose this is a conflict in between the push instruction hanled by stack
> engine and initialization of EBP that isn't.  That would explain why bulldozer
> don't seem to care about this particular benchmark (its stack engine seems to
> have quite different design).
>
> This is a bit sad situation - accumulate-outgoing-args is expensive code size
> wise and it seems we don't really need esp with -mno-accumulate-outgoing-args.
> The non-accumulation code path was mistakely disabled for too long ;(
>
> Vladimir, how much effort do you think it will be to fix the frame pointer
> elimination here?
My guess is a week.  The problem I am busy and having some problems with two
small projects right now which I'd like to include into gcc-4.9.

But I think, this still can be fixed on stage2 as it is a PR.

> I can imagine it is a quite tricky case. If so I would suggest adding m_CORE_ALL
> to X86_TUNE_ACCUMULATE_OUTGOING_ARGS with a comment explaining the problem and
> mentioning the regression on equake on core and mgrid on Bulldizer and opening
> an enhancement request for this...
>
> I also wonder if direct ESP use and push/pop instructions are causing so
> noticeable issues, I wonder if we can't "shrink wrap" this into red-zone in the
> 64bit compilation.  It seems that even with -maccumulate-outgoing-args pushing
> the frame allocation as late as possible in the function would be a good idea
> so it is not close to the push/pop/call/ret.
>
>
Zamyatin, Igor Oct. 22, 2013, 6:52 a.m. UTC | #3
Jan,

Please see my answers below

> -----Original Message-----
> From: Jan Hubicka [mailto:hubicka@ucw.cz]
> Sent: Sunday, October 20, 2013 12:30 AM
> To: Zamyatin, Igor; gcc-patches@gcc.gnu.org; vmakarov@redhat.com
> Cc: 'Jan Hubicka'
> Subject: Re: Honnor ix86_accumulate_outgoing_args again
> 
> > Jan,
> >
> > Does this seem reasonable to you?
> 
> Oops, sorry, I missed your email. (I was travelling and I am finishing a paper
> now).
> >
> > Thanks,
> > Igor
> >
> > > -----Original Message-----
> > > From: Zamyatin, Igor
> > > Sent: Tuesday, October 15, 2013 3:48 PM
> > > To: Jan Hubicka
> > > Subject: RE: Honnor ix86_accumulate_outgoing_args again
> > >
> > > Jan,
> > >
> > > Now we have following prologue in, say, phi0 routine in equake
> > >
> > > 0x804aa90 1  push   %ebp
> > > 0x804aa91 2  mov    %esp,%ebp
> > > 0x804aa93 3  sub    $0x18,%esp
> > > 0x804aa96 4  vmovsd 0x80ef7a8,%xmm0
> > > 0x804aa9e 5  vmovsd 0x8(%ebp),%xmm1
> > > 0x804aaa3 6  vcomisd %xmm1,%xmm0   <-- we see big stall somewhere
> here
> > > or 1-2 instructions above
> > >
> > > While earlier it was
> > >
> > > 0x804abd0 1 sub    $0x2c,%esp
> > > 0x804abd3 2 vmovsd 0x30(%esp),%xmm1
> > > 0x804abd9 3 vmovsd 0x80efcc8,%xmm0
> > > 0x804abe1 4 vcomisd %xmm1,%xmm0
> 
> Thanks for analysis! It is a different benchmark than for bulldozer, but
> apparently same case.  Again we used to eliminate frame pointer here but
> IRS now doesn't Do you see the same regression with -fno-omit-frame-
> pointer -maccumulate-outgoing-args?

No, with these flags performance is back

> 
> I suppose this is a conflict in between the push instruction hanled by stack
> engine and initialization of EBP that isn't.  That would explain why bulldozer
> don't seem to care about this particular benchmark (its stack engine seems to
> have quite different design).
> 
> This is a bit sad situation - accumulate-outgoing-args is expensive code size
> wise and it seems we don't really need esp with -mno-accumulate-outgoing-
> args.

Could you please explain this a bit more?

Thanks,
Igor

> The non-accumulation code path was mistakely disabled for too long ;(
> 
> Vladimir, how much effort do you think it will be to fix the frame pointer
> elimination here?
> I can imagine it is a quite tricky case. If so I would suggest adding
> m_CORE_ALL to X86_TUNE_ACCUMULATE_OUTGOING_ARGS with a
> comment explaining the problem and mentioning the regression on equake
> on core and mgrid on Bulldizer and opening an enhancement request for
> this...
> 
> I also wonder if direct ESP use and push/pop instructions are causing so
> noticeable issues, I wonder if we can't "shrink wrap" this into red-zone in the
> 64bit compilation.  It seems that even with -maccumulate-outgoing-args
> pushing the frame allocation as late as possible in the function would be a
> good idea so it is not close to the push/pop/call/ret.
> 
> Honza
> 
> > >
> > > Note that for Atom and SLM no regression happens because they are
> > > already in x86_accumulate_outgoing_args. As for other machines  -
> > > seems now (after your change) they don't get that
> > > MASK_ACCUMULATE_OUTGOING_ARGS and it leads to using ebp in the
> > > prologue.
> > >
> > > Thanks,
> > > Igor
> > >
> > > -----Original Message-----
> > > From: Jan Hubicka [mailto:hubicka@ucw.cz]
> > > Sent: Monday, October 14, 2013 8:44 PM
> > > To: Zamyatin, Igor
> > > Cc: Jan Hubicka
> > > Subject: Re: Honnor ix86_accumulate_outgoing_args again
> > >
> > > > Jan,
> > > >
> > > > I see that you haven't committed this change. Any particular
> > > > reason for
> > > this?
> > >
> > > No, seems I forgot about it.
> > > >
> > > > BTW, after r203171 (http://gcc.gnu.org/ml/gcc-cvs/2013-
> > > 10/msg00122.html) we see lot of performance degradation on spec2000
> > > and
> > > spec2006 tests on SandyBridge and IvyBridge in 32 bits mode. E.g.
> > > 183.equake got ~-15%. Have you seen it?
> > >
> > > I tested this only on Bulldozer chips where I saw large regression
> > > from mgrid, but curiously not from equake.  I tracked it down to
> > > frame pointer being forced by IRA that Vladimir promised to look at later.
> > >
> > > I assumed that arg accumulation was tested before the flags was set,
> > > so I did not benchmarked chips that previously dsabled it. Perhaps
> > > things changed because IRA was merged in meantime while this feature
> > > was accidentally disabled.
> > >
> > > It would be great if you can figure out why equake regress - in FP
> > > benchmarks we do not use push/pop so perhaps we just end up emitting
> > > something really stupid or we manage to confuse stack engine...
> > >
> > > Honza
> > >
> > > >
> > > > Thanks,
> > > > Igor
> > > >
> > > > -----Original Message-----
> > > > From: gcc-patches-owner@gcc.gnu.org
> > > > [mailto:gcc-patches-owner@gcc.gnu.org] On Behalf Of Jan Hubicka
> > > > Sent: Thursday, October 10, 2013 10:40 PM
> > > > To: Jan Hubicka
> > > > Cc: Vladimir Makarov; gcc-patches@gcc.gnu.org;
> > > > hjl.tools@gmail.com; ubizjak@gmail.com; rth@redhat.com;
> > > Ganesh.Gopalasubramanian@amd.com
> > > > Subject: Re: Honnor ix86_accumulate_outgoing_args again
> > > >
> > > > Hi,
> > > > this patch makes ACCUMULATE_OUTGOING_ARGS to disable itself
> when
> > > function is cold.  I did some extra testing and to my amusement we
> > > now seem to output more compact unwind info when
> > > ACCUMULATE_OUTGOING_ARGS is disabled, so this seems quite
> consistent
> > > code size win.
> > > >
> > > > We actually can do better and enable ACCUMULATE_OUTGOING_ARGS
> > > only when function contains hot calls.  This should also avoid need
> > > for frame allocation in prologue/epilogue on hot path then. I will
> > > look into this incrementally..
> > > >
> > > > I also noticed that we still have some tuning flags in i386.c
> > > > rather than in
> > > x86-tune.c so I moved them there.
> > > >
> > > > Testing x86_64-linux and will commit it once testing converge.
> > > >
> > > > Honza
> > > > 	* config/i386/i386.h (ACCUMULATE_OUTGOING_ARGS): Disable
> > > accumulation
> > > > 	for cold functions.
> > > > 	* x86-tune.def (X86_TUNE_USE_LEAVE): Update comment.
> > > > 	(X86_TUNE_PUSH_MEMORY): Likewise.
> > > > 	(X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL,
> > > > 	X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL): New.
> > > > 	(X86_TUNE_ACCUMULATE_OUTGOING_ARGS,
> > > X86_TUNE_ALWAYS_FANCY_MATH_387): New.
> > > > 	* i386.c (x86_accumulate_outgoing_args,
> > > x86_arch_always_fancy_math_387,
> > > > 	x86_avx256_split_unaligned_load,
> > > x86_avx256_split_unaligned_store):
> > > > 	Remove.
> > > > 	(ix86_option_override_internal): Update to use tune features
> > > instead
> > > > 	of variables.
> > > > Index: config/i386/i386.h
> > > >
> > >
> ==========================================================
> > > =========
> > > > --- config/i386/i386.h	(revision 203380)
> > > > +++ config/i386/i386.h	(working copy)
> > > > @@ -1492,13 +1492,26 @@ enum reg_class
> > > >     will be computed and placed into the variable `crtl-
> >outgoing_args_size'.
> > > >     No space will be pushed onto the stack for each call; instead, the
> > > >     function prologue should increase the stack frame size by this
> amount.
> > > > +
> > > > +   In 32bit mode enabling argument accumulation results in about
> > > > + 5% code
> > > size
> > > > +   growth becuase move instructions are less compact than push.  In
> 64bit
> > > > +   mode the difference is less drastic but visible.
> > > > +
> > > > +   FIXME: Unlike earlier implementations, the size of unwind info
> seems to
> > > > +   actually grouw with accumulation.  Is that because accumulated args
> > > > +   unwind info became unnecesarily bloated?
> > > >
> > > >     64-bit MS ABI seem to require 16 byte alignment everywhere except
> for
> > > > -   function prologue and apilogue.  This is not possible without
> > > > -   ACCUMULATE_OUTGOING_ARGS.  */
> > > > +   function prologue and epilogue.  This is not possible without
> > > > +   ACCUMULATE_OUTGOING_ARGS.
> > > > +
> > > > +   If stack probes are required, the space used for large function
> > > > +   arguments on the stack must also be probed, so enable
> > > > +   -maccumulate-outgoing-args so this happens in the prologue.
> > > > + */
> > > >
> > > >  #define ACCUMULATE_OUTGOING_ARGS \
> > > > -  (TARGET_ACCUMULATE_OUTGOING_ARGS ||
> TARGET_64BIT_MS_ABI)
> > > > +  ((TARGET_ACCUMULATE_OUTGOING_ARGS &&
> > > optimize_function_for_speed_p (cfun)) \
> > > > +   || TARGET_STACK_PROBE || TARGET_64BIT_MS_ABI)
> > > >
> > > >  /* If defined, a C expression whose value is nonzero when we want
> > > > to use
> > > PUSH
> > > >     instructions to pass outgoing arguments.  */
> > > > Index: config/i386/x86-tune.def
> > > >
> > >
> ==========================================================
> > > =========
> > > > --- config/i386/x86-tune.def	(revision 203387)
> > > > +++ config/i386/x86-tune.def	(working copy)
> > > > @@ -18,15 +18,13 @@ a copy of the GCC Runtime Library Except  see
> > > > the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
> > > > <http://www.gnu.org/licenses/>.  */
> > > >
> > > > -/* X86_TUNE_USE_LEAVE: Leave does not affect Nocona SPEC2000
> results
> > > > -   negatively, so enabling for Generic64 seems like good code size
> > > > -   tradeoff.  We can't enable it for 32bit generic because it does not
> > > > -   work well with PPro base chips.  */
> > > > +/* X86_TUNE_USE_LEAVE: Use "leave" instruction in epilogues where
> > > > +it fits.  */
> > > >  DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave",
> > > >  	  m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE |
> > > m_GENERIC)
> > > >
> > > >  /* X86_TUNE_PUSH_MEMORY: Enable generation of "push mem"
> > > instructions.
> > > > -   Some chips, like 486 and Pentium have problems with these
> sequences..
> > > */
> > > > +   Some chips, like 486 and Pentium works faster with separate load
> > > > +   and push instructions.  */
> > > >  DEF_TUNE (X86_TUNE_PUSH_MEMORY, "push_memory",
> > > >            m_386 | m_P4_NOCONA | m_CORE_ALL | m_K6_GEODE |
> > > m_AMD_MULTIPLE
> > > >            | m_GENERIC)
> > > > @@ -210,6 +208,16 @@ DEF_TUNE
> > > (X86_TUNE_SSE_UNALIGNED_LOAD_OP  DEF_TUNE
> > > (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL,
> > > "sse_unaligned_store_optimal",
> > > >            m_COREI7 | m_BDVER | m_SLM | m_GENERIC)
> > > >
> > > > +/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if true,
> unaligned
> > > loads are
> > > > +   split.  */
> > > > +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL,
> > > "256_unaligned_load_optimal",
> > > > +          ~(m_COREI7 | m_GENERIC))
> > > > +
> > > > +/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if true,
> unaligned
> > > loads are
> > > > +   split.  */
> > > > +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL,
> > > "256_unaligned_load_optimal",
> > > > +          ~(m_COREI7 | m_BDVER | m_GENERIC))
> > > > +
> > > >  /* Use packed single precision instructions where posisble.  I.e.
> > > > movups
> > > instead
> > > >     of movupd.  */
> > > >  DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL,
> > > "sse_packed_single_insn_optimal", @@ -398,3 +406,24 @@ DEF_TUNE
> > > (X86_TUNE_AVOID_MEM_OPND_FOR_CM
> > > >     fp converts to destination register.  */  DEF_TUNE
> > > (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS,
> > > "split_mem_opnd_for_fp_converts",
> > > >            m_SLM)
> > > > +
> > > > +/* X86_TUNE_ACCUMULATE_OUTGOING_ARGS: Allocate stack space
> for
> > > outgoing
> > > > +   arguments in prologue/epilogue instead of separately for each call
> > > > +   by push/pop instructions.
> > > > +   This increase code size by about 5% in 32bit mode, less so in 64bit
> mode
> > > > +   because parameters are passed in registers.  It is considerable
> > > > +   win for targets without stack engine that prevents multple
> > > > + push
> > > operations
> > > > +   to happen in parallel.
> > > > +
> > > > +   FIXME: the flags is incorrectly enabled for amdfam10, Bulldozer,
> > > > +   Bobcat and Generic.  This is because disabling it causes large
> > > > +   regression on mgrid due to IRA limitation leading to unecessary
> > > > +   use of the frame pointer in 32bit mode.  */ DEF_TUNE
> > > > +(X86_TUNE_ACCUMULATE_OUTGOING_ARGS,
> > > "accumulate_outgoing_args",
> > > > +	  m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM |
> > > m_AMD_MULTIPLE |
> > > > +m_GENERIC)
> > > > +
> > > > +/* X86_TUNE_ALWAYS_FANCY_MATH_387: controls use of fancy 387
> > > operations,
> > > > +   such as fsqrt, fprem, fsin, fcos, fsincos etc.
> > > > +   Should be enabled for all targets that always has coprocesor.
> > > > +*/ DEF_TUNE (X86_TUNE_ALWAYS_FANCY_MATH_387,
> > > "always_fancy_math_387",
> > > > +          ~(m_386 | m_486))
> > > > Index: config/i386/i386.c
> > > >
> > >
> ==========================================================
> > > =========
> > > > --- config/i386/i386.c	(revision 203380)
> > > > +++ config/i386/i386.c	(working copy)
> > > > @@ -1898,18 +1898,6 @@ static unsigned int initial_ix86_arch_fe
> > > >    ~m_386,
> > > >  };
> > > >
> > > > -static const unsigned int x86_accumulate_outgoing_args
> > > > -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM |
> m_AMD_MULTIPLE
> > > |
> > > > m_GENERIC;
> > > > -
> > > > -static const unsigned int x86_arch_always_fancy_math_387
> > > > -  = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM |
> > > m_SLM |
> > > > m_AMD_MULTIPLE | m_GENERIC;
> > > > -
> > > > -static const unsigned int x86_avx256_split_unaligned_load
> > > > -  = m_COREI7 | m_GENERIC;
> > > > -
> > > > -static const unsigned int x86_avx256_split_unaligned_store
> > > > -  = m_COREI7 | m_BDVER | m_GENERIC;
> > > > -
> > > >  /* In case the average insn count for single function invocation is
> > > >     lower than this constant, emit fast (but longer) prologue and
> > > >     epilogue code.  */
> > > > @@ -2920,7 +2908,7 @@ static void
> > > >  ix86_option_override_internal (bool main_args_p)  {
> > > >    int i;
> > > > -  unsigned int ix86_arch_mask, ix86_tune_mask;
> > > > +  unsigned int ix86_arch_mask;
> > > >    const bool ix86_tune_specified = (ix86_tune_string != NULL);
> > > >    const char *prefix;
> > > >    const char *suffix;
> > > > @@ -3673,7 +3661,7 @@ ix86_option_override_internal (bool main
> > > >
> > > >    /* If the architecture always has an FPU, turn off
> NO_FANCY_MATH_387,
> > > >       since the insns won't need emulation.  */
> > > > -  if (x86_arch_always_fancy_math_387 & ix86_arch_mask)
> > > > +  if (ix86_tune_features [X86_TUNE_ALWAYS_FANCY_MATH_387])
> > > >      target_flags &= ~MASK_NO_FANCY_MATH_387;
> > > >
> > > >    /* Likewise, if the target doesn't have a 387, or we've
> > > > specified @@ -
> > > 3805,8 +3793,7 @@ ix86_option_override_internal (bool main
> > > >  	gcc_unreachable ();
> > > >        }
> > > >
> > > > -  ix86_tune_mask = 1u << ix86_tune;
> > > > -  if ((x86_accumulate_outgoing_args & ix86_tune_mask)
> > > > +  if (ix86_tune_features
> [X86_TUNE_ACCUMULATE_OUTGOING_ARGS]
> > > >        && !(target_flags_explicit &
> MASK_ACCUMULATE_OUTGOING_ARGS)
> > > >        && !optimize_size)
> > > >      target_flags |= MASK_ACCUMULATE_OUTGOING_ARGS; @@ -
> 3946,10
> > > +3933,10 @@ ix86_option_override_internal (bool main
> > > >        if (flag_expensive_optimizations
> > > >  	  && !(target_flags_explicit & MASK_VZEROUPPER))
> > > >  	target_flags |= MASK_VZEROUPPER;
> > > > -      if ((x86_avx256_split_unaligned_load & ix86_tune_mask)
> > > > +      if
> > > (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
> > > >  	  && !(target_flags_explicit &
> > > MASK_AVX256_SPLIT_UNALIGNED_LOAD))
> > > >  	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
> > > > -      if ((x86_avx256_split_unaligned_store & ix86_tune_mask)
> > > > +      if
> > > (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL]
> > > >  	  && !(target_flags_explicit &
> > > MASK_AVX256_SPLIT_UNALIGNED_STORE))
> > > >  	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
> > > >        /* Enable 128-bit AVX instruction generation
Florian Weimer Oct. 31, 2013, 10:28 a.m. UTC | #4
On 10/10/2013 08:40 PM, Jan Hubicka wrote:
> +   In 32bit mode enabling argument accumulation results in about 5% code size
> +   growth becuase move instructions are less compact than push.  In 64bit
> +   mode the difference is less drastic but visible.
> +
> +   FIXME: Unlike earlier implementations, the size of unwind info seems to
> +   actually grouw with accumulation.  Is that because accumulated args
> +   unwind info became unnecesarily bloated?

Several typos: "32bit" "64bit", "becuase", "grouw".  "push." should be 
"pushes.", I think.  I can't parse the question at the end.

Sorry, no comments on the actual code changes. :-/
Zamyatin, Igor Nov. 3, 2013, 9:58 a.m. UTC | #5
So, Jan, what do you think will be best solution for stage 1?

Thanks,
Igor

> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> owner@gcc.gnu.org] On Behalf Of Vladimir Makarov
> Sent: Monday, October 21, 2013 6:52 AM
> To: Jan Hubicka; Zamyatin, Igor; gcc-patches@gcc.gnu.org
> Subject: Re: Honnor ix86_accumulate_outgoing_args again
> 
> On 13-10-19 4:30 PM, Jan Hubicka wrote:
> >> Jan,
> >>
> >> Does this seem reasonable to you?
> > Oops, sorry, I missed your email. (I was travelling and I am finishing
> > a paper now).
> >> Thanks,
> >> Igor
> >>
> >>> -----Original Message-----
> >>> From: Zamyatin, Igor
> >>> Sent: Tuesday, October 15, 2013 3:48 PM
> >>> To: Jan Hubicka
> >>> Subject: RE: Honnor ix86_accumulate_outgoing_args again
> >>>
> >>> Jan,
> >>>
> >>> Now we have following prologue in, say, phi0 routine in equake
> >>>
> >>> 0x804aa90 1  push   %ebp
> >>> 0x804aa91 2  mov    %esp,%ebp
> >>> 0x804aa93 3  sub    $0x18,%esp
> >>> 0x804aa96 4  vmovsd 0x80ef7a8,%xmm0
> >>> 0x804aa9e 5  vmovsd 0x8(%ebp),%xmm1
> >>> 0x804aaa3 6  vcomisd %xmm1,%xmm0   <-- we see big stall somewhere
> here
> >>> or 1-2 instructions above
> >>>
> >>> While earlier it was
> >>>
> >>> 0x804abd0 1 sub    $0x2c,%esp
> >>> 0x804abd3 2 vmovsd 0x30(%esp),%xmm1
> >>> 0x804abd9 3 vmovsd 0x80efcc8,%xmm0
> >>> 0x804abe1 4 vcomisd %xmm1,%xmm0
> > Thanks for analysis! It is a different benchmark than for bulldozer,
> > but apparently same case.  Again we used to eliminate frame pointer
> > here but IRS now doesn't Do you see the same regression with
> > -fno-omit-frame-pointer -maccumulate-outgoing-args?
> >
> > I suppose this is a conflict in between the push instruction hanled by
> > stack engine and initialization of EBP that isn't.  That would explain
> > why bulldozer don't seem to care about this particular benchmark (its
> > stack engine seems to have quite different design).
> >
> > This is a bit sad situation - accumulate-outgoing-args is expensive
> > code size wise and it seems we don't really need esp with -mno-
> accumulate-outgoing-args.
> > The non-accumulation code path was mistakely disabled for too long ;(
> >
> > Vladimir, how much effort do you think it will be to fix the frame
> > pointer elimination here?
> My guess is a week.  The problem I am busy and having some problems with
> two small projects right now which I'd like to include into gcc-4.9.
> 
> But I think, this still can be fixed on stage2 as it is a PR.
> 
> > I can imagine it is a quite tricky case. If so I would suggest adding
> > m_CORE_ALL to X86_TUNE_ACCUMULATE_OUTGOING_ARGS with a
> comment
> > explaining the problem and mentioning the regression on equake on core
> > and mgrid on Bulldizer and opening an enhancement request for this...
> >
> > I also wonder if direct ESP use and push/pop instructions are causing
> > so noticeable issues, I wonder if we can't "shrink wrap" this into
> > red-zone in the 64bit compilation.  It seems that even with
> > -maccumulate-outgoing-args pushing the frame allocation as late as
> > possible in the function would be a good idea so it is not close to the
> push/pop/call/ret.
> >
> >
Jakub Jelinek Nov. 12, 2013, 12:18 a.m. UTC | #6
On Thu, Oct 10, 2013 at 08:40:05PM +0200, Jan Hubicka wrote:
> --- config/i386/x86-tune.def	(revision 203387)
> +++ config/i386/x86-tune.def	(working copy)

> +/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if true, unaligned loads are
> +   split.  */
> +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL, "256_unaligned_load_optimal", 
> +          ~(m_COREI7 | m_GENERIC))
> +
> +/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if true, unaligned loads are

s/loads/stores/

> +   split.  */
> +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL, "256_unaligned_load_optimal", 
> +          ~(m_COREI7 | m_BDVER | m_GENERIC))

s/load/store/

Also, I wonder if we couldn't improve the generated code for
-mavx2 -mtune=generic or -march=core-avx2 -mtune=generic etc.
- m_GENERIC is included clearly because vmovup{s,d} was really bad
on SandyBridge (am I right here?), but if the ISA includes AVX2, then
the code will not run on that chip at all, so can't we override it?

> @@ -3946,10 +3933,10 @@ ix86_option_override_internal (bool main
>        if (flag_expensive_optimizations
>  	  && !(target_flags_explicit & MASK_VZEROUPPER))
>  	target_flags |= MASK_VZEROUPPER;
> -      if ((x86_avx256_split_unaligned_load & ix86_tune_mask)
> +      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]

Didn't you mean to use X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL here?

>  	  && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_LOAD))
>  	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
> -      if ((x86_avx256_split_unaligned_store & ix86_tune_mask)
> +      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL]

And similarly for STORE here?

>  	  && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_STORE))
>  	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
>        /* Enable 128-bit AVX instruction generation

	Jakub
diff mbox

Patch

Index: config/i386/i386.h
===================================================================
--- config/i386/i386.h	(revision 203380)
+++ config/i386/i386.h	(working copy)
@@ -1492,13 +1492,26 @@  enum reg_class
    will be computed and placed into the variable `crtl->outgoing_args_size'.
    No space will be pushed onto the stack for each call; instead, the
    function prologue should increase the stack frame size by this amount.  
+
+   In 32bit mode enabling argument accumulation results in about 5% code size
+   growth becuase move instructions are less compact than push.  In 64bit
+   mode the difference is less drastic but visible.  
+
+   FIXME: Unlike earlier implementations, the size of unwind info seems to
+   actually grouw with accumulation.  Is that because accumulated args
+   unwind info became unnecesarily bloated?
    
    64-bit MS ABI seem to require 16 byte alignment everywhere except for
-   function prologue and apilogue.  This is not possible without
-   ACCUMULATE_OUTGOING_ARGS.  */
+   function prologue and epilogue.  This is not possible without
+   ACCUMULATE_OUTGOING_ARGS.  
+
+   If stack probes are required, the space used for large function
+   arguments on the stack must also be probed, so enable
+   -maccumulate-outgoing-args so this happens in the prologue.  */
 
 #define ACCUMULATE_OUTGOING_ARGS \
-  (TARGET_ACCUMULATE_OUTGOING_ARGS || TARGET_64BIT_MS_ABI)
+  ((TARGET_ACCUMULATE_OUTGOING_ARGS && optimize_function_for_speed_p (cfun)) \
+   || TARGET_STACK_PROBE || TARGET_64BIT_MS_ABI)
 
 /* If defined, a C expression whose value is nonzero when we want to use PUSH
    instructions to pass outgoing arguments.  */
Index: config/i386/x86-tune.def
===================================================================
--- config/i386/x86-tune.def	(revision 203387)
+++ config/i386/x86-tune.def	(working copy)
@@ -18,15 +18,13 @@  a copy of the GCC Runtime Library Except
 see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 <http://www.gnu.org/licenses/>.  */
 
-/* X86_TUNE_USE_LEAVE: Leave does not affect Nocona SPEC2000 results
-   negatively, so enabling for Generic64 seems like good code size
-   tradeoff.  We can't enable it for 32bit generic because it does not
-   work well with PPro base chips.  */
+/* X86_TUNE_USE_LEAVE: Use "leave" instruction in epilogues where it fits.  */
 DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave", 
 	  m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC)
 
 /* X86_TUNE_PUSH_MEMORY: Enable generation of "push mem" instructions.
-   Some chips, like 486 and Pentium have problems with these sequences.  */
+   Some chips, like 486 and Pentium works faster with separate load
+   and push instructions.  */
 DEF_TUNE (X86_TUNE_PUSH_MEMORY, "push_memory", 
           m_386 | m_P4_NOCONA | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE 
           | m_GENERIC)
@@ -210,6 +208,16 @@  DEF_TUNE (X86_TUNE_SSE_UNALIGNED_LOAD_OP
 DEF_TUNE (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL, "sse_unaligned_store_optimal",
           m_COREI7 | m_BDVER | m_SLM | m_GENERIC)
 
+/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if true, unaligned loads are
+   split.  */
+DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL, "256_unaligned_load_optimal", 
+          ~(m_COREI7 | m_GENERIC))
+
+/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if true, unaligned loads are
+   split.  */
+DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL, "256_unaligned_load_optimal", 
+          ~(m_COREI7 | m_BDVER | m_GENERIC))
+
 /* Use packed single precision instructions where posisble.  I.e. movups instead
    of movupd.  */
 DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL, "sse_packed_single_insn_optimal",
@@ -398,3 +406,24 @@  DEF_TUNE (X86_TUNE_AVOID_MEM_OPND_FOR_CM
    fp converts to destination register.  */
 DEF_TUNE (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS, "split_mem_opnd_for_fp_converts",
           m_SLM)
+
+/* X86_TUNE_ACCUMULATE_OUTGOING_ARGS: Allocate stack space for outgoing
+   arguments in prologue/epilogue instead of separately for each call
+   by push/pop instructions.
+   This increase code size by about 5% in 32bit mode, less so in 64bit mode
+   because parameters are passed in registers.  It is considerable
+   win for targets without stack engine that prevents multple push operations
+   to happen in parallel.
+
+   FIXME: the flags is incorrectly enabled for amdfam10, Bulldozer,
+   Bobcat and Generic.  This is because disabling it causes large
+   regression on mgrid due to IRA limitation leading to unecessary
+   use of the frame pointer in 32bit mode.  */
+DEF_TUNE (X86_TUNE_ACCUMULATE_OUTGOING_ARGS, "accumulate_outgoing_args", 
+	  m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM | m_AMD_MULTIPLE | m_GENERIC)
+
+/* X86_TUNE_ALWAYS_FANCY_MATH_387: controls use of fancy 387 operations,
+   such as fsqrt, fprem, fsin, fcos, fsincos etc.
+   Should be enabled for all targets that always has coprocesor.  */
+DEF_TUNE (X86_TUNE_ALWAYS_FANCY_MATH_387, "always_fancy_math_387", 
+          ~(m_386 | m_486))
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 203380)
+++ config/i386/i386.c	(working copy)
@@ -1898,18 +1898,6 @@  static unsigned int initial_ix86_arch_fe
   ~m_386,
 };
 
-static const unsigned int x86_accumulate_outgoing_args
-  = m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM | m_AMD_MULTIPLE | m_GENERIC;
-
-static const unsigned int x86_arch_always_fancy_math_387
-  = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM | m_AMD_MULTIPLE | m_GENERIC;
-
-static const unsigned int x86_avx256_split_unaligned_load
-  = m_COREI7 | m_GENERIC;
-
-static const unsigned int x86_avx256_split_unaligned_store
-  = m_COREI7 | m_BDVER | m_GENERIC;
-
 /* In case the average insn count for single function invocation is
    lower than this constant, emit fast (but longer) prologue and
    epilogue code.  */
@@ -2920,7 +2908,7 @@  static void
 ix86_option_override_internal (bool main_args_p)
 {
   int i;
-  unsigned int ix86_arch_mask, ix86_tune_mask;
+  unsigned int ix86_arch_mask;
   const bool ix86_tune_specified = (ix86_tune_string != NULL);
   const char *prefix;
   const char *suffix;
@@ -3673,7 +3661,7 @@  ix86_option_override_internal (bool main
 
   /* If the architecture always has an FPU, turn off NO_FANCY_MATH_387,
      since the insns won't need emulation.  */
-  if (x86_arch_always_fancy_math_387 & ix86_arch_mask)
+  if (ix86_tune_features [X86_TUNE_ALWAYS_FANCY_MATH_387])
     target_flags &= ~MASK_NO_FANCY_MATH_387;
 
   /* Likewise, if the target doesn't have a 387, or we've specified
@@ -3805,8 +3793,7 @@  ix86_option_override_internal (bool main
 	gcc_unreachable ();
       }
 
-  ix86_tune_mask = 1u << ix86_tune;
-  if ((x86_accumulate_outgoing_args & ix86_tune_mask)
+  if (ix86_tune_features [X86_TUNE_ACCUMULATE_OUTGOING_ARGS]
       && !(target_flags_explicit & MASK_ACCUMULATE_OUTGOING_ARGS)
       && !optimize_size)
     target_flags |= MASK_ACCUMULATE_OUTGOING_ARGS;
@@ -3946,10 +3933,10 @@  ix86_option_override_internal (bool main
       if (flag_expensive_optimizations
 	  && !(target_flags_explicit & MASK_VZEROUPPER))
 	target_flags |= MASK_VZEROUPPER;
-      if ((x86_avx256_split_unaligned_load & ix86_tune_mask)
+      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
 	  && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_LOAD))
 	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
-      if ((x86_avx256_split_unaligned_store & ix86_tune_mask)
+      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL]
 	  && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_STORE))
 	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
       /* Enable 128-bit AVX instruction generation