diff mbox

Enabling Software Prefetching by Default at -O3

Message ID D4C76825A6780047854A11E93CDE84D02F7750@SAUSEXMBP01.amd.com
State New
Headers show

Commit Message

Fang, Changpeng June 24, 2010, 9:18 p.m. UTC
Hi, 

Attached is the version of the patch that turns prefetching on at -O3 for AMD cpus
only. As discussed elsewhere in this thread, we use tri-state for -fprefetch-loop-arrays.
If this flag is not explicitly set,  (for -O3) we turn it on in gcc/config/i386/i386.c 
(override_options).

Is this OK to commit now?

Thanks, 

Changpeng

Comments

Richard Biener June 25, 2010, 9:37 a.m. UTC | #1
On Thu, 24 Jun 2010, Fang, Changpeng wrote:

> Hi, 
> 
> Attached is the version of the patch that turns prefetching on at -O3 for AMD cpus
> only. As discussed elsewhere in this thread, we use tri-state for -fprefetch-loop-arrays.
> If this flag is not explicitly set,  (for -O3) we turn it on in gcc/config/i386/i386.c 
> (override_options).
> 
> Is this OK to commit now?

Ok.

Thanks,
Richard.

> 
> Thanks, 
> 
> Changpeng
> 
> ________________________________________
> From: Mark Mitchell [mark@codesourcery.com]
> Sent: Saturday, June 19, 2010 3:05 PM
> To: Christian Borntraeger
> Cc: gcc-patches@gcc.gnu.org; H.J. Lu; Fang, Changpeng; rguenther@suse.de; sebpop@gmail.com; Zdenek Dvorak; Maxim Kuvyrkov
> Subject: Re: [PATCH] Enabling Software Prefetching by Default at -O3
> 
> Christian Borntraeger wrote:
> 
> > It also might be worth to investigate if overriding the parameters per
> > -mtune=XXX results in an overall win for -fprefetch-loop-arrays. We did
> > that on s390 since the default values were not ideal
> 
> Yes, that might be a good idea for i7.
> 
> But, in the meantime, I think we should get a version of the patch that
> turns on prefetching on AMD CPUs with -O3.  There's no reason to demand
> consistency for all CPUs and it clearly benefits the AMD CPUs.
> Changpeng, would you please submit a patch that activates this
> optimization only with tuning for AMD CPUs?
> 
> Thanks,
> 
> --
> Mark Mitchell
> CodeSourcery
> mark@codesourcery.com
> (650) 331-3385 x713
> 
>
Sebastian Pop June 25, 2010, 6:25 p.m. UTC | #2
On Fri, Jun 25, 2010 at 04:37, Richard Guenther <rguenther@suse.de> wrote:
> On Thu, 24 Jun 2010, Fang, Changpeng wrote:
>
>> Hi,
>>
>> Attached is the version of the patch that turns prefetching on at -O3 for AMD cpus
>> only. As discussed elsewhere in this thread, we use tri-state for -fprefetch-loop-arrays.
>> If this flag is not explicitly set,  (for -O3) we turn it on in gcc/config/i386/i386.c
>> (override_options).
>>
>> Is this OK to commit now?
>
> Ok.
>

Committed r161391.
Richard Biener June 26, 2010, 2:21 p.m. UTC | #3
On Fri, Jun 25, 2010 at 8:25 PM, Sebastian Pop <sebpop@gmail.com> wrote:
> On Fri, Jun 25, 2010 at 04:37, Richard Guenther <rguenther@suse.de> wrote:
>> On Thu, 24 Jun 2010, Fang, Changpeng wrote:
>>
>>> Hi,
>>>
>>> Attached is the version of the patch that turns prefetching on at -O3 for AMD cpus
>>> only. As discussed elsewhere in this thread, we use tri-state for -fprefetch-loop-arrays.
>>> If this flag is not explicitly set,  (for -O3) we turn it on in gcc/config/i386/i386.c
>>> (override_options).
>>>
>>> Is this OK to commit now?
>>
>> Ok.
>>
>
> Committed r161391.

This may have regressed scimark sparse matmult by 20% on AMD Fam8
at -O3 -ffast-math -funroll-loops -march=native.

Richard.
Richard Biener June 26, 2010, 2:50 p.m. UTC | #4
On Sat, Jun 26, 2010 at 4:21 PM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Fri, Jun 25, 2010 at 8:25 PM, Sebastian Pop <sebpop@gmail.com> wrote:
>> On Fri, Jun 25, 2010 at 04:37, Richard Guenther <rguenther@suse.de> wrote:
>>> On Thu, 24 Jun 2010, Fang, Changpeng wrote:
>>>
>>>> Hi,
>>>>
>>>> Attached is the version of the patch that turns prefetching on at -O3 for AMD cpus
>>>> only. As discussed elsewhere in this thread, we use tri-state for -fprefetch-loop-arrays.
>>>> If this flag is not explicitly set,  (for -O3) we turn it on in gcc/config/i386/i386.c
>>>> (override_options).
>>>>
>>>> Is this OK to commit now?
>>>
>>> Ok.
>>>
>>
>> Committed r161391.
>
> This may have regressed scimark sparse matmult by 20% on AMD Fam8
> at -O3 -ffast-math -funroll-loops -march=native.

It also looks like prefetching has a very negative impact on SPEC CPU 2000 FP
in 32bit mode: http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-head-64-32o-32bit/recent.html
(that's Fam10).

Richard.
Jan Hubicka June 26, 2010, 2:55 p.m. UTC | #5
> On Fri, Jun 25, 2010 at 8:25 PM, Sebastian Pop <sebpop@gmail.com> wrote:
> > On Fri, Jun 25, 2010 at 04:37, Richard Guenther <rguenther@suse.de> wrote:
> >> On Thu, 24 Jun 2010, Fang, Changpeng wrote:
> >>
> >>> Hi,
> >>>
> >>> Attached is the version of the patch that turns prefetching on at -O3 for AMD cpus
> >>> only. As discussed elsewhere in this thread, we use tri-state for -fprefetch-loop-arrays.
> >>> If this flag is not explicitly set,  (for -O3) we turn it on in gcc/config/i386/i386.c
> >>> (override_options).
> >>>
> >>> Is this OK to commit now?
> >>
> >> Ok.
> >>
> >
> > Committed r161391.
> 
> This may have regressed scimark sparse matmult by 20% on AMD Fam8
> at -O3 -ffast-math -funroll-loops -march=native.
... and also about 2.6% SPECFP regression along with 8% code size growth in 32bit mode
http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-head-64-32o-32bit/recent.html
http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-head-64-32o-32bit/size.html

There was other changes today that might've caused the trouble.  In particular
my partial inlinig, Sebastians's ifcvt, Martin's ipa-cp improvements and Bernd's RA
changes.

Frescobaldi and Vangelish both test scimark and Frescobaldi does not show the regression
yet, tought it picked changes including my ipa-split code.  We might conclude that it
must be later patch that would leave Sebastian's and Martin's changes.  Those two seem
however unlikely candidates for such a flat differences everywhere IMO. 

I tested partial inlining in the form comitted to mainline on Frescobaldi on SPECint2000
and C++ benchmarks couple days ago and it dod not show the regressions, however I did
not test 32bit.  The change is quite target independent so I would be surprised if it
made 32bit code a lot worse and did not affect 64bit.

Some changes are also visible at our Itanium testers.

Hopefully we can resolve the slowdowns.  I also think that the change should be documented
in changes.html as well as we should enable prefetching with -fprofile-use
(and ensure that prefetcher uses profile info to figure out if loop has resonable trip
count for prefetching and is hot)

Honza
> 
> Richard.
Jan Hubicka June 26, 2010, 3:04 p.m. UTC | #6
> > On Fri, Jun 25, 2010 at 8:25 PM, Sebastian Pop <sebpop@gmail.com> wrote:
> > > On Fri, Jun 25, 2010 at 04:37, Richard Guenther <rguenther@suse.de> wrote:
> > >> On Thu, 24 Jun 2010, Fang, Changpeng wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> Attached is the version of the patch that turns prefetching on at -O3 for AMD cpus
> > >>> only. As discussed elsewhere in this thread, we use tri-state for -fprefetch-loop-arrays.
> > >>> If this flag is not explicitly set,  (for -O3) we turn it on in gcc/config/i386/i386.c
> > >>> (override_options).
> > >>>
> > >>> Is this OK to commit now?
> > >>
> > >> Ok.
> > >>
> > >
> > > Committed r161391.
> > 
> > This may have regressed scimark sparse matmult by 20% on AMD Fam8
> > at -O3 -ffast-math -funroll-loops -march=native.
> ... and also about 2.6% SPECFP regression along with 8% code size growth in 32bit mode
> http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-head-64-32o-32bit/recent.html
> http://gcc.opensuse.org/SPEC/CFP/sb-frescobaldi.suse.de-head-64-32o-32bit/size.html

I think the reason why we see regression in 32bit run only is that 32bit run use -march=native
while 64bit run uses default arch.  So it indeed looks like prefetching change.

Honza
diff mbox

Patch

From 7f48bc625b0e451dd8c05a3a3cc20f68dcaa695c Mon Sep 17 00:00:00 2001
From: Changpeng Fang <chfang@pathscale.(none)>
Date: Wed, 23 Jun 2010 17:05:59 -0700
Subject: [PATCH 3/3] Enable prefetching at -O3 for AMD cpus

	* gcc/common.opt (fprefetch-loop-arrays): Re-define
	-fprefetch-loop-arrays as a tri-state option with the
	initial value of -1.
	* gcc/tree-ssa-loop.c (gate_tree_ssa_loop_prefetch): Invoke
	prefetch pass only when flag_prefetch_loop_arrays > 0.
	* gcc/toplev.c (process_options): Note that, with tri-states,
	flag_prefetch_loop_arrays>0 means prefetching is enabled.
	* gcc/config/i386/i386.c (override_options): Enable prefetching
	at -O3 for a set of CPUs that sw prefetching is helpful.
	(software_prefetching_beneficial_p): New.  Return TRUE if
	software prefetching is beneficial for the given CPU.
---
 gcc/common.opt         |    2 +-
 gcc/config/i386/i386.c |   27 +++++++++++++++++++++++++++
 gcc/toplev.c           |    6 +++---
 gcc/tree-ssa-loop.c    |    2 +-
 4 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/gcc/common.opt b/gcc/common.opt
index 4904481..74fbd1d 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -937,7 +937,7 @@  Common Report Var(flag_predictive_commoning) Optimization
 Run predictive commoning optimization.
 
 fprefetch-loop-arrays
-Common Report Var(flag_prefetch_loop_arrays) Optimization
+Common Report Var(flag_prefetch_loop_arrays) Init(-1) Optimization
 Generate prefetch instructions, if available, for arrays in loops
 
 fprofile
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2a46f89..605e57b 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2691,6 +2691,26 @@  ix86_target_string (int isa, int flags, const char *arch, const char *tune,
   return ret;
 }
 
+/* Return TRUE if software prefetching is beneficial for the
+   given CPU. */
+
+static bool
+software_prefetching_beneficial_p (void)
+{
+  switch (ix86_tune)
+    {
+    case PROCESSOR_GEODE:
+    case PROCESSOR_K6:
+    case PROCESSOR_ATHLON:
+    case PROCESSOR_K8:
+    case PROCESSOR_AMDFAM10:
+      return true;
+
+    default:
+      return false;
+    }
+}
+
 /* Function that is callable from the debugger to print the current
    options.  */
 void
@@ -3531,6 +3551,13 @@  override_options (bool main_args_p)
   if (!PARAM_SET_P (PARAM_L2_CACHE_SIZE))
     set_param_value ("l2-cache-size", ix86_cost->l2_cache_size);
 
+  /* Enable sw prefetching at -O3 for CPUS that prefetching is helpful.  */
+  if (flag_prefetch_loop_arrays < 0
+      && HAVE_prefetch
+      && optimize >= 3
+      && software_prefetching_beneficial_p())
+    flag_prefetch_loop_arrays = 1;
+
   /* If using typedef char *va_list, signal that __builtin_va_start (&ap, 0)
      can be optimized to ap = __builtin_next_arg (0).  */
   if (!TARGET_64BIT)
diff --git a/gcc/toplev.c b/gcc/toplev.c
index ff4c850..369820b 100644
--- a/gcc/toplev.c
+++ b/gcc/toplev.c
@@ -2078,13 +2078,13 @@  process_options (void)
     }
 
 #ifndef HAVE_prefetch
-  if (flag_prefetch_loop_arrays)
+  if (flag_prefetch_loop_arrays > 0)
     {
       warning (0, "-fprefetch-loop-arrays not supported for this target");
       flag_prefetch_loop_arrays = 0;
     }
 #else
-  if (flag_prefetch_loop_arrays && !HAVE_prefetch)
+  if (flag_prefetch_loop_arrays > 0 && !HAVE_prefetch)
     {
       warning (0, "-fprefetch-loop-arrays not supported for this target (try -march switches)");
       flag_prefetch_loop_arrays = 0;
@@ -2093,7 +2093,7 @@  process_options (void)
 
   /* This combination of options isn't handled for i386 targets and doesn't
      make much sense anyway, so don't allow it.  */
-  if (flag_prefetch_loop_arrays && optimize_size)
+  if (flag_prefetch_loop_arrays > 0 && optimize_size)
     {
       warning (0, "-fprefetch-loop-arrays is not supported with -Os");
       flag_prefetch_loop_arrays = 0;
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index 344cfa8..c9c5bbd 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -600,7 +600,7 @@  tree_ssa_loop_prefetch (void)
 static bool
 gate_tree_ssa_loop_prefetch (void)
 {
-  return flag_prefetch_loop_arrays != 0;
+  return flag_prefetch_loop_arrays > 0;
 }
 
 struct gimple_opt_pass pass_loop_prefetch =
-- 
1.6.3.3