diff mbox

Enabling Software Prefetching by Default at -O3

Message ID D4C76825A6780047854A11E93CDE84D02F7743@SAUSEXMBP01.amd.com
State New
Headers show

Commit Message

Fang, Changpeng June 15, 2010, 9:34 p.m. UTC
Hi,

This patch serves as a proposal to turn on software prefetching by default at -O3.

Software prefetching pass has been introduced into gcc for a long time. It has been
observed that the prefetch pass is quite stable now and can noticeably improve
program performance where locality is a concern. As a result, we think it is time
to enable it under -O3 so that we can get the benefit out of it.

We have collected the data that shows the impact of prefetch on cpu2006 performance
 under -O3 with gcc 4.6. There is a ~2% improvement on both integer and floating programs 
and there is no apparent degradation. Programs listed below are those that have at least
(+/-)1% variation:
434.zeusmp (+1.77%), 436.cactusADM (+2.4%), 450.soplex (+1.28%), 
459.GemsFDTD (+5.48%), 470.lbm (+31.84%), 482.sphnix3 (+1.01%), 
458.sjeng (-1.27%), 462.libquantum (+19.23%). 

The patch passed bootstrapping with -O3 -g.  We have observed 2 failues in regression tests:
(1) gcc.dg/torture/stackalign/setjmp-1.c  -O3 -fomit-frame-pointer  (internal compiler error)
This is bug 44503. It is a CFG problem associate with _builtin_prefetch and we have had
a fix for it.
(2) gcc.dg/tree-ssa/loop-19.c scan-tree-dump-times optimized "MEM.(base: &|symbol: )a," 2
With prefetching turned on, the memory count is different. We can easily fix the test case after
this patch is accepted.

We realize that prefetch in gcc still has room for improvement. But we think it is good enough
now to be turned on.

Is it  acceptable to commit this patch?

Thanks,

Changpeng

Comments

H.J. Lu June 15, 2010, 10:03 p.m. UTC | #1
On Tue, Jun 15, 2010 at 2:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>
> Hi,
>
> This patch serves as a proposal to turn on software prefetching by default at -O3.
>
> Software prefetching pass has been introduced into gcc for a long time. It has been
> observed that the prefetch pass is quite stable now and can noticeably improve
> program performance where locality is a concern. As a result, we think it is time
> to enable it under -O3 so that we can get the benefit out of it.
>
> We have collected the data that shows the impact of prefetch on cpu2006 performance
>  under -O3 with gcc 4.6. There is a ~2% improvement on both integer and floating programs
> and there is no apparent degradation. Programs listed below are those that have at least
> (+/-)1% variation:
> 434.zeusmp (+1.77%), 436.cactusADM (+2.4%), 450.soplex (+1.28%),
> 459.GemsFDTD (+5.48%), 470.lbm (+31.84%), 482.sphnix3 (+1.01%),
> 458.sjeng (-1.27%), 462.libquantum (+19.23%).
>
> The patch passed bootstrapping with -O3 -g.  We have observed 2 failues in regression tests:
> (1) gcc.dg/torture/stackalign/setjmp-1.c  -O3 -fomit-frame-pointer  (internal compiler error)
> This is bug 44503. It is a CFG problem associate with _builtin_prefetch and we have had
> a fix for it.
> (2) gcc.dg/tree-ssa/loop-19.c scan-tree-dump-times optimized "MEM.(base: &|symbol: )a," 2
> With prefetching turned on, the memory count is different. We can easily fix the test case after
> this patch is accepted.
>
> We realize that prefetch in gcc still has room for improvement. But we think it is good enough
> now to be turned on.
>
> Is it  acceptable to commit this patch?
>

Have you measured performance impact on Intel Core i7?
Fang, Changpeng June 15, 2010, 10:10 p.m. UTC | #2
>Have you measured performance impact on Intel Core i7?

Noy yet. I should have mentioned that the performance data I posted in based
on the test on amd-linux64 system and gcc 4.6 r160766. (which includes my
last change to fix tonto regression).

Thanks,

Changpeng
H.J. Lu June 15, 2010, 10:21 p.m. UTC | #3
On Tue, Jun 15, 2010 at 3:10 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>>Have you measured performance impact on Intel Core i7?
>
> Noy yet. I should have mentioned that the performance data I posted in based
> on the test on amd-linux64 system and gcc 4.6 r160766. (which includes my
> last change to fix tonto regression).
>

Is that rate or speed? I would like to see numbers on Intel Core i7 if you are
proposing enable it by default for x86.

Thanks.
Fang, Changpeng June 15, 2010, 10:23 p.m. UTC | #4
>On Tue, Jun 15, 2010 at 3:10 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>>>Have you measured performance impact on Intel Core i7?
>>
>> Noy yet. I should have mentioned that the performance data I posted in based
>> on the test on amd-linux64 system and gcc 4.6 r160766. (which includes my
>> last change to fix tonto regression).
>>

>Is that rate or speed? I would like to see numbers on Intel Core i7 if you are
>proposing enable it by default for x86.

 Speed run. Hopefully prefetching could benefit all targets that support sw prefetch.

Thanks,

Changpeng
Mark Mitchell June 15, 2010, 10:47 p.m. UTC | #5
Fang, Changpeng wrote:

>> Is that rate or speed? I would like to see numbers on Intel Core i7 if you are
>> proposing enable it by default for x86.

HJ, are you in position to test this on Core i7?  Since you work at
Intel, it seems like it might be easier for you to do it than for the
folks at AMD to do it?

Maxim, is this something that you could easily test on some of our
embedded hardware, assuming that the machine descriptions define the
relevant prefetch instructions?  For example, I think that ARM, MIPS,
and Power all have theses instructions in the ISA, and we could look at
CoreMark numbers to at least convince ourselves that this does no harm.

Thanks,
H.J. Lu June 15, 2010, 10:55 p.m. UTC | #6
On Tue, Jun 15, 2010 at 3:47 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> Fang, Changpeng wrote:
>
>>> Is that rate or speed? I would like to see numbers on Intel Core i7 if you are
>>> proposing enable it by default for x86.
>
> HJ, are you in position to test this on Core i7?  Since you work at
> Intel, it seems like it might be easier for you to do it than for the
> folks at AMD to do it?

We never tried software prefetch since Intel processors rarely need
it.  We will try it. It will take some times.
Mark Mitchell June 15, 2010, 11:03 p.m. UTC | #7
H.J. Lu wrote:

> We never tried software prefetch since Intel processors rarely need
> it.  We will try it. It will take some times.

OK.

The other thing we can certainly do, once we get some input about how
well it does on various architectures, is to enable it conditionally.
There is no reason that -O3 has to enable the same things on all
architectures.  If prefetching is a win on AMD CPUs, then -O3 should
enable it on AMD CPUs -- but if it hurts Intel CPUs, or MIPS CPUs, or
whatever, then it shouldn't be enabled on those architectures.

So, let's see how generalizable Changpeng's numbers are to other
architectures, and then we can decide where the setting goes.  I think
there's no question from the numbers that -O3 should enable this
optimization on AMD CPUs.

Thanks,
Christian Borntraeger June 18, 2010, 8:02 a.m. UTC | #8
Am Mittwoch 16 Juni 2010, 01:03:20 schrieb Mark Mitchell:
> H.J. Lu wrote:
> 
> > We never tried software prefetch since Intel processors rarely need
> > it.  We will try it. It will take some times.
> 
> OK.
> 
> The other thing we can certainly do, once we get some input about how
> well it does on various architectures, is to enable it conditionally.
> There is no reason that -O3 has to enable the same things on all
> architectures.  If prefetching is a win on AMD CPUs, then -O3 should
> enable it on AMD CPUs -- but if it hurts Intel CPUs, or MIPS CPUs, or
> whatever, then it shouldn't be enabled on those architectures.
> 
> So, let's see how generalizable Changpeng's numbers are to other
> architectures, and then we can decide where the setting goes.  I think
> there's no question from the numbers that -O3 should enable this
> optimization on AMD CPUs.

We have also seen an overall win (around +1% for int and float) with
-fprefetch-loop-arrays on s390 and activated it on O3 (see config/s390/s390.c
override_options) for 4.6.
So we (s390) are fine with enabling loop prefetching in general for O3.

Changpeng, can you confirm that -O3 -fno-prefetch-loop-arrays does indeed
deactivate prefetching?

Christian
Andreas Krebbel June 18, 2010, 8:38 a.m. UTC | #9
Hi,

> This patch serves as a proposal to turn on software prefetching by default at -O3.

On S/390 I ran into a problem with the runtime of the prefetching pass when using
aggressive loop unrolling.  There is an algorithm with quadratic runtime regarding the
memory references which causes a very long compile time.  Christian already opened a
bugzilla for that (PR 44576). It probably would make sense to have a closer look at this
before enabling the pass by default. Perhaps we could simply limit the miss rate
computation to a certain amount of memory references?

Besides of that enabling the prefetching pass by default is ok with me. I think it would
help getting more people into improving it.

Bye,

-Andreas-
Andreas Krebbel June 18, 2010, 8:58 a.m. UTC | #10
Hi,

> This patch serves as a proposal to turn on software prefetching by default at -O3.

We probably also should remove these warnings if -fprefetch-loop-arrays hasn't been
specified explicitly:

#ifndef HAVE_prefetch
  if (flag_prefetch_loop_arrays)
    {
      warning (0, "-fprefetch-loop-arrays not supported for this target");
      flag_prefetch_loop_arrays = 0;
    }
#else
  if (flag_prefetch_loop_arrays && !HAVE_prefetch)
    {
      warning (0, "-fprefetch-loop-arrays not supported for this target (try -march
switches)");
      flag_prefetch_loop_arrays = 0;
    }
#endif

Bye,

-Andreas-
Fang, Changpeng June 18, 2010, 4:44 p.m. UTC | #11

H.J. Lu June 18, 2010, 10:07 p.m. UTC | #12
On Tue, Jun 15, 2010 at 3:55 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Tue, Jun 15, 2010 at 3:47 PM, Mark Mitchell <mark@codesourcery.com> wrote:
>> Fang, Changpeng wrote:
>>
>>>> Is that rate or speed? I would like to see numbers on Intel Core i7 if you are
>>>> proposing enable it by default for x86.
>>
>> HJ, are you in position to test this on Core i7?  Since you work at
>> Intel, it seems like it might be easier for you to do it than for the
>> folks at AMD to do it?
>
> We never tried software prefetch since Intel processors rarely need
> it.  We will try it. It will take some times.
>

Here are what we got on Intel Core i7 64bit with -O3 vs -O3
-fprefetch-loop-arrays:

400.perlbench 			 -0.369004%
401.bzip2 			 -3%
403.gcc 			 -1.5748%
429.mcf 			 0.784314%
445.gobmk 			 -1.28755%
456.hmmer 			 -1.67364%
458.sjeng 			 0%
462.libquantum 			 2.57827%
464.h264ref 			 -0.806452%
471.omnetpp 			 -1.51515%
473.astar 			 -0.581395%
483.xalancbmk 			 0%
SPECint(R)_base2006 			 -0.766284%
410.bwaves 			 -1.27796%
416.gamess 			 -1.2605%
433.milc 			 0%
434.zeusmp 			 1.24481%
435.gromacs 			 -0.478469%
436.cactusADM 			 -5.07813%
437.leslie3d 			 -1.10294%
444.namd 			 0%
450.soplex 			 0.628931%
453.povray 			 0.392157%
454.calculix 			 0%
459.GemsFDTD 			 -1.51515%
465.tonto 			 0%
470.lbm 			 -1.16279%
481.wrf 			 -0.442478%
482.sphinx3 			 2.51397%
SPECfp(R)_base2006 			 -0.769231%

It doesn't help Intel Core i7.  I think it should be enabled
with -mtune=XXX on x86 where prefetch improves
performance on XXX.
Christian Borntraeger June 19, 2010, 7:04 p.m. UTC | #13
Am Samstag 19 Juni 2010, 00:07:07 schrieb H.J. Lu:
> > We never tried software prefetch since Intel processors rarely need
> > it.  We will try it. It will take some times.
> >
> 
> Here are what we got on Intel Core i7 64bit with -O3 vs -O3
> -fprefetch-loop-arrays:
> 
> 400.perlbench 			 -0.369004%
> 401.bzip2 			 -3%
> 403.gcc 			 -1.5748%
> 429.mcf 			 0.784314%
> 445.gobmk 			 -1.28755%
> 456.hmmer 			 -1.67364%
> 458.sjeng 			 0%
> 462.libquantum 			 2.57827%
> 464.h264ref 			 -0.806452%
> 471.omnetpp 			 -1.51515%
> 473.astar 			 -0.581395%
> 483.xalancbmk 			 0%
> SPECint(R)_base2006 			 -0.766284%
> 410.bwaves 			 -1.27796%
> 416.gamess 			 -1.2605%
> 433.milc 			 0%
> 434.zeusmp 			 1.24481%
> 435.gromacs 			 -0.478469%
> 436.cactusADM 			 -5.07813%
> 437.leslie3d 			 -1.10294%
> 444.namd 			 0%
> 450.soplex 			 0.628931%
> 453.povray 			 0.392157%
> 454.calculix 			 0%
> 459.GemsFDTD 			 -1.51515%
> 465.tonto 			 0%
> 470.lbm 			 -1.16279%
> 481.wrf 			 -0.442478%
> 482.sphinx3 			 2.51397%
> SPECfp(R)_base2006 			 -0.769231%
> 
> It doesn't help Intel Core i7.  I think it should be enabled
> with -mtune=XXX on x86 where prefetch improves
> performance on XXX.

It also might be worth to investigate if overriding the parameters per
-mtune=XXX results in an overall win for -fprefetch-loop-arrays. We did
that on s390 since the default values were not ideal:
e.g. we have several of these
[...]
  if (!PARAM_SET_P (PARAM_PREFETCH_MIN_INSN_TO_MEM_RATIO))
    set_param_value ("prefetch-min-insn-to-mem-ratio", 2);
  if (!PARAM_SET_P (PARAM_SIMULTANEOUS_PREFETCHES))
    set_param_value ("simultaneous-prefetches", 6);
[...]
in override_options.

e.g. if software prefetch is expensive you could make it happen less often
for core i7 or vice versa.

Christian
Mark Mitchell June 19, 2010, 8:05 p.m. UTC | #14
Christian Borntraeger wrote:

> It also might be worth to investigate if overriding the parameters per
> -mtune=XXX results in an overall win for -fprefetch-loop-arrays. We did
> that on s390 since the default values were not ideal

Yes, that might be a good idea for i7.

But, in the meantime, I think we should get a version of the patch that
turns on prefetching on AMD CPUs with -O3.  There's no reason to demand
consistency for all CPUs and it clearly benefits the AMD CPUs.
Changpeng, would you please submit a patch that activates this
optimization only with tuning for AMD CPUs?

Thanks,
Andi Kleen June 19, 2010, 8:11 p.m. UTC | #15
Mark Mitchell <mark@codesourcery.com> writes:
>
> Yes, that might be a good idea for i7.
>
> But, in the meantime, I think we should get a version of the patch that
> turns on prefetching on AMD CPUs with -O3.  There's no reason to demand
> consistency for all CPUs and it clearly benefits the AMD CPUs.
> Changpeng, would you please submit a patch that activates this
> optimization only with tuning for AMD CPUs?

On i7 it will be likely still a win if hardware prefetch is disabled
in the BIOS. Most BIOS have a setting like this.

Unfortunately this cannot be easily detected by a program.

-Andi
Fang, Changpeng June 22, 2010, 10:05 p.m. UTC | #16
Hi, Mark:

>> It also might be worth to investigate if overriding the parameters per
>> -mtune=XXX results in an overall win for -fprefetch-loop-arrays. We did
>> that on s390 since the default values were not ideal

>Yes, that might be a good idea for i7.

>But, in the meantime, I think we should get a version of the patch that
>turns on prefetching on AMD CPUs with -O3.  There's no reason to demand
>consistency for all CPUs and it clearly benefits the AMD CPUs.
>Changpeng, would you please submit a patch that activates this
>optimization only with tuning for AMD CPUs?

It seems I have a problem in setting prefetching default at -O3 only for AMD CPUs.
When we know it is AMD CPUS after the command line options have been parsed,
we don't know whether -fno-prefetch-loop-arrays has been specified in the command line
or not. If we turn on prefetching at -O3, then we could not explicitly turn it off if we want.

What I want is something like:
 if  (!OPTION_SET_P (flag_prefetch_loop_arrays))
   flag_prefetch_loop_arrays = 1;

An alternative way is parsing -m options before other options.

What do you think?

Thanks,

Changpeng
Christian Borntraeger June 23, 2010, 6:05 a.m. UTC | #17
Am Mittwoch 23 Juni 2010, 00:05:57 schrieb Fang, Changpeng:
> Hi, Mark:
> 
> >> It also might be worth to investigate if overriding the parameters per
> >> -mtune=XXX results in an overall win for -fprefetch-loop-arrays. We did
> >> that on s390 since the default values were not ideal
> 
> >Yes, that might be a good idea for i7.
> 
> >But, in the meantime, I think we should get a version of the patch that
> >turns on prefetching on AMD CPUs with -O3.  There's no reason to demand
> >consistency for all CPUs and it clearly benefits the AMD CPUs.
> >Changpeng, would you please submit a patch that activates this
> >optimization only with tuning for AMD CPUs?
> 
> It seems I have a problem in setting prefetching default at -O3 only for AMD CPUs.
> When we know it is AMD CPUS after the command line options have been parsed,
> we don't know whether -fno-prefetch-loop-arrays has been specified in the command line
> or not. If we turn on prefetching at -O3, then we could not explicitly turn it off if we want.
> 
> What I want is something like:
>  if  (!OPTION_SET_P (flag_prefetch_loop_arrays))
>    flag_prefetch_loop_arrays = 1;

I think having an OPTION_SET_P  (or maybe name that FLAG_SET_P) makes a lot of sense and
would match the PARAM_SET_P way of doing things.

Christian
Richard Biener June 23, 2010, 8:50 a.m. UTC | #18
On Wed, 23 Jun 2010, Christian Borntraeger wrote:

> Am Mittwoch 23 Juni 2010, 00:05:57 schrieb Fang, Changpeng:
> > Hi, Mark:
> > 
> > >> It also might be worth to investigate if overriding the parameters per
> > >> -mtune=XXX results in an overall win for -fprefetch-loop-arrays. We did
> > >> that on s390 since the default values were not ideal
> > 
> > >Yes, that might be a good idea for i7.
> > 
> > >But, in the meantime, I think we should get a version of the patch that
> > >turns on prefetching on AMD CPUs with -O3.  There's no reason to demand
> > >consistency for all CPUs and it clearly benefits the AMD CPUs.
> > >Changpeng, would you please submit a patch that activates this
> > >optimization only with tuning for AMD CPUs?
> > 
> > It seems I have a problem in setting prefetching default at -O3 only for AMD CPUs.
> > When we know it is AMD CPUS after the command line options have been parsed,
> > we don't know whether -fno-prefetch-loop-arrays has been specified in the command line
> > or not. If we turn on prefetching at -O3, then we could not explicitly turn it off if we want.
> > 
> > What I want is something like:
> >  if  (!OPTION_SET_P (flag_prefetch_loop_arrays))
> >    flag_prefetch_loop_arrays = 1;
> 
> I think having an OPTION_SET_P  (or maybe name that FLAG_SET_P) makes a lot of sense and
> would match the PARAM_SET_P way of doing things.

We use tri-states for this elsewhere in the compiler.

Richard.
Joseph Myers June 23, 2010, 11:22 a.m. UTC | #19
On Wed, 23 Jun 2010, Richard Guenther wrote:

> > > What I want is something like:
> > >  if  (!OPTION_SET_P (flag_prefetch_loop_arrays))
> > >    flag_prefetch_loop_arrays = 1;
> > 
> > I think having an OPTION_SET_P  (or maybe name that FLAG_SET_P) makes a lot of sense and
> > would match the PARAM_SET_P way of doing things.
> 
> We use tri-states for this elsewhere in the compiler.

We use a mixture of tri-states and separate *_set variables.  When my 
patch series for option handling and multilib selection reaches the point 
of an options structure replacing separate variables for each option, I 
expect to use a separate such structure to record which options have been 
set explicitly.  That is:

  if (!global_options_explicit.opt.prefetch_loop_arrays)
    global_options.opt.prefetch_loop_arrays = 1;

(but hopefully using pointers to the relevant structures rather than 
hardcoding global_options and global_options_explicit).
Christian Borntraeger June 23, 2010, 12:10 p.m. UTC | #20
Am Mittwoch 23 Juni 2010, 10:50:35 schrieb Richard Guenther:
> > > What I want is something like:
> > >  if  (!OPTION_SET_P (flag_prefetch_loop_arrays))
> > >    flag_prefetch_loop_arrays = 1;
> > 
> > I think having an OPTION_SET_P  (or maybe name that FLAG_SET_P) makes a lot of sense and
> > would match the PARAM_SET_P way of doing things.
> 
> We use tri-states for this elsewhere in the compiler.

So something like the following unfinished patch 
would be the (current) way to go?

Christian

Index: gcc/common.opt
===================================================================
*** gcc/common.opt.orig
--- gcc/common.opt
*************** Common Report Var(flag_predictive_common
*** 945,951 ****
  Run predictive commoning optimization.
  
  fprefetch-loop-arrays
! Common Report Var(flag_prefetch_loop_arrays) Optimization
  Generate prefetch instructions, if available, for arrays in loops
  
  fprofile
--- 945,951 ----
  Run predictive commoning optimization.
  
  fprefetch-loop-arrays
! Common Report Var(flag_prefetch_loop_arrays) Init(2) Optimization
  Generate prefetch instructions, if available, for arrays in loops
  
  fprofile
Index: gcc/config/s390/s390.c
===================================================================
*** gcc/config/s390/s390.c.orig
--- gcc/config/s390/s390.c
*************** override_options (void)
*** 1676,1682 ****
  
    /* This cannot reside in optimization_options since HAVE_prefetch
       requires the arch flags to be evaluated already.  */
!   if (HAVE_prefetch && optimize >= 3)
      flag_prefetch_loop_arrays = 1;
  }
  
--- 1676,1682 ----
  
    /* This cannot reside in optimization_options since HAVE_prefetch
       requires the arch flags to be evaluated already.  */
!   if (HAVE_prefetch && optimize >= 3 && flag_prefetch_loop_arrays == 2)
      flag_prefetch_loop_arrays = 1;
  }
  
Index: gcc/toplev.c
===================================================================
*** gcc/toplev.c.orig
--- gcc/toplev.c
*************** process_options (void)
*** 2048,2053 ****
--- 2048,2056 ----
      }
  
  #ifndef HAVE_prefetch
+   if (flag_prefetch_loop_arrays == 2)
+     flag_prefetch_loop_arrays = 0;
+ 
    if (flag_prefetch_loop_arrays)
      {
        warning (0, "-fprefetch-loop-arrays not supported for this target");
Manuel López-Ibáñez June 23, 2010, 12:29 p.m. UTC | #21
>
>  fprefetch-loop-arrays
> ! Common Report Var(flag_prefetch_loop_arrays) Init(2) Optimization

For consistency with other options, I would suggest to use Init(-1)
for an "uninitialized" option. As Joseph said, this should be done
differently in the future, but for now using -1 is the usual setting.
I didn't understand how (or if) Joseph will implement the framework to
set default values. As I understand it, Init(X) should only set the
value X to the Var if it is not set by anything else before. Then, we
can move the default values back to the *.opt files. Otherwise, such a
patch would be an improvement.

Cheers,

Manuel.
diff mbox

Patch

From 50ef9b1dd700ace9854f88814488b7807fcbae1c Mon Sep 17 00:00:00 2001
From: Changpeng Fang <chfang@pathscale.(none)>
Date: Thu, 10 Jun 2010 14:52:15 -0700
Subject: [PATCH 2/2] Enable -fprefetch-loop-arrays under -O3

	*opts.c (decode_options): Turn on flag_prefetch_loop_arrays
	under -O3.

	*doc/invoke.texi(@item -O3): Say that -O3 includes -fprefetch-loop-arrays.
	(@item -fprefetch-loop-arrays): Say that -fprefetch-loop-arrays
	is enabled by default under -O3.
---
 gcc/doc/invoke.texi |    6 ++++--
 gcc/opts.c          |    1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index d8c0c22..1c28a91 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -5883,7 +5883,8 @@  invoking @option{-O2} on programs that use computed gotos.
 Optimize yet more.  @option{-O3} turns on all optimizations specified
 by @option{-O2} and also turns on the @option{-finline-functions},
 @option{-funswitch-loops}, @option{-fpredictive-commoning},
-@option{-fgcse-after-reload} and @option{-ftree-vectorize} options.
+@option{-fgcse-after-reload}, @option{-ftree-vectorize} and
+@option{-fprefetch-loop-arrays} options.
 
 @item -O0
 @opindex O0
@@ -7037,7 +7038,8 @@  memory to improve the performance of loops that access large arrays.
 This option may generate better or worse code; results are highly
 dependent on the structure of loops within the source code.
 
-Disabled at level @option{-Os}.
+This flag is enabled by default at @option{-O3} and disabled at
+level @option{-Os}.
 
 @item -fno-peephole
 @itemx -fno-peephole2
diff --git a/gcc/opts.c b/gcc/opts.c
index 8699ec3..7814341 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -951,6 +951,7 @@  decode_options (unsigned int argc, const char **argv)
   flag_unswitch_loops = opt3;
   flag_gcse_after_reload = opt3;
   flag_tree_vectorize = opt3;
+  flag_prefetch_loop_arrays = opt3;
   flag_ipa_cp_clone = opt3;
   if (flag_ipa_cp_clone)
     flag_ipa_cp = 1;
-- 
1.6.3.3