Patchwork [Loop,optimizer] : Add logic to disable certain loop optimizations on pre-/post-loops

login
register
mail settings
Submitter Jack Howarth
Date Jan. 4, 2011, 3:04 a.m.
Message ID <20110104030426.GA28556@bromo.med.uc.edu>
Download mbox | patch
Permalink /patch/77372/
State New
Headers show

Comments

Jack Howarth - Jan. 4, 2011, 3:04 a.m.
On Fri, Dec 17, 2010 at 01:14:49AM -0600, Fang, Changpeng wrote:
> Hi, Jack:
> 
> Thanks for the testing.
> 
> This patch is not supposed to slow down a program by 10% (rnflow and test_fpu).
> It would be helpful if you can provide analysis why they are slowed down.

Changpeng,
   The corrected merge against gcc trunk of...


for the proposed patch from http://gcc.gnu.org/ml/gcc-patches/2010-12/msg01344.html
eliminated the performance regressions on x86_64-apple-darwin10. I now get...

Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n

                   Execution Time
-m32
                 stock       patched  %increase
ac               10.59       10.59      0.0
aermod           19.49       19.13     -1.8
air               6.07        6.07      0.0
capacita         44.60       44.61      0.0
channel           1.98        1.98      0.0
doduc            31.19       31.31      0.4
fatigue           9.90       10.29      3.9
gas_dyn           4.72        4.71     -0.2
induct           13.93       13.93      0.0
linpk            15.50       15.49     -0.1
mdbx             11.28       11.26     -0.2
nf               27.62       27.58     -0.1
protein          38.70       38.60     -0.3
rnflow           24.68       24.68      0.0
test_fpu         10.13       10.13      0.0
tfft              1.92        1.92      0.0

Geometric Mean   12.06       12.08      0.2
Execution Time

-m64
                 stock       patched   %increase
ac                8.80        8.80      0.0
aermod           17.34       17.17     -1.0
air               5.48        5.52      0.7
capacita         32.38       32.50      0.4
channel           1.84        1.84      0.0
doduc            26.50       26.52      0.1
fatigue           8.35        8.33     -0.2
gas_dyn           4.30        4.29     -0.2 
induct           12.83       12.83      0.0
linpk            15.49       15.49      0.0
mdbx             11.23       11.22     -0.1
nf               30.21       30.16     -0.2
protein          34.13       32.07     -6.0
rnflow           23.18       23.19      0.0
test_fpu          8.04        8.02     -0.2
tfft              1.87        1.86     -0.5
 
Geometric Mean   10.87       10.82     -0.5
Execution Time


> 
> We did see a significant compilation time reduction for most pb05 programs.
> (I don't know why you do not have executable size data).
> 
> >I would note that intel darwin now defaults
> >to -mtune=core2 and always has defaulted to -fPIC.
> 
> I could not understand these default for darwin. My understanding is that,
> for x86_64, the default should be -mtune=generic.
> 
> Thanks,
> 
> Changpeng
> 
> ________________________________________
> From: Jack Howarth [howarth@bromo.med.uc.edu]
> Sent: Thursday, December 16, 2010 11:31 PM
> To: Fang, Changpeng
> Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> 
> On Thu, Dec 16, 2010 at 06:13:52PM -0600, Fang, Changpeng wrote:
> > Hi,
> >
> > Based on previous discussions, I modified the patch as such.
> >
> > If a loop is marked as non-rolling, optimize_loop_for_size_p returns TRUE
> > and optimize_loop_for_speed_p returns FALSE. All users of these two
> > functions will be affected.
> >
> > After applying the modified patch, pb05 compilation time decreases 29%, binary
> > size decreases 20%, while a small (0.5%) performance increase was found which may
> > be just noise.
> >
> > Modified patch passed bootstrapping and gcc regression tests on x86_64-unknown-linux-gnu.
> >
> > Is it OK to commit to trunk?
> 
> Changpeng,
>     On x86_64-apple-darwin10, I am finding a more severe penalty for this patch with
> the pb05 benchmarks. Using a profiledbootstrap BOOT_CFLAGS="-g -O3" build with...
> 
> Configured with: ../gcc-4.6-20101216/configure --prefix=/sw --prefix=/sw/lib/gcc4.6 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.6/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.6 --enable-checking=yes --enable-cloog-backend=isl --enable-build-with-cxx
> 
> I get without the patch...
> 
> ================================================================================
> Date & Time     : 16 Dec 2010 23:36:27
> Test Name       : gfortran_lin_O3
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> Maximum Times   :     2000.0
> Target Error %  :      0.100
> Minimum Repeats :    10
> Maximum Repeats :   100
> 
>    Benchmark   Compile  Executable   Ave Run  Number   Estim
>         Name    (secs)     (bytes)    (secs) Repeats   Err %
>    ---------   -------  ----------   ------- -------  ------
>           ac      1.76       10000      8.78      12  0.0081
>       aermod     54.94       10000     17.28      10  0.0307
>          air      3.44       10000      5.53      13  0.0734
>     capacita      2.64       10000     32.65      10  0.0096
>      channel      0.89       10000      1.84      20  0.0977
>        doduc      8.12       10000     27.00      10  0.0132
>      fatigue      3.06       10000      8.36      10  0.0104
>      gas_dyn      5.00       10000      4.30      17  0.0915
>       induct      6.24       10000     12.42      10  0.0100
>        linpk      1.02       10000     15.50      12  0.0542
>         mdbx      2.55       10000     11.24      10  0.0256
>           nf      2.90       10000     30.16      20  0.0989
>      protein      7.98       10000     33.72      10  0.0070
>       rnflow      9.34       10000     23.21      10  0.0551
>     test_fpu      6.72       10000      8.05      10  0.0426
>         tfft      0.76       10000      1.87      10  0.0597
> 
> Geometric Mean Execution Time =      10.87 seconds
> 
> and with the patch...
> 
> ================================================================================
> Date & Time     : 16 Dec 2010 21:31:06
> Test Name       : gfortran_lin_O3
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> Maximum Times   :     2000.0
> Target Error %  :      0.100
> Minimum Repeats :    10
> Maximum Repeats :   100
> 
>    Benchmark   Compile  Executable   Ave Run  Number   Estim
>         Name    (secs)     (bytes)    (secs) Repeats   Err %
>    ---------   -------  ----------   ------- -------  ------
>           ac      1.19       10000      8.78      10  0.0099
>       aermod     47.91       10000     16.95      10  0.0123
>          air      2.85       10000      5.34      12  0.0715
>     capacita      1.63       10000     33.10      10  0.0361
>      channel      0.67       10000      1.87      10  0.0884
>        doduc      6.42       10000     27.35      10  0.0206
>      fatigue      2.10       10000      8.32      10  0.0194
>      gas_dyn      2.07       10000      4.30      17  0.0843
>       induct      5.38       10000     12.58      10  0.0088
>        linpk      0.71       10000     15.69      18  0.0796
>         mdbx      1.95       10000     11.41      10  0.0238
>           nf      1.24       10000     31.34      12  0.0991
>      protein      3.88       10000     35.13      10  0.0659
>       rnflow      4.73       10000     25.97      10  0.0629
>     test_fpu      3.66       10000      8.88      11  0.0989
>         tfft      0.52       10000      1.89      10  0.0403
> 
> Geometric Mean Execution Time =      11.09 seconds
> 
> This shows about a 2.0% performance reduction in the Geometric
> Mean Execution Time. I would note that intel darwin now defaults
> to -mtune=core2 and always has defaulted to -fPIC.
>            Jack
> 
> >
> > Thanks,
> >
> > Changpeng
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ________________________________________
> > From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> > Sent: Thursday, December 16, 2010 12:47 PM
> > To: Fang, Changpeng
> > Cc: Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> >
> > Hi,
> >
> > > For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> > > not unrolling, (2) not prefetching.  Which one do you prefer?
> >
> > it is better not to prefetch (the current placement of prefetches is not good for non-rolling
> > loops),
> >
> > Zdenek
> >
> 
> Content-Description: polyhedron1.txt
> > Comparison of Polyhedron (2005) Compile Time, Binary Size and Performance After Applying the NON-ROLLING Marking Patch
> >
> > gfortran -Ofast -funroll-loops -march=amdfam10 %n.f90 -o %n
> >     ============================================================================================================================
> > ||             |  Before Patch                     |  After Patch                       | Changes                       ||
> > ||========================================================================================================================||
> > ||  Benchmark    |  Compile    Binary Size  Run Time |  Compile   Binary Size Run Time  | Compile  Binary   Performance ||
> > ||        Name   |  Time (s)   (bytes)           (s)       |  Time(s)     (bytes)      (secs)   | Time (%) Size(%)    (%)       ||
> > ||========================================================================================================================||
> > ||          ac   |    3.36       41976            13.26    |    2.52       34424      13.21     |  -25.00   -17.99    0.38      ||
> > ||      aermod   |  103.45     1412221            44.36    |   86.55     1268861      43.08     |  -16.34   -10.15    2.97      ||
> > ||         air   |    6.11       75186            11.73    |    5.72       71090      11.56     |   -6.38    -5.45    1.47      ||
> > ||    capacita   |    6.83       91257            88.40    |    4.70       74585      88.01     |  -31.19   -18.27    0.44      ||
> > ||     channel   |    2.14       39984             6.65    |    1.84       35888       6.69     |  -14.02   -10.24   -0.60      ||
> > ||       doduc   |   12.78      198624            38.59    |   12.20      186336      38.18     |   -4.54    -6.19    1.07      ||
> > ||     fatigue   |    9.11      110008            10.15    |    5.93       92472      10.12     |  -34.91   -15.94    0.30      ||
> > ||     gas_dyn   |   15.69      149726             7.14    |    8.45      109342       6.96     |  -46.14   -26.97    2.59      ||
> > ||      induct   |   10.98      191800            20.66    |   10.62      188168      20.61     |   -3.28    -1.89    0.24      ||
> > ||       linpk   |    2.27       46073            19.03    |    1.68       33401      19.03     |  -25.99   -27.50    0.00      ||
> > ||        mdbx   |    5.63      103731            21.41    |    4.24       83251      21.33     |  -24.69   -19.74    0.38      ||
> > ||          nf   |   14.18      118451            22.88    |    5.55       60499      23.14     |  -60.86   -48.92   -1.12      ||
> > ||     protein   |   34.20      177700            47.04    |   19.16      135012      46.94     |  -43.98   -24.02    0.21      ||
> > ||      rnflow   |   42.13      283645            40.30    |   20.92      178477      40.65     |  -50.34   -37.08   -0.86      ||
> > ||    test_fpu   |   30.17      252080            14.46    |   14.44      149136      14.32     |  -52.14   -40.84    0.98      ||
> > ||        tfft   |    1.50       32450             7.71    |    1.12       26546       7.67     |  -25.33   -18.19    0.52      ||
> > ||========================================================================================================================||
> > ||       average   |                      19.57    |                          19.46     |  -29.07   -20.59    0.56      ||
> > ============================================================================================================================
> 
> Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> > From cd8b85bba1b39e108235f44d9d07918179ff3d79 Mon Sep 17 00:00:00 2001
> > From: Changpeng Fang <chfang@houghton.(none)>
> > Date: Mon, 13 Dec 2010 12:01:49 -0800
> > Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
> >
> >       * basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
> >       * cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
> >       * cfgloop.h (mark_non_rolling_loop): New function declaration.
> >         (non_rolling_loop_p): New function declaration.
> >       * predict.c (optimize_loop_for_size_p): Return true if the loop was marked
> >         NON-ROLLING.  (optimize_loop_for_speed_p): Return false if the loop was
> >         marked NON-ROLLING.
> >       * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> >         non-rolling loop.
> >       * tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
> >         function.  (non_rolling_loop_p): Implement the new function.
> >       * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> >         non-rolling loop.  (vect_do_peeling_for_alignment): Mark the non-rolling
> >         loop.
> > ---
> >  gcc/basic-block.h          |    6 +++++-
> >  gcc/cfg.c                  |    7 ++++---
> >  gcc/cfgloop.h              |    2 ++
> >  gcc/predict.c              |    6 ++++++
> >  gcc/tree-ssa-loop-manip.c  |    3 +++
> >  gcc/tree-ssa-loop-niter.c  |   20 ++++++++++++++++++++
> >  gcc/tree-vect-loop-manip.c |    8 ++++++++
> >  7 files changed, 48 insertions(+), 4 deletions(-)
> >
> > diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> > index be0a1d1..850472d 100644
> > --- a/gcc/basic-block.h
> > +++ b/gcc/basic-block.h
> > @@ -245,7 +245,11 @@ enum bb_flags
> >
> >    /* Set on blocks that cannot be threaded through.
> >       Only used in cfgcleanup.c.  */
> > -  BB_NONTHREADABLE_BLOCK = 1 << 11
> > +  BB_NONTHREADABLE_BLOCK = 1 << 11,
> > +
> > +  /* Set on blocks that are headers of non-rolling loops.  */
> > +  BB_HEADER_OF_NONROLLING_LOOP = 1 << 12
> > +
> >  };
> >
> >  /* Dummy flag for convenience in the hot/cold partitioning code.  */
> > diff --git a/gcc/cfg.c b/gcc/cfg.c
> > index c8ef799..e59a637 100644
> > --- a/gcc/cfg.c
> > +++ b/gcc/cfg.c
> > @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> >    connect_src (e);
> >  }
> >
> > -/* Clear all basic block flags, with the exception of partitioning and
> > -   setjmp_target.  */
> > +/* Clear all basic block flags, with the exception of partitioning,
> > +   setjmp_target, and the non-rolling loop marker.  */
> >  void
> >  clear_bb_flags (void)
> >  {
> > @@ -434,7 +434,8 @@ clear_bb_flags (void)
> >
> >    FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> >      bb->flags = (BB_PARTITION (bb)
> > -              | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> > +              | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> > +                              + BB_HEADER_OF_NONROLLING_LOOP)));
> >  }
> >
> >  /* Check the consistency of profile information.  We can't do that
> > diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> > index bf2614e..e856a78 100644
> > --- a/gcc/cfgloop.h
> > +++ b/gcc/cfgloop.h
> > @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> >  void estimate_numbers_of_iterations_loop (struct loop *, bool);
> >  HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> >  bool estimated_loop_iterations (struct loop *, bool, double_int *);
> > +void mark_non_rolling_loop (struct loop *);
> > +bool non_rolling_loop_p (struct loop *);
> >
> >  /* Loop manipulation.  */
> >  extern bool can_duplicate_loop_p (const struct loop *loop);
> > diff --git a/gcc/predict.c b/gcc/predict.c
> > index c691990..bf729f8 100644
> > --- a/gcc/predict.c
> > +++ b/gcc/predict.c
> > @@ -279,6 +279,9 @@ optimize_insn_for_speed_p (void)
> >  bool
> >  optimize_loop_for_size_p (struct loop *loop)
> >  {
> > +  /* Loops marked NON-ROLLING are not likely to be hot.  */
> > +  if (non_rolling_loop_p (loop))
> > +    return true;
> >    return optimize_bb_for_size_p (loop->header);
> >  }
> >
> > @@ -287,6 +290,9 @@ optimize_loop_for_size_p (struct loop *loop)
> >  bool
> >  optimize_loop_for_speed_p (struct loop *loop)
> >  {
> > +  /* Loops marked NON-ROLLING are not likely to be hot.  */
> > +  if (non_rolling_loop_p (loop))
> > +    return false;
> >    return optimize_bb_for_speed_p (loop->header);
> >  }
> >
> > diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> > index 87b2c0d..bc977bb 100644
> > --- a/gcc/tree-ssa-loop-manip.c
> > +++ b/gcc/tree-ssa-loop-manip.c
> > @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> >    gcc_assert (new_loop != NULL);
> >    update_ssa (TODO_update_ssa);
> >
> > +  /* NEW_LOOP is a non-rolling loop.  */
> > +  mark_non_rolling_loop (new_loop);
> > +
> >    /* Determine the probability of the exit edge of the unrolled loop.  */
> >    new_est_niter = est_niter / factor;
> >
> > diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> > index ee85f6f..1e2e4b2 100644
> > --- a/gcc/tree-ssa-loop-niter.c
> > +++ b/gcc/tree-ssa-loop-niter.c
> > @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> >    fold_undefer_and_ignore_overflow_warnings ();
> >  }
> >
> > +/* Mark LOOP as a non-rolling loop.  */
> > +
> > +void
> > +mark_non_rolling_loop (struct loop *loop)
> > +{
> > +  gcc_assert (loop && loop->header);
> > +  loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
> > +}
> > +
> > +/* Return true if LOOP is a non-rolling loop.  */
> > +
> > +bool
> > +non_rolling_loop_p (struct loop *loop)
> > +{
> > +  int masked_flags;
> > +  gcc_assert (loop && loop->header);
> > +  masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
> > +  return (masked_flags != 0);
> > +}
> > +
> >  /* Returns true if statement S1 dominates statement S2.  */
> >
> >  bool
> > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> > index 6ecd304..216de78 100644
> > --- a/gcc/tree-vect-loop-manip.c
> > +++ b/gcc/tree-vect-loop-manip.c
> > @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> >                                           cond_expr, cond_expr_stmt_list);
> >    gcc_assert (new_loop);
> >    gcc_assert (loop_num == loop->num);
> > +
> > +  /* NEW_LOOP is a non-rolling loop.  */
> > +  mark_non_rolling_loop (new_loop);
> > +
> >  #ifdef ENABLE_CHECKING
> >    slpeel_verify_cfg_after_peeling (loop, new_loop);
> >  #endif
> > @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> >                                  th, true, NULL_TREE, NULL);
> >
> >    gcc_assert (new_loop);
> > +
> > +  /* NEW_LOOP is a non-rolling loop.  */
> > +  mark_non_rolling_loop (new_loop);
> > +
> >  #ifdef ENABLE_CHECKING
> >    slpeel_verify_cfg_after_peeling (new_loop, loop);
> >  #endif
> > --
> > 1.6.3.3
> >

Patch

Index: gcc/basic-block.h
===================================================================
--- gcc/basic-block.h	(revision 168437)
+++ gcc/basic-block.h	(working copy)
@@ -247,11 +247,14 @@ 
      Only used in cfgcleanup.c.  */
   BB_NONTHREADABLE_BLOCK = 1 << 11,
 
+  /* Set on blocks that are headers of non-rolling loops.  */
+  BB_HEADER_OF_NONROLLING_LOOP = 1 << 12,
+
   /* Set on blocks that were modified in some way.  This bit is set in
      df_set_bb_dirty, but not cleared by df_analyze, so it can be used
      to test whether a block has been modified prior to a df_analyze
      call.  */
-  BB_MODIFIED = 1 << 12
+  BB_MODIFIED = 1 << 13
 };
 
 /* Dummy flag for convenience in the hot/cold partitioning code.  */