diff mbox

[Loop,optimizer] : Add logic to disable certain loop optimizations on pre-/post-loops

Message ID D4C76825A6780047854A11E93CDE84D004C17683FA@SAUSEXMBP01.amd.com
State New
Headers show

Commit Message

Fang, Changpeng Dec. 13, 2010, 8:35 p.m. UTC
Hi,

The attached patch adds the logic to disable certain loop optimizations on pre-/post-loops. 

Some loop optimizations (auto-vectorization, loop unrolling, etc) may peel a few iterations
of a loop to form pre- and/or post-loops for various purposes (alignment, loop bounds, etc).
Currently, GCC loop optimizer is unable to recognize that such loops will roll only a few 
iterations and still perform optimizations on them. While this does not hurt the performance in general,
it may significantly increase the compilation time and code size without performance benefit.

This patch adds such logic for the loop optimizer to recognize pre- and/or post loops, and disable
prefetch, unswitch and loop unrolling on them. On polyhedron with -Ofast -funroll-loops -march=amdfam10,
the patch could reduce the compilation time by 28% on average, the reduce the binary size by 20% on
 average (see the atached data).  Note that the small improvement (0.5%) could have been noise, the
code size reduction could possibly improve the performance in some cases (I_cache iprovement?).

The patch passed bootstrap and gcc regression tests on x86_64-unknown-linux-gnu.

Is it OK to commit to trunk?

Thanks,

Changpeng
Effact of the pre-/post-loop patch on polyhedron 
option: gfortran -Ofast -funroll-loops -march=amdfam10

               compilation     code size      speed
              time reduction   reduction      improvement
                 (%)             (%)            (%)
-----------------------------------------------------------
          ac	-20.54		-17.15		0
      aermod	-15.93		-10.15		2.51
         air	-5.74		-5.45		-0.09
    capacita	-31.35		-18.27		0.08
     channel	-11.32		-10.24		1.22
       doduc	-4.52		-6.12		0.82
     fatigue	-34.51		-15.94		0
     gas_dyn	-45.56		-28.66		2.31
      induct	-3.1		-1.91		0.05
       linpk	-25.55		-27.5		0.26
        mdbx	-24.06		-19.74		1.27
          nf	-60.85		-48.92		-0.77
     protein	-44.73		-24.02		-0.19
      rnflow	-50.55		-36.69		0.47
    test_fpu	-52.49		-41.35		1.18
        tfft	-24.83		-18.29		0.39
-----------------------------------------------------------
     average	-28.48		-20.65		0.59

Comments

Sebastian Pop Dec. 14, 2010, 12:20 a.m. UTC | #1
On Mon, Dec 13, 2010 at 14:35, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
> Hi,
>
> The attached patch adds the logic to disable certain loop optimizations on pre-/post-loops.
>
> Some loop optimizations (auto-vectorization, loop unrolling, etc) may peel a few iterations
> of a loop to form pre- and/or post-loops for various purposes (alignment, loop bounds, etc).
> Currently, GCC loop optimizer is unable to recognize that such loops will roll only a few
> iterations and still perform optimizations on them. While this does not hurt the performance in general,
> it may significantly increase the compilation time and code size without performance benefit.
>
> This patch adds such logic for the loop optimizer to recognize pre- and/or post loops, and disable
> prefetch, unswitch and loop unrolling on them. On polyhedron with -Ofast -funroll-loops -march=amdfam10,
> the patch could reduce the compilation time by 28% on average, the reduce the binary size by 20% on
>  average (see the atached data).  Note that the small improvement (0.5%) could have been noise, the
> code size reduction could possibly improve the performance in some cases (I_cache iprovement?).
>
> The patch passed bootstrap and gcc regression tests on x86_64-unknown-linux-gnu.
>
> Is it OK to commit to trunk?

I like the way you solved this problem, but I cannot approve your patch.
I will let Richi or someone else comment on it.

Thanks for fixing this,
Sebastian
Zdenek Dvorak Dec. 14, 2010, 7:56 a.m. UTC | #2
Hi,

> The attached patch adds the logic to disable certain loop optimizations on pre-/post-loops. 
> 
> Some loop optimizations (auto-vectorization, loop unrolling, etc) may peel a few iterations
> of a loop to form pre- and/or post-loops for various purposes (alignment, loop bounds, etc).
> Currently, GCC loop optimizer is unable to recognize that such loops will roll only a few 
> iterations and still perform optimizations on them. While this does not hurt the performance in general,
> it may significantly increase the compilation time and code size without performance benefit.
> 
> This patch adds such logic for the loop optimizer to recognize pre- and/or post loops, and disable
> prefetch, unswitch and loop unrolling on them. 

why not simply change the profile updating to correctly indicate that these loops do not roll?
That way, all the optimizations would profit, not just those aware of the new bb flag,

Zdenek
Jack Howarth Dec. 14, 2010, 2:27 p.m. UTC | #3
On Mon, Dec 13, 2010 at 02:35:35PM -0600, Fang, Changpeng wrote:
> Hi,
> 
> The attached patch adds the logic to disable certain loop optimizations on pre-/post-loops. 
> 
> Some loop optimizations (auto-vectorization, loop unrolling, etc) may peel a few iterations
> of a loop to form pre- and/or post-loops for various purposes (alignment, loop bounds, etc).
> Currently, GCC loop optimizer is unable to recognize that such loops will roll only a few 
> iterations and still perform optimizations on them. While this does not hurt the performance in general,
> it may significantly increase the compilation time and code size without performance benefit.
> 
> This patch adds such logic for the loop optimizer to recognize pre- and/or post loops, and disable
> prefetch, unswitch and loop unrolling on them. On polyhedron with -Ofast -funroll-loops -march=amdfam10,
> the patch could reduce the compilation time by 28% on average, the reduce the binary size by 20% on
>  average (see the atached data).  Note that the small improvement (0.5%) could have been noise, the
> code size reduction could possibly improve the performance in some cases (I_cache iprovement?).
> 
> The patch passed bootstrap and gcc regression tests on x86_64-unknown-linux-gnu.
> 
> Is it OK to commit to trunk?
> 
> Thanks,
> 
> Changpeng

Changpeng,
   On x86_64-apple-darwin10, this patch produces some regressions in the gcc testsuite.
In particular at both -m32 and -m64...

XPASS: gcc.dg/pr30957-1.c execution test
FAIL: gcc.dg/pr30957-1.c scan-rtl-dump loop2_unroll "Expanding Accumulator"

and

FAIL: gcc.dg/var-expand1.c scan-rtl-dump loop2_unroll "Expanding Accumulator"

Do you see those as well on linux?
               Jack

>    
Content-Description: polyhedron.txt
> Effact of the pre-/post-loop patch on polyhedron 
> option: gfortran -Ofast -funroll-loops -march=amdfam10
> 
>                compilation     code size      speed
>               time reduction   reduction      improvement
>                  (%)             (%)            (%)
> -----------------------------------------------------------
>           ac	-20.54		-17.15		0
>       aermod	-15.93		-10.15		2.51
>          air	-5.74		-5.45		-0.09
>     capacita	-31.35		-18.27		0.08
>      channel	-11.32		-10.24		1.22
>        doduc	-4.52		-6.12		0.82
>      fatigue	-34.51		-15.94		0
>      gas_dyn	-45.56		-28.66		2.31
>       induct	-3.1		-1.91		0.05
>        linpk	-25.55		-27.5		0.26
>         mdbx	-24.06		-19.74		1.27
>           nf	-60.85		-48.92		-0.77
>      protein	-44.73		-24.02		-0.19
>       rnflow	-50.55		-36.69		0.47
>     test_fpu	-52.49		-41.35		1.18
>         tfft	-24.83		-18.29		0.39
> -----------------------------------------------------------
>      average	-28.48		-20.65		0.59
> 

Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> From e8636e80de4d6de8ba2dbc8f08bd2daddd02edc3 Mon Sep 17 00:00:00 2001
> From: Changpeng Fang <chfang@houghton.(none)>
> Date: Mon, 13 Dec 2010 12:01:49 -0800
> Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
> 
> 	* basic-block.h (bb_flags): Add a new flag BB_PRE_POST_LOOP_HEADER.
> 	* cfg.c (clear_bb_flags): Keep BB_PRE_POST_LOOP_HEADER marker.
> 	* cfgloop.h (mark_pre_or_post_loop): New function declaration.
> 	  (pre_or_post_loop_p): New function declaration.
> 	* loop-unroll.c (decide_unroll_runtime_iterations): Do not unroll a
> 	  pre- or post-loop.
> 	* loop-unswitch.c (unswitch_single_loop): Do not unswitch a pre- or
> 	  post-loop.
> 	* tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> 	  post-loop.
> 	* tree-ssa-loop-niter.c (mark_pre_or_post_loop): Implement the new
> 	  function.  (pre_or_post_loop_p): Implement the new function.
> 	* tree-ssa-loop-prefetch.c (loop_prefetch_arrays): Don't prefetch
> 	  a pre- or post-loop.
> 	* tree-ssa-loop-unswitch.c (tree_ssa_unswitch_loops): Do not unswitch
> 	  a pre- or post-loop.
> 	* tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> 	  post-loop.  (vect_do_peeling_for_alignment): Mark the pre-loop.
> ---
>  gcc/basic-block.h            |    6 +++++-
>  gcc/cfg.c                    |    7 ++++---
>  gcc/cfgloop.h                |    2 ++
>  gcc/loop-unroll.c            |    7 +++++++
>  gcc/loop-unswitch.c          |    8 ++++++++
>  gcc/tree-ssa-loop-manip.c    |    3 +++
>  gcc/tree-ssa-loop-niter.c    |   20 ++++++++++++++++++++
>  gcc/tree-ssa-loop-prefetch.c |    7 +++++++
>  gcc/tree-ssa-loop-unswitch.c |    8 ++++++++
>  gcc/tree-vect-loop-manip.c   |    8 ++++++++
>  10 files changed, 72 insertions(+), 4 deletions(-)
> 
> diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> index be0a1d1..78552fd 100644
> --- a/gcc/basic-block.h
> +++ b/gcc/basic-block.h
> @@ -245,7 +245,11 @@ enum bb_flags
>  
>    /* Set on blocks that cannot be threaded through.
>       Only used in cfgcleanup.c.  */
> -  BB_NONTHREADABLE_BLOCK = 1 << 11
> +  BB_NONTHREADABLE_BLOCK = 1 << 11,
> +
> +  /* Set on blocks that are headers of pre- or post-loops.  */
> +  BB_PRE_POST_LOOP_HEADER = 1 << 12
> +
>  };
>  
>  /* Dummy flag for convenience in the hot/cold partitioning code.  */
> diff --git a/gcc/cfg.c b/gcc/cfg.c
> index c8ef799..e9b394a 100644
> --- a/gcc/cfg.c
> +++ b/gcc/cfg.c
> @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
>    connect_src (e);
>  }
>  
> -/* Clear all basic block flags, with the exception of partitioning and
> -   setjmp_target.  */
> +/* Clear all basic block flags, with the exception of partitioning,
> +   setjmp_target, and the pre/post loop marker.  */
>  void
>  clear_bb_flags (void)
>  {
> @@ -434,7 +434,8 @@ clear_bb_flags (void)
>  
>    FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
>      bb->flags = (BB_PARTITION (bb)
> -		 | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> +		 | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> +                                 + BB_PRE_POST_LOOP_HEADER)));
>  }
>  
>  /* Check the consistency of profile information.  We can't do that
> diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> index bf2614e..ce848cc 100644
> --- a/gcc/cfgloop.h
> +++ b/gcc/cfgloop.h
> @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
>  void estimate_numbers_of_iterations_loop (struct loop *, bool);
>  HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
>  bool estimated_loop_iterations (struct loop *, bool, double_int *);
> +void mark_pre_or_post_loop (struct loop *);
> +bool pre_or_post_loop_p (struct loop *);
>  
>  /* Loop manipulation.  */
>  extern bool can_duplicate_loop_p (const struct loop *loop);
> diff --git a/gcc/loop-unroll.c b/gcc/loop-unroll.c
> index 67d6ea0..6f095f6 100644
> --- a/gcc/loop-unroll.c
> +++ b/gcc/loop-unroll.c
> @@ -857,6 +857,13 @@ decide_unroll_runtime_iterations (struct loop *loop, int flags)
>  	fprintf (dump_file, ";; Loop iterates constant times\n");
>        return;
>      }
> + 
> +  if (pre_or_post_loop_p (loop))
> +    {
> +      if (dump_file)
> +        fprintf (dump_file, ";; Not unrolling, a pre- or post-loop\n");
> +      return;
> +    }
>  
>    /* If we have profile feedback, check whether the loop rolls.  */
>    if (loop->header->count && expected_loop_iterations (loop) < 2 * nunroll)
> diff --git a/gcc/loop-unswitch.c b/gcc/loop-unswitch.c
> index 77524d8..59373bf 100644
> --- a/gcc/loop-unswitch.c
> +++ b/gcc/loop-unswitch.c
> @@ -276,6 +276,14 @@ unswitch_single_loop (struct loop *loop, rtx cond_checked, int num)
>        return;
>      }
>  
> +  /* Pre- or post loop usually just roll a few iterations.  */
> +  if (pre_or_post_loop_p (loop))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, ";; Not unswitching, a pre- or post loop\n");
> +      return;
> +    }
> +
>    /* We must be able to duplicate loop body.  */
>    if (!can_duplicate_loop_p (loop))
>      {
> diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> index 87b2c0d..f8ddbab 100644
> --- a/gcc/tree-ssa-loop-manip.c
> +++ b/gcc/tree-ssa-loop-manip.c
> @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
>    gcc_assert (new_loop != NULL);
>    update_ssa (TODO_update_ssa);
>  
> +  /* NEW_LOOP is a post-loop.  */
> +  mark_pre_or_post_loop (new_loop);
> +
>    /* Determine the probability of the exit edge of the unrolled loop.  */
>    new_est_niter = est_niter / factor;
>  
> diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> index ee85f6f..33e8cc3 100644
> --- a/gcc/tree-ssa-loop-niter.c
> +++ b/gcc/tree-ssa-loop-niter.c
> @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
>    fold_undefer_and_ignore_overflow_warnings ();
>  }
>  
> +/* Mark LOOP as a pre- or post loop.  */
> +
> +void
> +mark_pre_or_post_loop (struct loop *loop)
> +{
> +  gcc_assert (loop && loop->header);
> +  loop->header->flags |= BB_PRE_POST_LOOP_HEADER;
> +}
> +
> +/* Return true if LOOP is a pre- or post loop.  */
> +
> +bool
> +pre_or_post_loop_p (struct loop *loop)
> +{
> +  int masked_flags;
> +  gcc_assert (loop && loop->header);
> +  masked_flags = (loop->header->flags & BB_PRE_POST_LOOP_HEADER);
> +  return (masked_flags != 0);
> +}
> +
>  /* Returns true if statement S1 dominates statement S2.  */
>  
>  bool
> diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
> index 59c65d3..5c9f640 100644
> --- a/gcc/tree-ssa-loop-prefetch.c
> +++ b/gcc/tree-ssa-loop-prefetch.c
> @@ -1793,6 +1793,13 @@ loop_prefetch_arrays (struct loop *loop)
>        return false;
>      }
>  
> +  if (pre_or_post_loop_p (loop))
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file, "  Not Prefetching -- pre- or post loop\n");
> +      return false;
> +    }
> +
>    /* FIXME: the time should be weighted by the probabilities of the blocks in
>       the loop body.  */
>    time = tree_num_loop_insns (loop, &eni_time_weights);
> diff --git a/gcc/tree-ssa-loop-unswitch.c b/gcc/tree-ssa-loop-unswitch.c
> index b6b32dc..f3b8108 100644
> --- a/gcc/tree-ssa-loop-unswitch.c
> +++ b/gcc/tree-ssa-loop-unswitch.c
> @@ -88,6 +88,14 @@ tree_ssa_unswitch_loops (void)
>        if (dump_file && (dump_flags & TDF_DETAILS))
>          fprintf (dump_file, ";; Considering loop %d\n", loop->num);
>  
> +     /* Do not unswitch a pre- or post loop.  */
> +     if (pre_or_post_loop_p (loop))
> +       {
> +          if (dump_file && (dump_flags & TDF_DETAILS))
> +            fprintf (dump_file, ";; Not unswitching, a pre- or post loop\n");
> +          continue;
> +        }
> +
>        /* Do not unswitch in cold regions. */
>        if (optimize_loop_for_size_p (loop))
>          {
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index 6ecd304..9a63f7e 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
>  					    cond_expr, cond_expr_stmt_list);
>    gcc_assert (new_loop);
>    gcc_assert (loop_num == loop->num);
> +  
> +  /* NEW_LOOP is a post loop.  */
> +  mark_pre_or_post_loop (new_loop);
> +
>  #ifdef ENABLE_CHECKING
>    slpeel_verify_cfg_after_peeling (loop, new_loop);
>  #endif
> @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
>  				   th, true, NULL_TREE, NULL);
>  
>    gcc_assert (new_loop);
> +
> +  /* NEW_LOOP is a pre-loop.  */
> +  mark_pre_or_post_loop (new_loop);
> +  
>  #ifdef ENABLE_CHECKING
>    slpeel_verify_cfg_after_peeling (new_loop, loop);
>  #endif
> -- 
> 1.6.3.3
>
Fang, Changpeng Dec. 14, 2010, 4:45 p.m. UTC | #4
No, I didn't see these failures.

The failure list in my bootstrapping is the following. I see nothing relevant:

FAIL: gcc.dg/guality/pr43077-1.c  -O2 -flto -flto-partition=none  line 42 varb == 2
FAIL: gcc.dg/guality/pr43077-1.c  -O2 -flto  line 42 varb == 2
FAIL: gcc.dg/guality/sra-1.c  -O1  line 21 a.j == 14
FAIL: gcc.dg/guality/sra-1.c  -O2  line 21 a.j == 14
FAIL: gcc.dg/guality/sra-1.c  -O3 -fomit-frame-pointer  line 21 a.j == 14
FAIL: gcc.dg/guality/sra-1.c  -O3 -g  line 21 a.j == 14
FAIL: gcc.dg/guality/sra-1.c  -Os  line 21 a.j == 14
FAIL: gcc.dg/guality/vla-1.c  -O0  line 17 sizeof (a) == 6
FAIL: gcc.dg/guality/vla-1.c  -O0  line 24 sizeof (a) == 17 * sizeof (short)
FAIL: gcc.dg/guality/vla-1.c  -O1  line 24 sizeof (a) == 17 * sizeof (short)
FAIL: gcc.dg/guality/vla-1.c  -O2  line 24 sizeof (a) == 17 * sizeof (short)
FAIL: gcc.dg/guality/vla-1.c  -O3 -fomit-frame-pointer  line 24 sizeof (a) == 17 * sizeof (short)
FAIL: gcc.dg/guality/vla-1.c  -O3 -g  line 24 sizeof (a) == 17 * sizeof (short)
FAIL: gcc.dg/guality/vla-1.c  -Os  line 24 sizeof (a) == 17 * sizeof (short)
FAIL: gcc.dg/guality/vla-2.c  -O0  line 16 sizeof (a) == 5 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c  -O0  line 25 sizeof (a) == 6 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c  -O1  line 16 sizeof (a) == 5 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c  -O1  line 25 sizeof (a) == 6 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c  -O2  line 16 sizeof (a) == 5 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c  -O2  line 25 sizeof (a) == 6 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c  -O3 -fomit-frame-pointer  line 16 sizeof (a) == 5 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c  -O3 -fomit-frame-pointer  line 25 sizeof (a) == 6 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c  -O3 -g  line 16 sizeof (a) == 5 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c  -O3 -g  line 25 sizeof (a) == 6 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c  -Os  line 16 sizeof (a) == 5 * sizeof (int)
FAIL: gcc.dg/guality/vla-2.c  -Os  line 25 sizeof (a) == 6 * sizeof (int)
FAIL: g++.dg/guality/redeclaration1.C  -O0  line 17 i == 24
FAIL: g++.dg/guality/redeclaration1.C  -O1  line 17 i == 24
FAIL: g++.dg/guality/redeclaration1.C  -O2  line 17 i == 24
FAIL: g++.dg/guality/redeclaration1.C  -O3 -fomit-frame-pointer  line 17 i == 24
FAIL: g++.dg/guality/redeclaration1.C  -O3 -g  line 17 i == 24
FAIL: g++.dg/guality/redeclaration1.C  -Os  line 17 i == 24
FAIL: libmudflap.c/pass49-frag.c execution test
FAIL: libmudflap.c/pass49-frag.c output pattern test
FAIL: libmudflap.c/pass49-frag.c execution test
FAIL: libmudflap.c/pass49-frag.c output pattern test
FAIL: libmudflap.c/pass49-frag.c (-static) execution test
FAIL: libmudflap.c/pass49-frag.c (-static) output pattern test
FAIL: libmudflap.c/pass49-frag.c (-static) execution test
FAIL: libmudflap.c/pass49-frag.c (-static) output pattern test
FAIL: libmudflap.c/pass49-frag.c (-O2) execution test
FAIL: libmudflap.c/pass49-frag.c (-O2) output pattern test
FAIL: libmudflap.c/pass49-frag.c (-O2) execution test
FAIL: libmudflap.c/pass49-frag.c (-O2) output pattern test
FAIL: libmudflap.c/pass49-frag.c (-O3) execution test
FAIL: libmudflap.c/pass49-frag.c (-O3) output pattern test
FAIL: libmudflap.c/pass49-frag.c (-O3) execution test
FAIL: libmudflap.c/pass49-frag.c (-O3) output pattern test
FAIL: gcc.dg/cproj-fails-with-broken-glibc.c execution test


Thanks,

Changpeng
Fang, Changpeng Dec. 14, 2010, 7:13 p.m. UTC | #5
>why not simply change the profile updating to correctly indicate that these loops do not roll?
>That way, all the optimizations would profit, not just those aware of the new bb flag,

Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
to guard loop optimizations. For a given program, different data sizes will result in quite different
loop trip counts.

By the way, what optimizations else do you think will benefit from disabling for small trip count
loops, significantly? 

Thanks,

Changpeng
Zdenek Dvorak Dec. 14, 2010, 9:05 p.m. UTC | #6
Hi,

>  >why not simply change the profile updating to correctly indicate that these loops do not roll?
> >That way, all the optimizations would profit, not just those aware of the new bb flag,
> 
> Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
> to guard loop optimizations. 

it is already used that way; i.e., you do not need to change anything in the optimizations, just
make sure that the edge probabilities are sensible.

> For a given program, different data sizes will result in quite different
> loop trip counts.

That should not be the case -- for the pre/post loops generated in vectorization, we know the
expected # of iterations, based on their purpose; e.g., for loops inserted so that the # of iterarations
is divisible by 4, we know that the loop will iterate at most three times (and probably less), etc.

> By the way, what optimizations else do you think will benefit from disabling for small trip count
> loops, significantly? 

Anything where we check whether we should optimize for speed or code size,

Zdenek
Richard Biener Dec. 15, 2010, 2:35 a.m. UTC | #7
On Tue, Dec 14, 2010 at 10:05 PM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote:
> Hi,
>
>>  >why not simply change the profile updating to correctly indicate that these loops do not roll?
>> >That way, all the optimizations would profit, not just those aware of the new bb flag,
>>
>> Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
>> to guard loop optimizations.
>
> it is already used that way; i.e., you do not need to change anything in the optimizations, just
> make sure that the edge probabilities are sensible.
>
>> For a given program, different data sizes will result in quite different
>> loop trip counts.
>
> That should not be the case -- for the pre/post loops generated in vectorization, we know the
> expected # of iterations, based on their purpose; e.g., for loops inserted so that the # of iterarations
> is divisible by 4, we know that the loop will iterate at most three times (and probably less), etc.
>
>> By the way, what optimizations else do you think will benefit from disabling for small trip count
>> loops, significantly?
>
> Anything where we check whether we should optimize for speed or code size,

I agree with Zdenek (without having looked at the patch sofar).

Richard.

> Zdenek
>
Fang, Changpeng Dec. 15, 2010, 6:22 a.m. UTC | #8

Xinliang David Li Dec. 15, 2010, 7:54 a.m. UTC | #9
On Tue, Dec 14, 2010 at 10:22 PM, Fang, Changpeng
<Changpeng.Fang@amd.com> wrote:
>
>
> ________________________________________
> From: Richard Guenther [richard.guenther@gmail.com]
> Sent: Tuesday, December 14, 2010 8:35 PM
> To: Zdenek Dvorak
> Cc: Fang, Changpeng; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> On Tue, Dec 14, 2010 at 10:05 PM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote:
>> Hi,
>>
>>>  >why not simply change the profile updating to correctly indicate that these loops do not roll?
>>> >That way, all the optimizations would profit, not just those aware of the new bb flag,
>>>
>>> Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
>>> to guard loop optimizations.
>>
>> it is already used that way; i.e., you do not need to change anything in the optimizations, just
>> make sure that the edge probabilities are sensible.
>>
>>> For a given program, different data sizes will result in quite different
>>> loop trip counts.
>>
>> That should not be the case -- for the pre/post loops generated in vectorization, we know the
>> expected # of iterations, based on their purpose; e.g., for loops inserted so that the # of iterarations
>> is divisible by 4, we know that the loop will iterate at most three times (and probably less), etc.
>>
>>> By the way, what optimizations else do you think will benefit from disabling for small trip count
>>> loops, significantly?
>>
>> Anything where we check whether we should optimize for speed or code size,
>
>>I agree with Zdenek (without having looked at the patch sofar).
>
> I think my patch (adding a bb flag) provides a simple and yet effective solution for the unnecessay
> code expansion problem in prefetching, unswitching, and loop unrolling. However, I don't mind
> updating the profile information for the same purpose.
>
> Now, suppose we know a loop will roll at most 3 times at runtime. How should we update the profile
> information to let the expected_loop_iterations to know this value? ( I got lost here about the
> edge probabilities issues)
>


With profile data, the average loop trip count is recorded in the
nb_iterations_estimate field of the loop structure. You can get it
from estimated_loop_iteration_iterations_int(loop, false).  For the
small trip count loops introduced by the optimization, you can use
interface record_niter_bound (..., true, ..) to record a realistic
estimate which may or may not be  upper bound.

In general, without FDO, gcc does not estimate loop iteration
according to the back-edge probability computed by static prediction
(predict.c).  This is less than ideal. For instance, when
builtin_expect is used to annotate the loop bound, the information
will be lost.

David

> Thanks,
>
> Changpeng
>
>
>
>
>
>>Richard.
>
Zdenek Dvorak Dec. 15, 2010, 9:22 a.m. UTC | #10
Hi,

> >>>  >why not simply change the profile updating to correctly indicate that these loops do not roll?
> >>> >That way, all the optimizations would profit, not just those aware of the new bb flag,
> >>>
> >>> Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
> >>> to guard loop optimizations.
> >>
> >> it is already used that way; i.e., you do not need to change anything in the optimizations, just
> >> make sure that the edge probabilities are sensible.
> >>
> >>> For a given program, different data sizes will result in quite different
> >>> loop trip counts.
> >>
> >> That should not be the case -- for the pre/post loops generated in vectorization, we know the
> >> expected # of iterations, based on their purpose; e.g., for loops inserted so that the # of iterarations
> >> is divisible by 4, we know that the loop will iterate at most three times (and probably less), etc.
> >>
> >>> By the way, what optimizations else do you think will benefit from disabling for small trip count
> >>> loops, significantly?
> >>
> >> Anything where we check whether we should optimize for speed or code size,
> >
> >>I agree with Zdenek (without having looked at the patch sofar).
> >
> > I think my patch (adding a bb flag) provides a simple and yet effective solution for the unnecessay
> > code expansion problem in prefetching, unswitching, and loop unrolling. However, I don't mind
> > updating the profile information for the same purpose.
> >
> > Now, suppose we know a loop will roll at most 3 times at runtime. How should we update the profile
> > information to let the expected_loop_iterations to know this value? ( I got lost here about the
> > edge probabilities issues)
> >
> 
> 
> In general, without FDO, gcc does not estimate loop iteration
> according to the back-edge probability computed by static prediction
> (predict.c).  This is less than ideal. For instance, when
> builtin_expect is used to annotate the loop bound, the information
> will be lost.

hmmm.... I forgot about this.  OK, I withdraw my objection against the patch, although
I would suggest the following changes:
-- rename BB_PRE_POST_LOOP_HEADER to something like BB_HEADER_OF_NONROLLING_LOOP,
-- in estimate_numbers_of_iterations_loop, for loops with this flags use
   record_niter_bound (loop, double_int_two, true, false)
   to make tree-level loop optimizations know that the loop does not roll,
-- the check for the flag in loop_prefetch_arrays should not be needed, then.

Zdenek
Fang, Changpeng Dec. 15, 2010, 4:08 p.m. UTC | #11
Hi, 
>hmmm.... I forgot about this.  OK, I withdraw my objection against the patch, although
>I would suggest the following changes:
>-- rename BB_PRE_POST_LOOP_HEADER to something like BB_HEADER_OF_NONROLLING_LOOP,

Thanks. I will do this.

>-- in estimate_numbers_of_iterations_loop, for loops with this flags use
 >  record_niter_bound (loop, double_int_two, true, false)
 >  to make tree-level loop optimizations know that the loop does not roll,
>-- the check for the flag in loop_prefetch_arrays should not be needed, then.
>Zdenek

I have a new idea about this. How about, "if the flag is ON, we consider the loop as "optimize for size"")?
In this way, we will consider the loop as a cold area and turn off related optimizations on it.

Thanks,

Changpeng
Zdenek Dvorak Dec. 15, 2010, 4:15 p.m. UTC | #12
Hi,

> >hmmm.... I forgot about this.  OK, I withdraw my objection against the patch, although
> >I would suggest the following changes:
> >-- rename BB_PRE_POST_LOOP_HEADER to something like BB_HEADER_OF_NONROLLING_LOOP,
> 
> Thanks. I will do this.
> 
> >-- in estimate_numbers_of_iterations_loop, for loops with this flags use
>  >  record_niter_bound (loop, double_int_two, true, false)
>  >  to make tree-level loop optimizations know that the loop does not roll,
> >-- the check for the flag in loop_prefetch_arrays should not be needed, then.
> >Zdenek
> 
> I have a new idea about this. How about, "if the flag is ON, we consider the loop as "optimize for size"")?
> In this way, we will consider the loop as a cold area and turn off related optimizations on it.

yes, modifying optimize_loop_for_size_p is also a good idea,

Zdenek
Xinliang David Li Dec. 15, 2010, 4:52 p.m. UTC | #13
One more thing about FDO -- using average trip count can be misleading
too -- however if loop multi-version according to trip count value
profiling (currently missing), it will be more precise.

David

On Wed, Dec 15, 2010 at 1:22 AM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote:
> Hi,
>
>> >>>  >why not simply change the profile updating to correctly indicate that these loops do not roll?
>> >>> >That way, all the optimizations would profit, not just those aware of the new bb flag,
>> >>>
>> >>> Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
>> >>> to guard loop optimizations.
>> >>
>> >> it is already used that way; i.e., you do not need to change anything in the optimizations, just
>> >> make sure that the edge probabilities are sensible.
>> >>
>> >>> For a given program, different data sizes will result in quite different
>> >>> loop trip counts.
>> >>
>> >> That should not be the case -- for the pre/post loops generated in vectorization, we know the
>> >> expected # of iterations, based on their purpose; e.g., for loops inserted so that the # of iterarations
>> >> is divisible by 4, we know that the loop will iterate at most three times (and probably less), etc.
>> >>
>> >>> By the way, what optimizations else do you think will benefit from disabling for small trip count
>> >>> loops, significantly?
>> >>
>> >> Anything where we check whether we should optimize for speed or code size,
>> >
>> >>I agree with Zdenek (without having looked at the patch sofar).
>> >
>> > I think my patch (adding a bb flag) provides a simple and yet effective solution for the unnecessay
>> > code expansion problem in prefetching, unswitching, and loop unrolling. However, I don't mind
>> > updating the profile information for the same purpose.
>> >
>> > Now, suppose we know a loop will roll at most 3 times at runtime. How should we update the profile
>> > information to let the expected_loop_iterations to know this value? ( I got lost here about the
>> > edge probabilities issues)
>> >
>>
>>
>> In general, without FDO, gcc does not estimate loop iteration
>> according to the back-edge probability computed by static prediction
>> (predict.c).  This is less than ideal. For instance, when
>> builtin_expect is used to annotate the loop bound, the information
>> will be lost.
>
> hmmm.... I forgot about this.  OK, I withdraw my objection against the patch, although
> I would suggest the following changes:
> -- rename BB_PRE_POST_LOOP_HEADER to something like BB_HEADER_OF_NONROLLING_LOOP,
> -- in estimate_numbers_of_iterations_loop, for loops with this flags use
>   record_niter_bound (loop, double_int_two, true, false)
>   to make tree-level loop optimizations know that the loop does not roll,
> -- the check for the flag in loop_prefetch_arrays should not be needed, then.
>
> Zdenek
>
Richard Biener Dec. 16, 2010, 11:33 a.m. UTC | #14
2010/12/15 Zdenek Dvorak <rakdver@kam.mff.cuni.cz>:
> Hi,
>
>> >>>  >why not simply change the profile updating to correctly indicate that these loops do not roll?
>> >>> >That way, all the optimizations would profit, not just those aware of the new bb flag,
>> >>>
>> >>> Maybe my understanding is not correct. But I feel not comfortable using profile of trip count
>> >>> to guard loop optimizations.
>> >>
>> >> it is already used that way; i.e., you do not need to change anything in the optimizations, just
>> >> make sure that the edge probabilities are sensible.
>> >>
>> >>> For a given program, different data sizes will result in quite different
>> >>> loop trip counts.
>> >>
>> >> That should not be the case -- for the pre/post loops generated in vectorization, we know the
>> >> expected # of iterations, based on their purpose; e.g., for loops inserted so that the # of iterarations
>> >> is divisible by 4, we know that the loop will iterate at most three times (and probably less), etc.
>> >>
>> >>> By the way, what optimizations else do you think will benefit from disabling for small trip count
>> >>> loops, significantly?
>> >>
>> >> Anything where we check whether we should optimize for speed or code size,
>> >
>> >>I agree with Zdenek (without having looked at the patch sofar).
>> >
>> > I think my patch (adding a bb flag) provides a simple and yet effective solution for the unnecessay
>> > code expansion problem in prefetching, unswitching, and loop unrolling. However, I don't mind
>> > updating the profile information for the same purpose.
>> >
>> > Now, suppose we know a loop will roll at most 3 times at runtime. How should we update the profile
>> > information to let the expected_loop_iterations to know this value? ( I got lost here about the
>> > edge probabilities issues)
>> >
>>
>>
>> In general, without FDO, gcc does not estimate loop iteration
>> according to the back-edge probability computed by static prediction
>> (predict.c).  This is less than ideal. For instance, when
>> builtin_expect is used to annotate the loop bound, the information
>> will be lost.
>
> hmmm.... I forgot about this.  OK, I withdraw my objection against the patch, although
> I would suggest the following changes:
> -- rename BB_PRE_POST_LOOP_HEADER to something like BB_HEADER_OF_NONROLLING_LOOP,
> -- in estimate_numbers_of_iterations_loop, for loops with this flags use
>   record_niter_bound (loop, double_int_two, true, false)
>   to make tree-level loop optimizations know that the loop does not roll,
> -- the check for the flag in loop_prefetch_arrays should not be needed, then.

Btw, it would be nice if number-of-iteration analysis would figure out an
upper bound for niter for the typical prologue loops (which have exit
tests like i < niter & CST).  It's of course more difficult for epilogues
where we'd need to figure out the exit test and increment of a preceeding
loop.

Btw, any reason why we do not use static profiles for number of iteration
estimates?  We after all _do_ use the static profile to guide the
maybe_hot/cold_bb tests.

Richard.

> Zdenek
>
Zdenek Dvorak Dec. 16, 2010, 12:09 p.m. UTC | #15
Hi,

> Btw, any reason why we do not use static profiles for number of iteration
> estimates?  We after all _do_ use the static profile to guide the
> maybe_hot/cold_bb tests.

for loops for that we cannot determine the # of iterations statically,
basically the only important predictors are PRED_LOOP_BRANCH and
PRED_LOOP_EXIT, which predict that the loop will iterate about 10 times.  So,
by using static profile, we would just learn that every such loop is expected
to iterate 10 times, which is kind of useless,

Zdenek
Fang, Changpeng Dec. 16, 2010, 5:22 p.m. UTC | #16
My initial intention is Not to unroll prologue and epilogue loops. An estimated trip count
may not be that useful for the unrolling decision. To me, unrolling a loop that has at most 
3 (or 7) iterations does not make sense. RTL unrolling does not use the estimated trip 
count to determine the unroll factor, and thus it may still unroll the loop 4 or 8 times if
the loop is small ( #insns). To make things simple, we just don't unroll such loops.

However, a prologue or epilogue loop may still be a hot loop, depending on the outer 
loops. It may still be beneficial to perform other optimizations on such loops, if the code
size is not expanded multiple times.

For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not 
not unrolling, (2) not prefetching.  Which one do you prefer?

Thanks,

Changpeng
Zdenek Dvorak Dec. 16, 2010, 6:47 p.m. UTC | #17
Hi,

> For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not 
> not unrolling, (2) not prefetching.  Which one do you prefer?

it is better not to prefetch (the current placement of prefetches is not good for non-rolling
loops),

Zdenek
Richard Biener Dec. 18, 2010, 8:50 p.m. UTC | #18
On Thu, Dec 16, 2010 at 6:22 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
> My initial intention is Not to unroll prologue and epilogue loops. An estimated trip count
> may not be that useful for the unrolling decision. To me, unrolling a loop that has at most
> 3 (or 7) iterations does not make sense. RTL unrolling does not use the estimated trip
> count to determine the unroll factor, and thus it may still unroll the loop 4 or 8 times if
> the loop is small ( #insns). To make things simple, we just don't unroll such loops.
>
> However, a prologue or epilogue loop may still be a hot loop, depending on the outer
> loops. It may still be beneficial to perform other optimizations on such loops, if the code
> size is not expanded multiple times.
>
> For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> not unrolling, (2) not prefetching.  Which one do you prefer?

For small loop bodies it might make sense to completely peel the
prologue/epilogue loops (think of vectorizing doubles where those
loops roll at most once).  It would be nice to figure out if (or if not)
loop analysis (or later jump threading) is able to do that.

Richard.

> Thanks,
>
> Changpeng
>
>
>
> ________________________________________
> From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> Sent: Thursday, December 16, 2010 6:09 AM
> To: Richard Guenther
> Cc: Xinliang David Li; Fang, Changpeng; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> Hi,
>
>> Btw, any reason why we do not use static profiles for number of iteration
>> estimates?  We after all _do_ use the static profile to guide the
>> maybe_hot/cold_bb tests.
>
> for loops for that we cannot determine the # of iterations statically,
> basically the only important predictors are PRED_LOOP_BRANCH and
> PRED_LOOP_EXIT, which predict that the loop will iterate about 10 times.  So,
> by using static profile, we would just learn that every such loop is expected
> to iterate 10 times, which is kind of useless,
>
> Zdenek
>
Fang, Changpeng Dec. 21, 2010, 9:24 p.m. UTC | #19
>For small loop bodies it might make sense to completely peel the
>prologue/epilogue loops (think of vectorizing doubles where those
>loops roll at most once).  It would be nice to figure out if (or if not)
>loop analysis (or later jump threading) is able to do that.

Hi, Richard,

This is a good point. My intention is to apply the NON-ROLLING  loop
Marking approach only to loops whose  trip count could not be
determined at compile time. So, at the time of applying the approach,
we should first check whether the loop rolls constant time first.

Thanks,

Changpeng






>Richard.

> Thanks,
>
> Changpeng
>
>
>
> ________________________________________
> From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> Sent: Thursday, December 16, 2010 6:09 AM
> To: Richard Guenther
> Cc: Xinliang David Li; Fang, Changpeng; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> Hi,
>
>> Btw, any reason why we do not use static profiles for number of iteration
>> estimates?  We after all _do_ use the static profile to guide the
>> maybe_hot/cold_bb tests.
>
> for loops for that we cannot determine the # of iterations statically,
> basically the only important predictors are PRED_LOOP_BRANCH and
> PRED_LOOP_EXIT, which predict that the loop will iterate about 10 times.  So,
> by using static profile, we would just learn that every such loop is expected
> to iterate 10 times, which is kind of useless,
>
> Zdenek
>
diff mbox

Patch

From e8636e80de4d6de8ba2dbc8f08bd2daddd02edc3 Mon Sep 17 00:00:00 2001
From: Changpeng Fang <chfang@houghton.(none)>
Date: Mon, 13 Dec 2010 12:01:49 -0800
Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops

	* basic-block.h (bb_flags): Add a new flag BB_PRE_POST_LOOP_HEADER.
	* cfg.c (clear_bb_flags): Keep BB_PRE_POST_LOOP_HEADER marker.
	* cfgloop.h (mark_pre_or_post_loop): New function declaration.
	  (pre_or_post_loop_p): New function declaration.
	* loop-unroll.c (decide_unroll_runtime_iterations): Do not unroll a
	  pre- or post-loop.
	* loop-unswitch.c (unswitch_single_loop): Do not unswitch a pre- or
	  post-loop.
	* tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
	  post-loop.
	* tree-ssa-loop-niter.c (mark_pre_or_post_loop): Implement the new
	  function.  (pre_or_post_loop_p): Implement the new function.
	* tree-ssa-loop-prefetch.c (loop_prefetch_arrays): Don't prefetch
	  a pre- or post-loop.
	* tree-ssa-loop-unswitch.c (tree_ssa_unswitch_loops): Do not unswitch
	  a pre- or post-loop.
	* tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
	  post-loop.  (vect_do_peeling_for_alignment): Mark the pre-loop.
---
 gcc/basic-block.h            |    6 +++++-
 gcc/cfg.c                    |    7 ++++---
 gcc/cfgloop.h                |    2 ++
 gcc/loop-unroll.c            |    7 +++++++
 gcc/loop-unswitch.c          |    8 ++++++++
 gcc/tree-ssa-loop-manip.c    |    3 +++
 gcc/tree-ssa-loop-niter.c    |   20 ++++++++++++++++++++
 gcc/tree-ssa-loop-prefetch.c |    7 +++++++
 gcc/tree-ssa-loop-unswitch.c |    8 ++++++++
 gcc/tree-vect-loop-manip.c   |    8 ++++++++
 10 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/gcc/basic-block.h b/gcc/basic-block.h
index be0a1d1..78552fd 100644
--- a/gcc/basic-block.h
+++ b/gcc/basic-block.h
@@ -245,7 +245,11 @@  enum bb_flags
 
   /* Set on blocks that cannot be threaded through.
      Only used in cfgcleanup.c.  */
-  BB_NONTHREADABLE_BLOCK = 1 << 11
+  BB_NONTHREADABLE_BLOCK = 1 << 11,
+
+  /* Set on blocks that are headers of pre- or post-loops.  */
+  BB_PRE_POST_LOOP_HEADER = 1 << 12
+
 };
 
 /* Dummy flag for convenience in the hot/cold partitioning code.  */
diff --git a/gcc/cfg.c b/gcc/cfg.c
index c8ef799..e9b394a 100644
--- a/gcc/cfg.c
+++ b/gcc/cfg.c
@@ -425,8 +425,8 @@  redirect_edge_pred (edge e, basic_block new_pred)
   connect_src (e);
 }
 
-/* Clear all basic block flags, with the exception of partitioning and
-   setjmp_target.  */
+/* Clear all basic block flags, with the exception of partitioning,
+   setjmp_target, and the pre/post loop marker.  */
 void
 clear_bb_flags (void)
 {
@@ -434,7 +434,8 @@  clear_bb_flags (void)
 
   FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
     bb->flags = (BB_PARTITION (bb)
-		 | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
+		 | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
+                                 + BB_PRE_POST_LOOP_HEADER)));
 }
 
 /* Check the consistency of profile information.  We can't do that
diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index bf2614e..ce848cc 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -279,6 +279,8 @@  extern rtx doloop_condition_get (rtx);
 void estimate_numbers_of_iterations_loop (struct loop *, bool);
 HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
 bool estimated_loop_iterations (struct loop *, bool, double_int *);
+void mark_pre_or_post_loop (struct loop *);
+bool pre_or_post_loop_p (struct loop *);
 
 /* Loop manipulation.  */
 extern bool can_duplicate_loop_p (const struct loop *loop);
diff --git a/gcc/loop-unroll.c b/gcc/loop-unroll.c
index 67d6ea0..6f095f6 100644
--- a/gcc/loop-unroll.c
+++ b/gcc/loop-unroll.c
@@ -857,6 +857,13 @@  decide_unroll_runtime_iterations (struct loop *loop, int flags)
 	fprintf (dump_file, ";; Loop iterates constant times\n");
       return;
     }
+ 
+  if (pre_or_post_loop_p (loop))
+    {
+      if (dump_file)
+        fprintf (dump_file, ";; Not unrolling, a pre- or post-loop\n");
+      return;
+    }
 
   /* If we have profile feedback, check whether the loop rolls.  */
   if (loop->header->count && expected_loop_iterations (loop) < 2 * nunroll)
diff --git a/gcc/loop-unswitch.c b/gcc/loop-unswitch.c
index 77524d8..59373bf 100644
--- a/gcc/loop-unswitch.c
+++ b/gcc/loop-unswitch.c
@@ -276,6 +276,14 @@  unswitch_single_loop (struct loop *loop, rtx cond_checked, int num)
       return;
     }
 
+  /* Pre- or post loop usually just roll a few iterations.  */
+  if (pre_or_post_loop_p (loop))
+    {
+      if (dump_file)
+	fprintf (dump_file, ";; Not unswitching, a pre- or post loop\n");
+      return;
+    }
+
   /* We must be able to duplicate loop body.  */
   if (!can_duplicate_loop_p (loop))
     {
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index 87b2c0d..f8ddbab 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -931,6 +931,9 @@  tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
   gcc_assert (new_loop != NULL);
   update_ssa (TODO_update_ssa);
 
+  /* NEW_LOOP is a post-loop.  */
+  mark_pre_or_post_loop (new_loop);
+
   /* Determine the probability of the exit edge of the unrolled loop.  */
   new_est_niter = est_niter / factor;
 
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index ee85f6f..33e8cc3 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -3011,6 +3011,26 @@  estimate_numbers_of_iterations (bool use_undefined_p)
   fold_undefer_and_ignore_overflow_warnings ();
 }
 
+/* Mark LOOP as a pre- or post loop.  */
+
+void
+mark_pre_or_post_loop (struct loop *loop)
+{
+  gcc_assert (loop && loop->header);
+  loop->header->flags |= BB_PRE_POST_LOOP_HEADER;
+}
+
+/* Return true if LOOP is a pre- or post loop.  */
+
+bool
+pre_or_post_loop_p (struct loop *loop)
+{
+  int masked_flags;
+  gcc_assert (loop && loop->header);
+  masked_flags = (loop->header->flags & BB_PRE_POST_LOOP_HEADER);
+  return (masked_flags != 0);
+}
+
 /* Returns true if statement S1 dominates statement S2.  */
 
 bool
diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index 59c65d3..5c9f640 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -1793,6 +1793,13 @@  loop_prefetch_arrays (struct loop *loop)
       return false;
     }
 
+  if (pre_or_post_loop_p (loop))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "  Not Prefetching -- pre- or post loop\n");
+      return false;
+    }
+
   /* FIXME: the time should be weighted by the probabilities of the blocks in
      the loop body.  */
   time = tree_num_loop_insns (loop, &eni_time_weights);
diff --git a/gcc/tree-ssa-loop-unswitch.c b/gcc/tree-ssa-loop-unswitch.c
index b6b32dc..f3b8108 100644
--- a/gcc/tree-ssa-loop-unswitch.c
+++ b/gcc/tree-ssa-loop-unswitch.c
@@ -88,6 +88,14 @@  tree_ssa_unswitch_loops (void)
       if (dump_file && (dump_flags & TDF_DETAILS))
         fprintf (dump_file, ";; Considering loop %d\n", loop->num);
 
+     /* Do not unswitch a pre- or post loop.  */
+     if (pre_or_post_loop_p (loop))
+       {
+          if (dump_file && (dump_flags & TDF_DETAILS))
+            fprintf (dump_file, ";; Not unswitching, a pre- or post loop\n");
+          continue;
+        }
+
       /* Do not unswitch in cold regions. */
       if (optimize_loop_for_size_p (loop))
         {
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 6ecd304..9a63f7e 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1938,6 +1938,10 @@  vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
 					    cond_expr, cond_expr_stmt_list);
   gcc_assert (new_loop);
   gcc_assert (loop_num == loop->num);
+  
+  /* NEW_LOOP is a post loop.  */
+  mark_pre_or_post_loop (new_loop);
+
 #ifdef ENABLE_CHECKING
   slpeel_verify_cfg_after_peeling (loop, new_loop);
 #endif
@@ -2191,6 +2195,10 @@  vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
 				   th, true, NULL_TREE, NULL);
 
   gcc_assert (new_loop);
+
+  /* NEW_LOOP is a pre-loop.  */
+  mark_pre_or_post_loop (new_loop);
+  
 #ifdef ENABLE_CHECKING
   slpeel_verify_cfg_after_peeling (new_loop, loop);
 #endif
-- 
1.6.3.3