diff mbox

[RFC] Speed-up -fprofile-update=atomic

Message ID 0dc8a029-696a-39b0-96cc-ab44488607eb@suse.cz
State New
Headers show

Commit Message

Martin Liška Oct. 24, 2016, 12:09 p.m. UTC
On 10/17/2016 02:03 PM, Richard Biener wrote:
> On Mon, Oct 17, 2016 at 1:46 PM, Martin Liška <mliska@suse.cz> wrote:
>> On 10/13/2016 11:43 AM, Richard Biener wrote:
>>> On Wed, Oct 12, 2016 at 3:52 PM, Martin Liška <mliska@suse.cz> wrote:
>>>> On 10/04/2016 11:45 AM, Richard Biener wrote:
>>>>> On Thu, Sep 15, 2016 at 12:00 PM, Martin Liška <mliska@suse.cz> wrote:
>>>>>> On 09/07/2016 02:09 PM, Richard Biener wrote:
>>>>>>> On Wed, Sep 7, 2016 at 1:37 PM, Martin Liška <mliska@suse.cz> wrote:
>>>>>>>> On 08/18/2016 06:06 PM, Richard Biener wrote:
>>>>>>>>> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek <jakub@redhat.com> wrote:
>>>>>>>>>> On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote:
>>>>>>>>>>>> I'd prefer to make updates atomic in multi-threaded applications.
>>>>>>>>>>>> The best proxy we have for that is -pthread.
>>>>>>>>>>>>
>>>>>>>>>>>> Is it slower, most definitely, but odds are we're giving folks
>>>>>>>>>>>> garbage data otherwise, which in many ways is even worse.
>>>>>>>>>>>
>>>>>>>>>>> It will likely be catastrophically slower in some cases.
>>>>>>>>>>>
>>>>>>>>>>> Catastrophically as in too slow to be usable.
>>>>>>>>>>>
>>>>>>>>>>> An atomic instruction is a lot more expensive than a single
>>>>>>>>>> increment. Also
>>>>>>>>>>> they sometimes are really slow depending on the state of the machine.
>>>>>>>>>>
>>>>>>>>>> Can't we just have thread-local copies of all the counters (perhaps
>>>>>>>>>> using
>>>>>>>>>> __thread pointer as base) and just atomically merge at thread
>>>>>>>>>> termination?
>>>>>>>>>
>>>>>>>>> I suggested that as well but of course it'll have its own class of issues (short lived threads, so we need to somehow re-use counters from terminated threads, large number of threads and thus using too much memory for the counters)
>>>>>>>>>
>>>>>>>>> Richard.
>>>>>>>>
>>>>>>>> Hello.
>>>>>>>>
>>>>>>>> I've got written the approach on my TODO list, let's see whether it would be doable in a reasonable amount of time.
>>>>>>>>
>>>>>>>> I've just finished some measurements to illustrate slow-down of -fprofile-update=atomic approach.
>>>>>>>> All numbers are: no profile, -fprofile-generate, -fprofile-generate -fprofile-update=atomic
>>>>>>>> c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s
>>>>>>>> unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s
>>>>>>>> tramp3d (1 thread, -O3): 18.0, 46.6, 168s
>>>>>>>>
>>>>>>>> So the slow-down is roughly 300% compared to -fprofile-generate. I'm not having much experience with default option
>>>>>>>> selection, but these numbers can probably help.
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>
>>>>>>> Look at the generated code for an instrumented simple loop and see that for
>>>>>>> the non-atomic updates we happily apply store-motion to the counter update
>>>>>>> and thus we only get one counter update per loop exit rather than one per
>>>>>>> loop iteration.  Now see what happens for the atomic case (I suspect you
>>>>>>> get one per iteration).
>>>>>>>
>>>>>>> I'll bet this accounts for most of the slowdown.
>>>>>>>
>>>>>>> Back in time ICC which had atomic counter updates (but using function
>>>>>>> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also
>>>>>>> didn't have early inlining -- removing abstraction helps reducing the number
>>>>>>> of counters significantly).
>>>>>>>
>>>>>>> Richard.
>>>>>>
>>>>>> Hi.
>>>>>>
>>>>>> During Cauldron I discussed with Richi approaches how to speed-up ARCS
>>>>>> profile counter updates. My first attempt is to utilize TLS storage, where
>>>>>> every function is accumulating arcs counters. These are eventually added
>>>>>> (using atomic operations) to the global one at the very end of a function.
>>>>>> Currently I rely on target support of TLS, which is questionable whether
>>>>>> to have such a requirement for -fprofile-update=atomic, or to add a new option value
>>>>>> like -fprofile-update=atomic-tls?
>>>>>>
>>>>>> Running the patch on tramp3d, compared to previous numbers, it takes 88s to finish.
>>>>>> Time shrinks to 50%, compared to the current implementation.
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> Hmm, I thought I suggested that you can simply use automatic storage
>>>>> (which effectively
>>>>> is TLS...) for regions that are not forked or abnormally left (which
>>>>> means SESE regions
>>>>> that have no calls that eventually terminate or throw externally).
>>>>>
>>>>> So why did you end up with TLS?
>>>>
>>>> Hi.
>>>>
>>>> Usage for TLS does not makes sense, stupid mistake ;)
>>>>
>>>> By using SESE regions, do you mean the infrastructure that is utilized
>>>> by Graphite machinery?
>>>
>>> No, just as "single-entry single-exit region" which means placing of
>>> initializations of the internal counters to zero and the updates of the
>>> actual counters is "obvious".
>>>
>>> Note that this "optimization" isn't one if the SESE region does not contain
>>> cycle(s).  Unless there is a way to do an atomic update of a bunch of
>>> counters faster than doing them separately.  This optimization will also
>>> increase register pressure (or force the internal counters to the stack).
>>> Thus selecting which counters to "optimize" and which ones to leave in place
>>> might be necessary.
>>
>> Ok, I must admit the selection which counters to optimize is crucial. Current implementation
>> (atomic increments at places where a BB is reached) has advantage that it does not increase
>> register pressure and it does not update a global arcs counter when the BB is not visited.
>> On the other hand, having a local counters which are updated at function exit (my current implementation)
>> possibly updates counters for BB that not seen and it creates a huge memory locking spot if multiple threads
>> call a function very often. This is perf report of cray benchmark:
>>
>>        │             if((d = SQ(b) - 4.0 * a * c) < 0.0) return 0;
>>   0.01 │3c0:   xor    %ecx,%ecx
>>   0.00 │3c2:   mov    0x110(%rsp),%rdx
>>   0.00 │       lock   add    %rdx,__gcov0.ray_sphere
>>   7.96 │       mov    0x118(%rsp),%rdx
>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x8
>>  11.39 │       mov    0x120(%rsp),%rdx
>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x10
>>  11.09 │       mov    0x128(%rsp),%rdx
>>   0.00 │       lock   add    %rdx,__gcov0.ray_sphere+0x18
>>  11.27 │       mov    0x130(%rsp),%rdx
>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x20
>>  11.02 │       mov    0x138(%rsp),%rdx
>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x28
>>  11.46 │       mov    0x140(%rsp),%rdx
>>   0.00 │       lock   add    %rdx,__gcov0.ray_sphere+0x30
>>  11.84 │       mov    0x148(%rsp),%rdx
>>   0.00 │       lock   add    %rdx,__gcov0.ray_sphere+0x38
>>  11.57 │       mov    0x150(%rsp),%rdx
>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x40
>>   6.86 │       mov    0x158(%rsp),%rdx
>>        │       lock   add    %rdx,__gcov0.ray_sphere+0x48
>>
>> My current approach does atomic increment when maybe_hot_bb_p return false
>> and local counters are used otherwise. Question is how to find a better place
>> where to initialize and store local counter values?
> 
> Given the main reason for using local counters is loops you can use
> loop information
> and put counter initializations in the loop preheader and stores on
> the loop exit edges
> (basically what loop store motion does w/o atomic updates).  Store motion knows
> to ignore any aliasing with counters and thus is _very_ aggressive
> with doing this
> (including all loop nests but of course not across possible (abnormal)
> function exits).
> 
> While profile instrumentation is quite late it is of course still
> before IPA inlining.
> Thus loop store motion w/o atomic updates possibly still sees more opportunities
> than this.  I wonder if we might want to leave the optimization to the regular
> optimizers by only lowering the actual counter kind late and start with some
> IFN_UPDATE_COVERAGE_COUNTER which passes could handle (and merge).
> 
> Anyway, I'd go for the loop trick and simply handle counter motion
> from innermost
> loops only.  The final update place would be the common post-dominator of all
> exits (if you want to handle multiple exits).  As said you need to watch out for
> function termination that is not reflected by the CFG (external throws, exits in
> callees, etc.).
> 
> Richard.

Hello Richard.

I've just finished my patch, where I come up with a new internal fn (UPDATE_COVERAGE_COUNTER).
The function is generated by profile pass, and is handled by lim pass. I originally tried to
support internal fn in the lim machinery, but doing specific loop motion looks easier to work with.

With the patch applied, tramp3d runs 1.8x slower with -fprofile-update=atomic compared to -fprofile-update=single.

Patch can bootstrap on ppc64le-redhat-linux and survives regression tests.
Thoughts?

Thanks,
Martin
> 
>> Ideas welcomed.
>> Thanks,
>> Martin
>>
>>>
>>> Richard.
>>>
>>>> Thanks,
>>>> Martin
>>>>
>>>>>
>>>>> Richard.
>>>>>
>>>>>> Martin
>>>>>>
>>>>>>>
>>>>>>>> Martin
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>      Jakub
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>

Comments

Martin Liška Oct. 25, 2016, 2:31 p.m. UTC | #1
On 10/24/2016 03:51 PM, Richard Biener wrote:

> It's quite ad-hoc :/  The IFN will also be a memory optimization
> barrier unless you add special support
> for it in the alias oracle - so the performance measurement needs to
> be taken with a grain of salt
> (same is true for all atomics of course... - I have some local patches
> to improve things here).

Good, thus please ping me with the patches you have and I'll integrate it.

> 
> The way you implement process_sm_for_coverage_counter is more like a
> final value replacement.
> You could maybe use create_iv for the loop counter or even wind up
> computing the final value
> (number of iterations) only after the loop, avoiding the IV completely
> (eventually SCEV cprop
> saves you here afterwards).

Or maybe we can basically assign loop->niter as the argument of UPDATE_COVERAGE_COUNTER
function?

Martin

> 
> Richard.
Richard Biener Oct. 26, 2016, 9:28 a.m. UTC | #2
On Tue, Oct 25, 2016 at 4:31 PM, Martin Liška <mliska@suse.cz> wrote:
> On 10/24/2016 03:51 PM, Richard Biener wrote:
>
>> It's quite ad-hoc :/  The IFN will also be a memory optimization
>> barrier unless you add special support
>> for it in the alias oracle - so the performance measurement needs to
>> be taken with a grain of salt
>> (same is true for all atomics of course... - I have some local patches
>> to improve things here).
>
> Good, thus please ping me with the patches you have and I'll integrate it.
>
>>
>> The way you implement process_sm_for_coverage_counter is more like a
>> final value replacement.
>> You could maybe use create_iv for the loop counter or even wind up
>> computing the final value
>> (number of iterations) only after the loop, avoiding the IV completely
>> (eventually SCEV cprop
>> saves you here afterwards).
>
> Or maybe we can basically assign loop->niter as the argument of UPDATE_COVERAGE_COUNTER
> function?

Yes, that's what I said.

Richard.

> Martin
>
>>
>> Richard.
>
Richard Biener Oct. 26, 2016, 9:31 a.m. UTC | #3
On Wed, Oct 26, 2016 at 11:28 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Tue, Oct 25, 2016 at 4:31 PM, Martin Liška <mliska@suse.cz> wrote:
>> On 10/24/2016 03:51 PM, Richard Biener wrote:
>>
>>> It's quite ad-hoc :/  The IFN will also be a memory optimization
>>> barrier unless you add special support
>>> for it in the alias oracle - so the performance measurement needs to
>>> be taken with a grain of salt
>>> (same is true for all atomics of course... - I have some local patches
>>> to improve things here).
>>
>> Good, thus please ping me with the patches you have and I'll integrate it.

This is what I have in my tree (appearantly only points-to changes, I
suppose general
alias changes will be controversical as the builtins would lose their
"compiler memory
barrier" behavior).

Richard.

>>>
>>> The way you implement process_sm_for_coverage_counter is more like a
>>> final value replacement.
>>> You could maybe use create_iv for the loop counter or even wind up
>>> computing the final value
>>> (number of iterations) only after the loop, avoiding the IV completely
>>> (eventually SCEV cprop
>>> saves you here afterwards).
>>
>> Or maybe we can basically assign loop->niter as the argument of UPDATE_COVERAGE_COUNTER
>> function?
>
> Yes, that's what I said.
>
> Richard.
>
>> Martin
>>
>>>
>>> Richard.
>>
diff mbox

Patch

From a9e76ff1bc8b69e23a50d8cdc9556270cf7b269d Mon Sep 17 00:00:00 2001
From: marxin <mliska@suse.cz>
Date: Fri, 21 Oct 2016 13:26:19 +0200
Subject: [PATCH] Introduce loop store motion for UPDATE_COVERAGE_COUNTER

gcc/testsuite/ChangeLog:

2016-10-24  Martin Liska  <mliska@suse.cz>

	* gcc.dg/tree-ssa/ssa-lim-11.c: Update scanned pattern.

gcc/ChangeLog:

2016-10-24  Martin Liska  <mliska@suse.cz>

	* Makefile.in: Add new file profile-expand.c.
	* internal-fn.c (expand_UPDATE_COVERAGE_COUNTER): New IFN.
	* internal-fn.def: Likewise.
	* passes.def: Add new pass profile_expand.
	* profile-expand.c: New file.
	* tree-pass.h (make_pass_profile_expand): Declare a new
	function.
	* tree-profile.c (gimple_gen_edge_profiler): Generate the new
	internal fn.
	* tree-ssa-loop-im.c (loop_suitable_for_sm): Move to header
	file.
	(move_coverage_counter_update): New function.
	(process_sm_for_coverage_counter): Likewise.
	(pass_lim::execute): Call invariant motion for
	UPDATE_COVERAGE_COUNTER internal functions.
	* tree-ssa-loop.h: Move the function here from
	tree-ssa-loop-im.c.
	* value-prof.h: Declare expand_coverage_counter_ifns.
---
 gcc/Makefile.in                            |   1 +
 gcc/internal-fn.c                          |   6 ++
 gcc/internal-fn.def                        |   2 +
 gcc/passes.def                             |   1 +
 gcc/profile-expand.c                       | 143 ++++++++++++++++++++++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c |   3 +-
 gcc/tree-pass.h                            |   1 +
 gcc/tree-profile.c                         |  37 +------
 gcc/tree-ssa-loop-im.c                     | 157 +++++++++++++++++++++++++----
 gcc/tree-ssa-loop.h                        |  18 ++++
 gcc/value-prof.h                           |   1 +
 11 files changed, 315 insertions(+), 55 deletions(-)
 create mode 100644 gcc/profile-expand.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index c512cd7..9bdb406 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1404,6 +1404,7 @@  OBJS = \
 	print-rtl-function.o \
 	print-tree.o \
 	profile.o \
+	profile-expand.o \
 	real.o \
 	realmpfr.o \
 	recog.o \
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 0b32d5f..557d373 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -254,6 +254,12 @@  expand_FALLTHROUGH (internal_fn, gcall *call)
 	    "invalid use of attribute %<fallthrough%>");
 }
 
+static void
+expand_UPDATE_COVERAGE_COUNTER (internal_fn, gcall *)
+{
+  gcc_unreachable ();
+}
+
 /* Helper function for expand_addsub_overflow.  Return 1
    if ARG interpreted as signed in its precision is known to be always
    positive or 2 if ARG is known to be always negative, or 3 if ARG may
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index d4fbdb2..348fc2f 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -198,6 +198,8 @@  DEF_INTERNAL_FN (ATOMIC_COMPARE_EXCHANGE, ECF_LEAF | ECF_NOTHROW, NULL)
 /* To implement [[fallthrough]].  */
 DEF_INTERNAL_FN (FALLTHROUGH, ECF_LEAF | ECF_NOTHROW, NULL)
 
+DEF_INTERNAL_FN (UPDATE_COVERAGE_COUNTER, ECF_LEAF | ECF_NOTHROW, NULL)
+
 #undef DEF_INTERNAL_INT_FN
 #undef DEF_INTERNAL_FLT_FN
 #undef DEF_INTERNAL_OPTAB_FN
diff --git a/gcc/passes.def b/gcc/passes.def
index 1375254..4a22860 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -386,6 +386,7 @@  along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_asan_O0);
   NEXT_PASS (pass_tsan_O0);
   NEXT_PASS (pass_sanopt);
+  NEXT_PASS (pass_profile_expand);
   NEXT_PASS (pass_cleanup_eh);
   NEXT_PASS (pass_lower_resx);
   NEXT_PASS (pass_nrv);
diff --git a/gcc/profile-expand.c b/gcc/profile-expand.c
new file mode 100644
index 0000000..317fe1f
--- /dev/null
+++ b/gcc/profile-expand.c
@@ -0,0 +1,143 @@ 
+/* Profile expand pass.
+   Copyright (C) 2003-2016 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT
+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "memmodel.h"
+#include "backend.h"
+#include "target.h"
+#include "tree.h"
+#include "gimple.h"
+#include "cfghooks.h"
+#include "tree-pass.h"
+#include "ssa.h"
+#include "coverage.h"
+#include "varasm.h"
+#include "tree-nested.h"
+#include "gimplify.h"
+#include "gimple-iterator.h"
+#include "gimplify-me.h"
+#include "tree-cfg.h"
+#include "tree-into-ssa.h"
+#include "value-prof.h"
+#include "profile.h"
+#include "tree-cfgcleanup.h"
+#include "params.h"
+
+void
+expand_coverage_counter_ifns (void)
+{
+  basic_block bb;
+  tree f = builtin_decl_explicit (LONG_LONG_TYPE_SIZE > 32
+				  ? BUILT_IN_ATOMIC_FETCH_ADD_8:
+				  BUILT_IN_ATOMIC_FETCH_ADD_4);
+
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (gimple_call_internal_p (stmt, IFN_UPDATE_COVERAGE_COUNTER))
+	    {
+	      tree addr = gimple_call_arg (stmt, 0);
+	      tree value = gimple_call_arg (stmt, 1);
+	      if (flag_profile_update == PROFILE_UPDATE_ATOMIC)
+		{
+		  gcall *stmt
+		    = gimple_build_call (f, 3, addr, value,
+					 build_int_cst (integer_type_node,
+							MEMMODEL_RELAXED));
+		  gsi_replace (&gsi, stmt, true);
+		}
+	      else
+		{
+		  gcc_assert (TREE_CODE (addr) == ADDR_EXPR);
+		  tree ref = TREE_OPERAND (addr, 0);
+		  tree gcov_type_tmp_var
+		    = make_temp_ssa_name (get_gcov_type (), NULL,
+					  "PROF_edge_counter");
+		  gassign *stmt1 = gimple_build_assign (gcov_type_tmp_var, ref);
+		  gcov_type_tmp_var
+		    = make_temp_ssa_name (get_gcov_type (), NULL,
+					  "PROF_edge_counter");
+		  gassign *stmt2
+		    = gimple_build_assign (gcov_type_tmp_var, PLUS_EXPR,
+					   gimple_assign_lhs (stmt1), value);
+		  gassign *stmt3
+		    = gimple_build_assign (unshare_expr (ref),
+					   gimple_assign_lhs (stmt2));
+		  gsi_insert_seq_before (&gsi, stmt1, GSI_SAME_STMT);
+		  gsi_insert_seq_before (&gsi, stmt2, GSI_SAME_STMT);
+		  gsi_replace (&gsi, stmt3, GSI_SAME_STMT);
+		}
+	    }
+	}
+    }
+}
+
+/* Profile expand pass.  */
+
+namespace {
+
+const pass_data pass_data_profile_expand =
+{
+  GIMPLE_PASS, /* type */
+  "profile_expand", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_LIM, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa, /* todo_flags_finish */
+};
+
+class pass_profile_expand : public gimple_opt_pass
+{
+public:
+  pass_profile_expand (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_profile_expand, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  opt_pass * clone () { return new pass_profile_expand (m_ctxt); }
+  virtual bool gate (function *) { return true; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_profile_expand
+
+unsigned int
+pass_profile_expand::execute (function *)
+{
+  expand_coverage_counter_ifns ();
+
+  return 0;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_profile_expand (gcc::context *ctxt)
+{
+  return new pass_profile_expand (ctxt);
+}
+
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c
index e4c11aa..4c14e24 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c
@@ -22,4 +22,5 @@  void access_buf(struct thread_param* p)
     }
 }
 
-/* { dg-final { scan-tree-dump-times "Executing store motion of __gcov0.access_buf\\\[\[01\]\\\] from loop 1" 2 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of &__gcov0.access_buf\\\[\[01\]\\\] from loop 1" 1 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "Executing store motion of &__gcov0.access_buf\\\[\[01\]\\\] from loop 2" 1 "lim2" } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 5903fde..ac919f8 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -365,6 +365,7 @@  extern gimple_opt_pass *make_pass_fix_loops (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_no_loop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_init (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_profile_expand (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_lim (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_unswitch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_predcom (gcc::context *ctxt);
diff --git a/gcc/tree-profile.c b/gcc/tree-profile.c
index abeee92..288d38c 100644
--- a/gcc/tree-profile.c
+++ b/gcc/tree-profile.c
@@ -252,40 +252,13 @@  gimple_init_edge_profiler (void)
 void
 gimple_gen_edge_profiler (int edgeno, edge e)
 {
-  tree one;
+  tree one = build_int_cst (gcov_type_node, 1);
 
-  one = build_int_cst (gcov_type_node, 1);
-
-  if (flag_profile_update == PROFILE_UPDATE_ATOMIC)
-    {
-      /* __atomic_fetch_add (&counter, 1, MEMMODEL_RELAXED); */
-      tree addr = tree_coverage_counter_addr (GCOV_COUNTER_ARCS, edgeno);
-      tree f = builtin_decl_explicit (LONG_LONG_TYPE_SIZE > 32
-				      ? BUILT_IN_ATOMIC_FETCH_ADD_8:
-				      BUILT_IN_ATOMIC_FETCH_ADD_4);
-      gcall *stmt = gimple_build_call (f, 3, addr, one,
-				       build_int_cst (integer_type_node,
-						      MEMMODEL_RELAXED));
-      gsi_insert_on_edge (e, stmt);
-    }
-  else
-    {
-      tree ref = tree_coverage_counter_ref (GCOV_COUNTER_ARCS, edgeno);
-      tree gcov_type_tmp_var = make_temp_ssa_name (gcov_type_node,
-						   NULL, "PROF_edge_counter");
-      gassign *stmt1 = gimple_build_assign (gcov_type_tmp_var, ref);
-      gcov_type_tmp_var = make_temp_ssa_name (gcov_type_node,
-					      NULL, "PROF_edge_counter");
-      gassign *stmt2 = gimple_build_assign (gcov_type_tmp_var, PLUS_EXPR,
-					    gimple_assign_lhs (stmt1), one);
-      gassign *stmt3 = gimple_build_assign (unshare_expr (ref),
-					    gimple_assign_lhs (stmt2));
-      gsi_insert_on_edge (e, stmt1);
-      gsi_insert_on_edge (e, stmt2);
-      gsi_insert_on_edge (e, stmt3);
-    }
+  tree addr = tree_coverage_counter_addr (GCOV_COUNTER_ARCS, edgeno);
+  gcall *stmt = gimple_build_call_internal (IFN_UPDATE_COVERAGE_COUNTER,
+					    2, addr, one);
+  gsi_insert_on_edge (e, stmt);
 }
-
 /* Emits code to get VALUE to instrument at GSI, and returns the
    variable containing the value.  */
 
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 463db04..0e04250 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -44,6 +44,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "trans-mem.h"
 #include "gimple-fold.h"
 #include "tree-scalar-evolution.h"
+#include "coverage.h"
 
 /* TODO:  Support for predicated code motion.  I.e.
 
@@ -2281,24 +2282,6 @@  find_refs_for_sm (struct loop *loop, bitmap sm_executed, bitmap refs_to_sm)
     }
 }
 
-/* Checks whether LOOP (with exits stored in EXITS array) is suitable
-   for a store motion optimization (i.e. whether we can insert statement
-   on its exits).  */
-
-static bool
-loop_suitable_for_sm (struct loop *loop ATTRIBUTE_UNUSED,
-		      vec<edge> exits)
-{
-  unsigned i;
-  edge ex;
-
-  FOR_EACH_VEC_ELT (exits, i, ex)
-    if (ex->flags & (EDGE_ABNORMAL | EDGE_EH))
-      return false;
-
-  return true;
-}
-
 /* Try to perform store motion for all memory references modified inside
    LOOP.  SM_EXECUTED is the bitmap of the memory references for that
    store motion was executed in one of the outer loops.  */
@@ -2556,6 +2539,132 @@  tree_ssa_lim (void)
   return todo;
 }
 
+/* Move coverage counter update internal function, pointed by GSI iterator,
+   out of a loop LOOP.  */
+
+static void
+move_coverage_counter_update (gimple_stmt_iterator *gsi, struct loop *loop)
+{
+  gimple *call = gsi_stmt (*gsi);
+  tree type = get_gcov_type ();
+
+  vec<edge> exits = get_loop_exit_edges (loop);
+  if (!loop_suitable_for_sm (loop, exits))
+    {
+      exits.release ();
+      return;
+    }
+
+  /* Verify that BB of the CALL statement post-dominates all exits.  */
+  for (unsigned i = 0; i < exits.length (); i++)
+    {
+      edge exit = exits[i];
+      if (!dominated_by_p (CDI_POST_DOMINATORS, call->bb, exit->src))
+	{
+	  exits.release ();
+	  return;
+	}
+    }
+
+  if (exits.is_empty ())
+    return;
+
+  edge preheader = loop_preheader_edge (loop);
+  if (!single_succ_p (preheader->src)
+      || preheader->dest != call->bb)
+    return;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "Executing store motion of ");
+      print_generic_expr (dump_file, gimple_call_arg (call, 0), 0);
+      fprintf (dump_file, " from loop %d\n", loop->num);
+    }
+
+  tree preheader_var = make_temp_ssa_name (type, NULL, "PROF_edge_counter");
+  gimple *stmt = gimple_build_assign (preheader_var, build_int_cst (type, 0));
+  gimple_stmt_iterator it = gsi_last_bb (preheader->src);
+  gsi_insert_after (&it, stmt, GSI_NEW_STMT);
+
+  tree loop_var1 = make_temp_ssa_name (type, NULL, "PROF_edge_counter");
+  tree loop_var2 = make_temp_ssa_name (type, NULL, "PROF_edge_counter");
+
+  gphi *phi = create_phi_node (loop_var1, call->bb);
+  add_phi_arg (phi, preheader_var, preheader, UNKNOWN_LOCATION);
+
+  stmt = gimple_build_assign (loop_var2, PLUS_EXPR, loop_var1,
+			      build_int_cst (type, 1));
+  gsi_insert_before (gsi, stmt, GSI_SAME_STMT);
+
+  edge e;
+  edge_iterator ei;
+  FOR_EACH_EDGE (e, ei, call->bb->preds)
+    if (e != preheader)
+      add_phi_arg (phi, loop_var2, e, UNKNOWN_LOCATION);
+
+  tree updated_value = make_temp_ssa_name (type, NULL, "PROF_edge_counter");
+
+  for (unsigned i = 0; i < exits.length (); i++)
+    {
+      edge exit = exits[i];
+      if (!dominated_by_p (CDI_DOMINATORS, exit->dest, exit->src))
+	{
+	  basic_block new_bb = split_edge (exit);
+	  set_immediate_dominator (CDI_DOMINATORS, new_bb, exit->src);
+	  e = single_pred_edge (new_bb);
+	}
+      else
+	e = exit;
+
+      basic_block bb = e->dest;
+      phi = create_phi_node (updated_value, e->dest);
+      add_phi_arg (phi, loop_var2, e, UNKNOWN_LOCATION);
+
+      tree addr = unshare_expr (gimple_call_arg (call, 0));
+      call = gimple_build_call_internal (IFN_UPDATE_COVERAGE_COUNTER,
+					 2, addr, gimple_phi_result (phi));
+
+      it = gsi_start_bb (bb);
+      gsi_insert_before (&it, call, GSI_NEW_STMT);
+    }
+
+  exits.release ();
+  gsi_remove (gsi, true);
+}
+
+/* Process store motion for coverage counter update internal function.  */
+
+static void
+process_sm_for_coverage_counter (void)
+{
+  bool has_dom_info = dom_info_available_p (CDI_POST_DOMINATORS);
+  if (!has_dom_info)
+    calculate_dominance_info (CDI_POST_DOMINATORS);
+
+  struct loop *loop;
+  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
+    {
+      basic_block *body = get_loop_body (loop);
+      for (unsigned i = 0; i < loop->num_nodes; i++)
+	{
+	  gimple_stmt_iterator gsi;
+	  for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi);)
+	    {
+	      gimple *stmt = gsi_stmt (gsi);
+	      if (gimple_call_internal_p (stmt, IFN_UPDATE_COVERAGE_COUNTER)
+		  && integer_onep (gimple_call_arg (stmt, 1)))
+		move_coverage_counter_update (&gsi, loop);
+
+	      if (!gsi_end_p (gsi))
+		gsi_next (&gsi);
+	    }
+	}
+    }
+
+  if (!has_dom_info)
+    free_dominance_info (CDI_POST_DOMINATORS);
+}
+
 /* Loop invariant motion pass.  */
 
 namespace {
@@ -2592,11 +2701,15 @@  pass_lim::execute (function *fun)
 {
   bool in_loop_pipeline = scev_initialized_p ();
   if (!in_loop_pipeline)
-    loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
+    loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS
+			 | LOOPS_HAVE_PREHEADERS);
 
-  if (number_of_loops (fun) <= 1)
-    return 0;
-  unsigned int todo = tree_ssa_lim ();
+  unsigned int todo = 0;
+  if (number_of_loops (fun) > 1)
+    todo = tree_ssa_lim ();
+
+  process_sm_for_coverage_counter ();
+  todo |= TODO_update_ssa;
 
   if (!in_loop_pipeline)
     loop_optimizer_finalize ();
diff --git a/gcc/tree-ssa-loop.h b/gcc/tree-ssa-loop.h
index b2f37ab..88bcad7 100644
--- a/gcc/tree-ssa-loop.h
+++ b/gcc/tree-ssa-loop.h
@@ -79,4 +79,22 @@  loop_containing_stmt (gimple *stmt)
   return bb->loop_father;
 }
 
+/* Checks whether LOOP (with exits stored in EXITS array) is suitable
+   for a store motion optimization (i.e. whether we can insert statement
+   on its exits).  */
+
+static inline bool
+loop_suitable_for_sm (struct loop *loop ATTRIBUTE_UNUSED,
+		      vec<edge> exits)
+{
+  unsigned i;
+  edge ex;
+
+  FOR_EACH_VEC_ELT (exits, i, ex)
+    if (ex->flags & (EDGE_ABNORMAL | EDGE_EH))
+      return false;
+
+  return true;
+}
+
 #endif /* GCC_TREE_SSA_LOOP_H */
diff --git a/gcc/value-prof.h b/gcc/value-prof.h
index 07e2b3b..cb3ae1d 100644
--- a/gcc/value-prof.h
+++ b/gcc/value-prof.h
@@ -98,6 +98,7 @@  bool check_ic_target (gcall *, struct cgraph_node *);
 /* In tree-profile.c.  */
 extern void gimple_init_edge_profiler (void);
 extern void gimple_gen_edge_profiler (int, edge);
+extern void expand_coverage_counter_ifns (void);
 extern void gimple_gen_interval_profiler (histogram_value, unsigned, unsigned);
 extern void gimple_gen_pow2_profiler (histogram_value, unsigned, unsigned);
 extern void gimple_gen_one_value_profiler (histogram_value, unsigned, unsigned);
-- 
2.10.1