diff mbox

Conditional count update for fast coverage test in multi-threaded programs

Message ID CAF1bQ=Sz3B0o-sU7v4OjsJCscDaEoPoKrDEqG2cSyYpc0oXaKw@mail.gmail.com
State New
Headers show

Commit Message

Rong Xu Nov. 22, 2013, 3:51 a.m. UTC
Hi,

This patch injects a condition into the instrumented code for edge
counter update. The counter value will not be updated after reaching
value 1.

The feature is under a new parameter --param=coverage-exec_once.
Default is disabled and setting to 1 to enable.

This extra check usually slows the program down. For SPEC 2006
benchmarks (all single thread programs), we usually see around 20%-35%
slow down in -O2 coverage build. This feature, however, is expected to
improve the coverage run speed for multi-threaded programs, because
there virtually no data race and false sharing in updating counters.
The improvement can be significant for highly threaded programs -- we
are seeing 7x speedup in coverage test run for some non-trivial google
applications.

Tested with bootstrap.

Thanks,

-Rong
2013-11-21  Rong Xu  <xur@google.com>

	* gcc/doc/invoke.texi (coverage-exec_once): Add document.
	* gcc/params.def (coverage-exec_once): New param.
	* gcc/profile.h (gcov_type sum_edge_counts): Add extern decl.
	* gcc/profile.c (branch_prob): Add support for coverage-exec_once.
	* gcc/tree-profile.c (gimple_gen_edge_profiler): Ditto.
	(gimple_gen_ior_profiler): Ditto.
	(insert_if_then): Ditto.
	(add_execonce_wrapper): Ditto.
	(is_pred_instrumentation_candidate): Ditto.
	(add_predicate_to_edge_counters): Ditto.
	(cleanup_pred_instrumentation): Ditto.
	(init_pred_instrumentation): Ditto.
	(tree_profiling): Ditto.

Comments

Richard Biener Nov. 22, 2013, 12:03 p.m. UTC | #1
On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <xur@google.com> wrote:
> Hi,
>
> This patch injects a condition into the instrumented code for edge
> counter update. The counter value will not be updated after reaching
> value 1.
>
> The feature is under a new parameter --param=coverage-exec_once.
> Default is disabled and setting to 1 to enable.
>
> This extra check usually slows the program down. For SPEC 2006
> benchmarks (all single thread programs), we usually see around 20%-35%
> slow down in -O2 coverage build. This feature, however, is expected to
> improve the coverage run speed for multi-threaded programs, because
> there virtually no data race and false sharing in updating counters.
> The improvement can be significant for highly threaded programs -- we
> are seeing 7x speedup in coverage test run for some non-trivial google
> applications.
>
> Tested with bootstrap.

Err - why not simply emit

  counter = 1

for the counter update itself with that --param (I don't like a --param
for this either).

I assume that CPUs can avoid data-races and false sharing for
non-changing accesses?

Richard.

> Thanks,
>
> -Rong
Rong Xu Nov. 22, 2013, 9:49 p.m. UTC | #2
On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <xur@google.com> wrote:
>> Hi,
>>
>> This patch injects a condition into the instrumented code for edge
>> counter update. The counter value will not be updated after reaching
>> value 1.
>>
>> The feature is under a new parameter --param=coverage-exec_once.
>> Default is disabled and setting to 1 to enable.
>>
>> This extra check usually slows the program down. For SPEC 2006
>> benchmarks (all single thread programs), we usually see around 20%-35%
>> slow down in -O2 coverage build. This feature, however, is expected to
>> improve the coverage run speed for multi-threaded programs, because
>> there virtually no data race and false sharing in updating counters.
>> The improvement can be significant for highly threaded programs -- we
>> are seeing 7x speedup in coverage test run for some non-trivial google
>> applications.
>>
>> Tested with bootstrap.
>
> Err - why not simply emit
>
>   counter = 1
>
> for the counter update itself with that --param (I don't like a --param
> for this either).
>
> I assume that CPUs can avoid data-races and false sharing for
> non-changing accesses?
>

I'm not aware of any CPU having this feature. I think a write to the
shared cache line to invalidate all the shared copies. I cannot find
any reference on checking the value of the write. Do you have any
pointer to the feature?

I just tested this implementation vs. simply setting to 1, using
google search as the benchmark.
This one is 4.5x faster. The test was done on Intel Westmere systems.

I can change the parameter to an option.

-Rong

> Richard.
>
>> Thanks,
>>
>> -Rong
Richard Biener Nov. 25, 2013, 10:11 a.m. UTC | #3
On Fri, Nov 22, 2013 at 10:49 PM, Rong Xu <xur@google.com> wrote:
> On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <xur@google.com> wrote:
>>> Hi,
>>>
>>> This patch injects a condition into the instrumented code for edge
>>> counter update. The counter value will not be updated after reaching
>>> value 1.
>>>
>>> The feature is under a new parameter --param=coverage-exec_once.
>>> Default is disabled and setting to 1 to enable.
>>>
>>> This extra check usually slows the program down. For SPEC 2006
>>> benchmarks (all single thread programs), we usually see around 20%-35%
>>> slow down in -O2 coverage build. This feature, however, is expected to
>>> improve the coverage run speed for multi-threaded programs, because
>>> there virtually no data race and false sharing in updating counters.
>>> The improvement can be significant for highly threaded programs -- we
>>> are seeing 7x speedup in coverage test run for some non-trivial google
>>> applications.
>>>
>>> Tested with bootstrap.
>>
>> Err - why not simply emit
>>
>>   counter = 1
>>
>> for the counter update itself with that --param (I don't like a --param
>> for this either).
>>
>> I assume that CPUs can avoid data-races and false sharing for
>> non-changing accesses?
>>
>
> I'm not aware of any CPU having this feature. I think a write to the
> shared cache line to invalidate all the shared copies. I cannot find
> any reference on checking the value of the write. Do you have any
> pointer to the feature?

I don't have any pointer - but I remember seeing this in the context
of atomics thus it may be only in the context of using a xchg
or cmpxchg instruction.  Which would make it non-portable to
some extent (if you don't want to use atomic builtins here).

Richard.

> I just tested this implementation vs. simply setting to 1, using
> google search as the benchmark.
> This one is 4.5x faster. The test was done on Intel Westmere systems.
>
> I can change the parameter to an option.
>
> -Rong
>
>> Richard.
>>
>>> Thanks,
>>>
>>> -Rong
Rong Xu Nov. 25, 2013, 7:19 p.m. UTC | #4
On Mon, Nov 25, 2013 at 2:11 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Nov 22, 2013 at 10:49 PM, Rong Xu <xur@google.com> wrote:
>> On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <xur@google.com> wrote:
>>>> Hi,
>>>>
>>>> This patch injects a condition into the instrumented code for edge
>>>> counter update. The counter value will not be updated after reaching
>>>> value 1.
>>>>
>>>> The feature is under a new parameter --param=coverage-exec_once.
>>>> Default is disabled and setting to 1 to enable.
>>>>
>>>> This extra check usually slows the program down. For SPEC 2006
>>>> benchmarks (all single thread programs), we usually see around 20%-35%
>>>> slow down in -O2 coverage build. This feature, however, is expected to
>>>> improve the coverage run speed for multi-threaded programs, because
>>>> there virtually no data race and false sharing in updating counters.
>>>> The improvement can be significant for highly threaded programs -- we
>>>> are seeing 7x speedup in coverage test run for some non-trivial google
>>>> applications.
>>>>
>>>> Tested with bootstrap.
>>>
>>> Err - why not simply emit
>>>
>>>   counter = 1
>>>
>>> for the counter update itself with that --param (I don't like a --param
>>> for this either).
>>>
>>> I assume that CPUs can avoid data-races and false sharing for
>>> non-changing accesses?
>>>
>>
>> I'm not aware of any CPU having this feature. I think a write to the
>> shared cache line to invalidate all the shared copies. I cannot find
>> any reference on checking the value of the write. Do you have any
>> pointer to the feature?
>
> I don't have any pointer - but I remember seeing this in the context
> of atomics thus it may be only in the context of using a xchg
> or cmpxchg instruction.  Which would make it non-portable to
> some extent (if you don't want to use atomic builtins here).
>

cmpxchg should work here -- it's a conditional write so the data race
/false sharing can be avoided.
I'm comparing the performance b/w explicit branch vs cmpxchg and will
report back.

-Rong


> Richard.
>
>> I just tested this implementation vs. simply setting to 1, using
>> google search as the benchmark.
>> This one is 4.5x faster. The test was done on Intel Westmere systems.
>>
>> I can change the parameter to an option.
>>
>> -Rong
>>
>>> Richard.
>>>
>>>> Thanks,
>>>>
>>>> -Rong
Rong Xu Dec. 20, 2013, 10:45 p.m. UTC | #5
Here are the results using our internal benchmarks which are a mixed a
multi-threaded and single-threaded programs.
This was collected about a month ago but I did not got time to send
due to an unexpected trip.

cmpxchg gives the worst performance due to the memory barriers it incurs.
I'll send a patch that supports conditional_1 and unconditional_1.

--------------------------------- result -----------------------------

base: original_coverage
(1): using_conditional_1 -- using branch (my original implementation)
(2): using_unconfitional_1 -- write straight 1
(3): using_cmpxchg -- using compxchg write 1

Values are performance ratios where 100.0 equals the performance of
O2. Larger numbers are faster.
"--" means the test failed due to running too slowly.

arch        : westmere
  Benchmark           Base      (1)    (2)      (3)
---------------------------------------------------------
benchmark_1            26.4  +176.62%  +17.20%        --
benchmark_2              --    [78.4]   [12.3]        --
benchmark_3            86.3    +6.15%  +10.52%   -61.28%
benchmark_4            88.4    +6.59%  +14.26%   -68.76%
benchmark_5            89.6    +6.26%  +13.00%   -68.74%
benchmark_6            76.7   +22.28%  +29.15%   -75.31%
benchmark_7            89.0    -0.62%   +3.36%   -71.37%
benchmark_8            84.5    -1.45%   +5.27%   -74.04%
benchmark_9            81.3   +10.64%  +13.32%   -72.82%
benchmark_10           59.1   +44.71%  +14.77%   -73.24%
benchmark_11           90.3    -1.74%   +4.22%   -61.95%
benchmark_12           98.9    +0.07%   +0.48%    -6.37%
benchmark_13           74.0    -4.69%   +4.35%   -77.02%
benchmark_14           21.4  +309.92%  +63.41%   -35.82%
benchmark_15           21.4  +282.33%  +58.15%   -57.98%
benchmark_16           85.1    -7.71%   +1.65%   -60.72%
benchmark_17           81.7    +2.47%   +8.20%   -72.08%
benchmark_18           83.7    +1.59%   +3.83%   -69.33%
geometric mean                 +30.30%  +14.41%  -65.66% (incomplete)

arch        : sandybridge
  Benchmark           Base    (1)       (2)      (3)
---------------------------------------------------------
benchmark_1             --    [70.1]   [26.1]       --
benchmark_2             --    [79.1]       --       --
benchmark_3           84.3   +10.82%  +15.84%  -68.98%
benchmark_4           88.5   +10.28%  +11.35%  -75.10%
benchmark_5           89.4   +10.46%  +11.40%  -74.41%
benchmark_6           65.5   +38.52%  +44.46%  -77.97%
benchmark_7           87.7    -0.16%   +1.74%  -76.19%
benchmark_8           89.6    -4.52%   +6.29%  -78.10%
benchmark_9           79.9   +13.43%  +19.44%  -75.99%
benchmark_10          52.6   +61.53%   +8.23%  -78.41%
benchmark_11          89.9    -1.40%   +3.37%  -68.16%
benchmark_12          99.0    +1.51%   +0.63%  -10.37%
benchmark_13          74.3    -6.75%   +3.89%  -81.84%
benchmark_14          21.8  +295.76%  +19.48%  -51.58%
benchmark_15          23.5  +257.20%  +29.33%  -83.53%
benchmark_16          84.4   -10.04%   +2.39%  -68.25%
benchmark_17          81.6    +0.60%   +8.82%  -78.02%
benchmark_18          87.4    -1.14%   +9.69%  -75.88%
geometric mean               +25.64%  +11.76%  -72.96% (incomplete)

arch        : clovertown
  Benchmark         Base       (1)       (2)        (3)
--------------------------------------------------------------
benchmark_1             --     [83.4]        --         --
benchmark_2             --     [82.3]        --         --
benchmark_3           86.2     +7.58%   +13.10%    -81.74%
benchmark_4           89.4     +5.69%   +11.70%    -82.97%
benchmark_5           92.8     +4.67%    +7.48%    -80.02%
benchmark_6           78.1    +13.28%   +22.21%    -86.92%
benchmark_7           96.8     +0.25%    +5.44%    -84.94%
benchmark_8           89.1     +0.66%    +3.60%    -85.89%
benchmark_9           86.4     +8.42%    +9.95%    -82.30%
benchmark_10          59.7    +44.95%   +21.79%         --
benchmark_11          91.2     -0.29%    +4.35%    -76.05%
benchmark_12          99.0     +0.31%    -0.05%    -25.19%
benchmark_14           8.2  +1011.27%  +104.15%     +5.56%
benchmark_15          11.7   +669.25%  +108.54%    -29.83%
benchmark_16          85.7     -7.51%    +4.43%         --
benchmark_17          87.7     +2.84%    +7.45%         --
benchmark_18          87.9     +1.59%    +3.82%    -81.11%
geometric mean                 +37.89%   +17.54%  -74.47% (incomplete)

arch        : istanbul

 Benchmark           Base    (1)       (2)       (3)
----------------------------------------------------------
benchmark_1             --    [73.2]       --         --
benchmark_2             --    [82.9]       --         --
benchmark_3           86.1    +4.56%  +11.68%    -61.04%
benchmark_4           92.0    +3.47%   +4.63%    -64.84%
benchmark_5           91.9    +4.18%   +4.90%    -64.77%
benchmark_6           73.6   +23.36%  +27.13%    -72.64%
benchmark_7           93.6    -3.57%   +4.76%    -68.54%
benchmark_8           88.9    -3.01%   +2.87%    -75.50%
benchmark_9           81.6    +9.91%  +14.10%    -69.81%
benchmark_10          69.8   +23.60%   -1.53%         --
benchmark_11          89.9    +0.34%   +4.32%    -59.62%
benchmark_12          98.7    +0.96%   +0.88%    -10.52%
benchmark_13          80.5    -8.31%   -0.55%    -75.90%
benchmark_14          17.1  +429.47%   -5.21%     -4.87%
benchmark_15          22.0  +295.70%   -1.29%    -46.36%
benchmark_16          80.0    -6.31%   +0.54%    -62.61%
benchmark_17          83.5    +4.84%  +10.71%    -70.48%
benchmark_18          90.0    -1.27%   +3.18%    -72.30%
geometric mean               +24.51%   +4.81%  -62.44% (incomplete)

benchmark_13          77.1     -3.34%    +5.75%         --

On Mon, Nov 25, 2013 at 11:19 AM, Rong Xu <xur@google.com> wrote:
> On Mon, Nov 25, 2013 at 2:11 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Nov 22, 2013 at 10:49 PM, Rong Xu <xur@google.com> wrote:
>>> On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <xur@google.com> wrote:
>>>>> Hi,
>>>>>
>>>>> This patch injects a condition into the instrumented code for edge
>>>>> counter update. The counter value will not be updated after reaching
>>>>> value 1.
>>>>>
>>>>> The feature is under a new parameter --param=coverage-exec_once.
>>>>> Default is disabled and setting to 1 to enable.
>>>>>
>>>>> This extra check usually slows the program down. For SPEC 2006
>>>>> benchmarks (all single thread programs), we usually see around 20%-35%
>>>>> slow down in -O2 coverage build. This feature, however, is expected to
>>>>> improve the coverage run speed for multi-threaded programs, because
>>>>> there virtually no data race and false sharing in updating counters.
>>>>> The improvement can be significant for highly threaded programs -- we
>>>>> are seeing 7x speedup in coverage test run for some non-trivial google
>>>>> applications.
>>>>>
>>>>> Tested with bootstrap.
>>>>
>>>> Err - why not simply emit
>>>>
>>>>   counter = 1
>>>>
>>>> for the counter update itself with that --param (I don't like a --param
>>>> for this either).
>>>>
>>>> I assume that CPUs can avoid data-races and false sharing for
>>>> non-changing accesses?
>>>>
>>>
>>> I'm not aware of any CPU having this feature. I think a write to the
>>> shared cache line to invalidate all the shared copies. I cannot find
>>> any reference on checking the value of the write. Do you have any
>>> pointer to the feature?
>>
>> I don't have any pointer - but I remember seeing this in the context
>> of atomics thus it may be only in the context of using a xchg
>> or cmpxchg instruction.  Which would make it non-portable to
>> some extent (if you don't want to use atomic builtins here).
>>
>
> cmpxchg should work here -- it's a conditional write so the data race
> /false sharing can be avoided.
> I'm comparing the performance b/w explicit branch vs cmpxchg and will
> report back.
>
> -Rong
>
>
>> Richard.
>>
>>> I just tested this implementation vs. simply setting to 1, using
>>> google search as the benchmark.
>>> This one is 4.5x faster. The test was done on Intel Westmere systems.
>>>
>>> I can change the parameter to an option.
>>>
>>> -Rong
>>>
>>>> Richard.
>>>>
>>>>> Thanks,
>>>>>
>>>>> -Rong
Richard Biener April 17, 2014, 10:08 a.m. UTC | #6
On Fri, Dec 20, 2013 at 11:45 PM, Rong Xu <xur@google.com> wrote:
> Here are the results using our internal benchmarks which are a mixed a
> multi-threaded and single-threaded programs.
> This was collected about a month ago but I did not got time to send
> due to an unexpected trip.
>
> cmpxchg gives the worst performance due to the memory barriers it incurs.
> I'll send a patch that supports conditional_1 and unconditional_1.

Bah, too bad.  Ok, so another idea is to use non-tempoal unconditional
store of 1.  According to docs this is weakly ordered and side-steps the
cache (movnti/movntq on x86).

Btw,

+@item coverage-exec_once
+Set to 1 to update each arc counter only once. This avoids false sharing
+and speeds up profile-generate run for multi-threaded programs.
+

for -fprofile-generate this certainly is a bad choice but I can see that
it is useful for -coverage.  Also avoid mixing - and _ here.  Using a
--param here is not the proper thing for a non-developer feature,
thus I'd suggest to add -fcoverage-exec-once instead.  It is supposed
to help for -coverage, right?  Not for -fprofile-generate?

Instead of using a pointer-set to record stmts to "instrument" just
set one of the pass-local flags on the stmts (gimple_set_plf/gimple_plf,
you have to clear flags before using them).

Thanks,
Richard.

> --------------------------------- result -----------------------------
>
> base: original_coverage
> (1): using_conditional_1 -- using branch (my original implementation)
> (2): using_unconfitional_1 -- write straight 1
> (3): using_cmpxchg -- using compxchg write 1
>
> Values are performance ratios where 100.0 equals the performance of
> O2. Larger numbers are faster.
> "--" means the test failed due to running too slowly.
>
> arch        : westmere
>   Benchmark           Base      (1)    (2)      (3)
> ---------------------------------------------------------
> benchmark_1            26.4  +176.62%  +17.20%        --
> benchmark_2              --    [78.4]   [12.3]        --
> benchmark_3            86.3    +6.15%  +10.52%   -61.28%
> benchmark_4            88.4    +6.59%  +14.26%   -68.76%
> benchmark_5            89.6    +6.26%  +13.00%   -68.74%
> benchmark_6            76.7   +22.28%  +29.15%   -75.31%
> benchmark_7            89.0    -0.62%   +3.36%   -71.37%
> benchmark_8            84.5    -1.45%   +5.27%   -74.04%
> benchmark_9            81.3   +10.64%  +13.32%   -72.82%
> benchmark_10           59.1   +44.71%  +14.77%   -73.24%
> benchmark_11           90.3    -1.74%   +4.22%   -61.95%
> benchmark_12           98.9    +0.07%   +0.48%    -6.37%
> benchmark_13           74.0    -4.69%   +4.35%   -77.02%
> benchmark_14           21.4  +309.92%  +63.41%   -35.82%
> benchmark_15           21.4  +282.33%  +58.15%   -57.98%
> benchmark_16           85.1    -7.71%   +1.65%   -60.72%
> benchmark_17           81.7    +2.47%   +8.20%   -72.08%
> benchmark_18           83.7    +1.59%   +3.83%   -69.33%
> geometric mean                 +30.30%  +14.41%  -65.66% (incomplete)
>
> arch        : sandybridge
>   Benchmark           Base    (1)       (2)      (3)
> ---------------------------------------------------------
> benchmark_1             --    [70.1]   [26.1]       --
> benchmark_2             --    [79.1]       --       --
> benchmark_3           84.3   +10.82%  +15.84%  -68.98%
> benchmark_4           88.5   +10.28%  +11.35%  -75.10%
> benchmark_5           89.4   +10.46%  +11.40%  -74.41%
> benchmark_6           65.5   +38.52%  +44.46%  -77.97%
> benchmark_7           87.7    -0.16%   +1.74%  -76.19%
> benchmark_8           89.6    -4.52%   +6.29%  -78.10%
> benchmark_9           79.9   +13.43%  +19.44%  -75.99%
> benchmark_10          52.6   +61.53%   +8.23%  -78.41%
> benchmark_11          89.9    -1.40%   +3.37%  -68.16%
> benchmark_12          99.0    +1.51%   +0.63%  -10.37%
> benchmark_13          74.3    -6.75%   +3.89%  -81.84%
> benchmark_14          21.8  +295.76%  +19.48%  -51.58%
> benchmark_15          23.5  +257.20%  +29.33%  -83.53%
> benchmark_16          84.4   -10.04%   +2.39%  -68.25%
> benchmark_17          81.6    +0.60%   +8.82%  -78.02%
> benchmark_18          87.4    -1.14%   +9.69%  -75.88%
> geometric mean               +25.64%  +11.76%  -72.96% (incomplete)
>
> arch        : clovertown
>   Benchmark         Base       (1)       (2)        (3)
> --------------------------------------------------------------
> benchmark_1             --     [83.4]        --         --
> benchmark_2             --     [82.3]        --         --
> benchmark_3           86.2     +7.58%   +13.10%    -81.74%
> benchmark_4           89.4     +5.69%   +11.70%    -82.97%
> benchmark_5           92.8     +4.67%    +7.48%    -80.02%
> benchmark_6           78.1    +13.28%   +22.21%    -86.92%
> benchmark_7           96.8     +0.25%    +5.44%    -84.94%
> benchmark_8           89.1     +0.66%    +3.60%    -85.89%
> benchmark_9           86.4     +8.42%    +9.95%    -82.30%
> benchmark_10          59.7    +44.95%   +21.79%         --
> benchmark_11          91.2     -0.29%    +4.35%    -76.05%
> benchmark_12          99.0     +0.31%    -0.05%    -25.19%
> benchmark_14           8.2  +1011.27%  +104.15%     +5.56%
> benchmark_15          11.7   +669.25%  +108.54%    -29.83%
> benchmark_16          85.7     -7.51%    +4.43%         --
> benchmark_17          87.7     +2.84%    +7.45%         --
> benchmark_18          87.9     +1.59%    +3.82%    -81.11%
> geometric mean                 +37.89%   +17.54%  -74.47% (incomplete)
>
> arch        : istanbul
>
>  Benchmark           Base    (1)       (2)       (3)
> ----------------------------------------------------------
> benchmark_1             --    [73.2]       --         --
> benchmark_2             --    [82.9]       --         --
> benchmark_3           86.1    +4.56%  +11.68%    -61.04%
> benchmark_4           92.0    +3.47%   +4.63%    -64.84%
> benchmark_5           91.9    +4.18%   +4.90%    -64.77%
> benchmark_6           73.6   +23.36%  +27.13%    -72.64%
> benchmark_7           93.6    -3.57%   +4.76%    -68.54%
> benchmark_8           88.9    -3.01%   +2.87%    -75.50%
> benchmark_9           81.6    +9.91%  +14.10%    -69.81%
> benchmark_10          69.8   +23.60%   -1.53%         --
> benchmark_11          89.9    +0.34%   +4.32%    -59.62%
> benchmark_12          98.7    +0.96%   +0.88%    -10.52%
> benchmark_13          80.5    -8.31%   -0.55%    -75.90%
> benchmark_14          17.1  +429.47%   -5.21%     -4.87%
> benchmark_15          22.0  +295.70%   -1.29%    -46.36%
> benchmark_16          80.0    -6.31%   +0.54%    -62.61%
> benchmark_17          83.5    +4.84%  +10.71%    -70.48%
> benchmark_18          90.0    -1.27%   +3.18%    -72.30%
> geometric mean               +24.51%   +4.81%  -62.44% (incomplete)
>
> benchmark_13          77.1     -3.34%    +5.75%         --
>
> On Mon, Nov 25, 2013 at 11:19 AM, Rong Xu <xur@google.com> wrote:
>> On Mon, Nov 25, 2013 at 2:11 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Nov 22, 2013 at 10:49 PM, Rong Xu <xur@google.com> wrote:
>>>> On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>> On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <xur@google.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> This patch injects a condition into the instrumented code for edge
>>>>>> counter update. The counter value will not be updated after reaching
>>>>>> value 1.
>>>>>>
>>>>>> The feature is under a new parameter --param=coverage-exec_once.
>>>>>> Default is disabled and setting to 1 to enable.
>>>>>>
>>>>>> This extra check usually slows the program down. For SPEC 2006
>>>>>> benchmarks (all single thread programs), we usually see around 20%-35%
>>>>>> slow down in -O2 coverage build. This feature, however, is expected to
>>>>>> improve the coverage run speed for multi-threaded programs, because
>>>>>> there virtually no data race and false sharing in updating counters.
>>>>>> The improvement can be significant for highly threaded programs -- we
>>>>>> are seeing 7x speedup in coverage test run for some non-trivial google
>>>>>> applications.
>>>>>>
>>>>>> Tested with bootstrap.
>>>>>
>>>>> Err - why not simply emit
>>>>>
>>>>>   counter = 1
>>>>>
>>>>> for the counter update itself with that --param (I don't like a --param
>>>>> for this either).
>>>>>
>>>>> I assume that CPUs can avoid data-races and false sharing for
>>>>> non-changing accesses?
>>>>>
>>>>
>>>> I'm not aware of any CPU having this feature. I think a write to the
>>>> shared cache line to invalidate all the shared copies. I cannot find
>>>> any reference on checking the value of the write. Do you have any
>>>> pointer to the feature?
>>>
>>> I don't have any pointer - but I remember seeing this in the context
>>> of atomics thus it may be only in the context of using a xchg
>>> or cmpxchg instruction.  Which would make it non-portable to
>>> some extent (if you don't want to use atomic builtins here).
>>>
>>
>> cmpxchg should work here -- it's a conditional write so the data race
>> /false sharing can be avoided.
>> I'm comparing the performance b/w explicit branch vs cmpxchg and will
>> report back.
>>
>> -Rong
>>
>>
>>> Richard.
>>>
>>>> I just tested this implementation vs. simply setting to 1, using
>>>> google search as the benchmark.
>>>> This one is 4.5x faster. The test was done on Intel Westmere systems.
>>>>
>>>> I can change the parameter to an option.
>>>>
>>>> -Rong
>>>>
>>>>> Richard.
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -Rong
diff mbox

Patch

Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	(revision 205226)
+++ gcc/doc/invoke.texi	(working copy)
@@ -9937,6 +9937,10 @@  The default choice depends on the target.
 Set the maximum number of existing candidates that will be considered when
 seeking a basis for a new straight-line strength reduction candidate.
 
+@item coverage-exec_once
+Set to 1 to update each arc counter only once. This avoids false sharing
+and speeds up profile-generate run for multi-threaded programs.
+
 @end table
 @end table
 
Index: gcc/params.def
===================================================================
--- gcc/params.def	(revision 205226)
+++ gcc/params.def	(working copy)
@@ -441,6 +441,14 @@  DEFPARAM(TRACER_MIN_BRANCH_PROBABILITY,
 	 "Stop forward growth if the probability of best edge is less than this threshold (in percent). Used when profile feedback is not available",
 	 50, 0, 100)
 
+/* This parameter enables fast coverage test. Once the counter flips from 0
+   to 1, we no longer updating the value . This avoids false sharing and speeds
+   up profile-generate run for multi-threaded programs.  */
+DEFPARAM (PARAM_COVERAGE_EXEC_ONCE,
+         "coverage-exec_once",
+         "Stop incrementing edge counts once they become 1.",
+         0, 0, 1)
+
 /* The maximum number of incoming edges to consider for crossjumping.  */
 DEFPARAM(PARAM_MAX_CROSSJUMP_EDGES,
 	 "max-crossjump-edges",
Index: gcc/profile.c
===================================================================
--- gcc/profile.c	(revision 205226)
+++ gcc/profile.c	(working copy)
@@ -67,6 +67,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "dumpfile.h"
 #include "cgraph.h"
+#include "params.h"
 
 #include "profile.h"
 
@@ -1327,6 +1328,9 @@  branch_prob (void)
 
       /* Commit changes done by instrumentation.  */
       gsi_commit_edge_inserts ();
+
+      if (PARAM_VALUE (PARAM_COVERAGE_EXEC_ONCE))
+        add_predicate_to_edge_counters ();
     }
 
   free_aux_for_edges ();
Index: gcc/profile.h
===================================================================
--- gcc/profile.h	(revision 205226)
+++ gcc/profile.h	(working copy)
@@ -46,6 +46,9 @@  extern gcov_type sum_edge_counts (vec<edge, va_gc>
 extern void init_node_map (bool);
 extern void del_node_map (void);
 
+/* Insert a predicate to edge counter update stmt.  */
+extern void add_predicate_to_edge_counters (void);
+
 extern void get_working_sets (void);
 
 /* In predict.c.  */
Index: gcc/tree-profile.c
===================================================================
--- gcc/tree-profile.c	(revision 205226)
+++ gcc/tree-profile.c	(working copy)
@@ -52,6 +52,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "target.h"
 #include "tree-cfgcleanup.h"
 #include "tree-nested.h"
+#include "params.h"
 
 static GTY(()) tree gcov_type_node;
 static GTY(()) tree tree_interval_profiler_fn;
@@ -67,6 +68,11 @@  static GTY(()) tree ic_void_ptr_var;
 static GTY(()) tree ic_gcov_type_ptr_var;
 static GTY(()) tree ptr_void;
 
+/* A pointer-set of the last statement in each block of statements that need to
+   be applied a predicate .  */
+static struct pointer_set_t *pred_instrumentation_candidate = NULL;
+
+
 /* Do initialization work for the edge profiler.  */
 
 /* Add code:
@@ -295,6 +301,10 @@  gimple_gen_edge_profiler (int edgeno, edge e)
   stmt2 = gimple_build_assign_with_ops (PLUS_EXPR, gcov_type_tmp_var,
 					gimple_assign_lhs (stmt1), one);
   stmt3 = gimple_build_assign (unshare_expr (ref), gimple_assign_lhs (stmt2));
+
+  if (PARAM_VALUE (PARAM_COVERAGE_EXEC_ONCE))
+    pointer_set_insert (pred_instrumentation_candidate, stmt3);
+
   gsi_insert_on_edge (e, stmt1);
   gsi_insert_on_edge (e, stmt2);
   gsi_insert_on_edge (e, stmt3);
@@ -554,6 +564,121 @@  gimple_gen_ior_profiler (histogram_value value, un
   gsi_insert_before (&gsi, call, GSI_NEW_STMT);
 }
 
+/* Insert STMT_IF around given sequence of consecutive statements in the
+   same basic block starting with STMT_START, ending with STMT_END.
+   PROB is the probability of the taken branch.  */
+
+static void
+insert_if_then (gimple stmt_start, gimple stmt_end, gimple stmt_if, int prob)
+{
+  gimple_stmt_iterator gsi;
+  basic_block bb_original, bb_before_if, bb_after_if;
+  edge e_if_taken, e_then_join, e_else;
+  int orig_frequency;
+
+  gsi = gsi_for_stmt (stmt_start);
+  gsi_insert_before (&gsi, stmt_if, GSI_SAME_STMT);
+  bb_original = gsi_bb (gsi);
+  e_if_taken = split_block (bb_original, stmt_if);
+  e_if_taken->flags &= ~EDGE_FALLTHRU;
+  e_if_taken->flags |= EDGE_TRUE_VALUE;
+  e_then_join = split_block (e_if_taken->dest, stmt_end);
+  bb_before_if = e_if_taken->src;
+  bb_after_if = e_then_join->dest;
+  e_else = make_edge (bb_before_if, bb_after_if, EDGE_FALSE_VALUE);
+  orig_frequency = bb_original->frequency;
+  e_if_taken->probability = prob;
+  e_else->probability = REG_BR_PROB_BASE - prob;
+  e_if_taken->dest->frequency = orig_frequency * (prob / REG_BR_PROB_BASE);
+}
+
+/* Add a conditional stmt so that counter update will only exec one time.  */
+
+static void
+add_execonce_wrapper (gimple stmt_start, gimple stmt_end)
+{
+  tree zero, tmp1;
+  gimple stmt_if, stmt_assign;
+  gimple_stmt_iterator gsi;
+
+  tmp1 = make_temp_ssa_name (gcov_type_node, NULL, "PROF_temp");
+  stmt_assign = gimple_build_assign (tmp1, unshare_expr (gimple_assign_lhs (stmt_end)));
+
+  zero = build_int_cst (get_gcov_type (), 0);
+  stmt_if = gimple_build_cond (EQ_EXPR, tmp1, zero, NULL_TREE, NULL_TREE);
+
+  gsi = gsi_for_stmt (stmt_start);
+  gsi_insert_before (&gsi, stmt_assign, GSI_SAME_STMT);
+
+  /* Insert IF block.  */
+  insert_if_then (stmt_start, stmt_end, stmt_if, 1);
+}
+
+/* Return whether STMT is an instrumentation block.  */
+
+static bool
+is_pred_instrumentation_candidate (gimple stmt)
+{
+  return pointer_set_contains (pred_instrumentation_candidate, stmt);
+}
+
+/* Number of statements inserted for each edge counter increment.  */
+#define EDGE_COUNTER_STMT_COUNT 3
+
+/* Add a predicate wrapper around edge counter code in current function.  */
+
+void
+add_predicate_to_edge_counters (void)
+{
+  gimple_stmt_iterator gsi;
+  basic_block bb;
+
+  if (!PARAM_VALUE (PARAM_COVERAGE_EXEC_ONCE))
+    return;
+
+  FOR_EACH_BB_REVERSE (bb)
+    for (gsi = gsi_last_bb (bb); !gsi_end_p (gsi); gsi_prev (&gsi))
+      {
+        gimple stmt_end = gsi_stmt (gsi);
+        if (is_pred_instrumentation_candidate (stmt_end))
+          {
+            gimple stmt_beg;
+            int i;
+            int edge_counter_stmt_count = EDGE_COUNTER_STMT_COUNT;
+
+            for (i = 0; i < edge_counter_stmt_count - 1; i++)
+              gsi_prev (&gsi);
+            stmt_beg = gsi_stmt (gsi);
+            gcc_assert (stmt_beg);
+
+            add_execonce_wrapper (stmt_beg, stmt_end);
+
+            /* reset the iterator and continue.  */
+            gsi = gsi_last_bb (bb);
+          }
+      }
+}
+
+static void
+cleanup_pred_instrumentation (void)
+{
+  /* Free the hashmap.  */
+  if (PARAM_VALUE (PARAM_COVERAGE_EXEC_ONCE))
+    {
+      pointer_set_destroy (pred_instrumentation_candidate);
+      pred_instrumentation_candidate = NULL;
+    }
+}
+
+static void
+init_pred_instrumentation (void)
+{
+  if (PARAM_VALUE (PARAM_COVERAGE_EXEC_ONCE)
+      && pred_instrumentation_candidate == 0)
+    pred_instrumentation_candidate = pointer_set_create ();
+
+}
+
 /* Profile all functions in the callgraph.  */
 
 static unsigned int
@@ -567,6 +692,8 @@  tree_profiling (void)
 
   init_node_map (true);
 
+  init_pred_instrumentation ();
+
   FOR_EACH_DEFINED_FUNCTION (node)
     {
       if (!gimple_has_body_p (node->decl))
@@ -656,6 +783,7 @@  tree_profiling (void)
   handle_missing_profiles ();
 
   del_node_map ();
+  cleanup_pred_instrumentation ();
   return 0;
 }