Message ID | CAF1bQ=Sz3B0o-sU7v4OjsJCscDaEoPoKrDEqG2cSyYpc0oXaKw@mail.gmail.com |
---|---|
State | New |
Headers | show |
On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <xur@google.com> wrote: > Hi, > > This patch injects a condition into the instrumented code for edge > counter update. The counter value will not be updated after reaching > value 1. > > The feature is under a new parameter --param=coverage-exec_once. > Default is disabled and setting to 1 to enable. > > This extra check usually slows the program down. For SPEC 2006 > benchmarks (all single thread programs), we usually see around 20%-35% > slow down in -O2 coverage build. This feature, however, is expected to > improve the coverage run speed for multi-threaded programs, because > there virtually no data race and false sharing in updating counters. > The improvement can be significant for highly threaded programs -- we > are seeing 7x speedup in coverage test run for some non-trivial google > applications. > > Tested with bootstrap. Err - why not simply emit counter = 1 for the counter update itself with that --param (I don't like a --param for this either). I assume that CPUs can avoid data-races and false sharing for non-changing accesses? Richard. > Thanks, > > -Rong
On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener <richard.guenther@gmail.com> wrote: > On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <xur@google.com> wrote: >> Hi, >> >> This patch injects a condition into the instrumented code for edge >> counter update. The counter value will not be updated after reaching >> value 1. >> >> The feature is under a new parameter --param=coverage-exec_once. >> Default is disabled and setting to 1 to enable. >> >> This extra check usually slows the program down. For SPEC 2006 >> benchmarks (all single thread programs), we usually see around 20%-35% >> slow down in -O2 coverage build. This feature, however, is expected to >> improve the coverage run speed for multi-threaded programs, because >> there virtually no data race and false sharing in updating counters. >> The improvement can be significant for highly threaded programs -- we >> are seeing 7x speedup in coverage test run for some non-trivial google >> applications. >> >> Tested with bootstrap. > > Err - why not simply emit > > counter = 1 > > for the counter update itself with that --param (I don't like a --param > for this either). > > I assume that CPUs can avoid data-races and false sharing for > non-changing accesses? > I'm not aware of any CPU having this feature. I think a write to the shared cache line to invalidate all the shared copies. I cannot find any reference on checking the value of the write. Do you have any pointer to the feature? I just tested this implementation vs. simply setting to 1, using google search as the benchmark. This one is 4.5x faster. The test was done on Intel Westmere systems. I can change the parameter to an option. -Rong > Richard. > >> Thanks, >> >> -Rong
On Fri, Nov 22, 2013 at 10:49 PM, Rong Xu <xur@google.com> wrote: > On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener > <richard.guenther@gmail.com> wrote: >> On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <xur@google.com> wrote: >>> Hi, >>> >>> This patch injects a condition into the instrumented code for edge >>> counter update. The counter value will not be updated after reaching >>> value 1. >>> >>> The feature is under a new parameter --param=coverage-exec_once. >>> Default is disabled and setting to 1 to enable. >>> >>> This extra check usually slows the program down. For SPEC 2006 >>> benchmarks (all single thread programs), we usually see around 20%-35% >>> slow down in -O2 coverage build. This feature, however, is expected to >>> improve the coverage run speed for multi-threaded programs, because >>> there virtually no data race and false sharing in updating counters. >>> The improvement can be significant for highly threaded programs -- we >>> are seeing 7x speedup in coverage test run for some non-trivial google >>> applications. >>> >>> Tested with bootstrap. >> >> Err - why not simply emit >> >> counter = 1 >> >> for the counter update itself with that --param (I don't like a --param >> for this either). >> >> I assume that CPUs can avoid data-races and false sharing for >> non-changing accesses? >> > > I'm not aware of any CPU having this feature. I think a write to the > shared cache line to invalidate all the shared copies. I cannot find > any reference on checking the value of the write. Do you have any > pointer to the feature? I don't have any pointer - but I remember seeing this in the context of atomics thus it may be only in the context of using a xchg or cmpxchg instruction. Which would make it non-portable to some extent (if you don't want to use atomic builtins here). Richard. > I just tested this implementation vs. simply setting to 1, using > google search as the benchmark. > This one is 4.5x faster. The test was done on Intel Westmere systems. > > I can change the parameter to an option. > > -Rong > >> Richard. >> >>> Thanks, >>> >>> -Rong
On Mon, Nov 25, 2013 at 2:11 AM, Richard Biener <richard.guenther@gmail.com> wrote: > On Fri, Nov 22, 2013 at 10:49 PM, Rong Xu <xur@google.com> wrote: >> On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener >> <richard.guenther@gmail.com> wrote: >>> On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <xur@google.com> wrote: >>>> Hi, >>>> >>>> This patch injects a condition into the instrumented code for edge >>>> counter update. The counter value will not be updated after reaching >>>> value 1. >>>> >>>> The feature is under a new parameter --param=coverage-exec_once. >>>> Default is disabled and setting to 1 to enable. >>>> >>>> This extra check usually slows the program down. For SPEC 2006 >>>> benchmarks (all single thread programs), we usually see around 20%-35% >>>> slow down in -O2 coverage build. This feature, however, is expected to >>>> improve the coverage run speed for multi-threaded programs, because >>>> there virtually no data race and false sharing in updating counters. >>>> The improvement can be significant for highly threaded programs -- we >>>> are seeing 7x speedup in coverage test run for some non-trivial google >>>> applications. >>>> >>>> Tested with bootstrap. >>> >>> Err - why not simply emit >>> >>> counter = 1 >>> >>> for the counter update itself with that --param (I don't like a --param >>> for this either). >>> >>> I assume that CPUs can avoid data-races and false sharing for >>> non-changing accesses? >>> >> >> I'm not aware of any CPU having this feature. I think a write to the >> shared cache line to invalidate all the shared copies. I cannot find >> any reference on checking the value of the write. Do you have any >> pointer to the feature? > > I don't have any pointer - but I remember seeing this in the context > of atomics thus it may be only in the context of using a xchg > or cmpxchg instruction. Which would make it non-portable to > some extent (if you don't want to use atomic builtins here). > cmpxchg should work here -- it's a conditional write so the data race /false sharing can be avoided. I'm comparing the performance b/w explicit branch vs cmpxchg and will report back. -Rong > Richard. > >> I just tested this implementation vs. simply setting to 1, using >> google search as the benchmark. >> This one is 4.5x faster. The test was done on Intel Westmere systems. >> >> I can change the parameter to an option. >> >> -Rong >> >>> Richard. >>> >>>> Thanks, >>>> >>>> -Rong
Here are the results using our internal benchmarks which are a mixed a multi-threaded and single-threaded programs. This was collected about a month ago but I did not got time to send due to an unexpected trip. cmpxchg gives the worst performance due to the memory barriers it incurs. I'll send a patch that supports conditional_1 and unconditional_1. --------------------------------- result ----------------------------- base: original_coverage (1): using_conditional_1 -- using branch (my original implementation) (2): using_unconfitional_1 -- write straight 1 (3): using_cmpxchg -- using compxchg write 1 Values are performance ratios where 100.0 equals the performance of O2. Larger numbers are faster. "--" means the test failed due to running too slowly. arch : westmere Benchmark Base (1) (2) (3) --------------------------------------------------------- benchmark_1 26.4 +176.62% +17.20% -- benchmark_2 -- [78.4] [12.3] -- benchmark_3 86.3 +6.15% +10.52% -61.28% benchmark_4 88.4 +6.59% +14.26% -68.76% benchmark_5 89.6 +6.26% +13.00% -68.74% benchmark_6 76.7 +22.28% +29.15% -75.31% benchmark_7 89.0 -0.62% +3.36% -71.37% benchmark_8 84.5 -1.45% +5.27% -74.04% benchmark_9 81.3 +10.64% +13.32% -72.82% benchmark_10 59.1 +44.71% +14.77% -73.24% benchmark_11 90.3 -1.74% +4.22% -61.95% benchmark_12 98.9 +0.07% +0.48% -6.37% benchmark_13 74.0 -4.69% +4.35% -77.02% benchmark_14 21.4 +309.92% +63.41% -35.82% benchmark_15 21.4 +282.33% +58.15% -57.98% benchmark_16 85.1 -7.71% +1.65% -60.72% benchmark_17 81.7 +2.47% +8.20% -72.08% benchmark_18 83.7 +1.59% +3.83% -69.33% geometric mean +30.30% +14.41% -65.66% (incomplete) arch : sandybridge Benchmark Base (1) (2) (3) --------------------------------------------------------- benchmark_1 -- [70.1] [26.1] -- benchmark_2 -- [79.1] -- -- benchmark_3 84.3 +10.82% +15.84% -68.98% benchmark_4 88.5 +10.28% +11.35% -75.10% benchmark_5 89.4 +10.46% +11.40% -74.41% benchmark_6 65.5 +38.52% +44.46% -77.97% benchmark_7 87.7 -0.16% +1.74% -76.19% benchmark_8 89.6 -4.52% +6.29% -78.10% benchmark_9 79.9 +13.43% +19.44% -75.99% benchmark_10 52.6 +61.53% +8.23% -78.41% benchmark_11 89.9 -1.40% +3.37% -68.16% benchmark_12 99.0 +1.51% +0.63% -10.37% benchmark_13 74.3 -6.75% +3.89% -81.84% benchmark_14 21.8 +295.76% +19.48% -51.58% benchmark_15 23.5 +257.20% +29.33% -83.53% benchmark_16 84.4 -10.04% +2.39% -68.25% benchmark_17 81.6 +0.60% +8.82% -78.02% benchmark_18 87.4 -1.14% +9.69% -75.88% geometric mean +25.64% +11.76% -72.96% (incomplete) arch : clovertown Benchmark Base (1) (2) (3) -------------------------------------------------------------- benchmark_1 -- [83.4] -- -- benchmark_2 -- [82.3] -- -- benchmark_3 86.2 +7.58% +13.10% -81.74% benchmark_4 89.4 +5.69% +11.70% -82.97% benchmark_5 92.8 +4.67% +7.48% -80.02% benchmark_6 78.1 +13.28% +22.21% -86.92% benchmark_7 96.8 +0.25% +5.44% -84.94% benchmark_8 89.1 +0.66% +3.60% -85.89% benchmark_9 86.4 +8.42% +9.95% -82.30% benchmark_10 59.7 +44.95% +21.79% -- benchmark_11 91.2 -0.29% +4.35% -76.05% benchmark_12 99.0 +0.31% -0.05% -25.19% benchmark_14 8.2 +1011.27% +104.15% +5.56% benchmark_15 11.7 +669.25% +108.54% -29.83% benchmark_16 85.7 -7.51% +4.43% -- benchmark_17 87.7 +2.84% +7.45% -- benchmark_18 87.9 +1.59% +3.82% -81.11% geometric mean +37.89% +17.54% -74.47% (incomplete) arch : istanbul Benchmark Base (1) (2) (3) ---------------------------------------------------------- benchmark_1 -- [73.2] -- -- benchmark_2 -- [82.9] -- -- benchmark_3 86.1 +4.56% +11.68% -61.04% benchmark_4 92.0 +3.47% +4.63% -64.84% benchmark_5 91.9 +4.18% +4.90% -64.77% benchmark_6 73.6 +23.36% +27.13% -72.64% benchmark_7 93.6 -3.57% +4.76% -68.54% benchmark_8 88.9 -3.01% +2.87% -75.50% benchmark_9 81.6 +9.91% +14.10% -69.81% benchmark_10 69.8 +23.60% -1.53% -- benchmark_11 89.9 +0.34% +4.32% -59.62% benchmark_12 98.7 +0.96% +0.88% -10.52% benchmark_13 80.5 -8.31% -0.55% -75.90% benchmark_14 17.1 +429.47% -5.21% -4.87% benchmark_15 22.0 +295.70% -1.29% -46.36% benchmark_16 80.0 -6.31% +0.54% -62.61% benchmark_17 83.5 +4.84% +10.71% -70.48% benchmark_18 90.0 -1.27% +3.18% -72.30% geometric mean +24.51% +4.81% -62.44% (incomplete) benchmark_13 77.1 -3.34% +5.75% -- On Mon, Nov 25, 2013 at 11:19 AM, Rong Xu <xur@google.com> wrote: > On Mon, Nov 25, 2013 at 2:11 AM, Richard Biener > <richard.guenther@gmail.com> wrote: >> On Fri, Nov 22, 2013 at 10:49 PM, Rong Xu <xur@google.com> wrote: >>> On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener >>> <richard.guenther@gmail.com> wrote: >>>> On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <xur@google.com> wrote: >>>>> Hi, >>>>> >>>>> This patch injects a condition into the instrumented code for edge >>>>> counter update. The counter value will not be updated after reaching >>>>> value 1. >>>>> >>>>> The feature is under a new parameter --param=coverage-exec_once. >>>>> Default is disabled and setting to 1 to enable. >>>>> >>>>> This extra check usually slows the program down. For SPEC 2006 >>>>> benchmarks (all single thread programs), we usually see around 20%-35% >>>>> slow down in -O2 coverage build. This feature, however, is expected to >>>>> improve the coverage run speed for multi-threaded programs, because >>>>> there virtually no data race and false sharing in updating counters. >>>>> The improvement can be significant for highly threaded programs -- we >>>>> are seeing 7x speedup in coverage test run for some non-trivial google >>>>> applications. >>>>> >>>>> Tested with bootstrap. >>>> >>>> Err - why not simply emit >>>> >>>> counter = 1 >>>> >>>> for the counter update itself with that --param (I don't like a --param >>>> for this either). >>>> >>>> I assume that CPUs can avoid data-races and false sharing for >>>> non-changing accesses? >>>> >>> >>> I'm not aware of any CPU having this feature. I think a write to the >>> shared cache line to invalidate all the shared copies. I cannot find >>> any reference on checking the value of the write. Do you have any >>> pointer to the feature? >> >> I don't have any pointer - but I remember seeing this in the context >> of atomics thus it may be only in the context of using a xchg >> or cmpxchg instruction. Which would make it non-portable to >> some extent (if you don't want to use atomic builtins here). >> > > cmpxchg should work here -- it's a conditional write so the data race > /false sharing can be avoided. > I'm comparing the performance b/w explicit branch vs cmpxchg and will > report back. > > -Rong > > >> Richard. >> >>> I just tested this implementation vs. simply setting to 1, using >>> google search as the benchmark. >>> This one is 4.5x faster. The test was done on Intel Westmere systems. >>> >>> I can change the parameter to an option. >>> >>> -Rong >>> >>>> Richard. >>>> >>>>> Thanks, >>>>> >>>>> -Rong
On Fri, Dec 20, 2013 at 11:45 PM, Rong Xu <xur@google.com> wrote: > Here are the results using our internal benchmarks which are a mixed a > multi-threaded and single-threaded programs. > This was collected about a month ago but I did not got time to send > due to an unexpected trip. > > cmpxchg gives the worst performance due to the memory barriers it incurs. > I'll send a patch that supports conditional_1 and unconditional_1. Bah, too bad. Ok, so another idea is to use non-tempoal unconditional store of 1. According to docs this is weakly ordered and side-steps the cache (movnti/movntq on x86). Btw, +@item coverage-exec_once +Set to 1 to update each arc counter only once. This avoids false sharing +and speeds up profile-generate run for multi-threaded programs. + for -fprofile-generate this certainly is a bad choice but I can see that it is useful for -coverage. Also avoid mixing - and _ here. Using a --param here is not the proper thing for a non-developer feature, thus I'd suggest to add -fcoverage-exec-once instead. It is supposed to help for -coverage, right? Not for -fprofile-generate? Instead of using a pointer-set to record stmts to "instrument" just set one of the pass-local flags on the stmts (gimple_set_plf/gimple_plf, you have to clear flags before using them). Thanks, Richard. > --------------------------------- result ----------------------------- > > base: original_coverage > (1): using_conditional_1 -- using branch (my original implementation) > (2): using_unconfitional_1 -- write straight 1 > (3): using_cmpxchg -- using compxchg write 1 > > Values are performance ratios where 100.0 equals the performance of > O2. Larger numbers are faster. > "--" means the test failed due to running too slowly. > > arch : westmere > Benchmark Base (1) (2) (3) > --------------------------------------------------------- > benchmark_1 26.4 +176.62% +17.20% -- > benchmark_2 -- [78.4] [12.3] -- > benchmark_3 86.3 +6.15% +10.52% -61.28% > benchmark_4 88.4 +6.59% +14.26% -68.76% > benchmark_5 89.6 +6.26% +13.00% -68.74% > benchmark_6 76.7 +22.28% +29.15% -75.31% > benchmark_7 89.0 -0.62% +3.36% -71.37% > benchmark_8 84.5 -1.45% +5.27% -74.04% > benchmark_9 81.3 +10.64% +13.32% -72.82% > benchmark_10 59.1 +44.71% +14.77% -73.24% > benchmark_11 90.3 -1.74% +4.22% -61.95% > benchmark_12 98.9 +0.07% +0.48% -6.37% > benchmark_13 74.0 -4.69% +4.35% -77.02% > benchmark_14 21.4 +309.92% +63.41% -35.82% > benchmark_15 21.4 +282.33% +58.15% -57.98% > benchmark_16 85.1 -7.71% +1.65% -60.72% > benchmark_17 81.7 +2.47% +8.20% -72.08% > benchmark_18 83.7 +1.59% +3.83% -69.33% > geometric mean +30.30% +14.41% -65.66% (incomplete) > > arch : sandybridge > Benchmark Base (1) (2) (3) > --------------------------------------------------------- > benchmark_1 -- [70.1] [26.1] -- > benchmark_2 -- [79.1] -- -- > benchmark_3 84.3 +10.82% +15.84% -68.98% > benchmark_4 88.5 +10.28% +11.35% -75.10% > benchmark_5 89.4 +10.46% +11.40% -74.41% > benchmark_6 65.5 +38.52% +44.46% -77.97% > benchmark_7 87.7 -0.16% +1.74% -76.19% > benchmark_8 89.6 -4.52% +6.29% -78.10% > benchmark_9 79.9 +13.43% +19.44% -75.99% > benchmark_10 52.6 +61.53% +8.23% -78.41% > benchmark_11 89.9 -1.40% +3.37% -68.16% > benchmark_12 99.0 +1.51% +0.63% -10.37% > benchmark_13 74.3 -6.75% +3.89% -81.84% > benchmark_14 21.8 +295.76% +19.48% -51.58% > benchmark_15 23.5 +257.20% +29.33% -83.53% > benchmark_16 84.4 -10.04% +2.39% -68.25% > benchmark_17 81.6 +0.60% +8.82% -78.02% > benchmark_18 87.4 -1.14% +9.69% -75.88% > geometric mean +25.64% +11.76% -72.96% (incomplete) > > arch : clovertown > Benchmark Base (1) (2) (3) > -------------------------------------------------------------- > benchmark_1 -- [83.4] -- -- > benchmark_2 -- [82.3] -- -- > benchmark_3 86.2 +7.58% +13.10% -81.74% > benchmark_4 89.4 +5.69% +11.70% -82.97% > benchmark_5 92.8 +4.67% +7.48% -80.02% > benchmark_6 78.1 +13.28% +22.21% -86.92% > benchmark_7 96.8 +0.25% +5.44% -84.94% > benchmark_8 89.1 +0.66% +3.60% -85.89% > benchmark_9 86.4 +8.42% +9.95% -82.30% > benchmark_10 59.7 +44.95% +21.79% -- > benchmark_11 91.2 -0.29% +4.35% -76.05% > benchmark_12 99.0 +0.31% -0.05% -25.19% > benchmark_14 8.2 +1011.27% +104.15% +5.56% > benchmark_15 11.7 +669.25% +108.54% -29.83% > benchmark_16 85.7 -7.51% +4.43% -- > benchmark_17 87.7 +2.84% +7.45% -- > benchmark_18 87.9 +1.59% +3.82% -81.11% > geometric mean +37.89% +17.54% -74.47% (incomplete) > > arch : istanbul > > Benchmark Base (1) (2) (3) > ---------------------------------------------------------- > benchmark_1 -- [73.2] -- -- > benchmark_2 -- [82.9] -- -- > benchmark_3 86.1 +4.56% +11.68% -61.04% > benchmark_4 92.0 +3.47% +4.63% -64.84% > benchmark_5 91.9 +4.18% +4.90% -64.77% > benchmark_6 73.6 +23.36% +27.13% -72.64% > benchmark_7 93.6 -3.57% +4.76% -68.54% > benchmark_8 88.9 -3.01% +2.87% -75.50% > benchmark_9 81.6 +9.91% +14.10% -69.81% > benchmark_10 69.8 +23.60% -1.53% -- > benchmark_11 89.9 +0.34% +4.32% -59.62% > benchmark_12 98.7 +0.96% +0.88% -10.52% > benchmark_13 80.5 -8.31% -0.55% -75.90% > benchmark_14 17.1 +429.47% -5.21% -4.87% > benchmark_15 22.0 +295.70% -1.29% -46.36% > benchmark_16 80.0 -6.31% +0.54% -62.61% > benchmark_17 83.5 +4.84% +10.71% -70.48% > benchmark_18 90.0 -1.27% +3.18% -72.30% > geometric mean +24.51% +4.81% -62.44% (incomplete) > > benchmark_13 77.1 -3.34% +5.75% -- > > On Mon, Nov 25, 2013 at 11:19 AM, Rong Xu <xur@google.com> wrote: >> On Mon, Nov 25, 2013 at 2:11 AM, Richard Biener >> <richard.guenther@gmail.com> wrote: >>> On Fri, Nov 22, 2013 at 10:49 PM, Rong Xu <xur@google.com> wrote: >>>> On Fri, Nov 22, 2013 at 4:03 AM, Richard Biener >>>> <richard.guenther@gmail.com> wrote: >>>>> On Fri, Nov 22, 2013 at 4:51 AM, Rong Xu <xur@google.com> wrote: >>>>>> Hi, >>>>>> >>>>>> This patch injects a condition into the instrumented code for edge >>>>>> counter update. The counter value will not be updated after reaching >>>>>> value 1. >>>>>> >>>>>> The feature is under a new parameter --param=coverage-exec_once. >>>>>> Default is disabled and setting to 1 to enable. >>>>>> >>>>>> This extra check usually slows the program down. For SPEC 2006 >>>>>> benchmarks (all single thread programs), we usually see around 20%-35% >>>>>> slow down in -O2 coverage build. This feature, however, is expected to >>>>>> improve the coverage run speed for multi-threaded programs, because >>>>>> there virtually no data race and false sharing in updating counters. >>>>>> The improvement can be significant for highly threaded programs -- we >>>>>> are seeing 7x speedup in coverage test run for some non-trivial google >>>>>> applications. >>>>>> >>>>>> Tested with bootstrap. >>>>> >>>>> Err - why not simply emit >>>>> >>>>> counter = 1 >>>>> >>>>> for the counter update itself with that --param (I don't like a --param >>>>> for this either). >>>>> >>>>> I assume that CPUs can avoid data-races and false sharing for >>>>> non-changing accesses? >>>>> >>>> >>>> I'm not aware of any CPU having this feature. I think a write to the >>>> shared cache line to invalidate all the shared copies. I cannot find >>>> any reference on checking the value of the write. Do you have any >>>> pointer to the feature? >>> >>> I don't have any pointer - but I remember seeing this in the context >>> of atomics thus it may be only in the context of using a xchg >>> or cmpxchg instruction. Which would make it non-portable to >>> some extent (if you don't want to use atomic builtins here). >>> >> >> cmpxchg should work here -- it's a conditional write so the data race >> /false sharing can be avoided. >> I'm comparing the performance b/w explicit branch vs cmpxchg and will >> report back. >> >> -Rong >> >> >>> Richard. >>> >>>> I just tested this implementation vs. simply setting to 1, using >>>> google search as the benchmark. >>>> This one is 4.5x faster. The test was done on Intel Westmere systems. >>>> >>>> I can change the parameter to an option. >>>> >>>> -Rong >>>> >>>>> Richard. >>>>> >>>>>> Thanks, >>>>>> >>>>>> -Rong
Index: gcc/doc/invoke.texi =================================================================== --- gcc/doc/invoke.texi (revision 205226) +++ gcc/doc/invoke.texi (working copy) @@ -9937,6 +9937,10 @@ The default choice depends on the target. Set the maximum number of existing candidates that will be considered when seeking a basis for a new straight-line strength reduction candidate. +@item coverage-exec_once +Set to 1 to update each arc counter only once. This avoids false sharing +and speeds up profile-generate run for multi-threaded programs. + @end table @end table Index: gcc/params.def =================================================================== --- gcc/params.def (revision 205226) +++ gcc/params.def (working copy) @@ -441,6 +441,14 @@ DEFPARAM(TRACER_MIN_BRANCH_PROBABILITY, "Stop forward growth if the probability of best edge is less than this threshold (in percent). Used when profile feedback is not available", 50, 0, 100) +/* This parameter enables fast coverage test. Once the counter flips from 0 + to 1, we no longer updating the value . This avoids false sharing and speeds + up profile-generate run for multi-threaded programs. */ +DEFPARAM (PARAM_COVERAGE_EXEC_ONCE, + "coverage-exec_once", + "Stop incrementing edge counts once they become 1.", + 0, 0, 1) + /* The maximum number of incoming edges to consider for crossjumping. */ DEFPARAM(PARAM_MAX_CROSSJUMP_EDGES, "max-crossjump-edges", Index: gcc/profile.c =================================================================== --- gcc/profile.c (revision 205226) +++ gcc/profile.c (working copy) @@ -67,6 +67,7 @@ along with GCC; see the file COPYING3. If not see #include "cfgloop.h" #include "dumpfile.h" #include "cgraph.h" +#include "params.h" #include "profile.h" @@ -1327,6 +1328,9 @@ branch_prob (void) /* Commit changes done by instrumentation. */ gsi_commit_edge_inserts (); + + if (PARAM_VALUE (PARAM_COVERAGE_EXEC_ONCE)) + add_predicate_to_edge_counters (); } free_aux_for_edges (); Index: gcc/profile.h =================================================================== --- gcc/profile.h (revision 205226) +++ gcc/profile.h (working copy) @@ -46,6 +46,9 @@ extern gcov_type sum_edge_counts (vec<edge, va_gc> extern void init_node_map (bool); extern void del_node_map (void); +/* Insert a predicate to edge counter update stmt. */ +extern void add_predicate_to_edge_counters (void); + extern void get_working_sets (void); /* In predict.c. */ Index: gcc/tree-profile.c =================================================================== --- gcc/tree-profile.c (revision 205226) +++ gcc/tree-profile.c (working copy) @@ -52,6 +52,7 @@ along with GCC; see the file COPYING3. If not see #include "target.h" #include "tree-cfgcleanup.h" #include "tree-nested.h" +#include "params.h" static GTY(()) tree gcov_type_node; static GTY(()) tree tree_interval_profiler_fn; @@ -67,6 +68,11 @@ static GTY(()) tree ic_void_ptr_var; static GTY(()) tree ic_gcov_type_ptr_var; static GTY(()) tree ptr_void; +/* A pointer-set of the last statement in each block of statements that need to + be applied a predicate . */ +static struct pointer_set_t *pred_instrumentation_candidate = NULL; + + /* Do initialization work for the edge profiler. */ /* Add code: @@ -295,6 +301,10 @@ gimple_gen_edge_profiler (int edgeno, edge e) stmt2 = gimple_build_assign_with_ops (PLUS_EXPR, gcov_type_tmp_var, gimple_assign_lhs (stmt1), one); stmt3 = gimple_build_assign (unshare_expr (ref), gimple_assign_lhs (stmt2)); + + if (PARAM_VALUE (PARAM_COVERAGE_EXEC_ONCE)) + pointer_set_insert (pred_instrumentation_candidate, stmt3); + gsi_insert_on_edge (e, stmt1); gsi_insert_on_edge (e, stmt2); gsi_insert_on_edge (e, stmt3); @@ -554,6 +564,121 @@ gimple_gen_ior_profiler (histogram_value value, un gsi_insert_before (&gsi, call, GSI_NEW_STMT); } +/* Insert STMT_IF around given sequence of consecutive statements in the + same basic block starting with STMT_START, ending with STMT_END. + PROB is the probability of the taken branch. */ + +static void +insert_if_then (gimple stmt_start, gimple stmt_end, gimple stmt_if, int prob) +{ + gimple_stmt_iterator gsi; + basic_block bb_original, bb_before_if, bb_after_if; + edge e_if_taken, e_then_join, e_else; + int orig_frequency; + + gsi = gsi_for_stmt (stmt_start); + gsi_insert_before (&gsi, stmt_if, GSI_SAME_STMT); + bb_original = gsi_bb (gsi); + e_if_taken = split_block (bb_original, stmt_if); + e_if_taken->flags &= ~EDGE_FALLTHRU; + e_if_taken->flags |= EDGE_TRUE_VALUE; + e_then_join = split_block (e_if_taken->dest, stmt_end); + bb_before_if = e_if_taken->src; + bb_after_if = e_then_join->dest; + e_else = make_edge (bb_before_if, bb_after_if, EDGE_FALSE_VALUE); + orig_frequency = bb_original->frequency; + e_if_taken->probability = prob; + e_else->probability = REG_BR_PROB_BASE - prob; + e_if_taken->dest->frequency = orig_frequency * (prob / REG_BR_PROB_BASE); +} + +/* Add a conditional stmt so that counter update will only exec one time. */ + +static void +add_execonce_wrapper (gimple stmt_start, gimple stmt_end) +{ + tree zero, tmp1; + gimple stmt_if, stmt_assign; + gimple_stmt_iterator gsi; + + tmp1 = make_temp_ssa_name (gcov_type_node, NULL, "PROF_temp"); + stmt_assign = gimple_build_assign (tmp1, unshare_expr (gimple_assign_lhs (stmt_end))); + + zero = build_int_cst (get_gcov_type (), 0); + stmt_if = gimple_build_cond (EQ_EXPR, tmp1, zero, NULL_TREE, NULL_TREE); + + gsi = gsi_for_stmt (stmt_start); + gsi_insert_before (&gsi, stmt_assign, GSI_SAME_STMT); + + /* Insert IF block. */ + insert_if_then (stmt_start, stmt_end, stmt_if, 1); +} + +/* Return whether STMT is an instrumentation block. */ + +static bool +is_pred_instrumentation_candidate (gimple stmt) +{ + return pointer_set_contains (pred_instrumentation_candidate, stmt); +} + +/* Number of statements inserted for each edge counter increment. */ +#define EDGE_COUNTER_STMT_COUNT 3 + +/* Add a predicate wrapper around edge counter code in current function. */ + +void +add_predicate_to_edge_counters (void) +{ + gimple_stmt_iterator gsi; + basic_block bb; + + if (!PARAM_VALUE (PARAM_COVERAGE_EXEC_ONCE)) + return; + + FOR_EACH_BB_REVERSE (bb) + for (gsi = gsi_last_bb (bb); !gsi_end_p (gsi); gsi_prev (&gsi)) + { + gimple stmt_end = gsi_stmt (gsi); + if (is_pred_instrumentation_candidate (stmt_end)) + { + gimple stmt_beg; + int i; + int edge_counter_stmt_count = EDGE_COUNTER_STMT_COUNT; + + for (i = 0; i < edge_counter_stmt_count - 1; i++) + gsi_prev (&gsi); + stmt_beg = gsi_stmt (gsi); + gcc_assert (stmt_beg); + + add_execonce_wrapper (stmt_beg, stmt_end); + + /* reset the iterator and continue. */ + gsi = gsi_last_bb (bb); + } + } +} + +static void +cleanup_pred_instrumentation (void) +{ + /* Free the hashmap. */ + if (PARAM_VALUE (PARAM_COVERAGE_EXEC_ONCE)) + { + pointer_set_destroy (pred_instrumentation_candidate); + pred_instrumentation_candidate = NULL; + } +} + +static void +init_pred_instrumentation (void) +{ + if (PARAM_VALUE (PARAM_COVERAGE_EXEC_ONCE) + && pred_instrumentation_candidate == 0) + pred_instrumentation_candidate = pointer_set_create (); + +} + /* Profile all functions in the callgraph. */ static unsigned int @@ -567,6 +692,8 @@ tree_profiling (void) init_node_map (true); + init_pred_instrumentation (); + FOR_EACH_DEFINED_FUNCTION (node) { if (!gimple_has_body_p (node->decl)) @@ -656,6 +783,7 @@ tree_profiling (void) handle_missing_profiles (); del_node_map (); + cleanup_pred_instrumentation (); return 0; }