Message ID | 20101107215423.GA3607@laptop-hbl.localdomain |
---|---|
State | New |
Headers | show |
Holger Blasum <hbl@sysgo.com> writes: > Rationale: gcov is not robust in a threaded environment, the patch > makes it more robust. > > In particular, in threaded environments on can observe two effects: > (1) underreporting when coverage counters are overwritten by another > instance of a thread after a thread switch. > (2) generation of arbitrary (even negative) values by subtraction when > reconstructing the whole tree from the used by the spanning tree > optimization mechanism. (Recall that subtraction is messy numerically.) > > With the current gcov framework (I think) one cannot do anything > really generic about (1). A better fix would be to make all the gcov data per thread data. This is supported on most modern OS in a straight forward way with a storage class modifier. This would fix all the races and give always accurate information. Another advantage of this is that it would speed data collection with multiple threads -- no frequent cache line bouncing on the counters when multiple CPUs modify them in parallel. The only change needed would be to sum it up explicitely on thread destruction into a global copy that is accessed atomically. -Andi
Hi, > Holger Blasum <hbl@sysgo.com> writes: > > > Rationale: gcov is not robust in a threaded environment, the patch > > makes it more robust. > > > > In particular, in threaded environments on can observe two effects: > > (1) underreporting when coverage counters are overwritten by another > > instance of a thread after a thread switch. > > (2) generation of arbitrary (even negative) values by subtraction when > > reconstructing the whole tree from the used by the spanning tree > > optimization mechanism. (Recall that subtraction is messy numerically.) > > > > With the current gcov framework (I think) one cannot do anything > > really generic about (1). > > A better fix would be to make all the gcov data per thread data. > > This is supported on most modern OS in a straight forward way > with a storage class modifier. This would fix all the races and give > always accurate information. > > Another advantage of this is that it would speed data collection > with multiple threads -- no frequent cache line bouncing on the > counters when multiple CPUs modify them in parallel. > > The only change needed would be to sum it up explicitely on > thread destruction into a global copy that is accessed atomically. I had a patch implementing this a few years back: http://gcc.gnu.org/ml/gcc-patches/2002-01/msg00195.html which however did not make it to mainline (iirc, there were objections against the extra memory and runtime costs -- if you have a program with hundreds of thousands of basic blocks, you do not really want to use the corresponding megabytes of memory per thread for profiling, and spend time initializing them on the thread start and summing them on the thread end. Probably a better approach (using atomic primitives for manipulating the profile) was proposed in http://gcc.gnu.org/ml/gcc-patches/2007-03/msg00950.html which also never got merged, it seems, Zdenek
Zdenek Dvorak <rakdver@kam.mff.cuni.cz> writes: > > I had a patch implementing this a few years back: > > http://gcc.gnu.org/ml/gcc-patches/2002-01/msg00195.html > > which however did not make it to mainline (iirc, there were objections against > the extra memory and runtime costs -- if you have a program with hundreds of thousands of > basic blocks, you do not really want to use the corresponding megabytes of > memory per thread for profiling, and spend time initializing them > on the thread start and summing them on the thread end. Doesn't seem like a big problem to me on modern systems. I guess one could make it optional for the case of someone running on a memory constrained target. > Probably a better approach > (using atomic primitives for manipulating the profile) was proposed in > > http://gcc.gnu.org/ml/gcc-patches/2007-03/msg00950.html > > which also never got merged, it seems, atomics are not a better approach. atomics will never scale well and can experience catastrophic performance degradation on larger systems due to excessive communication overheads when lots of CPUs try to manipulate the same counters in parallel. On many NUMA systems there is also unfair memory which makes them even worse. -Andi
On Mon, Nov 8, 2010 at 12:17 PM, Andi Kleen <andi@firstfloor.org> wrote: > Zdenek Dvorak <rakdver@kam.mff.cuni.cz> writes: >> >> I had a patch implementing this a few years back: >> >> http://gcc.gnu.org/ml/gcc-patches/2002-01/msg00195.html >> >> which however did not make it to mainline (iirc, there were objections against >> the extra memory and runtime costs -- if you have a program with hundreds of thousands of >> basic blocks, you do not really want to use the corresponding megabytes of >> memory per thread for profiling, and spend time initializing them >> on the thread start and summing them on the thread end. > > Doesn't seem like a big problem to me on modern systems. > I guess one could make it optional for the case of someone > running on a memory constrained target. > >> Probably a better approach >> (using atomic primitives for manipulating the profile) was proposed in >> >> http://gcc.gnu.org/ml/gcc-patches/2007-03/msg00950.html >> >> which also never got merged, it seems, > > atomics are not a better approach. atomics will never scale well and can > experience catastrophic performance degradation on larger systems due to > excessive communication overheads when lots of CPUs try to manipulate > the same counters in parallel. > > On many NUMA systems there is also unfair memory which makes them even > worse. The initial copying and setup could be avoided by using a common zero-initialized COW mapping for the counters (can we have COW TLS mappings?). Won't solve the thread destruction time issue though (but that could be done via a separate thread). Richard. > -Andi > > -- > ak@linux.intel.com -- Speaking for myself only. >
Index: gcc/opts.c =================================================================== --- gcc/opts.c (revision 166172) +++ gcc/opts.c (working copy) @@ -1973,6 +1973,10 @@ profile_data_prefix = xstrdup (arg); break; + case OPT_fprofile_whole_graph: + flag_profile_whole_graph = 1; + break; + case OPT_fprofile_use_: profile_data_prefix = xstrdup (arg); flag_profile_use = true; Index: gcc/profile.c =================================================================== --- gcc/profile.c (revision 166172) +++ gcc/profile.c (working copy) @@ -1019,9 +1019,10 @@ /* Create spanning tree from basic block graph, mark each edge that is on the spanning tree. We insert as many abnormal and critical edges as possible to minimize number of edge splits necessary. */ + + if (!flag_profile_whole_graph) + find_spanning_tree (el); - find_spanning_tree (el); - /* Fake edges that are not on the tree will not be instrumented, so mark them ignored. */ for (num_instrumented = i = 0; i < num_edges; i++) Index: gcc/common.opt =================================================================== --- gcc/common.opt (revision 166172) +++ gcc/common.opt (working copy) @@ -1279,6 +1279,10 @@ Common Joined RejectNegative Enable common options for generating profile info for profile feedback directed optimizations, and set -fprofile-dir= +fprofile-whole-graph +Common Var(flag_profile_whole_graph, 0) RejectNegative +Use whole basic block graph for profiling, do not use spanning tree optimization (more memory usage, but better conservative approximation for multithreaded execution). + fprofile-use Common Var(flag_profile_use) Enable common options for performing profile feedback directed optimizations