diff mbox

gcov: option to instrument whole basic block graph, no use of spanning tree optimization

Message ID 20101107215423.GA3607@laptop-hbl.localdomain
State New
Headers show

Commit Message

Holger Blasum Nov. 7, 2010, 9:54 p.m. UTC
Hello, 

The attached patch introduces an option to instrument the whole basic
block graph and not just its spanning tree.

Rationale: gcov is not robust in a threaded environment, the patch
    makes it more robust.

In particular, in threaded environments on can observe two effects: 
(1) underreporting when coverage counters are overwritten by another 
instance of a thread after a thread switch. 
(2) generation of arbitrary (even negative) values by subtraction when 
reconstructing the whole tree from the used by the spanning tree 
optimization mechanism. (Recall that subtraction is messy numerically.) 

With the current gcov framework (I think) one cannot do anything 
really generic about (1). However, (2) can be fixed by optionally 
disabling the spanning tree optimization. 

Question: Why is fixing only (2) useful? 
Answer: One often wants to have the 
    property P: if a counter is non-zero then its
    path has had some (potentially underreported but non-zero) coverage

So fixing (2) is valuable because when (2) is fixed then P holds 
    whereas when (2) is not fixed (I think) due to potential cancellation
    effects P does not necessarily hold. 

Moreover I think fixing (2) also gives 
    property Q: if a path has had some (potentially underreported but 
        non-zero) coverage, then its counter is non-zero (converse of P)
    because in the event of coverage loss by concurrency at least the 
        overwriting counter is never 0. 

The following program is used for demonstration. 

$cat uncontrolled.c
#include <pthread.h>
#include <stdio.h>
#include <sys/io.h>

void *worker (void *args) {
	int i; 
	int j = 0;
	for (i = 0; i< 100000*1000; i++) {
		j++;
		if (j % 100000 == 0) {
			printf("j now is %d.\n", j);
		}	
	}
		
}

int main () 
{	
	iopl(3);
	pthread_t t1, t2;
	pthread_create (&t1, NULL, worker, NULL);
	pthread_create (&t2, NULL, worker, NULL);
	pthread_join (t1, NULL);
	pthread_join (t2, NULL);
	return 0;
}

 ... when run via ...
 
$ gcc -fprofile-arcs -ftest-coverage -O0 -o uncontrolled uncontrolled.c -lpthread
$ ./uncontrolled
$ gcov uncontrolled

 ... we observe effects (1) and (2)

        -:    0:Source:uncontrolled.c
        -:    0:Graph:uncontrolled.gcno
        -:    0:Data:uncontrolled.gcda
        -:    0:Runs:1
        -:    0:Programs:1
        -:    1:#include <pthread.h>
        -:    2:#include <stdio.h>
        -:    3:#include <sys/io.h>
        -:    4:
 -1067519:    5:void *worker (void *args) {
        -:    6:	int i; 
 -1067519:    7:	int j = 0;
182760737:    8:	for (i = 0; i< 100000*1000; i++) {
182760735:    9:		j++;
182760735:   10:		if (j % 100000 == 0) {
     2000:   11:			printf("j now is %d.\n", j);
        -:   12:		}	
        -:   13:	}
        -:   14:		
        2:   15:}
        -:   16:
        1:   17:int main () 
        -:   18:{	
        1:   19:	iopl(3);
        -:   20:	pthread_t t1, t2;
        1:   21:	pthread_create (&t1, NULL, worker, NULL);
        1:   22:	pthread_create (&t2, NULL, worker, NULL);
        1:   23:	pthread_join (t1, NULL);
        1:   24:	pthread_join (t2, NULL);
        1:   25:	return 0;
        -:   26:}


When run via the patched version (with -fprofile-whole-graph) this becomes:
    
$ gcc -fprofile-whole-graph -fprofile-arcs -ftest-coverage -O0 -o uncontrolled uncontrolled.c -lpthread
$ ./uncontrolled
$ gcov uncontrolled

    ... we only observe effect (1)

$ cat uncontrolled.c.gcov
        -:    0:Source:uncontrolled.c
        -:    0:Graph:uncontrolled.gcno
        -:    0:Data:uncontrolled.gcda
        -:    0:Runs:1
        -:    0:Programs:1
        -:    1:#include <pthread.h>
        -:    2:#include <stdio.h>
        -:    3:#include <sys/io.h>
        -:    4:
        2:    5:void *worker (void *args) {
        -:    6:	int i; 
        2:    7:	int j = 0;
187663616:    8:	for (i = 0; i< 100000*1000; i++) {
187879212:    9:		j++;
187879212:   10:		if (j % 100000 == 0) {
     2000:   11:			printf("j now is %d.\n", j);
        -:   12:		}	
        -:   13:	}
        -:   14:		
        2:   15:}
        -:   16:
        1:   17:int main () 
        -:   18:{	
        1:   19:	iopl(3);
        -:   20:	pthread_t t1, t2;
        1:   21:	pthread_create (&t1, NULL, worker, NULL);
        1:   22:	pthread_create (&t2, NULL, worker, NULL);
        1:   23:	pthread_join (t1, NULL);
        1:   24:	pthread_join (t2, NULL);
        1:   25:	return 0;
        -:   26:}

For further discussion see also section 4 of 
http://sysrun.haifa.il.ibm.com/hrl/greps2007/papers/gcov-on-an-embedded-system.pdf . Actually I'm submitting here because someone had noticed the paper and then asked the question why this isn't fixed yet :-)

Regards,

Comments

Andi Kleen Nov. 7, 2010, 11:01 p.m. UTC | #1
Holger Blasum <hbl@sysgo.com> writes:

> Rationale: gcov is not robust in a threaded environment, the patch
>     makes it more robust.
>
> In particular, in threaded environments on can observe two effects: 
> (1) underreporting when coverage counters are overwritten by another 
> instance of a thread after a thread switch. 
> (2) generation of arbitrary (even negative) values by subtraction when 
> reconstructing the whole tree from the used by the spanning tree 
> optimization mechanism. (Recall that subtraction is messy numerically.) 
>
> With the current gcov framework (I think) one cannot do anything 
> really generic about (1). 

A better fix would be to make all the gcov data per thread data.

This is supported on most modern OS in a straight forward way
with a storage class modifier. This would fix all the races and give 
always accurate information.

Another advantage of this is that it would speed data collection
with multiple threads -- no frequent cache line bouncing on the
counters when multiple CPUs modify them in parallel.

The only change needed would be to sum it up explicitely on
thread destruction into a global copy that is accessed atomically.

-Andi
Zdenek Dvorak Nov. 8, 2010, 10:45 a.m. UTC | #2
Hi,

> Holger Blasum <hbl@sysgo.com> writes:
> 
> > Rationale: gcov is not robust in a threaded environment, the patch
> >     makes it more robust.
> >
> > In particular, in threaded environments on can observe two effects: 
> > (1) underreporting when coverage counters are overwritten by another 
> > instance of a thread after a thread switch. 
> > (2) generation of arbitrary (even negative) values by subtraction when 
> > reconstructing the whole tree from the used by the spanning tree 
> > optimization mechanism. (Recall that subtraction is messy numerically.) 
> >
> > With the current gcov framework (I think) one cannot do anything 
> > really generic about (1). 
> 
> A better fix would be to make all the gcov data per thread data.
> 
> This is supported on most modern OS in a straight forward way
> with a storage class modifier. This would fix all the races and give 
> always accurate information.
> 
> Another advantage of this is that it would speed data collection
> with multiple threads -- no frequent cache line bouncing on the
> counters when multiple CPUs modify them in parallel.
> 
> The only change needed would be to sum it up explicitely on
> thread destruction into a global copy that is accessed atomically.

I had a patch implementing this a few years back:

http://gcc.gnu.org/ml/gcc-patches/2002-01/msg00195.html

which however did not make it to mainline (iirc, there were objections against
the extra memory and runtime costs -- if you have a program with hundreds of thousands of
basic blocks, you do not really want to use the corresponding megabytes of
memory per thread for profiling, and spend time initializing them
on the thread start and summing them on the thread end.  Probably a better approach
(using atomic primitives for manipulating the profile) was proposed in

http://gcc.gnu.org/ml/gcc-patches/2007-03/msg00950.html

which also never got merged, it seems,

Zdenek
Andi Kleen Nov. 8, 2010, 11:17 a.m. UTC | #3
Zdenek Dvorak <rakdver@kam.mff.cuni.cz> writes:
>
> I had a patch implementing this a few years back:
>
> http://gcc.gnu.org/ml/gcc-patches/2002-01/msg00195.html
>
> which however did not make it to mainline (iirc, there were objections against
> the extra memory and runtime costs -- if you have a program with hundreds of thousands of
> basic blocks, you do not really want to use the corresponding megabytes of
> memory per thread for profiling, and spend time initializing them
> on the thread start and summing them on the thread end. 

Doesn't seem like a big problem to me on modern systems.
I guess one could make it optional for the case of someone
running on a memory constrained target.

> Probably a better approach
> (using atomic primitives for manipulating the profile) was proposed in
>
> http://gcc.gnu.org/ml/gcc-patches/2007-03/msg00950.html
>
> which also never got merged, it seems,

atomics are not a better approach. atomics will never scale well and can
experience catastrophic performance degradation on larger systems due to
excessive communication overheads when lots of CPUs try to manipulate
the same counters in parallel.

On many NUMA systems there is also unfair memory which makes them even
worse.

-Andi
Richard Biener Nov. 8, 2010, 12:02 p.m. UTC | #4
On Mon, Nov 8, 2010 at 12:17 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Zdenek Dvorak <rakdver@kam.mff.cuni.cz> writes:
>>
>> I had a patch implementing this a few years back:
>>
>> http://gcc.gnu.org/ml/gcc-patches/2002-01/msg00195.html
>>
>> which however did not make it to mainline (iirc, there were objections against
>> the extra memory and runtime costs -- if you have a program with hundreds of thousands of
>> basic blocks, you do not really want to use the corresponding megabytes of
>> memory per thread for profiling, and spend time initializing them
>> on the thread start and summing them on the thread end.
>
> Doesn't seem like a big problem to me on modern systems.
> I guess one could make it optional for the case of someone
> running on a memory constrained target.
>
>> Probably a better approach
>> (using atomic primitives for manipulating the profile) was proposed in
>>
>> http://gcc.gnu.org/ml/gcc-patches/2007-03/msg00950.html
>>
>> which also never got merged, it seems,
>
> atomics are not a better approach. atomics will never scale well and can
> experience catastrophic performance degradation on larger systems due to
> excessive communication overheads when lots of CPUs try to manipulate
> the same counters in parallel.
>
> On many NUMA systems there is also unfair memory which makes them even
> worse.

The initial copying and setup could be avoided by using a common
zero-initialized COW mapping for the counters (can we have COW TLS
mappings?).  Won't solve the thread destruction time issue though
(but that could be done via a separate thread).

Richard.

> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only.
>
diff mbox

Patch

Index: gcc/opts.c
===================================================================
--- gcc/opts.c	(revision 166172)
+++ gcc/opts.c	(working copy)
@@ -1973,6 +1973,10 @@ 
       profile_data_prefix = xstrdup (arg);
       break;
 
+    case OPT_fprofile_whole_graph:
+      flag_profile_whole_graph = 1;
+      break;         
+
     case OPT_fprofile_use_:
       profile_data_prefix = xstrdup (arg);
       flag_profile_use = true;
Index: gcc/profile.c
===================================================================
--- gcc/profile.c	(revision 166172)
+++ gcc/profile.c	(working copy)
@@ -1019,9 +1019,10 @@ 
   /* Create spanning tree from basic block graph, mark each edge that is
      on the spanning tree.  We insert as many abnormal and critical edges
      as possible to minimize number of edge splits necessary.  */
+    
+  if (!flag_profile_whole_graph)
+    find_spanning_tree (el);
 
-  find_spanning_tree (el);
-
   /* Fake edges that are not on the tree will not be instrumented, so
      mark them ignored.  */
   for (num_instrumented = i = 0; i < num_edges; i++)
Index: gcc/common.opt
===================================================================
--- gcc/common.opt	(revision 166172)
+++ gcc/common.opt	(working copy)
@@ -1279,6 +1279,10 @@ 
 Common Joined RejectNegative
 Enable common options for generating profile info for profile feedback directed optimizations, and set -fprofile-dir=
 
+fprofile-whole-graph
+Common Var(flag_profile_whole_graph, 0) RejectNegative
+Use whole basic block graph for profiling, do not use spanning tree optimization (more memory usage, but better conservative approximation for multithreaded execution).
+
 fprofile-use
 Common Var(flag_profile_use)
 Enable common options for performing profile feedback directed optimizations