diff mbox

Remove global state from gcc/tracer.c

Message ID 1369306601.26167.70.camel@surprise
State New
Headers show

Commit Message

David Malcolm May 23, 2013, 10:56 a.m. UTC
On Thu, 2013-05-23 at 07:14 +0200, Jakub Jelinek wrote:
> On Wed, May 22, 2013 at 08:45:45PM -0400, David Malcolm wrote:
> > I'm attempting to eliminate global state from the insides of gcc.
> > 
> > gcc/tracer.c has various global variables, which are only used during
> > the lifetime of the execute callback of that pass, and cleaned up at the
> > end of each invocation of the pass.
> > 
> > The attached patch introduces a class to hold the state of the pass
> > ("tracer_state"), eliminating these globals.  An instance of the state
> > is created on the stack, and all of the various "static" functions in
> > tracer.c that used the globals become member functions of the state.
> > Hence the state is passed around by the implicit "this" of the
> > tracer_state, avoiding the need to patch each individual use of a field
> > within the state, minimizing the diff.
> 
> But do we want to handle the global state this way?  This adds overhead
> to (almost?) every single function (now method) in the file (because it gets
> an extra argument).  While that might be fine for rarely executed functions,
> if it involves also hot functions called many times, where especially on
> register starved hosts it could increase register pressure, plus the
> overhead of passing the this argument everywhere, this could start to be
> noticeable.  Sure, if you plan to do that just in one pass (but, why then?),
> it might be tiny slowdown, but after you convert the hundreds of passes in
> gcc that contain global state it might become significant.
> 
> There are alternative approaches that should be considered.

I thought of a possible way of doing this, attached is a
proof-of-concept attempt.

The idea is to use (and then not use) C++'s "static" syntax for class
methods and fields.  By making that optional with a big configure-time
switch, it gives us a way of making state be either global vs on-stack,
with minimal syntax changes.  In one configuration (for building gcc as
a library) there would be implicit this-> throughout, but in the other
(for speedy binaries) it would all compile away to global state, as per
the status quo.

This assumes that doing:

   tracer_state state;
   changed = state.tail_duplicate ();

is legitimate; when using global state, "state" is empty, and the call
to
  state.tail_duplicate ()
becomes effectively:
  state::tail_duplicate ()
since it's static in that configuration.

> E.g. global state of a pass can be moved into a per-pass structure,
> and have some way how to aggregate those per pass structures together from
> all the passes in the whole compiler (that can be either manual process,
> say each pass providing its own *-passstate.h and one big header including
> all that together), or automatic ones (say gengstate or a new tool could
> create those for us from special markings in the source, say new option on
> GTY or something) and have some magic macro how to access the global state
> within the pass (thispass->fieldname ?).  Then e.g. depending on how the
> compiler would be configured and built, thispass could be just address of a
> pass struct var (i.e. essentially keep the global state as is, for
> performance reasons), or when trying to build compiler as a library (with
> -fpic overhead we probably don't want for cc1/cc1plus - we can build all the
> *.o files twice, like libtool does) thispass could expand to __thread
> pointer var dereference plus a field inside of the global compiler state
> structure it points to for the current pass.  Thus, the library version
> of the compiler would be somewhat slower (both -fpic overhead and TLS
> overhead), and would need either a few of the entrypoints tweaked to adjust
> the TLS pointer to the global state, or we could require users to just call
> a special function to make the global state current in the current thread
> before calling compiler internals.

Thanks.   Though I thought we were trying to move away from relying on
GTY parsing?   (Sorry not to be able to answer more fully yet, need to
get family ready for school...)

Dave

Comments

David Malcolm May 23, 2013, 3:59 p.m. UTC | #1
On Thu, 2013-05-23 at 06:56 -0400, David Malcolm wrote:
> On Thu, 2013-05-23 at 07:14 +0200, Jakub Jelinek wrote:
> > On Wed, May 22, 2013 at 08:45:45PM -0400, David Malcolm wrote:
> > > I'm attempting to eliminate global state from the insides of gcc.
> > > 
> > > gcc/tracer.c has various global variables, which are only used during
> > > the lifetime of the execute callback of that pass, and cleaned up at the
> > > end of each invocation of the pass.
> > > 
> > > The attached patch introduces a class to hold the state of the pass
> > > ("tracer_state"), eliminating these globals.  An instance of the state
> > > is created on the stack, and all of the various "static" functions in
> > > tracer.c that used the globals become member functions of the state.
> > > Hence the state is passed around by the implicit "this" of the
> > > tracer_state, avoiding the need to patch each individual use of a field
> > > within the state, minimizing the diff.
> > 
> > But do we want to handle the global state this way?  This adds overhead
> > to (almost?) every single function (now method) in the file (because it gets
> > an extra argument).  While that might be fine for rarely executed functions,
> > if it involves also hot functions called many times, where especially on
> > register starved hosts it could increase register pressure, plus the
> > overhead of passing the this argument everywhere, this could start to be
> > noticeable.  Sure, if you plan to do that just in one pass (but, why then?),
> > it might be tiny slowdown, but after you convert the hundreds of passes in
> > gcc that contain global state it might become significant.
> > 
> > There are alternative approaches that should be considered.
> 
> I thought of a possible way of doing this, attached is a
> proof-of-concept attempt.
> 
> The idea is to use (and then not use) C++'s "static" syntax for class
> methods and fields.  By making that optional with a big configure-time
> switch, it gives us a way of making state be either global vs on-stack,
> with minimal syntax changes.  In one configuration (for building gcc as
> a library) there would be implicit this-> throughout, but in the other
> (for speedy binaries) it would all compile away to global state, as per
> the status quo.
> 
> This assumes that doing:
> 
>    tracer_state state;
>    changed = state.tail_duplicate ();
> 
> is legitimate; when using global state, "state" is empty, and the call
> to
>   state.tail_duplicate ()
> becomes effectively:
>   state::tail_duplicate ()
> since it's static in that configuration.
> 
> > E.g. global state of a pass can be moved into a per-pass structure,
> > and have some way how to aggregate those per pass structures together from
> > all the passes in the whole compiler (that can be either manual process,
> > say each pass providing its own *-passstate.h and one big header including
> > all that together), or automatic ones (say gengstate or a new tool could
> > create those for us from special markings in the source, say new option on
> > GTY or something) and have some magic macro how to access the global state
> > within the pass (thispass->fieldname ?).  Then e.g. depending on how the
> > compiler would be configured and built, thispass could be just address of a
> > pass struct var (i.e. essentially keep the global state as is, for
> > performance reasons), or when trying to build compiler as a library (with
> > -fpic overhead we probably don't want for cc1/cc1plus - we can build all the
> > *.o files twice, like libtool does) thispass could expand to __thread
> > pointer var dereference plus a field inside of the global compiler state
> > structure it points to for the current pass.  Thus, the library version
> > of the compiler would be somewhat slower (both -fpic overhead and TLS
> > overhead), and would need either a few of the entrypoints tweaked to adjust
> > the TLS pointer to the global state, or we could require users to just call
> > a special function to make the global state current in the current thread
> > before calling compiler internals.
> 
> Thanks.   Though I thought we were trying to move away from relying on
> GTY parsing?   (Sorry not to be able to answer more fully yet, need to
> get family ready for school...)

I've warmed to your idea of having tooling to support state, and
creating a generic framework for this.  For example, consider the
(long-term) use-case of embedding GCC's code as a library inside a
multithreaded app, where each thread could be JIT-compiling say OpenGL
shader code to machine code (perhaps with some extra non-standard
passes).  To get there, I'd need to isolate *all* of GGC's state, and
when I look at, say, the garbage-collector, I shudder.

So I'm interested in writing some compile-time tooling for generic
state-management in GCC, to sidestep the conflict between speed in the
status quo single state case vs support for multiple states.

Here's a possible way of doing it:

Given e.g. gcc/foo.c containing some state:
  static some_type bar;

then replace this with:

  DEFINE_STATE(some_type, bar);

which is trivial to parse, and have a preprocessing tool autogenerate a
header in the builddir:
   gcc/foo.c.state.h
that conditionally has something like this:

#if SUPPORT_MULTIPLE_STATES
struct foo_c_state
{
  some_type bar;
};
# define bar   ctxt->x_foo_c_bar;
/* there's a    "context *ctxt;" variable somewhere, possibly
   using TLS */

/* alternatively, we could do this as:
    ctxt.x_foo_c_bar
   and make changing context be an operation involving a memcpy, so the
   single global ctxt gets its state overwritten by a copy.  */

#else /* the single state case */

/* Just a global, as before: */
some_type bar;

#endif

That way, any code that uses "bar" has equal meaning (after
preprocessing) in the !SUPPORT_MULTIPLE_STATES case to the status quo,
and thus ought to retain the performance for that use case.

Probably need a DECLARE_STATE() macro also.

(I didn't want to reuse GTY() since there's plenty of global state that
isn't GC-managed, and IIRC the plan is to move away from GTY)

I can try prototyping this, if this approach sounds reasonable.  Is this
the sort of thing that would be best done as a branch? (e.g. a feature
branch in git).

BTW, I'm not convinced that TLS is the way to go for isolating these
details: what if someone wanted to parallelize a slow pass by
introducing a thread pool, optimizing each function on a different
thread?  I think it's reasonable to require a specific API call on the
part of client code when changing context (OpenGL works this way iirc,
though it *does* use TLS).  But I suppose that if the
DECLARE/DEFINE_STATE macros are in place, it becomes easier to
experiment with various implementations of state handling.

In a similar vein, would "universe" be a better name than "context"?
"context" suggests threads to some people, whereas what I have in mind
is the ability to have multiple clients of a future gcc-as-a-library,
each client having the ability to have "parallel universes" of gcc
state; some clients might spread themselves across several threads, but
want to see the same universe.  If GTY/PCH etc are layered on top of the
state management code, then clearly you can't share ptrs between states,
and the "parallel universe" metaphor is perhaps clearer.

Hope this is constructive (and that I'm vaguely on the right track here)
Dave
Richard Henderson May 23, 2013, 7:06 p.m. UTC | #2
On 05/23/2013 03:56 AM, David Malcolm wrote:
> The idea is to use (and then not use) C++'s "static" syntax for class
> methods and fields.  By making that optional with a big configure-time
> switch, it gives us a way of making state be either global vs on-stack,
> with minimal syntax changes.

Plausible.

Another thing I should mention while you're doing all of these static function
to class member conversions is that as written we're losing target-specific
optimizations that can be done on purely local functions.  This is trivially
fixed by placing these new classes in an anonymous namespace.

> +private:
> +
> +  /* Minimal outgoing edge probability considered for superblock
> +     formation.  */
> +  STATEFUL int probability_cutoff;
> +  STATEFUL int branch_ratio_cutoff;
> +
> +  /* A bit BB->index is set if BB has already been seen, i.e. it is
> +     connected to some trace already.  */
> +  STATEFUL sbitmap bb_seen;
...
> +#if GLOBAL_STATE
> +/* Global definitions of static data.  */
> +int tracer_state::probability_cutoff;
> +int tracer_state::branch_ratio_cutoff;
> +sbitmap tracer_state::bb_seen;
> +#endif
...
> +tracer_state::tracer_state()
> +#if !GLOBAL_STATE
> +  : probability_cutoff(0),
> +    branch_ratio_cutoff(0),
> +    bb_seen()
> +#endif
> +{
> +}
> +

What I don't like about this arrangement is that it requires three repetitions
of the state variables and their initializations.  I wonder if we can do better
with

class tracer_state
{
  private:
    struct data
    {
      int probability_cutoff;
      int branch_ratio_cutoff;
      sbitmap bb_seen;

      data()
        : probability_cutoff(0),
          branch_ratio_cutoff(0)
          bb_seen()
      { }
    };

    STATEFUL data d;

  public:
    tracer_state() { }
  ...
};

#if GLOBAL_STATE
tracer_state::data tracer_state::d;
#endif

which does require accesses as "d.foo" instead of just foo, but at least we've
cut down on the boilerplate.

Though with this I wonder if we need a CONSTEXPR define to markup
tracer_state::data::data so that the global variable doesn't wind up running a
constructor at runtime?  (I.e. performs correctly if inefficiently for stage1,
but stage[23] use g++ and thus can have the c++11 extension?)

> #if SUPPORT_MULTIPLE_STATES
> struct foo_c_state
> {
>   some_type bar;
> };
> # define bar   ctxt->x_foo_c_bar;
> /* there's a    "context *ctxt;" variable somewhere, possibly
>    using TLS */

I've an idea that this will perform very badly.  With ctxt being a global (if
tls) variable, it's call clobbered.  Thus we'll wind up reloading it all the
time, which for pic involves another call to the __get_tlsaddr runtime function.

For a very heavily used pointers, we're almost certainly better off having the
data be referenced via "this", where at least the starting point is known to be
function invariant.


r~
Jakub Jelinek May 23, 2013, 7:20 p.m. UTC | #3
On Thu, May 23, 2013 at 12:06:01PM -0700, Richard Henderson wrote:
> > struct foo_c_state
> > {
> >   some_type bar;
> > };
> > # define bar   ctxt->x_foo_c_bar;
> > /* there's a    "context *ctxt;" variable somewhere, possibly
> >    using TLS */
> 
> I've an idea that this will perform very badly.  With ctxt being a global (if
> tls) variable, it's call clobbered.  Thus we'll wind up reloading it all the
> time, which for pic involves another call to the __get_tlsaddr runtime function.

If all of gcc just has one __thread pointer in it, then we can just use
tls_model ("initial-exec") for it and lower that overhead down a lot.

	Jakub
Richard Henderson May 23, 2013, 7:27 p.m. UTC | #4
On 05/23/2013 12:20 PM, Jakub Jelinek wrote:
> On Thu, May 23, 2013 at 12:06:01PM -0700, Richard Henderson wrote:
>>> struct foo_c_state
>>> {
>>>   some_type bar;
>>> };
>>> # define bar   ctxt->x_foo_c_bar;
>>> /* there's a    "context *ctxt;" variable somewhere, possibly
>>>    using TLS */
>>
>> I've an idea that this will perform very badly.  With ctxt being a global (if
>> tls) variable, it's call clobbered.  Thus we'll wind up reloading it all the
>> time, which for pic involves another call to the __get_tlsaddr runtime function.
> 
> If all of gcc just has one __thread pointer in it, then we can just use
> tls_model ("initial-exec") for it and lower that overhead down a lot.

Not available on emutls systems.  But even with initial-exec the overhead is a
lot more than having the value stored within the local stack frame.

TLS is great for passing around rarely used state, which might be needed 10
call frames down but not in between.  E.g. any state for the presumed
per-thread memory allocator.

But it's not nearly so good for pass-specific data that potentially every
function within the pass might need.


r~
Richard Henderson May 23, 2013, 8:51 p.m. UTC | #5
On 05/23/2013 12:06 PM, Richard Henderson wrote:
> Another thing I should mention while you're doing all of these static function
> to class member conversions is that as written we're losing target-specific
> optimizations that can be done on purely local functions.  This is trivially
> fixed by placing these new classes in an anonymous namespace.

At which point it occurs to me that we don't need to distinguish between static
and normal member methods, nor play with sub-state structures.  All we need to
require is that IPA constant propagation sees that the initial "this" argument
is passed a constant value, and let it propagate and eliminate.

E.g.

namespace {

class pass_state
{
  private:
    int x, y, z;

  public:
    constexpr pass_state()
      : x(0), y(0), z(0)
    { }

    void doit();

  private:
    void a();
    void b();
    void c();
};

// ...

} // anon namespace

#ifdef GLOBAL_STATE
static pass_state ps;
#endif

void toplev()
{
#ifndef GLOBAL_STATE
  pass_state ps;
#endif
  ps.doit();
}

With an example that's probably too small, I verified that gcc will eliminate
all of the this parameters with GLOBAL_STATE defined.  It's certainly something
worth investigating further, as this scheme has the least amount of boilerplate
of any so far advanced.


r~
diff mbox

Patch

diff --git a/gcc/tracer.c b/gcc/tracer.c
index 975cadb..f83ac0b 100644
--- a/gcc/tracer.c
+++ b/gcc/tracer.c
@@ -53,20 +53,74 @@ 
 static int count_insns (basic_block);
 static bool ignore_bb_p (const_basic_block);
 static bool better_p (const_edge, const_edge);
-static edge find_best_successor (basic_block);
-static edge find_best_predecessor (basic_block);
-static int find_trace (basic_block, basic_block *);
 
-/* Minimal outgoing edge probability considered for superblock formation.  */
-static int probability_cutoff;
-static int branch_ratio_cutoff;
+/* Crude testing hack for switching between:
+     global state
+   vs
+     (on-stack state plus implicit this->)
+   This would be a config option controlling the whole build, so that
+   you'd use the former for a standalone build of gcc, and the latter
+   when building the code for use as a dynamic library.  */
+#define GLOBAL_STATE 1
+
+#if GLOBAL_STATE
+/* When using global state, all methods and fields of state classes
+   become "static", so that there is effectively a single global instance
+   of the state, and there is no implicit "this->" being passed around.  */
+# define STATEFUL static
+#else
+/* When using on-stack state, all methods and fields of state classes
+   lose the "static", so that there can be multiple instances of the state
+   with an implicit "this->" everywhere the state is used.  */
+# define STATEFUL
+#endif
+
+class tracer_state
+{
+public:
+  tracer_state();
+
+  STATEFUL bool tail_duplicate ();
+
+private:
+
+  STATEFUL edge find_best_successor (basic_block);
+  STATEFUL edge find_best_predecessor (basic_block);
+  STATEFUL int find_trace (basic_block, basic_block *);
+  STATEFUL void mark_bb_seen (basic_block bb);
+  STATEFUL bool bb_seen_p (basic_block bb);
+
+private:
+
+  /* Minimal outgoing edge probability considered for superblock
+     formation.  */
+  STATEFUL int probability_cutoff;
+  STATEFUL int branch_ratio_cutoff;
+
+  /* A bit BB->index is set if BB has already been seen, i.e. it is
+     connected to some trace already.  */
+  STATEFUL sbitmap bb_seen;
 
-/* A bit BB->index is set if BB has already been seen, i.e. it is
-   connected to some trace already.  */
-sbitmap bb_seen;
+}; // tracer_state
 
-static inline void
-mark_bb_seen (basic_block bb)
+#if GLOBAL_STATE
+/* Global definitions of static data.  */
+int tracer_state::probability_cutoff;
+int tracer_state::branch_ratio_cutoff;
+sbitmap tracer_state::bb_seen;
+#endif
+
+tracer_state::tracer_state()
+#if !GLOBAL_STATE
+  : probability_cutoff(0),
+    branch_ratio_cutoff(0),
+    bb_seen()
+#endif
+{
+}
+
+inline void
+tracer_state::mark_bb_seen (basic_block bb)
 {
   unsigned int size = SBITMAP_SIZE (bb_seen);
 
@@ -76,8 +130,8 @@  mark_bb_seen (basic_block bb)
   bitmap_set_bit (bb_seen, bb->index);
 }
 
-static inline bool
-bb_seen_p (basic_block bb)
+inline bool
+tracer_state::bb_seen_p (basic_block bb)
 {
   return bitmap_bit_p (bb_seen, bb->index);
 }
@@ -138,8 +192,8 @@  better_p (const_edge e1, const_edge e2)
 
 /* Return most frequent successor of basic block BB.  */
 
-static edge
-find_best_successor (basic_block bb)
+edge
+tracer_state::find_best_successor (basic_block bb)
 {
   edge e;
   edge best = NULL;
@@ -157,8 +211,8 @@  find_best_successor (basic_block bb)
 
 /* Return most frequent predecessor of basic block BB.  */
 
-static edge
-find_best_predecessor (basic_block bb)
+edge
+tracer_state::find_best_predecessor (basic_block bb)
 {
   edge e;
   edge best = NULL;
@@ -178,8 +232,8 @@  find_best_predecessor (basic_block bb)
 /* Find the trace using bb and record it in the TRACE array.
    Return number of basic blocks recorded.  */
 
-static int
-find_trace (basic_block bb, basic_block *trace)
+int
+tracer_state::find_trace (basic_block bb, basic_block *trace)
 {
   int i = 0;
   edge e;
@@ -220,8 +274,8 @@  find_trace (basic_block bb, basic_block *trace)
 /* Look for basic blocks in frequency order, construct traces and tail duplicate
    if profitable.  */
 
-static bool
-tail_duplicate (void)
+bool
+tracer_state::tail_duplicate ()
 {
   fibnode_t *blocks = XCNEWVEC (fibnode_t, last_basic_block);
   basic_block *trace = XNEWVEC (basic_block, n_basic_blocks);
@@ -376,7 +430,8 @@  tracer (void)
     brief_dump_cfg (dump_file, dump_flags);
 
   /* Trace formation is done on the fly inside tail_duplicate */
-  changed = tail_duplicate ();
+  tracer_state state;
+  changed = state.tail_duplicate ();
   if (changed)
     {
       free_dominance_info (CDI_DOMINATORS);