diff mbox

[4/4] define ASM_OUTPUT_LABEL to the name of a function

Message ID 1438886869.21752.62.camel@surprise
State New
Headers show

Commit Message

David Malcolm Aug. 6, 2015, 6:47 p.m. UTC
On Wed, 2015-08-05 at 16:22 -0400, Trevor Saunders wrote:
> On Wed, Aug 05, 2015 at 11:34:28AM -0400, David Malcolm wrote:
> > On Wed, 2015-08-05 at 11:28 -0400, David Malcolm wrote:
> > > On Wed, 2015-08-05 at 13:47 +0200, Richard Biener wrote:
> > > > On Wed, Aug 5, 2015 at 12:57 PM, Trevor Saunders <tbsaunde@tbsaunde.org> wrote:
> > > > > On Mon, Jul 27, 2015 at 11:06:58AM +0200, Richard Biener wrote:
> > > > >> On Sat, Jul 25, 2015 at 4:37 AM,  <tbsaunde+gcc@tbsaunde.org> wrote:
> > > > >> > From: Trevor Saunders <tbsaunde+gcc@tbsaunde.org>
> > > > >> >
> > > > >> >         * config/arc/arc.h, config/bfin/bfin.h, config/frv/frv.h,
> > > > >> >         config/ia64/ia64-protos.h, config/ia64/ia64.c, config/ia64/ia64.h,
> > > > >> >         config/lm32/lm32.h, config/mep/mep.h, config/mmix/mmix.h,
> > > > >> >         config/rs6000/rs6000.c, config/rs6000/xcoff.h, config/spu/spu.h,
> > > > >> >         config/visium/visium.h, defaults.h: Define ASM_OUTPUT_LABEL to
> > > > >> > the name of a function.
> > > > >> >         * output.h (default_output_label): New prototype.
> > > > >> >         * varasm.c (default_output_label): New function.
> > > > >> >         * vmsdbgout.c: Include tm_p.h.
> > > > >> >         * xcoffout.c: Likewise.
> > > > >>
> > > > >> Just a general remark - the GCC output machinery is known to be slow,
> > > > >> adding indirect calls might be not the very best idea without refactoring
> > > > >> some of it.
> > > > >>
> > > > >> Did you do any performance measurements for artificial testcases
> > > > >> exercising the specific bits you change?
> > > > >
> > > > > sorry about the delay, but I finally got a chance to do some perf tests
> > > > > of the first patch.  I took three test cases fold-const.ii, insn-emit.ii
> > > > > and a random .i from firefox and did 3 trials of the length of 100
> > > > > compilations.  The only non default flag was -std=gnu++11.
> > > > >
[...snip results...]
> > > > >
> > > > > So, roughly that looks to me like a range from improving by .5% to
> > > > > regressing by 1%.  I'm not sure what could cause an improvement, so I
> > > > > kind of wonder how valid these results are.
> > > > 
> > > > Hmm, indeed.  The speedup looks suspicious.
> > > > 
> > > > > Another question is how one can refactor the output machinary to be
> > > > > faster.  My first  thought is to buffer text internally before calling
> > > > > stdio functions, but that seems like a giant job.
> > > > 
> > > > stdio functions are already buffering, so I don't know either.
> > > > 
> > > > But yes, going the libas route would improve things here, or for
> > > > example enhancing gas to be able to eat target binary data
> > > > without the need to encode it in printable characters...
> > > > 
> > > > .raw_data number-of-bytes
> > > > <raw data>
> > > > 
> > > > Makes it quite unparsable to editors of course ...
> > > 
> > > A middle-ground might be to do both:
> > > 
> > > .raw_data number-of-bytes
> > > <raw data>
> > 
> > Sorry, I hit "Send" too early; I meant something like this as a
> > middle-ground:
> > 
> >   .raw_data number-of-bytes
> >   <raw data>
> > 
> >   ; comment giving the formatted text
> > 
> > so that cc1 etc are doing the formatting work to make the comment, so
> > that human readers can see what the raw data is meant to be, but the
> > assembler doesn't have to do work to parse it.
> 
> well, having random bytes in the file might still screw up editors, and
> I'd kind of expect that to be slower over all since gcc still does the
> formating, and both gcc and as do more IO.
> 
> > FWIW, I once had a go at hiding asm_out_file behind a class interface,
> > trying to build up higher-level methods on top of raw text printing.
> > Maybe that's a viable migration strategy  (I didn't finish that patch).
> 
> I was thinking about trying that, but I couldn't think of a good way to
> do it incrementally.
> 
> Trev

Attached is a patch from some experimentation, very much a
work-in-progress.

It eliminates the macro ASM_OUTPUT_LABEL in favor of calls to a method
of an "output" object:

  g_output.output_label (lab);

g_output would be a thin wrapper around asm_out_file (with the
assumption that asm_out_file never changes to point at anything else).

One idea here is to gradually replace uses of asm_out_file with methods
of g_output, giving us a possible approach for tackling the "don't
format so much and then parse it again" optimization.

Another idea here is to use templates and specialization in place of
target macros, to capture things in the type system;
g_output is actually:

  output<target_t> g_output;

which has a default implementation of output_label corresponding to the
current default ASM_OUTPUT_LABEL:

template <typename Target>
inline void
output<Target>::output_label (const char *name)
{
  assemble_name (name);
  puts (":\n");  
}

...but a specific Target traits class could have a specialization e.g.

template <>
inline void
output<target_arm>::output_label (const char *name)
{
  arm_asm_output_labelref (name);
}

This could give us (I hope) equivalent performance to the current
macro-based approach, but without using the preprocessor, albeit adding
some C++ (the non-trivial use of templates gives me pause).

That said, I've barely tested it; posting here in the hope it's
constructive.

Dave

Comments

Richard Sandiford Aug. 6, 2015, 7:36 p.m. UTC | #1
David Malcolm <dmalcolm@redhat.com> writes:
> On Wed, 2015-08-05 at 16:22 -0400, Trevor Saunders wrote:
>> On Wed, Aug 05, 2015 at 11:34:28AM -0400, David Malcolm wrote:
>> > On Wed, 2015-08-05 at 11:28 -0400, David Malcolm wrote:
>> > > On Wed, 2015-08-05 at 13:47 +0200, Richard Biener wrote:
>> > > > On Wed, Aug 5, 2015 at 12:57 PM, Trevor Saunders
>> > > > <tbsaunde@tbsaunde.org> wrote:
>> > > > > On Mon, Jul 27, 2015 at 11:06:58AM +0200, Richard Biener wrote:
>> > > > >> On Sat, Jul 25, 2015 at 4:37 AM,  <tbsaunde+gcc@tbsaunde.org> wrote:
>> > > > >> > From: Trevor Saunders <tbsaunde+gcc@tbsaunde.org>
>> > > > >> >
>> > > > >> >         * config/arc/arc.h, config/bfin/bfin.h, config/frv/frv.h,
>> > > > >> >         config/ia64/ia64-protos.h, config/ia64/ia64.c,
>> > > > >> > config/ia64/ia64.h,
>> > > > >> >         config/lm32/lm32.h, config/mep/mep.h, config/mmix/mmix.h,
>> > > > >> >         config/rs6000/rs6000.c, config/rs6000/xcoff.h,
>> > > > >> > config/spu/spu.h,
>> > > > >> >         config/visium/visium.h, defaults.h: Define
>> > > > >> > ASM_OUTPUT_LABEL to
>> > > > >> > the name of a function.
>> > > > >> >         * output.h (default_output_label): New prototype.
>> > > > >> >         * varasm.c (default_output_label): New function.
>> > > > >> >         * vmsdbgout.c: Include tm_p.h.
>> > > > >> >         * xcoffout.c: Likewise.
>> > > > >>
>> > > > >> Just a general remark - the GCC output machinery is known to be slow,
>> > > > >> adding indirect calls might be not the very best idea without
>> > > > >> refactoring
>> > > > >> some of it.
>> > > > >>
>> > > > >> Did you do any performance measurements for artificial testcases
>> > > > >> exercising the specific bits you change?
>> > > > >
>> > > > > sorry about the delay, but I finally got a chance to do some
>> > > > > perf tests
>> > > > > of the first patch.  I took three test cases fold-const.ii,
>> > > > > insn-emit.ii
>> > > > > and a random .i from firefox and did 3 trials of the length of 100
>> > > > > compilations.  The only non default flag was -std=gnu++11.
>> > > > >
> [...snip results...]
>> > > > >
>> > > > > So, roughly that looks to me like a range from improving by .5% to
>> > > > > regressing by 1%.  I'm not sure what could cause an improvement, so I
>> > > > > kind of wonder how valid these results are.
>> > > > 
>> > > > Hmm, indeed.  The speedup looks suspicious.
>> > > > 
>> > > > > Another question is how one can refactor the output machinary to be
>> > > > > faster.  My first  thought is to buffer text internally before calling
>> > > > > stdio functions, but that seems like a giant job.
>> > > > 
>> > > > stdio functions are already buffering, so I don't know either.
>> > > > 
>> > > > But yes, going the libas route would improve things here, or for
>> > > > example enhancing gas to be able to eat target binary data
>> > > > without the need to encode it in printable characters...
>> > > > 
>> > > > .raw_data number-of-bytes
>> > > > <raw data>
>> > > > 
>> > > > Makes it quite unparsable to editors of course ...
>> > > 
>> > > A middle-ground might be to do both:
>> > > 
>> > > .raw_data number-of-bytes
>> > > <raw data>
>> > 
>> > Sorry, I hit "Send" too early; I meant something like this as a
>> > middle-ground:
>> > 
>> >   .raw_data number-of-bytes
>> >   <raw data>
>> > 
>> >   ; comment giving the formatted text
>> > 
>> > so that cc1 etc are doing the formatting work to make the comment, so
>> > that human readers can see what the raw data is meant to be, but the
>> > assembler doesn't have to do work to parse it.
>> 
>> well, having random bytes in the file might still screw up editors, and
>> I'd kind of expect that to be slower over all since gcc still does the
>> formating, and both gcc and as do more IO.
>> 
>> > FWIW, I once had a go at hiding asm_out_file behind a class interface,
>> > trying to build up higher-level methods on top of raw text printing.
>> > Maybe that's a viable migration strategy  (I didn't finish that patch).
>> 
>> I was thinking about trying that, but I couldn't think of a good way to
>> do it incrementally.
>> 
>> Trev
>
> Attached is a patch from some experimentation, very much a
> work-in-progress.
>
> It eliminates the macro ASM_OUTPUT_LABEL in favor of calls to a method
> of an "output" object:
>
>   g_output.output_label (lab);
>
> g_output would be a thin wrapper around asm_out_file (with the
> assumption that asm_out_file never changes to point at anything else).
>
> One idea here is to gradually replace uses of asm_out_file with methods
> of g_output, giving us a possible approach for tackling the "don't
> format so much and then parse it again" optimization.
>
> Another idea here is to use templates and specialization in place of
> target macros, to capture things in the type system;
> g_output is actually:
>
>   output<target_t> g_output;
>
> which has a default implementation of output_label corresponding to the
> current default ASM_OUTPUT_LABEL:
>
> template <typename Target>
> inline void
> output<Target>::output_label (const char *name)
> {
>   assemble_name (name);
>   puts (":\n");  
> }
>
> ...but a specific Target traits class could have a specialization e.g.
>
> template <>
> inline void
> output<target_arm>::output_label (const char *name)
> {
>   arm_asm_output_labelref (name);
> }
>
> This could give us (I hope) equivalent performance to the current
> macro-based approach, but without using the preprocessor, albeit adding
> some C++ (the non-trivial use of templates gives me pause).

I might be missing the point, sorry, but it sounds like this enshrines
the idea of having a single target.

An integrated assembler or tighter asm output would be nice, but when
I last checked LLVM was usually faster than GCC even when compiling to asm,
even though LLVM does use indirection (in the form of virtual functions)
for its output routines.  I don't think indirect function calls themselves
are the problem -- as long as we get the abstraction right :-)

Thanks,
Richard
Trevor Saunders Aug. 7, 2015, 4:31 a.m. UTC | #2
On Thu, Aug 06, 2015 at 08:36:36PM +0100, Richard Sandiford wrote:
> David Malcolm <dmalcolm@redhat.com> writes:
> > On Wed, 2015-08-05 at 16:22 -0400, Trevor Saunders wrote:
> >> On Wed, Aug 05, 2015 at 11:34:28AM -0400, David Malcolm wrote:
> >> > On Wed, 2015-08-05 at 11:28 -0400, David Malcolm wrote:
> >> > > On Wed, 2015-08-05 at 13:47 +0200, Richard Biener wrote:
> >> > > > On Wed, Aug 5, 2015 at 12:57 PM, Trevor Saunders
> >> > > > <tbsaunde@tbsaunde.org> wrote:
> >> > > > > On Mon, Jul 27, 2015 at 11:06:58AM +0200, Richard Biener wrote:
> >> > > > >> On Sat, Jul 25, 2015 at 4:37 AM,  <tbsaunde+gcc@tbsaunde.org> wrote:
> >> > > > >> > From: Trevor Saunders <tbsaunde+gcc@tbsaunde.org>
> >> > > > >> >
> >> > > > >> >         * config/arc/arc.h, config/bfin/bfin.h, config/frv/frv.h,
> >> > > > >> >         config/ia64/ia64-protos.h, config/ia64/ia64.c,
> >> > > > >> > config/ia64/ia64.h,
> >> > > > >> >         config/lm32/lm32.h, config/mep/mep.h, config/mmix/mmix.h,
> >> > > > >> >         config/rs6000/rs6000.c, config/rs6000/xcoff.h,
> >> > > > >> > config/spu/spu.h,
> >> > > > >> >         config/visium/visium.h, defaults.h: Define
> >> > > > >> > ASM_OUTPUT_LABEL to
> >> > > > >> > the name of a function.
> >> > > > >> >         * output.h (default_output_label): New prototype.
> >> > > > >> >         * varasm.c (default_output_label): New function.
> >> > > > >> >         * vmsdbgout.c: Include tm_p.h.
> >> > > > >> >         * xcoffout.c: Likewise.
> >> > > > >>
> >> > > > >> Just a general remark - the GCC output machinery is known to be slow,
> >> > > > >> adding indirect calls might be not the very best idea without
> >> > > > >> refactoring
> >> > > > >> some of it.
> >> > > > >>
> >> > > > >> Did you do any performance measurements for artificial testcases
> >> > > > >> exercising the specific bits you change?
> >> > > > >
> >> > > > > sorry about the delay, but I finally got a chance to do some
> >> > > > > perf tests
> >> > > > > of the first patch.  I took three test cases fold-const.ii,
> >> > > > > insn-emit.ii
> >> > > > > and a random .i from firefox and did 3 trials of the length of 100
> >> > > > > compilations.  The only non default flag was -std=gnu++11.
> >> > > > >
> > [...snip results...]
> >> > > > >
> >> > > > > So, roughly that looks to me like a range from improving by .5% to
> >> > > > > regressing by 1%.  I'm not sure what could cause an improvement, so I
> >> > > > > kind of wonder how valid these results are.
> >> > > > 
> >> > > > Hmm, indeed.  The speedup looks suspicious.
> >> > > > 
> >> > > > > Another question is how one can refactor the output machinary to be
> >> > > > > faster.  My first  thought is to buffer text internally before calling
> >> > > > > stdio functions, but that seems like a giant job.
> >> > > > 
> >> > > > stdio functions are already buffering, so I don't know either.
> >> > > > 
> >> > > > But yes, going the libas route would improve things here, or for
> >> > > > example enhancing gas to be able to eat target binary data
> >> > > > without the need to encode it in printable characters...
> >> > > > 
> >> > > > .raw_data number-of-bytes
> >> > > > <raw data>
> >> > > > 
> >> > > > Makes it quite unparsable to editors of course ...
> >> > > 
> >> > > A middle-ground might be to do both:
> >> > > 
> >> > > .raw_data number-of-bytes
> >> > > <raw data>
> >> > 
> >> > Sorry, I hit "Send" too early; I meant something like this as a
> >> > middle-ground:
> >> > 
> >> >   .raw_data number-of-bytes
> >> >   <raw data>
> >> > 
> >> >   ; comment giving the formatted text
> >> > 
> >> > so that cc1 etc are doing the formatting work to make the comment, so
> >> > that human readers can see what the raw data is meant to be, but the
> >> > assembler doesn't have to do work to parse it.
> >> 
> >> well, having random bytes in the file might still screw up editors, and
> >> I'd kind of expect that to be slower over all since gcc still does the
> >> formating, and both gcc and as do more IO.
> >> 
> >> > FWIW, I once had a go at hiding asm_out_file behind a class interface,
> >> > trying to build up higher-level methods on top of raw text printing.
> >> > Maybe that's a viable migration strategy  (I didn't finish that patch).
> >> 
> >> I was thinking about trying that, but I couldn't think of a good way to
> >> do it incrementally.
> >> 
> >> Trev
> >
> > Attached is a patch from some experimentation, very much a
> > work-in-progress.
> >
> > It eliminates the macro ASM_OUTPUT_LABEL in favor of calls to a method
> > of an "output" object:
> >
> >   g_output.output_label (lab);
> >
> > g_output would be a thin wrapper around asm_out_file (with the
> > assumption that asm_out_file never changes to point at anything else).
> >
> > One idea here is to gradually replace uses of asm_out_file with methods
> > of g_output, giving us a possible approach for tackling the "don't
> > format so much and then parse it again" optimization.
> >
> > Another idea here is to use templates and specialization in place of
> > target macros, to capture things in the type system;
> > g_output is actually:
> >
> >   output<target_t> g_output;
> >
> > which has a default implementation of output_label corresponding to the
> > current default ASM_OUTPUT_LABEL:
> >
> > template <typename Target>
> > inline void
> > output<Target>::output_label (const char *name)
> > {
> >   assemble_name (name);
> >   puts (":\n");  
> > }
> >
> > ...but a specific Target traits class could have a specialization e.g.
> >
> > template <>
> > inline void
> > output<target_arm>::output_label (const char *name)
> > {
> >   arm_asm_output_labelref (name);
> > }
> >
> > This could give us (I hope) equivalent performance to the current
> > macro-based approach, but without using the preprocessor, albeit adding
> > some C++ (the non-trivial use of templates gives me pause).
> 
> I might be missing the point, sorry, but it sounds like this enshrines
> the idea of having a single target.

I assume you are refering to the template part?  Not totally, see
https://blog.mozilla.org/nfroyd/2014/10/30/porting-rr-to-x86-64/
for an example of building a tool that uses templates and supports
multiple targets at the same time.  That said I'm not sure I see the
advantages, and the switch statements look rather like virtual
functions.

> An integrated assembler or tighter asm output would be nice, but when
> I last checked LLVM was usually faster than GCC even when compiling to asm,
> even though LLVM does use indirection (in the form of virtual functions)
> for its output routines.  I don't think indirect function calls themselves
> are the problem -- as long as we get the abstraction right :-)

yeah, last time I looked (tbf a while ago) the C++ front end took up by
far the largest part of the time.  So it may not be terribly important,
but it would still be nice to figure out what a good design looks like.

Trev

> 
> Thanks,
> Richard
Richard Sandiford Aug. 7, 2015, 9:45 a.m. UTC | #3
Trevor Saunders <tbsaunde@tbsaunde.org> writes:
> On Thu, Aug 06, 2015 at 08:36:36PM +0100, Richard Sandiford wrote:
>> An integrated assembler or tighter asm output would be nice, but when
>> I last checked LLVM was usually faster than GCC even when compiling to asm,
>> even though LLVM does use indirection (in the form of virtual functions)
>> for its output routines.  I don't think indirect function calls themselves
>> are the problem -- as long as we get the abstraction right :-)
>
> yeah, last time I looked (tbf a while ago) the C++ front end took up by
> far the largest part of the time.  So it may not be terribly important,
> but it would still be nice to figure out what a good design looks like.

I tried getting final to output the code a large number of times.
Obviously just sticking "for (i = 0; i < n; ++i)" around something
isn't the best way of measuring performance (for all the usual reasons)
but it was interesting even so.  A lot of the time is taken in calls to
strlen and in assemble_name itself (called by ASM_OUTPUT_LABEL).
Each time we call assemble_name we do:

  real_name = targetm.strip_name_encoding (name);

  id = maybe_get_identifier (real_name);
  if (id)
    {
      tree id_orig = id;

      mark_referenced (id);
      ultimate_transparent_alias_target (&id);
      if (id != id_orig)
	name = IDENTIFIER_POINTER (id);
      gcc_assert (! TREE_CHAIN (id));
    }

Doing an identifier lookup every time we output a reference to a label
is pretty expensive.  Especially when many of the labels we're dealing
with are internal ones (basic block labels, debug labels, etc.) for which
the lookup is bound to fail.

So if compile-time for asm output is a concern, that seems like a good
place to start.  We should try harder to keep track of the identifier
behind a name (when there is one) and avoid this overhead for
internal labels.

Converting ASM_OUTPUT_LABEL to an indirect function call was in the
noise even with my for-loop hack.  The execution time of the hook is
dominated by assemble_name itself.  I hope patches like yours aren't
held up simply because they have the equivalent of a virtual function.

Also, although we seem to be paranoid about virtual functions and
indirect calls, it's worth remembering that on most targets every
call to fputs(_unlocked), fwrite(_unlocked) and strlen is a PLT call.
Our current code calls fputs several times for one line of assembly,
including for short strings like register names.  This is doubly
inefficient because:

(a) we could reduce the number of PLT calls by doing the buffering
    ourselves and

(b) the names of those registers are known at compile time (or at least
    at start-up time) and are short, but we call strlen() on them
    each time we write them out.

E.g. for the attached microbenchmark I get:

  Time taken, normalised to VERSION==1

  VERSION==1:  1.000
  VERSION==2:  1.377
  VERSION==3:  3.202 (1.638 with -minline-all-stringops)
  VERSION==4:  4.242 (2.921 with -minline-all-stringops)
  VERSION==5:  4.526
  VERSION==6:  4.543
  VERSION==7: 10.884

where the results for 5 vs. 6 are in the noise.

The 5->4 gain is by doing the buffering ourselves.  The 4->3 gain is for
keeping track of the string length rather than recomputing it each time.

This suggests that if we're serious about trying to speed up the asm output,
it would be worth adding an equivalent of LLVM's StringRef that pairs a
const char * string with its length.

Thanks,
Richard
Trevor Saunders Aug. 7, 2015, 1:50 p.m. UTC | #4
On Fri, Aug 07, 2015 at 10:45:57AM +0100, Richard Sandiford wrote:
> Trevor Saunders <tbsaunde@tbsaunde.org> writes:
> > On Thu, Aug 06, 2015 at 08:36:36PM +0100, Richard Sandiford wrote:
> >> An integrated assembler or tighter asm output would be nice, but when
> >> I last checked LLVM was usually faster than GCC even when compiling to asm,
> >> even though LLVM does use indirection (in the form of virtual functions)
> >> for its output routines.  I don't think indirect function calls themselves
> >> are the problem -- as long as we get the abstraction right :-)
> >
> > yeah, last time I looked (tbf a while ago) the C++ front end took up by
> > far the largest part of the time.  So it may not be terribly important,
> > but it would still be nice to figure out what a good design looks like.
> 
> I tried getting final to output the code a large number of times.
> Obviously just sticking "for (i = 0; i < n; ++i)" around something
> isn't the best way of measuring performance (for all the usual reasons)
> but it was interesting even so.  A lot of the time is taken in calls to
> strlen and in assemble_name itself (called by ASM_OUTPUT_LABEL).

yeah, this data looks great.  I find it interesting that you say we
spend so much time outputting labels as opposed to instructions.

> Each time we call assemble_name we do:
> 
>   real_name = targetm.strip_name_encoding (name);
> 
>   id = maybe_get_identifier (real_name);
>   if (id)
>     {
>       tree id_orig = id;
> 
>       mark_referenced (id);
>       ultimate_transparent_alias_target (&id);
>       if (id != id_orig)
> 	name = IDENTIFIER_POINTER (id);
>       gcc_assert (! TREE_CHAIN (id));
>     }
> 
> Doing an identifier lookup every time we output a reference to a label
> is pretty expensive.  Especially when many of the labels we're dealing
> with are internal ones (basic block labels, debug labels, etc.) for which
> the lookup is bound to fail.

well, there's ASm_OUTPUT_INTERNAL_LABEL, and I think something similar
for debug labels.  I guess we don't always use those where we could.  Or
maybe the problem is we have places where we need to look at data to
find out.  Maybe it would make sense to have the generally used
output_label routine take a tree / rtx, and check if its a internal or
debug label and dispatch appropriately.

> So if compile-time for asm output is a concern, that seems like a good
> place to start.  We should try harder to keep track of the identifier
> behind a name (when there is one) and avoid this overhead for
> internal labels.
> 
> Converting ASM_OUTPUT_LABEL to an indirect function call was in the
> noise even with my for-loop hack.  The execution time of the hook is
> dominated by assemble_name itself.  I hope patches like yours aren't
> held up simply because they have the equivalent of a virtual function.

Well, I think it makes sense to reroll this series, but I think I'll
keep working on trying to replace these macros with something else.

> Also, although we seem to be paranoid about virtual functions and
> indirect calls, it's worth remembering that on most targets every
> call to fputs(_unlocked), fwrite(_unlocked) and strlen is a PLT call.
> Our current code calls fputs several times for one line of assembly,
> including for short strings like register names.  This is doubly
> inefficient because:
> 
> (a) we could reduce the number of PLT calls by doing the buffering
>     ourselves and

yeah, I mentioned that earlier, but its great to have data showing its a
win!  I think its also probably important to enabling the other
optimizations below.

> (b) the names of those registers are known at compile time (or at least
>     at start-up time) and are short, but we call strlen() on them
>     each time we write them out.

yeah, that seems like something that should be fixed, but I'm not sure
off hand where to look for the code doing this.

> E.g. for the attached microbenchmark I get:
> 
>   Time taken, normalised to VERSION==1
> 
>   VERSION==1:  1.000
>   VERSION==2:  1.377
>   VERSION==3:  3.202 (1.638 with -minline-all-stringops)
>   VERSION==4:  4.242 (2.921 with -minline-all-stringops)
>   VERSION==5:  4.526
>   VERSION==6:  4.543
>   VERSION==7: 10.884
> 
> where the results for 5 vs. 6 are in the noise.
> 
> The 5->4 gain is by doing the buffering ourselves.  The 4->3 gain is for
> keeping track of the string length rather than recomputing it each time.
> 
> This suggests that if we're serious about trying to speed up the asm output,
> it would be worth adding an equivalent of LLVM's StringRef that pairs a
> const char * string with its length.

I've thought a tiny bit about working on that, so its nice to have data.

Trev

> 
> Thanks,
> Richard
> 

> #define _GNU_SOURCE 1
> 
> #include <stdio.h>
> #include <string.h>
> #include <iostream>
> 
> struct S
> {
>   S () : end (buffer) {}
> 
>   ~S ()
>   {
>     fwrite_unlocked (buffer, end - buffer, 1, stdout);
>   }
> 
> #if VERSION == 3
>   void __attribute__((noinline))
> #else
>   void
> #endif
>   write (const char *x, size_t len)
>   {
>     if (__builtin_expect (buffer + sizeof (buffer) - end < len, 0))
>       {
> 	fwrite_unlocked (buffer, end - buffer, 1, stdout);
> 	end = buffer;
>       }
>     memcpy (end, x, len);
>     end += len;
>   }
> 
> #if VERSION == 1 || VERSION == 3
>   template <size_t N>
>   void
>   write (const char (&x)[N])
>   {
>     write (x, N - 1);
>   }
> #elif VERSION == 2
>   template <size_t N>
>   void __attribute__((noinline))
>   write (const char (&x)[N])
>   {
>     write (x, N - 1);
>   }
> #else
>   void __attribute__((noinline))
>   write (const char *x)
>   {
>     write (x, strlen (x));
>   }
> #endif
>   char buffer[4096];
>   char *end;
> };
> 
> int
> main ()
> {
>   S s;
>   for (int i = 0; i < 100000000; ++i)
>     {
> #if VERSION <= 4
>       s.write ("Hello!");
> #elif VERSION == 5
>       fputs_unlocked ("Hello!", stdout);
> #elif VERSION == 6
>       fwrite_unlocked ("Hello!", 6, 1, stdout);
> #elif VERSION == 7
>       std::cout << "Hello!";
> #else
> #error Please define VERSION
> #endif
>     }
>   return 0;
> }
Richard Biener Aug. 7, 2015, 8:24 p.m. UTC | #5
On August 7, 2015 3:50:33 PM GMT+02:00, Trevor Saunders <tbsaunde@tbsaunde.org> wrote:
>On Fri, Aug 07, 2015 at 10:45:57AM +0100, Richard Sandiford wrote:
>> Trevor Saunders <tbsaunde@tbsaunde.org> writes:
>> > On Thu, Aug 06, 2015 at 08:36:36PM +0100, Richard Sandiford wrote:
>> >> An integrated assembler or tighter asm output would be nice, but
>when
>> >> I last checked LLVM was usually faster than GCC even when
>compiling to asm,
>> >> even though LLVM does use indirection (in the form of virtual
>functions)
>> >> for its output routines.  I don't think indirect function calls
>themselves
>> >> are the problem -- as long as we get the abstraction right :-)
>> >
>> > yeah, last time I looked (tbf a while ago) the C++ front end took
>up by
>> > far the largest part of the time.  So it may not be terribly
>important,
>> > but it would still be nice to figure out what a good design looks
>like.
>> 
>> I tried getting final to output the code a large number of times.
>> Obviously just sticking "for (i = 0; i < n; ++i)" around something
>> isn't the best way of measuring performance (for all the usual
>reasons)
>> but it was interesting even so.  A lot of the time is taken in calls
>to
>> strlen and in assemble_name itself (called by ASM_OUTPUT_LABEL).
>
>yeah, this data looks great.  I find it interesting that you say we
>spend so much time outputting labels as opposed to instructions.
>
>> Each time we call assemble_name we do:
>> 
>>   real_name = targetm.strip_name_encoding (name);
>> 
>>   id = maybe_get_identifier (real_name);
>>   if (id)
>>     {
>>       tree id_orig = id;
>> 
>>       mark_referenced (id);
>>       ultimate_transparent_alias_target (&id);
>>       if (id != id_orig)
>> 	name = IDENTIFIER_POINTER (id);
>>       gcc_assert (! TREE_CHAIN (id));
>>     }
>> 
>> Doing an identifier lookup every time we output a reference to a
>label
>> is pretty expensive.  Especially when many of the labels we're
>dealing
>> with are internal ones (basic block labels, debug labels, etc.) for
>which
>> the lookup is bound to fail.
>
>well, there's ASm_OUTPUT_INTERNAL_LABEL, and I think something similar
>for debug labels.  I guess we don't always use those where we could. 
>Or
>maybe the problem is we have places where we need to look at data to
>find out.  Maybe it would make sense to have the generally used
>output_label routine take a tree / rtx, and check if its a internal or
>debug label and dispatch appropriately.
>
>> So if compile-time for asm output is a concern, that seems like a
>good
>> place to start.  We should try harder to keep track of the identifier
>> behind a name (when there is one) and avoid this overhead for
>> internal labels.
>> 
>> Converting ASM_OUTPUT_LABEL to an indirect function call was in the
>> noise even with my for-loop hack.  The execution time of the hook is
>> dominated by assemble_name itself.  I hope patches like yours aren't
>> held up simply because they have the equivalent of a virtual
>function.
>
>Well, I think it makes sense to reroll this series, but I think I'll
>keep working on trying to replace these macros with something else.
>
>> Also, although we seem to be paranoid about virtual functions and
>> indirect calls, it's worth remembering that on most targets every
>> call to fputs(_unlocked), fwrite(_unlocked) and strlen is a PLT call.
>> Our current code calls fputs several times for one line of assembly,
>> including for short strings like register names.  This is doubly
>> inefficient because:
>> 
>> (a) we could reduce the number of PLT calls by doing the buffering
>>     ourselves and
>
>yeah, I mentioned that earlier, but its great to have data showing its
>a
>win!  I think its also probably important to enabling the other
>optimizations below.
>
>> (b) the names of those registers are known at compile time (or at
>least
>>     at start-up time) and are short, but we call strlen() on them
>>     each time we write them out.
>
>yeah, that seems like something that should be fixed, but I'm not sure
>off hand where to look for the code doing this.
>
>> E.g. for the attached microbenchmark I get:
>> 
>>   Time taken, normalised to VERSION==1
>> 
>>   VERSION==1:  1.000
>>   VERSION==2:  1.377
>>   VERSION==3:  3.202 (1.638 with -minline-all-stringops)
>>   VERSION==4:  4.242 (2.921 with -minline-all-stringops)
>>   VERSION==5:  4.526
>>   VERSION==6:  4.543
>>   VERSION==7: 10.884
>> 
>> where the results for 5 vs. 6 are in the noise.
>> 
>> The 5->4 gain is by doing the buffering ourselves.  The 4->3 gain is
>for
>> keeping track of the string length rather than recomputing it each
>time.
>> 
>> This suggests that if we're serious about trying to speed up the asm
>output,
>> it would be worth adding an equivalent of LLVM's StringRef that pairs
>a
>> const char * string with its length.
>
>I've thought a tiny bit about working on that, so its nice to have
>data.

Tree identifiers have an embedded length.
So its all about avoidibg this target hook mangling the labels.

Richard.

>Trev
>
>> 
>> Thanks,
>> Richard
>> 
>
>> #define _GNU_SOURCE 1
>> 
>> #include <stdio.h>
>> #include <string.h>
>> #include <iostream>
>> 
>> struct S
>> {
>>   S () : end (buffer) {}
>> 
>>   ~S ()
>>   {
>>     fwrite_unlocked (buffer, end - buffer, 1, stdout);
>>   }
>> 
>> #if VERSION == 3
>>   void __attribute__((noinline))
>> #else
>>   void
>> #endif
>>   write (const char *x, size_t len)
>>   {
>>     if (__builtin_expect (buffer + sizeof (buffer) - end < len, 0))
>>       {
>> 	fwrite_unlocked (buffer, end - buffer, 1, stdout);
>> 	end = buffer;
>>       }
>>     memcpy (end, x, len);
>>     end += len;
>>   }
>> 
>> #if VERSION == 1 || VERSION == 3
>>   template <size_t N>
>>   void
>>   write (const char (&x)[N])
>>   {
>>     write (x, N - 1);
>>   }
>> #elif VERSION == 2
>>   template <size_t N>
>>   void __attribute__((noinline))
>>   write (const char (&x)[N])
>>   {
>>     write (x, N - 1);
>>   }
>> #else
>>   void __attribute__((noinline))
>>   write (const char *x)
>>   {
>>     write (x, strlen (x));
>>   }
>> #endif
>>   char buffer[4096];
>>   char *end;
>> };
>> 
>> int
>> main ()
>> {
>>   S s;
>>   for (int i = 0; i < 100000000; ++i)
>>     {
>> #if VERSION <= 4
>>       s.write ("Hello!");
>> #elif VERSION == 5
>>       fputs_unlocked ("Hello!", stdout);
>> #elif VERSION == 6
>>       fwrite_unlocked ("Hello!", 6, 1, stdout);
>> #elif VERSION == 7
>>       std::cout << "Hello!";
>> #else
>> #error Please define VERSION
>> #endif
>>     }
>>   return 0;
>> }
Richard Sandiford Aug. 7, 2015, 9:52 p.m. UTC | #6
Richard Biener <richard.guenther@gmail.com> writes:
>>> E.g. for the attached microbenchmark I get:
>>> 
>>>   Time taken, normalised to VERSION==1
>>> 
>>>   VERSION==1:  1.000
>>>   VERSION==2:  1.377
>>>   VERSION==3:  3.202 (1.638 with -minline-all-stringops)
>>>   VERSION==4:  4.242 (2.921 with -minline-all-stringops)
>>>   VERSION==5:  4.526
>>>   VERSION==6:  4.543
>>>   VERSION==7: 10.884
>>> 
>>> where the results for 5 vs. 6 are in the noise.
>>> 
>>> The 5->4 gain is by doing the buffering ourselves.  The 4->3 gain is
>>for
>>> keeping track of the string length rather than recomputing it each
>>time.
>>> 
>>> This suggests that if we're serious about trying to speed up the asm
>>output,
>>> it would be worth adding an equivalent of LLVM's StringRef that pairs
>>a
>>> const char * string with its length.
>>
>>I've thought a tiny bit about working on that, so its nice to have
>>data.
>
> Tree identifiers have an embedded length.
> So its all about avoidibg this target hook mangling the labels.

Yeah, and register names start out as C strings where the length
is known at compile time.  Strings that result from sprintf have
a length given by the sprintf return value.

I think in practice most strings in GCC have (or had) a known length,
and the nice thing about StringRef-like classes is that they abstract
away the source of the length.  Even if we do have to use strlen,
the class makes sure we only calculate it once per object rather than
once per use.

Thanks,
Richard
diff mbox

Patch

From 011370666913836bd66f8b433e57780434c5aab1 Mon Sep 17 00:00:00 2001
From: David Malcolm <dmalcolm@redhat.com>
Date: Mon, 6 Jul 2015 14:53:20 -0400
Subject: [PATCH] Work-in-progress experiments with hiding asm_out_file

---
 gcc/config/elfos.h      |  2 +-
 gcc/config/i386/i386.c  | 36 +++++++++++++++---------------
 gcc/config/i386/i386.md |  8 +++----
 gcc/config/i386/sse.md  |  2 +-
 gcc/defaults.h          | 11 ----------
 gcc/dwarf2out.c         | 58 ++++++++++++++++++++++++-------------------------
 gcc/except.c            |  8 +++----
 gcc/final.c             |  3 ++-
 gcc/output.h            | 51 +++++++++++++++++++++++++++++++++++++++++++
 gcc/toplev.c            |  3 +++
 gcc/varasm.c            | 16 +++++++-------
 11 files changed, 121 insertions(+), 77 deletions(-)

diff --git a/gcc/config/elfos.h b/gcc/config/elfos.h
index bcc3870..3ea3376 100644
--- a/gcc/config/elfos.h
+++ b/gcc/config/elfos.h
@@ -333,7 +333,7 @@  see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 	  ASM_OUTPUT_SIZE_DIRECTIVE (FILE, NAME, size);			\
 	}								\
 									\
-      ASM_OUTPUT_LABEL (FILE, NAME);					\
+      g_output.output_label (NAME);					\
     }									\
   while (0)
 
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 128c5af..fe78ffb 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -6398,7 +6398,7 @@  ix86_asm_output_function_label (FILE *asm_out_file, const char *fname,
   SUBTARGET_ASM_UNWIND_INIT (asm_out_file);
 #endif
 
-  ASM_OUTPUT_LABEL (asm_out_file, fname);
+  g_output.output_label (fname);
 
   /* Output magic byte marker, if hot-patch attribute is set.  */
   if (is_ms_hook)
@@ -9877,7 +9877,7 @@  ix86_code_end (void)
 	  fputs ("\n\t.private_extern\t", asm_out_file);
 	  assemble_name (asm_out_file, name);
 	  putc ('\n', asm_out_file);
-	  ASM_OUTPUT_LABEL (asm_out_file, name);
+	  g_output.output_label (name);
 	  DECL_WEAK (decl) = 1;
 	}
       else
@@ -9898,7 +9898,7 @@  ix86_code_end (void)
       else
 	{
 	  switch_to_section (text_section);
-	  ASM_OUTPUT_LABEL (asm_out_file, name);
+	  g_output.output_label (name);
 	}
 
       DECL_INITIAL (decl) = make_node (BLOCK);
@@ -9985,7 +9985,7 @@  output_set_got (rtx dest, rtx label)
       /* Output the Mach-O "canonical" pic base label name ("Lxx$pb") here.
          This is what will be referenced by the Mach-O PIC subsystem.  */
       if (machopic_should_output_picbase_label () || !label)
-	ASM_OUTPUT_LABEL (asm_out_file, MACHOPIC_FUNCTION_BASE_NAME);
+	g_output.output_label (MACHOPIC_FUNCTION_BASE_NAME);
 
       /* When we are restoring the pic base at the site of a nonlocal label,
          and we decided to emit the pic base above, we will still output a
@@ -11180,9 +11180,9 @@  output_adjust_stack_and_probe (rtx reg)
   xops[0] = stack_pointer_rtx;
   xops[1] = reg;
   output_asm_insn ("cmp%z0\t{%1, %0|%0, %1}", xops);
-  fputs ("\tje\t", asm_out_file);
-  assemble_name_raw (asm_out_file, end_lab);
-  fputc ('\n', asm_out_file);
+  g_output.puts ("\tje\t");
+  g_output.assemble_name_raw (end_lab);
+  g_output.puts ("\n");
 
   /* SP = SP + PROBE_INTERVAL.  */
   xops[1] = GEN_INT (PROBE_INTERVAL);
@@ -11303,9 +11303,9 @@  output_probe_stack_range (rtx reg, rtx end)
   xops[0] = reg;
   xops[1] = end;
   output_asm_insn ("cmp%z0\t{%1, %0|%0, %1}", xops);
-  fputs ("\tje\t", asm_out_file);
-  assemble_name_raw (asm_out_file, end_lab);
-  fputc ('\n', asm_out_file);
+  g_output.puts ("\tje\t");
+  g_output.assemble_name_raw (end_lab);
+  g_output.puts ("\n");
 
   /* TEST_ADDR = TEST_ADDR + PROBE_INTERVAL.  */
   xops[1] = GEN_INT (PROBE_INTERVAL);
@@ -11317,9 +11317,9 @@  output_probe_stack_range (rtx reg, rtx end)
   xops[2] = const0_rtx;
   output_asm_insn ("or%z0\t{%2, (%0,%1)|DWORD PTR [%0+%1], %2}", xops);
 
-  fprintf (asm_out_file, "\tjmp\t");
-  assemble_name_raw (asm_out_file, loop_lab);
-  fputc ('\n', asm_out_file);
+  g_output.puts ("\tjmp\t");
+  g_output.assemble_name_raw (loop_lab);
+  g_output.puts ("\n");
 
   ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, end_lab);
 
@@ -14731,7 +14731,7 @@  output_pic_addr_const (FILE *file, rtx x, int code)
       /* FALLTHRU */
     case CODE_LABEL:
       ASM_GENERATE_INTERNAL_LABEL (buf, "L", CODE_LABEL_NUMBER (x));
-      assemble_name (asm_out_file, buf);
+      g_output.assemble_name (buf);
       break;
 
     case CONST_INT:
@@ -43256,16 +43256,16 @@  x86_file_start (void)
 {
   default_file_start ();
   if (TARGET_16BIT)
-    fputs ("\t.code16gcc\n", asm_out_file);
+    g_output.puts ("\t.code16gcc\n");
 #if TARGET_MACHO
   darwin_file_start ();
 #endif
   if (X86_FILE_START_VERSION_DIRECTIVE)
-    fputs ("\t.version\t\"01.01\"\n", asm_out_file);
+    g_output.puts ("\t.version\t\"01.01\"\n");
   if (X86_FILE_START_FLTUSED)
-    fputs ("\t.global\t__fltused\n", asm_out_file);
+    g_output.puts ("\t.global\t__fltused\n");
   if (ix86_asm_dialect == ASM_INTEL)
-    fputs ("\t.intel_syntax noprefix\n", asm_out_file);
+    g_output.puts ("\t.intel_syntax noprefix\n");
 }
 
 int
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 5c5c1fc..14c62a5 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -12137,7 +12137,7 @@ 
   gcc_assert (IN_RANGE (num, 1, 8));
 
   while (num--)
-    fputs ("\tnop\n", asm_out_file);
+    g_output.puts ("\tnop\n");
 
   return "";
 }
@@ -13224,11 +13224,11 @@ 
   "TARGET_64BIT"
 {
   if (!TARGET_X32)
-    fputs (ASM_BYTE "0x66\n", asm_out_file);
+    g_output.puts (ASM_BYTE "0x66\n");
   output_asm_insn
     ("lea{q}\t{%E1@tlsgd(%%rip), %%rdi|rdi, %E1@tlsgd[rip]}", operands);
-  fputs (ASM_SHORT "0x6666\n", asm_out_file);
-  fputs ("\trex64\n", asm_out_file);
+  g_output.puts (ASM_SHORT "0x6666\n");
+  g_output.puts ("\trex64\n");
   if (TARGET_SUN_TLS)
     return "call\t%p2@plt";
   return "call\t%P2";
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 0970f0e..f52dc02 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -13296,7 +13296,7 @@ 
   /* We can't use %^ here due to ASM_OUTPUT_OPCODE processing
      that requires %v to be at the beginning of the opcode name.  */
   if (Pmode != word_mode)
-    fputs ("\taddr32", asm_out_file);
+    g_output.puts ("\taddr32");
   return "%vmaskmovdqu\t{%2, %1|%1, %2}";
 }
   [(set_attr "type" "ssemov")
diff --git a/gcc/defaults.h b/gcc/defaults.h
index 9d38ba1..320c43d 100644
--- a/gcc/defaults.h
+++ b/gcc/defaults.h
@@ -136,17 +136,6 @@  see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 #endif
 
 /* This is how to output the definition of a user-level label named
-   NAME, such as the label on variable NAME.  */
-
-#ifndef ASM_OUTPUT_LABEL
-#define ASM_OUTPUT_LABEL(FILE,NAME) \
-  do {						\
-    assemble_name ((FILE), (NAME));		\
-    fputs (":\n", (FILE));			\
-  } while (0)
-#endif
-
-/* This is how to output the definition of a user-level label named
    NAME, such as the label on a function.  */
 
 #ifndef ASM_OUTPUT_FUNCTION_LABEL
diff --git a/gcc/dwarf2out.c b/gcc/dwarf2out.c
index 2c7dc71..a25847f 100644
--- a/gcc/dwarf2out.c
+++ b/gcc/dwarf2out.c
@@ -484,7 +484,7 @@  switch_to_eh_frame_section (bool back)
 	  ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (PTR_SIZE));
 	  targetm.asm_out.globalize_label (asm_out_file,
 					   IDENTIFIER_POINTER (label));
-	  ASM_OUTPUT_LABEL (asm_out_file, IDENTIFIER_POINTER (label));
+	  g_output.output_label (IDENTIFIER_POINTER (label));
 	}
     }
 }
@@ -600,7 +600,7 @@  output_fde (dw_fde_ref fde, bool for_eh, bool second,
 			 " indicating 64-bit DWARF extension");
   dw2_asm_output_delta (for_eh ? 4 : DWARF_OFFSET_SIZE, l2, l1,
 			"FDE Length");
-  ASM_OUTPUT_LABEL (asm_out_file, l1);
+  g_output.output_label (l1);
 
   if (for_eh)
     dw2_asm_output_delta (4, l1, section_start_label, "FDE CIE offset");
@@ -703,7 +703,7 @@  output_fde (dw_fde_ref fde, bool for_eh, bool second,
   /* Pad the FDE out to an address sized boundary.  */
   ASM_OUTPUT_ALIGN (asm_out_file,
 		    floor_log2 ((for_eh ? PTR_SIZE : DWARF2_ADDR_SIZE)));
-  ASM_OUTPUT_LABEL (asm_out_file, l2);
+  g_output.output_label (l2);
 
   j += 2;
 }
@@ -790,7 +790,7 @@  output_call_frame_info (int for_eh)
   switch_to_frame_table_section (for_eh, false);
 
   ASM_GENERATE_INTERNAL_LABEL (section_start_label, FRAME_BEGIN_LABEL, for_eh);
-  ASM_OUTPUT_LABEL (asm_out_file, section_start_label);
+  g_output.output_label (section_start_label);
 
   /* Output the CIE.  */
   ASM_GENERATE_INTERNAL_LABEL (l1, CIE_AFTER_SIZE_LABEL, for_eh);
@@ -800,7 +800,7 @@  output_call_frame_info (int for_eh)
       "Initial length escape value indicating 64-bit DWARF extension");
   dw2_asm_output_delta (for_eh ? 4 : DWARF_OFFSET_SIZE, l2, l1,
 			"Length of Common Information Entry");
-  ASM_OUTPUT_LABEL (asm_out_file, l1);
+  g_output.output_label (l1);
 
   /* Now that the CIE pointer is PC-relative for EH,
      use 0 to identify the CIE.  */
@@ -926,7 +926,7 @@  output_call_frame_info (int for_eh)
   /* Pad the CIE out to an address sized boundary.  */
   ASM_OUTPUT_ALIGN (asm_out_file,
 		    floor_log2 (for_eh ? PTR_SIZE : DWARF2_ADDR_SIZE));
-  ASM_OUTPUT_LABEL (asm_out_file, l2);
+  g_output.output_label (l2);
 
   /* Loop through all of the FDE's.  */
   FOR_EACH_VEC_ELT (*fde_vec, i, fde)
@@ -1160,7 +1160,7 @@  dwarf2out_end_epilogue (unsigned int line ATTRIBUTE_UNUSED,
      function.  */
   ASM_GENERATE_INTERNAL_LABEL (label, FUNC_END_LABEL,
 			       current_function_funcdef_no);
-  ASM_OUTPUT_LABEL (asm_out_file, label);
+  g_output.output_label (label);
   fde = cfun->fde;
   gcc_assert (fde != NULL);
   if (fde->dw_fde_second_begin == NULL)
@@ -8680,7 +8680,7 @@  output_die_symbol (dw_die_ref die)
        will break.  */
     targetm.asm_out.globalize_label (asm_out_file, sym);
 
-  ASM_OUTPUT_LABEL (asm_out_file, sym);
+  g_output.output_label (sym);
 }
 
 /* Return a new location list, given the begin and end range, and the
@@ -8722,7 +8722,7 @@  output_loc_list (dw_loc_list_ref list_head)
     return;
   list_head->emitted = true;
 
-  ASM_OUTPUT_LABEL (asm_out_file, list_head->ll_symbol);
+  g_output.output_label (list_head->ll_symbol);
 
   /* Walk the location list, and output each range + expression.  */
   for (curr = list_head; curr != NULL; curr = curr->dw_loc_next)
@@ -9228,7 +9228,7 @@  output_comp_unit (dw_die_ref die, int output_if_empty)
   else
     {
       switch_to_section (debug_info_section);
-      ASM_OUTPUT_LABEL (asm_out_file, debug_info_section_label);
+      g_output.output_label (debug_info_section_label);
       info_section_emitted = true;
     }
 
@@ -9323,7 +9323,7 @@  output_skeleton_debug_sections (dw_die_ref comp_unit)
   remove_AT (comp_unit, DW_AT_language);
 
   switch_to_section (debug_skeleton_info_section);
-  ASM_OUTPUT_LABEL (asm_out_file, debug_skeleton_info_section_label);
+  g_output.output_label (debug_skeleton_info_section_label);
 
   /* Produce the skeleton compilation-unit header.  This one differs enough from
      a normal CU header that it's better not to call output_compilation_unit
@@ -9348,7 +9348,7 @@  output_skeleton_debug_sections (dw_die_ref comp_unit)
 
   /* Build the skeleton debug_abbrev section.  */
   switch_to_section (debug_skeleton_abbrev_section);
-  ASM_OUTPUT_LABEL (asm_out_file, debug_skeleton_abbrev_section_label);
+  g_output.output_label (debug_skeleton_abbrev_section_label);
 
   output_die_abbrevs (SKELETON_COMP_DIE_ABBREV, comp_unit);
 
@@ -10380,11 +10380,11 @@  output_line_info (bool prologue_only)
       "Initial length escape value indicating 64-bit DWARF extension");
   dw2_asm_output_delta (DWARF_OFFSET_SIZE, l2, l1,
 			"Length of Source Line Info");
-  ASM_OUTPUT_LABEL (asm_out_file, l1);
+  g_output.output_label (l1);
 
   dw2_asm_output_data (2, ver, "DWARF Version");
   dw2_asm_output_delta (DWARF_OFFSET_SIZE, p2, p1, "Prolog Length");
-  ASM_OUTPUT_LABEL (asm_out_file, p1);
+  g_output.output_label (p1);
 
   /* Define the architecture-dependent minimum instruction length (in bytes).
      In this implementation of DWARF, this field is used for information
@@ -10432,11 +10432,11 @@  output_line_info (bool prologue_only)
 
   /* Write out the information about the files we use.  */
   output_file_names ();
-  ASM_OUTPUT_LABEL (asm_out_file, p2);
+  g_output.output_label (p2);
   if (prologue_only)
     {
       /* Output the marker for the end of the line number info.  */
-      ASM_OUTPUT_LABEL (asm_out_file, l2);
+      g_output.output_label (l2);
       return;
     }
 
@@ -10467,7 +10467,7 @@  output_line_info (bool prologue_only)
     output_one_line_info_table (text_section_line_info);
 
   /* Output the marker for the end of the line number info.  */
-  ASM_OUTPUT_LABEL (asm_out_file, l2);
+  g_output.output_label (l2);
 }
 
 /* Given a pointer to a tree node for some base type, return a pointer to
@@ -22485,7 +22485,7 @@  dwarf2out_begin_function (tree fun)
       gcc_assert (current_function_decl == fun);
       cold_text_section = unlikely_text_section ();
       switch_to_section (cold_text_section);
-      ASM_OUTPUT_LABEL (asm_out_file, cold_text_section_label);
+      g_output.output_label (cold_text_section_label);
       switch_to_section (sec);
     }
 
@@ -23146,7 +23146,7 @@  output_macinfo (void)
 	  ASM_GENERATE_INTERNAL_LABEL (label,
 				       DEBUG_MACRO_SECTION_LABEL,
 				       ref->lineno);
-	  ASM_OUTPUT_LABEL (asm_out_file, label);
+	  g_output.output_label (label);
 	  ref->code = 0;
 	  ref->info = NULL;
 	  dw2_asm_output_data (2, 4, "DWARF macro version number");
@@ -23297,7 +23297,7 @@  dwarf2out_init (const char *filename ATTRIBUTE_UNUSED)
     vec_alloc (macinfo_table, 64);
 
   switch_to_section (text_section);
-  ASM_OUTPUT_LABEL (asm_out_file, text_section_label);
+  g_output.output_label (text_section_label);
 
   /* Make sure the line number table for .text always exists.  */
   text_section_line_info = new_line_info_table ();
@@ -23319,7 +23319,7 @@  dwarf2out_assembly_start (void)
       && dwarf2out_do_cfi_asm ()
       && (!(flag_unwind_tables || flag_exceptions)
 	  || targetm_common.except_unwind_info (&global_options) != UI_DWARF2))
-    fprintf (asm_out_file, "\t.cfi_sections\t.debug_frame\n");
+    g_output.puts ("\t.cfi_sections\t.debug_frame\n");
 }
 
 /* A helper function for dwarf2out_finish called through
@@ -23394,7 +23394,7 @@  output_indirect_string (indirect_string_node **h, void *)
   node->form = find_string_form (node);
   if (node->form == DW_FORM_strp && node->refcount > 0)
     {
-      ASM_OUTPUT_LABEL (asm_out_file, node->label);
+      g_output.output_label (node->label);
       assemble_string (node->str, strlen (node->str) + 1);
     }
 
@@ -25352,7 +25352,7 @@  dwarf2out_finish (const char *filename)
                         ranges_section_label);
 
       switch_to_section (debug_addr_section);
-      ASM_OUTPUT_LABEL (asm_out_file, debug_addr_section_label);
+      g_output.output_label (debug_addr_section_label);
       output_addr_table ();
     }
 
@@ -25367,7 +25367,7 @@  dwarf2out_finish (const char *filename)
   if (abbrev_die_table_in_use != 1)
     {
       switch_to_section (debug_abbrev_section);
-      ASM_OUTPUT_LABEL (asm_out_file, abbrev_section_label);
+      g_output.output_label (abbrev_section_label);
       output_abbrev_section ();
     }
 
@@ -25376,7 +25376,7 @@  dwarf2out_finish (const char *filename)
     {
       /* Output the location lists info.  */
       switch_to_section (debug_loc_section);
-      ASM_OUTPUT_LABEL (asm_out_file, loc_section_label);
+      g_output.output_label (loc_section_label);
       output_location_lists (comp_unit_die ());
     }
 
@@ -25399,7 +25399,7 @@  dwarf2out_finish (const char *filename)
   if (ranges_table_in_use)
     {
       switch_to_section (debug_ranges_section);
-      ASM_OUTPUT_LABEL (asm_out_file, ranges_section_label);
+      g_output.output_label (ranges_section_label);
       output_ranges ();
     }
 
@@ -25407,7 +25407,7 @@  dwarf2out_finish (const char *filename)
   if (have_macinfo)
     {
       switch_to_section (debug_macinfo_section);
-      ASM_OUTPUT_LABEL (asm_out_file, macinfo_section_label);
+      g_output.output_label (macinfo_section_label);
       output_macinfo ();
       dw2_asm_output_data (1, 0, "End compilation unit");
     }
@@ -25419,14 +25419,14 @@  dwarf2out_finish (const char *filename)
      examining the file.  This is done late so that any filenames
      used by the debug_info section are marked as 'used'.  */
   switch_to_section (debug_line_section);
-  ASM_OUTPUT_LABEL (asm_out_file, debug_line_section_label);
+  g_output.output_label (debug_line_section_label);
   if (! DWARF2_ASM_LINE_DEBUG_INFO)
     output_line_info (false);
 
   if (dwarf_split_debug_info && info_section_emitted)
     {
       switch_to_section (debug_skeleton_line_section);
-      ASM_OUTPUT_LABEL (asm_out_file, debug_skeleton_line_section_label);
+      g_output.output_label (debug_skeleton_line_section_label);
       output_line_info (true);
     }
 
diff --git a/gcc/except.c b/gcc/except.c
index d59c539..d6b1ebc 100644
--- a/gcc/except.c
+++ b/gcc/except.c
@@ -3004,7 +3004,7 @@  output_one_function_exception_table (int section)
 				   current_function_funcdef_no);
       dw2_asm_output_delta_uleb128 (ttype_label, ttype_after_disp_label,
 				    "@TType base offset");
-      ASM_OUTPUT_LABEL (asm_out_file, ttype_after_disp_label);
+      g_output.output_label (ttype_after_disp_label);
 #else
       /* Ug.  Alignment queers things.  */
       unsigned int before_disp, after_disp, last_disp, disp;
@@ -3054,12 +3054,12 @@  output_one_function_exception_table (int section)
 			       current_function_funcdef_no);
   dw2_asm_output_delta_uleb128 (cs_end_label, cs_after_size_label,
 				"Call-site table length");
-  ASM_OUTPUT_LABEL (asm_out_file, cs_after_size_label);
+  g_output.output_label (cs_after_size_label);
   if (targetm_common.except_unwind_info (&global_options) == UI_SJLJ)
     sjlj_output_call_site_table ();
   else
     dw2_output_call_site_table (cs_format, section);
-  ASM_OUTPUT_LABEL (asm_out_file, cs_end_label);
+  g_output.output_label (cs_end_label);
 #else
   dw2_asm_output_data_uleb128 (call_site_len, "Call-site table length");
   if (targetm_common.except_unwind_info (&global_options) == UI_SJLJ)
@@ -3087,7 +3087,7 @@  output_one_function_exception_table (int section)
 
 #ifdef HAVE_AS_LEB128
   if (have_tt_data)
-      ASM_OUTPUT_LABEL (asm_out_file, ttype_label);
+      g_output.output_label (ttype_label);
 #endif
 
   /* ??? Decode and interpret the data for flag_debug_asm.  */
diff --git a/gcc/final.c b/gcc/final.c
index 5d91609..23541b9 100644
--- a/gcc/final.c
+++ b/gcc/final.c
@@ -2116,7 +2116,7 @@  output_alternate_entry_point (FILE *file, rtx_insn *insn)
 #ifdef ASM_OUTPUT_TYPE_DIRECTIVE
       ASM_OUTPUT_TYPE_DIRECTIVE (file, name, "function");
 #endif
-      ASM_OUTPUT_LABEL (file, name);
+      g_output.output_label (name);
       break;
 
     case LABEL_NORMAL:
@@ -4876,3 +4876,4 @@  get_call_reg_set_usage (rtx_insn *insn, HARD_REG_SET *reg_set,
   COPY_HARD_REG_SET (*reg_set, default_set);
   return false;
 }
+
diff --git a/gcc/output.h b/gcc/output.h
index 4ce6eea..b33d23a 100644
--- a/gcc/output.h
+++ b/gcc/output.h
@@ -614,4 +614,55 @@  extern int default_address_cost (rtx, machine_mode, addr_space_t, bool);
 /* Output stack usage information.  */
 extern void output_stack_usage (void);
 
+template <typename Target>
+class output
+{
+ public:
+  void init (FILE *outfile);
+
+  int puts (const char *s) const { return fputs (s, m_outfile); }
+  FILE *get_outfile () const { return m_outfile; }
+
+  void
+  assemble_name_raw (const char *name) { ::assemble_name_raw (m_outfile,
+							      name); }
+
+  void
+  assemble_name (const char *name) { ::assemble_name (m_outfile, name); }
+
+  /* Replacement for ASM_OUTPUT_LABEL.  */
+  void output_label (const char *name);
+
+ private:
+  FILE *m_outfile;
+};
+
+template <typename Target>
+inline void
+output<Target>::init (FILE *outfile)
+{
+  m_outfile = outfile;
+}
+
+/* Default implementation of output_label.  */
+template <typename Target>
+inline void
+output<Target>::output_label (const char *name)
+{
+  assemble_name (name);
+  puts (":\n");  
+}
+
+/* Hacked-up traits classes (each target would provide one. */
+class target_pdp11
+{
+};
+
+/* ...and this typedef would be generated at configure time.  */
+typedef target_pdp11 target_t;
+
+/* File in which assembler code is being written.  */
+/* (replacement for asm_out_file) */
+extern output<target_t> g_output;
+
 #endif /* ! GCC_OUTPUT_H */
diff --git a/gcc/toplev.c b/gcc/toplev.c
index 5aaa120..33151ac 100644
--- a/gcc/toplev.c
+++ b/gcc/toplev.c
@@ -184,6 +184,7 @@  const char *user_label_prefix;
    and debugging dumps.  */
 
 FILE *asm_out_file;
+output<target_t> g_output;
 FILE *aux_info_file;
 FILE *stack_usage_file = NULL;
 
@@ -975,6 +976,8 @@  init_asm_output (const char *name)
 		     "can%'t open %qs for writing: %m", asm_file_name);
     }
 
+  g_output.init (asm_out_file);
+
   if (!flag_syntax_only)
     {
       targetm.asm_out.file_start ();
diff --git a/gcc/varasm.c b/gcc/varasm.c
index 6a4ba0b..aa11d64 100644
--- a/gcc/varasm.c
+++ b/gcc/varasm.c
@@ -1746,7 +1746,7 @@  assemble_start_function (tree decl, const char *fnname)
 
       switch_to_section (unlikely_text_section ());
       assemble_align (align);
-      ASM_OUTPUT_LABEL (asm_out_file, crtl->subsections.cold_section_label);
+      g_output.output_label (crtl->subsections.cold_section_label);
 
       /* When the function starts with a cold section, we need to explicitly
 	 align the hot section and write out the hot section label.
@@ -1756,7 +1756,7 @@  assemble_start_function (tree decl, const char *fnname)
 	{
 	  switch_to_section (text_section);
 	  assemble_align (align);
-	  ASM_OUTPUT_LABEL (asm_out_file, crtl->subsections.hot_section_label);
+	  g_output.output_label (crtl->subsections.hot_section_label);
 	  hot_label_written = true;
 	  first_function_block_is_cold = true;
 	}
@@ -1769,7 +1769,7 @@  assemble_start_function (tree decl, const char *fnname)
   switch_to_section (function_section (decl));
   if (flag_reorder_blocks_and_partition
       && !hot_label_written)
-    ASM_OUTPUT_LABEL (asm_out_file, crtl->subsections.hot_section_label);
+    g_output.output_label (crtl->subsections.hot_section_label);
 
   /* Tell assembler to move to target machine's alignment for functions.  */
   align = floor_log2 (align / BITS_PER_UNIT);
@@ -1860,12 +1860,12 @@  assemble_end_function (tree decl, const char *fnname ATTRIBUTE_UNUSED)
 					IDENTIFIER_POINTER (cold_function_name),
 					decl);
 #endif
-      ASM_OUTPUT_LABEL (asm_out_file, crtl->subsections.cold_section_end_label);
+      g_output.output_label (crtl->subsections.cold_section_end_label);
       if (first_function_block_is_cold)
 	switch_to_section (text_section);
       else
 	switch_to_section (function_section (decl));
-      ASM_OUTPUT_LABEL (asm_out_file, crtl->subsections.hot_section_end_label);
+      g_output.output_label (crtl->subsections.hot_section_end_label);
       switch_to_section (save_text_section);
     }
 }
@@ -2052,7 +2052,7 @@  assemble_variable_contents (tree decl, const char *name,
   ASM_DECLARE_OBJECT_NAME (asm_out_file, name, decl);
 #else
   /* Standard thing is just output label for the object.  */
-  ASM_OUTPUT_LABEL (asm_out_file, name);
+  g_output.output_label (name);
 #endif /* ASM_DECLARE_OBJECT_NAME */
 
   if (!dont_output_data)
@@ -2499,9 +2499,9 @@  assemble_external_libcall (rtx fun)
 /* Assemble a label named NAME.  */
 
 void
-assemble_label (FILE *file, const char *name)
+assemble_label (FILE */*file*/, const char *name)
 {
-  ASM_OUTPUT_LABEL (file, name);
+  g_output.output_label (name);
 }
 
 /* Set the symbol_referenced flag for ID.  */
-- 
1.8.5.3