diff mbox

Speed up genattrtab

Message ID Pine.LNX.4.64.1006151807190.15724@wotan.suse.de
State New
Headers show

Commit Message

Michael Matz June 15, 2010, 4:26 p.m. UTC
Hi,

okay, let's try again, after many years.  Maybe this time :)

This speeds up genattrtab to be no time issue during bootstrap anymore.
Over the years I worked on many approaches to this.  My first one was to 
throttle down the optimization, then I completely removed the 
optimization, then I implemented different kinds of optimizations, then I 
combined them with the throttled down ones, and now I'm back to more or 
less only throttling the optimizations.

Obviously switching off all optimizations creates the fastest combination 
of genattrtab+compiling insn-attrtab.c.  But it has some effect on the 
overall speed of the compiler.  Not too bad as I rectified this a bit, but 
maybe too much to be acceptable to everyone.  But for all development 
during the last years I used a so modified genattrtab, it's really nice :)

Now, after much benchmarking last week I'm proposing the below patch.  It 
does not get rid of all optimizations, but throttles it significantly 
(plus reorders the order of computation so that lowering the limits 
doesn't have too much effect for small attributes).

Numbers follow.  First the architecture, then the version of genattrtab, 
then four numbers:
 gen_u == seconds to run an optimized (!) genattrtab
 st1_1 == seconds to compile generated insn-attrtab.c with an optimized 
          cc1
 big_u == seconds to compile an artificial piece of code that generates
          large functions, many loops, scheduling opportunities
 kde_u == seconds to compile kdecore.cc, a one-file variant of an older 
          version of libkdecore.

The genattrtab versions are: clean == as in SVN, try3 == no call to 
optimize_attrs, otherwise same as proposed patch, try == the proposed 
variant.  These measurements were taken on genattrtab versions that didn't 
contain the latest changes to support enum attributes, but those have no 
speed effect (I've checked for some combinations).

arch     name   gen_u  st1_u  big_u  kde_u
alpha    clean  0      0.75   43.21  32.52
alpha    try3   0      0.26   44.19  32.85
alpha    try    0      0.35   43.26  32.50
arm      clean  6      19.66  49.25  37.89
arm      try3   0      1.78   50.04  38.00
arm      try    2      2.35   49.85  38.10
crisv32  clean  0      0.21   36.17  27.33
crisv32  try3   0      0.15   36.49  27.53
crisv32  try    0      0.23   36.21  27.43
hppa     clean  0      1.11   46.77  31.97
hppa     try3   0      0.58   46.85  32.00
hppa     try    0      0.64   46.97  31.84
i386     clean  38     34.25  33.51  29.93
i386     try3   1      1.99   34.26  30.64
i386     try    6      2.31   33.78  30.12
ia64     clean  1      1.88   66.55  49.81
ia64     try3   0      0.71   67.08  50.33
ia64     try    0      0.95   66.62  49.81
mips     clean  74     17.08  51.23
mips     try3   0      1.74   52.11
mips     try    4      2.29   50.77
powerpc  clean  56     48.59  49.74  34.15
powerpc  try3   0      2.59   50.60  34.97
powerpc  try    5      2.17   49.38  34.71
s390x    clean  0      1.82   47.26  32.83
s390x    try3   0      0.62   47.63  33.75
s390x    try    0      0.78   47.41  33.64
sh       clean  0      1.46   50.99  38.09
sh       try3   0      0.68   51.05  38.30
sh       try    0      0.91   50.79  38.13
sparc    clean  0      1.11   44.21  32.98
sparc    try3   0      0.57   44.45  33.22
sparc    try    0      0.57   43.27  32.93
x86_64   clean  52     43.81  28.78  28.72
x86_64   try3   1      2.16   29.22  29.40
x86_64   try    6      2.98   28.96  28.96

(mips wasn't able to compile kdecore.cc).  This is all cross compilers to 
$arch-linux, all running on the same host machine (a x86_64-linux iCore7 
machine).  As said, I'm proposing "try", so compare the first and third 
numbers.  It will hugely help i386, mips, powerpc and x86_64, and arm a 
bit; the others aren't a problem right now anyway.  The speed difference 
of the compiler is acceptable I think, actually even speeding up the 
compiler sometimes (probably cache effects, because the .text size of 
insn-attrtab is _much_ smaller) or being in the noise.

So, if included, we go from 95 seconds to 9 seconds for x86_64 for an 
optimized cc1, the difference will be even larger for stage2 (using an 
possibly unoptimized cc1).

As said, I'm bootstrapping with variants of this since years, but of 
course I'm regstrapping this currently on x86_64-linux.  Okay for trunk?


Ciao,
Michael.

Comments

Mark Mitchell June 15, 2010, 5:19 p.m. UTC | #1
Michael Matz wrote:

> This speeds up genattrtab to be no time issue during bootstrap anymore.

> The speed difference
> of the compiler is acceptable I think, actually even speeding up the 
> compiler sometimes (probably cache effects, because the .text size of 
> insn-attrtab is _much_ smaller) or being in the noise.

I certainly see how this benefits us as GCC developers.  But, it does
look like it has a negative effect on overall compile-time, even though
in some cases it helps a bit.  The effect seems negative, in the
aggregate, though I didn't try to dump your table into a spreadsheet and
actually get a number out.  I don't think we should help ourselves at
the expense of the user's experience.

I happened to be talking to someone else this morning about this same
issue.  Would it help just to organize the Makefile in such a way that
this file is being built first, so that on a parallel machine it's not
the last thing being built?  This person observed that in many cases
make seems to wait to *start* building insn-attrtab until very late in
the build process, meaning that all the other cores are idle while it's
building.

Thanks,
Jan Hubicka June 15, 2010, 5:26 p.m. UTC | #2
> Michael Matz wrote:
> 
> > This speeds up genattrtab to be no time issue during bootstrap anymore.
> 
> > The speed difference
> > of the compiler is acceptable I think, actually even speeding up the 
> > compiler sometimes (probably cache effects, because the .text size of 
> > insn-attrtab is _much_ smaller) or being in the noise.
> 
> I certainly see how this benefits us as GCC developers.  But, it does
> look like it has a negative effect on overall compile-time, even though
> in some cases it helps a bit.  The effect seems negative, in the
> aggregate, though I didn't try to dump your table into a spreadsheet and
> actually get a number out.  I don't think we should help ourselves at
> the expense of the user's experience.
> 
> I happened to be talking to someone else this morning about this same
> issue.  Would it help just to organize the Makefile in such a way that
> this file is being built first, so that on a parallel machine it's not
> the last thing being built?  This person observed that in many cases
> make seems to wait to *start* building insn-attrtab until very late in
> the build process, meaning that all the other cores are idle while it's
> building.

In WHOPR compilation we now partition the program based on source files (this
is something I plan to revisit later), so we get partition that corresponds to
insn-attrtab.  WPA driver sorts units by size and feeds them to parallel make
and insn-attrtab is still last on finishing build by quite long shot.  This
ignore time spent by genattrtab itself.

So just reordering makefiles is not going to solve the problem.  I also think
insn-attrtab starts late because genattrtab takes long time to generate it.
insn-attrtab on i386 also has one dominating function.  With WHOPR this can
be solved by better partitioning algorithm along with function splitting (I have
prototypes of both the second I plan to contribute soon after some bugfixing).

Honza
> 
> Thanks,
> 
> -- 
> Mark Mitchell
> CodeSourcery
> mark@codesourcery.com
> (650) 331-3385 x713
Dave Korn June 15, 2010, 5:57 p.m. UTC | #3
On 15/06/2010 18:19, Mark Mitchell wrote:
> Michael Matz wrote:
> 
>> This speeds up genattrtab to be no time issue during bootstrap anymore.
> 
>> The speed difference
>> of the compiler is acceptable I think, actually even speeding up the 
>> compiler sometimes (probably cache effects, because the .text size of 
>> insn-attrtab is _much_ smaller) or being in the noise.
> 
> I certainly see how this benefits us as GCC developers.

  90 seconds out of a whole build, or did I misread something?

    cheers,
      DaveK
Mark Mitchell June 15, 2010, 6:39 p.m. UTC | #4
Jan Hubicka wrote:

> In WHOPR compilation we now partition the program based on source files (this
> is something I plan to revisit later), so we get partition that corresponds to
> insn-attrtab.  WPA driver sorts units by size and feeds them to parallel make
> and insn-attrtab is still last on finishing build by quite long shot.  This
> ignore time spent by genattrtab itself.
> 
> So just reordering makefiles is not going to solve the problem.

Well, it might solve the problem for developers not using WHOPR. :-)

I certainly don't claim it's a general solution; at some point, if you
have enouh cores, the single core compiling insn-attrtab will dominate
anyhow.  But if some easy Makefile hackery would help some developers,
I'd still argue we should do it.  And I still have a hard time with a
change that's going to make the compiler slower in order to save a
minute or two during compiler builds, especially given that test cycles
are so much longer than build cycles.
Ralf Wildenhues June 15, 2010, 7:03 p.m. UTC | #5
* Mark Mitchell wrote on Tue, Jun 15, 2010 at 08:39:47PM CEST:
> But if some easy Makefile hackery would help some developers,
> I'd still argue we should do it.

IIRC, this hackery has already been done sometime last year or
so, with reordering of objects and the order-only dependencies.
Unless the build has changed a lot (haven't checked), it is
simply not possible to improve in a generally useful way
without changing GNU make to allow for more fine-tuning.

It is typically helpful to not increase N in -jN too much over
the number of available cores.

Cheers,
Ralf
Jakub Jelinek June 15, 2010, 7:16 p.m. UTC | #6
On Tue, Jun 15, 2010 at 11:39:47AM -0700, Mark Mitchell wrote:
> Jan Hubicka wrote:
> 
> > In WHOPR compilation we now partition the program based on source files (this
> > is something I plan to revisit later), so we get partition that corresponds to
> > insn-attrtab.  WPA driver sorts units by size and feeds them to parallel make
> > and insn-attrtab is still last on finishing build by quite long shot.  This
> > ignore time spent by genattrtab itself.
> > 
> > So just reordering makefiles is not going to solve the problem.
> 
> Well, it might solve the problem for developers not using WHOPR. :-)
> 
> I certainly don't claim it's a general solution; at some point, if you
> have enouh cores, the single core compiling insn-attrtab will dominate
> anyhow.  But if some easy Makefile hackery would help some developers,
> I'd still argue we should do it.  And I still have a hard time with a
> change that's going to make the compiler slower in order to save a
> minute or two during compiler builds, especially given that test cycles
> are so much longer than build cycles.

I believe on x86_64/i686 the most time is spent in compiling
internal_dfa_insn_code, primarily because there are so many different
schedulings.
The insn is a big switch on recog_memoized, where most of the cases first
compare ix86_schedule var to some enum.  I guess it would be certainly
faster to compile to instead split the big function into separate function
for each schedule and make internal_dfa_insn_code a function pointer, would
need to be benchmarked how it would actually perform at runtime.

	Jakub
Paolo Bonzini June 15, 2010, 7:48 p.m. UTC | #7
On 06/15/2010 07:57 PM, Dave Korn wrote:
> On 15/06/2010 18:19, Mark Mitchell wrote:
>> Michael Matz wrote:
>>
>>> This speeds up genattrtab to be no time issue during bootstrap anymore.
>>
>>> The speed difference
>>> of the compiler is acceptable I think, actually even speeding up the
>>> compiler sometimes (probably cache effects, because the .text size of
>>> insn-attrtab is _much_ smaller) or being in the noise.
>>
>> I certainly see how this benefits us as GCC developers.
>
>    90 seconds out of a whole build, or did I misread something?

90 seconds out of "touch something.md; make cc1" is significant.  It's 
the most sensitive part for day-to-day backend development.

Here is Michael's data reformatted:

	big_u			kde_u		
	clean	try		clean	try	
alpha	43.21	43.26   0.12%	32.52	32.50  -0.06%
arm	49.25	49.85   1.22%	37.89	38.10   0.55%
crisv32	36.17	36.21   0.11%	27.33	27.43   0.37%
hppa	46.77	46.97   0.43%	31.97	31.84  -0.41%
i386	33.51	33.78   0.81%	29.93	30.12	 0.63%
ia64	66.55	66.62   0.11%	49.81	49.81	 0.00%
mips	51.23	50.77  -0.90%			
powerpc	49.74	49.38  -0.72%	34.15	34.71	 1.64%
s390x	47.26	47.41   0.32%	32.83	33.64	 2.47%
sh	50.99	50.79  -0.39%	38.09	38.13	 0.11%
sparc	44.21	43.27  -2.13%	32.98	32.93	-0.15%
x86_64	28.78	28.96   0.63%	28.72	28.96	 0.84%

It's a tougher call than it seems...  Unfortunately, of the four likely 
most used platforms (i386, x86_64, mips and arm) three are degraded 
consistently, and the fourth lacks one of the two data points.  That's 
indeed not too good. :-/

Compiling genattrtab at -O1 saves half of the time on x86_64, and 
cfgcleanup goes from 20 to 0.7 seconds.  So it should be worthwhile 
investigating _which_ cfgcleanup pass is spending so much time. 
Compiling insn-recog.c at -O1 instead saves 30% of the compilation time.

I used to like Michael's patches a lot but nowadays parallelization of 
the build is much better thanks to more cores in the systems.  Since 
genattrtab can run in parallel with most of the rest of the build, what 
we need to save time on is "touch x.md; make -j6 cc1" (with the 
insn-attrtab.c move-if-changed rule disabled of course).  Maybe -O1 is 
enough together with Jakub's old patch to split insn-attrtab.c into four 
or eight files.  Moreover, the "split" idea could be applied to 
insn-recog.c too.

Michael, can you get numbers for your testcases when you compile "clean" 
insn-attrtab.c and/or insn-recog.c with -O1?

Paolo
David Daney June 15, 2010, 8:50 p.m. UTC | #8
On 06/15/2010 10:57 AM, Dave Korn wrote:
> On 15/06/2010 18:19, Mark Mitchell wrote:
>> Michael Matz wrote:
>>
>>> This speeds up genattrtab to be no time issue during bootstrap anymore.
>>
>>> The speed difference
>>> of the compiler is acceptable I think, actually even speeding up the
>>> compiler sometimes (probably cache effects, because the .text size of
>>> insn-attrtab is _much_ smaller) or being in the noise.
>>
>> I certainly see how this benefits us as GCC developers.
>
>    90 seconds out of a whole build, or did I misread something?
>

For me, it is more like 20 minutes.  As Mark pointed out, the Makefile 
for some reason starts running genattrtab close to last in stages 2 and 
3.  So in my case (12 CPU mips64-linux), genattrtab takes about 50% of 
the wall-clock time to build a pass.

David Daney
Richard Biener June 16, 2010, 7:15 a.m. UTC | #9
On Tue, Jun 15, 2010 at 8:39 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> Jan Hubicka wrote:
>
>> In WHOPR compilation we now partition the program based on source files (this
>> is something I plan to revisit later), so we get partition that corresponds to
>> insn-attrtab.  WPA driver sorts units by size and feeds them to parallel make
>> and insn-attrtab is still last on finishing build by quite long shot.  This
>> ignore time spent by genattrtab itself.
>>
>> So just reordering makefiles is not going to solve the problem.
>
> Well, it might solve the problem for developers not using WHOPR. :-)
>
> I certainly don't claim it's a general solution; at some point, if you
> have enouh cores, the single core compiling insn-attrtab will dominate
> anyhow.  But if some easy Makefile hackery would help some developers,
> I'd still argue we should do it.  And I still have a hard time with a
> change that's going to make the compiler slower in order to save a
> minute or two during compiler builds, especially given that test cycles
> are so much longer than build cycles.

I think the makefile ordering is already good - it is just that genattrtab
takes so much time that compiling insn-attrtab starts very late.  Michael
addresses this by speeding up genattrtab.

Richard.

> --
> Mark Mitchell
> CodeSourcery
> mark@codesourcery.com
> (650) 331-3385 x713
>
Nathan Froyd June 16, 2010, 1:21 p.m. UTC | #10
On Wed, Jun 16, 2010 at 09:15:39AM +0200, Richard Guenther wrote:
> On Tue, Jun 15, 2010 at 8:39 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> > I certainly don't claim it's a general solution; at some point, if you
> > have enouh cores, the single core compiling insn-attrtab will dominate
> > anyhow.  But if some easy Makefile hackery would help some developers,
> > I'd still argue we should do it.  And I still have a hard time with a
> > change that's going to make the compiler slower in order to save a
> > minute or two during compiler builds, especially given that test cycles
> > are so much longer than build cycles.
> 
> I think the makefile ordering is already good - it is just that genattrtab
> takes so much time that compiling insn-attrtab starts very late.  Michael
> addresses this by speeding up genattrtab.

Pardon the dumb question here, but IIUC, we build insn-attrtab at each
stage.  Why not just build it once for all stages?  Do we gain anything
(besides less complicated Makefiles) by building it at every stage?  I
realize that this is not quite as nice as cutting its build time by
90%+, but moving it earlier might hide the compilation latency just as
effectively.

-Nathan
Michael Matz June 16, 2010, 1:23 p.m. UTC | #11
Hi,

On Wed, 16 Jun 2010, Nathan Froyd wrote:

> On Wed, Jun 16, 2010 at 09:15:39AM +0200, Richard Guenther wrote:
> > On Tue, Jun 15, 2010 at 8:39 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> > > I certainly don't claim it's a general solution; at some point, if you
> > > have enouh cores, the single core compiling insn-attrtab will dominate
> > > anyhow.  But if some easy Makefile hackery would help some developers,
> > > I'd still argue we should do it.  And I still have a hard time with a
> > > change that's going to make the compiler slower in order to save a
> > > minute or two during compiler builds, especially given that test cycles
> > > are so much longer than build cycles.
> > 
> > I think the makefile ordering is already good - it is just that genattrtab
> > takes so much time that compiling insn-attrtab starts very late.  Michael
> > addresses this by speeding up genattrtab.
> 
> Pardon the dumb question here, but IIUC, we build insn-attrtab at each 
> stage.  Why not just build it once for all stages?

Because that's the definition of bootstrapping.  Either we bootstrap, or 
we don't.  It doesn't make sense to bootstrap only half the sources.


Ciao,
Michael.
Nathan Froyd June 16, 2010, 1:37 p.m. UTC | #12
On Wed, Jun 16, 2010 at 03:23:49PM +0200, Michael Matz wrote:
> On Wed, 16 Jun 2010, Nathan Froyd wrote:
> > On Wed, Jun 16, 2010 at 09:15:39AM +0200, Richard Guenther wrote:
> > > I think the makefile ordering is already good - it is just that genattrtab
> > > takes so much time that compiling insn-attrtab starts very late.  Michael
> > > addresses this by speeding up genattrtab.
> > 
> > Pardon the dumb question here, but IIUC, we build insn-attrtab at each 
> > stage.  Why not just build it once for all stages?
> 
> Because that's the definition of bootstrapping.  Either we bootstrap, or 
> we don't.  It doesn't make sense to bootstrap only half the sources.

I'm not asking why we compile insn-attrtab at every stage, but why we
run genattrtab at every stage.  I understand that it's nice to compile
genattrtab at each stage for a little sanity checking, but the compiler
used to compile genattrtab should have no effect on the output of
genattrtab.

Or were you already responding to "why generate insn-attrtab at every
stage"?

-Nathan
Mike Stump June 16, 2010, 2:30 p.m. UTC | #13
On Jun 16, 2010, at 6:37 AM, Nathan Froyd <froydnj@codesourcery.com> wrote:
> On Wed, Jun 16, 2010 at 03:23:49PM +0200, Michael Matz wrote:
>> On Wed, 16 Jun 2010, Nathan Froyd wrote:
>>> On Wed, Jun 16, 2010 at 09:15:39AM +0200, Richard Guenther wrote:
>>>> I think the makefile ordering is already good - it is just that genattrtab
>>>> takes so much time that compiling insn-attrtab starts very late.  Michael
>>>> addresses this by speeding up genattrtab.
>>> 
>>> Pardon the dumb question here, but IIUC, we build insn-attrtab at each 
>>> stage.  Why not just build it once for all stages?
>> 
>> Because that's the definition of bootstrapping.  Either we bootstrap, or 
>> we don't.  It doesn't make sense to bootstrap only half the sources.
> 
> I'm not asking why we compile insn-attrtab at every stage, but why we
> run genattrtab at every stage.  I understand that it's nice to compile
> genattrtab at each stage for a little sanity checking, but the compiler
> used to compile genattrtab should have no effect on the output of
> genattrtab.

That is true, unless there are bugs.  We compile with gcc , and then test to ensure there are no bugs.  In this fashion, 100% of the source that is compiled, is compiled with gcc, the very definition of bootstrap.  People that want more speed at the expense of testing, don't bootstrap.
Michael Matz June 16, 2010, 2:43 p.m. UTC | #14
Hi,

On Wed, 16 Jun 2010, Nathan Froyd wrote:

> > > Pardon the dumb question here, but IIUC, we build insn-attrtab at 
> > > each stage.  Why not just build it once for all stages?
> > 
> > Because that's the definition of bootstrapping.  Either we bootstrap, 
> > or we don't.  It doesn't make sense to bootstrap only half the 
> > sources.
> 
> I'm not asking why we compile insn-attrtab at every stage, but why we 
> run genattrtab at every stage.  I understand that it's nice to compile 
> genattrtab at each stage for a little sanity checking, but the compiler 
> used to compile genattrtab should have no effect on the output of 
> genattrtab.

If we were going that route, then we wouldn't need bootstrapping at all, 
because the compiler used to compile $random.c should have no effect on 
the output of the final cc1.

> Or were you already responding to "why generate insn-attrtab at every
> stage"?

Implicitely, yes.


Ciao,
Michael.
Jan Hubicka June 16, 2010, 3:51 p.m. UTC | #15
> Hi,
> 
> On Wed, 16 Jun 2010, Nathan Froyd wrote:
> 
> > > > Pardon the dumb question here, but IIUC, we build insn-attrtab at 
> > > > each stage.  Why not just build it once for all stages?
> > > 
> > > Because that's the definition of bootstrapping.  Either we bootstrap, 
> > > or we don't.  It doesn't make sense to bootstrap only half the 
> > > sources.
> > 
> > I'm not asking why we compile insn-attrtab at every stage, but why we 
> > run genattrtab at every stage.  I understand that it's nice to compile 
> > genattrtab at each stage for a little sanity checking, but the compiler 
> > used to compile genattrtab should have no effect on the output of 
> > genattrtab.
> 
> If we were going that route, then we wouldn't need bootstrapping at all, 
> because the compiler used to compile $random.c should have no effect on 
> the output of the final cc1.
> 
> > Or were you already responding to "why generate insn-attrtab at every
> > stage"?
> 
> Implicitely, yes.

With a lot of cores, we might actually execute getnattrtab and compilation
of stage1 insn-attrtab.c at same time and then just sanity check that genattrtab
did not change results :)

Honza
> 
> 
> Ciao,
> Michael.
Paolo Bonzini June 16, 2010, 5:03 p.m. UTC | #16
On 06/16/2010 05:51 PM, Jan Hubicka wrote:
> With a lot of cores, we might actually execute getnattrtab and compilation
> of stage1 insn-attrtab.c at same time and then just sanity check that genattrtab
> did not change results :)

That's a different story.  It doesn't break the idea of bootstrapping, 
and can be applied to all insn-* files.

Paolo
Michael Matz June 17, 2010, 11:39 a.m. UTC | #17
Hi,

On Wed, 16 Jun 2010, Paolo Bonzini wrote:

> On 06/16/2010 05:51 PM, Jan Hubicka wrote:
> > With a lot of cores, we might actually execute getnattrtab and compilation
> > of stage1 insn-attrtab.c at same time and then just sanity check that
> > genattrtab
> > did not change results :)
> 
> That's a different story.  It doesn't break the idea of bootstrapping, 
> and can be applied to all insn-* files.

But it also doesn't give any gains.  You still would have to run 
genattrtab (to get something to compare with), and you still would have to 
compile insn-attrtab.c (to check that cc1 produced the same code as last 
stage).


Ciao,
Michael.
Jan Hubicka June 17, 2010, 11:45 a.m. UTC | #18
> Hi,
> 
> On Wed, 16 Jun 2010, Paolo Bonzini wrote:
> 
> > On 06/16/2010 05:51 PM, Jan Hubicka wrote:
> > > With a lot of cores, we might actually execute getnattrtab and compilation
> > > of stage1 insn-attrtab.c at same time and then just sanity check that
> > > genattrtab
> > > did not change results :)
> > 
> > That's a different story.  It doesn't break the idea of bootstrapping, 
> > and can be applied to all insn-* files.
> 
> But it also doesn't give any gains.  You still would have to run 
> genattrtab (to get something to compare with), and you still would have to 
> compile insn-attrtab.c (to check that cc1 produced the same code as last 
> stage).

This will let you to start insn-attrtab compilation as first thing during build,
so you gain parallelizm in between genattrtab and insn-attrtab compilation for stage>1.
Obviously as the insn-attrtab compilation time is still a bottleneck in whole build
(as can be seen in whopr link), we won't solve this completely, but we will get
genattrtab runtime less relevant to the whole build time.

Honza
> 
> 
> Ciao,
> Michael.
Paolo Bonzini June 17, 2010, 11:48 a.m. UTC | #19
On 06/17/2010 01:39 PM, Michael Matz wrote:
> On Wed, 16 Jun 2010, Paolo Bonzini wrote:
>> On 06/16/2010 05:51 PM, Jan Hubicka wrote:
>>> With a lot of cores, we might actually execute getnattrtab and compilation
>>> of stage1 insn-attrtab.c at same time and then just sanity check that
>>> genattrtab
>>> did not change results :)
>>
>> That's a different story.  It doesn't break the idea of bootstrapping,
>> and can be applied to all insn-* files.
>
> But it also doesn't give any gains.  You still would have to run
> genattrtab (to get something to compare with), and you still would have to
> compile insn-attrtab.c (to check that cc1 produced the same code as last
> stage).

You can parallelize those two, so you would still get a 50% gain on a 
beefy-enough machine (pre Jakub's patch).  However, the insn_dfa_code 
split makes the serialization of genattrtab and insn-attrtab.o 
compilation much less troublesome.

Paolo
Michael Matz June 17, 2010, 11:48 a.m. UTC | #20
Hi,

On Thu, 17 Jun 2010, Jan Hubicka wrote:

> > > That's a different story.  It doesn't break the idea of 
> > > bootstrapping, and can be applied to all insn-* files.
> > 
> > But it also doesn't give any gains.  You still would have to run 
> > genattrtab (to get something to compare with), and you still would 
> > have to compile insn-attrtab.c (to check that cc1 produced the same 
> > code as last stage).
> 
> This will let you to start insn-attrtab compilation as first thing 
> during build,

And then restart if the genattrtab result turns out to be different?  Or 
just stop compilation?


Ciao,
Michael.
Jan Hubicka June 17, 2010, 11:49 a.m. UTC | #21
> Hi,
> 
> On Thu, 17 Jun 2010, Jan Hubicka wrote:
> 
> > > > That's a different story.  It doesn't break the idea of 
> > > > bootstrapping, and can be applied to all insn-* files.
> > > 
> > > But it also doesn't give any gains.  You still would have to run 
> > > genattrtab (to get something to compare with), and you still would 
> > > have to compile insn-attrtab.c (to check that cc1 produced the same 
> > > code as last stage).
> > 
> > This will let you to start insn-attrtab compilation as first thing 
> > during build,
> 
> And then restart if the genattrtab result turns out to be different?  Or 
> just stop compilation?

Stop compilation. They should be the same, right?

Honza
diff mbox

Patch

Index: genattrtab.c
===================================================================
--- genattrtab.c	(revision 160784)
+++ genattrtab.c	(working copy)
@@ -241,6 +241,7 @@  static const char *length_str;
 static const char *delay_type_str;
 static const char *delay_1_0_str;
 static const char *num_delay_slots_str;
+static const char *insn_code_str;
 
 /* Simplify an expression.  Only call the routine if there is something to
    simplify.  */
@@ -1621,6 +1622,58 @@  write_length_unit_log (void)
   printf ("EXPORTED_CONST int length_unit_log = %u;\n", length_unit_log);
 }
 
+/* Compute approximate cost of the expression.  Used to decide whether
+   expression is cheap enough for inline.  */
+static int
+attr_rtx_cost (rtx x)
+{
+  int cost = 1;
+  enum rtx_code code;
+  if (!x)
+    return 0;
+  code = GET_CODE (x);
+  switch (code)
+    {
+    case MATCH_OPERAND:
+      if (XSTR (x, 1)[0])
+	return 10;
+      else
+	return 1;
+
+    case EQ_ATTR_ALT:
+      return 1;
+
+    case EQ_ATTR:
+      /* Alternatives don't result into function call.  */
+      if (!strcmp_check (XSTR (x, 0), alternative_name)
+          || !strcmp_check (XSTR (x, 0), insn_code_str))
+	return 1;
+      else
+	return 5;
+    default:
+      {
+	int i, j;
+	const char *fmt = GET_RTX_FORMAT (code);
+	for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
+	  {
+	    switch (fmt[i])
+	      {
+	      case 'V':
+	      case 'E':
+		for (j = 0; j < XVECLEN (x, i); j++)
+		  cost += attr_rtx_cost (XVECEXP (x, i, j));
+		break;
+	      case 'e':
+		cost += attr_rtx_cost (XEXP (x, i));
+		break;
+	      }
+	  }
+      }
+      break;
+    }
+  return cost;
+}
+
 /* Take a COND expression and see if any of the conditions in it can be
    simplified.  If any are known true or known false for the particular insn
    code, the COND can be further simplified.
@@ -1744,6 +1797,111 @@  simplify_cond (rtx exp, int insn_code, i
   return ret;
 }
 
+/* Taken a COND expression EXP and a value AV, and returns a possibly
+   optimized variant of EXP.  This is similar to simplify_cond, but
+   instead of optimizing for just one instruction we optimize for the
+   set of instructions found in AV.  */
+
+static rtx
+simplify_cond_insn_list (rtx exp, struct attr_value *av)
+{
+  int i;
+  /* We store the desired contents here,
+     then build a new expression if they don't match EXP.  */
+  rtx defval = XEXP (exp, 1);
+  rtx new_defval = XEXP (exp, 1);
+  int len = XVECLEN (exp, 0);
+  rtx *tests;
+  int allsame = 1;
+  rtx ret;
+
+  if (av->has_asm_insn)
+    return exp;
+
+  tests = XNEWVEC (rtx, len);
+  /* This lets us free all storage allocated below, if appropriate.  */
+  obstack_finish (rtl_obstack);
+
+  memcpy (tests, XVEC (exp, 0)->elem, len * sizeof (rtx));
+
+  /* See if default value needs simplification.  */
+  if (GET_CODE (defval) == COND)
+    new_defval = simplify_cond_insn_list (defval, av);
+
+  for (i = 0; i < len; i += 2)
+    {
+      rtx newtest, newval;
+      struct insn_ent *ie;
+      int oldcost;
+
+      oldcost = attr_rtx_cost (XVECEXP (exp, 0, i));
+      tests[i] = false_rtx;
+      for (ie = av->first_insn; ie != 0; ie = ie->next)
+        {
+	  rtx orig_test = XVECEXP (exp, 0, i);
+	  gcc_assert (ie->def->insn_code >= 0);
+	  newtest = simplify_test_exp_in_temp (orig_test, ie->def->insn_code,
+	  				       ie->def->insn_index);
+	  /* We couldn't simplify the expression for this insn code,
+	     so it makes no sense to try further, as syntactically we
+	     can't shorten the test anymore.  Further, also the
+	     simplifications done up to now can be removed.  */
+          if (newtest == orig_test)
+	    {
+	      tests[i] = orig_test;
+	      break;
+	    }
+	  if (newtest != false_rtx)
+	    {
+	      newtest = attr_rtx (AND,
+				  attr_eq (insn_code_str,
+					   attr_numeral (ie->def->insn_code)),
+				  newtest);
+	      tests[i] = insert_right_side (IOR, tests[i], newtest, -2, -2);
+	    }
+        }
+
+      if (attr_rtx_cost (tests[i]) > oldcost)
+        tests[i] = XVECEXP (exp, 0, i);
+
+      /* Simplify this test.  */
+      newtest = simplify_test_exp_in_temp (tests[i], -2, -2);
+      tests[i] = newtest;
+
+      newval = tests[i + 1];
+      /* See if this value may need simplification.  */
+      if (GET_CODE (newval) == COND)
+	newval = simplify_cond_insn_list (newval, av);
+
+      tests[i + 1] = newval;
+    }
+
+  /* See if we changed anything.  */
+  if (len != XVECLEN (exp, 0) || new_defval != XEXP (exp, 1))
+    allsame = 0;
+  else
+    for (i = 0; i < len; i++)
+      if (! attr_equal_p (tests[i], XVECEXP (exp, 0, i)))
+	{
+	  allsame = 0;
+	  break;
+	}
+
+  if (allsame)
+    ret = exp;
+  else
+    {
+      rtx newexp = rtx_alloc (COND);
+
+      XVEC (newexp, 0) = rtvec_alloc (len);
+      memcpy (XVEC (newexp, 0)->elem, tests, len * sizeof (rtx));
+      XEXP (newexp, 1) = new_defval;
+      ret = newexp;
+    }
+  free (tests);
+  return ret;
+}
+
 /* Remove an insn entry from an attribute value.  */
 
 static void
@@ -2216,57 +2374,6 @@  simplify_or_tree (rtx exp, rtx *pterm, i
   return exp;
 }
 
-/* Compute approximate cost of the expression.  Used to decide whether
-   expression is cheap enough for inline.  */
-static int
-attr_rtx_cost (rtx x)
-{
-  int cost = 0;
-  enum rtx_code code;
-  if (!x)
-    return 0;
-  code = GET_CODE (x);
-  switch (code)
-    {
-    case MATCH_OPERAND:
-      if (XSTR (x, 1)[0])
-	return 10;
-      else
-	return 0;
-
-    case EQ_ATTR_ALT:
-      return 0;
-
-    case EQ_ATTR:
-      /* Alternatives don't result into function call.  */
-      if (!strcmp_check (XSTR (x, 0), alternative_name))
-	return 0;
-      else
-	return 5;
-    default:
-      {
-	int i, j;
-	const char *fmt = GET_RTX_FORMAT (code);
-	for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
-	  {
-	    switch (fmt[i])
-	      {
-	      case 'V':
-	      case 'E':
-		for (j = 0; j < XVECLEN (x, i); j++)
-		  cost += attr_rtx_cost (XVECEXP (x, i, j));
-		break;
-	      case 'e':
-		cost += attr_rtx_cost (XEXP (x, i));
-		break;
-	      }
-	  }
-      }
-      break;
-    }
-  return cost;
-}
-
 /* Simplify test expression and use temporary obstack in order to avoid
    memory bloat.  Use ATTR_IND_SIMPLIFIED to avoid unnecessary simplifications
    and avoid unnecessary copying if possible.  */
@@ -2599,6 +2706,25 @@  simplify_test_exp (rtx exp, int insn_cod
 	  return SIMPLIFY_TEST_EXP (newexp, insn_code, insn_index);
 	}
 
+      /* Similarly,
+	    convert (ior (and (y) (x))
+			 (and (z) (x)))
+	    to      (and (ior (y) (z))
+			 (x))
+         Note that we want the common term to stay at the end.
+       */
+
+      else if (GET_CODE (left) == AND && GET_CODE (right) == AND
+	       && attr_equal_p (XEXP (left, 1), XEXP (right, 1)))
+	{
+	  newexp = attr_rtx (IOR, XEXP (left, 0), XEXP (right, 0));
+
+	  left = newexp;
+	  right = XEXP (right, 1);
+	  newexp = attr_rtx (AND, left, right);
+	  return SIMPLIFY_TEST_EXP (newexp, insn_code, insn_index);
+	}
+
       /* See if all or all but one of the insn's alternatives are specified
 	 in this tree.  Optimize if so.  */
 
@@ -2702,6 +2828,13 @@  simplify_test_exp (rtx exp, int insn_cod
 	  break;
 	}
 
+      if (XSTR (exp, 0) == insn_code_str)
+        {
+	  if (insn_code >= 0)
+	    newexp = atoi (XSTR (exp, 1)) == insn_code ? true_rtx : false_rtx;
+	  break;
+	}
+
       /* Look at the value for this insn code in the specified attribute.
 	 We normally can replace this comparison with the condition that
 	 would give this insn the values being tested for.  */
@@ -2734,7 +2867,7 @@  simplify_test_exp (rtx exp, int insn_cod
 	      x = evaluate_eq_attr (exp, attr, av->value,
 				    insn_code, insn_index);
 	      x = SIMPLIFY_TEST_EXP (x, insn_code, insn_index);
-	      if (attr_rtx_cost(x) < 20)
+	      if (attr_rtx_cost(x) < 7)
 		return x;
 	    }
 	}
@@ -2754,6 +2887,131 @@  simplify_test_exp (rtx exp, int insn_cod
   return newexp;
 }
 
+/* Return 1 if any EQ_ATTR subexpression of P refers to ATTR,
+   otherwise return 0.  */
+
+static int
+tests_attr_p (rtx p, struct attr_desc *attr)
+{
+  const char *fmt;
+  int i, ie, j, je;
+
+  if (GET_CODE (p) == EQ_ATTR)
+    {
+      if (XSTR (p, 0) != attr->name)
+	return 0;
+      return 1;
+    }
+
+  fmt = GET_RTX_FORMAT (GET_CODE (p));
+  ie = GET_RTX_LENGTH (GET_CODE (p));
+  for (i = 0; i < ie; i++)
+    {
+      switch (*fmt++)
+	{
+	case 'e':
+	  if (tests_attr_p (XEXP (p, i), attr))
+	    return 1;
+	  break;
+
+	case 'E':
+	  je = XVECLEN (p, i);
+	  for (j = 0; j < je; ++j)
+	    if (tests_attr_p (XVECEXP (p, i, j), attr))
+	      return 1;
+	  break;
+	}
+    }
+
+  return 0;
+}
+
+/* Calculate a topological sorting of all attributes so that
+   all attributes only depend on attributes in front of it.
+   Place the result in *RET (which is a pointer to an array of
+   attr_desc pointers), and return the size of that array.  */
+
+static int
+get_attr_order (struct attr_desc ***ret)
+{
+  int i, j;
+  int num = 0;
+  struct attr_desc *attr;
+  struct attr_desc **all, **sorted;
+  char *handled;
+  for (i = 0; i < MAX_ATTRS_INDEX; i++)
+    for (attr = attrs[i]; attr; attr = attr->next)
+      num++;
+  all = XNEWVEC (struct attr_desc *, num);
+  sorted = XNEWVEC (struct attr_desc *, num);
+  handled = XCNEWVEC (char, num);
+  num = 0;
+  for (i = 0; i < MAX_ATTRS_INDEX; i++)
+    for (attr = attrs[i]; attr; attr = attr->next)
+      all[num++] = attr;
+
+  j = 0;
+  for (i = 0; i < num; i++)
+    if (all[i]->is_const)
+      handled[i] = 1, sorted[j++] = all[i];
+
+  /* We have only few attributes hence we can live with the inner
+     loop being O(n^2), unlike the normal fast variants of topological
+     sorting.  */
+  while (j < num)
+    {
+      for (i = 0; i < num; i++)
+	if (!handled[i])
+	  {
+	    /* Let's see if I depends on anything interesting.  */
+	    int k;
+	    for (k = 0; k < num; k++)
+	      if (!handled[k])
+		{
+		  struct attr_value *av;
+		  for (av = all[i]->first_value; av; av = av->next)
+		    if (av->num_insns != 0)
+		      if (tests_attr_p (av->value, all[k]))
+			break;
+
+		  if (av)
+		    /* Something in I depends on K.  */
+		    break;
+		}
+	    if (k == num)
+	      {
+		/* Nothing in I depended on anything intersting, so
+		   it's done.  */
+		handled[i] = 1;
+		sorted[j++] = all[i];
+	      }
+	  }
+    }
+
+  for (j = 0; j < num; j++)
+    {
+      struct attr_desc *attr2;
+      struct attr_value *av;
+
+      attr = sorted[j];
+      fprintf (stderr, "%s depends on: ", attr->name);
+      for (i = 0; i < MAX_ATTRS_INDEX; ++i)
+	for (attr2 = attrs[i]; attr2; attr2 = attr2->next)
+	  if (!attr2->is_const)
+	    for (av = attr->first_value; av; av = av->next)
+	      if (av->num_insns != 0)
+		if (tests_attr_p (av->value, attr2))
+		  {
+		    fprintf (stderr, "%s, ", attr2->name);
+		    break;
+		  }
+      fprintf (stderr, "\n");
+    }
+  free (all);
+  *ret = sorted;
+  return num;
+}
+
 /* Optimize the attribute lists by seeing if we can determine conditional
    values from the known values of other attributes.  This will save subroutine
    calls during the compilation.  */
@@ -2768,6 +3026,8 @@  optimize_attrs (void)
   int i;
   struct attr_value_list *ivbuf;
   struct attr_value_list *iv;
+  struct attr_desc **topsort;
+  int topnum;
 
   /* For each insn code, make a list of all the insn_ent's for it,
      for all values for all attributes.  */
@@ -2783,18 +3043,22 @@  optimize_attrs (void)
 
   iv = ivbuf = XNEWVEC (struct attr_value_list, num_insn_ents);
 
-  for (i = 0; i < MAX_ATTRS_INDEX; i++)
-    for (attr = attrs[i]; attr; attr = attr->next)
-      for (av = attr->first_value; av; av = av->next)
-	for (ie = av->first_insn; ie; ie = ie->next)
-	  {
-	    iv->attr = attr;
-	    iv->av = av;
-	    iv->ie = ie;
-	    iv->next = insn_code_values[ie->def->insn_code];
-	    insn_code_values[ie->def->insn_code] = iv;
-	    iv++;
-	  }
+  /* Create the chain of insn*attr values such that we see dependend
+     attributes after their dependencies.  As we use a stack via the
+     next pointers start from the end of the topological order.  */
+  topnum = get_attr_order (&topsort);
+  for (i = topnum - 1; i >= 0; i--)
+    for (av = topsort[i]->first_value; av; av = av->next)
+      for (ie = av->first_insn; ie; ie = ie->next)
+	{
+	  iv->attr = topsort[i];
+	  iv->av = av;
+	  iv->ie = ie;
+	  iv->next = insn_code_values[ie->def->insn_code];
+	  insn_code_values[ie->def->insn_code] = iv;
+	  iv++;
+	}
+  free (topsort);
 
   /* Sanity check on num_insn_ents.  */
   gcc_assert (iv == ivbuf + num_insn_ents);
@@ -2829,7 +3093,15 @@  optimize_attrs (void)
 	    }
 
 	  rtl_obstack = old;
-	  if (newexp != av->value)
+	  /* If we created a new value for this instruction, and it's
+	     cheaper than the old value, and overall cheap, use that
+	     one as specific value for the current instruction.
+	     The last test is to avoid exploding the get_attr_ function
+	     sizes for no much gain.  */
+	  if (newexp != av->value
+	      && attr_rtx_cost (newexp) < attr_rtx_cost (av->value)
+	      && attr_rtx_cost (newexp) < 26
+	     )
 	    {
 	      newexp = attr_copy_rtx (newexp);
 	      remove_insn_ent (av, ie);
@@ -3332,6 +3604,12 @@  write_test_expr (rtx exp, int flags)
 	  break;
 	}
 
+      if (XSTR (exp, 0) == insn_code_str)
+        {
+	  printf ("insn_code == %s", XSTR (exp, 1));
+	  break;
+	}
+
       attr = find_attr (&XSTR (exp, 0), 0);
       gcc_assert (attr);
 
@@ -3618,10 +3896,80 @@  walk_attr_value (rtx exp)
       }
 }
 
-/* Write out a function to obtain the attribute for a given INSN.  */
+/* If a subexpression of P refers to ATTR, write out C code to retrieve
+   the value of that attribute storing it in a local variable.  */
+
+static int
+write_expr_attr_cache (rtx p, struct attr_desc *attr, int indent)
+{
+  const char *fmt;
+  int i, ie, j, je;
+
+  if (GET_CODE (p) == EQ_ATTR)
+    {
+      if (XSTR (p, 0) != attr->name)
+	return 0;
+
+      write_indent (indent);
+      if (attr->enum_name)
+	printf ("  enum %s ", attr->enum_name);
+      else if (!attr->is_numeric)
+	printf ("  enum attr_%s ", attr->name);
+      else
+	printf ("  int ");
+
+      printf ("attr_%s ATTRIBUTE_UNUSED = get_attr_%s (insn);\n",
+	      attr->name, attr->name);
+      return 1;
+    }
+
+  fmt = GET_RTX_FORMAT (GET_CODE (p));
+  ie = GET_RTX_LENGTH (GET_CODE (p));
+  for (i = 0; i < ie; i++)
+    {
+      switch (*fmt++)
+	{
+	case 'e':
+	  if (write_expr_attr_cache (XEXP (p, i), attr, indent))
+	    return 1;
+	  break;
+
+	case 'E':
+	  je = XVECLEN (p, i);
+	  for (j = 0; j < je; ++j)
+	    if (write_expr_attr_cache (XVECEXP (p, i, j), attr, indent))
+	      return 1;
+	  break;
+	}
+    }
+
+  return 0;
+}
+
+/* Given a VALUE (possibly having EQ_ATTR tests in subexpression)
+   write out C code to retrieve the value of all attributes used
+   in tests embedded therein.  */
 
 static void
-write_attr_get (struct attr_desc *attr)
+write_cache_used_attr_for_value (rtx value, int indent)
+{
+  struct attr_desc *attr2;
+  int i;
+
+  for (i = 0; i < MAX_ATTRS_INDEX; ++i)
+    for (attr2 = attrs[i]; attr2; attr2 = attr2->next)
+      if (!attr2->is_const)
+	write_expr_attr_cache (value, attr2, indent);
+}
+
+/* Write out C code to calculate the value of ATTR per instruction.
+   PREFIX and SUFFIX are used to delimit the 'return' statement delivering
+   the value to our caller.  Either it will form a real return statement,
+   or an accumulator update.  */
+
+static void
+write_attr_switch (struct attr_desc *attr, int indent, const char *prefix,
+		   const char *suffix)
 {
   struct attr_value *av, *common_av;
 
@@ -3629,6 +3977,32 @@  write_attr_get (struct attr_desc *attr)
      switch we will generate.  */
   common_av = find_most_used (attr);
 
+  write_indent (indent);
+  printf ("{\n");
+  write_indent (indent);
+  printf ("  int insn_code = recog_memoized (insn);\n");
+  write_indent (indent);
+  printf ("  switch (insn_code)\n");
+  write_indent (indent);
+  printf ("    {\n");
+
+  for (av = attr->first_value; av; av = av->next)
+    if (av != common_av)
+      write_attr_case (attr, av, 1, prefix, suffix, indent + 4, true_rtx);
+
+  write_attr_case (attr, common_av, 0, prefix, suffix, indent + 4, true_rtx);
+  
+  write_indent (indent);
+  printf ("    }\n");
+  write_indent (indent);
+  printf ("}\n\n");
+}
+
+/* Write out a function to obtain the attribute for a given INSN.  */
+
+static void
+write_attr_get (struct attr_desc *attr)
+{
   /* Write out start of function, then all values with explicit `case' lines,
      then a `default', then the value with the most uses.  */
   if (attr->enum_name)
@@ -3646,6 +4020,7 @@  write_attr_get (struct attr_desc *attr)
     printf ("get_attr_%s (rtx insn ATTRIBUTE_UNUSED)\n", attr->name);
   else
     {
+      struct attr_value *av;
       printf ("get_attr_%s (void)\n", attr->name);
       printf ("{\n");
 
@@ -3656,22 +4031,13 @@  write_attr_get (struct attr_desc *attr)
 			  av->first_insn->def->insn_index);
 	else if (av->num_insns != 0)
 	  write_attr_set (attr, 2, av->value, "return", ";",
-			  true_rtx, -2, 0);
+			  true_rtx, -2, -2);
 
       printf ("}\n\n");
       return;
     }
 
-  printf ("{\n");
-  printf ("  switch (recog_memoized (insn))\n");
-  printf ("    {\n");
-
-  for (av = attr->first_value; av; av = av->next)
-    if (av != common_av)
-      write_attr_case (attr, av, 1, "return", ";", 4, true_rtx);
-
-  write_attr_case (attr, common_av, 0, "return", ";", 4, true_rtx);
-  printf ("    }\n}\n\n");
+  write_attr_switch (attr, 0, "return", ";");
 }
 
 /* Given an AND tree of known true terms (because we are inside an `if' with
@@ -3727,6 +4093,10 @@  write_attr_set (struct attr_desc *attr,
 	  rtx testexp;
 	  rtx inner_true;
 
+	  /* Reset our_known_true after some time to not accumulate
+	     too much cruft (slowing down genattrtab).  */
+	  if ((i & 31) == 0)
+	    our_known_true = known_true;
 	  testexp = eliminate_known_true (our_known_true,
 					  XVECEXP (value, 0, i),
 					  insn_code, insn_index);
@@ -3755,7 +4125,7 @@  write_attr_set (struct attr_desc *attr,
 	  write_indent (indent);
 	  printf ("%sif ", first_if ? "" : "else ");
 	  first_if = 0;
-	  write_test_expr (testexp, 0);
+	  write_test_expr (testexp, 2);
 	  printf ("\n");
 	  write_indent (indent + 2);
 	  printf ("{\n");
@@ -3820,6 +4190,7 @@  write_attr_case (struct attr_desc *attr,
 		 int write_case_lines, const char *prefix, const char *suffix,
 		 int indent, rtx known_true)
 {
+  rtx opt_val;
   if (av->num_insns == 0)
     return;
 
@@ -3843,9 +4214,33 @@  write_attr_case (struct attr_desc *attr,
       printf ("default:\n");
     }
 
+  indent += 2;
+  write_indent (indent);
+  printf ("{\n");
+
+  opt_val = av->value;
+  /* If we have multiple but only few instructions associated with
+     this value we can possibly optimize this.  */
+  if (GET_CODE (opt_val) == COND && av->num_insns > 1 && av->num_insns < 10)
+    opt_val = simplify_cond_insn_list (opt_val, av);
+  while (GET_CODE (opt_val) == COND)
+    {
+      rtx newexp2;
+      if (av->num_insns == 1)
+	newexp2 = simplify_cond (opt_val, av->first_insn->def->insn_code,
+				 av->first_insn->def->insn_index);
+      else
+	newexp2 = simplify_cond (opt_val, -2, -2);
+      if (newexp2 == opt_val)
+	break;
+      opt_val = newexp2;
+    }
+
   /* See what we have to do to output this value.  */
   must_extract = must_constrain = address_used = 0;
-  walk_attr_value (av->value);
+  walk_attr_value (opt_val);
+
+  write_cache_used_attr_for_value (opt_val, indent);
 
   if (must_constrain)
     {
@@ -3859,11 +4254,11 @@  write_attr_case (struct attr_desc *attr,
     }
 
   if (av->num_insns == 1)
-    write_attr_set (attr, indent + 2, av->value, prefix, suffix,
+    write_attr_set (attr, indent + 2, opt_val, prefix, suffix,
 		    known_true, av->first_insn->def->insn_code,
 		    av->first_insn->def->insn_index);
   else
-    write_attr_set (attr, indent + 2, av->value, prefix, suffix,
+    write_attr_set (attr, indent + 2, opt_val, prefix, suffix,
 		    known_true, -2, 0);
 
   if (strncmp (prefix, "return", 6))
@@ -3871,6 +4266,8 @@  write_attr_case (struct attr_desc *attr,
       write_indent (indent + 2);
       printf ("break;\n");
     }
+  write_indent (indent);
+  printf ("}\n");
   printf ("\n");
 }
 
@@ -3993,7 +4390,6 @@  write_eligible_delay (const char *kind)
   char str[50];
   const char *pstr;
   struct attr_desc *attr;
-  struct attr_value *av, *common_av;
   int i;
 
   /* Compute the maximum number of delay slots required.  We use the delay
@@ -4026,19 +4422,11 @@  write_eligible_delay (const char *kind)
     {
       attr = find_attr (&delay_type_str, 0);
       gcc_assert (attr);
-      common_av = find_most_used (attr);
-
-      printf ("  insn = delay_insn;\n");
-      printf ("  switch (recog_memoized (insn))\n");
-      printf ("    {\n");
 
       sprintf (str, " * %d;\n      break;", max_slots);
-      for (av = attr->first_value; av; av = av->next)
-	if (av != common_av)
-	  write_attr_case (attr, av, 1, "slot +=", str, 4, true_rtx);
 
-      write_attr_case (attr, common_av, 0, "slot +=", str, 4, true_rtx);
-      printf ("    }\n\n");
+      printf ("  insn = delay_insn;\n");
+      write_attr_switch (attr, 2, "slot +=", str);
 
       /* Ensure matched.  Otherwise, shouldn't have been called.  */
       printf ("  gcc_assert (slot >= %d);\n\n", max_slots);
@@ -4047,20 +4435,10 @@  write_eligible_delay (const char *kind)
   /* If just one type of delay slot, write simple switch.  */
   if (num_delays == 1 && max_slots == 1)
     {
-      printf ("  insn = candidate_insn;\n");
-      printf ("  switch (recog_memoized (insn))\n");
-      printf ("    {\n");
-
       attr = find_attr (&delay_1_0_str, 0);
       gcc_assert (attr);
-      common_av = find_most_used (attr);
-
-      for (av = attr->first_value; av; av = av->next)
-	if (av != common_av)
-	  write_attr_case (attr, av, 1, "return", ";", 4, true_rtx);
-
-      write_attr_case (attr, common_av, 0, "return", ";", 4, true_rtx);
-      printf ("    }\n");
+      printf ("  insn = candidate_insn;\n");
+      write_attr_switch (attr, 2, "return", ";");
     }
 
   else
@@ -4076,21 +4454,12 @@  write_eligible_delay (const char *kind)
 	  {
 	    printf ("    case %d:\n",
 		    (i / 3) + (num_delays == 1 ? 0 : delay->num * max_slots));
-	    printf ("      switch (recog_memoized (insn))\n");
-	    printf ("\t{\n");
 
 	    sprintf (str, "*%s_%d_%d", kind, delay->num, i / 3);
 	    pstr = str;
 	    attr = find_attr (&pstr, 0);
 	    gcc_assert (attr);
-	    common_av = find_most_used (attr);
-
-	    for (av = attr->first_value; av; av = av->next)
-	      if (av != common_av)
-		write_attr_case (attr, av, 1, "return", ";", 8, true_rtx);
-
-	    write_attr_case (attr, common_av, 0, "return", ";", 8, true_rtx);
-	    printf ("      }\n");
+	    write_attr_switch (attr, 6, "return", ";");
 	  }
 
       printf ("    default:\n");
@@ -4133,7 +4502,7 @@  find_attr (const char **name_p, int crea
 
   /* Before we resort to using `strcmp', see if the string address matches
      anywhere.  In most cases, it should have been canonicalized to do so.  */
-  if (name == alternative_name)
+  if (name == alternative_name || name == insn_code_str)
     return NULL;
 
   index = name[0] & (MAX_ATTRS_INDEX - 1);
@@ -4458,6 +4827,7 @@  main (int argc, char **argv)
   delay_type_str = DEF_ATTR_STRING ("*delay_type");
   delay_1_0_str = DEF_ATTR_STRING ("*delay_1_0");
   num_delay_slots_str = DEF_ATTR_STRING ("*num_delay_slots");
+  insn_code_str = DEF_ATTR_STRING ("insn_code");
 
   printf ("/* Generated automatically by the program `genattrtab'\n\
 from the machine description file `md'.  */\n\n");
@@ -4573,7 +4943,7 @@  from the machine description file `md'.
   /* Construct extra attributes for `length'.  */
   make_length_attrs ();
 
-  /* Perform any possible optimizations to speed up compilation.  */
+  /* Perform some optimizations to speed up compilation.  */
   optimize_attrs ();
 
   /* Now write out all the `gen_attr_...' routines.  Do these before the