[2/2] Fix expansion slowness of PR60291

On Fri, 21 Feb 2014, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Fri, 21 Feb 2014, Richard Biener wrote:
> >
> >> On Fri, 21 Feb 2014, Richard Biener wrote:
> >> 
> >> > On Fri, 21 Feb 2014, Richard Biener wrote:
> >> > 
> >> > > 
> >> > > This fixes the slowness of RTL expansion in PR60291 which is caused
> >> > > by excessive collisions in mem-attr sharing.  The issue here is
> >> > > that sharing attempts happens across functions and we have a _lot_
> >> > > of functions in this testcase referencing the same lexically
> >> > > equivalent memory, for example MEM[(StgWord *)_5 + -64B].  That
> >> > > means those get the same hash value.  But they don't compare
> >> > > equal because an SSA name _5 from function A is of course not equal
> >> > > to one from function B.
> >> > > 
> >> > > The following fixes that by not doing mem-attr sharing across functions
> >> > > by clearing the mem-attrs hashtable in rest_of_clean_state.
> >> > > 
> >> > > Another fix may be to do what the comment in iterative_hash_expr
> >> > > says for SSA names:
> >> > > 
> >> > >     case SSA_NAME:
> >> > >       /* We can just compare by pointer.  */
> >> > >       return iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val);
> >> > > 
> >> > > (probably blame me for changing that to hashing the SSA version).
> >> > 
> >> > It was lxo.
> >> > 
> >> > > But I'm not sure that doesn't uncover issues with various hashtables and
> >> > > walking them, generating code dependent on the order.  It's IMHO just not
> >> > > expected that you compare function-local expressions from different
> >> > > functions.
> >> > 
> >> > Same speedup result from
> >> > 
> >> > Index: gcc/tree.c
> >> > ===================================================================
> >> > --- gcc/tree.c  (revision 207960)
> >> > +++ gcc/tree.c  (working copy)
> >> > @@ -7428,7 +7428,7 @@ iterative_hash_expr (const_tree t, hashv
> >> >        }
> >> >      case SSA_NAME:
> >> >        /* We can just compare by pointer.  */
> >> > -      return iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val);
> >> > +      return iterative_hash_hashval_t ((uintptr_t)t>>3, val);
> >> >      case PLACEHOLDER_EXPR:
> >> >        /* The node itself doesn't matter.  */
> >> >        return val;
> >> > 
> >> > and from
> >> > 
> >> > Index: gcc/tree.c
> >> > ===================================================================
> >> > --- gcc/tree.c  (revision 207960)
> >> > +++ gcc/tree.c  (working copy)
> >> > @@ -7428,7 +7428,9 @@ iterative_hash_expr (const_tree t, hashv
> >> >        }
> >> >      case SSA_NAME:
> >> >        /* We can just compare by pointer.  */
> >> > -      return iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val);
> >> > +      return iterative_hash_host_wide_int
> >> > +             (DECL_UID (cfun->decl),
> >> > +              iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val));
> >> >      case PLACEHOLDER_EXPR:
> >> >        /* The node itself doesn't matter.  */
> >> >        return val;
> >> > 
> >> > better than hashing pointers but requring cfun != NULL in this
> >> > function isn't good either.
> >> > 
> >> > At this point I'm more comfortable with clearing the hashtable
> >> > than with changing iterative_hash_expr in any way.  It's also
> >> > along the way to get rid of the hash completely.
> >> > 
> >> > Oh, btw, the speedup is going from
> >> > 
> >> >  expand                  : 481.98 (94%) usr   1.15 (17%) sys 481.94 (93%) 
> >> > wall  293891 kB (15%) ggc
> >> > 
> >> > to
> >> > 
> >> >  expand                  :   2.66 ( 7%) usr   0.13 ( 2%) sys   2.64 ( 6%) 
> >> > wall  262544 kB (13%) ggc
> >> > 
> >> > at -O0 (less dramatic slowness for -On).
> >> > 
> >> > > The other thing would be to discard mem-attr sharing alltogether,
> >> > > but that doesn't seem appropriate at this stage (but it would
> >> > > also simplify quite some code).  With only one function in RTL
> >> > > at a time that shouldn't be too bad (see several suggestions
> >> > > along that line, even with statistics).
> >> 
> >> Last statistics: http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01784.html
> >
> > With the patch below to get some statistics we see that one important
> > piece of sharing not covered by above measurements is RTX copying(?).
> >
> > On the testcase for this PR I get at -O1 and without the patch
> > to clear the hashtable after each function
> >
> > 142489 mem_attrs created (142439 for new, 50 for modification)
> > 1983225 mem_attrs shared (4044 for new, 820241 for modification, 1158940 
> > by rtx copying)
> > 0 mem_attrs dropped
> >
> > and with the patch to clear after each function
> >
> > 364411 mem_attrs created (144478 for new, 219933 for modification)
> > 1761303 mem_attrs shared (2005 for new, 600358 for modification, 1158940 
> > by rtx copying)
> > 0 mem_attrs dropped
> >
> > while for dwarf2out.c I see without the clearing
> >
> > 24399 mem_attrs created (6929 for new, 17470 for modification)
> > 102676 mem_attrs shared (10878 for new, 29265 for modification, 62533 by 
> > rtx copying)
> > 16 mem_attrs dropped
> >
> > which means that completely dropping the sharing would result
> > in creating of 6929 + 17807 + 62533(!) vs. 24399 mem-attrs.
> > That's still not a lot overhead given that mem-attrs take 40 bytes
> > (3MB vs. 950kB).  There is also always the possibility to
> > explicitely ref-count mem-attrs to handle sharing by rtx
> > copying (at least cse, fwprop, combine, ira and reload copy MEMs,
> > probably some for no good reason because MEMs are not shared),
> > thus make mem-attrs unshare-on-modify.
> 
> In a thread a few years ago you talked about the possibility of going
> further and folding the attributes into the MEM itself, so avoiding
> the indirection and separate allocation:
> 
>   http://thread.gmane.org/gmane.comp.gcc.patches/244464/focus=244538
> 
> (and earlier posts in the thread).  Would that still be OK?
> I might have a go if so.

It would work for me.  Micha just brought up the easiest incremental
change though, which is

which would be equivalent to folding it into the MEM but 1) not
saving the pointer, 2) retaining the important copy_rtx sharing.

Richard.

[2/2] Fix expansion slowness of PR60291

Commit Message

Patch