diff mbox

[2/2] Fix expansion slowness of PR60291

Message ID alpine.LSU.2.11.1402211302010.27942@zhemvz.fhfr.qr
State New
Headers show

Commit Message

Richard Biener Feb. 21, 2014, 12:23 p.m. UTC
On Fri, 21 Feb 2014, Richard Biener wrote:

> On Fri, 21 Feb 2014, Richard Biener wrote:
> 
> > On Fri, 21 Feb 2014, Richard Biener wrote:
> > 
> > > 
> > > This fixes the slowness of RTL expansion in PR60291 which is caused
> > > by excessive collisions in mem-attr sharing.  The issue here is
> > > that sharing attempts happens across functions and we have a _lot_
> > > of functions in this testcase referencing the same lexically
> > > equivalent memory, for example MEM[(StgWord *)_5 + -64B].  That
> > > means those get the same hash value.  But they don't compare
> > > equal because an SSA name _5 from function A is of course not equal
> > > to one from function B.
> > > 
> > > The following fixes that by not doing mem-attr sharing across functions
> > > by clearing the mem-attrs hashtable in rest_of_clean_state.
> > > 
> > > Another fix may be to do what the comment in iterative_hash_expr
> > > says for SSA names:
> > > 
> > >     case SSA_NAME:
> > >       /* We can just compare by pointer.  */
> > >       return iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val);
> > > 
> > > (probably blame me for changing that to hashing the SSA version).
> > 
> > It was lxo.
> > 
> > > But I'm not sure that doesn't uncover issues with various hashtables and
> > > walking them, generating code dependent on the order.  It's IMHO just not
> > > expected that you compare function-local expressions from different
> > > functions.
> > 
> > Same speedup result from
> > 
> > Index: gcc/tree.c
> > ===================================================================
> > --- gcc/tree.c  (revision 207960)
> > +++ gcc/tree.c  (working copy)
> > @@ -7428,7 +7428,7 @@ iterative_hash_expr (const_tree t, hashv
> >        }
> >      case SSA_NAME:
> >        /* We can just compare by pointer.  */
> > -      return iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val);
> > +      return iterative_hash_hashval_t ((uintptr_t)t>>3, val);
> >      case PLACEHOLDER_EXPR:
> >        /* The node itself doesn't matter.  */
> >        return val;
> > 
> > and from
> > 
> > Index: gcc/tree.c
> > ===================================================================
> > --- gcc/tree.c  (revision 207960)
> > +++ gcc/tree.c  (working copy)
> > @@ -7428,7 +7428,9 @@ iterative_hash_expr (const_tree t, hashv
> >        }
> >      case SSA_NAME:
> >        /* We can just compare by pointer.  */
> > -      return iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val);
> > +      return iterative_hash_host_wide_int
> > +             (DECL_UID (cfun->decl),
> > +              iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val));
> >      case PLACEHOLDER_EXPR:
> >        /* The node itself doesn't matter.  */
> >        return val;
> > 
> > better than hashing pointers but requring cfun != NULL in this
> > function isn't good either.
> > 
> > At this point I'm more comfortable with clearing the hashtable
> > than with changing iterative_hash_expr in any way.  It's also
> > along the way to get rid of the hash completely.
> > 
> > Oh, btw, the speedup is going from
> > 
> >  expand                  : 481.98 (94%) usr   1.15 (17%) sys 481.94 (93%) 
> > wall  293891 kB (15%) ggc
> > 
> > to
> > 
> >  expand                  :   2.66 ( 7%) usr   0.13 ( 2%) sys   2.64 ( 6%) 
> > wall  262544 kB (13%) ggc
> > 
> > at -O0 (less dramatic slowness for -On).
> > 
> > > The other thing would be to discard mem-attr sharing alltogether,
> > > but that doesn't seem appropriate at this stage (but it would
> > > also simplify quite some code).  With only one function in RTL
> > > at a time that shouldn't be too bad (see several suggestions
> > > along that line, even with statistics).
> 
> Last statistics: http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01784.html

With the patch below to get some statistics we see that one important
piece of sharing not covered by above measurements is RTX copying(?).

On the testcase for this PR I get at -O1 and without the patch
to clear the hashtable after each function

142489 mem_attrs created (142439 for new, 50 for modification)
1983225 mem_attrs shared (4044 for new, 820241 for modification, 1158940 
by rtx copying)
0 mem_attrs dropped

and with the patch to clear after each function

364411 mem_attrs created (144478 for new, 219933 for modification)
1761303 mem_attrs shared (2005 for new, 600358 for modification, 1158940 
by rtx copying)
0 mem_attrs dropped

while for dwarf2out.c I see without the clearing

24399 mem_attrs created (6929 for new, 17470 for modification)
102676 mem_attrs shared (10878 for new, 29265 for modification, 62533 by 
rtx copying)
16 mem_attrs dropped

which means that completely dropping the sharing would result
in creating of 6929 + 17807 + 62533(!) vs. 24399 mem-attrs.
That's still not a lot overhead given that mem-attrs take 40 bytes
(3MB vs. 950kB).  There is also always the possibility to
explicitely ref-count mem-attrs to handle sharing by rtx
copying (at least cse, fwprop, combine, ira and reload copy MEMs,
probably some for no good reason because MEMs are not shared),
thus make mem-attrs unshare-on-modify.

Richard.

Comments

Richard Sandiford Feb. 21, 2014, 12:49 p.m. UTC | #1
Richard Biener <rguenther@suse.de> writes:
> On Fri, 21 Feb 2014, Richard Biener wrote:
>
>> On Fri, 21 Feb 2014, Richard Biener wrote:
>> 
>> > On Fri, 21 Feb 2014, Richard Biener wrote:
>> > 
>> > > 
>> > > This fixes the slowness of RTL expansion in PR60291 which is caused
>> > > by excessive collisions in mem-attr sharing.  The issue here is
>> > > that sharing attempts happens across functions and we have a _lot_
>> > > of functions in this testcase referencing the same lexically
>> > > equivalent memory, for example MEM[(StgWord *)_5 + -64B].  That
>> > > means those get the same hash value.  But they don't compare
>> > > equal because an SSA name _5 from function A is of course not equal
>> > > to one from function B.
>> > > 
>> > > The following fixes that by not doing mem-attr sharing across functions
>> > > by clearing the mem-attrs hashtable in rest_of_clean_state.
>> > > 
>> > > Another fix may be to do what the comment in iterative_hash_expr
>> > > says for SSA names:
>> > > 
>> > >     case SSA_NAME:
>> > >       /* We can just compare by pointer.  */
>> > >       return iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val);
>> > > 
>> > > (probably blame me for changing that to hashing the SSA version).
>> > 
>> > It was lxo.
>> > 
>> > > But I'm not sure that doesn't uncover issues with various hashtables and
>> > > walking them, generating code dependent on the order.  It's IMHO just not
>> > > expected that you compare function-local expressions from different
>> > > functions.
>> > 
>> > Same speedup result from
>> > 
>> > Index: gcc/tree.c
>> > ===================================================================
>> > --- gcc/tree.c  (revision 207960)
>> > +++ gcc/tree.c  (working copy)
>> > @@ -7428,7 +7428,7 @@ iterative_hash_expr (const_tree t, hashv
>> >        }
>> >      case SSA_NAME:
>> >        /* We can just compare by pointer.  */
>> > -      return iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val);
>> > +      return iterative_hash_hashval_t ((uintptr_t)t>>3, val);
>> >      case PLACEHOLDER_EXPR:
>> >        /* The node itself doesn't matter.  */
>> >        return val;
>> > 
>> > and from
>> > 
>> > Index: gcc/tree.c
>> > ===================================================================
>> > --- gcc/tree.c  (revision 207960)
>> > +++ gcc/tree.c  (working copy)
>> > @@ -7428,7 +7428,9 @@ iterative_hash_expr (const_tree t, hashv
>> >        }
>> >      case SSA_NAME:
>> >        /* We can just compare by pointer.  */
>> > -      return iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val);
>> > +      return iterative_hash_host_wide_int
>> > +             (DECL_UID (cfun->decl),
>> > +              iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val));
>> >      case PLACEHOLDER_EXPR:
>> >        /* The node itself doesn't matter.  */
>> >        return val;
>> > 
>> > better than hashing pointers but requring cfun != NULL in this
>> > function isn't good either.
>> > 
>> > At this point I'm more comfortable with clearing the hashtable
>> > than with changing iterative_hash_expr in any way.  It's also
>> > along the way to get rid of the hash completely.
>> > 
>> > Oh, btw, the speedup is going from
>> > 
>> >  expand                  : 481.98 (94%) usr   1.15 (17%) sys 481.94 (93%) 
>> > wall  293891 kB (15%) ggc
>> > 
>> > to
>> > 
>> >  expand                  :   2.66 ( 7%) usr   0.13 ( 2%) sys   2.64 ( 6%) 
>> > wall  262544 kB (13%) ggc
>> > 
>> > at -O0 (less dramatic slowness for -On).
>> > 
>> > > The other thing would be to discard mem-attr sharing alltogether,
>> > > but that doesn't seem appropriate at this stage (but it would
>> > > also simplify quite some code).  With only one function in RTL
>> > > at a time that shouldn't be too bad (see several suggestions
>> > > along that line, even with statistics).
>> 
>> Last statistics: http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01784.html
>
> With the patch below to get some statistics we see that one important
> piece of sharing not covered by above measurements is RTX copying(?).
>
> On the testcase for this PR I get at -O1 and without the patch
> to clear the hashtable after each function
>
> 142489 mem_attrs created (142439 for new, 50 for modification)
> 1983225 mem_attrs shared (4044 for new, 820241 for modification, 1158940 
> by rtx copying)
> 0 mem_attrs dropped
>
> and with the patch to clear after each function
>
> 364411 mem_attrs created (144478 for new, 219933 for modification)
> 1761303 mem_attrs shared (2005 for new, 600358 for modification, 1158940 
> by rtx copying)
> 0 mem_attrs dropped
>
> while for dwarf2out.c I see without the clearing
>
> 24399 mem_attrs created (6929 for new, 17470 for modification)
> 102676 mem_attrs shared (10878 for new, 29265 for modification, 62533 by 
> rtx copying)
> 16 mem_attrs dropped
>
> which means that completely dropping the sharing would result
> in creating of 6929 + 17807 + 62533(!) vs. 24399 mem-attrs.
> That's still not a lot overhead given that mem-attrs take 40 bytes
> (3MB vs. 950kB).  There is also always the possibility to
> explicitely ref-count mem-attrs to handle sharing by rtx
> copying (at least cse, fwprop, combine, ira and reload copy MEMs,
> probably some for no good reason because MEMs are not shared),
> thus make mem-attrs unshare-on-modify.

In a thread a few years ago you talked about the possibility of going
further and folding the attributes into the MEM itself, so avoiding
the indirection and separate allocation:

  http://thread.gmane.org/gmane.comp.gcc.patches/244464/focus=244538

(and earlier posts in the thread).  Would that still be OK?
I might have a go if so.

Thanks,
Richard
shafiq132 Feb. 21, 2014, 1:02 p.m. UTC | #2
Sir,

   I  have a cross compiler and I know how to cross compile a file . But I
am doing all just for glib compilation that I do not know how to do. Anyone
to guide me. or generally just inform me how can I compile a complete
library using gcc.



--
View this message in context: http://gcc.1065356.n5.nabble.com/PATCH-2-2-Fix-expansion-slowness-of-PR60291-tp1013329p1013362.html
Sent from the gcc - patches mailing list archive at Nabble.com.
diff mbox

Patch

Index: gcc/rtl.c
===================================================================
--- gcc/rtl.c	(revision 207938)
+++ gcc/rtl.c	(working copy)
@@ -326,6 +326,8 @@  copy_rtx (rtx orig)
   return copy;
 }
 
+unsigned long mem_attrs_shared_copy;
+
 /* Create a new copy of an rtx.  Only copy just one level.  */
 
 rtx
@@ -333,6 +335,8 @@  shallow_copy_rtx_stat (const_rtx orig ME
 {
   const unsigned int size = rtx_size (orig);
   rtx const copy = ggc_alloc_rtx_def_stat (size PASS_MEM_STAT);
+  if (MEM_P (orig) && MEM_ATTRS (orig))
+    mem_attrs_shared_copy++;
   return (rtx) memcpy (copy, orig, size);
 }
 
Index: gcc/emit-rtl.c
===================================================================
--- gcc/emit-rtl.c	(revision 207938)
+++ gcc/emit-rtl.c	(working copy)
@@ -290,6 +290,12 @@  mem_attrs_htab_eq (const void *x, const
   return mem_attrs_eq_p ((const mem_attrs *) x, (const mem_attrs *) y);
 }
 
+unsigned long mem_attrs_dropped;
+unsigned long mem_attrs_new;
+unsigned long mem_attrs_new_modified;
+unsigned long mem_attrs_shared;
+unsigned long mem_attrs_shared_modified;
+
 /* Set MEM's memory attributes so that they are the same as ATTRS.  */
 
 static void
@@ -300,6 +306,8 @@  set_mem_attrs (rtx mem, mem_attrs *attrs
   /* If everything is the default, we can just clear the attributes.  */
   if (mem_attrs_eq_p (attrs, mode_mem_attrs[(int) GET_MODE (mem)]))
     {
+      if (MEM_ATTRS (mem))
+	mem_attrs_dropped++;
       MEM_ATTRS (mem) = 0;
       return;
     }
@@ -309,6 +317,17 @@  set_mem_attrs (rtx mem, mem_attrs *attrs
     {
       *slot = ggc_alloc_mem_attrs ();
       memcpy (*slot, attrs, sizeof (mem_attrs));
+      if (MEM_ATTRS (mem))
+	mem_attrs_new_modified++;
+      else
+	mem_attrs_new++;
+    }
+  else
+    {
+      if (MEM_ATTRS (mem))
+	mem_attrs_shared_modified++;
+      else
+	mem_attrs_shared++;
     }
 
   MEM_ATTRS (mem) = (mem_attrs *) *slot;
Index: gcc/toplev.c
===================================================================
--- gcc/toplev.c	(revision 207938)
+++ gcc/toplev.c	(working copy)
@@ -1989,6 +2023,26 @@  toplev_main (int argc, char **argv)
   if (!exit_after_options)
     do_compile ();
 
+    {
+      extern unsigned long mem_attrs_dropped;
+      extern unsigned long mem_attrs_new;
+      extern unsigned long mem_attrs_new_modified;
+      extern unsigned long mem_attrs_shared;
+      extern unsigned long mem_attrs_shared_modified;
+      extern unsigned long mem_attrs_shared_copy;
+      fprintf (stderr, "%lu mem_attrs created (%lu for new, %lu for "
+	       "modification)\n",
+	       mem_attrs_new + mem_attrs_new_modified,
+	       mem_attrs_new, mem_attrs_new_modified);
+      fprintf (stderr, "%lu mem_attrs shared (%lu for new, %lu for "
+	       "modification, %lu by rtx copying)\n",
+	       mem_attrs_shared + mem_attrs_shared_modified
+	       + mem_attrs_shared_copy,
+	       mem_attrs_shared, mem_attrs_shared_modified,
+	       mem_attrs_shared_copy);
+      fprintf (stderr, "%lu mem_attrs dropped\n", mem_attrs_dropped);
+    }
+
   if (warningcount || errorcount || werrorcount)
     print_ignored_options ();