diff mbox

RFC: Faster for_each_rtx-like iterators

Message ID 87k39x48gu.fsf@talisman.default
State New
Headers show

Commit Message

Richard Sandiford May 7, 2014, 8:52 p.m. UTC
I noticed for_each_rtx showing up in profiles and thought I'd have a go
at using worklist-based iterators instead.  So far I have three:

  FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx
  FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx
  FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx *

with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement.

I made FOR_EACH_SUBRTX the "default" (unsuffixed) version because
most walks really don't modify the structure.  I think we should
encourage const_rtxes to be used whereever possible.  E.g. it might
make it easier to have non-GC storage for temporary rtxes in future.

I've locally replaced all for_each_rtx calls in the generic code with
these iterators and they make things reproducably faster.  The speed-up
on full --enable-checking=release ./cc1 and ./cc1plus times is only about 1%,
but maybe that's enough to justify the churn.

Implementation-wise, the main observation is that most subrtxes are part
of a single contiguous sequence of "e" fields.  E.g. when compiling an
oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the
subrtxes of 7,636,542 rtxes.  Of those:

(A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields,
(B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and
                      no "E" fields, and
(C)    43,532 (00.6%) are more complicated.

(A) is really a special case of (B) in which the block has zero length.
Those are the only two cases that really need to be handled inline.
The implementation does this by having a mapping from an rtx code to the
bounds of its "e" sequence, in the form of a start index and count.

Out of (C), the vast majority (43,509) are PARALLELs.  However, as you'd
probably expect, bloating the inline code with that case made things
slower rather than faster.

The vast majority (in fact all in the combine.ii run above) of iterations
can be done with a 16-element stack worklist.  We obviously still need a
heap fallback for the pathological cases though.

I spent a bit of time trying different iterator implementations and
seeing which produced the best code.  Specific results from that were:

- The storage used for the worklist is separate from the iterator,
  in order to avoid capturing iterator fields.

- Although the natural type of the storage would be auto_vec <..., 16>,
  that produced some overhead compared with a separate stack array and heap
  vector pointer.  With the heap vector pointer, the only overhead is an
  assignment in the constructor and an "if (x) release (x)"-style sequence
  in the destructor.  I think the extra complication over auto_vec is worth
  it because in this case the heap version is so very rarely needed.

- Several existing for_each_rtx callbacks have something like:

    if (GET_CODE (x) == CONST)
      return -1;

  or:

    if (CONSTANT_P (x))
      return -1;

  to avoid walking subrtxes of constants.  That can be done without
  extra code checks and branches by having a separate code->bound
  mapping in which all constants are treated as leaf rtxes.  This usage
  should be common enough to outweigh the cache penalty of two arrays.

  The choice between iterating over constants or not is given in the
  final parameter of the FOR_EACH_* iterator.

- The maximum number of fields in (B)-type rtxes is 3.  We get better
  code by making that explicit rather than having a general loop.

- (C) codes map to an "e" count of UCHAR_MAX, so we can use a single
  check to test for that and for cases where the stack worklist is
  too small.

To give an example:

/* Callback for for_each_rtx, that returns 1 upon encountering a VALUE
   whose UID is greater than the int uid that D points to.  */

static int
refs_newer_value_cb (rtx *x, void *d)
{
  if (GET_CODE (*x) == VALUE && CSELIB_VAL_PTR (*x)->uid > *(int *)d)
    return 1;

  return 0;
}

/* Return TRUE if EXPR refers to a VALUE whose uid is greater than
   that of V.  */

static bool
refs_newer_value_p (rtx expr, rtx v)
{
  int minuid = CSELIB_VAL_PTR (v)->uid;

  return for_each_rtx (&expr, refs_newer_value_cb, &minuid);
}

becomes:

/* Return TRUE if EXPR refers to a VALUE whose uid is greater than
   that of V.  */

static bool
refs_newer_value_p (const_rtx expr, rtx v)
{
  int minuid = CSELIB_VAL_PTR (v)->uid;
  subrtx_iterator::array_type array;
  FOR_EACH_SUBRTX (iter, array, expr, NONCONST)
    if (GET_CODE (*iter) == VALUE && CSELIB_VAL_PTR (*iter)->uid > minuid)
      return true;
  return false;
}

The iterator also allows subrtxes of a specific rtx to be skipped;
this is the equivalent of returning -1 from a for_each_rtx callback.
It also allows the current rtx to be replaced in the worklist by
another.  E.g.:

static void
mark_constants_in_pattern (rtx insn)
{
  subrtx_iterator::array_type array;
  FOR_EACH_SUBRTX (iter, array, PATTERN (insn), ALL)
    {
      const_rtx x = *iter;
      if (GET_CODE (x) == SYMBOL_REF)
	{
	  if (CONSTANT_POOL_ADDRESS_P (x))
	    {
	      struct constant_descriptor_rtx *desc = SYMBOL_REF_CONSTANT (x);
	      if (desc->mark == 0)
		{
		  desc->mark = 1;
		  iter.substitute (desc->constant);
		}
	    }
	  else if (TREE_CONSTANT_POOL_ADDRESS_P (x))
	    {
	      tree decl = SYMBOL_REF_DECL (x);
	      if (!TREE_ASM_WRITTEN (DECL_INITIAL (decl)))
		{
		  n_deferred_constants--;
		  output_constant_def_contents (CONST_CAST_RTX (x));
		}
	    }
	}
    }
}

where the iter.substitute avoids a recursive walk.  (The CONST_CAST_RTX
is "temporary".)

Does this look like it's worth pursuing?  If so, is it OK to start
submitting now or should it wait until 4.9.1 is out?  If it is OK,
I plan to convert the ports too and get rid of for_each_rtx.

Thanks,
Richard

Comments

Mike Stump May 7, 2014, 9:31 p.m. UTC | #1
On May 7, 2014, at 1:52 PM, Richard Sandiford <rdsandiford@googlemail.com> wrote:
> 
> I've locally replaced all for_each_rtx calls in the generic code with
> these iterators and they make things reproducably faster.  The speed-up
> on full --enable-checking=release ./cc1 and ./cc1plus times is only about 1%,
> but maybe that's enough to justify the churn.

100 1% fixes would make the compiler 100% faster.  :-)  I think 1% is actually a really good improvement.  If you have times for -O0, that would be interesting to see what they are.
Trevor Saunders May 8, 2014, 1:21 a.m. UTC | #2
On Wed, May 07, 2014 at 09:52:49PM +0100, Richard Sandiford wrote:
> I noticed for_each_rtx showing up in profiles and thought I'd have a go
> at using worklist-based iterators instead.  So far I have three:
> 
>   FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx
>   FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx
>   FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx *
> 
> with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement.
> 
> I made FOR_EACH_SUBRTX the "default" (unsuffixed) version because
> most walks really don't modify the structure.  I think we should
> encourage const_rtxes to be used whereever possible.  E.g. it might
> make it easier to have non-GC storage for temporary rtxes in future.
> 
> I've locally replaced all for_each_rtx calls in the generic code with
> these iterators and they make things reproducably faster.  The speed-up
> on full --enable-checking=release ./cc1 and ./cc1plus times is only about 1%,
> but maybe that's enough to justify the churn.

seems pretty nice, and it seems like it'll make code a little more
readable too :)

> Implementation-wise, the main observation is that most subrtxes are part
> of a single contiguous sequence of "e" fields.  E.g. when compiling an
> oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the
> subrtxes of 7,636,542 rtxes.  Of those:
> 
> (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields,
> (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and
>                       no "E" fields, and
> (C)    43,532 (00.6%) are more complicated.
> 
> (A) is really a special case of (B) in which the block has zero length.
> Those are the only two cases that really need to be handled inline.
> The implementation does this by having a mapping from an rtx code to the
> bounds of its "e" sequence, in the form of a start index and count.
> 
> Out of (C), the vast majority (43,509) are PARALLELs.  However, as you'd
> probably expect, bloating the inline code with that case made things
> slower rather than faster.
> 
> The vast majority (in fact all in the combine.ii run above) of iterations
> can be done with a 16-element stack worklist.  We obviously still need a
> heap fallback for the pathological cases though.
> 
> I spent a bit of time trying different iterator implementations and
> seeing which produced the best code.  Specific results from that were:
> 
> - The storage used for the worklist is separate from the iterator,
>   in order to avoid capturing iterator fields.
> 
> - Although the natural type of the storage would be auto_vec <..., 16>,
>   that produced some overhead compared with a separate stack array and heap
>   vector pointer.  With the heap vector pointer, the only overhead is an
>   assignment in the constructor and an "if (x) release (x)"-style sequence
>   in the destructor.  I think the extra complication over auto_vec is worth
>   it because in this case the heap version is so very rarely needed.

hm, where does the overhead come from exactly? it seems like if  its
 faster to use vec<T, va_heap, vl_embedd> *foo; we should fix something
 about vectors since this isn't the only place it could matter.  does it
 matter if you use vec<T, va_heap, vl_embedd> * or vec<T> ? the second
 is basically just a wrapper around the former I'd expect has no effect.
 I'm not saying you're doing the wrong thing here, but if we can make
 generic vectors faster we probably should ;) or is the issue the
 __builtin_expect()s you can add?

> - Several existing for_each_rtx callbacks have something like:
> 
>     if (GET_CODE (x) == CONST)
>       return -1;
> 
>   or:
> 
>     if (CONSTANT_P (x))
>       return -1;
> 
>   to avoid walking subrtxes of constants.  That can be done without
>   extra code checks and branches by having a separate code->bound
>   mapping in which all constants are treated as leaf rtxes.  This usage
>   should be common enough to outweigh the cache penalty of two arrays.
> 
>   The choice between iterating over constants or not is given in the
>   final parameter of the FOR_EACH_* iterator.

less repitition \O/

> - The maximum number of fields in (B)-type rtxes is 3.  We get better
>   code by making that explicit rather than having a general loop.
> 
> - (C) codes map to an "e" count of UCHAR_MAX, so we can use a single
>   check to test for that and for cases where the stack worklist is
>   too small.

 can we use uint8_t?

> To give an example:
> 
> /* Callback for for_each_rtx, that returns 1 upon encountering a VALUE
>    whose UID is greater than the int uid that D points to.  */
> 
> static int
> refs_newer_value_cb (rtx *x, void *d)
> {
>   if (GET_CODE (*x) == VALUE && CSELIB_VAL_PTR (*x)->uid > *(int *)d)
>     return 1;
> 
>   return 0;
> }
> 
> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
>    that of V.  */
> 
> static bool
> refs_newer_value_p (rtx expr, rtx v)
> {
>   int minuid = CSELIB_VAL_PTR (v)->uid;
> 
>   return for_each_rtx (&expr, refs_newer_value_cb, &minuid);
> }
> 
> becomes:
> 
> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
>    that of V.  */
> 
> static bool
> refs_newer_value_p (const_rtx expr, rtx v)
> {
>   int minuid = CSELIB_VAL_PTR (v)->uid;
>   subrtx_iterator::array_type array;

some reason to not hide it in the macro?

>   FOR_EACH_SUBRTX (iter, array, expr, NONCONST)
>     if (GET_CODE (*iter) == VALUE && CSELIB_VAL_PTR (*iter)->uid > minuid)
>       return true;
>   return false;
> }
> 
> The iterator also allows subrtxes of a specific rtx to be skipped;
> this is the equivalent of returning -1 from a for_each_rtx callback.
> It also allows the current rtx to be replaced in the worklist by
> another.  E.g.:
> 
> static void
> mark_constants_in_pattern (rtx insn)
> {
>   subrtx_iterator::array_type array;
>   FOR_EACH_SUBRTX (iter, array, PATTERN (insn), ALL)
>     {
>       const_rtx x = *iter;
>       if (GET_CODE (x) == SYMBOL_REF)
> 	{
> 	  if (CONSTANT_POOL_ADDRESS_P (x))
> 	    {
> 	      struct constant_descriptor_rtx *desc = SYMBOL_REF_CONSTANT (x);
> 	      if (desc->mark == 0)
> 		{
> 		  desc->mark = 1;
> 		  iter.substitute (desc->constant);
> 		}
> 	    }
> 	  else if (TREE_CONSTANT_POOL_ADDRESS_P (x))
> 	    {
> 	      tree decl = SYMBOL_REF_DECL (x);
> 	      if (!TREE_ASM_WRITTEN (DECL_INITIAL (decl)))
> 		{
> 		  n_deferred_constants--;
> 		  output_constant_def_contents (CONST_CAST_RTX (x));
> 		}
> 	    }
> 	}
>     }
> }
> 
> where the iter.substitute avoids a recursive walk.  (The CONST_CAST_RTX
> is "temporary".)
> 
> Does this look like it's worth pursuing?  If so, is it OK to start
> submitting now or should it wait until 4.9.1 is out?  If it is OK,
> I plan to convert the ports too and get rid of for_each_rtx.
> 
> Thanks,
> Richard
> 
> 
> diff --git a/gcc/rtl-iter.h b/gcc/rtl-iter.h
> new file mode 100644
> index 0000000..1bc171d
> --- /dev/null
> +++ b/gcc/rtl-iter.h
> @@ -0,0 +1,293 @@
> +/* RTL iterators
> +   Copyright (C) 2014 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option) any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +/* This structure describes the subrtxes of an rtx as follows:
> +
> +   - if the rtx has no subrtxes, START and COUNT are both 0.
> +
> +   - if all the subrtxes of an rtx are stored in a contiguous block
> +     of XEXPs ("e"s), START is the index of the first XEXP and COUNT
> +     is the number of them.
> +
> +   - otherwise START is arbitrary and COUNT is UCHAR_MAX.  */
> +struct rtx_subrtx_bound_info {
> +  unsigned char start;
> +  unsigned char count;
> +};
> +extern rtx_subrtx_bound_info rtx_all_subrtx_bounds[];
> +extern rtx_subrtx_bound_info rtx_nonconst_subrtx_bounds[];
> +
> +/* Return true if CODE has no subrtxes.  */
> +
> +static inline bool
> +leaf_code_p (enum rtx_code code)
> +{
> +  return rtx_all_subrtx_bounds[code].count == 0;
> +}
> +
> +/* Used to iterate over subrtxes of an rtx.  T abstracts the type of
> +   access.  */
> +template <typename T>
> +class generic_subrtx_iterator
> +{
> +  static const size_t LOCAL_ELEMS = 16;
> +  typedef typename T::value_type value_type;
> +  typedef typename T::rtx_type rtx_type;
> +  typedef typename T::rtunion_type rtunion_type;
> +
> +public:
> +  struct array_type
> +  {
> +    array_type ();
> +    ~array_type ();
> +    value_type stack[LOCAL_ELEMS];
> +    vec <value_type, va_heap, vl_embed> *heap;
> +  };
> +  generic_subrtx_iterator (array_type &, value_type,
> +			   const rtx_subrtx_bound_info *);
> +  ~generic_subrtx_iterator ();
> +
> +  value_type operator * () const;
> +  bool at_end () const;
> +  void next ();
> +  void skip_subrtxes ();
> +  void substitute (value_type);
> +
> +private:
> +  /* The bounds to use for iterating over subrtxes.  */
> +  const rtx_subrtx_bound_info *m_bounds;
> +
> +  /* The storage used for the worklist.  */
> +  array_type &m_array;
> +
> +  /* The current rtx.  */
> +  value_type m_current;
> +
> +  /* The base of the current worklist.  */
> +  value_type *m_base;
> +
> +  /* The number of subrtxes in M_BASE.  */
> +  size_t m_end;
> +
> +  /* The following booleans shouldn't end up in registers or memory
> +     but just direct control flow.  */
> +
> +  /* True if the iteration is over.  */
> +  bool m_done;
> +
> +  /* True if we should skip the subrtxes of M_CURRENT.  */
> +  bool m_skip;
> +
> +  /* True if M_CURRENT has been replaced with a different rtx.  */
> +  bool m_substitute;
> +
> +  static void free_array (array_type &);
> +  static size_t add_subrtxes_to_queue (array_type &, value_type *, size_t,
> +				       rtx_type);
> +  static value_type *add_single_to_queue (array_type &, value_type *, size_t,
> +					  value_type);
> +};
> +
> +template <typename T>
> +inline generic_subrtx_iterator <T>::array_type::array_type () : heap (0) {}
> +
> +template <typename T>
> +inline generic_subrtx_iterator <T>::array_type::~array_type ()
> +{
> +  if (__builtin_expect (heap != 0, false))
> +    free_array (*this);
> +}
> +
> +/* Iterate over X and its subrtxes, in arbitrary order.  Use ARRAY to
> +   store the worklist.  We use an external array in order to avoid
> +   capturing the fields of this structure when taking the address of
> +   the array.  Use BOUNDS to find the bounds of simple "e"-string codes.  */
> +
> +template <typename T>
> +inline generic_subrtx_iterator <T>::
> +generic_subrtx_iterator (array_type &array, value_type x,
> +			 const rtx_subrtx_bound_info *bounds)
> +  : m_bounds (bounds),
> +    m_array (array),
> +    m_current (x),
> +    m_base (m_array.stack),
> +    m_end (0),
> +    m_done (false),
> +    m_skip (false),
> +    m_substitute (false)
> +{
> +}
> +
> +template <typename T>
> +inline generic_subrtx_iterator <T>::~generic_subrtx_iterator ()
> +{
> +}

 What's wrong with just letting the compiler generate that for you?

> +/* Return the current subrtx.  */
> +
> +template <typename T>
> +inline typename T::value_type
> +generic_subrtx_iterator <T>::operator * () const
> +{
> +  return m_current;
> +}
> +
> +/* Return true if the iteration has finished.  */
> +
> +template <typename T>
> +inline bool
> +generic_subrtx_iterator <T>::at_end () const
> +{
> +  return m_done;
> +}
> +
> +/* Move on to the next subrtx.  */
> +
> +template <typename T>
> +inline void
> +generic_subrtx_iterator <T>::next ()
> +{
> +  if (m_substitute)
> +    {
> +      m_substitute = false;
> +      m_skip = false;
> +      return;
> +    }
> +  if (!m_skip)
> +    {
> +      /* Add the subrtxes of M_CURRENT.  */
> +      rtx_type x = T::get_rtx (m_current);
> +      if (__builtin_expect (x != 0, true))
> +	{
> +	  enum rtx_code code = GET_CODE (x);
> +	  ssize_t count = m_bounds[code].count;
> +	  if (count > 0)
> +	    {
> +	      /* Handle the simple case of a single "e" block that is known
> +		 to fit into the current array.  */
> +	      if (__builtin_expect (m_end + count <= LOCAL_ELEMS + 1, true))
> +		{
> +		  /* Set M_CURRENT to the first subrtx and queue the rest.  */
> +		  ssize_t start = m_bounds[code].start;
> +		  rtunion_type *src = &x->u.fld[start];
> +		  if (__builtin_expect (count > 2, false))
> +		    m_base[m_end++] = T::get_value (src[2].rt_rtx);
> +		  if (count > 1)
> +		    m_base[m_end++] = T::get_value (src[1].rt_rtx);
> +		  m_current = T::get_value (src[0].rt_rtx);
> +		  return;
> +		}
> +	      /* Handle cases which aren't simple "e" sequences or where
> +		 the sequence might overrun M_BASE.  */
> +	      count = add_subrtxes_to_queue (m_array, m_base, m_end, x);
> +	      if (count > 0)
> +		{
> +		  m_end += count;
> +		  if (m_end > LOCAL_ELEMS)
> +		    m_base = m_array.heap->address ();
> +		  m_current = m_base[--m_end];
> +		  return;
> +		}
> +	    }
> +	}
> +    }
> +  else
> +    m_skip = false;
> +  if (m_end == 0)
> +    m_done = true;
> +  else
> +    m_current = m_base[--m_end];
> +}
> +
> +/* Skip the subrtxes of the current rtx.  */
> +
> +template <typename T>
> +inline void
> +generic_subrtx_iterator <T>::skip_subrtxes ()
> +{
> +  m_skip = true;
> +}
> +
> +/* Ignore the subrtxes of the current rtx and look at X instead.  */
> +
> +template <typename T>
> +inline void
> +generic_subrtx_iterator <T>::substitute (value_type x)
> +{
> +  m_substitute = true;
> +  m_current = x;
> +}
> +
> +/* Iterators for const_rtx.  */
> +struct const_rtx_accessor
> +{
> +  typedef const_rtx value_type;
> +  typedef const_rtx rtx_type;
> +  typedef const rtunion rtunion_type;
> +  static rtx_type get_rtx (value_type x) { return x; }
> +  static value_type get_value (rtx_type x) { return x; }
> +};
> +typedef generic_subrtx_iterator <const_rtx_accessor> subrtx_iterator;
> +
> +/* Iterators for non-constant rtx.  */
> +struct rtx_var_accessor
> +{
> +  typedef rtx value_type;
> +  typedef rtx rtx_type;
> +  typedef rtunion rtunion_type;
> +  static rtx_type get_rtx (value_type x) { return x; }
> +  static value_type get_value (rtx_type x) { return x; }
> +};
> +typedef generic_subrtx_iterator <rtx_var_accessor> subrtx_var_iterator;
> +
> +/* Iterators for rtx *.  */
> +struct rtx_ptr_accessor
> +{
> +  typedef rtx *value_type;
> +  typedef rtx rtx_type;
> +  typedef rtunion rtunion_type;
> +  static rtx_type get_rtx (value_type ptr) { return *ptr; }
> +  static value_type get_value (rtx_type &x) { return &x; }
> +};
> +typedef generic_subrtx_iterator <rtx_ptr_accessor> subrtx_ptr_iterator;
> +
> +#define ALL_BOUNDS rtx_all_subrtx_bounds
> +#define NONCONST_BOUNDS rtx_nonconst_subrtx_bounds
> +
> +/* Use ITER to iterate over const_rtx X and its recursive subrtxes,
> +   using subrtx_iterator::array ARRAY as the storage for the worklist.
> +   ARRAY can be reused for multiple consecutive iterations but shouldn't
> +   be shared by two concurrent iterations.  TYPE is ALL if all subrtxes
> +   are of interest or NONCONST if it is safe to ignore subrtxes of
> +   constants.  */
> +#define FOR_EACH_SUBRTX(ITER, ARRAY, X, TYPE) \
> +  for (subrtx_iterator ITER (ARRAY, X, TYPE##_BOUNDS); !ITER.at_end (); \
> +       ITER.next ())
> +
> +/* Like FOR_EACH_SUBRTX, but iterate over subrtxes of an rtx X.  */
> +#define FOR_EACH_SUBRTX_VAR(ITER, ARRAY, X, TYPE) \
> +  for (subrtx_var_iterator ITER (ARRAY, X, TYPE##_BOUNDS); !ITER.at_end (); \
> +       ITER.next ())
> +
> +/* Like FOR_EACH_SUBRTX, but iterate over subrtx pointers of rtx pointer X.
> +   For example, if X is &PATTERN (insn) and the pattern is a SET, iterate
> +   over &PATTERN (insn), &SET_DEST (PATTERN (insn)), etc.  */
> +#define FOR_EACH_SUBRTX_PTR(ITER, ARRAY, X, TYPE) \
> +  for (subrtx_ptr_iterator ITER (ARRAY, X, TYPE##_BOUNDS); !ITER.at_end (); \
> +       ITER.next ())
> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> index f3471b1..df52788 100644
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -37,6 +37,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree.h"
>  #include "emit-rtl.h"  /* FIXME: Can go away once crtl is moved to rtl.h.  */
>  #include "addresses.h"
> +#include "rtl-iter.h"
>  
>  /* Forward declarations */
>  static void set_of_1 (rtx, const_rtx, void *);
> @@ -62,6 +63,9 @@ static unsigned int num_sign_bit_copies1 (const_rtx, enum machine_mode, const_rt
>     -1 if a code has no such operand.  */
>  static int non_rtx_starting_operands[NUM_RTX_CODE];
>  
> +rtx_subrtx_bound_info rtx_all_subrtx_bounds[NUM_RTX_CODE];
> +rtx_subrtx_bound_info rtx_nonconst_subrtx_bounds[NUM_RTX_CODE];
> +
>  /* Truncation narrows the mode from SOURCE mode to DESTINATION mode.
>     If TARGET_MODE_REP_EXTENDED (DESTINATION, DESTINATION_REP) is
>     SIGN_EXTEND then while narrowing we also have to enforce the
> @@ -78,6 +82,93 @@ static int non_rtx_starting_operands[NUM_RTX_CODE];
>  static unsigned int
>  num_sign_bit_copies_in_rep[MAX_MODE_INT + 1][MAX_MODE_INT + 1];
>  
> +/* Store X into index I of ARRAY.  ARRAY is known to have at least I
> +   elements.  Return the new base of ARRAY.  */
> +
> +template <typename T>
> +typename T::value_type *
> +generic_subrtx_iterator <T>::add_single_to_queue (array_type &array,
> +						  value_type *base,
> +						  size_t i, value_type x)
> +{
> +  if (base == array.stack)
> +    {
> +      if (i < LOCAL_ELEMS)
> +	{
> +	  base[i] = x;
> +	  return base;
> +	}

it seems like this first little bit might be worth inlining, but I guess
you tested and it wasn't.

> +      gcc_checking_assert (i == LOCAL_ELEMS);
> +      vec_safe_grow (array.heap, i + 1);
> +      base = array.heap->address ();
> +      memcpy (base, array.stack, sizeof (array.stack));

that threw me for a sec until I realized it was just manually doing what
auto_vec does in this case.

> +      base[LOCAL_ELEMS] = x;
> +      return base;
> +    }
> +  unsigned int length = array.heap->length ();
> +  if (length > i)
> +    {
> +      gcc_checking_assert (base == array.heap->address ());
> +      base[i] = x;
> +      return base;
> +    }
> +  else
> +    {
> +      gcc_checking_assert (i == length);
> +      vec_safe_push (array.heap, x);
> +      return array.heap->address ();

this chunk seems redundant with what vec_safe_push does do you actually
need to pass in |i| at all?

> +    }
> +}
> +
> +/* Add the subrtxes of X to worklist ARRAY, starting at END.  Return the
> +   number of elements added to the worklist.  */
> +
> +template <typename T>
> +size_t
> +generic_subrtx_iterator <T>::add_subrtxes_to_queue (array_type &array,
> +						    value_type *base,
> +						    size_t end, rtx_type x)
> +{
> +  const char *format = GET_RTX_FORMAT (GET_CODE (x));
> +  size_t orig_end = end;
> +  for (int i = 0; format[i]; ++i)
> +    if (format[i] == 'e')
> +      {
> +	value_type subx = T::get_value (x->u.fld[i].rt_rtx);
> +	if (__builtin_expect (end < LOCAL_ELEMS, true))
> +	  base[end++] = subx;
> +	else
> +	  base = add_single_to_queue (array, base, end++, subx);
> +      }
> +    else if (format[i] == 'E')
> +      {
> +	int length = GET_NUM_ELEM (x->u.fld[i].rt_rtvec);
> +	rtx *vec = x->u.fld[i].rt_rtvec->elem;
> +	if (__builtin_expect (end + length <= LOCAL_ELEMS, true))
> +	  for (int j = 0; j < length; j++)
> +	    base[end++] = T::get_value (vec[j]);
> +	else
> +	  for (int j = 0; j < length; j++)
> +	    base = add_single_to_queue (array, base, end++,
> +					T::get_value (vec[j]));
> +      }
> +  return end - orig_end;
> +}
> +
> +template <typename T>
> +void
> +generic_subrtx_iterator <T>::free_array (array_type &array)
> +{
> +  vec_free (array.heap);
> +}
> +
> +template <typename T>
> +const size_t generic_subrtx_iterator <T>::LOCAL_ELEMS;
> +
> +template class generic_subrtx_iterator <const_rtx_accessor>;
> +template class generic_subrtx_iterator <rtx_var_accessor>;
> +template class generic_subrtx_iterator <rtx_ptr_accessor>;
> +
>  /* Return 1 if the value of X is unstable
>     (would be different at a different point in the program).
>     The frame pointer, arg pointer, etc. are considered stable
> @@ -5324,8 +5415,42 @@ truncated_to_mode (enum machine_mode mode, const_rtx x)
>    return false;
>  }
>  
> +/* Return true if RTX code CODE has a single sequence of zero or more
> +   "e" operands and no rtvec operands.  Initialize its rtx_all_subrtx_bounds
> +   entry in that case.  */
> +
> +static bool
> +setup_reg_subrtx_bounds (unsigned int code)
> +{
> +  const char *format = GET_RTX_FORMAT ((enum rtx_code) code);
> +  unsigned int i = 0;
> +  for (; format[i] != 'e'; ++i)
> +    {
> +      if (!format[i])
> +	/* No subrtxes.  Leave start and count as 0.  */
> +	return true;
> +      if (format[i] == 'E' || format[i] == 'V')
> +	return false;
> +    }
> +
> +  /* Record the sequence of 'e's.  */
> +  rtx_all_subrtx_bounds[code].start = i;
> +  do
> +    ++i;
> +  while (format[i] == 'e');
> +  rtx_all_subrtx_bounds[code].count = i - rtx_all_subrtx_bounds[code].start;
> +  /* rtl-iter.h relies on this.  */
> +  gcc_checking_assert (rtx_all_subrtx_bounds[code].count <= 3);
> +
> +  for (; format[i]; ++i)
> +    if (format[i] == 'E' || format[i] == 'V' || format[i] == 'e')
> +      return false;
> +
> +  return true;
> +}
> +
>  /* Initialize non_rtx_starting_operands, which is used to speed up
> -   for_each_rtx.  */
> +   for_each_rtx, and rtx_all_subrtx_bounds.  */
>  void
>  init_rtlanal (void)
>  {
> @@ -5335,6 +5460,10 @@ init_rtlanal (void)
>        const char *format = GET_RTX_FORMAT (i);
>        const char *first = strpbrk (format, "eEV");
>        non_rtx_starting_operands[i] = first ? first - format : -1;
> +      if (!setup_reg_subrtx_bounds (i))
> +	rtx_all_subrtx_bounds[i].count = UCHAR_MAX;
> +      if (GET_RTX_CLASS (i) != RTX_CONST_OBJ)
> +	rtx_nonconst_subrtx_bounds[i] = rtx_all_subrtx_bounds[i];
>      }
>  
>    init_num_sign_bit_copies_in_rep ();
Richard Sandiford May 8, 2014, 6:25 a.m. UTC | #3
Trevor Saunders <tsaunders@mozilla.com> writes:
> On Wed, May 07, 2014 at 09:52:49PM +0100, Richard Sandiford wrote:
>> I noticed for_each_rtx showing up in profiles and thought I'd have a go
>> at using worklist-based iterators instead.  So far I have three:
>> 
>>   FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx
>>   FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx
>>   FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx *
>> 
>> with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement.
>> 
>> I made FOR_EACH_SUBRTX the "default" (unsuffixed) version because
>> most walks really don't modify the structure.  I think we should
>> encourage const_rtxes to be used whereever possible.  E.g. it might
>> make it easier to have non-GC storage for temporary rtxes in future.
>> 
>> I've locally replaced all for_each_rtx calls in the generic code with
>> these iterators and they make things reproducably faster.  The speed-up
>> on full --enable-checking=release ./cc1 and ./cc1plus times is only about 1%,
>> but maybe that's enough to justify the churn.
>
> seems pretty nice, and it seems like it'll make code a little more
> readable too :)
>
>> Implementation-wise, the main observation is that most subrtxes are part
>> of a single contiguous sequence of "e" fields.  E.g. when compiling an
>> oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the
>> subrtxes of 7,636,542 rtxes.  Of those:
>> 
>> (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields,
>> (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and
>>                       no "E" fields, and
>> (C)    43,532 (00.6%) are more complicated.
>> 
>> (A) is really a special case of (B) in which the block has zero length.
>> Those are the only two cases that really need to be handled inline.
>> The implementation does this by having a mapping from an rtx code to the
>> bounds of its "e" sequence, in the form of a start index and count.
>> 
>> Out of (C), the vast majority (43,509) are PARALLELs.  However, as you'd
>> probably expect, bloating the inline code with that case made things
>> slower rather than faster.
>> 
>> The vast majority (in fact all in the combine.ii run above) of iterations
>> can be done with a 16-element stack worklist.  We obviously still need a
>> heap fallback for the pathological cases though.
>> 
>> I spent a bit of time trying different iterator implementations and
>> seeing which produced the best code.  Specific results from that were:
>> 
>> - The storage used for the worklist is separate from the iterator,
>>   in order to avoid capturing iterator fields.
>> 
>> - Although the natural type of the storage would be auto_vec <..., 16>,
>>   that produced some overhead compared with a separate stack array and heap
>>   vector pointer.  With the heap vector pointer, the only overhead is an
>>   assignment in the constructor and an "if (x) release (x)"-style sequence
>>   in the destructor.  I think the extra complication over auto_vec is worth
>>   it because in this case the heap version is so very rarely needed.
>
> hm, where does the overhead come from exactly? it seems like if  its
>  faster to use vec<T, va_heap, vl_embedd> *foo; we should fix something
>  about vectors since this isn't the only place it could matter.  does it
>  matter if you use vec<T, va_heap, vl_embedd> * or vec<T> ? the second
>  is basically just a wrapper around the former I'd expect has no effect.
>  I'm not saying you're doing the wrong thing here, but if we can make
>  generic vectors faster we probably should ;) or is the issue the
>  __builtin_expect()s you can add?

Part of the problem is that by having an array in the vec itself,
the other fields effectively have their address taken too.
So m_alloc, m_num and m_using_auto_storage need to be set up
and maintained on the stack, even though we're almost sure that they
will never be used.

>> - The maximum number of fields in (B)-type rtxes is 3.  We get better
>>   code by making that explicit rather than having a general loop.
>> 
>> - (C) codes map to an "e" count of UCHAR_MAX, so we can use a single
>>   check to test for that and for cases where the stack worklist is
>>   too small.
>
>  can we use uint8_t?

We don't really use that in GCC yet.  I don't mind setting a precedent
though :-)

>> To give an example:
>> 
>> /* Callback for for_each_rtx, that returns 1 upon encountering a VALUE
>>    whose UID is greater than the int uid that D points to.  */
>> 
>> static int
>> refs_newer_value_cb (rtx *x, void *d)
>> {
>>   if (GET_CODE (*x) == VALUE && CSELIB_VAL_PTR (*x)->uid > *(int *)d)
>>     return 1;
>> 
>>   return 0;
>> }
>> 
>> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
>>    that of V.  */
>> 
>> static bool
>> refs_newer_value_p (rtx expr, rtx v)
>> {
>>   int minuid = CSELIB_VAL_PTR (v)->uid;
>> 
>>   return for_each_rtx (&expr, refs_newer_value_cb, &minuid);
>> }
>> 
>> becomes:
>> 
>> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
>>    that of V.  */
>> 
>> static bool
>> refs_newer_value_p (const_rtx expr, rtx v)
>> {
>>   int minuid = CSELIB_VAL_PTR (v)->uid;
>>   subrtx_iterator::array_type array;
>
> some reason to not hide it in the macro?

I wanted to make the macro a true for loop, with no extra baggage.
AFAIK there's no way of declaring both the array and the iterator
in the first part of the for loop without making them part of the
same object (and thus capturing the iterator fields again).
A "do ... while (0)" wrapper might be too surprising.

Also, keeping the array separate allows you to reuse it between
iterations, so you only do the allocation and free once.  E.g.:

  subrtx_iterator::array_type array;
  for (rtx insn = ...; insn; insn = NEXT_INSN (insn))
    if (INSN_P (insn))
      FOR_EACH_SUBRTX (...)

>> +template <typename T>
>> +inline generic_subrtx_iterator <T>::~generic_subrtx_iterator ()
>> +{
>> +}
>
>  What's wrong with just letting the compiler generate that for you?

Bah, thanks for catching that.  It was left over from an earlier
experiment in which the destructor did do some cleanup of the array.

>> @@ -78,6 +82,93 @@ static int non_rtx_starting_operands[NUM_RTX_CODE];
>>  static unsigned int
>>  num_sign_bit_copies_in_rep[MAX_MODE_INT + 1][MAX_MODE_INT + 1];
>>  
>> +/* Store X into index I of ARRAY.  ARRAY is known to have at least I
>> +   elements.  Return the new base of ARRAY.  */
>> +
>> +template <typename T>
>> +typename T::value_type *
>> +generic_subrtx_iterator <T>::add_single_to_queue (array_type &array,
>> +						  value_type *base,
>> +						  size_t i, value_type x)
>> +{
>> +  if (base == array.stack)
>> +    {
>> +      if (i < LOCAL_ELEMS)
>> +	{
>> +	  base[i] = x;
>> +	  return base;
>> +	}
>
> it seems like this first little bit might be worth inlining, but I guess
> you tested and it wasn't.

Yeah.  The cases where an "e" or entire "E" fit within the stack buffer
are already handled inline.  So in practice we only get here if we know
we have blown or are about to blow the stack buffer.  The case it
handles is where part of an "E" field fits within the buffer and part of
it doesn't.

One option would have been to force the heap buffer to be used when
entering this function.  But that would then make the simple check:

		  if (m_end > LOCAL_ELEMS)
		    m_base = m_array.heap->address ();

unsafe.  It seemed better to make add_single_to_queue general and
handle both stack and heap pushes.

Thanks,
Richard
Richard Sandiford May 8, 2014, 6:30 a.m. UTC | #4
Richard Sandiford <rdsandiford@googlemail.com> writes:
> Trevor Saunders <tsaunders@mozilla.com> writes:
>> On Wed, May 07, 2014 at 09:52:49PM +0100, Richard Sandiford wrote:
>>> I noticed for_each_rtx showing up in profiles and thought I'd have a go
>>> at using worklist-based iterators instead.  So far I have three:
>>> 
>>>   FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx
>>>   FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx
>>>   FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx *
>>> 
>>> with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement.
>>> 
>>> I made FOR_EACH_SUBRTX the "default" (unsuffixed) version because
>>> most walks really don't modify the structure.  I think we should
>>> encourage const_rtxes to be used whereever possible.  E.g. it might
>>> make it easier to have non-GC storage for temporary rtxes in future.
>>> 
>>> I've locally replaced all for_each_rtx calls in the generic code with
>>> these iterators and they make things reproducably faster.  The speed-up
>>> on full --enable-checking=release ./cc1 and ./cc1plus times is only about 1%,
>>> but maybe that's enough to justify the churn.
>>
>> seems pretty nice, and it seems like it'll make code a little more
>> readable too :)
>>
>>> Implementation-wise, the main observation is that most subrtxes are part
>>> of a single contiguous sequence of "e" fields.  E.g. when compiling an
>>> oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the
>>> subrtxes of 7,636,542 rtxes.  Of those:
>>> 
>>> (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields,
>>> (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and
>>>                       no "E" fields, and
>>> (C)    43,532 (00.6%) are more complicated.
>>> 
>>> (A) is really a special case of (B) in which the block has zero length.
>>> Those are the only two cases that really need to be handled inline.
>>> The implementation does this by having a mapping from an rtx code to the
>>> bounds of its "e" sequence, in the form of a start index and count.
>>> 
>>> Out of (C), the vast majority (43,509) are PARALLELs.  However, as you'd
>>> probably expect, bloating the inline code with that case made things
>>> slower rather than faster.
>>> 
>>> The vast majority (in fact all in the combine.ii run above) of iterations
>>> can be done with a 16-element stack worklist.  We obviously still need a
>>> heap fallback for the pathological cases though.
>>> 
>>> I spent a bit of time trying different iterator implementations and
>>> seeing which produced the best code.  Specific results from that were:
>>> 
>>> - The storage used for the worklist is separate from the iterator,
>>>   in order to avoid capturing iterator fields.
>>> 
>>> - Although the natural type of the storage would be auto_vec <..., 16>,
>>>   that produced some overhead compared with a separate stack array and heap
>>>   vector pointer.  With the heap vector pointer, the only overhead is an
>>>   assignment in the constructor and an "if (x) release (x)"-style sequence
>>>   in the destructor.  I think the extra complication over auto_vec is worth
>>>   it because in this case the heap version is so very rarely needed.
>>
>> hm, where does the overhead come from exactly? it seems like if  its
>>  faster to use vec<T, va_heap, vl_embedd> *foo; we should fix something
>>  about vectors since this isn't the only place it could matter.  does it
>>  matter if you use vec<T, va_heap, vl_embedd> * or vec<T> ? the second
>>  is basically just a wrapper around the former I'd expect has no effect.
>>  I'm not saying you're doing the wrong thing here, but if we can make
>>  generic vectors faster we probably should ;) or is the issue the
>>  __builtin_expect()s you can add?
>
> Part of the problem is that by having an array in the vec itself,
> the other fields effectively have their address taken too.
> So m_alloc, m_num and m_using_auto_storage need to be set up
> and maintained on the stack, even though we're almost sure that they
> will never be used.
>
>>> - The maximum number of fields in (B)-type rtxes is 3.  We get better
>>>   code by making that explicit rather than having a general loop.
>>> 
>>> - (C) codes map to an "e" count of UCHAR_MAX, so we can use a single
>>>   check to test for that and for cases where the stack worklist is
>>>   too small.
>>
>>  can we use uint8_t?
>
> We don't really use that in GCC yet.  I don't mind setting a precedent
> though :-)
>
>>> To give an example:
>>> 
>>> /* Callback for for_each_rtx, that returns 1 upon encountering a VALUE
>>>    whose UID is greater than the int uid that D points to.  */
>>> 
>>> static int
>>> refs_newer_value_cb (rtx *x, void *d)
>>> {
>>>   if (GET_CODE (*x) == VALUE && CSELIB_VAL_PTR (*x)->uid > *(int *)d)
>>>     return 1;
>>> 
>>>   return 0;
>>> }
>>> 
>>> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
>>>    that of V.  */
>>> 
>>> static bool
>>> refs_newer_value_p (rtx expr, rtx v)
>>> {
>>>   int minuid = CSELIB_VAL_PTR (v)->uid;
>>> 
>>>   return for_each_rtx (&expr, refs_newer_value_cb, &minuid);
>>> }
>>> 
>>> becomes:
>>> 
>>> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
>>>    that of V.  */
>>> 
>>> static bool
>>> refs_newer_value_p (const_rtx expr, rtx v)
>>> {
>>>   int minuid = CSELIB_VAL_PTR (v)->uid;
>>>   subrtx_iterator::array_type array;
>>
>> some reason to not hide it in the macro?
>
> I wanted to make the macro a true for loop, with no extra baggage.
> AFAIK there's no way of declaring both the array and the iterator
> in the first part of the for loop without making them part of the
> same object (and thus capturing the iterator fields again).
> A "do ... while (0)" wrapper might be too surprising.

and of course would require an "end loop" macro to hide the while (0).
Trevor Saunders May 8, 2014, 11:21 a.m. UTC | #5
On Thu, May 08, 2014 at 07:25:50AM +0100, Richard Sandiford wrote:
> Trevor Saunders <tsaunders@mozilla.com> writes:
> > On Wed, May 07, 2014 at 09:52:49PM +0100, Richard Sandiford wrote:
> >> I noticed for_each_rtx showing up in profiles and thought I'd have a go
> >> at using worklist-based iterators instead.  So far I have three:
> >> 
> >>   FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx
> >>   FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx
> >>   FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx *
> >> 
> >> with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement.
> >> 
> >> I made FOR_EACH_SUBRTX the "default" (unsuffixed) version because
> >> most walks really don't modify the structure.  I think we should
> >> encourage const_rtxes to be used whereever possible.  E.g. it might
> >> make it easier to have non-GC storage for temporary rtxes in future.
> >> 
> >> I've locally replaced all for_each_rtx calls in the generic code with
> >> these iterators and they make things reproducably faster.  The speed-up
> >> on full --enable-checking=release ./cc1 and ./cc1plus times is only about 1%,
> >> but maybe that's enough to justify the churn.
> >
> > seems pretty nice, and it seems like it'll make code a little more
> > readable too :)
> >
> >> Implementation-wise, the main observation is that most subrtxes are part
> >> of a single contiguous sequence of "e" fields.  E.g. when compiling an
> >> oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the
> >> subrtxes of 7,636,542 rtxes.  Of those:
> >> 
> >> (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields,
> >> (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and
> >>                       no "E" fields, and
> >> (C)    43,532 (00.6%) are more complicated.
> >> 
> >> (A) is really a special case of (B) in which the block has zero length.
> >> Those are the only two cases that really need to be handled inline.
> >> The implementation does this by having a mapping from an rtx code to the
> >> bounds of its "e" sequence, in the form of a start index and count.
> >> 
> >> Out of (C), the vast majority (43,509) are PARALLELs.  However, as you'd
> >> probably expect, bloating the inline code with that case made things
> >> slower rather than faster.
> >> 
> >> The vast majority (in fact all in the combine.ii run above) of iterations
> >> can be done with a 16-element stack worklist.  We obviously still need a
> >> heap fallback for the pathological cases though.
> >> 
> >> I spent a bit of time trying different iterator implementations and
> >> seeing which produced the best code.  Specific results from that were:
> >> 
> >> - The storage used for the worklist is separate from the iterator,
> >>   in order to avoid capturing iterator fields.
> >> 
> >> - Although the natural type of the storage would be auto_vec <..., 16>,
> >>   that produced some overhead compared with a separate stack array and heap
> >>   vector pointer.  With the heap vector pointer, the only overhead is an
> >>   assignment in the constructor and an "if (x) release (x)"-style sequence
> >>   in the destructor.  I think the extra complication over auto_vec is worth
> >>   it because in this case the heap version is so very rarely needed.
> >
> > hm, where does the overhead come from exactly? it seems like if  its
> >  faster to use vec<T, va_heap, vl_embedd> *foo; we should fix something
> >  about vectors since this isn't the only place it could matter.  does it
> >  matter if you use vec<T, va_heap, vl_embedd> * or vec<T> ? the second
> >  is basically just a wrapper around the former I'd expect has no effect.
> >  I'm not saying you're doing the wrong thing here, but if we can make
> >  generic vectors faster we probably should ;) or is the issue the
> >  __builtin_expect()s you can add?
> 
> Part of the problem is that by having an array in the vec itself,
> the other fields effectively have their address taken too.
> So m_alloc, m_num and m_using_auto_storage need to be set up
> and maintained on the stack, even though we're almost sure that they
> will never be used.

ok

> >> - The maximum number of fields in (B)-type rtxes is 3.  We get better
> >>   code by making that explicit rather than having a general loop.
> >> 
> >> - (C) codes map to an "e" count of UCHAR_MAX, so we can use a single
> >>   check to test for that and for cases where the stack worklist is
> >>   too small.
> >
> >  can we use uint8_t?
> 
> We don't really use that in GCC yet.  I don't mind setting a precedent
> though :-)
> 
> >> To give an example:
> >> 
> >> /* Callback for for_each_rtx, that returns 1 upon encountering a VALUE
> >>    whose UID is greater than the int uid that D points to.  */
> >> 
> >> static int
> >> refs_newer_value_cb (rtx *x, void *d)
> >> {
> >>   if (GET_CODE (*x) == VALUE && CSELIB_VAL_PTR (*x)->uid > *(int *)d)
> >>     return 1;
> >> 
> >>   return 0;
> >> }
> >> 
> >> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
> >>    that of V.  */
> >> 
> >> static bool
> >> refs_newer_value_p (rtx expr, rtx v)
> >> {
> >>   int minuid = CSELIB_VAL_PTR (v)->uid;
> >> 
> >>   return for_each_rtx (&expr, refs_newer_value_cb, &minuid);
> >> }
> >> 
> >> becomes:
> >> 
> >> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
> >>    that of V.  */
> >> 
> >> static bool
> >> refs_newer_value_p (const_rtx expr, rtx v)
> >> {
> >>   int minuid = CSELIB_VAL_PTR (v)->uid;
> >>   subrtx_iterator::array_type array;
> >
> > some reason to not hide it in the macro?
> 
> I wanted to make the macro a true for loop, with no extra baggage.
> AFAIK there's no way of declaring both the array and the iterator
> in the first part of the for loop without making them part of the
> same object (and thus capturing the iterator fields again).
> A "do ... while (0)" wrapper might be too surprising.

yeah, for some reason I thought you could do for (T x, U y; ..) but
you seem to be right.

> Also, keeping the array separate allows you to reuse it between
> iterations, so you only do the allocation and free once.  E.g.:

fair.
Trev

> 
>   subrtx_iterator::array_type array;
>   for (rtx insn = ...; insn; insn = NEXT_INSN (insn))
>     if (INSN_P (insn))
>       FOR_EACH_SUBRTX (...)
> 
> >> +template <typename T>
> >> +inline generic_subrtx_iterator <T>::~generic_subrtx_iterator ()
> >> +{
> >> +}
> >
> >  What's wrong with just letting the compiler generate that for you?
> 
> Bah, thanks for catching that.  It was left over from an earlier
> experiment in which the destructor did do some cleanup of the array.
> 
> >> @@ -78,6 +82,93 @@ static int non_rtx_starting_operands[NUM_RTX_CODE];
> >>  static unsigned int
> >>  num_sign_bit_copies_in_rep[MAX_MODE_INT + 1][MAX_MODE_INT + 1];
> >>  
> >> +/* Store X into index I of ARRAY.  ARRAY is known to have at least I
> >> +   elements.  Return the new base of ARRAY.  */
> >> +
> >> +template <typename T>
> >> +typename T::value_type *
> >> +generic_subrtx_iterator <T>::add_single_to_queue (array_type &array,
> >> +						  value_type *base,
> >> +						  size_t i, value_type x)
> >> +{
> >> +  if (base == array.stack)
> >> +    {
> >> +      if (i < LOCAL_ELEMS)
> >> +	{
> >> +	  base[i] = x;
> >> +	  return base;
> >> +	}
> >
> > it seems like this first little bit might be worth inlining, but I guess
> > you tested and it wasn't.
> 
> Yeah.  The cases where an "e" or entire "E" fit within the stack buffer
> are already handled inline.  So in practice we only get here if we know
> we have blown or are about to blow the stack buffer.  The case it
> handles is where part of an "E" field fits within the buffer and part of
> it doesn't.
> 
> One option would have been to force the heap buffer to be used when
> entering this function.  But that would then make the simple check:
> 
> 		  if (m_end > LOCAL_ELEMS)
> 		    m_base = m_array.heap->address ();
> 
> unsafe.  It seemed better to make add_single_to_queue general and
> handle both stack and heap pushes.
> 
> Thanks,
> Richard
Jeff Law May 9, 2014, 6:18 a.m. UTC | #6
On 05/07/14 14:52, Richard Sandiford wrote:
> I noticed for_each_rtx showing up in profiles and thought I'd have a go
> at using worklist-based iterators instead.  So far I have three:
>
>    FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx
>    FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx
>    FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx *
>
> with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement.
Yea, for_each_rtx is a bit of a workhorse, so I'm not surprised it's 
showing up on some profiles.



> I've locally replaced all for_each_rtx calls in the generic code with
> these iterators and they make things reproducably faster.  The speed-up
> on full --enable-checking=release ./cc1 and ./cc1plus times is only about 1%,
> but maybe that's enough to justify the churn.
I'd consider that enough to justify the churn, especially if it's an 
improvement we're seeing consistently rather than just something that's 
helping pathological cases.


>
> Implementation-wise, the main observation is that most subrtxes are part
> of a single contiguous sequence of "e" fields.  E.g. when compiling an
> oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the
> subrtxes of 7,636,542 rtxes.  Of those:
Yea, makes sense.  One could possibly carry this a step further and 
realize that we could specialize the walkers.  ie, rather than indexing 
the rtx_code to get the rtx format, then iterate over that to find the 
'e' fields, if we have an PLUS, then damn it, we ought to know how to 
walk pretty directly.

>
> (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields,
> (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and
>                        no "E" fields, and
> (C)    43,532 (00.6%) are more complicated.
>
> (A) is really a special case of (B) in which the block has zero length.
> Those are the only two cases that really need to be handled inline.
> The implementation does this by having a mapping from an rtx code to the
> bounds of its "e" sequence, in the form of a start index and count.
Right.  So my thought above us just extending this a bit and realizing 
that we don't need an index/count for most RTXs.  That may (or may not) 
be worth the extra pain.  Your call whether or not to investigate that 
further.


>
> Does this look like it's worth pursuing?  If so, is it OK to start
> submitting now or should it wait until 4.9.1 is out?  If it is OK,
> I plan to convert the ports too and get rid of for_each_rtx.
I think it's definitely worth pursuing.  I suspect the release managers 
would definitely want to wait for 4.9.1 :-)

jeff
Richard Biener May 9, 2014, 10:41 a.m. UTC | #7
On Fri, May 9, 2014 at 8:18 AM, Jeff Law <law@redhat.com> wrote:
> On 05/07/14 14:52, Richard Sandiford wrote:
>>
>> I noticed for_each_rtx showing up in profiles and thought I'd have a go
>> at using worklist-based iterators instead.  So far I have three:
>>
>>    FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx
>>    FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx
>>    FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx *
>>
>> with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement.
>
> Yea, for_each_rtx is a bit of a workhorse, so I'm not surprised it's showing
> up on some profiles.
>
>
>
>
>> I've locally replaced all for_each_rtx calls in the generic code with
>> these iterators and they make things reproducably faster.  The speed-up
>> on full --enable-checking=release ./cc1 and ./cc1plus times is only about
>> 1%,
>> but maybe that's enough to justify the churn.
>
> I'd consider that enough to justify the churn, especially if it's an
> improvement we're seeing consistently rather than just something that's
> helping pathological cases.
>
>
>
>>
>> Implementation-wise, the main observation is that most subrtxes are part
>> of a single contiguous sequence of "e" fields.  E.g. when compiling an
>> oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the
>> subrtxes of 7,636,542 rtxes.  Of those:
>
> Yea, makes sense.  One could possibly carry this a step further and realize
> that we could specialize the walkers.  ie, rather than indexing the rtx_code
> to get the rtx format, then iterate over that to find the 'e' fields, if we
> have an PLUS, then damn it, we ought to know how to walk pretty directly.
>
>
>>
>> (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields,
>> (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and
>>                        no "E" fields, and
>> (C)    43,532 (00.6%) are more complicated.
>>
>> (A) is really a special case of (B) in which the block has zero length.
>> Those are the only two cases that really need to be handled inline.
>> The implementation does this by having a mapping from an rtx code to the
>> bounds of its "e" sequence, in the form of a start index and count.
>
> Right.  So my thought above us just extending this a bit and realizing that
> we don't need an index/count for most RTXs.  That may (or may not) be worth
> the extra pain.  Your call whether or not to investigate that further.
>
>
>
>>
>> Does this look like it's worth pursuing?  If so, is it OK to start
>> submitting now or should it wait until 4.9.1 is out?  If it is OK,
>> I plan to convert the ports too and get rid of for_each_rtx.
>
> I think it's definitely worth pursuing.  I suspect the release managers
> would definitely want to wait for 4.9.1 :-)

It's not a too big change so no need to wait.

Thanks,
Richard.

> jeff
>
Richard Sandiford May 17, 2014, 7:33 a.m. UTC | #8
Thanks for the comments.

Jeff Law <law@redhat.com> writes:
>> Implementation-wise, the main observation is that most subrtxes are part
>> of a single contiguous sequence of "e" fields.  E.g. when compiling an
>> oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the
>> subrtxes of 7,636,542 rtxes.  Of those:
> Yea, makes sense.  One could possibly carry this a step further and 
> realize that we could specialize the walkers.  ie, rather than indexing 
> the rtx_code to get the rtx format, then iterate over that to find the 
> 'e' fields, if we have an PLUS, then damn it, we ought to know how to 
> walk pretty directly.

One thing I tried was to add STATIC_CONSTANT_P for specific codes to the
next () routine.  The idea was that if we enter next () after handling
a particular code, we could skip directly to the right handling for that
code.  (I tried REG specifically, since that's such a commonly-handled
case and since the leaf case should be the easiest to thread.)  It didn't
help though because it would rely on the kind of jump threading that
only happens once __builtin_constant_p has been removed.

I suppose we could put the onus on the users of the iterator to invoke
a "handle subrtxes of this code" routine once they know what the code is.
That could make things a bit ugly though.  E.g.:

  FOR_EACH_SUBRTX (iter, array, expr, NONCONST)
    if (GET_CODE (*iter) == VALUE && CSELIB_VAL_PTR (*iter)->uid > minuid)
      return true;

would become:

  FOR_EACH_SUBRTX (iter, array, expr, NONCONST)
    if (GET_CODE (*iter) == VALUE)
      {
        if (CSELIB_VAL_PTR (*iter)->uid > minuid)
          return true;
        iter.code_is (VALUE);
      }

It began to feel like premature optimisation.

>> (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields,
>> (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and
>>                        no "E" fields, and
>> (C)    43,532 (00.6%) are more complicated.
>>
>> (A) is really a special case of (B) in which the block has zero length.
>> Those are the only two cases that really need to be handled inline.
>> The implementation does this by having a mapping from an rtx code to the
>> bounds of its "e" sequence, in the form of a start index and count.
> Right.  So my thought above us just extending this a bit and realizing 
> that we don't need an index/count for most RTXs.  That may (or may not) 
> be worth the extra pain.  Your call whether or not to investigate that 
> further.

Well, the idea was to try to keep the code small enough for inlining
to be worthwhile.  I don't think we'd want a tree of checks for specific
codes because I'd have thought it would introduce extra hard-to-predict
branches.  E.g. although PLUS and MEM are very common codes, they probably 
only occur once in most patterns.

I suppose we could organise the codes so that all the leaf rtxes come
before all the "e"s, which in turn come before all the "ee"s, etc.,
so that we can just check for code ranges.  The problem is that we'll
probably then end up with a separate bounds check (for the stack
worklist) for each length, so the code would end up bigger and
with more branches.  Also, I think the codes are currently arranged
so that the codes that are likely to appear in top-level patterns
are in the same block, and that constants are in the same block, etc.,
so that jump-table-based cases can be used where profitable.
(E.g. SET...TRAP_IF.)

Thanks,
Richard
Jeff Law May 19, 2014, 6:19 p.m. UTC | #9
On 05/17/14 01:33, Richard Sandiford wrote:
> I suppose we could put the onus on the users of the iterator to invoke
> a "handle subrtxes of this code" routine once they know what the code is.
> That could make things a bit ugly though.  E.g.:
>
>    FOR_EACH_SUBRTX (iter, array, expr, NONCONST)
>      if (GET_CODE (*iter) == VALUE && CSELIB_VAL_PTR (*iter)->uid > minuid)
>        return true;
>
> would become:
>
>    FOR_EACH_SUBRTX (iter, array, expr, NONCONST)
>      if (GET_CODE (*iter) == VALUE)
>        {
>          if (CSELIB_VAL_PTR (*iter)->uid > minuid)
>            return true;
>          iter.code_is (VALUE);
>        }
>
> It began to feel like premature optimisation.
Understood.  Thanks for poking at it.

There's something about FOR_EACH_RTX that feels like it needs a rethink, 
but I haven't managed to put my head around it yet.   I'll put it away 
for a while.

So as far as the patch itself is concerned, are there any outstanding 
issues?

jeff
Richard Sandiford May 19, 2014, 6:46 p.m. UTC | #10
Jeff Law <law@redhat.com> writes:
> On 05/17/14 01:33, Richard Sandiford wrote:
>> I suppose we could put the onus on the users of the iterator to invoke
>> a "handle subrtxes of this code" routine once they know what the code is.
>> That could make things a bit ugly though.  E.g.:
>>
>>    FOR_EACH_SUBRTX (iter, array, expr, NONCONST)
>>      if (GET_CODE (*iter) == VALUE && CSELIB_VAL_PTR (*iter)->uid > minuid)
>>        return true;
>>
>> would become:
>>
>>    FOR_EACH_SUBRTX (iter, array, expr, NONCONST)
>>      if (GET_CODE (*iter) == VALUE)
>>        {
>>          if (CSELIB_VAL_PTR (*iter)->uid > minuid)
>>            return true;
>>          iter.code_is (VALUE);
>>        }
>>
>> It began to feel like premature optimisation.
> Understood.  Thanks for poking at it.
>
> There's something about FOR_EACH_RTX that feels like it needs a rethink, 
> but I haven't managed to put my head around it yet.   I'll put it away 
> for a while.
>
> So as far as the patch itself is concerned, are there any outstanding 
> issues?

No, I just need to find time to brush off the patches and submit them
properly (looks there are 57 in all).  Hope to do that in the next
week or so.

Thanks for the reviews.

Richard
diff mbox

Patch

diff --git a/gcc/rtl-iter.h b/gcc/rtl-iter.h
new file mode 100644
index 0000000..1bc171d
--- /dev/null
+++ b/gcc/rtl-iter.h
@@ -0,0 +1,293 @@ 
+/* RTL iterators
+   Copyright (C) 2014 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+/* This structure describes the subrtxes of an rtx as follows:
+
+   - if the rtx has no subrtxes, START and COUNT are both 0.
+
+   - if all the subrtxes of an rtx are stored in a contiguous block
+     of XEXPs ("e"s), START is the index of the first XEXP and COUNT
+     is the number of them.
+
+   - otherwise START is arbitrary and COUNT is UCHAR_MAX.  */
+struct rtx_subrtx_bound_info {
+  unsigned char start;
+  unsigned char count;
+};
+extern rtx_subrtx_bound_info rtx_all_subrtx_bounds[];
+extern rtx_subrtx_bound_info rtx_nonconst_subrtx_bounds[];
+
+/* Return true if CODE has no subrtxes.  */
+
+static inline bool
+leaf_code_p (enum rtx_code code)
+{
+  return rtx_all_subrtx_bounds[code].count == 0;
+}
+
+/* Used to iterate over subrtxes of an rtx.  T abstracts the type of
+   access.  */
+template <typename T>
+class generic_subrtx_iterator
+{
+  static const size_t LOCAL_ELEMS = 16;
+  typedef typename T::value_type value_type;
+  typedef typename T::rtx_type rtx_type;
+  typedef typename T::rtunion_type rtunion_type;
+
+public:
+  struct array_type
+  {
+    array_type ();
+    ~array_type ();
+    value_type stack[LOCAL_ELEMS];
+    vec <value_type, va_heap, vl_embed> *heap;
+  };
+  generic_subrtx_iterator (array_type &, value_type,
+			   const rtx_subrtx_bound_info *);
+  ~generic_subrtx_iterator ();
+
+  value_type operator * () const;
+  bool at_end () const;
+  void next ();
+  void skip_subrtxes ();
+  void substitute (value_type);
+
+private:
+  /* The bounds to use for iterating over subrtxes.  */
+  const rtx_subrtx_bound_info *m_bounds;
+
+  /* The storage used for the worklist.  */
+  array_type &m_array;
+
+  /* The current rtx.  */
+  value_type m_current;
+
+  /* The base of the current worklist.  */
+  value_type *m_base;
+
+  /* The number of subrtxes in M_BASE.  */
+  size_t m_end;
+
+  /* The following booleans shouldn't end up in registers or memory
+     but just direct control flow.  */
+
+  /* True if the iteration is over.  */
+  bool m_done;
+
+  /* True if we should skip the subrtxes of M_CURRENT.  */
+  bool m_skip;
+
+  /* True if M_CURRENT has been replaced with a different rtx.  */
+  bool m_substitute;
+
+  static void free_array (array_type &);
+  static size_t add_subrtxes_to_queue (array_type &, value_type *, size_t,
+				       rtx_type);
+  static value_type *add_single_to_queue (array_type &, value_type *, size_t,
+					  value_type);
+};
+
+template <typename T>
+inline generic_subrtx_iterator <T>::array_type::array_type () : heap (0) {}
+
+template <typename T>
+inline generic_subrtx_iterator <T>::array_type::~array_type ()
+{
+  if (__builtin_expect (heap != 0, false))
+    free_array (*this);
+}
+
+/* Iterate over X and its subrtxes, in arbitrary order.  Use ARRAY to
+   store the worklist.  We use an external array in order to avoid
+   capturing the fields of this structure when taking the address of
+   the array.  Use BOUNDS to find the bounds of simple "e"-string codes.  */
+
+template <typename T>
+inline generic_subrtx_iterator <T>::
+generic_subrtx_iterator (array_type &array, value_type x,
+			 const rtx_subrtx_bound_info *bounds)
+  : m_bounds (bounds),
+    m_array (array),
+    m_current (x),
+    m_base (m_array.stack),
+    m_end (0),
+    m_done (false),
+    m_skip (false),
+    m_substitute (false)
+{
+}
+
+template <typename T>
+inline generic_subrtx_iterator <T>::~generic_subrtx_iterator ()
+{
+}
+
+/* Return the current subrtx.  */
+
+template <typename T>
+inline typename T::value_type
+generic_subrtx_iterator <T>::operator * () const
+{
+  return m_current;
+}
+
+/* Return true if the iteration has finished.  */
+
+template <typename T>
+inline bool
+generic_subrtx_iterator <T>::at_end () const
+{
+  return m_done;
+}
+
+/* Move on to the next subrtx.  */
+
+template <typename T>
+inline void
+generic_subrtx_iterator <T>::next ()
+{
+  if (m_substitute)
+    {
+      m_substitute = false;
+      m_skip = false;
+      return;
+    }
+  if (!m_skip)
+    {
+      /* Add the subrtxes of M_CURRENT.  */
+      rtx_type x = T::get_rtx (m_current);
+      if (__builtin_expect (x != 0, true))
+	{
+	  enum rtx_code code = GET_CODE (x);
+	  ssize_t count = m_bounds[code].count;
+	  if (count > 0)
+	    {
+	      /* Handle the simple case of a single "e" block that is known
+		 to fit into the current array.  */
+	      if (__builtin_expect (m_end + count <= LOCAL_ELEMS + 1, true))
+		{
+		  /* Set M_CURRENT to the first subrtx and queue the rest.  */
+		  ssize_t start = m_bounds[code].start;
+		  rtunion_type *src = &x->u.fld[start];
+		  if (__builtin_expect (count > 2, false))
+		    m_base[m_end++] = T::get_value (src[2].rt_rtx);
+		  if (count > 1)
+		    m_base[m_end++] = T::get_value (src[1].rt_rtx);
+		  m_current = T::get_value (src[0].rt_rtx);
+		  return;
+		}
+	      /* Handle cases which aren't simple "e" sequences or where
+		 the sequence might overrun M_BASE.  */
+	      count = add_subrtxes_to_queue (m_array, m_base, m_end, x);
+	      if (count > 0)
+		{
+		  m_end += count;
+		  if (m_end > LOCAL_ELEMS)
+		    m_base = m_array.heap->address ();
+		  m_current = m_base[--m_end];
+		  return;
+		}
+	    }
+	}
+    }
+  else
+    m_skip = false;
+  if (m_end == 0)
+    m_done = true;
+  else
+    m_current = m_base[--m_end];
+}
+
+/* Skip the subrtxes of the current rtx.  */
+
+template <typename T>
+inline void
+generic_subrtx_iterator <T>::skip_subrtxes ()
+{
+  m_skip = true;
+}
+
+/* Ignore the subrtxes of the current rtx and look at X instead.  */
+
+template <typename T>
+inline void
+generic_subrtx_iterator <T>::substitute (value_type x)
+{
+  m_substitute = true;
+  m_current = x;
+}
+
+/* Iterators for const_rtx.  */
+struct const_rtx_accessor
+{
+  typedef const_rtx value_type;
+  typedef const_rtx rtx_type;
+  typedef const rtunion rtunion_type;
+  static rtx_type get_rtx (value_type x) { return x; }
+  static value_type get_value (rtx_type x) { return x; }
+};
+typedef generic_subrtx_iterator <const_rtx_accessor> subrtx_iterator;
+
+/* Iterators for non-constant rtx.  */
+struct rtx_var_accessor
+{
+  typedef rtx value_type;
+  typedef rtx rtx_type;
+  typedef rtunion rtunion_type;
+  static rtx_type get_rtx (value_type x) { return x; }
+  static value_type get_value (rtx_type x) { return x; }
+};
+typedef generic_subrtx_iterator <rtx_var_accessor> subrtx_var_iterator;
+
+/* Iterators for rtx *.  */
+struct rtx_ptr_accessor
+{
+  typedef rtx *value_type;
+  typedef rtx rtx_type;
+  typedef rtunion rtunion_type;
+  static rtx_type get_rtx (value_type ptr) { return *ptr; }
+  static value_type get_value (rtx_type &x) { return &x; }
+};
+typedef generic_subrtx_iterator <rtx_ptr_accessor> subrtx_ptr_iterator;
+
+#define ALL_BOUNDS rtx_all_subrtx_bounds
+#define NONCONST_BOUNDS rtx_nonconst_subrtx_bounds
+
+/* Use ITER to iterate over const_rtx X and its recursive subrtxes,
+   using subrtx_iterator::array ARRAY as the storage for the worklist.
+   ARRAY can be reused for multiple consecutive iterations but shouldn't
+   be shared by two concurrent iterations.  TYPE is ALL if all subrtxes
+   are of interest or NONCONST if it is safe to ignore subrtxes of
+   constants.  */
+#define FOR_EACH_SUBRTX(ITER, ARRAY, X, TYPE) \
+  for (subrtx_iterator ITER (ARRAY, X, TYPE##_BOUNDS); !ITER.at_end (); \
+       ITER.next ())
+
+/* Like FOR_EACH_SUBRTX, but iterate over subrtxes of an rtx X.  */
+#define FOR_EACH_SUBRTX_VAR(ITER, ARRAY, X, TYPE) \
+  for (subrtx_var_iterator ITER (ARRAY, X, TYPE##_BOUNDS); !ITER.at_end (); \
+       ITER.next ())
+
+/* Like FOR_EACH_SUBRTX, but iterate over subrtx pointers of rtx pointer X.
+   For example, if X is &PATTERN (insn) and the pattern is a SET, iterate
+   over &PATTERN (insn), &SET_DEST (PATTERN (insn)), etc.  */
+#define FOR_EACH_SUBRTX_PTR(ITER, ARRAY, X, TYPE) \
+  for (subrtx_ptr_iterator ITER (ARRAY, X, TYPE##_BOUNDS); !ITER.at_end (); \
+       ITER.next ())
diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
index f3471b1..df52788 100644
--- a/gcc/rtlanal.c
+++ b/gcc/rtlanal.c
@@ -37,6 +37,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree.h"
 #include "emit-rtl.h"  /* FIXME: Can go away once crtl is moved to rtl.h.  */
 #include "addresses.h"
+#include "rtl-iter.h"
 
 /* Forward declarations */
 static void set_of_1 (rtx, const_rtx, void *);
@@ -62,6 +63,9 @@  static unsigned int num_sign_bit_copies1 (const_rtx, enum machine_mode, const_rt
    -1 if a code has no such operand.  */
 static int non_rtx_starting_operands[NUM_RTX_CODE];
 
+rtx_subrtx_bound_info rtx_all_subrtx_bounds[NUM_RTX_CODE];
+rtx_subrtx_bound_info rtx_nonconst_subrtx_bounds[NUM_RTX_CODE];
+
 /* Truncation narrows the mode from SOURCE mode to DESTINATION mode.
    If TARGET_MODE_REP_EXTENDED (DESTINATION, DESTINATION_REP) is
    SIGN_EXTEND then while narrowing we also have to enforce the
@@ -78,6 +82,93 @@  static int non_rtx_starting_operands[NUM_RTX_CODE];
 static unsigned int
 num_sign_bit_copies_in_rep[MAX_MODE_INT + 1][MAX_MODE_INT + 1];
 
+/* Store X into index I of ARRAY.  ARRAY is known to have at least I
+   elements.  Return the new base of ARRAY.  */
+
+template <typename T>
+typename T::value_type *
+generic_subrtx_iterator <T>::add_single_to_queue (array_type &array,
+						  value_type *base,
+						  size_t i, value_type x)
+{
+  if (base == array.stack)
+    {
+      if (i < LOCAL_ELEMS)
+	{
+	  base[i] = x;
+	  return base;
+	}
+      gcc_checking_assert (i == LOCAL_ELEMS);
+      vec_safe_grow (array.heap, i + 1);
+      base = array.heap->address ();
+      memcpy (base, array.stack, sizeof (array.stack));
+      base[LOCAL_ELEMS] = x;
+      return base;
+    }
+  unsigned int length = array.heap->length ();
+  if (length > i)
+    {
+      gcc_checking_assert (base == array.heap->address ());
+      base[i] = x;
+      return base;
+    }
+  else
+    {
+      gcc_checking_assert (i == length);
+      vec_safe_push (array.heap, x);
+      return array.heap->address ();
+    }
+}
+
+/* Add the subrtxes of X to worklist ARRAY, starting at END.  Return the
+   number of elements added to the worklist.  */
+
+template <typename T>
+size_t
+generic_subrtx_iterator <T>::add_subrtxes_to_queue (array_type &array,
+						    value_type *base,
+						    size_t end, rtx_type x)
+{
+  const char *format = GET_RTX_FORMAT (GET_CODE (x));
+  size_t orig_end = end;
+  for (int i = 0; format[i]; ++i)
+    if (format[i] == 'e')
+      {
+	value_type subx = T::get_value (x->u.fld[i].rt_rtx);
+	if (__builtin_expect (end < LOCAL_ELEMS, true))
+	  base[end++] = subx;
+	else
+	  base = add_single_to_queue (array, base, end++, subx);
+      }
+    else if (format[i] == 'E')
+      {
+	int length = GET_NUM_ELEM (x->u.fld[i].rt_rtvec);
+	rtx *vec = x->u.fld[i].rt_rtvec->elem;
+	if (__builtin_expect (end + length <= LOCAL_ELEMS, true))
+	  for (int j = 0; j < length; j++)
+	    base[end++] = T::get_value (vec[j]);
+	else
+	  for (int j = 0; j < length; j++)
+	    base = add_single_to_queue (array, base, end++,
+					T::get_value (vec[j]));
+      }
+  return end - orig_end;
+}
+
+template <typename T>
+void
+generic_subrtx_iterator <T>::free_array (array_type &array)
+{
+  vec_free (array.heap);
+}
+
+template <typename T>
+const size_t generic_subrtx_iterator <T>::LOCAL_ELEMS;
+
+template class generic_subrtx_iterator <const_rtx_accessor>;
+template class generic_subrtx_iterator <rtx_var_accessor>;
+template class generic_subrtx_iterator <rtx_ptr_accessor>;
+
 /* Return 1 if the value of X is unstable
    (would be different at a different point in the program).
    The frame pointer, arg pointer, etc. are considered stable
@@ -5324,8 +5415,42 @@  truncated_to_mode (enum machine_mode mode, const_rtx x)
   return false;
 }
 
+/* Return true if RTX code CODE has a single sequence of zero or more
+   "e" operands and no rtvec operands.  Initialize its rtx_all_subrtx_bounds
+   entry in that case.  */
+
+static bool
+setup_reg_subrtx_bounds (unsigned int code)
+{
+  const char *format = GET_RTX_FORMAT ((enum rtx_code) code);
+  unsigned int i = 0;
+  for (; format[i] != 'e'; ++i)
+    {
+      if (!format[i])
+	/* No subrtxes.  Leave start and count as 0.  */
+	return true;
+      if (format[i] == 'E' || format[i] == 'V')
+	return false;
+    }
+
+  /* Record the sequence of 'e's.  */
+  rtx_all_subrtx_bounds[code].start = i;
+  do
+    ++i;
+  while (format[i] == 'e');
+  rtx_all_subrtx_bounds[code].count = i - rtx_all_subrtx_bounds[code].start;
+  /* rtl-iter.h relies on this.  */
+  gcc_checking_assert (rtx_all_subrtx_bounds[code].count <= 3);
+
+  for (; format[i]; ++i)
+    if (format[i] == 'E' || format[i] == 'V' || format[i] == 'e')
+      return false;
+
+  return true;
+}
+
 /* Initialize non_rtx_starting_operands, which is used to speed up
-   for_each_rtx.  */
+   for_each_rtx, and rtx_all_subrtx_bounds.  */
 void
 init_rtlanal (void)
 {
@@ -5335,6 +5460,10 @@  init_rtlanal (void)
       const char *format = GET_RTX_FORMAT (i);
       const char *first = strpbrk (format, "eEV");
       non_rtx_starting_operands[i] = first ? first - format : -1;
+      if (!setup_reg_subrtx_bounds (i))
+	rtx_all_subrtx_bounds[i].count = UCHAR_MAX;
+      if (GET_RTX_CLASS (i) != RTX_CONST_OBJ)
+	rtx_nonconst_subrtx_bounds[i] = rtx_all_subrtx_bounds[i];
     }
 
   init_num_sign_bit_copies_in_rep ();