diff mbox series

[10/11] aarch64: Add new load/store pair fusion pass.

Message ID ZVZbNwaknAdo337U@arm.com
State New
Headers show
Series aarch64: Rework ldp/stp patterns, add new ldp/stp pass | expand

Commit Message

Alex Coplan Nov. 16, 2023, 6:11 p.m. UTC
This is a v3 of the aarch64 load/store pair fusion pass.
v2 was posted here:
 - https://gcc.gnu.org/pipermail/gcc-patches/2023-October/633601.html

The main changes since v2 are as follows:

We now handle writeback opportunities as well.  E.g. for this testcase:

void foo (long *p, long *q, long x, long y)
{
  do {
    *(p++) = x;
    *(p++) = y;
  } while (p < q);
}

wtih the patch, we generate:

foo:
.LFB0:
        .align  3
.L2:
        stp     x2, x3, [x0], 16
        cmp     x0, x1
        bcc     .L2
        ret

instead of:

foo:
.LFB0:
        .align  3
.L2:
        str     x2, [x0], 16
        str     x3, [x0, -8]
        cmp     x0, x1
        bcc     .L2
        ret

i.e. the pass is now capable of finding load/store pair opportunities even in
the case that one or more of the initial candidate accesses uses writeback addressing.
We do this by adding a notion of canonicalizing RTL bases.  When we see a
writeback access, we record that the new base def is equivalent to the original
def plus some offset.  When tracking accesses, we then canonicalize to track
each access relative to the earliest equivalent base in the basic block.

This allows us to spot that accesses are adjacent even though they don't share
the same RTL-SSA base def.

Furthermore, we also add some extra logic to opportunistically fold in trailing
destructive updates of the base register used for a load/store pair.  E.g. for

void post_add (long *p, long *q, long x, long y)
{
  do {
    p[0] = x;
    p[1] = y;
    p += 2;
  } while (p < q);
}

the auto-inc-dec pass doesn't currently form any writeback accesses, and we
generate:

post_add:
.LFB0:
        .align  3
.L2:
        add     x0, x0, 16
        stp     x2, x3, [x0, -16]
        cmp     x0, x1
        bcc     .L2
        ret

but with the updated pass, we now get:

post_add:
.LFB0:
        .align  3
.L2:
        stp     x2, x3, [x0], 16
        cmp     x0, x1
        bcc     .L2
        ret

Other notable changes to the pass since the last version include:
 - We switch to using the aarch64_gen_{load,store}_pair interface
   for forming the (non-writeback) pairs, allowing use of the new
   load/store pair representation added by the earlier patch.
 - The various updates to the load/store pair patterns mean that
   we no longer need to do mode canonicalization / mode unification
   in the pass, as the patterns allow arbitrary combinations of suitable modes
   of the same size.  So we remove the logic to do this (including the
   param to control the strategy).
 - Fix up classification of zero operands to make sure that these are always
   treated as GPR operands for pair discovery purposes.  This avoids us
   pairing zero operands with FPRs in the pre-RA pass, which used to lead to
   undesirable codegen involving cross-file moves.
 - We also remove the try_adjust_address logic from the previous iteration of
   the pass.  Since we validate all ldp/stp offsets in the pass, this only
   meant that we lost opportunities in the case that a given mem fails to
   adjust in its original mode.

Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

	* config.gcc: Add aarch64-ldp-fusion.o to extra_objs for aarch64; add
	aarch64-ldp-fusion.cc to target_gtfiles.
	* config/aarch64/aarch64-passes.def: Add copies of pass_ldp_fusion
	before and after RA.
	* config/aarch64/aarch64-protos.h (make_pass_ldp_fusion): Declare.
	* config/aarch64/aarch64.opt (-mearly-ldp-fusion): New.
	(-mlate-ldp-fusion): New.
	(--param=aarch64-ldp-alias-check-limit): New.
	(--param=aarch64-ldp-writeback): New.
	* config/aarch64/t-aarch64: Add rule for aarch64-ldp-fusion.o.
	* config/aarch64/aarch64-ldp-fusion.cc: New file.
---
 gcc/config.gcc                           |    4 +-
 gcc/config/aarch64/aarch64-ldp-fusion.cc | 2727 ++++++++++++++++++++++
 gcc/config/aarch64/aarch64-passes.def    |    2 +
 gcc/config/aarch64/aarch64-protos.h      |    1 +
 gcc/config/aarch64/aarch64.opt           |   23 +
 gcc/config/aarch64/t-aarch64             |    7 +
 6 files changed, 2762 insertions(+), 2 deletions(-)
 create mode 100644 gcc/config/aarch64/aarch64-ldp-fusion.cc

Comments

Richard Sandiford Nov. 22, 2023, 11:14 a.m. UTC | #1
Alex Coplan <alex.coplan@arm.com> writes:
> This is a v3 of the aarch64 load/store pair fusion pass.
> v2 was posted here:
>  - https://gcc.gnu.org/pipermail/gcc-patches/2023-October/633601.html
>
> The main changes since v2 are as follows:
>
> We now handle writeback opportunities as well.  E.g. for this testcase:
>
> void foo (long *p, long *q, long x, long y)
> {
>   do {
>     *(p++) = x;
>     *(p++) = y;
>   } while (p < q);
> }
>
> wtih the patch, we generate:
>
> foo:
> .LFB0:
>         .align  3
> .L2:
>         stp     x2, x3, [x0], 16
>         cmp     x0, x1
>         bcc     .L2
>         ret
>
> instead of:
>
> foo:
> .LFB0:
>         .align  3
> .L2:
>         str     x2, [x0], 16
>         str     x3, [x0, -8]
>         cmp     x0, x1
>         bcc     .L2
>         ret
>
> i.e. the pass is now capable of finding load/store pair opportunities even in
> the case that one or more of the initial candidate accesses uses writeback addressing.
> We do this by adding a notion of canonicalizing RTL bases.  When we see a
> writeback access, we record that the new base def is equivalent to the original
> def plus some offset.  When tracking accesses, we then canonicalize to track
> each access relative to the earliest equivalent base in the basic block.
>
> This allows us to spot that accesses are adjacent even though they don't share
> the same RTL-SSA base def.
>
> Furthermore, we also add some extra logic to opportunistically fold in trailing
> destructive updates of the base register used for a load/store pair.  E.g. for
>
> void post_add (long *p, long *q, long x, long y)
> {
>   do {
>     p[0] = x;
>     p[1] = y;
>     p += 2;
>   } while (p < q);
> }
>
> the auto-inc-dec pass doesn't currently form any writeback accesses, and we
> generate:
>
> post_add:
> .LFB0:
>         .align  3
> .L2:
>         add     x0, x0, 16
>         stp     x2, x3, [x0, -16]
>         cmp     x0, x1
>         bcc     .L2
>         ret
>
> but with the updated pass, we now get:
>
> post_add:
> .LFB0:
>         .align  3
> .L2:
>         stp     x2, x3, [x0], 16
>         cmp     x0, x1
>         bcc     .L2
>         ret
>
> Other notable changes to the pass since the last version include:
>  - We switch to using the aarch64_gen_{load,store}_pair interface
>    for forming the (non-writeback) pairs, allowing use of the new
>    load/store pair representation added by the earlier patch.
>  - The various updates to the load/store pair patterns mean that
>    we no longer need to do mode canonicalization / mode unification
>    in the pass, as the patterns allow arbitrary combinations of suitable modes
>    of the same size.  So we remove the logic to do this (including the
>    param to control the strategy).
>  - Fix up classification of zero operands to make sure that these are always
>    treated as GPR operands for pair discovery purposes.  This avoids us
>    pairing zero operands with FPRs in the pre-RA pass, which used to lead to
>    undesirable codegen involving cross-file moves.
>  - We also remove the try_adjust_address logic from the previous iteration of
>    the pass.  Since we validate all ldp/stp offsets in the pass, this only
>    meant that we lost opportunities in the case that a given mem fails to
>    adjust in its original mode.
>
> Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
>
> Thanks,
> Alex
>
> gcc/ChangeLog:
>
> 	* config.gcc: Add aarch64-ldp-fusion.o to extra_objs for aarch64; add
> 	aarch64-ldp-fusion.cc to target_gtfiles.
> 	* config/aarch64/aarch64-passes.def: Add copies of pass_ldp_fusion
> 	before and after RA.
> 	* config/aarch64/aarch64-protos.h (make_pass_ldp_fusion): Declare.
> 	* config/aarch64/aarch64.opt (-mearly-ldp-fusion): New.
> 	(-mlate-ldp-fusion): New.
> 	(--param=aarch64-ldp-alias-check-limit): New.
> 	(--param=aarch64-ldp-writeback): New.
> 	* config/aarch64/t-aarch64: Add rule for aarch64-ldp-fusion.o.
> 	* config/aarch64/aarch64-ldp-fusion.cc: New file.

Looks really good.  I'll probably need to do another pass over it,
but some initial comments below.

Main general comment is: it would be good to have more commentary.
Not "repeat the code in words" commentary, just comments that sketch
the intent or purpose of the following code, what the assumptions and
invariants are, etc.

> ---
>  gcc/config.gcc                           |    4 +-
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 2727 ++++++++++++++++++++++
>  gcc/config/aarch64/aarch64-passes.def    |    2 +
>  gcc/config/aarch64/aarch64-protos.h      |    1 +
>  gcc/config/aarch64/aarch64.opt           |   23 +
>  gcc/config/aarch64/t-aarch64             |    7 +
>  6 files changed, 2762 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/config/aarch64/aarch64-ldp-fusion.cc
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index c1460ca354e..8b7f6b20309 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -349,8 +349,8 @@ aarch64*-*-*)
>  	c_target_objs="aarch64-c.o"
>  	cxx_target_objs="aarch64-c.o"
>  	d_target_objs="aarch64-d.o"
> -	extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
> -	target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> +	extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o aarch64-ldp-fusion.o"
> +	target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc \$(srcdir)/config/aarch64/aarch64-ldp-fusion.cc"
>  	target_has_targetm_common=yes
>  	;;
>  alpha*-*-*)
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> new file mode 100644
> index 00000000000..6ab18b9216e
> --- /dev/null
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -0,0 +1,2727 @@
> +// LoadPair fusion optimization pass for AArch64.
> +// Copyright (C) 2023 Free Software Foundation, Inc.
> +//
> +// This file is part of GCC.
> +//
> +// GCC is free software; you can redistribute it and/or modify it
> +// under the terms of the GNU General Public License as published by
> +// the Free Software Foundation; either version 3, or (at your option)
> +// any later version.
> +//
> +// GCC is distributed in the hope that it will be useful, but
> +// WITHOUT ANY WARRANTY; without even the implied warranty of
> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +// General Public License for more details.
> +//
> +// You should have received a copy of the GNU General Public License
> +// along with GCC; see the file COPYING3.  If not see
> +// <http://www.gnu.org/licenses/>.
> +
> +#define INCLUDE_ALGORITHM
> +#define INCLUDE_FUNCTIONAL
> +#define INCLUDE_LIST
> +#define INCLUDE_TYPE_TRAITS
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "df.h"
> +#include "rtl-ssa.h"
> +#include "cfgcleanup.h"
> +#include "tree-pass.h"
> +#include "ordered-hash-map.h"
> +#include "tree-dfa.h"
> +#include "fold-const.h"
> +#include "tree-hash-traits.h"
> +#include "print-tree.h"
> +#include "insn-attr.h"
> +
> +using namespace rtl_ssa;
> +
> +enum
> +{
> +  LDP_IMM_BITS = 7,
> +  LDP_IMM_MASK = (1 << LDP_IMM_BITS) - 1,
> +  LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1)),
> +  LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1,
> +  LDP_MIN_IMM = -LDP_MAX_IMM - 1,
> +};

Since this isn't really an enumeration, it might be better to use
constexprs.

> +
> +// We pack these fields (load_p, fpsimd_p, and size) into an integer
> +// (LFS) which we use as part of the key into the main hash tables.
> +//
> +// The idea is that we group candidates together only if they agree on
> +// the fields below.  Candidates that disagree on any of these
> +// properties shouldn't be merged together.
> +struct lfs_fields
> +{
> +  bool load_p;
> +  bool fpsimd_p;
> +  unsigned size;
> +};
> +
> +using insn_list_t = std::list <insn_info *>;

Very minor, but it'd be good for the file to be consistent about having
or not having a space before template arguments.  The coding conventions
don't say, so either is fine.

> +using insn_iter_t = insn_list_t::iterator;
> +
> +// Information about the accesses at a given offset from a particular
> +// base.  Stored in an access_group, see below.
> +struct access_record
> +{
> +  poly_int64 offset;
> +  std::list<insn_info *> cand_insns;
> +  std::list<access_record>::iterator place;
> +
> +  access_record (poly_int64 off) : offset (off) {}
> +};
> +
> +// A group of accesses where adjacent accesses could be ldp/stp
> +// candidates.  The splay tree supports efficient insertion,
> +// while the list supports efficient iteration.
> +struct access_group
> +{
> +  splay_tree <access_record *> tree;
> +  std::list<access_record> list;
> +
> +  template<typename Alloc>
> +  inline void track (Alloc node_alloc, poly_int64 offset, insn_info *insn);
> +};
> +
> +// Information about a potential base candidate, used in try_fuse_pair.
> +// There may be zero, one, or two viable RTL bases for a given pair.
> +struct base_cand
> +{
> +  def_info *m_def;

Sorry for the trivia, but it seems odd for only this member variable
to have the "m_" suffix.  I think that's normally used for protected
and private members.

> +
> +  // FROM_INSN is -1 if the base candidate is already shared by both
> +  // candidate insns.  Otherwise it holds the index of the insn from
> +  // which the base originated.
> +  int from_insn;
> +
> +  // Initially: dataflow hazards that arise if we choose this base as
> +  // the common base register for the pair.
> +  //
> +  // Later these get narrowed, taking alias hazards into account.
> +  insn_info *hazards[2];

Might be worth expanding the comment a bit.  I wasn't sure how an
insn_info represented a hazard.  From further reading, I see it's
an insn that contains a hazard, and therefore acts as a barrier to
movement in that direction.

> +
> +  base_cand (def_info *def, int insn)
> +    : m_def (def), from_insn (insn), hazards {nullptr, nullptr} {}
> +
> +  base_cand (def_info *def) : base_cand (def, -1) {}
> +
> +  bool viable () const
> +  {
> +    return !hazards[0] || !hazards[1] || (*hazards[0] > *hazards[1]);
> +  }
> +};
> +
> +// Information about an alternate base.  For a def_info D, it may
> +// instead be expressed as D = BASE + OFFSET.
> +struct alt_base
> +{
> +  def_info *base;
> +  poly_int64 offset;
> +};
> +
> +// State used by the pass for a given basic block.
> +struct ldp_bb_info
> +{
> +  using def_hash = nofree_ptr_hash <def_info>;
> +  using expr_key_t = pair_hash <tree_operand_hash, int_hash <int, -1, -2>>;
> +  using def_key_t = pair_hash <def_hash, int_hash <int, -1, -2>>;
> +
> +  // Map of <tree base, LFS> -> access_group.
> +  ordered_hash_map <expr_key_t, access_group> expr_map;
> +
> +  // Map of <RTL-SSA def_info *, LFS> -> access_group.
> +  ordered_hash_map <def_key_t, access_group> def_map;
> +
> +  // Given the def_info for an RTL base register, express it as an offset from
> +  // some canonical base instead.
> +  //
> +  // Canonicalizing bases in this way allows us to identify adjacent accesses
> +  // even if they see different base register defs.
> +  hash_map <def_hash, alt_base> canon_base_map;
> +
> +  static const size_t obstack_alignment = sizeof (void *);
> +  bb_info *m_bb;
> +
> +  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
> +  {
> +    obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
> +				obstack_alignment, obstack_chunk_alloc,
> +				obstack_chunk_free);
> +  }
> +  ~ldp_bb_info ()
> +  {
> +    obstack_free (&m_obstack, nullptr);
> +  }
> +
> +  inline void track_access (insn_info *, bool load, rtx mem);
> +  inline void transform ();
> +  inline void cleanup_tombstones ();
> +
> +private:
> +  // Did we emit a tombstone insn for this bb?
> +  bool m_emitted_tombstone;
> +  obstack m_obstack;
> +
> +  inline splay_tree_node<access_record *> *node_alloc (access_record *);
> +
> +  template<typename Map>
> +  inline void traverse_base_map (Map &map);
> +  inline void transform_for_base (int load_size, access_group &group);
> +
> +  inline bool try_form_pairs (insn_list_t *, insn_list_t *,
> +			      bool load_p, unsigned access_size);
> +
> +  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
> +};
> +
> +splay_tree_node<access_record *> *
> +ldp_bb_info::node_alloc (access_record *access)
> +{
> +  using T = splay_tree_node<access_record *>;
> +  void *addr = obstack_alloc (&m_obstack, sizeof (T));
> +  return new (addr) T (access);
> +}
> +
> +// Given a mem MEM, if the address has side effects, return a MEM that accesses
> +// the same address but without the side effects.  Otherwise, return
> +// MEM unchanged.
> +static rtx
> +drop_writeback (rtx mem)
> +{
> +  rtx addr = XEXP (mem, 0);
> +
> +  if (!side_effects_p (addr))
> +    return mem;
> +
> +  switch (GET_CODE (addr))
> +    {
> +    case PRE_MODIFY:
> +      addr = XEXP (addr, 1);
> +      break;
> +    case POST_MODIFY:
> +    case POST_INC:
> +    case POST_DEC:
> +      addr = XEXP (addr, 0);
> +      break;
> +    case PRE_INC:
> +    case PRE_DEC:
> +    {
> +      poly_int64 adjustment = GET_MODE_SIZE (GET_MODE (mem));
> +      if (GET_CODE (addr) == PRE_DEC)
> +	adjustment *= -1;
> +      addr = plus_constant (GET_MODE (addr), XEXP (addr, 0), adjustment);
> +      break;
> +    }
> +    default:
> +      gcc_unreachable ();
> +    }
> +
> +  return change_address (mem, GET_MODE (mem), addr);
> +}
> +
> +// Convenience wrapper around strip_offset that can also look
> +// through {PRE,POST}_MODIFY.
> +static rtx ldp_strip_offset (rtx mem, rtx *modify, poly_int64 *offset)
> +{
> +  gcc_checking_assert (MEM_P (mem));
> +
> +  rtx base = strip_offset (XEXP (mem, 0), offset);
> +
> +  if (side_effects_p (base))
> +    *modify = base;
> +
> +  switch (GET_CODE (base))
> +    {
> +    case PRE_MODIFY:
> +    case POST_MODIFY:
> +      base = strip_offset (XEXP (base, 1), offset);
> +      gcc_checking_assert (REG_P (base));
> +      gcc_checking_assert (rtx_equal_p (XEXP (*modify, 0), base));
> +      break;
> +    case PRE_INC:
> +    case POST_INC:
> +      base = XEXP (base, 0);
> +      *offset = GET_MODE_SIZE (GET_MODE (mem));
> +      gcc_checking_assert (REG_P (base));
> +      break;
> +    case PRE_DEC:
> +    case POST_DEC:
> +      base = XEXP (base, 0);
> +      *offset = -GET_MODE_SIZE (GET_MODE (mem));
> +      gcc_checking_assert (REG_P (base));
> +      break;
> +
> +    default:
> +      gcc_checking_assert (!side_effects_p (base));
> +    }
> +
> +  return base;
> +}

Is the first strip_offset expected to fire for the side-effects case?
If so, then I suppose the switch should be adding to the offset,
rather than overwriting it, since the original offset will be lost.

If the first strip_offset doesn't fire for side effects (my guess,
since autoinc addresses must be top-level addresses), it's probably
easier to drop the modify argument and move the strip_offset into the
default case.

> +
> +static bool
> +any_pre_modify_p (rtx x)
> +{
> +  const auto code = GET_CODE (x);
> +  return code == PRE_INC || code == PRE_DEC || code == PRE_MODIFY;
> +}
> +
> +static bool
> +any_post_modify_p (rtx x)
> +{
> +  const auto code = GET_CODE (x);
> +  return code == POST_INC || code == POST_DEC || code == POST_MODIFY;
> +}
> +
> +static bool
> +ldp_operand_mode_ok_p (machine_mode mode)

Missing function comment (also elsewhere in the file).  Here I think
the comment serves a purpose, since the question isn't really whether
an ldp is ok in the sense of valid, but ok in the sense of a good idea.

> +{
> +  const bool allow_qregs
> +    = !(aarch64_tune_params.extra_tuning_flags
> +	& AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS);
> +
> +  if (!aarch64_ldpstp_operand_mode_p (mode))
> +    return false;
> +
> +  const auto size = GET_MODE_SIZE (mode).to_constant ();
> +  if (size == 16 && !allow_qregs)
> +    return false;
> +
> +  return reload_completed || mode != E_TImode;

I think this last condition deserves a comment.  I agree the condition
is correct (because we don't know whether TImode values are natural GPRs
or FPRs), but I only remember because you mentioned it relatively recently.

E_ prefixes should generally only be used in switches though.  Same for
the rest of the file.

> +}
> +
> +static int
> +encode_lfs (lfs_fields fields)
> +{
> +  int size_log2 = exact_log2 (fields.size);
> +  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
> +  return ((int)fields.load_p << 3)
> +    | ((int)fields.fpsimd_p << 2)
> +    | (size_log2 - 2);
> +}
> +
> +static lfs_fields
> +decode_lfs (int lfs)
> +{
> +  bool load_p = (lfs & (1 << 3));
> +  bool fpsimd_p = (lfs & (1 << 2));
> +  unsigned size = 1U << ((lfs & 3) + 2);
> +  return { load_p, fpsimd_p, size };
> +}
> +
> +template<typename Alloc>
> +void
> +access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
> +{
> +  auto insert_before = [&](std::list<access_record>::iterator after)
> +    {
> +      auto it = list.emplace (after, offset);
> +      it->cand_insns.push_back (insn);
> +      it->place = it;
> +      return &*it;
> +    };
> +
> +  if (!list.size ())
> +    {
> +      auto access = insert_before (list.end ());
> +      tree.insert_max_node (alloc_node (access));
> +      return;
> +    }
> +
> +  auto compare = [&](splay_tree_node<access_record *> *node)
> +    {
> +      return compare_sizes_for_sort (offset, node->value ()->offset);
> +    };
> +  auto result = tree.lookup (compare);
> +  splay_tree_node<access_record *> *node = tree.root ();
> +  if (result == 0)
> +    node->value ()->cand_insns.push_back (insn);
> +  else
> +    {
> +      auto it = node->value ()->place;
> +      auto after = (result > 0) ? std::next (it) : it;
> +      auto access = insert_before (after);
> +      tree.insert_child (node, result > 0, alloc_node (access));
> +    }
> +}
> +
> +bool
> +ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> +{
> +  if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
> +    return false;
> +
> +  poly_int64 offset;
> +  tree base_expr = get_addr_base_and_unit_offset (MEM_EXPR (mem),
> +						  &offset);
> +  if (!base_expr || !DECL_P (base_expr))
> +    return false;
> +
> +  offset += MEM_OFFSET (mem);
> +
> +  const machine_mode mem_mode = GET_MODE (mem);
> +  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
> +
> +  // Punt on misaligned offsets.
> +  if (offset.coeffs[0] & (mem_size - 1))

!multiple_p (offset, mem_size)

It's probably worth adding a comment to say that we reject unaligned
offsets because they are likely to lead to invalid LDP/STP addresses
(rather than for optimisation reasons).

> +    return false;
> +
> +  const auto key = std::make_pair (base_expr, encode_lfs (lfs));
> +  access_group &group = expr_map.get_or_insert (key, NULL);
> +  auto alloc = [&](access_record *access) { return node_alloc (access); };
> +  group.track (alloc, offset, insn);
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "[bb %u] tracking insn %d via ",
> +	       m_bb->index (), insn->uid ());
> +      print_node_brief (dump_file, "mem expr", base_expr, 0);
> +      fprintf (dump_file, " [L=%d FP=%d, %smode, off=",
> +	       lfs.load_p, lfs.fpsimd_p, mode_name[mem_mode]);
> +      print_dec (offset, dump_file);
> +      fprintf (dump_file, "]\n");
> +    }
> +
> +  return true;
> +}
> +
> +// Return true if X is a constant zero operand.  N.B. this matches the
> +// {w,x}zr check in aarch64_print_operand, the logic in the predicate
> +// aarch64_stp_reg_operand, and the constraints on the pair patterns.
> +static bool const_zero_op_p (rtx x)

Nit: new line after bool.  A few instances later too.

Could we put this somewhere that's easily accessible by everything
that wants to test it?  It's cropped up a few times in the series.

> +{
> +  return x == CONST0_RTX (GET_MODE (x))
> +    || (CONST_DOUBLE_P (x) && aarch64_float_const_zero_rtx_p (x));
> +}
> +
> +void
> +ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> +{
> +  // We can't combine volatile MEMs, so punt on these.
> +  if (MEM_VOLATILE_P (mem))
> +    return;
> +
> +  // Ignore writeback accesses if the param says to do so.
> +  if (!aarch64_ldp_writeback && side_effects_p (XEXP (mem, 0)))
> +    return;
> +
> +  const machine_mode mem_mode = GET_MODE (mem);
> +  if (!ldp_operand_mode_ok_p (mem_mode))
> +    return;
> +
> +  // Note ldp_operand_mode_ok_p already rejected VL modes.
> +  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
> +
> +  rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
> +
> +  // Is this an FP/SIMD access?  Note that constant zero operands
> +  // use an integer zero register ({w,x}zr).
> +  const bool fpsimd_op_p
> +    = GET_MODE_CLASS (mem_mode) != MODE_INT
> +      && (load_p || !const_zero_op_p (reg_op));
> +
> +  // N.B. we only want to segregate FP/SIMD accesses from integer accesses
> +  // before RA.
> +  const bool fpsimd_bit_p = !reload_completed && fpsimd_op_p;

But after RA, shouldn't we segregate them based on the hard register?
In other words, the natural condition after RA seems to be:

  REG_P (reg) && FP_REGNUM_P (REGNO (reg))

> +  const lfs_fields lfs = { load_p, fpsimd_bit_p, mem_size };
> +
> +  if (track_via_mem_expr (insn, mem, lfs))
> +    return;
> +
> +  poly_int64 mem_off;
> +  rtx modify = NULL_RTX;
> +  rtx base = ldp_strip_offset (mem, &modify, &mem_off);
> +  if (!REG_P (base))
> +    return;
> +
> +  // Need to calculate two (possibly different) offsets:
> +  //  - Offset at which the access occurs.
> +  //  - Offset of the new base def.
> +  poly_int64 access_off;
> +  if (modify && any_post_modify_p (modify))
> +    access_off = 0;
> +  else
> +    access_off = mem_off;
> +
> +  poly_int64 new_def_off = mem_off;
> +
> +  // Punt on accesses relative to the eliminable regs: since we don't
> +  // know the elimination offset pre-RA, we should postpone forming
> +  // pairs on such accesses until after RA.
> +  if (!reload_completed
> +      && (REGNO (base) == FRAME_POINTER_REGNUM
> +	  || REGNO (base) == ARG_POINTER_REGNUM))
> +    return;

Is this still an issue with the new representation of LDPs and STPs?
If so, it seems like there's more to it than the reason given in the
comments.

> +
> +  // Now need to find def of base register.
> +  def_info *base_def;
> +  use_info *base_use = find_access (insn->uses (), REGNO (base));
> +  gcc_assert (base_use);
> +  base_def = base_use->def ();
> +  if (!base_def)
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "base register (regno %d) of insn %d is undefined",
> +		 REGNO (base), insn->uid ());
> +      return;
> +    }

Does the track_via_mem_expr need to happen in its current position?
I wasn't sure why it was done relatively early, given the early-outs above.

> +
> +  alt_base *canon_base = canon_base_map.get (base_def);
> +  if (canon_base)
> +    {
> +      // Express this as the combined offset from the canonical base.
> +      base_def = canon_base->base;
> +      new_def_off += canon_base->offset;
> +      access_off += canon_base->offset;
> +    }
> +
> +  if (modify)
> +    {
> +      auto def = find_access (insn->defs (), REGNO (base));
> +      gcc_assert (def);
> +
> +      // Record that DEF = BASE_DEF + MEM_OFF.
> +      if (dump_file)
> +	{
> +	  pretty_printer pp;
> +	  pp_access (&pp, def, 0);
> +	  pp_string (&pp, " = ");
> +	  pp_access (&pp, base_def, 0);
> +	  fprintf (dump_file, "[bb %u] recording %s + ",
> +		   m_bb->index (), pp_formatted_text (&pp));
> +	  print_dec (new_def_off, dump_file);
> +	  fprintf (dump_file, "\n");
> +	}
> +
> +      alt_base base_rec { base_def, new_def_off };
> +      if (canon_base_map.put (def, base_rec))
> +	gcc_unreachable (); // Base defs should be unique.
> +    }
> +
> +  // Punt on misaligned offsets.
> +  if (mem_off.coeffs[0] & (mem_size - 1))

!multiple_p here too

> +    return;
> +
> +  const auto key = std::make_pair (base_def, encode_lfs (lfs));
> +  access_group &group = def_map.get_or_insert (key, NULL);
> +  auto alloc = [&](access_record *access) { return node_alloc (access); };
> +  group.track (alloc, access_off, insn);
> +
> +  if (dump_file)
> +    {
> +      pretty_printer pp;
> +      pp_access (&pp, base_def, 0);
> +
> +      fprintf (dump_file, "[bb %u] tracking insn %d via %s",
> +	       m_bb->index (), insn->uid (), pp_formatted_text (&pp));
> +      fprintf (dump_file,
> +	       " [L=%d, WB=%d, FP=%d, %smode, off=",
> +	       lfs.load_p, !!modify, lfs.fpsimd_p, mode_name[mem_mode]);
> +      print_dec (access_off, dump_file);
> +      fprintf (dump_file, "]\n");
> +    }
> +}
> +
> +// Dummy predicate that never ignores any insns.
> +static bool no_ignore (insn_info *) { return false; }
> +
> +// Return the latest dataflow hazard before INSN.
> +//
> +// If IGNORE is non-NULL, this points to a sub-rtx which we should
> +// ignore for dataflow purposes.  This is needed when considering
> +// changing the RTL base of an access discovered through a MEM_EXPR
> +// base.
> +//
> +// N.B. we ignore any defs/uses of memory here as we deal with that
> +// separately, making use of alias disambiguation.
> +static insn_info *
> +latest_hazard_before (insn_info *insn, rtx *ignore,
> +		      insn_info *ignore_insn = nullptr)
> +{
> +  insn_info *result = nullptr;
> +
> +  // Return true if we registered the hazard.
> +  auto hazard = [&](insn_info *h) -> bool
> +    {
> +      gcc_checking_assert (*h < *insn);
> +      if (h == ignore_insn)
> +	return false;
> +
> +      if (!result || *h > *result)
> +	result = h;
> +
> +      return true;
> +    };
> +
> +  rtx pat = PATTERN (insn->rtl ());
> +  auto ignore_use = [&](use_info *u)
> +    {
> +      if (u->is_mem ())
> +	return true;
> +
> +      return !refers_to_regno_p (u->regno (), u->regno () + 1, pat, ignore);
> +    };
> +
> +  // Find defs of uses in INSN (RaW).
> +  for (auto use : insn->uses ())
> +    if (!ignore_use (use) && use->def ())
> +      hazard (use->def ()->insn ());
> +
> +  // Find previous defs (WaW) or previous uses (WaR) of defs in INSN.
> +  for (auto def : insn->defs ())
> +    {
> +      if (def->is_mem ())
> +	continue;
> +
> +      if (def->prev_def ())
> +	{
> +	  hazard (def->prev_def ()->insn ()); // WaW
> +
> +	  auto set = dyn_cast <set_info *> (def->prev_def ());
> +	  if (set && set->has_nondebug_insn_uses ())
> +	    for (auto use : set->reverse_nondebug_insn_uses ())
> +	      if (use->insn () != insn && hazard (use->insn ())) // WaR
> +		break;
> +	}
> +
> +      if (!HARD_REGISTER_NUM_P (def->regno ()))
> +	continue;
> +
> +      // Also need to check backwards for call clobbers (WaW).
> +      for (auto call_group : def->ebb ()->call_clobbers ())
> +	{
> +	  if (!call_group->clobbers (def->resource ()))
> +	    continue;
> +
> +	  auto clobber_insn = prev_call_clobbers_ignoring (*call_group,
> +							   def->insn (),
> +							   no_ignore);
> +	  if (clobber_insn)
> +	    hazard (clobber_insn);
> +	}
> +
> +    }
> +
> +  return result;
> +}
> +
> +static insn_info *
> +first_hazard_after (insn_info *insn, rtx *ignore)
> +{
> +  insn_info *result = nullptr;
> +  auto hazard = [insn, &result](insn_info *h)
> +    {
> +      gcc_checking_assert (*h > *insn);
> +      if (!result || *h < *result)
> +	result = h;
> +    };
> +
> +  rtx pat = PATTERN (insn->rtl ());
> +  auto ignore_use = [&](use_info *u)
> +    {
> +      if (u->is_mem ())
> +	return true;
> +
> +      return !refers_to_regno_p (u->regno (), u->regno () + 1, pat, ignore);
> +    };
> +
> +  for (auto def : insn->defs ())
> +    {
> +      if (def->is_mem ())
> +	continue;
> +
> +      if (def->next_def ())
> +	hazard (def->next_def ()->insn ()); // WaW
> +
> +      auto set = dyn_cast <set_info *> (def);
> +      if (set && set->has_nondebug_insn_uses ())
> +	hazard (set->first_nondebug_insn_use ()->insn ()); // RaW
> +
> +      if (!HARD_REGISTER_NUM_P (def->regno ()))
> +	continue;
> +
> +      // Also check for call clobbers of this def (WaW).
> +      for (auto call_group : def->ebb ()->call_clobbers ())
> +	{
> +	  if (!call_group->clobbers (def->resource ()))
> +	    continue;
> +
> +	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
> +							   def->insn (),
> +							   no_ignore);
> +	  if (clobber_insn)
> +	    hazard (clobber_insn);
> +	}
> +    }
> +
> +  // Find any subsequent defs of uses in INSN (WaR).
> +  for (auto use : insn->uses ())
> +    {
> +      if (ignore_use (use))
> +	continue;
> +
> +      if (use->def ())
> +	{
> +	  auto def = use->def ()->next_def ();
> +	  if (def && def->insn () == insn)
> +	    def = def->next_def ();
> +
> +	  if (def)
> +	    hazard (def->insn ());
> +	}
> +
> +      if (!HARD_REGISTER_NUM_P (use->regno ()))
> +	continue;
> +
> +      // Also need to handle call clobbers of our uses (again WaR).
> +      //
> +      // See restrict_movement_for_uses_ignoring for why we don't
> +      // need to check backwards for call clobbers.
> +      for (auto call_group : use->ebb ()->call_clobbers ())
> +	{
> +	  if (!call_group->clobbers (use->resource ()))
> +	    continue;
> +
> +	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
> +							   use->insn (),
> +							   no_ignore);
> +	  if (clobber_insn)
> +	    hazard (clobber_insn);
> +	}
> +    }
> +
> +  return result;
> +}
> +
> +
> +enum change_strategy {
> +  CHANGE,
> +  DELETE,
> +  TOMBSTONE,
> +};
> +
> +// Given a change_strategy S, convert it to a string (for output in the
> +// dump file).
> +static const char *cs_to_string (change_strategy s)
> +{
> +#define C(x) case x: return #x
> +  switch (s)
> +    {
> +      C (CHANGE);
> +      C (DELETE);
> +      C (TOMBSTONE);
> +    }
> +#undef C
> +  gcc_unreachable ();
> +}
> +
> +// TODO: should this live in RTL-SSA?
> +static bool
> +ranges_overlap_p (const insn_range_info &r1, const insn_range_info &r2)
> +{
> +  // If either range is empty, then their intersection is empty.
> +  if (!r1 || !r2)
> +    return false;
> +
> +  // When do they not overlap? When one range finishes before the other
> +  // starts, i.e. (*r1.last < *r2.first || *r2.last < *r1.first).
> +  // Inverting this, we get the below.
> +  return *r1.last >= *r2.first && *r2.last >= *r1.first;
> +}
> +
> +// Get the range of insns that def feeds.
> +static insn_range_info get_def_range (def_info *def)
> +{
> +  insn_info *last = def->next_def ()->insn ()->prev_nondebug_insn ();
> +  return { def->insn (), last };
> +}
> +
> +// Given a def (of memory), return the downwards range within which we
> +// can safely move this def.
> +static insn_range_info
> +def_downwards_move_range (def_info *def)
> +{
> +  auto range = get_def_range (def);
> +
> +  auto set = dyn_cast <set_info *> (def);
> +  if (!set || !set->has_any_uses ())
> +    return range;
> +
> +  auto use = set->first_nondebug_insn_use ();
> +  if (use)
> +    range = move_earlier_than (range, use->insn ());
> +
> +  return range;
> +}
> +
> +// Given a def (of memory), return the upwards range within which we can
> +// safely move this def.
> +static insn_range_info
> +def_upwards_move_range (def_info *def)
> +{
> +  def_info *prev = def->prev_def ();
> +  insn_range_info range { prev->insn (), def->insn () };
> +
> +  auto set = dyn_cast <set_info *> (prev);
> +  if (!set || !set->has_any_uses ())
> +    return range;
> +
> +  auto use = set->last_nondebug_insn_use ();
> +  if (use)
> +    range = move_later_than (range, use->insn ());
> +
> +  return range;
> +}
> +
> +static def_info *
> +decide_stp_strategy (change_strategy strategy[2],
> +		     insn_info *first,
> +		     insn_info *second,
> +		     const insn_range_info &move_range)
> +{
> +  strategy[0] = CHANGE;
> +  strategy[1] = DELETE;
> +
> +  unsigned viable = 0;
> +  viable |= move_range.includes (first);
> +  viable |= ((unsigned) move_range.includes (second)) << 1;
> +
> +  def_info * const defs[2] = {
> +    memory_access (first->defs ()),
> +    memory_access (second->defs ())
> +  };
> +  if (defs[0] == defs[1])
> +    viable = 3; // No intervening store, either is viable.

Can this happen?  If the first and second insns are different then their
definitions should be too.

> +
> +  if (!(viable & 1)
> +      && ranges_overlap_p (move_range, def_downwards_move_range (defs[0])))
> +    viable |= 1;
> +  if (!(viable & 2)
> +      && ranges_overlap_p (move_range, def_upwards_move_range (defs[1])))
> +    viable |= 2;
> +
> +  if (viable == 2)
> +    std::swap (strategy[0], strategy[1]);
> +  else if (!viable)
> +    // Tricky case: need to delete both accesses.
> +    strategy[0] = DELETE;
> +
> +  for (int i = 0; i < 2; i++)
> +    {
> +      if (strategy[i] != DELETE)
> +	continue;
> +
> +      // See if we can get away without a tombstone.
> +      auto set = dyn_cast <set_info *> (defs[i]);
> +      if (!set || !set->has_any_uses ())
> +	continue; // We can indeed.
> +
> +      // If both sides are viable for re-purposing, and the other store's
> +      // def doesn't have any uses, then we can delete the other store
> +      // and re-purpose this store instead.
> +      if (viable == 3)
> +	{
> +	  gcc_assert (strategy[!i] == CHANGE);
> +	  auto other_set = dyn_cast <set_info *> (defs[!i]);
> +	  if (!other_set || !other_set->has_any_uses ())
> +	    {
> +	      strategy[i] = CHANGE;
> +	      strategy[!i] = DELETE;
> +	      break;
> +	    }
> +	}
> +
> +      // Alas, we need a tombstone after all.
> +      strategy[i] = TOMBSTONE;
> +    }

I think it's a bug in RTL-SSA that we have mem defs without uses,
or it's at least something that should be changed.  So probably best
to delete the loop and stick with the result of the range comparisons.

> +
> +  for (int i = 0; i < 2; i++)
> +    if (strategy[i] == CHANGE)
> +      return defs[i];
> +
> +  return nullptr;
> +}
> +
> +static GTY(()) rtx tombstone = NULL_RTX;
> +
> +// Generate the RTL pattern for a "tombstone"; used temporarily
> +// during this pass to replace stores that are marked for deletion
> +// where we can't immediately delete the store (e.g. if there are uses
> +// hanging off its def of memory).
> +//
> +// These are deleted at the end of the pass and uses re-parented
> +// appropriately at this point.
> +static rtx
> +gen_tombstone (void)
> +{
> +  if (!tombstone)
> +    {
> +      tombstone = gen_rtx_CLOBBER (VOIDmode,
> +				   gen_rtx_MEM (BLKmode,
> +						gen_rtx_SCRATCH (Pmode)));
> +      return tombstone;
> +    }
> +
> +  return copy_rtx (tombstone);
> +}
> +
> +static bool
> +tombstone_insn_p (insn_info *insn)
> +{
> +  rtx x = tombstone ? tombstone : gen_tombstone ();
> +  return rtx_equal_p (PATTERN (insn->rtl ()), x);
> +}

It's probably safer to check by insn uid, since the pattern could
in principle occur as a pre-existing barrier.

> +
> +static machine_mode
> +aarch64_operand_mode_for_pair_mode (machine_mode mode)
> +{
> +  switch (mode)
> +    {
> +    case E_V2x4QImode:
> +      return E_SImode;
> +    case E_V2x8QImode:
> +      return E_DImode;
> +    case E_V2x16QImode:
> +      return E_V16QImode;
> +    default:
> +      gcc_unreachable ();
> +    }
> +}
> +
> +static rtx
> +filter_notes (rtx note, rtx result, bool *eh_region)
> +{
> +  for (; note; note = XEXP (note, 1))
> +    {
> +      switch (REG_NOTE_KIND (note))
> +	{
> +	  case REG_EQUAL:
> +	  case REG_EQUIV:
> +	  case REG_DEAD:
> +	  case REG_UNUSED:
> +	  case REG_NOALIAS:
> +	    // These can all be dropped.  For REG_EQU{AL,IV} they
> +	    // cannot apply to non-single_set insns, and
> +	    // REG_{DEAD,UNUSED} are re-computed by RTl-SSA, see
> +	    // rtl-ssa/changes.cc:update_notes.

AFAIK, REG_DEAD isn't recomputed, only REG_UNUSED is.  But REG_DEADs are
allowed to bit-rot.

> +	    //
> +	    // Similarly, REG_NOALIAS cannot apply to a parallel.
> +	  case REG_INC:
> +	    // When we form the pair insn, the reg update is implemented
> +	    // as just another SET in the parallel, so isn't really an
> +	    // auto-increment in the RTL sense, hence we drop the note.
> +	    break;
> +	  case REG_EH_REGION:
> +	    gcc_assert (!*eh_region);
> +	    *eh_region = true;
> +	    result = alloc_reg_note (REG_EH_REGION, XEXP (note, 0), result);
> +	    break;
> +	  case REG_CFA_DEF_CFA:
> +	  case REG_CFA_OFFSET:
> +	  case REG_CFA_RESTORE:
> +	    result = alloc_reg_note (REG_NOTE_KIND (note),
> +				     copy_rtx (XEXP (note, 0)),
> +				     result);
> +	    break;
> +	  default:
> +	    // Unexpected REG_NOTE kind.
> +	    gcc_unreachable ();

Nit: cases should be indented to the same column as the "{".

> +	}
> +    }
> +
> +  return result;
> +}
> +
> +// Ensure we have a sensible scheme for combining REG_NOTEs
> +// given two candidate insns I1 and I2.
> +static rtx
> +combine_reg_notes (insn_info *i1, insn_info *i2, rtx writeback, bool &ok)
> +{
> +  if ((writeback && find_reg_note (i1->rtl (), REG_CFA_DEF_CFA, NULL_RTX))
> +      || find_reg_note (i2->rtl (), REG_CFA_DEF_CFA, NULL_RTX))
> +    {
> +      // CFA_DEF_CFA notes apply to the first set of the PARALLEL,
> +      // so we can only preserve them in the non-writeback case, in
> +      // the case that the note is attached to the lower access.

I thought that was only true if the note didn't provide some information:

	n = XEXP (note, 0);
	if (n == NULL)
	  n = single_set (insn);
	dwarf2out_frame_debug_cfa_offset (n);
	handled_one = true;

The aarch64 notes should be self-contained, so the note expression should
never be null.

If the notes are self-contained, they should be interpreted serially.
So I'd have expected everything to work out if we preserve the original
order (including order between instructions).

It's OK to punt anyway, of course.  I just wasn't sure about the reason
in the comments.

> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "(%d,%d,WB=%d): can't preserve CFA_DEF_CFA note, punting\n",
> +		 i1->uid (), i2->uid (), !!writeback);
> +      ok = false;
> +      return NULL_RTX;
> +    }
> +
> +  bool found_eh_region = false;
> +  rtx result = NULL_RTX;
> +  result = filter_notes (REG_NOTES (i1->rtl ()), result, &found_eh_region);
> +  return filter_notes (REG_NOTES (i2->rtl ()), result, &found_eh_region);
> +}
> +
> +// Given two memory accesses, at least one of which is of a writeback form,
> +// extract two non-writeback memory accesses addressed relative to the initial
> +// value of the base register, and output these in PATS.  Return an rtx that
> +// represents the overall change to the base register.
> +static rtx
> +extract_writebacks (bool load_p, rtx pats[2], int changed)

Wasn't clear at first from the comment that PATS is also where the
initial memory access insns come from.

> +{
> +  rtx base_reg = NULL_RTX;
> +  poly_int64 current_offset = 0;
> +
> +  poly_int64 offsets[2];
> +
> +  for (int i = 0; i < 2; i++)
> +    {
> +      rtx mem = XEXP (pats[i], load_p);
> +      rtx reg = XEXP (pats[i], !load_p);
> +
> +      rtx modify = NULL_RTX;
> +      poly_int64 offset;
> +      rtx this_base = ldp_strip_offset (mem, &modify, &offset);
> +      gcc_assert (REG_P (this_base));
> +      if (base_reg)
> +	gcc_assert (rtx_equal_p (base_reg, this_base));
> +      else
> +	base_reg = this_base;
> +
> +      // If we changed base for the current insn, then we already
> +      // derived the correct mem for this insn from the effective
> +      // address of the other access.
> +      if (i == changed)
> +	{
> +	  gcc_checking_assert (!modify);
> +	  offsets[i] = offset;
> +	  continue;
> +	}
> +
> +      if (modify && any_pre_modify_p (modify))
> +	current_offset += offset;
> +
> +      poly_int64 this_off = current_offset;
> +      if (!modify)
> +	this_off += offset;
> +
> +      offsets[i] = this_off;
> +      rtx new_mem = change_address (mem, GET_MODE (mem),
> +				    plus_constant (GET_MODE (base_reg),
> +						   base_reg, this_off));
> +      pats[i] = load_p
> +	? gen_rtx_SET (reg, new_mem)
> +	: gen_rtx_SET (new_mem, reg);
> +
> +      if (modify && any_post_modify_p (modify))
> +	current_offset += offset;
> +    }
> +
> +  if (known_eq (current_offset, 0))
> +    return NULL_RTX;
> +
> +  return gen_rtx_SET (base_reg, plus_constant (GET_MODE (base_reg),
> +					       base_reg, current_offset));
> +}
> +
> +static insn_info *
> +find_trailing_add (insn_info *insns[2],
> +		   const insn_range_info &pair_range,
> +		   rtx *writeback_effect,
> +		   def_info **add_def,
> +		   def_info *base_def,
> +		   poly_int64 initial_offset,
> +		   unsigned access_size)
> +{
> +  insn_info *pair_insn = insns[1];
> +
> +  def_info *def = base_def->next_def ();
> +
> +  while (def
> +	 && def->bb () == pair_insn->bb ()
> +	 && *(def->insn ()) <= *pair_insn)
> +    def = def->next_def ();

I don't understand the loop.  Why's it OK to skip over intervening defs?

> +
> +  if (!def || def->bb () != pair_insn->bb ())
> +    return nullptr;
> +
> +  insn_info *cand = def->insn ();
> +  const auto base_regno = base_def->regno ();
> +
> +  // If CAND doesn't also use our base register,
> +  // it can't destructively update it.
> +  if (!find_access (cand->uses (), base_regno))
> +    return nullptr;
> +
> +  auto rti = cand->rtl ();
> +
> +  if (!INSN_P (rti))
> +    return nullptr;
> +
> +  auto pat = PATTERN (rti);
> +  if (GET_CODE (pat) != SET)
> +    return nullptr;
> +
> +  auto dest = XEXP (pat, 0);
> +  if (!REG_P (dest) || REGNO (dest) != base_regno)
> +    return nullptr;
> +
> +  poly_int64 offset;
> +  rtx rhs_base = strip_offset (XEXP (pat, 1), &offset);
> +  if (!REG_P (rhs_base)
> +      || REGNO (rhs_base) != base_regno
> +      || !offset.is_constant ())
> +    return nullptr;
> +
> +  // If the initial base offset is zero, we can handle any add offset
> +  // (post-inc).  Otherwise, we require the offsets to match (pre-inc).
> +  if (!known_eq (initial_offset, 0) && !known_eq (offset, initial_offset))
> +    return nullptr;
> +
> +  auto off_hwi = offset.to_constant ();
> +
> +  if (off_hwi % access_size != 0)
> +    return nullptr;
> +
> +  off_hwi /= access_size;
> +
> +  if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
> +    return nullptr;
> +
> +  insn_info *pair_dst = pair_range.singleton ();
> +  gcc_assert (pair_dst);
> +
> +  auto dump_prefix = [&]()
> +    {
> +      if (!insns[0])
> +	fprintf (dump_file, "existing pair i%d: ", insns[1]->uid ());
> +      else
> +	fprintf (dump_file, "  (%d,%d)",
> +		 insns[0]->uid (), insns[1]->uid ());
> +    };
> +
> +  insn_info *hazard = latest_hazard_before (cand, nullptr, pair_insn);
> +  if (!hazard || *hazard <= *pair_dst)
> +    {
> +      if (dump_file)
> +	{
> +	  dump_prefix ();
> +	  fprintf (dump_file,
> +		   "folding in trailing add (%d) to use writeback form\n",
> +		   cand->uid ());
> +	}
> +
> +      *add_def = def;
> +      *writeback_effect = copy_rtx (pat);
> +      return cand;
> +    }
> +
> +  if (dump_file)
> +    {
> +      dump_prefix ();
> +      fprintf (dump_file,
> +	       "can't fold in trailing add (%d), hazard = %d\n",
> +	       cand->uid (), hazard->uid ());
> +    }
> +
> +  return nullptr;
> +}
> +
> +// Try and actually fuse the pair given by insns I1 and I2.
> +static bool
> +fuse_pair (bool load_p,
> +	   unsigned access_size,
> +	   int writeback,
> +	   insn_info *i1,
> +	   insn_info *i2,
> +	   base_cand &base,
> +	   const insn_range_info &move_range,
> +	   bool &emitted_tombstone_p)
> +{
> +  auto attempt = crtl->ssa->new_change_attempt ();
> +
> +  auto make_change = [&attempt](insn_info *insn)
> +    {
> +      return crtl->ssa->change_alloc <insn_change> (attempt, insn);
> +    };
> +  auto make_delete = [&attempt](insn_info *insn)
> +    {
> +      return crtl->ssa->change_alloc <insn_change> (attempt,
> +						    insn,
> +						    insn_change::DELETE);
> +    };
> +
> +  // Are we using a tombstone insn for this pair?
> +  bool have_tombstone_p = false;
> +
> +  insn_info *first = (*i1 < *i2) ? i1 : i2;
> +  insn_info *second = (first == i1) ? i2 : i1;
> +
> +  insn_info *insns[2] = { first, second };
> +
> +  auto_vec <insn_change *> changes;
> +  changes.reserve (4);
> +
> +  rtx pats[2] = {
> +    PATTERN (first->rtl ()),
> +    PATTERN (second->rtl ())
> +  };
> +
> +  use_array input_uses[2] = { first->uses (), second->uses () };
> +  def_array input_defs[2] = { first->defs (), second->defs () };
> +
> +  int changed_insn = -1;
> +  if (base.from_insn != -1)
> +    {
> +      // If we're not already using a shared base, we need
> +      // to re-write one of the accesses to use the base from
> +      // the other insn.
> +      gcc_checking_assert (base.from_insn == 0 || base.from_insn == 1);
> +      changed_insn = !base.from_insn;
> +
> +      rtx base_pat = pats[base.from_insn];
> +      rtx change_pat = pats[changed_insn];
> +      rtx base_mem = XEXP (base_pat, load_p);
> +      rtx change_mem = XEXP (change_pat, load_p);
> +
> +      const bool lower_base_p = (insns[base.from_insn] == i1);
> +      HOST_WIDE_INT adjust_amt = access_size;
> +      if (!lower_base_p)
> +	adjust_amt *= -1;
> +
> +      rtx change_reg = XEXP (change_pat, !load_p);
> +      machine_mode mode_for_mem = GET_MODE (change_mem);
> +      rtx effective_base = drop_writeback (base_mem);
> +      rtx new_mem = adjust_address_nv (effective_base,
> +				       mode_for_mem,
> +				       adjust_amt);
> +      rtx new_set = load_p
> +	? gen_rtx_SET (change_reg, new_mem)
> +	: gen_rtx_SET (new_mem, change_reg);
> +
> +      pats[changed_insn] = new_set;
> +
> +      auto keep_use = [&](use_info *u)
> +	{
> +	  return refers_to_regno_p (u->regno (), u->regno () + 1,
> +				    change_pat, &XEXP (change_pat, load_p));
> +	};
> +
> +      // Drop any uses that only occur in the old address.
> +      input_uses[changed_insn] = filter_accesses (attempt,
> +						  input_uses[changed_insn],
> +						  keep_use);
> +    }
> +
> +  rtx writeback_effect = NULL_RTX;
> +  if (writeback)
> +    writeback_effect = extract_writebacks (load_p, pats, changed_insn);
> +
> +  const auto base_regno = base.m_def->regno ();
> +
> +  if (base.from_insn == -1 && (writeback & 1))
> +    {
> +      // If the first of the candidate insns had a writeback form, we'll need to
> +      // drop the use of the updated base register from the second insn's uses.
> +      //
> +      // N.B. we needn't worry about the base register occurring as a store
> +      // operand, as we checked that there was no non-address true dependence
> +      // between the insns in try_fuse_pair.
> +      gcc_checking_assert (find_access (input_uses[1], base_regno));
> +      input_uses[1] = check_remove_regno_access (attempt,
> +						 input_uses[1],
> +						 base_regno);
> +    }
> +
> +  // Go through and drop uses that only occur in register notes,
> +  // as we won't be preserving those.
> +  for (int i = 0; i < 2; i++)
> +    {
> +      auto rti = insns[i]->rtl ();
> +      if (!REG_NOTES (rti))
> +	continue;
> +
> +      input_uses[i] = remove_note_accesses (attempt, input_uses[i]);
> +    }
> +
> +  // Edge case: if the first insn is a writeback load and the
> +  // second insn is a non-writeback load which transfers into the base
> +  // register, then we should drop the writeback altogether as the
> +  // update of the base register from the second load should prevail.
> +  //
> +  // For example:
> +  //   ldr x2, [x1], #8
> +  //   ldr x1, [x1]
> +  //   -->
> +  //   ldp x2, x1, [x1]
> +  if (writeback == 1
> +      && load_p
> +      && find_access (input_defs[1], base_regno))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "  ldp: i%d has wb but subsequent i%d has non-wb "
> +		 "update of base (r%d), dropping wb\n",
> +		 insns[0]->uid (), insns[1]->uid (), base_regno);
> +      gcc_assert (writeback_effect);
> +      writeback_effect = NULL_RTX;
> +    }

What guarantees that there are no other uses of the writeback result?

> +
> +  // If both of the original insns had a writeback form, then we should drop the
> +  // first def.  The second def could well have uses, but the first def should
> +  // only be used by the second insn (and we dropped that use above).

Same question here I suppose: how's the single use condition enforced?

> +  if (writeback == 3)
> +    input_defs[0] = check_remove_regno_access (attempt,
> +					       input_defs[0],
> +					       base_regno);
> +
> +  // So far the patterns have been in instruction order,
> +  // now we want them in offset order.
> +  if (i1 != first)
> +    std::swap (pats[0], pats[1]);
> +
> +  poly_int64 offsets[2];
> +  for (int i = 0; i < 2; i++)
> +    {
> +      rtx mem = XEXP (pats[i], load_p);
> +      gcc_checking_assert (MEM_P (mem));
> +      rtx base = strip_offset (XEXP (mem, 0), offsets + i);
> +      gcc_checking_assert (REG_P (base));
> +      gcc_checking_assert (base_regno == REGNO (base));
> +    }
> +
> +  insn_info *trailing_add = nullptr;
> +  if (aarch64_ldp_writeback > 1 && !writeback_effect)
> +    {
> +      def_info *add_def;
> +      trailing_add = find_trailing_add (insns, move_range, &writeback_effect,
> +					&add_def, base.m_def, offsets[0],
> +					access_size);
> +      if (trailing_add && !writeback)
> +	{
> +	  // If there was no writeback to start with, we need to preserve the
> +	  // def of the base register from the add insn.
> +	  input_defs[0] = insert_access (attempt, add_def, input_defs[0]);
> +	  gcc_assert (input_defs[0].is_valid ());

How do we avoid doing that in the writeback!=0 case?

> +	}
> +    }
> +
> +  // If either of the original insns had writeback, but the resulting
> +  // pair insn does not (can happen e.g. in the ldp edge case above, or
> +  // if the writeback effects cancel out), then drop the def(s) of the
> +  // base register as appropriate.
> +  if (!writeback_effect)
> +    for (int i = 0; i < 2; i++)
> +      if (writeback & (1 << i))
> +	input_defs[i] = check_remove_regno_access (attempt,
> +						   input_defs[i],
> +						   base_regno);

Is there any scope for simplifying the writeback logic?  There seems
to be some overlap in intent between this loop and the previous
writeback==3 handling.

> +
> +  // Now that we know what base mem we're going to use, check if it's OK
> +  // with the ldp/stp policy.
> +  rtx first_mem = XEXP (pats[0], load_p);
> +  if (!aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
> +						load_p,
> +						GET_MODE (first_mem)))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "punting on pair (%d,%d), ldp/stp policy says no\n",
> +		 i1->uid (), i2->uid ());
> +      return false;
> +    }
> +
> +  bool reg_notes_ok = true;
> +  rtx reg_notes = combine_reg_notes (i1, i2, writeback_effect, reg_notes_ok);
> +  if (!reg_notes_ok)
> +    return false;
> +
> +  rtx pair_pat;
> +  if (writeback_effect)
> +    {
> +      auto patvec = gen_rtvec (3, writeback_effect, pats[0], pats[1]);
> +      pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
> +    }
> +  else if (load_p)
> +    pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
> +				      XEXP (pats[1], 0),
> +				      XEXP (pats[0], 1));
> +  else
> +    pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
> +				       XEXP (pats[0], 1),
> +				       XEXP (pats[1], 1));
> +
> +  insn_change *pair_change = nullptr;
> +  auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
> +      rtx_insn *rti = change->insn ()->rtl ();
> +      gcc_assert (validate_unshare_change (rti, &PATTERN (rti), pair_pat,
> +					   true));
> +      gcc_assert (validate_change (rti, &REG_NOTES (rti),
> +				   reg_notes, true));
> +  };
> +
> +  if (load_p)
> +    {
> +      changes.quick_push (make_delete (first));
> +      pair_change = make_change (second);
> +      changes.quick_push (pair_change);
> +
> +      pair_change->move_range = move_range;
> +      pair_change->new_defs = merge_access_arrays (attempt,
> +						   input_defs[0],
> +						   input_defs[1]);
> +      gcc_assert (pair_change->new_defs.is_valid ());
> +
> +      pair_change->new_uses
> +	= merge_access_arrays (attempt,
> +			       drop_memory_access (input_uses[0]),
> +			       drop_memory_access (input_uses[1]));
> +      gcc_assert (pair_change->new_uses.is_valid ());
> +      set_pair_pat (pair_change);
> +    }
> +  else
> +    {
> +      change_strategy strategy[2];
> +      def_info *stp_def = decide_stp_strategy (strategy, first, second,
> +					       move_range);
> +      if (dump_file)
> +	{
> +	  auto cs1 = cs_to_string (strategy[0]);
> +	  auto cs2 = cs_to_string (strategy[1]);
> +	  fprintf (dump_file,
> +		   "  stp strategy for candidate insns (%d,%d): (%s,%s)\n",
> +		   insns[0]->uid (), insns[1]->uid (), cs1, cs2);
> +	  if (stp_def)
> +	    fprintf (dump_file,
> +		     "  re-using mem def from insn %d\n",
> +		     stp_def->insn ()->uid ());
> +	}
> +
> +      insn_change *change;
> +      for (int i = 0; i < 2; i++)
> +	{
> +	  switch (strategy[i])
> +	    {
> +	    case DELETE:
> +	      changes.quick_push (make_delete (insns[i]));
> +	      break;
> +	    case TOMBSTONE:
> +	    case CHANGE:
> +	      change = make_change (insns[i]);
> +	      if (strategy[i] == CHANGE)
> +		{
> +		  set_pair_pat (change);
> +		  change->new_uses = merge_access_arrays (attempt,
> +							  input_uses[0],
> +							  input_uses[1]);
> +		  auto d1 = drop_memory_access (input_defs[0]);
> +		  auto d2 = drop_memory_access (input_defs[1]);
> +		  change->new_defs = merge_access_arrays (attempt, d1, d2);
> +		  gcc_assert (change->new_defs.is_valid ());
> +		  gcc_assert (stp_def);
> +		  change->new_defs = insert_access (attempt,
> +						    stp_def,
> +						    change->new_defs);
> +		  gcc_assert (change->new_defs.is_valid ());
> +		  change->move_range = move_range;
> +		  pair_change = change;
> +		}
> +	      else
> +		{
> +		  rtx_insn *rti = insns[i]->rtl ();
> +		  gcc_assert (validate_change (rti, &PATTERN (rti),
> +					       gen_tombstone (), true));
> +		  gcc_assert (validate_change (rti, &REG_NOTES (rti),
> +					       NULL_RTX, true));
> +		  change->new_uses = use_array (nullptr, 0);
> +		  have_tombstone_p = true;
> +		}
> +	      gcc_assert (change->new_uses.is_valid ());
> +	      changes.quick_push (change);
> +	      break;
> +	    }
> +	}
> +
> +      if (!stp_def)
> +	{
> +	  // Tricky case.  Cannot re-purpose existing insns for stp.
> +	  // Need to insert new insn.
> +	  if (dump_file)
> +	    fprintf (dump_file,
> +		     "  stp fusion: cannot re-purpose candidate stores\n");
> +
> +	  auto new_insn = crtl->ssa->create_insn (attempt, INSN, pair_pat);
> +	  change = make_change (new_insn);
> +	  change->move_range = move_range;
> +	  change->new_uses = merge_access_arrays (attempt,
> +						  input_uses[0],
> +						  input_uses[1]);
> +	  gcc_assert (change->new_uses.is_valid ());
> +
> +	  auto d1 = drop_memory_access (input_defs[0]);
> +	  auto d2 = drop_memory_access (input_defs[1]);
> +	  change->new_defs = merge_access_arrays (attempt, d1, d2);
> +	  gcc_assert (change->new_defs.is_valid ());
> +
> +	  auto new_set = crtl->ssa->create_set (attempt, new_insn, memory);
> +	  change->new_defs = insert_access (attempt, new_set,
> +					    change->new_defs);
> +	  gcc_assert (change->new_defs.is_valid ());
> +	  changes.safe_insert (1, change);
> +	  pair_change = change;
> +	}
> +    }
> +
> +  if (trailing_add)
> +    changes.quick_push (make_delete (trailing_add));
> +
> +  auto n_changes = changes.length ();
> +  gcc_checking_assert (n_changes >= 2 && n_changes <= 4);
> +
> +
> +  auto is_changing = insn_is_changing (changes);
> +  for (unsigned i = 0; i < n_changes; i++)
> +    gcc_assert (rtl_ssa::restrict_movement_ignoring (*changes[i], is_changing));
> +
> +  // Check the pair pattern is recog'd.
> +  if (!rtl_ssa::recog_ignoring (attempt, *pair_change, is_changing))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "  failed to form pair, recog failed\n");
> +
> +      // Free any reg notes we allocated.
> +      while (reg_notes)
> +	{
> +	  rtx next = XEXP (reg_notes, 1);
> +	  free_EXPR_LIST_node (reg_notes);
> +	  reg_notes = next;
> +	}
> +      cancel_changes (0);
> +      return false;
> +    }
> +
> +  gcc_assert (crtl->ssa->verify_insn_changes (changes));
> +
> +  confirm_change_group ();
> +  crtl->ssa->change_insns (changes);
> +  emitted_tombstone_p |= have_tombstone_p;
> +  return true;
> +}
> +
> +// Return true if STORE_INSN may modify mem rtx MEM.  Make sure we keep
> +// within our BUDGET for alias analysis.
> +static bool
> +store_modifies_mem_p (rtx mem, insn_info *store_insn, int &budget)
> +{
> +  if (tombstone_insn_p (store_insn))
> +    return false;
> +
> +  if (!budget)
> +    {
> +      if (dump_file)
> +	{
> +	  fprintf (dump_file,
> +		   "exceeded budget, assuming store %d aliases with mem ",
> +		   store_insn->uid ());
> +	  print_simple_rtl (dump_file, mem);
> +	  fprintf (dump_file, "\n");
> +	}
> +
> +      return true;
> +    }
> +
> +  budget--;
> +  return memory_modified_in_insn_p (mem, store_insn->rtl ());
> +}
> +
> +// Return true if LOAD may be modified by STORE.  Make sure we keep
> +// within our BUDGET for alias analysis.
> +static bool
> +load_modified_by_store_p (insn_info *load,
> +			  insn_info *store,
> +			  int &budget)
> +{
> +  gcc_checking_assert (budget >= 0);
> +
> +  if (!budget)
> +    {
> +      if (dump_file)
> +	{
> +	  fprintf (dump_file,
> +		   "exceeded budget, assuming load %d aliases with store %d\n",
> +		   load->uid (), store->uid ());
> +	}
> +      return true;
> +    }
> +
> +  // It isn't safe to re-order stores over calls.
> +  if (CALL_P (load->rtl ()))
> +    return true;
> +
> +  budget--;
> +  return modified_in_p (PATTERN (load->rtl ()), store->rtl ());

Any reason not to use memory_modified_in_p directly here too?
I'd have expected the other dependencies to be covered by the
RTL-SSA checks.

> +}
> +
> +struct alias_walker
> +{
> +  virtual insn_info *insn () const = 0;
> +  virtual bool valid () const = 0;
> +  virtual bool conflict_p (int &budget) const = 0;
> +  virtual void advance () = 0;
> +};
> +
> +template<bool reverse>
> +class store_walker : public alias_walker
> +{
> +  using def_iter_t = typename std::conditional <reverse,
> +	reverse_def_iterator, def_iterator>::type;
> +
> +  def_iter_t def_iter;
> +  rtx cand_mem;
> +  insn_info *limit;
> +
> +public:
> +  store_walker (def_info *mem_def, rtx mem, insn_info *limit_insn) :
> +    def_iter (mem_def), cand_mem (mem), limit (limit_insn) {}
> +
> +  bool valid () const override
> +    {
> +      if (!*def_iter)
> +	return false;
> +
> +      if (reverse)
> +	return *((*def_iter)->insn ()) > *limit;
> +      else
> +	return *((*def_iter)->insn ()) < *limit;
> +    }
> +  insn_info *insn () const override { return (*def_iter)->insn (); }
> +  void advance () override { def_iter++; }
> +  bool conflict_p (int &budget) const override
> +  {
> +    return store_modifies_mem_p (cand_mem, insn (), budget);
> +  }
> +};
> +
> +template<bool reverse>
> +class load_walker : public alias_walker
> +{
> +  using def_iter_t = typename std::conditional <reverse,
> +	reverse_def_iterator, def_iterator>::type;
> +  using use_iter_t = typename std::conditional <reverse,
> +	reverse_use_iterator, nondebug_insn_use_iterator>::type;
> +
> +  def_iter_t def_iter;
> +  use_iter_t use_iter;
> +  insn_info *cand_store;
> +  insn_info *limit;
> +
> +  static use_info *start_use_chain (def_iter_t &def_iter)
> +  {
> +    set_info *set = nullptr;
> +    for (; *def_iter; def_iter++)
> +      {
> +	set = dyn_cast <set_info *> (*def_iter);
> +	if (!set)
> +	  continue;
> +
> +	use_info *use = reverse
> +	  ? set->last_nondebug_insn_use ()
> +	  : set->first_nondebug_insn_use ();
> +
> +	if (use)
> +	  return use;
> +      }
> +
> +    return nullptr;
> +  }
> +
> +public:
> +  void advance () override
> +  {
> +    use_iter++;
> +    if (*use_iter)
> +      return;
> +    def_iter++;
> +    use_iter = start_use_chain (def_iter);
> +  }
> +
> +  insn_info *insn () const override
> +  {
> +    gcc_checking_assert (*use_iter);
> +    return (*use_iter)->insn ();
> +  }
> +
> +  bool valid () const override
> +  {
> +    if (!*use_iter)
> +      return false;
> +
> +    if (reverse)
> +      return *((*use_iter)->insn ()) > *limit;
> +    else
> +      return *((*use_iter)->insn ()) < *limit;
> +  }
> +
> +  bool conflict_p (int &budget) const override
> +  {
> +    return load_modified_by_store_p (insn (), cand_store, budget);
> +  }
> +
> +  load_walker (def_info *def, insn_info *store, insn_info *limit_insn)
> +    : def_iter (def), use_iter (start_use_chain (def_iter)),
> +      cand_store (store), limit (limit_insn) {}
> +};

Could we move more of the code to the base class?  It looks like
the iteration parts are very similar.

> +
> +// Process our alias_walkers in a round-robin fashion, proceeding until
> +// nothing more can be learned from alias analysis.
> +//
> +// We try to maintain the invariant that if a walker becomes invalid, we
> +// set its pointer to null.
> +static void
> +do_alias_analysis (insn_info *alias_hazards[4],
> +		   alias_walker *walkers[4],
> +		   bool load_p)
> +{
> +  const int n_walkers = 2 + (2 * !load_p);
> +  int budget = aarch64_ldp_alias_check_limit;
> +
> +  auto next_walker = [walkers,n_walkers](int current) -> int {
> +    for (int j = 1; j <= n_walkers; j++)
> +      {
> +	int idx = (current + j) % n_walkers;
> +	if (walkers[idx])
> +	  return idx;
> +      }
> +    return -1;
> +  };
> +
> +  int i = -1;
> +  for (int j = 0; j < n_walkers; j++)
> +    {
> +      alias_hazards[j] = nullptr;
> +      if (!walkers[j])
> +	continue;
> +
> +      if (!walkers[j]->valid ())
> +	walkers[j] = nullptr;
> +      else if (i == -1)
> +	i = j;
> +    }
> +
> +  while (i >= 0)
> +    {
> +      int insn_i = i % 2;
> +      int paired_i = (i & 2) + !insn_i;
> +      int pair_fst = (i & 2);
> +      int pair_snd = (i & 2) + 1;
> +
> +      if (walkers[i]->conflict_p (budget))
> +	{
> +	  alias_hazards[i] = walkers[i]->insn ();
> +
> +	  // We got an aliasing conflict for this {load,store} walker,
> +	  // so we don't need to walk any further.
> +	  walkers[i] = nullptr;
> +
> +	  // If we have a pair of alias conflicts that prevent
> +	  // forming the pair, stop.  There's no need to do further
> +	  // analysis.
> +	  if (alias_hazards[paired_i]
> +	      && (*alias_hazards[pair_fst] <= *alias_hazards[pair_snd]))
> +	    return;
> +
> +	  if (!load_p)
> +	    {
> +	      int other_pair_fst = (pair_fst ? 0 : 2);
> +	      int other_paired_i = other_pair_fst + !insn_i;
> +
> +	      int x_pair_fst = (i == pair_fst) ? i : other_paired_i;
> +	      int x_pair_snd = (i == pair_fst) ? other_paired_i : i;
> +
> +	      // Similarly, handle the case where we have a {load,store}
> +	      // or {store,load} alias hazard pair that prevents forming
> +	      // the pair.
> +	      if (alias_hazards[other_paired_i]
> +		  && *alias_hazards[x_pair_fst] <= *alias_hazards[x_pair_snd])
> +		return;
> +	    }
> +	}
> +
> +      if (walkers[i])
> +	{
> +	  walkers[i]->advance ();
> +
> +	  if (!walkers[i]->valid ())
> +	    walkers[i] = nullptr;
> +	}
> +
> +      i = next_walker (i);
> +    }
> +}
> +
> +// Return an integer where bit (1 << i) is set if INSNS[i] uses writeback
> +// addressing.
> +static int
> +get_viable_bases (insn_info *insns[2],
> +		  vec <base_cand> &base_cands,
> +		  rtx cand_mems[2],
> +		  unsigned access_size,
> +		  bool reversed)
> +{
> +  // We discovered this pair through a common base.  Need to ensure that
> +  // we have a common base register that is live at both locations.
> +  def_info *base_defs[2] = {};
> +  int writeback = 0;
> +  for (int i = 0; i < 2; i++)
> +    {
> +      const bool is_lower = (i == reversed);
> +      poly_int64 poly_off;
> +      rtx modify = NULL_RTX;
> +      rtx base = ldp_strip_offset (cand_mems[i], &modify, &poly_off);
> +      if (modify)
> +	writeback |= (1 << i);
> +
> +      if (!REG_P (base) || !poly_off.is_constant ())
> +	continue;
> +
> +      // Punt on accesses relative to eliminable regs.  Since we don't know the
> +      // elimination offset pre-RA, we should postpone forming pairs on such
> +      // accesses until after RA.
> +      if (!reload_completed
> +	  && (REGNO (base) == FRAME_POINTER_REGNUM
> +	      || REGNO (base) == ARG_POINTER_REGNUM))
> +	continue;

Same as above, it's not obvious from the comment why this is necessary.

> +
> +      HOST_WIDE_INT base_off = poly_off.to_constant ();
> +
> +      // It should be unlikely that we ever punt here, since MEM_EXPR offset
> +      // alignment should be a good proxy for register offset alignment.
> +      if (base_off % access_size != 0)
> +	{
> +	  if (dump_file)
> +	    fprintf (dump_file,
> +		     "base not viable, offset misaligned (insn %d)\n",
> +		     insns[i]->uid ());
> +	  continue;
> +	}
> +
> +      base_off /= access_size;
> +
> +      if (!is_lower)
> +	base_off--;
> +
> +      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
> +	continue;
> +
> +      for (auto use : insns[i]->uses ())
> +	if (use->is_reg () && use->regno () == REGNO (base))
> +	  {
> +	    base_defs[i] = use->def ();
> +	    break;
> +	  }
> +    }
> +
> +  if (!base_defs[0] && !base_defs[1])
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "no viable base register for pair (%d,%d)\n",
> +		 insns[0]->uid (), insns[1]->uid ());
> +      return writeback;
> +    }
> +
> +  for (int i = 0; i < 2; i++)
> +    if ((writeback & (1 << i)) && !base_defs[i])
> +      {
> +	if (dump_file)
> +	  fprintf (dump_file, "insn %d has writeback but base isn't viable\n",
> +		   insns[i]->uid ());
> +	return writeback;
> +      }
> +
> +  if (writeback == 3
> +      && base_defs[0]->regno () != base_defs[1]->regno ())
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "pair (%d,%d): double writeback with distinct regs (%d,%d): "
> +		 "punting\n",
> +		 insns[0]->uid (), insns[1]->uid (),
> +		 base_defs[0]->regno (), base_defs[1]->regno ());
> +      return writeback;
> +    }
> +
> +  if (base_defs[0] && base_defs[1]
> +      && base_defs[0]->regno () == base_defs[1]->regno ())
> +    {
> +      // Easy case: insns already share the same base reg.
> +      base_cands.quick_push (base_defs[0]);
> +      return writeback;
> +    }
> +
> +  // Otherwise, we know that one of the bases must change.
> +  //
> +  // Note that if there is writeback we must use the writeback base
> +  // (we know now there is exactly one).
> +  for (int i = 0; i < 2; i++)
> +    if (base_defs[i] && (!writeback || (writeback & (1 << i))))
> +      base_cands.quick_push (base_cand { base_defs[i], i });
> +
> +  return writeback;
> +}
> +
> +// Given two adjacent memory accesses of the same size, I1 and I2, try
> +// and see if we can merge them into a ldp or stp.
> +static bool
> +try_fuse_pair (bool load_p,
> +	       unsigned access_size,
> +	       insn_info *i1,
> +	       insn_info *i2,
> +	       bool &emitted_tombstone_p)
> +{
> +  if (dump_file)
> +    fprintf (dump_file, "analyzing pair (load=%d): (%d,%d)\n",
> +	     load_p, i1->uid (), i2->uid ());
> +
> +  insn_info *insns[2];
> +  bool reversed = false;
> +  if (*i1 < *i2)
> +    {
> +      insns[0] = i1;
> +      insns[1] = i2;
> +    }
> +  else
> +    {
> +      insns[0] = i2;
> +      insns[1] = i1;
> +      reversed = true;
> +    }
> +
> +  rtx cand_mems[2];
> +  rtx reg_ops[2];
> +  rtx pats[2];
> +  for (int i = 0; i < 2; i++)
> +    {
> +      pats[i] = PATTERN (insns[i]->rtl ());
> +      cand_mems[i] = XEXP (pats[i], load_p);
> +      reg_ops[i] = XEXP (pats[i], !load_p);
> +    }
> +
> +  if (load_p && reg_overlap_mentioned_p (reg_ops[0], reg_ops[1]))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "punting on ldp due to reg conflcits (%d,%d)\n",
> +		 insns[0]->uid (), insns[1]->uid ());
> +      return false;
> +    }
> +
> +  if (cfun->can_throw_non_call_exceptions
> +      && (find_reg_note (insns[0]->rtl (), REG_EH_REGION, NULL_RTX)
> +	  || find_reg_note (insns[1]->rtl (), REG_EH_REGION, NULL_RTX))
> +      && insn_could_throw_p (insns[0]->rtl ())
> +      && insn_could_throw_p (insns[1]->rtl ()))

I don't get the nuance of this condition.  The REG_EH_REGION part is
definitely OK, but why are the insn_could_throw_p parts needed?

> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "can't combine insns with EH side effects (%d,%d)\n",
> +		 insns[0]->uid (), insns[1]->uid ());
> +      return false;
> +    }
> +
> +  auto_vec <base_cand> base_cands;
> +  base_cands.reserve (2);
> +
> +  int writeback = get_viable_bases (insns, base_cands, cand_mems,
> +				    access_size, reversed);
> +  if (base_cands.is_empty ())
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "no viable base for pair (%d,%d)\n",
> +		 insns[0]->uid (), insns[1]->uid ());
> +      return false;
> +    }
> +
> +  rtx *ignore = &XEXP (pats[1], load_p);
> +  for (auto use : insns[1]->uses ())
> +    if (!use->is_mem ()
> +	&& refers_to_regno_p (use->regno (), use->regno () + 1, pats[1], ignore)
> +	&& use->def () && use->def ()->insn () == insns[0])
> +      {
> +	// N.B. we allow a true dependence on the base address, as this
> +	// happens in the case of auto-inc accesses.  Consider a post-increment
> +	// load followed by a regular indexed load, for example.
> +	if (dump_file)
> +	  fprintf (dump_file,
> +		   "%d has non-address true dependence on %d, rejecting pair\n",
> +		   insns[1]->uid (), insns[0]->uid ());
> +	return false;
> +      }
> +
> +  unsigned i = 0;
> +  while (i < base_cands.length ())
> +    {
> +      base_cand &cand = base_cands[i];
> +
> +      rtx *ignore[2] = {};
> +      for (int j = 0; j < 2; j++)
> +	if (cand.from_insn == !j)
> +	  ignore[j] = &XEXP (cand_mems[j], 0);
> +
> +      insn_info *h = first_hazard_after (insns[0], ignore[0]);
> +      if (h && *h <= *insns[1])
> +	cand.hazards[0] = h;
> +
> +      h = latest_hazard_before (insns[1], ignore[1]);
> +      if (h && *h >= *insns[0])
> +	cand.hazards[1] = h;
> +
> +      if (!cand.viable ())
> +	{
> +	  if (dump_file)
> +	    fprintf (dump_file,
> +		     "pair (%d,%d): rejecting base %d due to dataflow "
> +		     "hazards (%d,%d)\n",
> +		     insns[0]->uid (),
> +		     insns[1]->uid (),
> +		     cand.m_def->regno (),
> +		     cand.hazards[0]->uid (),
> +		     cand.hazards[1]->uid ());
> +
> +	  base_cands.ordered_remove (i);
> +	}
> +      else
> +	i++;
> +    }
> +
> +  if (base_cands.is_empty ())
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "can't form pair (%d,%d) due to dataflow hazards\n",
> +		 insns[0]->uid (), insns[1]->uid ());
> +      return false;
> +    }
> +
> +  insn_info *alias_hazards[4] = {};
> +
> +  // First def of memory after the first insn, and last def of memory
> +  // before the second insn, respectively.
> +  def_info *mem_defs[2] = {};
> +  if (load_p)
> +    {
> +      if (!MEM_READONLY_P (cand_mems[0]))
> +	{
> +	  mem_defs[0] = memory_access (insns[0]->uses ())->def ();
> +	  gcc_checking_assert (mem_defs[0]);
> +	  mem_defs[0] = mem_defs[0]->next_def ();
> +	}
> +      if (!MEM_READONLY_P (cand_mems[1]))
> +	{
> +	  mem_defs[1] = memory_access (insns[1]->uses ())->def ();
> +	  gcc_checking_assert (mem_defs[1]);
> +	}
> +    }
> +  else
> +    {
> +      mem_defs[0] = memory_access (insns[0]->defs ())->next_def ();
> +      mem_defs[1] = memory_access (insns[1]->defs ())->prev_def ();
> +      gcc_checking_assert (mem_defs[0]);
> +      gcc_checking_assert (mem_defs[1]);
> +    }
> +
> +  store_walker<false> forward_store_walker (mem_defs[0],
> +					    cand_mems[0],
> +					    insns[1]);
> +  store_walker<true> backward_store_walker (mem_defs[1],
> +					    cand_mems[1],
> +					    insns[0]);
> +  alias_walker *walkers[4] = {};
> +  if (mem_defs[0])
> +    walkers[0] = &forward_store_walker;
> +  if (mem_defs[1])
> +    walkers[1] = &backward_store_walker;
> +
> +  if (load_p && (mem_defs[0] || mem_defs[1]))
> +    do_alias_analysis (alias_hazards, walkers, load_p);
> +  else
> +    {
> +      // We want to find any loads hanging off the first store.
> +      mem_defs[0] = memory_access (insns[0]->defs ());
> +      load_walker<false> forward_load_walker (mem_defs[0], insns[0], insns[1]);
> +      load_walker<true> backward_load_walker (mem_defs[1], insns[1], insns[0]);
> +      walkers[2] = &forward_load_walker;
> +      walkers[3] = &backward_load_walker;
> +      do_alias_analysis (alias_hazards, walkers, load_p);
> +      // Now consolidate hazards back down.
> +      if (alias_hazards[2]
> +	  && (!alias_hazards[0] || (*alias_hazards[2] < *alias_hazards[0])))
> +	alias_hazards[0] = alias_hazards[2];
> +
> +      if (alias_hazards[3]
> +	  && (!alias_hazards[1] || (*alias_hazards[3] > *alias_hazards[1])))
> +	alias_hazards[1] = alias_hazards[3];
> +    }
> +
> +  if (alias_hazards[0] && alias_hazards[1]
> +      && *alias_hazards[0] <= *alias_hazards[1])
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "cannot form pair (%d,%d) due to alias conflicts (%d,%d)\n",
> +		 i1->uid (), i2->uid (),
> +		 alias_hazards[0]->uid (), alias_hazards[1]->uid ());
> +      return false;
> +    }
> +
> +  // Now narrow the hazards on each base candidate using
> +  // the alias hazards.
> +  i = 0;
> +  while (i < base_cands.length ())
> +    {
> +      base_cand &cand = base_cands[i];
> +      if (alias_hazards[0] && (!cand.hazards[0]
> +			       || *alias_hazards[0] < *cand.hazards[0]))
> +	cand.hazards[0] = alias_hazards[0];
> +      if (alias_hazards[1] && (!cand.hazards[1]
> +			       || *alias_hazards[1] > *cand.hazards[1]))
> +	cand.hazards[1] = alias_hazards[1];
> +
> +      if (cand.viable ())
> +	i++;
> +      else
> +	{
> +	  if (dump_file)
> +	    fprintf (dump_file, "pair (%d,%d): rejecting base %d due to "
> +				"alias/dataflow hazards (%d,%d)",
> +				insns[0]->uid (), insns[1]->uid (),
> +				cand.m_def->regno (),
> +				cand.hazards[0]->uid (),
> +				cand.hazards[1]->uid ());
> +
> +	  base_cands.ordered_remove (i);
> +	}
> +    }
> +
> +  if (base_cands.is_empty ())
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "cannot form pair (%d,%d) due to alias/dataflow hazards",
> +		 insns[0]->uid (), insns[1]->uid ());
> +
> +      return false;
> +    }
> +
> +  base_cand *base = &base_cands[0];
> +  if (base_cands.length () > 1)
> +    {
> +      // If there are still multiple viable bases, it makes sense
> +      // to choose one that allows us to reduce register pressure,
> +      // for loads this means moving further down, for stores this
> +      // means moving further up.

Agreed, but that's only really an issue for the pre-RA pass.
Might there be reasons to prefer a different choice after RA?
(Genuine question.)

> +      gcc_checking_assert (base_cands.length () == 2);
> +      const int hazard_i = !load_p;
> +      if (base->hazards[hazard_i])
> +	{
> +	  if (!base_cands[1].hazards[hazard_i])
> +	    base = &base_cands[1];
> +	  else if (load_p
> +		   && *base_cands[1].hazards[hazard_i]
> +		      > *(base->hazards[hazard_i]))
> +	    base = &base_cands[1];
> +	  else if (!load_p
> +		   && *base_cands[1].hazards[hazard_i]
> +		      < *(base->hazards[hazard_i]))
> +	    base = &base_cands[1];
> +	}
> +    }
> +
> +  // Otherwise, hazards[0] > hazards[1].
> +  // Pair can be formed anywhere in (hazards[1], hazards[0]).
> +  insn_range_info range (insns[0], insns[1]);
> +  if (base->hazards[1])
> +    range.first = base->hazards[1];
> +  if (base->hazards[0])
> +    range.last = base->hazards[0]->prev_nondebug_insn ();
> +
> +  // Placement strategy: push loads down and pull stores up, this should
> +  // help register pressure by reducing live ranges.
> +  if (load_p)
> +    range.first = range.last;
> +  else
> +    range.last = range.first;
> +
> +  if (dump_file)
> +    {
> +      auto print_hazard = [](insn_info *i)
> +	{
> +	  if (i)
> +	    fprintf (dump_file, "%d", i->uid ());
> +	  else
> +	    fprintf (dump_file, "-");
> +	};
> +      auto print_pair = [print_hazard](insn_info **i)
> +	{
> +	  print_hazard (i[0]);
> +	  fprintf (dump_file, ",");
> +	  print_hazard (i[1]);
> +	};
> +
> +      fprintf (dump_file, "fusing pair [L=%d] (%d,%d), base=%d, hazards: (",
> +	      load_p, insns[0]->uid (), insns[1]->uid (),
> +	      base->m_def->regno ());
> +      print_pair (base->hazards);
> +      fprintf (dump_file, "), move_range: (%d,%d)\n",
> +	       range.first->uid (), range.last->uid ());
> +    }
> +
> +  return fuse_pair (load_p, access_size, writeback,
> +		    i1, i2, *base, range, emitted_tombstone_p);
> +}
> +
> +// Erase [l.begin (), i] inclusive, respecting iterator order.
> +static insn_iter_t
> +erase_prefix (insn_list_t &l, insn_iter_t i)
> +{
> +  l.erase (l.begin (), std::next (i));
> +  return l.begin ();
> +}
> +
> +static insn_iter_t
> +erase_one (insn_list_t &l, insn_iter_t i, insn_iter_t begin)
> +{
> +  auto prev_or_next = (i == begin) ? std::next (i) : std::prev (i);
> +  l.erase (i);
> +  return prev_or_next;
> +}
> +
> +static void
> +dump_insn_list (FILE *f, const insn_list_t &l)
> +{
> +  fprintf (f, "(");
> +
> +  auto i = l.begin ();
> +  auto end = l.end ();
> +
> +  if (i != end)
> +    fprintf (f, "%d", (*i)->uid ());
> +  i++;
> +
> +  for (; i != end; i++)
> +    {
> +      fprintf (f, ", %d", (*i)->uid ());
> +    }
> +
> +  fprintf (f, ")");
> +}
> +
> +DEBUG_FUNCTION void
> +debug (const insn_list_t &l)
> +{
> +  dump_insn_list (stderr, l);
> +  fprintf (stderr, "\n");
> +}
> +
> +void
> +merge_pairs (insn_iter_t l_begin,
> +	     insn_iter_t l_end,
> +	     insn_iter_t r_begin,
> +	     insn_iter_t r_end,
> +	     insn_list_t &left_list,
> +	     insn_list_t &right_list,
> +	     hash_set <insn_info *> &to_delete,
> +	     bool load_p,
> +	     unsigned access_size,
> +	     bool &emitted_tombstone_p)
> +{
> +  auto iter_l = l_begin;
> +  auto iter_r = r_begin;
> +
> +  bool result;
> +  while (l_begin != l_end && r_begin != r_end)
> +    {
> +      auto next_l = std::next (iter_l);
> +      auto next_r = std::next (iter_r);
> +      if (**iter_l < **iter_r
> +	  && next_l != l_end
> +	  && **next_l < **iter_r)
> +	{
> +	  iter_l = next_l;
> +	  continue;
> +	}
> +      else if (**iter_r < **iter_l
> +	       && next_r != r_end
> +	       && **next_r < **iter_l)
> +	{
> +	  iter_r = next_r;
> +	  continue;
> +	}
> +
> +      bool update_l = false;
> +      bool update_r = false;
> +
> +      result = try_fuse_pair (load_p, access_size,
> +			      *iter_l, *iter_r,
> +			      emitted_tombstone_p);
> +      if (result)
> +	{
> +	  update_l = update_r = true;
> +	  if (to_delete.add (*iter_r))
> +	    gcc_unreachable (); // Shouldn't get added twice.
> +
> +	  iter_l = erase_one (left_list, iter_l, l_begin);
> +	  iter_r = erase_one (right_list, iter_r, r_begin);
> +	}
> +      else
> +	{
> +	  // Here we know that the entire prefix we skipped
> +	  // over cannot merge with anything further on
> +	  // in iteration order (there are aliasing hazards
> +	  // on both sides), so delete the entire prefix.
> +	  if (**iter_l < **iter_r)
> +	    {
> +	      // Delete everything from l_begin to iter_l, inclusive.
> +	      update_l = true;
> +	      iter_l = erase_prefix (left_list, iter_l);
> +	    }
> +	  else
> +	    {
> +	      // Delete everything from r_begin to iter_r, inclusive.
> +	      update_r = true;
> +	      iter_r = erase_prefix (right_list, iter_r);
> +	    }
> +	}
> +
> +      if (update_l)
> +	{
> +	  l_begin = left_list.begin ();
> +	  l_end = left_list.end ();
> +	}
> +      if (update_r)
> +	{
> +	  r_begin = right_list.begin ();
> +	  r_end = right_list.end ();
> +	}
> +    }
> +}

Could you add some more comments here about how the iterator ranges
are used?  E.g. I wasn't sure how l_begin and r_begin differed from
left_list.begin () and right_list.begin ().

> +
> +// Given a list of insns LEFT_ORIG with all accesses adjacent to
> +// those in RIGHT_ORIG, try and form them into pairs.
> +//
> +// Return true iff we formed all the RIGHT_ORIG candidates into
> +// pairs.
> +bool
> +ldp_bb_info::try_form_pairs (insn_list_t *left_orig,
> +			     insn_list_t *right_orig,
> +			     bool load_p, unsigned access_size)
> +{
> +  // Make a copy of the right list which we can modify to
> +  // exclude candidates locally for this invocation.
> +  insn_list_t right_copy (*right_orig);
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "try_form_pairs [L=%d], cand vecs ", load_p);
> +      dump_insn_list (dump_file, *left_orig);
> +      fprintf (dump_file, " x ");
> +      dump_insn_list (dump_file, right_copy);
> +      fprintf (dump_file, "\n");
> +    }
> +
> +  // List of candidate insns to delete from the original right_list
> +  // (because they were formed into a pair).
> +  hash_set <insn_info *> to_delete;
> +
> +  // Now we have a 2D matrix of candidates, traverse it to try and
> +  // find a pair of insns that are already adjacent (within the
> +  // merged list of accesses).
> +  merge_pairs (left_orig->begin (), left_orig->end (),
> +	       right_copy.begin (), right_copy.end (),
> +	       *left_orig, right_copy,
> +	       to_delete, load_p, access_size,
> +	       m_emitted_tombstone);
> +
> +  // If we formed all right candidates into pairs,
> +  // then we can skip the next iteration.
> +  if (to_delete.elements () == right_orig->size ())
> +    return true;
> +
> +  // Delete items from to_delete.
> +  auto right_iter = right_orig->begin ();
> +  auto right_end = right_orig->end ();
> +  while (right_iter != right_end)
> +    {
> +      auto right_next = std::next (right_iter);
> +
> +      if (to_delete.contains (*right_iter))
> +	{
> +	  right_orig->erase (right_iter);
> +	  right_end = right_orig->end ();
> +	}
> +
> +      right_iter = right_next;
> +    }
> +
> +  return false;
> +}
> +
> +void
> +ldp_bb_info::transform_for_base (int encoded_lfs,
> +				 access_group &group)
> +{
> +  const auto lfs = decode_lfs (encoded_lfs);
> +  const unsigned access_size = lfs.size;
> +
> +  bool skip_next = true;
> +  access_record *prev_access = nullptr;
> +
> +  for (auto &access : group.list)
> +    {
> +      if (skip_next)
> +	skip_next = false;
> +      else if (known_eq (access.offset, prev_access->offset + access_size))
> +	skip_next = try_form_pairs (&prev_access->cand_insns,
> +				    &access.cand_insns,
> +				    lfs.load_p, access_size);
> +
> +      prev_access = &access;
> +    }
> +}
> +
> +void
> +ldp_bb_info::cleanup_tombstones ()
> +{
> +  // No need to do anything if we didn't emit a tombstone insn for this bb.
> +  if (!m_emitted_tombstone)
> +    return;
> +
> +  insn_info *insn = m_bb->head_insn ();
> +  while (insn)
> +    {
> +      insn_info *next = insn->next_nondebug_insn ();
> +      if (!insn->is_real () || !tombstone_insn_p (insn))
> +	{
> +	  insn = next;
> +	  continue;
> +	}
> +
> +      auto def = memory_access (insn->defs ());
> +      auto set = dyn_cast <set_info *> (def);
> +      if (set && set->has_any_uses ())
> +	{
> +	  def_info *prev_def = def->prev_def ();
> +	  auto prev_set = dyn_cast <set_info *> (prev_def);
> +	  if (!prev_set)
> +	    gcc_unreachable (); // TODO: handle this if needed.
> +
> +	  while (set->first_use ())
> +	    crtl->ssa->reparent_use (set->first_use (), prev_set);
> +	}
> +
> +      // Now set has no uses, we can delete it.
> +      insn_change change (insn, insn_change::DELETE);
> +      crtl->ssa->change_insn (change);
> +      insn = next;
> +    }
> +}
> +
> +template<typename Map>
> +void
> +ldp_bb_info::traverse_base_map (Map &map)
> +{
> +  for (auto kv : map)
> +    {
> +      const auto &key = kv.first;
> +      auto &value = kv.second;
> +      transform_for_base (key.second, value);
> +    }
> +}
> +
> +void
> +ldp_bb_info::transform ()
> +{
> +  traverse_base_map (expr_map);
> +  traverse_base_map (def_map);
> +}
> +
> +static void
> +ldp_fusion_init ()
> +{
> +  calculate_dominance_info (CDI_DOMINATORS);
> +  df_analyze ();
> +  crtl->ssa = new rtl_ssa::function_info (cfun);
> +}
> +
> +static void
> +ldp_fusion_destroy ()
> +{
> +  if (crtl->ssa->perform_pending_updates ())
> +    cleanup_cfg (0);
> +
> +  free_dominance_info (CDI_DOMINATORS);
> +
> +  delete crtl->ssa;
> +  crtl->ssa = nullptr;
> +}
> +
> +static rtx
> +aarch64_destructure_load_pair (rtx regs[2], rtx pattern)
> +{
> +  rtx mem = NULL_RTX;
> +
> +  for (int i = 0; i < 2; i++)
> +    {
> +      rtx pat = XVECEXP (pattern, 0, i);
> +      regs[i] = XEXP (pat, 0);
> +      rtx unspec = XEXP (pat, 1);
> +      gcc_checking_assert (GET_CODE (unspec) == UNSPEC);
> +      rtx this_mem = XVECEXP (unspec, 0, 0);
> +      if (mem)
> +	gcc_checking_assert (rtx_equal_p (mem, this_mem));
> +      else
> +	{
> +	  gcc_checking_assert (MEM_P (this_mem));
> +	  mem = this_mem;
> +	}
> +    }
> +
> +  return mem;
> +}
> +
> +static rtx
> +aarch64_destructure_store_pair (rtx regs[2], rtx pattern)
> +{
> +  rtx mem = XEXP (pattern, 0);
> +  rtx unspec = XEXP (pattern, 1);
> +  gcc_checking_assert (GET_CODE (unspec) == UNSPEC);
> +  for (int i = 0; i < 2; i++)
> +    regs[i] = XVECEXP (unspec, 0, i);
> +  return mem;
> +}
> +
> +static rtx
> +aarch64_gen_writeback_pair (rtx wb_effect, rtx pair_mem, rtx regs[2],
> +			    bool load_p)
> +{
> +  auto op_mode = aarch64_operand_mode_for_pair_mode (GET_MODE (pair_mem));
> +
> +  machine_mode modes[2];
> +  for (int i = 0; i < 2; i++)
> +    {
> +      machine_mode mode = GET_MODE (regs[i]);
> +      if (load_p)
> +	gcc_checking_assert (mode != VOIDmode);
> +      else if (mode == VOIDmode)
> +	mode = op_mode;
> +
> +      modes[i] = mode;
> +    }
> +
> +  const auto op_size = GET_MODE_SIZE (modes[0]);
> +  gcc_checking_assert (known_eq (op_size, GET_MODE_SIZE (modes[1])));
> +
> +  rtx pats[2];
> +  for (int i = 0; i < 2; i++)
> +    {
> +      rtx mem = adjust_address_nv (pair_mem, modes[i], op_size * i);
> +      pats[i] = load_p
> +	? gen_rtx_SET (regs[i], mem)
> +	: gen_rtx_SET (mem, regs[i]);
> +    }
> +
> +  return gen_rtx_PARALLEL (VOIDmode,
> +			   gen_rtvec (3, wb_effect, pats[0], pats[1]));
> +}
> +
> +// Given an existing pair insn INSN, look for a trailing update of
> +// the base register which we can fold in to make this pair use
> +// a writeback addressing mode.
> +static void
> +try_promote_writeback (insn_info *insn)
> +{
> +  auto rti = insn->rtl ();
> +  const auto attr = get_attr_ldpstp (rti);
> +  if (attr == LDPSTP_NONE)
> +    return;
> +
> +  bool load_p = (attr == LDPSTP_LDP);
> +  gcc_checking_assert (load_p || attr == LDPSTP_STP);
> +
> +  rtx regs[2];
> +  rtx mem = NULL_RTX;
> +  if (load_p)
> +    mem = aarch64_destructure_load_pair (regs, PATTERN (rti));
> +  else
> +    mem = aarch64_destructure_store_pair (regs, PATTERN (rti));
> +  gcc_checking_assert (MEM_P (mem));
> +
> +  poly_int64 offset;
> +  rtx base = strip_offset (XEXP (mem, 0), &offset);
> +  gcc_assert (REG_P (base));
> +
> +  const auto access_size = GET_MODE_SIZE (GET_MODE (mem)).to_constant () / 2;
> +
> +  if (find_access (insn->defs (), REGNO (base)))
> +    {
> +      gcc_assert (load_p);
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "ldp %d clobbers base r%d, can't promote to writeback\n",
> +		 insn->uid (), REGNO (base));
> +      return;
> +    }
> +
> +  auto base_use = find_access (insn->uses (), REGNO (base));
> +  gcc_assert (base_use);
> +
> +  if (!base_use->def ())
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "found pair (i%d, L=%d): but base r%d is upwards exposed\n",
> +		 insn->uid (), load_p, REGNO (base));
> +      return;
> +    }
> +
> +  auto base_def = base_use->def ();
> +
> +  rtx wb_effect = NULL_RTX;
> +  def_info *add_def;
> +  const insn_range_info pair_range (insn->prev_nondebug_insn ());
> +  insn_info *insns[2] = { nullptr, insn };
> +  insn_info *trailing_add = find_trailing_add (insns, pair_range, &wb_effect,
> +					       &add_def, base_def, offset,
> +					       access_size);
> +  if (!trailing_add)
> +    return;
> +
> +  auto attempt = crtl->ssa->new_change_attempt ();
> +
> +  insn_change pair_change (insn);
> +  insn_change del_change (trailing_add, insn_change::DELETE);
> +  insn_change *changes[] = { &pair_change, &del_change };
> +
> +  rtx pair_pat = aarch64_gen_writeback_pair (wb_effect, mem, regs, load_p);
> +  gcc_assert (validate_unshare_change (rti, &PATTERN (rti), pair_pat, true));
> +
> +  // The pair must gain the def of the base register from the add.
> +  pair_change.new_defs = insert_access (attempt,
> +					add_def,
> +					pair_change.new_defs);
> +  gcc_assert (pair_change.new_defs.is_valid ());
> +
> +  pair_change.move_range = insn_range_info (insn->prev_nondebug_insn ());
> +
> +  auto is_changing = insn_is_changing (changes);
> +  for (unsigned i = 0; i < ARRAY_SIZE (changes); i++)
> +    gcc_assert (rtl_ssa::restrict_movement_ignoring (*changes[i], is_changing));
> +
> +  gcc_assert (rtl_ssa::recog_ignoring (attempt, pair_change, is_changing));
> +  gcc_assert (crtl->ssa->verify_insn_changes (changes));
> +  confirm_change_group ();
> +  crtl->ssa->change_insns (changes);
> +}
> +
> +void ldp_fusion_bb (bb_info *bb)
> +{
> +  const bool track_loads
> +    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> +  const bool track_stores
> +    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> +
> +  ldp_bb_info bb_state (bb);
> +
> +  for (auto insn : bb->nondebug_insns ())
> +    {
> +      rtx_insn *rti = insn->rtl ();
> +
> +      if (!rti || !INSN_P (rti))
> +	continue;
> +
> +      rtx pat = PATTERN (rti);
> +      if (reload_completed
> +	  && aarch64_ldp_writeback > 1
> +	  && GET_CODE (pat) == PARALLEL
> +	  && XVECLEN (pat, 0) == 2)
> +	try_promote_writeback (insn);
> +
> +      if (GET_CODE (pat) != SET)
> +	continue;
> +
> +      if (track_stores && MEM_P (XEXP (pat, 0)))
> +	bb_state.track_access (insn, false, XEXP (pat, 0));
> +      else if (track_loads && MEM_P (XEXP (pat, 1)))
> +	bb_state.track_access (insn, true, XEXP (pat, 1));
> +    }
> +
> +  bb_state.transform ();
> +  bb_state.cleanup_tombstones ();
> +}
> +
> +void ldp_fusion ()
> +{
> +  ldp_fusion_init ();
> +
> +  for (auto bb : crtl->ssa->bbs ())
> +    ldp_fusion_bb (bb);
> +
> +  ldp_fusion_destroy ();
> +}
> +
> +namespace {
> +
> +const pass_data pass_data_ldp_fusion =
> +{
> +  RTL_PASS, /* type */
> +  "ldp_fusion", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_NONE, /* tv_id */
> +  0, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_df_finish, /* todo_flags_finish */
> +};
> +
> +class pass_ldp_fusion : public rtl_opt_pass
> +{
> +public:
> +  pass_ldp_fusion (gcc::context *ctx)
> +    : rtl_opt_pass (pass_data_ldp_fusion, ctx)
> +    {}
> +
> +  opt_pass *clone () override { return new pass_ldp_fusion (m_ctxt); }
> +
> +  bool gate (function *) final override
> +    {
> +      if (!optimize || optimize_debug)
> +	return false;
> +
> +      // If the tuning policy says never to form ldps or stps, don't run
> +      // the pass.
> +      if ((aarch64_tune_params.ldp_policy_model
> +	   == AARCH64_LDP_STP_POLICY_NEVER)
> +	  && (aarch64_tune_params.stp_policy_model
> +	      == AARCH64_LDP_STP_POLICY_NEVER))
> +	return false;
> +
> +      if (reload_completed)
> +	return flag_aarch64_late_ldp_fusion;
> +      else
> +	return flag_aarch64_early_ldp_fusion;
> +    }
> +
> +  unsigned execute (function *) final override
> +    {
> +      ldp_fusion ();
> +      return 0;
> +    }
> +};
> +
> +} // anon namespace
> +
> +rtl_opt_pass *
> +make_pass_ldp_fusion (gcc::context *ctx)
> +{
> +  return new pass_ldp_fusion (ctx);
> +}
> +
> +#include "gt-aarch64-ldp-fusion.h"
> diff --git a/gcc/config/aarch64/aarch64-passes.def b/gcc/config/aarch64/aarch64-passes.def
> index 6ace797b738..f38c642414e 100644
> --- a/gcc/config/aarch64/aarch64-passes.def
> +++ b/gcc/config/aarch64/aarch64-passes.def
> @@ -23,3 +23,5 @@ INSERT_PASS_BEFORE (pass_reorder_blocks, 1, pass_track_speculation);
>  INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
>  INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
>  INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> +INSERT_PASS_BEFORE (pass_early_remat, 1, pass_ldp_fusion);
> +INSERT_PASS_BEFORE (pass_peephole2, 1, pass_ldp_fusion);
> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> index 2ab54f244a7..fd75aa115d1 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -1055,6 +1055,7 @@ rtl_opt_pass *make_pass_track_speculation (gcc::context *);
>  rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
>  rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
>  rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
> +rtl_opt_pass *make_pass_ldp_fusion (gcc::context *);
>  
>  poly_uint64 aarch64_regmode_natural_size (machine_mode);
>  
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index f5a518202a1..a69c37ce33b 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -271,6 +271,16 @@ mtrack-speculation
>  Target Var(aarch64_track_speculation)
>  Generate code to track when the CPU might be speculating incorrectly.
>  
> +mearly-ldp-fusion
> +Target Var(flag_aarch64_early_ldp_fusion) Optimization Init(1)
> +Enable the pre-RA AArch64-specific pass to fuse loads and stores into
> +ldp and stp instructions.
> +
> +mlate-ldp-fusion
> +Target Var(flag_aarch64_late_ldp_fusion) Optimization Init(1)
> +Enable the post-RA AArch64-specific pass to fuse loads and stores into
> +ldp and stp instructions.
> +
>  mstack-protector-guard=
>  Target RejectNegative Joined Enum(stack_protector_guard) Var(aarch64_stack_protector_guard) Init(SSP_GLOBAL)
>  Use given stack-protector guard.
> @@ -360,3 +370,16 @@ Enum(aarch64_ldp_stp_policy) String(never) Value(AARCH64_LDP_STP_POLICY_NEVER)
>  
>  EnumValue
>  Enum(aarch64_ldp_stp_policy) String(aligned) Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> +
> +-param=aarch64-ldp-alias-check-limit=
> +Target Joined UInteger Var(aarch64_ldp_alias_check_limit) Init(8) IntegerRange(0, 65536) Param
> +Limit on number of alias checks performed when attempting to form an ldp/stp.
> +
> +-param=aarch64-ldp-writeback=
> +Target Joined UInteger Var(aarch64_ldp_writeback) Init(2) IntegerRange(0,2) Param
> +Param to control which wirteback opportunities we try to handle in the

writeback

> +load/store pair fusion pass.  A value of zero disables writeback
> +handling.  One means we try to form pairs involving one or more existing
> +individual writeback accesses where possible.  A value of two means we
> +also try to opportunistically form writeback opportunities by folding in
> +trailing destructive updates of the base register used by a pair.

Params are also documented in invoke.texi (but are allowed to change
between releases, unlike normal options).

Thanks,
Richard

> diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> index a9a244ab6d6..37917344a54 100644
> --- a/gcc/config/aarch64/t-aarch64
> +++ b/gcc/config/aarch64/t-aarch64
> @@ -176,6 +176,13 @@ aarch64-cc-fusion.o: $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
>  	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
>  		$(srcdir)/config/aarch64/aarch64-cc-fusion.cc
>  
> +aarch64-ldp-fusion.o: $(srcdir)/config/aarch64/aarch64-ldp-fusion.cc \
> +    $(CONFIG_H) $(SYSTEM_H) $(CORETYPES_H) $(BACKEND_H) $(RTL_H) $(DF_H) \
> +    $(RTL_SSA_H) cfgcleanup.h tree-pass.h ordered-hash-map.h tree-dfa.h \
> +    fold-const.h tree-hash-traits.h print-tree.h
> +	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> +		$(srcdir)/config/aarch64/aarch64-ldp-fusion.cc
> +
>  comma=,
>  MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
>  MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
Alex Coplan Dec. 5, 2023, 3:09 p.m. UTC | #2
On 22/11/2023 11:14, Richard Sandiford wrote:
> Alex Coplan <alex.coplan@arm.com> writes:
> > This is a v3 of the aarch64 load/store pair fusion pass.
> > v2 was posted here:
> >  - https://gcc.gnu.org/pipermail/gcc-patches/2023-October/633601.html
> >
> > The main changes since v2 are as follows:
> >
> > We now handle writeback opportunities as well.  E.g. for this testcase:
> >
> > void foo (long *p, long *q, long x, long y)
> > {
> >   do {
> >     *(p++) = x;
> >     *(p++) = y;
> >   } while (p < q);
> > }
> >
> > wtih the patch, we generate:
> >
> > foo:
> > .LFB0:
> >         .align  3
> > .L2:
> >         stp     x2, x3, [x0], 16
> >         cmp     x0, x1
> >         bcc     .L2
> >         ret
> >
> > instead of:
> >
> > foo:
> > .LFB0:
> >         .align  3
> > .L2:
> >         str     x2, [x0], 16
> >         str     x3, [x0, -8]
> >         cmp     x0, x1
> >         bcc     .L2
> >         ret
> >
> > i.e. the pass is now capable of finding load/store pair opportunities even in
> > the case that one or more of the initial candidate accesses uses writeback addressing.
> > We do this by adding a notion of canonicalizing RTL bases.  When we see a
> > writeback access, we record that the new base def is equivalent to the original
> > def plus some offset.  When tracking accesses, we then canonicalize to track
> > each access relative to the earliest equivalent base in the basic block.
> >
> > This allows us to spot that accesses are adjacent even though they don't share
> > the same RTL-SSA base def.
> >
> > Furthermore, we also add some extra logic to opportunistically fold in trailing
> > destructive updates of the base register used for a load/store pair.  E.g. for
> >
> > void post_add (long *p, long *q, long x, long y)
> > {
> >   do {
> >     p[0] = x;
> >     p[1] = y;
> >     p += 2;
> >   } while (p < q);
> > }
> >
> > the auto-inc-dec pass doesn't currently form any writeback accesses, and we
> > generate:
> >
> > post_add:
> > .LFB0:
> >         .align  3
> > .L2:
> >         add     x0, x0, 16
> >         stp     x2, x3, [x0, -16]
> >         cmp     x0, x1
> >         bcc     .L2
> >         ret
> >
> > but with the updated pass, we now get:
> >
> > post_add:
> > .LFB0:
> >         .align  3
> > .L2:
> >         stp     x2, x3, [x0], 16
> >         cmp     x0, x1
> >         bcc     .L2
> >         ret
> >
> > Other notable changes to the pass since the last version include:
> >  - We switch to using the aarch64_gen_{load,store}_pair interface
> >    for forming the (non-writeback) pairs, allowing use of the new
> >    load/store pair representation added by the earlier patch.
> >  - The various updates to the load/store pair patterns mean that
> >    we no longer need to do mode canonicalization / mode unification
> >    in the pass, as the patterns allow arbitrary combinations of suitable modes
> >    of the same size.  So we remove the logic to do this (including the
> >    param to control the strategy).
> >  - Fix up classification of zero operands to make sure that these are always
> >    treated as GPR operands for pair discovery purposes.  This avoids us
> >    pairing zero operands with FPRs in the pre-RA pass, which used to lead to
> >    undesirable codegen involving cross-file moves.
> >  - We also remove the try_adjust_address logic from the previous iteration of
> >    the pass.  Since we validate all ldp/stp offsets in the pass, this only
> >    meant that we lost opportunities in the case that a given mem fails to
> >    adjust in its original mode.
> >
> > Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
> >
> > Thanks,
> > Alex
> >
> > gcc/ChangeLog:
> >
> > 	* config.gcc: Add aarch64-ldp-fusion.o to extra_objs for aarch64; add
> > 	aarch64-ldp-fusion.cc to target_gtfiles.
> > 	* config/aarch64/aarch64-passes.def: Add copies of pass_ldp_fusion
> > 	before and after RA.
> > 	* config/aarch64/aarch64-protos.h (make_pass_ldp_fusion): Declare.
> > 	* config/aarch64/aarch64.opt (-mearly-ldp-fusion): New.
> > 	(-mlate-ldp-fusion): New.
> > 	(--param=aarch64-ldp-alias-check-limit): New.
> > 	(--param=aarch64-ldp-writeback): New.
> > 	* config/aarch64/t-aarch64: Add rule for aarch64-ldp-fusion.o.
> > 	* config/aarch64/aarch64-ldp-fusion.cc: New file.
> 
> Looks really good.  I'll probably need to do another pass over it,
> but some initial comments below.
> 
> Main general comment is: it would be good to have more commentary.
> Not "repeat the code in words" commentary, just comments that sketch
> the intent or purpose of the following code, what the assumptions and
> invariants are, etc.

Thanks a lot for the review.  I've tried to add more commentary in the
latest version, and tried to make sure that all functions have a
comment.

I've attached the incremental change.  Replies to your comments below.

> 
> > ---
> >  gcc/config.gcc                           |    4 +-
> >  gcc/config/aarch64/aarch64-ldp-fusion.cc | 2727 ++++++++++++++++++++++
> >  gcc/config/aarch64/aarch64-passes.def    |    2 +
> >  gcc/config/aarch64/aarch64-protos.h      |    1 +
> >  gcc/config/aarch64/aarch64.opt           |   23 +
> >  gcc/config/aarch64/t-aarch64             |    7 +
> >  6 files changed, 2762 insertions(+), 2 deletions(-)
> >  create mode 100644 gcc/config/aarch64/aarch64-ldp-fusion.cc
> >
> > diff --git a/gcc/config.gcc b/gcc/config.gcc
> > index c1460ca354e..8b7f6b20309 100644
> > --- a/gcc/config.gcc
> > +++ b/gcc/config.gcc
> > @@ -349,8 +349,8 @@ aarch64*-*-*)
> >  	c_target_objs="aarch64-c.o"
> >  	cxx_target_objs="aarch64-c.o"
> >  	d_target_objs="aarch64-d.o"
> > -	extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
> > -	target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> > +	extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o aarch64-ldp-fusion.o"
> > +	target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc \$(srcdir)/config/aarch64/aarch64-ldp-fusion.cc"
> >  	target_has_targetm_common=yes
> >  	;;
> >  alpha*-*-*)
> > diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > new file mode 100644
> > index 00000000000..6ab18b9216e
> > --- /dev/null
> > +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> > @@ -0,0 +1,2727 @@
> > +// LoadPair fusion optimization pass for AArch64.
> > +// Copyright (C) 2023 Free Software Foundation, Inc.
> > +//
> > +// This file is part of GCC.
> > +//
> > +// GCC is free software; you can redistribute it and/or modify it
> > +// under the terms of the GNU General Public License as published by
> > +// the Free Software Foundation; either version 3, or (at your option)
> > +// any later version.
> > +//
> > +// GCC is distributed in the hope that it will be useful, but
> > +// WITHOUT ANY WARRANTY; without even the implied warranty of
> > +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +// General Public License for more details.
> > +//
> > +// You should have received a copy of the GNU General Public License
> > +// along with GCC; see the file COPYING3.  If not see
> > +// <http://www.gnu.org/licenses/>.
> > +
> > +#define INCLUDE_ALGORITHM
> > +#define INCLUDE_FUNCTIONAL
> > +#define INCLUDE_LIST
> > +#define INCLUDE_TYPE_TRAITS
> > +#include "config.h"
> > +#include "system.h"
> > +#include "coretypes.h"
> > +#include "backend.h"
> > +#include "rtl.h"
> > +#include "df.h"
> > +#include "rtl-ssa.h"
> > +#include "cfgcleanup.h"
> > +#include "tree-pass.h"
> > +#include "ordered-hash-map.h"
> > +#include "tree-dfa.h"
> > +#include "fold-const.h"
> > +#include "tree-hash-traits.h"
> > +#include "print-tree.h"
> > +#include "insn-attr.h"
> > +
> > +using namespace rtl_ssa;
> > +
> > +enum
> > +{
> > +  LDP_IMM_BITS = 7,
> > +  LDP_IMM_MASK = (1 << LDP_IMM_BITS) - 1,
> > +  LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1)),
> > +  LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1,
> > +  LDP_MIN_IMM = -LDP_MAX_IMM - 1,
> > +};
> 
> Since this isn't really an enumeration, it might be better to use
> constexprs.

Done in the latest version (and noticed that LDP_IMM_MASK is now dead,
so dropped that).

> 
> > +
> > +// We pack these fields (load_p, fpsimd_p, and size) into an integer
> > +// (LFS) which we use as part of the key into the main hash tables.
> > +//
> > +// The idea is that we group candidates together only if they agree on
> > +// the fields below.  Candidates that disagree on any of these
> > +// properties shouldn't be merged together.
> > +struct lfs_fields
> > +{
> > +  bool load_p;
> > +  bool fpsimd_p;
> > +  unsigned size;
> > +};
> > +
> > +using insn_list_t = std::list <insn_info *>;
> 
> Very minor, but it'd be good for the file to be consistent about having
> or not having a space before template arguments.  The coding conventions
> don't say, so either is fine.

Done, I went without a space in the end, as I think that matches the
rtl-ssa code.

> 
> > +using insn_iter_t = insn_list_t::iterator;
> > +
> > +// Information about the accesses at a given offset from a particular
> > +// base.  Stored in an access_group, see below.
> > +struct access_record
> > +{
> > +  poly_int64 offset;
> > +  std::list<insn_info *> cand_insns;
> > +  std::list<access_record>::iterator place;
> > +
> > +  access_record (poly_int64 off) : offset (off) {}
> > +};
> > +
> > +// A group of accesses where adjacent accesses could be ldp/stp
> > +// candidates.  The splay tree supports efficient insertion,
> > +// while the list supports efficient iteration.
> > +struct access_group
> > +{
> > +  splay_tree <access_record *> tree;
> > +  std::list<access_record> list;
> > +
> > +  template<typename Alloc>
> > +  inline void track (Alloc node_alloc, poly_int64 offset, insn_info *insn);
> > +};
> > +
> > +// Information about a potential base candidate, used in try_fuse_pair.
> > +// There may be zero, one, or two viable RTL bases for a given pair.
> > +struct base_cand
> > +{
> > +  def_info *m_def;
> 
> Sorry for the trivia, but it seems odd for only this member variable
> to have the "m_" suffix.  I think that's normally used for protected
> and private members.

Fixed, thanks.

> 
> > +
> > +  // FROM_INSN is -1 if the base candidate is already shared by both
> > +  // candidate insns.  Otherwise it holds the index of the insn from
> > +  // which the base originated.
> > +  int from_insn;
> > +
> > +  // Initially: dataflow hazards that arise if we choose this base as
> > +  // the common base register for the pair.
> > +  //
> > +  // Later these get narrowed, taking alias hazards into account.
> > +  insn_info *hazards[2];
> 
> Might be worth expanding the comment a bit.  I wasn't sure how an
> insn_info represented a hazard.  From further reading, I see it's
> an insn that contains a hazard, and therefore acts as a barrier to
> movement in that direction.

I've tried to clarify this in the latest version, thanks.

> 
> > +
> > +  base_cand (def_info *def, int insn)
> > +    : m_def (def), from_insn (insn), hazards {nullptr, nullptr} {}
> > +
> > +  base_cand (def_info *def) : base_cand (def, -1) {}
> > +
> > +  bool viable () const
> > +  {
> > +    return !hazards[0] || !hazards[1] || (*hazards[0] > *hazards[1]);
> > +  }
> > +};
> > +
> > +// Information about an alternate base.  For a def_info D, it may
> > +// instead be expressed as D = BASE + OFFSET.
> > +struct alt_base
> > +{
> > +  def_info *base;
> > +  poly_int64 offset;
> > +};
> > +
> > +// State used by the pass for a given basic block.
> > +struct ldp_bb_info
> > +{
> > +  using def_hash = nofree_ptr_hash <def_info>;
> > +  using expr_key_t = pair_hash <tree_operand_hash, int_hash <int, -1, -2>>;
> > +  using def_key_t = pair_hash <def_hash, int_hash <int, -1, -2>>;
> > +
> > +  // Map of <tree base, LFS> -> access_group.
> > +  ordered_hash_map <expr_key_t, access_group> expr_map;
> > +
> > +  // Map of <RTL-SSA def_info *, LFS> -> access_group.
> > +  ordered_hash_map <def_key_t, access_group> def_map;
> > +
> > +  // Given the def_info for an RTL base register, express it as an offset from
> > +  // some canonical base instead.
> > +  //
> > +  // Canonicalizing bases in this way allows us to identify adjacent accesses
> > +  // even if they see different base register defs.
> > +  hash_map <def_hash, alt_base> canon_base_map;
> > +
> > +  static const size_t obstack_alignment = sizeof (void *);
> > +  bb_info *m_bb;
> > +
> > +  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
> > +  {
> > +    obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
> > +				obstack_alignment, obstack_chunk_alloc,
> > +				obstack_chunk_free);
> > +  }
> > +  ~ldp_bb_info ()
> > +  {
> > +    obstack_free (&m_obstack, nullptr);
> > +  }
> > +
> > +  inline void track_access (insn_info *, bool load, rtx mem);
> > +  inline void transform ();
> > +  inline void cleanup_tombstones ();
> > +
> > +private:
> > +  // Did we emit a tombstone insn for this bb?
> > +  bool m_emitted_tombstone;
> > +  obstack m_obstack;
> > +
> > +  inline splay_tree_node<access_record *> *node_alloc (access_record *);
> > +
> > +  template<typename Map>
> > +  inline void traverse_base_map (Map &map);
> > +  inline void transform_for_base (int load_size, access_group &group);
> > +
> > +  inline bool try_form_pairs (insn_list_t *, insn_list_t *,
> > +			      bool load_p, unsigned access_size);
> > +
> > +  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
> > +};
> > +
> > +splay_tree_node<access_record *> *
> > +ldp_bb_info::node_alloc (access_record *access)
> > +{
> > +  using T = splay_tree_node<access_record *>;
> > +  void *addr = obstack_alloc (&m_obstack, sizeof (T));
> > +  return new (addr) T (access);
> > +}
> > +
> > +// Given a mem MEM, if the address has side effects, return a MEM that accesses
> > +// the same address but without the side effects.  Otherwise, return
> > +// MEM unchanged.
> > +static rtx
> > +drop_writeback (rtx mem)
> > +{
> > +  rtx addr = XEXP (mem, 0);
> > +
> > +  if (!side_effects_p (addr))
> > +    return mem;
> > +
> > +  switch (GET_CODE (addr))
> > +    {
> > +    case PRE_MODIFY:
> > +      addr = XEXP (addr, 1);
> > +      break;
> > +    case POST_MODIFY:
> > +    case POST_INC:
> > +    case POST_DEC:
> > +      addr = XEXP (addr, 0);
> > +      break;
> > +    case PRE_INC:
> > +    case PRE_DEC:
> > +    {
> > +      poly_int64 adjustment = GET_MODE_SIZE (GET_MODE (mem));
> > +      if (GET_CODE (addr) == PRE_DEC)
> > +	adjustment *= -1;
> > +      addr = plus_constant (GET_MODE (addr), XEXP (addr, 0), adjustment);
> > +      break;
> > +    }
> > +    default:
> > +      gcc_unreachable ();
> > +    }
> > +
> > +  return change_address (mem, GET_MODE (mem), addr);
> > +}
> > +
> > +// Convenience wrapper around strip_offset that can also look
> > +// through {PRE,POST}_MODIFY.
> > +static rtx ldp_strip_offset (rtx mem, rtx *modify, poly_int64 *offset)
> > +{
> > +  gcc_checking_assert (MEM_P (mem));
> > +
> > +  rtx base = strip_offset (XEXP (mem, 0), offset);
> > +
> > +  if (side_effects_p (base))
> > +    *modify = base;
> > +
> > +  switch (GET_CODE (base))
> > +    {
> > +    case PRE_MODIFY:
> > +    case POST_MODIFY:
> > +      base = strip_offset (XEXP (base, 1), offset);
> > +      gcc_checking_assert (REG_P (base));
> > +      gcc_checking_assert (rtx_equal_p (XEXP (*modify, 0), base));
> > +      break;
> > +    case PRE_INC:
> > +    case POST_INC:
> > +      base = XEXP (base, 0);
> > +      *offset = GET_MODE_SIZE (GET_MODE (mem));
> > +      gcc_checking_assert (REG_P (base));
> > +      break;
> > +    case PRE_DEC:
> > +    case POST_DEC:
> > +      base = XEXP (base, 0);
> > +      *offset = -GET_MODE_SIZE (GET_MODE (mem));
> > +      gcc_checking_assert (REG_P (base));
> > +      break;
> > +
> > +    default:
> > +      gcc_checking_assert (!side_effects_p (base));
> > +    }
> > +
> > +  return base;
> > +}
> 
> Is the first strip_offset expected to fire for the side-effects case?
> If so, then I suppose the switch should be adding to the offset,
> rather than overwriting it, since the original offset will be lost.
> 
> If the first strip_offset doesn't fire for side effects (my guess,
> since autoinc addresses must be top-level addresses), it's probably
> easier to drop the modify argument and move the strip_offset into the
> default case.

Yeah, you're right that the first one doesn't fire in the side-effects
case, so I've made the suggested changes.  Thanks.

> 
> > +
> > +static bool
> > +any_pre_modify_p (rtx x)
> > +{
> > +  const auto code = GET_CODE (x);
> > +  return code == PRE_INC || code == PRE_DEC || code == PRE_MODIFY;
> > +}
> > +
> > +static bool
> > +any_post_modify_p (rtx x)
> > +{
> > +  const auto code = GET_CODE (x);
> > +  return code == POST_INC || code == POST_DEC || code == POST_MODIFY;
> > +}
> > +
> > +static bool
> > +ldp_operand_mode_ok_p (machine_mode mode)
> 
> Missing function comment (also elsewhere in the file).  Here I think
> the comment serves a purpose, since the question isn't really whether
> an ldp is ok in the sense of valid, but ok in the sense of a good idea.

Done.

> 
> > +{
> > +  const bool allow_qregs
> > +    = !(aarch64_tune_params.extra_tuning_flags
> > +	& AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS);
> > +
> > +  if (!aarch64_ldpstp_operand_mode_p (mode))
> > +    return false;
> > +
> > +  const auto size = GET_MODE_SIZE (mode).to_constant ();
> > +  if (size == 16 && !allow_qregs)
> > +    return false;
> > +
> > +  return reload_completed || mode != E_TImode;
> 
> I think this last condition deserves a comment.  I agree the condition
> is correct (because we don't know whether TImode values are natural GPRs
> or FPRs), but I only remember because you mentioned it relatively recently.
> 
> E_ prefixes should generally only be used in switches though.  Same for
> the rest of the file.

Done.

> 
> > +}
> > +
> > +static int
> > +encode_lfs (lfs_fields fields)
> > +{
> > +  int size_log2 = exact_log2 (fields.size);
> > +  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
> > +  return ((int)fields.load_p << 3)
> > +    | ((int)fields.fpsimd_p << 2)
> > +    | (size_log2 - 2);
> > +}
> > +
> > +static lfs_fields
> > +decode_lfs (int lfs)
> > +{
> > +  bool load_p = (lfs & (1 << 3));
> > +  bool fpsimd_p = (lfs & (1 << 2));
> > +  unsigned size = 1U << ((lfs & 3) + 2);
> > +  return { load_p, fpsimd_p, size };
> > +}
> > +
> > +template<typename Alloc>
> > +void
> > +access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
> > +{
> > +  auto insert_before = [&](std::list<access_record>::iterator after)
> > +    {
> > +      auto it = list.emplace (after, offset);
> > +      it->cand_insns.push_back (insn);
> > +      it->place = it;
> > +      return &*it;
> > +    };
> > +
> > +  if (!list.size ())
> > +    {
> > +      auto access = insert_before (list.end ());
> > +      tree.insert_max_node (alloc_node (access));
> > +      return;
> > +    }
> > +
> > +  auto compare = [&](splay_tree_node<access_record *> *node)
> > +    {
> > +      return compare_sizes_for_sort (offset, node->value ()->offset);
> > +    };
> > +  auto result = tree.lookup (compare);
> > +  splay_tree_node<access_record *> *node = tree.root ();
> > +  if (result == 0)
> > +    node->value ()->cand_insns.push_back (insn);
> > +  else
> > +    {
> > +      auto it = node->value ()->place;
> > +      auto after = (result > 0) ? std::next (it) : it;
> > +      auto access = insert_before (after);
> > +      tree.insert_child (node, result > 0, alloc_node (access));
> > +    }
> > +}
> > +
> > +bool
> > +ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> > +{
> > +  if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
> > +    return false;
> > +
> > +  poly_int64 offset;
> > +  tree base_expr = get_addr_base_and_unit_offset (MEM_EXPR (mem),
> > +						  &offset);
> > +  if (!base_expr || !DECL_P (base_expr))
> > +    return false;
> > +
> > +  offset += MEM_OFFSET (mem);
> > +
> > +  const machine_mode mem_mode = GET_MODE (mem);
> > +  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
> > +
> > +  // Punt on misaligned offsets.
> > +  if (offset.coeffs[0] & (mem_size - 1))
> 
> !multiple_p (offset, mem_size)
> 
> It's probably worth adding a comment to say that we reject unaligned
> offsets because they are likely to lead to invalid LDP/STP addresses
> (rather than for optimisation reasons).

Done, thanks.

> 
> > +    return false;
> > +
> > +  const auto key = std::make_pair (base_expr, encode_lfs (lfs));
> > +  access_group &group = expr_map.get_or_insert (key, NULL);
> > +  auto alloc = [&](access_record *access) { return node_alloc (access); };
> > +  group.track (alloc, offset, insn);
> > +
> > +  if (dump_file)
> > +    {
> > +      fprintf (dump_file, "[bb %u] tracking insn %d via ",
> > +	       m_bb->index (), insn->uid ());
> > +      print_node_brief (dump_file, "mem expr", base_expr, 0);
> > +      fprintf (dump_file, " [L=%d FP=%d, %smode, off=",
> > +	       lfs.load_p, lfs.fpsimd_p, mode_name[mem_mode]);
> > +      print_dec (offset, dump_file);
> > +      fprintf (dump_file, "]\n");
> > +    }
> > +
> > +  return true;
> > +}
> > +
> > +// Return true if X is a constant zero operand.  N.B. this matches the
> > +// {w,x}zr check in aarch64_print_operand, the logic in the predicate
> > +// aarch64_stp_reg_operand, and the constraints on the pair patterns.
> > +static bool const_zero_op_p (rtx x)
> 
> Nit: new line after bool.  A few instances later too.
> 
> Could we put this somewhere that's easily accessible by everything
> that wants to test it?  It's cropped up a few times in the series.

Moved this back to the aarch64_print_operand patch earlier in the
series:
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/639358.html

I then used it in aarch64_stp_reg_operand in 8/11 (as well as in the updated
version of this patch).

> 
> > +{
> > +  return x == CONST0_RTX (GET_MODE (x))
> > +    || (CONST_DOUBLE_P (x) && aarch64_float_const_zero_rtx_p (x));
> > +}
> > +
> > +void
> > +ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> > +{
> > +  // We can't combine volatile MEMs, so punt on these.
> > +  if (MEM_VOLATILE_P (mem))
> > +    return;
> > +
> > +  // Ignore writeback accesses if the param says to do so.
> > +  if (!aarch64_ldp_writeback && side_effects_p (XEXP (mem, 0)))
> > +    return;
> > +
> > +  const machine_mode mem_mode = GET_MODE (mem);
> > +  if (!ldp_operand_mode_ok_p (mem_mode))
> > +    return;
> > +
> > +  // Note ldp_operand_mode_ok_p already rejected VL modes.
> > +  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
> > +
> > +  rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
> > +
> > +  // Is this an FP/SIMD access?  Note that constant zero operands
> > +  // use an integer zero register ({w,x}zr).
> > +  const bool fpsimd_op_p
> > +    = GET_MODE_CLASS (mem_mode) != MODE_INT
> > +      && (load_p || !const_zero_op_p (reg_op));
> > +
> > +  // N.B. we only want to segregate FP/SIMD accesses from integer accesses
> > +  // before RA.
> > +  const bool fpsimd_bit_p = !reload_completed && fpsimd_op_p;
> 
> But after RA, shouldn't we segregate them based on the hard register?
> In other words, the natural condition after RA seems to be:
> 
>   REG_P (reg) && FP_REGNUM_P (REGNO (reg))

Yes, that sounds sensible, and should save us a lot of failed recog
attempts.  I've made this change, thanks.

> 
> > +  const lfs_fields lfs = { load_p, fpsimd_bit_p, mem_size };
> > +
> > +  if (track_via_mem_expr (insn, mem, lfs))
> > +    return;
> > +
> > +  poly_int64 mem_off;
> > +  rtx modify = NULL_RTX;
> > +  rtx base = ldp_strip_offset (mem, &modify, &mem_off);
> > +  if (!REG_P (base))
> > +    return;
> > +
> > +  // Need to calculate two (possibly different) offsets:
> > +  //  - Offset at which the access occurs.
> > +  //  - Offset of the new base def.
> > +  poly_int64 access_off;
> > +  if (modify && any_post_modify_p (modify))
> > +    access_off = 0;
> > +  else
> > +    access_off = mem_off;
> > +
> > +  poly_int64 new_def_off = mem_off;
> > +
> > +  // Punt on accesses relative to the eliminable regs: since we don't
> > +  // know the elimination offset pre-RA, we should postpone forming
> > +  // pairs on such accesses until after RA.
> > +  if (!reload_completed
> > +      && (REGNO (base) == FRAME_POINTER_REGNUM
> > +	  || REGNO (base) == ARG_POINTER_REGNUM))
> > +    return;
> 
> Is this still an issue with the new representation of LDPs and STPs?
> If so, it seems like there's more to it than the reason given in the
> comments.

Yes, I think so.  Not because we get invalid RTL if we get a reload, but
because we generate worse code if we get a reload.  If the offsets
really are out of range after elimination, it seems best to leave it to
the peepholes that handle that out-of-range case, otherwise you end up
with separate reloads for every out-of-range stack pair, which isn't
desirable.  The peepholes ensure we only form the pairs when this is
profitable (i.e. when two or more pairs can share the new base).
Ideally we'd handle this in the (post-RA) pass eventually, but that's
for future work.

> 
> > +
> > +  // Now need to find def of base register.
> > +  def_info *base_def;
> > +  use_info *base_use = find_access (insn->uses (), REGNO (base));
> > +  gcc_assert (base_use);
> > +  base_def = base_use->def ();
> > +  if (!base_def)
> > +    {
> > +      if (dump_file)
> > +	fprintf (dump_file,
> > +		 "base register (regno %d) of insn %d is undefined",
> > +		 REGNO (base), insn->uid ());
> > +      return;
> > +    }
> 
> Does the track_via_mem_expr need to happen in its current position?
> I wasn't sure why it was done relatively early, given the early-outs above.

It needs to happen before we do any validation of the RTL base, as we
want to track accesses that have suitable MEM_EXPR bases but no viable
RTL base, as we permit changing bases.

I think the early-outs above should apply regardless of what kind of
base we track, i.e. if any of the early-outs above this call fire, then
we really don't want to track the candidate.

Does that make sense?

> 
> > +
> > +  alt_base *canon_base = canon_base_map.get (base_def);
> > +  if (canon_base)
> > +    {
> > +      // Express this as the combined offset from the canonical base.
> > +      base_def = canon_base->base;
> > +      new_def_off += canon_base->offset;
> > +      access_off += canon_base->offset;
> > +    }
> > +
> > +  if (modify)
> > +    {
> > +      auto def = find_access (insn->defs (), REGNO (base));
> > +      gcc_assert (def);
> > +
> > +      // Record that DEF = BASE_DEF + MEM_OFF.
> > +      if (dump_file)
> > +	{
> > +	  pretty_printer pp;
> > +	  pp_access (&pp, def, 0);
> > +	  pp_string (&pp, " = ");
> > +	  pp_access (&pp, base_def, 0);
> > +	  fprintf (dump_file, "[bb %u] recording %s + ",
> > +		   m_bb->index (), pp_formatted_text (&pp));
> > +	  print_dec (new_def_off, dump_file);
> > +	  fprintf (dump_file, "\n");
> > +	}
> > +
> > +      alt_base base_rec { base_def, new_def_off };
> > +      if (canon_base_map.put (def, base_rec))
> > +	gcc_unreachable (); // Base defs should be unique.
> > +    }
> > +
> > +  // Punt on misaligned offsets.
> > +  if (mem_off.coeffs[0] & (mem_size - 1))
> 
> !multiple_p here too

Done, thanks.

> 
> > +    return;
> > +
> > +  const auto key = std::make_pair (base_def, encode_lfs (lfs));
> > +  access_group &group = def_map.get_or_insert (key, NULL);
> > +  auto alloc = [&](access_record *access) { return node_alloc (access); };
> > +  group.track (alloc, access_off, insn);
> > +
> > +  if (dump_file)
> > +    {
> > +      pretty_printer pp;
> > +      pp_access (&pp, base_def, 0);
> > +
> > +      fprintf (dump_file, "[bb %u] tracking insn %d via %s",
> > +	       m_bb->index (), insn->uid (), pp_formatted_text (&pp));
> > +      fprintf (dump_file,
> > +	       " [L=%d, WB=%d, FP=%d, %smode, off=",
> > +	       lfs.load_p, !!modify, lfs.fpsimd_p, mode_name[mem_mode]);
> > +      print_dec (access_off, dump_file);
> > +      fprintf (dump_file, "]\n");
> > +    }
> > +}
> > +
> > +// Dummy predicate that never ignores any insns.
> > +static bool no_ignore (insn_info *) { return false; }
> > +
> > +// Return the latest dataflow hazard before INSN.
> > +//
> > +// If IGNORE is non-NULL, this points to a sub-rtx which we should
> > +// ignore for dataflow purposes.  This is needed when considering
> > +// changing the RTL base of an access discovered through a MEM_EXPR
> > +// base.
> > +//
> > +// N.B. we ignore any defs/uses of memory here as we deal with that
> > +// separately, making use of alias disambiguation.
> > +static insn_info *
> > +latest_hazard_before (insn_info *insn, rtx *ignore,
> > +		      insn_info *ignore_insn = nullptr)
> > +{
> > +  insn_info *result = nullptr;
> > +
> > +  // Return true if we registered the hazard.
> > +  auto hazard = [&](insn_info *h) -> bool
> > +    {
> > +      gcc_checking_assert (*h < *insn);
> > +      if (h == ignore_insn)
> > +	return false;
> > +
> > +      if (!result || *h > *result)
> > +	result = h;
> > +
> > +      return true;
> > +    };
> > +
> > +  rtx pat = PATTERN (insn->rtl ());
> > +  auto ignore_use = [&](use_info *u)
> > +    {
> > +      if (u->is_mem ())
> > +	return true;
> > +
> > +      return !refers_to_regno_p (u->regno (), u->regno () + 1, pat, ignore);
> > +    };
> > +
> > +  // Find defs of uses in INSN (RaW).
> > +  for (auto use : insn->uses ())
> > +    if (!ignore_use (use) && use->def ())
> > +      hazard (use->def ()->insn ());
> > +
> > +  // Find previous defs (WaW) or previous uses (WaR) of defs in INSN.
> > +  for (auto def : insn->defs ())
> > +    {
> > +      if (def->is_mem ())
> > +	continue;
> > +
> > +      if (def->prev_def ())
> > +	{
> > +	  hazard (def->prev_def ()->insn ()); // WaW
> > +
> > +	  auto set = dyn_cast <set_info *> (def->prev_def ());
> > +	  if (set && set->has_nondebug_insn_uses ())
> > +	    for (auto use : set->reverse_nondebug_insn_uses ())
> > +	      if (use->insn () != insn && hazard (use->insn ())) // WaR
> > +		break;
> > +	}
> > +
> > +      if (!HARD_REGISTER_NUM_P (def->regno ()))
> > +	continue;
> > +
> > +      // Also need to check backwards for call clobbers (WaW).
> > +      for (auto call_group : def->ebb ()->call_clobbers ())
> > +	{
> > +	  if (!call_group->clobbers (def->resource ()))
> > +	    continue;
> > +
> > +	  auto clobber_insn = prev_call_clobbers_ignoring (*call_group,
> > +							   def->insn (),
> > +							   no_ignore);
> > +	  if (clobber_insn)
> > +	    hazard (clobber_insn);
> > +	}
> > +
> > +    }
> > +
> > +  return result;
> > +}
> > +
> > +static insn_info *
> > +first_hazard_after (insn_info *insn, rtx *ignore)
> > +{
> > +  insn_info *result = nullptr;
> > +  auto hazard = [insn, &result](insn_info *h)
> > +    {
> > +      gcc_checking_assert (*h > *insn);
> > +      if (!result || *h < *result)
> > +	result = h;
> > +    };
> > +
> > +  rtx pat = PATTERN (insn->rtl ());
> > +  auto ignore_use = [&](use_info *u)
> > +    {
> > +      if (u->is_mem ())
> > +	return true;
> > +
> > +      return !refers_to_regno_p (u->regno (), u->regno () + 1, pat, ignore);
> > +    };
> > +
> > +  for (auto def : insn->defs ())
> > +    {
> > +      if (def->is_mem ())
> > +	continue;
> > +
> > +      if (def->next_def ())
> > +	hazard (def->next_def ()->insn ()); // WaW
> > +
> > +      auto set = dyn_cast <set_info *> (def);
> > +      if (set && set->has_nondebug_insn_uses ())
> > +	hazard (set->first_nondebug_insn_use ()->insn ()); // RaW
> > +
> > +      if (!HARD_REGISTER_NUM_P (def->regno ()))
> > +	continue;
> > +
> > +      // Also check for call clobbers of this def (WaW).
> > +      for (auto call_group : def->ebb ()->call_clobbers ())
> > +	{
> > +	  if (!call_group->clobbers (def->resource ()))
> > +	    continue;
> > +
> > +	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
> > +							   def->insn (),
> > +							   no_ignore);
> > +	  if (clobber_insn)
> > +	    hazard (clobber_insn);
> > +	}
> > +    }
> > +
> > +  // Find any subsequent defs of uses in INSN (WaR).
> > +  for (auto use : insn->uses ())
> > +    {
> > +      if (ignore_use (use))
> > +	continue;
> > +
> > +      if (use->def ())
> > +	{
> > +	  auto def = use->def ()->next_def ();
> > +	  if (def && def->insn () == insn)
> > +	    def = def->next_def ();
> > +
> > +	  if (def)
> > +	    hazard (def->insn ());
> > +	}
> > +
> > +      if (!HARD_REGISTER_NUM_P (use->regno ()))
> > +	continue;
> > +
> > +      // Also need to handle call clobbers of our uses (again WaR).
> > +      //
> > +      // See restrict_movement_for_uses_ignoring for why we don't
> > +      // need to check backwards for call clobbers.
> > +      for (auto call_group : use->ebb ()->call_clobbers ())
> > +	{
> > +	  if (!call_group->clobbers (use->resource ()))
> > +	    continue;
> > +
> > +	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
> > +							   use->insn (),
> > +							   no_ignore);
> > +	  if (clobber_insn)
> > +	    hazard (clobber_insn);
> > +	}
> > +    }
> > +
> > +  return result;
> > +}
> > +
> > +
> > +enum change_strategy {
> > +  CHANGE,
> > +  DELETE,
> > +  TOMBSTONE,
> > +};
> > +
> > +// Given a change_strategy S, convert it to a string (for output in the
> > +// dump file).
> > +static const char *cs_to_string (change_strategy s)
> > +{
> > +#define C(x) case x: return #x
> > +  switch (s)
> > +    {
> > +      C (CHANGE);
> > +      C (DELETE);
> > +      C (TOMBSTONE);
> > +    }
> > +#undef C
> > +  gcc_unreachable ();
> > +}
> > +
> > +// TODO: should this live in RTL-SSA?
> > +static bool
> > +ranges_overlap_p (const insn_range_info &r1, const insn_range_info &r2)
> > +{
> > +  // If either range is empty, then their intersection is empty.
> > +  if (!r1 || !r2)
> > +    return false;
> > +
> > +  // When do they not overlap? When one range finishes before the other
> > +  // starts, i.e. (*r1.last < *r2.first || *r2.last < *r1.first).
> > +  // Inverting this, we get the below.
> > +  return *r1.last >= *r2.first && *r2.last >= *r1.first;
> > +}
> > +
> > +// Get the range of insns that def feeds.
> > +static insn_range_info get_def_range (def_info *def)
> > +{
> > +  insn_info *last = def->next_def ()->insn ()->prev_nondebug_insn ();
> > +  return { def->insn (), last };
> > +}
> > +
> > +// Given a def (of memory), return the downwards range within which we
> > +// can safely move this def.
> > +static insn_range_info
> > +def_downwards_move_range (def_info *def)
> > +{
> > +  auto range = get_def_range (def);
> > +
> > +  auto set = dyn_cast <set_info *> (def);
> > +  if (!set || !set->has_any_uses ())
> > +    return range;
> > +
> > +  auto use = set->first_nondebug_insn_use ();
> > +  if (use)
> > +    range = move_earlier_than (range, use->insn ());
> > +
> > +  return range;
> > +}
> > +
> > +// Given a def (of memory), return the upwards range within which we can
> > +// safely move this def.
> > +static insn_range_info
> > +def_upwards_move_range (def_info *def)
> > +{
> > +  def_info *prev = def->prev_def ();
> > +  insn_range_info range { prev->insn (), def->insn () };
> > +
> > +  auto set = dyn_cast <set_info *> (prev);
> > +  if (!set || !set->has_any_uses ())
> > +    return range;
> > +
> > +  auto use = set->last_nondebug_insn_use ();
> > +  if (use)
> > +    range = move_later_than (range, use->insn ());
> > +
> > +  return range;
> > +}
> > +
> > +static def_info *
> > +decide_stp_strategy (change_strategy strategy[2],
> > +		     insn_info *first,
> > +		     insn_info *second,
> > +		     const insn_range_info &move_range)
> > +{
> > +  strategy[0] = CHANGE;
> > +  strategy[1] = DELETE;
> > +
> > +  unsigned viable = 0;
> > +  viable |= move_range.includes (first);
> > +  viable |= ((unsigned) move_range.includes (second)) << 1;
> > +
> > +  def_info * const defs[2] = {
> > +    memory_access (first->defs ()),
> > +    memory_access (second->defs ())
> > +  };
> > +  if (defs[0] == defs[1])
> > +    viable = 3; // No intervening store, either is viable.
> 
> Can this happen?  If the first and second insns are different then their
> definitions should be too.

No, you're right, this can't happen.  I think this must have just been a
think-o, perhaps I was cribbing off the ldp case when I started writing
this.  Fixed, thanks.

> 
> > +
> > +  if (!(viable & 1)
> > +      && ranges_overlap_p (move_range, def_downwards_move_range (defs[0])))
> > +    viable |= 1;
> > +  if (!(viable & 2)
> > +      && ranges_overlap_p (move_range, def_upwards_move_range (defs[1])))
> > +    viable |= 2;
> > +
> > +  if (viable == 2)
> > +    std::swap (strategy[0], strategy[1]);
> > +  else if (!viable)
> > +    // Tricky case: need to delete both accesses.
> > +    strategy[0] = DELETE;
> > +
> > +  for (int i = 0; i < 2; i++)
> > +    {
> > +      if (strategy[i] != DELETE)
> > +	continue;
> > +
> > +      // See if we can get away without a tombstone.
> > +      auto set = dyn_cast <set_info *> (defs[i]);
> > +      if (!set || !set->has_any_uses ())
> > +	continue; // We can indeed.
> > +
> > +      // If both sides are viable for re-purposing, and the other store's
> > +      // def doesn't have any uses, then we can delete the other store
> > +      // and re-purpose this store instead.
> > +      if (viable == 3)
> > +	{
> > +	  gcc_assert (strategy[!i] == CHANGE);
> > +	  auto other_set = dyn_cast <set_info *> (defs[!i]);
> > +	  if (!other_set || !other_set->has_any_uses ())
> > +	    {
> > +	      strategy[i] = CHANGE;
> > +	      strategy[!i] = DELETE;
> > +	      break;
> > +	    }
> > +	}
> > +
> > +      // Alas, we need a tombstone after all.
> > +      strategy[i] = TOMBSTONE;
> > +    }
> 
> I think it's a bug in RTL-SSA that we have mem defs without uses,
> or it's at least something that should be changed.  So probably best
> to delete the loop and stick with the result of the range comparisons.

Agreed, I've dropped the loop (and all of the change_strategy stuff),
the new version is a lot simpler.  Thanks!

> 
> > +
> > +  for (int i = 0; i < 2; i++)
> > +    if (strategy[i] == CHANGE)
> > +      return defs[i];
> > +
> > +  return nullptr;
> > +}
> > +
> > +static GTY(()) rtx tombstone = NULL_RTX;
> > +
> > +// Generate the RTL pattern for a "tombstone"; used temporarily
> > +// during this pass to replace stores that are marked for deletion
> > +// where we can't immediately delete the store (e.g. if there are uses
> > +// hanging off its def of memory).
> > +//
> > +// These are deleted at the end of the pass and uses re-parented
> > +// appropriately at this point.
> > +static rtx
> > +gen_tombstone (void)
> > +{
> > +  if (!tombstone)
> > +    {
> > +      tombstone = gen_rtx_CLOBBER (VOIDmode,
> > +				   gen_rtx_MEM (BLKmode,
> > +						gen_rtx_SCRATCH (Pmode)));
> > +      return tombstone;
> > +    }
> > +
> > +  return copy_rtx (tombstone);
> > +}
> > +
> > +static bool
> > +tombstone_insn_p (insn_info *insn)
> > +{
> > +  rtx x = tombstone ? tombstone : gen_tombstone ();
> > +  return rtx_equal_p (PATTERN (insn->rtl ()), x);
> > +}
> 
> It's probably safer to check by insn uid, since the pattern could
> in principle occur as a pre-existing barrier.

Done (using a bitmap to track the tombstone uids).

> 
> > +
> > +static machine_mode
> > +aarch64_operand_mode_for_pair_mode (machine_mode mode)
> > +{
> > +  switch (mode)
> > +    {
> > +    case E_V2x4QImode:
> > +      return E_SImode;
> > +    case E_V2x8QImode:
> > +      return E_DImode;
> > +    case E_V2x16QImode:
> > +      return E_V16QImode;
> > +    default:
> > +      gcc_unreachable ();
> > +    }
> > +}
> > +
> > +static rtx
> > +filter_notes (rtx note, rtx result, bool *eh_region)
> > +{
> > +  for (; note; note = XEXP (note, 1))
> > +    {
> > +      switch (REG_NOTE_KIND (note))
> > +	{
> > +	  case REG_EQUAL:
> > +	  case REG_EQUIV:
> > +	  case REG_DEAD:
> > +	  case REG_UNUSED:
> > +	  case REG_NOALIAS:
> > +	    // These can all be dropped.  For REG_EQU{AL,IV} they
> > +	    // cannot apply to non-single_set insns, and
> > +	    // REG_{DEAD,UNUSED} are re-computed by RTl-SSA, see
> > +	    // rtl-ssa/changes.cc:update_notes.
> 
> AFAIK, REG_DEAD isn't recomputed, only REG_UNUSED is.  But REG_DEADs are
> allowed to bit-rot.

Fixed, thanks.

> 
> > +	    //
> > +	    // Similarly, REG_NOALIAS cannot apply to a parallel.
> > +	  case REG_INC:
> > +	    // When we form the pair insn, the reg update is implemented
> > +	    // as just another SET in the parallel, so isn't really an
> > +	    // auto-increment in the RTL sense, hence we drop the note.
> > +	    break;
> > +	  case REG_EH_REGION:
> > +	    gcc_assert (!*eh_region);
> > +	    *eh_region = true;
> > +	    result = alloc_reg_note (REG_EH_REGION, XEXP (note, 0), result);
> > +	    break;
> > +	  case REG_CFA_DEF_CFA:
> > +	  case REG_CFA_OFFSET:
> > +	  case REG_CFA_RESTORE:
> > +	    result = alloc_reg_note (REG_NOTE_KIND (note),
> > +				     copy_rtx (XEXP (note, 0)),
> > +				     result);
> > +	    break;
> > +	  default:
> > +	    // Unexpected REG_NOTE kind.
> > +	    gcc_unreachable ();
> 
> Nit: cases should be indented to the same column as the "{".

Fixed.

> 
> > +	}
> > +    }
> > +
> > +  return result;
> > +}
> > +
> > +// Ensure we have a sensible scheme for combining REG_NOTEs
> > +// given two candidate insns I1 and I2.
> > +static rtx
> > +combine_reg_notes (insn_info *i1, insn_info *i2, rtx writeback, bool &ok)
> > +{
> > +  if ((writeback && find_reg_note (i1->rtl (), REG_CFA_DEF_CFA, NULL_RTX))
> > +      || find_reg_note (i2->rtl (), REG_CFA_DEF_CFA, NULL_RTX))
> > +    {
> > +      // CFA_DEF_CFA notes apply to the first set of the PARALLEL,
> > +      // so we can only preserve them in the non-writeback case, in
> > +      // the case that the note is attached to the lower access.
> 
> I thought that was only true if the note didn't provide some information:
> 
> 	n = XEXP (note, 0);
> 	if (n == NULL)
> 	  n = single_set (insn);
> 	dwarf2out_frame_debug_cfa_offset (n);
> 	handled_one = true;
> 
> The aarch64 notes should be self-contained, so the note expression should
> never be null.
> 
> If the notes are self-contained, they should be interpreted serially.
> So I'd have expected everything to work out if we preserve the original
> order (including order between instructions).
> 
> It's OK to punt anyway, of course.  I just wasn't sure about the reason
> in the comments.

I think with the new representation of ldp/stp the notes are indeed
guaranteed to be self-contained, so I've dropped this in the latest
version, thanks.

> 
> > +      if (dump_file)
> > +	fprintf (dump_file,
> > +		 "(%d,%d,WB=%d): can't preserve CFA_DEF_CFA note, punting\n",
> > +		 i1->uid (), i2->uid (), !!writeback);
> > +      ok = false;
> > +      return NULL_RTX;
> > +    }
> > +
> > +  bool found_eh_region = false;
> > +  rtx result = NULL_RTX;
> > +  result = filter_notes (REG_NOTES (i1->rtl ()), result, &found_eh_region);
> > +  return filter_notes (REG_NOTES (i2->rtl ()), result, &found_eh_region);
> > +}
> > +
> > +// Given two memory accesses, at least one of which is of a writeback form,
> > +// extract two non-writeback memory accesses addressed relative to the initial
> > +// value of the base register, and output these in PATS.  Return an rtx that
> > +// represents the overall change to the base register.
> > +static rtx
> > +extract_writebacks (bool load_p, rtx pats[2], int changed)
> 
> Wasn't clear at first from the comment that PATS is also where the
> initial memory access insns come from.

I've tweaked the comment.

> 
> > +{
> > +  rtx base_reg = NULL_RTX;
> > +  poly_int64 current_offset = 0;
> > +
> > +  poly_int64 offsets[2];
> > +
> > +  for (int i = 0; i < 2; i++)
> > +    {
> > +      rtx mem = XEXP (pats[i], load_p);
> > +      rtx reg = XEXP (pats[i], !load_p);
> > +
> > +      rtx modify = NULL_RTX;
> > +      poly_int64 offset;
> > +      rtx this_base = ldp_strip_offset (mem, &modify, &offset);
> > +      gcc_assert (REG_P (this_base));
> > +      if (base_reg)
> > +	gcc_assert (rtx_equal_p (base_reg, this_base));
> > +      else
> > +	base_reg = this_base;
> > +
> > +      // If we changed base for the current insn, then we already
> > +      // derived the correct mem for this insn from the effective
> > +      // address of the other access.
> > +      if (i == changed)
> > +	{
> > +	  gcc_checking_assert (!modify);
> > +	  offsets[i] = offset;
> > +	  continue;
> > +	}
> > +
> > +      if (modify && any_pre_modify_p (modify))
> > +	current_offset += offset;
> > +
> > +      poly_int64 this_off = current_offset;
> > +      if (!modify)
> > +	this_off += offset;
> > +
> > +      offsets[i] = this_off;
> > +      rtx new_mem = change_address (mem, GET_MODE (mem),
> > +				    plus_constant (GET_MODE (base_reg),
> > +						   base_reg, this_off));
> > +      pats[i] = load_p
> > +	? gen_rtx_SET (reg, new_mem)
> > +	: gen_rtx_SET (new_mem, reg);
> > +
> > +      if (modify && any_post_modify_p (modify))
> > +	current_offset += offset;
> > +    }
> > +
> > +  if (known_eq (current_offset, 0))
> > +    return NULL_RTX;
> > +
> > +  return gen_rtx_SET (base_reg, plus_constant (GET_MODE (base_reg),
> > +					       base_reg, current_offset));
> > +}
> > +
> > +static insn_info *
> > +find_trailing_add (insn_info *insns[2],
> > +		   const insn_range_info &pair_range,
> > +		   rtx *writeback_effect,
> > +		   def_info **add_def,
> > +		   def_info *base_def,
> > +		   poly_int64 initial_offset,
> > +		   unsigned access_size)
> > +{
> > +  insn_info *pair_insn = insns[1];
> > +
> > +  def_info *def = base_def->next_def ();
> > +
> > +  while (def
> > +	 && def->bb () == pair_insn->bb ()
> > +	 && *(def->insn ()) <= *pair_insn)
> > +    def = def->next_def ();
> 
> I don't understand the loop.  Why's it OK to skip over intervening defs?

Yeah, I've re-worked it in the latest version to drop the loop, instead
we only skip over defs that occur due to writeback.  Hopefully the
latest version is clearer.

> 
> > +
> > +  if (!def || def->bb () != pair_insn->bb ())
> > +    return nullptr;
> > +
> > +  insn_info *cand = def->insn ();
> > +  const auto base_regno = base_def->regno ();
> > +
> > +  // If CAND doesn't also use our base register,
> > +  // it can't destructively update it.
> > +  if (!find_access (cand->uses (), base_regno))
> > +    return nullptr;
> > +
> > +  auto rti = cand->rtl ();
> > +
> > +  if (!INSN_P (rti))
> > +    return nullptr;
> > +
> > +  auto pat = PATTERN (rti);
> > +  if (GET_CODE (pat) != SET)
> > +    return nullptr;
> > +
> > +  auto dest = XEXP (pat, 0);
> > +  if (!REG_P (dest) || REGNO (dest) != base_regno)
> > +    return nullptr;
> > +
> > +  poly_int64 offset;
> > +  rtx rhs_base = strip_offset (XEXP (pat, 1), &offset);
> > +  if (!REG_P (rhs_base)
> > +      || REGNO (rhs_base) != base_regno
> > +      || !offset.is_constant ())
> > +    return nullptr;
> > +
> > +  // If the initial base offset is zero, we can handle any add offset
> > +  // (post-inc).  Otherwise, we require the offsets to match (pre-inc).
> > +  if (!known_eq (initial_offset, 0) && !known_eq (offset, initial_offset))
> > +    return nullptr;
> > +
> > +  auto off_hwi = offset.to_constant ();
> > +
> > +  if (off_hwi % access_size != 0)
> > +    return nullptr;
> > +
> > +  off_hwi /= access_size;
> > +
> > +  if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
> > +    return nullptr;
> > +
> > +  insn_info *pair_dst = pair_range.singleton ();
> > +  gcc_assert (pair_dst);
> > +
> > +  auto dump_prefix = [&]()
> > +    {
> > +      if (!insns[0])
> > +	fprintf (dump_file, "existing pair i%d: ", insns[1]->uid ());
> > +      else
> > +	fprintf (dump_file, "  (%d,%d)",
> > +		 insns[0]->uid (), insns[1]->uid ());
> > +    };
> > +
> > +  insn_info *hazard = latest_hazard_before (cand, nullptr, pair_insn);
> > +  if (!hazard || *hazard <= *pair_dst)
> > +    {
> > +      if (dump_file)
> > +	{
> > +	  dump_prefix ();
> > +	  fprintf (dump_file,
> > +		   "folding in trailing add (%d) to use writeback form\n",
> > +		   cand->uid ());
> > +	}
> > +
> > +      *add_def = def;
> > +      *writeback_effect = copy_rtx (pat);
> > +      return cand;
> > +    }
> > +
> > +  if (dump_file)
> > +    {
> > +      dump_prefix ();
> > +      fprintf (dump_file,
> > +	       "can't fold in trailing add (%d), hazard = %d\n",
> > +	       cand->uid (), hazard->uid ());
> > +    }
> > +
> > +  return nullptr;
> > +}
> > +
> > +// Try and actually fuse the pair given by insns I1 and I2.
> > +static bool
> > +fuse_pair (bool load_p,
> > +	   unsigned access_size,
> > +	   int writeback,
> > +	   insn_info *i1,
> > +	   insn_info *i2,
> > +	   base_cand &base,
> > +	   const insn_range_info &move_range,
> > +	   bool &emitted_tombstone_p)
> > +{
> > +  auto attempt = crtl->ssa->new_change_attempt ();
> > +
> > +  auto make_change = [&attempt](insn_info *insn)
> > +    {
> > +      return crtl->ssa->change_alloc <insn_change> (attempt, insn);
> > +    };
> > +  auto make_delete = [&attempt](insn_info *insn)
> > +    {
> > +      return crtl->ssa->change_alloc <insn_change> (attempt,
> > +						    insn,
> > +						    insn_change::DELETE);
> > +    };
> > +
> > +  // Are we using a tombstone insn for this pair?
> > +  bool have_tombstone_p = false;
> > +
> > +  insn_info *first = (*i1 < *i2) ? i1 : i2;
> > +  insn_info *second = (first == i1) ? i2 : i1;
> > +
> > +  insn_info *insns[2] = { first, second };
> > +
> > +  auto_vec <insn_change *> changes;
> > +  changes.reserve (4);
> > +
> > +  rtx pats[2] = {
> > +    PATTERN (first->rtl ()),
> > +    PATTERN (second->rtl ())
> > +  };
> > +
> > +  use_array input_uses[2] = { first->uses (), second->uses () };
> > +  def_array input_defs[2] = { first->defs (), second->defs () };
> > +
> > +  int changed_insn = -1;
> > +  if (base.from_insn != -1)
> > +    {
> > +      // If we're not already using a shared base, we need
> > +      // to re-write one of the accesses to use the base from
> > +      // the other insn.
> > +      gcc_checking_assert (base.from_insn == 0 || base.from_insn == 1);
> > +      changed_insn = !base.from_insn;
> > +
> > +      rtx base_pat = pats[base.from_insn];
> > +      rtx change_pat = pats[changed_insn];
> > +      rtx base_mem = XEXP (base_pat, load_p);
> > +      rtx change_mem = XEXP (change_pat, load_p);
> > +
> > +      const bool lower_base_p = (insns[base.from_insn] == i1);
> > +      HOST_WIDE_INT adjust_amt = access_size;
> > +      if (!lower_base_p)
> > +	adjust_amt *= -1;
> > +
> > +      rtx change_reg = XEXP (change_pat, !load_p);
> > +      machine_mode mode_for_mem = GET_MODE (change_mem);
> > +      rtx effective_base = drop_writeback (base_mem);
> > +      rtx new_mem = adjust_address_nv (effective_base,
> > +				       mode_for_mem,
> > +				       adjust_amt);
> > +      rtx new_set = load_p
> > +	? gen_rtx_SET (change_reg, new_mem)
> > +	: gen_rtx_SET (new_mem, change_reg);
> > +
> > +      pats[changed_insn] = new_set;
> > +
> > +      auto keep_use = [&](use_info *u)
> > +	{
> > +	  return refers_to_regno_p (u->regno (), u->regno () + 1,
> > +				    change_pat, &XEXP (change_pat, load_p));
> > +	};
> > +
> > +      // Drop any uses that only occur in the old address.
> > +      input_uses[changed_insn] = filter_accesses (attempt,
> > +						  input_uses[changed_insn],
> > +						  keep_use);
> > +    }
> > +
> > +  rtx writeback_effect = NULL_RTX;
> > +  if (writeback)
> > +    writeback_effect = extract_writebacks (load_p, pats, changed_insn);
> > +
> > +  const auto base_regno = base.m_def->regno ();
> > +
> > +  if (base.from_insn == -1 && (writeback & 1))
> > +    {
> > +      // If the first of the candidate insns had a writeback form, we'll need to
> > +      // drop the use of the updated base register from the second insn's uses.
> > +      //
> > +      // N.B. we needn't worry about the base register occurring as a store
> > +      // operand, as we checked that there was no non-address true dependence
> > +      // between the insns in try_fuse_pair.
> > +      gcc_checking_assert (find_access (input_uses[1], base_regno));
> > +      input_uses[1] = check_remove_regno_access (attempt,
> > +						 input_uses[1],
> > +						 base_regno);
> > +    }
> > +
> > +  // Go through and drop uses that only occur in register notes,
> > +  // as we won't be preserving those.
> > +  for (int i = 0; i < 2; i++)
> > +    {
> > +      auto rti = insns[i]->rtl ();
> > +      if (!REG_NOTES (rti))
> > +	continue;
> > +
> > +      input_uses[i] = remove_note_accesses (attempt, input_uses[i]);
> > +    }
> > +
> > +  // Edge case: if the first insn is a writeback load and the
> > +  // second insn is a non-writeback load which transfers into the base
> > +  // register, then we should drop the writeback altogether as the
> > +  // update of the base register from the second load should prevail.
> > +  //
> > +  // For example:
> > +  //   ldr x2, [x1], #8
> > +  //   ldr x1, [x1]
> > +  //   -->
> > +  //   ldp x2, x1, [x1]
> > +  if (writeback == 1
> > +      && load_p
> > +      && find_access (input_defs[1], base_regno))
> > +    {
> > +      if (dump_file)
> > +	fprintf (dump_file,
> > +		 "  ldp: i%d has wb but subsequent i%d has non-wb "
> > +		 "update of base (r%d), dropping wb\n",
> > +		 insns[0]->uid (), insns[1]->uid (), base_regno);
> > +      gcc_assert (writeback_effect);
> > +      writeback_effect = NULL_RTX;
> > +    }
> 
> What guarantees that there are no other uses of the writeback result?

I think any intervening uses of the writeback result should show up as
dataflow hazards and the pair will thus get rejected in try_fuse_pair,
before we get to this point.

> 
> > +
> > +  // If both of the original insns had a writeback form, then we should drop the
> > +  // first def.  The second def could well have uses, but the first def should
> > +  // only be used by the second insn (and we dropped that use above).
> 
> Same question here I suppose: how's the single use condition enforced?

I think the same answer applies.

> 
> > +  if (writeback == 3)
> > +    input_defs[0] = check_remove_regno_access (attempt,
> > +					       input_defs[0],
> > +					       base_regno);
> > +
> > +  // So far the patterns have been in instruction order,
> > +  // now we want them in offset order.
> > +  if (i1 != first)
> > +    std::swap (pats[0], pats[1]);
> > +
> > +  poly_int64 offsets[2];
> > +  for (int i = 0; i < 2; i++)
> > +    {
> > +      rtx mem = XEXP (pats[i], load_p);
> > +      gcc_checking_assert (MEM_P (mem));
> > +      rtx base = strip_offset (XEXP (mem, 0), offsets + i);
> > +      gcc_checking_assert (REG_P (base));
> > +      gcc_checking_assert (base_regno == REGNO (base));
> > +    }
> > +
> > +  insn_info *trailing_add = nullptr;
> > +  if (aarch64_ldp_writeback > 1 && !writeback_effect)
> > +    {
> > +      def_info *add_def;
> > +      trailing_add = find_trailing_add (insns, move_range, &writeback_effect,
> > +					&add_def, base.m_def, offsets[0],
> > +					access_size);
> > +      if (trailing_add && !writeback)
> > +	{
> > +	  // If there was no writeback to start with, we need to preserve the
> > +	  // def of the base register from the add insn.
> > +	  input_defs[0] = insert_access (attempt, add_def, input_defs[0]);
> > +	  gcc_assert (input_defs[0].is_valid ());
> 
> How do we avoid doing that in the writeback!=0 case?

I think this was an oversight, fixed in the latest version.

> 
> > +	}
> > +    }
> > +
> > +  // If either of the original insns had writeback, but the resulting
> > +  // pair insn does not (can happen e.g. in the ldp edge case above, or
> > +  // if the writeback effects cancel out), then drop the def(s) of the
> > +  // base register as appropriate.
> > +  if (!writeback_effect)
> > +    for (int i = 0; i < 2; i++)
> > +      if (writeback & (1 << i))
> > +	input_defs[i] = check_remove_regno_access (attempt,
> > +						   input_defs[i],
> > +						   base_regno);
> 
> Is there any scope for simplifying the writeback logic?  There seems
> to be some overlap in intent between this loop and the previous
> writeback==3 handling.

Yeah, I've tried to combine these two cases into a single loop in the
latest version.

> 
> > +
> > +  // Now that we know what base mem we're going to use, check if it's OK
> > +  // with the ldp/stp policy.
> > +  rtx first_mem = XEXP (pats[0], load_p);
> > +  if (!aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
> > +						load_p,
> > +						GET_MODE (first_mem)))
> > +    {
> > +      if (dump_file)
> > +	fprintf (dump_file, "punting on pair (%d,%d), ldp/stp policy says no\n",
> > +		 i1->uid (), i2->uid ());
> > +      return false;
> > +    }
> > +
> > +  bool reg_notes_ok = true;
> > +  rtx reg_notes = combine_reg_notes (i1, i2, writeback_effect, reg_notes_ok);
> > +  if (!reg_notes_ok)
> > +    return false;
> > +
> > +  rtx pair_pat;
> > +  if (writeback_effect)
> > +    {
> > +      auto patvec = gen_rtvec (3, writeback_effect, pats[0], pats[1]);
> > +      pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
> > +    }
> > +  else if (load_p)
> > +    pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
> > +				      XEXP (pats[1], 0),
> > +				      XEXP (pats[0], 1));
> > +  else
> > +    pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
> > +				       XEXP (pats[0], 1),
> > +				       XEXP (pats[1], 1));
> > +
> > +  insn_change *pair_change = nullptr;
> > +  auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
> > +      rtx_insn *rti = change->insn ()->rtl ();
> > +      gcc_assert (validate_unshare_change (rti, &PATTERN (rti), pair_pat,
> > +					   true));
> > +      gcc_assert (validate_change (rti, &REG_NOTES (rti),
> > +				   reg_notes, true));
> > +  };
> > +
> > +  if (load_p)
> > +    {
> > +      changes.quick_push (make_delete (first));
> > +      pair_change = make_change (second);
> > +      changes.quick_push (pair_change);
> > +
> > +      pair_change->move_range = move_range;
> > +      pair_change->new_defs = merge_access_arrays (attempt,
> > +						   input_defs[0],
> > +						   input_defs[1]);
> > +      gcc_assert (pair_change->new_defs.is_valid ());
> > +
> > +      pair_change->new_uses
> > +	= merge_access_arrays (attempt,
> > +			       drop_memory_access (input_uses[0]),
> > +			       drop_memory_access (input_uses[1]));
> > +      gcc_assert (pair_change->new_uses.is_valid ());
> > +      set_pair_pat (pair_change);
> > +    }
> > +  else
> > +    {
> > +      change_strategy strategy[2];
> > +      def_info *stp_def = decide_stp_strategy (strategy, first, second,
> > +					       move_range);
> > +      if (dump_file)
> > +	{
> > +	  auto cs1 = cs_to_string (strategy[0]);
> > +	  auto cs2 = cs_to_string (strategy[1]);
> > +	  fprintf (dump_file,
> > +		   "  stp strategy for candidate insns (%d,%d): (%s,%s)\n",
> > +		   insns[0]->uid (), insns[1]->uid (), cs1, cs2);
> > +	  if (stp_def)
> > +	    fprintf (dump_file,
> > +		     "  re-using mem def from insn %d\n",
> > +		     stp_def->insn ()->uid ());
> > +	}
> > +
> > +      insn_change *change;
> > +      for (int i = 0; i < 2; i++)
> > +	{
> > +	  switch (strategy[i])
> > +	    {
> > +	    case DELETE:
> > +	      changes.quick_push (make_delete (insns[i]));
> > +	      break;
> > +	    case TOMBSTONE:
> > +	    case CHANGE:
> > +	      change = make_change (insns[i]);
> > +	      if (strategy[i] == CHANGE)
> > +		{
> > +		  set_pair_pat (change);
> > +		  change->new_uses = merge_access_arrays (attempt,
> > +							  input_uses[0],
> > +							  input_uses[1]);
> > +		  auto d1 = drop_memory_access (input_defs[0]);
> > +		  auto d2 = drop_memory_access (input_defs[1]);
> > +		  change->new_defs = merge_access_arrays (attempt, d1, d2);
> > +		  gcc_assert (change->new_defs.is_valid ());
> > +		  gcc_assert (stp_def);
> > +		  change->new_defs = insert_access (attempt,
> > +						    stp_def,
> > +						    change->new_defs);
> > +		  gcc_assert (change->new_defs.is_valid ());
> > +		  change->move_range = move_range;
> > +		  pair_change = change;
> > +		}
> > +	      else
> > +		{
> > +		  rtx_insn *rti = insns[i]->rtl ();
> > +		  gcc_assert (validate_change (rti, &PATTERN (rti),
> > +					       gen_tombstone (), true));
> > +		  gcc_assert (validate_change (rti, &REG_NOTES (rti),
> > +					       NULL_RTX, true));
> > +		  change->new_uses = use_array (nullptr, 0);
> > +		  have_tombstone_p = true;
> > +		}
> > +	      gcc_assert (change->new_uses.is_valid ());
> > +	      changes.quick_push (change);
> > +	      break;
> > +	    }
> > +	}
> > +
> > +      if (!stp_def)
> > +	{
> > +	  // Tricky case.  Cannot re-purpose existing insns for stp.
> > +	  // Need to insert new insn.
> > +	  if (dump_file)
> > +	    fprintf (dump_file,
> > +		     "  stp fusion: cannot re-purpose candidate stores\n");
> > +
> > +	  auto new_insn = crtl->ssa->create_insn (attempt, INSN, pair_pat);
> > +	  change = make_change (new_insn);
> > +	  change->move_range = move_range;
> > +	  change->new_uses = merge_access_arrays (attempt,
> > +						  input_uses[0],
> > +						  input_uses[1]);
> > +	  gcc_assert (change->new_uses.is_valid ());
> > +
> > +	  auto d1 = drop_memory_access (input_defs[0]);
> > +	  auto d2 = drop_memory_access (input_defs[1]);
> > +	  change->new_defs = merge_access_arrays (attempt, d1, d2);
> > +	  gcc_assert (change->new_defs.is_valid ());
> > +
> > +	  auto new_set = crtl->ssa->create_set (attempt, new_insn, memory);
> > +	  change->new_defs = insert_access (attempt, new_set,
> > +					    change->new_defs);
> > +	  gcc_assert (change->new_defs.is_valid ());
> > +	  changes.safe_insert (1, change);
> > +	  pair_change = change;
> > +	}
> > +    }
> > +
> > +  if (trailing_add)
> > +    changes.quick_push (make_delete (trailing_add));
> > +
> > +  auto n_changes = changes.length ();
> > +  gcc_checking_assert (n_changes >= 2 && n_changes <= 4);
> > +
> > +
> > +  auto is_changing = insn_is_changing (changes);
> > +  for (unsigned i = 0; i < n_changes; i++)
> > +    gcc_assert (rtl_ssa::restrict_movement_ignoring (*changes[i], is_changing));
> > +
> > +  // Check the pair pattern is recog'd.
> > +  if (!rtl_ssa::recog_ignoring (attempt, *pair_change, is_changing))
> > +    {
> > +      if (dump_file)
> > +	fprintf (dump_file, "  failed to form pair, recog failed\n");
> > +
> > +      // Free any reg notes we allocated.
> > +      while (reg_notes)
> > +	{
> > +	  rtx next = XEXP (reg_notes, 1);
> > +	  free_EXPR_LIST_node (reg_notes);
> > +	  reg_notes = next;
> > +	}
> > +      cancel_changes (0);
> > +      return false;
> > +    }
> > +
> > +  gcc_assert (crtl->ssa->verify_insn_changes (changes));
> > +
> > +  confirm_change_group ();
> > +  crtl->ssa->change_insns (changes);
> > +  emitted_tombstone_p |= have_tombstone_p;
> > +  return true;
> > +}
> > +
> > +// Return true if STORE_INSN may modify mem rtx MEM.  Make sure we keep
> > +// within our BUDGET for alias analysis.
> > +static bool
> > +store_modifies_mem_p (rtx mem, insn_info *store_insn, int &budget)
> > +{
> > +  if (tombstone_insn_p (store_insn))
> > +    return false;
> > +
> > +  if (!budget)
> > +    {
> > +      if (dump_file)
> > +	{
> > +	  fprintf (dump_file,
> > +		   "exceeded budget, assuming store %d aliases with mem ",
> > +		   store_insn->uid ());
> > +	  print_simple_rtl (dump_file, mem);
> > +	  fprintf (dump_file, "\n");
> > +	}
> > +
> > +      return true;
> > +    }
> > +
> > +  budget--;
> > +  return memory_modified_in_insn_p (mem, store_insn->rtl ());
> > +}
> > +
> > +// Return true if LOAD may be modified by STORE.  Make sure we keep
> > +// within our BUDGET for alias analysis.
> > +static bool
> > +load_modified_by_store_p (insn_info *load,
> > +			  insn_info *store,
> > +			  int &budget)
> > +{
> > +  gcc_checking_assert (budget >= 0);
> > +
> > +  if (!budget)
> > +    {
> > +      if (dump_file)
> > +	{
> > +	  fprintf (dump_file,
> > +		   "exceeded budget, assuming load %d aliases with store %d\n",
> > +		   load->uid (), store->uid ());
> > +	}
> > +      return true;
> > +    }
> > +
> > +  // It isn't safe to re-order stores over calls.
> > +  if (CALL_P (load->rtl ()))
> > +    return true;
> > +
> > +  budget--;
> > +  return modified_in_p (PATTERN (load->rtl ()), store->rtl ());
> 
> Any reason not to use memory_modified_in_p directly here too?
> I'd have expected the other dependencies to be covered by the
> RTL-SSA checks.

I think because in this case load can have multiple MEMs (e.g. it can be
PARALLEL), so I was relying on modified_in_p to find all such MEMs in
that case.

Would you prefer if I iterated over all the MEMs in the rtx and called
memory_modified_in_insn_p on each?

> 
> > +}
> > +
> > +struct alias_walker
> > +{
> > +  virtual insn_info *insn () const = 0;
> > +  virtual bool valid () const = 0;
> > +  virtual bool conflict_p (int &budget) const = 0;
> > +  virtual void advance () = 0;
> > +};
> > +
> > +template<bool reverse>
> > +class store_walker : public alias_walker
> > +{
> > +  using def_iter_t = typename std::conditional <reverse,
> > +	reverse_def_iterator, def_iterator>::type;
> > +
> > +  def_iter_t def_iter;
> > +  rtx cand_mem;
> > +  insn_info *limit;
> > +
> > +public:
> > +  store_walker (def_info *mem_def, rtx mem, insn_info *limit_insn) :
> > +    def_iter (mem_def), cand_mem (mem), limit (limit_insn) {}
> > +
> > +  bool valid () const override
> > +    {
> > +      if (!*def_iter)
> > +	return false;
> > +
> > +      if (reverse)
> > +	return *((*def_iter)->insn ()) > *limit;
> > +      else
> > +	return *((*def_iter)->insn ()) < *limit;
> > +    }
> > +  insn_info *insn () const override { return (*def_iter)->insn (); }
> > +  void advance () override { def_iter++; }
> > +  bool conflict_p (int &budget) const override
> > +  {
> > +    return store_modifies_mem_p (cand_mem, insn (), budget);
> > +  }
> > +};
> > +
> > +template<bool reverse>
> > +class load_walker : public alias_walker
> > +{
> > +  using def_iter_t = typename std::conditional <reverse,
> > +	reverse_def_iterator, def_iterator>::type;
> > +  using use_iter_t = typename std::conditional <reverse,
> > +	reverse_use_iterator, nondebug_insn_use_iterator>::type;
> > +
> > +  def_iter_t def_iter;
> > +  use_iter_t use_iter;
> > +  insn_info *cand_store;
> > +  insn_info *limit;
> > +
> > +  static use_info *start_use_chain (def_iter_t &def_iter)
> > +  {
> > +    set_info *set = nullptr;
> > +    for (; *def_iter; def_iter++)
> > +      {
> > +	set = dyn_cast <set_info *> (*def_iter);
> > +	if (!set)
> > +	  continue;
> > +
> > +	use_info *use = reverse
> > +	  ? set->last_nondebug_insn_use ()
> > +	  : set->first_nondebug_insn_use ();
> > +
> > +	if (use)
> > +	  return use;
> > +      }
> > +
> > +    return nullptr;
> > +  }
> > +
> > +public:
> > +  void advance () override
> > +  {
> > +    use_iter++;
> > +    if (*use_iter)
> > +      return;
> > +    def_iter++;
> > +    use_iter = start_use_chain (def_iter);
> > +  }
> > +
> > +  insn_info *insn () const override
> > +  {
> > +    gcc_checking_assert (*use_iter);
> > +    return (*use_iter)->insn ();
> > +  }
> > +
> > +  bool valid () const override
> > +  {
> > +    if (!*use_iter)
> > +      return false;
> > +
> > +    if (reverse)
> > +      return *((*use_iter)->insn ()) > *limit;
> > +    else
> > +      return *((*use_iter)->insn ()) < *limit;
> > +  }
> > +
> > +  bool conflict_p (int &budget) const override
> > +  {
> > +    return load_modified_by_store_p (insn (), cand_store, budget);
> > +  }
> > +
> > +  load_walker (def_info *def, insn_info *store, insn_info *limit_insn)
> > +    : def_iter (def), use_iter (start_use_chain (def_iter)),
> > +      cand_store (store), limit (limit_insn) {}
> > +};
> 
> Could we move more of the code to the base class?  It looks like
> the iteration parts are very similar.

Yeah, I've added a def_walker class which inherits from the virtual base
in the latest version.  I've then tried to move the common logic/state
up into that class.

> 
> > +
> > +// Process our alias_walkers in a round-robin fashion, proceeding until
> > +// nothing more can be learned from alias analysis.
> > +//
> > +// We try to maintain the invariant that if a walker becomes invalid, we
> > +// set its pointer to null.
> > +static void
> > +do_alias_analysis (insn_info *alias_hazards[4],
> > +		   alias_walker *walkers[4],
> > +		   bool load_p)
> > +{
> > +  const int n_walkers = 2 + (2 * !load_p);
> > +  int budget = aarch64_ldp_alias_check_limit;
> > +
> > +  auto next_walker = [walkers,n_walkers](int current) -> int {
> > +    for (int j = 1; j <= n_walkers; j++)
> > +      {
> > +	int idx = (current + j) % n_walkers;
> > +	if (walkers[idx])
> > +	  return idx;
> > +      }
> > +    return -1;
> > +  };
> > +
> > +  int i = -1;
> > +  for (int j = 0; j < n_walkers; j++)
> > +    {
> > +      alias_hazards[j] = nullptr;
> > +      if (!walkers[j])
> > +	continue;
> > +
> > +      if (!walkers[j]->valid ())
> > +	walkers[j] = nullptr;
> > +      else if (i == -1)
> > +	i = j;
> > +    }
> > +
> > +  while (i >= 0)
> > +    {
> > +      int insn_i = i % 2;
> > +      int paired_i = (i & 2) + !insn_i;
> > +      int pair_fst = (i & 2);
> > +      int pair_snd = (i & 2) + 1;
> > +
> > +      if (walkers[i]->conflict_p (budget))
> > +	{
> > +	  alias_hazards[i] = walkers[i]->insn ();
> > +
> > +	  // We got an aliasing conflict for this {load,store} walker,
> > +	  // so we don't need to walk any further.
> > +	  walkers[i] = nullptr;
> > +
> > +	  // If we have a pair of alias conflicts that prevent
> > +	  // forming the pair, stop.  There's no need to do further
> > +	  // analysis.
> > +	  if (alias_hazards[paired_i]
> > +	      && (*alias_hazards[pair_fst] <= *alias_hazards[pair_snd]))
> > +	    return;
> > +
> > +	  if (!load_p)
> > +	    {
> > +	      int other_pair_fst = (pair_fst ? 0 : 2);
> > +	      int other_paired_i = other_pair_fst + !insn_i;
> > +
> > +	      int x_pair_fst = (i == pair_fst) ? i : other_paired_i;
> > +	      int x_pair_snd = (i == pair_fst) ? other_paired_i : i;
> > +
> > +	      // Similarly, handle the case where we have a {load,store}
> > +	      // or {store,load} alias hazard pair that prevents forming
> > +	      // the pair.
> > +	      if (alias_hazards[other_paired_i]
> > +		  && *alias_hazards[x_pair_fst] <= *alias_hazards[x_pair_snd])
> > +		return;
> > +	    }
> > +	}
> > +
> > +      if (walkers[i])
> > +	{
> > +	  walkers[i]->advance ();
> > +
> > +	  if (!walkers[i]->valid ())
> > +	    walkers[i] = nullptr;
> > +	}
> > +
> > +      i = next_walker (i);
> > +    }
> > +}
> > +
> > +// Return an integer where bit (1 << i) is set if INSNS[i] uses writeback
> > +// addressing.
> > +static int
> > +get_viable_bases (insn_info *insns[2],
> > +		  vec <base_cand> &base_cands,
> > +		  rtx cand_mems[2],
> > +		  unsigned access_size,
> > +		  bool reversed)
> > +{
> > +  // We discovered this pair through a common base.  Need to ensure that
> > +  // we have a common base register that is live at both locations.
> > +  def_info *base_defs[2] = {};
> > +  int writeback = 0;
> > +  for (int i = 0; i < 2; i++)
> > +    {
> > +      const bool is_lower = (i == reversed);
> > +      poly_int64 poly_off;
> > +      rtx modify = NULL_RTX;
> > +      rtx base = ldp_strip_offset (cand_mems[i], &modify, &poly_off);
> > +      if (modify)
> > +	writeback |= (1 << i);
> > +
> > +      if (!REG_P (base) || !poly_off.is_constant ())
> > +	continue;
> > +
> > +      // Punt on accesses relative to eliminable regs.  Since we don't know the
> > +      // elimination offset pre-RA, we should postpone forming pairs on such
> > +      // accesses until after RA.
> > +      if (!reload_completed
> > +	  && (REGNO (base) == FRAME_POINTER_REGNUM
> > +	      || REGNO (base) == ARG_POINTER_REGNUM))
> > +	continue;
> 
> Same as above, it's not obvious from the comment why this is necessary.

See above.

> 
> > +
> > +      HOST_WIDE_INT base_off = poly_off.to_constant ();
> > +
> > +      // It should be unlikely that we ever punt here, since MEM_EXPR offset
> > +      // alignment should be a good proxy for register offset alignment.
> > +      if (base_off % access_size != 0)
> > +	{
> > +	  if (dump_file)
> > +	    fprintf (dump_file,
> > +		     "base not viable, offset misaligned (insn %d)\n",
> > +		     insns[i]->uid ());
> > +	  continue;
> > +	}
> > +
> > +      base_off /= access_size;
> > +
> > +      if (!is_lower)
> > +	base_off--;
> > +
> > +      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
> > +	continue;
> > +
> > +      for (auto use : insns[i]->uses ())
> > +	if (use->is_reg () && use->regno () == REGNO (base))
> > +	  {
> > +	    base_defs[i] = use->def ();
> > +	    break;
> > +	  }
> > +    }
> > +
> > +  if (!base_defs[0] && !base_defs[1])
> > +    {
> > +      if (dump_file)
> > +	fprintf (dump_file, "no viable base register for pair (%d,%d)\n",
> > +		 insns[0]->uid (), insns[1]->uid ());
> > +      return writeback;
> > +    }
> > +
> > +  for (int i = 0; i < 2; i++)
> > +    if ((writeback & (1 << i)) && !base_defs[i])
> > +      {
> > +	if (dump_file)
> > +	  fprintf (dump_file, "insn %d has writeback but base isn't viable\n",
> > +		   insns[i]->uid ());
> > +	return writeback;
> > +      }
> > +
> > +  if (writeback == 3
> > +      && base_defs[0]->regno () != base_defs[1]->regno ())
> > +    {
> > +      if (dump_file)
> > +	fprintf (dump_file,
> > +		 "pair (%d,%d): double writeback with distinct regs (%d,%d): "
> > +		 "punting\n",
> > +		 insns[0]->uid (), insns[1]->uid (),
> > +		 base_defs[0]->regno (), base_defs[1]->regno ());
> > +      return writeback;
> > +    }
> > +
> > +  if (base_defs[0] && base_defs[1]
> > +      && base_defs[0]->regno () == base_defs[1]->regno ())
> > +    {
> > +      // Easy case: insns already share the same base reg.
> > +      base_cands.quick_push (base_defs[0]);
> > +      return writeback;
> > +    }
> > +
> > +  // Otherwise, we know that one of the bases must change.
> > +  //
> > +  // Note that if there is writeback we must use the writeback base
> > +  // (we know now there is exactly one).
> > +  for (int i = 0; i < 2; i++)
> > +    if (base_defs[i] && (!writeback || (writeback & (1 << i))))
> > +      base_cands.quick_push (base_cand { base_defs[i], i });
> > +
> > +  return writeback;
> > +}
> > +
> > +// Given two adjacent memory accesses of the same size, I1 and I2, try
> > +// and see if we can merge them into a ldp or stp.
> > +static bool
> > +try_fuse_pair (bool load_p,
> > +	       unsigned access_size,
> > +	       insn_info *i1,
> > +	       insn_info *i2,
> > +	       bool &emitted_tombstone_p)
> > +{
> > +  if (dump_file)
> > +    fprintf (dump_file, "analyzing pair (load=%d): (%d,%d)\n",
> > +	     load_p, i1->uid (), i2->uid ());
> > +
> > +  insn_info *insns[2];
> > +  bool reversed = false;
> > +  if (*i1 < *i2)
> > +    {
> > +      insns[0] = i1;
> > +      insns[1] = i2;
> > +    }
> > +  else
> > +    {
> > +      insns[0] = i2;
> > +      insns[1] = i1;
> > +      reversed = true;
> > +    }
> > +
> > +  rtx cand_mems[2];
> > +  rtx reg_ops[2];
> > +  rtx pats[2];
> > +  for (int i = 0; i < 2; i++)
> > +    {
> > +      pats[i] = PATTERN (insns[i]->rtl ());
> > +      cand_mems[i] = XEXP (pats[i], load_p);
> > +      reg_ops[i] = XEXP (pats[i], !load_p);
> > +    }
> > +
> > +  if (load_p && reg_overlap_mentioned_p (reg_ops[0], reg_ops[1]))
> > +    {
> > +      if (dump_file)
> > +	fprintf (dump_file,
> > +		 "punting on ldp due to reg conflcits (%d,%d)\n",
> > +		 insns[0]->uid (), insns[1]->uid ());
> > +      return false;
> > +    }
> > +
> > +  if (cfun->can_throw_non_call_exceptions
> > +      && (find_reg_note (insns[0]->rtl (), REG_EH_REGION, NULL_RTX)
> > +	  || find_reg_note (insns[1]->rtl (), REG_EH_REGION, NULL_RTX))
> > +      && insn_could_throw_p (insns[0]->rtl ())
> > +      && insn_could_throw_p (insns[1]->rtl ()))
> 
> I don't get the nuance of this condition.  The REG_EH_REGION part is
> definitely OK, but why are the insn_could_throw_p parts needed?

If I'm honest, this was cribbed from combine.cc:try_combine, which has:

  /* With non-call exceptions we can end up trying to combine multiple
     insns with possible EH side effects.  Make sure we can combine
     that to a single insn which means there must be at most one insn
     in the combination with an EH side effect.  */
  if (cfun->can_throw_non_call_exceptions)
    {
      if (find_reg_note (i3, REG_EH_REGION, NULL_RTX)
	  || find_reg_note (i2, REG_EH_REGION, NULL_RTX)
	  || (i1 && find_reg_note (i1, REG_EH_REGION, NULL_RTX))
	  || (i0 && find_reg_note (i0, REG_EH_REGION, NULL_RTX)))
	{
	  has_non_call_exception = true;
	  if (insn_could_throw_p (i3)
	      + insn_could_throw_p (i2)
	      + (i1 ? insn_could_throw_p (i1) : 0)
	      + (i0 ? insn_could_throw_p (i0) : 0) > 1)
	    {
	      if (dump_file && (dump_flags & TDF_DETAILS))
		fprintf (dump_file, "Can't combine multiple insns with EH "
			 "side-effects\n");
	      undo_all ();
	      return 0;
	    }
	}
    }

on the assumption that the existing combine code was correct.

> 
> > +    {
> > +      if (dump_file)
> > +	fprintf (dump_file,
> > +		 "can't combine insns with EH side effects (%d,%d)\n",
> > +		 insns[0]->uid (), insns[1]->uid ());
> > +      return false;
> > +    }
> > +
> > +  auto_vec <base_cand> base_cands;
> > +  base_cands.reserve (2);
> > +
> > +  int writeback = get_viable_bases (insns, base_cands, cand_mems,
> > +				    access_size, reversed);
> > +  if (base_cands.is_empty ())
> > +    {
> > +      if (dump_file)
> > +	fprintf (dump_file, "no viable base for pair (%d,%d)\n",
> > +		 insns[0]->uid (), insns[1]->uid ());
> > +      return false;
> > +    }
> > +
> > +  rtx *ignore = &XEXP (pats[1], load_p);
> > +  for (auto use : insns[1]->uses ())
> > +    if (!use->is_mem ()
> > +	&& refers_to_regno_p (use->regno (), use->regno () + 1, pats[1], ignore)
> > +	&& use->def () && use->def ()->insn () == insns[0])
> > +      {
> > +	// N.B. we allow a true dependence on the base address, as this
> > +	// happens in the case of auto-inc accesses.  Consider a post-increment
> > +	// load followed by a regular indexed load, for example.
> > +	if (dump_file)
> > +	  fprintf (dump_file,
> > +		   "%d has non-address true dependence on %d, rejecting pair\n",
> > +		   insns[1]->uid (), insns[0]->uid ());
> > +	return false;
> > +      }
> > +
> > +  unsigned i = 0;
> > +  while (i < base_cands.length ())
> > +    {
> > +      base_cand &cand = base_cands[i];
> > +
> > +      rtx *ignore[2] = {};
> > +      for (int j = 0; j < 2; j++)
> > +	if (cand.from_insn == !j)
> > +	  ignore[j] = &XEXP (cand_mems[j], 0);
> > +
> > +      insn_info *h = first_hazard_after (insns[0], ignore[0]);
> > +      if (h && *h <= *insns[1])
> > +	cand.hazards[0] = h;
> > +
> > +      h = latest_hazard_before (insns[1], ignore[1]);
> > +      if (h && *h >= *insns[0])
> > +	cand.hazards[1] = h;
> > +
> > +      if (!cand.viable ())
> > +	{
> > +	  if (dump_file)
> > +	    fprintf (dump_file,
> > +		     "pair (%d,%d): rejecting base %d due to dataflow "
> > +		     "hazards (%d,%d)\n",
> > +		     insns[0]->uid (),
> > +		     insns[1]->uid (),
> > +		     cand.m_def->regno (),
> > +		     cand.hazards[0]->uid (),
> > +		     cand.hazards[1]->uid ());
> > +
> > +	  base_cands.ordered_remove (i);
> > +	}
> > +      else
> > +	i++;
> > +    }
> > +
> > +  if (base_cands.is_empty ())
> > +    {
> > +      if (dump_file)
> > +	fprintf (dump_file,
> > +		 "can't form pair (%d,%d) due to dataflow hazards\n",
> > +		 insns[0]->uid (), insns[1]->uid ());
> > +      return false;
> > +    }
> > +
> > +  insn_info *alias_hazards[4] = {};
> > +
> > +  // First def of memory after the first insn, and last def of memory
> > +  // before the second insn, respectively.
> > +  def_info *mem_defs[2] = {};
> > +  if (load_p)
> > +    {
> > +      if (!MEM_READONLY_P (cand_mems[0]))
> > +	{
> > +	  mem_defs[0] = memory_access (insns[0]->uses ())->def ();
> > +	  gcc_checking_assert (mem_defs[0]);
> > +	  mem_defs[0] = mem_defs[0]->next_def ();
> > +	}
> > +      if (!MEM_READONLY_P (cand_mems[1]))
> > +	{
> > +	  mem_defs[1] = memory_access (insns[1]->uses ())->def ();
> > +	  gcc_checking_assert (mem_defs[1]);
> > +	}
> > +    }
> > +  else
> > +    {
> > +      mem_defs[0] = memory_access (insns[0]->defs ())->next_def ();
> > +      mem_defs[1] = memory_access (insns[1]->defs ())->prev_def ();
> > +      gcc_checking_assert (mem_defs[0]);
> > +      gcc_checking_assert (mem_defs[1]);
> > +    }
> > +
> > +  store_walker<false> forward_store_walker (mem_defs[0],
> > +					    cand_mems[0],
> > +					    insns[1]);
> > +  store_walker<true> backward_store_walker (mem_defs[1],
> > +					    cand_mems[1],
> > +					    insns[0]);
> > +  alias_walker *walkers[4] = {};
> > +  if (mem_defs[0])
> > +    walkers[0] = &forward_store_walker;
> > +  if (mem_defs[1])
> > +    walkers[1] = &backward_store_walker;
> > +
> > +  if (load_p && (mem_defs[0] || mem_defs[1]))
> > +    do_alias_analysis (alias_hazards, walkers, load_p);
> > +  else
> > +    {
> > +      // We want to find any loads hanging off the first store.
> > +      mem_defs[0] = memory_access (insns[0]->defs ());
> > +      load_walker<false> forward_load_walker (mem_defs[0], insns[0], insns[1]);
> > +      load_walker<true> backward_load_walker (mem_defs[1], insns[1], insns[0]);
> > +      walkers[2] = &forward_load_walker;
> > +      walkers[3] = &backward_load_walker;
> > +      do_alias_analysis (alias_hazards, walkers, load_p);
> > +      // Now consolidate hazards back down.
> > +      if (alias_hazards[2]
> > +	  && (!alias_hazards[0] || (*alias_hazards[2] < *alias_hazards[0])))
> > +	alias_hazards[0] = alias_hazards[2];
> > +
> > +      if (alias_hazards[3]
> > +	  && (!alias_hazards[1] || (*alias_hazards[3] > *alias_hazards[1])))
> > +	alias_hazards[1] = alias_hazards[3];
> > +    }
> > +
> > +  if (alias_hazards[0] && alias_hazards[1]
> > +      && *alias_hazards[0] <= *alias_hazards[1])
> > +    {
> > +      if (dump_file)
> > +	fprintf (dump_file,
> > +		 "cannot form pair (%d,%d) due to alias conflicts (%d,%d)\n",
> > +		 i1->uid (), i2->uid (),
> > +		 alias_hazards[0]->uid (), alias_hazards[1]->uid ());
> > +      return false;
> > +    }
> > +
> > +  // Now narrow the hazards on each base candidate using
> > +  // the alias hazards.
> > +  i = 0;
> > +  while (i < base_cands.length ())
> > +    {
> > +      base_cand &cand = base_cands[i];
> > +      if (alias_hazards[0] && (!cand.hazards[0]
> > +			       || *alias_hazards[0] < *cand.hazards[0]))
> > +	cand.hazards[0] = alias_hazards[0];
> > +      if (alias_hazards[1] && (!cand.hazards[1]
> > +			       || *alias_hazards[1] > *cand.hazards[1]))
> > +	cand.hazards[1] = alias_hazards[1];
> > +
> > +      if (cand.viable ())
> > +	i++;
> > +      else
> > +	{
> > +	  if (dump_file)
> > +	    fprintf (dump_file, "pair (%d,%d): rejecting base %d due to "
> > +				"alias/dataflow hazards (%d,%d)",
> > +				insns[0]->uid (), insns[1]->uid (),
> > +				cand.m_def->regno (),
> > +				cand.hazards[0]->uid (),
> > +				cand.hazards[1]->uid ());
> > +
> > +	  base_cands.ordered_remove (i);
> > +	}
> > +    }
> > +
> > +  if (base_cands.is_empty ())
> > +    {
> > +      if (dump_file)
> > +	fprintf (dump_file,
> > +		 "cannot form pair (%d,%d) due to alias/dataflow hazards",
> > +		 insns[0]->uid (), insns[1]->uid ());
> > +
> > +      return false;
> > +    }
> > +
> > +  base_cand *base = &base_cands[0];
> > +  if (base_cands.length () > 1)
> > +    {
> > +      // If there are still multiple viable bases, it makes sense
> > +      // to choose one that allows us to reduce register pressure,
> > +      // for loads this means moving further down, for stores this
> > +      // means moving further up.
> 
> Agreed, but that's only really an issue for the pre-RA pass.
> Might there be reasons to prefer a different choice after RA?
> (Genuine question.)

It would certainly be good to investigate alternative strategies for the
post-RA pass, to see if a different strategy gives better performance,
but unfortunately I don't think there's time to do that this cycle.

Is it OK if we stick with this strategy for now?

> 
> > +      gcc_checking_assert (base_cands.length () == 2);
> > +      const int hazard_i = !load_p;
> > +      if (base->hazards[hazard_i])
> > +	{
> > +	  if (!base_cands[1].hazards[hazard_i])
> > +	    base = &base_cands[1];
> > +	  else if (load_p
> > +		   && *base_cands[1].hazards[hazard_i]
> > +		      > *(base->hazards[hazard_i]))
> > +	    base = &base_cands[1];
> > +	  else if (!load_p
> > +		   && *base_cands[1].hazards[hazard_i]
> > +		      < *(base->hazards[hazard_i]))
> > +	    base = &base_cands[1];
> > +	}
> > +    }
> > +
> > +  // Otherwise, hazards[0] > hazards[1].
> > +  // Pair can be formed anywhere in (hazards[1], hazards[0]).
> > +  insn_range_info range (insns[0], insns[1]);
> > +  if (base->hazards[1])
> > +    range.first = base->hazards[1];
> > +  if (base->hazards[0])
> > +    range.last = base->hazards[0]->prev_nondebug_insn ();
> > +
> > +  // Placement strategy: push loads down and pull stores up, this should
> > +  // help register pressure by reducing live ranges.
> > +  if (load_p)
> > +    range.first = range.last;
> > +  else
> > +    range.last = range.first;
> > +
> > +  if (dump_file)
> > +    {
> > +      auto print_hazard = [](insn_info *i)
> > +	{
> > +	  if (i)
> > +	    fprintf (dump_file, "%d", i->uid ());
> > +	  else
> > +	    fprintf (dump_file, "-");
> > +	};
> > +      auto print_pair = [print_hazard](insn_info **i)
> > +	{
> > +	  print_hazard (i[0]);
> > +	  fprintf (dump_file, ",");
> > +	  print_hazard (i[1]);
> > +	};
> > +
> > +      fprintf (dump_file, "fusing pair [L=%d] (%d,%d), base=%d, hazards: (",
> > +	      load_p, insns[0]->uid (), insns[1]->uid (),
> > +	      base->m_def->regno ());
> > +      print_pair (base->hazards);
> > +      fprintf (dump_file, "), move_range: (%d,%d)\n",
> > +	       range.first->uid (), range.last->uid ());
> > +    }
> > +
> > +  return fuse_pair (load_p, access_size, writeback,
> > +		    i1, i2, *base, range, emitted_tombstone_p);
> > +}
> > +
> > +// Erase [l.begin (), i] inclusive, respecting iterator order.
> > +static insn_iter_t
> > +erase_prefix (insn_list_t &l, insn_iter_t i)
> > +{
> > +  l.erase (l.begin (), std::next (i));
> > +  return l.begin ();
> > +}
> > +
> > +static insn_iter_t
> > +erase_one (insn_list_t &l, insn_iter_t i, insn_iter_t begin)
> > +{
> > +  auto prev_or_next = (i == begin) ? std::next (i) : std::prev (i);
> > +  l.erase (i);
> > +  return prev_or_next;
> > +}
> > +
> > +static void
> > +dump_insn_list (FILE *f, const insn_list_t &l)
> > +{
> > +  fprintf (f, "(");
> > +
> > +  auto i = l.begin ();
> > +  auto end = l.end ();
> > +
> > +  if (i != end)
> > +    fprintf (f, "%d", (*i)->uid ());
> > +  i++;
> > +
> > +  for (; i != end; i++)
> > +    {
> > +      fprintf (f, ", %d", (*i)->uid ());
> > +    }
> > +
> > +  fprintf (f, ")");
> > +}
> > +
> > +DEBUG_FUNCTION void
> > +debug (const insn_list_t &l)
> > +{
> > +  dump_insn_list (stderr, l);
> > +  fprintf (stderr, "\n");
> > +}
> > +
> > +void
> > +merge_pairs (insn_iter_t l_begin,
> > +	     insn_iter_t l_end,
> > +	     insn_iter_t r_begin,
> > +	     insn_iter_t r_end,
> > +	     insn_list_t &left_list,
> > +	     insn_list_t &right_list,
> > +	     hash_set <insn_info *> &to_delete,
> > +	     bool load_p,
> > +	     unsigned access_size,
> > +	     bool &emitted_tombstone_p)
> > +{
> > +  auto iter_l = l_begin;
> > +  auto iter_r = r_begin;
> > +
> > +  bool result;
> > +  while (l_begin != l_end && r_begin != r_end)
> > +    {
> > +      auto next_l = std::next (iter_l);
> > +      auto next_r = std::next (iter_r);
> > +      if (**iter_l < **iter_r
> > +	  && next_l != l_end
> > +	  && **next_l < **iter_r)
> > +	{
> > +	  iter_l = next_l;
> > +	  continue;
> > +	}
> > +      else if (**iter_r < **iter_l
> > +	       && next_r != r_end
> > +	       && **next_r < **iter_l)
> > +	{
> > +	  iter_r = next_r;
> > +	  continue;
> > +	}
> > +
> > +      bool update_l = false;
> > +      bool update_r = false;
> > +
> > +      result = try_fuse_pair (load_p, access_size,
> > +			      *iter_l, *iter_r,
> > +			      emitted_tombstone_p);
> > +      if (result)
> > +	{
> > +	  update_l = update_r = true;
> > +	  if (to_delete.add (*iter_r))
> > +	    gcc_unreachable (); // Shouldn't get added twice.
> > +
> > +	  iter_l = erase_one (left_list, iter_l, l_begin);
> > +	  iter_r = erase_one (right_list, iter_r, r_begin);
> > +	}
> > +      else
> > +	{
> > +	  // Here we know that the entire prefix we skipped
> > +	  // over cannot merge with anything further on
> > +	  // in iteration order (there are aliasing hazards
> > +	  // on both sides), so delete the entire prefix.
> > +	  if (**iter_l < **iter_r)
> > +	    {
> > +	      // Delete everything from l_begin to iter_l, inclusive.
> > +	      update_l = true;
> > +	      iter_l = erase_prefix (left_list, iter_l);
> > +	    }
> > +	  else
> > +	    {
> > +	      // Delete everything from r_begin to iter_r, inclusive.
> > +	      update_r = true;
> > +	      iter_r = erase_prefix (right_list, iter_r);
> > +	    }
> > +	}
> > +
> > +      if (update_l)
> > +	{
> > +	  l_begin = left_list.begin ();
> > +	  l_end = left_list.end ();
> > +	}
> > +      if (update_r)
> > +	{
> > +	  r_begin = right_list.begin ();
> > +	  r_end = right_list.end ();
> > +	}
> > +    }
> > +}
> 
> Could you add some more comments here about how the iterator ranges
> are used?  E.g. I wasn't sure how l_begin and r_begin differed from
> left_list.begin () and right_list.begin ().

As discussed offline, this was a hold-over from a previous iteration of
the patch where we had a reverse traversal as well, and cacheing the
iterator values like this is unnecessary with the current code, so I've
removed those local variables / parameters.

This is a nice cleanup, thanks for spotting that.

> 
> > +
> > +// Given a list of insns LEFT_ORIG with all accesses adjacent to
> > +// those in RIGHT_ORIG, try and form them into pairs.
> > +//
> > +// Return true iff we formed all the RIGHT_ORIG candidates into
> > +// pairs.
> > +bool
> > +ldp_bb_info::try_form_pairs (insn_list_t *left_orig,
> > +			     insn_list_t *right_orig,
> > +			     bool load_p, unsigned access_size)
> > +{
> > +  // Make a copy of the right list which we can modify to
> > +  // exclude candidates locally for this invocation.
> > +  insn_list_t right_copy (*right_orig);
> > +
> > +  if (dump_file)
> > +    {
> > +      fprintf (dump_file, "try_form_pairs [L=%d], cand vecs ", load_p);
> > +      dump_insn_list (dump_file, *left_orig);
> > +      fprintf (dump_file, " x ");
> > +      dump_insn_list (dump_file, right_copy);
> > +      fprintf (dump_file, "\n");
> > +    }
> > +
> > +  // List of candidate insns to delete from the original right_list
> > +  // (because they were formed into a pair).
> > +  hash_set <insn_info *> to_delete;
> > +
> > +  // Now we have a 2D matrix of candidates, traverse it to try and
> > +  // find a pair of insns that are already adjacent (within the
> > +  // merged list of accesses).
> > +  merge_pairs (left_orig->begin (), left_orig->end (),
> > +	       right_copy.begin (), right_copy.end (),
> > +	       *left_orig, right_copy,
> > +	       to_delete, load_p, access_size,
> > +	       m_emitted_tombstone);
> > +
> > +  // If we formed all right candidates into pairs,
> > +  // then we can skip the next iteration.
> > +  if (to_delete.elements () == right_orig->size ())
> > +    return true;
> > +
> > +  // Delete items from to_delete.
> > +  auto right_iter = right_orig->begin ();
> > +  auto right_end = right_orig->end ();
> > +  while (right_iter != right_end)
> > +    {
> > +      auto right_next = std::next (right_iter);
> > +
> > +      if (to_delete.contains (*right_iter))
> > +	{
> > +	  right_orig->erase (right_iter);
> > +	  right_end = right_orig->end ();
> > +	}
> > +
> > +      right_iter = right_next;
> > +    }
> > +
> > +  return false;
> > +}
> > +
> > +void
> > +ldp_bb_info::transform_for_base (int encoded_lfs,
> > +				 access_group &group)
> > +{
> > +  const auto lfs = decode_lfs (encoded_lfs);
> > +  const unsigned access_size = lfs.size;
> > +
> > +  bool skip_next = true;
> > +  access_record *prev_access = nullptr;
> > +
> > +  for (auto &access : group.list)
> > +    {
> > +      if (skip_next)
> > +	skip_next = false;
> > +      else if (known_eq (access.offset, prev_access->offset + access_size))
> > +	skip_next = try_form_pairs (&prev_access->cand_insns,
> > +				    &access.cand_insns,
> > +				    lfs.load_p, access_size);
> > +
> > +      prev_access = &access;
> > +    }
> > +}
> > +
> > +void
> > +ldp_bb_info::cleanup_tombstones ()
> > +{
> > +  // No need to do anything if we didn't emit a tombstone insn for this bb.
> > +  if (!m_emitted_tombstone)
> > +    return;
> > +
> > +  insn_info *insn = m_bb->head_insn ();
> > +  while (insn)
> > +    {
> > +      insn_info *next = insn->next_nondebug_insn ();
> > +      if (!insn->is_real () || !tombstone_insn_p (insn))
> > +	{
> > +	  insn = next;
> > +	  continue;
> > +	}
> > +
> > +      auto def = memory_access (insn->defs ());
> > +      auto set = dyn_cast <set_info *> (def);
> > +      if (set && set->has_any_uses ())
> > +	{
> > +	  def_info *prev_def = def->prev_def ();
> > +	  auto prev_set = dyn_cast <set_info *> (prev_def);
> > +	  if (!prev_set)
> > +	    gcc_unreachable (); // TODO: handle this if needed.
> > +
> > +	  while (set->first_use ())
> > +	    crtl->ssa->reparent_use (set->first_use (), prev_set);
> > +	}
> > +
> > +      // Now set has no uses, we can delete it.
> > +      insn_change change (insn, insn_change::DELETE);
> > +      crtl->ssa->change_insn (change);
> > +      insn = next;
> > +    }
> > +}
> > +
> > +template<typename Map>
> > +void
> > +ldp_bb_info::traverse_base_map (Map &map)
> > +{
> > +  for (auto kv : map)
> > +    {
> > +      const auto &key = kv.first;
> > +      auto &value = kv.second;
> > +      transform_for_base (key.second, value);
> > +    }
> > +}
> > +
> > +void
> > +ldp_bb_info::transform ()
> > +{
> > +  traverse_base_map (expr_map);
> > +  traverse_base_map (def_map);
> > +}
> > +
> > +static void
> > +ldp_fusion_init ()
> > +{
> > +  calculate_dominance_info (CDI_DOMINATORS);
> > +  df_analyze ();
> > +  crtl->ssa = new rtl_ssa::function_info (cfun);
> > +}
> > +
> > +static void
> > +ldp_fusion_destroy ()
> > +{
> > +  if (crtl->ssa->perform_pending_updates ())
> > +    cleanup_cfg (0);
> > +
> > +  free_dominance_info (CDI_DOMINATORS);
> > +
> > +  delete crtl->ssa;
> > +  crtl->ssa = nullptr;
> > +}
> > +
> > +static rtx
> > +aarch64_destructure_load_pair (rtx regs[2], rtx pattern)
> > +{
> > +  rtx mem = NULL_RTX;
> > +
> > +  for (int i = 0; i < 2; i++)
> > +    {
> > +      rtx pat = XVECEXP (pattern, 0, i);
> > +      regs[i] = XEXP (pat, 0);
> > +      rtx unspec = XEXP (pat, 1);
> > +      gcc_checking_assert (GET_CODE (unspec) == UNSPEC);
> > +      rtx this_mem = XVECEXP (unspec, 0, 0);
> > +      if (mem)
> > +	gcc_checking_assert (rtx_equal_p (mem, this_mem));
> > +      else
> > +	{
> > +	  gcc_checking_assert (MEM_P (this_mem));
> > +	  mem = this_mem;
> > +	}
> > +    }
> > +
> > +  return mem;
> > +}
> > +
> > +static rtx
> > +aarch64_destructure_store_pair (rtx regs[2], rtx pattern)
> > +{
> > +  rtx mem = XEXP (pattern, 0);
> > +  rtx unspec = XEXP (pattern, 1);
> > +  gcc_checking_assert (GET_CODE (unspec) == UNSPEC);
> > +  for (int i = 0; i < 2; i++)
> > +    regs[i] = XVECEXP (unspec, 0, i);
> > +  return mem;
> > +}
> > +
> > +static rtx
> > +aarch64_gen_writeback_pair (rtx wb_effect, rtx pair_mem, rtx regs[2],
> > +			    bool load_p)
> > +{
> > +  auto op_mode = aarch64_operand_mode_for_pair_mode (GET_MODE (pair_mem));
> > +
> > +  machine_mode modes[2];
> > +  for (int i = 0; i < 2; i++)
> > +    {
> > +      machine_mode mode = GET_MODE (regs[i]);
> > +      if (load_p)
> > +	gcc_checking_assert (mode != VOIDmode);
> > +      else if (mode == VOIDmode)
> > +	mode = op_mode;
> > +
> > +      modes[i] = mode;
> > +    }
> > +
> > +  const auto op_size = GET_MODE_SIZE (modes[0]);
> > +  gcc_checking_assert (known_eq (op_size, GET_MODE_SIZE (modes[1])));
> > +
> > +  rtx pats[2];
> > +  for (int i = 0; i < 2; i++)
> > +    {
> > +      rtx mem = adjust_address_nv (pair_mem, modes[i], op_size * i);
> > +      pats[i] = load_p
> > +	? gen_rtx_SET (regs[i], mem)
> > +	: gen_rtx_SET (mem, regs[i]);
> > +    }
> > +
> > +  return gen_rtx_PARALLEL (VOIDmode,
> > +			   gen_rtvec (3, wb_effect, pats[0], pats[1]));
> > +}
> > +
> > +// Given an existing pair insn INSN, look for a trailing update of
> > +// the base register which we can fold in to make this pair use
> > +// a writeback addressing mode.
> > +static void
> > +try_promote_writeback (insn_info *insn)
> > +{
> > +  auto rti = insn->rtl ();
> > +  const auto attr = get_attr_ldpstp (rti);
> > +  if (attr == LDPSTP_NONE)
> > +    return;
> > +
> > +  bool load_p = (attr == LDPSTP_LDP);
> > +  gcc_checking_assert (load_p || attr == LDPSTP_STP);
> > +
> > +  rtx regs[2];
> > +  rtx mem = NULL_RTX;
> > +  if (load_p)
> > +    mem = aarch64_destructure_load_pair (regs, PATTERN (rti));
> > +  else
> > +    mem = aarch64_destructure_store_pair (regs, PATTERN (rti));
> > +  gcc_checking_assert (MEM_P (mem));
> > +
> > +  poly_int64 offset;
> > +  rtx base = strip_offset (XEXP (mem, 0), &offset);
> > +  gcc_assert (REG_P (base));
> > +
> > +  const auto access_size = GET_MODE_SIZE (GET_MODE (mem)).to_constant () / 2;
> > +
> > +  if (find_access (insn->defs (), REGNO (base)))
> > +    {
> > +      gcc_assert (load_p);
> > +      if (dump_file)
> > +	fprintf (dump_file,
> > +		 "ldp %d clobbers base r%d, can't promote to writeback\n",
> > +		 insn->uid (), REGNO (base));
> > +      return;
> > +    }
> > +
> > +  auto base_use = find_access (insn->uses (), REGNO (base));
> > +  gcc_assert (base_use);
> > +
> > +  if (!base_use->def ())
> > +    {
> > +      if (dump_file)
> > +	fprintf (dump_file,
> > +		 "found pair (i%d, L=%d): but base r%d is upwards exposed\n",
> > +		 insn->uid (), load_p, REGNO (base));
> > +      return;
> > +    }
> > +
> > +  auto base_def = base_use->def ();
> > +
> > +  rtx wb_effect = NULL_RTX;
> > +  def_info *add_def;
> > +  const insn_range_info pair_range (insn->prev_nondebug_insn ());
> > +  insn_info *insns[2] = { nullptr, insn };
> > +  insn_info *trailing_add = find_trailing_add (insns, pair_range, &wb_effect,
> > +					       &add_def, base_def, offset,
> > +					       access_size);
> > +  if (!trailing_add)
> > +    return;
> > +
> > +  auto attempt = crtl->ssa->new_change_attempt ();
> > +
> > +  insn_change pair_change (insn);
> > +  insn_change del_change (trailing_add, insn_change::DELETE);
> > +  insn_change *changes[] = { &pair_change, &del_change };
> > +
> > +  rtx pair_pat = aarch64_gen_writeback_pair (wb_effect, mem, regs, load_p);
> > +  gcc_assert (validate_unshare_change (rti, &PATTERN (rti), pair_pat, true));
> > +
> > +  // The pair must gain the def of the base register from the add.
> > +  pair_change.new_defs = insert_access (attempt,
> > +					add_def,
> > +					pair_change.new_defs);
> > +  gcc_assert (pair_change.new_defs.is_valid ());
> > +
> > +  pair_change.move_range = insn_range_info (insn->prev_nondebug_insn ());
> > +
> > +  auto is_changing = insn_is_changing (changes);
> > +  for (unsigned i = 0; i < ARRAY_SIZE (changes); i++)
> > +    gcc_assert (rtl_ssa::restrict_movement_ignoring (*changes[i], is_changing));
> > +
> > +  gcc_assert (rtl_ssa::recog_ignoring (attempt, pair_change, is_changing));
> > +  gcc_assert (crtl->ssa->verify_insn_changes (changes));
> > +  confirm_change_group ();
> > +  crtl->ssa->change_insns (changes);
> > +}
> > +
> > +void ldp_fusion_bb (bb_info *bb)
> > +{
> > +  const bool track_loads
> > +    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> > +  const bool track_stores
> > +    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> > +
> > +  ldp_bb_info bb_state (bb);
> > +
> > +  for (auto insn : bb->nondebug_insns ())
> > +    {
> > +      rtx_insn *rti = insn->rtl ();
> > +
> > +      if (!rti || !INSN_P (rti))
> > +	continue;
> > +
> > +      rtx pat = PATTERN (rti);
> > +      if (reload_completed
> > +	  && aarch64_ldp_writeback > 1
> > +	  && GET_CODE (pat) == PARALLEL
> > +	  && XVECLEN (pat, 0) == 2)
> > +	try_promote_writeback (insn);
> > +
> > +      if (GET_CODE (pat) != SET)
> > +	continue;
> > +
> > +      if (track_stores && MEM_P (XEXP (pat, 0)))
> > +	bb_state.track_access (insn, false, XEXP (pat, 0));
> > +      else if (track_loads && MEM_P (XEXP (pat, 1)))
> > +	bb_state.track_access (insn, true, XEXP (pat, 1));
> > +    }
> > +
> > +  bb_state.transform ();
> > +  bb_state.cleanup_tombstones ();
> > +}
> > +
> > +void ldp_fusion ()
> > +{
> > +  ldp_fusion_init ();
> > +
> > +  for (auto bb : crtl->ssa->bbs ())
> > +    ldp_fusion_bb (bb);
> > +
> > +  ldp_fusion_destroy ();
> > +}
> > +
> > +namespace {
> > +
> > +const pass_data pass_data_ldp_fusion =
> > +{
> > +  RTL_PASS, /* type */
> > +  "ldp_fusion", /* name */
> > +  OPTGROUP_NONE, /* optinfo_flags */
> > +  TV_NONE, /* tv_id */
> > +  0, /* properties_required */
> > +  0, /* properties_provided */
> > +  0, /* properties_destroyed */
> > +  0, /* todo_flags_start */
> > +  TODO_df_finish, /* todo_flags_finish */
> > +};
> > +
> > +class pass_ldp_fusion : public rtl_opt_pass
> > +{
> > +public:
> > +  pass_ldp_fusion (gcc::context *ctx)
> > +    : rtl_opt_pass (pass_data_ldp_fusion, ctx)
> > +    {}
> > +
> > +  opt_pass *clone () override { return new pass_ldp_fusion (m_ctxt); }
> > +
> > +  bool gate (function *) final override
> > +    {
> > +      if (!optimize || optimize_debug)
> > +	return false;
> > +
> > +      // If the tuning policy says never to form ldps or stps, don't run
> > +      // the pass.
> > +      if ((aarch64_tune_params.ldp_policy_model
> > +	   == AARCH64_LDP_STP_POLICY_NEVER)
> > +	  && (aarch64_tune_params.stp_policy_model
> > +	      == AARCH64_LDP_STP_POLICY_NEVER))
> > +	return false;
> > +
> > +      if (reload_completed)
> > +	return flag_aarch64_late_ldp_fusion;
> > +      else
> > +	return flag_aarch64_early_ldp_fusion;
> > +    }
> > +
> > +  unsigned execute (function *) final override
> > +    {
> > +      ldp_fusion ();
> > +      return 0;
> > +    }
> > +};
> > +
> > +} // anon namespace
> > +
> > +rtl_opt_pass *
> > +make_pass_ldp_fusion (gcc::context *ctx)
> > +{
> > +  return new pass_ldp_fusion (ctx);
> > +}
> > +
> > +#include "gt-aarch64-ldp-fusion.h"
> > diff --git a/gcc/config/aarch64/aarch64-passes.def b/gcc/config/aarch64/aarch64-passes.def
> > index 6ace797b738..f38c642414e 100644
> > --- a/gcc/config/aarch64/aarch64-passes.def
> > +++ b/gcc/config/aarch64/aarch64-passes.def
> > @@ -23,3 +23,5 @@ INSERT_PASS_BEFORE (pass_reorder_blocks, 1, pass_track_speculation);
> >  INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
> >  INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
> >  INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> > +INSERT_PASS_BEFORE (pass_early_remat, 1, pass_ldp_fusion);
> > +INSERT_PASS_BEFORE (pass_peephole2, 1, pass_ldp_fusion);
> > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> > index 2ab54f244a7..fd75aa115d1 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -1055,6 +1055,7 @@ rtl_opt_pass *make_pass_track_speculation (gcc::context *);
> >  rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
> >  rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
> >  rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
> > +rtl_opt_pass *make_pass_ldp_fusion (gcc::context *);
> >  
> >  poly_uint64 aarch64_regmode_natural_size (machine_mode);
> >  
> > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> > index f5a518202a1..a69c37ce33b 100644
> > --- a/gcc/config/aarch64/aarch64.opt
> > +++ b/gcc/config/aarch64/aarch64.opt
> > @@ -271,6 +271,16 @@ mtrack-speculation
> >  Target Var(aarch64_track_speculation)
> >  Generate code to track when the CPU might be speculating incorrectly.
> >  
> > +mearly-ldp-fusion
> > +Target Var(flag_aarch64_early_ldp_fusion) Optimization Init(1)
> > +Enable the pre-RA AArch64-specific pass to fuse loads and stores into
> > +ldp and stp instructions.
> > +
> > +mlate-ldp-fusion
> > +Target Var(flag_aarch64_late_ldp_fusion) Optimization Init(1)
> > +Enable the post-RA AArch64-specific pass to fuse loads and stores into
> > +ldp and stp instructions.
> > +
> >  mstack-protector-guard=
> >  Target RejectNegative Joined Enum(stack_protector_guard) Var(aarch64_stack_protector_guard) Init(SSP_GLOBAL)
> >  Use given stack-protector guard.
> > @@ -360,3 +370,16 @@ Enum(aarch64_ldp_stp_policy) String(never) Value(AARCH64_LDP_STP_POLICY_NEVER)
> >  
> >  EnumValue
> >  Enum(aarch64_ldp_stp_policy) String(aligned) Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> > +
> > +-param=aarch64-ldp-alias-check-limit=
> > +Target Joined UInteger Var(aarch64_ldp_alias_check_limit) Init(8) IntegerRange(0, 65536) Param
> > +Limit on number of alias checks performed when attempting to form an ldp/stp.
> > +
> > +-param=aarch64-ldp-writeback=
> > +Target Joined UInteger Var(aarch64_ldp_writeback) Init(2) IntegerRange(0,2) Param
> > +Param to control which wirteback opportunities we try to handle in the
> 
> writeback

Fixed, thanks.

> 
> > +load/store pair fusion pass.  A value of zero disables writeback
> > +handling.  One means we try to form pairs involving one or more existing
> > +individual writeback accesses where possible.  A value of two means we
> > +also try to opportunistically form writeback opportunities by folding in
> > +trailing destructive updates of the base register used by a pair.
> 
> Params are also documented in invoke.texi (but are allowed to change
> between releases, unlike normal options).

Done (and also documented the main options to enable the pass(es)).

How does this version look?

Thanks,
Alex

> 
> Thanks,
> Richard
> 
> > diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> > index a9a244ab6d6..37917344a54 100644
> > --- a/gcc/config/aarch64/t-aarch64
> > +++ b/gcc/config/aarch64/t-aarch64
> > @@ -176,6 +176,13 @@ aarch64-cc-fusion.o: $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
> >  	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> >  		$(srcdir)/config/aarch64/aarch64-cc-fusion.cc
> >  
> > +aarch64-ldp-fusion.o: $(srcdir)/config/aarch64/aarch64-ldp-fusion.cc \
> > +    $(CONFIG_H) $(SYSTEM_H) $(CORETYPES_H) $(BACKEND_H) $(RTL_H) $(DF_H) \
> > +    $(RTL_SSA_H) cfgcleanup.h tree-pass.h ordered-hash-map.h tree-dfa.h \
> > +    fold-const.h tree-hash-traits.h print-tree.h
> > +	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> > +		$(srcdir)/config/aarch64/aarch64-ldp-fusion.cc
> > +
> >  comma=,
> >  MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
> >  MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
diff --git a/gcc/config.gcc b/gcc/config.gcc
index 211c1929d38..9aae903913c 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -350,7 +350,7 @@ aarch64*-*-*)
 	cxx_target_objs="aarch64-c.o"
 	d_target_objs="aarch64-d.o"
 	extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o aarch64-ldp-fusion.o"
-	target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc \$(srcdir)/config/aarch64/aarch64-ldp-fusion.cc"
+	target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
 	target_has_targetm_common=yes
 	;;
 alpha*-*-*)
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 6ab18b9216e..965f92d9fc6 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -39,14 +39,10 @@
 
 using namespace rtl_ssa;
 
-enum
-{
-  LDP_IMM_BITS = 7,
-  LDP_IMM_MASK = (1 << LDP_IMM_BITS) - 1,
-  LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1)),
-  LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1,
-  LDP_MIN_IMM = -LDP_MAX_IMM - 1,
-};
+static constexpr HOST_WIDE_INT LDP_IMM_BITS = 7;
+static constexpr HOST_WIDE_INT LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1));
+static constexpr HOST_WIDE_INT LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1;
+static constexpr HOST_WIDE_INT LDP_MIN_IMM = -LDP_MAX_IMM - 1;
 
 // We pack these fields (load_p, fpsimd_p, and size) into an integer
 // (LFS) which we use as part of the key into the main hash tables.
@@ -61,7 +57,7 @@ struct lfs_fields
   unsigned size;
 };
 
-using insn_list_t = std::list <insn_info *>;
+using insn_list_t = std::list<insn_info *>;
 using insn_iter_t = insn_list_t::iterator;
 
 // Information about the accesses at a given offset from a particular
@@ -80,7 +76,7 @@ struct access_record
 // while the list supports efficient iteration.
 struct access_group
 {
-  splay_tree <access_record *> tree;
+  splay_tree<access_record *> tree;
   std::list<access_record> list;
 
   template<typename Alloc>
@@ -91,24 +87,42 @@ struct access_group
 // There may be zero, one, or two viable RTL bases for a given pair.
 struct base_cand
 {
-  def_info *m_def;
+  // DEF is the def of the base register to be used by the pair.
+  def_info *def;
 
   // FROM_INSN is -1 if the base candidate is already shared by both
   // candidate insns.  Otherwise it holds the index of the insn from
   // which the base originated.
+  //
+  // In the case that the base is shared, either DEF is already used
+  // by both candidate accesses, or both accesses see different versions
+  // of the same regno, in which case DEF is the def consumed by the
+  // first candidate access.
   int from_insn;
 
-  // Initially: dataflow hazards that arise if we choose this base as
-  // the common base register for the pair.
+  // To form a pair, we do so by moving the first access down and the second
+  // access up.  To determine where to form the pair, and whether or not
+  // it is safe to form the pair, we track instructions which cannot be
+  // re-ordered past due to either dataflow or alias hazards.
+  //
+  // Since we allow changing the base used by an access, the choice of
+  // base can change which instructions act as re-ordering hazards for
+  // this pair (due to different dataflow).  We store the initial
+  // dataflow hazards for this choice of base candidate in HAZARDS.
   //
-  // Later these get narrowed, taking alias hazards into account.
+  // These hazards act as re-ordering barriers to each candidate insn
+  // respectively, in program order.
+  //
+  // Later on, when we take alias analysis into account, we narrow
+  // HAZARDS accordingly.
   insn_info *hazards[2];
 
   base_cand (def_info *def, int insn)
-    : m_def (def), from_insn (insn), hazards {nullptr, nullptr} {}
+    : def (def), from_insn (insn), hazards {nullptr, nullptr} {}
 
   base_cand (def_info *def) : base_cand (def, -1) {}
 
+  // Test if this base candidate is viable according to HAZARDS.
   bool viable () const
   {
     return !hazards[0] || !hazards[1] || (*hazards[0] > *hazards[1]);
@@ -126,22 +140,22 @@ struct alt_base
 // State used by the pass for a given basic block.
 struct ldp_bb_info
 {
-  using def_hash = nofree_ptr_hash <def_info>;
-  using expr_key_t = pair_hash <tree_operand_hash, int_hash <int, -1, -2>>;
-  using def_key_t = pair_hash <def_hash, int_hash <int, -1, -2>>;
+  using def_hash = nofree_ptr_hash<def_info>;
+  using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
+  using def_key_t = pair_hash<def_hash, int_hash<int, -1, -2>>;
 
   // Map of <tree base, LFS> -> access_group.
-  ordered_hash_map <expr_key_t, access_group> expr_map;
+  ordered_hash_map<expr_key_t, access_group> expr_map;
 
   // Map of <RTL-SSA def_info *, LFS> -> access_group.
-  ordered_hash_map <def_key_t, access_group> def_map;
+  ordered_hash_map<def_key_t, access_group> def_map;
 
   // Given the def_info for an RTL base register, express it as an offset from
   // some canonical base instead.
   //
   // Canonicalizing bases in this way allows us to identify adjacent accesses
   // even if they see different base register defs.
-  hash_map <def_hash, alt_base> canon_base_map;
+  hash_map<def_hash, alt_base> canon_base_map;
 
   static const size_t obstack_alignment = sizeof (void *);
   bb_info *m_bb;
@@ -155,6 +169,12 @@ struct ldp_bb_info
   ~ldp_bb_info ()
   {
     obstack_free (&m_obstack, nullptr);
+
+    if (m_emitted_tombstone)
+      {
+	bitmap_release (&m_tombstone_bitmap);
+	bitmap_obstack_release (&m_bitmap_obstack);
+      }
   }
 
   inline void track_access (insn_info *, bool load, rtx mem);
@@ -162,10 +182,13 @@ struct ldp_bb_info
   inline void cleanup_tombstones ();
 
 private:
-  // Did we emit a tombstone insn for this bb?
-  bool m_emitted_tombstone;
   obstack m_obstack;
 
+  // State for keeping track of tombstone insns emitted for this BB.
+  bitmap_obstack m_bitmap_obstack;
+  bitmap_head m_tombstone_bitmap;
+  bool m_emitted_tombstone;
+
   inline splay_tree_node<access_record *> *node_alloc (access_record *);
 
   template<typename Map>
@@ -175,6 +198,22 @@ private:
   inline bool try_form_pairs (insn_list_t *, insn_list_t *,
 			      bool load_p, unsigned access_size);
 
+  inline void merge_pairs (insn_list_t &, insn_list_t &,
+			   hash_set<insn_info *> &to_delete,
+			   bool load_p,
+			   unsigned access_size);
+
+  inline bool try_fuse_pair (bool load_p, unsigned access_size,
+			     insn_info *i1, insn_info *i2);
+
+  inline bool fuse_pair (bool load_p, unsigned access_size,
+			 int writeback,
+			 insn_info *i1, insn_info *i2,
+			 base_cand &base,
+			 const insn_range_info &move_range);
+
+  inline void track_tombstone (int uid);
+
   inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
 };
 
@@ -223,45 +262,42 @@ drop_writeback (rtx mem)
   return change_address (mem, GET_MODE (mem), addr);
 }
 
-// Convenience wrapper around strip_offset that can also look
-// through {PRE,POST}_MODIFY.
-static rtx ldp_strip_offset (rtx mem, rtx *modify, poly_int64 *offset)
+// Convenience wrapper around strip_offset that can also look through
+// RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
+// MEM so that we know the mode of the access.
+static rtx ldp_strip_offset (rtx mem, poly_int64 *offset)
 {
-  gcc_checking_assert (MEM_P (mem));
-
-  rtx base = strip_offset (XEXP (mem, 0), offset);
-
-  if (side_effects_p (base))
-    *modify = base;
+  rtx addr = XEXP (mem, 0);
 
-  switch (GET_CODE (base))
+  switch (GET_CODE (addr))
     {
     case PRE_MODIFY:
     case POST_MODIFY:
-      base = strip_offset (XEXP (base, 1), offset);
-      gcc_checking_assert (REG_P (base));
-      gcc_checking_assert (rtx_equal_p (XEXP (*modify, 0), base));
+      addr = strip_offset (XEXP (addr, 1), offset);
+      gcc_checking_assert (REG_P (addr));
+      gcc_checking_assert (rtx_equal_p (XEXP (XEXP (mem, 0), 0), addr));
       break;
     case PRE_INC:
     case POST_INC:
-      base = XEXP (base, 0);
+      addr = XEXP (addr, 0);
       *offset = GET_MODE_SIZE (GET_MODE (mem));
-      gcc_checking_assert (REG_P (base));
+      gcc_checking_assert (REG_P (addr));
       break;
     case PRE_DEC:
     case POST_DEC:
-      base = XEXP (base, 0);
+      addr = XEXP (addr, 0);
       *offset = -GET_MODE_SIZE (GET_MODE (mem));
-      gcc_checking_assert (REG_P (base));
+      gcc_checking_assert (REG_P (addr));
       break;
 
     default:
-      gcc_checking_assert (!side_effects_p (base));
+      addr = strip_offset (addr, offset);
     }
 
-  return base;
+  return addr;
 }
 
+// Return true if X is a PRE_{INC,DEC,MODIFY} rtx.
 static bool
 any_pre_modify_p (rtx x)
 {
@@ -269,6 +305,7 @@ any_pre_modify_p (rtx x)
   return code == PRE_INC || code == PRE_DEC || code == PRE_MODIFY;
 }
 
+// Return true if X is a POST_{INC,DEC,MODIFY} rtx.
 static bool
 any_post_modify_p (rtx x)
 {
@@ -276,6 +313,8 @@ any_post_modify_p (rtx x)
   return code == POST_INC || code == POST_DEC || code == POST_MODIFY;
 }
 
+// Return true if we should consider forming ldp/stp insns from memory
+// accesses with operand mode MODE at this stage in compilation.
 static bool
 ldp_operand_mode_ok_p (machine_mode mode)
 {
@@ -290,9 +329,14 @@ ldp_operand_mode_ok_p (machine_mode mode)
   if (size == 16 && !allow_qregs)
     return false;
 
-  return reload_completed || mode != E_TImode;
+  // We don't pair up TImode accesses before RA because TImode is
+  // special in that it can be allocated to a pair of GPRs or a single
+  // FPR, and the RA is best placed to make that decision.
+  return reload_completed || mode != TImode;
 }
 
+// Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
+// into an integer for use as a hash table key.
 static int
 encode_lfs (lfs_fields fields)
 {
@@ -303,6 +347,7 @@ encode_lfs (lfs_fields fields)
     | (size_log2 - 2);
 }
 
+// Inverse of encode_lfs.
 static lfs_fields
 decode_lfs (int lfs)
 {
@@ -312,6 +357,8 @@ decode_lfs (int lfs)
   return { load_p, fpsimd_p, size };
 }
 
+// Track the access INSN at offset OFFSET in this access group.
+// ALLOC_NODE is used to allocate splay tree nodes.
 template<typename Alloc>
 void
 access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
@@ -348,6 +395,9 @@ access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
     }
 }
 
+// Given a candidate access INSN (with mem MEM), see if it has a suitable
+// MEM_EXPR base (i.e. a tree decl) relative to which we can track the access.
+// LFS is used as part of the key to the hash table, see track_access.
 bool
 ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
 {
@@ -365,8 +415,10 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
   const machine_mode mem_mode = GET_MODE (mem);
   const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
 
-  // Punt on misaligned offsets.
-  if (offset.coeffs[0] & (mem_size - 1))
+  // Punt on misaligned offsets.  LDP/STP instructions require offsets to be a
+  // multiple of the access size, and we believe that misaligned offsets on
+  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL bases.
+  if (!multiple_p (offset, mem_size))
     return false;
 
   const auto key = std::make_pair (base_expr, encode_lfs (lfs));
@@ -388,15 +440,11 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
   return true;
 }
 
-// Return true if X is a constant zero operand.  N.B. this matches the
-// {w,x}zr check in aarch64_print_operand, the logic in the predicate
-// aarch64_stp_reg_operand, and the constraints on the pair patterns.
-static bool const_zero_op_p (rtx x)
-{
-  return x == CONST0_RTX (GET_MODE (x))
-    || (CONST_DOUBLE_P (x) && aarch64_float_const_zero_rtx_p (x));
-}
-
+// Main function to begin pair discovery.  Given a memory access INSN,
+// determine whether it could be a candidate for fusing into an ldp/stp,
+// and if so, track it in the appropriate data structure for this basic
+// block.  LOAD_P is true if the access is a load, and MEM is the mem
+// rtx that occurs in INSN.
 void
 ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
 {
@@ -405,7 +453,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
     return;
 
   // Ignore writeback accesses if the param says to do so.
-  if (!aarch64_ldp_writeback && side_effects_p (XEXP (mem, 0)))
+  if (!aarch64_ldp_writeback
+      && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
     return;
 
   const machine_mode mem_mode = GET_MODE (mem);
@@ -417,23 +466,26 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
 
   rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
 
-  // Is this an FP/SIMD access?  Note that constant zero operands
-  // use an integer zero register ({w,x}zr).
+  // We want to segregate FP/SIMD accesses from GPR accesses.
+  //
+  // Before RA, we use the modes, noting that stores of constant zero
+  // operands use GPRs (even in non-integer modes).  After RA, we use
+  // the hard register numbers.
   const bool fpsimd_op_p
-    = GET_MODE_CLASS (mem_mode) != MODE_INT
-      && (load_p || !const_zero_op_p (reg_op));
+    = reload_completed
+    ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
+    : (GET_MODE_CLASS (mem_mode) != MODE_INT
+       && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
 
-  // N.B. we only want to segregate FP/SIMD accesses from integer accesses
-  // before RA.
-  const bool fpsimd_bit_p = !reload_completed && fpsimd_op_p;
-  const lfs_fields lfs = { load_p, fpsimd_bit_p, mem_size };
+  const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
 
   if (track_via_mem_expr (insn, mem, lfs))
     return;
 
   poly_int64 mem_off;
-  rtx modify = NULL_RTX;
-  rtx base = ldp_strip_offset (mem, &modify, &mem_off);
+  rtx addr = XEXP (mem, 0);
+  const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
+  rtx base = ldp_strip_offset (mem, &mem_off);
   if (!REG_P (base))
     return;
 
@@ -441,7 +493,7 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
   //  - Offset at which the access occurs.
   //  - Offset of the new base def.
   poly_int64 access_off;
-  if (modify && any_post_modify_p (modify))
+  if (autoinc_p && any_post_modify_p (addr))
     access_off = 0;
   else
     access_off = mem_off;
@@ -479,7 +531,7 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
       access_off += canon_base->offset;
     }
 
-  if (modify)
+  if (autoinc_p)
     {
       auto def = find_access (insn->defs (), REGNO (base));
       gcc_assert (def);
@@ -502,8 +554,9 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
 	gcc_unreachable (); // Base defs should be unique.
     }
 
-  // Punt on misaligned offsets.
-  if (mem_off.coeffs[0] & (mem_size - 1))
+  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
+  // the access size.
+  if (!multiple_p (mem_off, mem_size))
     return;
 
   const auto key = std::make_pair (base_def, encode_lfs (lfs));
@@ -520,7 +573,7 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
 	       m_bb->index (), insn->uid (), pp_formatted_text (&pp));
       fprintf (dump_file,
 	       " [L=%d, WB=%d, FP=%d, %smode, off=",
-	       lfs.load_p, !!modify, lfs.fpsimd_p, mode_name[mem_mode]);
+	       lfs.load_p, autoinc_p, lfs.fpsimd_p, mode_name[mem_mode]);
       print_dec (access_off, dump_file);
       fprintf (dump_file, "]\n");
     }
@@ -531,13 +584,15 @@ static bool no_ignore (insn_info *) { return false; }
 
 // Return the latest dataflow hazard before INSN.
 //
-// If IGNORE is non-NULL, this points to a sub-rtx which we should
-// ignore for dataflow purposes.  This is needed when considering
-// changing the RTL base of an access discovered through a MEM_EXPR
-// base.
+// If IGNORE is non-NULL, this points to a sub-rtx which we should ignore for
+// dataflow purposes.  This is needed when considering changing the RTL base of
+// an access discovered through a MEM_EXPR base.
+//
+// If IGNORE_INSN is non-NULL, we should further ignore any hazards arising
+// from that insn.
 //
-// N.B. we ignore any defs/uses of memory here as we deal with that
-// separately, making use of alias disambiguation.
+// N.B. we ignore any defs/uses of memory here as we deal with that separately,
+// making use of alias disambiguation.
 static insn_info *
 latest_hazard_before (insn_info *insn, rtx *ignore,
 		      insn_info *ignore_insn = nullptr)
@@ -581,7 +636,7 @@ latest_hazard_before (insn_info *insn, rtx *ignore,
 	{
 	  hazard (def->prev_def ()->insn ()); // WaW
 
-	  auto set = dyn_cast <set_info *> (def->prev_def ());
+	  auto set = dyn_cast<set_info *> (def->prev_def ());
 	  if (set && set->has_nondebug_insn_uses ())
 	    for (auto use : set->reverse_nondebug_insn_uses ())
 	      if (use->insn () != insn && hazard (use->insn ())) // WaR
@@ -609,6 +664,14 @@ latest_hazard_before (insn_info *insn, rtx *ignore,
   return result;
 }
 
+// Return the first dataflow hazard after INSN.
+//
+// If IGNORE is non-NULL, this points to a sub-rtx which we should ignore for
+// dataflow purposes.  This is needed when considering changing the RTL base of
+// an access discovered through a MEM_EXPR base.
+//
+// N.B. we ignore any defs/uses of memory here as we deal with that separately,
+// making use of alias disambiguation.
 static insn_info *
 first_hazard_after (insn_info *insn, rtx *ignore)
 {
@@ -637,7 +700,7 @@ first_hazard_after (insn_info *insn, rtx *ignore)
       if (def->next_def ())
 	hazard (def->next_def ()->insn ()); // WaW
 
-      auto set = dyn_cast <set_info *> (def);
+      auto set = dyn_cast<set_info *> (def);
       if (set && set->has_nondebug_insn_uses ())
 	hazard (set->first_nondebug_insn_use ()->insn ()); // RaW
 
@@ -697,29 +760,7 @@ first_hazard_after (insn_info *insn, rtx *ignore)
   return result;
 }
 
-
-enum change_strategy {
-  CHANGE,
-  DELETE,
-  TOMBSTONE,
-};
-
-// Given a change_strategy S, convert it to a string (for output in the
-// dump file).
-static const char *cs_to_string (change_strategy s)
-{
-#define C(x) case x: return #x
-  switch (s)
-    {
-      C (CHANGE);
-      C (DELETE);
-      C (TOMBSTONE);
-    }
-#undef C
-  gcc_unreachable ();
-}
-
-// TODO: should this live in RTL-SSA?
+// Return true iff R1 and R2 overlap.
 static bool
 ranges_overlap_p (const insn_range_info &r1, const insn_range_info &r2)
 {
@@ -747,7 +788,7 @@ def_downwards_move_range (def_info *def)
 {
   auto range = get_def_range (def);
 
-  auto set = dyn_cast <set_info *> (def);
+  auto set = dyn_cast<set_info *> (def);
   if (!set || !set->has_any_uses ())
     return range;
 
@@ -766,7 +807,7 @@ def_upwards_move_range (def_info *def)
   def_info *prev = def->prev_def ();
   insn_range_info range { prev->insn (), def->insn () };
 
-  auto set = dyn_cast <set_info *> (prev);
+  auto set = dyn_cast<set_info *> (prev);
   if (!set || !set->has_any_uses ())
     return range;
 
@@ -777,121 +818,67 @@ def_upwards_move_range (def_info *def)
   return range;
 }
 
-static def_info *
-decide_stp_strategy (change_strategy strategy[2],
-		     insn_info *first,
+// Given candidate store insns FIRST and SECOND, see if we can re-purpose one
+// of them (together with its def of memory) for the stp insn.  If so, return
+// that insn.  Otherwise, return null.
+static insn_info *
+decide_stp_strategy (insn_info *first,
 		     insn_info *second,
 		     const insn_range_info &move_range)
 {
-  strategy[0] = CHANGE;
-  strategy[1] = DELETE;
-
-  unsigned viable = 0;
-  viable |= move_range.includes (first);
-  viable |= ((unsigned) move_range.includes (second)) << 1;
-
   def_info * const defs[2] = {
     memory_access (first->defs ()),
     memory_access (second->defs ())
   };
-  if (defs[0] == defs[1])
-    viable = 3; // No intervening store, either is viable.
-
-  if (!(viable & 1)
-      && ranges_overlap_p (move_range, def_downwards_move_range (defs[0])))
-    viable |= 1;
-  if (!(viable & 2)
-      && ranges_overlap_p (move_range, def_upwards_move_range (defs[1])))
-    viable |= 2;
-
-  if (viable == 2)
-    std::swap (strategy[0], strategy[1]);
-  else if (!viable)
-    // Tricky case: need to delete both accesses.
-    strategy[0] = DELETE;
-
-  for (int i = 0; i < 2; i++)
-    {
-      if (strategy[i] != DELETE)
-	continue;
-
-      // See if we can get away without a tombstone.
-      auto set = dyn_cast <set_info *> (defs[i]);
-      if (!set || !set->has_any_uses ())
-	continue; // We can indeed.
-
-      // If both sides are viable for re-purposing, and the other store's
-      // def doesn't have any uses, then we can delete the other store
-      // and re-purpose this store instead.
-      if (viable == 3)
-	{
-	  gcc_assert (strategy[!i] == CHANGE);
-	  auto other_set = dyn_cast <set_info *> (defs[!i]);
-	  if (!other_set || !other_set->has_any_uses ())
-	    {
-	      strategy[i] = CHANGE;
-	      strategy[!i] = DELETE;
-	      break;
-	    }
-	}
 
-      // Alas, we need a tombstone after all.
-      strategy[i] = TOMBSTONE;
-    }
+  if (move_range.includes (first)
+      || ranges_overlap_p (move_range, def_downwards_move_range (defs[0])))
+    return first;
 
-  for (int i = 0; i < 2; i++)
-    if (strategy[i] == CHANGE)
-      return defs[i];
+  if (move_range.includes (second)
+      || ranges_overlap_p (move_range, def_upwards_move_range (defs[1])))
+    return second;
 
   return nullptr;
 }
 
-static GTY(()) rtx tombstone = NULL_RTX;
-
-// Generate the RTL pattern for a "tombstone"; used temporarily
-// during this pass to replace stores that are marked for deletion
-// where we can't immediately delete the store (e.g. if there are uses
-// hanging off its def of memory).
+// Generate the RTL pattern for a "tombstone"; used temporarily during this pass
+// to replace stores that are marked for deletion where we can't immediately
+// delete the store (since there are uses of mem hanging off the store).
 //
-// These are deleted at the end of the pass and uses re-parented
-// appropriately at this point.
+// These are deleted at the end of the pass and uses re-parented appropriately
+// at this point.
 static rtx
 gen_tombstone (void)
 {
-  if (!tombstone)
-    {
-      tombstone = gen_rtx_CLOBBER (VOIDmode,
-				   gen_rtx_MEM (BLKmode,
-						gen_rtx_SCRATCH (Pmode)));
-      return tombstone;
-    }
-
-  return copy_rtx (tombstone);
-}
-
-static bool
-tombstone_insn_p (insn_info *insn)
-{
-  rtx x = tombstone ? tombstone : gen_tombstone ();
-  return rtx_equal_p (PATTERN (insn->rtl ()), x);
+  return gen_rtx_CLOBBER (VOIDmode,
+			  gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode)));
 }
 
+// Given a pair mode MODE, return a canonical mode to be used for a single
+// operand of such a pair.  Currently we only use this when promoting a
+// non-writeback pair into a writeback pair, as it isn't otherwise clear
+// which mode to use when storing a modeless CONST_INT.
 static machine_mode
 aarch64_operand_mode_for_pair_mode (machine_mode mode)
 {
   switch (mode)
     {
     case E_V2x4QImode:
-      return E_SImode;
+      return SImode;
     case E_V2x8QImode:
-      return E_DImode;
+      return DImode;
     case E_V2x16QImode:
-      return E_V16QImode;
+      return V16QImode;
     default:
       gcc_unreachable ();
     }
 }
 
+// Go through the reg notes rooted at NOTE, dropping those that we should drop,
+// and preserving those that we want to keep by prepending them to (and
+// returning) RESULT.  EH_REGION is used to make sure we have at most one
+// REG_EH_REGION note in the resulting list.
 static rtx
 filter_notes (rtx note, rtx result, bool *eh_region)
 {
@@ -899,37 +886,38 @@ filter_notes (rtx note, rtx result, bool *eh_region)
     {
       switch (REG_NOTE_KIND (note))
 	{
-	  case REG_EQUAL:
-	  case REG_EQUIV:
-	  case REG_DEAD:
-	  case REG_UNUSED:
-	  case REG_NOALIAS:
-	    // These can all be dropped.  For REG_EQU{AL,IV} they
-	    // cannot apply to non-single_set insns, and
-	    // REG_{DEAD,UNUSED} are re-computed by RTl-SSA, see
-	    // rtl-ssa/changes.cc:update_notes.
-	    //
-	    // Similarly, REG_NOALIAS cannot apply to a parallel.
-	  case REG_INC:
-	    // When we form the pair insn, the reg update is implemented
-	    // as just another SET in the parallel, so isn't really an
-	    // auto-increment in the RTL sense, hence we drop the note.
-	    break;
-	  case REG_EH_REGION:
-	    gcc_assert (!*eh_region);
-	    *eh_region = true;
-	    result = alloc_reg_note (REG_EH_REGION, XEXP (note, 0), result);
-	    break;
-	  case REG_CFA_DEF_CFA:
-	  case REG_CFA_OFFSET:
-	  case REG_CFA_RESTORE:
-	    result = alloc_reg_note (REG_NOTE_KIND (note),
-				     copy_rtx (XEXP (note, 0)),
-				     result);
-	    break;
-	  default:
-	    // Unexpected REG_NOTE kind.
-	    gcc_unreachable ();
+	case REG_DEAD:
+	  // REG_DEAD notes aren't required to be maintained.
+	case REG_EQUAL:
+	case REG_EQUIV:
+	case REG_UNUSED:
+	case REG_NOALIAS:
+	  // These can all be dropped.  For REG_EQU{AL,IV} they
+	  // cannot apply to non-single_set insns, and
+	  // REG_UNUSED is re-computed by RTl-SSA, see
+	  // rtl-ssa/changes.cc:update_notes.
+	  //
+	  // Similarly, REG_NOALIAS cannot apply to a parallel.
+	case REG_INC:
+	  // When we form the pair insn, the reg update is implemented
+	  // as just another SET in the parallel, so isn't really an
+	  // auto-increment in the RTL sense, hence we drop the note.
+	  break;
+	case REG_EH_REGION:
+	  gcc_assert (!*eh_region);
+	  *eh_region = true;
+	  result = alloc_reg_note (REG_EH_REGION, XEXP (note, 0), result);
+	  break;
+	case REG_CFA_DEF_CFA:
+	case REG_CFA_OFFSET:
+	case REG_CFA_RESTORE:
+	  result = alloc_reg_note (REG_NOTE_KIND (note),
+				   copy_rtx (XEXP (note, 0)),
+				   result);
+	  break;
+	default:
+	  // Unexpected REG_NOTE kind.
+	  gcc_unreachable ();
 	}
     }
 
@@ -937,34 +925,21 @@ filter_notes (rtx note, rtx result, bool *eh_region)
 }
 
 // Ensure we have a sensible scheme for combining REG_NOTEs
-// given two candidate insns I1 and I2.
+// given two candidate insns I1 and I2 where *I1 < *I2.
 static rtx
-combine_reg_notes (insn_info *i1, insn_info *i2, rtx writeback, bool &ok)
+combine_reg_notes (insn_info *i1, insn_info *i2)
 {
-  if ((writeback && find_reg_note (i1->rtl (), REG_CFA_DEF_CFA, NULL_RTX))
-      || find_reg_note (i2->rtl (), REG_CFA_DEF_CFA, NULL_RTX))
-    {
-      // CFA_DEF_CFA notes apply to the first set of the PARALLEL,
-      // so we can only preserve them in the non-writeback case, in
-      // the case that the note is attached to the lower access.
-      if (dump_file)
-	fprintf (dump_file,
-		 "(%d,%d,WB=%d): can't preserve CFA_DEF_CFA note, punting\n",
-		 i1->uid (), i2->uid (), !!writeback);
-      ok = false;
-      return NULL_RTX;
-    }
-
   bool found_eh_region = false;
   rtx result = NULL_RTX;
-  result = filter_notes (REG_NOTES (i1->rtl ()), result, &found_eh_region);
-  return filter_notes (REG_NOTES (i2->rtl ()), result, &found_eh_region);
+  result = filter_notes (REG_NOTES (i2->rtl ()), result, &found_eh_region);
+  return filter_notes (REG_NOTES (i1->rtl ()), result, &found_eh_region);
 }
 
-// Given two memory accesses, at least one of which is of a writeback form,
-// extract two non-writeback memory accesses addressed relative to the initial
-// value of the base register, and output these in PATS.  Return an rtx that
-// represents the overall change to the base register.
+// Given two memory accesses in PATS, at least one of which is of a
+// writeback form, extract two non-writeback memory accesses addressed
+// relative to the initial value of the base register, and output these
+// in PATS.  Return an rtx that represents the overall change to the
+// base register.
 static rtx
 extract_writebacks (bool load_p, rtx pats[2], int changed)
 {
@@ -978,9 +953,11 @@ extract_writebacks (bool load_p, rtx pats[2], int changed)
       rtx mem = XEXP (pats[i], load_p);
       rtx reg = XEXP (pats[i], !load_p);
 
-      rtx modify = NULL_RTX;
+      rtx addr = XEXP (mem, 0);
+      const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
+
       poly_int64 offset;
-      rtx this_base = ldp_strip_offset (mem, &modify, &offset);
+      rtx this_base = ldp_strip_offset (mem, &offset);
       gcc_assert (REG_P (this_base));
       if (base_reg)
 	gcc_assert (rtx_equal_p (base_reg, this_base));
@@ -992,16 +969,16 @@ extract_writebacks (bool load_p, rtx pats[2], int changed)
       // address of the other access.
       if (i == changed)
 	{
-	  gcc_checking_assert (!modify);
+	  gcc_checking_assert (!autoinc_p);
 	  offsets[i] = offset;
 	  continue;
 	}
 
-      if (modify && any_pre_modify_p (modify))
+      if (autoinc_p && any_pre_modify_p (addr))
 	current_offset += offset;
 
       poly_int64 this_off = current_offset;
-      if (!modify)
+      if (!autoinc_p)
 	this_off += offset;
 
       offsets[i] = this_off;
@@ -1012,7 +989,7 @@ extract_writebacks (bool load_p, rtx pats[2], int changed)
 	? gen_rtx_SET (reg, new_mem)
 	: gen_rtx_SET (new_mem, reg);
 
-      if (modify && any_post_modify_p (modify))
+      if (autoinc_p && any_post_modify_p (addr))
 	current_offset += offset;
     }
 
@@ -1023,28 +1000,54 @@ extract_writebacks (bool load_p, rtx pats[2], int changed)
 					       base_reg, current_offset));
 }
 
+// INSNS contains either {nullptr, pair insn} (when promoting an existing
+// non-writeback pair) or contains the candidate insns used to form the pair
+// (when fusing a new pair).
+//
+// PAIR_RANGE specifies where we want to form the final pair.
+// INITIAL_OFFSET gives the current base offset for the pair,
+// INITIAL_WRITEBACK says whether either of the initial accesses had
+// writeback.
+// ACCESS_SIZE gives the access size for a single arm of the pair.
+// BASE_DEF gives the initial def of the base register consumed by the pair.
+//
+// Given the above, this function looks for a trailing destructive update of the
+// base register.  If there is one, we choose the first such update after
+// INSNS[1] that is still in the same BB as our pair.  We return the
+// new def in *ADD_DEF and the resulting writeback effect in
+// *WRITEBACK_EFFECT.
 static insn_info *
 find_trailing_add (insn_info *insns[2],
 		   const insn_range_info &pair_range,
+		   int initial_writeback,
 		   rtx *writeback_effect,
 		   def_info **add_def,
 		   def_info *base_def,
 		   poly_int64 initial_offset,
 		   unsigned access_size)
 {
-  insn_info *pair_insn = insns[1];
+  insn_info *pair_dst = pair_range.singleton ();
+  gcc_assert (pair_dst);
 
   def_info *def = base_def->next_def ();
 
-  while (def
-	 && def->bb () == pair_insn->bb ()
-	 && *(def->insn ()) <= *pair_insn)
-    def = def->next_def ();
+  // In the case that either of the initial pair insns had writeback,
+  // then there will be intervening defs of the base register.
+  // Skip over these.
+  for (int i = 0; i < 2; i++)
+    if (initial_writeback & (1 << i))
+      {
+	gcc_assert (def->insn () == insns[i]);
+	def = def->next_def ();
+      }
 
-  if (!def || def->bb () != pair_insn->bb ())
+  if (!def || def->bb () != pair_dst->bb ())
     return nullptr;
 
+  // DEF should now be the first def of the base register after PAIR_DST.
   insn_info *cand = def->insn ();
+  gcc_assert (*cand > *pair_dst);
+
   const auto base_regno = base_def->regno ();
 
   // If CAND doesn't also use our base register,
@@ -1087,9 +1090,6 @@ find_trailing_add (insn_info *insns[2],
   if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
     return nullptr;
 
-  insn_info *pair_dst = pair_range.singleton ();
-  gcc_assert (pair_dst);
-
   auto dump_prefix = [&]()
     {
       if (!insns[0])
@@ -1099,7 +1099,7 @@ find_trailing_add (insn_info *insns[2],
 		 insns[0]->uid (), insns[1]->uid ());
     };
 
-  insn_info *hazard = latest_hazard_before (cand, nullptr, pair_insn);
+  insn_info *hazard = latest_hazard_before (cand, nullptr, insns[1]);
   if (!hazard || *hazard <= *pair_dst)
     {
       if (dump_file)
@@ -1126,40 +1126,52 @@ find_trailing_add (insn_info *insns[2],
   return nullptr;
 }
 
+// We just emitted a tombstone with uid UID, track it in a bitmap for
+// this BB so we can easily identify it later when cleaning up tombstones.
+void
+ldp_bb_info::track_tombstone (int uid)
+{
+  if (!m_emitted_tombstone)
+    {
+      // Lazily initialize the bitmap for tracking tombstone insns.
+      bitmap_obstack_initialize (&m_bitmap_obstack);
+      bitmap_initialize (&m_tombstone_bitmap, &m_bitmap_obstack);
+      m_emitted_tombstone = true;
+    }
+
+  if (!bitmap_set_bit (&m_tombstone_bitmap, uid))
+    gcc_unreachable (); // Bit should have changed.
+}
+
 // Try and actually fuse the pair given by insns I1 and I2.
-static bool
-fuse_pair (bool load_p,
-	   unsigned access_size,
-	   int writeback,
-	   insn_info *i1,
-	   insn_info *i2,
-	   base_cand &base,
-	   const insn_range_info &move_range,
-	   bool &emitted_tombstone_p)
+bool
+ldp_bb_info::fuse_pair (bool load_p,
+			unsigned access_size,
+			int writeback,
+			insn_info *i1, insn_info *i2,
+			base_cand &base,
+			const insn_range_info &move_range)
 {
   auto attempt = crtl->ssa->new_change_attempt ();
 
   auto make_change = [&attempt](insn_info *insn)
     {
-      return crtl->ssa->change_alloc <insn_change> (attempt, insn);
+      return crtl->ssa->change_alloc<insn_change> (attempt, insn);
     };
   auto make_delete = [&attempt](insn_info *insn)
     {
-      return crtl->ssa->change_alloc <insn_change> (attempt,
-						    insn,
-						    insn_change::DELETE);
+      return crtl->ssa->change_alloc<insn_change> (attempt,
+						   insn,
+						   insn_change::DELETE);
     };
 
-  // Are we using a tombstone insn for this pair?
-  bool have_tombstone_p = false;
-
   insn_info *first = (*i1 < *i2) ? i1 : i2;
   insn_info *second = (first == i1) ? i2 : i1;
 
   insn_info *insns[2] = { first, second };
 
-  auto_vec <insn_change *> changes;
-  changes.reserve (4);
+  auto_vec<insn_change *, 4> changes (4);
+  auto_vec<int, 2> tombstone_uids (2);
 
   rtx pats[2] = {
     PATTERN (first->rtl ()),
@@ -1216,7 +1228,7 @@ fuse_pair (bool load_p,
   if (writeback)
     writeback_effect = extract_writebacks (load_p, pats, changed_insn);
 
-  const auto base_regno = base.m_def->regno ();
+  const auto base_regno = base.def->regno ();
 
   if (base.from_insn == -1 && (writeback & 1))
     {
@@ -1266,14 +1278,6 @@ fuse_pair (bool load_p,
       writeback_effect = NULL_RTX;
     }
 
-  // If both of the original insns had a writeback form, then we should drop the
-  // first def.  The second def could well have uses, but the first def should
-  // only be used by the second insn (and we dropped that use above).
-  if (writeback == 3)
-    input_defs[0] = check_remove_regno_access (attempt,
-					       input_defs[0],
-					       base_regno);
-
   // So far the patterns have been in instruction order,
   // now we want them in offset order.
   if (i1 != first)
@@ -1289,33 +1293,46 @@ fuse_pair (bool load_p,
       gcc_checking_assert (base_regno == REGNO (base));
     }
 
+  // If either of the original insns had writeback, but the resulting pair insn
+  // does not (can happen e.g. in the ldp edge case above, or if the writeback
+  // effects cancel out), then drop the def(s) of the base register as
+  // appropriate.
+  //
+  // Also drop the first def in the case that both of the original insns had
+  // writeback.  The second def could well have uses, but the first def should
+  // only be used by the second insn (and we dropped that use above).
+  for (int i = 0; i < 2; i++)
+    if ((!writeback_effect && (writeback & (1 << i)))
+	|| (i == 0 && writeback == 3))
+      input_defs[i] = check_remove_regno_access (attempt,
+						 input_defs[i],
+						 base_regno);
+
+  // If we don't currently have a writeback pair, and we don't have
+  // a load that clobbers the base register, look for a trailing destructive
+  // update of the base register and try and fold it in to make this into a
+  // writeback pair.
   insn_info *trailing_add = nullptr;
-  if (aarch64_ldp_writeback > 1 && !writeback_effect)
+  if (aarch64_ldp_writeback > 1
+      && !writeback_effect
+      && (!load_p || (!refers_to_regno_p (base_regno, base_regno + 1,
+					 XEXP (pats[0], 0), nullptr)
+		      && !refers_to_regno_p (base_regno, base_regno + 1,
+					     XEXP (pats[1], 0), nullptr))))
     {
       def_info *add_def;
-      trailing_add = find_trailing_add (insns, move_range, &writeback_effect,
-					&add_def, base.m_def, offsets[0],
+      trailing_add = find_trailing_add (insns, move_range, writeback,
+					&writeback_effect,
+					&add_def, base.def, offsets[0],
 					access_size);
-      if (trailing_add && !writeback)
+      if (trailing_add)
 	{
-	  // If there was no writeback to start with, we need to preserve the
-	  // def of the base register from the add insn.
+	  // The def of the base register from the trailing add should prevail.
 	  input_defs[0] = insert_access (attempt, add_def, input_defs[0]);
 	  gcc_assert (input_defs[0].is_valid ());
 	}
     }
 
-  // If either of the original insns had writeback, but the resulting
-  // pair insn does not (can happen e.g. in the ldp edge case above, or
-  // if the writeback effects cancel out), then drop the def(s) of the
-  // base register as appropriate.
-  if (!writeback_effect)
-    for (int i = 0; i < 2; i++)
-      if (writeback & (1 << i))
-	input_defs[i] = check_remove_regno_access (attempt,
-						   input_defs[i],
-						   base_regno);
-
   // Now that we know what base mem we're going to use, check if it's OK
   // with the ldp/stp policy.
   rtx first_mem = XEXP (pats[0], load_p);
@@ -1329,10 +1346,7 @@ fuse_pair (bool load_p,
       return false;
     }
 
-  bool reg_notes_ok = true;
-  rtx reg_notes = combine_reg_notes (i1, i2, writeback_effect, reg_notes_ok);
-  if (!reg_notes_ok)
-    return false;
+  rtx reg_notes = combine_reg_notes (first, second);
 
   rtx pair_pat;
   if (writeback_effect)
@@ -1379,68 +1393,53 @@ fuse_pair (bool load_p,
     }
   else
     {
-      change_strategy strategy[2];
-      def_info *stp_def = decide_stp_strategy (strategy, first, second,
-					       move_range);
-      if (dump_file)
-	{
-	  auto cs1 = cs_to_string (strategy[0]);
-	  auto cs2 = cs_to_string (strategy[1]);
-	  fprintf (dump_file,
-		   "  stp strategy for candidate insns (%d,%d): (%s,%s)\n",
-		   insns[0]->uid (), insns[1]->uid (), cs1, cs2);
-	  if (stp_def)
-	    fprintf (dump_file,
-		     "  re-using mem def from insn %d\n",
-		     stp_def->insn ()->uid ());
-	}
+      insn_info *store_to_change = decide_stp_strategy (first, second,
+							move_range);
+
+      if (store_to_change && dump_file)
+	fprintf (dump_file, "  stp: re-purposing store %d\n",
+		 store_to_change->uid ());
 
       insn_change *change;
       for (int i = 0; i < 2; i++)
 	{
-	  switch (strategy[i])
+	  change = make_change (insns[i]);
+	  if (insns[i] == store_to_change)
 	    {
-	    case DELETE:
-	      changes.quick_push (make_delete (insns[i]));
-	      break;
-	    case TOMBSTONE:
-	    case CHANGE:
-	      change = make_change (insns[i]);
-	      if (strategy[i] == CHANGE)
-		{
-		  set_pair_pat (change);
-		  change->new_uses = merge_access_arrays (attempt,
-							  input_uses[0],
-							  input_uses[1]);
-		  auto d1 = drop_memory_access (input_defs[0]);
-		  auto d2 = drop_memory_access (input_defs[1]);
-		  change->new_defs = merge_access_arrays (attempt, d1, d2);
-		  gcc_assert (change->new_defs.is_valid ());
-		  gcc_assert (stp_def);
-		  change->new_defs = insert_access (attempt,
-						    stp_def,
-						    change->new_defs);
-		  gcc_assert (change->new_defs.is_valid ());
-		  change->move_range = move_range;
-		  pair_change = change;
-		}
-	      else
-		{
-		  rtx_insn *rti = insns[i]->rtl ();
-		  gcc_assert (validate_change (rti, &PATTERN (rti),
-					       gen_tombstone (), true));
-		  gcc_assert (validate_change (rti, &REG_NOTES (rti),
-					       NULL_RTX, true));
-		  change->new_uses = use_array (nullptr, 0);
-		  have_tombstone_p = true;
-		}
-	      gcc_assert (change->new_uses.is_valid ());
-	      changes.quick_push (change);
-	      break;
+	      set_pair_pat (change);
+	      change->new_uses = merge_access_arrays (attempt,
+						      input_uses[0],
+						      input_uses[1]);
+	      auto d1 = drop_memory_access (input_defs[0]);
+	      auto d2 = drop_memory_access (input_defs[1]);
+	      change->new_defs = merge_access_arrays (attempt, d1, d2);
+	      gcc_assert (change->new_defs.is_valid ());
+	      def_info *stp_def = memory_access (store_to_change->defs ());
+	      change->new_defs = insert_access (attempt,
+						stp_def,
+						change->new_defs);
+	      gcc_assert (change->new_defs.is_valid ());
+	      change->move_range = move_range;
+	      pair_change = change;
 	    }
+	  else
+	    {
+	      // Note that we are turning this insn into a tombstone,
+	      // we need to keep track of these if we go ahead with the
+	      // change.
+	      tombstone_uids.quick_push (insns[i]->uid ());
+	      rtx_insn *rti = insns[i]->rtl ();
+	      gcc_assert (validate_change (rti, &PATTERN (rti),
+					   gen_tombstone (), true));
+	      gcc_assert (validate_change (rti, &REG_NOTES (rti),
+					   NULL_RTX, true));
+	      change->new_uses = use_array (nullptr, 0);
+	    }
+	  gcc_assert (change->new_uses.is_valid ());
+	  changes.quick_push (change);
 	}
 
-      if (!stp_def)
+      if (!store_to_change)
 	{
 	  // Tricky case.  Cannot re-purpose existing insns for stp.
 	  // Need to insert new insn.
@@ -1502,7 +1501,11 @@ fuse_pair (bool load_p,
 
   confirm_change_group ();
   crtl->ssa->change_insns (changes);
-  emitted_tombstone_p |= have_tombstone_p;
+
+  gcc_checking_assert (tombstone_uids.length () <= 2);
+  for (auto uid : tombstone_uids)
+    track_tombstone (uid);
+
   return true;
 }
 
@@ -1511,9 +1514,6 @@ fuse_pair (bool load_p,
 static bool
 store_modifies_mem_p (rtx mem, insn_info *store_insn, int &budget)
 {
-  if (tombstone_insn_p (store_insn))
-    return false;
-
   if (!budget)
     {
       if (dump_file)
@@ -1560,65 +1560,30 @@ load_modified_by_store_p (insn_info *load,
   return modified_in_p (PATTERN (load->rtl ()), store->rtl ());
 }
 
+// Virtual base class for load/store walkers used in alias analysis.
 struct alias_walker
 {
-  virtual insn_info *insn () const = 0;
-  virtual bool valid () const = 0;
   virtual bool conflict_p (int &budget) const = 0;
+  virtual insn_info *insn () const = 0;
+  virtual bool valid () const  = 0;
   virtual void advance () = 0;
 };
 
+// Implement some common functionality used by both store_walker
+// and load_walker.
 template<bool reverse>
-class store_walker : public alias_walker
-{
-  using def_iter_t = typename std::conditional <reverse,
-	reverse_def_iterator, def_iterator>::type;
-
-  def_iter_t def_iter;
-  rtx cand_mem;
-  insn_info *limit;
-
-public:
-  store_walker (def_info *mem_def, rtx mem, insn_info *limit_insn) :
-    def_iter (mem_def), cand_mem (mem), limit (limit_insn) {}
-
-  bool valid () const override
-    {
-      if (!*def_iter)
-	return false;
-
-      if (reverse)
-	return *((*def_iter)->insn ()) > *limit;
-      else
-	return *((*def_iter)->insn ()) < *limit;
-    }
-  insn_info *insn () const override { return (*def_iter)->insn (); }
-  void advance () override { def_iter++; }
-  bool conflict_p (int &budget) const override
-  {
-    return store_modifies_mem_p (cand_mem, insn (), budget);
-  }
-};
-
-template<bool reverse>
-class load_walker : public alias_walker
+class def_walker : public alias_walker
 {
-  using def_iter_t = typename std::conditional <reverse,
+protected:
+  using def_iter_t = typename std::conditional<reverse,
 	reverse_def_iterator, def_iterator>::type;
-  using use_iter_t = typename std::conditional <reverse,
-	reverse_use_iterator, nondebug_insn_use_iterator>::type;
-
-  def_iter_t def_iter;
-  use_iter_t use_iter;
-  insn_info *cand_store;
-  insn_info *limit;
 
   static use_info *start_use_chain (def_iter_t &def_iter)
   {
     set_info *set = nullptr;
     for (; *def_iter; def_iter++)
       {
-	set = dyn_cast <set_info *> (*def_iter);
+	set = dyn_cast<set_info *> (*def_iter);
 	if (!set)
 	  continue;
 
@@ -1633,41 +1598,87 @@ class load_walker : public alias_walker
     return nullptr;
   }
 
+  def_iter_t def_iter;
+  insn_info *limit;
+  def_walker (def_info *def, insn_info *limit) :
+    def_iter (def), limit (limit) {}
+
+  virtual bool iter_valid () const { return *def_iter; }
+
 public:
-  void advance () override
+  insn_info *insn () const override { return (*def_iter)->insn (); }
+  void advance () override { def_iter++; }
+  bool valid () const override final
   {
-    use_iter++;
-    if (*use_iter)
-      return;
-    def_iter++;
-    use_iter = start_use_chain (def_iter);
+    if (!iter_valid ())
+      return false;
+
+    if (reverse)
+      return *(insn ()) > *limit;
+    else
+      return *(insn ()) < *limit;
   }
+};
 
-  insn_info *insn () const override
+// alias_walker that iterates over stores.
+template<bool reverse, typename InsnPredicate>
+class store_walker : public def_walker<reverse>
+{
+  rtx cand_mem;
+  InsnPredicate tombstone_p;
+
+public:
+  store_walker (def_info *mem_def, rtx mem, insn_info *limit_insn,
+		InsnPredicate tombstone_fn) :
+    def_walker<reverse> (mem_def, limit_insn),
+    cand_mem (mem), tombstone_p (tombstone_fn) {}
+
+  bool conflict_p (int &budget) const override final
   {
-    gcc_checking_assert (*use_iter);
-    return (*use_iter)->insn ();
+    if (tombstone_p (this->insn ()))
+      return false;
+
+    return store_modifies_mem_p (cand_mem, this->insn (), budget);
   }
+};
 
-  bool valid () const override
+// alias_walker that iterates over loads.
+template<bool reverse>
+class load_walker : public def_walker<reverse>
+{
+  using Base = def_walker<reverse>;
+  using use_iter_t = typename std::conditional<reverse,
+	reverse_use_iterator, nondebug_insn_use_iterator>::type;
+
+  use_iter_t use_iter;
+  insn_info *cand_store;
+
+  bool iter_valid () const override final { return *use_iter; }
+
+public:
+  void advance () override final
   {
-    if (!*use_iter)
-      return false;
+    use_iter++;
+    if (*use_iter)
+      return;
+    this->def_iter++;
+    use_iter = Base::start_use_chain (this->def_iter);
+  }
 
-    if (reverse)
-      return *((*use_iter)->insn ()) > *limit;
-    else
-      return *((*use_iter)->insn ()) < *limit;
+  insn_info *insn () const override final
+  {
+    return (*use_iter)->insn ();
   }
 
-  bool conflict_p (int &budget) const override
+  bool conflict_p (int &budget) const override final
   {
     return load_modified_by_store_p (insn (), cand_store, budget);
   }
 
   load_walker (def_info *def, insn_info *store, insn_info *limit_insn)
-    : def_iter (def), use_iter (start_use_chain (def_iter)),
-      cand_store (store), limit (limit_insn) {}
+    : Base (def, limit_insn),
+      use_iter (Base::start_use_chain (this->def_iter)),
+      cand_store (store) {}
 };
 
 // Process our alias_walkers in a round-robin fashion, proceeding until
@@ -1757,11 +1768,18 @@ do_alias_analysis (insn_info *alias_hazards[4],
     }
 }
 
-// Return an integer where bit (1 << i) is set if INSNS[i] uses writeback
+// Given INSNS (in program order) which are known to be adjacent, look
+// to see if either insn has a suitable RTL (register) base that we can
+// use to form a pair.  Push these to BASE_CANDS if we find any.  CAND_MEMs
+// gives the relevant mems from the candidate insns, ACCESS_SIZE gives the
+// size of a single candidate access, and REVERSED says whether the accesses
+// are inverted in offset order.
+//
+// Returns an integer where bit (1 << i) is set if INSNS[i] uses writeback
 // addressing.
 static int
 get_viable_bases (insn_info *insns[2],
-		  vec <base_cand> &base_cands,
+		  vec<base_cand> &base_cands,
 		  rtx cand_mems[2],
 		  unsigned access_size,
 		  bool reversed)
@@ -1774,9 +1792,8 @@ get_viable_bases (insn_info *insns[2],
     {
       const bool is_lower = (i == reversed);
       poly_int64 poly_off;
-      rtx modify = NULL_RTX;
-      rtx base = ldp_strip_offset (cand_mems[i], &modify, &poly_off);
-      if (modify)
+      rtx base = ldp_strip_offset (cand_mems[i], &poly_off);
+      if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
 	writeback |= (1 << i);
 
       if (!REG_P (base) || !poly_off.is_constant ())
@@ -1869,12 +1886,12 @@ get_viable_bases (insn_info *insns[2],
 
 // Given two adjacent memory accesses of the same size, I1 and I2, try
 // and see if we can merge them into a ldp or stp.
-static bool
-try_fuse_pair (bool load_p,
-	       unsigned access_size,
-	       insn_info *i1,
-	       insn_info *i2,
-	       bool &emitted_tombstone_p)
+//
+// ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
+// if the accesses are both loads, otherwise they are both stores.
+bool
+ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
+			    insn_info *i1, insn_info *i2)
 {
   if (dump_file)
     fprintf (dump_file, "analyzing pair (load=%d): (%d,%d)\n",
@@ -1926,8 +1943,7 @@ try_fuse_pair (bool load_p,
       return false;
     }
 
-  auto_vec <base_cand> base_cands;
-  base_cands.reserve (2);
+  auto_vec<base_cand, 2> base_cands (2);
 
   int writeback = get_viable_bases (insns, base_cands, cand_mems,
 				    access_size, reversed);
@@ -1981,7 +1997,7 @@ try_fuse_pair (bool load_p,
 		     "hazards (%d,%d)\n",
 		     insns[0]->uid (),
 		     insns[1]->uid (),
-		     cand.m_def->regno (),
+		     cand.def->regno (),
 		     cand.hazards[0]->uid (),
 		     cand.hazards[1]->uid ());
 
@@ -2027,12 +2043,17 @@ try_fuse_pair (bool load_p,
       gcc_checking_assert (mem_defs[1]);
     }
 
-  store_walker<false> forward_store_walker (mem_defs[0],
-					    cand_mems[0],
-					    insns[1]);
-  store_walker<true> backward_store_walker (mem_defs[1],
-					    cand_mems[1],
-					    insns[0]);
+  auto tombstone_p = [&](insn_info *insn) -> bool {
+    return m_emitted_tombstone
+	   && bitmap_bit_p (&m_tombstone_bitmap, insn->uid ());
+  };
+
+  store_walker<false, decltype(tombstone_p)>
+    forward_store_walker (mem_defs[0], cand_mems[0], insns[1], tombstone_p);
+
+  store_walker<true, decltype(tombstone_p)>
+    backward_store_walker (mem_defs[1], cand_mems[1], insns[0], tombstone_p);
+
   alias_walker *walkers[4] = {};
   if (mem_defs[0])
     walkers[0] = &forward_store_walker;
@@ -2092,7 +2113,7 @@ try_fuse_pair (bool load_p,
 	    fprintf (dump_file, "pair (%d,%d): rejecting base %d due to "
 				"alias/dataflow hazards (%d,%d)",
 				insns[0]->uid (), insns[1]->uid (),
-				cand.m_def->regno (),
+				cand.def->regno (),
 				cand.hazards[0]->uid (),
 				cand.hazards[1]->uid ());
 
@@ -2167,17 +2188,17 @@ try_fuse_pair (bool load_p,
 
       fprintf (dump_file, "fusing pair [L=%d] (%d,%d), base=%d, hazards: (",
 	      load_p, insns[0]->uid (), insns[1]->uid (),
-	      base->m_def->regno ());
+	      base->def->regno ());
       print_pair (base->hazards);
       fprintf (dump_file, "), move_range: (%d,%d)\n",
 	       range.first->uid (), range.last->uid ());
     }
 
   return fuse_pair (load_p, access_size, writeback,
-		    i1, i2, *base, range, emitted_tombstone_p);
+		    i1, i2, *base, range);
 }
 
-// Erase [l.begin (), i] inclusive, respecting iterator order.
+// Erase [l.begin (), i] inclusive, return the new value of l.begin ().
 static insn_iter_t
 erase_prefix (insn_list_t &l, insn_iter_t i)
 {
@@ -2185,10 +2206,12 @@ erase_prefix (insn_list_t &l, insn_iter_t i)
   return l.begin ();
 }
 
+// Remove the insn at iterator I from the list.  If it was the first insn
+// in the list, return the next one.  Otherwise, return the previous one.
 static insn_iter_t
-erase_one (insn_list_t &l, insn_iter_t i, insn_iter_t begin)
+erase_one (insn_list_t &l, insn_iter_t i)
 {
-  auto prev_or_next = (i == begin) ? std::next (i) : std::prev (i);
+  auto prev_or_next = (i == l.begin ()) ? std::next (i) : std::prev (i);
   l.erase (i);
   return prev_or_next;
 }
@@ -2206,9 +2229,7 @@ dump_insn_list (FILE *f, const insn_list_t &l)
   i++;
 
   for (; i != end; i++)
-    {
-      fprintf (f, ", %d", (*i)->uid ());
-    }
+    fprintf (f, ", %d", (*i)->uid ());
 
   fprintf (f, ")");
 }
@@ -2220,85 +2241,87 @@ debug (const insn_list_t &l)
   fprintf (stderr, "\n");
 }
 
+// LEFT_LIST and RIGHT_LIST are lists of candidate instructions
+// where all insns in LEFT_LIST are known to be adjacent to those
+// in RIGHT_LIST.
+//
+// This function traverses the resulting 2D matrix of possible pair
+// candidates and attempts to merge them into pairs.
+//
+// The algorithm is straightforward: if we consider a combined list
+// of candidates X obtained by merging LEFT_LIST and RIGHT_LIST in
+// program order, then we advance through X until we
+// reach a crossing point (where X[i] and X[i+1] come from different
+// source lists).
+//
+// At this point we know X[i] and X[i+1] are adjacent accesses, and
+// we try to fuse them into a pair.  If this succeeds, we remove X[i]
+// and X[i+1] from their original lists and continue as above.  We
+// queue the access that came from RIGHT_LIST for deletion by adding
+// it to TO_DELETE, so that we don't try and merge it in subsequent
+// iterations of transform_for_base.  See below for a description of the
+// handling in the failure case.
 void
-merge_pairs (insn_iter_t l_begin,
-	     insn_iter_t l_end,
-	     insn_iter_t r_begin,
-	     insn_iter_t r_end,
-	     insn_list_t &left_list,
-	     insn_list_t &right_list,
-	     hash_set <insn_info *> &to_delete,
-	     bool load_p,
-	     unsigned access_size,
-	     bool &emitted_tombstone_p)
+ldp_bb_info::merge_pairs (insn_list_t &left_list,
+			  insn_list_t &right_list,
+			  hash_set<insn_info *> &to_delete,
+			  bool load_p,
+			  unsigned access_size)
 {
-  auto iter_l = l_begin;
-  auto iter_r = r_begin;
+  auto iter_l = left_list.begin ();
+  auto iter_r = right_list.begin ();
 
-  bool result;
-  while (l_begin != l_end && r_begin != r_end)
+  while (!left_list.empty () && !right_list.empty ())
     {
       auto next_l = std::next (iter_l);
       auto next_r = std::next (iter_r);
       if (**iter_l < **iter_r
-	  && next_l != l_end
+	  && next_l != left_list.end ()
 	  && **next_l < **iter_r)
 	{
 	  iter_l = next_l;
 	  continue;
 	}
       else if (**iter_r < **iter_l
-	       && next_r != r_end
+	       && next_r != right_list.end ()
 	       && **next_r < **iter_l)
 	{
 	  iter_r = next_r;
 	  continue;
 	}
 
-      bool update_l = false;
-      bool update_r = false;
-
-      result = try_fuse_pair (load_p, access_size,
-			      *iter_l, *iter_r,
-			      emitted_tombstone_p);
-      if (result)
+      if (try_fuse_pair (load_p, access_size, *iter_l, *iter_r))
 	{
-	  update_l = update_r = true;
 	  if (to_delete.add (*iter_r))
 	    gcc_unreachable (); // Shouldn't get added twice.
 
-	  iter_l = erase_one (left_list, iter_l, l_begin);
-	  iter_r = erase_one (right_list, iter_r, r_begin);
+	  iter_l = erase_one (left_list, iter_l);
+	  iter_r = erase_one (right_list, iter_r);
 	}
       else
 	{
-	  // Here we know that the entire prefix we skipped
-	  // over cannot merge with anything further on
-	  // in iteration order (there are aliasing hazards
-	  // on both sides), so delete the entire prefix.
+	  // If we failed to merge the pair, then we delete the entire
+	  // prefix of insns that originated from the same source list.
+	  // The rationale for this is as follows.
+	  //
+	  // In the store case, the insns in the prefix can't be
+	  // re-ordered over each other as they are guaranteed to store
+	  // to the same location, so we're guaranteed not to lose
+	  // opportunities by doing this.
+	  //
+	  // In the load case, subsequent loads from the same location
+	  // are either redundant (in which case they should have been
+	  // cleaned up by an earlier optimization pass) or there is an
+	  // intervening aliasing hazard, in which case we can't
+	  // re-order them anyway, so provided earlier passes have
+	  // cleaned up redundant loads, we shouldn't miss opportunities
+	  // by doing this.
 	  if (**iter_l < **iter_r)
-	    {
-	      // Delete everything from l_begin to iter_l, inclusive.
-	      update_l = true;
-	      iter_l = erase_prefix (left_list, iter_l);
-	    }
+	    // Delete everything from l_begin to iter_l, inclusive.
+	    iter_l = erase_prefix (left_list, iter_l);
 	  else
-	    {
-	      // Delete everything from r_begin to iter_r, inclusive.
-	      update_r = true;
-	      iter_r = erase_prefix (right_list, iter_r);
-	    }
-	}
-
-      if (update_l)
-	{
-	  l_begin = left_list.begin ();
-	  l_end = left_list.end ();
-	}
-      if (update_r)
-	{
-	  r_begin = right_list.begin ();
-	  r_end = right_list.end ();
+	    // Delete everything from r_begin to iter_r, inclusive.
+	    iter_r = erase_prefix (right_list, iter_r);
 	}
     }
 }
@@ -2328,16 +2351,12 @@ ldp_bb_info::try_form_pairs (insn_list_t *left_orig,
 
   // List of candidate insns to delete from the original right_list
   // (because they were formed into a pair).
-  hash_set <insn_info *> to_delete;
+  hash_set<insn_info *> to_delete;
 
   // Now we have a 2D matrix of candidates, traverse it to try and
   // find a pair of insns that are already adjacent (within the
   // merged list of accesses).
-  merge_pairs (left_orig->begin (), left_orig->end (),
-	       right_copy.begin (), right_copy.end (),
-	       *left_orig, right_copy,
-	       to_delete, load_p, access_size,
-	       m_emitted_tombstone);
+  merge_pairs (*left_orig, right_copy, to_delete, load_p, access_size);
 
   // If we formed all right candidates into pairs,
   // then we can skip the next iteration.
@@ -2363,6 +2382,9 @@ ldp_bb_info::try_form_pairs (insn_list_t *left_orig,
   return false;
 }
 
+// Iterate over the accesses in GROUP, looking for adjacent sets
+// of accesses.  If we find two sets of adjacent accesses, call
+// try_form_pairs.
 void
 ldp_bb_info::transform_for_base (int encoded_lfs,
 				 access_group &group)
@@ -2386,10 +2408,13 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
     }
 }
 
+// If we emitted tombstone insns for this BB, iterate through the BB
+// and remove all the tombstone insns, being sure to reparent any uses
+// of mem to previous defs when we do this.
 void
 ldp_bb_info::cleanup_tombstones ()
 {
-  // No need to do anything if we didn't emit a tombstone insn for this bb.
+  // No need to do anything if we didn't emit a tombstone insn for this BB.
   if (!m_emitted_tombstone)
     return;
 
@@ -2397,20 +2422,21 @@ ldp_bb_info::cleanup_tombstones ()
   while (insn)
     {
       insn_info *next = insn->next_nondebug_insn ();
-      if (!insn->is_real () || !tombstone_insn_p (insn))
+      if (!insn->is_real ()
+	  || !bitmap_bit_p (&m_tombstone_bitmap, insn->uid ()))
 	{
 	  insn = next;
 	  continue;
 	}
 
       auto def = memory_access (insn->defs ());
-      auto set = dyn_cast <set_info *> (def);
+      auto set = dyn_cast<set_info *> (def);
       if (set && set->has_any_uses ())
 	{
 	  def_info *prev_def = def->prev_def ();
-	  auto prev_set = dyn_cast <set_info *> (prev_def);
+	  auto prev_set = dyn_cast<set_info *> (prev_def);
 	  if (!prev_set)
-	    gcc_unreachable (); // TODO: handle this if needed.
+	    gcc_unreachable ();
 
 	  while (set->first_use ())
 	    crtl->ssa->reparent_use (set->first_use (), prev_set);
@@ -2462,6 +2488,8 @@ ldp_fusion_destroy ()
   crtl->ssa = nullptr;
 }
 
+// Given a load pair insn in PATTERN, unpack the insn, storing
+// the registers in REGS and returning the mem.
 static rtx
 aarch64_destructure_load_pair (rtx regs[2], rtx pattern)
 {
@@ -2486,6 +2514,8 @@ aarch64_destructure_load_pair (rtx regs[2], rtx pattern)
   return mem;
 }
 
+// Given a store pair insn in PATTERN, unpack the insn, storing
+// the register operands in REGS, and returning the mem.
 static rtx
 aarch64_destructure_store_pair (rtx regs[2], rtx pattern)
 {
@@ -2497,6 +2527,13 @@ aarch64_destructure_store_pair (rtx regs[2], rtx pattern)
   return mem;
 }
 
+// Given a pair mem in PAIR_MEM, register operands in REGS, and an rtx
+// representing the effect of writeback on the base register in WB_EFFECT,
+// return an insn representing a writeback variant of this pair.
+// LOAD_P is true iff the pair is a load.
+//
+// This is used when promoting existing non-writeback pairs to writeback
+// variants.
 static rtx
 aarch64_gen_writeback_pair (rtx wb_effect, rtx pair_mem, rtx regs[2],
 			    bool load_p)
@@ -2587,7 +2624,7 @@ try_promote_writeback (insn_info *insn)
   def_info *add_def;
   const insn_range_info pair_range (insn->prev_nondebug_insn ());
   insn_info *insns[2] = { nullptr, insn };
-  insn_info *trailing_add = find_trailing_add (insns, pair_range, &wb_effect,
+  insn_info *trailing_add = find_trailing_add (insns, pair_range, 0, &wb_effect,
 					       &add_def, base_def, offset,
 					       access_size);
   if (!trailing_add)
@@ -2620,6 +2657,10 @@ try_promote_writeback (insn_info *insn)
   crtl->ssa->change_insns (changes);
 }
 
+// Main function for the pass.  Iterate over the insns in BB looking
+// for load/store candidates.  If running after RA, also try and promote
+// non-writeback pairs to use writeback addressing.  Then try to fuse
+// candidates into pairs.
 void ldp_fusion_bb (bb_info *bb)
 {
   const bool track_loads
@@ -2723,5 +2764,3 @@ make_pass_ldp_fusion (gcc::context *ctx)
 {
   return new pass_ldp_fusion (ctx);
 }
-
-#include "gt-aarch64-ldp-fusion.h"
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index a69c37ce33b..116ec1892dc 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -377,7 +377,7 @@ Limit on number of alias checks performed when attempting to form an ldp/stp.
 
 -param=aarch64-ldp-writeback=
 Target Joined UInteger Var(aarch64_ldp_writeback) Init(2) IntegerRange(0,2) Param
-Param to control which wirteback opportunities we try to handle in the
+Param to control which writeback opportunities we try to handle in the
 load/store pair fusion pass.  A value of zero disables writeback
 handling.  One means we try to form pairs involving one or more existing
 individual writeback accesses where possible.  A value of two means we
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 2b51ff304f6..fd75a4e28d7 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -798,7 +798,7 @@ Objective-C and Objective-C++ Dialects}.
 -moverride=@var{string}  -mverbose-cost-dump
 -mstack-protector-guard=@var{guard} -mstack-protector-guard-reg=@var{sysreg}
 -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
--moutline-atomics }
+-moutline-atomics -mearly-ldp-fusion -mlate-ldp-fusion}
 
 @emph{Adapteva Epiphany Options}
 @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
@@ -16738,6 +16738,20 @@ With @option{--param=aarch64-stp-policy=never}, do not emit stp.
 With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
 source pointer is aligned to at least double the alignment of the type.
 
+@item aarch64-ldp-alias-check-limit
+Limit on the number of alias checks performed by the AArch64 load/store pair
+fusion pass when attempting to form an ldp/stp.  Higher values make the pass
+more aggressive at re-ordering loads over stores, at the expense of increased
+compile time.
+
+@item aarch64-ldp-writeback
+Param to control which writeback opportunities we try to handle in the AArch64
+load/store pair fusion pass.  A value of zero disables writeback handling.  One
+means we try to form pairs involving one or more existing individual writeback
+accesses where possible.  A value of two means we also try to opportunistically
+form writeback opportunities by folding in trailing destructive updates of the
+base register used by a pair.
+
 @item aarch64-loop-vect-issue-rate-niters
 The tuning for some AArch64 CPUs tries to take both latencies and issue
 rates into account when deciding whether a loop should be vectorized
@@ -21096,6 +21110,16 @@ Enable compiler hardening against straight line speculation (SLS).
 In addition, @samp{-mharden-sls=all} enables all SLS hardening while
 @samp{-mharden-sls=none} disables all SLS hardening.
 
+@opindex mearly-ldp-fusion
+@item -mearly-ldp-fusion
+Enable the copy of the AArch64 load/store pair fusion pass that runs before
+register allocation.  Enabled by default at @samp{-O} and above.
+
+@opindex mlate-ldp-fusion
+@item -mlate-ldp-fusion
+Enable the copy of the AArch64 load/store pair fusion pass that runs after
+register allocation.  Enabled by default at @samp{-O} and above.
+
 @opindex msve-vector-bits
 @item -msve-vector-bits=@var{bits}
 Specify the number of bits in an SVE vector register.  This option only has
diff mbox series

Patch

diff --git a/gcc/config.gcc b/gcc/config.gcc
index c1460ca354e..8b7f6b20309 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -349,8 +349,8 @@  aarch64*-*-*)
 	c_target_objs="aarch64-c.o"
 	cxx_target_objs="aarch64-c.o"
 	d_target_objs="aarch64-d.o"
-	extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
-	target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
+	extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o aarch64-ldp-fusion.o"
+	target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc \$(srcdir)/config/aarch64/aarch64-ldp-fusion.cc"
 	target_has_targetm_common=yes
 	;;
 alpha*-*-*)
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
new file mode 100644
index 00000000000..6ab18b9216e
--- /dev/null
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -0,0 +1,2727 @@ 
+// LoadPair fusion optimization pass for AArch64.
+// Copyright (C) 2023 Free Software Foundation, Inc.
+//
+// This file is part of GCC.
+//
+// GCC is free software; you can redistribute it and/or modify it
+// under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 3, or (at your option)
+// any later version.
+//
+// GCC is distributed in the hope that it will be useful, but
+// WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+// General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License
+// along with GCC; see the file COPYING3.  If not see
+// <http://www.gnu.org/licenses/>.
+
+#define INCLUDE_ALGORITHM
+#define INCLUDE_FUNCTIONAL
+#define INCLUDE_LIST
+#define INCLUDE_TYPE_TRAITS
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "rtl-ssa.h"
+#include "cfgcleanup.h"
+#include "tree-pass.h"
+#include "ordered-hash-map.h"
+#include "tree-dfa.h"
+#include "fold-const.h"
+#include "tree-hash-traits.h"
+#include "print-tree.h"
+#include "insn-attr.h"
+
+using namespace rtl_ssa;
+
+enum
+{
+  LDP_IMM_BITS = 7,
+  LDP_IMM_MASK = (1 << LDP_IMM_BITS) - 1,
+  LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1)),
+  LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1,
+  LDP_MIN_IMM = -LDP_MAX_IMM - 1,
+};
+
+// We pack these fields (load_p, fpsimd_p, and size) into an integer
+// (LFS) which we use as part of the key into the main hash tables.
+//
+// The idea is that we group candidates together only if they agree on
+// the fields below.  Candidates that disagree on any of these
+// properties shouldn't be merged together.
+struct lfs_fields
+{
+  bool load_p;
+  bool fpsimd_p;
+  unsigned size;
+};
+
+using insn_list_t = std::list <insn_info *>;
+using insn_iter_t = insn_list_t::iterator;
+
+// Information about the accesses at a given offset from a particular
+// base.  Stored in an access_group, see below.
+struct access_record
+{
+  poly_int64 offset;
+  std::list<insn_info *> cand_insns;
+  std::list<access_record>::iterator place;
+
+  access_record (poly_int64 off) : offset (off) {}
+};
+
+// A group of accesses where adjacent accesses could be ldp/stp
+// candidates.  The splay tree supports efficient insertion,
+// while the list supports efficient iteration.
+struct access_group
+{
+  splay_tree <access_record *> tree;
+  std::list<access_record> list;
+
+  template<typename Alloc>
+  inline void track (Alloc node_alloc, poly_int64 offset, insn_info *insn);
+};
+
+// Information about a potential base candidate, used in try_fuse_pair.
+// There may be zero, one, or two viable RTL bases for a given pair.
+struct base_cand
+{
+  def_info *m_def;
+
+  // FROM_INSN is -1 if the base candidate is already shared by both
+  // candidate insns.  Otherwise it holds the index of the insn from
+  // which the base originated.
+  int from_insn;
+
+  // Initially: dataflow hazards that arise if we choose this base as
+  // the common base register for the pair.
+  //
+  // Later these get narrowed, taking alias hazards into account.
+  insn_info *hazards[2];
+
+  base_cand (def_info *def, int insn)
+    : m_def (def), from_insn (insn), hazards {nullptr, nullptr} {}
+
+  base_cand (def_info *def) : base_cand (def, -1) {}
+
+  bool viable () const
+  {
+    return !hazards[0] || !hazards[1] || (*hazards[0] > *hazards[1]);
+  }
+};
+
+// Information about an alternate base.  For a def_info D, it may
+// instead be expressed as D = BASE + OFFSET.
+struct alt_base
+{
+  def_info *base;
+  poly_int64 offset;
+};
+
+// State used by the pass for a given basic block.
+struct ldp_bb_info
+{
+  using def_hash = nofree_ptr_hash <def_info>;
+  using expr_key_t = pair_hash <tree_operand_hash, int_hash <int, -1, -2>>;
+  using def_key_t = pair_hash <def_hash, int_hash <int, -1, -2>>;
+
+  // Map of <tree base, LFS> -> access_group.
+  ordered_hash_map <expr_key_t, access_group> expr_map;
+
+  // Map of <RTL-SSA def_info *, LFS> -> access_group.
+  ordered_hash_map <def_key_t, access_group> def_map;
+
+  // Given the def_info for an RTL base register, express it as an offset from
+  // some canonical base instead.
+  //
+  // Canonicalizing bases in this way allows us to identify adjacent accesses
+  // even if they see different base register defs.
+  hash_map <def_hash, alt_base> canon_base_map;
+
+  static const size_t obstack_alignment = sizeof (void *);
+  bb_info *m_bb;
+
+  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
+  {
+    obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
+				obstack_alignment, obstack_chunk_alloc,
+				obstack_chunk_free);
+  }
+  ~ldp_bb_info ()
+  {
+    obstack_free (&m_obstack, nullptr);
+  }
+
+  inline void track_access (insn_info *, bool load, rtx mem);
+  inline void transform ();
+  inline void cleanup_tombstones ();
+
+private:
+  // Did we emit a tombstone insn for this bb?
+  bool m_emitted_tombstone;
+  obstack m_obstack;
+
+  inline splay_tree_node<access_record *> *node_alloc (access_record *);
+
+  template<typename Map>
+  inline void traverse_base_map (Map &map);
+  inline void transform_for_base (int load_size, access_group &group);
+
+  inline bool try_form_pairs (insn_list_t *, insn_list_t *,
+			      bool load_p, unsigned access_size);
+
+  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
+};
+
+splay_tree_node<access_record *> *
+ldp_bb_info::node_alloc (access_record *access)
+{
+  using T = splay_tree_node<access_record *>;
+  void *addr = obstack_alloc (&m_obstack, sizeof (T));
+  return new (addr) T (access);
+}
+
+// Given a mem MEM, if the address has side effects, return a MEM that accesses
+// the same address but without the side effects.  Otherwise, return
+// MEM unchanged.
+static rtx
+drop_writeback (rtx mem)
+{
+  rtx addr = XEXP (mem, 0);
+
+  if (!side_effects_p (addr))
+    return mem;
+
+  switch (GET_CODE (addr))
+    {
+    case PRE_MODIFY:
+      addr = XEXP (addr, 1);
+      break;
+    case POST_MODIFY:
+    case POST_INC:
+    case POST_DEC:
+      addr = XEXP (addr, 0);
+      break;
+    case PRE_INC:
+    case PRE_DEC:
+    {
+      poly_int64 adjustment = GET_MODE_SIZE (GET_MODE (mem));
+      if (GET_CODE (addr) == PRE_DEC)
+	adjustment *= -1;
+      addr = plus_constant (GET_MODE (addr), XEXP (addr, 0), adjustment);
+      break;
+    }
+    default:
+      gcc_unreachable ();
+    }
+
+  return change_address (mem, GET_MODE (mem), addr);
+}
+
+// Convenience wrapper around strip_offset that can also look
+// through {PRE,POST}_MODIFY.
+static rtx ldp_strip_offset (rtx mem, rtx *modify, poly_int64 *offset)
+{
+  gcc_checking_assert (MEM_P (mem));
+
+  rtx base = strip_offset (XEXP (mem, 0), offset);
+
+  if (side_effects_p (base))
+    *modify = base;
+
+  switch (GET_CODE (base))
+    {
+    case PRE_MODIFY:
+    case POST_MODIFY:
+      base = strip_offset (XEXP (base, 1), offset);
+      gcc_checking_assert (REG_P (base));
+      gcc_checking_assert (rtx_equal_p (XEXP (*modify, 0), base));
+      break;
+    case PRE_INC:
+    case POST_INC:
+      base = XEXP (base, 0);
+      *offset = GET_MODE_SIZE (GET_MODE (mem));
+      gcc_checking_assert (REG_P (base));
+      break;
+    case PRE_DEC:
+    case POST_DEC:
+      base = XEXP (base, 0);
+      *offset = -GET_MODE_SIZE (GET_MODE (mem));
+      gcc_checking_assert (REG_P (base));
+      break;
+
+    default:
+      gcc_checking_assert (!side_effects_p (base));
+    }
+
+  return base;
+}
+
+static bool
+any_pre_modify_p (rtx x)
+{
+  const auto code = GET_CODE (x);
+  return code == PRE_INC || code == PRE_DEC || code == PRE_MODIFY;
+}
+
+static bool
+any_post_modify_p (rtx x)
+{
+  const auto code = GET_CODE (x);
+  return code == POST_INC || code == POST_DEC || code == POST_MODIFY;
+}
+
+static bool
+ldp_operand_mode_ok_p (machine_mode mode)
+{
+  const bool allow_qregs
+    = !(aarch64_tune_params.extra_tuning_flags
+	& AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS);
+
+  if (!aarch64_ldpstp_operand_mode_p (mode))
+    return false;
+
+  const auto size = GET_MODE_SIZE (mode).to_constant ();
+  if (size == 16 && !allow_qregs)
+    return false;
+
+  return reload_completed || mode != E_TImode;
+}
+
+static int
+encode_lfs (lfs_fields fields)
+{
+  int size_log2 = exact_log2 (fields.size);
+  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
+  return ((int)fields.load_p << 3)
+    | ((int)fields.fpsimd_p << 2)
+    | (size_log2 - 2);
+}
+
+static lfs_fields
+decode_lfs (int lfs)
+{
+  bool load_p = (lfs & (1 << 3));
+  bool fpsimd_p = (lfs & (1 << 2));
+  unsigned size = 1U << ((lfs & 3) + 2);
+  return { load_p, fpsimd_p, size };
+}
+
+template<typename Alloc>
+void
+access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
+{
+  auto insert_before = [&](std::list<access_record>::iterator after)
+    {
+      auto it = list.emplace (after, offset);
+      it->cand_insns.push_back (insn);
+      it->place = it;
+      return &*it;
+    };
+
+  if (!list.size ())
+    {
+      auto access = insert_before (list.end ());
+      tree.insert_max_node (alloc_node (access));
+      return;
+    }
+
+  auto compare = [&](splay_tree_node<access_record *> *node)
+    {
+      return compare_sizes_for_sort (offset, node->value ()->offset);
+    };
+  auto result = tree.lookup (compare);
+  splay_tree_node<access_record *> *node = tree.root ();
+  if (result == 0)
+    node->value ()->cand_insns.push_back (insn);
+  else
+    {
+      auto it = node->value ()->place;
+      auto after = (result > 0) ? std::next (it) : it;
+      auto access = insert_before (after);
+      tree.insert_child (node, result > 0, alloc_node (access));
+    }
+}
+
+bool
+ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
+{
+  if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
+    return false;
+
+  poly_int64 offset;
+  tree base_expr = get_addr_base_and_unit_offset (MEM_EXPR (mem),
+						  &offset);
+  if (!base_expr || !DECL_P (base_expr))
+    return false;
+
+  offset += MEM_OFFSET (mem);
+
+  const machine_mode mem_mode = GET_MODE (mem);
+  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
+
+  // Punt on misaligned offsets.
+  if (offset.coeffs[0] & (mem_size - 1))
+    return false;
+
+  const auto key = std::make_pair (base_expr, encode_lfs (lfs));
+  access_group &group = expr_map.get_or_insert (key, NULL);
+  auto alloc = [&](access_record *access) { return node_alloc (access); };
+  group.track (alloc, offset, insn);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "[bb %u] tracking insn %d via ",
+	       m_bb->index (), insn->uid ());
+      print_node_brief (dump_file, "mem expr", base_expr, 0);
+      fprintf (dump_file, " [L=%d FP=%d, %smode, off=",
+	       lfs.load_p, lfs.fpsimd_p, mode_name[mem_mode]);
+      print_dec (offset, dump_file);
+      fprintf (dump_file, "]\n");
+    }
+
+  return true;
+}
+
+// Return true if X is a constant zero operand.  N.B. this matches the
+// {w,x}zr check in aarch64_print_operand, the logic in the predicate
+// aarch64_stp_reg_operand, and the constraints on the pair patterns.
+static bool const_zero_op_p (rtx x)
+{
+  return x == CONST0_RTX (GET_MODE (x))
+    || (CONST_DOUBLE_P (x) && aarch64_float_const_zero_rtx_p (x));
+}
+
+void
+ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
+{
+  // We can't combine volatile MEMs, so punt on these.
+  if (MEM_VOLATILE_P (mem))
+    return;
+
+  // Ignore writeback accesses if the param says to do so.
+  if (!aarch64_ldp_writeback && side_effects_p (XEXP (mem, 0)))
+    return;
+
+  const machine_mode mem_mode = GET_MODE (mem);
+  if (!ldp_operand_mode_ok_p (mem_mode))
+    return;
+
+  // Note ldp_operand_mode_ok_p already rejected VL modes.
+  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
+
+  rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
+
+  // Is this an FP/SIMD access?  Note that constant zero operands
+  // use an integer zero register ({w,x}zr).
+  const bool fpsimd_op_p
+    = GET_MODE_CLASS (mem_mode) != MODE_INT
+      && (load_p || !const_zero_op_p (reg_op));
+
+  // N.B. we only want to segregate FP/SIMD accesses from integer accesses
+  // before RA.
+  const bool fpsimd_bit_p = !reload_completed && fpsimd_op_p;
+  const lfs_fields lfs = { load_p, fpsimd_bit_p, mem_size };
+
+  if (track_via_mem_expr (insn, mem, lfs))
+    return;
+
+  poly_int64 mem_off;
+  rtx modify = NULL_RTX;
+  rtx base = ldp_strip_offset (mem, &modify, &mem_off);
+  if (!REG_P (base))
+    return;
+
+  // Need to calculate two (possibly different) offsets:
+  //  - Offset at which the access occurs.
+  //  - Offset of the new base def.
+  poly_int64 access_off;
+  if (modify && any_post_modify_p (modify))
+    access_off = 0;
+  else
+    access_off = mem_off;
+
+  poly_int64 new_def_off = mem_off;
+
+  // Punt on accesses relative to the eliminable regs: since we don't
+  // know the elimination offset pre-RA, we should postpone forming
+  // pairs on such accesses until after RA.
+  if (!reload_completed
+      && (REGNO (base) == FRAME_POINTER_REGNUM
+	  || REGNO (base) == ARG_POINTER_REGNUM))
+    return;
+
+  // Now need to find def of base register.
+  def_info *base_def;
+  use_info *base_use = find_access (insn->uses (), REGNO (base));
+  gcc_assert (base_use);
+  base_def = base_use->def ();
+  if (!base_def)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "base register (regno %d) of insn %d is undefined",
+		 REGNO (base), insn->uid ());
+      return;
+    }
+
+  alt_base *canon_base = canon_base_map.get (base_def);
+  if (canon_base)
+    {
+      // Express this as the combined offset from the canonical base.
+      base_def = canon_base->base;
+      new_def_off += canon_base->offset;
+      access_off += canon_base->offset;
+    }
+
+  if (modify)
+    {
+      auto def = find_access (insn->defs (), REGNO (base));
+      gcc_assert (def);
+
+      // Record that DEF = BASE_DEF + MEM_OFF.
+      if (dump_file)
+	{
+	  pretty_printer pp;
+	  pp_access (&pp, def, 0);
+	  pp_string (&pp, " = ");
+	  pp_access (&pp, base_def, 0);
+	  fprintf (dump_file, "[bb %u] recording %s + ",
+		   m_bb->index (), pp_formatted_text (&pp));
+	  print_dec (new_def_off, dump_file);
+	  fprintf (dump_file, "\n");
+	}
+
+      alt_base base_rec { base_def, new_def_off };
+      if (canon_base_map.put (def, base_rec))
+	gcc_unreachable (); // Base defs should be unique.
+    }
+
+  // Punt on misaligned offsets.
+  if (mem_off.coeffs[0] & (mem_size - 1))
+    return;
+
+  const auto key = std::make_pair (base_def, encode_lfs (lfs));
+  access_group &group = def_map.get_or_insert (key, NULL);
+  auto alloc = [&](access_record *access) { return node_alloc (access); };
+  group.track (alloc, access_off, insn);
+
+  if (dump_file)
+    {
+      pretty_printer pp;
+      pp_access (&pp, base_def, 0);
+
+      fprintf (dump_file, "[bb %u] tracking insn %d via %s",
+	       m_bb->index (), insn->uid (), pp_formatted_text (&pp));
+      fprintf (dump_file,
+	       " [L=%d, WB=%d, FP=%d, %smode, off=",
+	       lfs.load_p, !!modify, lfs.fpsimd_p, mode_name[mem_mode]);
+      print_dec (access_off, dump_file);
+      fprintf (dump_file, "]\n");
+    }
+}
+
+// Dummy predicate that never ignores any insns.
+static bool no_ignore (insn_info *) { return false; }
+
+// Return the latest dataflow hazard before INSN.
+//
+// If IGNORE is non-NULL, this points to a sub-rtx which we should
+// ignore for dataflow purposes.  This is needed when considering
+// changing the RTL base of an access discovered through a MEM_EXPR
+// base.
+//
+// N.B. we ignore any defs/uses of memory here as we deal with that
+// separately, making use of alias disambiguation.
+static insn_info *
+latest_hazard_before (insn_info *insn, rtx *ignore,
+		      insn_info *ignore_insn = nullptr)
+{
+  insn_info *result = nullptr;
+
+  // Return true if we registered the hazard.
+  auto hazard = [&](insn_info *h) -> bool
+    {
+      gcc_checking_assert (*h < *insn);
+      if (h == ignore_insn)
+	return false;
+
+      if (!result || *h > *result)
+	result = h;
+
+      return true;
+    };
+
+  rtx pat = PATTERN (insn->rtl ());
+  auto ignore_use = [&](use_info *u)
+    {
+      if (u->is_mem ())
+	return true;
+
+      return !refers_to_regno_p (u->regno (), u->regno () + 1, pat, ignore);
+    };
+
+  // Find defs of uses in INSN (RaW).
+  for (auto use : insn->uses ())
+    if (!ignore_use (use) && use->def ())
+      hazard (use->def ()->insn ());
+
+  // Find previous defs (WaW) or previous uses (WaR) of defs in INSN.
+  for (auto def : insn->defs ())
+    {
+      if (def->is_mem ())
+	continue;
+
+      if (def->prev_def ())
+	{
+	  hazard (def->prev_def ()->insn ()); // WaW
+
+	  auto set = dyn_cast <set_info *> (def->prev_def ());
+	  if (set && set->has_nondebug_insn_uses ())
+	    for (auto use : set->reverse_nondebug_insn_uses ())
+	      if (use->insn () != insn && hazard (use->insn ())) // WaR
+		break;
+	}
+
+      if (!HARD_REGISTER_NUM_P (def->regno ()))
+	continue;
+
+      // Also need to check backwards for call clobbers (WaW).
+      for (auto call_group : def->ebb ()->call_clobbers ())
+	{
+	  if (!call_group->clobbers (def->resource ()))
+	    continue;
+
+	  auto clobber_insn = prev_call_clobbers_ignoring (*call_group,
+							   def->insn (),
+							   no_ignore);
+	  if (clobber_insn)
+	    hazard (clobber_insn);
+	}
+
+    }
+
+  return result;
+}
+
+static insn_info *
+first_hazard_after (insn_info *insn, rtx *ignore)
+{
+  insn_info *result = nullptr;
+  auto hazard = [insn, &result](insn_info *h)
+    {
+      gcc_checking_assert (*h > *insn);
+      if (!result || *h < *result)
+	result = h;
+    };
+
+  rtx pat = PATTERN (insn->rtl ());
+  auto ignore_use = [&](use_info *u)
+    {
+      if (u->is_mem ())
+	return true;
+
+      return !refers_to_regno_p (u->regno (), u->regno () + 1, pat, ignore);
+    };
+
+  for (auto def : insn->defs ())
+    {
+      if (def->is_mem ())
+	continue;
+
+      if (def->next_def ())
+	hazard (def->next_def ()->insn ()); // WaW
+
+      auto set = dyn_cast <set_info *> (def);
+      if (set && set->has_nondebug_insn_uses ())
+	hazard (set->first_nondebug_insn_use ()->insn ()); // RaW
+
+      if (!HARD_REGISTER_NUM_P (def->regno ()))
+	continue;
+
+      // Also check for call clobbers of this def (WaW).
+      for (auto call_group : def->ebb ()->call_clobbers ())
+	{
+	  if (!call_group->clobbers (def->resource ()))
+	    continue;
+
+	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
+							   def->insn (),
+							   no_ignore);
+	  if (clobber_insn)
+	    hazard (clobber_insn);
+	}
+    }
+
+  // Find any subsequent defs of uses in INSN (WaR).
+  for (auto use : insn->uses ())
+    {
+      if (ignore_use (use))
+	continue;
+
+      if (use->def ())
+	{
+	  auto def = use->def ()->next_def ();
+	  if (def && def->insn () == insn)
+	    def = def->next_def ();
+
+	  if (def)
+	    hazard (def->insn ());
+	}
+
+      if (!HARD_REGISTER_NUM_P (use->regno ()))
+	continue;
+
+      // Also need to handle call clobbers of our uses (again WaR).
+      //
+      // See restrict_movement_for_uses_ignoring for why we don't
+      // need to check backwards for call clobbers.
+      for (auto call_group : use->ebb ()->call_clobbers ())
+	{
+	  if (!call_group->clobbers (use->resource ()))
+	    continue;
+
+	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
+							   use->insn (),
+							   no_ignore);
+	  if (clobber_insn)
+	    hazard (clobber_insn);
+	}
+    }
+
+  return result;
+}
+
+
+enum change_strategy {
+  CHANGE,
+  DELETE,
+  TOMBSTONE,
+};
+
+// Given a change_strategy S, convert it to a string (for output in the
+// dump file).
+static const char *cs_to_string (change_strategy s)
+{
+#define C(x) case x: return #x
+  switch (s)
+    {
+      C (CHANGE);
+      C (DELETE);
+      C (TOMBSTONE);
+    }
+#undef C
+  gcc_unreachable ();
+}
+
+// TODO: should this live in RTL-SSA?
+static bool
+ranges_overlap_p (const insn_range_info &r1, const insn_range_info &r2)
+{
+  // If either range is empty, then their intersection is empty.
+  if (!r1 || !r2)
+    return false;
+
+  // When do they not overlap? When one range finishes before the other
+  // starts, i.e. (*r1.last < *r2.first || *r2.last < *r1.first).
+  // Inverting this, we get the below.
+  return *r1.last >= *r2.first && *r2.last >= *r1.first;
+}
+
+// Get the range of insns that def feeds.
+static insn_range_info get_def_range (def_info *def)
+{
+  insn_info *last = def->next_def ()->insn ()->prev_nondebug_insn ();
+  return { def->insn (), last };
+}
+
+// Given a def (of memory), return the downwards range within which we
+// can safely move this def.
+static insn_range_info
+def_downwards_move_range (def_info *def)
+{
+  auto range = get_def_range (def);
+
+  auto set = dyn_cast <set_info *> (def);
+  if (!set || !set->has_any_uses ())
+    return range;
+
+  auto use = set->first_nondebug_insn_use ();
+  if (use)
+    range = move_earlier_than (range, use->insn ());
+
+  return range;
+}
+
+// Given a def (of memory), return the upwards range within which we can
+// safely move this def.
+static insn_range_info
+def_upwards_move_range (def_info *def)
+{
+  def_info *prev = def->prev_def ();
+  insn_range_info range { prev->insn (), def->insn () };
+
+  auto set = dyn_cast <set_info *> (prev);
+  if (!set || !set->has_any_uses ())
+    return range;
+
+  auto use = set->last_nondebug_insn_use ();
+  if (use)
+    range = move_later_than (range, use->insn ());
+
+  return range;
+}
+
+static def_info *
+decide_stp_strategy (change_strategy strategy[2],
+		     insn_info *first,
+		     insn_info *second,
+		     const insn_range_info &move_range)
+{
+  strategy[0] = CHANGE;
+  strategy[1] = DELETE;
+
+  unsigned viable = 0;
+  viable |= move_range.includes (first);
+  viable |= ((unsigned) move_range.includes (second)) << 1;
+
+  def_info * const defs[2] = {
+    memory_access (first->defs ()),
+    memory_access (second->defs ())
+  };
+  if (defs[0] == defs[1])
+    viable = 3; // No intervening store, either is viable.
+
+  if (!(viable & 1)
+      && ranges_overlap_p (move_range, def_downwards_move_range (defs[0])))
+    viable |= 1;
+  if (!(viable & 2)
+      && ranges_overlap_p (move_range, def_upwards_move_range (defs[1])))
+    viable |= 2;
+
+  if (viable == 2)
+    std::swap (strategy[0], strategy[1]);
+  else if (!viable)
+    // Tricky case: need to delete both accesses.
+    strategy[0] = DELETE;
+
+  for (int i = 0; i < 2; i++)
+    {
+      if (strategy[i] != DELETE)
+	continue;
+
+      // See if we can get away without a tombstone.
+      auto set = dyn_cast <set_info *> (defs[i]);
+      if (!set || !set->has_any_uses ())
+	continue; // We can indeed.
+
+      // If both sides are viable for re-purposing, and the other store's
+      // def doesn't have any uses, then we can delete the other store
+      // and re-purpose this store instead.
+      if (viable == 3)
+	{
+	  gcc_assert (strategy[!i] == CHANGE);
+	  auto other_set = dyn_cast <set_info *> (defs[!i]);
+	  if (!other_set || !other_set->has_any_uses ())
+	    {
+	      strategy[i] = CHANGE;
+	      strategy[!i] = DELETE;
+	      break;
+	    }
+	}
+
+      // Alas, we need a tombstone after all.
+      strategy[i] = TOMBSTONE;
+    }
+
+  for (int i = 0; i < 2; i++)
+    if (strategy[i] == CHANGE)
+      return defs[i];
+
+  return nullptr;
+}
+
+static GTY(()) rtx tombstone = NULL_RTX;
+
+// Generate the RTL pattern for a "tombstone"; used temporarily
+// during this pass to replace stores that are marked for deletion
+// where we can't immediately delete the store (e.g. if there are uses
+// hanging off its def of memory).
+//
+// These are deleted at the end of the pass and uses re-parented
+// appropriately at this point.
+static rtx
+gen_tombstone (void)
+{
+  if (!tombstone)
+    {
+      tombstone = gen_rtx_CLOBBER (VOIDmode,
+				   gen_rtx_MEM (BLKmode,
+						gen_rtx_SCRATCH (Pmode)));
+      return tombstone;
+    }
+
+  return copy_rtx (tombstone);
+}
+
+static bool
+tombstone_insn_p (insn_info *insn)
+{
+  rtx x = tombstone ? tombstone : gen_tombstone ();
+  return rtx_equal_p (PATTERN (insn->rtl ()), x);
+}
+
+static machine_mode
+aarch64_operand_mode_for_pair_mode (machine_mode mode)
+{
+  switch (mode)
+    {
+    case E_V2x4QImode:
+      return E_SImode;
+    case E_V2x8QImode:
+      return E_DImode;
+    case E_V2x16QImode:
+      return E_V16QImode;
+    default:
+      gcc_unreachable ();
+    }
+}
+
+static rtx
+filter_notes (rtx note, rtx result, bool *eh_region)
+{
+  for (; note; note = XEXP (note, 1))
+    {
+      switch (REG_NOTE_KIND (note))
+	{
+	  case REG_EQUAL:
+	  case REG_EQUIV:
+	  case REG_DEAD:
+	  case REG_UNUSED:
+	  case REG_NOALIAS:
+	    // These can all be dropped.  For REG_EQU{AL,IV} they
+	    // cannot apply to non-single_set insns, and
+	    // REG_{DEAD,UNUSED} are re-computed by RTl-SSA, see
+	    // rtl-ssa/changes.cc:update_notes.
+	    //
+	    // Similarly, REG_NOALIAS cannot apply to a parallel.
+	  case REG_INC:
+	    // When we form the pair insn, the reg update is implemented
+	    // as just another SET in the parallel, so isn't really an
+	    // auto-increment in the RTL sense, hence we drop the note.
+	    break;
+	  case REG_EH_REGION:
+	    gcc_assert (!*eh_region);
+	    *eh_region = true;
+	    result = alloc_reg_note (REG_EH_REGION, XEXP (note, 0), result);
+	    break;
+	  case REG_CFA_DEF_CFA:
+	  case REG_CFA_OFFSET:
+	  case REG_CFA_RESTORE:
+	    result = alloc_reg_note (REG_NOTE_KIND (note),
+				     copy_rtx (XEXP (note, 0)),
+				     result);
+	    break;
+	  default:
+	    // Unexpected REG_NOTE kind.
+	    gcc_unreachable ();
+	}
+    }
+
+  return result;
+}
+
+// Ensure we have a sensible scheme for combining REG_NOTEs
+// given two candidate insns I1 and I2.
+static rtx
+combine_reg_notes (insn_info *i1, insn_info *i2, rtx writeback, bool &ok)
+{
+  if ((writeback && find_reg_note (i1->rtl (), REG_CFA_DEF_CFA, NULL_RTX))
+      || find_reg_note (i2->rtl (), REG_CFA_DEF_CFA, NULL_RTX))
+    {
+      // CFA_DEF_CFA notes apply to the first set of the PARALLEL,
+      // so we can only preserve them in the non-writeback case, in
+      // the case that the note is attached to the lower access.
+      if (dump_file)
+	fprintf (dump_file,
+		 "(%d,%d,WB=%d): can't preserve CFA_DEF_CFA note, punting\n",
+		 i1->uid (), i2->uid (), !!writeback);
+      ok = false;
+      return NULL_RTX;
+    }
+
+  bool found_eh_region = false;
+  rtx result = NULL_RTX;
+  result = filter_notes (REG_NOTES (i1->rtl ()), result, &found_eh_region);
+  return filter_notes (REG_NOTES (i2->rtl ()), result, &found_eh_region);
+}
+
+// Given two memory accesses, at least one of which is of a writeback form,
+// extract two non-writeback memory accesses addressed relative to the initial
+// value of the base register, and output these in PATS.  Return an rtx that
+// represents the overall change to the base register.
+static rtx
+extract_writebacks (bool load_p, rtx pats[2], int changed)
+{
+  rtx base_reg = NULL_RTX;
+  poly_int64 current_offset = 0;
+
+  poly_int64 offsets[2];
+
+  for (int i = 0; i < 2; i++)
+    {
+      rtx mem = XEXP (pats[i], load_p);
+      rtx reg = XEXP (pats[i], !load_p);
+
+      rtx modify = NULL_RTX;
+      poly_int64 offset;
+      rtx this_base = ldp_strip_offset (mem, &modify, &offset);
+      gcc_assert (REG_P (this_base));
+      if (base_reg)
+	gcc_assert (rtx_equal_p (base_reg, this_base));
+      else
+	base_reg = this_base;
+
+      // If we changed base for the current insn, then we already
+      // derived the correct mem for this insn from the effective
+      // address of the other access.
+      if (i == changed)
+	{
+	  gcc_checking_assert (!modify);
+	  offsets[i] = offset;
+	  continue;
+	}
+
+      if (modify && any_pre_modify_p (modify))
+	current_offset += offset;
+
+      poly_int64 this_off = current_offset;
+      if (!modify)
+	this_off += offset;
+
+      offsets[i] = this_off;
+      rtx new_mem = change_address (mem, GET_MODE (mem),
+				    plus_constant (GET_MODE (base_reg),
+						   base_reg, this_off));
+      pats[i] = load_p
+	? gen_rtx_SET (reg, new_mem)
+	: gen_rtx_SET (new_mem, reg);
+
+      if (modify && any_post_modify_p (modify))
+	current_offset += offset;
+    }
+
+  if (known_eq (current_offset, 0))
+    return NULL_RTX;
+
+  return gen_rtx_SET (base_reg, plus_constant (GET_MODE (base_reg),
+					       base_reg, current_offset));
+}
+
+static insn_info *
+find_trailing_add (insn_info *insns[2],
+		   const insn_range_info &pair_range,
+		   rtx *writeback_effect,
+		   def_info **add_def,
+		   def_info *base_def,
+		   poly_int64 initial_offset,
+		   unsigned access_size)
+{
+  insn_info *pair_insn = insns[1];
+
+  def_info *def = base_def->next_def ();
+
+  while (def
+	 && def->bb () == pair_insn->bb ()
+	 && *(def->insn ()) <= *pair_insn)
+    def = def->next_def ();
+
+  if (!def || def->bb () != pair_insn->bb ())
+    return nullptr;
+
+  insn_info *cand = def->insn ();
+  const auto base_regno = base_def->regno ();
+
+  // If CAND doesn't also use our base register,
+  // it can't destructively update it.
+  if (!find_access (cand->uses (), base_regno))
+    return nullptr;
+
+  auto rti = cand->rtl ();
+
+  if (!INSN_P (rti))
+    return nullptr;
+
+  auto pat = PATTERN (rti);
+  if (GET_CODE (pat) != SET)
+    return nullptr;
+
+  auto dest = XEXP (pat, 0);
+  if (!REG_P (dest) || REGNO (dest) != base_regno)
+    return nullptr;
+
+  poly_int64 offset;
+  rtx rhs_base = strip_offset (XEXP (pat, 1), &offset);
+  if (!REG_P (rhs_base)
+      || REGNO (rhs_base) != base_regno
+      || !offset.is_constant ())
+    return nullptr;
+
+  // If the initial base offset is zero, we can handle any add offset
+  // (post-inc).  Otherwise, we require the offsets to match (pre-inc).
+  if (!known_eq (initial_offset, 0) && !known_eq (offset, initial_offset))
+    return nullptr;
+
+  auto off_hwi = offset.to_constant ();
+
+  if (off_hwi % access_size != 0)
+    return nullptr;
+
+  off_hwi /= access_size;
+
+  if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
+    return nullptr;
+
+  insn_info *pair_dst = pair_range.singleton ();
+  gcc_assert (pair_dst);
+
+  auto dump_prefix = [&]()
+    {
+      if (!insns[0])
+	fprintf (dump_file, "existing pair i%d: ", insns[1]->uid ());
+      else
+	fprintf (dump_file, "  (%d,%d)",
+		 insns[0]->uid (), insns[1]->uid ());
+    };
+
+  insn_info *hazard = latest_hazard_before (cand, nullptr, pair_insn);
+  if (!hazard || *hazard <= *pair_dst)
+    {
+      if (dump_file)
+	{
+	  dump_prefix ();
+	  fprintf (dump_file,
+		   "folding in trailing add (%d) to use writeback form\n",
+		   cand->uid ());
+	}
+
+      *add_def = def;
+      *writeback_effect = copy_rtx (pat);
+      return cand;
+    }
+
+  if (dump_file)
+    {
+      dump_prefix ();
+      fprintf (dump_file,
+	       "can't fold in trailing add (%d), hazard = %d\n",
+	       cand->uid (), hazard->uid ());
+    }
+
+  return nullptr;
+}
+
+// Try and actually fuse the pair given by insns I1 and I2.
+static bool
+fuse_pair (bool load_p,
+	   unsigned access_size,
+	   int writeback,
+	   insn_info *i1,
+	   insn_info *i2,
+	   base_cand &base,
+	   const insn_range_info &move_range,
+	   bool &emitted_tombstone_p)
+{
+  auto attempt = crtl->ssa->new_change_attempt ();
+
+  auto make_change = [&attempt](insn_info *insn)
+    {
+      return crtl->ssa->change_alloc <insn_change> (attempt, insn);
+    };
+  auto make_delete = [&attempt](insn_info *insn)
+    {
+      return crtl->ssa->change_alloc <insn_change> (attempt,
+						    insn,
+						    insn_change::DELETE);
+    };
+
+  // Are we using a tombstone insn for this pair?
+  bool have_tombstone_p = false;
+
+  insn_info *first = (*i1 < *i2) ? i1 : i2;
+  insn_info *second = (first == i1) ? i2 : i1;
+
+  insn_info *insns[2] = { first, second };
+
+  auto_vec <insn_change *> changes;
+  changes.reserve (4);
+
+  rtx pats[2] = {
+    PATTERN (first->rtl ()),
+    PATTERN (second->rtl ())
+  };
+
+  use_array input_uses[2] = { first->uses (), second->uses () };
+  def_array input_defs[2] = { first->defs (), second->defs () };
+
+  int changed_insn = -1;
+  if (base.from_insn != -1)
+    {
+      // If we're not already using a shared base, we need
+      // to re-write one of the accesses to use the base from
+      // the other insn.
+      gcc_checking_assert (base.from_insn == 0 || base.from_insn == 1);
+      changed_insn = !base.from_insn;
+
+      rtx base_pat = pats[base.from_insn];
+      rtx change_pat = pats[changed_insn];
+      rtx base_mem = XEXP (base_pat, load_p);
+      rtx change_mem = XEXP (change_pat, load_p);
+
+      const bool lower_base_p = (insns[base.from_insn] == i1);
+      HOST_WIDE_INT adjust_amt = access_size;
+      if (!lower_base_p)
+	adjust_amt *= -1;
+
+      rtx change_reg = XEXP (change_pat, !load_p);
+      machine_mode mode_for_mem = GET_MODE (change_mem);
+      rtx effective_base = drop_writeback (base_mem);
+      rtx new_mem = adjust_address_nv (effective_base,
+				       mode_for_mem,
+				       adjust_amt);
+      rtx new_set = load_p
+	? gen_rtx_SET (change_reg, new_mem)
+	: gen_rtx_SET (new_mem, change_reg);
+
+      pats[changed_insn] = new_set;
+
+      auto keep_use = [&](use_info *u)
+	{
+	  return refers_to_regno_p (u->regno (), u->regno () + 1,
+				    change_pat, &XEXP (change_pat, load_p));
+	};
+
+      // Drop any uses that only occur in the old address.
+      input_uses[changed_insn] = filter_accesses (attempt,
+						  input_uses[changed_insn],
+						  keep_use);
+    }
+
+  rtx writeback_effect = NULL_RTX;
+  if (writeback)
+    writeback_effect = extract_writebacks (load_p, pats, changed_insn);
+
+  const auto base_regno = base.m_def->regno ();
+
+  if (base.from_insn == -1 && (writeback & 1))
+    {
+      // If the first of the candidate insns had a writeback form, we'll need to
+      // drop the use of the updated base register from the second insn's uses.
+      //
+      // N.B. we needn't worry about the base register occurring as a store
+      // operand, as we checked that there was no non-address true dependence
+      // between the insns in try_fuse_pair.
+      gcc_checking_assert (find_access (input_uses[1], base_regno));
+      input_uses[1] = check_remove_regno_access (attempt,
+						 input_uses[1],
+						 base_regno);
+    }
+
+  // Go through and drop uses that only occur in register notes,
+  // as we won't be preserving those.
+  for (int i = 0; i < 2; i++)
+    {
+      auto rti = insns[i]->rtl ();
+      if (!REG_NOTES (rti))
+	continue;
+
+      input_uses[i] = remove_note_accesses (attempt, input_uses[i]);
+    }
+
+  // Edge case: if the first insn is a writeback load and the
+  // second insn is a non-writeback load which transfers into the base
+  // register, then we should drop the writeback altogether as the
+  // update of the base register from the second load should prevail.
+  //
+  // For example:
+  //   ldr x2, [x1], #8
+  //   ldr x1, [x1]
+  //   -->
+  //   ldp x2, x1, [x1]
+  if (writeback == 1
+      && load_p
+      && find_access (input_defs[1], base_regno))
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "  ldp: i%d has wb but subsequent i%d has non-wb "
+		 "update of base (r%d), dropping wb\n",
+		 insns[0]->uid (), insns[1]->uid (), base_regno);
+      gcc_assert (writeback_effect);
+      writeback_effect = NULL_RTX;
+    }
+
+  // If both of the original insns had a writeback form, then we should drop the
+  // first def.  The second def could well have uses, but the first def should
+  // only be used by the second insn (and we dropped that use above).
+  if (writeback == 3)
+    input_defs[0] = check_remove_regno_access (attempt,
+					       input_defs[0],
+					       base_regno);
+
+  // So far the patterns have been in instruction order,
+  // now we want them in offset order.
+  if (i1 != first)
+    std::swap (pats[0], pats[1]);
+
+  poly_int64 offsets[2];
+  for (int i = 0; i < 2; i++)
+    {
+      rtx mem = XEXP (pats[i], load_p);
+      gcc_checking_assert (MEM_P (mem));
+      rtx base = strip_offset (XEXP (mem, 0), offsets + i);
+      gcc_checking_assert (REG_P (base));
+      gcc_checking_assert (base_regno == REGNO (base));
+    }
+
+  insn_info *trailing_add = nullptr;
+  if (aarch64_ldp_writeback > 1 && !writeback_effect)
+    {
+      def_info *add_def;
+      trailing_add = find_trailing_add (insns, move_range, &writeback_effect,
+					&add_def, base.m_def, offsets[0],
+					access_size);
+      if (trailing_add && !writeback)
+	{
+	  // If there was no writeback to start with, we need to preserve the
+	  // def of the base register from the add insn.
+	  input_defs[0] = insert_access (attempt, add_def, input_defs[0]);
+	  gcc_assert (input_defs[0].is_valid ());
+	}
+    }
+
+  // If either of the original insns had writeback, but the resulting
+  // pair insn does not (can happen e.g. in the ldp edge case above, or
+  // if the writeback effects cancel out), then drop the def(s) of the
+  // base register as appropriate.
+  if (!writeback_effect)
+    for (int i = 0; i < 2; i++)
+      if (writeback & (1 << i))
+	input_defs[i] = check_remove_regno_access (attempt,
+						   input_defs[i],
+						   base_regno);
+
+  // Now that we know what base mem we're going to use, check if it's OK
+  // with the ldp/stp policy.
+  rtx first_mem = XEXP (pats[0], load_p);
+  if (!aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
+						load_p,
+						GET_MODE (first_mem)))
+    {
+      if (dump_file)
+	fprintf (dump_file, "punting on pair (%d,%d), ldp/stp policy says no\n",
+		 i1->uid (), i2->uid ());
+      return false;
+    }
+
+  bool reg_notes_ok = true;
+  rtx reg_notes = combine_reg_notes (i1, i2, writeback_effect, reg_notes_ok);
+  if (!reg_notes_ok)
+    return false;
+
+  rtx pair_pat;
+  if (writeback_effect)
+    {
+      auto patvec = gen_rtvec (3, writeback_effect, pats[0], pats[1]);
+      pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
+    }
+  else if (load_p)
+    pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
+				      XEXP (pats[1], 0),
+				      XEXP (pats[0], 1));
+  else
+    pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
+				       XEXP (pats[0], 1),
+				       XEXP (pats[1], 1));
+
+  insn_change *pair_change = nullptr;
+  auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
+      rtx_insn *rti = change->insn ()->rtl ();
+      gcc_assert (validate_unshare_change (rti, &PATTERN (rti), pair_pat,
+					   true));
+      gcc_assert (validate_change (rti, &REG_NOTES (rti),
+				   reg_notes, true));
+  };
+
+  if (load_p)
+    {
+      changes.quick_push (make_delete (first));
+      pair_change = make_change (second);
+      changes.quick_push (pair_change);
+
+      pair_change->move_range = move_range;
+      pair_change->new_defs = merge_access_arrays (attempt,
+						   input_defs[0],
+						   input_defs[1]);
+      gcc_assert (pair_change->new_defs.is_valid ());
+
+      pair_change->new_uses
+	= merge_access_arrays (attempt,
+			       drop_memory_access (input_uses[0]),
+			       drop_memory_access (input_uses[1]));
+      gcc_assert (pair_change->new_uses.is_valid ());
+      set_pair_pat (pair_change);
+    }
+  else
+    {
+      change_strategy strategy[2];
+      def_info *stp_def = decide_stp_strategy (strategy, first, second,
+					       move_range);
+      if (dump_file)
+	{
+	  auto cs1 = cs_to_string (strategy[0]);
+	  auto cs2 = cs_to_string (strategy[1]);
+	  fprintf (dump_file,
+		   "  stp strategy for candidate insns (%d,%d): (%s,%s)\n",
+		   insns[0]->uid (), insns[1]->uid (), cs1, cs2);
+	  if (stp_def)
+	    fprintf (dump_file,
+		     "  re-using mem def from insn %d\n",
+		     stp_def->insn ()->uid ());
+	}
+
+      insn_change *change;
+      for (int i = 0; i < 2; i++)
+	{
+	  switch (strategy[i])
+	    {
+	    case DELETE:
+	      changes.quick_push (make_delete (insns[i]));
+	      break;
+	    case TOMBSTONE:
+	    case CHANGE:
+	      change = make_change (insns[i]);
+	      if (strategy[i] == CHANGE)
+		{
+		  set_pair_pat (change);
+		  change->new_uses = merge_access_arrays (attempt,
+							  input_uses[0],
+							  input_uses[1]);
+		  auto d1 = drop_memory_access (input_defs[0]);
+		  auto d2 = drop_memory_access (input_defs[1]);
+		  change->new_defs = merge_access_arrays (attempt, d1, d2);
+		  gcc_assert (change->new_defs.is_valid ());
+		  gcc_assert (stp_def);
+		  change->new_defs = insert_access (attempt,
+						    stp_def,
+						    change->new_defs);
+		  gcc_assert (change->new_defs.is_valid ());
+		  change->move_range = move_range;
+		  pair_change = change;
+		}
+	      else
+		{
+		  rtx_insn *rti = insns[i]->rtl ();
+		  gcc_assert (validate_change (rti, &PATTERN (rti),
+					       gen_tombstone (), true));
+		  gcc_assert (validate_change (rti, &REG_NOTES (rti),
+					       NULL_RTX, true));
+		  change->new_uses = use_array (nullptr, 0);
+		  have_tombstone_p = true;
+		}
+	      gcc_assert (change->new_uses.is_valid ());
+	      changes.quick_push (change);
+	      break;
+	    }
+	}
+
+      if (!stp_def)
+	{
+	  // Tricky case.  Cannot re-purpose existing insns for stp.
+	  // Need to insert new insn.
+	  if (dump_file)
+	    fprintf (dump_file,
+		     "  stp fusion: cannot re-purpose candidate stores\n");
+
+	  auto new_insn = crtl->ssa->create_insn (attempt, INSN, pair_pat);
+	  change = make_change (new_insn);
+	  change->move_range = move_range;
+	  change->new_uses = merge_access_arrays (attempt,
+						  input_uses[0],
+						  input_uses[1]);
+	  gcc_assert (change->new_uses.is_valid ());
+
+	  auto d1 = drop_memory_access (input_defs[0]);
+	  auto d2 = drop_memory_access (input_defs[1]);
+	  change->new_defs = merge_access_arrays (attempt, d1, d2);
+	  gcc_assert (change->new_defs.is_valid ());
+
+	  auto new_set = crtl->ssa->create_set (attempt, new_insn, memory);
+	  change->new_defs = insert_access (attempt, new_set,
+					    change->new_defs);
+	  gcc_assert (change->new_defs.is_valid ());
+	  changes.safe_insert (1, change);
+	  pair_change = change;
+	}
+    }
+
+  if (trailing_add)
+    changes.quick_push (make_delete (trailing_add));
+
+  auto n_changes = changes.length ();
+  gcc_checking_assert (n_changes >= 2 && n_changes <= 4);
+
+
+  auto is_changing = insn_is_changing (changes);
+  for (unsigned i = 0; i < n_changes; i++)
+    gcc_assert (rtl_ssa::restrict_movement_ignoring (*changes[i], is_changing));
+
+  // Check the pair pattern is recog'd.
+  if (!rtl_ssa::recog_ignoring (attempt, *pair_change, is_changing))
+    {
+      if (dump_file)
+	fprintf (dump_file, "  failed to form pair, recog failed\n");
+
+      // Free any reg notes we allocated.
+      while (reg_notes)
+	{
+	  rtx next = XEXP (reg_notes, 1);
+	  free_EXPR_LIST_node (reg_notes);
+	  reg_notes = next;
+	}
+      cancel_changes (0);
+      return false;
+    }
+
+  gcc_assert (crtl->ssa->verify_insn_changes (changes));
+
+  confirm_change_group ();
+  crtl->ssa->change_insns (changes);
+  emitted_tombstone_p |= have_tombstone_p;
+  return true;
+}
+
+// Return true if STORE_INSN may modify mem rtx MEM.  Make sure we keep
+// within our BUDGET for alias analysis.
+static bool
+store_modifies_mem_p (rtx mem, insn_info *store_insn, int &budget)
+{
+  if (tombstone_insn_p (store_insn))
+    return false;
+
+  if (!budget)
+    {
+      if (dump_file)
+	{
+	  fprintf (dump_file,
+		   "exceeded budget, assuming store %d aliases with mem ",
+		   store_insn->uid ());
+	  print_simple_rtl (dump_file, mem);
+	  fprintf (dump_file, "\n");
+	}
+
+      return true;
+    }
+
+  budget--;
+  return memory_modified_in_insn_p (mem, store_insn->rtl ());
+}
+
+// Return true if LOAD may be modified by STORE.  Make sure we keep
+// within our BUDGET for alias analysis.
+static bool
+load_modified_by_store_p (insn_info *load,
+			  insn_info *store,
+			  int &budget)
+{
+  gcc_checking_assert (budget >= 0);
+
+  if (!budget)
+    {
+      if (dump_file)
+	{
+	  fprintf (dump_file,
+		   "exceeded budget, assuming load %d aliases with store %d\n",
+		   load->uid (), store->uid ());
+	}
+      return true;
+    }
+
+  // It isn't safe to re-order stores over calls.
+  if (CALL_P (load->rtl ()))
+    return true;
+
+  budget--;
+  return modified_in_p (PATTERN (load->rtl ()), store->rtl ());
+}
+
+struct alias_walker
+{
+  virtual insn_info *insn () const = 0;
+  virtual bool valid () const = 0;
+  virtual bool conflict_p (int &budget) const = 0;
+  virtual void advance () = 0;
+};
+
+template<bool reverse>
+class store_walker : public alias_walker
+{
+  using def_iter_t = typename std::conditional <reverse,
+	reverse_def_iterator, def_iterator>::type;
+
+  def_iter_t def_iter;
+  rtx cand_mem;
+  insn_info *limit;
+
+public:
+  store_walker (def_info *mem_def, rtx mem, insn_info *limit_insn) :
+    def_iter (mem_def), cand_mem (mem), limit (limit_insn) {}
+
+  bool valid () const override
+    {
+      if (!*def_iter)
+	return false;
+
+      if (reverse)
+	return *((*def_iter)->insn ()) > *limit;
+      else
+	return *((*def_iter)->insn ()) < *limit;
+    }
+  insn_info *insn () const override { return (*def_iter)->insn (); }
+  void advance () override { def_iter++; }
+  bool conflict_p (int &budget) const override
+  {
+    return store_modifies_mem_p (cand_mem, insn (), budget);
+  }
+};
+
+template<bool reverse>
+class load_walker : public alias_walker
+{
+  using def_iter_t = typename std::conditional <reverse,
+	reverse_def_iterator, def_iterator>::type;
+  using use_iter_t = typename std::conditional <reverse,
+	reverse_use_iterator, nondebug_insn_use_iterator>::type;
+
+  def_iter_t def_iter;
+  use_iter_t use_iter;
+  insn_info *cand_store;
+  insn_info *limit;
+
+  static use_info *start_use_chain (def_iter_t &def_iter)
+  {
+    set_info *set = nullptr;
+    for (; *def_iter; def_iter++)
+      {
+	set = dyn_cast <set_info *> (*def_iter);
+	if (!set)
+	  continue;
+
+	use_info *use = reverse
+	  ? set->last_nondebug_insn_use ()
+	  : set->first_nondebug_insn_use ();
+
+	if (use)
+	  return use;
+      }
+
+    return nullptr;
+  }
+
+public:
+  void advance () override
+  {
+    use_iter++;
+    if (*use_iter)
+      return;
+    def_iter++;
+    use_iter = start_use_chain (def_iter);
+  }
+
+  insn_info *insn () const override
+  {
+    gcc_checking_assert (*use_iter);
+    return (*use_iter)->insn ();
+  }
+
+  bool valid () const override
+  {
+    if (!*use_iter)
+      return false;
+
+    if (reverse)
+      return *((*use_iter)->insn ()) > *limit;
+    else
+      return *((*use_iter)->insn ()) < *limit;
+  }
+
+  bool conflict_p (int &budget) const override
+  {
+    return load_modified_by_store_p (insn (), cand_store, budget);
+  }
+
+  load_walker (def_info *def, insn_info *store, insn_info *limit_insn)
+    : def_iter (def), use_iter (start_use_chain (def_iter)),
+      cand_store (store), limit (limit_insn) {}
+};
+
+// Process our alias_walkers in a round-robin fashion, proceeding until
+// nothing more can be learned from alias analysis.
+//
+// We try to maintain the invariant that if a walker becomes invalid, we
+// set its pointer to null.
+static void
+do_alias_analysis (insn_info *alias_hazards[4],
+		   alias_walker *walkers[4],
+		   bool load_p)
+{
+  const int n_walkers = 2 + (2 * !load_p);
+  int budget = aarch64_ldp_alias_check_limit;
+
+  auto next_walker = [walkers,n_walkers](int current) -> int {
+    for (int j = 1; j <= n_walkers; j++)
+      {
+	int idx = (current + j) % n_walkers;
+	if (walkers[idx])
+	  return idx;
+      }
+    return -1;
+  };
+
+  int i = -1;
+  for (int j = 0; j < n_walkers; j++)
+    {
+      alias_hazards[j] = nullptr;
+      if (!walkers[j])
+	continue;
+
+      if (!walkers[j]->valid ())
+	walkers[j] = nullptr;
+      else if (i == -1)
+	i = j;
+    }
+
+  while (i >= 0)
+    {
+      int insn_i = i % 2;
+      int paired_i = (i & 2) + !insn_i;
+      int pair_fst = (i & 2);
+      int pair_snd = (i & 2) + 1;
+
+      if (walkers[i]->conflict_p (budget))
+	{
+	  alias_hazards[i] = walkers[i]->insn ();
+
+	  // We got an aliasing conflict for this {load,store} walker,
+	  // so we don't need to walk any further.
+	  walkers[i] = nullptr;
+
+	  // If we have a pair of alias conflicts that prevent
+	  // forming the pair, stop.  There's no need to do further
+	  // analysis.
+	  if (alias_hazards[paired_i]
+	      && (*alias_hazards[pair_fst] <= *alias_hazards[pair_snd]))
+	    return;
+
+	  if (!load_p)
+	    {
+	      int other_pair_fst = (pair_fst ? 0 : 2);
+	      int other_paired_i = other_pair_fst + !insn_i;
+
+	      int x_pair_fst = (i == pair_fst) ? i : other_paired_i;
+	      int x_pair_snd = (i == pair_fst) ? other_paired_i : i;
+
+	      // Similarly, handle the case where we have a {load,store}
+	      // or {store,load} alias hazard pair that prevents forming
+	      // the pair.
+	      if (alias_hazards[other_paired_i]
+		  && *alias_hazards[x_pair_fst] <= *alias_hazards[x_pair_snd])
+		return;
+	    }
+	}
+
+      if (walkers[i])
+	{
+	  walkers[i]->advance ();
+
+	  if (!walkers[i]->valid ())
+	    walkers[i] = nullptr;
+	}
+
+      i = next_walker (i);
+    }
+}
+
+// Return an integer where bit (1 << i) is set if INSNS[i] uses writeback
+// addressing.
+static int
+get_viable_bases (insn_info *insns[2],
+		  vec <base_cand> &base_cands,
+		  rtx cand_mems[2],
+		  unsigned access_size,
+		  bool reversed)
+{
+  // We discovered this pair through a common base.  Need to ensure that
+  // we have a common base register that is live at both locations.
+  def_info *base_defs[2] = {};
+  int writeback = 0;
+  for (int i = 0; i < 2; i++)
+    {
+      const bool is_lower = (i == reversed);
+      poly_int64 poly_off;
+      rtx modify = NULL_RTX;
+      rtx base = ldp_strip_offset (cand_mems[i], &modify, &poly_off);
+      if (modify)
+	writeback |= (1 << i);
+
+      if (!REG_P (base) || !poly_off.is_constant ())
+	continue;
+
+      // Punt on accesses relative to eliminable regs.  Since we don't know the
+      // elimination offset pre-RA, we should postpone forming pairs on such
+      // accesses until after RA.
+      if (!reload_completed
+	  && (REGNO (base) == FRAME_POINTER_REGNUM
+	      || REGNO (base) == ARG_POINTER_REGNUM))
+	continue;
+
+      HOST_WIDE_INT base_off = poly_off.to_constant ();
+
+      // It should be unlikely that we ever punt here, since MEM_EXPR offset
+      // alignment should be a good proxy for register offset alignment.
+      if (base_off % access_size != 0)
+	{
+	  if (dump_file)
+	    fprintf (dump_file,
+		     "base not viable, offset misaligned (insn %d)\n",
+		     insns[i]->uid ());
+	  continue;
+	}
+
+      base_off /= access_size;
+
+      if (!is_lower)
+	base_off--;
+
+      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
+	continue;
+
+      for (auto use : insns[i]->uses ())
+	if (use->is_reg () && use->regno () == REGNO (base))
+	  {
+	    base_defs[i] = use->def ();
+	    break;
+	  }
+    }
+
+  if (!base_defs[0] && !base_defs[1])
+    {
+      if (dump_file)
+	fprintf (dump_file, "no viable base register for pair (%d,%d)\n",
+		 insns[0]->uid (), insns[1]->uid ());
+      return writeback;
+    }
+
+  for (int i = 0; i < 2; i++)
+    if ((writeback & (1 << i)) && !base_defs[i])
+      {
+	if (dump_file)
+	  fprintf (dump_file, "insn %d has writeback but base isn't viable\n",
+		   insns[i]->uid ());
+	return writeback;
+      }
+
+  if (writeback == 3
+      && base_defs[0]->regno () != base_defs[1]->regno ())
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "pair (%d,%d): double writeback with distinct regs (%d,%d): "
+		 "punting\n",
+		 insns[0]->uid (), insns[1]->uid (),
+		 base_defs[0]->regno (), base_defs[1]->regno ());
+      return writeback;
+    }
+
+  if (base_defs[0] && base_defs[1]
+      && base_defs[0]->regno () == base_defs[1]->regno ())
+    {
+      // Easy case: insns already share the same base reg.
+      base_cands.quick_push (base_defs[0]);
+      return writeback;
+    }
+
+  // Otherwise, we know that one of the bases must change.
+  //
+  // Note that if there is writeback we must use the writeback base
+  // (we know now there is exactly one).
+  for (int i = 0; i < 2; i++)
+    if (base_defs[i] && (!writeback || (writeback & (1 << i))))
+      base_cands.quick_push (base_cand { base_defs[i], i });
+
+  return writeback;
+}
+
+// Given two adjacent memory accesses of the same size, I1 and I2, try
+// and see if we can merge them into a ldp or stp.
+static bool
+try_fuse_pair (bool load_p,
+	       unsigned access_size,
+	       insn_info *i1,
+	       insn_info *i2,
+	       bool &emitted_tombstone_p)
+{
+  if (dump_file)
+    fprintf (dump_file, "analyzing pair (load=%d): (%d,%d)\n",
+	     load_p, i1->uid (), i2->uid ());
+
+  insn_info *insns[2];
+  bool reversed = false;
+  if (*i1 < *i2)
+    {
+      insns[0] = i1;
+      insns[1] = i2;
+    }
+  else
+    {
+      insns[0] = i2;
+      insns[1] = i1;
+      reversed = true;
+    }
+
+  rtx cand_mems[2];
+  rtx reg_ops[2];
+  rtx pats[2];
+  for (int i = 0; i < 2; i++)
+    {
+      pats[i] = PATTERN (insns[i]->rtl ());
+      cand_mems[i] = XEXP (pats[i], load_p);
+      reg_ops[i] = XEXP (pats[i], !load_p);
+    }
+
+  if (load_p && reg_overlap_mentioned_p (reg_ops[0], reg_ops[1]))
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "punting on ldp due to reg conflcits (%d,%d)\n",
+		 insns[0]->uid (), insns[1]->uid ());
+      return false;
+    }
+
+  if (cfun->can_throw_non_call_exceptions
+      && (find_reg_note (insns[0]->rtl (), REG_EH_REGION, NULL_RTX)
+	  || find_reg_note (insns[1]->rtl (), REG_EH_REGION, NULL_RTX))
+      && insn_could_throw_p (insns[0]->rtl ())
+      && insn_could_throw_p (insns[1]->rtl ()))
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "can't combine insns with EH side effects (%d,%d)\n",
+		 insns[0]->uid (), insns[1]->uid ());
+      return false;
+    }
+
+  auto_vec <base_cand> base_cands;
+  base_cands.reserve (2);
+
+  int writeback = get_viable_bases (insns, base_cands, cand_mems,
+				    access_size, reversed);
+  if (base_cands.is_empty ())
+    {
+      if (dump_file)
+	fprintf (dump_file, "no viable base for pair (%d,%d)\n",
+		 insns[0]->uid (), insns[1]->uid ());
+      return false;
+    }
+
+  rtx *ignore = &XEXP (pats[1], load_p);
+  for (auto use : insns[1]->uses ())
+    if (!use->is_mem ()
+	&& refers_to_regno_p (use->regno (), use->regno () + 1, pats[1], ignore)
+	&& use->def () && use->def ()->insn () == insns[0])
+      {
+	// N.B. we allow a true dependence on the base address, as this
+	// happens in the case of auto-inc accesses.  Consider a post-increment
+	// load followed by a regular indexed load, for example.
+	if (dump_file)
+	  fprintf (dump_file,
+		   "%d has non-address true dependence on %d, rejecting pair\n",
+		   insns[1]->uid (), insns[0]->uid ());
+	return false;
+      }
+
+  unsigned i = 0;
+  while (i < base_cands.length ())
+    {
+      base_cand &cand = base_cands[i];
+
+      rtx *ignore[2] = {};
+      for (int j = 0; j < 2; j++)
+	if (cand.from_insn == !j)
+	  ignore[j] = &XEXP (cand_mems[j], 0);
+
+      insn_info *h = first_hazard_after (insns[0], ignore[0]);
+      if (h && *h <= *insns[1])
+	cand.hazards[0] = h;
+
+      h = latest_hazard_before (insns[1], ignore[1]);
+      if (h && *h >= *insns[0])
+	cand.hazards[1] = h;
+
+      if (!cand.viable ())
+	{
+	  if (dump_file)
+	    fprintf (dump_file,
+		     "pair (%d,%d): rejecting base %d due to dataflow "
+		     "hazards (%d,%d)\n",
+		     insns[0]->uid (),
+		     insns[1]->uid (),
+		     cand.m_def->regno (),
+		     cand.hazards[0]->uid (),
+		     cand.hazards[1]->uid ());
+
+	  base_cands.ordered_remove (i);
+	}
+      else
+	i++;
+    }
+
+  if (base_cands.is_empty ())
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "can't form pair (%d,%d) due to dataflow hazards\n",
+		 insns[0]->uid (), insns[1]->uid ());
+      return false;
+    }
+
+  insn_info *alias_hazards[4] = {};
+
+  // First def of memory after the first insn, and last def of memory
+  // before the second insn, respectively.
+  def_info *mem_defs[2] = {};
+  if (load_p)
+    {
+      if (!MEM_READONLY_P (cand_mems[0]))
+	{
+	  mem_defs[0] = memory_access (insns[0]->uses ())->def ();
+	  gcc_checking_assert (mem_defs[0]);
+	  mem_defs[0] = mem_defs[0]->next_def ();
+	}
+      if (!MEM_READONLY_P (cand_mems[1]))
+	{
+	  mem_defs[1] = memory_access (insns[1]->uses ())->def ();
+	  gcc_checking_assert (mem_defs[1]);
+	}
+    }
+  else
+    {
+      mem_defs[0] = memory_access (insns[0]->defs ())->next_def ();
+      mem_defs[1] = memory_access (insns[1]->defs ())->prev_def ();
+      gcc_checking_assert (mem_defs[0]);
+      gcc_checking_assert (mem_defs[1]);
+    }
+
+  store_walker<false> forward_store_walker (mem_defs[0],
+					    cand_mems[0],
+					    insns[1]);
+  store_walker<true> backward_store_walker (mem_defs[1],
+					    cand_mems[1],
+					    insns[0]);
+  alias_walker *walkers[4] = {};
+  if (mem_defs[0])
+    walkers[0] = &forward_store_walker;
+  if (mem_defs[1])
+    walkers[1] = &backward_store_walker;
+
+  if (load_p && (mem_defs[0] || mem_defs[1]))
+    do_alias_analysis (alias_hazards, walkers, load_p);
+  else
+    {
+      // We want to find any loads hanging off the first store.
+      mem_defs[0] = memory_access (insns[0]->defs ());
+      load_walker<false> forward_load_walker (mem_defs[0], insns[0], insns[1]);
+      load_walker<true> backward_load_walker (mem_defs[1], insns[1], insns[0]);
+      walkers[2] = &forward_load_walker;
+      walkers[3] = &backward_load_walker;
+      do_alias_analysis (alias_hazards, walkers, load_p);
+      // Now consolidate hazards back down.
+      if (alias_hazards[2]
+	  && (!alias_hazards[0] || (*alias_hazards[2] < *alias_hazards[0])))
+	alias_hazards[0] = alias_hazards[2];
+
+      if (alias_hazards[3]
+	  && (!alias_hazards[1] || (*alias_hazards[3] > *alias_hazards[1])))
+	alias_hazards[1] = alias_hazards[3];
+    }
+
+  if (alias_hazards[0] && alias_hazards[1]
+      && *alias_hazards[0] <= *alias_hazards[1])
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "cannot form pair (%d,%d) due to alias conflicts (%d,%d)\n",
+		 i1->uid (), i2->uid (),
+		 alias_hazards[0]->uid (), alias_hazards[1]->uid ());
+      return false;
+    }
+
+  // Now narrow the hazards on each base candidate using
+  // the alias hazards.
+  i = 0;
+  while (i < base_cands.length ())
+    {
+      base_cand &cand = base_cands[i];
+      if (alias_hazards[0] && (!cand.hazards[0]
+			       || *alias_hazards[0] < *cand.hazards[0]))
+	cand.hazards[0] = alias_hazards[0];
+      if (alias_hazards[1] && (!cand.hazards[1]
+			       || *alias_hazards[1] > *cand.hazards[1]))
+	cand.hazards[1] = alias_hazards[1];
+
+      if (cand.viable ())
+	i++;
+      else
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "pair (%d,%d): rejecting base %d due to "
+				"alias/dataflow hazards (%d,%d)",
+				insns[0]->uid (), insns[1]->uid (),
+				cand.m_def->regno (),
+				cand.hazards[0]->uid (),
+				cand.hazards[1]->uid ());
+
+	  base_cands.ordered_remove (i);
+	}
+    }
+
+  if (base_cands.is_empty ())
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "cannot form pair (%d,%d) due to alias/dataflow hazards",
+		 insns[0]->uid (), insns[1]->uid ());
+
+      return false;
+    }
+
+  base_cand *base = &base_cands[0];
+  if (base_cands.length () > 1)
+    {
+      // If there are still multiple viable bases, it makes sense
+      // to choose one that allows us to reduce register pressure,
+      // for loads this means moving further down, for stores this
+      // means moving further up.
+      gcc_checking_assert (base_cands.length () == 2);
+      const int hazard_i = !load_p;
+      if (base->hazards[hazard_i])
+	{
+	  if (!base_cands[1].hazards[hazard_i])
+	    base = &base_cands[1];
+	  else if (load_p
+		   && *base_cands[1].hazards[hazard_i]
+		      > *(base->hazards[hazard_i]))
+	    base = &base_cands[1];
+	  else if (!load_p
+		   && *base_cands[1].hazards[hazard_i]
+		      < *(base->hazards[hazard_i]))
+	    base = &base_cands[1];
+	}
+    }
+
+  // Otherwise, hazards[0] > hazards[1].
+  // Pair can be formed anywhere in (hazards[1], hazards[0]).
+  insn_range_info range (insns[0], insns[1]);
+  if (base->hazards[1])
+    range.first = base->hazards[1];
+  if (base->hazards[0])
+    range.last = base->hazards[0]->prev_nondebug_insn ();
+
+  // Placement strategy: push loads down and pull stores up, this should
+  // help register pressure by reducing live ranges.
+  if (load_p)
+    range.first = range.last;
+  else
+    range.last = range.first;
+
+  if (dump_file)
+    {
+      auto print_hazard = [](insn_info *i)
+	{
+	  if (i)
+	    fprintf (dump_file, "%d", i->uid ());
+	  else
+	    fprintf (dump_file, "-");
+	};
+      auto print_pair = [print_hazard](insn_info **i)
+	{
+	  print_hazard (i[0]);
+	  fprintf (dump_file, ",");
+	  print_hazard (i[1]);
+	};
+
+      fprintf (dump_file, "fusing pair [L=%d] (%d,%d), base=%d, hazards: (",
+	      load_p, insns[0]->uid (), insns[1]->uid (),
+	      base->m_def->regno ());
+      print_pair (base->hazards);
+      fprintf (dump_file, "), move_range: (%d,%d)\n",
+	       range.first->uid (), range.last->uid ());
+    }
+
+  return fuse_pair (load_p, access_size, writeback,
+		    i1, i2, *base, range, emitted_tombstone_p);
+}
+
+// Erase [l.begin (), i] inclusive, respecting iterator order.
+static insn_iter_t
+erase_prefix (insn_list_t &l, insn_iter_t i)
+{
+  l.erase (l.begin (), std::next (i));
+  return l.begin ();
+}
+
+static insn_iter_t
+erase_one (insn_list_t &l, insn_iter_t i, insn_iter_t begin)
+{
+  auto prev_or_next = (i == begin) ? std::next (i) : std::prev (i);
+  l.erase (i);
+  return prev_or_next;
+}
+
+static void
+dump_insn_list (FILE *f, const insn_list_t &l)
+{
+  fprintf (f, "(");
+
+  auto i = l.begin ();
+  auto end = l.end ();
+
+  if (i != end)
+    fprintf (f, "%d", (*i)->uid ());
+  i++;
+
+  for (; i != end; i++)
+    {
+      fprintf (f, ", %d", (*i)->uid ());
+    }
+
+  fprintf (f, ")");
+}
+
+DEBUG_FUNCTION void
+debug (const insn_list_t &l)
+{
+  dump_insn_list (stderr, l);
+  fprintf (stderr, "\n");
+}
+
+void
+merge_pairs (insn_iter_t l_begin,
+	     insn_iter_t l_end,
+	     insn_iter_t r_begin,
+	     insn_iter_t r_end,
+	     insn_list_t &left_list,
+	     insn_list_t &right_list,
+	     hash_set <insn_info *> &to_delete,
+	     bool load_p,
+	     unsigned access_size,
+	     bool &emitted_tombstone_p)
+{
+  auto iter_l = l_begin;
+  auto iter_r = r_begin;
+
+  bool result;
+  while (l_begin != l_end && r_begin != r_end)
+    {
+      auto next_l = std::next (iter_l);
+      auto next_r = std::next (iter_r);
+      if (**iter_l < **iter_r
+	  && next_l != l_end
+	  && **next_l < **iter_r)
+	{
+	  iter_l = next_l;
+	  continue;
+	}
+      else if (**iter_r < **iter_l
+	       && next_r != r_end
+	       && **next_r < **iter_l)
+	{
+	  iter_r = next_r;
+	  continue;
+	}
+
+      bool update_l = false;
+      bool update_r = false;
+
+      result = try_fuse_pair (load_p, access_size,
+			      *iter_l, *iter_r,
+			      emitted_tombstone_p);
+      if (result)
+	{
+	  update_l = update_r = true;
+	  if (to_delete.add (*iter_r))
+	    gcc_unreachable (); // Shouldn't get added twice.
+
+	  iter_l = erase_one (left_list, iter_l, l_begin);
+	  iter_r = erase_one (right_list, iter_r, r_begin);
+	}
+      else
+	{
+	  // Here we know that the entire prefix we skipped
+	  // over cannot merge with anything further on
+	  // in iteration order (there are aliasing hazards
+	  // on both sides), so delete the entire prefix.
+	  if (**iter_l < **iter_r)
+	    {
+	      // Delete everything from l_begin to iter_l, inclusive.
+	      update_l = true;
+	      iter_l = erase_prefix (left_list, iter_l);
+	    }
+	  else
+	    {
+	      // Delete everything from r_begin to iter_r, inclusive.
+	      update_r = true;
+	      iter_r = erase_prefix (right_list, iter_r);
+	    }
+	}
+
+      if (update_l)
+	{
+	  l_begin = left_list.begin ();
+	  l_end = left_list.end ();
+	}
+      if (update_r)
+	{
+	  r_begin = right_list.begin ();
+	  r_end = right_list.end ();
+	}
+    }
+}
+
+// Given a list of insns LEFT_ORIG with all accesses adjacent to
+// those in RIGHT_ORIG, try and form them into pairs.
+//
+// Return true iff we formed all the RIGHT_ORIG candidates into
+// pairs.
+bool
+ldp_bb_info::try_form_pairs (insn_list_t *left_orig,
+			     insn_list_t *right_orig,
+			     bool load_p, unsigned access_size)
+{
+  // Make a copy of the right list which we can modify to
+  // exclude candidates locally for this invocation.
+  insn_list_t right_copy (*right_orig);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "try_form_pairs [L=%d], cand vecs ", load_p);
+      dump_insn_list (dump_file, *left_orig);
+      fprintf (dump_file, " x ");
+      dump_insn_list (dump_file, right_copy);
+      fprintf (dump_file, "\n");
+    }
+
+  // List of candidate insns to delete from the original right_list
+  // (because they were formed into a pair).
+  hash_set <insn_info *> to_delete;
+
+  // Now we have a 2D matrix of candidates, traverse it to try and
+  // find a pair of insns that are already adjacent (within the
+  // merged list of accesses).
+  merge_pairs (left_orig->begin (), left_orig->end (),
+	       right_copy.begin (), right_copy.end (),
+	       *left_orig, right_copy,
+	       to_delete, load_p, access_size,
+	       m_emitted_tombstone);
+
+  // If we formed all right candidates into pairs,
+  // then we can skip the next iteration.
+  if (to_delete.elements () == right_orig->size ())
+    return true;
+
+  // Delete items from to_delete.
+  auto right_iter = right_orig->begin ();
+  auto right_end = right_orig->end ();
+  while (right_iter != right_end)
+    {
+      auto right_next = std::next (right_iter);
+
+      if (to_delete.contains (*right_iter))
+	{
+	  right_orig->erase (right_iter);
+	  right_end = right_orig->end ();
+	}
+
+      right_iter = right_next;
+    }
+
+  return false;
+}
+
+void
+ldp_bb_info::transform_for_base (int encoded_lfs,
+				 access_group &group)
+{
+  const auto lfs = decode_lfs (encoded_lfs);
+  const unsigned access_size = lfs.size;
+
+  bool skip_next = true;
+  access_record *prev_access = nullptr;
+
+  for (auto &access : group.list)
+    {
+      if (skip_next)
+	skip_next = false;
+      else if (known_eq (access.offset, prev_access->offset + access_size))
+	skip_next = try_form_pairs (&prev_access->cand_insns,
+				    &access.cand_insns,
+				    lfs.load_p, access_size);
+
+      prev_access = &access;
+    }
+}
+
+void
+ldp_bb_info::cleanup_tombstones ()
+{
+  // No need to do anything if we didn't emit a tombstone insn for this bb.
+  if (!m_emitted_tombstone)
+    return;
+
+  insn_info *insn = m_bb->head_insn ();
+  while (insn)
+    {
+      insn_info *next = insn->next_nondebug_insn ();
+      if (!insn->is_real () || !tombstone_insn_p (insn))
+	{
+	  insn = next;
+	  continue;
+	}
+
+      auto def = memory_access (insn->defs ());
+      auto set = dyn_cast <set_info *> (def);
+      if (set && set->has_any_uses ())
+	{
+	  def_info *prev_def = def->prev_def ();
+	  auto prev_set = dyn_cast <set_info *> (prev_def);
+	  if (!prev_set)
+	    gcc_unreachable (); // TODO: handle this if needed.
+
+	  while (set->first_use ())
+	    crtl->ssa->reparent_use (set->first_use (), prev_set);
+	}
+
+      // Now set has no uses, we can delete it.
+      insn_change change (insn, insn_change::DELETE);
+      crtl->ssa->change_insn (change);
+      insn = next;
+    }
+}
+
+template<typename Map>
+void
+ldp_bb_info::traverse_base_map (Map &map)
+{
+  for (auto kv : map)
+    {
+      const auto &key = kv.first;
+      auto &value = kv.second;
+      transform_for_base (key.second, value);
+    }
+}
+
+void
+ldp_bb_info::transform ()
+{
+  traverse_base_map (expr_map);
+  traverse_base_map (def_map);
+}
+
+static void
+ldp_fusion_init ()
+{
+  calculate_dominance_info (CDI_DOMINATORS);
+  df_analyze ();
+  crtl->ssa = new rtl_ssa::function_info (cfun);
+}
+
+static void
+ldp_fusion_destroy ()
+{
+  if (crtl->ssa->perform_pending_updates ())
+    cleanup_cfg (0);
+
+  free_dominance_info (CDI_DOMINATORS);
+
+  delete crtl->ssa;
+  crtl->ssa = nullptr;
+}
+
+static rtx
+aarch64_destructure_load_pair (rtx regs[2], rtx pattern)
+{
+  rtx mem = NULL_RTX;
+
+  for (int i = 0; i < 2; i++)
+    {
+      rtx pat = XVECEXP (pattern, 0, i);
+      regs[i] = XEXP (pat, 0);
+      rtx unspec = XEXP (pat, 1);
+      gcc_checking_assert (GET_CODE (unspec) == UNSPEC);
+      rtx this_mem = XVECEXP (unspec, 0, 0);
+      if (mem)
+	gcc_checking_assert (rtx_equal_p (mem, this_mem));
+      else
+	{
+	  gcc_checking_assert (MEM_P (this_mem));
+	  mem = this_mem;
+	}
+    }
+
+  return mem;
+}
+
+static rtx
+aarch64_destructure_store_pair (rtx regs[2], rtx pattern)
+{
+  rtx mem = XEXP (pattern, 0);
+  rtx unspec = XEXP (pattern, 1);
+  gcc_checking_assert (GET_CODE (unspec) == UNSPEC);
+  for (int i = 0; i < 2; i++)
+    regs[i] = XVECEXP (unspec, 0, i);
+  return mem;
+}
+
+static rtx
+aarch64_gen_writeback_pair (rtx wb_effect, rtx pair_mem, rtx regs[2],
+			    bool load_p)
+{
+  auto op_mode = aarch64_operand_mode_for_pair_mode (GET_MODE (pair_mem));
+
+  machine_mode modes[2];
+  for (int i = 0; i < 2; i++)
+    {
+      machine_mode mode = GET_MODE (regs[i]);
+      if (load_p)
+	gcc_checking_assert (mode != VOIDmode);
+      else if (mode == VOIDmode)
+	mode = op_mode;
+
+      modes[i] = mode;
+    }
+
+  const auto op_size = GET_MODE_SIZE (modes[0]);
+  gcc_checking_assert (known_eq (op_size, GET_MODE_SIZE (modes[1])));
+
+  rtx pats[2];
+  for (int i = 0; i < 2; i++)
+    {
+      rtx mem = adjust_address_nv (pair_mem, modes[i], op_size * i);
+      pats[i] = load_p
+	? gen_rtx_SET (regs[i], mem)
+	: gen_rtx_SET (mem, regs[i]);
+    }
+
+  return gen_rtx_PARALLEL (VOIDmode,
+			   gen_rtvec (3, wb_effect, pats[0], pats[1]));
+}
+
+// Given an existing pair insn INSN, look for a trailing update of
+// the base register which we can fold in to make this pair use
+// a writeback addressing mode.
+static void
+try_promote_writeback (insn_info *insn)
+{
+  auto rti = insn->rtl ();
+  const auto attr = get_attr_ldpstp (rti);
+  if (attr == LDPSTP_NONE)
+    return;
+
+  bool load_p = (attr == LDPSTP_LDP);
+  gcc_checking_assert (load_p || attr == LDPSTP_STP);
+
+  rtx regs[2];
+  rtx mem = NULL_RTX;
+  if (load_p)
+    mem = aarch64_destructure_load_pair (regs, PATTERN (rti));
+  else
+    mem = aarch64_destructure_store_pair (regs, PATTERN (rti));
+  gcc_checking_assert (MEM_P (mem));
+
+  poly_int64 offset;
+  rtx base = strip_offset (XEXP (mem, 0), &offset);
+  gcc_assert (REG_P (base));
+
+  const auto access_size = GET_MODE_SIZE (GET_MODE (mem)).to_constant () / 2;
+
+  if (find_access (insn->defs (), REGNO (base)))
+    {
+      gcc_assert (load_p);
+      if (dump_file)
+	fprintf (dump_file,
+		 "ldp %d clobbers base r%d, can't promote to writeback\n",
+		 insn->uid (), REGNO (base));
+      return;
+    }
+
+  auto base_use = find_access (insn->uses (), REGNO (base));
+  gcc_assert (base_use);
+
+  if (!base_use->def ())
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "found pair (i%d, L=%d): but base r%d is upwards exposed\n",
+		 insn->uid (), load_p, REGNO (base));
+      return;
+    }
+
+  auto base_def = base_use->def ();
+
+  rtx wb_effect = NULL_RTX;
+  def_info *add_def;
+  const insn_range_info pair_range (insn->prev_nondebug_insn ());
+  insn_info *insns[2] = { nullptr, insn };
+  insn_info *trailing_add = find_trailing_add (insns, pair_range, &wb_effect,
+					       &add_def, base_def, offset,
+					       access_size);
+  if (!trailing_add)
+    return;
+
+  auto attempt = crtl->ssa->new_change_attempt ();
+
+  insn_change pair_change (insn);
+  insn_change del_change (trailing_add, insn_change::DELETE);
+  insn_change *changes[] = { &pair_change, &del_change };
+
+  rtx pair_pat = aarch64_gen_writeback_pair (wb_effect, mem, regs, load_p);
+  gcc_assert (validate_unshare_change (rti, &PATTERN (rti), pair_pat, true));
+
+  // The pair must gain the def of the base register from the add.
+  pair_change.new_defs = insert_access (attempt,
+					add_def,
+					pair_change.new_defs);
+  gcc_assert (pair_change.new_defs.is_valid ());
+
+  pair_change.move_range = insn_range_info (insn->prev_nondebug_insn ());
+
+  auto is_changing = insn_is_changing (changes);
+  for (unsigned i = 0; i < ARRAY_SIZE (changes); i++)
+    gcc_assert (rtl_ssa::restrict_movement_ignoring (*changes[i], is_changing));
+
+  gcc_assert (rtl_ssa::recog_ignoring (attempt, pair_change, is_changing));
+  gcc_assert (crtl->ssa->verify_insn_changes (changes));
+  confirm_change_group ();
+  crtl->ssa->change_insns (changes);
+}
+
+void ldp_fusion_bb (bb_info *bb)
+{
+  const bool track_loads
+    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
+  const bool track_stores
+    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
+
+  ldp_bb_info bb_state (bb);
+
+  for (auto insn : bb->nondebug_insns ())
+    {
+      rtx_insn *rti = insn->rtl ();
+
+      if (!rti || !INSN_P (rti))
+	continue;
+
+      rtx pat = PATTERN (rti);
+      if (reload_completed
+	  && aarch64_ldp_writeback > 1
+	  && GET_CODE (pat) == PARALLEL
+	  && XVECLEN (pat, 0) == 2)
+	try_promote_writeback (insn);
+
+      if (GET_CODE (pat) != SET)
+	continue;
+
+      if (track_stores && MEM_P (XEXP (pat, 0)))
+	bb_state.track_access (insn, false, XEXP (pat, 0));
+      else if (track_loads && MEM_P (XEXP (pat, 1)))
+	bb_state.track_access (insn, true, XEXP (pat, 1));
+    }
+
+  bb_state.transform ();
+  bb_state.cleanup_tombstones ();
+}
+
+void ldp_fusion ()
+{
+  ldp_fusion_init ();
+
+  for (auto bb : crtl->ssa->bbs ())
+    ldp_fusion_bb (bb);
+
+  ldp_fusion_destroy ();
+}
+
+namespace {
+
+const pass_data pass_data_ldp_fusion =
+{
+  RTL_PASS, /* type */
+  "ldp_fusion", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_ldp_fusion : public rtl_opt_pass
+{
+public:
+  pass_ldp_fusion (gcc::context *ctx)
+    : rtl_opt_pass (pass_data_ldp_fusion, ctx)
+    {}
+
+  opt_pass *clone () override { return new pass_ldp_fusion (m_ctxt); }
+
+  bool gate (function *) final override
+    {
+      if (!optimize || optimize_debug)
+	return false;
+
+      // If the tuning policy says never to form ldps or stps, don't run
+      // the pass.
+      if ((aarch64_tune_params.ldp_policy_model
+	   == AARCH64_LDP_STP_POLICY_NEVER)
+	  && (aarch64_tune_params.stp_policy_model
+	      == AARCH64_LDP_STP_POLICY_NEVER))
+	return false;
+
+      if (reload_completed)
+	return flag_aarch64_late_ldp_fusion;
+      else
+	return flag_aarch64_early_ldp_fusion;
+    }
+
+  unsigned execute (function *) final override
+    {
+      ldp_fusion ();
+      return 0;
+    }
+};
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_ldp_fusion (gcc::context *ctx)
+{
+  return new pass_ldp_fusion (ctx);
+}
+
+#include "gt-aarch64-ldp-fusion.h"
diff --git a/gcc/config/aarch64/aarch64-passes.def b/gcc/config/aarch64/aarch64-passes.def
index 6ace797b738..f38c642414e 100644
--- a/gcc/config/aarch64/aarch64-passes.def
+++ b/gcc/config/aarch64/aarch64-passes.def
@@ -23,3 +23,5 @@  INSERT_PASS_BEFORE (pass_reorder_blocks, 1, pass_track_speculation);
 INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
 INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
 INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
+INSERT_PASS_BEFORE (pass_early_remat, 1, pass_ldp_fusion);
+INSERT_PASS_BEFORE (pass_peephole2, 1, pass_ldp_fusion);
diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 2ab54f244a7..fd75aa115d1 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1055,6 +1055,7 @@  rtl_opt_pass *make_pass_track_speculation (gcc::context *);
 rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
 rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
 rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
+rtl_opt_pass *make_pass_ldp_fusion (gcc::context *);
 
 poly_uint64 aarch64_regmode_natural_size (machine_mode);
 
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index f5a518202a1..a69c37ce33b 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -271,6 +271,16 @@  mtrack-speculation
 Target Var(aarch64_track_speculation)
 Generate code to track when the CPU might be speculating incorrectly.
 
+mearly-ldp-fusion
+Target Var(flag_aarch64_early_ldp_fusion) Optimization Init(1)
+Enable the pre-RA AArch64-specific pass to fuse loads and stores into
+ldp and stp instructions.
+
+mlate-ldp-fusion
+Target Var(flag_aarch64_late_ldp_fusion) Optimization Init(1)
+Enable the post-RA AArch64-specific pass to fuse loads and stores into
+ldp and stp instructions.
+
 mstack-protector-guard=
 Target RejectNegative Joined Enum(stack_protector_guard) Var(aarch64_stack_protector_guard) Init(SSP_GLOBAL)
 Use given stack-protector guard.
@@ -360,3 +370,16 @@  Enum(aarch64_ldp_stp_policy) String(never) Value(AARCH64_LDP_STP_POLICY_NEVER)
 
 EnumValue
 Enum(aarch64_ldp_stp_policy) String(aligned) Value(AARCH64_LDP_STP_POLICY_ALIGNED)
+
+-param=aarch64-ldp-alias-check-limit=
+Target Joined UInteger Var(aarch64_ldp_alias_check_limit) Init(8) IntegerRange(0, 65536) Param
+Limit on number of alias checks performed when attempting to form an ldp/stp.
+
+-param=aarch64-ldp-writeback=
+Target Joined UInteger Var(aarch64_ldp_writeback) Init(2) IntegerRange(0,2) Param
+Param to control which wirteback opportunities we try to handle in the
+load/store pair fusion pass.  A value of zero disables writeback
+handling.  One means we try to form pairs involving one or more existing
+individual writeback accesses where possible.  A value of two means we
+also try to opportunistically form writeback opportunities by folding in
+trailing destructive updates of the base register used by a pair.
diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
index a9a244ab6d6..37917344a54 100644
--- a/gcc/config/aarch64/t-aarch64
+++ b/gcc/config/aarch64/t-aarch64
@@ -176,6 +176,13 @@  aarch64-cc-fusion.o: $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
 	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
 		$(srcdir)/config/aarch64/aarch64-cc-fusion.cc
 
+aarch64-ldp-fusion.o: $(srcdir)/config/aarch64/aarch64-ldp-fusion.cc \
+    $(CONFIG_H) $(SYSTEM_H) $(CORETYPES_H) $(BACKEND_H) $(RTL_H) $(DF_H) \
+    $(RTL_SSA_H) cfgcleanup.h tree-pass.h ordered-hash-map.h tree-dfa.h \
+    fold-const.h tree-hash-traits.h print-tree.h
+	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
+		$(srcdir)/config/aarch64/aarch64-ldp-fusion.cc
+
 comma=,
 MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
 MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))