diff mbox series

[V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).

Message ID e87b33e1-5862-42c2-8785-5c58c8cffaec@linux.ibm.com
State New
Headers show
Series [V2] rs6000: New pass for replacement of adjacent loads fusion (lxv). | expand

Commit Message

Ajit Agarwal Jan. 21, 2024, 2:27 p.m. UTC
Hello All:

New pass to replace adjacent memory addresses lxv with lxvp.
Added common infrastructure for load store fusion for
different targets.

Common routines are refactored in fusion-common.h.

AARCH64 load/store fusion pass is not changed with the 
common infrastructure.

For AARCH64 architectures just include "fusion-common.h"
and target dependent code can be added to that.


Alex/Richard:

If you would like me to add for AARCH64 I can do that for AARCH64.

If you would like to do that is fine with me.

Bootstrapped and regtested with powerpc64-linux-gnu.

Improvement in performance is seen with Spec 2017 spec FP benchmarks.

Thanks & Regards
Ajit

rs6000: New  pass for replacement of adjacent lxv with lxvp.

New pass to replace adjacent memory addresses lxv with lxvp.
Added common infrastructure for load store fusion for
different targets.

Common routines are refactored in fusion-common.h.

2024-01-21  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>

gcc/ChangeLog:

	* config/rs6000/rs6000-passes.def: New vecload pass
	before pass_early_remat.
	* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
	* config.gcc: Add new executable.
	* config/rs6000/rs6000-protos.h: Add new prototype for vecload
	pass.
	* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
	* config/rs6000/t-rs6000: Add new rule.
	* fusion-common.h: Add common infrastructure for load store
	fusion that can be shared across different architectures.
	* emit-rtl.cc: Modify assert code.

gcc/testsuite/ChangeLog:

	* g++.target/powerpc/vecload.C: New test.
	* g++.target/powerpc/vecload1.C: New test.
	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
---
 gcc/config.gcc                                |    4 +-
 gcc/config/rs6000/rs6000-passes.def           |    3 +
 gcc/config/rs6000/rs6000-protos.h             |    1 +
 gcc/config/rs6000/rs6000-vecload-opt.cc       | 1186 ++++++++++++++++
 gcc/config/rs6000/rs6000.cc                   |    1 +
 gcc/config/rs6000/t-rs6000                    |    5 +
 gcc/emit-rtl.cc                               |   14 +-
 gcc/fusion-common.h                           | 1195 +++++++++++++++++
 gcc/testsuite/g++.target/powerpc/vecload.C    |   15 +
 gcc/testsuite/g++.target/powerpc/vecload1.C   |   22 +
 .../gcc.target/powerpc/mma-builtin-1.c        |    4 +-
 11 files changed, 2433 insertions(+), 17 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
 create mode 100644 gcc/fusion-common.h
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload1.C

Comments

Alex Coplan Jan. 24, 2024, 4:43 p.m. UTC | #1
Hi Ajit,

On 21/01/2024 19:57, Ajit Agarwal wrote:
> 
> Hello All:
> 
> New pass to replace adjacent memory addresses lxv with lxvp.
> Added common infrastructure for load store fusion for
> different targets.

Thanks for this, it would be nice to see the load/store pair pass
generalized to multiple targets.

I assume you are targeting GCC 15 for this, as we are in stage 4 at
the moment?

> 
> Common routines are refactored in fusion-common.h.
> 
> AARCH64 load/store fusion pass is not changed with the 
> common infrastructure.

I think any patch to generalize the load/store pair fusion pass should
update the aarch64 code at the same time to use the generic
infrastructure, instead of duplicating the code.

As a general comment, I think we should move as much of the code as
possible to target-independent code, with only the bits that are truly
target-specific (e.g. deciding which modes to allow for a load/store
pair operand) in target code.

In terms of structuring the interface between generic code and target
code, I think it would be pragmatic to use a class with (in some cases,
pure) virtual functions that can be overriden by targets to implement
any target-specific behaviour.

IMO the generic class should be implemented in its own .cc instead of
using a header-only approach.  The target code would then define a
derived class which overrides the virtual functions (where necessary)
declared in the generic class, and then instantiate the derived class to
create a target-customized instance of the pass.

A more traditional GCC approach would be to use optabs and target hooks
to customize the behaviour of the pass to handle target-specific
aspects, but:
 - Target hooks are quite heavyweight, and we'd potentially have to add
   quite a few hooks just for one pass that (at least initially) will
   only be used by a couple of targets.
 - Using classes allows both sides to easily maintain their own state
   and share that state where appropriate.

Nit on naming: I understand you want to move away from ldp_fusion, but
how about pair_fusion or mem_pair_fusion instead of just "fusion" as a
base name?  IMO just "fusion" isn't very clear as to what the pass is
trying to achieve.

In general the code could do with a lot more commentary to explain the
rationale for various things / explain the high-level intent of the
code.

Unfortunately I'm not familiar with the DF framework (I've only really
worked with RTL-SSA for the aarch64 pass), so I haven't commented on the
use of that framework, but it would be nice if what you're trying to do
could be done using RTL-SSA instead of using DF directly.

Hopefully Richard S can chime in on those aspects.

My main concerns with the patch at the moment (apart from the code
duplication) is that it looks like:

 - The patch removes alias analysis from try_fuse_pair, which is unsafe.
 - The patch tries to make its own RTL changes inside
   rs6000_gen_load_pair, but it should let fuse_pair make those changes
   using RTL-SSA instead.

I've left some more specific (but still mostly high-level) comments below.

> 
> For AARCH64 architectures just include "fusion-common.h"
> and target dependent code can be added to that.
> 
> 
> Alex/Richard:
> 
> If you would like me to add for AARCH64 I can do that for AARCH64.
> 
> If you would like to do that is fine with me.
> 
> Bootstrapped and regtested with powerpc64-linux-gnu.
> 
> Improvement in performance is seen with Spec 2017 spec FP benchmarks.
> 
> Thanks & Regards
> Ajit
> 
> rs6000: New  pass for replacement of adjacent lxv with lxvp.

Are you looking to handle stores eventually, out of interest?  Looking
at rs6000-vecload-opt.cc:fusion_bb it looks like you're just handling
loads at the moment.

> 
> New pass to replace adjacent memory addresses lxv with lxvp.
> Added common infrastructure for load store fusion for
> different targets.
> 
> Common routines are refactored in fusion-common.h.

I've just done a very quick scan through this file as it mostly just
looks to be idential to existing code in aarch64-ldp-fusion.cc.

> 
> 2024-01-21  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
> 
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000-passes.def: New vecload pass
> 	before pass_early_remat.
> 	* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
> 	* config.gcc: Add new executable.
> 	* config/rs6000/rs6000-protos.h: Add new prototype for vecload
> 	pass.
> 	* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
> 	* config/rs6000/t-rs6000: Add new rule.
> 	* fusion-common.h: Add common infrastructure for load store
> 	fusion that can be shared across different architectures.
> 	* emit-rtl.cc: Modify assert code.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* g++.target/powerpc/vecload.C: New test.
> 	* g++.target/powerpc/vecload1.C: New test.
> 	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
> ---
>  gcc/config.gcc                                |    4 +-
>  gcc/config/rs6000/rs6000-passes.def           |    3 +
>  gcc/config/rs6000/rs6000-protos.h             |    1 +
>  gcc/config/rs6000/rs6000-vecload-opt.cc       | 1186 ++++++++++++++++
>  gcc/config/rs6000/rs6000.cc                   |    1 +
>  gcc/config/rs6000/t-rs6000                    |    5 +
>  gcc/emit-rtl.cc                               |   14 +-
>  gcc/fusion-common.h                           | 1195 +++++++++++++++++
>  gcc/testsuite/g++.target/powerpc/vecload.C    |   15 +
>  gcc/testsuite/g++.target/powerpc/vecload1.C   |   22 +
>  .../gcc.target/powerpc/mma-builtin-1.c        |    4 +-
>  11 files changed, 2433 insertions(+), 17 deletions(-)
>  create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
>  create mode 100644 gcc/fusion-common.h
>  create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C
>  create mode 100644 gcc/testsuite/g++.target/powerpc/vecload1.C
> 
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 00355509c92..9bff42cf830 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -518,7 +518,7 @@ or1k*-*-*)
>  	;;
>  powerpc*-*-*)
>  	cpu_type=rs6000
> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>  	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
>  	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
> @@ -555,7 +555,7 @@ riscv*)
>  	;;
>  rs6000*-*-*)
>  	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
> diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
> index 46a0d0b8c56..eb4a65ebe10 100644
> --- a/gcc/config/rs6000/rs6000-passes.def
> +++ b/gcc/config/rs6000/rs6000-passes.def
> @@ -28,6 +28,9 @@ along with GCC; see the file COPYING3.  If not see
>       The power8 does not have instructions that automaticaly do the byte swaps
>       for loads and stores.  */
>    INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
> +  /* Pass to replace adjacent memory addresses lxv instruction with lxvp
> +     instruction.  */
> +  INSERT_PASS_BEFORE (pass_early_remat, 1, pass_analyze_vecload);
>  
>    /* Pass to do the PCREL_OPT optimization that combines the load of an
>       external symbol's address along with a single load or store using that
> diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
> index 09a57a806fa..f0a9f36602e 100644
> --- a/gcc/config/rs6000/rs6000-protos.h
> +++ b/gcc/config/rs6000/rs6000-protos.h
> @@ -343,6 +343,7 @@ namespace gcc { class context; }
>  class rtl_opt_pass;
>  
>  extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
> +extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
>  extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
>  extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
>  extern bool rs6000_quadword_masked_address_p (const_rtx exp);
> diff --git a/gcc/config/rs6000/rs6000-vecload-opt.cc b/gcc/config/rs6000/rs6000-vecload-opt.cc
> new file mode 100644
> index 00000000000..cebadad97d8
> --- /dev/null
> +++ b/gcc/config/rs6000/rs6000-vecload-opt.cc
> @@ -0,0 +1,1186 @@
> +/* Subroutines used to replace lxv with lxvp
> +   for TARGET_POWER10 and TARGET_VSX,
> +
> +   Copyright (C) 2020-2023 Free Software Foundation, Inc.
> +   Contributed by Ajit Kumar Agarwal <aagarwa1@linux.ibm.com>.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published
> +   by the Free Software Foundation; either version 3, or (at your
> +   option) any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but WITHOUT
> +   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> +   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
> +   License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#define IN_TARGET_CODE 1
> +#define INCLUDE_ALGORITHM
> +#define INCLUDE_FUNCTIONAL
> +#define INCLUDE_LIST
> +#define INCLUDE_TYPE_TRAITS
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "target.h"
> +#include "rtl.h"
> +#include "memmodel.h"
> +#include "emit-rtl.h"
> +#include "tree-pass.h"
> +#include "df.h"
> +#include "dumpfile.h"
> +#include "rs6000-internal.h"
> +#include "rs6000-protos.h"
> +#include "fusion-common.h"
> +
> +/* Return false if dependent rtx LOC is SUBREG.  */
> +static bool
> +is_feasible (rtx_insn *insn)

Can you explain the rationale for this function?
I'm not really sure what it's trying to achieve.

> +{
> +  df_ref use;
> +  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
> +    {
> +      struct df_link *def_link = DF_REF_CHAIN (use);
> +      if (!def_link || !def_link->ref || DF_REF_IS_ARTIFICIAL (def_link->ref))
> +	continue;
> +      while (def_link && def_link->ref)
> +	{
> +	  rtx *loc = DF_REF_LOC (def_link->ref);
> +	  if (!loc || *loc == NULL_RTX)
> +	    return false;
> +	  if (GET_CODE (*loc) == SUBREG)
> +	    return false;
> +	  def_link = def_link->next;
> +	}
> +     }
> +  return true;
> +}
> +
> +/* df_scan_rescan the unspec instruction where operands
> +   are reversed.  */
> +void set_rescan_for_unspec (rtx_insn *insn)
> +{
> +  df_ref use;
> +  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> +  rtx_insn *select_insn2;
> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
> +    {
> +      struct df_link *def_link = DF_REF_CHAIN (use);
> +      while (def_link && def_link->ref)
> +	{
> +	  select_insn2 = DF_REF_INSN (def_link->ref);
> +	  rtx set = single_set (select_insn2);
> +
> +	  if (set == NULL_RTX)
> +	    return;
> +
> +	  if (set != NULL_RTX)
> +	    {
> +	      rtx op0 = SET_SRC (set);
> +	      if (GET_CODE (op0) != UNSPEC)
> +		return;
> +
> +	      if (GET_CODE (op0) == VEC_SELECT
> +		  && GET_CODE (XEXP (op0, 1)) == PARALLEL)
> +		return;
> +
> +	      if (GET_CODE (op0) == UNSPEC)
> +		df_insn_rescan (select_insn2);
> +	    }
> +	   def_link = def_link->next;
> +	}
> +     }
> +}
> +
> +/* Return dependent UNSPEC instruction.  */
> +rtx_insn *get_rtx_UNSPEC (rtx_insn *insn)
> +{
> +  df_ref use;
> +  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> +  rtx_insn *select_insn2;
> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
> +    {
> +      struct df_link *def_link = DF_REF_CHAIN (use);
> +      while (def_link && def_link->ref)
> +	{
> +	  select_insn2 = DF_REF_INSN (def_link->ref);
> +	  rtx set = single_set (select_insn2);
> +
> +	  if (set == NULL_RTX)
> +	    return 0;
> +
> +	  if (set != NULL_RTX)

This condition is always true due to the early return above.

> +	    {
> +	      rtx op0 = SET_SRC (set);
> +
> +	      if (GET_CODE (op0) == UNSPEC)
> +		return select_insn2;
> +	    }
> +	   def_link = def_link->next;
> +	}
> +     }
> +  return 0;
> +}
> +
> +/* Replace identified lxv with lxvp.
> +   Bail out if following condition are true:
> +
> +   - dependent instruction of load is vec_select instruction,

Please can you explain the rationale for this condition?

Do I understand correctly that you want to bail out if a use of a
register loaded by either insn occurs in a vec_select rtx?  If so, why?

AIUI, since we shouldn't change register dataflow in the pass, I wonder
if you could check this condition once in track_access, instead of
checking it for every pair formation attempt?

> +
> +   - machine mode of unspec is not same as machine mode
> +     of lxv instruction.
> +
> +   - dependent instruction is not unspec.

Likewise, please explain the rationale for this.

> +
> +   - Source operand of unspec is eq instruction.  */

Same here.

> +
> +rtx
> +rs6000_gen_load_pair (rtx_insn *insn1, rtx_insn *insn2)
> +{

I think this function is named in a rather misleading way, and I think
it is trying to do too much at once.

Based on the name of the function, I would expect it to just produce the
appropriate RTL pattern for a load pair given the two candidate loads.
But it actually tries to do a bunch of analysis to determine if the pair
can be formed, and then inserts a new insn and deletes the original
candidate insns.

Any changing of the RTL should be done by RTL-SSA in fuse_pair.

> +  rtx body = PATTERN (insn1);
> +  rtx src_exp = SET_SRC (body);
> +  rtx dest_exp = SET_DEST (body);
> +  rtx lxv;
> +  rtx insn2_body = PATTERN (insn2);
> +
> +  rtx insn2_src_exp = SET_SRC (insn2_body);
> +
> +  if (GET_MODE (src_exp) != GET_MODE (SET_SRC (insn2_body)))
> +    return NULL_RTX;

IIUC this should hold from generic RTL rules, so shouldn't be needed.

> +
> +  if (GET_MODE (dest_exp) == TImode)
> +    return NULL_RTX;
> +
> +  if (!ALTIVEC_OR_VSX_VECTOR_MODE (GET_MODE (dest_exp)))
> +    return NULL_RTX;

Validation of the modes should have happened much earlier (in
track_access), so shouldn't be necessary at this point.

> +
> +  if (!is_feasible (insn1))
> +    return NULL_RTX;
> +
> +  if (!is_feasible (insn2))
> +    return NULL_RTX;

I can't really comment on this as I don't understand the intent behind
is_feasible, but AFAICT it seems to do some dataflow analysis (looking
at uses/defs).  That should ideally have been done at the latest in
try_fuse_pair (before we get to fuse_pair), and shouldn't be done when
we're trying to generate the final RTL pattern for the pair.

Given that these seem to be properties of the indiviudal insns it might
even be possible to check this much earlier (when first considering the
individual accesses in track_access).

> +
> +  for (rtx note = REG_NOTES (insn1); note; note = XEXP (note, 1))
> +    if (REG_NOTE_KIND (note) == REG_EQUAL
> +	|| REG_NOTE_KIND (note) == REG_EQUIV)
> +      return NULL_RTX;

Can you explain why you're punting if the candidate insns have these
notes?  Did you see a performance improvement with that change?  It
might be something we want to consider doing for aarch64 too if running
before RA (in which case we could do it in the generic code).

> +
> +  int no_dep = 0;
> +  df_ref  use;
> +  df_insn_info *insn_info = DF_INSN_INFO_GET (insn1);
> +  rtx_insn *select_insn2;
> +
> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
> +    {
> +      struct df_link *def_link = DF_REF_CHAIN (use);
> +      while (def_link && def_link->ref)
> +	{
> +	  select_insn2 = DF_REF_INSN (def_link->ref);
> +	  rtx set = single_set (select_insn2);
> +
> +	  if (set == NULL_RTX)
> +	    return NULL_RTX;
> +
> +	  if (set != NULL_RTX)

As above, I think this condition is redundant due to the early return
immediately above.

> +	    {
> +	      rtx op0 = SET_SRC (set);
> +
> +	      if (GET_CODE (op0) != UNSPEC)
> +		return NULL_RTX;
> +
> +	      if (GET_CODE (op0) == VEC_SELECT
> +		  && GET_CODE (XEXP (op0, 1)) == PARALLEL)
> +		return NULL_RTX;
> +
> +	      if (GET_CODE (op0) == UNSPEC)
> +		{
> +		  if (GET_MODE (op0) != XOmode
> +		      && GET_MODE (op0) != GET_MODE (dest_exp))
> +		    return NULL_RTX;
> +
> +		  int nvecs = XVECLEN (op0, 0);
> +		  for (int i = 0; i < nvecs; i++)
> +		    {
> +		      rtx op;
> +		      op = XVECEXP (op0, 0, i);
> +
> +		      if (GET_MODE (op) == OOmode)
> +			return NULL_RTX;
> +		      if (GET_CODE (op) == EQ)
> +			return NULL_RTX;
> +		    }
> +		}
> +	       ++no_dep;
> +	     }
> +	   def_link = def_link->next;
> +	}
> +     }
> +
> +  rtx_insn *insn = get_rtx_UNSPEC (insn1);
> +
> +  if (insn && insn == get_rtx_UNSPEC (insn2) && no_dep == 1)
> +    return NULL_RTX;
> +
> +  int regoff;
> +  rtx src;
> +  rtx addr = XEXP (src_exp, 0);
> +
> +  if (GET_CODE (addr) == PLUS
> +      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
> +    {
> +      regoff = 0;
> +      src = simplify_gen_subreg (GET_MODE (dest_exp),
> +				 dest_exp, GET_MODE (dest_exp),
> +				 regoff);
> +    }
> +  else
> +    {
> +      regoff = INTVAL (CONST0_RTX (SImode)) *
> +		       GET_MODE_SIZE (GET_MODE (dest_exp));
> +      src = simplify_gen_subreg (GET_MODE (dest_exp),
> +				 dest_exp, GET_MODE (dest_exp),
> +				 regoff);
> +    }
> +  insn_info = DF_INSN_INFO_GET (insn2);
> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
> +    {
> +      struct df_link *def_link = DF_REF_CHAIN (use);
> +      if (!def_link || !def_link->ref || DF_REF_IS_ARTIFICIAL (def_link->ref))
> +	continue;
> +      while (def_link && def_link->ref)
> +	{
> +	  rtx *loc = DF_REF_LOC (def_link->ref);
> +	  *loc =  src;
> +	  def_link = def_link->next;
> +	}
> +     }
> +
> +  int regoff1;
> +  rtx src1;
> +  addr = XEXP (insn2_src_exp, 0);
> +
> +  if (GET_CODE (addr) == PLUS
> +      && XEXP (addr, 1)
> +      && CONST_INT_P (XEXP(addr, 1)))
> +    {
> +      regoff1 = 16;
> +      src1 = simplify_gen_subreg (GET_MODE (dest_exp),
> +				  dest_exp, GET_MODE (dest_exp),
> +				  regoff1);
> +    }
> +  else
> +    {
> +      regoff1 = INTVAL (CONST0_RTX (SImode)) * GET_MODE_SIZE (GET_MODE (dest_exp));//V16QImode);
> +      src1 = simplify_gen_subreg (GET_MODE (dest_exp),
> +				  dest_exp, GET_MODE (dest_exp),
> +				  regoff1);
> +    }
> +
> +  insn_info = DF_INSN_INFO_GET (insn1);
> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
> +    {
> +      struct df_link *def_link = DF_REF_CHAIN (use);
> +      if (!def_link || !def_link->ref || DF_REF_IS_ARTIFICIAL (def_link->ref))
> +	continue;
> +      while (def_link && def_link->ref)
> +	{
> +	  rtx *loc = DF_REF_LOC (def_link->ref);
> +	  PUT_MODE_RAW (*loc, OOmode);
> +	  *loc = src1;
> +	  def_link = def_link->next;
> +	}
> +     }
> +
> +  set_rescan_for_unspec (insn1);
> +  set_rescan_for_unspec (insn2);
> +  df_insn_rescan (insn1);
> +  df_insn_rescan (insn2);
> +
> +  PUT_MODE_RAW (src_exp, OOmode);
> +  PUT_MODE_RAW (dest_exp, OOmode);

I don't think you should be updating the original RTL directly like
this, but instead return a new pattern which will ultimately get passed
to validate_unshare_change in fuse_pair (i.e. let fuse_pair update the
RTL).

> +  lxv = gen_movoo (dest_exp, src_exp);
> +  rtx_insn *new_insn = emit_insn_before (lxv,  insn1);

Likewise, instead of emitting a new insn, this should let fuse_pair
perform the change.

> +  set_block_for_insn (new_insn, BLOCK_FOR_INSN (insn1));
> +  df_insn_rescan (new_insn);
> +
> +  if (dump_file)
> +    {
> +      unsigned int new_uid = INSN_UID (new_insn);
> +      fprintf (dump_file, "Replacing lxv %d with lxvp  %d\n",
> +			  INSN_UID (insn1), new_uid);
> +      print_rtl_single (dump_file, new_insn);
> +      print_rtl_single (dump_file, insn1);
> +      print_rtl_single (dump_file, insn2);
> +
> +    }
> +
> +  df_insn_delete (insn1);
> +  remove_insn (insn1);
> +  df_insn_delete (insn2);
> +  remove_insn (insn2);
> +  insn1->set_deleted ();
> +  insn2->set_deleted ();

Likewise, I don't think you should be deleting the original insns here,
but you should let RTL-SSA do this in the generic code instead (i.e. in
fuse_pair).

> +  return lxv;
> +}
> +
> +// LEFT_LIST and RIGHT_LIST are lists of candidate instructions where all insns
> +// in LEFT_LIST are known to be adjacent to those in RIGHT_LIST.
> +//
> +// This function traverses the resulting 2D matrix of possible pair candidates
> +// and attempts to merge them into pairs.
> +//
> +// The algorithm is straightforward: if we consider a combined list of
> +// candidates X obtained by merging LEFT_LIST and RIGHT_LIST in program order,
> +// then we advance through X until we reach a crossing point (where X[i] and
> +// X[i+1] come from different source lists).
> +//
> +// At this point we know X[i] and X[i+1] are adjacent accesses, and we try to
> +// fuse them into a pair.  If this succeeds, we remove X[i] and X[i+1] from
> +// their original lists and continue as above.
> +//
> +// In the failure case, we advance through the source list containing X[i] and
> +// continue as above (proceeding to the next crossing point).
> +//
> +// The rationale for skipping over groups of consecutive candidates from the
> +// same source list is as follows:
> +//
> +// In the store case, the insns in the group can't be re-ordered over each
> +// other as they are guaranteed to store to the same location, so we're
> +// guaranteed not to lose opportunities by doing this.
> +//
> +// In the load case, subsequent loads from the same location are either
> +// redundant (in which case they should have been cleaned up by an earlier
> +// optimization pass) or there is an intervening aliasing hazard, in which case
> +// we can't re-order them anyway, so provided earlier passes have cleaned up
> +// redundant loads, we shouldn't miss opportunities by doing this.
> +void
> +fusion_bb_info::merge_pairs (insn_list_t &left_list,
> +			  insn_list_t &right_list,
> +			  bool load_p,
> +			  unsigned access_size)
> +{

This function should be in the generic code, not in target code.
Was there a reason you moved it to target code?

When moving code from aarch64-ldp-fusion.cc, I think it would be better
if you left the code style unchanged and only made changes where there
is a need for a functional change.  For example ...

> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "merge_pairs [L=%d], cand vecs ", load_p);
> +      dump_insn_list (dump_file, left_list);
> +      fprintf (dump_file, " x ");
> +      dump_insn_list (dump_file, right_list);
> +      fprintf (dump_file, "\n");
> +    }
> +
> +  auto iter_l = left_list.begin ();
> +  auto iter_r = right_list.begin ();
> +  while (iter_l != left_list.end () && iter_r != right_list.end ())
> +    {
> +      auto left = *iter_l;
> +      auto right = *iter_r;

... here you introduce new left and right locals, but AFAICT there isn't
a need for this (no functional change).

> +
> +      auto next_l = std::next (iter_l);
> +      auto next_r = std::next (iter_r);
> +
> +      if ((*left) < (*right)
> +	  && next_l != (left_list.end ())
> +	  && **next_l < *right)
> +	iter_l = next_l;
> +      else if ((*right) < (*left)
> +	       && next_r != (right_list.end ())
> +	       && **next_r < *left)
> +	iter_r = next_r;
> +      else if (try_fuse_pair (load_p, access_size, *iter_l, *iter_r)) //left, right))
> +	{
> +	  left_list.erase (iter_l);
> +	  iter_l = next_l;
> +	  right_list.erase (iter_r);
> +	  iter_r = next_r;
> +	}
> +      else if (*left  < *right)
> +	iter_l = next_l;
> +      else
> +	iter_r = next_r;
> +    }
> + 
> +}
> +
> +
> +
> +// Main function to begin pair discovery.  Given a memory access INSN,
> +// determine whether it could be a candidate for fusing into an lxvp/stvxp,
> +// and if so, track it in the appropriate data structure for this basic
> +// block.  LOAD_P is true if the access is a load, and MEM is the mem
> +// rtx that occurs in INSN.
> +void
> +fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> +{

This function should also be in generic code, not in target code.
I can see that you want to customize a target-specific aspect ...

> +  // We can't combine volatile MEMs, so punt on these.
> +  if (MEM_VOLATILE_P (mem))
> +    return;
> +
> +  // Ignore writeback accesses if the param says to do so.
> +  if (GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
> +    return;
> +
> +  const machine_mode mem_mode = GET_MODE (mem);
> +
> +  rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
> +
> +  if (!ALTIVEC_OR_VSX_VECTOR_MODE (GET_MODE (reg_op)))
> +    return;

... here.  But I think this should be done either using a target hook
or ideally using a virtual function which is overriden by target code.

For rs6000, would it be enough to validate the mode of the mem instead
of looking at the mode of reg_op?  Then we could have a single function,
e.g.:

bool pair_fusion::pair_operand_mode_ok_p (machine_mode mode);

which for rs6000 would return:

ALTIVEC_OR_VSX_VECTOR_MODE (mode);

and for aarch64 would return:

ldp_operand_mode_ok_p (mode);

and in track_access (from generic code), you would do:

if (!pair_operand_mode_ok_p (mem_mode))
  return;

note that, at least for aarch64, the reason that we validate the mem mode
instead of the register mode is that the register operand may be a modeless
const_int in the store case.

> +
> +  const bool fpsimd_op_p = (ALTIVEC_OR_VSX_VECTOR_MODE (GET_MODE (reg_op)));

IIUC, this boolean is always true due to the above check.  So it is
better to just set it to a constant (doesn't matter which) if you only
want to handle vector modes for rs6000.  As above, this should be done
through a virtual function, where for aarch64 we return the condition in
aarch64-ldp-fusion.cc:

  const bool fpsimd_op_p
    = reload_completed
    ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
    : (GET_MODE_CLASS (mem_mode) != MODE_INT
       && (load_p || !aarch64_const_zero_rtx_p (reg_op)));

and for rs6000 you can just return false (assuming you don't want to segregate
accesses using this bit).

> +
> +  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode);
> +  const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
> +
> +  if (track_via_mem_expr (insn, mem, lfs))
> +    return;
> +
> +  poly_int64 mem_off;
> +  rtx addr = XEXP (mem, 0);
> +  const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
> +  rtx base = load_strip_offset (mem, &mem_off);
> +  if (!REG_P (base))
> +    return;
> +
> +  // Need to calculate two (possibly different) offsets:
> +  //  - Offset at which the access occurs.
> +  //  - Offset of the new base def.
> +  poly_int64 access_off;
> +  if (autoinc_p && any_post_modify_p (addr))
> +    access_off = 0;
> +  else
> +    access_off = mem_off;
> +
> +  poly_int64 new_def_off = mem_off;
> +
> +  // Punt on accesses relative to eliminable regs.  Since we don't know the
> +  // elimination offset pre-RA, we should postpone forming pairs on such
> +  // accesses until after RA.
> +  //
> +  // As it stands, addresses with offsets in range for LDR but not
> +  // in range for LDP/STP are currently reloaded inefficiently,
> +  // ending up with a separate base register for each pair.
> +  //
> +  // In theory LRA should make use of
> +  // targetm.legitimize_address_displacement to promote sharing of
> +  // bases among multiple (nearby) address reloads, but the current
> +  // LRA code returns early from process_address_1 for operands that
> +  // satisfy "m", even if they don't satisfy the real (relaxed) address
> +  // constraint; this early return means we never get to the code
> +  // that calls targetm.legitimize_address_displacement.
> +  //
> +  // So for now, it's better to punt when we can't be sure that the
> +  // offset is in range for LDP/STP.  Out-of-range cases can then be
> +  // handled after RA by the out-of-range LDP/STP peepholes.  Eventually, it
> +  // would be nice to handle known out-of-range opportunities in the
> +  // pass itself (for stack accesses, this would be in the post-RA pass).
> +  if (!reload_completed
> +      && (REGNO (base) == FRAME_POINTER_REGNUM
> +	  || REGNO (base) == ARG_POINTER_REGNUM))
> +    return;
> +
> +  // Now need to find def of base register.
> +  use_info *base_use = find_access (insn->uses (), REGNO (base));
> +  gcc_assert (base_use);
> +  def_info *base_def = base_use->def ();
> +  if (!base_def)
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "base register (regno %d) of insn %d is undefined",
> +		 REGNO (base), insn->uid ());
> +      return;
> +    }
> +
> +  alt_base *canon_base = canon_base_map.get (base_def);
> +  if (canon_base)
> +    {
> +      // Express this as the combined offset from the canonical base.
> +      base_def = canon_base->base;
> +      new_def_off += canon_base->offset;
> +      access_off += canon_base->offset;
> +    }
> +
> +  if (autoinc_p)
> +    {
> +      auto def = find_access (insn->defs (), REGNO (base));
> +      gcc_assert (def);
> +
> +      // Record that DEF = BASE_DEF + MEM_OFF.
> +      if (dump_file)
> +	{
> +	  pretty_printer pp;
> +	  pp_access (&pp, def, 0);
> +	  pp_string (&pp, " = ");
> +	  pp_access (&pp, base_def, 0);
> +	  fprintf (dump_file, "[bb %u] recording %s + ",
> +		   m_bb->index (), pp_formatted_text (&pp));
> +	  print_dec (new_def_off, dump_file);
> +	  fprintf (dump_file, "\n");
> +	}
> +
> +      alt_base base_rec { base_def, new_def_off };
> +      if (canon_base_map.put (def, base_rec))
> +	gcc_unreachable (); // Base defs should be unique.
> +    }
> +
> +  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
> +  // the access size.
> +  if (!multiple_p (mem_off, mem_size))
> +    return;
> +
> +  const auto key = std::make_pair (base_def, encode_lfs (lfs));
> +  access_group &group = def_map.get_or_insert (key, NULL);
> +  auto alloc = [&](access_record *access) { return node_alloc (access); };
> +  group.track (alloc, access_off, insn);
> +  if (dump_file)
> +    {
> +      pretty_printer pp;
> +      pp_access (&pp, base_def, 0);
> +
> +      fprintf (dump_file, "[bb %u] tracking insn %d via %s",
> +	       m_bb->index (), insn->uid (), pp_formatted_text (&pp));
> +      fprintf (dump_file,
> +	       " [L=%d, WB=%d, FP=%d, %smode, off=",
> +	       lfs.load_p, autoinc_p, lfs.fpsimd_p, mode_name[mem_mode]);
> +      print_dec (access_off, dump_file);
> +      fprintf (dump_file, "]\n");
> +    }
> +}
> +
> +
> +// Given two adjacent memory accesses of the same size, I1 and I2, try
> +// and see if we can merge them into a lxvp or stvxp.
> +//
> +// ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
> +// if the accesses are both loads, otherwise they are both stores.
> +bool
> +fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> +			    insn_info *i1, insn_info *i2)

Likewise, I think this function should be in generic code, not target code.

> +{
> +  if (dump_file)
> +    fprintf (dump_file, "analyzing pair (load=%d): (%d,%d)\n",
> +	     load_p, i1->uid (), i2->uid ());
> +
> +  insn_info *insns[2];
> +  bool reversed = false;
> +  if (*i1 < *i2)
> +    {
> +      insns[0] = i1;
> +      insns[1] = i2;
> +    }
> +  else
> +    {
> +      insns[0] = i2;
> +      insns[1] = i1;
> +      reversed = true;
> +    }
> +
> +  rtx cand_mems[2];
> +  rtx reg_ops[2];
> +  rtx pats[2];
> +  for (int i = 0; i < 2; i++)
> +    {
> +      pats[i] = PATTERN (insns[i]->rtl ());
> +      cand_mems[i] = XEXP (pats[i], load_p);
> +      reg_ops[i] = XEXP (pats[i], !load_p);
> +    }
> +
> +  if (load_p && reg_overlap_mentioned_p (reg_ops[0], reg_ops[1]))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "punting on lxvp due to reg conflcits (%d,%d)\n",

I understand that you might not want to use "ldp" in the generic code,
but how about "punting on load pair" instead?

> +		 insns[0]->uid (), insns[1]->uid ());
> +      return false;
> +    }
> +
> +  if (cfun->can_throw_non_call_exceptions
> +      && find_reg_note (insns[0]->rtl (), REG_EH_REGION, NULL_RTX)
> +      && find_reg_note (insns[1]->rtl (), REG_EH_REGION, NULL_RTX))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "can't combine insns with EH side effects (%d,%d)\n",
> +		 insns[0]->uid (), insns[1]->uid ());
> +      return false;
> +    }
> +
> +  auto_vec<base_cand, 2> base_cands (2);
> +
> +  int writeback = get_viable_bases (insns, base_cands, cand_mems,
> +				    access_size, reversed);
> +  if (base_cands.is_empty ())
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "no viable base for pair (%d,%d)\n",
> +		 insns[0]->uid (), insns[1]->uid ());
> +      return false;
> +    }
> +
> +  // Punt on frame-related insns with writeback.  We probably won't see
> +  // these in practice, but this is conservative and ensures we don't
> +  // have to worry about these later on.
> +  if (writeback && (RTX_FRAME_RELATED_P (i1->rtl ())
> +		    || RTX_FRAME_RELATED_P (i2->rtl ())))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "rejecting pair (%d,%d): frame-related insn with writeback\n",
> +		 i1->uid (), i2->uid ());
> +      return false;
> +    }
> +
> +  rtx *ignore = &XEXP (pats[1], load_p);
> +  for (auto use : insns[1]->uses ())
> +    if (!use->is_mem ()
> +	&& refers_to_regno_p (use->regno (), use->regno () + 1, pats[1], ignore)
> +	&& use->def () && use->def ()->insn () == insns[0])
> +      {
> +	// N.B. we allow a true dependence on the base address, as this
> +	// happens in the case of auto-inc accesses.  Consider a post-increment
> +	// load followed by a regular indexed load, for example.
> +	if (dump_file)
> +	  fprintf (dump_file,
> +		   "%d has non-address true dependence on %d, rejecting pair\n",
> +		   insns[1]->uid (), insns[0]->uid ());
> +	return false;
> +      }
> +
> +  unsigned i = 0;
> +  while (i < base_cands.length ())
> +    {
> +      base_cand &cand = base_cands[i];
> +
> +      rtx *ignore[2] = {};
> +      for (int j = 0; j < 2; j++)
> +	if (cand.from_insn == !j)
> +	  ignore[j] = &XEXP (cand_mems[j], 0);
> +
> +      insn_info *h = first_hazard_after (insns[0], ignore[0]);
> +      if (h && *h <= *insns[1])
> +	cand.hazards[0] = h;
> +
> +      h = latest_hazard_before (insns[1], ignore[1]);
> +      if (h && *h >= *insns[0])
> +	cand.hazards[1] = h;
> +
> +      if (!cand.viable ())
> +	{
> +	  if (dump_file)
> +	    fprintf (dump_file,
> +		     "pair (%d,%d): rejecting base %d due to dataflow "
> +		     "hazards (%d,%d)\n",
> +		     insns[0]->uid (),
> +		     insns[1]->uid (),
> +		     cand.def->regno (),
> +		     cand.hazards[0]->uid (),
> +		     cand.hazards[1]->uid ());
> +
> +	  base_cands.ordered_remove (i);
> +	}
> +      else
> +	i++;
> +    }
> +
> +  if (base_cands.is_empty ())
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "can't form pair (%d,%d) due to dataflow hazards\n",
> +		 insns[0]->uid (), insns[1]->uid ());
> +      return false;
> +    }
> +
> +  //insn_info *alias_hazards[4] = {};

It looks like you've just commented out / removed the alias analysis (I don't
see a call to do_alias_analysis).  How is that supposed to work?

> +
> +  // First def of memory after the first insn, and last def of memory
> +  // before the second insn, respectively.
> +  def_info *mem_defs[2] = {};
> +  if (load_p)
> +    {
> +      if (!MEM_READONLY_P (cand_mems[0]))
> +	{
> +	  mem_defs[0] = memory_access (insns[0]->uses ())->def ();
> +	  gcc_checking_assert (mem_defs[0]);
> +	  mem_defs[0] = mem_defs[0]->next_def ();
> +	}
> +      if (!MEM_READONLY_P (cand_mems[1]))
> +	{
> +	  mem_defs[1] = memory_access (insns[1]->uses ())->def ();
> +	  gcc_checking_assert (mem_defs[1]);
> +	}
> +    }
> +  else
> +    {
> +      mem_defs[0] = memory_access (insns[0]->defs ())->next_def ();
> +      mem_defs[1] = memory_access (insns[1]->defs ())->prev_def ();
> +      gcc_checking_assert (mem_defs[0]);
> +      gcc_checking_assert (mem_defs[1]);
> +    }
> +
> +  //auto tombstone_p = [&](insn_info *insn) -> bool {
> +   // return m_emitted_tombstone
> +//	   && bitmap_bit_p (&m_tombstone_bitmap, insn->uid ());
> + // };
> +
> +  if (base_cands.is_empty ())
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "cannot form pair (%d,%d) due to alias/dataflow hazards",
> +		 insns[0]->uid (), insns[1]->uid ());
> +
> +      return false;
> +    }
> +
> +  base_cand *base = &base_cands[0];
> +  insn_range_info range (insns[0], insns[1]);
> +  // If the second insn can throw, narrow the move range to exactly that insn.
> +  // This prevents us trying to move the second insn from the end of the BB.
> +  if (cfun->can_throw_non_call_exceptions
> +      && find_reg_note (insns[1]->rtl (), REG_EH_REGION, NULL_RTX))
> +    {
> +      gcc_assert (range.includes (insns[1]));
> +      range = insn_range_info (insns[1]);
> +    }
> +
> +  // Placement strategy: push loads down and pull stores up, this should
> +  // help register pressure by reducing live ranges.
> +  if (load_p)
> +    range.first = range.last;
> +  else
> +    range.last = range.first;
> +  return fuse_pair (load_p, access_size, writeback,
> +		    i1, i2, *base, range);
> +}
> +// Try and actually fuse the pair given by insns I1 and I2.
> +//
> +// Here we've done enough analysis to know this is safe, we only
> +// reject the pair at this stage if either the tuning policy says to,
> +// or recog fails on the final pair insn.
> +//
> +// LOAD_P is true for loads, ACCESS_SIZE gives the access size of each
> +// candidate insn.  Bit i of WRITEBACK is set if the ith insn (in program
> +// order) uses writeback.
> +//
> +// BASE gives the chosen base candidate for the pair and MOVE_RANGE is
> +// a singleton range which says where to place the pair.
> +bool
> +fusion_bb_info::fuse_pair (bool load_p,
> +			unsigned access_size,
> +			int writeback,
> +			insn_info *i1, insn_info *i2,
> +			base_cand &base,
> +			const insn_range_info &move_range)
> +{

Again, I think this function should be in generic code, not in target
code.

> +  auto attempt = crtl->ssa->new_change_attempt ();
> +
> +  auto make_change = [&attempt](insn_info *insn)
> +    {
> +      return crtl->ssa->change_alloc<insn_change> (attempt, insn);
> +    };
> +  auto make_delete = [&attempt](insn_info *insn)
> +    {
> +      return crtl->ssa->change_alloc<insn_change> (attempt,
> +						   insn,
> +						   insn_change::DELETE);
> +    };
> +
> +  if (*i1 > *i2)
> +    return false;
> +
> +  insn_info *first = (*i1 < *i2) ? i1 : i2;
> +  insn_info *second = (first == i1) ? i2 : i1;
> +
> +  insn_info *insns[2] = { first, second };
> +
> +  auto_vec<insn_change *, 4> changes (4);
> +  auto_vec<int, 2> tombstone_uids (2);
> +
> +  rtx pats[2] = {
> +    PATTERN (first->rtl ()),
> +    PATTERN (second->rtl ())
> +  };
> +
> +  use_array input_uses[2] = { first->uses (), second->uses () };
> +  def_array input_defs[2] = { first->defs (), second->defs () };
> +
> +  int changed_insn = -1;
> +  if (base.from_insn != -1)
> +    {
> +      // If we're not already using a shared base, we need
> +      // to re-write one of the accesses to use the base from
> +      // the other insn.
> +      gcc_checking_assert (base.from_insn == 0 || base.from_insn == 1);
> +      changed_insn = !base.from_insn;
> +
> +      rtx base_pat = pats[base.from_insn];
> +      rtx change_pat = pats[changed_insn];
> +      rtx base_mem = XEXP (base_pat, load_p);
> +      rtx change_mem = XEXP (change_pat, load_p);
> +
> +      const bool lower_base_p = (insns[base.from_insn] == i1);
> +      HOST_WIDE_INT adjust_amt = access_size;
> +      if (!lower_base_p)
> +	adjust_amt *= -1;
> +
> +      rtx change_reg = XEXP (change_pat, !load_p);
> +      machine_mode mode_for_mem = GET_MODE (change_mem);
> +      rtx effective_base = drop_writeback (base_mem);
> +      rtx new_mem = adjust_address_nv (effective_base,
> +				       mode_for_mem,
> +				       adjust_amt);
> +      rtx new_set = load_p
> +	? gen_rtx_SET (change_reg, new_mem)
> +	: gen_rtx_SET (new_mem, change_reg);
> +
> +      pats[changed_insn] = new_set;
> +
> +      auto keep_use = [&](use_info *u)
> +	{
> +	  return refers_to_regno_p (u->regno (), u->regno () + 1,
> +				    change_pat, &XEXP (change_pat, load_p));
> +	};
> +
> +      // Drop any uses that only occur in the old address.
> +      input_uses[changed_insn] = filter_accesses (attempt,
> +						  input_uses[changed_insn],
> +						  keep_use);
> +    }
> +
> +  rtx writeback_effect = NULL_RTX;
> +  if (writeback)
> +    writeback_effect = extract_writebacks (load_p, pats, changed_insn);
> +
> +  const auto base_regno = base.def->regno ();
> +
> +  if (base.from_insn == -1 && (writeback & 1))
> +    {
> +      // If the first of the candidate insns had a writeback form, we'll need to
> +      // drop the use of the updated base register from the second insn's uses.
> +      //
> +      // N.B. we needn't worry about the base register occurring as a store
> +      // operand, as we checked that there was no non-address true dependence
> +      // between the insns in try_fuse_pair.
> +      gcc_checking_assert (find_access (input_uses[1], base_regno));
> +      input_uses[1] = check_remove_regno_access (attempt,
> +						 input_uses[1],
> +						 base_regno);
> +    }
> +
> +  // Go through and drop uses that only occur in register notes,
> +  // as we won't be preserving those.
> +  for (int i = 0; i < 2; i++)
> +    {
> +      auto rti = insns[i]->rtl ();
> +      if (!REG_NOTES (rti))
> +	continue;
> +
> +      input_uses[i] = remove_note_accesses (attempt, input_uses[i]);
> +    }
> +
> +  // So far the patterns have been in instruction order,
> +  // now we want them in offset order.
> +  if (i1 != first)
> +    std::swap (pats[0], pats[1]);
> +
> +  poly_int64 offsets[2];
> +  for (int i = 0; i < 2; i++)
> +    {
> +      rtx mem = XEXP (pats[i], load_p);
> +      gcc_checking_assert (MEM_P (mem));
> +      rtx base = strip_offset (XEXP (mem, 0), offsets + i);
> +      gcc_checking_assert (REG_P (base));
> +      gcc_checking_assert (base_regno == REGNO (base));
> +    }
> +
> +  // If either of the original insns had writeback, but the resulting pair insn
> +  // does not (can happen e.g. in the lxvp edge case above, or if the writeback
> +  // effects cancel out), then drop the def(s) of the base register as
> +  // appropriate.
> +  //
> +  // Also drop the first def in the case that both of the original insns had
> +  // writeback.  The second def could well have uses, but the first def should
> +  // only be used by the second insn (and we dropped that use above).
> +  for (int i = 0; i < 2; i++)
> +    if ((!writeback_effect && (writeback & (1 << i)))
> +	|| (i == 0 && writeback == 3))
> +      input_defs[i] = check_remove_regno_access (attempt,
> +						 input_defs[i],
> +						 base_regno);
> +
> +  // If we don't currently have a writeback pair, and we don't have
> +  // a load that clobbers the base register, look for a trailing destructive
> +  // update of the base register and try and fold it in to make this into a
> +  // writeback pair.
> +  insn_info *trailing_add = nullptr;

I can see you've dropped the call to find_trailing_add here.

If you don't want to handle writeback pairs, that's fine, but I think
that should be controlled with a target-controlled virtual function.

Effectively you can just replace uses of the param aarch64_ldp_writeback with a
call to such a function, then the target can control which writeback
opportunities we handle, if any.  For aarch64 this would just return
the result of the param for the time being.

> +
> +  rtx reg_notes = combine_reg_notes (first, second, load_p);
> +
> +  rtx pair_pat = NULL_RTX;
> +
> +  if (load_p) {
> +    pair_pat = rs6000_gen_load_pair ((first->rtl ()), (second->rtl ()));
> +    if (pair_pat == NULL_RTX)
> +      return false;
> +  }
> +  insn_change *pair_change = nullptr;
> +  auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
> +    rtx_insn *rti = change->insn ()->rtl ();
> +    validate_unshare_change (rti, &PATTERN (rti), pair_pat, true);
> +    validate_change (rti, &REG_NOTES (rti), reg_notes, true);
> +  };
> +
> +  if (load_p)
> +    {
> +      changes.quick_push (make_delete (first));
> +      pair_change = make_change (second);
> +      changes.quick_push (pair_change);
> +
> +      pair_change->move_range = move_range;
> +      pair_change->new_defs = merge_access_arrays (attempt,
> +						   input_defs[0],
> +						   input_defs[1]);
> +      gcc_assert (pair_change->new_defs.is_valid ());
> +
> +      pair_change->new_uses
> +	= merge_access_arrays (attempt,
> +			       drop_memory_access (input_uses[0]),
> +			       drop_memory_access (input_uses[1]));
> +      gcc_assert (pair_change->new_uses.is_valid ());
> +      set_pair_pat (pair_change);
> +    }
> +  else
> +    {
> +      insn_info *store_to_change = decide_store_strategy (first, second,
> +							move_range);
> +
> +      if (store_to_change && dump_file)
> +	fprintf (dump_file, "  stvx: re-purposing store %d\n",
> +		 store_to_change->uid ());
> +
> +      insn_change *change;
> +      for (int i = 0; i < 2; i++)
> +	{
> +	  change = make_change (insns[i]);
> +	  if (insns[i] == store_to_change)
> +	    {
> +	      set_pair_pat (change);
> +	      change->new_uses = merge_access_arrays (attempt,
> +						      input_uses[0],
> +						      input_uses[1]);
> +	      auto d1 = drop_memory_access (input_defs[0]);
> +	      auto d2 = drop_memory_access (input_defs[1]);
> +	      change->new_defs = merge_access_arrays (attempt, d1, d2);
> +	      gcc_assert (change->new_defs.is_valid ());
> +	      def_info *stxvp_def = memory_access (store_to_change->defs ());
> +	      change->new_defs = insert_access (attempt,
> +						stxvp_def,
> +						change->new_defs);
> +	      gcc_assert (change->new_defs.is_valid ());
> +	      change->move_range = move_range;
> +	      pair_change = change;
> +	    }
> +	  else
> +	    {
> +	      // Note that we are turning this insn into a tombstone,
> +	      // we need to keep track of these if we go ahead with the
> +	      // change.
> +	      tombstone_uids.quick_push (insns[i]->uid ());
> +	      rtx_insn *rti = insns[i]->rtl ();
> +	      validate_change (rti, &PATTERN (rti), gen_tombstone (), true);
> +	      validate_change (rti, &REG_NOTES (rti), NULL_RTX, true);
> +	      change->new_uses = use_array (nullptr, 0);
> +	    }
> +	  gcc_assert (change->new_uses.is_valid ());
> +	  changes.quick_push (change);
> +	}
> +
> +      if (!store_to_change)
> +	{
> +	  // Tricky case.  Cannot re-purpose existing insns for stp.
> +	  // Need to insert new insn.
> +	  if (dump_file)
> +	    fprintf (dump_file,
> +		     "  stp fusion: cannot re-purpose candidate stores\n");
> +
> +	  auto new_insn = crtl->ssa->create_insn (attempt, INSN, pair_pat);
> +	  change = make_change (new_insn);
> +	  change->move_range = move_range;
> +	  change->new_uses = merge_access_arrays (attempt,
> +						  input_uses[0],
> +						  input_uses[1]);
> +	  gcc_assert (change->new_uses.is_valid ());
> +
> +	  auto d1 = drop_memory_access (input_defs[0]);
> +	  auto d2 = drop_memory_access (input_defs[1]);
> +	  change->new_defs = merge_access_arrays (attempt, d1, d2);
> +	  gcc_assert (change->new_defs.is_valid ());
> +
> +	  auto new_set = crtl->ssa->create_set (attempt, new_insn, memory);
> +	  change->new_defs = insert_access (attempt, new_set,
> +					    change->new_defs);
> +	  gcc_assert (change->new_defs.is_valid ());
> +	  changes.safe_insert (1, change);
> +	  pair_change = change;
> +	}
> +    }
> +
> +  if (trailing_add)
> +    changes.quick_push (make_delete (trailing_add));
> +
> +  auto n_changes = changes.length ();
> +  gcc_checking_assert (n_changes >= 2 && n_changes <= 4);
> +
> +  gcc_assert (crtl->ssa->verify_insn_changes (changes));
> +
> +  cancel_changes (0);

Can you explain why you changed this to unconditionally cancel the
changes instead of letting RTL-SSA perform them?

> +
> +  confirm_change_group ();
> +
> +  gcc_checking_assert (tombstone_uids.length () <= 2);
> +  for (auto uid : tombstone_uids)
> +    track_tombstone (uid);
> +
> +  return true;
> +}
> +
> +void fusion_bb (bb_info *bb)
> +{

Likewise, I think this function should be in generic code, not in target
code.  Note for aarch64 we have:

  const bool track_loads
    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
  const bool track_stores
    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;

I think these can just be hidden behind target-controlled virtual
functions, so you'd have something like:

bool pair_fusion::should_fuse_stores ();
bool pair_fusion::should_fuse_loads ();

(or a single function that returns multiple bitfields if you like).

then for aarch64 you would return:

aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;

for the load case, and for rs6000 you would return true (just based on
what's below).  Presumably since you don't handle stores below, you'd
want to return false in the store case for rs6000 code (at least
initially).

As I mentioned above, the aarch64 version of this function also tries
to promote existing pairs to writeback pairs, again that would be
controlled by the same function call that replaces uses of
the param aarch64_ldp_writeback.

> +  fusion_bb_info bb_state (bb);
> +
> +  for (auto insn : bb->nondebug_insns ())
> +    {
> +      rtx_insn *rti = insn->rtl ();
> +
> +      if (!rti || !INSN_P (rti))
> +	continue;
> +
> +      rtx pat = PATTERN (rti);
> +
> +      if (GET_CODE (pat) != SET)
> +	continue;
> +
> +      if (MEM_P (XEXP (pat, 1)))
> +	bb_state.track_access (insn, true, XEXP (pat, 1));
> +    }
> +
> +  bb_state.transform ();
> +  bb_state.cleanup_tombstones ();
> +}
> +
> +static void
> +fusion_init ()
> +{
> +  df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
> +  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
> +  df_analyze ();
> +  df_set_flags (DF_DEFER_INSN_RESCAN);
> +  /* Rebuild ud- and du-chains.  */
> +  df_remove_problem (df_chain);
> +  df_process_deferred_rescans ();
> +  df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
> +  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
> +  df_analyze ();
> +  df_set_flags (DF_DEFER_INSN_RESCAN);

Is there a reason you can't use RTL-SSA for the def-use information?
IMO it would be cleaner to use RTL-SSA as the existing aarch64 pass does
(and the refactored generic code presumably will do).

> +  calculate_dominance_info (CDI_DOMINATORS);
> +  crtl->ssa = new rtl_ssa::function_info (cfun);
> +}
> +
> +static void
> +fusion_destroy ()
> +{
> +  if (crtl->ssa->perform_pending_updates ())
> +    cleanup_cfg (0);
> +
> +  free_dominance_info (CDI_DOMINATORS);
> +  delete crtl->ssa;
> +  crtl->ssa = nullptr;
> +}
> +
> +// Iterate over the accesses in GROUP, looking for adjacent sets
> +// of accesses.  If we find two sets of adjacent accesses, call
> +// merge_pairs.
> +void
> +fusion_bb_info::transform_for_base (int encoded_lfs,
> +				 access_group &group)

Again, I think this function should be in generic code.

> +{
> +  const auto lfs = decode_lfs (encoded_lfs);
> +  const unsigned access_size = lfs.size;
> +
> +  bool skip_next = true;
> +  access_record *prev_access = nullptr;
> +
> +  for (auto &access : group.list)
> +    {
> +      if (skip_next)
> +	skip_next = false;
> +      else if (known_eq (access.offset, prev_access->offset + access_size))
> +	{
> +	  merge_pairs (prev_access->cand_insns,
> +		       access.cand_insns,
> +		       lfs.load_p,
> +		       access_size);
> +	  skip_next = true;
> +	}
> +      prev_access = &access;
> +    }
> +}
> +
> +void rs6000_analyze_vecload ()
> +{
> +  fusion_init ();
> +
> +  for (auto bb : crtl->ssa->bbs ())
> +    fusion_bb (bb);
> +
> +  fusion_destroy ();
> +}
> +
> +const pass_data pass_data_analyze_vecload =
> +{
> +  RTL_PASS, /* type */
> +  "vecload", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_NONE, /* tv_id */
> +  0, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_df_finish, /* todo_flags_finish */
> +};
> +
> +class pass_analyze_vecload : public rtl_opt_pass
> +{
> +public:
> +  pass_analyze_vecload(gcc::context *ctxt)
> +    : rtl_opt_pass(pass_data_analyze_vecload, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  bool gate (function *)
> +    {
> +      return (optimize > 0 && TARGET_VSX && TARGET_POWER10);
> +    }
> +
> +  unsigned int execute (function *) final override
> +    {
> +      rs6000_analyze_vecload ();
> +      return 0;
> +    }
> +}; // class pass_analyze_vecload
> +
> +rtl_opt_pass *
> +make_pass_analyze_vecload (gcc::context *ctxt)
> +{
> +  return new pass_analyze_vecload (ctxt);
> +}
> +
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 5d975dab921..e36946c44ea 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -1178,6 +1178,7 @@ static bool rs6000_secondary_reload_move (enum rs6000_reg_type,
>  					  secondary_reload_info *,
>  					  bool);
>  rtl_opt_pass *make_pass_analyze_swaps (gcc::context*);
> +rtl_opt_pass *make_pass_analyze_vecload (gcc::context*);
>  
>  /* Hash table stuff for keeping track of TOC entries.  */
>  
> diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
> index b3ce09d523b..7664830f8ab 100644
> --- a/gcc/config/rs6000/t-rs6000
> +++ b/gcc/config/rs6000/t-rs6000
> @@ -35,6 +35,11 @@ rs6000-p8swap.o: $(srcdir)/config/rs6000/rs6000-p8swap.cc
>  	$(COMPILE) $<
>  	$(POSTCOMPILE)
>  
> +rs6000-vecload-opt.o: $(srcdir)/config/rs6000/rs6000-vecload-opt.cc
> +	$(COMPILE) $<
> +	$(POSTCOMPILE)
> +
> +
>  rs6000-d.o: $(srcdir)/config/rs6000/rs6000-d.cc
>  	$(COMPILE) $<
>  	$(POSTCOMPILE)
> diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
> index 1856fa4884f..ffc47a6eaa0 100644
> --- a/gcc/emit-rtl.cc
> +++ b/gcc/emit-rtl.cc
> @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode imode,
>      return false;
>  
>    /* The subreg offset cannot be outside the inner object.  */
> -  if (maybe_ge (offset, isize))
> +  if (maybe_gt (offset, isize))
>      return false;

Can you explain why this change is needed?

>  
>    poly_uint64 regsize = REGMODE_NATURAL_SIZE (imode);
> @@ -3035,18 +3035,6 @@ verify_rtx_sharing (rtx orig, rtx insn)
>        break;
>      }
>  
> -  /* This rtx may not be shared.  If it has already been seen,
> -     replace it with a copy of itself.  */
> -  if (flag_checking && RTX_FLAG (x, used))
> -    {
> -      error ("invalid rtl sharing found in the insn");
> -      debug_rtx (insn);
> -      error ("shared rtx");
> -      debug_rtx (x);
> -      internal_error ("internal consistency failure");
> -    }
> -  gcc_assert (!RTX_FLAG (x, used));
> -

I think the need to make this change is likely indicative of a problem
in the pass, and the pass should be fixed instead of changing
verify_rtx_sharing.

>    RTX_FLAG (x, used) = 1;
>  
>    /* Now scan the subexpressions recursively.  */
> diff --git a/gcc/fusion-common.h b/gcc/fusion-common.h
> new file mode 100644
> index 00000000000..82f1d6ef032
> --- /dev/null
> +++ b/gcc/fusion-common.h
> @@ -0,0 +1,1195 @@
> +// Common infrastructure for Load/Store Pair fusion optimization pass.
> +// Copyright (C) 2023-2024 Free Software Foundation, Inc.
> +//
> +// This file is part of GCC.
> +//
> +// GCC is free software; you can redistribute it and/or modify it
> +// under the terms of the GNU General Public License as published by
> +// the Free Software Foundation; either version 3, or (at your option)
> +// any later version.
> +//
> +// GCC is distributed in the hope that it will be useful, but
> +// WITHOUT ANY WARRANTY; without even the implied warranty of
> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +// General Public License for more details.
> +//
> +// You should have received a copy of the GNU General Public License
> +// along with GCC; see the file COPYING3.  If not see
> +// <http://www.gnu.org/licenses/>.
> +
> +#define INCLUDE_ALGORITHM
> +#define INCLUDE_FUNCTIONAL
> +#define INCLUDE_LIST
> +#define INCLUDE_TYPE_TRAITS
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "df.h"
> +#include "rtl-iter.h"
> +#include "rtl-ssa.h"
> +#include "cfgcleanup.h"
> +#include "tree-pass.h"
> +#include "ordered-hash-map.h"
> +#include "tree-dfa.h"
> +#include "fold-const.h"
> +#include "tree-hash-traits.h"
> +#include "print-tree.h"
> +#include "insn-attr.h"
> +
> +using namespace rtl_ssa;
> +
> +static constexpr HOST_WIDE_INT LDP_IMM_BITS = 7;
> +static constexpr HOST_WIDE_INT LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1));
> +static constexpr HOST_WIDE_INT LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1;
> +static constexpr HOST_WIDE_INT LDP_MIN_IMM = -LDP_MAX_IMM - 1;

Note we would not want these aarch64-specific constants in generic code,
they would need to be target-configurable (in fact we might just want a
virtual function / target hook to validate offsets instead).

> +
> +// We pack these fields (load_p, fpsimd_p, and size) into an integer
> +// (LFS) which we use as part of the key into the main hash tables.
> +//
> +// The idea is that we group candidates together only if they agree on
> +// the fields below.  Candidates that disagree on any of these
> +// properties shouldn't be merged together.
> +struct lfs_fields
> +{
> +  bool load_p;
> +  bool fpsimd_p;
> +  unsigned size;
> +};
> +
> +using insn_list_t = std::list<insn_info *>;
> +using insn_iter_t = insn_list_t::iterator;
> +
> +// Information about the accesses at a given offset from a particular
> +// base.  Stored in an access_group, see below.
> +struct access_record
> +{
> +  poly_int64 offset;
> +  std::list<insn_info *> cand_insns;
> +  std::list<access_record>::iterator place;
> +
> +  access_record (poly_int64 off) : offset (off) {}
> +};
> +
> +// A group of accesses where adjacent accesses could be ldp/stp
> +// or lxvp/stxvp candidates.  The splay tree supports efficient

Perhaps better to spell out "load pair or store pair" instead of listing
the various instruction mnenomics for different targets.

> +// insertion, while the list supports efficient iteration.
> +struct access_group
> +{
> +  splay_tree<access_record *> tree;
> +  std::list<access_record> list;
> +
> +  template<typename Alloc>
> +  inline void track (Alloc node_alloc, poly_int64 offset, insn_info *insn);
> +};
> +
> +// Information about a potential base candidate, used in try_fuse_pair.
> +// There may be zero, one, or two viable RTL bases for a given pair.
> +struct base_cand
> +{
> +  // DEF is the def of the base register to be used by the pair.
> +  def_info *def;
> +
> +  // FROM_INSN is -1 if the base candidate is already shared by both
> +  // candidate insns.  Otherwise it holds the index of the insn from
> +  // which the base originated.
> +  //
> +  // In the case that the base is shared, either DEF is already used
> +  // by both candidate accesses, or both accesses see different versions
> +  // of the same regno, in which case DEF is the def consumed by the
> +  // first candidate access.
> +  int from_insn;
> +
> +  // To form a pair, we do so by moving the first access down and the second
> +  // access up.  To determine where to form the pair, and whether or not
> +  // it is safe to form the pair, we track instructions which cannot be
> +  // re-ordered past due to either dataflow or alias hazards.
> +  //
> +  // Since we allow changing the base used by an access, the choice of
> +  // base can change which instructions act as re-ordering hazards for
> +  // this pair (due to different dataflow).  We store the initial
> +  // dataflow hazards for this choice of base candidate in HAZARDS.
> +  //
> +  // These hazards act as re-ordering barriers to each candidate insn
> +  // respectively, in program order.
> +  //
> +  // Later on, when we take alias analysis into account, we narrow
> +  // HAZARDS accordingly.
> +  insn_info *hazards[2];
> +
> +  base_cand (def_info *def, int insn)
> +    : def (def), from_insn (insn), hazards {nullptr, nullptr} {}
> +
> +  base_cand (def_info *def) : base_cand (def, -1) {}
> +
> +  // Test if this base candidate is viable according to HAZARDS.
> +  bool viable () const
> +  {
> +    return !hazards[0] || !hazards[1] || (*hazards[0] > *hazards[1]);
> +  }
> +};
> +
> +// Information about an alternate base.  For a def_info D, it may
> +// instead be expressed as D = BASE + OFFSET.
> +struct alt_base
> +{
> +  def_info *base;
> +  poly_int64 offset;
> +};
> +
> +// State used by the pass for a given basic block.
> +struct fusion_bb_info
> +{
> +  using def_hash = nofree_ptr_hash<def_info>;
> +  using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
> +  using def_key_t = pair_hash<def_hash, int_hash<int, -1, -2>>;
> +
> +  // Map of <tree base, LFS> -> access_group.
> +  ordered_hash_map<expr_key_t, access_group> expr_map;
> +
> +  // Map of <RTL-SSA def_info *, LFS> -> access_group.
> +  ordered_hash_map<def_key_t, access_group> def_map;
> +
> +  // Given the def_info for an RTL base register, express it as an offset from
> +  // some canonical base instead.
> +  //
> +  // Canonicalizing bases in this way allows us to identify adjacent accesses
> +  // even if they see different base register defs.
> +  hash_map<def_hash, alt_base> canon_base_map;
> +
> +  static const size_t obstack_alignment = sizeof (void *);
> +  bb_info *m_bb;
> +
> +  fusion_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
> +  {
> +    obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
> +				obstack_alignment, obstack_chunk_alloc,
> +				obstack_chunk_free);
> +  }
> +  ~fusion_bb_info ()
> +  {
> +    obstack_free (&m_obstack, nullptr);
> +
> +    if (m_emitted_tombstone)
> +      {
> +	bitmap_release (&m_tombstone_bitmap);
> +	bitmap_obstack_release (&m_bitmap_obstack);
> +      }
> +  }
> +
> +  inline void track_access (insn_info *, bool load, rtx mem);
> +  inline void transform ();
> +  inline void cleanup_tombstones ();
> +
> +private:
> +  obstack m_obstack;
> +
> +  // State for keeping track of tombstone insns emitted for this BB.
> +  bitmap_obstack m_bitmap_obstack;
> +  bitmap_head m_tombstone_bitmap;
> +  bool m_emitted_tombstone;
> +
> +  inline splay_tree_node<access_record *> *node_alloc (access_record *);
> +
> +  template<typename Map>
> +  inline void traverse_base_map (Map &map);
> +  inline void transform_for_base (int load_size, access_group &group);
> +
> +   void merge_pairs (insn_list_t &, insn_list_t &,
> +			   bool load_p, unsigned access_size);
> +
> +   bool try_fuse_pair (bool load_p, unsigned access_size,
> +			     insn_info *i1, insn_info *i2);
> +
> +   bool fuse_pair (bool load_p, unsigned access_size,
> +			 int writeback,
> +			 insn_info *i1, insn_info *i2,
> +			 base_cand &base,
> +			 const insn_range_info &move_range);
> +
> +  inline void track_tombstone (int uid);
> +
> +  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
> +};
> +
> +splay_tree_node<access_record *> *fusion_bb_info::node_alloc (access_record *access)
> +{
> +  using T = splay_tree_node<access_record *>;
> +  void *addr = obstack_alloc (&m_obstack, sizeof (T));
> +  return new (addr) T (access);
> +}
> +
> +// Given a mem MEM, if the address has side effects, return a MEM that accesses
> +// the same address but without the side effects.  Otherwise, return
> +// MEM unchanged.
> +static rtx
> +drop_writeback (rtx mem)
> +{
> +  rtx addr = XEXP (mem, 0);
> +
> +  if (!side_effects_p (addr))
> +    return mem;
> +
> +  switch (GET_CODE (addr))
> +    {
> +    case PRE_MODIFY:
> +      addr = XEXP (addr, 1);
> +      break;
> +    case POST_MODIFY:
> +    case POST_INC:
> +    case POST_DEC:
> +      addr = XEXP (addr, 0);
> +      break;
> +    case PRE_INC:
> +    case PRE_DEC:
> +    {
> +      poly_int64 adjustment = GET_MODE_SIZE (GET_MODE (mem));
> +      if (GET_CODE (addr) == PRE_DEC)
> +	adjustment *= -1;
> +      addr = plus_constant (GET_MODE (addr), XEXP (addr, 0), adjustment);
> +      break;
> +    }
> +    default:
> +      gcc_unreachable ();
> +    }
> +
> +  return change_address (mem, GET_MODE (mem), addr);
> +}
> +
> +// Convenience wrapper around strip_offset that can also look through
> +// RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
> +// MEM so that we know the mode of the access.
> +static rtx
> +load_strip_offset (rtx mem, poly_int64 *offset)

Nit: naming.  How about pair_fusion_strip_offset instead?

> +{
> +  rtx addr = XEXP (mem, 0);
> +
> +  switch (GET_CODE (addr))
> +    {
> +    case PRE_MODIFY:
> +    case POST_MODIFY:
> +      addr = strip_offset (XEXP (addr, 1), offset);
> +      gcc_checking_assert (REG_P (addr));
> +      gcc_checking_assert (rtx_equal_p (XEXP (XEXP (mem, 0), 0), addr));
> +      break;
> +    case PRE_INC:
> +    case POST_INC:
> +      addr = XEXP (addr, 0);
> +      *offset = GET_MODE_SIZE (GET_MODE (mem));
> +      gcc_checking_assert (REG_P (addr));
> +      break;
> +    case PRE_DEC:
> +    case POST_DEC:
> +      addr = XEXP (addr, 0);
> +      *offset = -GET_MODE_SIZE (GET_MODE (mem));
> +      gcc_checking_assert (REG_P (addr));
> +      break;
> +
> +    default:
> +      addr = strip_offset (addr, offset);
> +    }
> +
> +  return addr;
> +}
> +
> +// Return true if X is a PRE_{INC,DEC,MODIFY} rtx.
> +static bool
> +any_pre_modify_p (rtx x)
> +{
> +  const auto code = GET_CODE (x);
> +  return code == PRE_INC || code == PRE_DEC || code == PRE_MODIFY;
> +}
> +
> +// Return true if X is a POST_{INC,DEC,MODIFY} rtx.
> +static bool
> +any_post_modify_p (rtx x)
> +{
> +  const auto code = GET_CODE (x);
> +  return code == POST_INC || code == POST_DEC || code == POST_MODIFY;
> +}
> +// Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
> +// into an integer for use as a hash table key.
> +static int
> +encode_lfs (lfs_fields fields)
> +{
> +  int size_log2 = exact_log2 (fields.size);
> +  //gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
> +  return ((int)fields.load_p << 3)
> +    | ((int)fields.fpsimd_p << 2)
> +    | (size_log2 - 2);
> +}
> +
> +// Inverse of encode_lfs.
> +static lfs_fields
> +decode_lfs (int lfs)
> +{
> +  bool load_p = (lfs & (1 << 3));
> +  bool fpsimd_p = (lfs & (1 << 2));
> +  unsigned size = 1U << ((lfs & 3) + 2);
> +  return { load_p, fpsimd_p, size };
> +}
> +// Track the access INSN at offset OFFSET in this access group.
> +// ALLOC_NODE is used to allocate splay tree nodes.
> +template<typename Alloc>
> +void
> +access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
> +{
> +  auto insert_before = [&](std::list<access_record>::iterator after)
> +    {
> +      auto it = list.emplace (after, offset);
> +      it->cand_insns.push_back (insn);
> +      it->place = it;
> +      return &*it;
> +    };
> +
> +  if (!list.size ())
> +    {
> +      auto access = insert_before (list.end ());
> +      tree.insert_max_node (alloc_node (access));
> +      return;
> +    }
> +
> +  auto compare = [&](splay_tree_node<access_record *> *node)
> +    {
> +      return compare_sizes_for_sort (offset, node->value ()->offset);
> +    };
> +  auto result = tree.lookup (compare);
> +  splay_tree_node<access_record *> *node = tree.root ();
> +  if (result == 0)
> +    node->value ()->cand_insns.push_back (insn);
> +  else
> +    {
> +      auto it = node->value ()->place;
> +      auto after = (result > 0) ? std::next (it) : it;
> +      auto access = insert_before (after);
> +      tree.insert_child (node, result > 0, alloc_node (access));
> +    }
> +}
> +// Given a candidate access INSN (with mem MEM), see if it has a suitable
> +// MEM_EXPR base (i.e. a tree decl) relative to which we can track the access.
> +// LFS is used as part of the key to the hash table, see track_access.
> +bool
> +fusion_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> +{
> +  if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
> +    return false;
> +
> +  poly_int64 offset;
> +  tree base_expr = get_addr_base_and_unit_offset (MEM_EXPR (mem),
> +						  &offset);
> +  if (!base_expr || !DECL_P (base_expr))
> +    return false;
> +
> +  offset += MEM_OFFSET (mem);
> +
> +  const machine_mode mem_mode = GET_MODE (mem);
> +  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode);//.to_constant ();

Why the .to_constant () comment here?

I hope these comments are useful.  Please feel free to reach out if you have any
questions / get stuck with anything.  I'm often around on IRC.

Thanks,
Alex

> +
> +  // Punt on misaligned offsets.  LDP/STP instructions require offsets to be a
> +  // multiple of the access size, and we believe that misaligned offsets on
> +  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL bases.
> +  if (!multiple_p (offset, mem_size))
> +    return false;
> +
> +  const auto key = std::make_pair (base_expr, encode_lfs (lfs));
> +  access_group &group = expr_map.get_or_insert (key, NULL);
> +  auto alloc = [&](access_record *access) { return node_alloc (access); };
> +  group.track (alloc, offset, insn);
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "[bb %u] tracking insn %d via ",
> +	       m_bb->index (), insn->uid ());
> +      print_node_brief (dump_file, "mem expr", base_expr, 0);
> +      fprintf (dump_file, " [L=%d FP=%d, %smode, off=",
> +	       lfs.load_p, lfs.fpsimd_p, mode_name[mem_mode]);
> +      print_dec (offset, dump_file);
> +      fprintf (dump_file, "]\n");
> +    }
> +
> +  return true;
> +}
> +
> +// Dummy predicate that never ignores any insns.
> +static bool no_ignore (insn_info *) { return false; }
> +
> +// Return the latest dataflow hazard before INSN.
> +//
> +// If IGNORE is non-NULL, this points to a sub-rtx which we should ignore for
> +// dataflow purposes.  This is needed when considering changing the RTL base of
> +// an access discovered through a MEM_EXPR base.
> +//
> +// If IGNORE_INSN is non-NULL, we should further ignore any hazards arising
> +// from that insn.
> +//
> +// N.B. we ignore any defs/uses of memory here as we deal with that separately,
> +// making use of alias disambiguation.
> +static insn_info *
> +latest_hazard_before (insn_info *insn, rtx *ignore,
> +		      insn_info *ignore_insn = nullptr)
> +{
> +  insn_info *result = nullptr;
> +
> +  // If the insn can throw then it is at the end of a BB and we can't
> +  // move it, model this by recording a hazard in the previous insn
> +  // which will prevent moving the insn up.
> +  if (cfun->can_throw_non_call_exceptions
> +      && find_reg_note (insn->rtl (), REG_EH_REGION, NULL_RTX))
> +    return insn->prev_nondebug_insn ();
> +
> +  // Return true if we registered the hazard.
> +  auto hazard = [&](insn_info *h) -> bool
> +    {
> +      gcc_checking_assert (*h < *insn);
> +      if (h == ignore_insn)
> +	return false;
> +
> +      if (!result || *h > *result)
> +	result = h;
> +
> +      return true;
> +    };
> +
> +  rtx pat = PATTERN (insn->rtl ());
> +  auto ignore_use = [&](use_info *u)
> +    {
> +      if (u->is_mem ())
> +	return true;
> +
> +      return !refers_to_regno_p (u->regno (), u->regno () + 1, pat, ignore);
> +    };
> +
> +  // Find defs of uses in INSN (RaW).
> +  for (auto use : insn->uses ())
> +    if (!ignore_use (use) && use->def ())
> +      hazard (use->def ()->insn ());
> +
> +  // Find previous defs (WaW) or previous uses (WaR) of defs in INSN.
> +  for (auto def : insn->defs ())
> +    {
> +      if (def->is_mem ())
> +	continue;
> +
> +      if (def->prev_def ())
> +	{
> +	  hazard (def->prev_def ()->insn ()); // WaW
> +
> +	  auto set = dyn_cast<set_info *> (def->prev_def ());
> +	  if (set && set->has_nondebug_insn_uses ())
> +	    for (auto use : set->reverse_nondebug_insn_uses ())
> +	      if (use->insn () != insn && hazard (use->insn ())) // WaR
> +		break;
> +	}
> +
> +      if (!HARD_REGISTER_NUM_P (def->regno ()))
> +	continue;
> +
> +      // Also need to check backwards for call clobbers (WaW).
> +      for (auto call_group : def->ebb ()->call_clobbers ())
> +	{
> +	  if (!call_group->clobbers (def->resource ()))
> +	    continue;
> +
> +	  auto clobber_insn = prev_call_clobbers_ignoring (*call_group,
> +							   def->insn (),
> +							   no_ignore);
> +	  if (clobber_insn)
> +	    hazard (clobber_insn);
> +	}
> +
> +    }
> +
> +  return result;
> +}
> +
> +// Return the first dataflow hazard after INSN.
> +//
> +// If IGNORE is non-NULL, this points to a sub-rtx which we should ignore for
> +// dataflow purposes.  This is needed when considering changing the RTL base of
> +// an access discovered through a MEM_EXPR base.
> +//
> +// N.B. we ignore any defs/uses of memory here as we deal with that separately,
> +// making use of alias disambiguation.
> +static insn_info *
> +first_hazard_after (insn_info *insn, rtx *ignore)
> +{
> +  insn_info *result = nullptr;
> +  auto hazard = [insn, &result](insn_info *h)
> +    {
> +      gcc_checking_assert (*h > *insn);
> +      if (!result || *h < *result)
> +	result = h;
> +    };
> +
> +  rtx pat = PATTERN (insn->rtl ());
> +  auto ignore_use = [&](use_info *u)
> +    {
> +      if (u->is_mem ())
> +	return true;
> +
> +      return !refers_to_regno_p (u->regno (), u->regno () + 1, pat, ignore);
> +    };
> +
> +  for (auto def : insn->defs ())
> +    {
> +      if (def->is_mem ())
> +	continue;
> +
> +      if (def->next_def ())
> +	hazard (def->next_def ()->insn ()); // WaW
> +
> +      auto set = dyn_cast<set_info *> (def);
> +      if (set && set->has_nondebug_insn_uses ())
> +	hazard (set->first_nondebug_insn_use ()->insn ()); // RaW
> +
> +      if (!HARD_REGISTER_NUM_P (def->regno ()))
> +	continue;
> +
> +      // Also check for call clobbers of this def (WaW).
> +      for (auto call_group : def->ebb ()->call_clobbers ())
> +	{
> +	  if (!call_group->clobbers (def->resource ()))
> +	    continue;
> +
> +	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
> +							   def->insn (),
> +							   no_ignore);
> +	  if (clobber_insn)
> +	    hazard (clobber_insn);
> +	}
> +    }
> +
> +  // Find any subsequent defs of uses in INSN (WaR).
> +  for (auto use : insn->uses ())
> +    {
> +      if (ignore_use (use))
> +	continue;
> +
> +      if (use->def ())
> +	{
> +	  auto def = use->def ()->next_def ();
> +	  if (def && def->insn () == insn)
> +	    def = def->next_def ();
> +
> +	  if (def)
> +	    hazard (def->insn ());
> +	}
> +
> +      if (!HARD_REGISTER_NUM_P (use->regno ()))
> +	continue;
> +
> +      // Also need to handle call clobbers of our uses (again WaR).
> +      //
> +      // See restrict_movement_for_uses_ignoring for why we don't
> +      // need to check backwards for call clobbers.
> +      for (auto call_group : use->ebb ()->call_clobbers ())
> +	{
> +	  if (!call_group->clobbers (use->resource ()))
> +	    continue;
> +
> +	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
> +							   use->insn (),
> +							   no_ignore);
> +	  if (clobber_insn)
> +	    hazard (clobber_insn);
> +	}
> +    }
> +
> +  return result;
> +}
> +
> +// Return true iff R1 and R2 overlap.
> +static bool
> +ranges_overlap_p (const insn_range_info &r1, const insn_range_info &r2)
> +{
> +  // If either range is empty, then their intersection is empty.
> +  if (!r1 || !r2)
> +    return false;
> +
> +  // When do they not overlap? When one range finishes before the other
> +  // starts, i.e. (*r1.last < *r2.first || *r2.last < *r1.first).
> +  // Inverting this, we get the below.
> +  return *r1.last >= *r2.first && *r2.last >= *r1.first;
> +}
> +
> +// Get the range of insns that def feeds.
> +static insn_range_info get_def_range (def_info *def)
> +{
> +  insn_info *last = def->next_def ()->insn ()->prev_nondebug_insn ();
> +  return { def->insn (), last };
> +}
> +
> +// Given a def (of memory), return the downwards range within which we
> +// can safely move this def.
> +static insn_range_info
> +def_downwards_move_range (def_info *def)
> +{
> +  auto range = get_def_range (def);
> +
> +  auto set = dyn_cast<set_info *> (def);
> +  if (!set || !set->has_any_uses ())
> +    return range;
> +
> +  auto use = set->first_nondebug_insn_use ();
> +  if (use)
> +    range = move_earlier_than (range, use->insn ());
> +
> +  return range;
> +}
> +
> +// Given a def (of memory), return the upwards range within which we can
> +// safely move this def.
> +static insn_range_info
> +def_upwards_move_range (def_info *def)
> +{
> +  def_info *prev = def->prev_def ();
> +  insn_range_info range { prev->insn (), def->insn () };
> +
> +  auto set = dyn_cast<set_info *> (prev);
> +  if (!set || !set->has_any_uses ())
> +    return range;
> +
> +  auto use = set->last_nondebug_insn_use ();
> +  if (use)
> +    range = move_later_than (range, use->insn ());
> +
> +  return range;
> +}
> +
> +// Given candidate store insns FIRST and SECOND, see if we can re-purpose one
> +// of them (together with its def of memory) for the stp/stxvp insn. If so,
> +//  return that insn.  Otherwise, return null.
> +static insn_info *
> +decide_store_strategy (insn_info *first,
> +		     insn_info *second,
> +		     const insn_range_info &move_range)
> +{
> +  def_info * const defs[2] = {
> +    memory_access (first->defs ()),
> +    memory_access (second->defs ())
> +  };
> +
> +  if (move_range.includes (first)
> +      || ranges_overlap_p (move_range, def_downwards_move_range (defs[0])))
> +    return first;
> +
> +  if (move_range.includes (second)
> +      || ranges_overlap_p (move_range, def_upwards_move_range (defs[1])))
> +    return second;
> +
> +  return nullptr;
> +}
> +
> +// Generate the RTL pattern for a "tombstone"; used temporarily during this pass
> +// to replace stores that are marked for deletion where we can't immediately
> +// delete the store (since there are uses of mem hanging off the store).
> +//
> +// These are deleted at the end of the pass and uses re-parented appropriately
> +// at this point.
> +static rtx
> +gen_tombstone (void)
> +{
> +  return gen_rtx_CLOBBER (VOIDmode,
> +			  gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode)));
> +}
> +// Go through the reg notes rooted at NOTE, dropping those that we should drop,
> +// and preserving those that we want to keep by prepending them to (and
> +// returning) RESULT.  EH_REGION is used to make sure we have at most one
> +// REG_EH_REGION note in the resulting list.  FR_EXPR is used to return any
> +// REG_FRAME_RELATED_EXPR note we find, as these can need special handling in
> +// combine_reg_notes.
> +static rtx
> +filter_notes (rtx note, rtx result, bool *eh_region, rtx *fr_expr)
> +{
> +  for (; note; note = XEXP (note, 1))
> +    {
> +      switch (REG_NOTE_KIND (note))
> +	{
> +	case REG_DEAD:
> +	  // REG_DEAD notes aren't required to be maintained.
> +	case REG_EQUAL:
> +	case REG_EQUIV:
> +	case REG_UNUSED:
> +	case REG_NOALIAS:
> +	  // These can all be dropped.  For REG_EQU{AL,IV} they cannot apply to
> +	  // non-single_set insns, and REG_UNUSED is re-computed by RTl-SSA, see
> +	  // rtl-ssa/changes.cc:update_notes.
> +	  //
> +	  // Similarly, REG_NOALIAS cannot apply to a parallel.
> +	case REG_INC:
> +	  // When we form the pair insn, the reg update is implemented
> +	  // as just another SET in the parallel, so isn't really an
> +	  // auto-increment in the RTL sense, hence we drop the note.
> +	  break;
> +	case REG_EH_REGION:
> +	  gcc_assert (!*eh_region);
> +	  *eh_region = true;
> +	  result = alloc_reg_note (REG_EH_REGION, XEXP (note, 0), result);
> +	  break;
> +	case REG_CFA_DEF_CFA:
> +	case REG_CFA_OFFSET:
> +	case REG_CFA_RESTORE:
> +	  result = alloc_reg_note (REG_NOTE_KIND (note),
> +				   copy_rtx (XEXP (note, 0)),
> +				   result);
> +	  break;
> +	case REG_FRAME_RELATED_EXPR:
> +	  gcc_assert (!*fr_expr);
> +	  *fr_expr = copy_rtx (XEXP (note, 0));
> +	  break;
> +	default:
> +	  // Unexpected REG_NOTE kind.
> +	  gcc_unreachable ();
> +	}
> +    }
> +
> +  return result;
> +}
> +
> +// Return the notes that should be attached to a combination of I1 and I2, where
> +// *I1 < *I2.  LOAD_P is true for loads.
> +static rtx
> +combine_reg_notes (insn_info *i1, insn_info *i2, bool load_p)
> +{
> +  // Temporary storage for REG_FRAME_RELATED_EXPR notes.
> +  rtx fr_expr[2] = {};
> +
> +  bool found_eh_region = false;
> +  rtx result = NULL_RTX;
> +  result = filter_notes (REG_NOTES (i2->rtl ()), result,
> +			 &found_eh_region, fr_expr);
> +  result = filter_notes (REG_NOTES (i1->rtl ()), result,
> +			 &found_eh_region, fr_expr + 1);
> +
> +  if (!load_p)
> +    {
> +      // Simple frame-related sp-relative saves don't need CFI notes, but when
> +      // we combine them into an stp/stxvp we will need a CFI note as dwarf2cfi
> +      //  can't interpret the unspec pair representation directly.
> +      if (RTX_FRAME_RELATED_P (i1->rtl ()) && !fr_expr[0])
> +	fr_expr[0] = copy_rtx (PATTERN (i1->rtl ()));
> +      if (RTX_FRAME_RELATED_P (i2->rtl ()) && !fr_expr[1])
> +	fr_expr[1] = copy_rtx (PATTERN (i2->rtl ()));
> +    }
> +
> +  rtx fr_pat = NULL_RTX;
> +  if (fr_expr[0] && fr_expr[1])
> +    {
> +      // Combining two frame-related insns, need to construct
> +      // a REG_FRAME_RELATED_EXPR note which represents the combined
> +      // operation.
> +      RTX_FRAME_RELATED_P (fr_expr[1]) = 1;
> +      fr_pat = gen_rtx_PARALLEL (VOIDmode,
> +				 gen_rtvec (2, fr_expr[0], fr_expr[1]));
> +    }
> +  else
> +    fr_pat = fr_expr[0] ? fr_expr[0] : fr_expr[1];
> +
> +  if (fr_pat)
> +    result = alloc_reg_note (REG_FRAME_RELATED_EXPR,
> +			     fr_pat, result);
> +
> +  return result;
> +}
> +
> +// Given two memory accesses in PATS, at least one of which is of a
> +// writeback form, extract two non-writeback memory accesses addressed
> +// relative to the initial value of the base register, and output these
> +// in PATS.  Return an rtx that represents the overall change to the
> +// base register.
> +static rtx
> +extract_writebacks (bool load_p, rtx pats[2], int changed)
> +{
> +  rtx base_reg = NULL_RTX;
> +  poly_int64 current_offset = 0;
> +
> +  poly_int64 offsets[2];
> +
> +  for (int i = 0; i < 2; i++)
> +    {
> +      rtx mem = XEXP (pats[i], load_p);
> +      rtx reg = XEXP (pats[i], !load_p);
> +
> +      rtx addr = XEXP (mem, 0);
> +      const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
> +
> +      poly_int64 offset;
> +      rtx this_base = load_strip_offset (mem, &offset);
> +      gcc_assert (REG_P (this_base));
> +      if (base_reg)
> +	gcc_assert (rtx_equal_p (base_reg, this_base));
> +      else
> +	base_reg = this_base;
> +
> +      // If we changed base for the current insn, then we already
> +      // derived the correct mem for this insn from the effective
> +      // address of the other access.
> +      if (i == changed)
> +	{
> +	  gcc_checking_assert (!autoinc_p);
> +	  offsets[i] = offset;
> +	  continue;
> +	}
> +
> +      if (autoinc_p && any_pre_modify_p (addr))
> +	current_offset += offset;
> +
> +      poly_int64 this_off = current_offset;
> +      if (!autoinc_p)
> +	this_off += offset;
> +
> +      offsets[i] = this_off;
> +      rtx new_mem = change_address (mem, GET_MODE (mem),
> +				    plus_constant (GET_MODE (base_reg),
> +						   base_reg, this_off));
> +      pats[i] = load_p
> +	? gen_rtx_SET (reg, new_mem)
> +	: gen_rtx_SET (new_mem, reg);
> +
> +      if (autoinc_p && any_post_modify_p (addr))
> +	current_offset += offset;
> +    }
> +
> +  if (known_eq (current_offset, 0))
> +    return NULL_RTX;
> +
> +  return gen_rtx_SET (base_reg, plus_constant (GET_MODE (base_reg),
> +					       base_reg, current_offset));
> +}
> +
> +// We just emitted a tombstone with uid UID, track it in a bitmap for
> +// this BB so we can easily identify it later when cleaning up tombstones.
> +void
> +fusion_bb_info::track_tombstone (int uid)
> +{
> +  if (!m_emitted_tombstone)
> +    {
> +      // Lazily initialize the bitmap for tracking tombstone insns.
> +      bitmap_obstack_initialize (&m_bitmap_obstack);
> +      bitmap_initialize (&m_tombstone_bitmap, &m_bitmap_obstack);
> +      m_emitted_tombstone = true;
> +    }
> +
> +  if (!bitmap_set_bit (&m_tombstone_bitmap, uid))
> +    gcc_unreachable (); // Bit should have changed.
> +}
> +
> +// Virtual base class for load/store walkers used in alias analysis.
> +struct alias_walker
> +{
> +  virtual bool conflict_p (int &budget) const = 0;
> +  virtual insn_info *insn () const = 0;
> +  virtual bool valid () const  = 0;
> +  virtual void advance () = 0;
> +};
> +
> +// Implement some common functionality used by both store_walker
> +// and load_walker.
> +template<bool reverse>
> +class def_walker : public alias_walker
> +{
> +protected:
> +  using def_iter_t = typename std::conditional<reverse,
> +	reverse_def_iterator, def_iterator>::type;
> +
> +  static use_info *start_use_chain (def_iter_t &def_iter)
> +  {
> +    set_info *set = nullptr;
> +    for (; *def_iter; def_iter++)
> +      {
> +	set = dyn_cast<set_info *> (*def_iter);
> +	if (!set)
> +	  continue;
> +
> +	use_info *use = reverse
> +	  ? set->last_nondebug_insn_use ()
> +	  : set->first_nondebug_insn_use ();
> +
> +	if (use)
> +	  return use;
> +      }
> +
> +    return nullptr;
> +  }
> +
> +  def_iter_t def_iter;
> +  insn_info *limit;
> +  def_walker (def_info *def, insn_info *limit) :
> +    def_iter (def), limit (limit) {}
> +
> +  virtual bool iter_valid () const { return *def_iter; }
> +
> +public:
> +  insn_info *insn () const override { return (*def_iter)->insn (); }
> +  void advance () override { def_iter++; }
> +  bool valid () const override final
> +  {
> +    if (!iter_valid ())
> +      return false;
> +
> +    if (reverse)
> +      return *(insn ()) > *limit;
> +    else
> +      return *(insn ()) < *limit;
> +  }
> +};
> +
> +// alias_walker that iterates over stores.
> +template<bool reverse, typename InsnPredicate>
> +class store_walker : public def_walker<reverse>
> +{
> +  rtx cand_mem;
> +  InsnPredicate tombstone_p;
> +
> +public:
> +  store_walker (def_info *mem_def, rtx mem, insn_info *limit_insn,
> +		InsnPredicate tombstone_fn) :
> +    def_walker<reverse> (mem_def, limit_insn),
> +    cand_mem (mem), tombstone_p (tombstone_fn) {}
> +
> +  bool conflict_p (int &budget) const override final
> +  {
> +    if (tombstone_p (this->insn ()))
> +      return false;
> +
> +    return store_modifies_mem_p (cand_mem, this->insn (), budget);
> +  }
> +};
> +
> +// alias_walker that iterates over loads.
> +template<bool reverse>
> +class load_walker : public def_walker<reverse>
> +{
> +  using Base = def_walker<reverse>;
> +  using use_iter_t = typename std::conditional<reverse,
> +	reverse_use_iterator, nondebug_insn_use_iterator>::type;
> +
> +  use_iter_t use_iter;
> +  insn_info *cand_store;
> +
> +  bool iter_valid () const override final { return *use_iter; }
> +
> +public:
> +  void advance () override final
> +  {
> +    use_iter++;
> +    if (*use_iter)
> +      return;
> +    this->def_iter++;
> +    use_iter = Base::start_use_chain (this->def_iter);
> +  }
> +
> +  insn_info *insn () const override final
> +  {
> +    return (*use_iter)->insn ();
> +  }
> +
> +  bool conflict_p (int &budget) const override final
> +  {
> +    return load_modified_by_store_p (insn (), cand_store, budget);
> +  }
> +
> +  load_walker (def_info *def, insn_info *store, insn_info *limit_insn)
> +    : Base (def, limit_insn),
> +      use_iter (Base::start_use_chain (this->def_iter)),
> +      cand_store (store) {}
> +};
> +
> +// Given INSNS (in program order) which are known to be adjacent, look
> +// to see if either insn has a suitable RTL (register) base that we can
> +// use to form a pair.  Push these to BASE_CANDS if we find any.  CAND_MEMs
> +// gives the relevant mems from the candidate insns, ACCESS_SIZE gives the
> +// size of a single candidate access, and REVERSED says whether the accesses
> +// are inverted in offset order.
> +//
> +// Returns an integer where bit (1 << i) is set if INSNS[i] uses writeback
> +// addressing.
> +static int
> +get_viable_bases (insn_info *insns[2],
> +		  vec<base_cand> &base_cands,
> +		  rtx cand_mems[2],
> +		  unsigned access_size,
> +		  bool reversed)
> +{
> +  // We discovered this pair through a common base.  Need to ensure that
> +  // we have a common base register that is live at both locations.
> +  def_info *base_defs[2] = {};
> +  int writeback = 0;
> +  for (int i = 0; i < 2; i++)
> +    {
> +      const bool is_lower = (i == reversed);
> +      poly_int64 poly_off;
> +      rtx base = load_strip_offset (cand_mems[i], &poly_off);
> +      if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
> +	writeback |= (1 << i);
> +
> +      if (!REG_P (base) || !poly_off.is_constant ())
> +	continue;
> +
> +      // Punt on accesses relative to eliminable regs.  See the comment in
> +      // fusion_bb_info::track_access for a detailed explanation of this.
> +      if (!reload_completed
> +	  && (REGNO (base) == FRAME_POINTER_REGNUM
> +	      || REGNO (base) == ARG_POINTER_REGNUM))
> +	continue;
> +
> +      HOST_WIDE_INT base_off = poly_off.to_constant ();
> +
> +      // It should be unlikely that we ever punt here, since MEM_EXPR offset
> +      // alignment should be a good proxy for register offset alignment.
> +      if (base_off % access_size != 0)
> +	{
> +	  if (dump_file)
> +	    fprintf (dump_file,
> +		     "base not viable, offset misaligned (insn %d)\n",
> +		     insns[i]->uid ());
> +	  continue;
> +	}
> +
> +      base_off /= access_size;
> +
> +      if (!is_lower)
> +	base_off--;
> +
> +      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
> +	continue;
> +
> +      use_info *use = find_access (insns[i]->uses (), REGNO (base));
> +      gcc_assert (use);
> +      base_defs[i] = use->def ();
> +    }
> +
> +  if (!base_defs[0] && !base_defs[1])
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "no viable base register for pair (%d,%d)\n",
> +		 insns[0]->uid (), insns[1]->uid ());
> +      return writeback;
> +    }
> +
> +  for (int i = 0; i < 2; i++)
> +    if ((writeback & (1 << i)) && !base_defs[i])
> +      {
> +	if (dump_file)
> +	  fprintf (dump_file, "insn %d has writeback but base isn't viable\n",
> +		   insns[i]->uid ());
> +	return writeback;
> +      }
> +
> +  if (writeback == 3
> +      && base_defs[0]->regno () != base_defs[1]->regno ())
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "pair (%d,%d): double writeback with distinct regs (%d,%d): "
> +		 "punting\n",
> +		 insns[0]->uid (), insns[1]->uid (),
> +		 base_defs[0]->regno (), base_defs[1]->regno ());
> +      return writeback;
> +    }
> +
> +  if (base_defs[0] && base_defs[1]
> +      && base_defs[0]->regno () == base_defs[1]->regno ())
> +    {
> +      // Easy case: insns already share the same base reg.
> +      base_cands.quick_push (base_defs[0]);
> +      return writeback;
> +    }
> +
> +  // Otherwise, we know that one of the bases must change.
> +  //
> +  // Note that if there is writeback we must use the writeback base
> +  // (we know now there is exactly one).
> +  for (int i = 0; i < 2; i++)
> +    if (base_defs[i] && (!writeback || (writeback & (1 << i))))
> +      base_cands.quick_push (base_cand { base_defs[i], i });
> +
> +  return writeback;
> +}
> +
> +static void
> +dump_insn_list (FILE *f, const insn_list_t &l)
> +{
> +  fprintf (f, "(");
> +
> +  auto i = l.begin ();
> +  auto end = l.end ();
> +
> +  if (i != end)
> +    fprintf (f, "%d", (*i)->uid ());
> +  i++;
> +
> +  for (; i != end; i++)
> +    fprintf (f, ", %d", (*i)->uid ());
> +
> +  fprintf (f, ")");
> +}
> +
> +DEBUG_FUNCTION void
> +debug (const insn_list_t &l)
> +{
> +  dump_insn_list (stderr, l);
> +  fprintf (stderr, "\n");
> +}
> +
> +// If we emitted tombstone insns for this BB, iterate through the BB
> +// and remove all the tombstone insns, being sure to reparent any uses
> +// of mem to previous defs when we do this.
> +void
> +fusion_bb_info::cleanup_tombstones ()
> +{
> +  // No need to do anything if we didn't emit a tombstone insn for this BB.
> +  if (!m_emitted_tombstone)
> +    return;
> +
> +  insn_info *insn = m_bb->head_insn ();
> +  while (insn)
> +    {
> +      insn_info *next = insn->next_nondebug_insn ();
> +      if (!insn->is_real ()
> +	  || !bitmap_bit_p (&m_tombstone_bitmap, insn->uid ()))
> +	{
> +	  insn = next;
> +	  continue;
> +	}
> +
> +      auto def = memory_access (insn->defs ());
> +      auto set = dyn_cast<set_info *> (def);
> +      if (set && set->has_any_uses ())
> +	{
> +	  def_info *prev_def = def->prev_def ();
> +	  auto prev_set = dyn_cast<set_info *> (prev_def);
> +	  if (!prev_set)
> +	    gcc_unreachable ();
> +
> +	  while (set->first_use ())
> +	    crtl->ssa->reparent_use (set->first_use (), prev_set);
> +	}
> +
> +      // Now set has no uses, we can delete it.
> +      insn_change change (insn, insn_change::DELETE);
> +      //df_insn_rescan (insn->rtl());
> +      crtl->ssa->change_insn (change);
> +      insn = next;
> +    }
> +}
> +
> +template<typename Map>
> +void
> +fusion_bb_info::traverse_base_map (Map &map)
> +{
> +  for (auto kv : map)
> +    {
> +      const auto &key = kv.first;
> +      auto &value = kv.second;
> +      transform_for_base (key.second, value);
> +    }
> +}
> +
> +void
> +fusion_bb_info::transform ()
> +{
> +  traverse_base_map (expr_map);
> +  traverse_base_map (def_map);
> +}
> diff --git a/gcc/testsuite/g++.target/powerpc/vecload.C b/gcc/testsuite/g++.target/powerpc/vecload.C
> new file mode 100644
> index 00000000000..c523572cf3c
> --- /dev/null
> +++ b/gcc/testsuite/g++.target/powerpc/vecload.C
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */ 
> +/* { dg-require-effective-target power10_ok } */
> +/* { dg-options "-mdejagnu-cpu=power10 -O2" } */ 
> +
> +#include <altivec.h>
> +
> +void
> +foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
> +{
> +  __vector_quad acc;
> +  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
> +  *dst = acc;
> +}
> +/* { dg-final { scan-assembler {\mlxvp\M} } } */
> diff --git a/gcc/testsuite/g++.target/powerpc/vecload1.C b/gcc/testsuite/g++.target/powerpc/vecload1.C
> new file mode 100644
> index 00000000000..d10ff0cdf36
> --- /dev/null
> +++ b/gcc/testsuite/g++.target/powerpc/vecload1.C
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target power10_ok } */
> +/* { dg-options "-mdejagnu-cpu=power10 -O2" } */
> +
> +#include <altivec.h>
> +	
> +void
> +foo2 ()
> +{
> +  __vector_quad *dst1;
> +  __vector_quad *dst2;
> +  vector unsigned char src;
> +  __vector_quad acc;
> +  vector unsigned char *ptr;
> +  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
> +  *dst1 = acc;
> +  __builtin_mma_xvf32ger(&acc, src, ptr[2]);
> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[3]);
> +  *dst2 = acc;
> +}
> +/* { dg-final { scan-assembler {\mlxvp\M} } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
> index 69ee826e1be..ae29127f954 100644
> --- a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
> +++ b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
> @@ -258,8 +258,8 @@ foo13b (__vector_quad *dst, __vector_quad *src, vec_t *vec)
>    dst[13] = acc;
>  }
>  
> -/* { dg-final { scan-assembler-times {\mlxv\M} 40 } } */
> -/* { dg-final { scan-assembler-times {\mlxvp\M} 12 } } */
> +/* { dg-final { scan-assembler-times {\mlxv\M} 0 } } */
> +/* { dg-final { scan-assembler-times {\mlxvp\M} 32 } } */
>  /* { dg-final { scan-assembler-times {\mstxvp\M} 40 } } */
>  /* { dg-final { scan-assembler-times {\mxxmfacc\M} 20 } } */
>  /* { dg-final { scan-assembler-times {\mxxmtacc\M} 6 } } */
> -- 
> 2.39.3
>
Michael Meissner Jan. 30, 2024, 9:45 a.m. UTC | #2
On Sun, Jan 21, 2024 at 07:57:54PM +0530, Ajit Agarwal wrote:
> 
> Hello All:
> 
> New pass to replace adjacent memory addresses lxv with lxvp.
> Added common infrastructure for load store fusion for
> different targets.
> 
> Common routines are refactored in fusion-common.h.
> 
> AARCH64 load/store fusion pass is not changed with the 
> common infrastructure.
> 
> For AARCH64 architectures just include "fusion-common.h"
> and target dependent code can be added to that.
> 
> 
> Alex/Richard:
> 
> If you would like me to add for AARCH64 I can do that for AARCH64.
> 
> If you would like to do that is fine with me.
> 
> Bootstrapped and regtested with powerpc64-linux-gnu.
> 
> Improvement in performance is seen with Spec 2017 spec FP benchmarks.

This patch is a lot better than the previous patch in that it generates fewer
extra instructions, and just replaces some of the load vector instructions with
load vector pair.

In compiling Spec 2017 with it, I see the following results:

Benchmarks that generate lxvp instead of lxv:

	500.perlbench_r		replace 10 LXVs with  5 LXVPs
	502.gcc_r		replace  2 LXVs with  1 LXVPs
	510.parest_r		replace 28 LXVs with 14 LXVPs
	511.povray_r		replace  4 LXVs with  2 LXVPs
	521.wrf_r		replace 12 LXVs with  6 LXVPs
	527.cam4_r		replace 12 LXVs with  6 LXVPs
	557.xz_r		replace 10 LXVs with  5 LXVPs

A few of the benchmarks generated a different number of NOPs, based on how
prefixed addresses were generated.  I tend to feel this is minor compared to
the others.

	 507.cactuBSSN_r	 17 fewer alignment NOPs
	 520.omnetpp_r		231 more  alignment NOPs
	 523.xalancbmk_r	246 fewer alignment NOPs
	 531.deepsjeng_r	  2 more  alignment NOPs
	 541.leela_r		 28 more  alignment NOPs
	 549.fotonik3d_r	 27 more  alignment NOPs
	 554.roms_r		  8 more  alignment NOPs

However there were three benchmarks where the code regressed.  In particular,
it looks like there are more load and store vectors to the stack, so it
indicates more spilling is going on.

	525.x264_r		16 more  stack spills, but  84 LXVPs
	526.blender_r		 4 more  stack spills, but 149 LXVPs

One benchmark actually generated fewer stack spills as well as generating
LXVPs.

	538.imagick_r		11 fewer stack spills, and  26 LXVPs

Note, these are changes to the static instructions generated.  It does not
evaluate whether the changes help/hurt performance.
Ajit Agarwal Jan. 31, 2024, 10:12 a.m. UTC | #3
Hello Alex:

Thanks for your valuable review comments.

I am incorporating the comments and would send the patch with rs6000 and
AARCH64 changes.

Thanks & Regards
Ajit

On 24/01/24 10:13 pm, Alex Coplan wrote:
> Hi Ajit,
> 
> On 21/01/2024 19:57, Ajit Agarwal wrote:
>>
>> Hello All:
>>
>> New pass to replace adjacent memory addresses lxv with lxvp.
>> Added common infrastructure for load store fusion for
>> different targets.
> 
> Thanks for this, it would be nice to see the load/store pair pass
> generalized to multiple targets.
> 
> I assume you are targeting GCC 15 for this, as we are in stage 4 at
> the moment?
> 
>>
>> Common routines are refactored in fusion-common.h.
>>
>> AARCH64 load/store fusion pass is not changed with the 
>> common infrastructure.
> 
> I think any patch to generalize the load/store pair fusion pass should
> update the aarch64 code at the same time to use the generic
> infrastructure, instead of duplicating the code.
> 
> As a general comment, I think we should move as much of the code as
> possible to target-independent code, with only the bits that are truly
> target-specific (e.g. deciding which modes to allow for a load/store
> pair operand) in target code.
> 
> In terms of structuring the interface between generic code and target
> code, I think it would be pragmatic to use a class with (in some cases,
> pure) virtual functions that can be overriden by targets to implement
> any target-specific behaviour.
> 
> IMO the generic class should be implemented in its own .cc instead of
> using a header-only approach.  The target code would then define a
> derived class which overrides the virtual functions (where necessary)
> declared in the generic class, and then instantiate the derived class to
> create a target-customized instance of the pass.
> 
> A more traditional GCC approach would be to use optabs and target hooks
> to customize the behaviour of the pass to handle target-specific
> aspects, but:
>  - Target hooks are quite heavyweight, and we'd potentially have to add
>    quite a few hooks just for one pass that (at least initially) will
>    only be used by a couple of targets.
>  - Using classes allows both sides to easily maintain their own state
>    and share that state where appropriate.
> 
> Nit on naming: I understand you want to move away from ldp_fusion, but
> how about pair_fusion or mem_pair_fusion instead of just "fusion" as a
> base name?  IMO just "fusion" isn't very clear as to what the pass is
> trying to achieve.
> 
> In general the code could do with a lot more commentary to explain the
> rationale for various things / explain the high-level intent of the
> code.
> 
> Unfortunately I'm not familiar with the DF framework (I've only really
> worked with RTL-SSA for the aarch64 pass), so I haven't commented on the
> use of that framework, but it would be nice if what you're trying to do
> could be done using RTL-SSA instead of using DF directly.
> 
> Hopefully Richard S can chime in on those aspects.
> 
> My main concerns with the patch at the moment (apart from the code
> duplication) is that it looks like:
> 
>  - The patch removes alias analysis from try_fuse_pair, which is unsafe.
>  - The patch tries to make its own RTL changes inside
>    rs6000_gen_load_pair, but it should let fuse_pair make those changes
>    using RTL-SSA instead.
> 
> I've left some more specific (but still mostly high-level) comments below.
> 
>>
>> For AARCH64 architectures just include "fusion-common.h"
>> and target dependent code can be added to that.
>>
>>
>> Alex/Richard:
>>
>> If you would like me to add for AARCH64 I can do that for AARCH64.
>>
>> If you would like to do that is fine with me.
>>
>> Bootstrapped and regtested with powerpc64-linux-gnu.
>>
>> Improvement in performance is seen with Spec 2017 spec FP benchmarks.
>>
>> Thanks & Regards
>> Ajit
>>
>> rs6000: New  pass for replacement of adjacent lxv with lxvp.
> 
> Are you looking to handle stores eventually, out of interest?  Looking
> at rs6000-vecload-opt.cc:fusion_bb it looks like you're just handling
> loads at the moment.
> 
>>
>> New pass to replace adjacent memory addresses lxv with lxvp.
>> Added common infrastructure for load store fusion for
>> different targets.
>>
>> Common routines are refactored in fusion-common.h.
> 
> I've just done a very quick scan through this file as it mostly just
> looks to be idential to existing code in aarch64-ldp-fusion.cc.
> 
>>
>> 2024-01-21  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>
>> gcc/ChangeLog:
>>
>> 	* config/rs6000/rs6000-passes.def: New vecload pass
>> 	before pass_early_remat.
>> 	* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
>> 	* config.gcc: Add new executable.
>> 	* config/rs6000/rs6000-protos.h: Add new prototype for vecload
>> 	pass.
>> 	* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
>> 	* config/rs6000/t-rs6000: Add new rule.
>> 	* fusion-common.h: Add common infrastructure for load store
>> 	fusion that can be shared across different architectures.
>> 	* emit-rtl.cc: Modify assert code.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 	* g++.target/powerpc/vecload.C: New test.
>> 	* g++.target/powerpc/vecload1.C: New test.
>> 	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
>> ---
>>  gcc/config.gcc                                |    4 +-
>>  gcc/config/rs6000/rs6000-passes.def           |    3 +
>>  gcc/config/rs6000/rs6000-protos.h             |    1 +
>>  gcc/config/rs6000/rs6000-vecload-opt.cc       | 1186 ++++++++++++++++
>>  gcc/config/rs6000/rs6000.cc                   |    1 +
>>  gcc/config/rs6000/t-rs6000                    |    5 +
>>  gcc/emit-rtl.cc                               |   14 +-
>>  gcc/fusion-common.h                           | 1195 +++++++++++++++++
>>  gcc/testsuite/g++.target/powerpc/vecload.C    |   15 +
>>  gcc/testsuite/g++.target/powerpc/vecload1.C   |   22 +
>>  .../gcc.target/powerpc/mma-builtin-1.c        |    4 +-
>>  11 files changed, 2433 insertions(+), 17 deletions(-)
>>  create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
>>  create mode 100644 gcc/fusion-common.h
>>  create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C
>>  create mode 100644 gcc/testsuite/g++.target/powerpc/vecload1.C
>>
>> diff --git a/gcc/config.gcc b/gcc/config.gcc
>> index 00355509c92..9bff42cf830 100644
>> --- a/gcc/config.gcc
>> +++ b/gcc/config.gcc
>> @@ -518,7 +518,7 @@ or1k*-*-*)
>>  	;;
>>  powerpc*-*-*)
>>  	cpu_type=rs6000
>> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
>> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
>>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>>  	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
>>  	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
>> @@ -555,7 +555,7 @@ riscv*)
>>  	;;
>>  rs6000*-*-*)
>>  	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
>> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
>> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
>>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
>>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
>> diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
>> index 46a0d0b8c56..eb4a65ebe10 100644
>> --- a/gcc/config/rs6000/rs6000-passes.def
>> +++ b/gcc/config/rs6000/rs6000-passes.def
>> @@ -28,6 +28,9 @@ along with GCC; see the file COPYING3.  If not see
>>       The power8 does not have instructions that automaticaly do the byte swaps
>>       for loads and stores.  */
>>    INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
>> +  /* Pass to replace adjacent memory addresses lxv instruction with lxvp
>> +     instruction.  */
>> +  INSERT_PASS_BEFORE (pass_early_remat, 1, pass_analyze_vecload);
>>  
>>    /* Pass to do the PCREL_OPT optimization that combines the load of an
>>       external symbol's address along with a single load or store using that
>> diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
>> index 09a57a806fa..f0a9f36602e 100644
>> --- a/gcc/config/rs6000/rs6000-protos.h
>> +++ b/gcc/config/rs6000/rs6000-protos.h
>> @@ -343,6 +343,7 @@ namespace gcc { class context; }
>>  class rtl_opt_pass;
>>  
>>  extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
>> +extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
>>  extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
>>  extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
>>  extern bool rs6000_quadword_masked_address_p (const_rtx exp);
>> diff --git a/gcc/config/rs6000/rs6000-vecload-opt.cc b/gcc/config/rs6000/rs6000-vecload-opt.cc
>> new file mode 100644
>> index 00000000000..cebadad97d8
>> --- /dev/null
>> +++ b/gcc/config/rs6000/rs6000-vecload-opt.cc
>> @@ -0,0 +1,1186 @@
>> +/* Subroutines used to replace lxv with lxvp
>> +   for TARGET_POWER10 and TARGET_VSX,
>> +
>> +   Copyright (C) 2020-2023 Free Software Foundation, Inc.
>> +   Contributed by Ajit Kumar Agarwal <aagarwa1@linux.ibm.com>.
>> +
>> +   This file is part of GCC.
>> +
>> +   GCC is free software; you can redistribute it and/or modify it
>> +   under the terms of the GNU General Public License as published
>> +   by the Free Software Foundation; either version 3, or (at your
>> +   option) any later version.
>> +
>> +   GCC is distributed in the hope that it will be useful, but WITHOUT
>> +   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
>> +   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
>> +   License for more details.
>> +
>> +   You should have received a copy of the GNU General Public License
>> +   along with GCC; see the file COPYING3.  If not see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#define IN_TARGET_CODE 1
>> +#define INCLUDE_ALGORITHM
>> +#define INCLUDE_FUNCTIONAL
>> +#define INCLUDE_LIST
>> +#define INCLUDE_TYPE_TRAITS
>> +#include "config.h"
>> +#include "system.h"
>> +#include "coretypes.h"
>> +#include "backend.h"
>> +#include "target.h"
>> +#include "rtl.h"
>> +#include "memmodel.h"
>> +#include "emit-rtl.h"
>> +#include "tree-pass.h"
>> +#include "df.h"
>> +#include "dumpfile.h"
>> +#include "rs6000-internal.h"
>> +#include "rs6000-protos.h"
>> +#include "fusion-common.h"
>> +
>> +/* Return false if dependent rtx LOC is SUBREG.  */
>> +static bool
>> +is_feasible (rtx_insn *insn)
> 
> Can you explain the rationale for this function?
> I'm not really sure what it's trying to achieve.
> 
>> +{
>> +  df_ref use;
>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>> +    {
>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>> +      if (!def_link || !def_link->ref || DF_REF_IS_ARTIFICIAL (def_link->ref))
>> +	continue;
>> +      while (def_link && def_link->ref)
>> +	{
>> +	  rtx *loc = DF_REF_LOC (def_link->ref);
>> +	  if (!loc || *loc == NULL_RTX)
>> +	    return false;
>> +	  if (GET_CODE (*loc) == SUBREG)
>> +	    return false;
>> +	  def_link = def_link->next;
>> +	}
>> +     }
>> +  return true;
>> +}
>> +
>> +/* df_scan_rescan the unspec instruction where operands
>> +   are reversed.  */
>> +void set_rescan_for_unspec (rtx_insn *insn)
>> +{
>> +  df_ref use;
>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
>> +  rtx_insn *select_insn2;
>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>> +    {
>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>> +      while (def_link && def_link->ref)
>> +	{
>> +	  select_insn2 = DF_REF_INSN (def_link->ref);
>> +	  rtx set = single_set (select_insn2);
>> +
>> +	  if (set == NULL_RTX)
>> +	    return;
>> +
>> +	  if (set != NULL_RTX)
>> +	    {
>> +	      rtx op0 = SET_SRC (set);
>> +	      if (GET_CODE (op0) != UNSPEC)
>> +		return;
>> +
>> +	      if (GET_CODE (op0) == VEC_SELECT
>> +		  && GET_CODE (XEXP (op0, 1)) == PARALLEL)
>> +		return;
>> +
>> +	      if (GET_CODE (op0) == UNSPEC)
>> +		df_insn_rescan (select_insn2);
>> +	    }
>> +	   def_link = def_link->next;
>> +	}
>> +     }
>> +}
>> +
>> +/* Return dependent UNSPEC instruction.  */
>> +rtx_insn *get_rtx_UNSPEC (rtx_insn *insn)
>> +{
>> +  df_ref use;
>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
>> +  rtx_insn *select_insn2;
>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>> +    {
>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>> +      while (def_link && def_link->ref)
>> +	{
>> +	  select_insn2 = DF_REF_INSN (def_link->ref);
>> +	  rtx set = single_set (select_insn2);
>> +
>> +	  if (set == NULL_RTX)
>> +	    return 0;
>> +
>> +	  if (set != NULL_RTX)
> 
> This condition is always true due to the early return above.
> 
>> +	    {
>> +	      rtx op0 = SET_SRC (set);
>> +
>> +	      if (GET_CODE (op0) == UNSPEC)
>> +		return select_insn2;
>> +	    }
>> +	   def_link = def_link->next;
>> +	}
>> +     }
>> +  return 0;
>> +}
>> +
>> +/* Replace identified lxv with lxvp.
>> +   Bail out if following condition are true:
>> +
>> +   - dependent instruction of load is vec_select instruction,
> 
> Please can you explain the rationale for this condition?
> 
> Do I understand correctly that you want to bail out if a use of a
> register loaded by either insn occurs in a vec_select rtx?  If so, why?
> 
> AIUI, since we shouldn't change register dataflow in the pass, I wonder
> if you could check this condition once in track_access, instead of
> checking it for every pair formation attempt?
> 
>> +
>> +   - machine mode of unspec is not same as machine mode
>> +     of lxv instruction.
>> +
>> +   - dependent instruction is not unspec.
> 
> Likewise, please explain the rationale for this.
> 
>> +
>> +   - Source operand of unspec is eq instruction.  */
> 
> Same here.
> 
>> +
>> +rtx
>> +rs6000_gen_load_pair (rtx_insn *insn1, rtx_insn *insn2)
>> +{
> 
> I think this function is named in a rather misleading way, and I think
> it is trying to do too much at once.
> 
> Based on the name of the function, I would expect it to just produce the
> appropriate RTL pattern for a load pair given the two candidate loads.
> But it actually tries to do a bunch of analysis to determine if the pair
> can be formed, and then inserts a new insn and deletes the original
> candidate insns.
> 
> Any changing of the RTL should be done by RTL-SSA in fuse_pair.
> 
>> +  rtx body = PATTERN (insn1);
>> +  rtx src_exp = SET_SRC (body);
>> +  rtx dest_exp = SET_DEST (body);
>> +  rtx lxv;
>> +  rtx insn2_body = PATTERN (insn2);
>> +
>> +  rtx insn2_src_exp = SET_SRC (insn2_body);
>> +
>> +  if (GET_MODE (src_exp) != GET_MODE (SET_SRC (insn2_body)))
>> +    return NULL_RTX;
> 
> IIUC this should hold from generic RTL rules, so shouldn't be needed.
> 
>> +
>> +  if (GET_MODE (dest_exp) == TImode)
>> +    return NULL_RTX;
>> +
>> +  if (!ALTIVEC_OR_VSX_VECTOR_MODE (GET_MODE (dest_exp)))
>> +    return NULL_RTX;
> 
> Validation of the modes should have happened much earlier (in
> track_access), so shouldn't be necessary at this point.
> 
>> +
>> +  if (!is_feasible (insn1))
>> +    return NULL_RTX;
>> +
>> +  if (!is_feasible (insn2))
>> +    return NULL_RTX;
> 
> I can't really comment on this as I don't understand the intent behind
> is_feasible, but AFAICT it seems to do some dataflow analysis (looking
> at uses/defs).  That should ideally have been done at the latest in
> try_fuse_pair (before we get to fuse_pair), and shouldn't be done when
> we're trying to generate the final RTL pattern for the pair.
> 
> Given that these seem to be properties of the indiviudal insns it might
> even be possible to check this much earlier (when first considering the
> individual accesses in track_access).
> 
>> +
>> +  for (rtx note = REG_NOTES (insn1); note; note = XEXP (note, 1))
>> +    if (REG_NOTE_KIND (note) == REG_EQUAL
>> +	|| REG_NOTE_KIND (note) == REG_EQUIV)
>> +      return NULL_RTX;
> 
> Can you explain why you're punting if the candidate insns have these
> notes?  Did you see a performance improvement with that change?  It
> might be something we want to consider doing for aarch64 too if running
> before RA (in which case we could do it in the generic code).
> 
>> +
>> +  int no_dep = 0;
>> +  df_ref  use;
>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (insn1);
>> +  rtx_insn *select_insn2;
>> +
>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>> +    {
>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>> +      while (def_link && def_link->ref)
>> +	{
>> +	  select_insn2 = DF_REF_INSN (def_link->ref);
>> +	  rtx set = single_set (select_insn2);
>> +
>> +	  if (set == NULL_RTX)
>> +	    return NULL_RTX;
>> +
>> +	  if (set != NULL_RTX)
> 
> As above, I think this condition is redundant due to the early return
> immediately above.
> 
>> +	    {
>> +	      rtx op0 = SET_SRC (set);
>> +
>> +	      if (GET_CODE (op0) != UNSPEC)
>> +		return NULL_RTX;
>> +
>> +	      if (GET_CODE (op0) == VEC_SELECT
>> +		  && GET_CODE (XEXP (op0, 1)) == PARALLEL)
>> +		return NULL_RTX;
>> +
>> +	      if (GET_CODE (op0) == UNSPEC)
>> +		{
>> +		  if (GET_MODE (op0) != XOmode
>> +		      && GET_MODE (op0) != GET_MODE (dest_exp))
>> +		    return NULL_RTX;
>> +
>> +		  int nvecs = XVECLEN (op0, 0);
>> +		  for (int i = 0; i < nvecs; i++)
>> +		    {
>> +		      rtx op;
>> +		      op = XVECEXP (op0, 0, i);
>> +
>> +		      if (GET_MODE (op) == OOmode)
>> +			return NULL_RTX;
>> +		      if (GET_CODE (op) == EQ)
>> +			return NULL_RTX;
>> +		    }
>> +		}
>> +	       ++no_dep;
>> +	     }
>> +	   def_link = def_link->next;
>> +	}
>> +     }
>> +
>> +  rtx_insn *insn = get_rtx_UNSPEC (insn1);
>> +
>> +  if (insn && insn == get_rtx_UNSPEC (insn2) && no_dep == 1)
>> +    return NULL_RTX;
>> +
>> +  int regoff;
>> +  rtx src;
>> +  rtx addr = XEXP (src_exp, 0);
>> +
>> +  if (GET_CODE (addr) == PLUS
>> +      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
>> +    {
>> +      regoff = 0;
>> +      src = simplify_gen_subreg (GET_MODE (dest_exp),
>> +				 dest_exp, GET_MODE (dest_exp),
>> +				 regoff);
>> +    }
>> +  else
>> +    {
>> +      regoff = INTVAL (CONST0_RTX (SImode)) *
>> +		       GET_MODE_SIZE (GET_MODE (dest_exp));
>> +      src = simplify_gen_subreg (GET_MODE (dest_exp),
>> +				 dest_exp, GET_MODE (dest_exp),
>> +				 regoff);
>> +    }
>> +  insn_info = DF_INSN_INFO_GET (insn2);
>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>> +    {
>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>> +      if (!def_link || !def_link->ref || DF_REF_IS_ARTIFICIAL (def_link->ref))
>> +	continue;
>> +      while (def_link && def_link->ref)
>> +	{
>> +	  rtx *loc = DF_REF_LOC (def_link->ref);
>> +	  *loc =  src;
>> +	  def_link = def_link->next;
>> +	}
>> +     }
>> +
>> +  int regoff1;
>> +  rtx src1;
>> +  addr = XEXP (insn2_src_exp, 0);
>> +
>> +  if (GET_CODE (addr) == PLUS
>> +      && XEXP (addr, 1)
>> +      && CONST_INT_P (XEXP(addr, 1)))
>> +    {
>> +      regoff1 = 16;
>> +      src1 = simplify_gen_subreg (GET_MODE (dest_exp),
>> +				  dest_exp, GET_MODE (dest_exp),
>> +				  regoff1);
>> +    }
>> +  else
>> +    {
>> +      regoff1 = INTVAL (CONST0_RTX (SImode)) * GET_MODE_SIZE (GET_MODE (dest_exp));//V16QImode);
>> +      src1 = simplify_gen_subreg (GET_MODE (dest_exp),
>> +				  dest_exp, GET_MODE (dest_exp),
>> +				  regoff1);
>> +    }
>> +
>> +  insn_info = DF_INSN_INFO_GET (insn1);
>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>> +    {
>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>> +      if (!def_link || !def_link->ref || DF_REF_IS_ARTIFICIAL (def_link->ref))
>> +	continue;
>> +      while (def_link && def_link->ref)
>> +	{
>> +	  rtx *loc = DF_REF_LOC (def_link->ref);
>> +	  PUT_MODE_RAW (*loc, OOmode);
>> +	  *loc = src1;
>> +	  def_link = def_link->next;
>> +	}
>> +     }
>> +
>> +  set_rescan_for_unspec (insn1);
>> +  set_rescan_for_unspec (insn2);
>> +  df_insn_rescan (insn1);
>> +  df_insn_rescan (insn2);
>> +
>> +  PUT_MODE_RAW (src_exp, OOmode);
>> +  PUT_MODE_RAW (dest_exp, OOmode);
> 
> I don't think you should be updating the original RTL directly like
> this, but instead return a new pattern which will ultimately get passed
> to validate_unshare_change in fuse_pair (i.e. let fuse_pair update the
> RTL).
> 
>> +  lxv = gen_movoo (dest_exp, src_exp);
>> +  rtx_insn *new_insn = emit_insn_before (lxv,  insn1);
> 
> Likewise, instead of emitting a new insn, this should let fuse_pair
> perform the change.
> 
>> +  set_block_for_insn (new_insn, BLOCK_FOR_INSN (insn1));
>> +  df_insn_rescan (new_insn);
>> +
>> +  if (dump_file)
>> +    {
>> +      unsigned int new_uid = INSN_UID (new_insn);
>> +      fprintf (dump_file, "Replacing lxv %d with lxvp  %d\n",
>> +			  INSN_UID (insn1), new_uid);
>> +      print_rtl_single (dump_file, new_insn);
>> +      print_rtl_single (dump_file, insn1);
>> +      print_rtl_single (dump_file, insn2);
>> +
>> +    }
>> +
>> +  df_insn_delete (insn1);
>> +  remove_insn (insn1);
>> +  df_insn_delete (insn2);
>> +  remove_insn (insn2);
>> +  insn1->set_deleted ();
>> +  insn2->set_deleted ();
> 
> Likewise, I don't think you should be deleting the original insns here,
> but you should let RTL-SSA do this in the generic code instead (i.e. in
> fuse_pair).
> 
>> +  return lxv;
>> +}
>> +
>> +// LEFT_LIST and RIGHT_LIST are lists of candidate instructions where all insns
>> +// in LEFT_LIST are known to be adjacent to those in RIGHT_LIST.
>> +//
>> +// This function traverses the resulting 2D matrix of possible pair candidates
>> +// and attempts to merge them into pairs.
>> +//
>> +// The algorithm is straightforward: if we consider a combined list of
>> +// candidates X obtained by merging LEFT_LIST and RIGHT_LIST in program order,
>> +// then we advance through X until we reach a crossing point (where X[i] and
>> +// X[i+1] come from different source lists).
>> +//
>> +// At this point we know X[i] and X[i+1] are adjacent accesses, and we try to
>> +// fuse them into a pair.  If this succeeds, we remove X[i] and X[i+1] from
>> +// their original lists and continue as above.
>> +//
>> +// In the failure case, we advance through the source list containing X[i] and
>> +// continue as above (proceeding to the next crossing point).
>> +//
>> +// The rationale for skipping over groups of consecutive candidates from the
>> +// same source list is as follows:
>> +//
>> +// In the store case, the insns in the group can't be re-ordered over each
>> +// other as they are guaranteed to store to the same location, so we're
>> +// guaranteed not to lose opportunities by doing this.
>> +//
>> +// In the load case, subsequent loads from the same location are either
>> +// redundant (in which case they should have been cleaned up by an earlier
>> +// optimization pass) or there is an intervening aliasing hazard, in which case
>> +// we can't re-order them anyway, so provided earlier passes have cleaned up
>> +// redundant loads, we shouldn't miss opportunities by doing this.
>> +void
>> +fusion_bb_info::merge_pairs (insn_list_t &left_list,
>> +			  insn_list_t &right_list,
>> +			  bool load_p,
>> +			  unsigned access_size)
>> +{
> 
> This function should be in the generic code, not in target code.
> Was there a reason you moved it to target code?
> 
> When moving code from aarch64-ldp-fusion.cc, I think it would be better
> if you left the code style unchanged and only made changes where there
> is a need for a functional change.  For example ...
> 
>> +  if (dump_file)
>> +    {
>> +      fprintf (dump_file, "merge_pairs [L=%d], cand vecs ", load_p);
>> +      dump_insn_list (dump_file, left_list);
>> +      fprintf (dump_file, " x ");
>> +      dump_insn_list (dump_file, right_list);
>> +      fprintf (dump_file, "\n");
>> +    }
>> +
>> +  auto iter_l = left_list.begin ();
>> +  auto iter_r = right_list.begin ();
>> +  while (iter_l != left_list.end () && iter_r != right_list.end ())
>> +    {
>> +      auto left = *iter_l;
>> +      auto right = *iter_r;
> 
> ... here you introduce new left and right locals, but AFAICT there isn't
> a need for this (no functional change).
> 
>> +
>> +      auto next_l = std::next (iter_l);
>> +      auto next_r = std::next (iter_r);
>> +
>> +      if ((*left) < (*right)
>> +	  && next_l != (left_list.end ())
>> +	  && **next_l < *right)
>> +	iter_l = next_l;
>> +      else if ((*right) < (*left)
>> +	       && next_r != (right_list.end ())
>> +	       && **next_r < *left)
>> +	iter_r = next_r;
>> +      else if (try_fuse_pair (load_p, access_size, *iter_l, *iter_r)) //left, right))
>> +	{
>> +	  left_list.erase (iter_l);
>> +	  iter_l = next_l;
>> +	  right_list.erase (iter_r);
>> +	  iter_r = next_r;
>> +	}
>> +      else if (*left  < *right)
>> +	iter_l = next_l;
>> +      else
>> +	iter_r = next_r;
>> +    }
>> + 
>> +}
>> +
>> +
>> +
>> +// Main function to begin pair discovery.  Given a memory access INSN,
>> +// determine whether it could be a candidate for fusing into an lxvp/stvxp,
>> +// and if so, track it in the appropriate data structure for this basic
>> +// block.  LOAD_P is true if the access is a load, and MEM is the mem
>> +// rtx that occurs in INSN.
>> +void
>> +fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>> +{
> 
> This function should also be in generic code, not in target code.
> I can see that you want to customize a target-specific aspect ...
> 
>> +  // We can't combine volatile MEMs, so punt on these.
>> +  if (MEM_VOLATILE_P (mem))
>> +    return;
>> +
>> +  // Ignore writeback accesses if the param says to do so.
>> +  if (GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
>> +    return;
>> +
>> +  const machine_mode mem_mode = GET_MODE (mem);
>> +
>> +  rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
>> +
>> +  if (!ALTIVEC_OR_VSX_VECTOR_MODE (GET_MODE (reg_op)))
>> +    return;
> 
> ... here.  But I think this should be done either using a target hook
> or ideally using a virtual function which is overriden by target code.
> 
> For rs6000, would it be enough to validate the mode of the mem instead
> of looking at the mode of reg_op?  Then we could have a single function,
> e.g.:
> 
> bool pair_fusion::pair_operand_mode_ok_p (machine_mode mode);
> 
> which for rs6000 would return:
> 
> ALTIVEC_OR_VSX_VECTOR_MODE (mode);
> 
> and for aarch64 would return:
> 
> ldp_operand_mode_ok_p (mode);
> 
> and in track_access (from generic code), you would do:
> 
> if (!pair_operand_mode_ok_p (mem_mode))
>   return;
> 
> note that, at least for aarch64, the reason that we validate the mem mode
> instead of the register mode is that the register operand may be a modeless
> const_int in the store case.
> 
>> +
>> +  const bool fpsimd_op_p = (ALTIVEC_OR_VSX_VECTOR_MODE (GET_MODE (reg_op)));
> 
> IIUC, this boolean is always true due to the above check.  So it is
> better to just set it to a constant (doesn't matter which) if you only
> want to handle vector modes for rs6000.  As above, this should be done
> through a virtual function, where for aarch64 we return the condition in
> aarch64-ldp-fusion.cc:
> 
>   const bool fpsimd_op_p
>     = reload_completed
>     ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
>     : (GET_MODE_CLASS (mem_mode) != MODE_INT
>        && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
> 
> and for rs6000 you can just return false (assuming you don't want to segregate
> accesses using this bit).
> 
>> +
>> +  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode);
>> +  const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
>> +
>> +  if (track_via_mem_expr (insn, mem, lfs))
>> +    return;
>> +
>> +  poly_int64 mem_off;
>> +  rtx addr = XEXP (mem, 0);
>> +  const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
>> +  rtx base = load_strip_offset (mem, &mem_off);
>> +  if (!REG_P (base))
>> +    return;
>> +
>> +  // Need to calculate two (possibly different) offsets:
>> +  //  - Offset at which the access occurs.
>> +  //  - Offset of the new base def.
>> +  poly_int64 access_off;
>> +  if (autoinc_p && any_post_modify_p (addr))
>> +    access_off = 0;
>> +  else
>> +    access_off = mem_off;
>> +
>> +  poly_int64 new_def_off = mem_off;
>> +
>> +  // Punt on accesses relative to eliminable regs.  Since we don't know the
>> +  // elimination offset pre-RA, we should postpone forming pairs on such
>> +  // accesses until after RA.
>> +  //
>> +  // As it stands, addresses with offsets in range for LDR but not
>> +  // in range for LDP/STP are currently reloaded inefficiently,
>> +  // ending up with a separate base register for each pair.
>> +  //
>> +  // In theory LRA should make use of
>> +  // targetm.legitimize_address_displacement to promote sharing of
>> +  // bases among multiple (nearby) address reloads, but the current
>> +  // LRA code returns early from process_address_1 for operands that
>> +  // satisfy "m", even if they don't satisfy the real (relaxed) address
>> +  // constraint; this early return means we never get to the code
>> +  // that calls targetm.legitimize_address_displacement.
>> +  //
>> +  // So for now, it's better to punt when we can't be sure that the
>> +  // offset is in range for LDP/STP.  Out-of-range cases can then be
>> +  // handled after RA by the out-of-range LDP/STP peepholes.  Eventually, it
>> +  // would be nice to handle known out-of-range opportunities in the
>> +  // pass itself (for stack accesses, this would be in the post-RA pass).
>> +  if (!reload_completed
>> +      && (REGNO (base) == FRAME_POINTER_REGNUM
>> +	  || REGNO (base) == ARG_POINTER_REGNUM))
>> +    return;
>> +
>> +  // Now need to find def of base register.
>> +  use_info *base_use = find_access (insn->uses (), REGNO (base));
>> +  gcc_assert (base_use);
>> +  def_info *base_def = base_use->def ();
>> +  if (!base_def)
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file,
>> +		 "base register (regno %d) of insn %d is undefined",
>> +		 REGNO (base), insn->uid ());
>> +      return;
>> +    }
>> +
>> +  alt_base *canon_base = canon_base_map.get (base_def);
>> +  if (canon_base)
>> +    {
>> +      // Express this as the combined offset from the canonical base.
>> +      base_def = canon_base->base;
>> +      new_def_off += canon_base->offset;
>> +      access_off += canon_base->offset;
>> +    }
>> +
>> +  if (autoinc_p)
>> +    {
>> +      auto def = find_access (insn->defs (), REGNO (base));
>> +      gcc_assert (def);
>> +
>> +      // Record that DEF = BASE_DEF + MEM_OFF.
>> +      if (dump_file)
>> +	{
>> +	  pretty_printer pp;
>> +	  pp_access (&pp, def, 0);
>> +	  pp_string (&pp, " = ");
>> +	  pp_access (&pp, base_def, 0);
>> +	  fprintf (dump_file, "[bb %u] recording %s + ",
>> +		   m_bb->index (), pp_formatted_text (&pp));
>> +	  print_dec (new_def_off, dump_file);
>> +	  fprintf (dump_file, "\n");
>> +	}
>> +
>> +      alt_base base_rec { base_def, new_def_off };
>> +      if (canon_base_map.put (def, base_rec))
>> +	gcc_unreachable (); // Base defs should be unique.
>> +    }
>> +
>> +  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
>> +  // the access size.
>> +  if (!multiple_p (mem_off, mem_size))
>> +    return;
>> +
>> +  const auto key = std::make_pair (base_def, encode_lfs (lfs));
>> +  access_group &group = def_map.get_or_insert (key, NULL);
>> +  auto alloc = [&](access_record *access) { return node_alloc (access); };
>> +  group.track (alloc, access_off, insn);
>> +  if (dump_file)
>> +    {
>> +      pretty_printer pp;
>> +      pp_access (&pp, base_def, 0);
>> +
>> +      fprintf (dump_file, "[bb %u] tracking insn %d via %s",
>> +	       m_bb->index (), insn->uid (), pp_formatted_text (&pp));
>> +      fprintf (dump_file,
>> +	       " [L=%d, WB=%d, FP=%d, %smode, off=",
>> +	       lfs.load_p, autoinc_p, lfs.fpsimd_p, mode_name[mem_mode]);
>> +      print_dec (access_off, dump_file);
>> +      fprintf (dump_file, "]\n");
>> +    }
>> +}
>> +
>> +
>> +// Given two adjacent memory accesses of the same size, I1 and I2, try
>> +// and see if we can merge them into a lxvp or stvxp.
>> +//
>> +// ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
>> +// if the accesses are both loads, otherwise they are both stores.
>> +bool
>> +fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>> +			    insn_info *i1, insn_info *i2)
> 
> Likewise, I think this function should be in generic code, not target code.
> 
>> +{
>> +  if (dump_file)
>> +    fprintf (dump_file, "analyzing pair (load=%d): (%d,%d)\n",
>> +	     load_p, i1->uid (), i2->uid ());
>> +
>> +  insn_info *insns[2];
>> +  bool reversed = false;
>> +  if (*i1 < *i2)
>> +    {
>> +      insns[0] = i1;
>> +      insns[1] = i2;
>> +    }
>> +  else
>> +    {
>> +      insns[0] = i2;
>> +      insns[1] = i1;
>> +      reversed = true;
>> +    }
>> +
>> +  rtx cand_mems[2];
>> +  rtx reg_ops[2];
>> +  rtx pats[2];
>> +  for (int i = 0; i < 2; i++)
>> +    {
>> +      pats[i] = PATTERN (insns[i]->rtl ());
>> +      cand_mems[i] = XEXP (pats[i], load_p);
>> +      reg_ops[i] = XEXP (pats[i], !load_p);
>> +    }
>> +
>> +  if (load_p && reg_overlap_mentioned_p (reg_ops[0], reg_ops[1]))
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file,
>> +		 "punting on lxvp due to reg conflcits (%d,%d)\n",
> 
> I understand that you might not want to use "ldp" in the generic code,
> but how about "punting on load pair" instead?
> 
>> +		 insns[0]->uid (), insns[1]->uid ());
>> +      return false;
>> +    }
>> +
>> +  if (cfun->can_throw_non_call_exceptions
>> +      && find_reg_note (insns[0]->rtl (), REG_EH_REGION, NULL_RTX)
>> +      && find_reg_note (insns[1]->rtl (), REG_EH_REGION, NULL_RTX))
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file,
>> +		 "can't combine insns with EH side effects (%d,%d)\n",
>> +		 insns[0]->uid (), insns[1]->uid ());
>> +      return false;
>> +    }
>> +
>> +  auto_vec<base_cand, 2> base_cands (2);
>> +
>> +  int writeback = get_viable_bases (insns, base_cands, cand_mems,
>> +				    access_size, reversed);
>> +  if (base_cands.is_empty ())
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file, "no viable base for pair (%d,%d)\n",
>> +		 insns[0]->uid (), insns[1]->uid ());
>> +      return false;
>> +    }
>> +
>> +  // Punt on frame-related insns with writeback.  We probably won't see
>> +  // these in practice, but this is conservative and ensures we don't
>> +  // have to worry about these later on.
>> +  if (writeback && (RTX_FRAME_RELATED_P (i1->rtl ())
>> +		    || RTX_FRAME_RELATED_P (i2->rtl ())))
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file,
>> +		 "rejecting pair (%d,%d): frame-related insn with writeback\n",
>> +		 i1->uid (), i2->uid ());
>> +      return false;
>> +    }
>> +
>> +  rtx *ignore = &XEXP (pats[1], load_p);
>> +  for (auto use : insns[1]->uses ())
>> +    if (!use->is_mem ()
>> +	&& refers_to_regno_p (use->regno (), use->regno () + 1, pats[1], ignore)
>> +	&& use->def () && use->def ()->insn () == insns[0])
>> +      {
>> +	// N.B. we allow a true dependence on the base address, as this
>> +	// happens in the case of auto-inc accesses.  Consider a post-increment
>> +	// load followed by a regular indexed load, for example.
>> +	if (dump_file)
>> +	  fprintf (dump_file,
>> +		   "%d has non-address true dependence on %d, rejecting pair\n",
>> +		   insns[1]->uid (), insns[0]->uid ());
>> +	return false;
>> +      }
>> +
>> +  unsigned i = 0;
>> +  while (i < base_cands.length ())
>> +    {
>> +      base_cand &cand = base_cands[i];
>> +
>> +      rtx *ignore[2] = {};
>> +      for (int j = 0; j < 2; j++)
>> +	if (cand.from_insn == !j)
>> +	  ignore[j] = &XEXP (cand_mems[j], 0);
>> +
>> +      insn_info *h = first_hazard_after (insns[0], ignore[0]);
>> +      if (h && *h <= *insns[1])
>> +	cand.hazards[0] = h;
>> +
>> +      h = latest_hazard_before (insns[1], ignore[1]);
>> +      if (h && *h >= *insns[0])
>> +	cand.hazards[1] = h;
>> +
>> +      if (!cand.viable ())
>> +	{
>> +	  if (dump_file)
>> +	    fprintf (dump_file,
>> +		     "pair (%d,%d): rejecting base %d due to dataflow "
>> +		     "hazards (%d,%d)\n",
>> +		     insns[0]->uid (),
>> +		     insns[1]->uid (),
>> +		     cand.def->regno (),
>> +		     cand.hazards[0]->uid (),
>> +		     cand.hazards[1]->uid ());
>> +
>> +	  base_cands.ordered_remove (i);
>> +	}
>> +      else
>> +	i++;
>> +    }
>> +
>> +  if (base_cands.is_empty ())
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file,
>> +		 "can't form pair (%d,%d) due to dataflow hazards\n",
>> +		 insns[0]->uid (), insns[1]->uid ());
>> +      return false;
>> +    }
>> +
>> +  //insn_info *alias_hazards[4] = {};
> 
> It looks like you've just commented out / removed the alias analysis (I don't
> see a call to do_alias_analysis).  How is that supposed to work?
> 
>> +
>> +  // First def of memory after the first insn, and last def of memory
>> +  // before the second insn, respectively.
>> +  def_info *mem_defs[2] = {};
>> +  if (load_p)
>> +    {
>> +      if (!MEM_READONLY_P (cand_mems[0]))
>> +	{
>> +	  mem_defs[0] = memory_access (insns[0]->uses ())->def ();
>> +	  gcc_checking_assert (mem_defs[0]);
>> +	  mem_defs[0] = mem_defs[0]->next_def ();
>> +	}
>> +      if (!MEM_READONLY_P (cand_mems[1]))
>> +	{
>> +	  mem_defs[1] = memory_access (insns[1]->uses ())->def ();
>> +	  gcc_checking_assert (mem_defs[1]);
>> +	}
>> +    }
>> +  else
>> +    {
>> +      mem_defs[0] = memory_access (insns[0]->defs ())->next_def ();
>> +      mem_defs[1] = memory_access (insns[1]->defs ())->prev_def ();
>> +      gcc_checking_assert (mem_defs[0]);
>> +      gcc_checking_assert (mem_defs[1]);
>> +    }
>> +
>> +  //auto tombstone_p = [&](insn_info *insn) -> bool {
>> +   // return m_emitted_tombstone
>> +//	   && bitmap_bit_p (&m_tombstone_bitmap, insn->uid ());
>> + // };
>> +
>> +  if (base_cands.is_empty ())
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file,
>> +		 "cannot form pair (%d,%d) due to alias/dataflow hazards",
>> +		 insns[0]->uid (), insns[1]->uid ());
>> +
>> +      return false;
>> +    }
>> +
>> +  base_cand *base = &base_cands[0];
>> +  insn_range_info range (insns[0], insns[1]);
>> +  // If the second insn can throw, narrow the move range to exactly that insn.
>> +  // This prevents us trying to move the second insn from the end of the BB.
>> +  if (cfun->can_throw_non_call_exceptions
>> +      && find_reg_note (insns[1]->rtl (), REG_EH_REGION, NULL_RTX))
>> +    {
>> +      gcc_assert (range.includes (insns[1]));
>> +      range = insn_range_info (insns[1]);
>> +    }
>> +
>> +  // Placement strategy: push loads down and pull stores up, this should
>> +  // help register pressure by reducing live ranges.
>> +  if (load_p)
>> +    range.first = range.last;
>> +  else
>> +    range.last = range.first;
>> +  return fuse_pair (load_p, access_size, writeback,
>> +		    i1, i2, *base, range);
>> +}
>> +// Try and actually fuse the pair given by insns I1 and I2.
>> +//
>> +// Here we've done enough analysis to know this is safe, we only
>> +// reject the pair at this stage if either the tuning policy says to,
>> +// or recog fails on the final pair insn.
>> +//
>> +// LOAD_P is true for loads, ACCESS_SIZE gives the access size of each
>> +// candidate insn.  Bit i of WRITEBACK is set if the ith insn (in program
>> +// order) uses writeback.
>> +//
>> +// BASE gives the chosen base candidate for the pair and MOVE_RANGE is
>> +// a singleton range which says where to place the pair.
>> +bool
>> +fusion_bb_info::fuse_pair (bool load_p,
>> +			unsigned access_size,
>> +			int writeback,
>> +			insn_info *i1, insn_info *i2,
>> +			base_cand &base,
>> +			const insn_range_info &move_range)
>> +{
> 
> Again, I think this function should be in generic code, not in target
> code.
> 
>> +  auto attempt = crtl->ssa->new_change_attempt ();
>> +
>> +  auto make_change = [&attempt](insn_info *insn)
>> +    {
>> +      return crtl->ssa->change_alloc<insn_change> (attempt, insn);
>> +    };
>> +  auto make_delete = [&attempt](insn_info *insn)
>> +    {
>> +      return crtl->ssa->change_alloc<insn_change> (attempt,
>> +						   insn,
>> +						   insn_change::DELETE);
>> +    };
>> +
>> +  if (*i1 > *i2)
>> +    return false;
>> +
>> +  insn_info *first = (*i1 < *i2) ? i1 : i2;
>> +  insn_info *second = (first == i1) ? i2 : i1;
>> +
>> +  insn_info *insns[2] = { first, second };
>> +
>> +  auto_vec<insn_change *, 4> changes (4);
>> +  auto_vec<int, 2> tombstone_uids (2);
>> +
>> +  rtx pats[2] = {
>> +    PATTERN (first->rtl ()),
>> +    PATTERN (second->rtl ())
>> +  };
>> +
>> +  use_array input_uses[2] = { first->uses (), second->uses () };
>> +  def_array input_defs[2] = { first->defs (), second->defs () };
>> +
>> +  int changed_insn = -1;
>> +  if (base.from_insn != -1)
>> +    {
>> +      // If we're not already using a shared base, we need
>> +      // to re-write one of the accesses to use the base from
>> +      // the other insn.
>> +      gcc_checking_assert (base.from_insn == 0 || base.from_insn == 1);
>> +      changed_insn = !base.from_insn;
>> +
>> +      rtx base_pat = pats[base.from_insn];
>> +      rtx change_pat = pats[changed_insn];
>> +      rtx base_mem = XEXP (base_pat, load_p);
>> +      rtx change_mem = XEXP (change_pat, load_p);
>> +
>> +      const bool lower_base_p = (insns[base.from_insn] == i1);
>> +      HOST_WIDE_INT adjust_amt = access_size;
>> +      if (!lower_base_p)
>> +	adjust_amt *= -1;
>> +
>> +      rtx change_reg = XEXP (change_pat, !load_p);
>> +      machine_mode mode_for_mem = GET_MODE (change_mem);
>> +      rtx effective_base = drop_writeback (base_mem);
>> +      rtx new_mem = adjust_address_nv (effective_base,
>> +				       mode_for_mem,
>> +				       adjust_amt);
>> +      rtx new_set = load_p
>> +	? gen_rtx_SET (change_reg, new_mem)
>> +	: gen_rtx_SET (new_mem, change_reg);
>> +
>> +      pats[changed_insn] = new_set;
>> +
>> +      auto keep_use = [&](use_info *u)
>> +	{
>> +	  return refers_to_regno_p (u->regno (), u->regno () + 1,
>> +				    change_pat, &XEXP (change_pat, load_p));
>> +	};
>> +
>> +      // Drop any uses that only occur in the old address.
>> +      input_uses[changed_insn] = filter_accesses (attempt,
>> +						  input_uses[changed_insn],
>> +						  keep_use);
>> +    }
>> +
>> +  rtx writeback_effect = NULL_RTX;
>> +  if (writeback)
>> +    writeback_effect = extract_writebacks (load_p, pats, changed_insn);
>> +
>> +  const auto base_regno = base.def->regno ();
>> +
>> +  if (base.from_insn == -1 && (writeback & 1))
>> +    {
>> +      // If the first of the candidate insns had a writeback form, we'll need to
>> +      // drop the use of the updated base register from the second insn's uses.
>> +      //
>> +      // N.B. we needn't worry about the base register occurring as a store
>> +      // operand, as we checked that there was no non-address true dependence
>> +      // between the insns in try_fuse_pair.
>> +      gcc_checking_assert (find_access (input_uses[1], base_regno));
>> +      input_uses[1] = check_remove_regno_access (attempt,
>> +						 input_uses[1],
>> +						 base_regno);
>> +    }
>> +
>> +  // Go through and drop uses that only occur in register notes,
>> +  // as we won't be preserving those.
>> +  for (int i = 0; i < 2; i++)
>> +    {
>> +      auto rti = insns[i]->rtl ();
>> +      if (!REG_NOTES (rti))
>> +	continue;
>> +
>> +      input_uses[i] = remove_note_accesses (attempt, input_uses[i]);
>> +    }
>> +
>> +  // So far the patterns have been in instruction order,
>> +  // now we want them in offset order.
>> +  if (i1 != first)
>> +    std::swap (pats[0], pats[1]);
>> +
>> +  poly_int64 offsets[2];
>> +  for (int i = 0; i < 2; i++)
>> +    {
>> +      rtx mem = XEXP (pats[i], load_p);
>> +      gcc_checking_assert (MEM_P (mem));
>> +      rtx base = strip_offset (XEXP (mem, 0), offsets + i);
>> +      gcc_checking_assert (REG_P (base));
>> +      gcc_checking_assert (base_regno == REGNO (base));
>> +    }
>> +
>> +  // If either of the original insns had writeback, but the resulting pair insn
>> +  // does not (can happen e.g. in the lxvp edge case above, or if the writeback
>> +  // effects cancel out), then drop the def(s) of the base register as
>> +  // appropriate.
>> +  //
>> +  // Also drop the first def in the case that both of the original insns had
>> +  // writeback.  The second def could well have uses, but the first def should
>> +  // only be used by the second insn (and we dropped that use above).
>> +  for (int i = 0; i < 2; i++)
>> +    if ((!writeback_effect && (writeback & (1 << i)))
>> +	|| (i == 0 && writeback == 3))
>> +      input_defs[i] = check_remove_regno_access (attempt,
>> +						 input_defs[i],
>> +						 base_regno);
>> +
>> +  // If we don't currently have a writeback pair, and we don't have
>> +  // a load that clobbers the base register, look for a trailing destructive
>> +  // update of the base register and try and fold it in to make this into a
>> +  // writeback pair.
>> +  insn_info *trailing_add = nullptr;
> 
> I can see you've dropped the call to find_trailing_add here.
> 
> If you don't want to handle writeback pairs, that's fine, but I think
> that should be controlled with a target-controlled virtual function.
> 
> Effectively you can just replace uses of the param aarch64_ldp_writeback with a
> call to such a function, then the target can control which writeback
> opportunities we handle, if any.  For aarch64 this would just return
> the result of the param for the time being.
> 
>> +
>> +  rtx reg_notes = combine_reg_notes (first, second, load_p);
>> +
>> +  rtx pair_pat = NULL_RTX;
>> +
>> +  if (load_p) {
>> +    pair_pat = rs6000_gen_load_pair ((first->rtl ()), (second->rtl ()));
>> +    if (pair_pat == NULL_RTX)
>> +      return false;
>> +  }
>> +  insn_change *pair_change = nullptr;
>> +  auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
>> +    rtx_insn *rti = change->insn ()->rtl ();
>> +    validate_unshare_change (rti, &PATTERN (rti), pair_pat, true);
>> +    validate_change (rti, &REG_NOTES (rti), reg_notes, true);
>> +  };
>> +
>> +  if (load_p)
>> +    {
>> +      changes.quick_push (make_delete (first));
>> +      pair_change = make_change (second);
>> +      changes.quick_push (pair_change);
>> +
>> +      pair_change->move_range = move_range;
>> +      pair_change->new_defs = merge_access_arrays (attempt,
>> +						   input_defs[0],
>> +						   input_defs[1]);
>> +      gcc_assert (pair_change->new_defs.is_valid ());
>> +
>> +      pair_change->new_uses
>> +	= merge_access_arrays (attempt,
>> +			       drop_memory_access (input_uses[0]),
>> +			       drop_memory_access (input_uses[1]));
>> +      gcc_assert (pair_change->new_uses.is_valid ());
>> +      set_pair_pat (pair_change);
>> +    }
>> +  else
>> +    {
>> +      insn_info *store_to_change = decide_store_strategy (first, second,
>> +							move_range);
>> +
>> +      if (store_to_change && dump_file)
>> +	fprintf (dump_file, "  stvx: re-purposing store %d\n",
>> +		 store_to_change->uid ());
>> +
>> +      insn_change *change;
>> +      for (int i = 0; i < 2; i++)
>> +	{
>> +	  change = make_change (insns[i]);
>> +	  if (insns[i] == store_to_change)
>> +	    {
>> +	      set_pair_pat (change);
>> +	      change->new_uses = merge_access_arrays (attempt,
>> +						      input_uses[0],
>> +						      input_uses[1]);
>> +	      auto d1 = drop_memory_access (input_defs[0]);
>> +	      auto d2 = drop_memory_access (input_defs[1]);
>> +	      change->new_defs = merge_access_arrays (attempt, d1, d2);
>> +	      gcc_assert (change->new_defs.is_valid ());
>> +	      def_info *stxvp_def = memory_access (store_to_change->defs ());
>> +	      change->new_defs = insert_access (attempt,
>> +						stxvp_def,
>> +						change->new_defs);
>> +	      gcc_assert (change->new_defs.is_valid ());
>> +	      change->move_range = move_range;
>> +	      pair_change = change;
>> +	    }
>> +	  else
>> +	    {
>> +	      // Note that we are turning this insn into a tombstone,
>> +	      // we need to keep track of these if we go ahead with the
>> +	      // change.
>> +	      tombstone_uids.quick_push (insns[i]->uid ());
>> +	      rtx_insn *rti = insns[i]->rtl ();
>> +	      validate_change (rti, &PATTERN (rti), gen_tombstone (), true);
>> +	      validate_change (rti, &REG_NOTES (rti), NULL_RTX, true);
>> +	      change->new_uses = use_array (nullptr, 0);
>> +	    }
>> +	  gcc_assert (change->new_uses.is_valid ());
>> +	  changes.quick_push (change);
>> +	}
>> +
>> +      if (!store_to_change)
>> +	{
>> +	  // Tricky case.  Cannot re-purpose existing insns for stp.
>> +	  // Need to insert new insn.
>> +	  if (dump_file)
>> +	    fprintf (dump_file,
>> +		     "  stp fusion: cannot re-purpose candidate stores\n");
>> +
>> +	  auto new_insn = crtl->ssa->create_insn (attempt, INSN, pair_pat);
>> +	  change = make_change (new_insn);
>> +	  change->move_range = move_range;
>> +	  change->new_uses = merge_access_arrays (attempt,
>> +						  input_uses[0],
>> +						  input_uses[1]);
>> +	  gcc_assert (change->new_uses.is_valid ());
>> +
>> +	  auto d1 = drop_memory_access (input_defs[0]);
>> +	  auto d2 = drop_memory_access (input_defs[1]);
>> +	  change->new_defs = merge_access_arrays (attempt, d1, d2);
>> +	  gcc_assert (change->new_defs.is_valid ());
>> +
>> +	  auto new_set = crtl->ssa->create_set (attempt, new_insn, memory);
>> +	  change->new_defs = insert_access (attempt, new_set,
>> +					    change->new_defs);
>> +	  gcc_assert (change->new_defs.is_valid ());
>> +	  changes.safe_insert (1, change);
>> +	  pair_change = change;
>> +	}
>> +    }
>> +
>> +  if (trailing_add)
>> +    changes.quick_push (make_delete (trailing_add));
>> +
>> +  auto n_changes = changes.length ();
>> +  gcc_checking_assert (n_changes >= 2 && n_changes <= 4);
>> +
>> +  gcc_assert (crtl->ssa->verify_insn_changes (changes));
>> +
>> +  cancel_changes (0);
> 
> Can you explain why you changed this to unconditionally cancel the
> changes instead of letting RTL-SSA perform them?
> 
>> +
>> +  confirm_change_group ();
>> +
>> +  gcc_checking_assert (tombstone_uids.length () <= 2);
>> +  for (auto uid : tombstone_uids)
>> +    track_tombstone (uid);
>> +
>> +  return true;
>> +}
>> +
>> +void fusion_bb (bb_info *bb)
>> +{
> 
> Likewise, I think this function should be in generic code, not in target
> code.  Note for aarch64 we have:
> 
>   const bool track_loads
>     = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>   const bool track_stores
>     = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> 
> I think these can just be hidden behind target-controlled virtual
> functions, so you'd have something like:
> 
> bool pair_fusion::should_fuse_stores ();
> bool pair_fusion::should_fuse_loads ();
> 
> (or a single function that returns multiple bitfields if you like).
> 
> then for aarch64 you would return:
> 
> aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> 
> for the load case, and for rs6000 you would return true (just based on
> what's below).  Presumably since you don't handle stores below, you'd
> want to return false in the store case for rs6000 code (at least
> initially).
> 
> As I mentioned above, the aarch64 version of this function also tries
> to promote existing pairs to writeback pairs, again that would be
> controlled by the same function call that replaces uses of
> the param aarch64_ldp_writeback.
> 
>> +  fusion_bb_info bb_state (bb);
>> +
>> +  for (auto insn : bb->nondebug_insns ())
>> +    {
>> +      rtx_insn *rti = insn->rtl ();
>> +
>> +      if (!rti || !INSN_P (rti))
>> +	continue;
>> +
>> +      rtx pat = PATTERN (rti);
>> +
>> +      if (GET_CODE (pat) != SET)
>> +	continue;
>> +
>> +      if (MEM_P (XEXP (pat, 1)))
>> +	bb_state.track_access (insn, true, XEXP (pat, 1));
>> +    }
>> +
>> +  bb_state.transform ();
>> +  bb_state.cleanup_tombstones ();
>> +}
>> +
>> +static void
>> +fusion_init ()
>> +{
>> +  df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
>> +  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
>> +  df_analyze ();
>> +  df_set_flags (DF_DEFER_INSN_RESCAN);
>> +  /* Rebuild ud- and du-chains.  */
>> +  df_remove_problem (df_chain);
>> +  df_process_deferred_rescans ();
>> +  df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
>> +  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
>> +  df_analyze ();
>> +  df_set_flags (DF_DEFER_INSN_RESCAN);
> 
> Is there a reason you can't use RTL-SSA for the def-use information?
> IMO it would be cleaner to use RTL-SSA as the existing aarch64 pass does
> (and the refactored generic code presumably will do).
> 
>> +  calculate_dominance_info (CDI_DOMINATORS);
>> +  crtl->ssa = new rtl_ssa::function_info (cfun);
>> +}
>> +
>> +static void
>> +fusion_destroy ()
>> +{
>> +  if (crtl->ssa->perform_pending_updates ())
>> +    cleanup_cfg (0);
>> +
>> +  free_dominance_info (CDI_DOMINATORS);
>> +  delete crtl->ssa;
>> +  crtl->ssa = nullptr;
>> +}
>> +
>> +// Iterate over the accesses in GROUP, looking for adjacent sets
>> +// of accesses.  If we find two sets of adjacent accesses, call
>> +// merge_pairs.
>> +void
>> +fusion_bb_info::transform_for_base (int encoded_lfs,
>> +				 access_group &group)
> 
> Again, I think this function should be in generic code.
> 
>> +{
>> +  const auto lfs = decode_lfs (encoded_lfs);
>> +  const unsigned access_size = lfs.size;
>> +
>> +  bool skip_next = true;
>> +  access_record *prev_access = nullptr;
>> +
>> +  for (auto &access : group.list)
>> +    {
>> +      if (skip_next)
>> +	skip_next = false;
>> +      else if (known_eq (access.offset, prev_access->offset + access_size))
>> +	{
>> +	  merge_pairs (prev_access->cand_insns,
>> +		       access.cand_insns,
>> +		       lfs.load_p,
>> +		       access_size);
>> +	  skip_next = true;
>> +	}
>> +      prev_access = &access;
>> +    }
>> +}
>> +
>> +void rs6000_analyze_vecload ()
>> +{
>> +  fusion_init ();
>> +
>> +  for (auto bb : crtl->ssa->bbs ())
>> +    fusion_bb (bb);
>> +
>> +  fusion_destroy ();
>> +}
>> +
>> +const pass_data pass_data_analyze_vecload =
>> +{
>> +  RTL_PASS, /* type */
>> +  "vecload", /* name */
>> +  OPTGROUP_NONE, /* optinfo_flags */
>> +  TV_NONE, /* tv_id */
>> +  0, /* properties_required */
>> +  0, /* properties_provided */
>> +  0, /* properties_destroyed */
>> +  0, /* todo_flags_start */
>> +  TODO_df_finish, /* todo_flags_finish */
>> +};
>> +
>> +class pass_analyze_vecload : public rtl_opt_pass
>> +{
>> +public:
>> +  pass_analyze_vecload(gcc::context *ctxt)
>> +    : rtl_opt_pass(pass_data_analyze_vecload, ctxt)
>> +  {}
>> +
>> +  /* opt_pass methods: */
>> +  bool gate (function *)
>> +    {
>> +      return (optimize > 0 && TARGET_VSX && TARGET_POWER10);
>> +    }
>> +
>> +  unsigned int execute (function *) final override
>> +    {
>> +      rs6000_analyze_vecload ();
>> +      return 0;
>> +    }
>> +}; // class pass_analyze_vecload
>> +
>> +rtl_opt_pass *
>> +make_pass_analyze_vecload (gcc::context *ctxt)
>> +{
>> +  return new pass_analyze_vecload (ctxt);
>> +}
>> +
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index 5d975dab921..e36946c44ea 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -1178,6 +1178,7 @@ static bool rs6000_secondary_reload_move (enum rs6000_reg_type,
>>  					  secondary_reload_info *,
>>  					  bool);
>>  rtl_opt_pass *make_pass_analyze_swaps (gcc::context*);
>> +rtl_opt_pass *make_pass_analyze_vecload (gcc::context*);
>>  
>>  /* Hash table stuff for keeping track of TOC entries.  */
>>  
>> diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
>> index b3ce09d523b..7664830f8ab 100644
>> --- a/gcc/config/rs6000/t-rs6000
>> +++ b/gcc/config/rs6000/t-rs6000
>> @@ -35,6 +35,11 @@ rs6000-p8swap.o: $(srcdir)/config/rs6000/rs6000-p8swap.cc
>>  	$(COMPILE) $<
>>  	$(POSTCOMPILE)
>>  
>> +rs6000-vecload-opt.o: $(srcdir)/config/rs6000/rs6000-vecload-opt.cc
>> +	$(COMPILE) $<
>> +	$(POSTCOMPILE)
>> +
>> +
>>  rs6000-d.o: $(srcdir)/config/rs6000/rs6000-d.cc
>>  	$(COMPILE) $<
>>  	$(POSTCOMPILE)
>> diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
>> index 1856fa4884f..ffc47a6eaa0 100644
>> --- a/gcc/emit-rtl.cc
>> +++ b/gcc/emit-rtl.cc
>> @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode imode,
>>      return false;
>>  
>>    /* The subreg offset cannot be outside the inner object.  */
>> -  if (maybe_ge (offset, isize))
>> +  if (maybe_gt (offset, isize))
>>      return false;
> 
> Can you explain why this change is needed?
> 
>>  
>>    poly_uint64 regsize = REGMODE_NATURAL_SIZE (imode);
>> @@ -3035,18 +3035,6 @@ verify_rtx_sharing (rtx orig, rtx insn)
>>        break;
>>      }
>>  
>> -  /* This rtx may not be shared.  If it has already been seen,
>> -     replace it with a copy of itself.  */
>> -  if (flag_checking && RTX_FLAG (x, used))
>> -    {
>> -      error ("invalid rtl sharing found in the insn");
>> -      debug_rtx (insn);
>> -      error ("shared rtx");
>> -      debug_rtx (x);
>> -      internal_error ("internal consistency failure");
>> -    }
>> -  gcc_assert (!RTX_FLAG (x, used));
>> -
> 
> I think the need to make this change is likely indicative of a problem
> in the pass, and the pass should be fixed instead of changing
> verify_rtx_sharing.
> 
>>    RTX_FLAG (x, used) = 1;
>>  
>>    /* Now scan the subexpressions recursively.  */
>> diff --git a/gcc/fusion-common.h b/gcc/fusion-common.h
>> new file mode 100644
>> index 00000000000..82f1d6ef032
>> --- /dev/null
>> +++ b/gcc/fusion-common.h
>> @@ -0,0 +1,1195 @@
>> +// Common infrastructure for Load/Store Pair fusion optimization pass.
>> +// Copyright (C) 2023-2024 Free Software Foundation, Inc.
>> +//
>> +// This file is part of GCC.
>> +//
>> +// GCC is free software; you can redistribute it and/or modify it
>> +// under the terms of the GNU General Public License as published by
>> +// the Free Software Foundation; either version 3, or (at your option)
>> +// any later version.
>> +//
>> +// GCC is distributed in the hope that it will be useful, but
>> +// WITHOUT ANY WARRANTY; without even the implied warranty of
>> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +// General Public License for more details.
>> +//
>> +// You should have received a copy of the GNU General Public License
>> +// along with GCC; see the file COPYING3.  If not see
>> +// <http://www.gnu.org/licenses/>.
>> +
>> +#define INCLUDE_ALGORITHM
>> +#define INCLUDE_FUNCTIONAL
>> +#define INCLUDE_LIST
>> +#define INCLUDE_TYPE_TRAITS
>> +#include "config.h"
>> +#include "system.h"
>> +#include "coretypes.h"
>> +#include "backend.h"
>> +#include "rtl.h"
>> +#include "df.h"
>> +#include "rtl-iter.h"
>> +#include "rtl-ssa.h"
>> +#include "cfgcleanup.h"
>> +#include "tree-pass.h"
>> +#include "ordered-hash-map.h"
>> +#include "tree-dfa.h"
>> +#include "fold-const.h"
>> +#include "tree-hash-traits.h"
>> +#include "print-tree.h"
>> +#include "insn-attr.h"
>> +
>> +using namespace rtl_ssa;
>> +
>> +static constexpr HOST_WIDE_INT LDP_IMM_BITS = 7;
>> +static constexpr HOST_WIDE_INT LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1));
>> +static constexpr HOST_WIDE_INT LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1;
>> +static constexpr HOST_WIDE_INT LDP_MIN_IMM = -LDP_MAX_IMM - 1;
> 
> Note we would not want these aarch64-specific constants in generic code,
> they would need to be target-configurable (in fact we might just want a
> virtual function / target hook to validate offsets instead).
> 
>> +
>> +// We pack these fields (load_p, fpsimd_p, and size) into an integer
>> +// (LFS) which we use as part of the key into the main hash tables.
>> +//
>> +// The idea is that we group candidates together only if they agree on
>> +// the fields below.  Candidates that disagree on any of these
>> +// properties shouldn't be merged together.
>> +struct lfs_fields
>> +{
>> +  bool load_p;
>> +  bool fpsimd_p;
>> +  unsigned size;
>> +};
>> +
>> +using insn_list_t = std::list<insn_info *>;
>> +using insn_iter_t = insn_list_t::iterator;
>> +
>> +// Information about the accesses at a given offset from a particular
>> +// base.  Stored in an access_group, see below.
>> +struct access_record
>> +{
>> +  poly_int64 offset;
>> +  std::list<insn_info *> cand_insns;
>> +  std::list<access_record>::iterator place;
>> +
>> +  access_record (poly_int64 off) : offset (off) {}
>> +};
>> +
>> +// A group of accesses where adjacent accesses could be ldp/stp
>> +// or lxvp/stxvp candidates.  The splay tree supports efficient
> 
> Perhaps better to spell out "load pair or store pair" instead of listing
> the various instruction mnenomics for different targets.
> 
>> +// insertion, while the list supports efficient iteration.
>> +struct access_group
>> +{
>> +  splay_tree<access_record *> tree;
>> +  std::list<access_record> list;
>> +
>> +  template<typename Alloc>
>> +  inline void track (Alloc node_alloc, poly_int64 offset, insn_info *insn);
>> +};
>> +
>> +// Information about a potential base candidate, used in try_fuse_pair.
>> +// There may be zero, one, or two viable RTL bases for a given pair.
>> +struct base_cand
>> +{
>> +  // DEF is the def of the base register to be used by the pair.
>> +  def_info *def;
>> +
>> +  // FROM_INSN is -1 if the base candidate is already shared by both
>> +  // candidate insns.  Otherwise it holds the index of the insn from
>> +  // which the base originated.
>> +  //
>> +  // In the case that the base is shared, either DEF is already used
>> +  // by both candidate accesses, or both accesses see different versions
>> +  // of the same regno, in which case DEF is the def consumed by the
>> +  // first candidate access.
>> +  int from_insn;
>> +
>> +  // To form a pair, we do so by moving the first access down and the second
>> +  // access up.  To determine where to form the pair, and whether or not
>> +  // it is safe to form the pair, we track instructions which cannot be
>> +  // re-ordered past due to either dataflow or alias hazards.
>> +  //
>> +  // Since we allow changing the base used by an access, the choice of
>> +  // base can change which instructions act as re-ordering hazards for
>> +  // this pair (due to different dataflow).  We store the initial
>> +  // dataflow hazards for this choice of base candidate in HAZARDS.
>> +  //
>> +  // These hazards act as re-ordering barriers to each candidate insn
>> +  // respectively, in program order.
>> +  //
>> +  // Later on, when we take alias analysis into account, we narrow
>> +  // HAZARDS accordingly.
>> +  insn_info *hazards[2];
>> +
>> +  base_cand (def_info *def, int insn)
>> +    : def (def), from_insn (insn), hazards {nullptr, nullptr} {}
>> +
>> +  base_cand (def_info *def) : base_cand (def, -1) {}
>> +
>> +  // Test if this base candidate is viable according to HAZARDS.
>> +  bool viable () const
>> +  {
>> +    return !hazards[0] || !hazards[1] || (*hazards[0] > *hazards[1]);
>> +  }
>> +};
>> +
>> +// Information about an alternate base.  For a def_info D, it may
>> +// instead be expressed as D = BASE + OFFSET.
>> +struct alt_base
>> +{
>> +  def_info *base;
>> +  poly_int64 offset;
>> +};
>> +
>> +// State used by the pass for a given basic block.
>> +struct fusion_bb_info
>> +{
>> +  using def_hash = nofree_ptr_hash<def_info>;
>> +  using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
>> +  using def_key_t = pair_hash<def_hash, int_hash<int, -1, -2>>;
>> +
>> +  // Map of <tree base, LFS> -> access_group.
>> +  ordered_hash_map<expr_key_t, access_group> expr_map;
>> +
>> +  // Map of <RTL-SSA def_info *, LFS> -> access_group.
>> +  ordered_hash_map<def_key_t, access_group> def_map;
>> +
>> +  // Given the def_info for an RTL base register, express it as an offset from
>> +  // some canonical base instead.
>> +  //
>> +  // Canonicalizing bases in this way allows us to identify adjacent accesses
>> +  // even if they see different base register defs.
>> +  hash_map<def_hash, alt_base> canon_base_map;
>> +
>> +  static const size_t obstack_alignment = sizeof (void *);
>> +  bb_info *m_bb;
>> +
>> +  fusion_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
>> +  {
>> +    obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
>> +				obstack_alignment, obstack_chunk_alloc,
>> +				obstack_chunk_free);
>> +  }
>> +  ~fusion_bb_info ()
>> +  {
>> +    obstack_free (&m_obstack, nullptr);
>> +
>> +    if (m_emitted_tombstone)
>> +      {
>> +	bitmap_release (&m_tombstone_bitmap);
>> +	bitmap_obstack_release (&m_bitmap_obstack);
>> +      }
>> +  }
>> +
>> +  inline void track_access (insn_info *, bool load, rtx mem);
>> +  inline void transform ();
>> +  inline void cleanup_tombstones ();
>> +
>> +private:
>> +  obstack m_obstack;
>> +
>> +  // State for keeping track of tombstone insns emitted for this BB.
>> +  bitmap_obstack m_bitmap_obstack;
>> +  bitmap_head m_tombstone_bitmap;
>> +  bool m_emitted_tombstone;
>> +
>> +  inline splay_tree_node<access_record *> *node_alloc (access_record *);
>> +
>> +  template<typename Map>
>> +  inline void traverse_base_map (Map &map);
>> +  inline void transform_for_base (int load_size, access_group &group);
>> +
>> +   void merge_pairs (insn_list_t &, insn_list_t &,
>> +			   bool load_p, unsigned access_size);
>> +
>> +   bool try_fuse_pair (bool load_p, unsigned access_size,
>> +			     insn_info *i1, insn_info *i2);
>> +
>> +   bool fuse_pair (bool load_p, unsigned access_size,
>> +			 int writeback,
>> +			 insn_info *i1, insn_info *i2,
>> +			 base_cand &base,
>> +			 const insn_range_info &move_range);
>> +
>> +  inline void track_tombstone (int uid);
>> +
>> +  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
>> +};
>> +
>> +splay_tree_node<access_record *> *fusion_bb_info::node_alloc (access_record *access)
>> +{
>> +  using T = splay_tree_node<access_record *>;
>> +  void *addr = obstack_alloc (&m_obstack, sizeof (T));
>> +  return new (addr) T (access);
>> +}
>> +
>> +// Given a mem MEM, if the address has side effects, return a MEM that accesses
>> +// the same address but without the side effects.  Otherwise, return
>> +// MEM unchanged.
>> +static rtx
>> +drop_writeback (rtx mem)
>> +{
>> +  rtx addr = XEXP (mem, 0);
>> +
>> +  if (!side_effects_p (addr))
>> +    return mem;
>> +
>> +  switch (GET_CODE (addr))
>> +    {
>> +    case PRE_MODIFY:
>> +      addr = XEXP (addr, 1);
>> +      break;
>> +    case POST_MODIFY:
>> +    case POST_INC:
>> +    case POST_DEC:
>> +      addr = XEXP (addr, 0);
>> +      break;
>> +    case PRE_INC:
>> +    case PRE_DEC:
>> +    {
>> +      poly_int64 adjustment = GET_MODE_SIZE (GET_MODE (mem));
>> +      if (GET_CODE (addr) == PRE_DEC)
>> +	adjustment *= -1;
>> +      addr = plus_constant (GET_MODE (addr), XEXP (addr, 0), adjustment);
>> +      break;
>> +    }
>> +    default:
>> +      gcc_unreachable ();
>> +    }
>> +
>> +  return change_address (mem, GET_MODE (mem), addr);
>> +}
>> +
>> +// Convenience wrapper around strip_offset that can also look through
>> +// RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
>> +// MEM so that we know the mode of the access.
>> +static rtx
>> +load_strip_offset (rtx mem, poly_int64 *offset)
> 
> Nit: naming.  How about pair_fusion_strip_offset instead?
> 
>> +{
>> +  rtx addr = XEXP (mem, 0);
>> +
>> +  switch (GET_CODE (addr))
>> +    {
>> +    case PRE_MODIFY:
>> +    case POST_MODIFY:
>> +      addr = strip_offset (XEXP (addr, 1), offset);
>> +      gcc_checking_assert (REG_P (addr));
>> +      gcc_checking_assert (rtx_equal_p (XEXP (XEXP (mem, 0), 0), addr));
>> +      break;
>> +    case PRE_INC:
>> +    case POST_INC:
>> +      addr = XEXP (addr, 0);
>> +      *offset = GET_MODE_SIZE (GET_MODE (mem));
>> +      gcc_checking_assert (REG_P (addr));
>> +      break;
>> +    case PRE_DEC:
>> +    case POST_DEC:
>> +      addr = XEXP (addr, 0);
>> +      *offset = -GET_MODE_SIZE (GET_MODE (mem));
>> +      gcc_checking_assert (REG_P (addr));
>> +      break;
>> +
>> +    default:
>> +      addr = strip_offset (addr, offset);
>> +    }
>> +
>> +  return addr;
>> +}
>> +
>> +// Return true if X is a PRE_{INC,DEC,MODIFY} rtx.
>> +static bool
>> +any_pre_modify_p (rtx x)
>> +{
>> +  const auto code = GET_CODE (x);
>> +  return code == PRE_INC || code == PRE_DEC || code == PRE_MODIFY;
>> +}
>> +
>> +// Return true if X is a POST_{INC,DEC,MODIFY} rtx.
>> +static bool
>> +any_post_modify_p (rtx x)
>> +{
>> +  const auto code = GET_CODE (x);
>> +  return code == POST_INC || code == POST_DEC || code == POST_MODIFY;
>> +}
>> +// Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
>> +// into an integer for use as a hash table key.
>> +static int
>> +encode_lfs (lfs_fields fields)
>> +{
>> +  int size_log2 = exact_log2 (fields.size);
>> +  //gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
>> +  return ((int)fields.load_p << 3)
>> +    | ((int)fields.fpsimd_p << 2)
>> +    | (size_log2 - 2);
>> +}
>> +
>> +// Inverse of encode_lfs.
>> +static lfs_fields
>> +decode_lfs (int lfs)
>> +{
>> +  bool load_p = (lfs & (1 << 3));
>> +  bool fpsimd_p = (lfs & (1 << 2));
>> +  unsigned size = 1U << ((lfs & 3) + 2);
>> +  return { load_p, fpsimd_p, size };
>> +}
>> +// Track the access INSN at offset OFFSET in this access group.
>> +// ALLOC_NODE is used to allocate splay tree nodes.
>> +template<typename Alloc>
>> +void
>> +access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
>> +{
>> +  auto insert_before = [&](std::list<access_record>::iterator after)
>> +    {
>> +      auto it = list.emplace (after, offset);
>> +      it->cand_insns.push_back (insn);
>> +      it->place = it;
>> +      return &*it;
>> +    };
>> +
>> +  if (!list.size ())
>> +    {
>> +      auto access = insert_before (list.end ());
>> +      tree.insert_max_node (alloc_node (access));
>> +      return;
>> +    }
>> +
>> +  auto compare = [&](splay_tree_node<access_record *> *node)
>> +    {
>> +      return compare_sizes_for_sort (offset, node->value ()->offset);
>> +    };
>> +  auto result = tree.lookup (compare);
>> +  splay_tree_node<access_record *> *node = tree.root ();
>> +  if (result == 0)
>> +    node->value ()->cand_insns.push_back (insn);
>> +  else
>> +    {
>> +      auto it = node->value ()->place;
>> +      auto after = (result > 0) ? std::next (it) : it;
>> +      auto access = insert_before (after);
>> +      tree.insert_child (node, result > 0, alloc_node (access));
>> +    }
>> +}
>> +// Given a candidate access INSN (with mem MEM), see if it has a suitable
>> +// MEM_EXPR base (i.e. a tree decl) relative to which we can track the access.
>> +// LFS is used as part of the key to the hash table, see track_access.
>> +bool
>> +fusion_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>> +{
>> +  if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
>> +    return false;
>> +
>> +  poly_int64 offset;
>> +  tree base_expr = get_addr_base_and_unit_offset (MEM_EXPR (mem),
>> +						  &offset);
>> +  if (!base_expr || !DECL_P (base_expr))
>> +    return false;
>> +
>> +  offset += MEM_OFFSET (mem);
>> +
>> +  const machine_mode mem_mode = GET_MODE (mem);
>> +  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode);//.to_constant ();
> 
> Why the .to_constant () comment here?
> 
> I hope these comments are useful.  Please feel free to reach out if you have any
> questions / get stuck with anything.  I'm often around on IRC.
> 
> Thanks,
> Alex
> 
>> +
>> +  // Punt on misaligned offsets.  LDP/STP instructions require offsets to be a
>> +  // multiple of the access size, and we believe that misaligned offsets on
>> +  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL bases.
>> +  if (!multiple_p (offset, mem_size))
>> +    return false;
>> +
>> +  const auto key = std::make_pair (base_expr, encode_lfs (lfs));
>> +  access_group &group = expr_map.get_or_insert (key, NULL);
>> +  auto alloc = [&](access_record *access) { return node_alloc (access); };
>> +  group.track (alloc, offset, insn);
>> +
>> +  if (dump_file)
>> +    {
>> +      fprintf (dump_file, "[bb %u] tracking insn %d via ",
>> +	       m_bb->index (), insn->uid ());
>> +      print_node_brief (dump_file, "mem expr", base_expr, 0);
>> +      fprintf (dump_file, " [L=%d FP=%d, %smode, off=",
>> +	       lfs.load_p, lfs.fpsimd_p, mode_name[mem_mode]);
>> +      print_dec (offset, dump_file);
>> +      fprintf (dump_file, "]\n");
>> +    }
>> +
>> +  return true;
>> +}
>> +
>> +// Dummy predicate that never ignores any insns.
>> +static bool no_ignore (insn_info *) { return false; }
>> +
>> +// Return the latest dataflow hazard before INSN.
>> +//
>> +// If IGNORE is non-NULL, this points to a sub-rtx which we should ignore for
>> +// dataflow purposes.  This is needed when considering changing the RTL base of
>> +// an access discovered through a MEM_EXPR base.
>> +//
>> +// If IGNORE_INSN is non-NULL, we should further ignore any hazards arising
>> +// from that insn.
>> +//
>> +// N.B. we ignore any defs/uses of memory here as we deal with that separately,
>> +// making use of alias disambiguation.
>> +static insn_info *
>> +latest_hazard_before (insn_info *insn, rtx *ignore,
>> +		      insn_info *ignore_insn = nullptr)
>> +{
>> +  insn_info *result = nullptr;
>> +
>> +  // If the insn can throw then it is at the end of a BB and we can't
>> +  // move it, model this by recording a hazard in the previous insn
>> +  // which will prevent moving the insn up.
>> +  if (cfun->can_throw_non_call_exceptions
>> +      && find_reg_note (insn->rtl (), REG_EH_REGION, NULL_RTX))
>> +    return insn->prev_nondebug_insn ();
>> +
>> +  // Return true if we registered the hazard.
>> +  auto hazard = [&](insn_info *h) -> bool
>> +    {
>> +      gcc_checking_assert (*h < *insn);
>> +      if (h == ignore_insn)
>> +	return false;
>> +
>> +      if (!result || *h > *result)
>> +	result = h;
>> +
>> +      return true;
>> +    };
>> +
>> +  rtx pat = PATTERN (insn->rtl ());
>> +  auto ignore_use = [&](use_info *u)
>> +    {
>> +      if (u->is_mem ())
>> +	return true;
>> +
>> +      return !refers_to_regno_p (u->regno (), u->regno () + 1, pat, ignore);
>> +    };
>> +
>> +  // Find defs of uses in INSN (RaW).
>> +  for (auto use : insn->uses ())
>> +    if (!ignore_use (use) && use->def ())
>> +      hazard (use->def ()->insn ());
>> +
>> +  // Find previous defs (WaW) or previous uses (WaR) of defs in INSN.
>> +  for (auto def : insn->defs ())
>> +    {
>> +      if (def->is_mem ())
>> +	continue;
>> +
>> +      if (def->prev_def ())
>> +	{
>> +	  hazard (def->prev_def ()->insn ()); // WaW
>> +
>> +	  auto set = dyn_cast<set_info *> (def->prev_def ());
>> +	  if (set && set->has_nondebug_insn_uses ())
>> +	    for (auto use : set->reverse_nondebug_insn_uses ())
>> +	      if (use->insn () != insn && hazard (use->insn ())) // WaR
>> +		break;
>> +	}
>> +
>> +      if (!HARD_REGISTER_NUM_P (def->regno ()))
>> +	continue;
>> +
>> +      // Also need to check backwards for call clobbers (WaW).
>> +      for (auto call_group : def->ebb ()->call_clobbers ())
>> +	{
>> +	  if (!call_group->clobbers (def->resource ()))
>> +	    continue;
>> +
>> +	  auto clobber_insn = prev_call_clobbers_ignoring (*call_group,
>> +							   def->insn (),
>> +							   no_ignore);
>> +	  if (clobber_insn)
>> +	    hazard (clobber_insn);
>> +	}
>> +
>> +    }
>> +
>> +  return result;
>> +}
>> +
>> +// Return the first dataflow hazard after INSN.
>> +//
>> +// If IGNORE is non-NULL, this points to a sub-rtx which we should ignore for
>> +// dataflow purposes.  This is needed when considering changing the RTL base of
>> +// an access discovered through a MEM_EXPR base.
>> +//
>> +// N.B. we ignore any defs/uses of memory here as we deal with that separately,
>> +// making use of alias disambiguation.
>> +static insn_info *
>> +first_hazard_after (insn_info *insn, rtx *ignore)
>> +{
>> +  insn_info *result = nullptr;
>> +  auto hazard = [insn, &result](insn_info *h)
>> +    {
>> +      gcc_checking_assert (*h > *insn);
>> +      if (!result || *h < *result)
>> +	result = h;
>> +    };
>> +
>> +  rtx pat = PATTERN (insn->rtl ());
>> +  auto ignore_use = [&](use_info *u)
>> +    {
>> +      if (u->is_mem ())
>> +	return true;
>> +
>> +      return !refers_to_regno_p (u->regno (), u->regno () + 1, pat, ignore);
>> +    };
>> +
>> +  for (auto def : insn->defs ())
>> +    {
>> +      if (def->is_mem ())
>> +	continue;
>> +
>> +      if (def->next_def ())
>> +	hazard (def->next_def ()->insn ()); // WaW
>> +
>> +      auto set = dyn_cast<set_info *> (def);
>> +      if (set && set->has_nondebug_insn_uses ())
>> +	hazard (set->first_nondebug_insn_use ()->insn ()); // RaW
>> +
>> +      if (!HARD_REGISTER_NUM_P (def->regno ()))
>> +	continue;
>> +
>> +      // Also check for call clobbers of this def (WaW).
>> +      for (auto call_group : def->ebb ()->call_clobbers ())
>> +	{
>> +	  if (!call_group->clobbers (def->resource ()))
>> +	    continue;
>> +
>> +	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
>> +							   def->insn (),
>> +							   no_ignore);
>> +	  if (clobber_insn)
>> +	    hazard (clobber_insn);
>> +	}
>> +    }
>> +
>> +  // Find any subsequent defs of uses in INSN (WaR).
>> +  for (auto use : insn->uses ())
>> +    {
>> +      if (ignore_use (use))
>> +	continue;
>> +
>> +      if (use->def ())
>> +	{
>> +	  auto def = use->def ()->next_def ();
>> +	  if (def && def->insn () == insn)
>> +	    def = def->next_def ();
>> +
>> +	  if (def)
>> +	    hazard (def->insn ());
>> +	}
>> +
>> +      if (!HARD_REGISTER_NUM_P (use->regno ()))
>> +	continue;
>> +
>> +      // Also need to handle call clobbers of our uses (again WaR).
>> +      //
>> +      // See restrict_movement_for_uses_ignoring for why we don't
>> +      // need to check backwards for call clobbers.
>> +      for (auto call_group : use->ebb ()->call_clobbers ())
>> +	{
>> +	  if (!call_group->clobbers (use->resource ()))
>> +	    continue;
>> +
>> +	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
>> +							   use->insn (),
>> +							   no_ignore);
>> +	  if (clobber_insn)
>> +	    hazard (clobber_insn);
>> +	}
>> +    }
>> +
>> +  return result;
>> +}
>> +
>> +// Return true iff R1 and R2 overlap.
>> +static bool
>> +ranges_overlap_p (const insn_range_info &r1, const insn_range_info &r2)
>> +{
>> +  // If either range is empty, then their intersection is empty.
>> +  if (!r1 || !r2)
>> +    return false;
>> +
>> +  // When do they not overlap? When one range finishes before the other
>> +  // starts, i.e. (*r1.last < *r2.first || *r2.last < *r1.first).
>> +  // Inverting this, we get the below.
>> +  return *r1.last >= *r2.first && *r2.last >= *r1.first;
>> +}
>> +
>> +// Get the range of insns that def feeds.
>> +static insn_range_info get_def_range (def_info *def)
>> +{
>> +  insn_info *last = def->next_def ()->insn ()->prev_nondebug_insn ();
>> +  return { def->insn (), last };
>> +}
>> +
>> +// Given a def (of memory), return the downwards range within which we
>> +// can safely move this def.
>> +static insn_range_info
>> +def_downwards_move_range (def_info *def)
>> +{
>> +  auto range = get_def_range (def);
>> +
>> +  auto set = dyn_cast<set_info *> (def);
>> +  if (!set || !set->has_any_uses ())
>> +    return range;
>> +
>> +  auto use = set->first_nondebug_insn_use ();
>> +  if (use)
>> +    range = move_earlier_than (range, use->insn ());
>> +
>> +  return range;
>> +}
>> +
>> +// Given a def (of memory), return the upwards range within which we can
>> +// safely move this def.
>> +static insn_range_info
>> +def_upwards_move_range (def_info *def)
>> +{
>> +  def_info *prev = def->prev_def ();
>> +  insn_range_info range { prev->insn (), def->insn () };
>> +
>> +  auto set = dyn_cast<set_info *> (prev);
>> +  if (!set || !set->has_any_uses ())
>> +    return range;
>> +
>> +  auto use = set->last_nondebug_insn_use ();
>> +  if (use)
>> +    range = move_later_than (range, use->insn ());
>> +
>> +  return range;
>> +}
>> +
>> +// Given candidate store insns FIRST and SECOND, see if we can re-purpose one
>> +// of them (together with its def of memory) for the stp/stxvp insn. If so,
>> +//  return that insn.  Otherwise, return null.
>> +static insn_info *
>> +decide_store_strategy (insn_info *first,
>> +		     insn_info *second,
>> +		     const insn_range_info &move_range)
>> +{
>> +  def_info * const defs[2] = {
>> +    memory_access (first->defs ()),
>> +    memory_access (second->defs ())
>> +  };
>> +
>> +  if (move_range.includes (first)
>> +      || ranges_overlap_p (move_range, def_downwards_move_range (defs[0])))
>> +    return first;
>> +
>> +  if (move_range.includes (second)
>> +      || ranges_overlap_p (move_range, def_upwards_move_range (defs[1])))
>> +    return second;
>> +
>> +  return nullptr;
>> +}
>> +
>> +// Generate the RTL pattern for a "tombstone"; used temporarily during this pass
>> +// to replace stores that are marked for deletion where we can't immediately
>> +// delete the store (since there are uses of mem hanging off the store).
>> +//
>> +// These are deleted at the end of the pass and uses re-parented appropriately
>> +// at this point.
>> +static rtx
>> +gen_tombstone (void)
>> +{
>> +  return gen_rtx_CLOBBER (VOIDmode,
>> +			  gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode)));
>> +}
>> +// Go through the reg notes rooted at NOTE, dropping those that we should drop,
>> +// and preserving those that we want to keep by prepending them to (and
>> +// returning) RESULT.  EH_REGION is used to make sure we have at most one
>> +// REG_EH_REGION note in the resulting list.  FR_EXPR is used to return any
>> +// REG_FRAME_RELATED_EXPR note we find, as these can need special handling in
>> +// combine_reg_notes.
>> +static rtx
>> +filter_notes (rtx note, rtx result, bool *eh_region, rtx *fr_expr)
>> +{
>> +  for (; note; note = XEXP (note, 1))
>> +    {
>> +      switch (REG_NOTE_KIND (note))
>> +	{
>> +	case REG_DEAD:
>> +	  // REG_DEAD notes aren't required to be maintained.
>> +	case REG_EQUAL:
>> +	case REG_EQUIV:
>> +	case REG_UNUSED:
>> +	case REG_NOALIAS:
>> +	  // These can all be dropped.  For REG_EQU{AL,IV} they cannot apply to
>> +	  // non-single_set insns, and REG_UNUSED is re-computed by RTl-SSA, see
>> +	  // rtl-ssa/changes.cc:update_notes.
>> +	  //
>> +	  // Similarly, REG_NOALIAS cannot apply to a parallel.
>> +	case REG_INC:
>> +	  // When we form the pair insn, the reg update is implemented
>> +	  // as just another SET in the parallel, so isn't really an
>> +	  // auto-increment in the RTL sense, hence we drop the note.
>> +	  break;
>> +	case REG_EH_REGION:
>> +	  gcc_assert (!*eh_region);
>> +	  *eh_region = true;
>> +	  result = alloc_reg_note (REG_EH_REGION, XEXP (note, 0), result);
>> +	  break;
>> +	case REG_CFA_DEF_CFA:
>> +	case REG_CFA_OFFSET:
>> +	case REG_CFA_RESTORE:
>> +	  result = alloc_reg_note (REG_NOTE_KIND (note),
>> +				   copy_rtx (XEXP (note, 0)),
>> +				   result);
>> +	  break;
>> +	case REG_FRAME_RELATED_EXPR:
>> +	  gcc_assert (!*fr_expr);
>> +	  *fr_expr = copy_rtx (XEXP (note, 0));
>> +	  break;
>> +	default:
>> +	  // Unexpected REG_NOTE kind.
>> +	  gcc_unreachable ();
>> +	}
>> +    }
>> +
>> +  return result;
>> +}
>> +
>> +// Return the notes that should be attached to a combination of I1 and I2, where
>> +// *I1 < *I2.  LOAD_P is true for loads.
>> +static rtx
>> +combine_reg_notes (insn_info *i1, insn_info *i2, bool load_p)
>> +{
>> +  // Temporary storage for REG_FRAME_RELATED_EXPR notes.
>> +  rtx fr_expr[2] = {};
>> +
>> +  bool found_eh_region = false;
>> +  rtx result = NULL_RTX;
>> +  result = filter_notes (REG_NOTES (i2->rtl ()), result,
>> +			 &found_eh_region, fr_expr);
>> +  result = filter_notes (REG_NOTES (i1->rtl ()), result,
>> +			 &found_eh_region, fr_expr + 1);
>> +
>> +  if (!load_p)
>> +    {
>> +      // Simple frame-related sp-relative saves don't need CFI notes, but when
>> +      // we combine them into an stp/stxvp we will need a CFI note as dwarf2cfi
>> +      //  can't interpret the unspec pair representation directly.
>> +      if (RTX_FRAME_RELATED_P (i1->rtl ()) && !fr_expr[0])
>> +	fr_expr[0] = copy_rtx (PATTERN (i1->rtl ()));
>> +      if (RTX_FRAME_RELATED_P (i2->rtl ()) && !fr_expr[1])
>> +	fr_expr[1] = copy_rtx (PATTERN (i2->rtl ()));
>> +    }
>> +
>> +  rtx fr_pat = NULL_RTX;
>> +  if (fr_expr[0] && fr_expr[1])
>> +    {
>> +      // Combining two frame-related insns, need to construct
>> +      // a REG_FRAME_RELATED_EXPR note which represents the combined
>> +      // operation.
>> +      RTX_FRAME_RELATED_P (fr_expr[1]) = 1;
>> +      fr_pat = gen_rtx_PARALLEL (VOIDmode,
>> +				 gen_rtvec (2, fr_expr[0], fr_expr[1]));
>> +    }
>> +  else
>> +    fr_pat = fr_expr[0] ? fr_expr[0] : fr_expr[1];
>> +
>> +  if (fr_pat)
>> +    result = alloc_reg_note (REG_FRAME_RELATED_EXPR,
>> +			     fr_pat, result);
>> +
>> +  return result;
>> +}
>> +
>> +// Given two memory accesses in PATS, at least one of which is of a
>> +// writeback form, extract two non-writeback memory accesses addressed
>> +// relative to the initial value of the base register, and output these
>> +// in PATS.  Return an rtx that represents the overall change to the
>> +// base register.
>> +static rtx
>> +extract_writebacks (bool load_p, rtx pats[2], int changed)
>> +{
>> +  rtx base_reg = NULL_RTX;
>> +  poly_int64 current_offset = 0;
>> +
>> +  poly_int64 offsets[2];
>> +
>> +  for (int i = 0; i < 2; i++)
>> +    {
>> +      rtx mem = XEXP (pats[i], load_p);
>> +      rtx reg = XEXP (pats[i], !load_p);
>> +
>> +      rtx addr = XEXP (mem, 0);
>> +      const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
>> +
>> +      poly_int64 offset;
>> +      rtx this_base = load_strip_offset (mem, &offset);
>> +      gcc_assert (REG_P (this_base));
>> +      if (base_reg)
>> +	gcc_assert (rtx_equal_p (base_reg, this_base));
>> +      else
>> +	base_reg = this_base;
>> +
>> +      // If we changed base for the current insn, then we already
>> +      // derived the correct mem for this insn from the effective
>> +      // address of the other access.
>> +      if (i == changed)
>> +	{
>> +	  gcc_checking_assert (!autoinc_p);
>> +	  offsets[i] = offset;
>> +	  continue;
>> +	}
>> +
>> +      if (autoinc_p && any_pre_modify_p (addr))
>> +	current_offset += offset;
>> +
>> +      poly_int64 this_off = current_offset;
>> +      if (!autoinc_p)
>> +	this_off += offset;
>> +
>> +      offsets[i] = this_off;
>> +      rtx new_mem = change_address (mem, GET_MODE (mem),
>> +				    plus_constant (GET_MODE (base_reg),
>> +						   base_reg, this_off));
>> +      pats[i] = load_p
>> +	? gen_rtx_SET (reg, new_mem)
>> +	: gen_rtx_SET (new_mem, reg);
>> +
>> +      if (autoinc_p && any_post_modify_p (addr))
>> +	current_offset += offset;
>> +    }
>> +
>> +  if (known_eq (current_offset, 0))
>> +    return NULL_RTX;
>> +
>> +  return gen_rtx_SET (base_reg, plus_constant (GET_MODE (base_reg),
>> +					       base_reg, current_offset));
>> +}
>> +
>> +// We just emitted a tombstone with uid UID, track it in a bitmap for
>> +// this BB so we can easily identify it later when cleaning up tombstones.
>> +void
>> +fusion_bb_info::track_tombstone (int uid)
>> +{
>> +  if (!m_emitted_tombstone)
>> +    {
>> +      // Lazily initialize the bitmap for tracking tombstone insns.
>> +      bitmap_obstack_initialize (&m_bitmap_obstack);
>> +      bitmap_initialize (&m_tombstone_bitmap, &m_bitmap_obstack);
>> +      m_emitted_tombstone = true;
>> +    }
>> +
>> +  if (!bitmap_set_bit (&m_tombstone_bitmap, uid))
>> +    gcc_unreachable (); // Bit should have changed.
>> +}
>> +
>> +// Virtual base class for load/store walkers used in alias analysis.
>> +struct alias_walker
>> +{
>> +  virtual bool conflict_p (int &budget) const = 0;
>> +  virtual insn_info *insn () const = 0;
>> +  virtual bool valid () const  = 0;
>> +  virtual void advance () = 0;
>> +};
>> +
>> +// Implement some common functionality used by both store_walker
>> +// and load_walker.
>> +template<bool reverse>
>> +class def_walker : public alias_walker
>> +{
>> +protected:
>> +  using def_iter_t = typename std::conditional<reverse,
>> +	reverse_def_iterator, def_iterator>::type;
>> +
>> +  static use_info *start_use_chain (def_iter_t &def_iter)
>> +  {
>> +    set_info *set = nullptr;
>> +    for (; *def_iter; def_iter++)
>> +      {
>> +	set = dyn_cast<set_info *> (*def_iter);
>> +	if (!set)
>> +	  continue;
>> +
>> +	use_info *use = reverse
>> +	  ? set->last_nondebug_insn_use ()
>> +	  : set->first_nondebug_insn_use ();
>> +
>> +	if (use)
>> +	  return use;
>> +      }
>> +
>> +    return nullptr;
>> +  }
>> +
>> +  def_iter_t def_iter;
>> +  insn_info *limit;
>> +  def_walker (def_info *def, insn_info *limit) :
>> +    def_iter (def), limit (limit) {}
>> +
>> +  virtual bool iter_valid () const { return *def_iter; }
>> +
>> +public:
>> +  insn_info *insn () const override { return (*def_iter)->insn (); }
>> +  void advance () override { def_iter++; }
>> +  bool valid () const override final
>> +  {
>> +    if (!iter_valid ())
>> +      return false;
>> +
>> +    if (reverse)
>> +      return *(insn ()) > *limit;
>> +    else
>> +      return *(insn ()) < *limit;
>> +  }
>> +};
>> +
>> +// alias_walker that iterates over stores.
>> +template<bool reverse, typename InsnPredicate>
>> +class store_walker : public def_walker<reverse>
>> +{
>> +  rtx cand_mem;
>> +  InsnPredicate tombstone_p;
>> +
>> +public:
>> +  store_walker (def_info *mem_def, rtx mem, insn_info *limit_insn,
>> +		InsnPredicate tombstone_fn) :
>> +    def_walker<reverse> (mem_def, limit_insn),
>> +    cand_mem (mem), tombstone_p (tombstone_fn) {}
>> +
>> +  bool conflict_p (int &budget) const override final
>> +  {
>> +    if (tombstone_p (this->insn ()))
>> +      return false;
>> +
>> +    return store_modifies_mem_p (cand_mem, this->insn (), budget);
>> +  }
>> +};
>> +
>> +// alias_walker that iterates over loads.
>> +template<bool reverse>
>> +class load_walker : public def_walker<reverse>
>> +{
>> +  using Base = def_walker<reverse>;
>> +  using use_iter_t = typename std::conditional<reverse,
>> +	reverse_use_iterator, nondebug_insn_use_iterator>::type;
>> +
>> +  use_iter_t use_iter;
>> +  insn_info *cand_store;
>> +
>> +  bool iter_valid () const override final { return *use_iter; }
>> +
>> +public:
>> +  void advance () override final
>> +  {
>> +    use_iter++;
>> +    if (*use_iter)
>> +      return;
>> +    this->def_iter++;
>> +    use_iter = Base::start_use_chain (this->def_iter);
>> +  }
>> +
>> +  insn_info *insn () const override final
>> +  {
>> +    return (*use_iter)->insn ();
>> +  }
>> +
>> +  bool conflict_p (int &budget) const override final
>> +  {
>> +    return load_modified_by_store_p (insn (), cand_store, budget);
>> +  }
>> +
>> +  load_walker (def_info *def, insn_info *store, insn_info *limit_insn)
>> +    : Base (def, limit_insn),
>> +      use_iter (Base::start_use_chain (this->def_iter)),
>> +      cand_store (store) {}
>> +};
>> +
>> +// Given INSNS (in program order) which are known to be adjacent, look
>> +// to see if either insn has a suitable RTL (register) base that we can
>> +// use to form a pair.  Push these to BASE_CANDS if we find any.  CAND_MEMs
>> +// gives the relevant mems from the candidate insns, ACCESS_SIZE gives the
>> +// size of a single candidate access, and REVERSED says whether the accesses
>> +// are inverted in offset order.
>> +//
>> +// Returns an integer where bit (1 << i) is set if INSNS[i] uses writeback
>> +// addressing.
>> +static int
>> +get_viable_bases (insn_info *insns[2],
>> +		  vec<base_cand> &base_cands,
>> +		  rtx cand_mems[2],
>> +		  unsigned access_size,
>> +		  bool reversed)
>> +{
>> +  // We discovered this pair through a common base.  Need to ensure that
>> +  // we have a common base register that is live at both locations.
>> +  def_info *base_defs[2] = {};
>> +  int writeback = 0;
>> +  for (int i = 0; i < 2; i++)
>> +    {
>> +      const bool is_lower = (i == reversed);
>> +      poly_int64 poly_off;
>> +      rtx base = load_strip_offset (cand_mems[i], &poly_off);
>> +      if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
>> +	writeback |= (1 << i);
>> +
>> +      if (!REG_P (base) || !poly_off.is_constant ())
>> +	continue;
>> +
>> +      // Punt on accesses relative to eliminable regs.  See the comment in
>> +      // fusion_bb_info::track_access for a detailed explanation of this.
>> +      if (!reload_completed
>> +	  && (REGNO (base) == FRAME_POINTER_REGNUM
>> +	      || REGNO (base) == ARG_POINTER_REGNUM))
>> +	continue;
>> +
>> +      HOST_WIDE_INT base_off = poly_off.to_constant ();
>> +
>> +      // It should be unlikely that we ever punt here, since MEM_EXPR offset
>> +      // alignment should be a good proxy for register offset alignment.
>> +      if (base_off % access_size != 0)
>> +	{
>> +	  if (dump_file)
>> +	    fprintf (dump_file,
>> +		     "base not viable, offset misaligned (insn %d)\n",
>> +		     insns[i]->uid ());
>> +	  continue;
>> +	}
>> +
>> +      base_off /= access_size;
>> +
>> +      if (!is_lower)
>> +	base_off--;
>> +
>> +      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
>> +	continue;
>> +
>> +      use_info *use = find_access (insns[i]->uses (), REGNO (base));
>> +      gcc_assert (use);
>> +      base_defs[i] = use->def ();
>> +    }
>> +
>> +  if (!base_defs[0] && !base_defs[1])
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file, "no viable base register for pair (%d,%d)\n",
>> +		 insns[0]->uid (), insns[1]->uid ());
>> +      return writeback;
>> +    }
>> +
>> +  for (int i = 0; i < 2; i++)
>> +    if ((writeback & (1 << i)) && !base_defs[i])
>> +      {
>> +	if (dump_file)
>> +	  fprintf (dump_file, "insn %d has writeback but base isn't viable\n",
>> +		   insns[i]->uid ());
>> +	return writeback;
>> +      }
>> +
>> +  if (writeback == 3
>> +      && base_defs[0]->regno () != base_defs[1]->regno ())
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file,
>> +		 "pair (%d,%d): double writeback with distinct regs (%d,%d): "
>> +		 "punting\n",
>> +		 insns[0]->uid (), insns[1]->uid (),
>> +		 base_defs[0]->regno (), base_defs[1]->regno ());
>> +      return writeback;
>> +    }
>> +
>> +  if (base_defs[0] && base_defs[1]
>> +      && base_defs[0]->regno () == base_defs[1]->regno ())
>> +    {
>> +      // Easy case: insns already share the same base reg.
>> +      base_cands.quick_push (base_defs[0]);
>> +      return writeback;
>> +    }
>> +
>> +  // Otherwise, we know that one of the bases must change.
>> +  //
>> +  // Note that if there is writeback we must use the writeback base
>> +  // (we know now there is exactly one).
>> +  for (int i = 0; i < 2; i++)
>> +    if (base_defs[i] && (!writeback || (writeback & (1 << i))))
>> +      base_cands.quick_push (base_cand { base_defs[i], i });
>> +
>> +  return writeback;
>> +}
>> +
>> +static void
>> +dump_insn_list (FILE *f, const insn_list_t &l)
>> +{
>> +  fprintf (f, "(");
>> +
>> +  auto i = l.begin ();
>> +  auto end = l.end ();
>> +
>> +  if (i != end)
>> +    fprintf (f, "%d", (*i)->uid ());
>> +  i++;
>> +
>> +  for (; i != end; i++)
>> +    fprintf (f, ", %d", (*i)->uid ());
>> +
>> +  fprintf (f, ")");
>> +}
>> +
>> +DEBUG_FUNCTION void
>> +debug (const insn_list_t &l)
>> +{
>> +  dump_insn_list (stderr, l);
>> +  fprintf (stderr, "\n");
>> +}
>> +
>> +// If we emitted tombstone insns for this BB, iterate through the BB
>> +// and remove all the tombstone insns, being sure to reparent any uses
>> +// of mem to previous defs when we do this.
>> +void
>> +fusion_bb_info::cleanup_tombstones ()
>> +{
>> +  // No need to do anything if we didn't emit a tombstone insn for this BB.
>> +  if (!m_emitted_tombstone)
>> +    return;
>> +
>> +  insn_info *insn = m_bb->head_insn ();
>> +  while (insn)
>> +    {
>> +      insn_info *next = insn->next_nondebug_insn ();
>> +      if (!insn->is_real ()
>> +	  || !bitmap_bit_p (&m_tombstone_bitmap, insn->uid ()))
>> +	{
>> +	  insn = next;
>> +	  continue;
>> +	}
>> +
>> +      auto def = memory_access (insn->defs ());
>> +      auto set = dyn_cast<set_info *> (def);
>> +      if (set && set->has_any_uses ())
>> +	{
>> +	  def_info *prev_def = def->prev_def ();
>> +	  auto prev_set = dyn_cast<set_info *> (prev_def);
>> +	  if (!prev_set)
>> +	    gcc_unreachable ();
>> +
>> +	  while (set->first_use ())
>> +	    crtl->ssa->reparent_use (set->first_use (), prev_set);
>> +	}
>> +
>> +      // Now set has no uses, we can delete it.
>> +      insn_change change (insn, insn_change::DELETE);
>> +      //df_insn_rescan (insn->rtl());
>> +      crtl->ssa->change_insn (change);
>> +      insn = next;
>> +    }
>> +}
>> +
>> +template<typename Map>
>> +void
>> +fusion_bb_info::traverse_base_map (Map &map)
>> +{
>> +  for (auto kv : map)
>> +    {
>> +      const auto &key = kv.first;
>> +      auto &value = kv.second;
>> +      transform_for_base (key.second, value);
>> +    }
>> +}
>> +
>> +void
>> +fusion_bb_info::transform ()
>> +{
>> +  traverse_base_map (expr_map);
>> +  traverse_base_map (def_map);
>> +}
>> diff --git a/gcc/testsuite/g++.target/powerpc/vecload.C b/gcc/testsuite/g++.target/powerpc/vecload.C
>> new file mode 100644
>> index 00000000000..c523572cf3c
>> --- /dev/null
>> +++ b/gcc/testsuite/g++.target/powerpc/vecload.C
>> @@ -0,0 +1,15 @@
>> +/* { dg-do compile } */ 
>> +/* { dg-require-effective-target power10_ok } */
>> +/* { dg-options "-mdejagnu-cpu=power10 -O2" } */ 
>> +
>> +#include <altivec.h>
>> +
>> +void
>> +foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
>> +{
>> +  __vector_quad acc;
>> +  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
>> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
>> +  *dst = acc;
>> +}
>> +/* { dg-final { scan-assembler {\mlxvp\M} } } */
>> diff --git a/gcc/testsuite/g++.target/powerpc/vecload1.C b/gcc/testsuite/g++.target/powerpc/vecload1.C
>> new file mode 100644
>> index 00000000000..d10ff0cdf36
>> --- /dev/null
>> +++ b/gcc/testsuite/g++.target/powerpc/vecload1.C
>> @@ -0,0 +1,22 @@
>> +/* { dg-do compile } */
>> +/* { dg-require-effective-target power10_ok } */
>> +/* { dg-options "-mdejagnu-cpu=power10 -O2" } */
>> +
>> +#include <altivec.h>
>> +	
>> +void
>> +foo2 ()
>> +{
>> +  __vector_quad *dst1;
>> +  __vector_quad *dst2;
>> +  vector unsigned char src;
>> +  __vector_quad acc;
>> +  vector unsigned char *ptr;
>> +  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
>> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
>> +  *dst1 = acc;
>> +  __builtin_mma_xvf32ger(&acc, src, ptr[2]);
>> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[3]);
>> +  *dst2 = acc;
>> +}
>> +/* { dg-final { scan-assembler {\mlxvp\M} } } */
>> diff --git a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
>> index 69ee826e1be..ae29127f954 100644
>> --- a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
>> @@ -258,8 +258,8 @@ foo13b (__vector_quad *dst, __vector_quad *src, vec_t *vec)
>>    dst[13] = acc;
>>  }
>>  
>> -/* { dg-final { scan-assembler-times {\mlxv\M} 40 } } */
>> -/* { dg-final { scan-assembler-times {\mlxvp\M} 12 } } */
>> +/* { dg-final { scan-assembler-times {\mlxv\M} 0 } } */
>> +/* { dg-final { scan-assembler-times {\mlxvp\M} 32 } } */
>>  /* { dg-final { scan-assembler-times {\mstxvp\M} 40 } } */
>>  /* { dg-final { scan-assembler-times {\mxxmfacc\M} 20 } } */
>>  /* { dg-final { scan-assembler-times {\mxxmtacc\M} 6 } } */
>> -- 
>> 2.39.3
>>
Ajit Agarwal Feb. 14, 2024, 1:41 p.m. UTC | #4
Hello Alex:

On 24/01/24 10:13 pm, Alex Coplan wrote:
> Hi Ajit,
> 
> On 21/01/2024 19:57, Ajit Agarwal wrote:
>>
>> Hello All:
>>
>> New pass to replace adjacent memory addresses lxv with lxvp.
>> Added common infrastructure for load store fusion for
>> different targets.
> 
> Thanks for this, it would be nice to see the load/store pair pass
> generalized to multiple targets.
> 
> I assume you are targeting GCC 15 for this, as we are in stage 4 at
> the moment?
> 
>>
>> Common routines are refactored in fusion-common.h.
>>
>> AARCH64 load/store fusion pass is not changed with the 
>> common infrastructure.
> 
> I think any patch to generalize the load/store pair fusion pass should
> update the aarch64 code at the same time to use the generic
> infrastructure, instead of duplicating the code.
> 
> As a general comment, I think we should move as much of the code as
> possible to target-independent code, with only the bits that are truly
> target-specific (e.g. deciding which modes to allow for a load/store
> pair operand) in target code.
> 
> In terms of structuring the interface between generic code and target
> code, I think it would be pragmatic to use a class with (in some cases,
> pure) virtual functions that can be overriden by targets to implement
> any target-specific behaviour.
> 
> IMO the generic class should be implemented in its own .cc instead of
> using a header-only approach.  The target code would then define a
> derived class which overrides the virtual functions (where necessary)
> declared in the generic class, and then instantiate the derived class to
> create a target-customized instance of the pass.

Incorporated the above comments in the recent patch sent.
> 
> A more traditional GCC approach would be to use optabs and target hooks
> to customize the behaviour of the pass to handle target-specific
> aspects, but:
>  - Target hooks are quite heavyweight, and we'd potentially have to add
>    quite a few hooks just for one pass that (at least initially) will
>    only be used by a couple of targets.
>  - Using classes allows both sides to easily maintain their own state
>    and share that state where appropriate.
> 
> Nit on naming: I understand you want to move away from ldp_fusion, but
> how about pair_fusion or mem_pair_fusion instead of just "fusion" as a
> base name?  IMO just "fusion" isn't very clear as to what the pass is
> trying to achieve.
> 

I have made it pair_fusion.

> In general the code could do with a lot more commentary to explain the
> rationale for various things / explain the high-level intent of the
> code.
> 
> Unfortunately I'm not familiar with the DF framework (I've only really
> worked with RTL-SSA for the aarch64 pass), so I haven't commented on the
> use of that framework, but it would be nice if what you're trying to do
> could be done using RTL-SSA instead of using DF directly.
>

I have used rtl-ssa DEF-USE at many places in the recent patch.
But DF framework is useful as it populates a pointer rtx through
DF_REF_LOC and then we can easily modify. This is missing in
rtl-ssa pass and wherever LOC is required to change I have 
used DF framework in the recent patch.

 
> Hopefully Richard S can chime in on those aspects.
> 
> My main concerns with the patch at the moment (apart from the code
> duplication) is that it looks like:
> 
>  - The patch removes alias analysis from try_fuse_pair, which is unsafe.
>  - The patch tries to make its own RTL changes inside
>    rs6000_gen_load_pair, but it should let fuse_pair make those changes
>    using RTL-SSA instead.
>

My mistake that I have remove alias analysis from try_fuse_pair.
In recent patch I kept all the code in the aarch64-ldp-fusion
intact except organizing the generic and target dependent code
through pure virtual functions.
 
> I've left some more specific (but still mostly high-level) comments below.
> 
>>
>> For AARCH64 architectures just include "fusion-common.h"
>> and target dependent code can be added to that.
>>
>>
>> Alex/Richard:
>>
>> If you would like me to add for AARCH64 I can do that for AARCH64.
>>
>> If you would like to do that is fine with me.
>>
>> Bootstrapped and regtested with powerpc64-linux-gnu.
>>
>> Improvement in performance is seen with Spec 2017 spec FP benchmarks.
>>
>> Thanks & Regards
>> Ajit
>>
>> rs6000: New  pass for replacement of adjacent lxv with lxvp.
> 
> Are you looking to handle stores eventually, out of interest?  Looking
> at rs6000-vecload-opt.cc:fusion_bb it looks like you're just handling
> loads at the moment.
> 
>>

I have included store fusion also in the recent patch.

>> New pass to replace adjacent memory addresses lxv with lxvp.
>> Added common infrastructure for load store fusion for
>> different targets.
>>
>> Common routines are refactored in fusion-common.h.
> 
> I've just done a very quick scan through this file as it mostly just
> looks to be idential to existing code in aarch64-ldp-fusion.cc.
> 
>>
>> 2024-01-21  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>
>> gcc/ChangeLog:
>>
>> 	* config/rs6000/rs6000-passes.def: New vecload pass
>> 	before pass_early_remat.
>> 	* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
>> 	* config.gcc: Add new executable.
>> 	* config/rs6000/rs6000-protos.h: Add new prototype for vecload
>> 	pass.
>> 	* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
>> 	* config/rs6000/t-rs6000: Add new rule.
>> 	* fusion-common.h: Add common infrastructure for load store
>> 	fusion that can be shared across different architectures.
>> 	* emit-rtl.cc: Modify assert code.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 	* g++.target/powerpc/vecload.C: New test.
>> 	* g++.target/powerpc/vecload1.C: New test.
>> 	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
>> ---
>>  gcc/config.gcc                                |    4 +-
>>  gcc/config/rs6000/rs6000-passes.def           |    3 +
>>  gcc/config/rs6000/rs6000-protos.h             |    1 +
>>  gcc/config/rs6000/rs6000-vecload-opt.cc       | 1186 ++++++++++++++++
>>  gcc/config/rs6000/rs6000.cc                   |    1 +
>>  gcc/config/rs6000/t-rs6000                    |    5 +
>>  gcc/emit-rtl.cc                               |   14 +-
>>  gcc/fusion-common.h                           | 1195 +++++++++++++++++
>>  gcc/testsuite/g++.target/powerpc/vecload.C    |   15 +
>>  gcc/testsuite/g++.target/powerpc/vecload1.C   |   22 +
>>  .../gcc.target/powerpc/mma-builtin-1.c        |    4 +-
>>  11 files changed, 2433 insertions(+), 17 deletions(-)
>>  create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
>>  create mode 100644 gcc/fusion-common.h
>>  create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C
>>  create mode 100644 gcc/testsuite/g++.target/powerpc/vecload1.C
>>
>> diff --git a/gcc/config.gcc b/gcc/config.gcc
>> index 00355509c92..9bff42cf830 100644
>> --- a/gcc/config.gcc
>> +++ b/gcc/config.gcc
>> @@ -518,7 +518,7 @@ or1k*-*-*)
>>  	;;
>>  powerpc*-*-*)
>>  	cpu_type=rs6000
>> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
>> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
>>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>>  	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
>>  	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
>> @@ -555,7 +555,7 @@ riscv*)
>>  	;;
>>  rs6000*-*-*)
>>  	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
>> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
>> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
>>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
>>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
>> diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
>> index 46a0d0b8c56..eb4a65ebe10 100644
>> --- a/gcc/config/rs6000/rs6000-passes.def
>> +++ b/gcc/config/rs6000/rs6000-passes.def
>> @@ -28,6 +28,9 @@ along with GCC; see the file COPYING3.  If not see
>>       The power8 does not have instructions that automaticaly do the byte swaps
>>       for loads and stores.  */
>>    INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
>> +  /* Pass to replace adjacent memory addresses lxv instruction with lxvp
>> +     instruction.  */
>> +  INSERT_PASS_BEFORE (pass_early_remat, 1, pass_analyze_vecload);
>>  
>>    /* Pass to do the PCREL_OPT optimization that combines the load of an
>>       external symbol's address along with a single load or store using that
>> diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
>> index 09a57a806fa..f0a9f36602e 100644
>> --- a/gcc/config/rs6000/rs6000-protos.h
>> +++ b/gcc/config/rs6000/rs6000-protos.h
>> @@ -343,6 +343,7 @@ namespace gcc { class context; }
>>  class rtl_opt_pass;
>>  
>>  extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
>> +extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
>>  extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
>>  extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
>>  extern bool rs6000_quadword_masked_address_p (const_rtx exp);
>> diff --git a/gcc/config/rs6000/rs6000-vecload-opt.cc b/gcc/config/rs6000/rs6000-vecload-opt.cc
>> new file mode 100644
>> index 00000000000..cebadad97d8
>> --- /dev/null
>> +++ b/gcc/config/rs6000/rs6000-vecload-opt.cc
>> @@ -0,0 +1,1186 @@
>> +/* Subroutines used to replace lxv with lxvp
>> +   for TARGET_POWER10 and TARGET_VSX,
>> +
>> +   Copyright (C) 2020-2023 Free Software Foundation, Inc.
>> +   Contributed by Ajit Kumar Agarwal <aagarwa1@linux.ibm.com>.
>> +
>> +   This file is part of GCC.
>> +
>> +   GCC is free software; you can redistribute it and/or modify it
>> +   under the terms of the GNU General Public License as published
>> +   by the Free Software Foundation; either version 3, or (at your
>> +   option) any later version.
>> +
>> +   GCC is distributed in the hope that it will be useful, but WITHOUT
>> +   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
>> +   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
>> +   License for more details.
>> +
>> +   You should have received a copy of the GNU General Public License
>> +   along with GCC; see the file COPYING3.  If not see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#define IN_TARGET_CODE 1
>> +#define INCLUDE_ALGORITHM
>> +#define INCLUDE_FUNCTIONAL
>> +#define INCLUDE_LIST
>> +#define INCLUDE_TYPE_TRAITS
>> +#include "config.h"
>> +#include "system.h"
>> +#include "coretypes.h"
>> +#include "backend.h"
>> +#include "target.h"
>> +#include "rtl.h"
>> +#include "memmodel.h"
>> +#include "emit-rtl.h"
>> +#include "tree-pass.h"
>> +#include "df.h"
>> +#include "dumpfile.h"
>> +#include "rs6000-internal.h"
>> +#include "rs6000-protos.h"
>> +#include "fusion-common.h"
>> +
>> +/* Return false if dependent rtx LOC is SUBREG.  */
>> +static bool
>> +is_feasible (rtx_insn *insn)
> 
> Can you explain the rationale for this function?
> I'm not really sure what it's trying to achieve.
> 
>> +{
>> +  df_ref use;
>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>> +    {
>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>> +      if (!def_link || !def_link->ref || DF_REF_IS_ARTIFICIAL (def_link->ref))
>> +	continue;
>> +      while (def_link && def_link->ref)
>> +	{
>> +	  rtx *loc = DF_REF_LOC (def_link->ref);
>> +	  if (!loc || *loc == NULL_RTX)
>> +	    return false;
>> +	  if (GET_CODE (*loc) == SUBREG)
>> +	    return false;
>> +	  def_link = def_link->next;
>> +	}
>> +     }
>> +  return true;
>> +}
>> +

I have removed this check from the recent patch.
>> +/* df_scan_rescan the unspec instruction where operands
>> +   are reversed.  */
>> +void set_rescan_for_unspec (rtx_insn *insn)
>> +{
>> +  df_ref use;
>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
>> +  rtx_insn *select_insn2;
>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>> +    {
>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>> +      while (def_link && def_link->ref)
>> +	{
>> +	  select_insn2 = DF_REF_INSN (def_link->ref);
>> +	  rtx set = single_set (select_insn2);
>> +
>> +	  if (set == NULL_RTX)
>> +	    return;
>> +
>> +	  if (set != NULL_RTX)
>> +	    {
>> +	      rtx op0 = SET_SRC (set);
>> +	      if (GET_CODE (op0) != UNSPEC)
>> +		return;
>> +
>> +	      if (GET_CODE (op0) == VEC_SELECT
>> +		  && GET_CODE (XEXP (op0, 1)) == PARALLEL)
>> +		return;
>> +
>> +	      if (GET_CODE (op0) == UNSPEC)
>> +		df_insn_rescan (select_insn2);
>> +	    }
>> +	   def_link = def_link->next;
>> +	}
>> +     }
>> +}
>> +
>> +/* Return dependent UNSPEC instruction.  */
>> +rtx_insn *get_rtx_UNSPEC (rtx_insn *insn)
>> +{
>> +  df_ref use;
>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
>> +  rtx_insn *select_insn2;
>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>> +    {
>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>> +      while (def_link && def_link->ref)
>> +	{
>> +	  select_insn2 = DF_REF_INSN (def_link->ref);
>> +	  rtx set = single_set (select_insn2);
>> +
>> +	  if (set == NULL_RTX)
>> +	    return 0;
>> +
>> +	  if (set != NULL_RTX)
> 
> This condition is always true due to the early return above.
> 

I have made this change in recent patch.

>> +	    {
>> +	      rtx op0 = SET_SRC (set);
>> +
>> +	      if (GET_CODE (op0) == UNSPEC)
>> +		return select_insn2;
>> +	    }
>> +	   def_link = def_link->next;
>> +	}
>> +     }
>> +  return 0;
>> +}
>> +
>> +/* Replace identified lxv with lxvp.
>> +   Bail out if following condition are true:
>> +
>> +   - dependent instruction of load is vec_select instruction,
> 
> Please can you explain the rationale for this condition?
> 
> Do I understand correctly that you want to bail out if a use of a
> register loaded by either insn occurs in a vec_select rtx?  If so, why?
>

I have removed vec_select check from load fusion only check I have 
provided in target of load in UNSPEC this where we require load
vector pair to be generated.


 
> AIUI, since we shouldn't change register dataflow in the pass, I wonder
> if you could check this condition once in track_access, instead of
> checking it for every pair formation attempt?
>

I have added the check in the wrapper code where we call track_access.
 
>> +
>> +   - machine mode of unspec is not same as machine mode
>> +     of lxv instruction.
>> +
>> +   - dependent instruction is not unspec.
> 
> Likewise, please explain the rationale for this.
> 
>> +
>> +   - Source operand of unspec is eq instruction.  */
> 
> Same here.
> 
>> +
>> +rtx
>> +rs6000_gen_load_pair (rtx_insn *insn1, rtx_insn *insn2)
>> +{
> 
> I think this function is named in a rather misleading way, and I think
> it is trying to do too much at once.
> 
> Based on the name of the function, I would expect it to just produce the
> appropriate RTL pattern for a load pair given the two candidate loads.
> But it actually tries to do a bunch of analysis to determine if the pair
> can be formed, and then inserts a new insn and deletes the original
> candidate insns.
> 

I have made changes in the recent patch as you have mentioned above.

> Any changing of the RTL should be done by RTL-SSA in fuse_pair.
> 
>> +  rtx body = PATTERN (insn1);
>> +  rtx src_exp = SET_SRC (body);
>> +  rtx dest_exp = SET_DEST (body);
>> +  rtx lxv;
>> +  rtx insn2_body = PATTERN (insn2);
>> +
>> +  rtx insn2_src_exp = SET_SRC (insn2_body);
>> +
>> +  if (GET_MODE (src_exp) != GET_MODE (SET_SRC (insn2_body)))
>> +    return NULL_RTX;
> 
> IIUC this should hold from generic RTL rules, so shouldn't be needed.
> 
>> +
>> +  if (GET_MODE (dest_exp) == TImode)
>> +    return NULL_RTX;
>> +
>> +  if (!ALTIVEC_OR_VSX_VECTOR_MODE (GET_MODE (dest_exp)))
>> +    return NULL_RTX;
> 
> Validation of the modes should have happened much earlier (in
> track_access), so shouldn't be necessary at this point.
>

I did that.
 
>> +
>> +  if (!is_feasible (insn1))
>> +    return NULL_RTX;
>> +
>> +  if (!is_feasible (insn2))
>> +    return NULL_RTX;
> 
> I can't really comment on this as I don't understand the intent behind
> is_feasible, but AFAICT it seems to do some dataflow analysis (looking
> at uses/defs).  That should ideally have been done at the latest in
> try_fuse_pair (before we get to fuse_pair), and shouldn't be done when
> we're trying to generate the final RTL pattern for the pair.
> 
> Given that these seem to be properties of the indiviudal insns it might
> even be possible to check this much earlier (when first considering the
> individual accesses in track_access).
> 

Most of the check that were used in this patch are removed and moved 
some in earlier part of code in the recent patch.
>> +
>> +  for (rtx note = REG_NOTES (insn1); note; note = XEXP (note, 1))
>> +    if (REG_NOTE_KIND (note) == REG_EQUAL
>> +	|| REG_NOTE_KIND (note) == REG_EQUIV)
>> +      return NULL_RTX;
> 
> Can you explain why you're punting if the candidate insns have these
> notes?  Did you see a performance improvement with that change?  It
> might be something we want to consider doing for aarch64 too if running
> before RA (in which case we could do it in the generic code).
> 

Yes we found that it is not useful in spec 2017  benchmarks.

>> +
>> +  int no_dep = 0;
>> +  df_ref  use;
>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (insn1);
>> +  rtx_insn *select_insn2;
>> +
>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>> +    {
>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>> +      while (def_link && def_link->ref)
>> +	{
>> +	  select_insn2 = DF_REF_INSN (def_link->ref);
>> +	  rtx set = single_set (select_insn2);
>> +
>> +	  if (set == NULL_RTX)
>> +	    return NULL_RTX;
>> +
>> +	  if (set != NULL_RTX)
> 
> As above, I think this condition is redundant due to the early return
> immediately above.
> 

I did that.

>> +	    {
>> +	      rtx op0 = SET_SRC (set);
>> +
>> +	      if (GET_CODE (op0) != UNSPEC)
>> +		return NULL_RTX;
>> +
>> +	      if (GET_CODE (op0) == VEC_SELECT
>> +		  && GET_CODE (XEXP (op0, 1)) == PARALLEL)
>> +		return NULL_RTX;
>> +
>> +	      if (GET_CODE (op0) == UNSPEC)
>> +		{
>> +		  if (GET_MODE (op0) != XOmode
>> +		      && GET_MODE (op0) != GET_MODE (dest_exp))
>> +		    return NULL_RTX;
>> +
>> +		  int nvecs = XVECLEN (op0, 0);
>> +		  for (int i = 0; i < nvecs; i++)
>> +		    {
>> +		      rtx op;
>> +		      op = XVECEXP (op0, 0, i);
>> +
>> +		      if (GET_MODE (op) == OOmode)
>> +			return NULL_RTX;
>> +		      if (GET_CODE (op) == EQ)
>> +			return NULL_RTX;
>> +		    }
>> +		}
>> +	       ++no_dep;
>> +	     }
>> +	   def_link = def_link->next;
>> +	}
>> +     }
>> +
>> +  rtx_insn *insn = get_rtx_UNSPEC (insn1);
>> +
>> +  if (insn && insn == get_rtx_UNSPEC (insn2) && no_dep == 1)
>> +    return NULL_RTX;
>> +
>> +  int regoff;
>> +  rtx src;
>> +  rtx addr = XEXP (src_exp, 0);
>> +
>> +  if (GET_CODE (addr) == PLUS
>> +      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
>> +    {
>> +      regoff = 0;
>> +      src = simplify_gen_subreg (GET_MODE (dest_exp),
>> +				 dest_exp, GET_MODE (dest_exp),
>> +				 regoff);
>> +    }
>> +  else
>> +    {
>> +      regoff = INTVAL (CONST0_RTX (SImode)) *
>> +		       GET_MODE_SIZE (GET_MODE (dest_exp));
>> +      src = simplify_gen_subreg (GET_MODE (dest_exp),
>> +				 dest_exp, GET_MODE (dest_exp),
>> +				 regoff);
>> +    }
>> +  insn_info = DF_INSN_INFO_GET (insn2);
>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>> +    {
>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>> +      if (!def_link || !def_link->ref || DF_REF_IS_ARTIFICIAL (def_link->ref))
>> +	continue;
>> +      while (def_link && def_link->ref)
>> +	{
>> +	  rtx *loc = DF_REF_LOC (def_link->ref);
>> +	  *loc =  src;
>> +	  def_link = def_link->next;
>> +	}
>> +     }
>> +
>> +  int regoff1;
>> +  rtx src1;
>> +  addr = XEXP (insn2_src_exp, 0);
>> +
>> +  if (GET_CODE (addr) == PLUS
>> +      && XEXP (addr, 1)
>> +      && CONST_INT_P (XEXP(addr, 1)))
>> +    {
>> +      regoff1 = 16;
>> +      src1 = simplify_gen_subreg (GET_MODE (dest_exp),
>> +				  dest_exp, GET_MODE (dest_exp),
>> +				  regoff1);
>> +    }
>> +  else
>> +    {
>> +      regoff1 = INTVAL (CONST0_RTX (SImode)) * GET_MODE_SIZE (GET_MODE (dest_exp));//V16QImode);
>> +      src1 = simplify_gen_subreg (GET_MODE (dest_exp),
>> +				  dest_exp, GET_MODE (dest_exp),
>> +				  regoff1);
>> +    }
>> +
>> +  insn_info = DF_INSN_INFO_GET (insn1);
>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>> +    {
>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>> +      if (!def_link || !def_link->ref || DF_REF_IS_ARTIFICIAL (def_link->ref))
>> +	continue;
>> +      while (def_link && def_link->ref)
>> +	{
>> +	  rtx *loc = DF_REF_LOC (def_link->ref);
>> +	  PUT_MODE_RAW (*loc, OOmode);
>> +	  *loc = src1;
>> +	  def_link = def_link->next;
>> +	}
>> +     }
>> +
>> +  set_rescan_for_unspec (insn1);
>> +  set_rescan_for_unspec (insn2);
>> +  df_insn_rescan (insn1);
>> +  df_insn_rescan (insn2);
>> +
>> +  PUT_MODE_RAW (src_exp, OOmode);
>> +  PUT_MODE_RAW (dest_exp, OOmode);
> 
> I don't think you should be updating the original RTL directly like
> this, but instead return a new pattern which will ultimately get passed
> to validate_unshare_change in fuse_pair (i.e. let fuse_pair update the
> RTL).
>

I did that.

 
>> +  lxv = gen_movoo (dest_exp, src_exp);
>> +  rtx_insn *new_insn = emit_insn_before (lxv,  insn1);
> 
> Likewise, instead of emitting a new insn, this should let fuse_pair
> perform the change.
> 
>> +  set_block_for_insn (new_insn, BLOCK_FOR_INSN (insn1));
>> +  df_insn_rescan (new_insn);
>> +
>> +  if (dump_file)
>> +    {
>> +      unsigned int new_uid = INSN_UID (new_insn);
>> +      fprintf (dump_file, "Replacing lxv %d with lxvp  %d\n",
>> +			  INSN_UID (insn1), new_uid);
>> +      print_rtl_single (dump_file, new_insn);
>> +      print_rtl_single (dump_file, insn1);
>> +      print_rtl_single (dump_file, insn2);
>> +
>> +    }
>> +
>> +  df_insn_delete (insn1);
>> +  remove_insn (insn1);
>> +  df_insn_delete (insn2);
>> +  remove_insn (insn2);
>> +  insn1->set_deleted ();
>> +  insn2->set_deleted ();
> 
> Likewise, I don't think you should be deleting the original insns here,
> but you should let RTL-SSA do this in the generic code instead (i.e. in
> fuse_pair).
>

I did that in recent patch.
 
>> +  return lxv;
>> +}
>> +
>> +// LEFT_LIST and RIGHT_LIST are lists of candidate instructions where all insns
>> +// in LEFT_LIST are known to be adjacent to those in RIGHT_LIST.
>> +//
>> +// This function traverses the resulting 2D matrix of possible pair candidates
>> +// and attempts to merge them into pairs.
>> +//
>> +// The algorithm is straightforward: if we consider a combined list of
>> +// candidates X obtained by merging LEFT_LIST and RIGHT_LIST in program order,
>> +// then we advance through X until we reach a crossing point (where X[i] and
>> +// X[i+1] come from different source lists).
>> +//
>> +// At this point we know X[i] and X[i+1] are adjacent accesses, and we try to
>> +// fuse them into a pair.  If this succeeds, we remove X[i] and X[i+1] from
>> +// their original lists and continue as above.
>> +//
>> +// In the failure case, we advance through the source list containing X[i] and
>> +// continue as above (proceeding to the next crossing point).
>> +//
>> +// The rationale for skipping over groups of consecutive candidates from the
>> +// same source list is as follows:
>> +//
>> +// In the store case, the insns in the group can't be re-ordered over each
>> +// other as they are guaranteed to store to the same location, so we're
>> +// guaranteed not to lose opportunities by doing this.
>> +//
>> +// In the load case, subsequent loads from the same location are either
>> +// redundant (in which case they should have been cleaned up by an earlier
>> +// optimization pass) or there is an intervening aliasing hazard, in which case
>> +// we can't re-order them anyway, so provided earlier passes have cleaned up
>> +// redundant loads, we shouldn't miss opportunities by doing this.
>> +void
>> +fusion_bb_info::merge_pairs (insn_list_t &left_list,
>> +			  insn_list_t &right_list,
>> +			  bool load_p,
>> +			  unsigned access_size)
>> +{
> 
> This function should be in the generic code, not in target code.
> Was there a reason you moved it to target code?
> 
Yes I have moved in generic code.

> When moving code from aarch64-ldp-fusion.cc, I think it would be better
> if you left the code style unchanged and only made changes where there
> is a need for a functional change.  For example ...
> 
>> +  if (dump_file)
>> +    {
>> +      fprintf (dump_file, "merge_pairs [L=%d], cand vecs ", load_p);
>> +      dump_insn_list (dump_file, left_list);
>> +      fprintf (dump_file, " x ");
>> +      dump_insn_list (dump_file, right_list);
>> +      fprintf (dump_file, "\n");
>> +    }
>> +
>> +  auto iter_l = left_list.begin ();
>> +  auto iter_r = right_list.begin ();
>> +  while (iter_l != left_list.end () && iter_r != right_list.end ())
>> +    {
>> +      auto left = *iter_l;
>> +      auto right = *iter_r;
> 
> ... here you introduce new left and right locals, but AFAICT there isn't
> a need for this (no functional change).
> 
>> +
>> +      auto next_l = std::next (iter_l);
>> +      auto next_r = std::next (iter_r);
>> +
>> +      if ((*left) < (*right)
>> +	  && next_l != (left_list.end ())
>> +	  && **next_l < *right)
>> +	iter_l = next_l;
>> +      else if ((*right) < (*left)
>> +	       && next_r != (right_list.end ())
>> +	       && **next_r < *left)
>> +	iter_r = next_r;
>> +      else if (try_fuse_pair (load_p, access_size, *iter_l, *iter_r)) //left, right))
>> +	{
>> +	  left_list.erase (iter_l);
>> +	  iter_l = next_l;
>> +	  right_list.erase (iter_r);
>> +	  iter_r = next_r;
>> +	}
>> +      else if (*left  < *right)
>> +	iter_l = next_l;
>> +      else
>> +	iter_r = next_r;
>> +    }
>> + 
>> +}
>> +
>> +
>> +
>> +// Main function to begin pair discovery.  Given a memory access INSN,
>> +// determine whether it could be a candidate for fusing into an lxvp/stvxp,
>> +// and if so, track it in the appropriate data structure for this basic
>> +// block.  LOAD_P is true if the access is a load, and MEM is the mem
>> +// rtx that occurs in INSN.
>> +void
>> +fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>> +{
> 
> This function should also be in generic code, not in target code.
> I can see that you want to customize a target-specific aspect ...
> 
>> +  // We can't combine volatile MEMs, so punt on these.
>> +  if (MEM_VOLATILE_P (mem))
>> +    return;
>> +
>> +  // Ignore writeback accesses if the param says to do so.
>> +  if (GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
>> +    return;
>> +
>> +  const machine_mode mem_mode = GET_MODE (mem);
>> +
>> +  rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
>> +
>> +  if (!ALTIVEC_OR_VSX_VECTOR_MODE (GET_MODE (reg_op)))
>> +    return;
> 
> ... here.  But I think this should be done either using a target hook
> or ideally using a virtual function which is overriden by target code.
>

Yes I did in pure virtual function in target code.

 
> For rs6000, would it be enough to validate the mode of the mem instead
> of looking at the mode of reg_op?  Then we could have a single function,
> e.g.:
> 
> bool pair_fusion::pair_operand_mode_ok_p (machine_mode mode);
> 
> which for rs6000 would return:
> 
> ALTIVEC_OR_VSX_VECTOR_MODE (mode);
> 
> and for aarch64 would return:
> 
> ldp_operand_mode_ok_p (mode);
> 
> and in track_access (from generic code), you would do:
> 
> if (!pair_operand_mode_ok_p (mem_mode))
>   return;
> 
Yes I did that.

> note that, at least for aarch64, the reason that we validate the mem mode
> instead of the register mode is that the register operand may be a modeless
> const_int in the store case.
> 
>> +
>> +  const bool fpsimd_op_p = (ALTIVEC_OR_VSX_VECTOR_MODE (GET_MODE (reg_op)));
> 
> IIUC, this boolean is always true due to the above check.  So it is
> better to just set it to a constant (doesn't matter which) if you only
> want to handle vector modes for rs6000.  As above, this should be done
> through a virtual function, where for aarch64 we return the condition in
> aarch64-ldp-fusion.cc:
> 
>   const bool fpsimd_op_p
>     = reload_completed
>     ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
>     : (GET_MODE_CLASS (mem_mode) != MODE_INT
>        && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
> 
> and for rs6000 you can just return false (assuming you don't want to segregate
> accesses using this bit).
> 
>> +
>> +  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode);
>> +  const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
>> +
>> +  if (track_via_mem_expr (insn, mem, lfs))
>> +    return;
>> +
>> +  poly_int64 mem_off;
>> +  rtx addr = XEXP (mem, 0);
>> +  const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
>> +  rtx base = load_strip_offset (mem, &mem_off);
>> +  if (!REG_P (base))
>> +    return;
>> +
>> +  // Need to calculate two (possibly different) offsets:
>> +  //  - Offset at which the access occurs.
>> +  //  - Offset of the new base def.
>> +  poly_int64 access_off;
>> +  if (autoinc_p && any_post_modify_p (addr))
>> +    access_off = 0;
>> +  else
>> +    access_off = mem_off;
>> +
>> +  poly_int64 new_def_off = mem_off;
>> +
>> +  // Punt on accesses relative to eliminable regs.  Since we don't know the
>> +  // elimination offset pre-RA, we should postpone forming pairs on such
>> +  // accesses until after RA.
>> +  //
>> +  // As it stands, addresses with offsets in range for LDR but not
>> +  // in range for LDP/STP are currently reloaded inefficiently,
>> +  // ending up with a separate base register for each pair.
>> +  //
>> +  // In theory LRA should make use of
>> +  // targetm.legitimize_address_displacement to promote sharing of
>> +  // bases among multiple (nearby) address reloads, but the current
>> +  // LRA code returns early from process_address_1 for operands that
>> +  // satisfy "m", even if they don't satisfy the real (relaxed) address
>> +  // constraint; this early return means we never get to the code
>> +  // that calls targetm.legitimize_address_displacement.
>> +  //
>> +  // So for now, it's better to punt when we can't be sure that the
>> +  // offset is in range for LDP/STP.  Out-of-range cases can then be
>> +  // handled after RA by the out-of-range LDP/STP peepholes.  Eventually, it
>> +  // would be nice to handle known out-of-range opportunities in the
>> +  // pass itself (for stack accesses, this would be in the post-RA pass).
>> +  if (!reload_completed
>> +      && (REGNO (base) == FRAME_POINTER_REGNUM
>> +	  || REGNO (base) == ARG_POINTER_REGNUM))
>> +    return;
>> +
>> +  // Now need to find def of base register.
>> +  use_info *base_use = find_access (insn->uses (), REGNO (base));
>> +  gcc_assert (base_use);
>> +  def_info *base_def = base_use->def ();
>> +  if (!base_def)
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file,
>> +		 "base register (regno %d) of insn %d is undefined",
>> +		 REGNO (base), insn->uid ());
>> +      return;
>> +    }
>> +
>> +  alt_base *canon_base = canon_base_map.get (base_def);
>> +  if (canon_base)
>> +    {
>> +      // Express this as the combined offset from the canonical base.
>> +      base_def = canon_base->base;
>> +      new_def_off += canon_base->offset;
>> +      access_off += canon_base->offset;
>> +    }
>> +
>> +  if (autoinc_p)
>> +    {
>> +      auto def = find_access (insn->defs (), REGNO (base));
>> +      gcc_assert (def);
>> +
>> +      // Record that DEF = BASE_DEF + MEM_OFF.
>> +      if (dump_file)
>> +	{
>> +	  pretty_printer pp;
>> +	  pp_access (&pp, def, 0);
>> +	  pp_string (&pp, " = ");
>> +	  pp_access (&pp, base_def, 0);
>> +	  fprintf (dump_file, "[bb %u] recording %s + ",
>> +		   m_bb->index (), pp_formatted_text (&pp));
>> +	  print_dec (new_def_off, dump_file);
>> +	  fprintf (dump_file, "\n");
>> +	}
>> +
>> +      alt_base base_rec { base_def, new_def_off };
>> +      if (canon_base_map.put (def, base_rec))
>> +	gcc_unreachable (); // Base defs should be unique.
>> +    }
>> +
>> +  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
>> +  // the access size.
>> +  if (!multiple_p (mem_off, mem_size))
>> +    return;
>> +
>> +  const auto key = std::make_pair (base_def, encode_lfs (lfs));
>> +  access_group &group = def_map.get_or_insert (key, NULL);
>> +  auto alloc = [&](access_record *access) { return node_alloc (access); };
>> +  group.track (alloc, access_off, insn);
>> +  if (dump_file)
>> +    {
>> +      pretty_printer pp;
>> +      pp_access (&pp, base_def, 0);
>> +
>> +      fprintf (dump_file, "[bb %u] tracking insn %d via %s",
>> +	       m_bb->index (), insn->uid (), pp_formatted_text (&pp));
>> +      fprintf (dump_file,
>> +	       " [L=%d, WB=%d, FP=%d, %smode, off=",
>> +	       lfs.load_p, autoinc_p, lfs.fpsimd_p, mode_name[mem_mode]);
>> +      print_dec (access_off, dump_file);
>> +      fprintf (dump_file, "]\n");
>> +    }
>> +}
>> +
>> +
>> +// Given two adjacent memory accesses of the same size, I1 and I2, try
>> +// and see if we can merge them into a lxvp or stvxp.
>> +//
>> +// ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
>> +// if the accesses are both loads, otherwise they are both stores.
>> +bool
>> +fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>> +			    insn_info *i1, insn_info *i2)
> 
> Likewise, I think this function should be in generic code, not target code.
> 
>> +{
>> +  if (dump_file)
>> +    fprintf (dump_file, "analyzing pair (load=%d): (%d,%d)\n",
>> +	     load_p, i1->uid (), i2->uid ());
>> +
>> +  insn_info *insns[2];
>> +  bool reversed = false;
>> +  if (*i1 < *i2)
>> +    {
>> +      insns[0] = i1;
>> +      insns[1] = i2;
>> +    }
>> +  else
>> +    {
>> +      insns[0] = i2;
>> +      insns[1] = i1;
>> +      reversed = true;
>> +    }
>> +
>> +  rtx cand_mems[2];
>> +  rtx reg_ops[2];
>> +  rtx pats[2];
>> +  for (int i = 0; i < 2; i++)
>> +    {
>> +      pats[i] = PATTERN (insns[i]->rtl ());
>> +      cand_mems[i] = XEXP (pats[i], load_p);
>> +      reg_ops[i] = XEXP (pats[i], !load_p);
>> +    }
>> +
>> +  if (load_p && reg_overlap_mentioned_p (reg_ops[0], reg_ops[1]))
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file,
>> +		 "punting on lxvp due to reg conflcits (%d,%d)\n",
> 
> I understand that you might not want to use "ldp" in the generic code,
> but how about "punting on load pair" instead?
> 
>> +		 insns[0]->uid (), insns[1]->uid ());
>> +      return false;
>> +    }
>> +
>> +  if (cfun->can_throw_non_call_exceptions
>> +      && find_reg_note (insns[0]->rtl (), REG_EH_REGION, NULL_RTX)
>> +      && find_reg_note (insns[1]->rtl (), REG_EH_REGION, NULL_RTX))
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file,
>> +		 "can't combine insns with EH side effects (%d,%d)\n",
>> +		 insns[0]->uid (), insns[1]->uid ());
>> +      return false;
>> +    }
>> +
>> +  auto_vec<base_cand, 2> base_cands (2);
>> +
>> +  int writeback = get_viable_bases (insns, base_cands, cand_mems,
>> +				    access_size, reversed);
>> +  if (base_cands.is_empty ())
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file, "no viable base for pair (%d,%d)\n",
>> +		 insns[0]->uid (), insns[1]->uid ());
>> +      return false;
>> +    }
>> +
>> +  // Punt on frame-related insns with writeback.  We probably won't see
>> +  // these in practice, but this is conservative and ensures we don't
>> +  // have to worry about these later on.
>> +  if (writeback && (RTX_FRAME_RELATED_P (i1->rtl ())
>> +		    || RTX_FRAME_RELATED_P (i2->rtl ())))
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file,
>> +		 "rejecting pair (%d,%d): frame-related insn with writeback\n",
>> +		 i1->uid (), i2->uid ());
>> +      return false;
>> +    }
>> +
>> +  rtx *ignore = &XEXP (pats[1], load_p);
>> +  for (auto use : insns[1]->uses ())
>> +    if (!use->is_mem ()
>> +	&& refers_to_regno_p (use->regno (), use->regno () + 1, pats[1], ignore)
>> +	&& use->def () && use->def ()->insn () == insns[0])
>> +      {
>> +	// N.B. we allow a true dependence on the base address, as this
>> +	// happens in the case of auto-inc accesses.  Consider a post-increment
>> +	// load followed by a regular indexed load, for example.
>> +	if (dump_file)
>> +	  fprintf (dump_file,
>> +		   "%d has non-address true dependence on %d, rejecting pair\n",
>> +		   insns[1]->uid (), insns[0]->uid ());
>> +	return false;
>> +      }
>> +
>> +  unsigned i = 0;
>> +  while (i < base_cands.length ())
>> +    {
>> +      base_cand &cand = base_cands[i];
>> +
>> +      rtx *ignore[2] = {};
>> +      for (int j = 0; j < 2; j++)
>> +	if (cand.from_insn == !j)
>> +	  ignore[j] = &XEXP (cand_mems[j], 0);
>> +
>> +      insn_info *h = first_hazard_after (insns[0], ignore[0]);
>> +      if (h && *h <= *insns[1])
>> +	cand.hazards[0] = h;
>> +
>> +      h = latest_hazard_before (insns[1], ignore[1]);
>> +      if (h && *h >= *insns[0])
>> +	cand.hazards[1] = h;
>> +
>> +      if (!cand.viable ())
>> +	{
>> +	  if (dump_file)
>> +	    fprintf (dump_file,
>> +		     "pair (%d,%d): rejecting base %d due to dataflow "
>> +		     "hazards (%d,%d)\n",
>> +		     insns[0]->uid (),
>> +		     insns[1]->uid (),
>> +		     cand.def->regno (),
>> +		     cand.hazards[0]->uid (),
>> +		     cand.hazards[1]->uid ());
>> +
>> +	  base_cands.ordered_remove (i);
>> +	}
>> +      else
>> +	i++;
>> +    }
>> +
>> +  if (base_cands.is_empty ())
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file,
>> +		 "can't form pair (%d,%d) due to dataflow hazards\n",
>> +		 insns[0]->uid (), insns[1]->uid ());
>> +      return false;
>> +    }
>> +
>> +  //insn_info *alias_hazards[4] = {};
> 
> It looks like you've just commented out / removed the alias analysis (I don't
> see a call to do_alias_analysis).  How is that supposed to work?
>

Yes my mistake I kept that in recent patch.
 
>> +
>> +  // First def of memory after the first insn, and last def of memory
>> +  // before the second insn, respectively.
>> +  def_info *mem_defs[2] = {};
>> +  if (load_p)
>> +    {
>> +      if (!MEM_READONLY_P (cand_mems[0]))
>> +	{
>> +	  mem_defs[0] = memory_access (insns[0]->uses ())->def ();
>> +	  gcc_checking_assert (mem_defs[0]);
>> +	  mem_defs[0] = mem_defs[0]->next_def ();
>> +	}
>> +      if (!MEM_READONLY_P (cand_mems[1]))
>> +	{
>> +	  mem_defs[1] = memory_access (insns[1]->uses ())->def ();
>> +	  gcc_checking_assert (mem_defs[1]);
>> +	}
>> +    }
>> +  else
>> +    {
>> +      mem_defs[0] = memory_access (insns[0]->defs ())->next_def ();
>> +      mem_defs[1] = memory_access (insns[1]->defs ())->prev_def ();
>> +      gcc_checking_assert (mem_defs[0]);
>> +      gcc_checking_assert (mem_defs[1]);
>> +    }
>> +
>> +  //auto tombstone_p = [&](insn_info *insn) -> bool {
>> +   // return m_emitted_tombstone
>> +//	   && bitmap_bit_p (&m_tombstone_bitmap, insn->uid ());
>> + // };
>> +
>> +  if (base_cands.is_empty ())
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file,
>> +		 "cannot form pair (%d,%d) due to alias/dataflow hazards",
>> +		 insns[0]->uid (), insns[1]->uid ());
>> +
>> +      return false;
>> +    }
>> +
>> +  base_cand *base = &base_cands[0];
>> +  insn_range_info range (insns[0], insns[1]);
>> +  // If the second insn can throw, narrow the move range to exactly that insn.
>> +  // This prevents us trying to move the second insn from the end of the BB.
>> +  if (cfun->can_throw_non_call_exceptions
>> +      && find_reg_note (insns[1]->rtl (), REG_EH_REGION, NULL_RTX))
>> +    {
>> +      gcc_assert (range.includes (insns[1]));
>> +      range = insn_range_info (insns[1]);
>> +    }
>> +
>> +  // Placement strategy: push loads down and pull stores up, this should
>> +  // help register pressure by reducing live ranges.
>> +  if (load_p)
>> +    range.first = range.last;
>> +  else
>> +    range.last = range.first;
>> +  return fuse_pair (load_p, access_size, writeback,
>> +		    i1, i2, *base, range);
>> +}
>> +// Try and actually fuse the pair given by insns I1 and I2.
>> +//
>> +// Here we've done enough analysis to know this is safe, we only
>> +// reject the pair at this stage if either the tuning policy says to,
>> +// or recog fails on the final pair insn.
>> +//
>> +// LOAD_P is true for loads, ACCESS_SIZE gives the access size of each
>> +// candidate insn.  Bit i of WRITEBACK is set if the ith insn (in program
>> +// order) uses writeback.
>> +//
>> +// BASE gives the chosen base candidate for the pair and MOVE_RANGE is
>> +// a singleton range which says where to place the pair.
>> +bool
>> +fusion_bb_info::fuse_pair (bool load_p,
>> +			unsigned access_size,
>> +			int writeback,
>> +			insn_info *i1, insn_info *i2,
>> +			base_cand &base,
>> +			const insn_range_info &move_range)
>> +{
> 
> Again, I think this function should be in generic code, not in target
> code.

Yes I moved it to generic code.
> 
>> +  auto attempt = crtl->ssa->new_change_attempt ();
>> +
>> +  auto make_change = [&attempt](insn_info *insn)
>> +    {
>> +      return crtl->ssa->change_alloc<insn_change> (attempt, insn);
>> +    };
>> +  auto make_delete = [&attempt](insn_info *insn)
>> +    {
>> +      return crtl->ssa->change_alloc<insn_change> (attempt,
>> +						   insn,
>> +						   insn_change::DELETE);
>> +    };
>> +
>> +  if (*i1 > *i2)
>> +    return false;
>> +
>> +  insn_info *first = (*i1 < *i2) ? i1 : i2;
>> +  insn_info *second = (first == i1) ? i2 : i1;
>> +
>> +  insn_info *insns[2] = { first, second };
>> +
>> +  auto_vec<insn_change *, 4> changes (4);
>> +  auto_vec<int, 2> tombstone_uids (2);
>> +
>> +  rtx pats[2] = {
>> +    PATTERN (first->rtl ()),
>> +    PATTERN (second->rtl ())
>> +  };
>> +
>> +  use_array input_uses[2] = { first->uses (), second->uses () };
>> +  def_array input_defs[2] = { first->defs (), second->defs () };
>> +
>> +  int changed_insn = -1;
>> +  if (base.from_insn != -1)
>> +    {
>> +      // If we're not already using a shared base, we need
>> +      // to re-write one of the accesses to use the base from
>> +      // the other insn.
>> +      gcc_checking_assert (base.from_insn == 0 || base.from_insn == 1);
>> +      changed_insn = !base.from_insn;
>> +
>> +      rtx base_pat = pats[base.from_insn];
>> +      rtx change_pat = pats[changed_insn];
>> +      rtx base_mem = XEXP (base_pat, load_p);
>> +      rtx change_mem = XEXP (change_pat, load_p);
>> +
>> +      const bool lower_base_p = (insns[base.from_insn] == i1);
>> +      HOST_WIDE_INT adjust_amt = access_size;
>> +      if (!lower_base_p)
>> +	adjust_amt *= -1;
>> +
>> +      rtx change_reg = XEXP (change_pat, !load_p);
>> +      machine_mode mode_for_mem = GET_MODE (change_mem);
>> +      rtx effective_base = drop_writeback (base_mem);
>> +      rtx new_mem = adjust_address_nv (effective_base,
>> +				       mode_for_mem,
>> +				       adjust_amt);
>> +      rtx new_set = load_p
>> +	? gen_rtx_SET (change_reg, new_mem)
>> +	: gen_rtx_SET (new_mem, change_reg);
>> +
>> +      pats[changed_insn] = new_set;
>> +
>> +      auto keep_use = [&](use_info *u)
>> +	{
>> +	  return refers_to_regno_p (u->regno (), u->regno () + 1,
>> +				    change_pat, &XEXP (change_pat, load_p));
>> +	};
>> +
>> +      // Drop any uses that only occur in the old address.
>> +      input_uses[changed_insn] = filter_accesses (attempt,
>> +						  input_uses[changed_insn],
>> +						  keep_use);
>> +    }
>> +
>> +  rtx writeback_effect = NULL_RTX;
>> +  if (writeback)
>> +    writeback_effect = extract_writebacks (load_p, pats, changed_insn);
>> +
>> +  const auto base_regno = base.def->regno ();
>> +
>> +  if (base.from_insn == -1 && (writeback & 1))
>> +    {
>> +      // If the first of the candidate insns had a writeback form, we'll need to
>> +      // drop the use of the updated base register from the second insn's uses.
>> +      //
>> +      // N.B. we needn't worry about the base register occurring as a store
>> +      // operand, as we checked that there was no non-address true dependence
>> +      // between the insns in try_fuse_pair.
>> +      gcc_checking_assert (find_access (input_uses[1], base_regno));
>> +      input_uses[1] = check_remove_regno_access (attempt,
>> +						 input_uses[1],
>> +						 base_regno);
>> +    }
>> +
>> +  // Go through and drop uses that only occur in register notes,
>> +  // as we won't be preserving those.
>> +  for (int i = 0; i < 2; i++)
>> +    {
>> +      auto rti = insns[i]->rtl ();
>> +      if (!REG_NOTES (rti))
>> +	continue;
>> +
>> +      input_uses[i] = remove_note_accesses (attempt, input_uses[i]);
>> +    }
>> +
>> +  // So far the patterns have been in instruction order,
>> +  // now we want them in offset order.
>> +  if (i1 != first)
>> +    std::swap (pats[0], pats[1]);
>> +
>> +  poly_int64 offsets[2];
>> +  for (int i = 0; i < 2; i++)
>> +    {
>> +      rtx mem = XEXP (pats[i], load_p);
>> +      gcc_checking_assert (MEM_P (mem));
>> +      rtx base = strip_offset (XEXP (mem, 0), offsets + i);
>> +      gcc_checking_assert (REG_P (base));
>> +      gcc_checking_assert (base_regno == REGNO (base));
>> +    }
>> +
>> +  // If either of the original insns had writeback, but the resulting pair insn
>> +  // does not (can happen e.g. in the lxvp edge case above, or if the writeback
>> +  // effects cancel out), then drop the def(s) of the base register as
>> +  // appropriate.
>> +  //
>> +  // Also drop the first def in the case that both of the original insns had
>> +  // writeback.  The second def could well have uses, but the first def should
>> +  // only be used by the second insn (and we dropped that use above).
>> +  for (int i = 0; i < 2; i++)
>> +    if ((!writeback_effect && (writeback & (1 << i)))
>> +	|| (i == 0 && writeback == 3))
>> +      input_defs[i] = check_remove_regno_access (attempt,
>> +						 input_defs[i],
>> +						 base_regno);
>> +
>> +  // If we don't currently have a writeback pair, and we don't have
>> +  // a load that clobbers the base register, look for a trailing destructive
>> +  // update of the base register and try and fold it in to make this into a
>> +  // writeback pair.
>> +  insn_info *trailing_add = nullptr;
> 
> I can see you've dropped the call to find_trailing_add here.
> 
> If you don't want to handle writeback pairs, that's fine, but I think
> that should be controlled with a target-controlled virtual function.
> 
> Effectively you can just replace uses of the param aarch64_ldp_writeback with a
> call to such a function, then the target can control which writeback
> opportunities we handle, if any.  For aarch64 this would just return
> the result of the param for the time being.
> 

Yes I did that.
>> +
>> +  rtx reg_notes = combine_reg_notes (first, second, load_p);
>> +
>> +  rtx pair_pat = NULL_RTX;
>> +
>> +  if (load_p) {
>> +    pair_pat = rs6000_gen_load_pair ((first->rtl ()), (second->rtl ()));
>> +    if (pair_pat == NULL_RTX)
>> +      return false;
>> +  }
>> +  insn_change *pair_change = nullptr;
>> +  auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
>> +    rtx_insn *rti = change->insn ()->rtl ();
>> +    validate_unshare_change (rti, &PATTERN (rti), pair_pat, true);
>> +    validate_change (rti, &REG_NOTES (rti), reg_notes, true);
>> +  };
>> +
>> +  if (load_p)
>> +    {
>> +      changes.quick_push (make_delete (first));
>> +      pair_change = make_change (second);
>> +      changes.quick_push (pair_change);
>> +
>> +      pair_change->move_range = move_range;
>> +      pair_change->new_defs = merge_access_arrays (attempt,
>> +						   input_defs[0],
>> +						   input_defs[1]);
>> +      gcc_assert (pair_change->new_defs.is_valid ());
>> +
>> +      pair_change->new_uses
>> +	= merge_access_arrays (attempt,
>> +			       drop_memory_access (input_uses[0]),
>> +			       drop_memory_access (input_uses[1]));
>> +      gcc_assert (pair_change->new_uses.is_valid ());
>> +      set_pair_pat (pair_change);
>> +    }
>> +  else
>> +    {
>> +      insn_info *store_to_change = decide_store_strategy (first, second,
>> +							move_range);
>> +
>> +      if (store_to_change && dump_file)
>> +	fprintf (dump_file, "  stvx: re-purposing store %d\n",
>> +		 store_to_change->uid ());
>> +
>> +      insn_change *change;
>> +      for (int i = 0; i < 2; i++)
>> +	{
>> +	  change = make_change (insns[i]);
>> +	  if (insns[i] == store_to_change)
>> +	    {
>> +	      set_pair_pat (change);
>> +	      change->new_uses = merge_access_arrays (attempt,
>> +						      input_uses[0],
>> +						      input_uses[1]);
>> +	      auto d1 = drop_memory_access (input_defs[0]);
>> +	      auto d2 = drop_memory_access (input_defs[1]);
>> +	      change->new_defs = merge_access_arrays (attempt, d1, d2);
>> +	      gcc_assert (change->new_defs.is_valid ());
>> +	      def_info *stxvp_def = memory_access (store_to_change->defs ());
>> +	      change->new_defs = insert_access (attempt,
>> +						stxvp_def,
>> +						change->new_defs);
>> +	      gcc_assert (change->new_defs.is_valid ());
>> +	      change->move_range = move_range;
>> +	      pair_change = change;
>> +	    }
>> +	  else
>> +	    {
>> +	      // Note that we are turning this insn into a tombstone,
>> +	      // we need to keep track of these if we go ahead with the
>> +	      // change.
>> +	      tombstone_uids.quick_push (insns[i]->uid ());
>> +	      rtx_insn *rti = insns[i]->rtl ();
>> +	      validate_change (rti, &PATTERN (rti), gen_tombstone (), true);
>> +	      validate_change (rti, &REG_NOTES (rti), NULL_RTX, true);
>> +	      change->new_uses = use_array (nullptr, 0);
>> +	    }
>> +	  gcc_assert (change->new_uses.is_valid ());
>> +	  changes.quick_push (change);
>> +	}
>> +
>> +      if (!store_to_change)
>> +	{
>> +	  // Tricky case.  Cannot re-purpose existing insns for stp.
>> +	  // Need to insert new insn.
>> +	  if (dump_file)
>> +	    fprintf (dump_file,
>> +		     "  stp fusion: cannot re-purpose candidate stores\n");
>> +
>> +	  auto new_insn = crtl->ssa->create_insn (attempt, INSN, pair_pat);
>> +	  change = make_change (new_insn);
>> +	  change->move_range = move_range;
>> +	  change->new_uses = merge_access_arrays (attempt,
>> +						  input_uses[0],
>> +						  input_uses[1]);
>> +	  gcc_assert (change->new_uses.is_valid ());
>> +
>> +	  auto d1 = drop_memory_access (input_defs[0]);
>> +	  auto d2 = drop_memory_access (input_defs[1]);
>> +	  change->new_defs = merge_access_arrays (attempt, d1, d2);
>> +	  gcc_assert (change->new_defs.is_valid ());
>> +
>> +	  auto new_set = crtl->ssa->create_set (attempt, new_insn, memory);
>> +	  change->new_defs = insert_access (attempt, new_set,
>> +					    change->new_defs);
>> +	  gcc_assert (change->new_defs.is_valid ());
>> +	  changes.safe_insert (1, change);
>> +	  pair_change = change;
>> +	}
>> +    }
>> +
>> +  if (trailing_add)
>> +    changes.quick_push (make_delete (trailing_add));
>> +
>> +  auto n_changes = changes.length ();
>> +  gcc_checking_assert (n_changes >= 2 && n_changes <= 4);
>> +
>> +  gcc_assert (crtl->ssa->verify_insn_changes (changes));
>> +
>> +  cancel_changes (0);
> 
> Can you explain why you changed this to unconditionally cancel the
> changes instead of letting RTL-SSA perform them?
> 
>> +
>> +  confirm_change_group ();
>> +
>> +  gcc_checking_assert (tombstone_uids.length () <= 2);
>> +  for (auto uid : tombstone_uids)
>> +    track_tombstone (uid);
>> +
>> +  return true;
>> +}
>> +
>> +void fusion_bb (bb_info *bb)
>> +{
> 
> Likewise, I think this function should be in generic code, not in target
> code.  Note for aarch64 we have:
> 
>   const bool track_loads
>     = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>   const bool track_stores
>     = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> 
> I think these can just be hidden behind target-controlled virtual
> functions, so you'd have something like:
> 
> bool pair_fusion::should_fuse_stores ();
> bool pair_fusion::should_fuse_loads ();
> 

I kept transform_for_base in target code with pure virtual
function and we dont need to use these in generic code.

As this is different in rs6000 and aarch64 target.

> (or a single function that returns multiple bitfields if you like).
> 
> then for aarch64 you would return:
> 
> aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> 
> for the load case, and for rs6000 you would return true (just based on
> what's below).  Presumably since you don't handle stores below, you'd
> want to return false in the store case for rs6000 code (at least
> initially).
> 
> As I mentioned above, the aarch64 version of this function also tries
> to promote existing pairs to writeback pairs, again that would be
> controlled by the same function call that replaces uses of
> the param aarch64_ldp_writeback.
>

Yes I did that.
 
>> +  fusion_bb_info bb_state (bb);
>> +
>> +  for (auto insn : bb->nondebug_insns ())
>> +    {
>> +      rtx_insn *rti = insn->rtl ();
>> +
>> +      if (!rti || !INSN_P (rti))
>> +	continue;
>> +
>> +      rtx pat = PATTERN (rti);
>> +
>> +      if (GET_CODE (pat) != SET)
>> +	continue;
>> +
>> +      if (MEM_P (XEXP (pat, 1)))
>> +	bb_state.track_access (insn, true, XEXP (pat, 1));
>> +    }
>> +
>> +  bb_state.transform ();
>> +  bb_state.cleanup_tombstones ();
>> +}
>> +
>> +static void
>> +fusion_init ()
>> +{
>> +  df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
>> +  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
>> +  df_analyze ();
>> +  df_set_flags (DF_DEFER_INSN_RESCAN);
>> +  /* Rebuild ud- and du-chains.  */
>> +  df_remove_problem (df_chain);
>> +  df_process_deferred_rescans ();
>> +  df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
>> +  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
>> +  df_analyze ();
>> +  df_set_flags (DF_DEFER_INSN_RESCAN);
> 
> Is there a reason you can't use RTL-SSA for the def-use information?
> IMO it would be cleaner to use RTL-SSA as the existing aarch64 pass does
> (and the refactored generic code presumably will do).
> 
Yes I have used rtl-ssa at most of the places in rs6000 target.
But where require to modify rtx * loc then DF_REF_LOC while
generation subreg in order to generate sequential registers
for pairs.
>> +  calculate_dominance_info (CDI_DOMINATORS);
>> +  crtl->ssa = new rtl_ssa::function_info (cfun);
>> +}
>> +
>> +static void
>> +fusion_destroy ()
>> +{
>> +  if (crtl->ssa->perform_pending_updates ())
>> +    cleanup_cfg (0);
>> +
>> +  free_dominance_info (CDI_DOMINATORS);
>> +  delete crtl->ssa;
>> +  crtl->ssa = nullptr;
>> +}
>> +
>> +// Iterate over the accesses in GROUP, looking for adjacent sets
>> +// of accesses.  If we find two sets of adjacent accesses, call
>> +// merge_pairs.
>> +void
>> +fusion_bb_info::transform_for_base (int encoded_lfs,
>> +				 access_group &group)
> 
> Again, I think this function should be in generic code.
> 

Yes I moved it to generic code.
>> +{
>> +  const auto lfs = decode_lfs (encoded_lfs);
>> +  const unsigned access_size = lfs.size;
>> +
>> +  bool skip_next = true;
>> +  access_record *prev_access = nullptr;
>> +
>> +  for (auto &access : group.list)
>> +    {
>> +      if (skip_next)
>> +	skip_next = false;
>> +      else if (known_eq (access.offset, prev_access->offset + access_size))
>> +	{
>> +	  merge_pairs (prev_access->cand_insns,
>> +		       access.cand_insns,
>> +		       lfs.load_p,
>> +		       access_size);
>> +	  skip_next = true;
>> +	}
>> +      prev_access = &access;
>> +    }
>> +}
>> +
>> +void rs6000_analyze_vecload ()
>> +{
>> +  fusion_init ();
>> +
>> +  for (auto bb : crtl->ssa->bbs ())
>> +    fusion_bb (bb);
>> +
>> +  fusion_destroy ();
>> +}
>> +
>> +const pass_data pass_data_analyze_vecload =
>> +{
>> +  RTL_PASS, /* type */
>> +  "vecload", /* name */
>> +  OPTGROUP_NONE, /* optinfo_flags */
>> +  TV_NONE, /* tv_id */
>> +  0, /* properties_required */
>> +  0, /* properties_provided */
>> +  0, /* properties_destroyed */
>> +  0, /* todo_flags_start */
>> +  TODO_df_finish, /* todo_flags_finish */
>> +};
>> +
>> +class pass_analyze_vecload : public rtl_opt_pass
>> +{
>> +public:
>> +  pass_analyze_vecload(gcc::context *ctxt)
>> +    : rtl_opt_pass(pass_data_analyze_vecload, ctxt)
>> +  {}
>> +
>> +  /* opt_pass methods: */
>> +  bool gate (function *)
>> +    {
>> +      return (optimize > 0 && TARGET_VSX && TARGET_POWER10);
>> +    }
>> +
>> +  unsigned int execute (function *) final override
>> +    {
>> +      rs6000_analyze_vecload ();
>> +      return 0;
>> +    }
>> +}; // class pass_analyze_vecload
>> +
>> +rtl_opt_pass *
>> +make_pass_analyze_vecload (gcc::context *ctxt)
>> +{
>> +  return new pass_analyze_vecload (ctxt);
>> +}
>> +
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index 5d975dab921..e36946c44ea 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -1178,6 +1178,7 @@ static bool rs6000_secondary_reload_move (enum rs6000_reg_type,
>>  					  secondary_reload_info *,
>>  					  bool);
>>  rtl_opt_pass *make_pass_analyze_swaps (gcc::context*);
>> +rtl_opt_pass *make_pass_analyze_vecload (gcc::context*);
>>  
>>  /* Hash table stuff for keeping track of TOC entries.  */
>>  
>> diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
>> index b3ce09d523b..7664830f8ab 100644
>> --- a/gcc/config/rs6000/t-rs6000
>> +++ b/gcc/config/rs6000/t-rs6000
>> @@ -35,6 +35,11 @@ rs6000-p8swap.o: $(srcdir)/config/rs6000/rs6000-p8swap.cc
>>  	$(COMPILE) $<
>>  	$(POSTCOMPILE)
>>  
>> +rs6000-vecload-opt.o: $(srcdir)/config/rs6000/rs6000-vecload-opt.cc
>> +	$(COMPILE) $<
>> +	$(POSTCOMPILE)
>> +
>> +
>>  rs6000-d.o: $(srcdir)/config/rs6000/rs6000-d.cc
>>  	$(COMPILE) $<
>>  	$(POSTCOMPILE)
>> diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
>> index 1856fa4884f..ffc47a6eaa0 100644
>> --- a/gcc/emit-rtl.cc
>> +++ b/gcc/emit-rtl.cc
>> @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode imode,
>>      return false;
>>  
>>    /* The subreg offset cannot be outside the inner object.  */
>> -  if (maybe_ge (offset, isize))
>> +  if (maybe_gt (offset, isize))
>>      return false;
> 
> Can you explain why this change is needed?
> 

This is required in rs6000 target where we generate the subreg
with offset 16 from OO mode (256 bit) to 128 bit vector modes.
Otherwise it segfaults.

>>  
>>    poly_uint64 regsize = REGMODE_NATURAL_SIZE (imode);
>> @@ -3035,18 +3035,6 @@ verify_rtx_sharing (rtx orig, rtx insn)
>>        break;
>>      }
>>  
>> -  /* This rtx may not be shared.  If it has already been seen,
>> -     replace it with a copy of itself.  */
>> -  if (flag_checking && RTX_FLAG (x, used))
>> -    {
>> -      error ("invalid rtl sharing found in the insn");
>> -      debug_rtx (insn);
>> -      error ("shared rtx");
>> -      debug_rtx (x);
>> -      internal_error ("internal consistency failure");
>> -    }
>> -  gcc_assert (!RTX_FLAG (x, used));
>> -
>

I have removed this modification to assert in recent patch.

 
> I think the need to make this change is likely indicative of a problem
> in the pass, and the pass should be fixed instead of changing
> verify_rtx_sharing.
> 
>>    RTX_FLAG (x, used) = 1;
>>

Yes I did that.
  
>>    /* Now scan the subexpressions recursively.  */
>> diff --git a/gcc/fusion-common.h b/gcc/fusion-common.h
>> new file mode 100644
>> index 00000000000..82f1d6ef032
>> --- /dev/null
>> +++ b/gcc/fusion-common.h
>> @@ -0,0 +1,1195 @@
>> +// Common infrastructure for Load/Store Pair fusion optimization pass.
>> +// Copyright (C) 2023-2024 Free Software Foundation, Inc.
>> +//
>> +// This file is part of GCC.
>> +//
>> +// GCC is free software; you can redistribute it and/or modify it
>> +// under the terms of the GNU General Public License as published by
>> +// the Free Software Foundation; either version 3, or (at your option)
>> +// any later version.
>> +//
>> +// GCC is distributed in the hope that it will be useful, but
>> +// WITHOUT ANY WARRANTY; without even the implied warranty of
>> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +// General Public License for more details.
>> +//
>> +// You should have received a copy of the GNU General Public License
>> +// along with GCC; see the file COPYING3.  If not see
>> +// <http://www.gnu.org/licenses/>.
>> +
>> +#define INCLUDE_ALGORITHM
>> +#define INCLUDE_FUNCTIONAL
>> +#define INCLUDE_LIST
>> +#define INCLUDE_TYPE_TRAITS
>> +#include "config.h"
>> +#include "system.h"
>> +#include "coretypes.h"
>> +#include "backend.h"
>> +#include "rtl.h"
>> +#include "df.h"
>> +#include "rtl-iter.h"
>> +#include "rtl-ssa.h"
>> +#include "cfgcleanup.h"
>> +#include "tree-pass.h"
>> +#include "ordered-hash-map.h"
>> +#include "tree-dfa.h"
>> +#include "fold-const.h"
>> +#include "tree-hash-traits.h"
>> +#include "print-tree.h"
>> +#include "insn-attr.h"
>> +
>> +using namespace rtl_ssa;
>> +
>> +static constexpr HOST_WIDE_INT LDP_IMM_BITS = 7;
>> +static constexpr HOST_WIDE_INT LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1));
>> +static constexpr HOST_WIDE_INT LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1;
>> +static constexpr HOST_WIDE_INT LDP_MIN_IMM = -LDP_MAX_IMM - 1;
> 
> Note we would not want these aarch64-specific constants in generic code,
> they would need to be target-configurable (in fact we might just want a
> virtual function / target hook to validate offsets instead).
> 
>> +
>> +// We pack these fields (load_p, fpsimd_p, and size) into an integer
>> +// (LFS) which we use as part of the key into the main hash tables.
>> +//
>> +// The idea is that we group candidates together only if they agree on
>> +// the fields below.  Candidates that disagree on any of these
>> +// properties shouldn't be merged together.
>> +struct lfs_fields
>> +{
>> +  bool load_p;
>> +  bool fpsimd_p;
>> +  unsigned size;
>> +};
>> +
>> +using insn_list_t = std::list<insn_info *>;
>> +using insn_iter_t = insn_list_t::iterator;
>> +
>> +// Information about the accesses at a given offset from a particular
>> +// base.  Stored in an access_group, see below.
>> +struct access_record
>> +{
>> +  poly_int64 offset;
>> +  std::list<insn_info *> cand_insns;
>> +  std::list<access_record>::iterator place;
>> +
>> +  access_record (poly_int64 off) : offset (off) {}
>> +};
>> +
>> +// A group of accesses where adjacent accesses could be ldp/stp
>> +// or lxvp/stxvp candidates.  The splay tree supports efficient
> 
> Perhaps better to spell out "load pair or store pair" instead of listing
> the various instruction mnenomics for different targets.
> 
Yes I did that.

>> +// insertion, while the list supports efficient iteration.
>> +struct access_group
>> +{
>> +  splay_tree<access_record *> tree;
>> +  std::list<access_record> list;
>> +
>> +  template<typename Alloc>
>> +  inline void track (Alloc node_alloc, poly_int64 offset, insn_info *insn);
>> +};
>> +
>> +// Information about a potential base candidate, used in try_fuse_pair.
>> +// There may be zero, one, or two viable RTL bases for a given pair.
>> +struct base_cand
>> +{
>> +  // DEF is the def of the base register to be used by the pair.
>> +  def_info *def;
>> +
>> +  // FROM_INSN is -1 if the base candidate is already shared by both
>> +  // candidate insns.  Otherwise it holds the index of the insn from
>> +  // which the base originated.
>> +  //
>> +  // In the case that the base is shared, either DEF is already used
>> +  // by both candidate accesses, or both accesses see different versions
>> +  // of the same regno, in which case DEF is the def consumed by the
>> +  // first candidate access.
>> +  int from_insn;
>> +
>> +  // To form a pair, we do so by moving the first access down and the second
>> +  // access up.  To determine where to form the pair, and whether or not
>> +  // it is safe to form the pair, we track instructions which cannot be
>> +  // re-ordered past due to either dataflow or alias hazards.
>> +  //
>> +  // Since we allow changing the base used by an access, the choice of
>> +  // base can change which instructions act as re-ordering hazards for
>> +  // this pair (due to different dataflow).  We store the initial
>> +  // dataflow hazards for this choice of base candidate in HAZARDS.
>> +  //
>> +  // These hazards act as re-ordering barriers to each candidate insn
>> +  // respectively, in program order.
>> +  //
>> +  // Later on, when we take alias analysis into account, we narrow
>> +  // HAZARDS accordingly.
>> +  insn_info *hazards[2];
>> +
>> +  base_cand (def_info *def, int insn)
>> +    : def (def), from_insn (insn), hazards {nullptr, nullptr} {}
>> +
>> +  base_cand (def_info *def) : base_cand (def, -1) {}
>> +
>> +  // Test if this base candidate is viable according to HAZARDS.
>> +  bool viable () const
>> +  {
>> +    return !hazards[0] || !hazards[1] || (*hazards[0] > *hazards[1]);
>> +  }
>> +};
>> +
>> +// Information about an alternate base.  For a def_info D, it may
>> +// instead be expressed as D = BASE + OFFSET.
>> +struct alt_base
>> +{
>> +  def_info *base;
>> +  poly_int64 offset;
>> +};
>> +
>> +// State used by the pass for a given basic block.
>> +struct fusion_bb_info
>> +{
>> +  using def_hash = nofree_ptr_hash<def_info>;
>> +  using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
>> +  using def_key_t = pair_hash<def_hash, int_hash<int, -1, -2>>;
>> +
>> +  // Map of <tree base, LFS> -> access_group.
>> +  ordered_hash_map<expr_key_t, access_group> expr_map;
>> +
>> +  // Map of <RTL-SSA def_info *, LFS> -> access_group.
>> +  ordered_hash_map<def_key_t, access_group> def_map;
>> +
>> +  // Given the def_info for an RTL base register, express it as an offset from
>> +  // some canonical base instead.
>> +  //
>> +  // Canonicalizing bases in this way allows us to identify adjacent accesses
>> +  // even if they see different base register defs.
>> +  hash_map<def_hash, alt_base> canon_base_map;
>> +
>> +  static const size_t obstack_alignment = sizeof (void *);
>> +  bb_info *m_bb;
>> +
>> +  fusion_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
>> +  {
>> +    obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
>> +				obstack_alignment, obstack_chunk_alloc,
>> +				obstack_chunk_free);
>> +  }
>> +  ~fusion_bb_info ()
>> +  {
>> +    obstack_free (&m_obstack, nullptr);
>> +
>> +    if (m_emitted_tombstone)
>> +      {
>> +	bitmap_release (&m_tombstone_bitmap);
>> +	bitmap_obstack_release (&m_bitmap_obstack);
>> +      }
>> +  }
>> +
>> +  inline void track_access (insn_info *, bool load, rtx mem);
>> +  inline void transform ();
>> +  inline void cleanup_tombstones ();
>> +
>> +private:
>> +  obstack m_obstack;
>> +
>> +  // State for keeping track of tombstone insns emitted for this BB.
>> +  bitmap_obstack m_bitmap_obstack;
>> +  bitmap_head m_tombstone_bitmap;
>> +  bool m_emitted_tombstone;
>> +
>> +  inline splay_tree_node<access_record *> *node_alloc (access_record *);
>> +
>> +  template<typename Map>
>> +  inline void traverse_base_map (Map &map);
>> +  inline void transform_for_base (int load_size, access_group &group);
>> +
>> +   void merge_pairs (insn_list_t &, insn_list_t &,
>> +			   bool load_p, unsigned access_size);
>> +
>> +   bool try_fuse_pair (bool load_p, unsigned access_size,
>> +			     insn_info *i1, insn_info *i2);
>> +
>> +   bool fuse_pair (bool load_p, unsigned access_size,
>> +			 int writeback,
>> +			 insn_info *i1, insn_info *i2,
>> +			 base_cand &base,
>> +			 const insn_range_info &move_range);
>> +
>> +  inline void track_tombstone (int uid);
>> +
>> +  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
>> +};
>> +
>> +splay_tree_node<access_record *> *fusion_bb_info::node_alloc (access_record *access)
>> +{
>> +  using T = splay_tree_node<access_record *>;
>> +  void *addr = obstack_alloc (&m_obstack, sizeof (T));
>> +  return new (addr) T (access);
>> +}
>> +
>> +// Given a mem MEM, if the address has side effects, return a MEM that accesses
>> +// the same address but without the side effects.  Otherwise, return
>> +// MEM unchanged.
>> +static rtx
>> +drop_writeback (rtx mem)
>> +{
>> +  rtx addr = XEXP (mem, 0);
>> +
>> +  if (!side_effects_p (addr))
>> +    return mem;
>> +
>> +  switch (GET_CODE (addr))
>> +    {
>> +    case PRE_MODIFY:
>> +      addr = XEXP (addr, 1);
>> +      break;
>> +    case POST_MODIFY:
>> +    case POST_INC:
>> +    case POST_DEC:
>> +      addr = XEXP (addr, 0);
>> +      break;
>> +    case PRE_INC:
>> +    case PRE_DEC:
>> +    {
>> +      poly_int64 adjustment = GET_MODE_SIZE (GET_MODE (mem));
>> +      if (GET_CODE (addr) == PRE_DEC)
>> +	adjustment *= -1;
>> +      addr = plus_constant (GET_MODE (addr), XEXP (addr, 0), adjustment);
>> +      break;
>> +    }
>> +    default:
>> +      gcc_unreachable ();
>> +    }
>> +
>> +  return change_address (mem, GET_MODE (mem), addr);
>> +}
>> +
>> +// Convenience wrapper around strip_offset that can also look through
>> +// RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
>> +// MEM so that we know the mode of the access.
>> +static rtx
>> +load_strip_offset (rtx mem, poly_int64 *offset)
> 
> Nit: naming.  How about pair_fusion_strip_offset instead?
> 
Yes I did that.

>> +{
>> +  rtx addr = XEXP (mem, 0);
>> +
>> +  switch (GET_CODE (addr))
>> +    {
>> +    case PRE_MODIFY:
>> +    case POST_MODIFY:
>> +      addr = strip_offset (XEXP (addr, 1), offset);
>> +      gcc_checking_assert (REG_P (addr));
>> +      gcc_checking_assert (rtx_equal_p (XEXP (XEXP (mem, 0), 0), addr));
>> +      break;
>> +    case PRE_INC:
>> +    case POST_INC:
>> +      addr = XEXP (addr, 0);
>> +      *offset = GET_MODE_SIZE (GET_MODE (mem));
>> +      gcc_checking_assert (REG_P (addr));
>> +      break;
>> +    case PRE_DEC:
>> +    case POST_DEC:
>> +      addr = XEXP (addr, 0);
>> +      *offset = -GET_MODE_SIZE (GET_MODE (mem));
>> +      gcc_checking_assert (REG_P (addr));
>> +      break;
>> +
>> +    default:
>> +      addr = strip_offset (addr, offset);
>> +    }
>> +
>> +  return addr;
>> +}
>> +
>> +// Return true if X is a PRE_{INC,DEC,MODIFY} rtx.
>> +static bool
>> +any_pre_modify_p (rtx x)
>> +{
>> +  const auto code = GET_CODE (x);
>> +  return code == PRE_INC || code == PRE_DEC || code == PRE_MODIFY;
>> +}
>> +
>> +// Return true if X is a POST_{INC,DEC,MODIFY} rtx.
>> +static bool
>> +any_post_modify_p (rtx x)
>> +{
>> +  const auto code = GET_CODE (x);
>> +  return code == POST_INC || code == POST_DEC || code == POST_MODIFY;
>> +}
>> +// Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
>> +// into an integer for use as a hash table key.
>> +static int
>> +encode_lfs (lfs_fields fields)
>> +{
>> +  int size_log2 = exact_log2 (fields.size);
>> +  //gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
>> +  return ((int)fields.load_p << 3)
>> +    | ((int)fields.fpsimd_p << 2)
>> +    | (size_log2 - 2);
>> +}
>> +
>> +// Inverse of encode_lfs.
>> +static lfs_fields
>> +decode_lfs (int lfs)
>> +{
>> +  bool load_p = (lfs & (1 << 3));
>> +  bool fpsimd_p = (lfs & (1 << 2));
>> +  unsigned size = 1U << ((lfs & 3) + 2);
>> +  return { load_p, fpsimd_p, size };
>> +}
>> +// Track the access INSN at offset OFFSET in this access group.
>> +// ALLOC_NODE is used to allocate splay tree nodes.
>> +template<typename Alloc>
>> +void
>> +access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
>> +{
>> +  auto insert_before = [&](std::list<access_record>::iterator after)
>> +    {
>> +      auto it = list.emplace (after, offset);
>> +      it->cand_insns.push_back (insn);
>> +      it->place = it;
>> +      return &*it;
>> +    };
>> +
>> +  if (!list.size ())
>> +    {
>> +      auto access = insert_before (list.end ());
>> +      tree.insert_max_node (alloc_node (access));
>> +      return;
>> +    }
>> +
>> +  auto compare = [&](splay_tree_node<access_record *> *node)
>> +    {
>> +      return compare_sizes_for_sort (offset, node->value ()->offset);
>> +    };
>> +  auto result = tree.lookup (compare);
>> +  splay_tree_node<access_record *> *node = tree.root ();
>> +  if (result == 0)
>> +    node->value ()->cand_insns.push_back (insn);
>> +  else
>> +    {
>> +      auto it = node->value ()->place;
>> +      auto after = (result > 0) ? std::next (it) : it;
>> +      auto access = insert_before (after);
>> +      tree.insert_child (node, result > 0, alloc_node (access));
>> +    }
>> +}
>> +// Given a candidate access INSN (with mem MEM), see if it has a suitable
>> +// MEM_EXPR base (i.e. a tree decl) relative to which we can track the access.
>> +// LFS is used as part of the key to the hash table, see track_access.
>> +bool
>> +fusion_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>> +{
>> +  if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
>> +    return false;
>> +
>> +  poly_int64 offset;
>> +  tree base_expr = get_addr_base_and_unit_offset (MEM_EXPR (mem),
>> +						  &offset);
>> +  if (!base_expr || !DECL_P (base_expr))
>> +    return false;
>> +
>> +  offset += MEM_OFFSET (mem);
>> +
>> +  const machine_mode mem_mode = GET_MODE (mem);
>> +  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode);//.to_constant ();
> 
> Why the .to_constant () comment here?
> 
Yes I have uncommented that.

Please review the recent patch as I have incorporated all the 
above comments in recent patch that I have sent.

Thanks & Regards
Ajit
> I hope these comments are useful.  Please feel free to reach out if you have any
> questions / get stuck with anything.  I'm often around on IRC.
> 
> Thanks,
> Alex
> 
>> +
>> +  // Punt on misaligned offsets.  LDP/STP instructions require offsets to be a
>> +  // multiple of the access size, and we believe that misaligned offsets on
>> +  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL bases.
>> +  if (!multiple_p (offset, mem_size))
>> +    return false;
>> +
>> +  const auto key = std::make_pair (base_expr, encode_lfs (lfs));
>> +  access_group &group = expr_map.get_or_insert (key, NULL);
>> +  auto alloc = [&](access_record *access) { return node_alloc (access); };
>> +  group.track (alloc, offset, insn);
>> +
>> +  if (dump_file)
>> +    {
>> +      fprintf (dump_file, "[bb %u] tracking insn %d via ",
>> +	       m_bb->index (), insn->uid ());
>> +      print_node_brief (dump_file, "mem expr", base_expr, 0);
>> +      fprintf (dump_file, " [L=%d FP=%d, %smode, off=",
>> +	       lfs.load_p, lfs.fpsimd_p, mode_name[mem_mode]);
>> +      print_dec (offset, dump_file);
>> +      fprintf (dump_file, "]\n");
>> +    }
>> +
>> +  return true;
>> +}
>> +
>> +// Dummy predicate that never ignores any insns.
>> +static bool no_ignore (insn_info *) { return false; }
>> +
>> +// Return the latest dataflow hazard before INSN.
>> +//
>> +// If IGNORE is non-NULL, this points to a sub-rtx which we should ignore for
>> +// dataflow purposes.  This is needed when considering changing the RTL base of
>> +// an access discovered through a MEM_EXPR base.
>> +//
>> +// If IGNORE_INSN is non-NULL, we should further ignore any hazards arising
>> +// from that insn.
>> +//
>> +// N.B. we ignore any defs/uses of memory here as we deal with that separately,
>> +// making use of alias disambiguation.
>> +static insn_info *
>> +latest_hazard_before (insn_info *insn, rtx *ignore,
>> +		      insn_info *ignore_insn = nullptr)
>> +{
>> +  insn_info *result = nullptr;
>> +
>> +  // If the insn can throw then it is at the end of a BB and we can't
>> +  // move it, model this by recording a hazard in the previous insn
>> +  // which will prevent moving the insn up.
>> +  if (cfun->can_throw_non_call_exceptions
>> +      && find_reg_note (insn->rtl (), REG_EH_REGION, NULL_RTX))
>> +    return insn->prev_nondebug_insn ();
>> +
>> +  // Return true if we registered the hazard.
>> +  auto hazard = [&](insn_info *h) -> bool
>> +    {
>> +      gcc_checking_assert (*h < *insn);
>> +      if (h == ignore_insn)
>> +	return false;
>> +
>> +      if (!result || *h > *result)
>> +	result = h;
>> +
>> +      return true;
>> +    };
>> +
>> +  rtx pat = PATTERN (insn->rtl ());
>> +  auto ignore_use = [&](use_info *u)
>> +    {
>> +      if (u->is_mem ())
>> +	return true;
>> +
>> +      return !refers_to_regno_p (u->regno (), u->regno () + 1, pat, ignore);
>> +    };
>> +
>> +  // Find defs of uses in INSN (RaW).
>> +  for (auto use : insn->uses ())
>> +    if (!ignore_use (use) && use->def ())
>> +      hazard (use->def ()->insn ());
>> +
>> +  // Find previous defs (WaW) or previous uses (WaR) of defs in INSN.
>> +  for (auto def : insn->defs ())
>> +    {
>> +      if (def->is_mem ())
>> +	continue;
>> +
>> +      if (def->prev_def ())
>> +	{
>> +	  hazard (def->prev_def ()->insn ()); // WaW
>> +
>> +	  auto set = dyn_cast<set_info *> (def->prev_def ());
>> +	  if (set && set->has_nondebug_insn_uses ())
>> +	    for (auto use : set->reverse_nondebug_insn_uses ())
>> +	      if (use->insn () != insn && hazard (use->insn ())) // WaR
>> +		break;
>> +	}
>> +
>> +      if (!HARD_REGISTER_NUM_P (def->regno ()))
>> +	continue;
>> +
>> +      // Also need to check backwards for call clobbers (WaW).
>> +      for (auto call_group : def->ebb ()->call_clobbers ())
>> +	{
>> +	  if (!call_group->clobbers (def->resource ()))
>> +	    continue;
>> +
>> +	  auto clobber_insn = prev_call_clobbers_ignoring (*call_group,
>> +							   def->insn (),
>> +							   no_ignore);
>> +	  if (clobber_insn)
>> +	    hazard (clobber_insn);
>> +	}
>> +
>> +    }
>> +
>> +  return result;
>> +}
>> +
>> +// Return the first dataflow hazard after INSN.
>> +//
>> +// If IGNORE is non-NULL, this points to a sub-rtx which we should ignore for
>> +// dataflow purposes.  This is needed when considering changing the RTL base of
>> +// an access discovered through a MEM_EXPR base.
>> +//
>> +// N.B. we ignore any defs/uses of memory here as we deal with that separately,
>> +// making use of alias disambiguation.
>> +static insn_info *
>> +first_hazard_after (insn_info *insn, rtx *ignore)
>> +{
>> +  insn_info *result = nullptr;
>> +  auto hazard = [insn, &result](insn_info *h)
>> +    {
>> +      gcc_checking_assert (*h > *insn);
>> +      if (!result || *h < *result)
>> +	result = h;
>> +    };
>> +
>> +  rtx pat = PATTERN (insn->rtl ());
>> +  auto ignore_use = [&](use_info *u)
>> +    {
>> +      if (u->is_mem ())
>> +	return true;
>> +
>> +      return !refers_to_regno_p (u->regno (), u->regno () + 1, pat, ignore);
>> +    };
>> +
>> +  for (auto def : insn->defs ())
>> +    {
>> +      if (def->is_mem ())
>> +	continue;
>> +
>> +      if (def->next_def ())
>> +	hazard (def->next_def ()->insn ()); // WaW
>> +
>> +      auto set = dyn_cast<set_info *> (def);
>> +      if (set && set->has_nondebug_insn_uses ())
>> +	hazard (set->first_nondebug_insn_use ()->insn ()); // RaW
>> +
>> +      if (!HARD_REGISTER_NUM_P (def->regno ()))
>> +	continue;
>> +
>> +      // Also check for call clobbers of this def (WaW).
>> +      for (auto call_group : def->ebb ()->call_clobbers ())
>> +	{
>> +	  if (!call_group->clobbers (def->resource ()))
>> +	    continue;
>> +
>> +	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
>> +							   def->insn (),
>> +							   no_ignore);
>> +	  if (clobber_insn)
>> +	    hazard (clobber_insn);
>> +	}
>> +    }
>> +
>> +  // Find any subsequent defs of uses in INSN (WaR).
>> +  for (auto use : insn->uses ())
>> +    {
>> +      if (ignore_use (use))
>> +	continue;
>> +
>> +      if (use->def ())
>> +	{
>> +	  auto def = use->def ()->next_def ();
>> +	  if (def && def->insn () == insn)
>> +	    def = def->next_def ();
>> +
>> +	  if (def)
>> +	    hazard (def->insn ());
>> +	}
>> +
>> +      if (!HARD_REGISTER_NUM_P (use->regno ()))
>> +	continue;
>> +
>> +      // Also need to handle call clobbers of our uses (again WaR).
>> +      //
>> +      // See restrict_movement_for_uses_ignoring for why we don't
>> +      // need to check backwards for call clobbers.
>> +      for (auto call_group : use->ebb ()->call_clobbers ())
>> +	{
>> +	  if (!call_group->clobbers (use->resource ()))
>> +	    continue;
>> +
>> +	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
>> +							   use->insn (),
>> +							   no_ignore);
>> +	  if (clobber_insn)
>> +	    hazard (clobber_insn);
>> +	}
>> +    }
>> +
>> +  return result;
>> +}
>> +
>> +// Return true iff R1 and R2 overlap.
>> +static bool
>> +ranges_overlap_p (const insn_range_info &r1, const insn_range_info &r2)
>> +{
>> +  // If either range is empty, then their intersection is empty.
>> +  if (!r1 || !r2)
>> +    return false;
>> +
>> +  // When do they not overlap? When one range finishes before the other
>> +  // starts, i.e. (*r1.last < *r2.first || *r2.last < *r1.first).
>> +  // Inverting this, we get the below.
>> +  return *r1.last >= *r2.first && *r2.last >= *r1.first;
>> +}
>> +
>> +// Get the range of insns that def feeds.
>> +static insn_range_info get_def_range (def_info *def)
>> +{
>> +  insn_info *last = def->next_def ()->insn ()->prev_nondebug_insn ();
>> +  return { def->insn (), last };
>> +}
>> +
>> +// Given a def (of memory), return the downwards range within which we
>> +// can safely move this def.
>> +static insn_range_info
>> +def_downwards_move_range (def_info *def)
>> +{
>> +  auto range = get_def_range (def);
>> +
>> +  auto set = dyn_cast<set_info *> (def);
>> +  if (!set || !set->has_any_uses ())
>> +    return range;
>> +
>> +  auto use = set->first_nondebug_insn_use ();
>> +  if (use)
>> +    range = move_earlier_than (range, use->insn ());
>> +
>> +  return range;
>> +}
>> +
>> +// Given a def (of memory), return the upwards range within which we can
>> +// safely move this def.
>> +static insn_range_info
>> +def_upwards_move_range (def_info *def)
>> +{
>> +  def_info *prev = def->prev_def ();
>> +  insn_range_info range { prev->insn (), def->insn () };
>> +
>> +  auto set = dyn_cast<set_info *> (prev);
>> +  if (!set || !set->has_any_uses ())
>> +    return range;
>> +
>> +  auto use = set->last_nondebug_insn_use ();
>> +  if (use)
>> +    range = move_later_than (range, use->insn ());
>> +
>> +  return range;
>> +}
>> +
>> +// Given candidate store insns FIRST and SECOND, see if we can re-purpose one
>> +// of them (together with its def of memory) for the stp/stxvp insn. If so,
>> +//  return that insn.  Otherwise, return null.
>> +static insn_info *
>> +decide_store_strategy (insn_info *first,
>> +		     insn_info *second,
>> +		     const insn_range_info &move_range)
>> +{
>> +  def_info * const defs[2] = {
>> +    memory_access (first->defs ()),
>> +    memory_access (second->defs ())
>> +  };
>> +
>> +  if (move_range.includes (first)
>> +      || ranges_overlap_p (move_range, def_downwards_move_range (defs[0])))
>> +    return first;
>> +
>> +  if (move_range.includes (second)
>> +      || ranges_overlap_p (move_range, def_upwards_move_range (defs[1])))
>> +    return second;
>> +
>> +  return nullptr;
>> +}
>> +
>> +// Generate the RTL pattern for a "tombstone"; used temporarily during this pass
>> +// to replace stores that are marked for deletion where we can't immediately
>> +// delete the store (since there are uses of mem hanging off the store).
>> +//
>> +// These are deleted at the end of the pass and uses re-parented appropriately
>> +// at this point.
>> +static rtx
>> +gen_tombstone (void)
>> +{
>> +  return gen_rtx_CLOBBER (VOIDmode,
>> +			  gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode)));
>> +}
>> +// Go through the reg notes rooted at NOTE, dropping those that we should drop,
>> +// and preserving those that we want to keep by prepending them to (and
>> +// returning) RESULT.  EH_REGION is used to make sure we have at most one
>> +// REG_EH_REGION note in the resulting list.  FR_EXPR is used to return any
>> +// REG_FRAME_RELATED_EXPR note we find, as these can need special handling in
>> +// combine_reg_notes.
>> +static rtx
>> +filter_notes (rtx note, rtx result, bool *eh_region, rtx *fr_expr)
>> +{
>> +  for (; note; note = XEXP (note, 1))
>> +    {
>> +      switch (REG_NOTE_KIND (note))
>> +	{
>> +	case REG_DEAD:
>> +	  // REG_DEAD notes aren't required to be maintained.
>> +	case REG_EQUAL:
>> +	case REG_EQUIV:
>> +	case REG_UNUSED:
>> +	case REG_NOALIAS:
>> +	  // These can all be dropped.  For REG_EQU{AL,IV} they cannot apply to
>> +	  // non-single_set insns, and REG_UNUSED is re-computed by RTl-SSA, see
>> +	  // rtl-ssa/changes.cc:update_notes.
>> +	  //
>> +	  // Similarly, REG_NOALIAS cannot apply to a parallel.
>> +	case REG_INC:
>> +	  // When we form the pair insn, the reg update is implemented
>> +	  // as just another SET in the parallel, so isn't really an
>> +	  // auto-increment in the RTL sense, hence we drop the note.
>> +	  break;
>> +	case REG_EH_REGION:
>> +	  gcc_assert (!*eh_region);
>> +	  *eh_region = true;
>> +	  result = alloc_reg_note (REG_EH_REGION, XEXP (note, 0), result);
>> +	  break;
>> +	case REG_CFA_DEF_CFA:
>> +	case REG_CFA_OFFSET:
>> +	case REG_CFA_RESTORE:
>> +	  result = alloc_reg_note (REG_NOTE_KIND (note),
>> +				   copy_rtx (XEXP (note, 0)),
>> +				   result);
>> +	  break;
>> +	case REG_FRAME_RELATED_EXPR:
>> +	  gcc_assert (!*fr_expr);
>> +	  *fr_expr = copy_rtx (XEXP (note, 0));
>> +	  break;
>> +	default:
>> +	  // Unexpected REG_NOTE kind.
>> +	  gcc_unreachable ();
>> +	}
>> +    }
>> +
>> +  return result;
>> +}
>> +
>> +// Return the notes that should be attached to a combination of I1 and I2, where
>> +// *I1 < *I2.  LOAD_P is true for loads.
>> +static rtx
>> +combine_reg_notes (insn_info *i1, insn_info *i2, bool load_p)
>> +{
>> +  // Temporary storage for REG_FRAME_RELATED_EXPR notes.
>> +  rtx fr_expr[2] = {};
>> +
>> +  bool found_eh_region = false;
>> +  rtx result = NULL_RTX;
>> +  result = filter_notes (REG_NOTES (i2->rtl ()), result,
>> +			 &found_eh_region, fr_expr);
>> +  result = filter_notes (REG_NOTES (i1->rtl ()), result,
>> +			 &found_eh_region, fr_expr + 1);
>> +
>> +  if (!load_p)
>> +    {
>> +      // Simple frame-related sp-relative saves don't need CFI notes, but when
>> +      // we combine them into an stp/stxvp we will need a CFI note as dwarf2cfi
>> +      //  can't interpret the unspec pair representation directly.
>> +      if (RTX_FRAME_RELATED_P (i1->rtl ()) && !fr_expr[0])
>> +	fr_expr[0] = copy_rtx (PATTERN (i1->rtl ()));
>> +      if (RTX_FRAME_RELATED_P (i2->rtl ()) && !fr_expr[1])
>> +	fr_expr[1] = copy_rtx (PATTERN (i2->rtl ()));
>> +    }
>> +
>> +  rtx fr_pat = NULL_RTX;
>> +  if (fr_expr[0] && fr_expr[1])
>> +    {
>> +      // Combining two frame-related insns, need to construct
>> +      // a REG_FRAME_RELATED_EXPR note which represents the combined
>> +      // operation.
>> +      RTX_FRAME_RELATED_P (fr_expr[1]) = 1;
>> +      fr_pat = gen_rtx_PARALLEL (VOIDmode,
>> +				 gen_rtvec (2, fr_expr[0], fr_expr[1]));
>> +    }
>> +  else
>> +    fr_pat = fr_expr[0] ? fr_expr[0] : fr_expr[1];
>> +
>> +  if (fr_pat)
>> +    result = alloc_reg_note (REG_FRAME_RELATED_EXPR,
>> +			     fr_pat, result);
>> +
>> +  return result;
>> +}
>> +
>> +// Given two memory accesses in PATS, at least one of which is of a
>> +// writeback form, extract two non-writeback memory accesses addressed
>> +// relative to the initial value of the base register, and output these
>> +// in PATS.  Return an rtx that represents the overall change to the
>> +// base register.
>> +static rtx
>> +extract_writebacks (bool load_p, rtx pats[2], int changed)
>> +{
>> +  rtx base_reg = NULL_RTX;
>> +  poly_int64 current_offset = 0;
>> +
>> +  poly_int64 offsets[2];
>> +
>> +  for (int i = 0; i < 2; i++)
>> +    {
>> +      rtx mem = XEXP (pats[i], load_p);
>> +      rtx reg = XEXP (pats[i], !load_p);
>> +
>> +      rtx addr = XEXP (mem, 0);
>> +      const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
>> +
>> +      poly_int64 offset;
>> +      rtx this_base = load_strip_offset (mem, &offset);
>> +      gcc_assert (REG_P (this_base));
>> +      if (base_reg)
>> +	gcc_assert (rtx_equal_p (base_reg, this_base));
>> +      else
>> +	base_reg = this_base;
>> +
>> +      // If we changed base for the current insn, then we already
>> +      // derived the correct mem for this insn from the effective
>> +      // address of the other access.
>> +      if (i == changed)
>> +	{
>> +	  gcc_checking_assert (!autoinc_p);
>> +	  offsets[i] = offset;
>> +	  continue;
>> +	}
>> +
>> +      if (autoinc_p && any_pre_modify_p (addr))
>> +	current_offset += offset;
>> +
>> +      poly_int64 this_off = current_offset;
>> +      if (!autoinc_p)
>> +	this_off += offset;
>> +
>> +      offsets[i] = this_off;
>> +      rtx new_mem = change_address (mem, GET_MODE (mem),
>> +				    plus_constant (GET_MODE (base_reg),
>> +						   base_reg, this_off));
>> +      pats[i] = load_p
>> +	? gen_rtx_SET (reg, new_mem)
>> +	: gen_rtx_SET (new_mem, reg);
>> +
>> +      if (autoinc_p && any_post_modify_p (addr))
>> +	current_offset += offset;
>> +    }
>> +
>> +  if (known_eq (current_offset, 0))
>> +    return NULL_RTX;
>> +
>> +  return gen_rtx_SET (base_reg, plus_constant (GET_MODE (base_reg),
>> +					       base_reg, current_offset));
>> +}
>> +
>> +// We just emitted a tombstone with uid UID, track it in a bitmap for
>> +// this BB so we can easily identify it later when cleaning up tombstones.
>> +void
>> +fusion_bb_info::track_tombstone (int uid)
>> +{
>> +  if (!m_emitted_tombstone)
>> +    {
>> +      // Lazily initialize the bitmap for tracking tombstone insns.
>> +      bitmap_obstack_initialize (&m_bitmap_obstack);
>> +      bitmap_initialize (&m_tombstone_bitmap, &m_bitmap_obstack);
>> +      m_emitted_tombstone = true;
>> +    }
>> +
>> +  if (!bitmap_set_bit (&m_tombstone_bitmap, uid))
>> +    gcc_unreachable (); // Bit should have changed.
>> +}
>> +
>> +// Virtual base class for load/store walkers used in alias analysis.
>> +struct alias_walker
>> +{
>> +  virtual bool conflict_p (int &budget) const = 0;
>> +  virtual insn_info *insn () const = 0;
>> +  virtual bool valid () const  = 0;
>> +  virtual void advance () = 0;
>> +};
>> +
>> +// Implement some common functionality used by both store_walker
>> +// and load_walker.
>> +template<bool reverse>
>> +class def_walker : public alias_walker
>> +{
>> +protected:
>> +  using def_iter_t = typename std::conditional<reverse,
>> +	reverse_def_iterator, def_iterator>::type;
>> +
>> +  static use_info *start_use_chain (def_iter_t &def_iter)
>> +  {
>> +    set_info *set = nullptr;
>> +    for (; *def_iter; def_iter++)
>> +      {
>> +	set = dyn_cast<set_info *> (*def_iter);
>> +	if (!set)
>> +	  continue;
>> +
>> +	use_info *use = reverse
>> +	  ? set->last_nondebug_insn_use ()
>> +	  : set->first_nondebug_insn_use ();
>> +
>> +	if (use)
>> +	  return use;
>> +      }
>> +
>> +    return nullptr;
>> +  }
>> +
>> +  def_iter_t def_iter;
>> +  insn_info *limit;
>> +  def_walker (def_info *def, insn_info *limit) :
>> +    def_iter (def), limit (limit) {}
>> +
>> +  virtual bool iter_valid () const { return *def_iter; }
>> +
>> +public:
>> +  insn_info *insn () const override { return (*def_iter)->insn (); }
>> +  void advance () override { def_iter++; }
>> +  bool valid () const override final
>> +  {
>> +    if (!iter_valid ())
>> +      return false;
>> +
>> +    if (reverse)
>> +      return *(insn ()) > *limit;
>> +    else
>> +      return *(insn ()) < *limit;
>> +  }
>> +};
>> +
>> +// alias_walker that iterates over stores.
>> +template<bool reverse, typename InsnPredicate>
>> +class store_walker : public def_walker<reverse>
>> +{
>> +  rtx cand_mem;
>> +  InsnPredicate tombstone_p;
>> +
>> +public:
>> +  store_walker (def_info *mem_def, rtx mem, insn_info *limit_insn,
>> +		InsnPredicate tombstone_fn) :
>> +    def_walker<reverse> (mem_def, limit_insn),
>> +    cand_mem (mem), tombstone_p (tombstone_fn) {}
>> +
>> +  bool conflict_p (int &budget) const override final
>> +  {
>> +    if (tombstone_p (this->insn ()))
>> +      return false;
>> +
>> +    return store_modifies_mem_p (cand_mem, this->insn (), budget);
>> +  }
>> +};
>> +
>> +// alias_walker that iterates over loads.
>> +template<bool reverse>
>> +class load_walker : public def_walker<reverse>
>> +{
>> +  using Base = def_walker<reverse>;
>> +  using use_iter_t = typename std::conditional<reverse,
>> +	reverse_use_iterator, nondebug_insn_use_iterator>::type;
>> +
>> +  use_iter_t use_iter;
>> +  insn_info *cand_store;
>> +
>> +  bool iter_valid () const override final { return *use_iter; }
>> +
>> +public:
>> +  void advance () override final
>> +  {
>> +    use_iter++;
>> +    if (*use_iter)
>> +      return;
>> +    this->def_iter++;
>> +    use_iter = Base::start_use_chain (this->def_iter);
>> +  }
>> +
>> +  insn_info *insn () const override final
>> +  {
>> +    return (*use_iter)->insn ();
>> +  }
>> +
>> +  bool conflict_p (int &budget) const override final
>> +  {
>> +    return load_modified_by_store_p (insn (), cand_store, budget);
>> +  }
>> +
>> +  load_walker (def_info *def, insn_info *store, insn_info *limit_insn)
>> +    : Base (def, limit_insn),
>> +      use_iter (Base::start_use_chain (this->def_iter)),
>> +      cand_store (store) {}
>> +};
>> +
>> +// Given INSNS (in program order) which are known to be adjacent, look
>> +// to see if either insn has a suitable RTL (register) base that we can
>> +// use to form a pair.  Push these to BASE_CANDS if we find any.  CAND_MEMs
>> +// gives the relevant mems from the candidate insns, ACCESS_SIZE gives the
>> +// size of a single candidate access, and REVERSED says whether the accesses
>> +// are inverted in offset order.
>> +//
>> +// Returns an integer where bit (1 << i) is set if INSNS[i] uses writeback
>> +// addressing.
>> +static int
>> +get_viable_bases (insn_info *insns[2],
>> +		  vec<base_cand> &base_cands,
>> +		  rtx cand_mems[2],
>> +		  unsigned access_size,
>> +		  bool reversed)
>> +{
>> +  // We discovered this pair through a common base.  Need to ensure that
>> +  // we have a common base register that is live at both locations.
>> +  def_info *base_defs[2] = {};
>> +  int writeback = 0;
>> +  for (int i = 0; i < 2; i++)
>> +    {
>> +      const bool is_lower = (i == reversed);
>> +      poly_int64 poly_off;
>> +      rtx base = load_strip_offset (cand_mems[i], &poly_off);
>> +      if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
>> +	writeback |= (1 << i);
>> +
>> +      if (!REG_P (base) || !poly_off.is_constant ())
>> +	continue;
>> +
>> +      // Punt on accesses relative to eliminable regs.  See the comment in
>> +      // fusion_bb_info::track_access for a detailed explanation of this.
>> +      if (!reload_completed
>> +	  && (REGNO (base) == FRAME_POINTER_REGNUM
>> +	      || REGNO (base) == ARG_POINTER_REGNUM))
>> +	continue;
>> +
>> +      HOST_WIDE_INT base_off = poly_off.to_constant ();
>> +
>> +      // It should be unlikely that we ever punt here, since MEM_EXPR offset
>> +      // alignment should be a good proxy for register offset alignment.
>> +      if (base_off % access_size != 0)
>> +	{
>> +	  if (dump_file)
>> +	    fprintf (dump_file,
>> +		     "base not viable, offset misaligned (insn %d)\n",
>> +		     insns[i]->uid ());
>> +	  continue;
>> +	}
>> +
>> +      base_off /= access_size;
>> +
>> +      if (!is_lower)
>> +	base_off--;
>> +
>> +      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
>> +	continue;
>> +
>> +      use_info *use = find_access (insns[i]->uses (), REGNO (base));
>> +      gcc_assert (use);
>> +      base_defs[i] = use->def ();
>> +    }
>> +
>> +  if (!base_defs[0] && !base_defs[1])
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file, "no viable base register for pair (%d,%d)\n",
>> +		 insns[0]->uid (), insns[1]->uid ());
>> +      return writeback;
>> +    }
>> +
>> +  for (int i = 0; i < 2; i++)
>> +    if ((writeback & (1 << i)) && !base_defs[i])
>> +      {
>> +	if (dump_file)
>> +	  fprintf (dump_file, "insn %d has writeback but base isn't viable\n",
>> +		   insns[i]->uid ());
>> +	return writeback;
>> +      }
>> +
>> +  if (writeback == 3
>> +      && base_defs[0]->regno () != base_defs[1]->regno ())
>> +    {
>> +      if (dump_file)
>> +	fprintf (dump_file,
>> +		 "pair (%d,%d): double writeback with distinct regs (%d,%d): "
>> +		 "punting\n",
>> +		 insns[0]->uid (), insns[1]->uid (),
>> +		 base_defs[0]->regno (), base_defs[1]->regno ());
>> +      return writeback;
>> +    }
>> +
>> +  if (base_defs[0] && base_defs[1]
>> +      && base_defs[0]->regno () == base_defs[1]->regno ())
>> +    {
>> +      // Easy case: insns already share the same base reg.
>> +      base_cands.quick_push (base_defs[0]);
>> +      return writeback;
>> +    }
>> +
>> +  // Otherwise, we know that one of the bases must change.
>> +  //
>> +  // Note that if there is writeback we must use the writeback base
>> +  // (we know now there is exactly one).
>> +  for (int i = 0; i < 2; i++)
>> +    if (base_defs[i] && (!writeback || (writeback & (1 << i))))
>> +      base_cands.quick_push (base_cand { base_defs[i], i });
>> +
>> +  return writeback;
>> +}
>> +
>> +static void
>> +dump_insn_list (FILE *f, const insn_list_t &l)
>> +{
>> +  fprintf (f, "(");
>> +
>> +  auto i = l.begin ();
>> +  auto end = l.end ();
>> +
>> +  if (i != end)
>> +    fprintf (f, "%d", (*i)->uid ());
>> +  i++;
>> +
>> +  for (; i != end; i++)
>> +    fprintf (f, ", %d", (*i)->uid ());
>> +
>> +  fprintf (f, ")");
>> +}
>> +
>> +DEBUG_FUNCTION void
>> +debug (const insn_list_t &l)
>> +{
>> +  dump_insn_list (stderr, l);
>> +  fprintf (stderr, "\n");
>> +}
>> +
>> +// If we emitted tombstone insns for this BB, iterate through the BB
>> +// and remove all the tombstone insns, being sure to reparent any uses
>> +// of mem to previous defs when we do this.
>> +void
>> +fusion_bb_info::cleanup_tombstones ()
>> +{
>> +  // No need to do anything if we didn't emit a tombstone insn for this BB.
>> +  if (!m_emitted_tombstone)
>> +    return;
>> +
>> +  insn_info *insn = m_bb->head_insn ();
>> +  while (insn)
>> +    {
>> +      insn_info *next = insn->next_nondebug_insn ();
>> +      if (!insn->is_real ()
>> +	  || !bitmap_bit_p (&m_tombstone_bitmap, insn->uid ()))
>> +	{
>> +	  insn = next;
>> +	  continue;
>> +	}
>> +
>> +      auto def = memory_access (insn->defs ());
>> +      auto set = dyn_cast<set_info *> (def);
>> +      if (set && set->has_any_uses ())
>> +	{
>> +	  def_info *prev_def = def->prev_def ();
>> +	  auto prev_set = dyn_cast<set_info *> (prev_def);
>> +	  if (!prev_set)
>> +	    gcc_unreachable ();
>> +
>> +	  while (set->first_use ())
>> +	    crtl->ssa->reparent_use (set->first_use (), prev_set);
>> +	}
>> +
>> +      // Now set has no uses, we can delete it.
>> +      insn_change change (insn, insn_change::DELETE);
>> +      //df_insn_rescan (insn->rtl());
>> +      crtl->ssa->change_insn (change);
>> +      insn = next;
>> +    }
>> +}
>> +
>> +template<typename Map>
>> +void
>> +fusion_bb_info::traverse_base_map (Map &map)
>> +{
>> +  for (auto kv : map)
>> +    {
>> +      const auto &key = kv.first;
>> +      auto &value = kv.second;
>> +      transform_for_base (key.second, value);
>> +    }
>> +}
>> +
>> +void
>> +fusion_bb_info::transform ()
>> +{
>> +  traverse_base_map (expr_map);
>> +  traverse_base_map (def_map);
>> +}
>> diff --git a/gcc/testsuite/g++.target/powerpc/vecload.C b/gcc/testsuite/g++.target/powerpc/vecload.C
>> new file mode 100644
>> index 00000000000..c523572cf3c
>> --- /dev/null
>> +++ b/gcc/testsuite/g++.target/powerpc/vecload.C
>> @@ -0,0 +1,15 @@
>> +/* { dg-do compile } */ 
>> +/* { dg-require-effective-target power10_ok } */
>> +/* { dg-options "-mdejagnu-cpu=power10 -O2" } */ 
>> +
>> +#include <altivec.h>
>> +
>> +void
>> +foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
>> +{
>> +  __vector_quad acc;
>> +  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
>> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
>> +  *dst = acc;
>> +}
>> +/* { dg-final { scan-assembler {\mlxvp\M} } } */
>> diff --git a/gcc/testsuite/g++.target/powerpc/vecload1.C b/gcc/testsuite/g++.target/powerpc/vecload1.C
>> new file mode 100644
>> index 00000000000..d10ff0cdf36
>> --- /dev/null
>> +++ b/gcc/testsuite/g++.target/powerpc/vecload1.C
>> @@ -0,0 +1,22 @@
>> +/* { dg-do compile } */
>> +/* { dg-require-effective-target power10_ok } */
>> +/* { dg-options "-mdejagnu-cpu=power10 -O2" } */
>> +
>> +#include <altivec.h>
>> +	
>> +void
>> +foo2 ()
>> +{
>> +  __vector_quad *dst1;
>> +  __vector_quad *dst2;
>> +  vector unsigned char src;
>> +  __vector_quad acc;
>> +  vector unsigned char *ptr;
>> +  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
>> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
>> +  *dst1 = acc;
>> +  __builtin_mma_xvf32ger(&acc, src, ptr[2]);
>> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[3]);
>> +  *dst2 = acc;
>> +}
>> +/* { dg-final { scan-assembler {\mlxvp\M} } } */
>> diff --git a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
>> index 69ee826e1be..ae29127f954 100644
>> --- a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
>> @@ -258,8 +258,8 @@ foo13b (__vector_quad *dst, __vector_quad *src, vec_t *vec)
>>    dst[13] = acc;
>>  }
>>  
>> -/* { dg-final { scan-assembler-times {\mlxv\M} 40 } } */
>> -/* { dg-final { scan-assembler-times {\mlxvp\M} 12 } } */
>> +/* { dg-final { scan-assembler-times {\mlxv\M} 0 } } */
>> +/* { dg-final { scan-assembler-times {\mlxvp\M} 32 } } */
>>  /* { dg-final { scan-assembler-times {\mstxvp\M} 40 } } */
>>  /* { dg-final { scan-assembler-times {\mxxmfacc\M} 20 } } */
>>  /* { dg-final { scan-assembler-times {\mxxmtacc\M} 6 } } */
>> -- 
>> 2.39.3
>>
Richard Sandiford Feb. 14, 2024, 5:15 p.m. UTC | #5
Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>> diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
>>> index 1856fa4884f..ffc47a6eaa0 100644
>>> --- a/gcc/emit-rtl.cc
>>> +++ b/gcc/emit-rtl.cc
>>> @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode imode,
>>>      return false;
>>>  
>>>    /* The subreg offset cannot be outside the inner object.  */
>>> -  if (maybe_ge (offset, isize))
>>> +  if (maybe_gt (offset, isize))
>>>      return false;
>> 
>> Can you explain why this change is needed?
>> 
>
> This is required in rs6000 target where we generate the subreg
> with offset 16 from OO mode (256 bit) to 128 bit vector modes.
> Otherwise it segfaults.

Could you go into more detail?  Why does that subreg lead to a segfault?

In itself, a 16-byte subreg at byte offset 16 into a 32-byte pair is pretty
standard.  AArch64 uses this too for its vector load/store pairs (and for
structure pairs more generally).

Thanks,
Richard
Ajit Agarwal Feb. 14, 2024, 7:44 p.m. UTC | #6
Hello Richard:


On 14/02/24 10:45 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
>>>> index 1856fa4884f..ffc47a6eaa0 100644
>>>> --- a/gcc/emit-rtl.cc
>>>> +++ b/gcc/emit-rtl.cc
>>>> @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode imode,
>>>>      return false;
>>>>  
>>>>    /* The subreg offset cannot be outside the inner object.  */
>>>> -  if (maybe_ge (offset, isize))
>>>> +  if (maybe_gt (offset, isize))
>>>>      return false;
>>>
>>> Can you explain why this change is needed?
>>>
>>
>> This is required in rs6000 target where we generate the subreg
>> with offset 16 from OO mode (256 bit) to 128 bit vector modes.
>> Otherwise it segfaults.
> 
> Could you go into more detail?  Why does that subreg lead to a segfault?
> 
> In itself, a 16-byte subreg at byte offset 16 into a 32-byte pair is pretty
> standard.  AArch64 uses this too for its vector load/store pairs (and for
> structure pairs more generally).
> 

If we want to create subreg V16QI (reg OO R) 16) imode is V16QI (isize = 16) and offset 
is 16. maybe_ge (offset, isize) return true and validate_subreg returns false;

Hence above subreg is not generated and we generate incorrect code.

Thats why I have modified to maybe_gt (offset, isize).

Thanks & Regards
Ajit
> Thanks,
> Richard
Richard Sandiford Feb. 14, 2024, 8:51 p.m. UTC | #7
Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> Hello Richard:
>
>
> On 14/02/24 10:45 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>> diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
>>>>> index 1856fa4884f..ffc47a6eaa0 100644
>>>>> --- a/gcc/emit-rtl.cc
>>>>> +++ b/gcc/emit-rtl.cc
>>>>> @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode imode,
>>>>>      return false;
>>>>>  
>>>>>    /* The subreg offset cannot be outside the inner object.  */
>>>>> -  if (maybe_ge (offset, isize))
>>>>> +  if (maybe_gt (offset, isize))
>>>>>      return false;
>>>>
>>>> Can you explain why this change is needed?
>>>>
>>>
>>> This is required in rs6000 target where we generate the subreg
>>> with offset 16 from OO mode (256 bit) to 128 bit vector modes.
>>> Otherwise it segfaults.
>> 
>> Could you go into more detail?  Why does that subreg lead to a segfault?
>> 
>> In itself, a 16-byte subreg at byte offset 16 into a 32-byte pair is pretty
>> standard.  AArch64 uses this too for its vector load/store pairs (and for
>> structure pairs more generally).
>> 
>
> If we want to create subreg V16QI (reg OO R) 16) imode is V16QI (isize = 16) and offset 
> is 16. maybe_ge (offset, isize) return true and validate_subreg returns false;

isize is supposed to be the size of the "inner mode", which in this
case is OO.  Since OO is a 32-bit mode, I would expect isize to be 32
rather than 16.  Is that not the case?

Or is the problem that something is trying to take a subreg of a subreg?
If so, that is only valid in certain cases.  It isn't for example valid
to use a subreg operation to move between (subreg:V16QI (reg:OO X) 16)
and (subreg:V16QI (reg:OO X) 0).

Thanks,
Richard
Ajit Agarwal Feb. 14, 2024, 9:16 p.m. UTC | #8
Hello Richard:

On 15/02/24 2:21 am, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> Hello Richard:
>>
>>
>> On 14/02/24 10:45 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>> diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
>>>>>> index 1856fa4884f..ffc47a6eaa0 100644
>>>>>> --- a/gcc/emit-rtl.cc
>>>>>> +++ b/gcc/emit-rtl.cc
>>>>>> @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode imode,
>>>>>>      return false;
>>>>>>  
>>>>>>    /* The subreg offset cannot be outside the inner object.  */
>>>>>> -  if (maybe_ge (offset, isize))
>>>>>> +  if (maybe_gt (offset, isize))
>>>>>>      return false;
>>>>>
>>>>> Can you explain why this change is needed?
>>>>>
>>>>
>>>> This is required in rs6000 target where we generate the subreg
>>>> with offset 16 from OO mode (256 bit) to 128 bit vector modes.
>>>> Otherwise it segfaults.
>>>
>>> Could you go into more detail?  Why does that subreg lead to a segfault?
>>>
>>> In itself, a 16-byte subreg at byte offset 16 into a 32-byte pair is pretty
>>> standard.  AArch64 uses this too for its vector load/store pairs (and for
>>> structure pairs more generally).
>>>
>>
>> If we want to create subreg V16QI (reg OO R) 16) imode is V16QI (isize = 16) and offset 
>> is 16. maybe_ge (offset, isize) return true and validate_subreg returns false;
> 
> isize is supposed to be the size of the "inner mode", which in this
> case is OO.  Since OO is a 32-bit mode, I would expect isize to be 32
> rather than 16.  Is that not the case?
> 
> Or is the problem that something is trying to take a subreg of a subreg?
> If so, that is only valid in certain cases.  It isn't for example valid
> to use a subreg operation to move between (subreg:V16QI (reg:OO X) 16)
> and (subreg:V16QI (reg:OO X) 0).
>

The above changes are not required. emit-rtl.cc changes are not required anymore
as I have fixed in rs6000 target fusion code while fixing the modes
of src and dest as same for SET rtx as you have suggested for REG_UNUSED
issues.

Thanks a lot for your help.

Thanks & Regards
Ajit
 
> Thanks,
> Richard
diff mbox series

Patch

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 00355509c92..9bff42cf830 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -518,7 +518,7 @@  or1k*-*-*)
 	;;
 powerpc*-*-*)
 	cpu_type=rs6000
-	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
 	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
 	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -555,7 +555,7 @@  riscv*)
 	;;
 rs6000*-*-*)
 	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
-	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
index 46a0d0b8c56..eb4a65ebe10 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -28,6 +28,9 @@  along with GCC; see the file COPYING3.  If not see
      The power8 does not have instructions that automaticaly do the byte swaps
      for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
+  /* Pass to replace adjacent memory addresses lxv instruction with lxvp
+     instruction.  */
+  INSERT_PASS_BEFORE (pass_early_remat, 1, pass_analyze_vecload);
 
   /* Pass to do the PCREL_OPT optimization that combines the load of an
      external symbol's address along with a single load or store using that
diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
index 09a57a806fa..f0a9f36602e 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -343,6 +343,7 @@  namespace gcc { class context; }
 class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
+extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
 extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
 extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
 extern bool rs6000_quadword_masked_address_p (const_rtx exp);
diff --git a/gcc/config/rs6000/rs6000-vecload-opt.cc b/gcc/config/rs6000/rs6000-vecload-opt.cc
new file mode 100644
index 00000000000..cebadad97d8
--- /dev/null
+++ b/gcc/config/rs6000/rs6000-vecload-opt.cc
@@ -0,0 +1,1186 @@ 
+/* Subroutines used to replace lxv with lxvp
+   for TARGET_POWER10 and TARGET_VSX,
+
+   Copyright (C) 2020-2023 Free Software Foundation, Inc.
+   Contributed by Ajit Kumar Agarwal <aagarwa1@linux.ibm.com>.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#define IN_TARGET_CODE 1
+#define INCLUDE_ALGORITHM
+#define INCLUDE_FUNCTIONAL
+#define INCLUDE_LIST
+#define INCLUDE_TYPE_TRAITS
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "target.h"
+#include "rtl.h"
+#include "memmodel.h"
+#include "emit-rtl.h"
+#include "tree-pass.h"
+#include "df.h"
+#include "dumpfile.h"
+#include "rs6000-internal.h"
+#include "rs6000-protos.h"
+#include "fusion-common.h"
+
+/* Return false if dependent rtx LOC is SUBREG.  */
+static bool
+is_feasible (rtx_insn *insn)
+{
+  df_ref use;
+  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+  FOR_EACH_INSN_INFO_DEF (use, insn_info)
+    {
+      struct df_link *def_link = DF_REF_CHAIN (use);
+      if (!def_link || !def_link->ref || DF_REF_IS_ARTIFICIAL (def_link->ref))
+	continue;
+      while (def_link && def_link->ref)
+	{
+	  rtx *loc = DF_REF_LOC (def_link->ref);
+	  if (!loc || *loc == NULL_RTX)
+	    return false;
+	  if (GET_CODE (*loc) == SUBREG)
+	    return false;
+	  def_link = def_link->next;
+	}
+     }
+  return true;
+}
+
+/* df_scan_rescan the unspec instruction where operands
+   are reversed.  */
+void set_rescan_for_unspec (rtx_insn *insn)
+{
+  df_ref use;
+  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+  rtx_insn *select_insn2;
+  FOR_EACH_INSN_INFO_DEF (use, insn_info)
+    {
+      struct df_link *def_link = DF_REF_CHAIN (use);
+      while (def_link && def_link->ref)
+	{
+	  select_insn2 = DF_REF_INSN (def_link->ref);
+	  rtx set = single_set (select_insn2);
+
+	  if (set == NULL_RTX)
+	    return;
+
+	  if (set != NULL_RTX)
+	    {
+	      rtx op0 = SET_SRC (set);
+	      if (GET_CODE (op0) != UNSPEC)
+		return;
+
+	      if (GET_CODE (op0) == VEC_SELECT
+		  && GET_CODE (XEXP (op0, 1)) == PARALLEL)
+		return;
+
+	      if (GET_CODE (op0) == UNSPEC)
+		df_insn_rescan (select_insn2);
+	    }
+	   def_link = def_link->next;
+	}
+     }
+}
+
+/* Return dependent UNSPEC instruction.  */
+rtx_insn *get_rtx_UNSPEC (rtx_insn *insn)
+{
+  df_ref use;
+  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+  rtx_insn *select_insn2;
+  FOR_EACH_INSN_INFO_DEF (use, insn_info)
+    {
+      struct df_link *def_link = DF_REF_CHAIN (use);
+      while (def_link && def_link->ref)
+	{
+	  select_insn2 = DF_REF_INSN (def_link->ref);
+	  rtx set = single_set (select_insn2);
+
+	  if (set == NULL_RTX)
+	    return 0;
+
+	  if (set != NULL_RTX)
+	    {
+	      rtx op0 = SET_SRC (set);
+
+	      if (GET_CODE (op0) == UNSPEC)
+		return select_insn2;
+	    }
+	   def_link = def_link->next;
+	}
+     }
+  return 0;
+}
+
+/* Replace identified lxv with lxvp.
+   Bail out if following condition are true:
+
+   - dependent instruction of load is vec_select instruction,
+
+   - machine mode of unspec is not same as machine mode
+     of lxv instruction.
+
+   - dependent instruction is not unspec.
+
+   - Source operand of unspec is eq instruction.  */
+
+rtx
+rs6000_gen_load_pair (rtx_insn *insn1, rtx_insn *insn2)
+{
+  rtx body = PATTERN (insn1);
+  rtx src_exp = SET_SRC (body);
+  rtx dest_exp = SET_DEST (body);
+  rtx lxv;
+  rtx insn2_body = PATTERN (insn2);
+
+  rtx insn2_src_exp = SET_SRC (insn2_body);
+
+  if (GET_MODE (src_exp) != GET_MODE (SET_SRC (insn2_body)))
+    return NULL_RTX;
+
+  if (GET_MODE (dest_exp) == TImode)
+    return NULL_RTX;
+
+  if (!ALTIVEC_OR_VSX_VECTOR_MODE (GET_MODE (dest_exp)))
+    return NULL_RTX;
+
+  if (!is_feasible (insn1))
+    return NULL_RTX;
+
+  if (!is_feasible (insn2))
+    return NULL_RTX;
+
+  for (rtx note = REG_NOTES (insn1); note; note = XEXP (note, 1))
+    if (REG_NOTE_KIND (note) == REG_EQUAL
+	|| REG_NOTE_KIND (note) == REG_EQUIV)
+      return NULL_RTX;
+
+  int no_dep = 0;
+  df_ref  use;
+  df_insn_info *insn_info = DF_INSN_INFO_GET (insn1);
+  rtx_insn *select_insn2;
+
+  FOR_EACH_INSN_INFO_DEF (use, insn_info)
+    {
+      struct df_link *def_link = DF_REF_CHAIN (use);
+      while (def_link && def_link->ref)
+	{
+	  select_insn2 = DF_REF_INSN (def_link->ref);
+	  rtx set = single_set (select_insn2);
+
+	  if (set == NULL_RTX)
+	    return NULL_RTX;
+
+	  if (set != NULL_RTX)
+	    {
+	      rtx op0 = SET_SRC (set);
+
+	      if (GET_CODE (op0) != UNSPEC)
+		return NULL_RTX;
+
+	      if (GET_CODE (op0) == VEC_SELECT
+		  && GET_CODE (XEXP (op0, 1)) == PARALLEL)
+		return NULL_RTX;
+
+	      if (GET_CODE (op0) == UNSPEC)
+		{
+		  if (GET_MODE (op0) != XOmode
+		      && GET_MODE (op0) != GET_MODE (dest_exp))
+		    return NULL_RTX;
+
+		  int nvecs = XVECLEN (op0, 0);
+		  for (int i = 0; i < nvecs; i++)
+		    {
+		      rtx op;
+		      op = XVECEXP (op0, 0, i);
+
+		      if (GET_MODE (op) == OOmode)
+			return NULL_RTX;
+		      if (GET_CODE (op) == EQ)
+			return NULL_RTX;
+		    }
+		}
+	       ++no_dep;
+	     }
+	   def_link = def_link->next;
+	}
+     }
+
+  rtx_insn *insn = get_rtx_UNSPEC (insn1);
+
+  if (insn && insn == get_rtx_UNSPEC (insn2) && no_dep == 1)
+    return NULL_RTX;
+
+  int regoff;
+  rtx src;
+  rtx addr = XEXP (src_exp, 0);
+
+  if (GET_CODE (addr) == PLUS
+      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
+    {
+      regoff = 0;
+      src = simplify_gen_subreg (GET_MODE (dest_exp),
+				 dest_exp, GET_MODE (dest_exp),
+				 regoff);
+    }
+  else
+    {
+      regoff = INTVAL (CONST0_RTX (SImode)) *
+		       GET_MODE_SIZE (GET_MODE (dest_exp));
+      src = simplify_gen_subreg (GET_MODE (dest_exp),
+				 dest_exp, GET_MODE (dest_exp),
+				 regoff);
+    }
+  insn_info = DF_INSN_INFO_GET (insn2);
+  FOR_EACH_INSN_INFO_DEF (use, insn_info)
+    {
+      struct df_link *def_link = DF_REF_CHAIN (use);
+      if (!def_link || !def_link->ref || DF_REF_IS_ARTIFICIAL (def_link->ref))
+	continue;
+      while (def_link && def_link->ref)
+	{
+	  rtx *loc = DF_REF_LOC (def_link->ref);
+	  *loc =  src;
+	  def_link = def_link->next;
+	}
+     }
+
+  int regoff1;
+  rtx src1;
+  addr = XEXP (insn2_src_exp, 0);
+
+  if (GET_CODE (addr) == PLUS
+      && XEXP (addr, 1)
+      && CONST_INT_P (XEXP(addr, 1)))
+    {
+      regoff1 = 16;
+      src1 = simplify_gen_subreg (GET_MODE (dest_exp),
+				  dest_exp, GET_MODE (dest_exp),
+				  regoff1);
+    }
+  else
+    {
+      regoff1 = INTVAL (CONST0_RTX (SImode)) * GET_MODE_SIZE (GET_MODE (dest_exp));//V16QImode);
+      src1 = simplify_gen_subreg (GET_MODE (dest_exp),
+				  dest_exp, GET_MODE (dest_exp),
+				  regoff1);
+    }
+
+  insn_info = DF_INSN_INFO_GET (insn1);
+  FOR_EACH_INSN_INFO_DEF (use, insn_info)
+    {
+      struct df_link *def_link = DF_REF_CHAIN (use);
+      if (!def_link || !def_link->ref || DF_REF_IS_ARTIFICIAL (def_link->ref))
+	continue;
+      while (def_link && def_link->ref)
+	{
+	  rtx *loc = DF_REF_LOC (def_link->ref);
+	  PUT_MODE_RAW (*loc, OOmode);
+	  *loc = src1;
+	  def_link = def_link->next;
+	}
+     }
+
+  set_rescan_for_unspec (insn1);
+  set_rescan_for_unspec (insn2);
+  df_insn_rescan (insn1);
+  df_insn_rescan (insn2);
+
+  PUT_MODE_RAW (src_exp, OOmode);
+  PUT_MODE_RAW (dest_exp, OOmode);
+  lxv = gen_movoo (dest_exp, src_exp);
+  rtx_insn *new_insn = emit_insn_before (lxv,  insn1);
+  set_block_for_insn (new_insn, BLOCK_FOR_INSN (insn1));
+  df_insn_rescan (new_insn);
+
+  if (dump_file)
+    {
+      unsigned int new_uid = INSN_UID (new_insn);
+      fprintf (dump_file, "Replacing lxv %d with lxvp  %d\n",
+			  INSN_UID (insn1), new_uid);
+      print_rtl_single (dump_file, new_insn);
+      print_rtl_single (dump_file, insn1);
+      print_rtl_single (dump_file, insn2);
+
+    }
+
+  df_insn_delete (insn1);
+  remove_insn (insn1);
+  df_insn_delete (insn2);
+  remove_insn (insn2);
+  insn1->set_deleted ();
+  insn2->set_deleted ();
+  return lxv;
+}
+
+// LEFT_LIST and RIGHT_LIST are lists of candidate instructions where all insns
+// in LEFT_LIST are known to be adjacent to those in RIGHT_LIST.
+//
+// This function traverses the resulting 2D matrix of possible pair candidates
+// and attempts to merge them into pairs.
+//
+// The algorithm is straightforward: if we consider a combined list of
+// candidates X obtained by merging LEFT_LIST and RIGHT_LIST in program order,
+// then we advance through X until we reach a crossing point (where X[i] and
+// X[i+1] come from different source lists).
+//
+// At this point we know X[i] and X[i+1] are adjacent accesses, and we try to
+// fuse them into a pair.  If this succeeds, we remove X[i] and X[i+1] from
+// their original lists and continue as above.
+//
+// In the failure case, we advance through the source list containing X[i] and
+// continue as above (proceeding to the next crossing point).
+//
+// The rationale for skipping over groups of consecutive candidates from the
+// same source list is as follows:
+//
+// In the store case, the insns in the group can't be re-ordered over each
+// other as they are guaranteed to store to the same location, so we're
+// guaranteed not to lose opportunities by doing this.
+//
+// In the load case, subsequent loads from the same location are either
+// redundant (in which case they should have been cleaned up by an earlier
+// optimization pass) or there is an intervening aliasing hazard, in which case
+// we can't re-order them anyway, so provided earlier passes have cleaned up
+// redundant loads, we shouldn't miss opportunities by doing this.
+void
+fusion_bb_info::merge_pairs (insn_list_t &left_list,
+			  insn_list_t &right_list,
+			  bool load_p,
+			  unsigned access_size)
+{
+  if (dump_file)
+    {
+      fprintf (dump_file, "merge_pairs [L=%d], cand vecs ", load_p);
+      dump_insn_list (dump_file, left_list);
+      fprintf (dump_file, " x ");
+      dump_insn_list (dump_file, right_list);
+      fprintf (dump_file, "\n");
+    }
+
+  auto iter_l = left_list.begin ();
+  auto iter_r = right_list.begin ();
+  while (iter_l != left_list.end () && iter_r != right_list.end ())
+    {
+      auto left = *iter_l;
+      auto right = *iter_r;
+
+      auto next_l = std::next (iter_l);
+      auto next_r = std::next (iter_r);
+
+      if ((*left) < (*right)
+	  && next_l != (left_list.end ())
+	  && **next_l < *right)
+	iter_l = next_l;
+      else if ((*right) < (*left)
+	       && next_r != (right_list.end ())
+	       && **next_r < *left)
+	iter_r = next_r;
+      else if (try_fuse_pair (load_p, access_size, *iter_l, *iter_r)) //left, right))
+	{
+	  left_list.erase (iter_l);
+	  iter_l = next_l;
+	  right_list.erase (iter_r);
+	  iter_r = next_r;
+	}
+      else if (*left  < *right)
+	iter_l = next_l;
+      else
+	iter_r = next_r;
+    }
+ 
+}
+
+
+
+// Main function to begin pair discovery.  Given a memory access INSN,
+// determine whether it could be a candidate for fusing into an lxvp/stvxp,
+// and if so, track it in the appropriate data structure for this basic
+// block.  LOAD_P is true if the access is a load, and MEM is the mem
+// rtx that occurs in INSN.
+void
+fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
+{
+  // We can't combine volatile MEMs, so punt on these.
+  if (MEM_VOLATILE_P (mem))
+    return;
+
+  // Ignore writeback accesses if the param says to do so.
+  if (GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
+    return;
+
+  const machine_mode mem_mode = GET_MODE (mem);
+
+  rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
+
+  if (!ALTIVEC_OR_VSX_VECTOR_MODE (GET_MODE (reg_op)))
+    return;
+
+  const bool fpsimd_op_p = (ALTIVEC_OR_VSX_VECTOR_MODE (GET_MODE (reg_op)));
+
+  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode);
+  const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
+
+  if (track_via_mem_expr (insn, mem, lfs))
+    return;
+
+  poly_int64 mem_off;
+  rtx addr = XEXP (mem, 0);
+  const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
+  rtx base = load_strip_offset (mem, &mem_off);
+  if (!REG_P (base))
+    return;
+
+  // Need to calculate two (possibly different) offsets:
+  //  - Offset at which the access occurs.
+  //  - Offset of the new base def.
+  poly_int64 access_off;
+  if (autoinc_p && any_post_modify_p (addr))
+    access_off = 0;
+  else
+    access_off = mem_off;
+
+  poly_int64 new_def_off = mem_off;
+
+  // Punt on accesses relative to eliminable regs.  Since we don't know the
+  // elimination offset pre-RA, we should postpone forming pairs on such
+  // accesses until after RA.
+  //
+  // As it stands, addresses with offsets in range for LDR but not
+  // in range for LDP/STP are currently reloaded inefficiently,
+  // ending up with a separate base register for each pair.
+  //
+  // In theory LRA should make use of
+  // targetm.legitimize_address_displacement to promote sharing of
+  // bases among multiple (nearby) address reloads, but the current
+  // LRA code returns early from process_address_1 for operands that
+  // satisfy "m", even if they don't satisfy the real (relaxed) address
+  // constraint; this early return means we never get to the code
+  // that calls targetm.legitimize_address_displacement.
+  //
+  // So for now, it's better to punt when we can't be sure that the
+  // offset is in range for LDP/STP.  Out-of-range cases can then be
+  // handled after RA by the out-of-range LDP/STP peepholes.  Eventually, it
+  // would be nice to handle known out-of-range opportunities in the
+  // pass itself (for stack accesses, this would be in the post-RA pass).
+  if (!reload_completed
+      && (REGNO (base) == FRAME_POINTER_REGNUM
+	  || REGNO (base) == ARG_POINTER_REGNUM))
+    return;
+
+  // Now need to find def of base register.
+  use_info *base_use = find_access (insn->uses (), REGNO (base));
+  gcc_assert (base_use);
+  def_info *base_def = base_use->def ();
+  if (!base_def)
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "base register (regno %d) of insn %d is undefined",
+		 REGNO (base), insn->uid ());
+      return;
+    }
+
+  alt_base *canon_base = canon_base_map.get (base_def);
+  if (canon_base)
+    {
+      // Express this as the combined offset from the canonical base.
+      base_def = canon_base->base;
+      new_def_off += canon_base->offset;
+      access_off += canon_base->offset;
+    }
+
+  if (autoinc_p)
+    {
+      auto def = find_access (insn->defs (), REGNO (base));
+      gcc_assert (def);
+
+      // Record that DEF = BASE_DEF + MEM_OFF.
+      if (dump_file)
+	{
+	  pretty_printer pp;
+	  pp_access (&pp, def, 0);
+	  pp_string (&pp, " = ");
+	  pp_access (&pp, base_def, 0);
+	  fprintf (dump_file, "[bb %u] recording %s + ",
+		   m_bb->index (), pp_formatted_text (&pp));
+	  print_dec (new_def_off, dump_file);
+	  fprintf (dump_file, "\n");
+	}
+
+      alt_base base_rec { base_def, new_def_off };
+      if (canon_base_map.put (def, base_rec))
+	gcc_unreachable (); // Base defs should be unique.
+    }
+
+  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
+  // the access size.
+  if (!multiple_p (mem_off, mem_size))
+    return;
+
+  const auto key = std::make_pair (base_def, encode_lfs (lfs));
+  access_group &group = def_map.get_or_insert (key, NULL);
+  auto alloc = [&](access_record *access) { return node_alloc (access); };
+  group.track (alloc, access_off, insn);
+  if (dump_file)
+    {
+      pretty_printer pp;
+      pp_access (&pp, base_def, 0);
+
+      fprintf (dump_file, "[bb %u] tracking insn %d via %s",
+	       m_bb->index (), insn->uid (), pp_formatted_text (&pp));
+      fprintf (dump_file,
+	       " [L=%d, WB=%d, FP=%d, %smode, off=",
+	       lfs.load_p, autoinc_p, lfs.fpsimd_p, mode_name[mem_mode]);
+      print_dec (access_off, dump_file);
+      fprintf (dump_file, "]\n");
+    }
+}
+
+
+// Given two adjacent memory accesses of the same size, I1 and I2, try
+// and see if we can merge them into a lxvp or stvxp.
+//
+// ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
+// if the accesses are both loads, otherwise they are both stores.
+bool
+fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
+			    insn_info *i1, insn_info *i2)
+{
+  if (dump_file)
+    fprintf (dump_file, "analyzing pair (load=%d): (%d,%d)\n",
+	     load_p, i1->uid (), i2->uid ());
+
+  insn_info *insns[2];
+  bool reversed = false;
+  if (*i1 < *i2)
+    {
+      insns[0] = i1;
+      insns[1] = i2;
+    }
+  else
+    {
+      insns[0] = i2;
+      insns[1] = i1;
+      reversed = true;
+    }
+
+  rtx cand_mems[2];
+  rtx reg_ops[2];
+  rtx pats[2];
+  for (int i = 0; i < 2; i++)
+    {
+      pats[i] = PATTERN (insns[i]->rtl ());
+      cand_mems[i] = XEXP (pats[i], load_p);
+      reg_ops[i] = XEXP (pats[i], !load_p);
+    }
+
+  if (load_p && reg_overlap_mentioned_p (reg_ops[0], reg_ops[1]))
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "punting on lxvp due to reg conflcits (%d,%d)\n",
+		 insns[0]->uid (), insns[1]->uid ());
+      return false;
+    }
+
+  if (cfun->can_throw_non_call_exceptions
+      && find_reg_note (insns[0]->rtl (), REG_EH_REGION, NULL_RTX)
+      && find_reg_note (insns[1]->rtl (), REG_EH_REGION, NULL_RTX))
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "can't combine insns with EH side effects (%d,%d)\n",
+		 insns[0]->uid (), insns[1]->uid ());
+      return false;
+    }
+
+  auto_vec<base_cand, 2> base_cands (2);
+
+  int writeback = get_viable_bases (insns, base_cands, cand_mems,
+				    access_size, reversed);
+  if (base_cands.is_empty ())
+    {
+      if (dump_file)
+	fprintf (dump_file, "no viable base for pair (%d,%d)\n",
+		 insns[0]->uid (), insns[1]->uid ());
+      return false;
+    }
+
+  // Punt on frame-related insns with writeback.  We probably won't see
+  // these in practice, but this is conservative and ensures we don't
+  // have to worry about these later on.
+  if (writeback && (RTX_FRAME_RELATED_P (i1->rtl ())
+		    || RTX_FRAME_RELATED_P (i2->rtl ())))
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "rejecting pair (%d,%d): frame-related insn with writeback\n",
+		 i1->uid (), i2->uid ());
+      return false;
+    }
+
+  rtx *ignore = &XEXP (pats[1], load_p);
+  for (auto use : insns[1]->uses ())
+    if (!use->is_mem ()
+	&& refers_to_regno_p (use->regno (), use->regno () + 1, pats[1], ignore)
+	&& use->def () && use->def ()->insn () == insns[0])
+      {
+	// N.B. we allow a true dependence on the base address, as this
+	// happens in the case of auto-inc accesses.  Consider a post-increment
+	// load followed by a regular indexed load, for example.
+	if (dump_file)
+	  fprintf (dump_file,
+		   "%d has non-address true dependence on %d, rejecting pair\n",
+		   insns[1]->uid (), insns[0]->uid ());
+	return false;
+      }
+
+  unsigned i = 0;
+  while (i < base_cands.length ())
+    {
+      base_cand &cand = base_cands[i];
+
+      rtx *ignore[2] = {};
+      for (int j = 0; j < 2; j++)
+	if (cand.from_insn == !j)
+	  ignore[j] = &XEXP (cand_mems[j], 0);
+
+      insn_info *h = first_hazard_after (insns[0], ignore[0]);
+      if (h && *h <= *insns[1])
+	cand.hazards[0] = h;
+
+      h = latest_hazard_before (insns[1], ignore[1]);
+      if (h && *h >= *insns[0])
+	cand.hazards[1] = h;
+
+      if (!cand.viable ())
+	{
+	  if (dump_file)
+	    fprintf (dump_file,
+		     "pair (%d,%d): rejecting base %d due to dataflow "
+		     "hazards (%d,%d)\n",
+		     insns[0]->uid (),
+		     insns[1]->uid (),
+		     cand.def->regno (),
+		     cand.hazards[0]->uid (),
+		     cand.hazards[1]->uid ());
+
+	  base_cands.ordered_remove (i);
+	}
+      else
+	i++;
+    }
+
+  if (base_cands.is_empty ())
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "can't form pair (%d,%d) due to dataflow hazards\n",
+		 insns[0]->uid (), insns[1]->uid ());
+      return false;
+    }
+
+  //insn_info *alias_hazards[4] = {};
+
+  // First def of memory after the first insn, and last def of memory
+  // before the second insn, respectively.
+  def_info *mem_defs[2] = {};
+  if (load_p)
+    {
+      if (!MEM_READONLY_P (cand_mems[0]))
+	{
+	  mem_defs[0] = memory_access (insns[0]->uses ())->def ();
+	  gcc_checking_assert (mem_defs[0]);
+	  mem_defs[0] = mem_defs[0]->next_def ();
+	}
+      if (!MEM_READONLY_P (cand_mems[1]))
+	{
+	  mem_defs[1] = memory_access (insns[1]->uses ())->def ();
+	  gcc_checking_assert (mem_defs[1]);
+	}
+    }
+  else
+    {
+      mem_defs[0] = memory_access (insns[0]->defs ())->next_def ();
+      mem_defs[1] = memory_access (insns[1]->defs ())->prev_def ();
+      gcc_checking_assert (mem_defs[0]);
+      gcc_checking_assert (mem_defs[1]);
+    }
+
+  //auto tombstone_p = [&](insn_info *insn) -> bool {
+   // return m_emitted_tombstone
+//	   && bitmap_bit_p (&m_tombstone_bitmap, insn->uid ());
+ // };
+
+  if (base_cands.is_empty ())
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "cannot form pair (%d,%d) due to alias/dataflow hazards",
+		 insns[0]->uid (), insns[1]->uid ());
+
+      return false;
+    }
+
+  base_cand *base = &base_cands[0];
+  insn_range_info range (insns[0], insns[1]);
+  // If the second insn can throw, narrow the move range to exactly that insn.
+  // This prevents us trying to move the second insn from the end of the BB.
+  if (cfun->can_throw_non_call_exceptions
+      && find_reg_note (insns[1]->rtl (), REG_EH_REGION, NULL_RTX))
+    {
+      gcc_assert (range.includes (insns[1]));
+      range = insn_range_info (insns[1]);
+    }
+
+  // Placement strategy: push loads down and pull stores up, this should
+  // help register pressure by reducing live ranges.
+  if (load_p)
+    range.first = range.last;
+  else
+    range.last = range.first;
+  return fuse_pair (load_p, access_size, writeback,
+		    i1, i2, *base, range);
+}
+// Try and actually fuse the pair given by insns I1 and I2.
+//
+// Here we've done enough analysis to know this is safe, we only
+// reject the pair at this stage if either the tuning policy says to,
+// or recog fails on the final pair insn.
+//
+// LOAD_P is true for loads, ACCESS_SIZE gives the access size of each
+// candidate insn.  Bit i of WRITEBACK is set if the ith insn (in program
+// order) uses writeback.
+//
+// BASE gives the chosen base candidate for the pair and MOVE_RANGE is
+// a singleton range which says where to place the pair.
+bool
+fusion_bb_info::fuse_pair (bool load_p,
+			unsigned access_size,
+			int writeback,
+			insn_info *i1, insn_info *i2,
+			base_cand &base,
+			const insn_range_info &move_range)
+{
+  auto attempt = crtl->ssa->new_change_attempt ();
+
+  auto make_change = [&attempt](insn_info *insn)
+    {
+      return crtl->ssa->change_alloc<insn_change> (attempt, insn);
+    };
+  auto make_delete = [&attempt](insn_info *insn)
+    {
+      return crtl->ssa->change_alloc<insn_change> (attempt,
+						   insn,
+						   insn_change::DELETE);
+    };
+
+  if (*i1 > *i2)
+    return false;
+
+  insn_info *first = (*i1 < *i2) ? i1 : i2;
+  insn_info *second = (first == i1) ? i2 : i1;
+
+  insn_info *insns[2] = { first, second };
+
+  auto_vec<insn_change *, 4> changes (4);
+  auto_vec<int, 2> tombstone_uids (2);
+
+  rtx pats[2] = {
+    PATTERN (first->rtl ()),
+    PATTERN (second->rtl ())
+  };
+
+  use_array input_uses[2] = { first->uses (), second->uses () };
+  def_array input_defs[2] = { first->defs (), second->defs () };
+
+  int changed_insn = -1;
+  if (base.from_insn != -1)
+    {
+      // If we're not already using a shared base, we need
+      // to re-write one of the accesses to use the base from
+      // the other insn.
+      gcc_checking_assert (base.from_insn == 0 || base.from_insn == 1);
+      changed_insn = !base.from_insn;
+
+      rtx base_pat = pats[base.from_insn];
+      rtx change_pat = pats[changed_insn];
+      rtx base_mem = XEXP (base_pat, load_p);
+      rtx change_mem = XEXP (change_pat, load_p);
+
+      const bool lower_base_p = (insns[base.from_insn] == i1);
+      HOST_WIDE_INT adjust_amt = access_size;
+      if (!lower_base_p)
+	adjust_amt *= -1;
+
+      rtx change_reg = XEXP (change_pat, !load_p);
+      machine_mode mode_for_mem = GET_MODE (change_mem);
+      rtx effective_base = drop_writeback (base_mem);
+      rtx new_mem = adjust_address_nv (effective_base,
+				       mode_for_mem,
+				       adjust_amt);
+      rtx new_set = load_p
+	? gen_rtx_SET (change_reg, new_mem)
+	: gen_rtx_SET (new_mem, change_reg);
+
+      pats[changed_insn] = new_set;
+
+      auto keep_use = [&](use_info *u)
+	{
+	  return refers_to_regno_p (u->regno (), u->regno () + 1,
+				    change_pat, &XEXP (change_pat, load_p));
+	};
+
+      // Drop any uses that only occur in the old address.
+      input_uses[changed_insn] = filter_accesses (attempt,
+						  input_uses[changed_insn],
+						  keep_use);
+    }
+
+  rtx writeback_effect = NULL_RTX;
+  if (writeback)
+    writeback_effect = extract_writebacks (load_p, pats, changed_insn);
+
+  const auto base_regno = base.def->regno ();
+
+  if (base.from_insn == -1 && (writeback & 1))
+    {
+      // If the first of the candidate insns had a writeback form, we'll need to
+      // drop the use of the updated base register from the second insn's uses.
+      //
+      // N.B. we needn't worry about the base register occurring as a store
+      // operand, as we checked that there was no non-address true dependence
+      // between the insns in try_fuse_pair.
+      gcc_checking_assert (find_access (input_uses[1], base_regno));
+      input_uses[1] = check_remove_regno_access (attempt,
+						 input_uses[1],
+						 base_regno);
+    }
+
+  // Go through and drop uses that only occur in register notes,
+  // as we won't be preserving those.
+  for (int i = 0; i < 2; i++)
+    {
+      auto rti = insns[i]->rtl ();
+      if (!REG_NOTES (rti))
+	continue;
+
+      input_uses[i] = remove_note_accesses (attempt, input_uses[i]);
+    }
+
+  // So far the patterns have been in instruction order,
+  // now we want them in offset order.
+  if (i1 != first)
+    std::swap (pats[0], pats[1]);
+
+  poly_int64 offsets[2];
+  for (int i = 0; i < 2; i++)
+    {
+      rtx mem = XEXP (pats[i], load_p);
+      gcc_checking_assert (MEM_P (mem));
+      rtx base = strip_offset (XEXP (mem, 0), offsets + i);
+      gcc_checking_assert (REG_P (base));
+      gcc_checking_assert (base_regno == REGNO (base));
+    }
+
+  // If either of the original insns had writeback, but the resulting pair insn
+  // does not (can happen e.g. in the lxvp edge case above, or if the writeback
+  // effects cancel out), then drop the def(s) of the base register as
+  // appropriate.
+  //
+  // Also drop the first def in the case that both of the original insns had
+  // writeback.  The second def could well have uses, but the first def should
+  // only be used by the second insn (and we dropped that use above).
+  for (int i = 0; i < 2; i++)
+    if ((!writeback_effect && (writeback & (1 << i)))
+	|| (i == 0 && writeback == 3))
+      input_defs[i] = check_remove_regno_access (attempt,
+						 input_defs[i],
+						 base_regno);
+
+  // If we don't currently have a writeback pair, and we don't have
+  // a load that clobbers the base register, look for a trailing destructive
+  // update of the base register and try and fold it in to make this into a
+  // writeback pair.
+  insn_info *trailing_add = nullptr;
+
+  rtx reg_notes = combine_reg_notes (first, second, load_p);
+
+  rtx pair_pat = NULL_RTX;
+
+  if (load_p) {
+    pair_pat = rs6000_gen_load_pair ((first->rtl ()), (second->rtl ()));
+    if (pair_pat == NULL_RTX)
+      return false;
+  }
+  insn_change *pair_change = nullptr;
+  auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
+    rtx_insn *rti = change->insn ()->rtl ();
+    validate_unshare_change (rti, &PATTERN (rti), pair_pat, true);
+    validate_change (rti, &REG_NOTES (rti), reg_notes, true);
+  };
+
+  if (load_p)
+    {
+      changes.quick_push (make_delete (first));
+      pair_change = make_change (second);
+      changes.quick_push (pair_change);
+
+      pair_change->move_range = move_range;
+      pair_change->new_defs = merge_access_arrays (attempt,
+						   input_defs[0],
+						   input_defs[1]);
+      gcc_assert (pair_change->new_defs.is_valid ());
+
+      pair_change->new_uses
+	= merge_access_arrays (attempt,
+			       drop_memory_access (input_uses[0]),
+			       drop_memory_access (input_uses[1]));
+      gcc_assert (pair_change->new_uses.is_valid ());
+      set_pair_pat (pair_change);
+    }
+  else
+    {
+      insn_info *store_to_change = decide_store_strategy (first, second,
+							move_range);
+
+      if (store_to_change && dump_file)
+	fprintf (dump_file, "  stvx: re-purposing store %d\n",
+		 store_to_change->uid ());
+
+      insn_change *change;
+      for (int i = 0; i < 2; i++)
+	{
+	  change = make_change (insns[i]);
+	  if (insns[i] == store_to_change)
+	    {
+	      set_pair_pat (change);
+	      change->new_uses = merge_access_arrays (attempt,
+						      input_uses[0],
+						      input_uses[1]);
+	      auto d1 = drop_memory_access (input_defs[0]);
+	      auto d2 = drop_memory_access (input_defs[1]);
+	      change->new_defs = merge_access_arrays (attempt, d1, d2);
+	      gcc_assert (change->new_defs.is_valid ());
+	      def_info *stxvp_def = memory_access (store_to_change->defs ());
+	      change->new_defs = insert_access (attempt,
+						stxvp_def,
+						change->new_defs);
+	      gcc_assert (change->new_defs.is_valid ());
+	      change->move_range = move_range;
+	      pair_change = change;
+	    }
+	  else
+	    {
+	      // Note that we are turning this insn into a tombstone,
+	      // we need to keep track of these if we go ahead with the
+	      // change.
+	      tombstone_uids.quick_push (insns[i]->uid ());
+	      rtx_insn *rti = insns[i]->rtl ();
+	      validate_change (rti, &PATTERN (rti), gen_tombstone (), true);
+	      validate_change (rti, &REG_NOTES (rti), NULL_RTX, true);
+	      change->new_uses = use_array (nullptr, 0);
+	    }
+	  gcc_assert (change->new_uses.is_valid ());
+	  changes.quick_push (change);
+	}
+
+      if (!store_to_change)
+	{
+	  // Tricky case.  Cannot re-purpose existing insns for stp.
+	  // Need to insert new insn.
+	  if (dump_file)
+	    fprintf (dump_file,
+		     "  stp fusion: cannot re-purpose candidate stores\n");
+
+	  auto new_insn = crtl->ssa->create_insn (attempt, INSN, pair_pat);
+	  change = make_change (new_insn);
+	  change->move_range = move_range;
+	  change->new_uses = merge_access_arrays (attempt,
+						  input_uses[0],
+						  input_uses[1]);
+	  gcc_assert (change->new_uses.is_valid ());
+
+	  auto d1 = drop_memory_access (input_defs[0]);
+	  auto d2 = drop_memory_access (input_defs[1]);
+	  change->new_defs = merge_access_arrays (attempt, d1, d2);
+	  gcc_assert (change->new_defs.is_valid ());
+
+	  auto new_set = crtl->ssa->create_set (attempt, new_insn, memory);
+	  change->new_defs = insert_access (attempt, new_set,
+					    change->new_defs);
+	  gcc_assert (change->new_defs.is_valid ());
+	  changes.safe_insert (1, change);
+	  pair_change = change;
+	}
+    }
+
+  if (trailing_add)
+    changes.quick_push (make_delete (trailing_add));
+
+  auto n_changes = changes.length ();
+  gcc_checking_assert (n_changes >= 2 && n_changes <= 4);
+
+  gcc_assert (crtl->ssa->verify_insn_changes (changes));
+
+  cancel_changes (0);
+
+  confirm_change_group ();
+
+  gcc_checking_assert (tombstone_uids.length () <= 2);
+  for (auto uid : tombstone_uids)
+    track_tombstone (uid);
+
+  return true;
+}
+
+void fusion_bb (bb_info *bb)
+{
+  fusion_bb_info bb_state (bb);
+
+  for (auto insn : bb->nondebug_insns ())
+    {
+      rtx_insn *rti = insn->rtl ();
+
+      if (!rti || !INSN_P (rti))
+	continue;
+
+      rtx pat = PATTERN (rti);
+
+      if (GET_CODE (pat) != SET)
+	continue;
+
+      if (MEM_P (XEXP (pat, 1)))
+	bb_state.track_access (insn, true, XEXP (pat, 1));
+    }
+
+  bb_state.transform ();
+  bb_state.cleanup_tombstones ();
+}
+
+static void
+fusion_init ()
+{
+  df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
+  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
+  df_analyze ();
+  df_set_flags (DF_DEFER_INSN_RESCAN);
+  /* Rebuild ud- and du-chains.  */
+  df_remove_problem (df_chain);
+  df_process_deferred_rescans ();
+  df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
+  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
+  df_analyze ();
+  df_set_flags (DF_DEFER_INSN_RESCAN);
+  calculate_dominance_info (CDI_DOMINATORS);
+  crtl->ssa = new rtl_ssa::function_info (cfun);
+}
+
+static void
+fusion_destroy ()
+{
+  if (crtl->ssa->perform_pending_updates ())
+    cleanup_cfg (0);
+
+  free_dominance_info (CDI_DOMINATORS);
+  delete crtl->ssa;
+  crtl->ssa = nullptr;
+}
+
+// Iterate over the accesses in GROUP, looking for adjacent sets
+// of accesses.  If we find two sets of adjacent accesses, call
+// merge_pairs.
+void
+fusion_bb_info::transform_for_base (int encoded_lfs,
+				 access_group &group)
+{
+  const auto lfs = decode_lfs (encoded_lfs);
+  const unsigned access_size = lfs.size;
+
+  bool skip_next = true;
+  access_record *prev_access = nullptr;
+
+  for (auto &access : group.list)
+    {
+      if (skip_next)
+	skip_next = false;
+      else if (known_eq (access.offset, prev_access->offset + access_size))
+	{
+	  merge_pairs (prev_access->cand_insns,
+		       access.cand_insns,
+		       lfs.load_p,
+		       access_size);
+	  skip_next = true;
+	}
+      prev_access = &access;
+    }
+}
+
+void rs6000_analyze_vecload ()
+{
+  fusion_init ();
+
+  for (auto bb : crtl->ssa->bbs ())
+    fusion_bb (bb);
+
+  fusion_destroy ();
+}
+
+const pass_data pass_data_analyze_vecload =
+{
+  RTL_PASS, /* type */
+  "vecload", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_analyze_vecload : public rtl_opt_pass
+{
+public:
+  pass_analyze_vecload(gcc::context *ctxt)
+    : rtl_opt_pass(pass_data_analyze_vecload, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  bool gate (function *)
+    {
+      return (optimize > 0 && TARGET_VSX && TARGET_POWER10);
+    }
+
+  unsigned int execute (function *) final override
+    {
+      rs6000_analyze_vecload ();
+      return 0;
+    }
+}; // class pass_analyze_vecload
+
+rtl_opt_pass *
+make_pass_analyze_vecload (gcc::context *ctxt)
+{
+  return new pass_analyze_vecload (ctxt);
+}
+
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index 5d975dab921..e36946c44ea 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -1178,6 +1178,7 @@  static bool rs6000_secondary_reload_move (enum rs6000_reg_type,
 					  secondary_reload_info *,
 					  bool);
 rtl_opt_pass *make_pass_analyze_swaps (gcc::context*);
+rtl_opt_pass *make_pass_analyze_vecload (gcc::context*);
 
 /* Hash table stuff for keeping track of TOC entries.  */
 
diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
index b3ce09d523b..7664830f8ab 100644
--- a/gcc/config/rs6000/t-rs6000
+++ b/gcc/config/rs6000/t-rs6000
@@ -35,6 +35,11 @@  rs6000-p8swap.o: $(srcdir)/config/rs6000/rs6000-p8swap.cc
 	$(COMPILE) $<
 	$(POSTCOMPILE)
 
+rs6000-vecload-opt.o: $(srcdir)/config/rs6000/rs6000-vecload-opt.cc
+	$(COMPILE) $<
+	$(POSTCOMPILE)
+
+
 rs6000-d.o: $(srcdir)/config/rs6000/rs6000-d.cc
 	$(COMPILE) $<
 	$(POSTCOMPILE)
diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
index 1856fa4884f..ffc47a6eaa0 100644
--- a/gcc/emit-rtl.cc
+++ b/gcc/emit-rtl.cc
@@ -921,7 +921,7 @@  validate_subreg (machine_mode omode, machine_mode imode,
     return false;
 
   /* The subreg offset cannot be outside the inner object.  */
-  if (maybe_ge (offset, isize))
+  if (maybe_gt (offset, isize))
     return false;
 
   poly_uint64 regsize = REGMODE_NATURAL_SIZE (imode);
@@ -3035,18 +3035,6 @@  verify_rtx_sharing (rtx orig, rtx insn)
       break;
     }
 
-  /* This rtx may not be shared.  If it has already been seen,
-     replace it with a copy of itself.  */
-  if (flag_checking && RTX_FLAG (x, used))
-    {
-      error ("invalid rtl sharing found in the insn");
-      debug_rtx (insn);
-      error ("shared rtx");
-      debug_rtx (x);
-      internal_error ("internal consistency failure");
-    }
-  gcc_assert (!RTX_FLAG (x, used));
-
   RTX_FLAG (x, used) = 1;
 
   /* Now scan the subexpressions recursively.  */
diff --git a/gcc/fusion-common.h b/gcc/fusion-common.h
new file mode 100644
index 00000000000..82f1d6ef032
--- /dev/null
+++ b/gcc/fusion-common.h
@@ -0,0 +1,1195 @@ 
+// Common infrastructure for Load/Store Pair fusion optimization pass.
+// Copyright (C) 2023-2024 Free Software Foundation, Inc.
+//
+// This file is part of GCC.
+//
+// GCC is free software; you can redistribute it and/or modify it
+// under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 3, or (at your option)
+// any later version.
+//
+// GCC is distributed in the hope that it will be useful, but
+// WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+// General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License
+// along with GCC; see the file COPYING3.  If not see
+// <http://www.gnu.org/licenses/>.
+
+#define INCLUDE_ALGORITHM
+#define INCLUDE_FUNCTIONAL
+#define INCLUDE_LIST
+#define INCLUDE_TYPE_TRAITS
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "rtl-iter.h"
+#include "rtl-ssa.h"
+#include "cfgcleanup.h"
+#include "tree-pass.h"
+#include "ordered-hash-map.h"
+#include "tree-dfa.h"
+#include "fold-const.h"
+#include "tree-hash-traits.h"
+#include "print-tree.h"
+#include "insn-attr.h"
+
+using namespace rtl_ssa;
+
+static constexpr HOST_WIDE_INT LDP_IMM_BITS = 7;
+static constexpr HOST_WIDE_INT LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1));
+static constexpr HOST_WIDE_INT LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1;
+static constexpr HOST_WIDE_INT LDP_MIN_IMM = -LDP_MAX_IMM - 1;
+
+// We pack these fields (load_p, fpsimd_p, and size) into an integer
+// (LFS) which we use as part of the key into the main hash tables.
+//
+// The idea is that we group candidates together only if they agree on
+// the fields below.  Candidates that disagree on any of these
+// properties shouldn't be merged together.
+struct lfs_fields
+{
+  bool load_p;
+  bool fpsimd_p;
+  unsigned size;
+};
+
+using insn_list_t = std::list<insn_info *>;
+using insn_iter_t = insn_list_t::iterator;
+
+// Information about the accesses at a given offset from a particular
+// base.  Stored in an access_group, see below.
+struct access_record
+{
+  poly_int64 offset;
+  std::list<insn_info *> cand_insns;
+  std::list<access_record>::iterator place;
+
+  access_record (poly_int64 off) : offset (off) {}
+};
+
+// A group of accesses where adjacent accesses could be ldp/stp
+// or lxvp/stxvp candidates.  The splay tree supports efficient
+// insertion, while the list supports efficient iteration.
+struct access_group
+{
+  splay_tree<access_record *> tree;
+  std::list<access_record> list;
+
+  template<typename Alloc>
+  inline void track (Alloc node_alloc, poly_int64 offset, insn_info *insn);
+};
+
+// Information about a potential base candidate, used in try_fuse_pair.
+// There may be zero, one, or two viable RTL bases for a given pair.
+struct base_cand
+{
+  // DEF is the def of the base register to be used by the pair.
+  def_info *def;
+
+  // FROM_INSN is -1 if the base candidate is already shared by both
+  // candidate insns.  Otherwise it holds the index of the insn from
+  // which the base originated.
+  //
+  // In the case that the base is shared, either DEF is already used
+  // by both candidate accesses, or both accesses see different versions
+  // of the same regno, in which case DEF is the def consumed by the
+  // first candidate access.
+  int from_insn;
+
+  // To form a pair, we do so by moving the first access down and the second
+  // access up.  To determine where to form the pair, and whether or not
+  // it is safe to form the pair, we track instructions which cannot be
+  // re-ordered past due to either dataflow or alias hazards.
+  //
+  // Since we allow changing the base used by an access, the choice of
+  // base can change which instructions act as re-ordering hazards for
+  // this pair (due to different dataflow).  We store the initial
+  // dataflow hazards for this choice of base candidate in HAZARDS.
+  //
+  // These hazards act as re-ordering barriers to each candidate insn
+  // respectively, in program order.
+  //
+  // Later on, when we take alias analysis into account, we narrow
+  // HAZARDS accordingly.
+  insn_info *hazards[2];
+
+  base_cand (def_info *def, int insn)
+    : def (def), from_insn (insn), hazards {nullptr, nullptr} {}
+
+  base_cand (def_info *def) : base_cand (def, -1) {}
+
+  // Test if this base candidate is viable according to HAZARDS.
+  bool viable () const
+  {
+    return !hazards[0] || !hazards[1] || (*hazards[0] > *hazards[1]);
+  }
+};
+
+// Information about an alternate base.  For a def_info D, it may
+// instead be expressed as D = BASE + OFFSET.
+struct alt_base
+{
+  def_info *base;
+  poly_int64 offset;
+};
+
+// State used by the pass for a given basic block.
+struct fusion_bb_info
+{
+  using def_hash = nofree_ptr_hash<def_info>;
+  using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
+  using def_key_t = pair_hash<def_hash, int_hash<int, -1, -2>>;
+
+  // Map of <tree base, LFS> -> access_group.
+  ordered_hash_map<expr_key_t, access_group> expr_map;
+
+  // Map of <RTL-SSA def_info *, LFS> -> access_group.
+  ordered_hash_map<def_key_t, access_group> def_map;
+
+  // Given the def_info for an RTL base register, express it as an offset from
+  // some canonical base instead.
+  //
+  // Canonicalizing bases in this way allows us to identify adjacent accesses
+  // even if they see different base register defs.
+  hash_map<def_hash, alt_base> canon_base_map;
+
+  static const size_t obstack_alignment = sizeof (void *);
+  bb_info *m_bb;
+
+  fusion_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
+  {
+    obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
+				obstack_alignment, obstack_chunk_alloc,
+				obstack_chunk_free);
+  }
+  ~fusion_bb_info ()
+  {
+    obstack_free (&m_obstack, nullptr);
+
+    if (m_emitted_tombstone)
+      {
+	bitmap_release (&m_tombstone_bitmap);
+	bitmap_obstack_release (&m_bitmap_obstack);
+      }
+  }
+
+  inline void track_access (insn_info *, bool load, rtx mem);
+  inline void transform ();
+  inline void cleanup_tombstones ();
+
+private:
+  obstack m_obstack;
+
+  // State for keeping track of tombstone insns emitted for this BB.
+  bitmap_obstack m_bitmap_obstack;
+  bitmap_head m_tombstone_bitmap;
+  bool m_emitted_tombstone;
+
+  inline splay_tree_node<access_record *> *node_alloc (access_record *);
+
+  template<typename Map>
+  inline void traverse_base_map (Map &map);
+  inline void transform_for_base (int load_size, access_group &group);
+
+   void merge_pairs (insn_list_t &, insn_list_t &,
+			   bool load_p, unsigned access_size);
+
+   bool try_fuse_pair (bool load_p, unsigned access_size,
+			     insn_info *i1, insn_info *i2);
+
+   bool fuse_pair (bool load_p, unsigned access_size,
+			 int writeback,
+			 insn_info *i1, insn_info *i2,
+			 base_cand &base,
+			 const insn_range_info &move_range);
+
+  inline void track_tombstone (int uid);
+
+  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
+};
+
+splay_tree_node<access_record *> *fusion_bb_info::node_alloc (access_record *access)
+{
+  using T = splay_tree_node<access_record *>;
+  void *addr = obstack_alloc (&m_obstack, sizeof (T));
+  return new (addr) T (access);
+}
+
+// Given a mem MEM, if the address has side effects, return a MEM that accesses
+// the same address but without the side effects.  Otherwise, return
+// MEM unchanged.
+static rtx
+drop_writeback (rtx mem)
+{
+  rtx addr = XEXP (mem, 0);
+
+  if (!side_effects_p (addr))
+    return mem;
+
+  switch (GET_CODE (addr))
+    {
+    case PRE_MODIFY:
+      addr = XEXP (addr, 1);
+      break;
+    case POST_MODIFY:
+    case POST_INC:
+    case POST_DEC:
+      addr = XEXP (addr, 0);
+      break;
+    case PRE_INC:
+    case PRE_DEC:
+    {
+      poly_int64 adjustment = GET_MODE_SIZE (GET_MODE (mem));
+      if (GET_CODE (addr) == PRE_DEC)
+	adjustment *= -1;
+      addr = plus_constant (GET_MODE (addr), XEXP (addr, 0), adjustment);
+      break;
+    }
+    default:
+      gcc_unreachable ();
+    }
+
+  return change_address (mem, GET_MODE (mem), addr);
+}
+
+// Convenience wrapper around strip_offset that can also look through
+// RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
+// MEM so that we know the mode of the access.
+static rtx
+load_strip_offset (rtx mem, poly_int64 *offset)
+{
+  rtx addr = XEXP (mem, 0);
+
+  switch (GET_CODE (addr))
+    {
+    case PRE_MODIFY:
+    case POST_MODIFY:
+      addr = strip_offset (XEXP (addr, 1), offset);
+      gcc_checking_assert (REG_P (addr));
+      gcc_checking_assert (rtx_equal_p (XEXP (XEXP (mem, 0), 0), addr));
+      break;
+    case PRE_INC:
+    case POST_INC:
+      addr = XEXP (addr, 0);
+      *offset = GET_MODE_SIZE (GET_MODE (mem));
+      gcc_checking_assert (REG_P (addr));
+      break;
+    case PRE_DEC:
+    case POST_DEC:
+      addr = XEXP (addr, 0);
+      *offset = -GET_MODE_SIZE (GET_MODE (mem));
+      gcc_checking_assert (REG_P (addr));
+      break;
+
+    default:
+      addr = strip_offset (addr, offset);
+    }
+
+  return addr;
+}
+
+// Return true if X is a PRE_{INC,DEC,MODIFY} rtx.
+static bool
+any_pre_modify_p (rtx x)
+{
+  const auto code = GET_CODE (x);
+  return code == PRE_INC || code == PRE_DEC || code == PRE_MODIFY;
+}
+
+// Return true if X is a POST_{INC,DEC,MODIFY} rtx.
+static bool
+any_post_modify_p (rtx x)
+{
+  const auto code = GET_CODE (x);
+  return code == POST_INC || code == POST_DEC || code == POST_MODIFY;
+}
+// Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
+// into an integer for use as a hash table key.
+static int
+encode_lfs (lfs_fields fields)
+{
+  int size_log2 = exact_log2 (fields.size);
+  //gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
+  return ((int)fields.load_p << 3)
+    | ((int)fields.fpsimd_p << 2)
+    | (size_log2 - 2);
+}
+
+// Inverse of encode_lfs.
+static lfs_fields
+decode_lfs (int lfs)
+{
+  bool load_p = (lfs & (1 << 3));
+  bool fpsimd_p = (lfs & (1 << 2));
+  unsigned size = 1U << ((lfs & 3) + 2);
+  return { load_p, fpsimd_p, size };
+}
+// Track the access INSN at offset OFFSET in this access group.
+// ALLOC_NODE is used to allocate splay tree nodes.
+template<typename Alloc>
+void
+access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
+{
+  auto insert_before = [&](std::list<access_record>::iterator after)
+    {
+      auto it = list.emplace (after, offset);
+      it->cand_insns.push_back (insn);
+      it->place = it;
+      return &*it;
+    };
+
+  if (!list.size ())
+    {
+      auto access = insert_before (list.end ());
+      tree.insert_max_node (alloc_node (access));
+      return;
+    }
+
+  auto compare = [&](splay_tree_node<access_record *> *node)
+    {
+      return compare_sizes_for_sort (offset, node->value ()->offset);
+    };
+  auto result = tree.lookup (compare);
+  splay_tree_node<access_record *> *node = tree.root ();
+  if (result == 0)
+    node->value ()->cand_insns.push_back (insn);
+  else
+    {
+      auto it = node->value ()->place;
+      auto after = (result > 0) ? std::next (it) : it;
+      auto access = insert_before (after);
+      tree.insert_child (node, result > 0, alloc_node (access));
+    }
+}
+// Given a candidate access INSN (with mem MEM), see if it has a suitable
+// MEM_EXPR base (i.e. a tree decl) relative to which we can track the access.
+// LFS is used as part of the key to the hash table, see track_access.
+bool
+fusion_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
+{
+  if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
+    return false;
+
+  poly_int64 offset;
+  tree base_expr = get_addr_base_and_unit_offset (MEM_EXPR (mem),
+						  &offset);
+  if (!base_expr || !DECL_P (base_expr))
+    return false;
+
+  offset += MEM_OFFSET (mem);
+
+  const machine_mode mem_mode = GET_MODE (mem);
+  const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode);//.to_constant ();
+
+  // Punt on misaligned offsets.  LDP/STP instructions require offsets to be a
+  // multiple of the access size, and we believe that misaligned offsets on
+  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL bases.
+  if (!multiple_p (offset, mem_size))
+    return false;
+
+  const auto key = std::make_pair (base_expr, encode_lfs (lfs));
+  access_group &group = expr_map.get_or_insert (key, NULL);
+  auto alloc = [&](access_record *access) { return node_alloc (access); };
+  group.track (alloc, offset, insn);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "[bb %u] tracking insn %d via ",
+	       m_bb->index (), insn->uid ());
+      print_node_brief (dump_file, "mem expr", base_expr, 0);
+      fprintf (dump_file, " [L=%d FP=%d, %smode, off=",
+	       lfs.load_p, lfs.fpsimd_p, mode_name[mem_mode]);
+      print_dec (offset, dump_file);
+      fprintf (dump_file, "]\n");
+    }
+
+  return true;
+}
+
+// Dummy predicate that never ignores any insns.
+static bool no_ignore (insn_info *) { return false; }
+
+// Return the latest dataflow hazard before INSN.
+//
+// If IGNORE is non-NULL, this points to a sub-rtx which we should ignore for
+// dataflow purposes.  This is needed when considering changing the RTL base of
+// an access discovered through a MEM_EXPR base.
+//
+// If IGNORE_INSN is non-NULL, we should further ignore any hazards arising
+// from that insn.
+//
+// N.B. we ignore any defs/uses of memory here as we deal with that separately,
+// making use of alias disambiguation.
+static insn_info *
+latest_hazard_before (insn_info *insn, rtx *ignore,
+		      insn_info *ignore_insn = nullptr)
+{
+  insn_info *result = nullptr;
+
+  // If the insn can throw then it is at the end of a BB and we can't
+  // move it, model this by recording a hazard in the previous insn
+  // which will prevent moving the insn up.
+  if (cfun->can_throw_non_call_exceptions
+      && find_reg_note (insn->rtl (), REG_EH_REGION, NULL_RTX))
+    return insn->prev_nondebug_insn ();
+
+  // Return true if we registered the hazard.
+  auto hazard = [&](insn_info *h) -> bool
+    {
+      gcc_checking_assert (*h < *insn);
+      if (h == ignore_insn)
+	return false;
+
+      if (!result || *h > *result)
+	result = h;
+
+      return true;
+    };
+
+  rtx pat = PATTERN (insn->rtl ());
+  auto ignore_use = [&](use_info *u)
+    {
+      if (u->is_mem ())
+	return true;
+
+      return !refers_to_regno_p (u->regno (), u->regno () + 1, pat, ignore);
+    };
+
+  // Find defs of uses in INSN (RaW).
+  for (auto use : insn->uses ())
+    if (!ignore_use (use) && use->def ())
+      hazard (use->def ()->insn ());
+
+  // Find previous defs (WaW) or previous uses (WaR) of defs in INSN.
+  for (auto def : insn->defs ())
+    {
+      if (def->is_mem ())
+	continue;
+
+      if (def->prev_def ())
+	{
+	  hazard (def->prev_def ()->insn ()); // WaW
+
+	  auto set = dyn_cast<set_info *> (def->prev_def ());
+	  if (set && set->has_nondebug_insn_uses ())
+	    for (auto use : set->reverse_nondebug_insn_uses ())
+	      if (use->insn () != insn && hazard (use->insn ())) // WaR
+		break;
+	}
+
+      if (!HARD_REGISTER_NUM_P (def->regno ()))
+	continue;
+
+      // Also need to check backwards for call clobbers (WaW).
+      for (auto call_group : def->ebb ()->call_clobbers ())
+	{
+	  if (!call_group->clobbers (def->resource ()))
+	    continue;
+
+	  auto clobber_insn = prev_call_clobbers_ignoring (*call_group,
+							   def->insn (),
+							   no_ignore);
+	  if (clobber_insn)
+	    hazard (clobber_insn);
+	}
+
+    }
+
+  return result;
+}
+
+// Return the first dataflow hazard after INSN.
+//
+// If IGNORE is non-NULL, this points to a sub-rtx which we should ignore for
+// dataflow purposes.  This is needed when considering changing the RTL base of
+// an access discovered through a MEM_EXPR base.
+//
+// N.B. we ignore any defs/uses of memory here as we deal with that separately,
+// making use of alias disambiguation.
+static insn_info *
+first_hazard_after (insn_info *insn, rtx *ignore)
+{
+  insn_info *result = nullptr;
+  auto hazard = [insn, &result](insn_info *h)
+    {
+      gcc_checking_assert (*h > *insn);
+      if (!result || *h < *result)
+	result = h;
+    };
+
+  rtx pat = PATTERN (insn->rtl ());
+  auto ignore_use = [&](use_info *u)
+    {
+      if (u->is_mem ())
+	return true;
+
+      return !refers_to_regno_p (u->regno (), u->regno () + 1, pat, ignore);
+    };
+
+  for (auto def : insn->defs ())
+    {
+      if (def->is_mem ())
+	continue;
+
+      if (def->next_def ())
+	hazard (def->next_def ()->insn ()); // WaW
+
+      auto set = dyn_cast<set_info *> (def);
+      if (set && set->has_nondebug_insn_uses ())
+	hazard (set->first_nondebug_insn_use ()->insn ()); // RaW
+
+      if (!HARD_REGISTER_NUM_P (def->regno ()))
+	continue;
+
+      // Also check for call clobbers of this def (WaW).
+      for (auto call_group : def->ebb ()->call_clobbers ())
+	{
+	  if (!call_group->clobbers (def->resource ()))
+	    continue;
+
+	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
+							   def->insn (),
+							   no_ignore);
+	  if (clobber_insn)
+	    hazard (clobber_insn);
+	}
+    }
+
+  // Find any subsequent defs of uses in INSN (WaR).
+  for (auto use : insn->uses ())
+    {
+      if (ignore_use (use))
+	continue;
+
+      if (use->def ())
+	{
+	  auto def = use->def ()->next_def ();
+	  if (def && def->insn () == insn)
+	    def = def->next_def ();
+
+	  if (def)
+	    hazard (def->insn ());
+	}
+
+      if (!HARD_REGISTER_NUM_P (use->regno ()))
+	continue;
+
+      // Also need to handle call clobbers of our uses (again WaR).
+      //
+      // See restrict_movement_for_uses_ignoring for why we don't
+      // need to check backwards for call clobbers.
+      for (auto call_group : use->ebb ()->call_clobbers ())
+	{
+	  if (!call_group->clobbers (use->resource ()))
+	    continue;
+
+	  auto clobber_insn = next_call_clobbers_ignoring (*call_group,
+							   use->insn (),
+							   no_ignore);
+	  if (clobber_insn)
+	    hazard (clobber_insn);
+	}
+    }
+
+  return result;
+}
+
+// Return true iff R1 and R2 overlap.
+static bool
+ranges_overlap_p (const insn_range_info &r1, const insn_range_info &r2)
+{
+  // If either range is empty, then their intersection is empty.
+  if (!r1 || !r2)
+    return false;
+
+  // When do they not overlap? When one range finishes before the other
+  // starts, i.e. (*r1.last < *r2.first || *r2.last < *r1.first).
+  // Inverting this, we get the below.
+  return *r1.last >= *r2.first && *r2.last >= *r1.first;
+}
+
+// Get the range of insns that def feeds.
+static insn_range_info get_def_range (def_info *def)
+{
+  insn_info *last = def->next_def ()->insn ()->prev_nondebug_insn ();
+  return { def->insn (), last };
+}
+
+// Given a def (of memory), return the downwards range within which we
+// can safely move this def.
+static insn_range_info
+def_downwards_move_range (def_info *def)
+{
+  auto range = get_def_range (def);
+
+  auto set = dyn_cast<set_info *> (def);
+  if (!set || !set->has_any_uses ())
+    return range;
+
+  auto use = set->first_nondebug_insn_use ();
+  if (use)
+    range = move_earlier_than (range, use->insn ());
+
+  return range;
+}
+
+// Given a def (of memory), return the upwards range within which we can
+// safely move this def.
+static insn_range_info
+def_upwards_move_range (def_info *def)
+{
+  def_info *prev = def->prev_def ();
+  insn_range_info range { prev->insn (), def->insn () };
+
+  auto set = dyn_cast<set_info *> (prev);
+  if (!set || !set->has_any_uses ())
+    return range;
+
+  auto use = set->last_nondebug_insn_use ();
+  if (use)
+    range = move_later_than (range, use->insn ());
+
+  return range;
+}
+
+// Given candidate store insns FIRST and SECOND, see if we can re-purpose one
+// of them (together with its def of memory) for the stp/stxvp insn. If so,
+//  return that insn.  Otherwise, return null.
+static insn_info *
+decide_store_strategy (insn_info *first,
+		     insn_info *second,
+		     const insn_range_info &move_range)
+{
+  def_info * const defs[2] = {
+    memory_access (first->defs ()),
+    memory_access (second->defs ())
+  };
+
+  if (move_range.includes (first)
+      || ranges_overlap_p (move_range, def_downwards_move_range (defs[0])))
+    return first;
+
+  if (move_range.includes (second)
+      || ranges_overlap_p (move_range, def_upwards_move_range (defs[1])))
+    return second;
+
+  return nullptr;
+}
+
+// Generate the RTL pattern for a "tombstone"; used temporarily during this pass
+// to replace stores that are marked for deletion where we can't immediately
+// delete the store (since there are uses of mem hanging off the store).
+//
+// These are deleted at the end of the pass and uses re-parented appropriately
+// at this point.
+static rtx
+gen_tombstone (void)
+{
+  return gen_rtx_CLOBBER (VOIDmode,
+			  gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode)));
+}
+// Go through the reg notes rooted at NOTE, dropping those that we should drop,
+// and preserving those that we want to keep by prepending them to (and
+// returning) RESULT.  EH_REGION is used to make sure we have at most one
+// REG_EH_REGION note in the resulting list.  FR_EXPR is used to return any
+// REG_FRAME_RELATED_EXPR note we find, as these can need special handling in
+// combine_reg_notes.
+static rtx
+filter_notes (rtx note, rtx result, bool *eh_region, rtx *fr_expr)
+{
+  for (; note; note = XEXP (note, 1))
+    {
+      switch (REG_NOTE_KIND (note))
+	{
+	case REG_DEAD:
+	  // REG_DEAD notes aren't required to be maintained.
+	case REG_EQUAL:
+	case REG_EQUIV:
+	case REG_UNUSED:
+	case REG_NOALIAS:
+	  // These can all be dropped.  For REG_EQU{AL,IV} they cannot apply to
+	  // non-single_set insns, and REG_UNUSED is re-computed by RTl-SSA, see
+	  // rtl-ssa/changes.cc:update_notes.
+	  //
+	  // Similarly, REG_NOALIAS cannot apply to a parallel.
+	case REG_INC:
+	  // When we form the pair insn, the reg update is implemented
+	  // as just another SET in the parallel, so isn't really an
+	  // auto-increment in the RTL sense, hence we drop the note.
+	  break;
+	case REG_EH_REGION:
+	  gcc_assert (!*eh_region);
+	  *eh_region = true;
+	  result = alloc_reg_note (REG_EH_REGION, XEXP (note, 0), result);
+	  break;
+	case REG_CFA_DEF_CFA:
+	case REG_CFA_OFFSET:
+	case REG_CFA_RESTORE:
+	  result = alloc_reg_note (REG_NOTE_KIND (note),
+				   copy_rtx (XEXP (note, 0)),
+				   result);
+	  break;
+	case REG_FRAME_RELATED_EXPR:
+	  gcc_assert (!*fr_expr);
+	  *fr_expr = copy_rtx (XEXP (note, 0));
+	  break;
+	default:
+	  // Unexpected REG_NOTE kind.
+	  gcc_unreachable ();
+	}
+    }
+
+  return result;
+}
+
+// Return the notes that should be attached to a combination of I1 and I2, where
+// *I1 < *I2.  LOAD_P is true for loads.
+static rtx
+combine_reg_notes (insn_info *i1, insn_info *i2, bool load_p)
+{
+  // Temporary storage for REG_FRAME_RELATED_EXPR notes.
+  rtx fr_expr[2] = {};
+
+  bool found_eh_region = false;
+  rtx result = NULL_RTX;
+  result = filter_notes (REG_NOTES (i2->rtl ()), result,
+			 &found_eh_region, fr_expr);
+  result = filter_notes (REG_NOTES (i1->rtl ()), result,
+			 &found_eh_region, fr_expr + 1);
+
+  if (!load_p)
+    {
+      // Simple frame-related sp-relative saves don't need CFI notes, but when
+      // we combine them into an stp/stxvp we will need a CFI note as dwarf2cfi
+      //  can't interpret the unspec pair representation directly.
+      if (RTX_FRAME_RELATED_P (i1->rtl ()) && !fr_expr[0])
+	fr_expr[0] = copy_rtx (PATTERN (i1->rtl ()));
+      if (RTX_FRAME_RELATED_P (i2->rtl ()) && !fr_expr[1])
+	fr_expr[1] = copy_rtx (PATTERN (i2->rtl ()));
+    }
+
+  rtx fr_pat = NULL_RTX;
+  if (fr_expr[0] && fr_expr[1])
+    {
+      // Combining two frame-related insns, need to construct
+      // a REG_FRAME_RELATED_EXPR note which represents the combined
+      // operation.
+      RTX_FRAME_RELATED_P (fr_expr[1]) = 1;
+      fr_pat = gen_rtx_PARALLEL (VOIDmode,
+				 gen_rtvec (2, fr_expr[0], fr_expr[1]));
+    }
+  else
+    fr_pat = fr_expr[0] ? fr_expr[0] : fr_expr[1];
+
+  if (fr_pat)
+    result = alloc_reg_note (REG_FRAME_RELATED_EXPR,
+			     fr_pat, result);
+
+  return result;
+}
+
+// Given two memory accesses in PATS, at least one of which is of a
+// writeback form, extract two non-writeback memory accesses addressed
+// relative to the initial value of the base register, and output these
+// in PATS.  Return an rtx that represents the overall change to the
+// base register.
+static rtx
+extract_writebacks (bool load_p, rtx pats[2], int changed)
+{
+  rtx base_reg = NULL_RTX;
+  poly_int64 current_offset = 0;
+
+  poly_int64 offsets[2];
+
+  for (int i = 0; i < 2; i++)
+    {
+      rtx mem = XEXP (pats[i], load_p);
+      rtx reg = XEXP (pats[i], !load_p);
+
+      rtx addr = XEXP (mem, 0);
+      const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
+
+      poly_int64 offset;
+      rtx this_base = load_strip_offset (mem, &offset);
+      gcc_assert (REG_P (this_base));
+      if (base_reg)
+	gcc_assert (rtx_equal_p (base_reg, this_base));
+      else
+	base_reg = this_base;
+
+      // If we changed base for the current insn, then we already
+      // derived the correct mem for this insn from the effective
+      // address of the other access.
+      if (i == changed)
+	{
+	  gcc_checking_assert (!autoinc_p);
+	  offsets[i] = offset;
+	  continue;
+	}
+
+      if (autoinc_p && any_pre_modify_p (addr))
+	current_offset += offset;
+
+      poly_int64 this_off = current_offset;
+      if (!autoinc_p)
+	this_off += offset;
+
+      offsets[i] = this_off;
+      rtx new_mem = change_address (mem, GET_MODE (mem),
+				    plus_constant (GET_MODE (base_reg),
+						   base_reg, this_off));
+      pats[i] = load_p
+	? gen_rtx_SET (reg, new_mem)
+	: gen_rtx_SET (new_mem, reg);
+
+      if (autoinc_p && any_post_modify_p (addr))
+	current_offset += offset;
+    }
+
+  if (known_eq (current_offset, 0))
+    return NULL_RTX;
+
+  return gen_rtx_SET (base_reg, plus_constant (GET_MODE (base_reg),
+					       base_reg, current_offset));
+}
+
+// We just emitted a tombstone with uid UID, track it in a bitmap for
+// this BB so we can easily identify it later when cleaning up tombstones.
+void
+fusion_bb_info::track_tombstone (int uid)
+{
+  if (!m_emitted_tombstone)
+    {
+      // Lazily initialize the bitmap for tracking tombstone insns.
+      bitmap_obstack_initialize (&m_bitmap_obstack);
+      bitmap_initialize (&m_tombstone_bitmap, &m_bitmap_obstack);
+      m_emitted_tombstone = true;
+    }
+
+  if (!bitmap_set_bit (&m_tombstone_bitmap, uid))
+    gcc_unreachable (); // Bit should have changed.
+}
+
+// Virtual base class for load/store walkers used in alias analysis.
+struct alias_walker
+{
+  virtual bool conflict_p (int &budget) const = 0;
+  virtual insn_info *insn () const = 0;
+  virtual bool valid () const  = 0;
+  virtual void advance () = 0;
+};
+
+// Implement some common functionality used by both store_walker
+// and load_walker.
+template<bool reverse>
+class def_walker : public alias_walker
+{
+protected:
+  using def_iter_t = typename std::conditional<reverse,
+	reverse_def_iterator, def_iterator>::type;
+
+  static use_info *start_use_chain (def_iter_t &def_iter)
+  {
+    set_info *set = nullptr;
+    for (; *def_iter; def_iter++)
+      {
+	set = dyn_cast<set_info *> (*def_iter);
+	if (!set)
+	  continue;
+
+	use_info *use = reverse
+	  ? set->last_nondebug_insn_use ()
+	  : set->first_nondebug_insn_use ();
+
+	if (use)
+	  return use;
+      }
+
+    return nullptr;
+  }
+
+  def_iter_t def_iter;
+  insn_info *limit;
+  def_walker (def_info *def, insn_info *limit) :
+    def_iter (def), limit (limit) {}
+
+  virtual bool iter_valid () const { return *def_iter; }
+
+public:
+  insn_info *insn () const override { return (*def_iter)->insn (); }
+  void advance () override { def_iter++; }
+  bool valid () const override final
+  {
+    if (!iter_valid ())
+      return false;
+
+    if (reverse)
+      return *(insn ()) > *limit;
+    else
+      return *(insn ()) < *limit;
+  }
+};
+
+// alias_walker that iterates over stores.
+template<bool reverse, typename InsnPredicate>
+class store_walker : public def_walker<reverse>
+{
+  rtx cand_mem;
+  InsnPredicate tombstone_p;
+
+public:
+  store_walker (def_info *mem_def, rtx mem, insn_info *limit_insn,
+		InsnPredicate tombstone_fn) :
+    def_walker<reverse> (mem_def, limit_insn),
+    cand_mem (mem), tombstone_p (tombstone_fn) {}
+
+  bool conflict_p (int &budget) const override final
+  {
+    if (tombstone_p (this->insn ()))
+      return false;
+
+    return store_modifies_mem_p (cand_mem, this->insn (), budget);
+  }
+};
+
+// alias_walker that iterates over loads.
+template<bool reverse>
+class load_walker : public def_walker<reverse>
+{
+  using Base = def_walker<reverse>;
+  using use_iter_t = typename std::conditional<reverse,
+	reverse_use_iterator, nondebug_insn_use_iterator>::type;
+
+  use_iter_t use_iter;
+  insn_info *cand_store;
+
+  bool iter_valid () const override final { return *use_iter; }
+
+public:
+  void advance () override final
+  {
+    use_iter++;
+    if (*use_iter)
+      return;
+    this->def_iter++;
+    use_iter = Base::start_use_chain (this->def_iter);
+  }
+
+  insn_info *insn () const override final
+  {
+    return (*use_iter)->insn ();
+  }
+
+  bool conflict_p (int &budget) const override final
+  {
+    return load_modified_by_store_p (insn (), cand_store, budget);
+  }
+
+  load_walker (def_info *def, insn_info *store, insn_info *limit_insn)
+    : Base (def, limit_insn),
+      use_iter (Base::start_use_chain (this->def_iter)),
+      cand_store (store) {}
+};
+
+// Given INSNS (in program order) which are known to be adjacent, look
+// to see if either insn has a suitable RTL (register) base that we can
+// use to form a pair.  Push these to BASE_CANDS if we find any.  CAND_MEMs
+// gives the relevant mems from the candidate insns, ACCESS_SIZE gives the
+// size of a single candidate access, and REVERSED says whether the accesses
+// are inverted in offset order.
+//
+// Returns an integer where bit (1 << i) is set if INSNS[i] uses writeback
+// addressing.
+static int
+get_viable_bases (insn_info *insns[2],
+		  vec<base_cand> &base_cands,
+		  rtx cand_mems[2],
+		  unsigned access_size,
+		  bool reversed)
+{
+  // We discovered this pair through a common base.  Need to ensure that
+  // we have a common base register that is live at both locations.
+  def_info *base_defs[2] = {};
+  int writeback = 0;
+  for (int i = 0; i < 2; i++)
+    {
+      const bool is_lower = (i == reversed);
+      poly_int64 poly_off;
+      rtx base = load_strip_offset (cand_mems[i], &poly_off);
+      if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
+	writeback |= (1 << i);
+
+      if (!REG_P (base) || !poly_off.is_constant ())
+	continue;
+
+      // Punt on accesses relative to eliminable regs.  See the comment in
+      // fusion_bb_info::track_access for a detailed explanation of this.
+      if (!reload_completed
+	  && (REGNO (base) == FRAME_POINTER_REGNUM
+	      || REGNO (base) == ARG_POINTER_REGNUM))
+	continue;
+
+      HOST_WIDE_INT base_off = poly_off.to_constant ();
+
+      // It should be unlikely that we ever punt here, since MEM_EXPR offset
+      // alignment should be a good proxy for register offset alignment.
+      if (base_off % access_size != 0)
+	{
+	  if (dump_file)
+	    fprintf (dump_file,
+		     "base not viable, offset misaligned (insn %d)\n",
+		     insns[i]->uid ());
+	  continue;
+	}
+
+      base_off /= access_size;
+
+      if (!is_lower)
+	base_off--;
+
+      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
+	continue;
+
+      use_info *use = find_access (insns[i]->uses (), REGNO (base));
+      gcc_assert (use);
+      base_defs[i] = use->def ();
+    }
+
+  if (!base_defs[0] && !base_defs[1])
+    {
+      if (dump_file)
+	fprintf (dump_file, "no viable base register for pair (%d,%d)\n",
+		 insns[0]->uid (), insns[1]->uid ());
+      return writeback;
+    }
+
+  for (int i = 0; i < 2; i++)
+    if ((writeback & (1 << i)) && !base_defs[i])
+      {
+	if (dump_file)
+	  fprintf (dump_file, "insn %d has writeback but base isn't viable\n",
+		   insns[i]->uid ());
+	return writeback;
+      }
+
+  if (writeback == 3
+      && base_defs[0]->regno () != base_defs[1]->regno ())
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "pair (%d,%d): double writeback with distinct regs (%d,%d): "
+		 "punting\n",
+		 insns[0]->uid (), insns[1]->uid (),
+		 base_defs[0]->regno (), base_defs[1]->regno ());
+      return writeback;
+    }
+
+  if (base_defs[0] && base_defs[1]
+      && base_defs[0]->regno () == base_defs[1]->regno ())
+    {
+      // Easy case: insns already share the same base reg.
+      base_cands.quick_push (base_defs[0]);
+      return writeback;
+    }
+
+  // Otherwise, we know that one of the bases must change.
+  //
+  // Note that if there is writeback we must use the writeback base
+  // (we know now there is exactly one).
+  for (int i = 0; i < 2; i++)
+    if (base_defs[i] && (!writeback || (writeback & (1 << i))))
+      base_cands.quick_push (base_cand { base_defs[i], i });
+
+  return writeback;
+}
+
+static void
+dump_insn_list (FILE *f, const insn_list_t &l)
+{
+  fprintf (f, "(");
+
+  auto i = l.begin ();
+  auto end = l.end ();
+
+  if (i != end)
+    fprintf (f, "%d", (*i)->uid ());
+  i++;
+
+  for (; i != end; i++)
+    fprintf (f, ", %d", (*i)->uid ());
+
+  fprintf (f, ")");
+}
+
+DEBUG_FUNCTION void
+debug (const insn_list_t &l)
+{
+  dump_insn_list (stderr, l);
+  fprintf (stderr, "\n");
+}
+
+// If we emitted tombstone insns for this BB, iterate through the BB
+// and remove all the tombstone insns, being sure to reparent any uses
+// of mem to previous defs when we do this.
+void
+fusion_bb_info::cleanup_tombstones ()
+{
+  // No need to do anything if we didn't emit a tombstone insn for this BB.
+  if (!m_emitted_tombstone)
+    return;
+
+  insn_info *insn = m_bb->head_insn ();
+  while (insn)
+    {
+      insn_info *next = insn->next_nondebug_insn ();
+      if (!insn->is_real ()
+	  || !bitmap_bit_p (&m_tombstone_bitmap, insn->uid ()))
+	{
+	  insn = next;
+	  continue;
+	}
+
+      auto def = memory_access (insn->defs ());
+      auto set = dyn_cast<set_info *> (def);
+      if (set && set->has_any_uses ())
+	{
+	  def_info *prev_def = def->prev_def ();
+	  auto prev_set = dyn_cast<set_info *> (prev_def);
+	  if (!prev_set)
+	    gcc_unreachable ();
+
+	  while (set->first_use ())
+	    crtl->ssa->reparent_use (set->first_use (), prev_set);
+	}
+
+      // Now set has no uses, we can delete it.
+      insn_change change (insn, insn_change::DELETE);
+      //df_insn_rescan (insn->rtl());
+      crtl->ssa->change_insn (change);
+      insn = next;
+    }
+}
+
+template<typename Map>
+void
+fusion_bb_info::traverse_base_map (Map &map)
+{
+  for (auto kv : map)
+    {
+      const auto &key = kv.first;
+      auto &value = kv.second;
+      transform_for_base (key.second, value);
+    }
+}
+
+void
+fusion_bb_info::transform ()
+{
+  traverse_base_map (expr_map);
+  traverse_base_map (def_map);
+}
diff --git a/gcc/testsuite/g++.target/powerpc/vecload.C b/gcc/testsuite/g++.target/powerpc/vecload.C
new file mode 100644
index 00000000000..c523572cf3c
--- /dev/null
+++ b/gcc/testsuite/g++.target/powerpc/vecload.C
@@ -0,0 +1,15 @@ 
+/* { dg-do compile } */ 
+/* { dg-require-effective-target power10_ok } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2" } */ 
+
+#include <altivec.h>
+
+void
+foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
+{
+  __vector_quad acc;
+  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
+  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
+  *dst = acc;
+}
+/* { dg-final { scan-assembler {\mlxvp\M} } } */
diff --git a/gcc/testsuite/g++.target/powerpc/vecload1.C b/gcc/testsuite/g++.target/powerpc/vecload1.C
new file mode 100644
index 00000000000..d10ff0cdf36
--- /dev/null
+++ b/gcc/testsuite/g++.target/powerpc/vecload1.C
@@ -0,0 +1,22 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target power10_ok } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2" } */
+
+#include <altivec.h>
+	
+void
+foo2 ()
+{
+  __vector_quad *dst1;
+  __vector_quad *dst2;
+  vector unsigned char src;
+  __vector_quad acc;
+  vector unsigned char *ptr;
+  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
+  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
+  *dst1 = acc;
+  __builtin_mma_xvf32ger(&acc, src, ptr[2]);
+  __builtin_mma_xvf32gerpp(&acc, src, ptr[3]);
+  *dst2 = acc;
+}
+/* { dg-final { scan-assembler {\mlxvp\M} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
index 69ee826e1be..ae29127f954 100644
--- a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
+++ b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
@@ -258,8 +258,8 @@  foo13b (__vector_quad *dst, __vector_quad *src, vec_t *vec)
   dst[13] = acc;
 }
 
-/* { dg-final { scan-assembler-times {\mlxv\M} 40 } } */
-/* { dg-final { scan-assembler-times {\mlxvp\M} 12 } } */
+/* { dg-final { scan-assembler-times {\mlxv\M} 0 } } */
+/* { dg-final { scan-assembler-times {\mlxvp\M} 32 } } */
 /* { dg-final { scan-assembler-times {\mstxvp\M} 40 } } */
 /* { dg-final { scan-assembler-times {\mxxmfacc\M} 20 } } */
 /* { dg-final { scan-assembler-times {\mxxmtacc\M} 6 } } */