diff mbox series

[V4,1/3] aarch64: Place target independent and dependent changed code in one file

Message ID bc960b42-6bdf-4b42-8ca9-536ab8be81e5@linux.ibm.com
State New
Headers show
Series [V4,1/3] aarch64: Place target independent and dependent changed code in one file | expand

Commit Message

Ajit Agarwal April 5, 2024, 8:23 a.m. UTC
Hello Alex/Richard:

All review comments are incorporated.

Common infrastructure of load store pair fusion is divided into target
independent and target dependent changed code.

Target independent code is the Generic code with pure virtual function
to interface betwwen target independent and dependent code.

Target dependent code is the implementation of pure virtual function for
aarch64 target and the call to target independent code.

Thanks & Regards
Ajit


aarch64: Place target independent and dependent changed code in one file

Common infrastructure of load store pair fusion is divided into target
independent and target dependent changed code.

Target independent code is the Generic code with pure virtual function
to interface betwwen target independent and dependent code.

Target dependent code is the implementation of pure virtual function for
aarch64 target and the call to target independent code.

2024-04-06  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>

gcc/ChangeLog:

	* config/aarch64/aarch64-ldp-fusion.cc: Place target
	independent and dependent changed code.
---
 gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++++++++++++++--------
 1 file changed, 249 insertions(+), 122 deletions(-)

Comments

Alex Coplan April 5, 2024, 4:33 p.m. UTC | #1
On 05/04/2024 13:53, Ajit Agarwal wrote:
> Hello Alex/Richard:
> 
> All review comments are incorporated.

Thanks, I was kind-of expecting you to also send the renaming patch as a
preparatory patch as we discussed.

Sorry for another meta comment, but: I think the reason that the Linaro
CI isn't running tests on your patches is actually because you're
sending 1/3 of a series but not sending the rest of the series.

So please can you either send this as an individual preparatory patch
(not marked as a series) or if you're going to send a series (e.g. with
a preparatory rename patch as 1/2 and this as 2/2) then send the entire
series when you make updates.  That way the CI should test your patches,
which would be helpful.

> 
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
> 
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
> 
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
> 
> Thanks & Regards
> Ajit
> 
> 
> aarch64: Place target independent and dependent changed code in one file
> 
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
> 
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
> 
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
> 
> 2024-04-06  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
> 
> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64-ldp-fusion.cc: Place target
> 	independent and dependent changed code.

You're going to need a proper ChangeLog eventually, but I guess there's
no need for that right now.

> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++++++++++++++--------
>  1 file changed, 249 insertions(+), 122 deletions(-)
> 
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 22ed95eb743..cb21b514ef7 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -138,8 +138,122 @@ struct alt_base
>    poly_int64 offset;
>  };
>  
> +// Virtual base class for load/store walkers used in alias analysis.
> +struct alias_walker
> +{
> +  virtual bool conflict_p (int &budget) const = 0;
> +  virtual insn_info *insn () const = 0;
> +  virtual bool valid () const  = 0;

Heh, looking at this made me realise there is a whitespace bug here in
the existing code (double space after const).  Sorry about that!  I'll
push an obvious fix for that.

> +  virtual void advance () = 0;
> +};
> +
> +struct pair_fusion {
> +
> +  pair_fusion () {};

This ctor looks pointless at the moment.  Perhaps instead we could put
the contents of ldp_fusion_init in here and then delete that function?

> +  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
> +			       bool load_p) = 0;

Please can we have comments above each of these virtual functions
describing any parameters, what the purpose of the hook is, and the
interpretation of the return value?  This will serve as the
documentation for other targets that want to make use of the pass.

It might make sense to have a default-false implementation for
fpsimd_op_p, especially if you don't want to make use of this bit for
rs6000.

> +
> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
> +  virtual bool pair_trailing_writeback_p () = 0;

Sorry for the run-around, but: I think this and
handle_writeback_opportunities () should be the same function, either
returning an enum or taking an enum and returning a boolean.

At a minimum they should have more similar sounding names.

> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> +				      machine_mode mem_mode) = 0;
> +  virtual int pair_mem_alias_check_limit () = 0;
> +  virtual bool handle_writeback_opportunities () = 0 ;
> +  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
> +					machine_mode mode) = 0;
> +  virtual rtx gen_mem_pair (rtx *pats,  rtx writeback,

Nit: excess whitespace after pats,

> +			    bool load_p) = 0;
> +  virtual bool pair_mem_promote_writeback_p (rtx pat) = 0;
> +  virtual bool track_load_p () = 0;
> +  virtual bool track_store_p () = 0;

I think it would probably make more sense for these two to have
default-true implementations rather than being pure virtual functions.

Also, sorry for the bikeshedding, but please can we keep the plural
names (so track_loads_p and track_stores_p).

> +  virtual bool cand_insns_empty_p (std::list<insn_info *> &insns) = 0;

Why does this need to be virtualised?  I would it expect it to
just be insns.empty () on all targets.

> +  virtual bool pair_mem_in_range_p (HOST_WIDE_INT off) = 0;
> +  void ldp_fusion_bb (bb_info *bb);

Just to check, are you planning to rename this function in a separate
patch?

> +
> +  ~pair_fusion() { }

As with the ctor, perhaps we could put the contents of ldp_fusion_destroy
in here and then delete that function?

> +};
> +
> +struct aarch64_pair_fusion : public pair_fusion
> +{
> +public:

Nit: there shouldn't be a need for this as public is already the default
for structs.

> +  aarch64_pair_fusion () : pair_fusion () {};

This ctor looks pointless, the default one should do, so can we remove
this?

> +  bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
> +		    bool load_p) override final
> +  {
> +    const bool fpsimd_op_p
> +      = reload_completed
> +      ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
> +      : (GET_MODE_CLASS (mem_mode) != MODE_INT
> +	 && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
> +    return fpsimd_op_p;

Can this just directly return the boolean expression instead of storing
it into a variable first?

> +  }
> +
> +  bool pair_mem_promote_writeback_p (rtx pat)
> +  {
> +     if (reload_completed
> +	 && aarch64_ldp_writeback > 1
> +	 && GET_CODE (pat) == PARALLEL
> +	 && XVECLEN (pat, 0) == 2)
> +	return true;
> +
> +     return false;
> +  }
> +
> +  bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
> +				machine_mode mode)
> +  {
> +    return aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
> +						     load_p,
> +						     mode);
> +  }
> +  bool pair_operand_mode_ok_p (machine_mode mode);

Why have you implemented this (one-line function) out-of-line but the
others are implemented inline?  Please be consistent.

> +
> +  rtx gen_mem_pair (rtx *pats, rtx writeback, bool load_p);
> +
> +  bool pair_trailing_writeback_p  ()
> +  {
> +    return aarch64_ldp_writeback > 1;
> +  }
> +  bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> +			      machine_mode mem_mode)
> +  {
> +    return (load_p
> +	     ? aarch64_ldp_reg_operand (reg_op, mem_mode)
> +	     : aarch64_stp_reg_operand (reg_op, mem_mode));
> +  }
> +  int pair_mem_alias_check_limit ()
> +  {
> +    return aarch64_ldp_alias_check_limit;
> +  }
> +  bool handle_writeback_opportunities ()
> +  {
> +    return aarch64_ldp_writeback;
> +  }
> +  bool track_load_p ()
> +  {
> +    const bool track_loads
> +      = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> +    return track_loads;
> +  }

As above, can this just return the boolean expression directly without
the temporary?

> +  bool track_store_p ()
> +  {
> +    const bool track_stores
> +      = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> +    return track_stores;
> +  }

Likewise.

> +  bool cand_insns_empty_p (std::list<insn_info *> &insns)
> +  {
> +    return insns.empty();
> +  }

As mentioned on the declaration, I don't see why this needs to be
virtualised.

> +  bool pair_mem_in_range_p (HOST_WIDE_INT off)
> +  {
> +    return (off < LDP_MIN_IMM || off > LDP_MAX_IMM);
> +  }

Hmm, this looks to test exactly the opposite condition from what the
name suggests, i.e. the implementation looks like what I would expect
for a function called pair_mem_out_of_range_p.

Please can you invert the condition and fix up callers appropriately?

This suggests you're not really thinking about the code you're writing,
please try to think a bit more carefully before posting patches.

> +};
> +
> +
>  // State used by the pass for a given basic block.
> -struct ldp_bb_info
> +struct pair_fusion_bb_info
>  {
>    using def_hash = nofree_ptr_hash<def_info>;
>    using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
> @@ -160,14 +274,17 @@ struct ldp_bb_info
>  
>    static const size_t obstack_alignment = sizeof (void *);
>    bb_info *m_bb;
> +  pair_fusion *bb_state;

Can we call this m_pass instead of bb_state?  The pair_fusion class is
explicitly at the pass level rather than the bb level.

Also I think this (and indeed m_bb, although that's pre-existing) could
be private members.  Not sure there's a good reason for making them
public.

It might be slightly nicer to use a reference for m_pass unless there's
a good reason to use a pointer, although I don't feel strongly about
this.

>  
> -  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
> +  pair_fusion_bb_info (bb_info *bb,
> +		      aarch64_pair_fusion *d) : m_bb (bb),

This generic class needs to take the generic pass type, i.e. just a
pair_fusion * rather than an aarch64_pair_fusion *.

> +		      bb_state (d), m_emitted_tombstone (false)
>    {
>      obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
>  				obstack_alignment, obstack_chunk_alloc,
>  				obstack_chunk_free);
>    }
> -  ~ldp_bb_info ()
> +  ~pair_fusion_bb_info ()
>    {
>      obstack_free (&m_obstack, nullptr);
>  
> @@ -177,10 +294,32 @@ struct ldp_bb_info
>  	bitmap_obstack_release (&m_bitmap_obstack);
>        }
>    }
> +  void track_access (insn_info *, bool load, rtx mem);
> +  void transform ();
> +  void cleanup_tombstones ();

I think most (or indeed all?) of the changes here shouldn't be needed.
AFAICS, there shouldn't be a need to:
 - drop inline from these functions.
 - change the accessibility of any members.

Plese can you drop these changes or explain why they're needed (from
here until the end of the class).

> +  void merge_pairs (insn_list_t &, insn_list_t &,
> +		    bool load_p, unsigned access_size);
> +  void transform_for_base (int load_size, access_group &group);
> +
> +  bool try_fuse_pair (bool load_p, unsigned access_size,
> +			     insn_info *i1, insn_info *i2);
> +
> +  bool fuse_pair (bool load_p, unsigned access_size,
> +		  int writeback,
> +		  insn_info *i1, insn_info *i2,
> +		  base_cand &base,
> +		  const insn_range_info &move_range);
>  
> -  inline void track_access (insn_info *, bool load, rtx mem);
> -  inline void transform ();
> -  inline void cleanup_tombstones ();
> +  void do_alias_analysis (insn_info *alias_hazards[4],
> +			  alias_walker *walkers[4],
> +			  bool load_p);
> +
> +  void track_tombstone (int uid);
> +
> +  bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
> +
> +  template<typename Map>
> +    void traverse_base_map (Map &map);
>  
>  private:
>    obstack m_obstack;
> @@ -191,30 +330,32 @@ private:
>    bool m_emitted_tombstone;
>  
>    inline splay_tree_node<access_record *> *node_alloc (access_record *);
> +};
>  
> -  template<typename Map>
> -  inline void traverse_base_map (Map &map);
> -  inline void transform_for_base (int load_size, access_group &group);
> -
> -  inline void merge_pairs (insn_list_t &, insn_list_t &,
> -			   bool load_p, unsigned access_size);
> -
> -  inline bool try_fuse_pair (bool load_p, unsigned access_size,
> -			     insn_info *i1, insn_info *i2);
> -
> -  inline bool fuse_pair (bool load_p, unsigned access_size,
> -			 int writeback,
> -			 insn_info *i1, insn_info *i2,
> -			 base_cand &base,
> -			 const insn_range_info &move_range);
> -
> -  inline void track_tombstone (int uid);
> +rtx aarch64_pair_fusion::gen_mem_pair (rtx *pats,

Style nit: newline after rtx.

> +		  rtx writeback,
> +		  bool load_p)
> +  {
> +    rtx pair_pat;
>  
> -  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
> -};
> +    if (writeback)
> +      {
> +	auto patvec = gen_rtvec (3, writeback, pats[0], pats[1]);
> +	pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
> +      }
> +    else if (load_p)
> +      pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
> +					XEXP (pats[1], 0),
> +					XEXP (pats[0], 1));
> +    else
> +      pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
> +					 XEXP (pats[0], 1),
> +					 XEXP (pats[1], 1));
> +     return pair_pat;

Can the assignments just be direct returns here, please?  Then
get rid of the temporary.

> +  }
>  
>  splay_tree_node<access_record *> *
> -ldp_bb_info::node_alloc (access_record *access)
> +pair_fusion_bb_info::node_alloc (access_record *access)
>  {
>    using T = splay_tree_node<access_record *>;
>    void *addr = obstack_alloc (&m_obstack, sizeof (T));
> @@ -262,7 +403,7 @@ drop_writeback (rtx mem)
>  // RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
>  // MEM so that we know the mode of the access.
>  static rtx
> -ldp_strip_offset (rtx mem, poly_int64 *offset)
> +pair_mem_strip_offset (rtx mem, poly_int64 *offset)

I thought we discussed splitting the renaming into a separate patch?

>  {
>    rtx addr = XEXP (mem, 0);
>  
> @@ -332,6 +473,12 @@ ldp_operand_mode_ok_p (machine_mode mode)
>    return reload_completed || mode != TImode;
>  }
>  
> +bool
> +aarch64_pair_fusion::pair_operand_mode_ok_p (machine_mode mode)
> +{
> +  return ldp_operand_mode_ok_p (mode);
> +}
> +
>  // Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
>  // into an integer for use as a hash table key.
>  static int
> @@ -396,7 +543,7 @@ access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
>  // MEM_EXPR base (i.e. a tree decl) relative to which we can track the access.
>  // LFS is used as part of the key to the hash table, see track_access.
>  bool
> -ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> +pair_fusion_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)

Again, please do any renaming in a separate patch.

>  {
>    if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
>      return false;
> @@ -412,9 +559,10 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>    const machine_mode mem_mode = GET_MODE (mem);
>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
>  
> -  // Punt on misaligned offsets.  LDP/STP instructions require offsets to be a
> -  // multiple of the access size, and we believe that misaligned offsets on
> -  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL bases.
> +  // Punt on misaligned offsets.  Paired memory access instructions require
> +  // offsets to be a multiple of the access size, and we believe that
> +  // misaligned offsets on MEM_EXPR bases are likely to lead to misaligned
> +  // offsets w.r.t. RTL bases.
>    if (!multiple_p (offset, mem_size))
>      return false;
>  
> @@ -438,46 +586,38 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>  }
>  
>  // Main function to begin pair discovery.  Given a memory access INSN,
> -// determine whether it could be a candidate for fusing into an ldp/stp,
> +// determine whether it could be a candidate for fusing into an pair mem,

s/an pair mem/a paired access/

>  // and if so, track it in the appropriate data structure for this basic
>  // block.  LOAD_P is true if the access is a load, and MEM is the mem
>  // rtx that occurs in INSN.
>  void
> -ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> +pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>  {
>    // We can't combine volatile MEMs, so punt on these.
>    if (MEM_VOLATILE_P (mem))
>      return;
>  
> -  // Ignore writeback accesses if the param says to do so.
> -  if (!aarch64_ldp_writeback
> +  // Ignore writeback accesses if the param says to do so

This comment needs updating, I guess something like "if the hook says to
do so".  But also please don't delete the full stop.

> +  if (!bb_state->handle_writeback_opportunities ()
>        && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
>      return;
>  
>    const machine_mode mem_mode = GET_MODE (mem);
> -  if (!ldp_operand_mode_ok_p (mem_mode))
> +

I don't think we need the extra whitespace here.

> +  if (!bb_state->pair_operand_mode_ok_p (mem_mode))
>      return;
>  
>    rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
>  
> -  // Ignore the access if the register operand isn't suitable for ldp/stp.
> -  if (load_p
> -      ? !aarch64_ldp_reg_operand (reg_op, mem_mode)
> -      : !aarch64_stp_reg_operand (reg_op, mem_mode))
> +  if (!bb_state->pair_reg_operand_ok_p (load_p, reg_op, mem_mode))
>      return;
> -
>    // We want to segregate FP/SIMD accesses from GPR accesses.
>    //
>    // Before RA, we use the modes, noting that stores of constant zero
>    // operands use GPRs (even in non-integer modes).  After RA, we use
>    // the hard register numbers.
> -  const bool fpsimd_op_p
> -    = reload_completed
> -    ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
> -    : (GET_MODE_CLASS (mem_mode) != MODE_INT
> -       && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
> -
> -  // Note ldp_operand_mode_ok_p already rejected VL modes.
> +  const bool fpsimd_op_p = bb_state->fpsimd_op_p (reg_op, mem_mode, load_p);
> +  // Note pair_operand_mode_ok_p already rejected VL modes.
>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
>    const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
>  
> @@ -487,7 +627,7 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>    poly_int64 mem_off;
>    rtx addr = XEXP (mem, 0);
>    const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
> -  rtx base = ldp_strip_offset (mem, &mem_off);
> +  rtx base = pair_mem_strip_offset (mem, &mem_off);

Again, let's split the renaming off into a separate patch.

>    if (!REG_P (base))
>      return;
>  
> @@ -506,8 +646,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>    // elimination offset pre-RA, we should postpone forming pairs on such
>    // accesses until after RA.
>    //
> -  // As it stands, addresses with offsets in range for LDR but not
> -  // in range for LDP/STP are currently reloaded inefficiently,
> +  // As it stands, addresses in range for an individual load/store but not
> +  // for a paired access are currently reloaded inefficiently,
>    // ending up with a separate base register for each pair.
>    //
>    // In theory LRA should make use of
> @@ -519,8 +659,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>    // that calls targetm.legitimize_address_displacement.
>    //
>    // So for now, it's better to punt when we can't be sure that the
> -  // offset is in range for LDP/STP.  Out-of-range cases can then be
> -  // handled after RA by the out-of-range LDP/STP peepholes.  Eventually, it
> +  // offset is in range for paired access.  Out-of-range cases can then be
> +  // handled after RA by the out-of-range PAIR MEM  peepholes.  Eventually, it
>    // would be nice to handle known out-of-range opportunities in the
>    // pass itself (for stack accesses, this would be in the post-RA pass).
>    if (!reload_completed
> @@ -573,8 +713,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>  	gcc_unreachable (); // Base defs should be unique.
>      }
>  
> -  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
> -  // the access size.
> +  // Punt on misaligned offsets.  Paired memory accesses require offsets
> +  // to be a multiple of the access size.
>    if (!multiple_p (mem_off, mem_size))
>      return;
>  
> @@ -614,7 +754,7 @@ static bool no_ignore (insn_info *) { return false; }
>  // making use of alias disambiguation.
>  static insn_info *
>  latest_hazard_before (insn_info *insn, rtx *ignore,
> -		      insn_info *ignore_insn = nullptr)
> +		      insn_info *ignore_insn = 0)

Is this change needed?

>  {
>    insn_info *result = nullptr;
>  
> @@ -1150,7 +1290,7 @@ extract_writebacks (bool load_p, rtx pats[2], int changed)
>        const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
>  
>        poly_int64 offset;
> -      rtx this_base = ldp_strip_offset (mem, &offset);
> +      rtx this_base = pair_mem_strip_offset (mem, &offset);
>        gcc_assert (REG_P (this_base));
>        if (base_reg)
>  	gcc_assert (rtx_equal_p (base_reg, this_base));
> @@ -1286,7 +1426,11 @@ find_trailing_add (insn_info *insns[2],
>  
>    off_hwi /= access_size;
>  
> -  if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
> +  pair_fusion *pfuse;
> +  aarch64_pair_fusion derived;
> +  pfuse = &derived;

Huh? We'll want to use the same instance of the pass structure (pair_fusion)
that gets created at the top level, so that should get passed around everywhere.

In the case of find_trailing_add we probably don't want to add any more
parameters (as Segher noted there are already too many), so perhaps the
easiest thing would be to make find_trailing_add itself a (non-virtual)
member function of pair_fusion.

Then here we can just query pair_mem_in_range_p directly.

For try_promote_writeback itself you can either make it a non-virtual
member function of pair_fusion or just pass a reference to the pass
object in when you call it.

Note also that you don't have to create a pointer of the parent type in
order to call a virtual function on a derived object, you can just call
it directly.

> +
> +  if (pfuse->pair_mem_in_range_p (off_hwi))
>      return nullptr;

Note of course this condition needs inverting when you invert the
implementation.

>  
>    auto dump_prefix = [&]()
> @@ -1328,7 +1472,7 @@ find_trailing_add (insn_info *insns[2],
>  // We just emitted a tombstone with uid UID, track it in a bitmap for
>  // this BB so we can easily identify it later when cleaning up tombstones.
>  void
> -ldp_bb_info::track_tombstone (int uid)
> +pair_fusion_bb_info::track_tombstone (int uid)

Again, please split the renaming off.  Same with other instances below.

>  {
>    if (!m_emitted_tombstone)
>      {
> @@ -1528,7 +1672,7 @@ fixup_debug_uses (obstack_watermark &attempt,
>  	  gcc_checking_assert (GET_RTX_CLASS (GET_CODE (XEXP (mem, 0)))
>  			       == RTX_AUTOINC);
>  
> -	  base = ldp_strip_offset (mem, &offset);
> +	  base = pair_mem_strip_offset (mem, &offset);
>  	  gcc_checking_assert (REG_P (base) && REGNO (base) == base_regno);
>  	}
>        fixup_debug_use (attempt, use, def, base, offset);
> @@ -1664,7 +1808,7 @@ fixup_debug_uses (obstack_watermark &attempt,
>  // BASE gives the chosen base candidate for the pair and MOVE_RANGE is
>  // a singleton range which says where to place the pair.
>  bool
> -ldp_bb_info::fuse_pair (bool load_p,
> +pair_fusion_bb_info::fuse_pair (bool load_p,
>  			unsigned access_size,
>  			int writeback,
>  			insn_info *i1, insn_info *i2,
> @@ -1800,7 +1944,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>      {
>        if (dump_file)
>  	fprintf (dump_file,
> -		 "  ldp: i%d has wb but subsequent i%d has non-wb "
> +		 "  load pair: i%d has wb but subsequent i%d has non-wb "
>  		 "update of base (r%d), dropping wb\n",
>  		 insns[0]->uid (), insns[1]->uid (), base_regno);
>        gcc_assert (writeback_effect);
> @@ -1823,7 +1967,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>      }
>  
>    // If either of the original insns had writeback, but the resulting pair insn
> -  // does not (can happen e.g. in the ldp edge case above, or if the writeback
> +  // does not (can happen e.g. in the load pair edge case above, or if the writeback
>    // effects cancel out), then drop the def(s) of the base register as
>    // appropriate.
>    //
> @@ -1842,7 +1986,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>    // update of the base register and try and fold it in to make this into a
>    // writeback pair.
>    insn_info *trailing_add = nullptr;
> -  if (aarch64_ldp_writeback > 1
> +  if (bb_state->pair_trailing_writeback_p ()
>        && !writeback_effect
>        && (!load_p || (!refers_to_regno_p (base_regno, base_regno + 1,
>  					 XEXP (pats[0], 0), nullptr)
> @@ -1863,14 +2007,14 @@ ldp_bb_info::fuse_pair (bool load_p,
>      }
>  
>    // Now that we know what base mem we're going to use, check if it's OK
> -  // with the ldp/stp policy.
> +  // with the pair mem policy.
>    rtx first_mem = XEXP (pats[0], load_p);
> -  if (!aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
> -						load_p,
> -						GET_MODE (first_mem)))
> +  if (!bb_state->pair_mem_ok_with_policy (first_mem,
> +			  load_p,
> +			  GET_MODE (first_mem)))
>      {
>        if (dump_file)
> -	fprintf (dump_file, "punting on pair (%d,%d), ldp/stp policy says no\n",
> +	fprintf (dump_file, "punting on pair (%d,%d), pair mem  policy says no\n",
>  		 i1->uid (), i2->uid ());
>        return false;
>      }
> @@ -1878,21 +2022,10 @@ ldp_bb_info::fuse_pair (bool load_p,
>    rtx reg_notes = combine_reg_notes (first, second, load_p);
>  
>    rtx pair_pat;
> -  if (writeback_effect)
> -    {
> -      auto patvec = gen_rtvec (3, writeback_effect, pats[0], pats[1]);
> -      pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
> -    }
> -  else if (load_p)
> -    pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
> -				      XEXP (pats[1], 0),
> -				      XEXP (pats[0], 1));
> -  else
> -    pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
> -				       XEXP (pats[0], 1),
> -				       XEXP (pats[1], 1));
>  
> +  pair_pat = bb_state->gen_mem_pair (pats, writeback_effect, load_p);
>    insn_change *pair_change = nullptr;
> +
>    auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
>      rtx_insn *rti = change->insn ()->rtl ();
>      validate_unshare_change (rti, &PATTERN (rti), pair_pat, true);
> @@ -2133,15 +2266,6 @@ load_modified_by_store_p (insn_info *load,
>    return false;
>  }
>  
> -// Virtual base class for load/store walkers used in alias analysis.
> -struct alias_walker
> -{
> -  virtual bool conflict_p (int &budget) const = 0;
> -  virtual insn_info *insn () const = 0;
> -  virtual bool valid () const  = 0;
> -  virtual void advance () = 0;
> -};
> -
>  // Implement some common functionality used by both store_walker
>  // and load_walker.
>  template<bool reverse>
> @@ -2259,13 +2383,13 @@ public:
>  //
>  // We try to maintain the invariant that if a walker becomes invalid, we
>  // set its pointer to null.
> -static void
> -do_alias_analysis (insn_info *alias_hazards[4],
> +void
> +pair_fusion_bb_info::do_alias_analysis (insn_info *alias_hazards[4],

It would probably make more sense for this to be a (non-virtual) member
function of the pair_fusion class, since it doesn't need to access the
BB-specific state.

>  		   alias_walker *walkers[4],
>  		   bool load_p)
>  {
>    const int n_walkers = 2 + (2 * !load_p);
> -  int budget = aarch64_ldp_alias_check_limit;
> +  int budget = bb_state->pair_mem_alias_check_limit ();
>  
>    auto next_walker = [walkers,n_walkers](int current) -> int {
>      for (int j = 1; j <= n_walkers; j++)
> @@ -2365,7 +2489,7 @@ get_viable_bases (insn_info *insns[2],
>      {
>        const bool is_lower = (i == reversed);
>        poly_int64 poly_off;
> -      rtx base = ldp_strip_offset (cand_mems[i], &poly_off);
> +      rtx base = pair_mem_strip_offset (cand_mems[i], &poly_off);
>        if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
>  	writeback |= (1 << i);
>  
> @@ -2373,7 +2497,7 @@ get_viable_bases (insn_info *insns[2],
>  	continue;
>  
>        // Punt on accesses relative to eliminable regs.  See the comment in
> -      // ldp_bb_info::track_access for a detailed explanation of this.
> +      // pair_fusion_bb_info::track_access for a detailed explanation of this.
>        if (!reload_completed
>  	  && (REGNO (base) == FRAME_POINTER_REGNUM
>  	      || REGNO (base) == ARG_POINTER_REGNUM))
> @@ -2397,7 +2521,11 @@ get_viable_bases (insn_info *insns[2],
>        if (!is_lower)
>  	base_off--;
>  
> -      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
> +      pair_fusion *pfuse;
> +      aarch64_pair_fusion derived;
> +      pfuse = &derived;

Again, please don't do this but instead use the same instance
everywhere.

> +
> +      if (pfuse->pair_mem_in_range_p (base_off))
>  	continue;
>  
>        use_info *use = find_access (insns[i]->uses (), REGNO (base));
> @@ -2454,12 +2582,12 @@ get_viable_bases (insn_info *insns[2],
>  }
>  
>  // Given two adjacent memory accesses of the same size, I1 and I2, try
> -// and see if we can merge them into a ldp or stp.
> +// and see if we can merge them into a paired accesses load and store.

"into a load pair or store pair" or "into a paired access" if you
prefer.

>  //
>  // ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
>  // if the accesses are both loads, otherwise they are both stores.
>  bool
> -ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> +pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>  			    insn_info *i1, insn_info *i2)
>  {
>    if (dump_file)
> @@ -2494,7 +2622,7 @@ ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>      {
>        if (dump_file)
>  	fprintf (dump_file,
> -		 "punting on ldp due to reg conflcits (%d,%d)\n",
> +		 "punting on pair mem load  due to reg conflcits (%d,%d)\n",
>  		 insns[0]->uid (), insns[1]->uid ());
>        return false;
>      }
> @@ -2843,7 +2971,7 @@ debug (const insn_list_t &l)
>  // we can't re-order them anyway, so provided earlier passes have cleaned up
>  // redundant loads, we shouldn't miss opportunities by doing this.
>  void
> -ldp_bb_info::merge_pairs (insn_list_t &left_list,
> +pair_fusion_bb_info::merge_pairs (insn_list_t &left_list,
>  			  insn_list_t &right_list,
>  			  bool load_p,
>  			  unsigned access_size)
> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
>  // of accesses.  If we find two sets of adjacent accesses, call
>  // merge_pairs.
>  void
> -ldp_bb_info::transform_for_base (int encoded_lfs,
> -				 access_group &group)
> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
> +					 access_group &group)
>  {
>    const auto lfs = decode_lfs (encoded_lfs);
>    const unsigned access_size = lfs.size;
> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>  		       access.cand_insns,
>  		       lfs.load_p,
>  		       access_size);
> -	  skip_next = access.cand_insns.empty ();
> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);

As above, why is this needed?

>  	}
>        prev_access = &access;
>      }
> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>  // and remove all the tombstone insns, being sure to reparent any uses
>  // of mem to previous defs when we do this.
>  void
> -ldp_bb_info::cleanup_tombstones ()
> +pair_fusion_bb_info::cleanup_tombstones ()
>  {
>    // No need to do anything if we didn't emit a tombstone insn for this BB.
>    if (!m_emitted_tombstone)
> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>  
>  template<typename Map>
>  void
> -ldp_bb_info::traverse_base_map (Map &map)
> +pair_fusion_bb_info::traverse_base_map (Map &map)
>  {
>    for (auto kv : map)
>      {
> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>  }
>  
>  void
> -ldp_bb_info::transform ()
> +pair_fusion_bb_info::transform ()
>  {
>    traverse_base_map (expr_map);
>    traverse_base_map (def_map);
> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>  // for load/store candidates.  If running after RA, also try and promote
>  // non-writeback pairs to use writeback addressing.  Then try to fuse
>  // candidates into pairs.
> -void ldp_fusion_bb (bb_info *bb)
> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>  {
> -  const bool track_loads
> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> -  const bool track_stores
> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> +  const bool track_loads = track_load_p ();
> +  const bool track_stores = track_store_p ();
>  
> -  ldp_bb_info bb_state (bb);

This:

> +  aarch64_pair_fusion derived;

can be deleted and then:

> +  pair_fusion_bb_info bb_info (bb, &derived);

can just be:

  pair_fusion_bb_info bb_info (bb, this);

(or you can pass *this if you make bb_info take a reference).

I don't think there's a particular need to change the variable name
(bb_state -> bb_info).  I chose the former because it doens't clash
with the RTL-SSA structure of the same name as the latter.

>  
>    for (auto insn : bb->nondebug_insns ())
>      {
> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>  	continue;
>  
>        rtx pat = PATTERN (rti);
> -      if (reload_completed
> -	  && aarch64_ldp_writeback > 1
> -	  && GET_CODE (pat) == PARALLEL
> -	  && XVECLEN (pat, 0) == 2)
> +      if (pair_mem_promote_writeback_p (pat))
>  	try_promote_writeback (insn);

It looks like try_promote_writeback itself will need some further work
to make it target-independent.  I suppose this check:

  auto rti = insn->rtl ();
  const auto attr = get_attr_ldpstp (rti);
  if (attr == LDPSTP_NONE)
    return;

  bool load_p = (attr == LDPSTP_LDP);
  gcc_checking_assert (load_p || attr == LDPSTP_STP);

will need to become part of the pair_mem_promote_writeback_p hook that you
added, potentially changing it to return a boolean for load_p.

Then I guess we will need hooks for destructuring the pair insn and
another hook to wrap aarch64_gen_writeback_pair.

>  
>        if (GET_CODE (pat) != SET)
>  	continue;
>  
>        if (track_stores && MEM_P (XEXP (pat, 0)))
> -	bb_state.track_access (insn, false, XEXP (pat, 0));
> +	bb_info.track_access (insn, false, XEXP (pat, 0));
>        else if (track_loads && MEM_P (XEXP (pat, 1)))
> -	bb_state.track_access (insn, true, XEXP (pat, 1));
> +	bb_info.track_access (insn, true, XEXP (pat, 1));
>      }
>  
> -  bb_state.transform ();
> -  bb_state.cleanup_tombstones ();
> +  bb_info.transform ();
> +  bb_info.cleanup_tombstones ();
>  }
>  
>  void ldp_fusion ()
>  {
>    ldp_fusion_init ();
> +  pair_fusion *pfuse;
> +  aarch64_pair_fusion derived;
> +  pfuse = &derived;

This is indeed the one place where I think it is acceptable to
instantiate aarch64_pair_fusion.  But again there's no need to create a
pointer to the parent class, just call any function you like directly.

>  
>    for (auto bb : crtl->ssa->bbs ())
> -    ldp_fusion_bb (bb);
> +    pfuse->ldp_fusion_bb (bb);

I think even the code to iterate over bbs should itself be a member
function of pair_fusion (say "run") and then that becomes part of the
generic code.

So this function would just become:

aarch64_pair_fusion pass;
pass.run ();

and could be inlined into the caller.

Perhaps you could also add an early return like:

  if (!track_loads_p () && !track_stores_p ())
    return;

in pair_fusion::run () and then remove the corresponding code from
pass_ldp_fusion::gate?

>  
>    ldp_fusion_destroy ();
>  }
> -- 
> 2.39.3
> 

Thanks,
Alex
Ajit Agarwal April 9, 2024, noon UTC | #2
On 05/04/24 10:03 pm, Alex Coplan wrote:
> On 05/04/2024 13:53, Ajit Agarwal wrote:
>> Hello Alex/Richard:
>>
>> All review comments are incorporated.
> 
> Thanks, I was kind-of expecting you to also send the renaming patch as a
> preparatory patch as we discussed.
> 
> Sorry for another meta comment, but: I think the reason that the Linaro
> CI isn't running tests on your patches is actually because you're
> sending 1/3 of a series but not sending the rest of the series.
> 
> So please can you either send this as an individual preparatory patch
> (not marked as a series) or if you're going to send a series (e.g. with
> a preparatory rename patch as 1/2 and this as 2/2) then send the entire
> series when you make updates.  That way the CI should test your patches,
> which would be helpful.
>

Addressed.
 
>>
>> Common infrastructure of load store pair fusion is divided into target
>> independent and target dependent changed code.
>>
>> Target independent code is the Generic code with pure virtual function
>> to interface betwwen target independent and dependent code.
>>
>> Target dependent code is the implementation of pure virtual function for
>> aarch64 target and the call to target independent code.
>>
>> Thanks & Regards
>> Ajit
>>
>>
>> aarch64: Place target independent and dependent changed code in one file
>>
>> Common infrastructure of load store pair fusion is divided into target
>> independent and target dependent changed code.
>>
>> Target independent code is the Generic code with pure virtual function
>> to interface betwwen target independent and dependent code.
>>
>> Target dependent code is the implementation of pure virtual function for
>> aarch64 target and the call to target independent code.
>>
>> 2024-04-06  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>
>> gcc/ChangeLog:
>>
>> 	* config/aarch64/aarch64-ldp-fusion.cc: Place target
>> 	independent and dependent changed code.
> 
> You're going to need a proper ChangeLog eventually, but I guess there's
> no need for that right now.
> 
>> ---
>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++++++++++++++--------
>>  1 file changed, 249 insertions(+), 122 deletions(-)
>>
>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> index 22ed95eb743..cb21b514ef7 100644
>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> @@ -138,8 +138,122 @@ struct alt_base
>>    poly_int64 offset;
>>  };
>>  
>> +// Virtual base class for load/store walkers used in alias analysis.
>> +struct alias_walker
>> +{
>> +  virtual bool conflict_p (int &budget) const = 0;
>> +  virtual insn_info *insn () const = 0;
>> +  virtual bool valid () const  = 0;
> 
> Heh, looking at this made me realise there is a whitespace bug here in
> the existing code (double space after const).  Sorry about that!  I'll
> push an obvious fix for that.
> 
>> +  virtual void advance () = 0;
>> +};
>> +
>> +struct pair_fusion {
>> +
>> +  pair_fusion () {};
> 
> This ctor looks pointless at the moment.  Perhaps instead we could put
> the contents of ldp_fusion_init in here and then delete that function?
> 

Addressed.

>> +  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
>> +			       bool load_p) = 0;
> 
> Please can we have comments above each of these virtual functions
> describing any parameters, what the purpose of the hook is, and the
> interpretation of the return value?  This will serve as the
> documentation for other targets that want to make use of the pass.
> 
> It might make sense to have a default-false implementation for
> fpsimd_op_p, especially if you don't want to make use of this bit for
> rs6000.
>

Addressed.
 
>> +
>> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
>> +  virtual bool pair_trailing_writeback_p () = 0;
> 
> Sorry for the run-around, but: I think this and
> handle_writeback_opportunities () should be the same function, either
> returning an enum or taking an enum and returning a boolean.
> 
> At a minimum they should have more similar sounding names.
> 

Addressed.

>> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
>> +				      machine_mode mem_mode) = 0;
>> +  virtual int pair_mem_alias_check_limit () = 0;
>> +  virtual bool handle_writeback_opportunities () = 0 ;
>> +  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
>> +					machine_mode mode) = 0;
>> +  virtual rtx gen_mem_pair (rtx *pats,  rtx writeback,
> 
> Nit: excess whitespace after pats,
> 
>> +			    bool load_p) = 0;
>> +  virtual bool pair_mem_promote_writeback_p (rtx pat) = 0;
>> +  virtual bool track_load_p () = 0;
>> +  virtual bool track_store_p () = 0;
> 
> I think it would probably make more sense for these two to have
> default-true implementations rather than being pure virtual functions.
> 
> Also, sorry for the bikeshedding, but please can we keep the plural
> names (so track_loads_p and track_stores_p).
> 
>> +  virtual bool cand_insns_empty_p (std::list<insn_info *> &insns) = 0;
> 
> Why does this need to be virtualised?  I would it expect it to
> just be insns.empty () on all targets.
> 
>> +  virtual bool pair_mem_in_range_p (HOST_WIDE_INT off) = 0;
>> +  void ldp_fusion_bb (bb_info *bb);
> 
> Just to check, are you planning to rename this function in a separate
> patch?
> 

Yes.
>> +
>> +  ~pair_fusion() { }
> 
> As with the ctor, perhaps we could put the contents of ldp_fusion_destroy
> in here and then delete that function?
> 

Addressed.
>> +};
>> +
>> +struct aarch64_pair_fusion : public pair_fusion
>> +{
>> +public:
> 

Addressed.
> Nit: there shouldn't be a need for this as public is already the default
> for structs.
> 
>> +  aarch64_pair_fusion () : pair_fusion () {};
> 
> This ctor looks pointless, the default one should do, so can we remove
> this?
> 

Addressed.
>> +  bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
>> +		    bool load_p) override final
>> +  {
>> +    const bool fpsimd_op_p
>> +      = reload_completed
>> +      ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
>> +      : (GET_MODE_CLASS (mem_mode) != MODE_INT
>> +	 && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
>> +    return fpsimd_op_p;
> 
> Can this just directly return the boolean expression instead of storing
> it into a variable first?
> 
Addressed.
>> +  }
>> +
>> +  bool pair_mem_promote_writeback_p (rtx pat)
>> +  {
>> +     if (reload_completed
>> +	 && aarch64_ldp_writeback > 1
>> +	 && GET_CODE (pat) == PARALLEL
>> +	 && XVECLEN (pat, 0) == 2)
>> +	return true;
>> +
>> +     return false;
>> +  }
>> +
>> +  bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
>> +				machine_mode mode)
>> +  {
>> +    return aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
>> +						     load_p,
>> +						     mode);
>> +  }
>> +  bool pair_operand_mode_ok_p (machine_mode mode);
> 
> Why have you implemented this (one-line function) out-of-line but the
> others are implemented inline?  Please be consistent.
> 

Addressed.
>> +
>> +  rtx gen_mem_pair (rtx *pats, rtx writeback, bool load_p);
>> +
>> +  bool pair_trailing_writeback_p  ()
>> +  {
>> +    return aarch64_ldp_writeback > 1;
>> +  }
>> +  bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
>> +			      machine_mode mem_mode)
>> +  {
>> +    return (load_p
>> +	     ? aarch64_ldp_reg_operand (reg_op, mem_mode)
>> +	     : aarch64_stp_reg_operand (reg_op, mem_mode));
>> +  }
>> +  int pair_mem_alias_check_limit ()
>> +  {
>> +    return aarch64_ldp_alias_check_limit;
>> +  }
>> +  bool handle_writeback_opportunities ()
>> +  {
>> +    return aarch64_ldp_writeback;
>> +  }
>> +  bool track_load_p ()
>> +  {
>> +    const bool track_loads
>> +      = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>> +    return track_loads;
>> +  }
> 

Addressed.
> As above, can this just return the boolean expression directly without
> the temporary?
> 
>> +  bool track_store_p ()
>> +  {
>> +    const bool track_stores
>> +      = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>> +    return track_stores;
>> +  }
> 

Addressed.
> Likewise.
> 
>> +  bool cand_insns_empty_p (std::list<insn_info *> &insns)
>> +  {
>> +    return insns.empty();
>> +  }
> 
> As mentioned on the declaration, I don't see why this needs to be
> virtualised.
> 
>> +  bool pair_mem_in_range_p (HOST_WIDE_INT off)
>> +  {
>> +    return (off < LDP_MIN_IMM || off > LDP_MAX_IMM);
>> +  }
> 
> Hmm, this looks to test exactly the opposite condition from what the
> name suggests, i.e. the implementation looks like what I would expect
> for a function called pair_mem_out_of_range_p.
> 
> Please can you invert the condition and fix up callers appropriately?
> 
> This suggests you're not really thinking about the code you're writing,
> please try to think a bit more carefully before posting patches.
> 

Sorry for my mistake as I have overlooked it.
>> +};
>> +
>> +
>>  // State used by the pass for a given basic block.
>> -struct ldp_bb_info
>> +struct pair_fusion_bb_info
>>  {
>>    using def_hash = nofree_ptr_hash<def_info>;
>>    using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
>> @@ -160,14 +274,17 @@ struct ldp_bb_info
>>  
>>    static const size_t obstack_alignment = sizeof (void *);
>>    bb_info *m_bb;
>> +  pair_fusion *bb_state;
> 

Addressed.

> Can we call this m_pass instead of bb_state?  The pair_fusion class is
> explicitly at the pass level rather than the bb level.
> 
> Also I think this (and indeed m_bb, although that's pre-existing) could
> be private members.  Not sure there's a good reason for making them
> public.
>

Addressed.
 
> It might be slightly nicer to use a reference for m_pass unless there's
> a good reason to use a pointer, although I don't feel strongly about
> this.
> 
>>  
>> -  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
>> +  pair_fusion_bb_info (bb_info *bb,
>> +		      aarch64_pair_fusion *d) : m_bb (bb),
> 
> This generic class needs to take the generic pass type, i.e. just a
> pair_fusion * rather than an aarch64_pair_fusion *.
> 
>> +		      bb_state (d), m_emitted_tombstone (false)
>>    {
>>      obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
>>  				obstack_alignment, obstack_chunk_alloc,
>>  				obstack_chunk_free);
>>    }
>> -  ~ldp_bb_info ()
>> +  ~pair_fusion_bb_info ()
>>    {
>>      obstack_free (&m_obstack, nullptr);
>>  
>> @@ -177,10 +294,32 @@ struct ldp_bb_info
>>  	bitmap_obstack_release (&m_bitmap_obstack);
>>        }
>>    }
>> +  void track_access (insn_info *, bool load, rtx mem);
>> +  void transform ();
>> +  void cleanup_tombstones ();
> 
> I think most (or indeed all?) of the changes here shouldn't be needed.
> AFAICS, there shouldn't be a need to:
>  - drop inline from these functions.
>  - change the accessibility of any members.
> 

Addressed.
> Plese can you drop these changes or explain why they're needed (from
> here until the end of the class).
> 
>> +  void merge_pairs (insn_list_t &, insn_list_t &,
>> +		    bool load_p, unsigned access_size);
>> +  void transform_for_base (int load_size, access_group &group);
>> +
>> +  bool try_fuse_pair (bool load_p, unsigned access_size,
>> +			     insn_info *i1, insn_info *i2);
>> +
>> +  bool fuse_pair (bool load_p, unsigned access_size,
>> +		  int writeback,
>> +		  insn_info *i1, insn_info *i2,
>> +		  base_cand &base,
>> +		  const insn_range_info &move_range);
>>  
>> -  inline void track_access (insn_info *, bool load, rtx mem);
>> -  inline void transform ();
>> -  inline void cleanup_tombstones ();
>> +  void do_alias_analysis (insn_info *alias_hazards[4],
>> +			  alias_walker *walkers[4],
>> +			  bool load_p);
>> +
>> +  void track_tombstone (int uid);
>> +
>> +  bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
>> +
>> +  template<typename Map>
>> +    void traverse_base_map (Map &map);
>>  
>>  private:
>>    obstack m_obstack;
>> @@ -191,30 +330,32 @@ private:
>>    bool m_emitted_tombstone;
>>  
>>    inline splay_tree_node<access_record *> *node_alloc (access_record *);
>> +};
>>  
>> -  template<typename Map>
>> -  inline void traverse_base_map (Map &map);
>> -  inline void transform_for_base (int load_size, access_group &group);
>> -
>> -  inline void merge_pairs (insn_list_t &, insn_list_t &,
>> -			   bool load_p, unsigned access_size);
>> -
>> -  inline bool try_fuse_pair (bool load_p, unsigned access_size,
>> -			     insn_info *i1, insn_info *i2);
>> -
>> -  inline bool fuse_pair (bool load_p, unsigned access_size,
>> -			 int writeback,
>> -			 insn_info *i1, insn_info *i2,
>> -			 base_cand &base,
>> -			 const insn_range_info &move_range);
>> -
>> -  inline void track_tombstone (int uid);
>> +rtx aarch64_pair_fusion::gen_mem_pair (rtx *pats,
> 
> Style nit: newline after rtx.
> 
Addressed.

>> +		  rtx writeback,
>> +		  bool load_p)
>> +  {
>> +    rtx pair_pat;
>>  
>> -  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
>> -};
>> +    if (writeback)
>> +      {
>> +	auto patvec = gen_rtvec (3, writeback, pats[0], pats[1]);
>> +	pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
>> +      }
>> +    else if (load_p)
>> +      pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
>> +					XEXP (pats[1], 0),
>> +					XEXP (pats[0], 1));
>> +    else
>> +      pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
>> +					 XEXP (pats[0], 1),
>> +					 XEXP (pats[1], 1));
>> +     return pair_pat;
> 
> Can the assignments just be direct returns here, please?  Then
> get rid of the temporary.
>

Addressed.
 
>> +  }
>>  
>>  splay_tree_node<access_record *> *
>> -ldp_bb_info::node_alloc (access_record *access)
>> +pair_fusion_bb_info::node_alloc (access_record *access)
>>  {
>>    using T = splay_tree_node<access_record *>;
>>    void *addr = obstack_alloc (&m_obstack, sizeof (T));
>> @@ -262,7 +403,7 @@ drop_writeback (rtx mem)
>>  // RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
>>  // MEM so that we know the mode of the access.
>>  static rtx
>> -ldp_strip_offset (rtx mem, poly_int64 *offset)
>> +pair_mem_strip_offset (rtx mem, poly_int64 *offset)
> 
> I thought we discussed splitting the renaming into a separate patch?
> 
>>  {
>>    rtx addr = XEXP (mem, 0);
>>  
>> @@ -332,6 +473,12 @@ ldp_operand_mode_ok_p (machine_mode mode)
>>    return reload_completed || mode != TImode;
>>  }
>>  
>> +bool
>> +aarch64_pair_fusion::pair_operand_mode_ok_p (machine_mode mode)
>> +{
>> +  return ldp_operand_mode_ok_p (mode);
>> +}
>> +
>>  // Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
>>  // into an integer for use as a hash table key.
>>  static int
>> @@ -396,7 +543,7 @@ access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
>>  // MEM_EXPR base (i.e. a tree decl) relative to which we can track the access.
>>  // LFS is used as part of the key to the hash table, see track_access.
>>  bool
>> -ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>> +pair_fusion_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> 
> Again, please do any renaming in a separate patch.
> 

Addressed.
>>  {
>>    if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
>>      return false;
>> @@ -412,9 +559,10 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>>    const machine_mode mem_mode = GET_MODE (mem);
>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
>>  
>> -  // Punt on misaligned offsets.  LDP/STP instructions require offsets to be a
>> -  // multiple of the access size, and we believe that misaligned offsets on
>> -  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL bases.
>> +  // Punt on misaligned offsets.  Paired memory access instructions require
>> +  // offsets to be a multiple of the access size, and we believe that
>> +  // misaligned offsets on MEM_EXPR bases are likely to lead to misaligned
>> +  // offsets w.r.t. RTL bases.
>>    if (!multiple_p (offset, mem_size))
>>      return false;
>>  
>> @@ -438,46 +586,38 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>>  }
>>  
>>  // Main function to begin pair discovery.  Given a memory access INSN,
>> -// determine whether it could be a candidate for fusing into an ldp/stp,
>> +// determine whether it could be a candidate for fusing into an pair mem,
> 
> s/an pair mem/a paired access/
> 
>>  // and if so, track it in the appropriate data structure for this basic
>>  // block.  LOAD_P is true if the access is a load, and MEM is the mem
>>  // rtx that occurs in INSN.
>>  void
>> -ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>> +pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>  {
>>    // We can't combine volatile MEMs, so punt on these.
>>    if (MEM_VOLATILE_P (mem))
>>      return;
>>  
>> -  // Ignore writeback accesses if the param says to do so.
>> -  if (!aarch64_ldp_writeback
>> +  // Ignore writeback accesses if the param says to do so
> 

Addressed.
> This comment needs updating, I guess something like "if the hook says to
> do so".  But also please don't delete the full stop.
> 
>> +  if (!bb_state->handle_writeback_opportunities ()
>>        && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
>>      return;
>>  
>>    const machine_mode mem_mode = GET_MODE (mem);
>> -  if (!ldp_operand_mode_ok_p (mem_mode))
>> +
> 
> I don't think we need the extra whitespace here.
> 
>> +  if (!bb_state->pair_operand_mode_ok_p (mem_mode))
>>      return;
>>  
>>    rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
>>  
>> -  // Ignore the access if the register operand isn't suitable for ldp/stp.
>> -  if (load_p
>> -      ? !aarch64_ldp_reg_operand (reg_op, mem_mode)
>> -      : !aarch64_stp_reg_operand (reg_op, mem_mode))
>> +  if (!bb_state->pair_reg_operand_ok_p (load_p, reg_op, mem_mode))
>>      return;
>> -
>>    // We want to segregate FP/SIMD accesses from GPR accesses.
>>    //
>>    // Before RA, we use the modes, noting that stores of constant zero
>>    // operands use GPRs (even in non-integer modes).  After RA, we use
>>    // the hard register numbers.
>> -  const bool fpsimd_op_p
>> -    = reload_completed
>> -    ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
>> -    : (GET_MODE_CLASS (mem_mode) != MODE_INT
>> -       && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
>> -
>> -  // Note ldp_operand_mode_ok_p already rejected VL modes.
>> +  const bool fpsimd_op_p = bb_state->fpsimd_op_p (reg_op, mem_mode, load_p);
>> +  // Note pair_operand_mode_ok_p already rejected VL modes.
>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
>>    const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
>>  
>> @@ -487,7 +627,7 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>    poly_int64 mem_off;
>>    rtx addr = XEXP (mem, 0);
>>    const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
>> -  rtx base = ldp_strip_offset (mem, &mem_off);
>> +  rtx base = pair_mem_strip_offset (mem, &mem_off);
> 
> Again, let's split the renaming off into a separate patch.
> 

Addressed.
>>    if (!REG_P (base))
>>      return;
>>  
>> @@ -506,8 +646,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>    // elimination offset pre-RA, we should postpone forming pairs on such
>>    // accesses until after RA.
>>    //
>> -  // As it stands, addresses with offsets in range for LDR but not
>> -  // in range for LDP/STP are currently reloaded inefficiently,
>> +  // As it stands, addresses in range for an individual load/store but not
>> +  // for a paired access are currently reloaded inefficiently,
>>    // ending up with a separate base register for each pair.
>>    //
>>    // In theory LRA should make use of
>> @@ -519,8 +659,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>    // that calls targetm.legitimize_address_displacement.
>>    //
>>    // So for now, it's better to punt when we can't be sure that the
>> -  // offset is in range for LDP/STP.  Out-of-range cases can then be
>> -  // handled after RA by the out-of-range LDP/STP peepholes.  Eventually, it
>> +  // offset is in range for paired access.  Out-of-range cases can then be
>> +  // handled after RA by the out-of-range PAIR MEM  peepholes.  Eventually, it
>>    // would be nice to handle known out-of-range opportunities in the
>>    // pass itself (for stack accesses, this would be in the post-RA pass).
>>    if (!reload_completed
>> @@ -573,8 +713,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>  	gcc_unreachable (); // Base defs should be unique.
>>      }
>>  
>> -  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
>> -  // the access size.
>> +  // Punt on misaligned offsets.  Paired memory accesses require offsets
>> +  // to be a multiple of the access size.
>>    if (!multiple_p (mem_off, mem_size))
>>      return;
>>  
>> @@ -614,7 +754,7 @@ static bool no_ignore (insn_info *) { return false; }
>>  // making use of alias disambiguation.
>>  static insn_info *
>>  latest_hazard_before (insn_info *insn, rtx *ignore,
>> -		      insn_info *ignore_insn = nullptr)
>> +		      insn_info *ignore_insn = 0)
> 
> Is this change needed?
> 

Addressed.
>>  {
>>    insn_info *result = nullptr;
>>  
>> @@ -1150,7 +1290,7 @@ extract_writebacks (bool load_p, rtx pats[2], int changed)
>>        const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
>>  
>>        poly_int64 offset;
>> -      rtx this_base = ldp_strip_offset (mem, &offset);
>> +      rtx this_base = pair_mem_strip_offset (mem, &offset);
>>        gcc_assert (REG_P (this_base));
>>        if (base_reg)
>>  	gcc_assert (rtx_equal_p (base_reg, this_base));
>> @@ -1286,7 +1426,11 @@ find_trailing_add (insn_info *insns[2],
>>  
>>    off_hwi /= access_size;
>>  
>> -  if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
>> +  pair_fusion *pfuse;
>> +  aarch64_pair_fusion derived;
>> +  pfuse = &derived;
> 
> Huh? We'll want to use the same instance of the pass structure (pair_fusion)
> that gets created at the top level, so that should get passed around everywhere.
> 
> In the case of find_trailing_add we probably don't want to add any more
> parameters (as Segher noted there are already too many), so perhaps the
> easiest thing would be to make find_trailing_add itself a (non-virtual)
> member function of pair_fusion.
> 
> Then here we can just query pair_mem_in_range_p directly.
> 
> For try_promote_writeback itself you can either make it a non-virtual
> member function of pair_fusion or just pass a reference to the pass
> object in when you call it.
> 
> Note also that you don't have to create a pointer of the parent type in
> order to call a virtual function on a derived object, you can just call
> it directly.
> 

Addressed.
>> +
>> +  if (pfuse->pair_mem_in_range_p (off_hwi))
>>      return nullptr;
> 
> Note of course this condition needs inverting when you invert the
> implementation.
> 

Addressed.
>>  
>>    auto dump_prefix = [&]()
>> @@ -1328,7 +1472,7 @@ find_trailing_add (insn_info *insns[2],
>>  // We just emitted a tombstone with uid UID, track it in a bitmap for
>>  // this BB so we can easily identify it later when cleaning up tombstones.
>>  void
>> -ldp_bb_info::track_tombstone (int uid)
>> +pair_fusion_bb_info::track_tombstone (int uid)
> 
> Again, please split the renaming off.  Same with other instances below.
> 
>>  {
>>    if (!m_emitted_tombstone)
>>      {
>> @@ -1528,7 +1672,7 @@ fixup_debug_uses (obstack_watermark &attempt,
>>  	  gcc_checking_assert (GET_RTX_CLASS (GET_CODE (XEXP (mem, 0)))
>>  			       == RTX_AUTOINC);
>>  
>> -	  base = ldp_strip_offset (mem, &offset);
>> +	  base = pair_mem_strip_offset (mem, &offset);
>>  	  gcc_checking_assert (REG_P (base) && REGNO (base) == base_regno);
>>  	}
>>        fixup_debug_use (attempt, use, def, base, offset);
>> @@ -1664,7 +1808,7 @@ fixup_debug_uses (obstack_watermark &attempt,
>>  // BASE gives the chosen base candidate for the pair and MOVE_RANGE is
>>  // a singleton range which says where to place the pair.
>>  bool
>> -ldp_bb_info::fuse_pair (bool load_p,
>> +pair_fusion_bb_info::fuse_pair (bool load_p,
>>  			unsigned access_size,
>>  			int writeback,
>>  			insn_info *i1, insn_info *i2,
>> @@ -1800,7 +1944,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>>      {
>>        if (dump_file)
>>  	fprintf (dump_file,
>> -		 "  ldp: i%d has wb but subsequent i%d has non-wb "
>> +		 "  load pair: i%d has wb but subsequent i%d has non-wb "
>>  		 "update of base (r%d), dropping wb\n",
>>  		 insns[0]->uid (), insns[1]->uid (), base_regno);
>>        gcc_assert (writeback_effect);
>> @@ -1823,7 +1967,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>>      }
>>  
>>    // If either of the original insns had writeback, but the resulting pair insn
>> -  // does not (can happen e.g. in the ldp edge case above, or if the writeback
>> +  // does not (can happen e.g. in the load pair edge case above, or if the writeback
>>    // effects cancel out), then drop the def(s) of the base register as
>>    // appropriate.
>>    //
>> @@ -1842,7 +1986,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>>    // update of the base register and try and fold it in to make this into a
>>    // writeback pair.
>>    insn_info *trailing_add = nullptr;
>> -  if (aarch64_ldp_writeback > 1
>> +  if (bb_state->pair_trailing_writeback_p ()
>>        && !writeback_effect
>>        && (!load_p || (!refers_to_regno_p (base_regno, base_regno + 1,
>>  					 XEXP (pats[0], 0), nullptr)
>> @@ -1863,14 +2007,14 @@ ldp_bb_info::fuse_pair (bool load_p,
>>      }
>>  
>>    // Now that we know what base mem we're going to use, check if it's OK
>> -  // with the ldp/stp policy.
>> +  // with the pair mem policy.
>>    rtx first_mem = XEXP (pats[0], load_p);
>> -  if (!aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
>> -						load_p,
>> -						GET_MODE (first_mem)))
>> +  if (!bb_state->pair_mem_ok_with_policy (first_mem,
>> +			  load_p,
>> +			  GET_MODE (first_mem)))
>>      {
>>        if (dump_file)
>> -	fprintf (dump_file, "punting on pair (%d,%d), ldp/stp policy says no\n",
>> +	fprintf (dump_file, "punting on pair (%d,%d), pair mem  policy says no\n",
>>  		 i1->uid (), i2->uid ());
>>        return false;
>>      }
>> @@ -1878,21 +2022,10 @@ ldp_bb_info::fuse_pair (bool load_p,
>>    rtx reg_notes = combine_reg_notes (first, second, load_p);
>>  
>>    rtx pair_pat;
>> -  if (writeback_effect)
>> -    {
>> -      auto patvec = gen_rtvec (3, writeback_effect, pats[0], pats[1]);
>> -      pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
>> -    }
>> -  else if (load_p)
>> -    pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
>> -				      XEXP (pats[1], 0),
>> -				      XEXP (pats[0], 1));
>> -  else
>> -    pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
>> -				       XEXP (pats[0], 1),
>> -				       XEXP (pats[1], 1));
>>  
>> +  pair_pat = bb_state->gen_mem_pair (pats, writeback_effect, load_p);
>>    insn_change *pair_change = nullptr;
>> +
>>    auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
>>      rtx_insn *rti = change->insn ()->rtl ();
>>      validate_unshare_change (rti, &PATTERN (rti), pair_pat, true);
>> @@ -2133,15 +2266,6 @@ load_modified_by_store_p (insn_info *load,
>>    return false;
>>  }
>>  
>> -// Virtual base class for load/store walkers used in alias analysis.
>> -struct alias_walker
>> -{
>> -  virtual bool conflict_p (int &budget) const = 0;
>> -  virtual insn_info *insn () const = 0;
>> -  virtual bool valid () const  = 0;
>> -  virtual void advance () = 0;
>> -};
>> -
>>  // Implement some common functionality used by both store_walker
>>  // and load_walker.
>>  template<bool reverse>
>> @@ -2259,13 +2383,13 @@ public:
>>  //
>>  // We try to maintain the invariant that if a walker becomes invalid, we
>>  // set its pointer to null.
>> -static void
>> -do_alias_analysis (insn_info *alias_hazards[4],
>> +void
>> +pair_fusion_bb_info::do_alias_analysis (insn_info *alias_hazards[4],
> 
> It would probably make more sense for this to be a (non-virtual) member
> function of the pair_fusion class, since it doesn't need to access the
> BB-specific state.
> 

Addressed.
>>  		   alias_walker *walkers[4],
>>  		   bool load_p)
>>  {
>>    const int n_walkers = 2 + (2 * !load_p);
>> -  int budget = aarch64_ldp_alias_check_limit;
>> +  int budget = bb_state->pair_mem_alias_check_limit ();
>>  
>>    auto next_walker = [walkers,n_walkers](int current) -> int {
>>      for (int j = 1; j <= n_walkers; j++)
>> @@ -2365,7 +2489,7 @@ get_viable_bases (insn_info *insns[2],
>>      {
>>        const bool is_lower = (i == reversed);
>>        poly_int64 poly_off;
>> -      rtx base = ldp_strip_offset (cand_mems[i], &poly_off);
>> +      rtx base = pair_mem_strip_offset (cand_mems[i], &poly_off);
>>        if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
>>  	writeback |= (1 << i);
>>  
>> @@ -2373,7 +2497,7 @@ get_viable_bases (insn_info *insns[2],
>>  	continue;
>>  
>>        // Punt on accesses relative to eliminable regs.  See the comment in
>> -      // ldp_bb_info::track_access for a detailed explanation of this.
>> +      // pair_fusion_bb_info::track_access for a detailed explanation of this.
>>        if (!reload_completed
>>  	  && (REGNO (base) == FRAME_POINTER_REGNUM
>>  	      || REGNO (base) == ARG_POINTER_REGNUM))
>> @@ -2397,7 +2521,11 @@ get_viable_bases (insn_info *insns[2],
>>        if (!is_lower)
>>  	base_off--;
>>  
>> -      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
>> +      pair_fusion *pfuse;
>> +      aarch64_pair_fusion derived;
>> +      pfuse = &derived;
> 
> Again, please don't do this but instead use the same instance
> everywhere.
> 

Addressed.
>> +
>> +      if (pfuse->pair_mem_in_range_p (base_off))
>>  	continue;
>>  
>>        use_info *use = find_access (insns[i]->uses (), REGNO (base));
>> @@ -2454,12 +2582,12 @@ get_viable_bases (insn_info *insns[2],
>>  }
>>  
>>  // Given two adjacent memory accesses of the same size, I1 and I2, try
>> -// and see if we can merge them into a ldp or stp.
>> +// and see if we can merge them into a paired accesses load and store.
> 
> "into a load pair or store pair" or "into a paired access" if you
> prefer.
> 

Addressed.
>>  //
>>  // ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
>>  // if the accesses are both loads, otherwise they are both stores.
>>  bool
>> -ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>> +pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>  			    insn_info *i1, insn_info *i2)
>>  {
>>    if (dump_file)
>> @@ -2494,7 +2622,7 @@ ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>      {
>>        if (dump_file)
>>  	fprintf (dump_file,
>> -		 "punting on ldp due to reg conflcits (%d,%d)\n",
>> +		 "punting on pair mem load  due to reg conflcits (%d,%d)\n",
>>  		 insns[0]->uid (), insns[1]->uid ());
>>        return false;
>>      }
>> @@ -2843,7 +2971,7 @@ debug (const insn_list_t &l)
>>  // we can't re-order them anyway, so provided earlier passes have cleaned up
>>  // redundant loads, we shouldn't miss opportunities by doing this.
>>  void
>> -ldp_bb_info::merge_pairs (insn_list_t &left_list,
>> +pair_fusion_bb_info::merge_pairs (insn_list_t &left_list,
>>  			  insn_list_t &right_list,
>>  			  bool load_p,
>>  			  unsigned access_size)
>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
>>  // of accesses.  If we find two sets of adjacent accesses, call
>>  // merge_pairs.
>>  void
>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>> -				 access_group &group)
>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>> +					 access_group &group)
>>  {
>>    const auto lfs = decode_lfs (encoded_lfs);
>>    const unsigned access_size = lfs.size;
>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>  		       access.cand_insns,
>>  		       lfs.load_p,
>>  		       access_size);
>> -	  skip_next = access.cand_insns.empty ();
>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
> 
> As above, why is this needed?

For rs6000 we want to return always true. as load store pair
that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
And we want load store pair to 8/16 32/64. Thats why we want
to generate always true for rs6000 to skip pairs as above.

> 
>>  	}
>>        prev_access = &access;
>>      }
>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>  // and remove all the tombstone insns, being sure to reparent any uses
>>  // of mem to previous defs when we do this.
>>  void
>> -ldp_bb_info::cleanup_tombstones ()
>> +pair_fusion_bb_info::cleanup_tombstones ()
>>  {
>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
>>    if (!m_emitted_tombstone)
>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>  
>>  template<typename Map>
>>  void
>> -ldp_bb_info::traverse_base_map (Map &map)
>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>  {
>>    for (auto kv : map)
>>      {
>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>  }
>>  
>>  void
>> -ldp_bb_info::transform ()
>> +pair_fusion_bb_info::transform ()
>>  {
>>    traverse_base_map (expr_map);
>>    traverse_base_map (def_map);
>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>  // for load/store candidates.  If running after RA, also try and promote
>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
>>  // candidates into pairs.
>> -void ldp_fusion_bb (bb_info *bb)
>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>  {
>> -  const bool track_loads
>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>> -  const bool track_stores
>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>> +  const bool track_loads = track_load_p ();
>> +  const bool track_stores = track_store_p ();
>>  
>> -  ldp_bb_info bb_state (bb);
> 
> This:
> 
>> +  aarch64_pair_fusion derived;
> 
> can be deleted and then:
> 
>> +  pair_fusion_bb_info bb_info (bb, &derived);
> 
> can just be:
> 
>   pair_fusion_bb_info bb_info (bb, this);
> 
> (or you can pass *this if you make bb_info take a reference).
> 
> I don't think there's a particular need to change the variable name
> (bb_state -> bb_info).  I chose the former because it doens't clash
> with the RTL-SSA structure of the same name as the latter.
> 

Addressed.
>>  
>>    for (auto insn : bb->nondebug_insns ())
>>      {
>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>  	continue;
>>  
>>        rtx pat = PATTERN (rti);
>> -      if (reload_completed
>> -	  && aarch64_ldp_writeback > 1
>> -	  && GET_CODE (pat) == PARALLEL
>> -	  && XVECLEN (pat, 0) == 2)
>> +      if (pair_mem_promote_writeback_p (pat))
>>  	try_promote_writeback (insn);
> 
> It looks like try_promote_writeback itself will need some further work
> to make it target-independent.  I suppose this check:
> 
>   auto rti = insn->rtl ();
>   const auto attr = get_attr_ldpstp (rti);
>   if (attr == LDPSTP_NONE)
>     return;
> 
>   bool load_p = (attr == LDPSTP_LDP);
>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
> 
> will need to become part of the pair_mem_promote_writeback_p hook that you
> added, potentially changing it to return a boolean for load_p.
> 
> Then I guess we will need hooks for destructuring the pair insn and
> another hook to wrap aarch64_gen_writeback_pair.
> 

Addressed.
>>  
>>        if (GET_CODE (pat) != SET)
>>  	continue;
>>  
>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
>>      }
>>  
>> -  bb_state.transform ();
>> -  bb_state.cleanup_tombstones ();
>> +  bb_info.transform ();
>> +  bb_info.cleanup_tombstones ();
>>  }
>>  
>>  void ldp_fusion ()
>>  {
>>    ldp_fusion_init ();
>> +  pair_fusion *pfuse;
>> +  aarch64_pair_fusion derived;
>> +  pfuse = &derived;
> 
> This is indeed the one place where I think it is acceptable to
> instantiate aarch64_pair_fusion.  But again there's no need to create a
> pointer to the parent class, just call any function you like directly.
> 

Addressed.
>>  
>>    for (auto bb : crtl->ssa->bbs ())
>> -    ldp_fusion_bb (bb);
>> +    pfuse->ldp_fusion_bb (bb);
> 
> I think even the code to iterate over bbs should itself be a member
> function of pair_fusion (say "run") and then that becomes part of the
> generic code.
> 
> So this function would just become:
> 
> aarch64_pair_fusion pass;
> pass.run ();
> 
> and could be inlined into the caller.
> 

Addressed.
> Perhaps you could also add an early return like:
> 
>   if (!track_loads_p () && !track_stores_p ())
>     return;
> 
> in pair_fusion::run () and then remove the corresponding code from
> pass_ldp_fusion::gate?
> 

Addressed.

>>  
>>    ldp_fusion_destroy ();
>>  }
>> -- 
>> 2.39.3
>>
> 
> Thanks,
> Alex

Thanks & Regards
Ajit
Alex Coplan April 9, 2024, 1:59 p.m. UTC | #3
On 09/04/2024 17:30, Ajit Agarwal wrote:
> 
> 
> On 05/04/24 10:03 pm, Alex Coplan wrote:
> > On 05/04/2024 13:53, Ajit Agarwal wrote:
> >> Hello Alex/Richard:
> >>
> >> All review comments are incorporated.
> > 
> > Thanks, I was kind-of expecting you to also send the renaming patch as a
> > preparatory patch as we discussed.
> > 
> > Sorry for another meta comment, but: I think the reason that the Linaro
> > CI isn't running tests on your patches is actually because you're
> > sending 1/3 of a series but not sending the rest of the series.
> > 
> > So please can you either send this as an individual preparatory patch
> > (not marked as a series) or if you're going to send a series (e.g. with
> > a preparatory rename patch as 1/2 and this as 2/2) then send the entire
> > series when you make updates.  That way the CI should test your patches,
> > which would be helpful.
> >
> 
> Addressed.
>  
> >>
> >> Common infrastructure of load store pair fusion is divided into target
> >> independent and target dependent changed code.
> >>
> >> Target independent code is the Generic code with pure virtual function
> >> to interface betwwen target independent and dependent code.
> >>
> >> Target dependent code is the implementation of pure virtual function for
> >> aarch64 target and the call to target independent code.
> >>
> >> Thanks & Regards
> >> Ajit
> >>
> >>
> >> aarch64: Place target independent and dependent changed code in one file
> >>
> >> Common infrastructure of load store pair fusion is divided into target
> >> independent and target dependent changed code.
> >>
> >> Target independent code is the Generic code with pure virtual function
> >> to interface betwwen target independent and dependent code.
> >>
> >> Target dependent code is the implementation of pure virtual function for
> >> aarch64 target and the call to target independent code.
> >>
> >> 2024-04-06  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
> >>
> >> gcc/ChangeLog:
> >>
> >> 	* config/aarch64/aarch64-ldp-fusion.cc: Place target
> >> 	independent and dependent changed code.
> > 
> > You're going to need a proper ChangeLog eventually, but I guess there's
> > no need for that right now.
> > 
> >> ---
> >>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++++++++++++++--------
> >>  1 file changed, 249 insertions(+), 122 deletions(-)
> >>
> >> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >> index 22ed95eb743..cb21b514ef7 100644
> >> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >> @@ -138,8 +138,122 @@ struct alt_base
> >>    poly_int64 offset;
> >>  };
> >>  
> >> +// Virtual base class for load/store walkers used in alias analysis.
> >> +struct alias_walker
> >> +{
> >> +  virtual bool conflict_p (int &budget) const = 0;
> >> +  virtual insn_info *insn () const = 0;
> >> +  virtual bool valid () const  = 0;
> > 
> > Heh, looking at this made me realise there is a whitespace bug here in
> > the existing code (double space after const).  Sorry about that!  I'll
> > push an obvious fix for that.
> > 
> >> +  virtual void advance () = 0;
> >> +};
> >> +
> >> +struct pair_fusion {
> >> +
> >> +  pair_fusion () {};
> > 
> > This ctor looks pointless at the moment.  Perhaps instead we could put
> > the contents of ldp_fusion_init in here and then delete that function?
> > 
> 
> Addressed.
> 
> >> +  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
> >> +			       bool load_p) = 0;
> > 
> > Please can we have comments above each of these virtual functions
> > describing any parameters, what the purpose of the hook is, and the
> > interpretation of the return value?  This will serve as the
> > documentation for other targets that want to make use of the pass.
> > 
> > It might make sense to have a default-false implementation for
> > fpsimd_op_p, especially if you don't want to make use of this bit for
> > rs6000.
> >
> 
> Addressed.
>  
> >> +
> >> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
> >> +  virtual bool pair_trailing_writeback_p () = 0;
> > 
> > Sorry for the run-around, but: I think this and
> > handle_writeback_opportunities () should be the same function, either
> > returning an enum or taking an enum and returning a boolean.
> > 
> > At a minimum they should have more similar sounding names.
> > 
> 
> Addressed.
> 
> >> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> >> +				      machine_mode mem_mode) = 0;
> >> +  virtual int pair_mem_alias_check_limit () = 0;
> >> +  virtual bool handle_writeback_opportunities () = 0 ;
> >> +  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
> >> +					machine_mode mode) = 0;
> >> +  virtual rtx gen_mem_pair (rtx *pats,  rtx writeback,
> > 
> > Nit: excess whitespace after pats,
> > 
> >> +			    bool load_p) = 0;
> >> +  virtual bool pair_mem_promote_writeback_p (rtx pat) = 0;
> >> +  virtual bool track_load_p () = 0;
> >> +  virtual bool track_store_p () = 0;
> > 
> > I think it would probably make more sense for these two to have
> > default-true implementations rather than being pure virtual functions.
> > 
> > Also, sorry for the bikeshedding, but please can we keep the plural
> > names (so track_loads_p and track_stores_p).
> > 
> >> +  virtual bool cand_insns_empty_p (std::list<insn_info *> &insns) = 0;
> > 
> > Why does this need to be virtualised?  I would it expect it to
> > just be insns.empty () on all targets.
> > 
> >> +  virtual bool pair_mem_in_range_p (HOST_WIDE_INT off) = 0;
> >> +  void ldp_fusion_bb (bb_info *bb);
> > 
> > Just to check, are you planning to rename this function in a separate
> > patch?
> > 
> 
> Yes.
> >> +
> >> +  ~pair_fusion() { }
> > 
> > As with the ctor, perhaps we could put the contents of ldp_fusion_destroy
> > in here and then delete that function?
> > 
> 
> Addressed.
> >> +};
> >> +
> >> +struct aarch64_pair_fusion : public pair_fusion
> >> +{
> >> +public:
> > 
> 
> Addressed.
> > Nit: there shouldn't be a need for this as public is already the default
> > for structs.
> > 
> >> +  aarch64_pair_fusion () : pair_fusion () {};
> > 
> > This ctor looks pointless, the default one should do, so can we remove
> > this?
> > 
> 
> Addressed.
> >> +  bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
> >> +		    bool load_p) override final
> >> +  {
> >> +    const bool fpsimd_op_p
> >> +      = reload_completed
> >> +      ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
> >> +      : (GET_MODE_CLASS (mem_mode) != MODE_INT
> >> +	 && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
> >> +    return fpsimd_op_p;
> > 
> > Can this just directly return the boolean expression instead of storing
> > it into a variable first?
> > 
> Addressed.
> >> +  }
> >> +
> >> +  bool pair_mem_promote_writeback_p (rtx pat)
> >> +  {
> >> +     if (reload_completed
> >> +	 && aarch64_ldp_writeback > 1
> >> +	 && GET_CODE (pat) == PARALLEL
> >> +	 && XVECLEN (pat, 0) == 2)
> >> +	return true;
> >> +
> >> +     return false;
> >> +  }
> >> +
> >> +  bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
> >> +				machine_mode mode)
> >> +  {
> >> +    return aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
> >> +						     load_p,
> >> +						     mode);
> >> +  }
> >> +  bool pair_operand_mode_ok_p (machine_mode mode);
> > 
> > Why have you implemented this (one-line function) out-of-line but the
> > others are implemented inline?  Please be consistent.
> > 
> 
> Addressed.
> >> +
> >> +  rtx gen_mem_pair (rtx *pats, rtx writeback, bool load_p);
> >> +
> >> +  bool pair_trailing_writeback_p  ()
> >> +  {
> >> +    return aarch64_ldp_writeback > 1;
> >> +  }
> >> +  bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> >> +			      machine_mode mem_mode)
> >> +  {
> >> +    return (load_p
> >> +	     ? aarch64_ldp_reg_operand (reg_op, mem_mode)
> >> +	     : aarch64_stp_reg_operand (reg_op, mem_mode));
> >> +  }
> >> +  int pair_mem_alias_check_limit ()
> >> +  {
> >> +    return aarch64_ldp_alias_check_limit;
> >> +  }
> >> +  bool handle_writeback_opportunities ()
> >> +  {
> >> +    return aarch64_ldp_writeback;
> >> +  }
> >> +  bool track_load_p ()
> >> +  {
> >> +    const bool track_loads
> >> +      = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >> +    return track_loads;
> >> +  }
> > 
> 
> Addressed.
> > As above, can this just return the boolean expression directly without
> > the temporary?
> > 
> >> +  bool track_store_p ()
> >> +  {
> >> +    const bool track_stores
> >> +      = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >> +    return track_stores;
> >> +  }
> > 
> 
> Addressed.
> > Likewise.
> > 
> >> +  bool cand_insns_empty_p (std::list<insn_info *> &insns)
> >> +  {
> >> +    return insns.empty();
> >> +  }
> > 
> > As mentioned on the declaration, I don't see why this needs to be
> > virtualised.
> > 
> >> +  bool pair_mem_in_range_p (HOST_WIDE_INT off)
> >> +  {
> >> +    return (off < LDP_MIN_IMM || off > LDP_MAX_IMM);
> >> +  }
> > 
> > Hmm, this looks to test exactly the opposite condition from what the
> > name suggests, i.e. the implementation looks like what I would expect
> > for a function called pair_mem_out_of_range_p.
> > 
> > Please can you invert the condition and fix up callers appropriately?
> > 
> > This suggests you're not really thinking about the code you're writing,
> > please try to think a bit more carefully before posting patches.
> > 
> 
> Sorry for my mistake as I have overlooked it.
> >> +};
> >> +
> >> +
> >>  // State used by the pass for a given basic block.
> >> -struct ldp_bb_info
> >> +struct pair_fusion_bb_info
> >>  {
> >>    using def_hash = nofree_ptr_hash<def_info>;
> >>    using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
> >> @@ -160,14 +274,17 @@ struct ldp_bb_info
> >>  
> >>    static const size_t obstack_alignment = sizeof (void *);
> >>    bb_info *m_bb;
> >> +  pair_fusion *bb_state;
> > 
> 
> Addressed.
> 
> > Can we call this m_pass instead of bb_state?  The pair_fusion class is
> > explicitly at the pass level rather than the bb level.
> > 
> > Also I think this (and indeed m_bb, although that's pre-existing) could
> > be private members.  Not sure there's a good reason for making them
> > public.
> >
> 
> Addressed.
>  
> > It might be slightly nicer to use a reference for m_pass unless there's
> > a good reason to use a pointer, although I don't feel strongly about
> > this.
> > 
> >>  
> >> -  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
> >> +  pair_fusion_bb_info (bb_info *bb,
> >> +		      aarch64_pair_fusion *d) : m_bb (bb),
> > 
> > This generic class needs to take the generic pass type, i.e. just a
> > pair_fusion * rather than an aarch64_pair_fusion *.
> > 
> >> +		      bb_state (d), m_emitted_tombstone (false)
> >>    {
> >>      obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
> >>  				obstack_alignment, obstack_chunk_alloc,
> >>  				obstack_chunk_free);
> >>    }
> >> -  ~ldp_bb_info ()
> >> +  ~pair_fusion_bb_info ()
> >>    {
> >>      obstack_free (&m_obstack, nullptr);
> >>  
> >> @@ -177,10 +294,32 @@ struct ldp_bb_info
> >>  	bitmap_obstack_release (&m_bitmap_obstack);
> >>        }
> >>    }
> >> +  void track_access (insn_info *, bool load, rtx mem);
> >> +  void transform ();
> >> +  void cleanup_tombstones ();
> > 
> > I think most (or indeed all?) of the changes here shouldn't be needed.
> > AFAICS, there shouldn't be a need to:
> >  - drop inline from these functions.
> >  - change the accessibility of any members.
> > 
> 
> Addressed.
> > Plese can you drop these changes or explain why they're needed (from
> > here until the end of the class).
> > 
> >> +  void merge_pairs (insn_list_t &, insn_list_t &,
> >> +		    bool load_p, unsigned access_size);
> >> +  void transform_for_base (int load_size, access_group &group);
> >> +
> >> +  bool try_fuse_pair (bool load_p, unsigned access_size,
> >> +			     insn_info *i1, insn_info *i2);
> >> +
> >> +  bool fuse_pair (bool load_p, unsigned access_size,
> >> +		  int writeback,
> >> +		  insn_info *i1, insn_info *i2,
> >> +		  base_cand &base,
> >> +		  const insn_range_info &move_range);
> >>  
> >> -  inline void track_access (insn_info *, bool load, rtx mem);
> >> -  inline void transform ();
> >> -  inline void cleanup_tombstones ();
> >> +  void do_alias_analysis (insn_info *alias_hazards[4],
> >> +			  alias_walker *walkers[4],
> >> +			  bool load_p);
> >> +
> >> +  void track_tombstone (int uid);
> >> +
> >> +  bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
> >> +
> >> +  template<typename Map>
> >> +    void traverse_base_map (Map &map);
> >>  
> >>  private:
> >>    obstack m_obstack;
> >> @@ -191,30 +330,32 @@ private:
> >>    bool m_emitted_tombstone;
> >>  
> >>    inline splay_tree_node<access_record *> *node_alloc (access_record *);
> >> +};
> >>  
> >> -  template<typename Map>
> >> -  inline void traverse_base_map (Map &map);
> >> -  inline void transform_for_base (int load_size, access_group &group);
> >> -
> >> -  inline void merge_pairs (insn_list_t &, insn_list_t &,
> >> -			   bool load_p, unsigned access_size);
> >> -
> >> -  inline bool try_fuse_pair (bool load_p, unsigned access_size,
> >> -			     insn_info *i1, insn_info *i2);
> >> -
> >> -  inline bool fuse_pair (bool load_p, unsigned access_size,
> >> -			 int writeback,
> >> -			 insn_info *i1, insn_info *i2,
> >> -			 base_cand &base,
> >> -			 const insn_range_info &move_range);
> >> -
> >> -  inline void track_tombstone (int uid);
> >> +rtx aarch64_pair_fusion::gen_mem_pair (rtx *pats,
> > 
> > Style nit: newline after rtx.
> > 
> Addressed.
> 
> >> +		  rtx writeback,
> >> +		  bool load_p)
> >> +  {
> >> +    rtx pair_pat;
> >>  
> >> -  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
> >> -};
> >> +    if (writeback)
> >> +      {
> >> +	auto patvec = gen_rtvec (3, writeback, pats[0], pats[1]);
> >> +	pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
> >> +      }
> >> +    else if (load_p)
> >> +      pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
> >> +					XEXP (pats[1], 0),
> >> +					XEXP (pats[0], 1));
> >> +    else
> >> +      pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
> >> +					 XEXP (pats[0], 1),
> >> +					 XEXP (pats[1], 1));
> >> +     return pair_pat;
> > 
> > Can the assignments just be direct returns here, please?  Then
> > get rid of the temporary.
> >
> 
> Addressed.
>  
> >> +  }
> >>  
> >>  splay_tree_node<access_record *> *
> >> -ldp_bb_info::node_alloc (access_record *access)
> >> +pair_fusion_bb_info::node_alloc (access_record *access)
> >>  {
> >>    using T = splay_tree_node<access_record *>;
> >>    void *addr = obstack_alloc (&m_obstack, sizeof (T));
> >> @@ -262,7 +403,7 @@ drop_writeback (rtx mem)
> >>  // RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
> >>  // MEM so that we know the mode of the access.
> >>  static rtx
> >> -ldp_strip_offset (rtx mem, poly_int64 *offset)
> >> +pair_mem_strip_offset (rtx mem, poly_int64 *offset)
> > 
> > I thought we discussed splitting the renaming into a separate patch?
> > 
> >>  {
> >>    rtx addr = XEXP (mem, 0);
> >>  
> >> @@ -332,6 +473,12 @@ ldp_operand_mode_ok_p (machine_mode mode)
> >>    return reload_completed || mode != TImode;
> >>  }
> >>  
> >> +bool
> >> +aarch64_pair_fusion::pair_operand_mode_ok_p (machine_mode mode)
> >> +{
> >> +  return ldp_operand_mode_ok_p (mode);
> >> +}
> >> +
> >>  // Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
> >>  // into an integer for use as a hash table key.
> >>  static int
> >> @@ -396,7 +543,7 @@ access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
> >>  // MEM_EXPR base (i.e. a tree decl) relative to which we can track the access.
> >>  // LFS is used as part of the key to the hash table, see track_access.
> >>  bool
> >> -ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> >> +pair_fusion_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> > 
> > Again, please do any renaming in a separate patch.
> > 
> 
> Addressed.
> >>  {
> >>    if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
> >>      return false;
> >> @@ -412,9 +559,10 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> >>    const machine_mode mem_mode = GET_MODE (mem);
> >>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
> >>  
> >> -  // Punt on misaligned offsets.  LDP/STP instructions require offsets to be a
> >> -  // multiple of the access size, and we believe that misaligned offsets on
> >> -  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL bases.
> >> +  // Punt on misaligned offsets.  Paired memory access instructions require
> >> +  // offsets to be a multiple of the access size, and we believe that
> >> +  // misaligned offsets on MEM_EXPR bases are likely to lead to misaligned
> >> +  // offsets w.r.t. RTL bases.
> >>    if (!multiple_p (offset, mem_size))
> >>      return false;
> >>  
> >> @@ -438,46 +586,38 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> >>  }
> >>  
> >>  // Main function to begin pair discovery.  Given a memory access INSN,
> >> -// determine whether it could be a candidate for fusing into an ldp/stp,
> >> +// determine whether it could be a candidate for fusing into an pair mem,
> > 
> > s/an pair mem/a paired access/
> > 
> >>  // and if so, track it in the appropriate data structure for this basic
> >>  // block.  LOAD_P is true if the access is a load, and MEM is the mem
> >>  // rtx that occurs in INSN.
> >>  void
> >> -ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >> +pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>  {
> >>    // We can't combine volatile MEMs, so punt on these.
> >>    if (MEM_VOLATILE_P (mem))
> >>      return;
> >>  
> >> -  // Ignore writeback accesses if the param says to do so.
> >> -  if (!aarch64_ldp_writeback
> >> +  // Ignore writeback accesses if the param says to do so
> > 
> 
> Addressed.
> > This comment needs updating, I guess something like "if the hook says to
> > do so".  But also please don't delete the full stop.
> > 
> >> +  if (!bb_state->handle_writeback_opportunities ()
> >>        && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
> >>      return;
> >>  
> >>    const machine_mode mem_mode = GET_MODE (mem);
> >> -  if (!ldp_operand_mode_ok_p (mem_mode))
> >> +
> > 
> > I don't think we need the extra whitespace here.
> > 
> >> +  if (!bb_state->pair_operand_mode_ok_p (mem_mode))
> >>      return;
> >>  
> >>    rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
> >>  
> >> -  // Ignore the access if the register operand isn't suitable for ldp/stp.
> >> -  if (load_p
> >> -      ? !aarch64_ldp_reg_operand (reg_op, mem_mode)
> >> -      : !aarch64_stp_reg_operand (reg_op, mem_mode))
> >> +  if (!bb_state->pair_reg_operand_ok_p (load_p, reg_op, mem_mode))
> >>      return;
> >> -
> >>    // We want to segregate FP/SIMD accesses from GPR accesses.
> >>    //
> >>    // Before RA, we use the modes, noting that stores of constant zero
> >>    // operands use GPRs (even in non-integer modes).  After RA, we use
> >>    // the hard register numbers.
> >> -  const bool fpsimd_op_p
> >> -    = reload_completed
> >> -    ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
> >> -    : (GET_MODE_CLASS (mem_mode) != MODE_INT
> >> -       && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
> >> -
> >> -  // Note ldp_operand_mode_ok_p already rejected VL modes.
> >> +  const bool fpsimd_op_p = bb_state->fpsimd_op_p (reg_op, mem_mode, load_p);
> >> +  // Note pair_operand_mode_ok_p already rejected VL modes.
> >>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
> >>    const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
> >>  
> >> @@ -487,7 +627,7 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>    poly_int64 mem_off;
> >>    rtx addr = XEXP (mem, 0);
> >>    const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
> >> -  rtx base = ldp_strip_offset (mem, &mem_off);
> >> +  rtx base = pair_mem_strip_offset (mem, &mem_off);
> > 
> > Again, let's split the renaming off into a separate patch.
> > 
> 
> Addressed.
> >>    if (!REG_P (base))
> >>      return;
> >>  
> >> @@ -506,8 +646,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>    // elimination offset pre-RA, we should postpone forming pairs on such
> >>    // accesses until after RA.
> >>    //
> >> -  // As it stands, addresses with offsets in range for LDR but not
> >> -  // in range for LDP/STP are currently reloaded inefficiently,
> >> +  // As it stands, addresses in range for an individual load/store but not
> >> +  // for a paired access are currently reloaded inefficiently,
> >>    // ending up with a separate base register for each pair.
> >>    //
> >>    // In theory LRA should make use of
> >> @@ -519,8 +659,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>    // that calls targetm.legitimize_address_displacement.
> >>    //
> >>    // So for now, it's better to punt when we can't be sure that the
> >> -  // offset is in range for LDP/STP.  Out-of-range cases can then be
> >> -  // handled after RA by the out-of-range LDP/STP peepholes.  Eventually, it
> >> +  // offset is in range for paired access.  Out-of-range cases can then be
> >> +  // handled after RA by the out-of-range PAIR MEM  peepholes.  Eventually, it
> >>    // would be nice to handle known out-of-range opportunities in the
> >>    // pass itself (for stack accesses, this would be in the post-RA pass).
> >>    if (!reload_completed
> >> @@ -573,8 +713,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>  	gcc_unreachable (); // Base defs should be unique.
> >>      }
> >>  
> >> -  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
> >> -  // the access size.
> >> +  // Punt on misaligned offsets.  Paired memory accesses require offsets
> >> +  // to be a multiple of the access size.
> >>    if (!multiple_p (mem_off, mem_size))
> >>      return;
> >>  
> >> @@ -614,7 +754,7 @@ static bool no_ignore (insn_info *) { return false; }
> >>  // making use of alias disambiguation.
> >>  static insn_info *
> >>  latest_hazard_before (insn_info *insn, rtx *ignore,
> >> -		      insn_info *ignore_insn = nullptr)
> >> +		      insn_info *ignore_insn = 0)
> > 
> > Is this change needed?
> > 
> 
> Addressed.
> >>  {
> >>    insn_info *result = nullptr;
> >>  
> >> @@ -1150,7 +1290,7 @@ extract_writebacks (bool load_p, rtx pats[2], int changed)
> >>        const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
> >>  
> >>        poly_int64 offset;
> >> -      rtx this_base = ldp_strip_offset (mem, &offset);
> >> +      rtx this_base = pair_mem_strip_offset (mem, &offset);
> >>        gcc_assert (REG_P (this_base));
> >>        if (base_reg)
> >>  	gcc_assert (rtx_equal_p (base_reg, this_base));
> >> @@ -1286,7 +1426,11 @@ find_trailing_add (insn_info *insns[2],
> >>  
> >>    off_hwi /= access_size;
> >>  
> >> -  if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
> >> +  pair_fusion *pfuse;
> >> +  aarch64_pair_fusion derived;
> >> +  pfuse = &derived;
> > 
> > Huh? We'll want to use the same instance of the pass structure (pair_fusion)
> > that gets created at the top level, so that should get passed around everywhere.
> > 
> > In the case of find_trailing_add we probably don't want to add any more
> > parameters (as Segher noted there are already too many), so perhaps the
> > easiest thing would be to make find_trailing_add itself a (non-virtual)
> > member function of pair_fusion.
> > 
> > Then here we can just query pair_mem_in_range_p directly.
> > 
> > For try_promote_writeback itself you can either make it a non-virtual
> > member function of pair_fusion or just pass a reference to the pass
> > object in when you call it.
> > 
> > Note also that you don't have to create a pointer of the parent type in
> > order to call a virtual function on a derived object, you can just call
> > it directly.
> > 
> 
> Addressed.
> >> +
> >> +  if (pfuse->pair_mem_in_range_p (off_hwi))
> >>      return nullptr;
> > 
> > Note of course this condition needs inverting when you invert the
> > implementation.
> > 
> 
> Addressed.
> >>  
> >>    auto dump_prefix = [&]()
> >> @@ -1328,7 +1472,7 @@ find_trailing_add (insn_info *insns[2],
> >>  // We just emitted a tombstone with uid UID, track it in a bitmap for
> >>  // this BB so we can easily identify it later when cleaning up tombstones.
> >>  void
> >> -ldp_bb_info::track_tombstone (int uid)
> >> +pair_fusion_bb_info::track_tombstone (int uid)
> > 
> > Again, please split the renaming off.  Same with other instances below.
> > 
> >>  {
> >>    if (!m_emitted_tombstone)
> >>      {
> >> @@ -1528,7 +1672,7 @@ fixup_debug_uses (obstack_watermark &attempt,
> >>  	  gcc_checking_assert (GET_RTX_CLASS (GET_CODE (XEXP (mem, 0)))
> >>  			       == RTX_AUTOINC);
> >>  
> >> -	  base = ldp_strip_offset (mem, &offset);
> >> +	  base = pair_mem_strip_offset (mem, &offset);
> >>  	  gcc_checking_assert (REG_P (base) && REGNO (base) == base_regno);
> >>  	}
> >>        fixup_debug_use (attempt, use, def, base, offset);
> >> @@ -1664,7 +1808,7 @@ fixup_debug_uses (obstack_watermark &attempt,
> >>  // BASE gives the chosen base candidate for the pair and MOVE_RANGE is
> >>  // a singleton range which says where to place the pair.
> >>  bool
> >> -ldp_bb_info::fuse_pair (bool load_p,
> >> +pair_fusion_bb_info::fuse_pair (bool load_p,
> >>  			unsigned access_size,
> >>  			int writeback,
> >>  			insn_info *i1, insn_info *i2,
> >> @@ -1800,7 +1944,7 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>      {
> >>        if (dump_file)
> >>  	fprintf (dump_file,
> >> -		 "  ldp: i%d has wb but subsequent i%d has non-wb "
> >> +		 "  load pair: i%d has wb but subsequent i%d has non-wb "
> >>  		 "update of base (r%d), dropping wb\n",
> >>  		 insns[0]->uid (), insns[1]->uid (), base_regno);
> >>        gcc_assert (writeback_effect);
> >> @@ -1823,7 +1967,7 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>      }
> >>  
> >>    // If either of the original insns had writeback, but the resulting pair insn
> >> -  // does not (can happen e.g. in the ldp edge case above, or if the writeback
> >> +  // does not (can happen e.g. in the load pair edge case above, or if the writeback
> >>    // effects cancel out), then drop the def(s) of the base register as
> >>    // appropriate.
> >>    //
> >> @@ -1842,7 +1986,7 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>    // update of the base register and try and fold it in to make this into a
> >>    // writeback pair.
> >>    insn_info *trailing_add = nullptr;
> >> -  if (aarch64_ldp_writeback > 1
> >> +  if (bb_state->pair_trailing_writeback_p ()
> >>        && !writeback_effect
> >>        && (!load_p || (!refers_to_regno_p (base_regno, base_regno + 1,
> >>  					 XEXP (pats[0], 0), nullptr)
> >> @@ -1863,14 +2007,14 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>      }
> >>  
> >>    // Now that we know what base mem we're going to use, check if it's OK
> >> -  // with the ldp/stp policy.
> >> +  // with the pair mem policy.
> >>    rtx first_mem = XEXP (pats[0], load_p);
> >> -  if (!aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
> >> -						load_p,
> >> -						GET_MODE (first_mem)))
> >> +  if (!bb_state->pair_mem_ok_with_policy (first_mem,
> >> +			  load_p,
> >> +			  GET_MODE (first_mem)))
> >>      {
> >>        if (dump_file)
> >> -	fprintf (dump_file, "punting on pair (%d,%d), ldp/stp policy says no\n",
> >> +	fprintf (dump_file, "punting on pair (%d,%d), pair mem  policy says no\n",
> >>  		 i1->uid (), i2->uid ());
> >>        return false;
> >>      }
> >> @@ -1878,21 +2022,10 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>    rtx reg_notes = combine_reg_notes (first, second, load_p);
> >>  
> >>    rtx pair_pat;
> >> -  if (writeback_effect)
> >> -    {
> >> -      auto patvec = gen_rtvec (3, writeback_effect, pats[0], pats[1]);
> >> -      pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
> >> -    }
> >> -  else if (load_p)
> >> -    pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
> >> -				      XEXP (pats[1], 0),
> >> -				      XEXP (pats[0], 1));
> >> -  else
> >> -    pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
> >> -				       XEXP (pats[0], 1),
> >> -				       XEXP (pats[1], 1));
> >>  
> >> +  pair_pat = bb_state->gen_mem_pair (pats, writeback_effect, load_p);
> >>    insn_change *pair_change = nullptr;
> >> +
> >>    auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
> >>      rtx_insn *rti = change->insn ()->rtl ();
> >>      validate_unshare_change (rti, &PATTERN (rti), pair_pat, true);
> >> @@ -2133,15 +2266,6 @@ load_modified_by_store_p (insn_info *load,
> >>    return false;
> >>  }
> >>  
> >> -// Virtual base class for load/store walkers used in alias analysis.
> >> -struct alias_walker
> >> -{
> >> -  virtual bool conflict_p (int &budget) const = 0;
> >> -  virtual insn_info *insn () const = 0;
> >> -  virtual bool valid () const  = 0;
> >> -  virtual void advance () = 0;
> >> -};
> >> -
> >>  // Implement some common functionality used by both store_walker
> >>  // and load_walker.
> >>  template<bool reverse>
> >> @@ -2259,13 +2383,13 @@ public:
> >>  //
> >>  // We try to maintain the invariant that if a walker becomes invalid, we
> >>  // set its pointer to null.
> >> -static void
> >> -do_alias_analysis (insn_info *alias_hazards[4],
> >> +void
> >> +pair_fusion_bb_info::do_alias_analysis (insn_info *alias_hazards[4],
> > 
> > It would probably make more sense for this to be a (non-virtual) member
> > function of the pair_fusion class, since it doesn't need to access the
> > BB-specific state.
> > 
> 
> Addressed.
> >>  		   alias_walker *walkers[4],
> >>  		   bool load_p)
> >>  {
> >>    const int n_walkers = 2 + (2 * !load_p);
> >> -  int budget = aarch64_ldp_alias_check_limit;
> >> +  int budget = bb_state->pair_mem_alias_check_limit ();
> >>  
> >>    auto next_walker = [walkers,n_walkers](int current) -> int {
> >>      for (int j = 1; j <= n_walkers; j++)
> >> @@ -2365,7 +2489,7 @@ get_viable_bases (insn_info *insns[2],
> >>      {
> >>        const bool is_lower = (i == reversed);
> >>        poly_int64 poly_off;
> >> -      rtx base = ldp_strip_offset (cand_mems[i], &poly_off);
> >> +      rtx base = pair_mem_strip_offset (cand_mems[i], &poly_off);
> >>        if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
> >>  	writeback |= (1 << i);
> >>  
> >> @@ -2373,7 +2497,7 @@ get_viable_bases (insn_info *insns[2],
> >>  	continue;
> >>  
> >>        // Punt on accesses relative to eliminable regs.  See the comment in
> >> -      // ldp_bb_info::track_access for a detailed explanation of this.
> >> +      // pair_fusion_bb_info::track_access for a detailed explanation of this.
> >>        if (!reload_completed
> >>  	  && (REGNO (base) == FRAME_POINTER_REGNUM
> >>  	      || REGNO (base) == ARG_POINTER_REGNUM))
> >> @@ -2397,7 +2521,11 @@ get_viable_bases (insn_info *insns[2],
> >>        if (!is_lower)
> >>  	base_off--;
> >>  
> >> -      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
> >> +      pair_fusion *pfuse;
> >> +      aarch64_pair_fusion derived;
> >> +      pfuse = &derived;
> > 
> > Again, please don't do this but instead use the same instance
> > everywhere.
> > 
> 
> Addressed.
> >> +
> >> +      if (pfuse->pair_mem_in_range_p (base_off))
> >>  	continue;
> >>  
> >>        use_info *use = find_access (insns[i]->uses (), REGNO (base));
> >> @@ -2454,12 +2582,12 @@ get_viable_bases (insn_info *insns[2],
> >>  }
> >>  
> >>  // Given two adjacent memory accesses of the same size, I1 and I2, try
> >> -// and see if we can merge them into a ldp or stp.
> >> +// and see if we can merge them into a paired accesses load and store.
> > 
> > "into a load pair or store pair" or "into a paired access" if you
> > prefer.
> > 
> 
> Addressed.
> >>  //
> >>  // ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
> >>  // if the accesses are both loads, otherwise they are both stores.
> >>  bool
> >> -ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> >> +pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> >>  			    insn_info *i1, insn_info *i2)
> >>  {
> >>    if (dump_file)
> >> @@ -2494,7 +2622,7 @@ ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> >>      {
> >>        if (dump_file)
> >>  	fprintf (dump_file,
> >> -		 "punting on ldp due to reg conflcits (%d,%d)\n",
> >> +		 "punting on pair mem load  due to reg conflcits (%d,%d)\n",
> >>  		 insns[0]->uid (), insns[1]->uid ());
> >>        return false;
> >>      }
> >> @@ -2843,7 +2971,7 @@ debug (const insn_list_t &l)
> >>  // we can't re-order them anyway, so provided earlier passes have cleaned up
> >>  // redundant loads, we shouldn't miss opportunities by doing this.
> >>  void
> >> -ldp_bb_info::merge_pairs (insn_list_t &left_list,
> >> +pair_fusion_bb_info::merge_pairs (insn_list_t &left_list,
> >>  			  insn_list_t &right_list,
> >>  			  bool load_p,
> >>  			  unsigned access_size)
> >> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
> >>  // of accesses.  If we find two sets of adjacent accesses, call
> >>  // merge_pairs.
> >>  void
> >> -ldp_bb_info::transform_for_base (int encoded_lfs,
> >> -				 access_group &group)
> >> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
> >> +					 access_group &group)
> >>  {
> >>    const auto lfs = decode_lfs (encoded_lfs);
> >>    const unsigned access_size = lfs.size;
> >> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
> >>  		       access.cand_insns,
> >>  		       lfs.load_p,
> >>  		       access_size);
> >> -	  skip_next = access.cand_insns.empty ();
> >> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
> > 
> > As above, why is this needed?
> 
> For rs6000 we want to return always true. as load store pair
> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
> And we want load store pair to 8/16 32/64. Thats why we want
> to generate always true for rs6000 to skip pairs as above.

Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
load/store pair instructions where the two arms of the access are storing
operands of different sizes?  Or something else?

As it stands the logic is to skip the next iteration only if we
exhausted all the candidate insns for the current access.  In the case
that we didn't exhaust all such candidates, then the idea is that when
access becomes prev_access, we can attempt to use those candidates as
the "left-hand side" of a pair in the next iteration since we failed to
use them as the "right-hand side" of a pair in the current iteration.
I don't see why you wouldn't want that behaviour.  Please can you
explain?

Thanks,
Alex

> 
> > 
> >>  	}
> >>        prev_access = &access;
> >>      }
> >> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
> >>  // and remove all the tombstone insns, being sure to reparent any uses
> >>  // of mem to previous defs when we do this.
> >>  void
> >> -ldp_bb_info::cleanup_tombstones ()
> >> +pair_fusion_bb_info::cleanup_tombstones ()
> >>  {
> >>    // No need to do anything if we didn't emit a tombstone insn for this BB.
> >>    if (!m_emitted_tombstone)
> >> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
> >>  
> >>  template<typename Map>
> >>  void
> >> -ldp_bb_info::traverse_base_map (Map &map)
> >> +pair_fusion_bb_info::traverse_base_map (Map &map)
> >>  {
> >>    for (auto kv : map)
> >>      {
> >> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
> >>  }
> >>  
> >>  void
> >> -ldp_bb_info::transform ()
> >> +pair_fusion_bb_info::transform ()
> >>  {
> >>    traverse_base_map (expr_map);
> >>    traverse_base_map (def_map);
> >> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
> >>  // for load/store candidates.  If running after RA, also try and promote
> >>  // non-writeback pairs to use writeback addressing.  Then try to fuse
> >>  // candidates into pairs.
> >> -void ldp_fusion_bb (bb_info *bb)
> >> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
> >>  {
> >> -  const bool track_loads
> >> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >> -  const bool track_stores
> >> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >> +  const bool track_loads = track_load_p ();
> >> +  const bool track_stores = track_store_p ();
> >>  
> >> -  ldp_bb_info bb_state (bb);
> > 
> > This:
> > 
> >> +  aarch64_pair_fusion derived;
> > 
> > can be deleted and then:
> > 
> >> +  pair_fusion_bb_info bb_info (bb, &derived);
> > 
> > can just be:
> > 
> >   pair_fusion_bb_info bb_info (bb, this);
> > 
> > (or you can pass *this if you make bb_info take a reference).
> > 
> > I don't think there's a particular need to change the variable name
> > (bb_state -> bb_info).  I chose the former because it doens't clash
> > with the RTL-SSA structure of the same name as the latter.
> > 
> 
> Addressed.
> >>  
> >>    for (auto insn : bb->nondebug_insns ())
> >>      {
> >> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
> >>  	continue;
> >>  
> >>        rtx pat = PATTERN (rti);
> >> -      if (reload_completed
> >> -	  && aarch64_ldp_writeback > 1
> >> -	  && GET_CODE (pat) == PARALLEL
> >> -	  && XVECLEN (pat, 0) == 2)
> >> +      if (pair_mem_promote_writeback_p (pat))
> >>  	try_promote_writeback (insn);
> > 
> > It looks like try_promote_writeback itself will need some further work
> > to make it target-independent.  I suppose this check:
> > 
> >   auto rti = insn->rtl ();
> >   const auto attr = get_attr_ldpstp (rti);
> >   if (attr == LDPSTP_NONE)
> >     return;
> > 
> >   bool load_p = (attr == LDPSTP_LDP);
> >   gcc_checking_assert (load_p || attr == LDPSTP_STP);
> > 
> > will need to become part of the pair_mem_promote_writeback_p hook that you
> > added, potentially changing it to return a boolean for load_p.
> > 
> > Then I guess we will need hooks for destructuring the pair insn and
> > another hook to wrap aarch64_gen_writeback_pair.
> > 
> 
> Addressed.
> >>  
> >>        if (GET_CODE (pat) != SET)
> >>  	continue;
> >>  
> >>        if (track_stores && MEM_P (XEXP (pat, 0)))
> >> -	bb_state.track_access (insn, false, XEXP (pat, 0));
> >> +	bb_info.track_access (insn, false, XEXP (pat, 0));
> >>        else if (track_loads && MEM_P (XEXP (pat, 1)))
> >> -	bb_state.track_access (insn, true, XEXP (pat, 1));
> >> +	bb_info.track_access (insn, true, XEXP (pat, 1));
> >>      }
> >>  
> >> -  bb_state.transform ();
> >> -  bb_state.cleanup_tombstones ();
> >> +  bb_info.transform ();
> >> +  bb_info.cleanup_tombstones ();
> >>  }
> >>  
> >>  void ldp_fusion ()
> >>  {
> >>    ldp_fusion_init ();
> >> +  pair_fusion *pfuse;
> >> +  aarch64_pair_fusion derived;
> >> +  pfuse = &derived;
> > 
> > This is indeed the one place where I think it is acceptable to
> > instantiate aarch64_pair_fusion.  But again there's no need to create a
> > pointer to the parent class, just call any function you like directly.
> > 
> 
> Addressed.
> >>  
> >>    for (auto bb : crtl->ssa->bbs ())
> >> -    ldp_fusion_bb (bb);
> >> +    pfuse->ldp_fusion_bb (bb);
> > 
> > I think even the code to iterate over bbs should itself be a member
> > function of pair_fusion (say "run") and then that becomes part of the
> > generic code.
> > 
> > So this function would just become:
> > 
> > aarch64_pair_fusion pass;
> > pass.run ();
> > 
> > and could be inlined into the caller.
> > 
> 
> Addressed.
> > Perhaps you could also add an early return like:
> > 
> >   if (!track_loads_p () && !track_stores_p ())
> >     return;
> > 
> > in pair_fusion::run () and then remove the corresponding code from
> > pass_ldp_fusion::gate?
> > 
> 
> Addressed.
> 
> >>  
> >>    ldp_fusion_destroy ();
> >>  }
> >> -- 
> >> 2.39.3
> >>
> > 
> > Thanks,
> > Alex
> 
> Thanks & Regards
> Ajit
Ajit Agarwal April 9, 2024, 2:31 p.m. UTC | #4
Hello Alex:

On 09/04/24 7:29 pm, Alex Coplan wrote:
> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>
>>
>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>> Hello Alex/Richard:
>>>>
>>>> All review comments are incorporated.
>>>
>>> Thanks, I was kind-of expecting you to also send the renaming patch as a
>>> preparatory patch as we discussed.
>>>
>>> Sorry for another meta comment, but: I think the reason that the Linaro
>>> CI isn't running tests on your patches is actually because you're
>>> sending 1/3 of a series but not sending the rest of the series.
>>>
>>> So please can you either send this as an individual preparatory patch
>>> (not marked as a series) or if you're going to send a series (e.g. with
>>> a preparatory rename patch as 1/2 and this as 2/2) then send the entire
>>> series when you make updates.  That way the CI should test your patches,
>>> which would be helpful.
>>>
>>
>> Addressed.
>>  
>>>>
>>>> Common infrastructure of load store pair fusion is divided into target
>>>> independent and target dependent changed code.
>>>>
>>>> Target independent code is the Generic code with pure virtual function
>>>> to interface betwwen target independent and dependent code.
>>>>
>>>> Target dependent code is the implementation of pure virtual function for
>>>> aarch64 target and the call to target independent code.
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>
>>>>
>>>> aarch64: Place target independent and dependent changed code in one file
>>>>
>>>> Common infrastructure of load store pair fusion is divided into target
>>>> independent and target dependent changed code.
>>>>
>>>> Target independent code is the Generic code with pure virtual function
>>>> to interface betwwen target independent and dependent code.
>>>>
>>>> Target dependent code is the implementation of pure virtual function for
>>>> aarch64 target and the call to target independent code.
>>>>
>>>> 2024-04-06  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>> 	* config/aarch64/aarch64-ldp-fusion.cc: Place target
>>>> 	independent and dependent changed code.
>>>
>>> You're going to need a proper ChangeLog eventually, but I guess there's
>>> no need for that right now.
>>>
>>>> ---
>>>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++++++++++++++--------
>>>>  1 file changed, 249 insertions(+), 122 deletions(-)
>>>>
>>>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>> index 22ed95eb743..cb21b514ef7 100644
>>>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>> @@ -138,8 +138,122 @@ struct alt_base
>>>>    poly_int64 offset;
>>>>  };
>>>>  
>>>> +// Virtual base class for load/store walkers used in alias analysis.
>>>> +struct alias_walker
>>>> +{
>>>> +  virtual bool conflict_p (int &budget) const = 0;
>>>> +  virtual insn_info *insn () const = 0;
>>>> +  virtual bool valid () const  = 0;
>>>
>>> Heh, looking at this made me realise there is a whitespace bug here in
>>> the existing code (double space after const).  Sorry about that!  I'll
>>> push an obvious fix for that.
>>>
>>>> +  virtual void advance () = 0;
>>>> +};
>>>> +
>>>> +struct pair_fusion {
>>>> +
>>>> +  pair_fusion () {};
>>>
>>> This ctor looks pointless at the moment.  Perhaps instead we could put
>>> the contents of ldp_fusion_init in here and then delete that function?
>>>
>>
>> Addressed.
>>
>>>> +  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
>>>> +			       bool load_p) = 0;
>>>
>>> Please can we have comments above each of these virtual functions
>>> describing any parameters, what the purpose of the hook is, and the
>>> interpretation of the return value?  This will serve as the
>>> documentation for other targets that want to make use of the pass.
>>>
>>> It might make sense to have a default-false implementation for
>>> fpsimd_op_p, especially if you don't want to make use of this bit for
>>> rs6000.
>>>
>>
>> Addressed.
>>  
>>>> +
>>>> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
>>>> +  virtual bool pair_trailing_writeback_p () = 0;
>>>
>>> Sorry for the run-around, but: I think this and
>>> handle_writeback_opportunities () should be the same function, either
>>> returning an enum or taking an enum and returning a boolean.
>>>
>>> At a minimum they should have more similar sounding names.
>>>
>>
>> Addressed.
>>
>>>> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
>>>> +				      machine_mode mem_mode) = 0;
>>>> +  virtual int pair_mem_alias_check_limit () = 0;
>>>> +  virtual bool handle_writeback_opportunities () = 0 ;
>>>> +  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
>>>> +					machine_mode mode) = 0;
>>>> +  virtual rtx gen_mem_pair (rtx *pats,  rtx writeback,
>>>
>>> Nit: excess whitespace after pats,
>>>
>>>> +			    bool load_p) = 0;
>>>> +  virtual bool pair_mem_promote_writeback_p (rtx pat) = 0;
>>>> +  virtual bool track_load_p () = 0;
>>>> +  virtual bool track_store_p () = 0;
>>>
>>> I think it would probably make more sense for these two to have
>>> default-true implementations rather than being pure virtual functions.
>>>
>>> Also, sorry for the bikeshedding, but please can we keep the plural
>>> names (so track_loads_p and track_stores_p).
>>>
>>>> +  virtual bool cand_insns_empty_p (std::list<insn_info *> &insns) = 0;
>>>
>>> Why does this need to be virtualised?  I would it expect it to
>>> just be insns.empty () on all targets.
>>>
>>>> +  virtual bool pair_mem_in_range_p (HOST_WIDE_INT off) = 0;
>>>> +  void ldp_fusion_bb (bb_info *bb);
>>>
>>> Just to check, are you planning to rename this function in a separate
>>> patch?
>>>
>>
>> Yes.
>>>> +
>>>> +  ~pair_fusion() { }
>>>
>>> As with the ctor, perhaps we could put the contents of ldp_fusion_destroy
>>> in here and then delete that function?
>>>
>>
>> Addressed.
>>>> +};
>>>> +
>>>> +struct aarch64_pair_fusion : public pair_fusion
>>>> +{
>>>> +public:
>>>
>>
>> Addressed.
>>> Nit: there shouldn't be a need for this as public is already the default
>>> for structs.
>>>
>>>> +  aarch64_pair_fusion () : pair_fusion () {};
>>>
>>> This ctor looks pointless, the default one should do, so can we remove
>>> this?
>>>
>>
>> Addressed.
>>>> +  bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
>>>> +		    bool load_p) override final
>>>> +  {
>>>> +    const bool fpsimd_op_p
>>>> +      = reload_completed
>>>> +      ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
>>>> +      : (GET_MODE_CLASS (mem_mode) != MODE_INT
>>>> +	 && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
>>>> +    return fpsimd_op_p;
>>>
>>> Can this just directly return the boolean expression instead of storing
>>> it into a variable first?
>>>
>> Addressed.
>>>> +  }
>>>> +
>>>> +  bool pair_mem_promote_writeback_p (rtx pat)
>>>> +  {
>>>> +     if (reload_completed
>>>> +	 && aarch64_ldp_writeback > 1
>>>> +	 && GET_CODE (pat) == PARALLEL
>>>> +	 && XVECLEN (pat, 0) == 2)
>>>> +	return true;
>>>> +
>>>> +     return false;
>>>> +  }
>>>> +
>>>> +  bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
>>>> +				machine_mode mode)
>>>> +  {
>>>> +    return aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
>>>> +						     load_p,
>>>> +						     mode);
>>>> +  }
>>>> +  bool pair_operand_mode_ok_p (machine_mode mode);
>>>
>>> Why have you implemented this (one-line function) out-of-line but the
>>> others are implemented inline?  Please be consistent.
>>>
>>
>> Addressed.
>>>> +
>>>> +  rtx gen_mem_pair (rtx *pats, rtx writeback, bool load_p);
>>>> +
>>>> +  bool pair_trailing_writeback_p  ()
>>>> +  {
>>>> +    return aarch64_ldp_writeback > 1;
>>>> +  }
>>>> +  bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
>>>> +			      machine_mode mem_mode)
>>>> +  {
>>>> +    return (load_p
>>>> +	     ? aarch64_ldp_reg_operand (reg_op, mem_mode)
>>>> +	     : aarch64_stp_reg_operand (reg_op, mem_mode));
>>>> +  }
>>>> +  int pair_mem_alias_check_limit ()
>>>> +  {
>>>> +    return aarch64_ldp_alias_check_limit;
>>>> +  }
>>>> +  bool handle_writeback_opportunities ()
>>>> +  {
>>>> +    return aarch64_ldp_writeback;
>>>> +  }
>>>> +  bool track_load_p ()
>>>> +  {
>>>> +    const bool track_loads
>>>> +      = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>> +    return track_loads;
>>>> +  }
>>>
>>
>> Addressed.
>>> As above, can this just return the boolean expression directly without
>>> the temporary?
>>>
>>>> +  bool track_store_p ()
>>>> +  {
>>>> +    const bool track_stores
>>>> +      = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>> +    return track_stores;
>>>> +  }
>>>
>>
>> Addressed.
>>> Likewise.
>>>
>>>> +  bool cand_insns_empty_p (std::list<insn_info *> &insns)
>>>> +  {
>>>> +    return insns.empty();
>>>> +  }
>>>
>>> As mentioned on the declaration, I don't see why this needs to be
>>> virtualised.
>>>
>>>> +  bool pair_mem_in_range_p (HOST_WIDE_INT off)
>>>> +  {
>>>> +    return (off < LDP_MIN_IMM || off > LDP_MAX_IMM);
>>>> +  }
>>>
>>> Hmm, this looks to test exactly the opposite condition from what the
>>> name suggests, i.e. the implementation looks like what I would expect
>>> for a function called pair_mem_out_of_range_p.
>>>
>>> Please can you invert the condition and fix up callers appropriately?
>>>
>>> This suggests you're not really thinking about the code you're writing,
>>> please try to think a bit more carefully before posting patches.
>>>
>>
>> Sorry for my mistake as I have overlooked it.
>>>> +};
>>>> +
>>>> +
>>>>  // State used by the pass for a given basic block.
>>>> -struct ldp_bb_info
>>>> +struct pair_fusion_bb_info
>>>>  {
>>>>    using def_hash = nofree_ptr_hash<def_info>;
>>>>    using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
>>>> @@ -160,14 +274,17 @@ struct ldp_bb_info
>>>>  
>>>>    static const size_t obstack_alignment = sizeof (void *);
>>>>    bb_info *m_bb;
>>>> +  pair_fusion *bb_state;
>>>
>>
>> Addressed.
>>
>>> Can we call this m_pass instead of bb_state?  The pair_fusion class is
>>> explicitly at the pass level rather than the bb level.
>>>
>>> Also I think this (and indeed m_bb, although that's pre-existing) could
>>> be private members.  Not sure there's a good reason for making them
>>> public.
>>>
>>
>> Addressed.
>>  
>>> It might be slightly nicer to use a reference for m_pass unless there's
>>> a good reason to use a pointer, although I don't feel strongly about
>>> this.
>>>
>>>>  
>>>> -  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
>>>> +  pair_fusion_bb_info (bb_info *bb,
>>>> +		      aarch64_pair_fusion *d) : m_bb (bb),
>>>
>>> This generic class needs to take the generic pass type, i.e. just a
>>> pair_fusion * rather than an aarch64_pair_fusion *.
>>>
>>>> +		      bb_state (d), m_emitted_tombstone (false)
>>>>    {
>>>>      obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
>>>>  				obstack_alignment, obstack_chunk_alloc,
>>>>  				obstack_chunk_free);
>>>>    }
>>>> -  ~ldp_bb_info ()
>>>> +  ~pair_fusion_bb_info ()
>>>>    {
>>>>      obstack_free (&m_obstack, nullptr);
>>>>  
>>>> @@ -177,10 +294,32 @@ struct ldp_bb_info
>>>>  	bitmap_obstack_release (&m_bitmap_obstack);
>>>>        }
>>>>    }
>>>> +  void track_access (insn_info *, bool load, rtx mem);
>>>> +  void transform ();
>>>> +  void cleanup_tombstones ();
>>>
>>> I think most (or indeed all?) of the changes here shouldn't be needed.
>>> AFAICS, there shouldn't be a need to:
>>>  - drop inline from these functions.
>>>  - change the accessibility of any members.
>>>
>>
>> Addressed.
>>> Plese can you drop these changes or explain why they're needed (from
>>> here until the end of the class).
>>>
>>>> +  void merge_pairs (insn_list_t &, insn_list_t &,
>>>> +		    bool load_p, unsigned access_size);
>>>> +  void transform_for_base (int load_size, access_group &group);
>>>> +
>>>> +  bool try_fuse_pair (bool load_p, unsigned access_size,
>>>> +			     insn_info *i1, insn_info *i2);
>>>> +
>>>> +  bool fuse_pair (bool load_p, unsigned access_size,
>>>> +		  int writeback,
>>>> +		  insn_info *i1, insn_info *i2,
>>>> +		  base_cand &base,
>>>> +		  const insn_range_info &move_range);
>>>>  
>>>> -  inline void track_access (insn_info *, bool load, rtx mem);
>>>> -  inline void transform ();
>>>> -  inline void cleanup_tombstones ();
>>>> +  void do_alias_analysis (insn_info *alias_hazards[4],
>>>> +			  alias_walker *walkers[4],
>>>> +			  bool load_p);
>>>> +
>>>> +  void track_tombstone (int uid);
>>>> +
>>>> +  bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
>>>> +
>>>> +  template<typename Map>
>>>> +    void traverse_base_map (Map &map);
>>>>  
>>>>  private:
>>>>    obstack m_obstack;
>>>> @@ -191,30 +330,32 @@ private:
>>>>    bool m_emitted_tombstone;
>>>>  
>>>>    inline splay_tree_node<access_record *> *node_alloc (access_record *);
>>>> +};
>>>>  
>>>> -  template<typename Map>
>>>> -  inline void traverse_base_map (Map &map);
>>>> -  inline void transform_for_base (int load_size, access_group &group);
>>>> -
>>>> -  inline void merge_pairs (insn_list_t &, insn_list_t &,
>>>> -			   bool load_p, unsigned access_size);
>>>> -
>>>> -  inline bool try_fuse_pair (bool load_p, unsigned access_size,
>>>> -			     insn_info *i1, insn_info *i2);
>>>> -
>>>> -  inline bool fuse_pair (bool load_p, unsigned access_size,
>>>> -			 int writeback,
>>>> -			 insn_info *i1, insn_info *i2,
>>>> -			 base_cand &base,
>>>> -			 const insn_range_info &move_range);
>>>> -
>>>> -  inline void track_tombstone (int uid);
>>>> +rtx aarch64_pair_fusion::gen_mem_pair (rtx *pats,
>>>
>>> Style nit: newline after rtx.
>>>
>> Addressed.
>>
>>>> +		  rtx writeback,
>>>> +		  bool load_p)
>>>> +  {
>>>> +    rtx pair_pat;
>>>>  
>>>> -  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
>>>> -};
>>>> +    if (writeback)
>>>> +      {
>>>> +	auto patvec = gen_rtvec (3, writeback, pats[0], pats[1]);
>>>> +	pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
>>>> +      }
>>>> +    else if (load_p)
>>>> +      pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
>>>> +					XEXP (pats[1], 0),
>>>> +					XEXP (pats[0], 1));
>>>> +    else
>>>> +      pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
>>>> +					 XEXP (pats[0], 1),
>>>> +					 XEXP (pats[1], 1));
>>>> +     return pair_pat;
>>>
>>> Can the assignments just be direct returns here, please?  Then
>>> get rid of the temporary.
>>>
>>
>> Addressed.
>>  
>>>> +  }
>>>>  
>>>>  splay_tree_node<access_record *> *
>>>> -ldp_bb_info::node_alloc (access_record *access)
>>>> +pair_fusion_bb_info::node_alloc (access_record *access)
>>>>  {
>>>>    using T = splay_tree_node<access_record *>;
>>>>    void *addr = obstack_alloc (&m_obstack, sizeof (T));
>>>> @@ -262,7 +403,7 @@ drop_writeback (rtx mem)
>>>>  // RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
>>>>  // MEM so that we know the mode of the access.
>>>>  static rtx
>>>> -ldp_strip_offset (rtx mem, poly_int64 *offset)
>>>> +pair_mem_strip_offset (rtx mem, poly_int64 *offset)
>>>
>>> I thought we discussed splitting the renaming into a separate patch?
>>>
>>>>  {
>>>>    rtx addr = XEXP (mem, 0);
>>>>  
>>>> @@ -332,6 +473,12 @@ ldp_operand_mode_ok_p (machine_mode mode)
>>>>    return reload_completed || mode != TImode;
>>>>  }
>>>>  
>>>> +bool
>>>> +aarch64_pair_fusion::pair_operand_mode_ok_p (machine_mode mode)
>>>> +{
>>>> +  return ldp_operand_mode_ok_p (mode);
>>>> +}
>>>> +
>>>>  // Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
>>>>  // into an integer for use as a hash table key.
>>>>  static int
>>>> @@ -396,7 +543,7 @@ access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
>>>>  // MEM_EXPR base (i.e. a tree decl) relative to which we can track the access.
>>>>  // LFS is used as part of the key to the hash table, see track_access.
>>>>  bool
>>>> -ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>>>> +pair_fusion_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>>>
>>> Again, please do any renaming in a separate patch.
>>>
>>
>> Addressed.
>>>>  {
>>>>    if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
>>>>      return false;
>>>> @@ -412,9 +559,10 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>>>>    const machine_mode mem_mode = GET_MODE (mem);
>>>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
>>>>  
>>>> -  // Punt on misaligned offsets.  LDP/STP instructions require offsets to be a
>>>> -  // multiple of the access size, and we believe that misaligned offsets on
>>>> -  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL bases.
>>>> +  // Punt on misaligned offsets.  Paired memory access instructions require
>>>> +  // offsets to be a multiple of the access size, and we believe that
>>>> +  // misaligned offsets on MEM_EXPR bases are likely to lead to misaligned
>>>> +  // offsets w.r.t. RTL bases.
>>>>    if (!multiple_p (offset, mem_size))
>>>>      return false;
>>>>  
>>>> @@ -438,46 +586,38 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>>>>  }
>>>>  
>>>>  // Main function to begin pair discovery.  Given a memory access INSN,
>>>> -// determine whether it could be a candidate for fusing into an ldp/stp,
>>>> +// determine whether it could be a candidate for fusing into an pair mem,
>>>
>>> s/an pair mem/a paired access/
>>>
>>>>  // and if so, track it in the appropriate data structure for this basic
>>>>  // block.  LOAD_P is true if the access is a load, and MEM is the mem
>>>>  // rtx that occurs in INSN.
>>>>  void
>>>> -ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>> +pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>  {
>>>>    // We can't combine volatile MEMs, so punt on these.
>>>>    if (MEM_VOLATILE_P (mem))
>>>>      return;
>>>>  
>>>> -  // Ignore writeback accesses if the param says to do so.
>>>> -  if (!aarch64_ldp_writeback
>>>> +  // Ignore writeback accesses if the param says to do so
>>>
>>
>> Addressed.
>>> This comment needs updating, I guess something like "if the hook says to
>>> do so".  But also please don't delete the full stop.
>>>
>>>> +  if (!bb_state->handle_writeback_opportunities ()
>>>>        && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
>>>>      return;
>>>>  
>>>>    const machine_mode mem_mode = GET_MODE (mem);
>>>> -  if (!ldp_operand_mode_ok_p (mem_mode))
>>>> +
>>>
>>> I don't think we need the extra whitespace here.
>>>
>>>> +  if (!bb_state->pair_operand_mode_ok_p (mem_mode))
>>>>      return;
>>>>  
>>>>    rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
>>>>  
>>>> -  // Ignore the access if the register operand isn't suitable for ldp/stp.
>>>> -  if (load_p
>>>> -      ? !aarch64_ldp_reg_operand (reg_op, mem_mode)
>>>> -      : !aarch64_stp_reg_operand (reg_op, mem_mode))
>>>> +  if (!bb_state->pair_reg_operand_ok_p (load_p, reg_op, mem_mode))
>>>>      return;
>>>> -
>>>>    // We want to segregate FP/SIMD accesses from GPR accesses.
>>>>    //
>>>>    // Before RA, we use the modes, noting that stores of constant zero
>>>>    // operands use GPRs (even in non-integer modes).  After RA, we use
>>>>    // the hard register numbers.
>>>> -  const bool fpsimd_op_p
>>>> -    = reload_completed
>>>> -    ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
>>>> -    : (GET_MODE_CLASS (mem_mode) != MODE_INT
>>>> -       && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
>>>> -
>>>> -  // Note ldp_operand_mode_ok_p already rejected VL modes.
>>>> +  const bool fpsimd_op_p = bb_state->fpsimd_op_p (reg_op, mem_mode, load_p);
>>>> +  // Note pair_operand_mode_ok_p already rejected VL modes.
>>>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
>>>>    const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
>>>>  
>>>> @@ -487,7 +627,7 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>    poly_int64 mem_off;
>>>>    rtx addr = XEXP (mem, 0);
>>>>    const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
>>>> -  rtx base = ldp_strip_offset (mem, &mem_off);
>>>> +  rtx base = pair_mem_strip_offset (mem, &mem_off);
>>>
>>> Again, let's split the renaming off into a separate patch.
>>>
>>
>> Addressed.
>>>>    if (!REG_P (base))
>>>>      return;
>>>>  
>>>> @@ -506,8 +646,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>    // elimination offset pre-RA, we should postpone forming pairs on such
>>>>    // accesses until after RA.
>>>>    //
>>>> -  // As it stands, addresses with offsets in range for LDR but not
>>>> -  // in range for LDP/STP are currently reloaded inefficiently,
>>>> +  // As it stands, addresses in range for an individual load/store but not
>>>> +  // for a paired access are currently reloaded inefficiently,
>>>>    // ending up with a separate base register for each pair.
>>>>    //
>>>>    // In theory LRA should make use of
>>>> @@ -519,8 +659,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>    // that calls targetm.legitimize_address_displacement.
>>>>    //
>>>>    // So for now, it's better to punt when we can't be sure that the
>>>> -  // offset is in range for LDP/STP.  Out-of-range cases can then be
>>>> -  // handled after RA by the out-of-range LDP/STP peepholes.  Eventually, it
>>>> +  // offset is in range for paired access.  Out-of-range cases can then be
>>>> +  // handled after RA by the out-of-range PAIR MEM  peepholes.  Eventually, it
>>>>    // would be nice to handle known out-of-range opportunities in the
>>>>    // pass itself (for stack accesses, this would be in the post-RA pass).
>>>>    if (!reload_completed
>>>> @@ -573,8 +713,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>  	gcc_unreachable (); // Base defs should be unique.
>>>>      }
>>>>  
>>>> -  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
>>>> -  // the access size.
>>>> +  // Punt on misaligned offsets.  Paired memory accesses require offsets
>>>> +  // to be a multiple of the access size.
>>>>    if (!multiple_p (mem_off, mem_size))
>>>>      return;
>>>>  
>>>> @@ -614,7 +754,7 @@ static bool no_ignore (insn_info *) { return false; }
>>>>  // making use of alias disambiguation.
>>>>  static insn_info *
>>>>  latest_hazard_before (insn_info *insn, rtx *ignore,
>>>> -		      insn_info *ignore_insn = nullptr)
>>>> +		      insn_info *ignore_insn = 0)
>>>
>>> Is this change needed?
>>>
>>
>> Addressed.
>>>>  {
>>>>    insn_info *result = nullptr;
>>>>  
>>>> @@ -1150,7 +1290,7 @@ extract_writebacks (bool load_p, rtx pats[2], int changed)
>>>>        const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
>>>>  
>>>>        poly_int64 offset;
>>>> -      rtx this_base = ldp_strip_offset (mem, &offset);
>>>> +      rtx this_base = pair_mem_strip_offset (mem, &offset);
>>>>        gcc_assert (REG_P (this_base));
>>>>        if (base_reg)
>>>>  	gcc_assert (rtx_equal_p (base_reg, this_base));
>>>> @@ -1286,7 +1426,11 @@ find_trailing_add (insn_info *insns[2],
>>>>  
>>>>    off_hwi /= access_size;
>>>>  
>>>> -  if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
>>>> +  pair_fusion *pfuse;
>>>> +  aarch64_pair_fusion derived;
>>>> +  pfuse = &derived;
>>>
>>> Huh? We'll want to use the same instance of the pass structure (pair_fusion)
>>> that gets created at the top level, so that should get passed around everywhere.
>>>
>>> In the case of find_trailing_add we probably don't want to add any more
>>> parameters (as Segher noted there are already too many), so perhaps the
>>> easiest thing would be to make find_trailing_add itself a (non-virtual)
>>> member function of pair_fusion.
>>>
>>> Then here we can just query pair_mem_in_range_p directly.
>>>
>>> For try_promote_writeback itself you can either make it a non-virtual
>>> member function of pair_fusion or just pass a reference to the pass
>>> object in when you call it.
>>>
>>> Note also that you don't have to create a pointer of the parent type in
>>> order to call a virtual function on a derived object, you can just call
>>> it directly.
>>>
>>
>> Addressed.
>>>> +
>>>> +  if (pfuse->pair_mem_in_range_p (off_hwi))
>>>>      return nullptr;
>>>
>>> Note of course this condition needs inverting when you invert the
>>> implementation.
>>>
>>
>> Addressed.
>>>>  
>>>>    auto dump_prefix = [&]()
>>>> @@ -1328,7 +1472,7 @@ find_trailing_add (insn_info *insns[2],
>>>>  // We just emitted a tombstone with uid UID, track it in a bitmap for
>>>>  // this BB so we can easily identify it later when cleaning up tombstones.
>>>>  void
>>>> -ldp_bb_info::track_tombstone (int uid)
>>>> +pair_fusion_bb_info::track_tombstone (int uid)
>>>
>>> Again, please split the renaming off.  Same with other instances below.
>>>
>>>>  {
>>>>    if (!m_emitted_tombstone)
>>>>      {
>>>> @@ -1528,7 +1672,7 @@ fixup_debug_uses (obstack_watermark &attempt,
>>>>  	  gcc_checking_assert (GET_RTX_CLASS (GET_CODE (XEXP (mem, 0)))
>>>>  			       == RTX_AUTOINC);
>>>>  
>>>> -	  base = ldp_strip_offset (mem, &offset);
>>>> +	  base = pair_mem_strip_offset (mem, &offset);
>>>>  	  gcc_checking_assert (REG_P (base) && REGNO (base) == base_regno);
>>>>  	}
>>>>        fixup_debug_use (attempt, use, def, base, offset);
>>>> @@ -1664,7 +1808,7 @@ fixup_debug_uses (obstack_watermark &attempt,
>>>>  // BASE gives the chosen base candidate for the pair and MOVE_RANGE is
>>>>  // a singleton range which says where to place the pair.
>>>>  bool
>>>> -ldp_bb_info::fuse_pair (bool load_p,
>>>> +pair_fusion_bb_info::fuse_pair (bool load_p,
>>>>  			unsigned access_size,
>>>>  			int writeback,
>>>>  			insn_info *i1, insn_info *i2,
>>>> @@ -1800,7 +1944,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>>>>      {
>>>>        if (dump_file)
>>>>  	fprintf (dump_file,
>>>> -		 "  ldp: i%d has wb but subsequent i%d has non-wb "
>>>> +		 "  load pair: i%d has wb but subsequent i%d has non-wb "
>>>>  		 "update of base (r%d), dropping wb\n",
>>>>  		 insns[0]->uid (), insns[1]->uid (), base_regno);
>>>>        gcc_assert (writeback_effect);
>>>> @@ -1823,7 +1967,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>>>>      }
>>>>  
>>>>    // If either of the original insns had writeback, but the resulting pair insn
>>>> -  // does not (can happen e.g. in the ldp edge case above, or if the writeback
>>>> +  // does not (can happen e.g. in the load pair edge case above, or if the writeback
>>>>    // effects cancel out), then drop the def(s) of the base register as
>>>>    // appropriate.
>>>>    //
>>>> @@ -1842,7 +1986,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>>>>    // update of the base register and try and fold it in to make this into a
>>>>    // writeback pair.
>>>>    insn_info *trailing_add = nullptr;
>>>> -  if (aarch64_ldp_writeback > 1
>>>> +  if (bb_state->pair_trailing_writeback_p ()
>>>>        && !writeback_effect
>>>>        && (!load_p || (!refers_to_regno_p (base_regno, base_regno + 1,
>>>>  					 XEXP (pats[0], 0), nullptr)
>>>> @@ -1863,14 +2007,14 @@ ldp_bb_info::fuse_pair (bool load_p,
>>>>      }
>>>>  
>>>>    // Now that we know what base mem we're going to use, check if it's OK
>>>> -  // with the ldp/stp policy.
>>>> +  // with the pair mem policy.
>>>>    rtx first_mem = XEXP (pats[0], load_p);
>>>> -  if (!aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
>>>> -						load_p,
>>>> -						GET_MODE (first_mem)))
>>>> +  if (!bb_state->pair_mem_ok_with_policy (first_mem,
>>>> +			  load_p,
>>>> +			  GET_MODE (first_mem)))
>>>>      {
>>>>        if (dump_file)
>>>> -	fprintf (dump_file, "punting on pair (%d,%d), ldp/stp policy says no\n",
>>>> +	fprintf (dump_file, "punting on pair (%d,%d), pair mem  policy says no\n",
>>>>  		 i1->uid (), i2->uid ());
>>>>        return false;
>>>>      }
>>>> @@ -1878,21 +2022,10 @@ ldp_bb_info::fuse_pair (bool load_p,
>>>>    rtx reg_notes = combine_reg_notes (first, second, load_p);
>>>>  
>>>>    rtx pair_pat;
>>>> -  if (writeback_effect)
>>>> -    {
>>>> -      auto patvec = gen_rtvec (3, writeback_effect, pats[0], pats[1]);
>>>> -      pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
>>>> -    }
>>>> -  else if (load_p)
>>>> -    pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
>>>> -				      XEXP (pats[1], 0),
>>>> -				      XEXP (pats[0], 1));
>>>> -  else
>>>> -    pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
>>>> -				       XEXP (pats[0], 1),
>>>> -				       XEXP (pats[1], 1));
>>>>  
>>>> +  pair_pat = bb_state->gen_mem_pair (pats, writeback_effect, load_p);
>>>>    insn_change *pair_change = nullptr;
>>>> +
>>>>    auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
>>>>      rtx_insn *rti = change->insn ()->rtl ();
>>>>      validate_unshare_change (rti, &PATTERN (rti), pair_pat, true);
>>>> @@ -2133,15 +2266,6 @@ load_modified_by_store_p (insn_info *load,
>>>>    return false;
>>>>  }
>>>>  
>>>> -// Virtual base class for load/store walkers used in alias analysis.
>>>> -struct alias_walker
>>>> -{
>>>> -  virtual bool conflict_p (int &budget) const = 0;
>>>> -  virtual insn_info *insn () const = 0;
>>>> -  virtual bool valid () const  = 0;
>>>> -  virtual void advance () = 0;
>>>> -};
>>>> -
>>>>  // Implement some common functionality used by both store_walker
>>>>  // and load_walker.
>>>>  template<bool reverse>
>>>> @@ -2259,13 +2383,13 @@ public:
>>>>  //
>>>>  // We try to maintain the invariant that if a walker becomes invalid, we
>>>>  // set its pointer to null.
>>>> -static void
>>>> -do_alias_analysis (insn_info *alias_hazards[4],
>>>> +void
>>>> +pair_fusion_bb_info::do_alias_analysis (insn_info *alias_hazards[4],
>>>
>>> It would probably make more sense for this to be a (non-virtual) member
>>> function of the pair_fusion class, since it doesn't need to access the
>>> BB-specific state.
>>>
>>
>> Addressed.
>>>>  		   alias_walker *walkers[4],
>>>>  		   bool load_p)
>>>>  {
>>>>    const int n_walkers = 2 + (2 * !load_p);
>>>> -  int budget = aarch64_ldp_alias_check_limit;
>>>> +  int budget = bb_state->pair_mem_alias_check_limit ();
>>>>  
>>>>    auto next_walker = [walkers,n_walkers](int current) -> int {
>>>>      for (int j = 1; j <= n_walkers; j++)
>>>> @@ -2365,7 +2489,7 @@ get_viable_bases (insn_info *insns[2],
>>>>      {
>>>>        const bool is_lower = (i == reversed);
>>>>        poly_int64 poly_off;
>>>> -      rtx base = ldp_strip_offset (cand_mems[i], &poly_off);
>>>> +      rtx base = pair_mem_strip_offset (cand_mems[i], &poly_off);
>>>>        if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
>>>>  	writeback |= (1 << i);
>>>>  
>>>> @@ -2373,7 +2497,7 @@ get_viable_bases (insn_info *insns[2],
>>>>  	continue;
>>>>  
>>>>        // Punt on accesses relative to eliminable regs.  See the comment in
>>>> -      // ldp_bb_info::track_access for a detailed explanation of this.
>>>> +      // pair_fusion_bb_info::track_access for a detailed explanation of this.
>>>>        if (!reload_completed
>>>>  	  && (REGNO (base) == FRAME_POINTER_REGNUM
>>>>  	      || REGNO (base) == ARG_POINTER_REGNUM))
>>>> @@ -2397,7 +2521,11 @@ get_viable_bases (insn_info *insns[2],
>>>>        if (!is_lower)
>>>>  	base_off--;
>>>>  
>>>> -      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
>>>> +      pair_fusion *pfuse;
>>>> +      aarch64_pair_fusion derived;
>>>> +      pfuse = &derived;
>>>
>>> Again, please don't do this but instead use the same instance
>>> everywhere.
>>>
>>
>> Addressed.
>>>> +
>>>> +      if (pfuse->pair_mem_in_range_p (base_off))
>>>>  	continue;
>>>>  
>>>>        use_info *use = find_access (insns[i]->uses (), REGNO (base));
>>>> @@ -2454,12 +2582,12 @@ get_viable_bases (insn_info *insns[2],
>>>>  }
>>>>  
>>>>  // Given two adjacent memory accesses of the same size, I1 and I2, try
>>>> -// and see if we can merge them into a ldp or stp.
>>>> +// and see if we can merge them into a paired accesses load and store.
>>>
>>> "into a load pair or store pair" or "into a paired access" if you
>>> prefer.
>>>
>>
>> Addressed.
>>>>  //
>>>>  // ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
>>>>  // if the accesses are both loads, otherwise they are both stores.
>>>>  bool
>>>> -ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>> +pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>>  			    insn_info *i1, insn_info *i2)
>>>>  {
>>>>    if (dump_file)
>>>> @@ -2494,7 +2622,7 @@ ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>>      {
>>>>        if (dump_file)
>>>>  	fprintf (dump_file,
>>>> -		 "punting on ldp due to reg conflcits (%d,%d)\n",
>>>> +		 "punting on pair mem load  due to reg conflcits (%d,%d)\n",
>>>>  		 insns[0]->uid (), insns[1]->uid ());
>>>>        return false;
>>>>      }
>>>> @@ -2843,7 +2971,7 @@ debug (const insn_list_t &l)
>>>>  // we can't re-order them anyway, so provided earlier passes have cleaned up
>>>>  // redundant loads, we shouldn't miss opportunities by doing this.
>>>>  void
>>>> -ldp_bb_info::merge_pairs (insn_list_t &left_list,
>>>> +pair_fusion_bb_info::merge_pairs (insn_list_t &left_list,
>>>>  			  insn_list_t &right_list,
>>>>  			  bool load_p,
>>>>  			  unsigned access_size)
>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
>>>>  // of accesses.  If we find two sets of adjacent accesses, call
>>>>  // merge_pairs.
>>>>  void
>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>>>> -				 access_group &group)
>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>>>> +					 access_group &group)
>>>>  {
>>>>    const auto lfs = decode_lfs (encoded_lfs);
>>>>    const unsigned access_size = lfs.size;
>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>  		       access.cand_insns,
>>>>  		       lfs.load_p,
>>>>  		       access_size);
>>>> -	  skip_next = access.cand_insns.empty ();
>>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
>>>
>>> As above, why is this needed?
>>
>> For rs6000 we want to return always true. as load store pair
>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
>> And we want load store pair to 8/16 32/64. Thats why we want
>> to generate always true for rs6000 to skip pairs as above.
> 
> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
> load/store pair instructions where the two arms of the access are storing
> operands of different sizes?  Or something else?
> 
> As it stands the logic is to skip the next iteration only if we
> exhausted all the candidate insns for the current access.  In the case
> that we didn't exhaust all such candidates, then the idea is that when
> access becomes prev_access, we can attempt to use those candidates as
> the "left-hand side" of a pair in the next iteration since we failed to
> use them as the "right-hand side" of a pair in the current iteration.
> I don't see why you wouldn't want that behaviour.  Please can you
> explain?
> 

In merge_pair we get the 2 load candiates one load from 0 offset and
other load is from 16th offset. Then in next iteration we get load
from 16th offset and other load from 32 offset. In next iteration
we get load from 32 offset and other load from 48 offset.

For example:

Currently we get the load candiates as follows.

pairs:

load from 0th offset.
load from 16th offset.

next pairs:

load from 16th offset.
load from 32th offset.

next pairs:

load from 32th offset
load from 48th offset.

Instead in rs6000 we should get:

pairs:

load from 0th offset
load from 16th offset.

next pairs:

load from 32th offset
load from 48th offset.

Thanks & Regards
Ajit
> Thanks,
> Alex
> 
>>
>>>
>>>>  	}
>>>>        prev_access = &access;
>>>>      }
>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>  // and remove all the tombstone insns, being sure to reparent any uses
>>>>  // of mem to previous defs when we do this.
>>>>  void
>>>> -ldp_bb_info::cleanup_tombstones ()
>>>> +pair_fusion_bb_info::cleanup_tombstones ()
>>>>  {
>>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
>>>>    if (!m_emitted_tombstone)
>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>>>  
>>>>  template<typename Map>
>>>>  void
>>>> -ldp_bb_info::traverse_base_map (Map &map)
>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>>>  {
>>>>    for (auto kv : map)
>>>>      {
>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>>>  }
>>>>  
>>>>  void
>>>> -ldp_bb_info::transform ()
>>>> +pair_fusion_bb_info::transform ()
>>>>  {
>>>>    traverse_base_map (expr_map);
>>>>    traverse_base_map (def_map);
>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>>>  // for load/store candidates.  If running after RA, also try and promote
>>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
>>>>  // candidates into pairs.
>>>> -void ldp_fusion_bb (bb_info *bb)
>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>>>  {
>>>> -  const bool track_loads
>>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>> -  const bool track_stores
>>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>> +  const bool track_loads = track_load_p ();
>>>> +  const bool track_stores = track_store_p ();
>>>>  
>>>> -  ldp_bb_info bb_state (bb);
>>>
>>> This:
>>>
>>>> +  aarch64_pair_fusion derived;
>>>
>>> can be deleted and then:
>>>
>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
>>>
>>> can just be:
>>>
>>>   pair_fusion_bb_info bb_info (bb, this);
>>>
>>> (or you can pass *this if you make bb_info take a reference).
>>>
>>> I don't think there's a particular need to change the variable name
>>> (bb_state -> bb_info).  I chose the former because it doens't clash
>>> with the RTL-SSA structure of the same name as the latter.
>>>
>>
>> Addressed.
>>>>  
>>>>    for (auto insn : bb->nondebug_insns ())
>>>>      {
>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>>>  	continue;
>>>>  
>>>>        rtx pat = PATTERN (rti);
>>>> -      if (reload_completed
>>>> -	  && aarch64_ldp_writeback > 1
>>>> -	  && GET_CODE (pat) == PARALLEL
>>>> -	  && XVECLEN (pat, 0) == 2)
>>>> +      if (pair_mem_promote_writeback_p (pat))
>>>>  	try_promote_writeback (insn);
>>>
>>> It looks like try_promote_writeback itself will need some further work
>>> to make it target-independent.  I suppose this check:
>>>
>>>   auto rti = insn->rtl ();
>>>   const auto attr = get_attr_ldpstp (rti);
>>>   if (attr == LDPSTP_NONE)
>>>     return;
>>>
>>>   bool load_p = (attr == LDPSTP_LDP);
>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
>>>
>>> will need to become part of the pair_mem_promote_writeback_p hook that you
>>> added, potentially changing it to return a boolean for load_p.
>>>
>>> Then I guess we will need hooks for destructuring the pair insn and
>>> another hook to wrap aarch64_gen_writeback_pair.
>>>
>>
>> Addressed.
>>>>  
>>>>        if (GET_CODE (pat) != SET)
>>>>  	continue;
>>>>  
>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
>>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
>>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
>>>>      }
>>>>  
>>>> -  bb_state.transform ();
>>>> -  bb_state.cleanup_tombstones ();
>>>> +  bb_info.transform ();
>>>> +  bb_info.cleanup_tombstones ();
>>>>  }
>>>>  
>>>>  void ldp_fusion ()
>>>>  {
>>>>    ldp_fusion_init ();
>>>> +  pair_fusion *pfuse;
>>>> +  aarch64_pair_fusion derived;
>>>> +  pfuse = &derived;
>>>
>>> This is indeed the one place where I think it is acceptable to
>>> instantiate aarch64_pair_fusion.  But again there's no need to create a
>>> pointer to the parent class, just call any function you like directly.
>>>
>>
>> Addressed.
>>>>  
>>>>    for (auto bb : crtl->ssa->bbs ())
>>>> -    ldp_fusion_bb (bb);
>>>> +    pfuse->ldp_fusion_bb (bb);
>>>
>>> I think even the code to iterate over bbs should itself be a member
>>> function of pair_fusion (say "run") and then that becomes part of the
>>> generic code.
>>>
>>> So this function would just become:
>>>
>>> aarch64_pair_fusion pass;
>>> pass.run ();
>>>
>>> and could be inlined into the caller.
>>>
>>
>> Addressed.
>>> Perhaps you could also add an early return like:
>>>
>>>   if (!track_loads_p () && !track_stores_p ())
>>>     return;
>>>
>>> in pair_fusion::run () and then remove the corresponding code from
>>> pass_ldp_fusion::gate?
>>>
>>
>> Addressed.
>>
>>>>  
>>>>    ldp_fusion_destroy ();
>>>>  }
>>>> -- 
>>>> 2.39.3
>>>>
>>>
>>> Thanks,
>>> Alex
>>
>> Thanks & Regards
>> Ajit
Alex Coplan April 9, 2024, 3:09 p.m. UTC | #5
On 09/04/2024 20:01, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 09/04/24 7:29 pm, Alex Coplan wrote:
> > On 09/04/2024 17:30, Ajit Agarwal wrote:
> >>
> >>
> >> On 05/04/24 10:03 pm, Alex Coplan wrote:
> >>> On 05/04/2024 13:53, Ajit Agarwal wrote:
> >>>> Hello Alex/Richard:
> >>>>
> >>>> All review comments are incorporated.
> >>>
> >>> Thanks, I was kind-of expecting you to also send the renaming patch as a
> >>> preparatory patch as we discussed.
> >>>
> >>> Sorry for another meta comment, but: I think the reason that the Linaro
> >>> CI isn't running tests on your patches is actually because you're
> >>> sending 1/3 of a series but not sending the rest of the series.
> >>>
> >>> So please can you either send this as an individual preparatory patch
> >>> (not marked as a series) or if you're going to send a series (e.g. with
> >>> a preparatory rename patch as 1/2 and this as 2/2) then send the entire
> >>> series when you make updates.  That way the CI should test your patches,
> >>> which would be helpful.
> >>>
> >>
> >> Addressed.
> >>  
> >>>>
> >>>> Common infrastructure of load store pair fusion is divided into target
> >>>> independent and target dependent changed code.
> >>>>
> >>>> Target independent code is the Generic code with pure virtual function
> >>>> to interface betwwen target independent and dependent code.
> >>>>
> >>>> Target dependent code is the implementation of pure virtual function for
> >>>> aarch64 target and the call to target independent code.
> >>>>
> >>>> Thanks & Regards
> >>>> Ajit
> >>>>
> >>>>
> >>>> aarch64: Place target independent and dependent changed code in one file
> >>>>
> >>>> Common infrastructure of load store pair fusion is divided into target
> >>>> independent and target dependent changed code.
> >>>>
> >>>> Target independent code is the Generic code with pure virtual function
> >>>> to interface betwwen target independent and dependent code.
> >>>>
> >>>> Target dependent code is the implementation of pure virtual function for
> >>>> aarch64 target and the call to target independent code.
> >>>>
> >>>> 2024-04-06  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
> >>>>
> >>>> gcc/ChangeLog:
> >>>>
> >>>> 	* config/aarch64/aarch64-ldp-fusion.cc: Place target
> >>>> 	independent and dependent changed code.
> >>>
> >>> You're going to need a proper ChangeLog eventually, but I guess there's
> >>> no need for that right now.
> >>>
> >>>> ---
> >>>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++++++++++++++--------
> >>>>  1 file changed, 249 insertions(+), 122 deletions(-)
> >>>>
> >>>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >>>> index 22ed95eb743..cb21b514ef7 100644
> >>>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >>>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >>>> @@ -138,8 +138,122 @@ struct alt_base
> >>>>    poly_int64 offset;
> >>>>  };
> >>>>  
> >>>> +// Virtual base class for load/store walkers used in alias analysis.
> >>>> +struct alias_walker
> >>>> +{
> >>>> +  virtual bool conflict_p (int &budget) const = 0;
> >>>> +  virtual insn_info *insn () const = 0;
> >>>> +  virtual bool valid () const  = 0;
> >>>
> >>> Heh, looking at this made me realise there is a whitespace bug here in
> >>> the existing code (double space after const).  Sorry about that!  I'll
> >>> push an obvious fix for that.
> >>>
> >>>> +  virtual void advance () = 0;
> >>>> +};
> >>>> +
> >>>> +struct pair_fusion {
> >>>> +
> >>>> +  pair_fusion () {};
> >>>
> >>> This ctor looks pointless at the moment.  Perhaps instead we could put
> >>> the contents of ldp_fusion_init in here and then delete that function?
> >>>
> >>
> >> Addressed.
> >>
> >>>> +  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
> >>>> +			       bool load_p) = 0;
> >>>
> >>> Please can we have comments above each of these virtual functions
> >>> describing any parameters, what the purpose of the hook is, and the
> >>> interpretation of the return value?  This will serve as the
> >>> documentation for other targets that want to make use of the pass.
> >>>
> >>> It might make sense to have a default-false implementation for
> >>> fpsimd_op_p, especially if you don't want to make use of this bit for
> >>> rs6000.
> >>>
> >>
> >> Addressed.
> >>  
> >>>> +
> >>>> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
> >>>> +  virtual bool pair_trailing_writeback_p () = 0;
> >>>
> >>> Sorry for the run-around, but: I think this and
> >>> handle_writeback_opportunities () should be the same function, either
> >>> returning an enum or taking an enum and returning a boolean.
> >>>
> >>> At a minimum they should have more similar sounding names.
> >>>
> >>
> >> Addressed.
> >>
> >>>> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> >>>> +				      machine_mode mem_mode) = 0;
> >>>> +  virtual int pair_mem_alias_check_limit () = 0;
> >>>> +  virtual bool handle_writeback_opportunities () = 0 ;
> >>>> +  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
> >>>> +					machine_mode mode) = 0;
> >>>> +  virtual rtx gen_mem_pair (rtx *pats,  rtx writeback,
> >>>
> >>> Nit: excess whitespace after pats,
> >>>
> >>>> +			    bool load_p) = 0;
> >>>> +  virtual bool pair_mem_promote_writeback_p (rtx pat) = 0;
> >>>> +  virtual bool track_load_p () = 0;
> >>>> +  virtual bool track_store_p () = 0;
> >>>
> >>> I think it would probably make more sense for these two to have
> >>> default-true implementations rather than being pure virtual functions.
> >>>
> >>> Also, sorry for the bikeshedding, but please can we keep the plural
> >>> names (so track_loads_p and track_stores_p).
> >>>
> >>>> +  virtual bool cand_insns_empty_p (std::list<insn_info *> &insns) = 0;
> >>>
> >>> Why does this need to be virtualised?  I would it expect it to
> >>> just be insns.empty () on all targets.
> >>>
> >>>> +  virtual bool pair_mem_in_range_p (HOST_WIDE_INT off) = 0;
> >>>> +  void ldp_fusion_bb (bb_info *bb);
> >>>
> >>> Just to check, are you planning to rename this function in a separate
> >>> patch?
> >>>
> >>
> >> Yes.
> >>>> +
> >>>> +  ~pair_fusion() { }
> >>>
> >>> As with the ctor, perhaps we could put the contents of ldp_fusion_destroy
> >>> in here and then delete that function?
> >>>
> >>
> >> Addressed.
> >>>> +};
> >>>> +
> >>>> +struct aarch64_pair_fusion : public pair_fusion
> >>>> +{
> >>>> +public:
> >>>
> >>
> >> Addressed.
> >>> Nit: there shouldn't be a need for this as public is already the default
> >>> for structs.
> >>>
> >>>> +  aarch64_pair_fusion () : pair_fusion () {};
> >>>
> >>> This ctor looks pointless, the default one should do, so can we remove
> >>> this?
> >>>
> >>
> >> Addressed.
> >>>> +  bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
> >>>> +		    bool load_p) override final
> >>>> +  {
> >>>> +    const bool fpsimd_op_p
> >>>> +      = reload_completed
> >>>> +      ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
> >>>> +      : (GET_MODE_CLASS (mem_mode) != MODE_INT
> >>>> +	 && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
> >>>> +    return fpsimd_op_p;
> >>>
> >>> Can this just directly return the boolean expression instead of storing
> >>> it into a variable first?
> >>>
> >> Addressed.
> >>>> +  }
> >>>> +
> >>>> +  bool pair_mem_promote_writeback_p (rtx pat)
> >>>> +  {
> >>>> +     if (reload_completed
> >>>> +	 && aarch64_ldp_writeback > 1
> >>>> +	 && GET_CODE (pat) == PARALLEL
> >>>> +	 && XVECLEN (pat, 0) == 2)
> >>>> +	return true;
> >>>> +
> >>>> +     return false;
> >>>> +  }
> >>>> +
> >>>> +  bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
> >>>> +				machine_mode mode)
> >>>> +  {
> >>>> +    return aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
> >>>> +						     load_p,
> >>>> +						     mode);
> >>>> +  }
> >>>> +  bool pair_operand_mode_ok_p (machine_mode mode);
> >>>
> >>> Why have you implemented this (one-line function) out-of-line but the
> >>> others are implemented inline?  Please be consistent.
> >>>
> >>
> >> Addressed.
> >>>> +
> >>>> +  rtx gen_mem_pair (rtx *pats, rtx writeback, bool load_p);
> >>>> +
> >>>> +  bool pair_trailing_writeback_p  ()
> >>>> +  {
> >>>> +    return aarch64_ldp_writeback > 1;
> >>>> +  }
> >>>> +  bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> >>>> +			      machine_mode mem_mode)
> >>>> +  {
> >>>> +    return (load_p
> >>>> +	     ? aarch64_ldp_reg_operand (reg_op, mem_mode)
> >>>> +	     : aarch64_stp_reg_operand (reg_op, mem_mode));
> >>>> +  }
> >>>> +  int pair_mem_alias_check_limit ()
> >>>> +  {
> >>>> +    return aarch64_ldp_alias_check_limit;
> >>>> +  }
> >>>> +  bool handle_writeback_opportunities ()
> >>>> +  {
> >>>> +    return aarch64_ldp_writeback;
> >>>> +  }
> >>>> +  bool track_load_p ()
> >>>> +  {
> >>>> +    const bool track_loads
> >>>> +      = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>> +    return track_loads;
> >>>> +  }
> >>>
> >>
> >> Addressed.
> >>> As above, can this just return the boolean expression directly without
> >>> the temporary?
> >>>
> >>>> +  bool track_store_p ()
> >>>> +  {
> >>>> +    const bool track_stores
> >>>> +      = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>> +    return track_stores;
> >>>> +  }
> >>>
> >>
> >> Addressed.
> >>> Likewise.
> >>>
> >>>> +  bool cand_insns_empty_p (std::list<insn_info *> &insns)
> >>>> +  {
> >>>> +    return insns.empty();
> >>>> +  }
> >>>
> >>> As mentioned on the declaration, I don't see why this needs to be
> >>> virtualised.
> >>>
> >>>> +  bool pair_mem_in_range_p (HOST_WIDE_INT off)
> >>>> +  {
> >>>> +    return (off < LDP_MIN_IMM || off > LDP_MAX_IMM);
> >>>> +  }
> >>>
> >>> Hmm, this looks to test exactly the opposite condition from what the
> >>> name suggests, i.e. the implementation looks like what I would expect
> >>> for a function called pair_mem_out_of_range_p.
> >>>
> >>> Please can you invert the condition and fix up callers appropriately?
> >>>
> >>> This suggests you're not really thinking about the code you're writing,
> >>> please try to think a bit more carefully before posting patches.
> >>>
> >>
> >> Sorry for my mistake as I have overlooked it.
> >>>> +};
> >>>> +
> >>>> +
> >>>>  // State used by the pass for a given basic block.
> >>>> -struct ldp_bb_info
> >>>> +struct pair_fusion_bb_info
> >>>>  {
> >>>>    using def_hash = nofree_ptr_hash<def_info>;
> >>>>    using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
> >>>> @@ -160,14 +274,17 @@ struct ldp_bb_info
> >>>>  
> >>>>    static const size_t obstack_alignment = sizeof (void *);
> >>>>    bb_info *m_bb;
> >>>> +  pair_fusion *bb_state;
> >>>
> >>
> >> Addressed.
> >>
> >>> Can we call this m_pass instead of bb_state?  The pair_fusion class is
> >>> explicitly at the pass level rather than the bb level.
> >>>
> >>> Also I think this (and indeed m_bb, although that's pre-existing) could
> >>> be private members.  Not sure there's a good reason for making them
> >>> public.
> >>>
> >>
> >> Addressed.
> >>  
> >>> It might be slightly nicer to use a reference for m_pass unless there's
> >>> a good reason to use a pointer, although I don't feel strongly about
> >>> this.
> >>>
> >>>>  
> >>>> -  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
> >>>> +  pair_fusion_bb_info (bb_info *bb,
> >>>> +		      aarch64_pair_fusion *d) : m_bb (bb),
> >>>
> >>> This generic class needs to take the generic pass type, i.e. just a
> >>> pair_fusion * rather than an aarch64_pair_fusion *.
> >>>
> >>>> +		      bb_state (d), m_emitted_tombstone (false)
> >>>>    {
> >>>>      obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
> >>>>  				obstack_alignment, obstack_chunk_alloc,
> >>>>  				obstack_chunk_free);
> >>>>    }
> >>>> -  ~ldp_bb_info ()
> >>>> +  ~pair_fusion_bb_info ()
> >>>>    {
> >>>>      obstack_free (&m_obstack, nullptr);
> >>>>  
> >>>> @@ -177,10 +294,32 @@ struct ldp_bb_info
> >>>>  	bitmap_obstack_release (&m_bitmap_obstack);
> >>>>        }
> >>>>    }
> >>>> +  void track_access (insn_info *, bool load, rtx mem);
> >>>> +  void transform ();
> >>>> +  void cleanup_tombstones ();
> >>>
> >>> I think most (or indeed all?) of the changes here shouldn't be needed.
> >>> AFAICS, there shouldn't be a need to:
> >>>  - drop inline from these functions.
> >>>  - change the accessibility of any members.
> >>>
> >>
> >> Addressed.
> >>> Plese can you drop these changes or explain why they're needed (from
> >>> here until the end of the class).
> >>>
> >>>> +  void merge_pairs (insn_list_t &, insn_list_t &,
> >>>> +		    bool load_p, unsigned access_size);
> >>>> +  void transform_for_base (int load_size, access_group &group);
> >>>> +
> >>>> +  bool try_fuse_pair (bool load_p, unsigned access_size,
> >>>> +			     insn_info *i1, insn_info *i2);
> >>>> +
> >>>> +  bool fuse_pair (bool load_p, unsigned access_size,
> >>>> +		  int writeback,
> >>>> +		  insn_info *i1, insn_info *i2,
> >>>> +		  base_cand &base,
> >>>> +		  const insn_range_info &move_range);
> >>>>  
> >>>> -  inline void track_access (insn_info *, bool load, rtx mem);
> >>>> -  inline void transform ();
> >>>> -  inline void cleanup_tombstones ();
> >>>> +  void do_alias_analysis (insn_info *alias_hazards[4],
> >>>> +			  alias_walker *walkers[4],
> >>>> +			  bool load_p);
> >>>> +
> >>>> +  void track_tombstone (int uid);
> >>>> +
> >>>> +  bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
> >>>> +
> >>>> +  template<typename Map>
> >>>> +    void traverse_base_map (Map &map);
> >>>>  
> >>>>  private:
> >>>>    obstack m_obstack;
> >>>> @@ -191,30 +330,32 @@ private:
> >>>>    bool m_emitted_tombstone;
> >>>>  
> >>>>    inline splay_tree_node<access_record *> *node_alloc (access_record *);
> >>>> +};
> >>>>  
> >>>> -  template<typename Map>
> >>>> -  inline void traverse_base_map (Map &map);
> >>>> -  inline void transform_for_base (int load_size, access_group &group);
> >>>> -
> >>>> -  inline void merge_pairs (insn_list_t &, insn_list_t &,
> >>>> -			   bool load_p, unsigned access_size);
> >>>> -
> >>>> -  inline bool try_fuse_pair (bool load_p, unsigned access_size,
> >>>> -			     insn_info *i1, insn_info *i2);
> >>>> -
> >>>> -  inline bool fuse_pair (bool load_p, unsigned access_size,
> >>>> -			 int writeback,
> >>>> -			 insn_info *i1, insn_info *i2,
> >>>> -			 base_cand &base,
> >>>> -			 const insn_range_info &move_range);
> >>>> -
> >>>> -  inline void track_tombstone (int uid);
> >>>> +rtx aarch64_pair_fusion::gen_mem_pair (rtx *pats,
> >>>
> >>> Style nit: newline after rtx.
> >>>
> >> Addressed.
> >>
> >>>> +		  rtx writeback,
> >>>> +		  bool load_p)
> >>>> +  {
> >>>> +    rtx pair_pat;
> >>>>  
> >>>> -  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
> >>>> -};
> >>>> +    if (writeback)
> >>>> +      {
> >>>> +	auto patvec = gen_rtvec (3, writeback, pats[0], pats[1]);
> >>>> +	pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
> >>>> +      }
> >>>> +    else if (load_p)
> >>>> +      pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
> >>>> +					XEXP (pats[1], 0),
> >>>> +					XEXP (pats[0], 1));
> >>>> +    else
> >>>> +      pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
> >>>> +					 XEXP (pats[0], 1),
> >>>> +					 XEXP (pats[1], 1));
> >>>> +     return pair_pat;
> >>>
> >>> Can the assignments just be direct returns here, please?  Then
> >>> get rid of the temporary.
> >>>
> >>
> >> Addressed.
> >>  
> >>>> +  }
> >>>>  
> >>>>  splay_tree_node<access_record *> *
> >>>> -ldp_bb_info::node_alloc (access_record *access)
> >>>> +pair_fusion_bb_info::node_alloc (access_record *access)
> >>>>  {
> >>>>    using T = splay_tree_node<access_record *>;
> >>>>    void *addr = obstack_alloc (&m_obstack, sizeof (T));
> >>>> @@ -262,7 +403,7 @@ drop_writeback (rtx mem)
> >>>>  // RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
> >>>>  // MEM so that we know the mode of the access.
> >>>>  static rtx
> >>>> -ldp_strip_offset (rtx mem, poly_int64 *offset)
> >>>> +pair_mem_strip_offset (rtx mem, poly_int64 *offset)
> >>>
> >>> I thought we discussed splitting the renaming into a separate patch?
> >>>
> >>>>  {
> >>>>    rtx addr = XEXP (mem, 0);
> >>>>  
> >>>> @@ -332,6 +473,12 @@ ldp_operand_mode_ok_p (machine_mode mode)
> >>>>    return reload_completed || mode != TImode;
> >>>>  }
> >>>>  
> >>>> +bool
> >>>> +aarch64_pair_fusion::pair_operand_mode_ok_p (machine_mode mode)
> >>>> +{
> >>>> +  return ldp_operand_mode_ok_p (mode);
> >>>> +}
> >>>> +
> >>>>  // Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
> >>>>  // into an integer for use as a hash table key.
> >>>>  static int
> >>>> @@ -396,7 +543,7 @@ access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
> >>>>  // MEM_EXPR base (i.e. a tree decl) relative to which we can track the access.
> >>>>  // LFS is used as part of the key to the hash table, see track_access.
> >>>>  bool
> >>>> -ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> >>>> +pair_fusion_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> >>>
> >>> Again, please do any renaming in a separate patch.
> >>>
> >>
> >> Addressed.
> >>>>  {
> >>>>    if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
> >>>>      return false;
> >>>> @@ -412,9 +559,10 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> >>>>    const machine_mode mem_mode = GET_MODE (mem);
> >>>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
> >>>>  
> >>>> -  // Punt on misaligned offsets.  LDP/STP instructions require offsets to be a
> >>>> -  // multiple of the access size, and we believe that misaligned offsets on
> >>>> -  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL bases.
> >>>> +  // Punt on misaligned offsets.  Paired memory access instructions require
> >>>> +  // offsets to be a multiple of the access size, and we believe that
> >>>> +  // misaligned offsets on MEM_EXPR bases are likely to lead to misaligned
> >>>> +  // offsets w.r.t. RTL bases.
> >>>>    if (!multiple_p (offset, mem_size))
> >>>>      return false;
> >>>>  
> >>>> @@ -438,46 +586,38 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> >>>>  }
> >>>>  
> >>>>  // Main function to begin pair discovery.  Given a memory access INSN,
> >>>> -// determine whether it could be a candidate for fusing into an ldp/stp,
> >>>> +// determine whether it could be a candidate for fusing into an pair mem,
> >>>
> >>> s/an pair mem/a paired access/
> >>>
> >>>>  // and if so, track it in the appropriate data structure for this basic
> >>>>  // block.  LOAD_P is true if the access is a load, and MEM is the mem
> >>>>  // rtx that occurs in INSN.
> >>>>  void
> >>>> -ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>> +pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>  {
> >>>>    // We can't combine volatile MEMs, so punt on these.
> >>>>    if (MEM_VOLATILE_P (mem))
> >>>>      return;
> >>>>  
> >>>> -  // Ignore writeback accesses if the param says to do so.
> >>>> -  if (!aarch64_ldp_writeback
> >>>> +  // Ignore writeback accesses if the param says to do so
> >>>
> >>
> >> Addressed.
> >>> This comment needs updating, I guess something like "if the hook says to
> >>> do so".  But also please don't delete the full stop.
> >>>
> >>>> +  if (!bb_state->handle_writeback_opportunities ()
> >>>>        && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
> >>>>      return;
> >>>>  
> >>>>    const machine_mode mem_mode = GET_MODE (mem);
> >>>> -  if (!ldp_operand_mode_ok_p (mem_mode))
> >>>> +
> >>>
> >>> I don't think we need the extra whitespace here.
> >>>
> >>>> +  if (!bb_state->pair_operand_mode_ok_p (mem_mode))
> >>>>      return;
> >>>>  
> >>>>    rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
> >>>>  
> >>>> -  // Ignore the access if the register operand isn't suitable for ldp/stp.
> >>>> -  if (load_p
> >>>> -      ? !aarch64_ldp_reg_operand (reg_op, mem_mode)
> >>>> -      : !aarch64_stp_reg_operand (reg_op, mem_mode))
> >>>> +  if (!bb_state->pair_reg_operand_ok_p (load_p, reg_op, mem_mode))
> >>>>      return;
> >>>> -
> >>>>    // We want to segregate FP/SIMD accesses from GPR accesses.
> >>>>    //
> >>>>    // Before RA, we use the modes, noting that stores of constant zero
> >>>>    // operands use GPRs (even in non-integer modes).  After RA, we use
> >>>>    // the hard register numbers.
> >>>> -  const bool fpsimd_op_p
> >>>> -    = reload_completed
> >>>> -    ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
> >>>> -    : (GET_MODE_CLASS (mem_mode) != MODE_INT
> >>>> -       && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
> >>>> -
> >>>> -  // Note ldp_operand_mode_ok_p already rejected VL modes.
> >>>> +  const bool fpsimd_op_p = bb_state->fpsimd_op_p (reg_op, mem_mode, load_p);
> >>>> +  // Note pair_operand_mode_ok_p already rejected VL modes.
> >>>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
> >>>>    const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
> >>>>  
> >>>> @@ -487,7 +627,7 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>    poly_int64 mem_off;
> >>>>    rtx addr = XEXP (mem, 0);
> >>>>    const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
> >>>> -  rtx base = ldp_strip_offset (mem, &mem_off);
> >>>> +  rtx base = pair_mem_strip_offset (mem, &mem_off);
> >>>
> >>> Again, let's split the renaming off into a separate patch.
> >>>
> >>
> >> Addressed.
> >>>>    if (!REG_P (base))
> >>>>      return;
> >>>>  
> >>>> @@ -506,8 +646,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>    // elimination offset pre-RA, we should postpone forming pairs on such
> >>>>    // accesses until after RA.
> >>>>    //
> >>>> -  // As it stands, addresses with offsets in range for LDR but not
> >>>> -  // in range for LDP/STP are currently reloaded inefficiently,
> >>>> +  // As it stands, addresses in range for an individual load/store but not
> >>>> +  // for a paired access are currently reloaded inefficiently,
> >>>>    // ending up with a separate base register for each pair.
> >>>>    //
> >>>>    // In theory LRA should make use of
> >>>> @@ -519,8 +659,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>    // that calls targetm.legitimize_address_displacement.
> >>>>    //
> >>>>    // So for now, it's better to punt when we can't be sure that the
> >>>> -  // offset is in range for LDP/STP.  Out-of-range cases can then be
> >>>> -  // handled after RA by the out-of-range LDP/STP peepholes.  Eventually, it
> >>>> +  // offset is in range for paired access.  Out-of-range cases can then be
> >>>> +  // handled after RA by the out-of-range PAIR MEM  peepholes.  Eventually, it
> >>>>    // would be nice to handle known out-of-range opportunities in the
> >>>>    // pass itself (for stack accesses, this would be in the post-RA pass).
> >>>>    if (!reload_completed
> >>>> @@ -573,8 +713,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>  	gcc_unreachable (); // Base defs should be unique.
> >>>>      }
> >>>>  
> >>>> -  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
> >>>> -  // the access size.
> >>>> +  // Punt on misaligned offsets.  Paired memory accesses require offsets
> >>>> +  // to be a multiple of the access size.
> >>>>    if (!multiple_p (mem_off, mem_size))
> >>>>      return;
> >>>>  
> >>>> @@ -614,7 +754,7 @@ static bool no_ignore (insn_info *) { return false; }
> >>>>  // making use of alias disambiguation.
> >>>>  static insn_info *
> >>>>  latest_hazard_before (insn_info *insn, rtx *ignore,
> >>>> -		      insn_info *ignore_insn = nullptr)
> >>>> +		      insn_info *ignore_insn = 0)
> >>>
> >>> Is this change needed?
> >>>
> >>
> >> Addressed.
> >>>>  {
> >>>>    insn_info *result = nullptr;
> >>>>  
> >>>> @@ -1150,7 +1290,7 @@ extract_writebacks (bool load_p, rtx pats[2], int changed)
> >>>>        const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
> >>>>  
> >>>>        poly_int64 offset;
> >>>> -      rtx this_base = ldp_strip_offset (mem, &offset);
> >>>> +      rtx this_base = pair_mem_strip_offset (mem, &offset);
> >>>>        gcc_assert (REG_P (this_base));
> >>>>        if (base_reg)
> >>>>  	gcc_assert (rtx_equal_p (base_reg, this_base));
> >>>> @@ -1286,7 +1426,11 @@ find_trailing_add (insn_info *insns[2],
> >>>>  
> >>>>    off_hwi /= access_size;
> >>>>  
> >>>> -  if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
> >>>> +  pair_fusion *pfuse;
> >>>> +  aarch64_pair_fusion derived;
> >>>> +  pfuse = &derived;
> >>>
> >>> Huh? We'll want to use the same instance of the pass structure (pair_fusion)
> >>> that gets created at the top level, so that should get passed around everywhere.
> >>>
> >>> In the case of find_trailing_add we probably don't want to add any more
> >>> parameters (as Segher noted there are already too many), so perhaps the
> >>> easiest thing would be to make find_trailing_add itself a (non-virtual)
> >>> member function of pair_fusion.
> >>>
> >>> Then here we can just query pair_mem_in_range_p directly.
> >>>
> >>> For try_promote_writeback itself you can either make it a non-virtual
> >>> member function of pair_fusion or just pass a reference to the pass
> >>> object in when you call it.
> >>>
> >>> Note also that you don't have to create a pointer of the parent type in
> >>> order to call a virtual function on a derived object, you can just call
> >>> it directly.
> >>>
> >>
> >> Addressed.
> >>>> +
> >>>> +  if (pfuse->pair_mem_in_range_p (off_hwi))
> >>>>      return nullptr;
> >>>
> >>> Note of course this condition needs inverting when you invert the
> >>> implementation.
> >>>
> >>
> >> Addressed.
> >>>>  
> >>>>    auto dump_prefix = [&]()
> >>>> @@ -1328,7 +1472,7 @@ find_trailing_add (insn_info *insns[2],
> >>>>  // We just emitted a tombstone with uid UID, track it in a bitmap for
> >>>>  // this BB so we can easily identify it later when cleaning up tombstones.
> >>>>  void
> >>>> -ldp_bb_info::track_tombstone (int uid)
> >>>> +pair_fusion_bb_info::track_tombstone (int uid)
> >>>
> >>> Again, please split the renaming off.  Same with other instances below.
> >>>
> >>>>  {
> >>>>    if (!m_emitted_tombstone)
> >>>>      {
> >>>> @@ -1528,7 +1672,7 @@ fixup_debug_uses (obstack_watermark &attempt,
> >>>>  	  gcc_checking_assert (GET_RTX_CLASS (GET_CODE (XEXP (mem, 0)))
> >>>>  			       == RTX_AUTOINC);
> >>>>  
> >>>> -	  base = ldp_strip_offset (mem, &offset);
> >>>> +	  base = pair_mem_strip_offset (mem, &offset);
> >>>>  	  gcc_checking_assert (REG_P (base) && REGNO (base) == base_regno);
> >>>>  	}
> >>>>        fixup_debug_use (attempt, use, def, base, offset);
> >>>> @@ -1664,7 +1808,7 @@ fixup_debug_uses (obstack_watermark &attempt,
> >>>>  // BASE gives the chosen base candidate for the pair and MOVE_RANGE is
> >>>>  // a singleton range which says where to place the pair.
> >>>>  bool
> >>>> -ldp_bb_info::fuse_pair (bool load_p,
> >>>> +pair_fusion_bb_info::fuse_pair (bool load_p,
> >>>>  			unsigned access_size,
> >>>>  			int writeback,
> >>>>  			insn_info *i1, insn_info *i2,
> >>>> @@ -1800,7 +1944,7 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>>>      {
> >>>>        if (dump_file)
> >>>>  	fprintf (dump_file,
> >>>> -		 "  ldp: i%d has wb but subsequent i%d has non-wb "
> >>>> +		 "  load pair: i%d has wb but subsequent i%d has non-wb "
> >>>>  		 "update of base (r%d), dropping wb\n",
> >>>>  		 insns[0]->uid (), insns[1]->uid (), base_regno);
> >>>>        gcc_assert (writeback_effect);
> >>>> @@ -1823,7 +1967,7 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>>>      }
> >>>>  
> >>>>    // If either of the original insns had writeback, but the resulting pair insn
> >>>> -  // does not (can happen e.g. in the ldp edge case above, or if the writeback
> >>>> +  // does not (can happen e.g. in the load pair edge case above, or if the writeback
> >>>>    // effects cancel out), then drop the def(s) of the base register as
> >>>>    // appropriate.
> >>>>    //
> >>>> @@ -1842,7 +1986,7 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>>>    // update of the base register and try and fold it in to make this into a
> >>>>    // writeback pair.
> >>>>    insn_info *trailing_add = nullptr;
> >>>> -  if (aarch64_ldp_writeback > 1
> >>>> +  if (bb_state->pair_trailing_writeback_p ()
> >>>>        && !writeback_effect
> >>>>        && (!load_p || (!refers_to_regno_p (base_regno, base_regno + 1,
> >>>>  					 XEXP (pats[0], 0), nullptr)
> >>>> @@ -1863,14 +2007,14 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>>>      }
> >>>>  
> >>>>    // Now that we know what base mem we're going to use, check if it's OK
> >>>> -  // with the ldp/stp policy.
> >>>> +  // with the pair mem policy.
> >>>>    rtx first_mem = XEXP (pats[0], load_p);
> >>>> -  if (!aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
> >>>> -						load_p,
> >>>> -						GET_MODE (first_mem)))
> >>>> +  if (!bb_state->pair_mem_ok_with_policy (first_mem,
> >>>> +			  load_p,
> >>>> +			  GET_MODE (first_mem)))
> >>>>      {
> >>>>        if (dump_file)
> >>>> -	fprintf (dump_file, "punting on pair (%d,%d), ldp/stp policy says no\n",
> >>>> +	fprintf (dump_file, "punting on pair (%d,%d), pair mem  policy says no\n",
> >>>>  		 i1->uid (), i2->uid ());
> >>>>        return false;
> >>>>      }
> >>>> @@ -1878,21 +2022,10 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>>>    rtx reg_notes = combine_reg_notes (first, second, load_p);
> >>>>  
> >>>>    rtx pair_pat;
> >>>> -  if (writeback_effect)
> >>>> -    {
> >>>> -      auto patvec = gen_rtvec (3, writeback_effect, pats[0], pats[1]);
> >>>> -      pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
> >>>> -    }
> >>>> -  else if (load_p)
> >>>> -    pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
> >>>> -				      XEXP (pats[1], 0),
> >>>> -				      XEXP (pats[0], 1));
> >>>> -  else
> >>>> -    pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
> >>>> -				       XEXP (pats[0], 1),
> >>>> -				       XEXP (pats[1], 1));
> >>>>  
> >>>> +  pair_pat = bb_state->gen_mem_pair (pats, writeback_effect, load_p);
> >>>>    insn_change *pair_change = nullptr;
> >>>> +
> >>>>    auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
> >>>>      rtx_insn *rti = change->insn ()->rtl ();
> >>>>      validate_unshare_change (rti, &PATTERN (rti), pair_pat, true);
> >>>> @@ -2133,15 +2266,6 @@ load_modified_by_store_p (insn_info *load,
> >>>>    return false;
> >>>>  }
> >>>>  
> >>>> -// Virtual base class for load/store walkers used in alias analysis.
> >>>> -struct alias_walker
> >>>> -{
> >>>> -  virtual bool conflict_p (int &budget) const = 0;
> >>>> -  virtual insn_info *insn () const = 0;
> >>>> -  virtual bool valid () const  = 0;
> >>>> -  virtual void advance () = 0;
> >>>> -};
> >>>> -
> >>>>  // Implement some common functionality used by both store_walker
> >>>>  // and load_walker.
> >>>>  template<bool reverse>
> >>>> @@ -2259,13 +2383,13 @@ public:
> >>>>  //
> >>>>  // We try to maintain the invariant that if a walker becomes invalid, we
> >>>>  // set its pointer to null.
> >>>> -static void
> >>>> -do_alias_analysis (insn_info *alias_hazards[4],
> >>>> +void
> >>>> +pair_fusion_bb_info::do_alias_analysis (insn_info *alias_hazards[4],
> >>>
> >>> It would probably make more sense for this to be a (non-virtual) member
> >>> function of the pair_fusion class, since it doesn't need to access the
> >>> BB-specific state.
> >>>
> >>
> >> Addressed.
> >>>>  		   alias_walker *walkers[4],
> >>>>  		   bool load_p)
> >>>>  {
> >>>>    const int n_walkers = 2 + (2 * !load_p);
> >>>> -  int budget = aarch64_ldp_alias_check_limit;
> >>>> +  int budget = bb_state->pair_mem_alias_check_limit ();
> >>>>  
> >>>>    auto next_walker = [walkers,n_walkers](int current) -> int {
> >>>>      for (int j = 1; j <= n_walkers; j++)
> >>>> @@ -2365,7 +2489,7 @@ get_viable_bases (insn_info *insns[2],
> >>>>      {
> >>>>        const bool is_lower = (i == reversed);
> >>>>        poly_int64 poly_off;
> >>>> -      rtx base = ldp_strip_offset (cand_mems[i], &poly_off);
> >>>> +      rtx base = pair_mem_strip_offset (cand_mems[i], &poly_off);
> >>>>        if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
> >>>>  	writeback |= (1 << i);
> >>>>  
> >>>> @@ -2373,7 +2497,7 @@ get_viable_bases (insn_info *insns[2],
> >>>>  	continue;
> >>>>  
> >>>>        // Punt on accesses relative to eliminable regs.  See the comment in
> >>>> -      // ldp_bb_info::track_access for a detailed explanation of this.
> >>>> +      // pair_fusion_bb_info::track_access for a detailed explanation of this.
> >>>>        if (!reload_completed
> >>>>  	  && (REGNO (base) == FRAME_POINTER_REGNUM
> >>>>  	      || REGNO (base) == ARG_POINTER_REGNUM))
> >>>> @@ -2397,7 +2521,11 @@ get_viable_bases (insn_info *insns[2],
> >>>>        if (!is_lower)
> >>>>  	base_off--;
> >>>>  
> >>>> -      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
> >>>> +      pair_fusion *pfuse;
> >>>> +      aarch64_pair_fusion derived;
> >>>> +      pfuse = &derived;
> >>>
> >>> Again, please don't do this but instead use the same instance
> >>> everywhere.
> >>>
> >>
> >> Addressed.
> >>>> +
> >>>> +      if (pfuse->pair_mem_in_range_p (base_off))
> >>>>  	continue;
> >>>>  
> >>>>        use_info *use = find_access (insns[i]->uses (), REGNO (base));
> >>>> @@ -2454,12 +2582,12 @@ get_viable_bases (insn_info *insns[2],
> >>>>  }
> >>>>  
> >>>>  // Given two adjacent memory accesses of the same size, I1 and I2, try
> >>>> -// and see if we can merge them into a ldp or stp.
> >>>> +// and see if we can merge them into a paired accesses load and store.
> >>>
> >>> "into a load pair or store pair" or "into a paired access" if you
> >>> prefer.
> >>>
> >>
> >> Addressed.
> >>>>  //
> >>>>  // ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
> >>>>  // if the accesses are both loads, otherwise they are both stores.
> >>>>  bool
> >>>> -ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> >>>> +pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> >>>>  			    insn_info *i1, insn_info *i2)
> >>>>  {
> >>>>    if (dump_file)
> >>>> @@ -2494,7 +2622,7 @@ ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> >>>>      {
> >>>>        if (dump_file)
> >>>>  	fprintf (dump_file,
> >>>> -		 "punting on ldp due to reg conflcits (%d,%d)\n",
> >>>> +		 "punting on pair mem load  due to reg conflcits (%d,%d)\n",
> >>>>  		 insns[0]->uid (), insns[1]->uid ());
> >>>>        return false;
> >>>>      }
> >>>> @@ -2843,7 +2971,7 @@ debug (const insn_list_t &l)
> >>>>  // we can't re-order them anyway, so provided earlier passes have cleaned up
> >>>>  // redundant loads, we shouldn't miss opportunities by doing this.
> >>>>  void
> >>>> -ldp_bb_info::merge_pairs (insn_list_t &left_list,
> >>>> +pair_fusion_bb_info::merge_pairs (insn_list_t &left_list,
> >>>>  			  insn_list_t &right_list,
> >>>>  			  bool load_p,
> >>>>  			  unsigned access_size)
> >>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
> >>>>  // of accesses.  If we find two sets of adjacent accesses, call
> >>>>  // merge_pairs.
> >>>>  void
> >>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>> -				 access_group &group)
> >>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
> >>>> +					 access_group &group)
> >>>>  {
> >>>>    const auto lfs = decode_lfs (encoded_lfs);
> >>>>    const unsigned access_size = lfs.size;
> >>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>  		       access.cand_insns,
> >>>>  		       lfs.load_p,
> >>>>  		       access_size);
> >>>> -	  skip_next = access.cand_insns.empty ();
> >>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
> >>>
> >>> As above, why is this needed?
> >>
> >> For rs6000 we want to return always true. as load store pair
> >> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
> >> And we want load store pair to 8/16 32/64. Thats why we want
> >> to generate always true for rs6000 to skip pairs as above.
> > 
> > Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
> > load/store pair instructions where the two arms of the access are storing
> > operands of different sizes?  Or something else?
> > 
> > As it stands the logic is to skip the next iteration only if we
> > exhausted all the candidate insns for the current access.  In the case
> > that we didn't exhaust all such candidates, then the idea is that when
> > access becomes prev_access, we can attempt to use those candidates as
> > the "left-hand side" of a pair in the next iteration since we failed to
> > use them as the "right-hand side" of a pair in the current iteration.
> > I don't see why you wouldn't want that behaviour.  Please can you
> > explain?
> > 
> 
> In merge_pair we get the 2 load candiates one load from 0 offset and
> other load is from 16th offset. Then in next iteration we get load
> from 16th offset and other load from 32 offset. In next iteration
> we get load from 32 offset and other load from 48 offset.
> 
> For example:
> 
> Currently we get the load candiates as follows.
> 
> pairs:
> 
> load from 0th offset.
> load from 16th offset.
> 
> next pairs:
> 
> load from 16th offset.
> load from 32th offset.
> 
> next pairs:
> 
> load from 32th offset
> load from 48th offset.
> 
> Instead in rs6000 we should get:
> 
> pairs:
> 
> load from 0th offset
> load from 16th offset.
> 
> next pairs:
> 
> load from 32th offset
> load from 48th offset.

Hmm, so then I guess my question is: why wouldn't you consider merging
the pair with offsets (16,32) for rs6000?  Is it because you have a
stricter alignment requirement on the base pair offsets (i.e. they have
to be a multiple of 32 when the operand size is 16)?  So the pair
offsets have to be a multiple of the entire pair size rather than a
single operand size?

If that is the case then I think it would be better to introduce a
virtual function (say pair_offset_alignment_ok_p) that vets the base
offset of the pair (prev_access->offset in transform_for_base).  I guess
it would also take access_size as a parameter and for aarch64 it should
check:

  multiple_p (offset, access_size)

and for rs6000 it could check:

  multiple_p (offset, access_size * 2)

and we would then incorporate a call to that predicate in the else if
condition of tranform_for_base.

It would have the same limitation whereby we assume that MEM_EXPR offset
alignment is a good proxy for RTL offset alignment, but we already
depend on that with the multiple_p check in track_via_mem_expr.

Thanks,
Alex

> 
> Thanks & Regards
> Ajit
> > Thanks,
> > Alex
> > 
> >>
> >>>
> >>>>  	}
> >>>>        prev_access = &access;
> >>>>      }
> >>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>  // and remove all the tombstone insns, being sure to reparent any uses
> >>>>  // of mem to previous defs when we do this.
> >>>>  void
> >>>> -ldp_bb_info::cleanup_tombstones ()
> >>>> +pair_fusion_bb_info::cleanup_tombstones ()
> >>>>  {
> >>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
> >>>>    if (!m_emitted_tombstone)
> >>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
> >>>>  
> >>>>  template<typename Map>
> >>>>  void
> >>>> -ldp_bb_info::traverse_base_map (Map &map)
> >>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
> >>>>  {
> >>>>    for (auto kv : map)
> >>>>      {
> >>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
> >>>>  }
> >>>>  
> >>>>  void
> >>>> -ldp_bb_info::transform ()
> >>>> +pair_fusion_bb_info::transform ()
> >>>>  {
> >>>>    traverse_base_map (expr_map);
> >>>>    traverse_base_map (def_map);
> >>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
> >>>>  // for load/store candidates.  If running after RA, also try and promote
> >>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
> >>>>  // candidates into pairs.
> >>>> -void ldp_fusion_bb (bb_info *bb)
> >>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
> >>>>  {
> >>>> -  const bool track_loads
> >>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>> -  const bool track_stores
> >>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>> +  const bool track_loads = track_load_p ();
> >>>> +  const bool track_stores = track_store_p ();
> >>>>  
> >>>> -  ldp_bb_info bb_state (bb);
> >>>
> >>> This:
> >>>
> >>>> +  aarch64_pair_fusion derived;
> >>>
> >>> can be deleted and then:
> >>>
> >>>> +  pair_fusion_bb_info bb_info (bb, &derived);
> >>>
> >>> can just be:
> >>>
> >>>   pair_fusion_bb_info bb_info (bb, this);
> >>>
> >>> (or you can pass *this if you make bb_info take a reference).
> >>>
> >>> I don't think there's a particular need to change the variable name
> >>> (bb_state -> bb_info).  I chose the former because it doens't clash
> >>> with the RTL-SSA structure of the same name as the latter.
> >>>
> >>
> >> Addressed.
> >>>>  
> >>>>    for (auto insn : bb->nondebug_insns ())
> >>>>      {
> >>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
> >>>>  	continue;
> >>>>  
> >>>>        rtx pat = PATTERN (rti);
> >>>> -      if (reload_completed
> >>>> -	  && aarch64_ldp_writeback > 1
> >>>> -	  && GET_CODE (pat) == PARALLEL
> >>>> -	  && XVECLEN (pat, 0) == 2)
> >>>> +      if (pair_mem_promote_writeback_p (pat))
> >>>>  	try_promote_writeback (insn);
> >>>
> >>> It looks like try_promote_writeback itself will need some further work
> >>> to make it target-independent.  I suppose this check:
> >>>
> >>>   auto rti = insn->rtl ();
> >>>   const auto attr = get_attr_ldpstp (rti);
> >>>   if (attr == LDPSTP_NONE)
> >>>     return;
> >>>
> >>>   bool load_p = (attr == LDPSTP_LDP);
> >>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
> >>>
> >>> will need to become part of the pair_mem_promote_writeback_p hook that you
> >>> added, potentially changing it to return a boolean for load_p.
> >>>
> >>> Then I guess we will need hooks for destructuring the pair insn and
> >>> another hook to wrap aarch64_gen_writeback_pair.
> >>>
> >>
> >> Addressed.
> >>>>  
> >>>>        if (GET_CODE (pat) != SET)
> >>>>  	continue;
> >>>>  
> >>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
> >>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
> >>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
> >>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
> >>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
> >>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
> >>>>      }
> >>>>  
> >>>> -  bb_state.transform ();
> >>>> -  bb_state.cleanup_tombstones ();
> >>>> +  bb_info.transform ();
> >>>> +  bb_info.cleanup_tombstones ();
> >>>>  }
> >>>>  
> >>>>  void ldp_fusion ()
> >>>>  {
> >>>>    ldp_fusion_init ();
> >>>> +  pair_fusion *pfuse;
> >>>> +  aarch64_pair_fusion derived;
> >>>> +  pfuse = &derived;
> >>>
> >>> This is indeed the one place where I think it is acceptable to
> >>> instantiate aarch64_pair_fusion.  But again there's no need to create a
> >>> pointer to the parent class, just call any function you like directly.
> >>>
> >>
> >> Addressed.
> >>>>  
> >>>>    for (auto bb : crtl->ssa->bbs ())
> >>>> -    ldp_fusion_bb (bb);
> >>>> +    pfuse->ldp_fusion_bb (bb);
> >>>
> >>> I think even the code to iterate over bbs should itself be a member
> >>> function of pair_fusion (say "run") and then that becomes part of the
> >>> generic code.
> >>>
> >>> So this function would just become:
> >>>
> >>> aarch64_pair_fusion pass;
> >>> pass.run ();
> >>>
> >>> and could be inlined into the caller.
> >>>
> >>
> >> Addressed.
> >>> Perhaps you could also add an early return like:
> >>>
> >>>   if (!track_loads_p () && !track_stores_p ())
> >>>     return;
> >>>
> >>> in pair_fusion::run () and then remove the corresponding code from
> >>> pass_ldp_fusion::gate?
> >>>
> >>
> >> Addressed.
> >>
> >>>>  
> >>>>    ldp_fusion_destroy ();
> >>>>  }
> >>>> -- 
> >>>> 2.39.3
> >>>>
> >>>
> >>> Thanks,
> >>> Alex
> >>
> >> Thanks & Regards
> >> Ajit
Ajit Agarwal April 9, 2024, 3:29 p.m. UTC | #6
Hello Alex:

On 09/04/24 8:39 pm, Alex Coplan wrote:
> On 09/04/2024 20:01, Ajit Agarwal wrote:
>> Hello Alex:
>>
>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>
>>>>
>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>> Hello Alex/Richard:
>>>>>>
>>>>>> All review comments are incorporated.
>>>>>
>>>>> Thanks, I was kind-of expecting you to also send the renaming patch as a
>>>>> preparatory patch as we discussed.
>>>>>
>>>>> Sorry for another meta comment, but: I think the reason that the Linaro
>>>>> CI isn't running tests on your patches is actually because you're
>>>>> sending 1/3 of a series but not sending the rest of the series.
>>>>>
>>>>> So please can you either send this as an individual preparatory patch
>>>>> (not marked as a series) or if you're going to send a series (e.g. with
>>>>> a preparatory rename patch as 1/2 and this as 2/2) then send the entire
>>>>> series when you make updates.  That way the CI should test your patches,
>>>>> which would be helpful.
>>>>>
>>>>
>>>> Addressed.
>>>>  
>>>>>>
>>>>>> Common infrastructure of load store pair fusion is divided into target
>>>>>> independent and target dependent changed code.
>>>>>>
>>>>>> Target independent code is the Generic code with pure virtual function
>>>>>> to interface betwwen target independent and dependent code.
>>>>>>
>>>>>> Target dependent code is the implementation of pure virtual function for
>>>>>> aarch64 target and the call to target independent code.
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Ajit
>>>>>>
>>>>>>
>>>>>> aarch64: Place target independent and dependent changed code in one file
>>>>>>
>>>>>> Common infrastructure of load store pair fusion is divided into target
>>>>>> independent and target dependent changed code.
>>>>>>
>>>>>> Target independent code is the Generic code with pure virtual function
>>>>>> to interface betwwen target independent and dependent code.
>>>>>>
>>>>>> Target dependent code is the implementation of pure virtual function for
>>>>>> aarch64 target and the call to target independent code.
>>>>>>
>>>>>> 2024-04-06  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>>>>>
>>>>>> gcc/ChangeLog:
>>>>>>
>>>>>> 	* config/aarch64/aarch64-ldp-fusion.cc: Place target
>>>>>> 	independent and dependent changed code.
>>>>>
>>>>> You're going to need a proper ChangeLog eventually, but I guess there's
>>>>> no need for that right now.
>>>>>
>>>>>> ---
>>>>>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++++++++++++++--------
>>>>>>  1 file changed, 249 insertions(+), 122 deletions(-)
>>>>>>
>>>>>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>>>> index 22ed95eb743..cb21b514ef7 100644
>>>>>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>>>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>>>> @@ -138,8 +138,122 @@ struct alt_base
>>>>>>    poly_int64 offset;
>>>>>>  };
>>>>>>  
>>>>>> +// Virtual base class for load/store walkers used in alias analysis.
>>>>>> +struct alias_walker
>>>>>> +{
>>>>>> +  virtual bool conflict_p (int &budget) const = 0;
>>>>>> +  virtual insn_info *insn () const = 0;
>>>>>> +  virtual bool valid () const  = 0;
>>>>>
>>>>> Heh, looking at this made me realise there is a whitespace bug here in
>>>>> the existing code (double space after const).  Sorry about that!  I'll
>>>>> push an obvious fix for that.
>>>>>
>>>>>> +  virtual void advance () = 0;
>>>>>> +};
>>>>>> +
>>>>>> +struct pair_fusion {
>>>>>> +
>>>>>> +  pair_fusion () {};
>>>>>
>>>>> This ctor looks pointless at the moment.  Perhaps instead we could put
>>>>> the contents of ldp_fusion_init in here and then delete that function?
>>>>>
>>>>
>>>> Addressed.
>>>>
>>>>>> +  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
>>>>>> +			       bool load_p) = 0;
>>>>>
>>>>> Please can we have comments above each of these virtual functions
>>>>> describing any parameters, what the purpose of the hook is, and the
>>>>> interpretation of the return value?  This will serve as the
>>>>> documentation for other targets that want to make use of the pass.
>>>>>
>>>>> It might make sense to have a default-false implementation for
>>>>> fpsimd_op_p, especially if you don't want to make use of this bit for
>>>>> rs6000.
>>>>>
>>>>
>>>> Addressed.
>>>>  
>>>>>> +
>>>>>> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
>>>>>> +  virtual bool pair_trailing_writeback_p () = 0;
>>>>>
>>>>> Sorry for the run-around, but: I think this and
>>>>> handle_writeback_opportunities () should be the same function, either
>>>>> returning an enum or taking an enum and returning a boolean.
>>>>>
>>>>> At a minimum they should have more similar sounding names.
>>>>>
>>>>
>>>> Addressed.
>>>>
>>>>>> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
>>>>>> +				      machine_mode mem_mode) = 0;
>>>>>> +  virtual int pair_mem_alias_check_limit () = 0;
>>>>>> +  virtual bool handle_writeback_opportunities () = 0 ;
>>>>>> +  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
>>>>>> +					machine_mode mode) = 0;
>>>>>> +  virtual rtx gen_mem_pair (rtx *pats,  rtx writeback,
>>>>>
>>>>> Nit: excess whitespace after pats,
>>>>>
>>>>>> +			    bool load_p) = 0;
>>>>>> +  virtual bool pair_mem_promote_writeback_p (rtx pat) = 0;
>>>>>> +  virtual bool track_load_p () = 0;
>>>>>> +  virtual bool track_store_p () = 0;
>>>>>
>>>>> I think it would probably make more sense for these two to have
>>>>> default-true implementations rather than being pure virtual functions.
>>>>>
>>>>> Also, sorry for the bikeshedding, but please can we keep the plural
>>>>> names (so track_loads_p and track_stores_p).
>>>>>
>>>>>> +  virtual bool cand_insns_empty_p (std::list<insn_info *> &insns) = 0;
>>>>>
>>>>> Why does this need to be virtualised?  I would it expect it to
>>>>> just be insns.empty () on all targets.
>>>>>
>>>>>> +  virtual bool pair_mem_in_range_p (HOST_WIDE_INT off) = 0;
>>>>>> +  void ldp_fusion_bb (bb_info *bb);
>>>>>
>>>>> Just to check, are you planning to rename this function in a separate
>>>>> patch?
>>>>>
>>>>
>>>> Yes.
>>>>>> +
>>>>>> +  ~pair_fusion() { }
>>>>>
>>>>> As with the ctor, perhaps we could put the contents of ldp_fusion_destroy
>>>>> in here and then delete that function?
>>>>>
>>>>
>>>> Addressed.
>>>>>> +};
>>>>>> +
>>>>>> +struct aarch64_pair_fusion : public pair_fusion
>>>>>> +{
>>>>>> +public:
>>>>>
>>>>
>>>> Addressed.
>>>>> Nit: there shouldn't be a need for this as public is already the default
>>>>> for structs.
>>>>>
>>>>>> +  aarch64_pair_fusion () : pair_fusion () {};
>>>>>
>>>>> This ctor looks pointless, the default one should do, so can we remove
>>>>> this?
>>>>>
>>>>
>>>> Addressed.
>>>>>> +  bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
>>>>>> +		    bool load_p) override final
>>>>>> +  {
>>>>>> +    const bool fpsimd_op_p
>>>>>> +      = reload_completed
>>>>>> +      ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
>>>>>> +      : (GET_MODE_CLASS (mem_mode) != MODE_INT
>>>>>> +	 && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
>>>>>> +    return fpsimd_op_p;
>>>>>
>>>>> Can this just directly return the boolean expression instead of storing
>>>>> it into a variable first?
>>>>>
>>>> Addressed.
>>>>>> +  }
>>>>>> +
>>>>>> +  bool pair_mem_promote_writeback_p (rtx pat)
>>>>>> +  {
>>>>>> +     if (reload_completed
>>>>>> +	 && aarch64_ldp_writeback > 1
>>>>>> +	 && GET_CODE (pat) == PARALLEL
>>>>>> +	 && XVECLEN (pat, 0) == 2)
>>>>>> +	return true;
>>>>>> +
>>>>>> +     return false;
>>>>>> +  }
>>>>>> +
>>>>>> +  bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
>>>>>> +				machine_mode mode)
>>>>>> +  {
>>>>>> +    return aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
>>>>>> +						     load_p,
>>>>>> +						     mode);
>>>>>> +  }
>>>>>> +  bool pair_operand_mode_ok_p (machine_mode mode);
>>>>>
>>>>> Why have you implemented this (one-line function) out-of-line but the
>>>>> others are implemented inline?  Please be consistent.
>>>>>
>>>>
>>>> Addressed.
>>>>>> +
>>>>>> +  rtx gen_mem_pair (rtx *pats, rtx writeback, bool load_p);
>>>>>> +
>>>>>> +  bool pair_trailing_writeback_p  ()
>>>>>> +  {
>>>>>> +    return aarch64_ldp_writeback > 1;
>>>>>> +  }
>>>>>> +  bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
>>>>>> +			      machine_mode mem_mode)
>>>>>> +  {
>>>>>> +    return (load_p
>>>>>> +	     ? aarch64_ldp_reg_operand (reg_op, mem_mode)
>>>>>> +	     : aarch64_stp_reg_operand (reg_op, mem_mode));
>>>>>> +  }
>>>>>> +  int pair_mem_alias_check_limit ()
>>>>>> +  {
>>>>>> +    return aarch64_ldp_alias_check_limit;
>>>>>> +  }
>>>>>> +  bool handle_writeback_opportunities ()
>>>>>> +  {
>>>>>> +    return aarch64_ldp_writeback;
>>>>>> +  }
>>>>>> +  bool track_load_p ()
>>>>>> +  {
>>>>>> +    const bool track_loads
>>>>>> +      = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>> +    return track_loads;
>>>>>> +  }
>>>>>
>>>>
>>>> Addressed.
>>>>> As above, can this just return the boolean expression directly without
>>>>> the temporary?
>>>>>
>>>>>> +  bool track_store_p ()
>>>>>> +  {
>>>>>> +    const bool track_stores
>>>>>> +      = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>> +    return track_stores;
>>>>>> +  }
>>>>>
>>>>
>>>> Addressed.
>>>>> Likewise.
>>>>>
>>>>>> +  bool cand_insns_empty_p (std::list<insn_info *> &insns)
>>>>>> +  {
>>>>>> +    return insns.empty();
>>>>>> +  }
>>>>>
>>>>> As mentioned on the declaration, I don't see why this needs to be
>>>>> virtualised.
>>>>>
>>>>>> +  bool pair_mem_in_range_p (HOST_WIDE_INT off)
>>>>>> +  {
>>>>>> +    return (off < LDP_MIN_IMM || off > LDP_MAX_IMM);
>>>>>> +  }
>>>>>
>>>>> Hmm, this looks to test exactly the opposite condition from what the
>>>>> name suggests, i.e. the implementation looks like what I would expect
>>>>> for a function called pair_mem_out_of_range_p.
>>>>>
>>>>> Please can you invert the condition and fix up callers appropriately?
>>>>>
>>>>> This suggests you're not really thinking about the code you're writing,
>>>>> please try to think a bit more carefully before posting patches.
>>>>>
>>>>
>>>> Sorry for my mistake as I have overlooked it.
>>>>>> +};
>>>>>> +
>>>>>> +
>>>>>>  // State used by the pass for a given basic block.
>>>>>> -struct ldp_bb_info
>>>>>> +struct pair_fusion_bb_info
>>>>>>  {
>>>>>>    using def_hash = nofree_ptr_hash<def_info>;
>>>>>>    using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
>>>>>> @@ -160,14 +274,17 @@ struct ldp_bb_info
>>>>>>  
>>>>>>    static const size_t obstack_alignment = sizeof (void *);
>>>>>>    bb_info *m_bb;
>>>>>> +  pair_fusion *bb_state;
>>>>>
>>>>
>>>> Addressed.
>>>>
>>>>> Can we call this m_pass instead of bb_state?  The pair_fusion class is
>>>>> explicitly at the pass level rather than the bb level.
>>>>>
>>>>> Also I think this (and indeed m_bb, although that's pre-existing) could
>>>>> be private members.  Not sure there's a good reason for making them
>>>>> public.
>>>>>
>>>>
>>>> Addressed.
>>>>  
>>>>> It might be slightly nicer to use a reference for m_pass unless there's
>>>>> a good reason to use a pointer, although I don't feel strongly about
>>>>> this.
>>>>>
>>>>>>  
>>>>>> -  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
>>>>>> +  pair_fusion_bb_info (bb_info *bb,
>>>>>> +		      aarch64_pair_fusion *d) : m_bb (bb),
>>>>>
>>>>> This generic class needs to take the generic pass type, i.e. just a
>>>>> pair_fusion * rather than an aarch64_pair_fusion *.
>>>>>
>>>>>> +		      bb_state (d), m_emitted_tombstone (false)
>>>>>>    {
>>>>>>      obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
>>>>>>  				obstack_alignment, obstack_chunk_alloc,
>>>>>>  				obstack_chunk_free);
>>>>>>    }
>>>>>> -  ~ldp_bb_info ()
>>>>>> +  ~pair_fusion_bb_info ()
>>>>>>    {
>>>>>>      obstack_free (&m_obstack, nullptr);
>>>>>>  
>>>>>> @@ -177,10 +294,32 @@ struct ldp_bb_info
>>>>>>  	bitmap_obstack_release (&m_bitmap_obstack);
>>>>>>        }
>>>>>>    }
>>>>>> +  void track_access (insn_info *, bool load, rtx mem);
>>>>>> +  void transform ();
>>>>>> +  void cleanup_tombstones ();
>>>>>
>>>>> I think most (or indeed all?) of the changes here shouldn't be needed.
>>>>> AFAICS, there shouldn't be a need to:
>>>>>  - drop inline from these functions.
>>>>>  - change the accessibility of any members.
>>>>>
>>>>
>>>> Addressed.
>>>>> Plese can you drop these changes or explain why they're needed (from
>>>>> here until the end of the class).
>>>>>
>>>>>> +  void merge_pairs (insn_list_t &, insn_list_t &,
>>>>>> +		    bool load_p, unsigned access_size);
>>>>>> +  void transform_for_base (int load_size, access_group &group);
>>>>>> +
>>>>>> +  bool try_fuse_pair (bool load_p, unsigned access_size,
>>>>>> +			     insn_info *i1, insn_info *i2);
>>>>>> +
>>>>>> +  bool fuse_pair (bool load_p, unsigned access_size,
>>>>>> +		  int writeback,
>>>>>> +		  insn_info *i1, insn_info *i2,
>>>>>> +		  base_cand &base,
>>>>>> +		  const insn_range_info &move_range);
>>>>>>  
>>>>>> -  inline void track_access (insn_info *, bool load, rtx mem);
>>>>>> -  inline void transform ();
>>>>>> -  inline void cleanup_tombstones ();
>>>>>> +  void do_alias_analysis (insn_info *alias_hazards[4],
>>>>>> +			  alias_walker *walkers[4],
>>>>>> +			  bool load_p);
>>>>>> +
>>>>>> +  void track_tombstone (int uid);
>>>>>> +
>>>>>> +  bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
>>>>>> +
>>>>>> +  template<typename Map>
>>>>>> +    void traverse_base_map (Map &map);
>>>>>>  
>>>>>>  private:
>>>>>>    obstack m_obstack;
>>>>>> @@ -191,30 +330,32 @@ private:
>>>>>>    bool m_emitted_tombstone;
>>>>>>  
>>>>>>    inline splay_tree_node<access_record *> *node_alloc (access_record *);
>>>>>> +};
>>>>>>  
>>>>>> -  template<typename Map>
>>>>>> -  inline void traverse_base_map (Map &map);
>>>>>> -  inline void transform_for_base (int load_size, access_group &group);
>>>>>> -
>>>>>> -  inline void merge_pairs (insn_list_t &, insn_list_t &,
>>>>>> -			   bool load_p, unsigned access_size);
>>>>>> -
>>>>>> -  inline bool try_fuse_pair (bool load_p, unsigned access_size,
>>>>>> -			     insn_info *i1, insn_info *i2);
>>>>>> -
>>>>>> -  inline bool fuse_pair (bool load_p, unsigned access_size,
>>>>>> -			 int writeback,
>>>>>> -			 insn_info *i1, insn_info *i2,
>>>>>> -			 base_cand &base,
>>>>>> -			 const insn_range_info &move_range);
>>>>>> -
>>>>>> -  inline void track_tombstone (int uid);
>>>>>> +rtx aarch64_pair_fusion::gen_mem_pair (rtx *pats,
>>>>>
>>>>> Style nit: newline after rtx.
>>>>>
>>>> Addressed.
>>>>
>>>>>> +		  rtx writeback,
>>>>>> +		  bool load_p)
>>>>>> +  {
>>>>>> +    rtx pair_pat;
>>>>>>  
>>>>>> -  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
>>>>>> -};
>>>>>> +    if (writeback)
>>>>>> +      {
>>>>>> +	auto patvec = gen_rtvec (3, writeback, pats[0], pats[1]);
>>>>>> +	pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
>>>>>> +      }
>>>>>> +    else if (load_p)
>>>>>> +      pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
>>>>>> +					XEXP (pats[1], 0),
>>>>>> +					XEXP (pats[0], 1));
>>>>>> +    else
>>>>>> +      pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
>>>>>> +					 XEXP (pats[0], 1),
>>>>>> +					 XEXP (pats[1], 1));
>>>>>> +     return pair_pat;
>>>>>
>>>>> Can the assignments just be direct returns here, please?  Then
>>>>> get rid of the temporary.
>>>>>
>>>>
>>>> Addressed.
>>>>  
>>>>>> +  }
>>>>>>  
>>>>>>  splay_tree_node<access_record *> *
>>>>>> -ldp_bb_info::node_alloc (access_record *access)
>>>>>> +pair_fusion_bb_info::node_alloc (access_record *access)
>>>>>>  {
>>>>>>    using T = splay_tree_node<access_record *>;
>>>>>>    void *addr = obstack_alloc (&m_obstack, sizeof (T));
>>>>>> @@ -262,7 +403,7 @@ drop_writeback (rtx mem)
>>>>>>  // RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
>>>>>>  // MEM so that we know the mode of the access.
>>>>>>  static rtx
>>>>>> -ldp_strip_offset (rtx mem, poly_int64 *offset)
>>>>>> +pair_mem_strip_offset (rtx mem, poly_int64 *offset)
>>>>>
>>>>> I thought we discussed splitting the renaming into a separate patch?
>>>>>
>>>>>>  {
>>>>>>    rtx addr = XEXP (mem, 0);
>>>>>>  
>>>>>> @@ -332,6 +473,12 @@ ldp_operand_mode_ok_p (machine_mode mode)
>>>>>>    return reload_completed || mode != TImode;
>>>>>>  }
>>>>>>  
>>>>>> +bool
>>>>>> +aarch64_pair_fusion::pair_operand_mode_ok_p (machine_mode mode)
>>>>>> +{
>>>>>> +  return ldp_operand_mode_ok_p (mode);
>>>>>> +}
>>>>>> +
>>>>>>  // Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
>>>>>>  // into an integer for use as a hash table key.
>>>>>>  static int
>>>>>> @@ -396,7 +543,7 @@ access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
>>>>>>  // MEM_EXPR base (i.e. a tree decl) relative to which we can track the access.
>>>>>>  // LFS is used as part of the key to the hash table, see track_access.
>>>>>>  bool
>>>>>> -ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>>>>>> +pair_fusion_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>>>>>
>>>>> Again, please do any renaming in a separate patch.
>>>>>
>>>>
>>>> Addressed.
>>>>>>  {
>>>>>>    if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
>>>>>>      return false;
>>>>>> @@ -412,9 +559,10 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>>>>>>    const machine_mode mem_mode = GET_MODE (mem);
>>>>>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
>>>>>>  
>>>>>> -  // Punt on misaligned offsets.  LDP/STP instructions require offsets to be a
>>>>>> -  // multiple of the access size, and we believe that misaligned offsets on
>>>>>> -  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL bases.
>>>>>> +  // Punt on misaligned offsets.  Paired memory access instructions require
>>>>>> +  // offsets to be a multiple of the access size, and we believe that
>>>>>> +  // misaligned offsets on MEM_EXPR bases are likely to lead to misaligned
>>>>>> +  // offsets w.r.t. RTL bases.
>>>>>>    if (!multiple_p (offset, mem_size))
>>>>>>      return false;
>>>>>>  
>>>>>> @@ -438,46 +586,38 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>>>>>>  }
>>>>>>  
>>>>>>  // Main function to begin pair discovery.  Given a memory access INSN,
>>>>>> -// determine whether it could be a candidate for fusing into an ldp/stp,
>>>>>> +// determine whether it could be a candidate for fusing into an pair mem,
>>>>>
>>>>> s/an pair mem/a paired access/
>>>>>
>>>>>>  // and if so, track it in the appropriate data structure for this basic
>>>>>>  // block.  LOAD_P is true if the access is a load, and MEM is the mem
>>>>>>  // rtx that occurs in INSN.
>>>>>>  void
>>>>>> -ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>>> +pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>>>  {
>>>>>>    // We can't combine volatile MEMs, so punt on these.
>>>>>>    if (MEM_VOLATILE_P (mem))
>>>>>>      return;
>>>>>>  
>>>>>> -  // Ignore writeback accesses if the param says to do so.
>>>>>> -  if (!aarch64_ldp_writeback
>>>>>> +  // Ignore writeback accesses if the param says to do so
>>>>>
>>>>
>>>> Addressed.
>>>>> This comment needs updating, I guess something like "if the hook says to
>>>>> do so".  But also please don't delete the full stop.
>>>>>
>>>>>> +  if (!bb_state->handle_writeback_opportunities ()
>>>>>>        && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
>>>>>>      return;
>>>>>>  
>>>>>>    const machine_mode mem_mode = GET_MODE (mem);
>>>>>> -  if (!ldp_operand_mode_ok_p (mem_mode))
>>>>>> +
>>>>>
>>>>> I don't think we need the extra whitespace here.
>>>>>
>>>>>> +  if (!bb_state->pair_operand_mode_ok_p (mem_mode))
>>>>>>      return;
>>>>>>  
>>>>>>    rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
>>>>>>  
>>>>>> -  // Ignore the access if the register operand isn't suitable for ldp/stp.
>>>>>> -  if (load_p
>>>>>> -      ? !aarch64_ldp_reg_operand (reg_op, mem_mode)
>>>>>> -      : !aarch64_stp_reg_operand (reg_op, mem_mode))
>>>>>> +  if (!bb_state->pair_reg_operand_ok_p (load_p, reg_op, mem_mode))
>>>>>>      return;
>>>>>> -
>>>>>>    // We want to segregate FP/SIMD accesses from GPR accesses.
>>>>>>    //
>>>>>>    // Before RA, we use the modes, noting that stores of constant zero
>>>>>>    // operands use GPRs (even in non-integer modes).  After RA, we use
>>>>>>    // the hard register numbers.
>>>>>> -  const bool fpsimd_op_p
>>>>>> -    = reload_completed
>>>>>> -    ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
>>>>>> -    : (GET_MODE_CLASS (mem_mode) != MODE_INT
>>>>>> -       && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
>>>>>> -
>>>>>> -  // Note ldp_operand_mode_ok_p already rejected VL modes.
>>>>>> +  const bool fpsimd_op_p = bb_state->fpsimd_op_p (reg_op, mem_mode, load_p);
>>>>>> +  // Note pair_operand_mode_ok_p already rejected VL modes.
>>>>>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
>>>>>>    const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
>>>>>>  
>>>>>> @@ -487,7 +627,7 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>>>    poly_int64 mem_off;
>>>>>>    rtx addr = XEXP (mem, 0);
>>>>>>    const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
>>>>>> -  rtx base = ldp_strip_offset (mem, &mem_off);
>>>>>> +  rtx base = pair_mem_strip_offset (mem, &mem_off);
>>>>>
>>>>> Again, let's split the renaming off into a separate patch.
>>>>>
>>>>
>>>> Addressed.
>>>>>>    if (!REG_P (base))
>>>>>>      return;
>>>>>>  
>>>>>> @@ -506,8 +646,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>>>    // elimination offset pre-RA, we should postpone forming pairs on such
>>>>>>    // accesses until after RA.
>>>>>>    //
>>>>>> -  // As it stands, addresses with offsets in range for LDR but not
>>>>>> -  // in range for LDP/STP are currently reloaded inefficiently,
>>>>>> +  // As it stands, addresses in range for an individual load/store but not
>>>>>> +  // for a paired access are currently reloaded inefficiently,
>>>>>>    // ending up with a separate base register for each pair.
>>>>>>    //
>>>>>>    // In theory LRA should make use of
>>>>>> @@ -519,8 +659,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>>>    // that calls targetm.legitimize_address_displacement.
>>>>>>    //
>>>>>>    // So for now, it's better to punt when we can't be sure that the
>>>>>> -  // offset is in range for LDP/STP.  Out-of-range cases can then be
>>>>>> -  // handled after RA by the out-of-range LDP/STP peepholes.  Eventually, it
>>>>>> +  // offset is in range for paired access.  Out-of-range cases can then be
>>>>>> +  // handled after RA by the out-of-range PAIR MEM  peepholes.  Eventually, it
>>>>>>    // would be nice to handle known out-of-range opportunities in the
>>>>>>    // pass itself (for stack accesses, this would be in the post-RA pass).
>>>>>>    if (!reload_completed
>>>>>> @@ -573,8 +713,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>>>  	gcc_unreachable (); // Base defs should be unique.
>>>>>>      }
>>>>>>  
>>>>>> -  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
>>>>>> -  // the access size.
>>>>>> +  // Punt on misaligned offsets.  Paired memory accesses require offsets
>>>>>> +  // to be a multiple of the access size.
>>>>>>    if (!multiple_p (mem_off, mem_size))
>>>>>>      return;
>>>>>>  
>>>>>> @@ -614,7 +754,7 @@ static bool no_ignore (insn_info *) { return false; }
>>>>>>  // making use of alias disambiguation.
>>>>>>  static insn_info *
>>>>>>  latest_hazard_before (insn_info *insn, rtx *ignore,
>>>>>> -		      insn_info *ignore_insn = nullptr)
>>>>>> +		      insn_info *ignore_insn = 0)
>>>>>
>>>>> Is this change needed?
>>>>>
>>>>
>>>> Addressed.
>>>>>>  {
>>>>>>    insn_info *result = nullptr;
>>>>>>  
>>>>>> @@ -1150,7 +1290,7 @@ extract_writebacks (bool load_p, rtx pats[2], int changed)
>>>>>>        const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
>>>>>>  
>>>>>>        poly_int64 offset;
>>>>>> -      rtx this_base = ldp_strip_offset (mem, &offset);
>>>>>> +      rtx this_base = pair_mem_strip_offset (mem, &offset);
>>>>>>        gcc_assert (REG_P (this_base));
>>>>>>        if (base_reg)
>>>>>>  	gcc_assert (rtx_equal_p (base_reg, this_base));
>>>>>> @@ -1286,7 +1426,11 @@ find_trailing_add (insn_info *insns[2],
>>>>>>  
>>>>>>    off_hwi /= access_size;
>>>>>>  
>>>>>> -  if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
>>>>>> +  pair_fusion *pfuse;
>>>>>> +  aarch64_pair_fusion derived;
>>>>>> +  pfuse = &derived;
>>>>>
>>>>> Huh? We'll want to use the same instance of the pass structure (pair_fusion)
>>>>> that gets created at the top level, so that should get passed around everywhere.
>>>>>
>>>>> In the case of find_trailing_add we probably don't want to add any more
>>>>> parameters (as Segher noted there are already too many), so perhaps the
>>>>> easiest thing would be to make find_trailing_add itself a (non-virtual)
>>>>> member function of pair_fusion.
>>>>>
>>>>> Then here we can just query pair_mem_in_range_p directly.
>>>>>
>>>>> For try_promote_writeback itself you can either make it a non-virtual
>>>>> member function of pair_fusion or just pass a reference to the pass
>>>>> object in when you call it.
>>>>>
>>>>> Note also that you don't have to create a pointer of the parent type in
>>>>> order to call a virtual function on a derived object, you can just call
>>>>> it directly.
>>>>>
>>>>
>>>> Addressed.
>>>>>> +
>>>>>> +  if (pfuse->pair_mem_in_range_p (off_hwi))
>>>>>>      return nullptr;
>>>>>
>>>>> Note of course this condition needs inverting when you invert the
>>>>> implementation.
>>>>>
>>>>
>>>> Addressed.
>>>>>>  
>>>>>>    auto dump_prefix = [&]()
>>>>>> @@ -1328,7 +1472,7 @@ find_trailing_add (insn_info *insns[2],
>>>>>>  // We just emitted a tombstone with uid UID, track it in a bitmap for
>>>>>>  // this BB so we can easily identify it later when cleaning up tombstones.
>>>>>>  void
>>>>>> -ldp_bb_info::track_tombstone (int uid)
>>>>>> +pair_fusion_bb_info::track_tombstone (int uid)
>>>>>
>>>>> Again, please split the renaming off.  Same with other instances below.
>>>>>
>>>>>>  {
>>>>>>    if (!m_emitted_tombstone)
>>>>>>      {
>>>>>> @@ -1528,7 +1672,7 @@ fixup_debug_uses (obstack_watermark &attempt,
>>>>>>  	  gcc_checking_assert (GET_RTX_CLASS (GET_CODE (XEXP (mem, 0)))
>>>>>>  			       == RTX_AUTOINC);
>>>>>>  
>>>>>> -	  base = ldp_strip_offset (mem, &offset);
>>>>>> +	  base = pair_mem_strip_offset (mem, &offset);
>>>>>>  	  gcc_checking_assert (REG_P (base) && REGNO (base) == base_regno);
>>>>>>  	}
>>>>>>        fixup_debug_use (attempt, use, def, base, offset);
>>>>>> @@ -1664,7 +1808,7 @@ fixup_debug_uses (obstack_watermark &attempt,
>>>>>>  // BASE gives the chosen base candidate for the pair and MOVE_RANGE is
>>>>>>  // a singleton range which says where to place the pair.
>>>>>>  bool
>>>>>> -ldp_bb_info::fuse_pair (bool load_p,
>>>>>> +pair_fusion_bb_info::fuse_pair (bool load_p,
>>>>>>  			unsigned access_size,
>>>>>>  			int writeback,
>>>>>>  			insn_info *i1, insn_info *i2,
>>>>>> @@ -1800,7 +1944,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>>>>>>      {
>>>>>>        if (dump_file)
>>>>>>  	fprintf (dump_file,
>>>>>> -		 "  ldp: i%d has wb but subsequent i%d has non-wb "
>>>>>> +		 "  load pair: i%d has wb but subsequent i%d has non-wb "
>>>>>>  		 "update of base (r%d), dropping wb\n",
>>>>>>  		 insns[0]->uid (), insns[1]->uid (), base_regno);
>>>>>>        gcc_assert (writeback_effect);
>>>>>> @@ -1823,7 +1967,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>>>>>>      }
>>>>>>  
>>>>>>    // If either of the original insns had writeback, but the resulting pair insn
>>>>>> -  // does not (can happen e.g. in the ldp edge case above, or if the writeback
>>>>>> +  // does not (can happen e.g. in the load pair edge case above, or if the writeback
>>>>>>    // effects cancel out), then drop the def(s) of the base register as
>>>>>>    // appropriate.
>>>>>>    //
>>>>>> @@ -1842,7 +1986,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>>>>>>    // update of the base register and try and fold it in to make this into a
>>>>>>    // writeback pair.
>>>>>>    insn_info *trailing_add = nullptr;
>>>>>> -  if (aarch64_ldp_writeback > 1
>>>>>> +  if (bb_state->pair_trailing_writeback_p ()
>>>>>>        && !writeback_effect
>>>>>>        && (!load_p || (!refers_to_regno_p (base_regno, base_regno + 1,
>>>>>>  					 XEXP (pats[0], 0), nullptr)
>>>>>> @@ -1863,14 +2007,14 @@ ldp_bb_info::fuse_pair (bool load_p,
>>>>>>      }
>>>>>>  
>>>>>>    // Now that we know what base mem we're going to use, check if it's OK
>>>>>> -  // with the ldp/stp policy.
>>>>>> +  // with the pair mem policy.
>>>>>>    rtx first_mem = XEXP (pats[0], load_p);
>>>>>> -  if (!aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
>>>>>> -						load_p,
>>>>>> -						GET_MODE (first_mem)))
>>>>>> +  if (!bb_state->pair_mem_ok_with_policy (first_mem,
>>>>>> +			  load_p,
>>>>>> +			  GET_MODE (first_mem)))
>>>>>>      {
>>>>>>        if (dump_file)
>>>>>> -	fprintf (dump_file, "punting on pair (%d,%d), ldp/stp policy says no\n",
>>>>>> +	fprintf (dump_file, "punting on pair (%d,%d), pair mem  policy says no\n",
>>>>>>  		 i1->uid (), i2->uid ());
>>>>>>        return false;
>>>>>>      }
>>>>>> @@ -1878,21 +2022,10 @@ ldp_bb_info::fuse_pair (bool load_p,
>>>>>>    rtx reg_notes = combine_reg_notes (first, second, load_p);
>>>>>>  
>>>>>>    rtx pair_pat;
>>>>>> -  if (writeback_effect)
>>>>>> -    {
>>>>>> -      auto patvec = gen_rtvec (3, writeback_effect, pats[0], pats[1]);
>>>>>> -      pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
>>>>>> -    }
>>>>>> -  else if (load_p)
>>>>>> -    pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
>>>>>> -				      XEXP (pats[1], 0),
>>>>>> -				      XEXP (pats[0], 1));
>>>>>> -  else
>>>>>> -    pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
>>>>>> -				       XEXP (pats[0], 1),
>>>>>> -				       XEXP (pats[1], 1));
>>>>>>  
>>>>>> +  pair_pat = bb_state->gen_mem_pair (pats, writeback_effect, load_p);
>>>>>>    insn_change *pair_change = nullptr;
>>>>>> +
>>>>>>    auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
>>>>>>      rtx_insn *rti = change->insn ()->rtl ();
>>>>>>      validate_unshare_change (rti, &PATTERN (rti), pair_pat, true);
>>>>>> @@ -2133,15 +2266,6 @@ load_modified_by_store_p (insn_info *load,
>>>>>>    return false;
>>>>>>  }
>>>>>>  
>>>>>> -// Virtual base class for load/store walkers used in alias analysis.
>>>>>> -struct alias_walker
>>>>>> -{
>>>>>> -  virtual bool conflict_p (int &budget) const = 0;
>>>>>> -  virtual insn_info *insn () const = 0;
>>>>>> -  virtual bool valid () const  = 0;
>>>>>> -  virtual void advance () = 0;
>>>>>> -};
>>>>>> -
>>>>>>  // Implement some common functionality used by both store_walker
>>>>>>  // and load_walker.
>>>>>>  template<bool reverse>
>>>>>> @@ -2259,13 +2383,13 @@ public:
>>>>>>  //
>>>>>>  // We try to maintain the invariant that if a walker becomes invalid, we
>>>>>>  // set its pointer to null.
>>>>>> -static void
>>>>>> -do_alias_analysis (insn_info *alias_hazards[4],
>>>>>> +void
>>>>>> +pair_fusion_bb_info::do_alias_analysis (insn_info *alias_hazards[4],
>>>>>
>>>>> It would probably make more sense for this to be a (non-virtual) member
>>>>> function of the pair_fusion class, since it doesn't need to access the
>>>>> BB-specific state.
>>>>>
>>>>
>>>> Addressed.
>>>>>>  		   alias_walker *walkers[4],
>>>>>>  		   bool load_p)
>>>>>>  {
>>>>>>    const int n_walkers = 2 + (2 * !load_p);
>>>>>> -  int budget = aarch64_ldp_alias_check_limit;
>>>>>> +  int budget = bb_state->pair_mem_alias_check_limit ();
>>>>>>  
>>>>>>    auto next_walker = [walkers,n_walkers](int current) -> int {
>>>>>>      for (int j = 1; j <= n_walkers; j++)
>>>>>> @@ -2365,7 +2489,7 @@ get_viable_bases (insn_info *insns[2],
>>>>>>      {
>>>>>>        const bool is_lower = (i == reversed);
>>>>>>        poly_int64 poly_off;
>>>>>> -      rtx base = ldp_strip_offset (cand_mems[i], &poly_off);
>>>>>> +      rtx base = pair_mem_strip_offset (cand_mems[i], &poly_off);
>>>>>>        if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
>>>>>>  	writeback |= (1 << i);
>>>>>>  
>>>>>> @@ -2373,7 +2497,7 @@ get_viable_bases (insn_info *insns[2],
>>>>>>  	continue;
>>>>>>  
>>>>>>        // Punt on accesses relative to eliminable regs.  See the comment in
>>>>>> -      // ldp_bb_info::track_access for a detailed explanation of this.
>>>>>> +      // pair_fusion_bb_info::track_access for a detailed explanation of this.
>>>>>>        if (!reload_completed
>>>>>>  	  && (REGNO (base) == FRAME_POINTER_REGNUM
>>>>>>  	      || REGNO (base) == ARG_POINTER_REGNUM))
>>>>>> @@ -2397,7 +2521,11 @@ get_viable_bases (insn_info *insns[2],
>>>>>>        if (!is_lower)
>>>>>>  	base_off--;
>>>>>>  
>>>>>> -      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
>>>>>> +      pair_fusion *pfuse;
>>>>>> +      aarch64_pair_fusion derived;
>>>>>> +      pfuse = &derived;
>>>>>
>>>>> Again, please don't do this but instead use the same instance
>>>>> everywhere.
>>>>>
>>>>
>>>> Addressed.
>>>>>> +
>>>>>> +      if (pfuse->pair_mem_in_range_p (base_off))
>>>>>>  	continue;
>>>>>>  
>>>>>>        use_info *use = find_access (insns[i]->uses (), REGNO (base));
>>>>>> @@ -2454,12 +2582,12 @@ get_viable_bases (insn_info *insns[2],
>>>>>>  }
>>>>>>  
>>>>>>  // Given two adjacent memory accesses of the same size, I1 and I2, try
>>>>>> -// and see if we can merge them into a ldp or stp.
>>>>>> +// and see if we can merge them into a paired accesses load and store.
>>>>>
>>>>> "into a load pair or store pair" or "into a paired access" if you
>>>>> prefer.
>>>>>
>>>>
>>>> Addressed.
>>>>>>  //
>>>>>>  // ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
>>>>>>  // if the accesses are both loads, otherwise they are both stores.
>>>>>>  bool
>>>>>> -ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>>>> +pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>>>>  			    insn_info *i1, insn_info *i2)
>>>>>>  {
>>>>>>    if (dump_file)
>>>>>> @@ -2494,7 +2622,7 @@ ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>>>>      {
>>>>>>        if (dump_file)
>>>>>>  	fprintf (dump_file,
>>>>>> -		 "punting on ldp due to reg conflcits (%d,%d)\n",
>>>>>> +		 "punting on pair mem load  due to reg conflcits (%d,%d)\n",
>>>>>>  		 insns[0]->uid (), insns[1]->uid ());
>>>>>>        return false;
>>>>>>      }
>>>>>> @@ -2843,7 +2971,7 @@ debug (const insn_list_t &l)
>>>>>>  // we can't re-order them anyway, so provided earlier passes have cleaned up
>>>>>>  // redundant loads, we shouldn't miss opportunities by doing this.
>>>>>>  void
>>>>>> -ldp_bb_info::merge_pairs (insn_list_t &left_list,
>>>>>> +pair_fusion_bb_info::merge_pairs (insn_list_t &left_list,
>>>>>>  			  insn_list_t &right_list,
>>>>>>  			  bool load_p,
>>>>>>  			  unsigned access_size)
>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
>>>>>>  // merge_pairs.
>>>>>>  void
>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>> -				 access_group &group)
>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>>>>>> +					 access_group &group)
>>>>>>  {
>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
>>>>>>    const unsigned access_size = lfs.size;
>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>  		       access.cand_insns,
>>>>>>  		       lfs.load_p,
>>>>>>  		       access_size);
>>>>>> -	  skip_next = access.cand_insns.empty ();
>>>>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
>>>>>
>>>>> As above, why is this needed?
>>>>
>>>> For rs6000 we want to return always true. as load store pair
>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
>>>> And we want load store pair to 8/16 32/64. Thats why we want
>>>> to generate always true for rs6000 to skip pairs as above.
>>>
>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
>>> load/store pair instructions where the two arms of the access are storing
>>> operands of different sizes?  Or something else?
>>>
>>> As it stands the logic is to skip the next iteration only if we
>>> exhausted all the candidate insns for the current access.  In the case
>>> that we didn't exhaust all such candidates, then the idea is that when
>>> access becomes prev_access, we can attempt to use those candidates as
>>> the "left-hand side" of a pair in the next iteration since we failed to
>>> use them as the "right-hand side" of a pair in the current iteration.
>>> I don't see why you wouldn't want that behaviour.  Please can you
>>> explain?
>>>
>>
>> In merge_pair we get the 2 load candiates one load from 0 offset and
>> other load is from 16th offset. Then in next iteration we get load
>> from 16th offset and other load from 32 offset. In next iteration
>> we get load from 32 offset and other load from 48 offset.
>>
>> For example:
>>
>> Currently we get the load candiates as follows.
>>
>> pairs:
>>
>> load from 0th offset.
>> load from 16th offset.
>>
>> next pairs:
>>
>> load from 16th offset.
>> load from 32th offset.
>>
>> next pairs:
>>
>> load from 32th offset
>> load from 48th offset.
>>
>> Instead in rs6000 we should get:
>>
>> pairs:
>>
>> load from 0th offset
>> load from 16th offset.
>>
>> next pairs:
>>
>> load from 32th offset
>> load from 48th offset.
> 
> Hmm, so then I guess my question is: why wouldn't you consider merging
> the pair with offsets (16,32) for rs6000?  Is it because you have a
> stricter alignment requirement on the base pair offsets (i.e. they have
> to be a multiple of 32 when the operand size is 16)?  So the pair
> offsets have to be a multiple of the entire pair size rather than a
> single operand size> 

We get load pair at a certain point with (0,16) and other program
point we get load pair (32, 48).

In current implementation it takes offsets loads as (0, 16),
(16, 32), (32, 48).

But In rs6000 we want  the load pair to be merged at different points
as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
0 offset and other load (32, 48) with lxvp with 32 offset.

In current case it will merge with lxvp with 0 offset and lxvp with
16 offset, then lxvp with 32 offset and lxvp with 48 offset which
is incorrect in our case as the (16-32) case 16 offset will not
load from even register and break for rs6000.

lxvp should load from even registers and then loaded value will
be in even register and even register +1 (which is odd).

Thanks & Regards
Ajit
> If that is the case then I think it would be better to introduce a
> virtual function (say pair_offset_alignment_ok_p) that vets the base
> offset of the pair (prev_access->offset in transform_for_base).  I guess
> it would also take access_size as a parameter and for aarch64 it should
> check:
> 
>   multiple_p (offset, access_size)
> 
> and for rs6000 it could check:
> 
>   multiple_p (offset, access_size * 2)
> 
> and we would then incorporate a call to that predicate in the else if
> condition of tranform_for_base.
> 
> It would have the same limitation whereby we assume that MEM_EXPR offset
> alignment is a good proxy for RTL offset alignment, but we already
> depend on that with the multiple_p check in track_via_mem_expr.
> 
> Thanks,
> Alex
> 
>>
>> Thanks & Regards
>> Ajit
>>> Thanks,
>>> Alex
>>>
>>>>
>>>>>
>>>>>>  	}
>>>>>>        prev_access = &access;
>>>>>>      }
>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>  // and remove all the tombstone insns, being sure to reparent any uses
>>>>>>  // of mem to previous defs when we do this.
>>>>>>  void
>>>>>> -ldp_bb_info::cleanup_tombstones ()
>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
>>>>>>  {
>>>>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
>>>>>>    if (!m_emitted_tombstone)
>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>>>>>  
>>>>>>  template<typename Map>
>>>>>>  void
>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>>>>>  {
>>>>>>    for (auto kv : map)
>>>>>>      {
>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>>>>>  }
>>>>>>  
>>>>>>  void
>>>>>> -ldp_bb_info::transform ()
>>>>>> +pair_fusion_bb_info::transform ()
>>>>>>  {
>>>>>>    traverse_base_map (expr_map);
>>>>>>    traverse_base_map (def_map);
>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>>>>>  // for load/store candidates.  If running after RA, also try and promote
>>>>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
>>>>>>  // candidates into pairs.
>>>>>> -void ldp_fusion_bb (bb_info *bb)
>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>>>>>  {
>>>>>> -  const bool track_loads
>>>>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>> -  const bool track_stores
>>>>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>> +  const bool track_loads = track_load_p ();
>>>>>> +  const bool track_stores = track_store_p ();
>>>>>>  
>>>>>> -  ldp_bb_info bb_state (bb);
>>>>>
>>>>> This:
>>>>>
>>>>>> +  aarch64_pair_fusion derived;
>>>>>
>>>>> can be deleted and then:
>>>>>
>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
>>>>>
>>>>> can just be:
>>>>>
>>>>>   pair_fusion_bb_info bb_info (bb, this);
>>>>>
>>>>> (or you can pass *this if you make bb_info take a reference).
>>>>>
>>>>> I don't think there's a particular need to change the variable name
>>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
>>>>> with the RTL-SSA structure of the same name as the latter.
>>>>>
>>>>
>>>> Addressed.
>>>>>>  
>>>>>>    for (auto insn : bb->nondebug_insns ())
>>>>>>      {
>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>>>>>  	continue;
>>>>>>  
>>>>>>        rtx pat = PATTERN (rti);
>>>>>> -      if (reload_completed
>>>>>> -	  && aarch64_ldp_writeback > 1
>>>>>> -	  && GET_CODE (pat) == PARALLEL
>>>>>> -	  && XVECLEN (pat, 0) == 2)
>>>>>> +      if (pair_mem_promote_writeback_p (pat))
>>>>>>  	try_promote_writeback (insn);
>>>>>
>>>>> It looks like try_promote_writeback itself will need some further work
>>>>> to make it target-independent.  I suppose this check:
>>>>>
>>>>>   auto rti = insn->rtl ();
>>>>>   const auto attr = get_attr_ldpstp (rti);
>>>>>   if (attr == LDPSTP_NONE)
>>>>>     return;
>>>>>
>>>>>   bool load_p = (attr == LDPSTP_LDP);
>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
>>>>>
>>>>> will need to become part of the pair_mem_promote_writeback_p hook that you
>>>>> added, potentially changing it to return a boolean for load_p.
>>>>>
>>>>> Then I guess we will need hooks for destructuring the pair insn and
>>>>> another hook to wrap aarch64_gen_writeback_pair.
>>>>>
>>>>
>>>> Addressed.
>>>>>>  
>>>>>>        if (GET_CODE (pat) != SET)
>>>>>>  	continue;
>>>>>>  
>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>>>>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
>>>>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>>>>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
>>>>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
>>>>>>      }
>>>>>>  
>>>>>> -  bb_state.transform ();
>>>>>> -  bb_state.cleanup_tombstones ();
>>>>>> +  bb_info.transform ();
>>>>>> +  bb_info.cleanup_tombstones ();
>>>>>>  }
>>>>>>  
>>>>>>  void ldp_fusion ()
>>>>>>  {
>>>>>>    ldp_fusion_init ();
>>>>>> +  pair_fusion *pfuse;
>>>>>> +  aarch64_pair_fusion derived;
>>>>>> +  pfuse = &derived;
>>>>>
>>>>> This is indeed the one place where I think it is acceptable to
>>>>> instantiate aarch64_pair_fusion.  But again there's no need to create a
>>>>> pointer to the parent class, just call any function you like directly.
>>>>>
>>>>
>>>> Addressed.
>>>>>>  
>>>>>>    for (auto bb : crtl->ssa->bbs ())
>>>>>> -    ldp_fusion_bb (bb);
>>>>>> +    pfuse->ldp_fusion_bb (bb);
>>>>>
>>>>> I think even the code to iterate over bbs should itself be a member
>>>>> function of pair_fusion (say "run") and then that becomes part of the
>>>>> generic code.
>>>>>
>>>>> So this function would just become:
>>>>>
>>>>> aarch64_pair_fusion pass;
>>>>> pass.run ();
>>>>>
>>>>> and could be inlined into the caller.
>>>>>
>>>>
>>>> Addressed.
>>>>> Perhaps you could also add an early return like:
>>>>>
>>>>>   if (!track_loads_p () && !track_stores_p ())
>>>>>     return;
>>>>>
>>>>> in pair_fusion::run () and then remove the corresponding code from
>>>>> pass_ldp_fusion::gate?
>>>>>
>>>>
>>>> Addressed.
>>>>
>>>>>>  
>>>>>>    ldp_fusion_destroy ();
>>>>>>  }
>>>>>> -- 
>>>>>> 2.39.3
>>>>>>
>>>>>
>>>>> Thanks,
>>>>> Alex
>>>>
>>>> Thanks & Regards
>>>> Ajit
Alex Coplan April 10, 2024, 8:12 a.m. UTC | #7
Hi Ajit,

On 09/04/2024 20:59, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 09/04/24 8:39 pm, Alex Coplan wrote:
> > On 09/04/2024 20:01, Ajit Agarwal wrote:
> >> Hello Alex:
> >>
> >> On 09/04/24 7:29 pm, Alex Coplan wrote:
> >>> On 09/04/2024 17:30, Ajit Agarwal wrote:
> >>>>
> >>>>
> >>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
> >>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
> >>>>>> Hello Alex/Richard:
> >>>>>>
> >>>>>> All review comments are incorporated.
> >>>>>
> >>>>> Thanks, I was kind-of expecting you to also send the renaming patch as a
> >>>>> preparatory patch as we discussed.
> >>>>>
> >>>>> Sorry for another meta comment, but: I think the reason that the Linaro
> >>>>> CI isn't running tests on your patches is actually because you're
> >>>>> sending 1/3 of a series but not sending the rest of the series.
> >>>>>
> >>>>> So please can you either send this as an individual preparatory patch
> >>>>> (not marked as a series) or if you're going to send a series (e.g. with
> >>>>> a preparatory rename patch as 1/2 and this as 2/2) then send the entire
> >>>>> series when you make updates.  That way the CI should test your patches,
> >>>>> which would be helpful.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>  
> >>>>>>
> >>>>>> Common infrastructure of load store pair fusion is divided into target
> >>>>>> independent and target dependent changed code.
> >>>>>>
> >>>>>> Target independent code is the Generic code with pure virtual function
> >>>>>> to interface betwwen target independent and dependent code.
> >>>>>>
> >>>>>> Target dependent code is the implementation of pure virtual function for
> >>>>>> aarch64 target and the call to target independent code.
> >>>>>>
> >>>>>> Thanks & Regards
> >>>>>> Ajit
> >>>>>>
> >>>>>>
> >>>>>> aarch64: Place target independent and dependent changed code in one file
> >>>>>>
> >>>>>> Common infrastructure of load store pair fusion is divided into target
> >>>>>> independent and target dependent changed code.
> >>>>>>
> >>>>>> Target independent code is the Generic code with pure virtual function
> >>>>>> to interface betwwen target independent and dependent code.
> >>>>>>
> >>>>>> Target dependent code is the implementation of pure virtual function for
> >>>>>> aarch64 target and the call to target independent code.
> >>>>>>
> >>>>>> 2024-04-06  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
> >>>>>>
> >>>>>> gcc/ChangeLog:
> >>>>>>
> >>>>>> 	* config/aarch64/aarch64-ldp-fusion.cc: Place target
> >>>>>> 	independent and dependent changed code.
> >>>>>
> >>>>> You're going to need a proper ChangeLog eventually, but I guess there's
> >>>>> no need for that right now.
> >>>>>
> >>>>>> ---
> >>>>>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++++++++++++++--------
> >>>>>>  1 file changed, 249 insertions(+), 122 deletions(-)
> >>>>>>
> >>>>>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >>>>>> index 22ed95eb743..cb21b514ef7 100644
> >>>>>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >>>>>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> >>>>>> @@ -138,8 +138,122 @@ struct alt_base
> >>>>>>    poly_int64 offset;
> >>>>>>  };
> >>>>>>  
> >>>>>> +// Virtual base class for load/store walkers used in alias analysis.
> >>>>>> +struct alias_walker
> >>>>>> +{
> >>>>>> +  virtual bool conflict_p (int &budget) const = 0;
> >>>>>> +  virtual insn_info *insn () const = 0;
> >>>>>> +  virtual bool valid () const  = 0;
> >>>>>
> >>>>> Heh, looking at this made me realise there is a whitespace bug here in
> >>>>> the existing code (double space after const).  Sorry about that!  I'll
> >>>>> push an obvious fix for that.
> >>>>>
> >>>>>> +  virtual void advance () = 0;
> >>>>>> +};
> >>>>>> +
> >>>>>> +struct pair_fusion {
> >>>>>> +
> >>>>>> +  pair_fusion () {};
> >>>>>
> >>>>> This ctor looks pointless at the moment.  Perhaps instead we could put
> >>>>> the contents of ldp_fusion_init in here and then delete that function?
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>
> >>>>>> +  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
> >>>>>> +			       bool load_p) = 0;
> >>>>>
> >>>>> Please can we have comments above each of these virtual functions
> >>>>> describing any parameters, what the purpose of the hook is, and the
> >>>>> interpretation of the return value?  This will serve as the
> >>>>> documentation for other targets that want to make use of the pass.
> >>>>>
> >>>>> It might make sense to have a default-false implementation for
> >>>>> fpsimd_op_p, especially if you don't want to make use of this bit for
> >>>>> rs6000.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>  
> >>>>>> +
> >>>>>> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
> >>>>>> +  virtual bool pair_trailing_writeback_p () = 0;
> >>>>>
> >>>>> Sorry for the run-around, but: I think this and
> >>>>> handle_writeback_opportunities () should be the same function, either
> >>>>> returning an enum or taking an enum and returning a boolean.
> >>>>>
> >>>>> At a minimum they should have more similar sounding names.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>
> >>>>>> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> >>>>>> +				      machine_mode mem_mode) = 0;
> >>>>>> +  virtual int pair_mem_alias_check_limit () = 0;
> >>>>>> +  virtual bool handle_writeback_opportunities () = 0 ;
> >>>>>> +  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
> >>>>>> +					machine_mode mode) = 0;
> >>>>>> +  virtual rtx gen_mem_pair (rtx *pats,  rtx writeback,
> >>>>>
> >>>>> Nit: excess whitespace after pats,
> >>>>>
> >>>>>> +			    bool load_p) = 0;
> >>>>>> +  virtual bool pair_mem_promote_writeback_p (rtx pat) = 0;
> >>>>>> +  virtual bool track_load_p () = 0;
> >>>>>> +  virtual bool track_store_p () = 0;
> >>>>>
> >>>>> I think it would probably make more sense for these two to have
> >>>>> default-true implementations rather than being pure virtual functions.
> >>>>>
> >>>>> Also, sorry for the bikeshedding, but please can we keep the plural
> >>>>> names (so track_loads_p and track_stores_p).
> >>>>>
> >>>>>> +  virtual bool cand_insns_empty_p (std::list<insn_info *> &insns) = 0;
> >>>>>
> >>>>> Why does this need to be virtualised?  I would it expect it to
> >>>>> just be insns.empty () on all targets.
> >>>>>
> >>>>>> +  virtual bool pair_mem_in_range_p (HOST_WIDE_INT off) = 0;
> >>>>>> +  void ldp_fusion_bb (bb_info *bb);
> >>>>>
> >>>>> Just to check, are you planning to rename this function in a separate
> >>>>> patch?
> >>>>>
> >>>>
> >>>> Yes.
> >>>>>> +
> >>>>>> +  ~pair_fusion() { }
> >>>>>
> >>>>> As with the ctor, perhaps we could put the contents of ldp_fusion_destroy
> >>>>> in here and then delete that function?
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>> +};
> >>>>>> +
> >>>>>> +struct aarch64_pair_fusion : public pair_fusion
> >>>>>> +{
> >>>>>> +public:
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>> Nit: there shouldn't be a need for this as public is already the default
> >>>>> for structs.
> >>>>>
> >>>>>> +  aarch64_pair_fusion () : pair_fusion () {};
> >>>>>
> >>>>> This ctor looks pointless, the default one should do, so can we remove
> >>>>> this?
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>> +  bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
> >>>>>> +		    bool load_p) override final
> >>>>>> +  {
> >>>>>> +    const bool fpsimd_op_p
> >>>>>> +      = reload_completed
> >>>>>> +      ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
> >>>>>> +      : (GET_MODE_CLASS (mem_mode) != MODE_INT
> >>>>>> +	 && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
> >>>>>> +    return fpsimd_op_p;
> >>>>>
> >>>>> Can this just directly return the boolean expression instead of storing
> >>>>> it into a variable first?
> >>>>>
> >>>> Addressed.
> >>>>>> +  }
> >>>>>> +
> >>>>>> +  bool pair_mem_promote_writeback_p (rtx pat)
> >>>>>> +  {
> >>>>>> +     if (reload_completed
> >>>>>> +	 && aarch64_ldp_writeback > 1
> >>>>>> +	 && GET_CODE (pat) == PARALLEL
> >>>>>> +	 && XVECLEN (pat, 0) == 2)
> >>>>>> +	return true;
> >>>>>> +
> >>>>>> +     return false;
> >>>>>> +  }
> >>>>>> +
> >>>>>> +  bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
> >>>>>> +				machine_mode mode)
> >>>>>> +  {
> >>>>>> +    return aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
> >>>>>> +						     load_p,
> >>>>>> +						     mode);
> >>>>>> +  }
> >>>>>> +  bool pair_operand_mode_ok_p (machine_mode mode);
> >>>>>
> >>>>> Why have you implemented this (one-line function) out-of-line but the
> >>>>> others are implemented inline?  Please be consistent.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>> +
> >>>>>> +  rtx gen_mem_pair (rtx *pats, rtx writeback, bool load_p);
> >>>>>> +
> >>>>>> +  bool pair_trailing_writeback_p  ()
> >>>>>> +  {
> >>>>>> +    return aarch64_ldp_writeback > 1;
> >>>>>> +  }
> >>>>>> +  bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> >>>>>> +			      machine_mode mem_mode)
> >>>>>> +  {
> >>>>>> +    return (load_p
> >>>>>> +	     ? aarch64_ldp_reg_operand (reg_op, mem_mode)
> >>>>>> +	     : aarch64_stp_reg_operand (reg_op, mem_mode));
> >>>>>> +  }
> >>>>>> +  int pair_mem_alias_check_limit ()
> >>>>>> +  {
> >>>>>> +    return aarch64_ldp_alias_check_limit;
> >>>>>> +  }
> >>>>>> +  bool handle_writeback_opportunities ()
> >>>>>> +  {
> >>>>>> +    return aarch64_ldp_writeback;
> >>>>>> +  }
> >>>>>> +  bool track_load_p ()
> >>>>>> +  {
> >>>>>> +    const bool track_loads
> >>>>>> +      = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>> +    return track_loads;
> >>>>>> +  }
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>> As above, can this just return the boolean expression directly without
> >>>>> the temporary?
> >>>>>
> >>>>>> +  bool track_store_p ()
> >>>>>> +  {
> >>>>>> +    const bool track_stores
> >>>>>> +      = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>> +    return track_stores;
> >>>>>> +  }
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>> Likewise.
> >>>>>
> >>>>>> +  bool cand_insns_empty_p (std::list<insn_info *> &insns)
> >>>>>> +  {
> >>>>>> +    return insns.empty();
> >>>>>> +  }
> >>>>>
> >>>>> As mentioned on the declaration, I don't see why this needs to be
> >>>>> virtualised.
> >>>>>
> >>>>>> +  bool pair_mem_in_range_p (HOST_WIDE_INT off)
> >>>>>> +  {
> >>>>>> +    return (off < LDP_MIN_IMM || off > LDP_MAX_IMM);
> >>>>>> +  }
> >>>>>
> >>>>> Hmm, this looks to test exactly the opposite condition from what the
> >>>>> name suggests, i.e. the implementation looks like what I would expect
> >>>>> for a function called pair_mem_out_of_range_p.
> >>>>>
> >>>>> Please can you invert the condition and fix up callers appropriately?
> >>>>>
> >>>>> This suggests you're not really thinking about the code you're writing,
> >>>>> please try to think a bit more carefully before posting patches.
> >>>>>
> >>>>
> >>>> Sorry for my mistake as I have overlooked it.
> >>>>>> +};
> >>>>>> +
> >>>>>> +
> >>>>>>  // State used by the pass for a given basic block.
> >>>>>> -struct ldp_bb_info
> >>>>>> +struct pair_fusion_bb_info
> >>>>>>  {
> >>>>>>    using def_hash = nofree_ptr_hash<def_info>;
> >>>>>>    using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
> >>>>>> @@ -160,14 +274,17 @@ struct ldp_bb_info
> >>>>>>  
> >>>>>>    static const size_t obstack_alignment = sizeof (void *);
> >>>>>>    bb_info *m_bb;
> >>>>>> +  pair_fusion *bb_state;
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>
> >>>>> Can we call this m_pass instead of bb_state?  The pair_fusion class is
> >>>>> explicitly at the pass level rather than the bb level.
> >>>>>
> >>>>> Also I think this (and indeed m_bb, although that's pre-existing) could
> >>>>> be private members.  Not sure there's a good reason for making them
> >>>>> public.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>  
> >>>>> It might be slightly nicer to use a reference for m_pass unless there's
> >>>>> a good reason to use a pointer, although I don't feel strongly about
> >>>>> this.
> >>>>>
> >>>>>>  
> >>>>>> -  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
> >>>>>> +  pair_fusion_bb_info (bb_info *bb,
> >>>>>> +		      aarch64_pair_fusion *d) : m_bb (bb),
> >>>>>
> >>>>> This generic class needs to take the generic pass type, i.e. just a
> >>>>> pair_fusion * rather than an aarch64_pair_fusion *.
> >>>>>
> >>>>>> +		      bb_state (d), m_emitted_tombstone (false)
> >>>>>>    {
> >>>>>>      obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
> >>>>>>  				obstack_alignment, obstack_chunk_alloc,
> >>>>>>  				obstack_chunk_free);
> >>>>>>    }
> >>>>>> -  ~ldp_bb_info ()
> >>>>>> +  ~pair_fusion_bb_info ()
> >>>>>>    {
> >>>>>>      obstack_free (&m_obstack, nullptr);
> >>>>>>  
> >>>>>> @@ -177,10 +294,32 @@ struct ldp_bb_info
> >>>>>>  	bitmap_obstack_release (&m_bitmap_obstack);
> >>>>>>        }
> >>>>>>    }
> >>>>>> +  void track_access (insn_info *, bool load, rtx mem);
> >>>>>> +  void transform ();
> >>>>>> +  void cleanup_tombstones ();
> >>>>>
> >>>>> I think most (or indeed all?) of the changes here shouldn't be needed.
> >>>>> AFAICS, there shouldn't be a need to:
> >>>>>  - drop inline from these functions.
> >>>>>  - change the accessibility of any members.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>> Plese can you drop these changes or explain why they're needed (from
> >>>>> here until the end of the class).
> >>>>>
> >>>>>> +  void merge_pairs (insn_list_t &, insn_list_t &,
> >>>>>> +		    bool load_p, unsigned access_size);
> >>>>>> +  void transform_for_base (int load_size, access_group &group);
> >>>>>> +
> >>>>>> +  bool try_fuse_pair (bool load_p, unsigned access_size,
> >>>>>> +			     insn_info *i1, insn_info *i2);
> >>>>>> +
> >>>>>> +  bool fuse_pair (bool load_p, unsigned access_size,
> >>>>>> +		  int writeback,
> >>>>>> +		  insn_info *i1, insn_info *i2,
> >>>>>> +		  base_cand &base,
> >>>>>> +		  const insn_range_info &move_range);
> >>>>>>  
> >>>>>> -  inline void track_access (insn_info *, bool load, rtx mem);
> >>>>>> -  inline void transform ();
> >>>>>> -  inline void cleanup_tombstones ();
> >>>>>> +  void do_alias_analysis (insn_info *alias_hazards[4],
> >>>>>> +			  alias_walker *walkers[4],
> >>>>>> +			  bool load_p);
> >>>>>> +
> >>>>>> +  void track_tombstone (int uid);
> >>>>>> +
> >>>>>> +  bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
> >>>>>> +
> >>>>>> +  template<typename Map>
> >>>>>> +    void traverse_base_map (Map &map);
> >>>>>>  
> >>>>>>  private:
> >>>>>>    obstack m_obstack;
> >>>>>> @@ -191,30 +330,32 @@ private:
> >>>>>>    bool m_emitted_tombstone;
> >>>>>>  
> >>>>>>    inline splay_tree_node<access_record *> *node_alloc (access_record *);
> >>>>>> +};
> >>>>>>  
> >>>>>> -  template<typename Map>
> >>>>>> -  inline void traverse_base_map (Map &map);
> >>>>>> -  inline void transform_for_base (int load_size, access_group &group);
> >>>>>> -
> >>>>>> -  inline void merge_pairs (insn_list_t &, insn_list_t &,
> >>>>>> -			   bool load_p, unsigned access_size);
> >>>>>> -
> >>>>>> -  inline bool try_fuse_pair (bool load_p, unsigned access_size,
> >>>>>> -			     insn_info *i1, insn_info *i2);
> >>>>>> -
> >>>>>> -  inline bool fuse_pair (bool load_p, unsigned access_size,
> >>>>>> -			 int writeback,
> >>>>>> -			 insn_info *i1, insn_info *i2,
> >>>>>> -			 base_cand &base,
> >>>>>> -			 const insn_range_info &move_range);
> >>>>>> -
> >>>>>> -  inline void track_tombstone (int uid);
> >>>>>> +rtx aarch64_pair_fusion::gen_mem_pair (rtx *pats,
> >>>>>
> >>>>> Style nit: newline after rtx.
> >>>>>
> >>>> Addressed.
> >>>>
> >>>>>> +		  rtx writeback,
> >>>>>> +		  bool load_p)
> >>>>>> +  {
> >>>>>> +    rtx pair_pat;
> >>>>>>  
> >>>>>> -  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
> >>>>>> -};
> >>>>>> +    if (writeback)
> >>>>>> +      {
> >>>>>> +	auto patvec = gen_rtvec (3, writeback, pats[0], pats[1]);
> >>>>>> +	pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
> >>>>>> +      }
> >>>>>> +    else if (load_p)
> >>>>>> +      pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
> >>>>>> +					XEXP (pats[1], 0),
> >>>>>> +					XEXP (pats[0], 1));
> >>>>>> +    else
> >>>>>> +      pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
> >>>>>> +					 XEXP (pats[0], 1),
> >>>>>> +					 XEXP (pats[1], 1));
> >>>>>> +     return pair_pat;
> >>>>>
> >>>>> Can the assignments just be direct returns here, please?  Then
> >>>>> get rid of the temporary.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>  
> >>>>>> +  }
> >>>>>>  
> >>>>>>  splay_tree_node<access_record *> *
> >>>>>> -ldp_bb_info::node_alloc (access_record *access)
> >>>>>> +pair_fusion_bb_info::node_alloc (access_record *access)
> >>>>>>  {
> >>>>>>    using T = splay_tree_node<access_record *>;
> >>>>>>    void *addr = obstack_alloc (&m_obstack, sizeof (T));
> >>>>>> @@ -262,7 +403,7 @@ drop_writeback (rtx mem)
> >>>>>>  // RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
> >>>>>>  // MEM so that we know the mode of the access.
> >>>>>>  static rtx
> >>>>>> -ldp_strip_offset (rtx mem, poly_int64 *offset)
> >>>>>> +pair_mem_strip_offset (rtx mem, poly_int64 *offset)
> >>>>>
> >>>>> I thought we discussed splitting the renaming into a separate patch?
> >>>>>
> >>>>>>  {
> >>>>>>    rtx addr = XEXP (mem, 0);
> >>>>>>  
> >>>>>> @@ -332,6 +473,12 @@ ldp_operand_mode_ok_p (machine_mode mode)
> >>>>>>    return reload_completed || mode != TImode;
> >>>>>>  }
> >>>>>>  
> >>>>>> +bool
> >>>>>> +aarch64_pair_fusion::pair_operand_mode_ok_p (machine_mode mode)
> >>>>>> +{
> >>>>>> +  return ldp_operand_mode_ok_p (mode);
> >>>>>> +}
> >>>>>> +
> >>>>>>  // Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
> >>>>>>  // into an integer for use as a hash table key.
> >>>>>>  static int
> >>>>>> @@ -396,7 +543,7 @@ access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
> >>>>>>  // MEM_EXPR base (i.e. a tree decl) relative to which we can track the access.
> >>>>>>  // LFS is used as part of the key to the hash table, see track_access.
> >>>>>>  bool
> >>>>>> -ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> >>>>>> +pair_fusion_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> >>>>>
> >>>>> Again, please do any renaming in a separate patch.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  {
> >>>>>>    if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
> >>>>>>      return false;
> >>>>>> @@ -412,9 +559,10 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> >>>>>>    const machine_mode mem_mode = GET_MODE (mem);
> >>>>>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
> >>>>>>  
> >>>>>> -  // Punt on misaligned offsets.  LDP/STP instructions require offsets to be a
> >>>>>> -  // multiple of the access size, and we believe that misaligned offsets on
> >>>>>> -  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL bases.
> >>>>>> +  // Punt on misaligned offsets.  Paired memory access instructions require
> >>>>>> +  // offsets to be a multiple of the access size, and we believe that
> >>>>>> +  // misaligned offsets on MEM_EXPR bases are likely to lead to misaligned
> >>>>>> +  // offsets w.r.t. RTL bases.
> >>>>>>    if (!multiple_p (offset, mem_size))
> >>>>>>      return false;
> >>>>>>  
> >>>>>> @@ -438,46 +586,38 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
> >>>>>>  }
> >>>>>>  
> >>>>>>  // Main function to begin pair discovery.  Given a memory access INSN,
> >>>>>> -// determine whether it could be a candidate for fusing into an ldp/stp,
> >>>>>> +// determine whether it could be a candidate for fusing into an pair mem,
> >>>>>
> >>>>> s/an pair mem/a paired access/
> >>>>>
> >>>>>>  // and if so, track it in the appropriate data structure for this basic
> >>>>>>  // block.  LOAD_P is true if the access is a load, and MEM is the mem
> >>>>>>  // rtx that occurs in INSN.
> >>>>>>  void
> >>>>>> -ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>>> +pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>>>  {
> >>>>>>    // We can't combine volatile MEMs, so punt on these.
> >>>>>>    if (MEM_VOLATILE_P (mem))
> >>>>>>      return;
> >>>>>>  
> >>>>>> -  // Ignore writeback accesses if the param says to do so.
> >>>>>> -  if (!aarch64_ldp_writeback
> >>>>>> +  // Ignore writeback accesses if the param says to do so
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>> This comment needs updating, I guess something like "if the hook says to
> >>>>> do so".  But also please don't delete the full stop.
> >>>>>
> >>>>>> +  if (!bb_state->handle_writeback_opportunities ()
> >>>>>>        && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
> >>>>>>      return;
> >>>>>>  
> >>>>>>    const machine_mode mem_mode = GET_MODE (mem);
> >>>>>> -  if (!ldp_operand_mode_ok_p (mem_mode))
> >>>>>> +
> >>>>>
> >>>>> I don't think we need the extra whitespace here.
> >>>>>
> >>>>>> +  if (!bb_state->pair_operand_mode_ok_p (mem_mode))
> >>>>>>      return;
> >>>>>>  
> >>>>>>    rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
> >>>>>>  
> >>>>>> -  // Ignore the access if the register operand isn't suitable for ldp/stp.
> >>>>>> -  if (load_p
> >>>>>> -      ? !aarch64_ldp_reg_operand (reg_op, mem_mode)
> >>>>>> -      : !aarch64_stp_reg_operand (reg_op, mem_mode))
> >>>>>> +  if (!bb_state->pair_reg_operand_ok_p (load_p, reg_op, mem_mode))
> >>>>>>      return;
> >>>>>> -
> >>>>>>    // We want to segregate FP/SIMD accesses from GPR accesses.
> >>>>>>    //
> >>>>>>    // Before RA, we use the modes, noting that stores of constant zero
> >>>>>>    // operands use GPRs (even in non-integer modes).  After RA, we use
> >>>>>>    // the hard register numbers.
> >>>>>> -  const bool fpsimd_op_p
> >>>>>> -    = reload_completed
> >>>>>> -    ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
> >>>>>> -    : (GET_MODE_CLASS (mem_mode) != MODE_INT
> >>>>>> -       && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
> >>>>>> -
> >>>>>> -  // Note ldp_operand_mode_ok_p already rejected VL modes.
> >>>>>> +  const bool fpsimd_op_p = bb_state->fpsimd_op_p (reg_op, mem_mode, load_p);
> >>>>>> +  // Note pair_operand_mode_ok_p already rejected VL modes.
> >>>>>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
> >>>>>>    const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
> >>>>>>  
> >>>>>> @@ -487,7 +627,7 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>>>    poly_int64 mem_off;
> >>>>>>    rtx addr = XEXP (mem, 0);
> >>>>>>    const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
> >>>>>> -  rtx base = ldp_strip_offset (mem, &mem_off);
> >>>>>> +  rtx base = pair_mem_strip_offset (mem, &mem_off);
> >>>>>
> >>>>> Again, let's split the renaming off into a separate patch.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>    if (!REG_P (base))
> >>>>>>      return;
> >>>>>>  
> >>>>>> @@ -506,8 +646,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>>>    // elimination offset pre-RA, we should postpone forming pairs on such
> >>>>>>    // accesses until after RA.
> >>>>>>    //
> >>>>>> -  // As it stands, addresses with offsets in range for LDR but not
> >>>>>> -  // in range for LDP/STP are currently reloaded inefficiently,
> >>>>>> +  // As it stands, addresses in range for an individual load/store but not
> >>>>>> +  // for a paired access are currently reloaded inefficiently,
> >>>>>>    // ending up with a separate base register for each pair.
> >>>>>>    //
> >>>>>>    // In theory LRA should make use of
> >>>>>> @@ -519,8 +659,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>>>    // that calls targetm.legitimize_address_displacement.
> >>>>>>    //
> >>>>>>    // So for now, it's better to punt when we can't be sure that the
> >>>>>> -  // offset is in range for LDP/STP.  Out-of-range cases can then be
> >>>>>> -  // handled after RA by the out-of-range LDP/STP peepholes.  Eventually, it
> >>>>>> +  // offset is in range for paired access.  Out-of-range cases can then be
> >>>>>> +  // handled after RA by the out-of-range PAIR MEM  peepholes.  Eventually, it
> >>>>>>    // would be nice to handle known out-of-range opportunities in the
> >>>>>>    // pass itself (for stack accesses, this would be in the post-RA pass).
> >>>>>>    if (!reload_completed
> >>>>>> @@ -573,8 +713,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
> >>>>>>  	gcc_unreachable (); // Base defs should be unique.
> >>>>>>      }
> >>>>>>  
> >>>>>> -  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
> >>>>>> -  // the access size.
> >>>>>> +  // Punt on misaligned offsets.  Paired memory accesses require offsets
> >>>>>> +  // to be a multiple of the access size.
> >>>>>>    if (!multiple_p (mem_off, mem_size))
> >>>>>>      return;
> >>>>>>  
> >>>>>> @@ -614,7 +754,7 @@ static bool no_ignore (insn_info *) { return false; }
> >>>>>>  // making use of alias disambiguation.
> >>>>>>  static insn_info *
> >>>>>>  latest_hazard_before (insn_info *insn, rtx *ignore,
> >>>>>> -		      insn_info *ignore_insn = nullptr)
> >>>>>> +		      insn_info *ignore_insn = 0)
> >>>>>
> >>>>> Is this change needed?
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  {
> >>>>>>    insn_info *result = nullptr;
> >>>>>>  
> >>>>>> @@ -1150,7 +1290,7 @@ extract_writebacks (bool load_p, rtx pats[2], int changed)
> >>>>>>        const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
> >>>>>>  
> >>>>>>        poly_int64 offset;
> >>>>>> -      rtx this_base = ldp_strip_offset (mem, &offset);
> >>>>>> +      rtx this_base = pair_mem_strip_offset (mem, &offset);
> >>>>>>        gcc_assert (REG_P (this_base));
> >>>>>>        if (base_reg)
> >>>>>>  	gcc_assert (rtx_equal_p (base_reg, this_base));
> >>>>>> @@ -1286,7 +1426,11 @@ find_trailing_add (insn_info *insns[2],
> >>>>>>  
> >>>>>>    off_hwi /= access_size;
> >>>>>>  
> >>>>>> -  if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
> >>>>>> +  pair_fusion *pfuse;
> >>>>>> +  aarch64_pair_fusion derived;
> >>>>>> +  pfuse = &derived;
> >>>>>
> >>>>> Huh? We'll want to use the same instance of the pass structure (pair_fusion)
> >>>>> that gets created at the top level, so that should get passed around everywhere.
> >>>>>
> >>>>> In the case of find_trailing_add we probably don't want to add any more
> >>>>> parameters (as Segher noted there are already too many), so perhaps the
> >>>>> easiest thing would be to make find_trailing_add itself a (non-virtual)
> >>>>> member function of pair_fusion.
> >>>>>
> >>>>> Then here we can just query pair_mem_in_range_p directly.
> >>>>>
> >>>>> For try_promote_writeback itself you can either make it a non-virtual
> >>>>> member function of pair_fusion or just pass a reference to the pass
> >>>>> object in when you call it.
> >>>>>
> >>>>> Note also that you don't have to create a pointer of the parent type in
> >>>>> order to call a virtual function on a derived object, you can just call
> >>>>> it directly.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>> +
> >>>>>> +  if (pfuse->pair_mem_in_range_p (off_hwi))
> >>>>>>      return nullptr;
> >>>>>
> >>>>> Note of course this condition needs inverting when you invert the
> >>>>> implementation.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  
> >>>>>>    auto dump_prefix = [&]()
> >>>>>> @@ -1328,7 +1472,7 @@ find_trailing_add (insn_info *insns[2],
> >>>>>>  // We just emitted a tombstone with uid UID, track it in a bitmap for
> >>>>>>  // this BB so we can easily identify it later when cleaning up tombstones.
> >>>>>>  void
> >>>>>> -ldp_bb_info::track_tombstone (int uid)
> >>>>>> +pair_fusion_bb_info::track_tombstone (int uid)
> >>>>>
> >>>>> Again, please split the renaming off.  Same with other instances below.
> >>>>>
> >>>>>>  {
> >>>>>>    if (!m_emitted_tombstone)
> >>>>>>      {
> >>>>>> @@ -1528,7 +1672,7 @@ fixup_debug_uses (obstack_watermark &attempt,
> >>>>>>  	  gcc_checking_assert (GET_RTX_CLASS (GET_CODE (XEXP (mem, 0)))
> >>>>>>  			       == RTX_AUTOINC);
> >>>>>>  
> >>>>>> -	  base = ldp_strip_offset (mem, &offset);
> >>>>>> +	  base = pair_mem_strip_offset (mem, &offset);
> >>>>>>  	  gcc_checking_assert (REG_P (base) && REGNO (base) == base_regno);
> >>>>>>  	}
> >>>>>>        fixup_debug_use (attempt, use, def, base, offset);
> >>>>>> @@ -1664,7 +1808,7 @@ fixup_debug_uses (obstack_watermark &attempt,
> >>>>>>  // BASE gives the chosen base candidate for the pair and MOVE_RANGE is
> >>>>>>  // a singleton range which says where to place the pair.
> >>>>>>  bool
> >>>>>> -ldp_bb_info::fuse_pair (bool load_p,
> >>>>>> +pair_fusion_bb_info::fuse_pair (bool load_p,
> >>>>>>  			unsigned access_size,
> >>>>>>  			int writeback,
> >>>>>>  			insn_info *i1, insn_info *i2,
> >>>>>> @@ -1800,7 +1944,7 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>>>>>      {
> >>>>>>        if (dump_file)
> >>>>>>  	fprintf (dump_file,
> >>>>>> -		 "  ldp: i%d has wb but subsequent i%d has non-wb "
> >>>>>> +		 "  load pair: i%d has wb but subsequent i%d has non-wb "
> >>>>>>  		 "update of base (r%d), dropping wb\n",
> >>>>>>  		 insns[0]->uid (), insns[1]->uid (), base_regno);
> >>>>>>        gcc_assert (writeback_effect);
> >>>>>> @@ -1823,7 +1967,7 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>>>>>      }
> >>>>>>  
> >>>>>>    // If either of the original insns had writeback, but the resulting pair insn
> >>>>>> -  // does not (can happen e.g. in the ldp edge case above, or if the writeback
> >>>>>> +  // does not (can happen e.g. in the load pair edge case above, or if the writeback
> >>>>>>    // effects cancel out), then drop the def(s) of the base register as
> >>>>>>    // appropriate.
> >>>>>>    //
> >>>>>> @@ -1842,7 +1986,7 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>>>>>    // update of the base register and try and fold it in to make this into a
> >>>>>>    // writeback pair.
> >>>>>>    insn_info *trailing_add = nullptr;
> >>>>>> -  if (aarch64_ldp_writeback > 1
> >>>>>> +  if (bb_state->pair_trailing_writeback_p ()
> >>>>>>        && !writeback_effect
> >>>>>>        && (!load_p || (!refers_to_regno_p (base_regno, base_regno + 1,
> >>>>>>  					 XEXP (pats[0], 0), nullptr)
> >>>>>> @@ -1863,14 +2007,14 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>>>>>      }
> >>>>>>  
> >>>>>>    // Now that we know what base mem we're going to use, check if it's OK
> >>>>>> -  // with the ldp/stp policy.
> >>>>>> +  // with the pair mem policy.
> >>>>>>    rtx first_mem = XEXP (pats[0], load_p);
> >>>>>> -  if (!aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
> >>>>>> -						load_p,
> >>>>>> -						GET_MODE (first_mem)))
> >>>>>> +  if (!bb_state->pair_mem_ok_with_policy (first_mem,
> >>>>>> +			  load_p,
> >>>>>> +			  GET_MODE (first_mem)))
> >>>>>>      {
> >>>>>>        if (dump_file)
> >>>>>> -	fprintf (dump_file, "punting on pair (%d,%d), ldp/stp policy says no\n",
> >>>>>> +	fprintf (dump_file, "punting on pair (%d,%d), pair mem  policy says no\n",
> >>>>>>  		 i1->uid (), i2->uid ());
> >>>>>>        return false;
> >>>>>>      }
> >>>>>> @@ -1878,21 +2022,10 @@ ldp_bb_info::fuse_pair (bool load_p,
> >>>>>>    rtx reg_notes = combine_reg_notes (first, second, load_p);
> >>>>>>  
> >>>>>>    rtx pair_pat;
> >>>>>> -  if (writeback_effect)
> >>>>>> -    {
> >>>>>> -      auto patvec = gen_rtvec (3, writeback_effect, pats[0], pats[1]);
> >>>>>> -      pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
> >>>>>> -    }
> >>>>>> -  else if (load_p)
> >>>>>> -    pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
> >>>>>> -				      XEXP (pats[1], 0),
> >>>>>> -				      XEXP (pats[0], 1));
> >>>>>> -  else
> >>>>>> -    pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
> >>>>>> -				       XEXP (pats[0], 1),
> >>>>>> -				       XEXP (pats[1], 1));
> >>>>>>  
> >>>>>> +  pair_pat = bb_state->gen_mem_pair (pats, writeback_effect, load_p);
> >>>>>>    insn_change *pair_change = nullptr;
> >>>>>> +
> >>>>>>    auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
> >>>>>>      rtx_insn *rti = change->insn ()->rtl ();
> >>>>>>      validate_unshare_change (rti, &PATTERN (rti), pair_pat, true);
> >>>>>> @@ -2133,15 +2266,6 @@ load_modified_by_store_p (insn_info *load,
> >>>>>>    return false;
> >>>>>>  }
> >>>>>>  
> >>>>>> -// Virtual base class for load/store walkers used in alias analysis.
> >>>>>> -struct alias_walker
> >>>>>> -{
> >>>>>> -  virtual bool conflict_p (int &budget) const = 0;
> >>>>>> -  virtual insn_info *insn () const = 0;
> >>>>>> -  virtual bool valid () const  = 0;
> >>>>>> -  virtual void advance () = 0;
> >>>>>> -};
> >>>>>> -
> >>>>>>  // Implement some common functionality used by both store_walker
> >>>>>>  // and load_walker.
> >>>>>>  template<bool reverse>
> >>>>>> @@ -2259,13 +2383,13 @@ public:
> >>>>>>  //
> >>>>>>  // We try to maintain the invariant that if a walker becomes invalid, we
> >>>>>>  // set its pointer to null.
> >>>>>> -static void
> >>>>>> -do_alias_analysis (insn_info *alias_hazards[4],
> >>>>>> +void
> >>>>>> +pair_fusion_bb_info::do_alias_analysis (insn_info *alias_hazards[4],
> >>>>>
> >>>>> It would probably make more sense for this to be a (non-virtual) member
> >>>>> function of the pair_fusion class, since it doesn't need to access the
> >>>>> BB-specific state.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  		   alias_walker *walkers[4],
> >>>>>>  		   bool load_p)
> >>>>>>  {
> >>>>>>    const int n_walkers = 2 + (2 * !load_p);
> >>>>>> -  int budget = aarch64_ldp_alias_check_limit;
> >>>>>> +  int budget = bb_state->pair_mem_alias_check_limit ();
> >>>>>>  
> >>>>>>    auto next_walker = [walkers,n_walkers](int current) -> int {
> >>>>>>      for (int j = 1; j <= n_walkers; j++)
> >>>>>> @@ -2365,7 +2489,7 @@ get_viable_bases (insn_info *insns[2],
> >>>>>>      {
> >>>>>>        const bool is_lower = (i == reversed);
> >>>>>>        poly_int64 poly_off;
> >>>>>> -      rtx base = ldp_strip_offset (cand_mems[i], &poly_off);
> >>>>>> +      rtx base = pair_mem_strip_offset (cand_mems[i], &poly_off);
> >>>>>>        if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
> >>>>>>  	writeback |= (1 << i);
> >>>>>>  
> >>>>>> @@ -2373,7 +2497,7 @@ get_viable_bases (insn_info *insns[2],
> >>>>>>  	continue;
> >>>>>>  
> >>>>>>        // Punt on accesses relative to eliminable regs.  See the comment in
> >>>>>> -      // ldp_bb_info::track_access for a detailed explanation of this.
> >>>>>> +      // pair_fusion_bb_info::track_access for a detailed explanation of this.
> >>>>>>        if (!reload_completed
> >>>>>>  	  && (REGNO (base) == FRAME_POINTER_REGNUM
> >>>>>>  	      || REGNO (base) == ARG_POINTER_REGNUM))
> >>>>>> @@ -2397,7 +2521,11 @@ get_viable_bases (insn_info *insns[2],
> >>>>>>        if (!is_lower)
> >>>>>>  	base_off--;
> >>>>>>  
> >>>>>> -      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
> >>>>>> +      pair_fusion *pfuse;
> >>>>>> +      aarch64_pair_fusion derived;
> >>>>>> +      pfuse = &derived;
> >>>>>
> >>>>> Again, please don't do this but instead use the same instance
> >>>>> everywhere.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>> +
> >>>>>> +      if (pfuse->pair_mem_in_range_p (base_off))
> >>>>>>  	continue;
> >>>>>>  
> >>>>>>        use_info *use = find_access (insns[i]->uses (), REGNO (base));
> >>>>>> @@ -2454,12 +2582,12 @@ get_viable_bases (insn_info *insns[2],
> >>>>>>  }
> >>>>>>  
> >>>>>>  // Given two adjacent memory accesses of the same size, I1 and I2, try
> >>>>>> -// and see if we can merge them into a ldp or stp.
> >>>>>> +// and see if we can merge them into a paired accesses load and store.
> >>>>>
> >>>>> "into a load pair or store pair" or "into a paired access" if you
> >>>>> prefer.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  //
> >>>>>>  // ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
> >>>>>>  // if the accesses are both loads, otherwise they are both stores.
> >>>>>>  bool
> >>>>>> -ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> >>>>>> +pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> >>>>>>  			    insn_info *i1, insn_info *i2)
> >>>>>>  {
> >>>>>>    if (dump_file)
> >>>>>> @@ -2494,7 +2622,7 @@ ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> >>>>>>      {
> >>>>>>        if (dump_file)
> >>>>>>  	fprintf (dump_file,
> >>>>>> -		 "punting on ldp due to reg conflcits (%d,%d)\n",
> >>>>>> +		 "punting on pair mem load  due to reg conflcits (%d,%d)\n",
> >>>>>>  		 insns[0]->uid (), insns[1]->uid ());
> >>>>>>        return false;
> >>>>>>      }
> >>>>>> @@ -2843,7 +2971,7 @@ debug (const insn_list_t &l)
> >>>>>>  // we can't re-order them anyway, so provided earlier passes have cleaned up
> >>>>>>  // redundant loads, we shouldn't miss opportunities by doing this.
> >>>>>>  void
> >>>>>> -ldp_bb_info::merge_pairs (insn_list_t &left_list,
> >>>>>> +pair_fusion_bb_info::merge_pairs (insn_list_t &left_list,
> >>>>>>  			  insn_list_t &right_list,
> >>>>>>  			  bool load_p,
> >>>>>>  			  unsigned access_size)
> >>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
> >>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
> >>>>>>  // merge_pairs.
> >>>>>>  void
> >>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>> -				 access_group &group)
> >>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
> >>>>>> +					 access_group &group)
> >>>>>>  {
> >>>>>>    const auto lfs = decode_lfs (encoded_lfs);
> >>>>>>    const unsigned access_size = lfs.size;
> >>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>  		       access.cand_insns,
> >>>>>>  		       lfs.load_p,
> >>>>>>  		       access_size);
> >>>>>> -	  skip_next = access.cand_insns.empty ();
> >>>>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
> >>>>>
> >>>>> As above, why is this needed?
> >>>>
> >>>> For rs6000 we want to return always true. as load store pair
> >>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
> >>>> And we want load store pair to 8/16 32/64. Thats why we want
> >>>> to generate always true for rs6000 to skip pairs as above.
> >>>
> >>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
> >>> load/store pair instructions where the two arms of the access are storing
> >>> operands of different sizes?  Or something else?
> >>>
> >>> As it stands the logic is to skip the next iteration only if we
> >>> exhausted all the candidate insns for the current access.  In the case
> >>> that we didn't exhaust all such candidates, then the idea is that when
> >>> access becomes prev_access, we can attempt to use those candidates as
> >>> the "left-hand side" of a pair in the next iteration since we failed to
> >>> use them as the "right-hand side" of a pair in the current iteration.
> >>> I don't see why you wouldn't want that behaviour.  Please can you
> >>> explain?
> >>>
> >>
> >> In merge_pair we get the 2 load candiates one load from 0 offset and
> >> other load is from 16th offset. Then in next iteration we get load
> >> from 16th offset and other load from 32 offset. In next iteration
> >> we get load from 32 offset and other load from 48 offset.
> >>
> >> For example:
> >>
> >> Currently we get the load candiates as follows.
> >>
> >> pairs:
> >>
> >> load from 0th offset.
> >> load from 16th offset.
> >>
> >> next pairs:
> >>
> >> load from 16th offset.
> >> load from 32th offset.
> >>
> >> next pairs:
> >>
> >> load from 32th offset
> >> load from 48th offset.
> >>
> >> Instead in rs6000 we should get:
> >>
> >> pairs:
> >>
> >> load from 0th offset
> >> load from 16th offset.
> >>
> >> next pairs:
> >>
> >> load from 32th offset
> >> load from 48th offset.
> > 
> > Hmm, so then I guess my question is: why wouldn't you consider merging
> > the pair with offsets (16,32) for rs6000?  Is it because you have a
> > stricter alignment requirement on the base pair offsets (i.e. they have
> > to be a multiple of 32 when the operand size is 16)?  So the pair
> > offsets have to be a multiple of the entire pair size rather than a
> > single operand size> 
> 
> We get load pair at a certain point with (0,16) and other program
> point we get load pair (32, 48).
> 
> In current implementation it takes offsets loads as (0, 16),
> (16, 32), (32, 48).
> 
> But In rs6000 we want  the load pair to be merged at different points
> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
> 0 offset and other load (32, 48) with lxvp with 32 offset.
> 
> In current case it will merge with lxvp with 0 offset and lxvp with
> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
> is incorrect in our case as the (16-32) case 16 offset will not
> load from even register and break for rs6000.

Sorry, I think I'm still missing something here.  Why does the address offset
affect the parity of the tranfser register?  ISTM they needn't be related at
all (and indeed we can't even know the parity of the transfer register before
RA, but perhaps you're only intending on running the pass after RA?)

How do you plan on handling this even-odd requirement for rs6000?

Thanks,
Alex

> 
> lxvp should load from even registers and then loaded value will
> be in even register and even register +1 (which is odd).
> 
> Thanks & Regards
> Ajit
> > If that is the case then I think it would be better to introduce a
> > virtual function (say pair_offset_alignment_ok_p) that vets the base
> > offset of the pair (prev_access->offset in transform_for_base).  I guess
> > it would also take access_size as a parameter and for aarch64 it should
> > check:
> > 
> >   multiple_p (offset, access_size)
> > 
> > and for rs6000 it could check:
> > 
> >   multiple_p (offset, access_size * 2)
> > 
> > and we would then incorporate a call to that predicate in the else if
> > condition of tranform_for_base.
> > 
> > It would have the same limitation whereby we assume that MEM_EXPR offset
> > alignment is a good proxy for RTL offset alignment, but we already
> > depend on that with the multiple_p check in track_via_mem_expr.
> > 
> > Thanks,
> > Alex
> > 
> >>
> >> Thanks & Regards
> >> Ajit
> >>> Thanks,
> >>> Alex
> >>>
> >>>>
> >>>>>
> >>>>>>  	}
> >>>>>>        prev_access = &access;
> >>>>>>      }
> >>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>  // and remove all the tombstone insns, being sure to reparent any uses
> >>>>>>  // of mem to previous defs when we do this.
> >>>>>>  void
> >>>>>> -ldp_bb_info::cleanup_tombstones ()
> >>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
> >>>>>>  {
> >>>>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
> >>>>>>    if (!m_emitted_tombstone)
> >>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
> >>>>>>  
> >>>>>>  template<typename Map>
> >>>>>>  void
> >>>>>> -ldp_bb_info::traverse_base_map (Map &map)
> >>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
> >>>>>>  {
> >>>>>>    for (auto kv : map)
> >>>>>>      {
> >>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
> >>>>>>  }
> >>>>>>  
> >>>>>>  void
> >>>>>> -ldp_bb_info::transform ()
> >>>>>> +pair_fusion_bb_info::transform ()
> >>>>>>  {
> >>>>>>    traverse_base_map (expr_map);
> >>>>>>    traverse_base_map (def_map);
> >>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
> >>>>>>  // for load/store candidates.  If running after RA, also try and promote
> >>>>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
> >>>>>>  // candidates into pairs.
> >>>>>> -void ldp_fusion_bb (bb_info *bb)
> >>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
> >>>>>>  {
> >>>>>> -  const bool track_loads
> >>>>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>> -  const bool track_stores
> >>>>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>> +  const bool track_loads = track_load_p ();
> >>>>>> +  const bool track_stores = track_store_p ();
> >>>>>>  
> >>>>>> -  ldp_bb_info bb_state (bb);
> >>>>>
> >>>>> This:
> >>>>>
> >>>>>> +  aarch64_pair_fusion derived;
> >>>>>
> >>>>> can be deleted and then:
> >>>>>
> >>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
> >>>>>
> >>>>> can just be:
> >>>>>
> >>>>>   pair_fusion_bb_info bb_info (bb, this);
> >>>>>
> >>>>> (or you can pass *this if you make bb_info take a reference).
> >>>>>
> >>>>> I don't think there's a particular need to change the variable name
> >>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
> >>>>> with the RTL-SSA structure of the same name as the latter.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  
> >>>>>>    for (auto insn : bb->nondebug_insns ())
> >>>>>>      {
> >>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
> >>>>>>  	continue;
> >>>>>>  
> >>>>>>        rtx pat = PATTERN (rti);
> >>>>>> -      if (reload_completed
> >>>>>> -	  && aarch64_ldp_writeback > 1
> >>>>>> -	  && GET_CODE (pat) == PARALLEL
> >>>>>> -	  && XVECLEN (pat, 0) == 2)
> >>>>>> +      if (pair_mem_promote_writeback_p (pat))
> >>>>>>  	try_promote_writeback (insn);
> >>>>>
> >>>>> It looks like try_promote_writeback itself will need some further work
> >>>>> to make it target-independent.  I suppose this check:
> >>>>>
> >>>>>   auto rti = insn->rtl ();
> >>>>>   const auto attr = get_attr_ldpstp (rti);
> >>>>>   if (attr == LDPSTP_NONE)
> >>>>>     return;
> >>>>>
> >>>>>   bool load_p = (attr == LDPSTP_LDP);
> >>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
> >>>>>
> >>>>> will need to become part of the pair_mem_promote_writeback_p hook that you
> >>>>> added, potentially changing it to return a boolean for load_p.
> >>>>>
> >>>>> Then I guess we will need hooks for destructuring the pair insn and
> >>>>> another hook to wrap aarch64_gen_writeback_pair.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  
> >>>>>>        if (GET_CODE (pat) != SET)
> >>>>>>  	continue;
> >>>>>>  
> >>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
> >>>>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
> >>>>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
> >>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
> >>>>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
> >>>>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
> >>>>>>      }
> >>>>>>  
> >>>>>> -  bb_state.transform ();
> >>>>>> -  bb_state.cleanup_tombstones ();
> >>>>>> +  bb_info.transform ();
> >>>>>> +  bb_info.cleanup_tombstones ();
> >>>>>>  }
> >>>>>>  
> >>>>>>  void ldp_fusion ()
> >>>>>>  {
> >>>>>>    ldp_fusion_init ();
> >>>>>> +  pair_fusion *pfuse;
> >>>>>> +  aarch64_pair_fusion derived;
> >>>>>> +  pfuse = &derived;
> >>>>>
> >>>>> This is indeed the one place where I think it is acceptable to
> >>>>> instantiate aarch64_pair_fusion.  But again there's no need to create a
> >>>>> pointer to the parent class, just call any function you like directly.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>>>  
> >>>>>>    for (auto bb : crtl->ssa->bbs ())
> >>>>>> -    ldp_fusion_bb (bb);
> >>>>>> +    pfuse->ldp_fusion_bb (bb);
> >>>>>
> >>>>> I think even the code to iterate over bbs should itself be a member
> >>>>> function of pair_fusion (say "run") and then that becomes part of the
> >>>>> generic code.
> >>>>>
> >>>>> So this function would just become:
> >>>>>
> >>>>> aarch64_pair_fusion pass;
> >>>>> pass.run ();
> >>>>>
> >>>>> and could be inlined into the caller.
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>> Perhaps you could also add an early return like:
> >>>>>
> >>>>>   if (!track_loads_p () && !track_stores_p ())
> >>>>>     return;
> >>>>>
> >>>>> in pair_fusion::run () and then remove the corresponding code from
> >>>>> pass_ldp_fusion::gate?
> >>>>>
> >>>>
> >>>> Addressed.
> >>>>
> >>>>>>  
> >>>>>>    ldp_fusion_destroy ();
> >>>>>>  }
> >>>>>> -- 
> >>>>>> 2.39.3
> >>>>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Alex
> >>>>
> >>>> Thanks & Regards
> >>>> Ajit
Ajit Agarwal April 10, 2024, 10:01 a.m. UTC | #8
Hello Alex:

On 10/04/24 1:42 pm, Alex Coplan wrote:
> Hi Ajit,
> 
> On 09/04/2024 20:59, Ajit Agarwal wrote:
>> Hello Alex:
>>
>> On 09/04/24 8:39 pm, Alex Coplan wrote:
>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
>>>> Hello Alex:
>>>>
>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>>>
>>>>>>
>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>>>> Hello Alex/Richard:
>>>>>>>>
>>>>>>>> All review comments are incorporated.
>>>>>>>
>>>>>>> Thanks, I was kind-of expecting you to also send the renaming patch as a
>>>>>>> preparatory patch as we discussed.
>>>>>>>
>>>>>>> Sorry for another meta comment, but: I think the reason that the Linaro
>>>>>>> CI isn't running tests on your patches is actually because you're
>>>>>>> sending 1/3 of a series but not sending the rest of the series.
>>>>>>>
>>>>>>> So please can you either send this as an individual preparatory patch
>>>>>>> (not marked as a series) or if you're going to send a series (e.g. with
>>>>>>> a preparatory rename patch as 1/2 and this as 2/2) then send the entire
>>>>>>> series when you make updates.  That way the CI should test your patches,
>>>>>>> which would be helpful.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>  
>>>>>>>>
>>>>>>>> Common infrastructure of load store pair fusion is divided into target
>>>>>>>> independent and target dependent changed code.
>>>>>>>>
>>>>>>>> Target independent code is the Generic code with pure virtual function
>>>>>>>> to interface betwwen target independent and dependent code.
>>>>>>>>
>>>>>>>> Target dependent code is the implementation of pure virtual function for
>>>>>>>> aarch64 target and the call to target independent code.
>>>>>>>>
>>>>>>>> Thanks & Regards
>>>>>>>> Ajit
>>>>>>>>
>>>>>>>>
>>>>>>>> aarch64: Place target independent and dependent changed code in one file
>>>>>>>>
>>>>>>>> Common infrastructure of load store pair fusion is divided into target
>>>>>>>> independent and target dependent changed code.
>>>>>>>>
>>>>>>>> Target independent code is the Generic code with pure virtual function
>>>>>>>> to interface betwwen target independent and dependent code.
>>>>>>>>
>>>>>>>> Target dependent code is the implementation of pure virtual function for
>>>>>>>> aarch64 target and the call to target independent code.
>>>>>>>>
>>>>>>>> 2024-04-06  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>>>>>>>
>>>>>>>> gcc/ChangeLog:
>>>>>>>>
>>>>>>>> 	* config/aarch64/aarch64-ldp-fusion.cc: Place target
>>>>>>>> 	independent and dependent changed code.
>>>>>>>
>>>>>>> You're going to need a proper ChangeLog eventually, but I guess there's
>>>>>>> no need for that right now.
>>>>>>>
>>>>>>>> ---
>>>>>>>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++++++++++++++--------
>>>>>>>>  1 file changed, 249 insertions(+), 122 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>>>>>> index 22ed95eb743..cb21b514ef7 100644
>>>>>>>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>>>>>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>>>>>> @@ -138,8 +138,122 @@ struct alt_base
>>>>>>>>    poly_int64 offset;
>>>>>>>>  };
>>>>>>>>  
>>>>>>>> +// Virtual base class for load/store walkers used in alias analysis.
>>>>>>>> +struct alias_walker
>>>>>>>> +{
>>>>>>>> +  virtual bool conflict_p (int &budget) const = 0;
>>>>>>>> +  virtual insn_info *insn () const = 0;
>>>>>>>> +  virtual bool valid () const  = 0;
>>>>>>>
>>>>>>> Heh, looking at this made me realise there is a whitespace bug here in
>>>>>>> the existing code (double space after const).  Sorry about that!  I'll
>>>>>>> push an obvious fix for that.
>>>>>>>
>>>>>>>> +  virtual void advance () = 0;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +struct pair_fusion {
>>>>>>>> +
>>>>>>>> +  pair_fusion () {};
>>>>>>>
>>>>>>> This ctor looks pointless at the moment.  Perhaps instead we could put
>>>>>>> the contents of ldp_fusion_init in here and then delete that function?
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>
>>>>>>>> +  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
>>>>>>>> +			       bool load_p) = 0;
>>>>>>>
>>>>>>> Please can we have comments above each of these virtual functions
>>>>>>> describing any parameters, what the purpose of the hook is, and the
>>>>>>> interpretation of the return value?  This will serve as the
>>>>>>> documentation for other targets that want to make use of the pass.
>>>>>>>
>>>>>>> It might make sense to have a default-false implementation for
>>>>>>> fpsimd_op_p, especially if you don't want to make use of this bit for
>>>>>>> rs6000.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>  
>>>>>>>> +
>>>>>>>> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
>>>>>>>> +  virtual bool pair_trailing_writeback_p () = 0;
>>>>>>>
>>>>>>> Sorry for the run-around, but: I think this and
>>>>>>> handle_writeback_opportunities () should be the same function, either
>>>>>>> returning an enum or taking an enum and returning a boolean.
>>>>>>>
>>>>>>> At a minimum they should have more similar sounding names.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>
>>>>>>>> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
>>>>>>>> +				      machine_mode mem_mode) = 0;
>>>>>>>> +  virtual int pair_mem_alias_check_limit () = 0;
>>>>>>>> +  virtual bool handle_writeback_opportunities () = 0 ;
>>>>>>>> +  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
>>>>>>>> +					machine_mode mode) = 0;
>>>>>>>> +  virtual rtx gen_mem_pair (rtx *pats,  rtx writeback,
>>>>>>>
>>>>>>> Nit: excess whitespace after pats,
>>>>>>>
>>>>>>>> +			    bool load_p) = 0;
>>>>>>>> +  virtual bool pair_mem_promote_writeback_p (rtx pat) = 0;
>>>>>>>> +  virtual bool track_load_p () = 0;
>>>>>>>> +  virtual bool track_store_p () = 0;
>>>>>>>
>>>>>>> I think it would probably make more sense for these two to have
>>>>>>> default-true implementations rather than being pure virtual functions.
>>>>>>>
>>>>>>> Also, sorry for the bikeshedding, but please can we keep the plural
>>>>>>> names (so track_loads_p and track_stores_p).
>>>>>>>
>>>>>>>> +  virtual bool cand_insns_empty_p (std::list<insn_info *> &insns) = 0;
>>>>>>>
>>>>>>> Why does this need to be virtualised?  I would it expect it to
>>>>>>> just be insns.empty () on all targets.
>>>>>>>
>>>>>>>> +  virtual bool pair_mem_in_range_p (HOST_WIDE_INT off) = 0;
>>>>>>>> +  void ldp_fusion_bb (bb_info *bb);
>>>>>>>
>>>>>>> Just to check, are you planning to rename this function in a separate
>>>>>>> patch?
>>>>>>>
>>>>>>
>>>>>> Yes.
>>>>>>>> +
>>>>>>>> +  ~pair_fusion() { }
>>>>>>>
>>>>>>> As with the ctor, perhaps we could put the contents of ldp_fusion_destroy
>>>>>>> in here and then delete that function?
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +struct aarch64_pair_fusion : public pair_fusion
>>>>>>>> +{
>>>>>>>> +public:
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>> Nit: there shouldn't be a need for this as public is already the default
>>>>>>> for structs.
>>>>>>>
>>>>>>>> +  aarch64_pair_fusion () : pair_fusion () {};
>>>>>>>
>>>>>>> This ctor looks pointless, the default one should do, so can we remove
>>>>>>> this?
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>>> +  bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
>>>>>>>> +		    bool load_p) override final
>>>>>>>> +  {
>>>>>>>> +    const bool fpsimd_op_p
>>>>>>>> +      = reload_completed
>>>>>>>> +      ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
>>>>>>>> +      : (GET_MODE_CLASS (mem_mode) != MODE_INT
>>>>>>>> +	 && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
>>>>>>>> +    return fpsimd_op_p;
>>>>>>>
>>>>>>> Can this just directly return the boolean expression instead of storing
>>>>>>> it into a variable first?
>>>>>>>
>>>>>> Addressed.
>>>>>>>> +  }
>>>>>>>> +
>>>>>>>> +  bool pair_mem_promote_writeback_p (rtx pat)
>>>>>>>> +  {
>>>>>>>> +     if (reload_completed
>>>>>>>> +	 && aarch64_ldp_writeback > 1
>>>>>>>> +	 && GET_CODE (pat) == PARALLEL
>>>>>>>> +	 && XVECLEN (pat, 0) == 2)
>>>>>>>> +	return true;
>>>>>>>> +
>>>>>>>> +     return false;
>>>>>>>> +  }
>>>>>>>> +
>>>>>>>> +  bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
>>>>>>>> +				machine_mode mode)
>>>>>>>> +  {
>>>>>>>> +    return aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
>>>>>>>> +						     load_p,
>>>>>>>> +						     mode);
>>>>>>>> +  }
>>>>>>>> +  bool pair_operand_mode_ok_p (machine_mode mode);
>>>>>>>
>>>>>>> Why have you implemented this (one-line function) out-of-line but the
>>>>>>> others are implemented inline?  Please be consistent.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>>> +
>>>>>>>> +  rtx gen_mem_pair (rtx *pats, rtx writeback, bool load_p);
>>>>>>>> +
>>>>>>>> +  bool pair_trailing_writeback_p  ()
>>>>>>>> +  {
>>>>>>>> +    return aarch64_ldp_writeback > 1;
>>>>>>>> +  }
>>>>>>>> +  bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
>>>>>>>> +			      machine_mode mem_mode)
>>>>>>>> +  {
>>>>>>>> +    return (load_p
>>>>>>>> +	     ? aarch64_ldp_reg_operand (reg_op, mem_mode)
>>>>>>>> +	     : aarch64_stp_reg_operand (reg_op, mem_mode));
>>>>>>>> +  }
>>>>>>>> +  int pair_mem_alias_check_limit ()
>>>>>>>> +  {
>>>>>>>> +    return aarch64_ldp_alias_check_limit;
>>>>>>>> +  }
>>>>>>>> +  bool handle_writeback_opportunities ()
>>>>>>>> +  {
>>>>>>>> +    return aarch64_ldp_writeback;
>>>>>>>> +  }
>>>>>>>> +  bool track_load_p ()
>>>>>>>> +  {
>>>>>>>> +    const bool track_loads
>>>>>>>> +      = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>> +    return track_loads;
>>>>>>>> +  }
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>> As above, can this just return the boolean expression directly without
>>>>>>> the temporary?
>>>>>>>
>>>>>>>> +  bool track_store_p ()
>>>>>>>> +  {
>>>>>>>> +    const bool track_stores
>>>>>>>> +      = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>> +    return track_stores;
>>>>>>>> +  }
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>> Likewise.
>>>>>>>
>>>>>>>> +  bool cand_insns_empty_p (std::list<insn_info *> &insns)
>>>>>>>> +  {
>>>>>>>> +    return insns.empty();
>>>>>>>> +  }
>>>>>>>
>>>>>>> As mentioned on the declaration, I don't see why this needs to be
>>>>>>> virtualised.
>>>>>>>
>>>>>>>> +  bool pair_mem_in_range_p (HOST_WIDE_INT off)
>>>>>>>> +  {
>>>>>>>> +    return (off < LDP_MIN_IMM || off > LDP_MAX_IMM);
>>>>>>>> +  }
>>>>>>>
>>>>>>> Hmm, this looks to test exactly the opposite condition from what the
>>>>>>> name suggests, i.e. the implementation looks like what I would expect
>>>>>>> for a function called pair_mem_out_of_range_p.
>>>>>>>
>>>>>>> Please can you invert the condition and fix up callers appropriately?
>>>>>>>
>>>>>>> This suggests you're not really thinking about the code you're writing,
>>>>>>> please try to think a bit more carefully before posting patches.
>>>>>>>
>>>>>>
>>>>>> Sorry for my mistake as I have overlooked it.
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +
>>>>>>>>  // State used by the pass for a given basic block.
>>>>>>>> -struct ldp_bb_info
>>>>>>>> +struct pair_fusion_bb_info
>>>>>>>>  {
>>>>>>>>    using def_hash = nofree_ptr_hash<def_info>;
>>>>>>>>    using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
>>>>>>>> @@ -160,14 +274,17 @@ struct ldp_bb_info
>>>>>>>>  
>>>>>>>>    static const size_t obstack_alignment = sizeof (void *);
>>>>>>>>    bb_info *m_bb;
>>>>>>>> +  pair_fusion *bb_state;
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>
>>>>>>> Can we call this m_pass instead of bb_state?  The pair_fusion class is
>>>>>>> explicitly at the pass level rather than the bb level.
>>>>>>>
>>>>>>> Also I think this (and indeed m_bb, although that's pre-existing) could
>>>>>>> be private members.  Not sure there's a good reason for making them
>>>>>>> public.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>  
>>>>>>> It might be slightly nicer to use a reference for m_pass unless there's
>>>>>>> a good reason to use a pointer, although I don't feel strongly about
>>>>>>> this.
>>>>>>>
>>>>>>>>  
>>>>>>>> -  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
>>>>>>>> +  pair_fusion_bb_info (bb_info *bb,
>>>>>>>> +		      aarch64_pair_fusion *d) : m_bb (bb),
>>>>>>>
>>>>>>> This generic class needs to take the generic pass type, i.e. just a
>>>>>>> pair_fusion * rather than an aarch64_pair_fusion *.
>>>>>>>
>>>>>>>> +		      bb_state (d), m_emitted_tombstone (false)
>>>>>>>>    {
>>>>>>>>      obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
>>>>>>>>  				obstack_alignment, obstack_chunk_alloc,
>>>>>>>>  				obstack_chunk_free);
>>>>>>>>    }
>>>>>>>> -  ~ldp_bb_info ()
>>>>>>>> +  ~pair_fusion_bb_info ()
>>>>>>>>    {
>>>>>>>>      obstack_free (&m_obstack, nullptr);
>>>>>>>>  
>>>>>>>> @@ -177,10 +294,32 @@ struct ldp_bb_info
>>>>>>>>  	bitmap_obstack_release (&m_bitmap_obstack);
>>>>>>>>        }
>>>>>>>>    }
>>>>>>>> +  void track_access (insn_info *, bool load, rtx mem);
>>>>>>>> +  void transform ();
>>>>>>>> +  void cleanup_tombstones ();
>>>>>>>
>>>>>>> I think most (or indeed all?) of the changes here shouldn't be needed.
>>>>>>> AFAICS, there shouldn't be a need to:
>>>>>>>  - drop inline from these functions.
>>>>>>>  - change the accessibility of any members.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>> Plese can you drop these changes or explain why they're needed (from
>>>>>>> here until the end of the class).
>>>>>>>
>>>>>>>> +  void merge_pairs (insn_list_t &, insn_list_t &,
>>>>>>>> +		    bool load_p, unsigned access_size);
>>>>>>>> +  void transform_for_base (int load_size, access_group &group);
>>>>>>>> +
>>>>>>>> +  bool try_fuse_pair (bool load_p, unsigned access_size,
>>>>>>>> +			     insn_info *i1, insn_info *i2);
>>>>>>>> +
>>>>>>>> +  bool fuse_pair (bool load_p, unsigned access_size,
>>>>>>>> +		  int writeback,
>>>>>>>> +		  insn_info *i1, insn_info *i2,
>>>>>>>> +		  base_cand &base,
>>>>>>>> +		  const insn_range_info &move_range);
>>>>>>>>  
>>>>>>>> -  inline void track_access (insn_info *, bool load, rtx mem);
>>>>>>>> -  inline void transform ();
>>>>>>>> -  inline void cleanup_tombstones ();
>>>>>>>> +  void do_alias_analysis (insn_info *alias_hazards[4],
>>>>>>>> +			  alias_walker *walkers[4],
>>>>>>>> +			  bool load_p);
>>>>>>>> +
>>>>>>>> +  void track_tombstone (int uid);
>>>>>>>> +
>>>>>>>> +  bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
>>>>>>>> +
>>>>>>>> +  template<typename Map>
>>>>>>>> +    void traverse_base_map (Map &map);
>>>>>>>>  
>>>>>>>>  private:
>>>>>>>>    obstack m_obstack;
>>>>>>>> @@ -191,30 +330,32 @@ private:
>>>>>>>>    bool m_emitted_tombstone;
>>>>>>>>  
>>>>>>>>    inline splay_tree_node<access_record *> *node_alloc (access_record *);
>>>>>>>> +};
>>>>>>>>  
>>>>>>>> -  template<typename Map>
>>>>>>>> -  inline void traverse_base_map (Map &map);
>>>>>>>> -  inline void transform_for_base (int load_size, access_group &group);
>>>>>>>> -
>>>>>>>> -  inline void merge_pairs (insn_list_t &, insn_list_t &,
>>>>>>>> -			   bool load_p, unsigned access_size);
>>>>>>>> -
>>>>>>>> -  inline bool try_fuse_pair (bool load_p, unsigned access_size,
>>>>>>>> -			     insn_info *i1, insn_info *i2);
>>>>>>>> -
>>>>>>>> -  inline bool fuse_pair (bool load_p, unsigned access_size,
>>>>>>>> -			 int writeback,
>>>>>>>> -			 insn_info *i1, insn_info *i2,
>>>>>>>> -			 base_cand &base,
>>>>>>>> -			 const insn_range_info &move_range);
>>>>>>>> -
>>>>>>>> -  inline void track_tombstone (int uid);
>>>>>>>> +rtx aarch64_pair_fusion::gen_mem_pair (rtx *pats,
>>>>>>>
>>>>>>> Style nit: newline after rtx.
>>>>>>>
>>>>>> Addressed.
>>>>>>
>>>>>>>> +		  rtx writeback,
>>>>>>>> +		  bool load_p)
>>>>>>>> +  {
>>>>>>>> +    rtx pair_pat;
>>>>>>>>  
>>>>>>>> -  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
>>>>>>>> -};
>>>>>>>> +    if (writeback)
>>>>>>>> +      {
>>>>>>>> +	auto patvec = gen_rtvec (3, writeback, pats[0], pats[1]);
>>>>>>>> +	pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
>>>>>>>> +      }
>>>>>>>> +    else if (load_p)
>>>>>>>> +      pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
>>>>>>>> +					XEXP (pats[1], 0),
>>>>>>>> +					XEXP (pats[0], 1));
>>>>>>>> +    else
>>>>>>>> +      pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
>>>>>>>> +					 XEXP (pats[0], 1),
>>>>>>>> +					 XEXP (pats[1], 1));
>>>>>>>> +     return pair_pat;
>>>>>>>
>>>>>>> Can the assignments just be direct returns here, please?  Then
>>>>>>> get rid of the temporary.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>  
>>>>>>>> +  }
>>>>>>>>  
>>>>>>>>  splay_tree_node<access_record *> *
>>>>>>>> -ldp_bb_info::node_alloc (access_record *access)
>>>>>>>> +pair_fusion_bb_info::node_alloc (access_record *access)
>>>>>>>>  {
>>>>>>>>    using T = splay_tree_node<access_record *>;
>>>>>>>>    void *addr = obstack_alloc (&m_obstack, sizeof (T));
>>>>>>>> @@ -262,7 +403,7 @@ drop_writeback (rtx mem)
>>>>>>>>  // RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
>>>>>>>>  // MEM so that we know the mode of the access.
>>>>>>>>  static rtx
>>>>>>>> -ldp_strip_offset (rtx mem, poly_int64 *offset)
>>>>>>>> +pair_mem_strip_offset (rtx mem, poly_int64 *offset)
>>>>>>>
>>>>>>> I thought we discussed splitting the renaming into a separate patch?
>>>>>>>
>>>>>>>>  {
>>>>>>>>    rtx addr = XEXP (mem, 0);
>>>>>>>>  
>>>>>>>> @@ -332,6 +473,12 @@ ldp_operand_mode_ok_p (machine_mode mode)
>>>>>>>>    return reload_completed || mode != TImode;
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> +bool
>>>>>>>> +aarch64_pair_fusion::pair_operand_mode_ok_p (machine_mode mode)
>>>>>>>> +{
>>>>>>>> +  return ldp_operand_mode_ok_p (mode);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  // Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
>>>>>>>>  // into an integer for use as a hash table key.
>>>>>>>>  static int
>>>>>>>> @@ -396,7 +543,7 @@ access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
>>>>>>>>  // MEM_EXPR base (i.e. a tree decl) relative to which we can track the access.
>>>>>>>>  // LFS is used as part of the key to the hash table, see track_access.
>>>>>>>>  bool
>>>>>>>> -ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>>>>>>>> +pair_fusion_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>>>>>>>
>>>>>>> Again, please do any renaming in a separate patch.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>>>  {
>>>>>>>>    if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
>>>>>>>>      return false;
>>>>>>>> @@ -412,9 +559,10 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>>>>>>>>    const machine_mode mem_mode = GET_MODE (mem);
>>>>>>>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
>>>>>>>>  
>>>>>>>> -  // Punt on misaligned offsets.  LDP/STP instructions require offsets to be a
>>>>>>>> -  // multiple of the access size, and we believe that misaligned offsets on
>>>>>>>> -  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL bases.
>>>>>>>> +  // Punt on misaligned offsets.  Paired memory access instructions require
>>>>>>>> +  // offsets to be a multiple of the access size, and we believe that
>>>>>>>> +  // misaligned offsets on MEM_EXPR bases are likely to lead to misaligned
>>>>>>>> +  // offsets w.r.t. RTL bases.
>>>>>>>>    if (!multiple_p (offset, mem_size))
>>>>>>>>      return false;
>>>>>>>>  
>>>>>>>> @@ -438,46 +586,38 @@ ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
>>>>>>>>  }
>>>>>>>>  
>>>>>>>>  // Main function to begin pair discovery.  Given a memory access INSN,
>>>>>>>> -// determine whether it could be a candidate for fusing into an ldp/stp,
>>>>>>>> +// determine whether it could be a candidate for fusing into an pair mem,
>>>>>>>
>>>>>>> s/an pair mem/a paired access/
>>>>>>>
>>>>>>>>  // and if so, track it in the appropriate data structure for this basic
>>>>>>>>  // block.  LOAD_P is true if the access is a load, and MEM is the mem
>>>>>>>>  // rtx that occurs in INSN.
>>>>>>>>  void
>>>>>>>> -ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>>>>> +pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>>>>>  {
>>>>>>>>    // We can't combine volatile MEMs, so punt on these.
>>>>>>>>    if (MEM_VOLATILE_P (mem))
>>>>>>>>      return;
>>>>>>>>  
>>>>>>>> -  // Ignore writeback accesses if the param says to do so.
>>>>>>>> -  if (!aarch64_ldp_writeback
>>>>>>>> +  // Ignore writeback accesses if the param says to do so
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>> This comment needs updating, I guess something like "if the hook says to
>>>>>>> do so".  But also please don't delete the full stop.
>>>>>>>
>>>>>>>> +  if (!bb_state->handle_writeback_opportunities ()
>>>>>>>>        && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
>>>>>>>>      return;
>>>>>>>>  
>>>>>>>>    const machine_mode mem_mode = GET_MODE (mem);
>>>>>>>> -  if (!ldp_operand_mode_ok_p (mem_mode))
>>>>>>>> +
>>>>>>>
>>>>>>> I don't think we need the extra whitespace here.
>>>>>>>
>>>>>>>> +  if (!bb_state->pair_operand_mode_ok_p (mem_mode))
>>>>>>>>      return;
>>>>>>>>  
>>>>>>>>    rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
>>>>>>>>  
>>>>>>>> -  // Ignore the access if the register operand isn't suitable for ldp/stp.
>>>>>>>> -  if (load_p
>>>>>>>> -      ? !aarch64_ldp_reg_operand (reg_op, mem_mode)
>>>>>>>> -      : !aarch64_stp_reg_operand (reg_op, mem_mode))
>>>>>>>> +  if (!bb_state->pair_reg_operand_ok_p (load_p, reg_op, mem_mode))
>>>>>>>>      return;
>>>>>>>> -
>>>>>>>>    // We want to segregate FP/SIMD accesses from GPR accesses.
>>>>>>>>    //
>>>>>>>>    // Before RA, we use the modes, noting that stores of constant zero
>>>>>>>>    // operands use GPRs (even in non-integer modes).  After RA, we use
>>>>>>>>    // the hard register numbers.
>>>>>>>> -  const bool fpsimd_op_p
>>>>>>>> -    = reload_completed
>>>>>>>> -    ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
>>>>>>>> -    : (GET_MODE_CLASS (mem_mode) != MODE_INT
>>>>>>>> -       && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
>>>>>>>> -
>>>>>>>> -  // Note ldp_operand_mode_ok_p already rejected VL modes.
>>>>>>>> +  const bool fpsimd_op_p = bb_state->fpsimd_op_p (reg_op, mem_mode, load_p);
>>>>>>>> +  // Note pair_operand_mode_ok_p already rejected VL modes.
>>>>>>>>    const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
>>>>>>>>    const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
>>>>>>>>  
>>>>>>>> @@ -487,7 +627,7 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>>>>>    poly_int64 mem_off;
>>>>>>>>    rtx addr = XEXP (mem, 0);
>>>>>>>>    const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
>>>>>>>> -  rtx base = ldp_strip_offset (mem, &mem_off);
>>>>>>>> +  rtx base = pair_mem_strip_offset (mem, &mem_off);
>>>>>>>
>>>>>>> Again, let's split the renaming off into a separate patch.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>>>    if (!REG_P (base))
>>>>>>>>      return;
>>>>>>>>  
>>>>>>>> @@ -506,8 +646,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>>>>>    // elimination offset pre-RA, we should postpone forming pairs on such
>>>>>>>>    // accesses until after RA.
>>>>>>>>    //
>>>>>>>> -  // As it stands, addresses with offsets in range for LDR but not
>>>>>>>> -  // in range for LDP/STP are currently reloaded inefficiently,
>>>>>>>> +  // As it stands, addresses in range for an individual load/store but not
>>>>>>>> +  // for a paired access are currently reloaded inefficiently,
>>>>>>>>    // ending up with a separate base register for each pair.
>>>>>>>>    //
>>>>>>>>    // In theory LRA should make use of
>>>>>>>> @@ -519,8 +659,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>>>>>    // that calls targetm.legitimize_address_displacement.
>>>>>>>>    //
>>>>>>>>    // So for now, it's better to punt when we can't be sure that the
>>>>>>>> -  // offset is in range for LDP/STP.  Out-of-range cases can then be
>>>>>>>> -  // handled after RA by the out-of-range LDP/STP peepholes.  Eventually, it
>>>>>>>> +  // offset is in range for paired access.  Out-of-range cases can then be
>>>>>>>> +  // handled after RA by the out-of-range PAIR MEM  peepholes.  Eventually, it
>>>>>>>>    // would be nice to handle known out-of-range opportunities in the
>>>>>>>>    // pass itself (for stack accesses, this would be in the post-RA pass).
>>>>>>>>    if (!reload_completed
>>>>>>>> @@ -573,8 +713,8 @@ ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>>>>>>>>  	gcc_unreachable (); // Base defs should be unique.
>>>>>>>>      }
>>>>>>>>  
>>>>>>>> -  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
>>>>>>>> -  // the access size.
>>>>>>>> +  // Punt on misaligned offsets.  Paired memory accesses require offsets
>>>>>>>> +  // to be a multiple of the access size.
>>>>>>>>    if (!multiple_p (mem_off, mem_size))
>>>>>>>>      return;
>>>>>>>>  
>>>>>>>> @@ -614,7 +754,7 @@ static bool no_ignore (insn_info *) { return false; }
>>>>>>>>  // making use of alias disambiguation.
>>>>>>>>  static insn_info *
>>>>>>>>  latest_hazard_before (insn_info *insn, rtx *ignore,
>>>>>>>> -		      insn_info *ignore_insn = nullptr)
>>>>>>>> +		      insn_info *ignore_insn = 0)
>>>>>>>
>>>>>>> Is this change needed?
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>>>  {
>>>>>>>>    insn_info *result = nullptr;
>>>>>>>>  
>>>>>>>> @@ -1150,7 +1290,7 @@ extract_writebacks (bool load_p, rtx pats[2], int changed)
>>>>>>>>        const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
>>>>>>>>  
>>>>>>>>        poly_int64 offset;
>>>>>>>> -      rtx this_base = ldp_strip_offset (mem, &offset);
>>>>>>>> +      rtx this_base = pair_mem_strip_offset (mem, &offset);
>>>>>>>>        gcc_assert (REG_P (this_base));
>>>>>>>>        if (base_reg)
>>>>>>>>  	gcc_assert (rtx_equal_p (base_reg, this_base));
>>>>>>>> @@ -1286,7 +1426,11 @@ find_trailing_add (insn_info *insns[2],
>>>>>>>>  
>>>>>>>>    off_hwi /= access_size;
>>>>>>>>  
>>>>>>>> -  if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
>>>>>>>> +  pair_fusion *pfuse;
>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>> +  pfuse = &derived;
>>>>>>>
>>>>>>> Huh? We'll want to use the same instance of the pass structure (pair_fusion)
>>>>>>> that gets created at the top level, so that should get passed around everywhere.
>>>>>>>
>>>>>>> In the case of find_trailing_add we probably don't want to add any more
>>>>>>> parameters (as Segher noted there are already too many), so perhaps the
>>>>>>> easiest thing would be to make find_trailing_add itself a (non-virtual)
>>>>>>> member function of pair_fusion.
>>>>>>>
>>>>>>> Then here we can just query pair_mem_in_range_p directly.
>>>>>>>
>>>>>>> For try_promote_writeback itself you can either make it a non-virtual
>>>>>>> member function of pair_fusion or just pass a reference to the pass
>>>>>>> object in when you call it.
>>>>>>>
>>>>>>> Note also that you don't have to create a pointer of the parent type in
>>>>>>> order to call a virtual function on a derived object, you can just call
>>>>>>> it directly.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>>> +
>>>>>>>> +  if (pfuse->pair_mem_in_range_p (off_hwi))
>>>>>>>>      return nullptr;
>>>>>>>
>>>>>>> Note of course this condition needs inverting when you invert the
>>>>>>> implementation.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>>>  
>>>>>>>>    auto dump_prefix = [&]()
>>>>>>>> @@ -1328,7 +1472,7 @@ find_trailing_add (insn_info *insns[2],
>>>>>>>>  // We just emitted a tombstone with uid UID, track it in a bitmap for
>>>>>>>>  // this BB so we can easily identify it later when cleaning up tombstones.
>>>>>>>>  void
>>>>>>>> -ldp_bb_info::track_tombstone (int uid)
>>>>>>>> +pair_fusion_bb_info::track_tombstone (int uid)
>>>>>>>
>>>>>>> Again, please split the renaming off.  Same with other instances below.
>>>>>>>
>>>>>>>>  {
>>>>>>>>    if (!m_emitted_tombstone)
>>>>>>>>      {
>>>>>>>> @@ -1528,7 +1672,7 @@ fixup_debug_uses (obstack_watermark &attempt,
>>>>>>>>  	  gcc_checking_assert (GET_RTX_CLASS (GET_CODE (XEXP (mem, 0)))
>>>>>>>>  			       == RTX_AUTOINC);
>>>>>>>>  
>>>>>>>> -	  base = ldp_strip_offset (mem, &offset);
>>>>>>>> +	  base = pair_mem_strip_offset (mem, &offset);
>>>>>>>>  	  gcc_checking_assert (REG_P (base) && REGNO (base) == base_regno);
>>>>>>>>  	}
>>>>>>>>        fixup_debug_use (attempt, use, def, base, offset);
>>>>>>>> @@ -1664,7 +1808,7 @@ fixup_debug_uses (obstack_watermark &attempt,
>>>>>>>>  // BASE gives the chosen base candidate for the pair and MOVE_RANGE is
>>>>>>>>  // a singleton range which says where to place the pair.
>>>>>>>>  bool
>>>>>>>> -ldp_bb_info::fuse_pair (bool load_p,
>>>>>>>> +pair_fusion_bb_info::fuse_pair (bool load_p,
>>>>>>>>  			unsigned access_size,
>>>>>>>>  			int writeback,
>>>>>>>>  			insn_info *i1, insn_info *i2,
>>>>>>>> @@ -1800,7 +1944,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>>>>>>>>      {
>>>>>>>>        if (dump_file)
>>>>>>>>  	fprintf (dump_file,
>>>>>>>> -		 "  ldp: i%d has wb but subsequent i%d has non-wb "
>>>>>>>> +		 "  load pair: i%d has wb but subsequent i%d has non-wb "
>>>>>>>>  		 "update of base (r%d), dropping wb\n",
>>>>>>>>  		 insns[0]->uid (), insns[1]->uid (), base_regno);
>>>>>>>>        gcc_assert (writeback_effect);
>>>>>>>> @@ -1823,7 +1967,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>>>>>>>>      }
>>>>>>>>  
>>>>>>>>    // If either of the original insns had writeback, but the resulting pair insn
>>>>>>>> -  // does not (can happen e.g. in the ldp edge case above, or if the writeback
>>>>>>>> +  // does not (can happen e.g. in the load pair edge case above, or if the writeback
>>>>>>>>    // effects cancel out), then drop the def(s) of the base register as
>>>>>>>>    // appropriate.
>>>>>>>>    //
>>>>>>>> @@ -1842,7 +1986,7 @@ ldp_bb_info::fuse_pair (bool load_p,
>>>>>>>>    // update of the base register and try and fold it in to make this into a
>>>>>>>>    // writeback pair.
>>>>>>>>    insn_info *trailing_add = nullptr;
>>>>>>>> -  if (aarch64_ldp_writeback > 1
>>>>>>>> +  if (bb_state->pair_trailing_writeback_p ()
>>>>>>>>        && !writeback_effect
>>>>>>>>        && (!load_p || (!refers_to_regno_p (base_regno, base_regno + 1,
>>>>>>>>  					 XEXP (pats[0], 0), nullptr)
>>>>>>>> @@ -1863,14 +2007,14 @@ ldp_bb_info::fuse_pair (bool load_p,
>>>>>>>>      }
>>>>>>>>  
>>>>>>>>    // Now that we know what base mem we're going to use, check if it's OK
>>>>>>>> -  // with the ldp/stp policy.
>>>>>>>> +  // with the pair mem policy.
>>>>>>>>    rtx first_mem = XEXP (pats[0], load_p);
>>>>>>>> -  if (!aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
>>>>>>>> -						load_p,
>>>>>>>> -						GET_MODE (first_mem)))
>>>>>>>> +  if (!bb_state->pair_mem_ok_with_policy (first_mem,
>>>>>>>> +			  load_p,
>>>>>>>> +			  GET_MODE (first_mem)))
>>>>>>>>      {
>>>>>>>>        if (dump_file)
>>>>>>>> -	fprintf (dump_file, "punting on pair (%d,%d), ldp/stp policy says no\n",
>>>>>>>> +	fprintf (dump_file, "punting on pair (%d,%d), pair mem  policy says no\n",
>>>>>>>>  		 i1->uid (), i2->uid ());
>>>>>>>>        return false;
>>>>>>>>      }
>>>>>>>> @@ -1878,21 +2022,10 @@ ldp_bb_info::fuse_pair (bool load_p,
>>>>>>>>    rtx reg_notes = combine_reg_notes (first, second, load_p);
>>>>>>>>  
>>>>>>>>    rtx pair_pat;
>>>>>>>> -  if (writeback_effect)
>>>>>>>> -    {
>>>>>>>> -      auto patvec = gen_rtvec (3, writeback_effect, pats[0], pats[1]);
>>>>>>>> -      pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
>>>>>>>> -    }
>>>>>>>> -  else if (load_p)
>>>>>>>> -    pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
>>>>>>>> -				      XEXP (pats[1], 0),
>>>>>>>> -				      XEXP (pats[0], 1));
>>>>>>>> -  else
>>>>>>>> -    pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
>>>>>>>> -				       XEXP (pats[0], 1),
>>>>>>>> -				       XEXP (pats[1], 1));
>>>>>>>>  
>>>>>>>> +  pair_pat = bb_state->gen_mem_pair (pats, writeback_effect, load_p);
>>>>>>>>    insn_change *pair_change = nullptr;
>>>>>>>> +
>>>>>>>>    auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
>>>>>>>>      rtx_insn *rti = change->insn ()->rtl ();
>>>>>>>>      validate_unshare_change (rti, &PATTERN (rti), pair_pat, true);
>>>>>>>> @@ -2133,15 +2266,6 @@ load_modified_by_store_p (insn_info *load,
>>>>>>>>    return false;
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> -// Virtual base class for load/store walkers used in alias analysis.
>>>>>>>> -struct alias_walker
>>>>>>>> -{
>>>>>>>> -  virtual bool conflict_p (int &budget) const = 0;
>>>>>>>> -  virtual insn_info *insn () const = 0;
>>>>>>>> -  virtual bool valid () const  = 0;
>>>>>>>> -  virtual void advance () = 0;
>>>>>>>> -};
>>>>>>>> -
>>>>>>>>  // Implement some common functionality used by both store_walker
>>>>>>>>  // and load_walker.
>>>>>>>>  template<bool reverse>
>>>>>>>> @@ -2259,13 +2383,13 @@ public:
>>>>>>>>  //
>>>>>>>>  // We try to maintain the invariant that if a walker becomes invalid, we
>>>>>>>>  // set its pointer to null.
>>>>>>>> -static void
>>>>>>>> -do_alias_analysis (insn_info *alias_hazards[4],
>>>>>>>> +void
>>>>>>>> +pair_fusion_bb_info::do_alias_analysis (insn_info *alias_hazards[4],
>>>>>>>
>>>>>>> It would probably make more sense for this to be a (non-virtual) member
>>>>>>> function of the pair_fusion class, since it doesn't need to access the
>>>>>>> BB-specific state.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>>>  		   alias_walker *walkers[4],
>>>>>>>>  		   bool load_p)
>>>>>>>>  {
>>>>>>>>    const int n_walkers = 2 + (2 * !load_p);
>>>>>>>> -  int budget = aarch64_ldp_alias_check_limit;
>>>>>>>> +  int budget = bb_state->pair_mem_alias_check_limit ();
>>>>>>>>  
>>>>>>>>    auto next_walker = [walkers,n_walkers](int current) -> int {
>>>>>>>>      for (int j = 1; j <= n_walkers; j++)
>>>>>>>> @@ -2365,7 +2489,7 @@ get_viable_bases (insn_info *insns[2],
>>>>>>>>      {
>>>>>>>>        const bool is_lower = (i == reversed);
>>>>>>>>        poly_int64 poly_off;
>>>>>>>> -      rtx base = ldp_strip_offset (cand_mems[i], &poly_off);
>>>>>>>> +      rtx base = pair_mem_strip_offset (cand_mems[i], &poly_off);
>>>>>>>>        if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
>>>>>>>>  	writeback |= (1 << i);
>>>>>>>>  
>>>>>>>> @@ -2373,7 +2497,7 @@ get_viable_bases (insn_info *insns[2],
>>>>>>>>  	continue;
>>>>>>>>  
>>>>>>>>        // Punt on accesses relative to eliminable regs.  See the comment in
>>>>>>>> -      // ldp_bb_info::track_access for a detailed explanation of this.
>>>>>>>> +      // pair_fusion_bb_info::track_access for a detailed explanation of this.
>>>>>>>>        if (!reload_completed
>>>>>>>>  	  && (REGNO (base) == FRAME_POINTER_REGNUM
>>>>>>>>  	      || REGNO (base) == ARG_POINTER_REGNUM))
>>>>>>>> @@ -2397,7 +2521,11 @@ get_viable_bases (insn_info *insns[2],
>>>>>>>>        if (!is_lower)
>>>>>>>>  	base_off--;
>>>>>>>>  
>>>>>>>> -      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
>>>>>>>> +      pair_fusion *pfuse;
>>>>>>>> +      aarch64_pair_fusion derived;
>>>>>>>> +      pfuse = &derived;
>>>>>>>
>>>>>>> Again, please don't do this but instead use the same instance
>>>>>>> everywhere.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>>> +
>>>>>>>> +      if (pfuse->pair_mem_in_range_p (base_off))
>>>>>>>>  	continue;
>>>>>>>>  
>>>>>>>>        use_info *use = find_access (insns[i]->uses (), REGNO (base));
>>>>>>>> @@ -2454,12 +2582,12 @@ get_viable_bases (insn_info *insns[2],
>>>>>>>>  }
>>>>>>>>  
>>>>>>>>  // Given two adjacent memory accesses of the same size, I1 and I2, try
>>>>>>>> -// and see if we can merge them into a ldp or stp.
>>>>>>>> +// and see if we can merge them into a paired accesses load and store.
>>>>>>>
>>>>>>> "into a load pair or store pair" or "into a paired access" if you
>>>>>>> prefer.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>>>  //
>>>>>>>>  // ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
>>>>>>>>  // if the accesses are both loads, otherwise they are both stores.
>>>>>>>>  bool
>>>>>>>> -ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>>>>>> +pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>>>>>>  			    insn_info *i1, insn_info *i2)
>>>>>>>>  {
>>>>>>>>    if (dump_file)
>>>>>>>> @@ -2494,7 +2622,7 @@ ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>>>>>>      {
>>>>>>>>        if (dump_file)
>>>>>>>>  	fprintf (dump_file,
>>>>>>>> -		 "punting on ldp due to reg conflcits (%d,%d)\n",
>>>>>>>> +		 "punting on pair mem load  due to reg conflcits (%d,%d)\n",
>>>>>>>>  		 insns[0]->uid (), insns[1]->uid ());
>>>>>>>>        return false;
>>>>>>>>      }
>>>>>>>> @@ -2843,7 +2971,7 @@ debug (const insn_list_t &l)
>>>>>>>>  // we can't re-order them anyway, so provided earlier passes have cleaned up
>>>>>>>>  // redundant loads, we shouldn't miss opportunities by doing this.
>>>>>>>>  void
>>>>>>>> -ldp_bb_info::merge_pairs (insn_list_t &left_list,
>>>>>>>> +pair_fusion_bb_info::merge_pairs (insn_list_t &left_list,
>>>>>>>>  			  insn_list_t &right_list,
>>>>>>>>  			  bool load_p,
>>>>>>>>  			  unsigned access_size)
>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
>>>>>>>>  // merge_pairs.
>>>>>>>>  void
>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>> -				 access_group &group)
>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>> +					 access_group &group)
>>>>>>>>  {
>>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
>>>>>>>>    const unsigned access_size = lfs.size;
>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>  		       access.cand_insns,
>>>>>>>>  		       lfs.load_p,
>>>>>>>>  		       access_size);
>>>>>>>> -	  skip_next = access.cand_insns.empty ();
>>>>>>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
>>>>>>>
>>>>>>> As above, why is this needed?
>>>>>>
>>>>>> For rs6000 we want to return always true. as load store pair
>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
>>>>>> to generate always true for rs6000 to skip pairs as above.
>>>>>
>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
>>>>> load/store pair instructions where the two arms of the access are storing
>>>>> operands of different sizes?  Or something else?
>>>>>
>>>>> As it stands the logic is to skip the next iteration only if we
>>>>> exhausted all the candidate insns for the current access.  In the case
>>>>> that we didn't exhaust all such candidates, then the idea is that when
>>>>> access becomes prev_access, we can attempt to use those candidates as
>>>>> the "left-hand side" of a pair in the next iteration since we failed to
>>>>> use them as the "right-hand side" of a pair in the current iteration.
>>>>> I don't see why you wouldn't want that behaviour.  Please can you
>>>>> explain?
>>>>>
>>>>
>>>> In merge_pair we get the 2 load candiates one load from 0 offset and
>>>> other load is from 16th offset. Then in next iteration we get load
>>>> from 16th offset and other load from 32 offset. In next iteration
>>>> we get load from 32 offset and other load from 48 offset.
>>>>
>>>> For example:
>>>>
>>>> Currently we get the load candiates as follows.
>>>>
>>>> pairs:
>>>>
>>>> load from 0th offset.
>>>> load from 16th offset.
>>>>
>>>> next pairs:
>>>>
>>>> load from 16th offset.
>>>> load from 32th offset.
>>>>
>>>> next pairs:
>>>>
>>>> load from 32th offset
>>>> load from 48th offset.
>>>>
>>>> Instead in rs6000 we should get:
>>>>
>>>> pairs:
>>>>
>>>> load from 0th offset
>>>> load from 16th offset.
>>>>
>>>> next pairs:
>>>>
>>>> load from 32th offset
>>>> load from 48th offset.
>>>
>>> Hmm, so then I guess my question is: why wouldn't you consider merging
>>> the pair with offsets (16,32) for rs6000?  Is it because you have a
>>> stricter alignment requirement on the base pair offsets (i.e. they have
>>> to be a multiple of 32 when the operand size is 16)?  So the pair
>>> offsets have to be a multiple of the entire pair size rather than a
>>> single operand size> 
>>
>> We get load pair at a certain point with (0,16) and other program
>> point we get load pair (32, 48).
>>
>> In current implementation it takes offsets loads as (0, 16),
>> (16, 32), (32, 48).
>>
>> But In rs6000 we want  the load pair to be merged at different points
>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
>> 0 offset and other load (32, 48) with lxvp with 32 offset.
>>
>> In current case it will merge with lxvp with 0 offset and lxvp with
>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
>> is incorrect in our case as the (16-32) case 16 offset will not
>> load from even register and break for rs6000.
> 
> Sorry, I think I'm still missing something here.  Why does the address offset
> affect the parity of the tranfser register?  ISTM they needn't be related at
> all (and indeed we can't even know the parity of the transfer register before
> RA, but perhaps you're only intending on running the pass after RA?)
> 

We have load pair with (0,16) wherein these loads are adjacent and
replaced with lxvp.

Semantic of lxvp instruction is that it loads adjacent load pair in
even register and even_register + 1.

We replace the above load pair with lxvp instruction and then we
dont need to merge (16,32) as (0, 16) is already merged and instead
we merge (32,48).

Yes you are correct, the addresss offset doesn't affect the parity of
the register transfer as we are doing fusion pass before RA.
If that is the case then I think it would be better to introduce a
>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>> it would also take access_size as a parameter and for aarch64 it should
>>> check:
>>>
>>>   multiple_p (offset, access_size)
>>>
>>> and for rs6000 it could check:
>>>
>>>   multiple_p (offset, access_size * 2)
>>>
>>> and we would then incorporate a call to that predicate in the else if
>>> condition of tranform_for_base.
>>>
>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>> alignment is a good proxy for RTL offset alignment, but we already
>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>
I have addressed the above hooks and it worked fine with both rs6000
and aarch64. I am sending subsequent patch in some time that address
above.

> How do you plan on handling this even-odd requirement for rs6000?
> 

We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
generate register pair and thats what we generate and implement
in rs6000 target
code.

Thanks & Regards
Ajit 
> Thanks,
> Alex
> 
>>
>> lxvp should load from even registers and then loaded value will
>> be in even register and even register +1 (which is odd).
>>
>> Thanks & Regards
>> Ajit
>>> If that is the case then I think it would be better to introduce a
>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>> it would also take access_size as a parameter and for aarch64 it should
>>> check:
>>>
>>>   multiple_p (offset, access_size)
>>>
>>> and for rs6000 it could check:
>>>
>>>   multiple_p (offset, access_size * 2)
>>>
>>> and we would then incorporate a call to that predicate in the else if
>>> condition of tranform_for_base.
>>>
>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>> alignment is a good proxy for RTL offset alignment, but we already
>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>
>>> Thanks,
>>> Alex
>>>
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>  	}
>>>>>>>>        prev_access = &access;
>>>>>>>>      }
>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>  // and remove all the tombstone insns, being sure to reparent any uses
>>>>>>>>  // of mem to previous defs when we do this.
>>>>>>>>  void
>>>>>>>> -ldp_bb_info::cleanup_tombstones ()
>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
>>>>>>>>  {
>>>>>>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
>>>>>>>>    if (!m_emitted_tombstone)
>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>>>>>>>  
>>>>>>>>  template<typename Map>
>>>>>>>>  void
>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>>>>>>>  {
>>>>>>>>    for (auto kv : map)
>>>>>>>>      {
>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>  }
>>>>>>>>  
>>>>>>>>  void
>>>>>>>> -ldp_bb_info::transform ()
>>>>>>>> +pair_fusion_bb_info::transform ()
>>>>>>>>  {
>>>>>>>>    traverse_base_map (expr_map);
>>>>>>>>    traverse_base_map (def_map);
>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>>>>>>>  // for load/store candidates.  If running after RA, also try and promote
>>>>>>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
>>>>>>>>  // candidates into pairs.
>>>>>>>> -void ldp_fusion_bb (bb_info *bb)
>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>>>>>>>  {
>>>>>>>> -  const bool track_loads
>>>>>>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>> -  const bool track_stores
>>>>>>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>> +  const bool track_loads = track_load_p ();
>>>>>>>> +  const bool track_stores = track_store_p ();
>>>>>>>>  
>>>>>>>> -  ldp_bb_info bb_state (bb);
>>>>>>>
>>>>>>> This:
>>>>>>>
>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>
>>>>>>> can be deleted and then:
>>>>>>>
>>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
>>>>>>>
>>>>>>> can just be:
>>>>>>>
>>>>>>>   pair_fusion_bb_info bb_info (bb, this);
>>>>>>>
>>>>>>> (or you can pass *this if you make bb_info take a reference).
>>>>>>>
>>>>>>> I don't think there's a particular need to change the variable name
>>>>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
>>>>>>> with the RTL-SSA structure of the same name as the latter.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>>>  
>>>>>>>>    for (auto insn : bb->nondebug_insns ())
>>>>>>>>      {
>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>>>>>>>  	continue;
>>>>>>>>  
>>>>>>>>        rtx pat = PATTERN (rti);
>>>>>>>> -      if (reload_completed
>>>>>>>> -	  && aarch64_ldp_writeback > 1
>>>>>>>> -	  && GET_CODE (pat) == PARALLEL
>>>>>>>> -	  && XVECLEN (pat, 0) == 2)
>>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
>>>>>>>>  	try_promote_writeback (insn);
>>>>>>>
>>>>>>> It looks like try_promote_writeback itself will need some further work
>>>>>>> to make it target-independent.  I suppose this check:
>>>>>>>
>>>>>>>   auto rti = insn->rtl ();
>>>>>>>   const auto attr = get_attr_ldpstp (rti);
>>>>>>>   if (attr == LDPSTP_NONE)
>>>>>>>     return;
>>>>>>>
>>>>>>>   bool load_p = (attr == LDPSTP_LDP);
>>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
>>>>>>>
>>>>>>> will need to become part of the pair_mem_promote_writeback_p hook that you
>>>>>>> added, potentially changing it to return a boolean for load_p.
>>>>>>>
>>>>>>> Then I guess we will need hooks for destructuring the pair insn and
>>>>>>> another hook to wrap aarch64_gen_writeback_pair.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>>>  
>>>>>>>>        if (GET_CODE (pat) != SET)
>>>>>>>>  	continue;
>>>>>>>>  
>>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>>>>>>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
>>>>>>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>>>>>>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
>>>>>>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>      }
>>>>>>>>  
>>>>>>>> -  bb_state.transform ();
>>>>>>>> -  bb_state.cleanup_tombstones ();
>>>>>>>> +  bb_info.transform ();
>>>>>>>> +  bb_info.cleanup_tombstones ();
>>>>>>>>  }
>>>>>>>>  
>>>>>>>>  void ldp_fusion ()
>>>>>>>>  {
>>>>>>>>    ldp_fusion_init ();
>>>>>>>> +  pair_fusion *pfuse;
>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>> +  pfuse = &derived;
>>>>>>>
>>>>>>> This is indeed the one place where I think it is acceptable to
>>>>>>> instantiate aarch64_pair_fusion.  But again there's no need to create a
>>>>>>> pointer to the parent class, just call any function you like directly.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>>>  
>>>>>>>>    for (auto bb : crtl->ssa->bbs ())
>>>>>>>> -    ldp_fusion_bb (bb);
>>>>>>>> +    pfuse->ldp_fusion_bb (bb);
>>>>>>>
>>>>>>> I think even the code to iterate over bbs should itself be a member
>>>>>>> function of pair_fusion (say "run") and then that becomes part of the
>>>>>>> generic code.
>>>>>>>
>>>>>>> So this function would just become:
>>>>>>>
>>>>>>> aarch64_pair_fusion pass;
>>>>>>> pass.run ();
>>>>>>>
>>>>>>> and could be inlined into the caller.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>> Perhaps you could also add an early return like:
>>>>>>>
>>>>>>>   if (!track_loads_p () && !track_stores_p ())
>>>>>>>     return;
>>>>>>>
>>>>>>> in pair_fusion::run () and then remove the corresponding code from
>>>>>>> pass_ldp_fusion::gate?
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>
>>>>>>>>  
>>>>>>>>    ldp_fusion_destroy ();
>>>>>>>>  }
>>>>>>>> -- 
>>>>>>>> 2.39.3
>>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Alex
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Ajit
Alex Coplan April 10, 2024, 2:22 p.m. UTC | #9
Hi Ajit,

On 10/04/2024 15:31, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 10/04/24 1:42 pm, Alex Coplan wrote:
> > Hi Ajit,
> > 
> > On 09/04/2024 20:59, Ajit Agarwal wrote:
> >> Hello Alex:
> >>
> >> On 09/04/24 8:39 pm, Alex Coplan wrote:
> >>> On 09/04/2024 20:01, Ajit Agarwal wrote:
> >>>> Hello Alex:
> >>>>
> >>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
> >>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
> >>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
> >>>>>>>> Hello Alex/Richard:
> >>>>>>>>
> >>>>>>>> All review comments are incorporated.
<snip>
> >>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
> >>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
> >>>>>>>>  // merge_pairs.
> >>>>>>>>  void
> >>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>> -				 access_group &group)
> >>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>> +					 access_group &group)
> >>>>>>>>  {
> >>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
> >>>>>>>>    const unsigned access_size = lfs.size;
> >>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>  		       access.cand_insns,
> >>>>>>>>  		       lfs.load_p,
> >>>>>>>>  		       access_size);
> >>>>>>>> -	  skip_next = access.cand_insns.empty ();
> >>>>>>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
> >>>>>>>
> >>>>>>> As above, why is this needed?
> >>>>>>
> >>>>>> For rs6000 we want to return always true. as load store pair
> >>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
> >>>>>> And we want load store pair to 8/16 32/64. Thats why we want
> >>>>>> to generate always true for rs6000 to skip pairs as above.
> >>>>>
> >>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
> >>>>> load/store pair instructions where the two arms of the access are storing
> >>>>> operands of different sizes?  Or something else?
> >>>>>
> >>>>> As it stands the logic is to skip the next iteration only if we
> >>>>> exhausted all the candidate insns for the current access.  In the case
> >>>>> that we didn't exhaust all such candidates, then the idea is that when
> >>>>> access becomes prev_access, we can attempt to use those candidates as
> >>>>> the "left-hand side" of a pair in the next iteration since we failed to
> >>>>> use them as the "right-hand side" of a pair in the current iteration.
> >>>>> I don't see why you wouldn't want that behaviour.  Please can you
> >>>>> explain?
> >>>>>
> >>>>
> >>>> In merge_pair we get the 2 load candiates one load from 0 offset and
> >>>> other load is from 16th offset. Then in next iteration we get load
> >>>> from 16th offset and other load from 32 offset. In next iteration
> >>>> we get load from 32 offset and other load from 48 offset.
> >>>>
> >>>> For example:
> >>>>
> >>>> Currently we get the load candiates as follows.
> >>>>
> >>>> pairs:
> >>>>
> >>>> load from 0th offset.
> >>>> load from 16th offset.
> >>>>
> >>>> next pairs:
> >>>>
> >>>> load from 16th offset.
> >>>> load from 32th offset.
> >>>>
> >>>> next pairs:
> >>>>
> >>>> load from 32th offset
> >>>> load from 48th offset.
> >>>>
> >>>> Instead in rs6000 we should get:
> >>>>
> >>>> pairs:
> >>>>
> >>>> load from 0th offset
> >>>> load from 16th offset.
> >>>>
> >>>> next pairs:
> >>>>
> >>>> load from 32th offset
> >>>> load from 48th offset.
> >>>
> >>> Hmm, so then I guess my question is: why wouldn't you consider merging
> >>> the pair with offsets (16,32) for rs6000?  Is it because you have a
> >>> stricter alignment requirement on the base pair offsets (i.e. they have
> >>> to be a multiple of 32 when the operand size is 16)?  So the pair
> >>> offsets have to be a multiple of the entire pair size rather than a
> >>> single operand size> 
> >>
> >> We get load pair at a certain point with (0,16) and other program
> >> point we get load pair (32, 48).
> >>
> >> In current implementation it takes offsets loads as (0, 16),
> >> (16, 32), (32, 48).
> >>
> >> But In rs6000 we want  the load pair to be merged at different points
> >> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
> >> 0 offset and other load (32, 48) with lxvp with 32 offset.
> >>
> >> In current case it will merge with lxvp with 0 offset and lxvp with
> >> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
> >> is incorrect in our case as the (16-32) case 16 offset will not
> >> load from even register and break for rs6000.
> > 
> > Sorry, I think I'm still missing something here.  Why does the address offset
> > affect the parity of the tranfser register?  ISTM they needn't be related at
> > all (and indeed we can't even know the parity of the transfer register before
> > RA, but perhaps you're only intending on running the pass after RA?)
> > 
> 
> We have load pair with (0,16) wherein these loads are adjacent and
> replaced with lxvp.
> 
> Semantic of lxvp instruction is that it loads adjacent load pair in
> even register and even_register + 1.
> 
> We replace the above load pair with lxvp instruction and then we
> dont need to merge (16,32) as (0, 16) is already merged and instead
> we merge (32,48).

Ok, but the existing logic should already account for this.  I.e. if we
successfully merge (0,16), then we don't attempt to merge (16,32).  We'd only
attempt to merge (16,32) if the merge of (0,16) failed (for whatever reason).
So I don't see that there's anything specific to lxvp that requires this logic
to change, _unless_ you have a stricter alignment requirement on the offsets as
I mentioned before.

> 
> Yes you are correct, the addresss offset doesn't affect the parity of
> the register transfer as we are doing fusion pass before RA.
> If that is the case then I think it would be better to introduce a
> >>> virtual function (say pair_offset_alignment_ok_p) that vets the base
> >>> offset of the pair (prev_access->offset in transform_for_base).  I guess
> >>> it would also take access_size as a parameter and for aarch64 it should
> >>> check:
> >>>
> >>>   multiple_p (offset, access_size)
> >>>
> >>> and for rs6000 it could check:
> >>>
> >>>   multiple_p (offset, access_size * 2)
> >>>
> >>> and we would then incorporate a call to that predicate in the else if
> >>> condition of tranform_for_base.
> >>>
> >>> It would have the same limitation whereby we assume that MEM_EXPR offset
> >>> alignment is a good proxy for RTL offset alignment, but we already
> >>> depend on that with the multiple_p check in track_via_mem_expr.
> >>>
> I have addressed the above hooks and it worked fine with both rs6000
> and aarch64. I am sending subsequent patch in some time that address
> above.
> 
> > How do you plan on handling this even-odd requirement for rs6000?
> > 
> 
> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
> generate register pair and thats what we generate and implement
> in rs6000 target
> code.

Ah, this is coming back to me now.  Sorry, I should have remembered this from
the previous discussion with Richard S.

Apologies for going on a slight tangent here, but if you're running
before RA are you planning to create a new OImode pseudo register for
the lxvp insn and then somehow update uses of the old transfer registers
to replace them with subregs of that OImode pseudo?  Or do you just plan
or replacing the individual loads with moves (instead of deleting them)?
I guess the latter would be simpler and might work better in the
presence of hard regs.

Thanks,
Alex

> 
> Thanks & Regards
> Ajit 
> > Thanks,
> > Alex
> > 
> >>
> >> lxvp should load from even registers and then loaded value will
> >> be in even register and even register +1 (which is odd).
> >>
> >> Thanks & Regards
> >> Ajit
> >>> If that is the case then I think it would be better to introduce a
> >>> virtual function (say pair_offset_alignment_ok_p) that vets the base
> >>> offset of the pair (prev_access->offset in transform_for_base).  I guess
> >>> it would also take access_size as a parameter and for aarch64 it should
> >>> check:
> >>>
> >>>   multiple_p (offset, access_size)
> >>>
> >>> and for rs6000 it could check:
> >>>
> >>>   multiple_p (offset, access_size * 2)
> >>>
> >>> and we would then incorporate a call to that predicate in the else if
> >>> condition of tranform_for_base.
> >>>
> >>> It would have the same limitation whereby we assume that MEM_EXPR offset
> >>> alignment is a good proxy for RTL offset alignment, but we already
> >>> depend on that with the multiple_p check in track_via_mem_expr.
> >>>
> >>> Thanks,
> >>> Alex
> >>>
> >>>>
> >>>> Thanks & Regards
> >>>> Ajit
> >>>>> Thanks,
> >>>>> Alex
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>  	}
> >>>>>>>>        prev_access = &access;
> >>>>>>>>      }
> >>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>  // and remove all the tombstone insns, being sure to reparent any uses
> >>>>>>>>  // of mem to previous defs when we do this.
> >>>>>>>>  void
> >>>>>>>> -ldp_bb_info::cleanup_tombstones ()
> >>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
> >>>>>>>>  {
> >>>>>>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
> >>>>>>>>    if (!m_emitted_tombstone)
> >>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
> >>>>>>>>  
> >>>>>>>>  template<typename Map>
> >>>>>>>>  void
> >>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
> >>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
> >>>>>>>>  {
> >>>>>>>>    for (auto kv : map)
> >>>>>>>>      {
> >>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
> >>>>>>>>  }
> >>>>>>>>  
> >>>>>>>>  void
> >>>>>>>> -ldp_bb_info::transform ()
> >>>>>>>> +pair_fusion_bb_info::transform ()
> >>>>>>>>  {
> >>>>>>>>    traverse_base_map (expr_map);
> >>>>>>>>    traverse_base_map (def_map);
> >>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
> >>>>>>>>  // for load/store candidates.  If running after RA, also try and promote
> >>>>>>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
> >>>>>>>>  // candidates into pairs.
> >>>>>>>> -void ldp_fusion_bb (bb_info *bb)
> >>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
> >>>>>>>>  {
> >>>>>>>> -  const bool track_loads
> >>>>>>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>>>> -  const bool track_stores
> >>>>>>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>>>> +  const bool track_loads = track_load_p ();
> >>>>>>>> +  const bool track_stores = track_store_p ();
> >>>>>>>>  
> >>>>>>>> -  ldp_bb_info bb_state (bb);
> >>>>>>>
> >>>>>>> This:
> >>>>>>>
> >>>>>>>> +  aarch64_pair_fusion derived;
> >>>>>>>
> >>>>>>> can be deleted and then:
> >>>>>>>
> >>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
> >>>>>>>
> >>>>>>> can just be:
> >>>>>>>
> >>>>>>>   pair_fusion_bb_info bb_info (bb, this);
> >>>>>>>
> >>>>>>> (or you can pass *this if you make bb_info take a reference).
> >>>>>>>
> >>>>>>> I don't think there's a particular need to change the variable name
> >>>>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
> >>>>>>> with the RTL-SSA structure of the same name as the latter.
> >>>>>>>
> >>>>>>
> >>>>>> Addressed.
> >>>>>>>>  
> >>>>>>>>    for (auto insn : bb->nondebug_insns ())
> >>>>>>>>      {
> >>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
> >>>>>>>>  	continue;
> >>>>>>>>  
> >>>>>>>>        rtx pat = PATTERN (rti);
> >>>>>>>> -      if (reload_completed
> >>>>>>>> -	  && aarch64_ldp_writeback > 1
> >>>>>>>> -	  && GET_CODE (pat) == PARALLEL
> >>>>>>>> -	  && XVECLEN (pat, 0) == 2)
> >>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
> >>>>>>>>  	try_promote_writeback (insn);
> >>>>>>>
> >>>>>>> It looks like try_promote_writeback itself will need some further work
> >>>>>>> to make it target-independent.  I suppose this check:
> >>>>>>>
> >>>>>>>   auto rti = insn->rtl ();
> >>>>>>>   const auto attr = get_attr_ldpstp (rti);
> >>>>>>>   if (attr == LDPSTP_NONE)
> >>>>>>>     return;
> >>>>>>>
> >>>>>>>   bool load_p = (attr == LDPSTP_LDP);
> >>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
> >>>>>>>
> >>>>>>> will need to become part of the pair_mem_promote_writeback_p hook that you
> >>>>>>> added, potentially changing it to return a boolean for load_p.
> >>>>>>>
> >>>>>>> Then I guess we will need hooks for destructuring the pair insn and
> >>>>>>> another hook to wrap aarch64_gen_writeback_pair.
> >>>>>>>
> >>>>>>
> >>>>>> Addressed.
> >>>>>>>>  
> >>>>>>>>        if (GET_CODE (pat) != SET)
> >>>>>>>>  	continue;
> >>>>>>>>  
> >>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
> >>>>>>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
> >>>>>>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
> >>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
> >>>>>>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
> >>>>>>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
> >>>>>>>>      }
> >>>>>>>>  
> >>>>>>>> -  bb_state.transform ();
> >>>>>>>> -  bb_state.cleanup_tombstones ();
> >>>>>>>> +  bb_info.transform ();
> >>>>>>>> +  bb_info.cleanup_tombstones ();
> >>>>>>>>  }
> >>>>>>>>  
> >>>>>>>>  void ldp_fusion ()
> >>>>>>>>  {
> >>>>>>>>    ldp_fusion_init ();
> >>>>>>>> +  pair_fusion *pfuse;
> >>>>>>>> +  aarch64_pair_fusion derived;
> >>>>>>>> +  pfuse = &derived;
> >>>>>>>
> >>>>>>> This is indeed the one place where I think it is acceptable to
> >>>>>>> instantiate aarch64_pair_fusion.  But again there's no need to create a
> >>>>>>> pointer to the parent class, just call any function you like directly.
> >>>>>>>
> >>>>>>
> >>>>>> Addressed.
> >>>>>>>>  
> >>>>>>>>    for (auto bb : crtl->ssa->bbs ())
> >>>>>>>> -    ldp_fusion_bb (bb);
> >>>>>>>> +    pfuse->ldp_fusion_bb (bb);
> >>>>>>>
> >>>>>>> I think even the code to iterate over bbs should itself be a member
> >>>>>>> function of pair_fusion (say "run") and then that becomes part of the
> >>>>>>> generic code.
> >>>>>>>
> >>>>>>> So this function would just become:
> >>>>>>>
> >>>>>>> aarch64_pair_fusion pass;
> >>>>>>> pass.run ();
> >>>>>>>
> >>>>>>> and could be inlined into the caller.
> >>>>>>>
> >>>>>>
> >>>>>> Addressed.
> >>>>>>> Perhaps you could also add an early return like:
> >>>>>>>
> >>>>>>>   if (!track_loads_p () && !track_stores_p ())
> >>>>>>>     return;
> >>>>>>>
> >>>>>>> in pair_fusion::run () and then remove the corresponding code from
> >>>>>>> pass_ldp_fusion::gate?
> >>>>>>>
> >>>>>>
> >>>>>> Addressed.
> >>>>>>
> >>>>>>>>  
> >>>>>>>>    ldp_fusion_destroy ();
> >>>>>>>>  }
> >>>>>>>> -- 
> >>>>>>>> 2.39.3
> >>>>>>>>
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Alex
> >>>>>>
> >>>>>> Thanks & Regards
> >>>>>> Ajit
Ajit Agarwal April 10, 2024, 6:18 p.m. UTC | #10
Hello Alex:

On 10/04/24 7:52 pm, Alex Coplan wrote:
> Hi Ajit,
> 
> On 10/04/2024 15:31, Ajit Agarwal wrote:
>> Hello Alex:
>>
>> On 10/04/24 1:42 pm, Alex Coplan wrote:
>>> Hi Ajit,
>>>
>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
>>>> Hello Alex:
>>>>
>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
>>>>>> Hello Alex:
>>>>>>
>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>>>>>> Hello Alex/Richard:
>>>>>>>>>>
>>>>>>>>>> All review comments are incorporated.
> <snip>
>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
>>>>>>>>>>  // merge_pairs.
>>>>>>>>>>  void
>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>> -				 access_group &group)
>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>> +					 access_group &group)
>>>>>>>>>>  {
>>>>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
>>>>>>>>>>    const unsigned access_size = lfs.size;
>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>  		       access.cand_insns,
>>>>>>>>>>  		       lfs.load_p,
>>>>>>>>>>  		       access_size);
>>>>>>>>>> -	  skip_next = access.cand_insns.empty ();
>>>>>>>>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
>>>>>>>>>
>>>>>>>>> As above, why is this needed?
>>>>>>>>
>>>>>>>> For rs6000 we want to return always true. as load store pair
>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
>>>>>>>> to generate always true for rs6000 to skip pairs as above.
>>>>>>>
>>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
>>>>>>> load/store pair instructions where the two arms of the access are storing
>>>>>>> operands of different sizes?  Or something else?
>>>>>>>
>>>>>>> As it stands the logic is to skip the next iteration only if we
>>>>>>> exhausted all the candidate insns for the current access.  In the case
>>>>>>> that we didn't exhaust all such candidates, then the idea is that when
>>>>>>> access becomes prev_access, we can attempt to use those candidates as
>>>>>>> the "left-hand side" of a pair in the next iteration since we failed to
>>>>>>> use them as the "right-hand side" of a pair in the current iteration.
>>>>>>> I don't see why you wouldn't want that behaviour.  Please can you
>>>>>>> explain?
>>>>>>>
>>>>>>
>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset and
>>>>>> other load is from 16th offset. Then in next iteration we get load
>>>>>> from 16th offset and other load from 32 offset. In next iteration
>>>>>> we get load from 32 offset and other load from 48 offset.
>>>>>>
>>>>>> For example:
>>>>>>
>>>>>> Currently we get the load candiates as follows.
>>>>>>
>>>>>> pairs:
>>>>>>
>>>>>> load from 0th offset.
>>>>>> load from 16th offset.
>>>>>>
>>>>>> next pairs:
>>>>>>
>>>>>> load from 16th offset.
>>>>>> load from 32th offset.
>>>>>>
>>>>>> next pairs:
>>>>>>
>>>>>> load from 32th offset
>>>>>> load from 48th offset.
>>>>>>
>>>>>> Instead in rs6000 we should get:
>>>>>>
>>>>>> pairs:
>>>>>>
>>>>>> load from 0th offset
>>>>>> load from 16th offset.
>>>>>>
>>>>>> next pairs:
>>>>>>
>>>>>> load from 32th offset
>>>>>> load from 48th offset.
>>>>>
>>>>> Hmm, so then I guess my question is: why wouldn't you consider merging
>>>>> the pair with offsets (16,32) for rs6000?  Is it because you have a
>>>>> stricter alignment requirement on the base pair offsets (i.e. they have
>>>>> to be a multiple of 32 when the operand size is 16)?  So the pair
>>>>> offsets have to be a multiple of the entire pair size rather than a
>>>>> single operand size> 
>>>>
>>>> We get load pair at a certain point with (0,16) and other program
>>>> point we get load pair (32, 48).
>>>>
>>>> In current implementation it takes offsets loads as (0, 16),
>>>> (16, 32), (32, 48).
>>>>
>>>> But In rs6000 we want  the load pair to be merged at different points
>>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
>>>> 0 offset and other load (32, 48) with lxvp with 32 offset.
>>>>
>>>> In current case it will merge with lxvp with 0 offset and lxvp with
>>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
>>>> is incorrect in our case as the (16-32) case 16 offset will not
>>>> load from even register and break for rs6000.
>>>
>>> Sorry, I think I'm still missing something here.  Why does the address offset
>>> affect the parity of the tranfser register?  ISTM they needn't be related at
>>> all (and indeed we can't even know the parity of the transfer register before
>>> RA, but perhaps you're only intending on running the pass after RA?)
>>>
>>
>> We have load pair with (0,16) wherein these loads are adjacent and
>> replaced with lxvp.
>>
>> Semantic of lxvp instruction is that it loads adjacent load pair in
>> even register and even_register + 1.
>>
>> We replace the above load pair with lxvp instruction and then we
>> dont need to merge (16,32) as (0, 16) is already merged and instead
>> we merge (32,48).
> 
> Ok, but the existing logic should already account for this.  I.e. if we
> successfully merge (0,16), then we don't attempt to merge (16,32).  We'd only
> attempt to merge (16,32) if the merge of (0,16) failed (for whatever reason).
> So I don't see that there's anything specific to lxvp that requires this logic
> to change, _unless_ you have a stricter alignment requirement on the offsets as
> I mentioned before.
> 

Thanks for the suggestion. It worked for rs6000 also with current changes.
Sorry for the confusion.

>>
>> Yes you are correct, the addresss offset doesn't affect the parity of
>> the register transfer as we are doing fusion pass before RA.
>> If that is the case then I think it would be better to introduce a
>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>> check:
>>>>>
>>>>>   multiple_p (offset, access_size)
>>>>>
>>>>> and for rs6000 it could check:
>>>>>
>>>>>   multiple_p (offset, access_size * 2)
>>>>>
>>>>> and we would then incorporate a call to that predicate in the else if
>>>>> condition of tranform_for_base.
>>>>>
>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>
>> I have addressed the above hooks and it worked fine with both rs6000
>> and aarch64. I am sending subsequent patch in some time that address
>> above.
>>
>>> How do you plan on handling this even-odd requirement for rs6000?
>>>
>>
>> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
>> generate register pair and thats what we generate and implement
>> in rs6000 target
>> code.
> 
> Ah, this is coming back to me now.  Sorry, I should have remembered this from
> the previous discussion with Richard S.
> 
> Apologies for going on a slight tangent here, but if you're running
> before RA are you planning to create a new OImode pseudo register for
> the lxvp insn and then somehow update uses of the old transfer registers
> to replace them with subregs of that OImode pseudo?  

Yes I do the same as you have mentioned. We generate register pairs
with 256 bit mode with two subregs of 128 bit modes with 0 and
16 offset.

Or do you just plan
> or replacing the individual loads with moves (instead of deleting them)?
> I guess the latter would be simpler and might work better in the
> presence of hard regs.
>

Would you mind explaining how to generate register pairs with lxvp by
replacing loads with moves.

Thanks & Regards
Ajit
 
> Thanks,
> Alex
> 
>>
>> Thanks & Regards
>> Ajit 
>>> Thanks,
>>> Alex
>>>
>>>>
>>>> lxvp should load from even registers and then loaded value will
>>>> be in even register and even register +1 (which is odd).
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>> If that is the case then I think it would be better to introduce a
>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>> check:
>>>>>
>>>>>   multiple_p (offset, access_size)
>>>>>
>>>>> and for rs6000 it could check:
>>>>>
>>>>>   multiple_p (offset, access_size * 2)
>>>>>
>>>>> and we would then incorporate a call to that predicate in the else if
>>>>> condition of tranform_for_base.
>>>>>
>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Ajit
>>>>>>> Thanks,
>>>>>>> Alex
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>  	}
>>>>>>>>>>        prev_access = &access;
>>>>>>>>>>      }
>>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>  // and remove all the tombstone insns, being sure to reparent any uses
>>>>>>>>>>  // of mem to previous defs when we do this.
>>>>>>>>>>  void
>>>>>>>>>> -ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
>>>>>>>>>>  {
>>>>>>>>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
>>>>>>>>>>    if (!m_emitted_tombstone)
>>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>  
>>>>>>>>>>  template<typename Map>
>>>>>>>>>>  void
>>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>  {
>>>>>>>>>>    for (auto kv : map)
>>>>>>>>>>      {
>>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>  }
>>>>>>>>>>  
>>>>>>>>>>  void
>>>>>>>>>> -ldp_bb_info::transform ()
>>>>>>>>>> +pair_fusion_bb_info::transform ()
>>>>>>>>>>  {
>>>>>>>>>>    traverse_base_map (expr_map);
>>>>>>>>>>    traverse_base_map (def_map);
>>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>>>>>>>>>  // for load/store candidates.  If running after RA, also try and promote
>>>>>>>>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
>>>>>>>>>>  // candidates into pairs.
>>>>>>>>>> -void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>  {
>>>>>>>>>> -  const bool track_loads
>>>>>>>>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>> -  const bool track_stores
>>>>>>>>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>> +  const bool track_loads = track_load_p ();
>>>>>>>>>> +  const bool track_stores = track_store_p ();
>>>>>>>>>>  
>>>>>>>>>> -  ldp_bb_info bb_state (bb);
>>>>>>>>>
>>>>>>>>> This:
>>>>>>>>>
>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>
>>>>>>>>> can be deleted and then:
>>>>>>>>>
>>>>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
>>>>>>>>>
>>>>>>>>> can just be:
>>>>>>>>>
>>>>>>>>>   pair_fusion_bb_info bb_info (bb, this);
>>>>>>>>>
>>>>>>>>> (or you can pass *this if you make bb_info take a reference).
>>>>>>>>>
>>>>>>>>> I don't think there's a particular need to change the variable name
>>>>>>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
>>>>>>>>> with the RTL-SSA structure of the same name as the latter.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Addressed.
>>>>>>>>>>  
>>>>>>>>>>    for (auto insn : bb->nondebug_insns ())
>>>>>>>>>>      {
>>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>  	continue;
>>>>>>>>>>  
>>>>>>>>>>        rtx pat = PATTERN (rti);
>>>>>>>>>> -      if (reload_completed
>>>>>>>>>> -	  && aarch64_ldp_writeback > 1
>>>>>>>>>> -	  && GET_CODE (pat) == PARALLEL
>>>>>>>>>> -	  && XVECLEN (pat, 0) == 2)
>>>>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
>>>>>>>>>>  	try_promote_writeback (insn);
>>>>>>>>>
>>>>>>>>> It looks like try_promote_writeback itself will need some further work
>>>>>>>>> to make it target-independent.  I suppose this check:
>>>>>>>>>
>>>>>>>>>   auto rti = insn->rtl ();
>>>>>>>>>   const auto attr = get_attr_ldpstp (rti);
>>>>>>>>>   if (attr == LDPSTP_NONE)
>>>>>>>>>     return;
>>>>>>>>>
>>>>>>>>>   bool load_p = (attr == LDPSTP_LDP);
>>>>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
>>>>>>>>>
>>>>>>>>> will need to become part of the pair_mem_promote_writeback_p hook that you
>>>>>>>>> added, potentially changing it to return a boolean for load_p.
>>>>>>>>>
>>>>>>>>> Then I guess we will need hooks for destructuring the pair insn and
>>>>>>>>> another hook to wrap aarch64_gen_writeback_pair.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Addressed.
>>>>>>>>>>  
>>>>>>>>>>        if (GET_CODE (pat) != SET)
>>>>>>>>>>  	continue;
>>>>>>>>>>  
>>>>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>>>>>>>>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>>>>>>>>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>      }
>>>>>>>>>>  
>>>>>>>>>> -  bb_state.transform ();
>>>>>>>>>> -  bb_state.cleanup_tombstones ();
>>>>>>>>>> +  bb_info.transform ();
>>>>>>>>>> +  bb_info.cleanup_tombstones ();
>>>>>>>>>>  }
>>>>>>>>>>  
>>>>>>>>>>  void ldp_fusion ()
>>>>>>>>>>  {
>>>>>>>>>>    ldp_fusion_init ();
>>>>>>>>>> +  pair_fusion *pfuse;
>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>> +  pfuse = &derived;
>>>>>>>>>
>>>>>>>>> This is indeed the one place where I think it is acceptable to
>>>>>>>>> instantiate aarch64_pair_fusion.  But again there's no need to create a
>>>>>>>>> pointer to the parent class, just call any function you like directly.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Addressed.
>>>>>>>>>>  
>>>>>>>>>>    for (auto bb : crtl->ssa->bbs ())
>>>>>>>>>> -    ldp_fusion_bb (bb);
>>>>>>>>>> +    pfuse->ldp_fusion_bb (bb);
>>>>>>>>>
>>>>>>>>> I think even the code to iterate over bbs should itself be a member
>>>>>>>>> function of pair_fusion (say "run") and then that becomes part of the
>>>>>>>>> generic code.
>>>>>>>>>
>>>>>>>>> So this function would just become:
>>>>>>>>>
>>>>>>>>> aarch64_pair_fusion pass;
>>>>>>>>> pass.run ();
>>>>>>>>>
>>>>>>>>> and could be inlined into the caller.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Addressed.
>>>>>>>>> Perhaps you could also add an early return like:
>>>>>>>>>
>>>>>>>>>   if (!track_loads_p () && !track_stores_p ())
>>>>>>>>>     return;
>>>>>>>>>
>>>>>>>>> in pair_fusion::run () and then remove the corresponding code from
>>>>>>>>> pass_ldp_fusion::gate?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Addressed.
>>>>>>>>
>>>>>>>>>>  
>>>>>>>>>>    ldp_fusion_destroy ();
>>>>>>>>>>  }
>>>>>>>>>> -- 
>>>>>>>>>> 2.39.3
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Alex
>>>>>>>>
>>>>>>>> Thanks & Regards
>>>>>>>> Ajit
Alex Coplan April 11, 2024, 2:25 p.m. UTC | #11
On 10/04/2024 23:48, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 10/04/24 7:52 pm, Alex Coplan wrote:
> > Hi Ajit,
> > 
> > On 10/04/2024 15:31, Ajit Agarwal wrote:
> >> Hello Alex:
> >>
> >> On 10/04/24 1:42 pm, Alex Coplan wrote:
> >>> Hi Ajit,
> >>>
> >>> On 09/04/2024 20:59, Ajit Agarwal wrote:
> >>>> Hello Alex:
> >>>>
> >>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
> >>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
> >>>>>> Hello Alex:
> >>>>>>
> >>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
> >>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
> >>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
> >>>>>>>>>> Hello Alex/Richard:
> >>>>>>>>>>
> >>>>>>>>>> All review comments are incorporated.
> > <snip>
> >>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
> >>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
> >>>>>>>>>>  // merge_pairs.
> >>>>>>>>>>  void
> >>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>> -				 access_group &group)
> >>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>> +					 access_group &group)
> >>>>>>>>>>  {
> >>>>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
> >>>>>>>>>>    const unsigned access_size = lfs.size;
> >>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>>  		       access.cand_insns,
> >>>>>>>>>>  		       lfs.load_p,
> >>>>>>>>>>  		       access_size);
> >>>>>>>>>> -	  skip_next = access.cand_insns.empty ();
> >>>>>>>>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
> >>>>>>>>>
> >>>>>>>>> As above, why is this needed?
> >>>>>>>>
> >>>>>>>> For rs6000 we want to return always true. as load store pair
> >>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
> >>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
> >>>>>>>> to generate always true for rs6000 to skip pairs as above.
> >>>>>>>
> >>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
> >>>>>>> load/store pair instructions where the two arms of the access are storing
> >>>>>>> operands of different sizes?  Or something else?
> >>>>>>>
> >>>>>>> As it stands the logic is to skip the next iteration only if we
> >>>>>>> exhausted all the candidate insns for the current access.  In the case
> >>>>>>> that we didn't exhaust all such candidates, then the idea is that when
> >>>>>>> access becomes prev_access, we can attempt to use those candidates as
> >>>>>>> the "left-hand side" of a pair in the next iteration since we failed to
> >>>>>>> use them as the "right-hand side" of a pair in the current iteration.
> >>>>>>> I don't see why you wouldn't want that behaviour.  Please can you
> >>>>>>> explain?
> >>>>>>>
> >>>>>>
> >>>>>> In merge_pair we get the 2 load candiates one load from 0 offset and
> >>>>>> other load is from 16th offset. Then in next iteration we get load
> >>>>>> from 16th offset and other load from 32 offset. In next iteration
> >>>>>> we get load from 32 offset and other load from 48 offset.
> >>>>>>
> >>>>>> For example:
> >>>>>>
> >>>>>> Currently we get the load candiates as follows.
> >>>>>>
> >>>>>> pairs:
> >>>>>>
> >>>>>> load from 0th offset.
> >>>>>> load from 16th offset.
> >>>>>>
> >>>>>> next pairs:
> >>>>>>
> >>>>>> load from 16th offset.
> >>>>>> load from 32th offset.
> >>>>>>
> >>>>>> next pairs:
> >>>>>>
> >>>>>> load from 32th offset
> >>>>>> load from 48th offset.
> >>>>>>
> >>>>>> Instead in rs6000 we should get:
> >>>>>>
> >>>>>> pairs:
> >>>>>>
> >>>>>> load from 0th offset
> >>>>>> load from 16th offset.
> >>>>>>
> >>>>>> next pairs:
> >>>>>>
> >>>>>> load from 32th offset
> >>>>>> load from 48th offset.
> >>>>>
> >>>>> Hmm, so then I guess my question is: why wouldn't you consider merging
> >>>>> the pair with offsets (16,32) for rs6000?  Is it because you have a
> >>>>> stricter alignment requirement on the base pair offsets (i.e. they have
> >>>>> to be a multiple of 32 when the operand size is 16)?  So the pair
> >>>>> offsets have to be a multiple of the entire pair size rather than a
> >>>>> single operand size> 
> >>>>
> >>>> We get load pair at a certain point with (0,16) and other program
> >>>> point we get load pair (32, 48).
> >>>>
> >>>> In current implementation it takes offsets loads as (0, 16),
> >>>> (16, 32), (32, 48).
> >>>>
> >>>> But In rs6000 we want  the load pair to be merged at different points
> >>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
> >>>> 0 offset and other load (32, 48) with lxvp with 32 offset.
> >>>>
> >>>> In current case it will merge with lxvp with 0 offset and lxvp with
> >>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
> >>>> is incorrect in our case as the (16-32) case 16 offset will not
> >>>> load from even register and break for rs6000.
> >>>
> >>> Sorry, I think I'm still missing something here.  Why does the address offset
> >>> affect the parity of the tranfser register?  ISTM they needn't be related at
> >>> all (and indeed we can't even know the parity of the transfer register before
> >>> RA, but perhaps you're only intending on running the pass after RA?)
> >>>
> >>
> >> We have load pair with (0,16) wherein these loads are adjacent and
> >> replaced with lxvp.
> >>
> >> Semantic of lxvp instruction is that it loads adjacent load pair in
> >> even register and even_register + 1.
> >>
> >> We replace the above load pair with lxvp instruction and then we
> >> dont need to merge (16,32) as (0, 16) is already merged and instead
> >> we merge (32,48).
> > 
> > Ok, but the existing logic should already account for this.  I.e. if we
> > successfully merge (0,16), then we don't attempt to merge (16,32).  We'd only
> > attempt to merge (16,32) if the merge of (0,16) failed (for whatever reason).
> > So I don't see that there's anything specific to lxvp that requires this logic
> > to change, _unless_ you have a stricter alignment requirement on the offsets as
> > I mentioned before.
> > 
> 
> Thanks for the suggestion. It worked for rs6000 also with current changes.
> Sorry for the confusion.

Alright, glad we got to the bottom of this!

> 
> >>
> >> Yes you are correct, the addresss offset doesn't affect the parity of
> >> the register transfer as we are doing fusion pass before RA.
> >> If that is the case then I think it would be better to introduce a
> >>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
> >>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
> >>>>> it would also take access_size as a parameter and for aarch64 it should
> >>>>> check:
> >>>>>
> >>>>>   multiple_p (offset, access_size)
> >>>>>
> >>>>> and for rs6000 it could check:
> >>>>>
> >>>>>   multiple_p (offset, access_size * 2)
> >>>>>
> >>>>> and we would then incorporate a call to that predicate in the else if
> >>>>> condition of tranform_for_base.
> >>>>>
> >>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
> >>>>> alignment is a good proxy for RTL offset alignment, but we already
> >>>>> depend on that with the multiple_p check in track_via_mem_expr.
> >>>>>
> >> I have addressed the above hooks and it worked fine with both rs6000
> >> and aarch64. I am sending subsequent patch in some time that address
> >> above.
> >>
> >>> How do you plan on handling this even-odd requirement for rs6000?
> >>>
> >>
> >> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
> >> generate register pair and thats what we generate and implement
> >> in rs6000 target
> >> code.
> > 
> > Ah, this is coming back to me now.  Sorry, I should have remembered this from
> > the previous discussion with Richard S.
> > 
> > Apologies for going on a slight tangent here, but if you're running
> > before RA are you planning to create a new OImode pseudo register for
> > the lxvp insn and then somehow update uses of the old transfer registers
> > to replace them with subregs of that OImode pseudo?  
> 
> Yes I do the same as you have mentioned. We generate register pairs
> with 256 bit mode with two subregs of 128 bit modes with 0 and
> 16 offset.
> 
> Or do you just plan
> > or replacing the individual loads with moves (instead of deleting them)?
> > I guess the latter would be simpler and might work better in the
> > presence of hard regs.
> >
> 
> Would you mind explaining how to generate register pairs with lxvp by
> replacing loads with moves.

Yeah, so suppose you have something like:

(set (reg:V16QI v1) (mem:V16QI addr))
(set (reg:V16QI v2) (mem:V16QI addr+16))

then when you insert the lxvp you can then (logically) replace the
original load instructions with moves from the appropriate subreg, as
follows:

(set (reg:OI pair-pseudo) (mem:OI addr)) ; lxvp
(set (reg:V16QI v1) (subreg:V16QI (reg:OI pair-pseudo) 0))
(set (reg:V16QI v2) (subreg:V16QI (reg:OI pair-pseudo) 16))

now I'm not sure off-hand if this exact combination of subregs and mode
changes is valid, but hopefully you get the idea.  The benefit of this
approach is that it keeps the transformation local and is more
compile-time efficient (we don't have to look through all mentions of
v1/v2 and replace them with subregs).  I'd hope that the RA can then
clean up any redundant moves (especially in the case that v1/v2 are
pseudos).

That would mean that you don't need all the grubbing around with DF
looking for occurrences of the transfer regs.  Instead we'd just need
some logic to insert the extra moves after the lxvp for rs6000, and I
think this might fit more cleanly into the current pass.

Does that make sense?  In any case, this shouldn't really affect the
preparatory aarch64 patch because we were planning to defer adding any
hooks that are only needed for rs6000 from the initial aarch64/generic
split.

Thanks,
Alex

> 
> Thanks & Regards
> Ajit
>  
> > Thanks,
> > Alex
> > 
> >>
> >> Thanks & Regards
> >> Ajit 
> >>> Thanks,
> >>> Alex
> >>>
> >>>>
> >>>> lxvp should load from even registers and then loaded value will
> >>>> be in even register and even register +1 (which is odd).
> >>>>
> >>>> Thanks & Regards
> >>>> Ajit
> >>>>> If that is the case then I think it would be better to introduce a
> >>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
> >>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
> >>>>> it would also take access_size as a parameter and for aarch64 it should
> >>>>> check:
> >>>>>
> >>>>>   multiple_p (offset, access_size)
> >>>>>
> >>>>> and for rs6000 it could check:
> >>>>>
> >>>>>   multiple_p (offset, access_size * 2)
> >>>>>
> >>>>> and we would then incorporate a call to that predicate in the else if
> >>>>> condition of tranform_for_base.
> >>>>>
> >>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
> >>>>> alignment is a good proxy for RTL offset alignment, but we already
> >>>>> depend on that with the multiple_p check in track_via_mem_expr.
> >>>>>
> >>>>> Thanks,
> >>>>> Alex
> >>>>>
> >>>>>>
> >>>>>> Thanks & Regards
> >>>>>> Ajit
> >>>>>>> Thanks,
> >>>>>>> Alex
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>  	}
> >>>>>>>>>>        prev_access = &access;
> >>>>>>>>>>      }
> >>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>>  // and remove all the tombstone insns, being sure to reparent any uses
> >>>>>>>>>>  // of mem to previous defs when we do this.
> >>>>>>>>>>  void
> >>>>>>>>>> -ldp_bb_info::cleanup_tombstones ()
> >>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
> >>>>>>>>>>  {
> >>>>>>>>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
> >>>>>>>>>>    if (!m_emitted_tombstone)
> >>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
> >>>>>>>>>>  
> >>>>>>>>>>  template<typename Map>
> >>>>>>>>>>  void
> >>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
> >>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
> >>>>>>>>>>  {
> >>>>>>>>>>    for (auto kv : map)
> >>>>>>>>>>      {
> >>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
> >>>>>>>>>>  }
> >>>>>>>>>>  
> >>>>>>>>>>  void
> >>>>>>>>>> -ldp_bb_info::transform ()
> >>>>>>>>>> +pair_fusion_bb_info::transform ()
> >>>>>>>>>>  {
> >>>>>>>>>>    traverse_base_map (expr_map);
> >>>>>>>>>>    traverse_base_map (def_map);
> >>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
> >>>>>>>>>>  // for load/store candidates.  If running after RA, also try and promote
> >>>>>>>>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
> >>>>>>>>>>  // candidates into pairs.
> >>>>>>>>>> -void ldp_fusion_bb (bb_info *bb)
> >>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
> >>>>>>>>>>  {
> >>>>>>>>>> -  const bool track_loads
> >>>>>>>>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>>>>>> -  const bool track_stores
> >>>>>>>>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>>>>>> +  const bool track_loads = track_load_p ();
> >>>>>>>>>> +  const bool track_stores = track_store_p ();
> >>>>>>>>>>  
> >>>>>>>>>> -  ldp_bb_info bb_state (bb);
> >>>>>>>>>
> >>>>>>>>> This:
> >>>>>>>>>
> >>>>>>>>>> +  aarch64_pair_fusion derived;
> >>>>>>>>>
> >>>>>>>>> can be deleted and then:
> >>>>>>>>>
> >>>>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
> >>>>>>>>>
> >>>>>>>>> can just be:
> >>>>>>>>>
> >>>>>>>>>   pair_fusion_bb_info bb_info (bb, this);
> >>>>>>>>>
> >>>>>>>>> (or you can pass *this if you make bb_info take a reference).
> >>>>>>>>>
> >>>>>>>>> I don't think there's a particular need to change the variable name
> >>>>>>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
> >>>>>>>>> with the RTL-SSA structure of the same name as the latter.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Addressed.
> >>>>>>>>>>  
> >>>>>>>>>>    for (auto insn : bb->nondebug_insns ())
> >>>>>>>>>>      {
> >>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
> >>>>>>>>>>  	continue;
> >>>>>>>>>>  
> >>>>>>>>>>        rtx pat = PATTERN (rti);
> >>>>>>>>>> -      if (reload_completed
> >>>>>>>>>> -	  && aarch64_ldp_writeback > 1
> >>>>>>>>>> -	  && GET_CODE (pat) == PARALLEL
> >>>>>>>>>> -	  && XVECLEN (pat, 0) == 2)
> >>>>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
> >>>>>>>>>>  	try_promote_writeback (insn);
> >>>>>>>>>
> >>>>>>>>> It looks like try_promote_writeback itself will need some further work
> >>>>>>>>> to make it target-independent.  I suppose this check:
> >>>>>>>>>
> >>>>>>>>>   auto rti = insn->rtl ();
> >>>>>>>>>   const auto attr = get_attr_ldpstp (rti);
> >>>>>>>>>   if (attr == LDPSTP_NONE)
> >>>>>>>>>     return;
> >>>>>>>>>
> >>>>>>>>>   bool load_p = (attr == LDPSTP_LDP);
> >>>>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
> >>>>>>>>>
> >>>>>>>>> will need to become part of the pair_mem_promote_writeback_p hook that you
> >>>>>>>>> added, potentially changing it to return a boolean for load_p.
> >>>>>>>>>
> >>>>>>>>> Then I guess we will need hooks for destructuring the pair insn and
> >>>>>>>>> another hook to wrap aarch64_gen_writeback_pair.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Addressed.
> >>>>>>>>>>  
> >>>>>>>>>>        if (GET_CODE (pat) != SET)
> >>>>>>>>>>  	continue;
> >>>>>>>>>>  
> >>>>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
> >>>>>>>>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
> >>>>>>>>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
> >>>>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
> >>>>>>>>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
> >>>>>>>>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
> >>>>>>>>>>      }
> >>>>>>>>>>  
> >>>>>>>>>> -  bb_state.transform ();
> >>>>>>>>>> -  bb_state.cleanup_tombstones ();
> >>>>>>>>>> +  bb_info.transform ();
> >>>>>>>>>> +  bb_info.cleanup_tombstones ();
> >>>>>>>>>>  }
> >>>>>>>>>>  
> >>>>>>>>>>  void ldp_fusion ()
> >>>>>>>>>>  {
> >>>>>>>>>>    ldp_fusion_init ();
> >>>>>>>>>> +  pair_fusion *pfuse;
> >>>>>>>>>> +  aarch64_pair_fusion derived;
> >>>>>>>>>> +  pfuse = &derived;
> >>>>>>>>>
> >>>>>>>>> This is indeed the one place where I think it is acceptable to
> >>>>>>>>> instantiate aarch64_pair_fusion.  But again there's no need to create a
> >>>>>>>>> pointer to the parent class, just call any function you like directly.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Addressed.
> >>>>>>>>>>  
> >>>>>>>>>>    for (auto bb : crtl->ssa->bbs ())
> >>>>>>>>>> -    ldp_fusion_bb (bb);
> >>>>>>>>>> +    pfuse->ldp_fusion_bb (bb);
> >>>>>>>>>
> >>>>>>>>> I think even the code to iterate over bbs should itself be a member
> >>>>>>>>> function of pair_fusion (say "run") and then that becomes part of the
> >>>>>>>>> generic code.
> >>>>>>>>>
> >>>>>>>>> So this function would just become:
> >>>>>>>>>
> >>>>>>>>> aarch64_pair_fusion pass;
> >>>>>>>>> pass.run ();
> >>>>>>>>>
> >>>>>>>>> and could be inlined into the caller.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Addressed.
> >>>>>>>>> Perhaps you could also add an early return like:
> >>>>>>>>>
> >>>>>>>>>   if (!track_loads_p () && !track_stores_p ())
> >>>>>>>>>     return;
> >>>>>>>>>
> >>>>>>>>> in pair_fusion::run () and then remove the corresponding code from
> >>>>>>>>> pass_ldp_fusion::gate?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Addressed.
> >>>>>>>>
> >>>>>>>>>>  
> >>>>>>>>>>    ldp_fusion_destroy ();
> >>>>>>>>>>  }
> >>>>>>>>>> -- 
> >>>>>>>>>> 2.39.3
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Alex
> >>>>>>>>
> >>>>>>>> Thanks & Regards
> >>>>>>>> Ajit
Ajit Agarwal April 12, 2024, 2:32 p.m. UTC | #12
Hello Alex:

On 11/04/24 7:55 pm, Alex Coplan wrote:
> On 10/04/2024 23:48, Ajit Agarwal wrote:
>> Hello Alex:
>>
>> On 10/04/24 7:52 pm, Alex Coplan wrote:
>>> Hi Ajit,
>>>
>>> On 10/04/2024 15:31, Ajit Agarwal wrote:
>>>> Hello Alex:
>>>>
>>>> On 10/04/24 1:42 pm, Alex Coplan wrote:
>>>>> Hi Ajit,
>>>>>
>>>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
>>>>>> Hello Alex:
>>>>>>
>>>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
>>>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
>>>>>>>> Hello Alex:
>>>>>>>>
>>>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>>>>>>>> Hello Alex/Richard:
>>>>>>>>>>>>
>>>>>>>>>>>> All review comments are incorporated.
>>> <snip>
>>>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
>>>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
>>>>>>>>>>>>  // merge_pairs.
>>>>>>>>>>>>  void
>>>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>> -				 access_group &group)
>>>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>> +					 access_group &group)
>>>>>>>>>>>>  {
>>>>>>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
>>>>>>>>>>>>    const unsigned access_size = lfs.size;
>>>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>  		       access.cand_insns,
>>>>>>>>>>>>  		       lfs.load_p,
>>>>>>>>>>>>  		       access_size);
>>>>>>>>>>>> -	  skip_next = access.cand_insns.empty ();
>>>>>>>>>>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
>>>>>>>>>>>
>>>>>>>>>>> As above, why is this needed?
>>>>>>>>>>
>>>>>>>>>> For rs6000 we want to return always true. as load store pair
>>>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
>>>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
>>>>>>>>>> to generate always true for rs6000 to skip pairs as above.
>>>>>>>>>
>>>>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
>>>>>>>>> load/store pair instructions where the two arms of the access are storing
>>>>>>>>> operands of different sizes?  Or something else?
>>>>>>>>>
>>>>>>>>> As it stands the logic is to skip the next iteration only if we
>>>>>>>>> exhausted all the candidate insns for the current access.  In the case
>>>>>>>>> that we didn't exhaust all such candidates, then the idea is that when
>>>>>>>>> access becomes prev_access, we can attempt to use those candidates as
>>>>>>>>> the "left-hand side" of a pair in the next iteration since we failed to
>>>>>>>>> use them as the "right-hand side" of a pair in the current iteration.
>>>>>>>>> I don't see why you wouldn't want that behaviour.  Please can you
>>>>>>>>> explain?
>>>>>>>>>
>>>>>>>>
>>>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset and
>>>>>>>> other load is from 16th offset. Then in next iteration we get load
>>>>>>>> from 16th offset and other load from 32 offset. In next iteration
>>>>>>>> we get load from 32 offset and other load from 48 offset.
>>>>>>>>
>>>>>>>> For example:
>>>>>>>>
>>>>>>>> Currently we get the load candiates as follows.
>>>>>>>>
>>>>>>>> pairs:
>>>>>>>>
>>>>>>>> load from 0th offset.
>>>>>>>> load from 16th offset.
>>>>>>>>
>>>>>>>> next pairs:
>>>>>>>>
>>>>>>>> load from 16th offset.
>>>>>>>> load from 32th offset.
>>>>>>>>
>>>>>>>> next pairs:
>>>>>>>>
>>>>>>>> load from 32th offset
>>>>>>>> load from 48th offset.
>>>>>>>>
>>>>>>>> Instead in rs6000 we should get:
>>>>>>>>
>>>>>>>> pairs:
>>>>>>>>
>>>>>>>> load from 0th offset
>>>>>>>> load from 16th offset.
>>>>>>>>
>>>>>>>> next pairs:
>>>>>>>>
>>>>>>>> load from 32th offset
>>>>>>>> load from 48th offset.
>>>>>>>
>>>>>>> Hmm, so then I guess my question is: why wouldn't you consider merging
>>>>>>> the pair with offsets (16,32) for rs6000?  Is it because you have a
>>>>>>> stricter alignment requirement on the base pair offsets (i.e. they have
>>>>>>> to be a multiple of 32 when the operand size is 16)?  So the pair
>>>>>>> offsets have to be a multiple of the entire pair size rather than a
>>>>>>> single operand size> 
>>>>>>
>>>>>> We get load pair at a certain point with (0,16) and other program
>>>>>> point we get load pair (32, 48).
>>>>>>
>>>>>> In current implementation it takes offsets loads as (0, 16),
>>>>>> (16, 32), (32, 48).
>>>>>>
>>>>>> But In rs6000 we want  the load pair to be merged at different points
>>>>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
>>>>>> 0 offset and other load (32, 48) with lxvp with 32 offset.
>>>>>>
>>>>>> In current case it will merge with lxvp with 0 offset and lxvp with
>>>>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
>>>>>> is incorrect in our case as the (16-32) case 16 offset will not
>>>>>> load from even register and break for rs6000.
>>>>>
>>>>> Sorry, I think I'm still missing something here.  Why does the address offset
>>>>> affect the parity of the tranfser register?  ISTM they needn't be related at
>>>>> all (and indeed we can't even know the parity of the transfer register before
>>>>> RA, but perhaps you're only intending on running the pass after RA?)
>>>>>
>>>>
>>>> We have load pair with (0,16) wherein these loads are adjacent and
>>>> replaced with lxvp.
>>>>
>>>> Semantic of lxvp instruction is that it loads adjacent load pair in
>>>> even register and even_register + 1.
>>>>
>>>> We replace the above load pair with lxvp instruction and then we
>>>> dont need to merge (16,32) as (0, 16) is already merged and instead
>>>> we merge (32,48).
>>>
>>> Ok, but the existing logic should already account for this.  I.e. if we
>>> successfully merge (0,16), then we don't attempt to merge (16,32).  We'd only
>>> attempt to merge (16,32) if the merge of (0,16) failed (for whatever reason).
>>> So I don't see that there's anything specific to lxvp that requires this logic
>>> to change, _unless_ you have a stricter alignment requirement on the offsets as
>>> I mentioned before.
>>>
>>
>> Thanks for the suggestion. It worked for rs6000 also with current changes.
>> Sorry for the confusion.
> 
> Alright, glad we got to the bottom of this!

Thanks.
> 
>>
>>>>
>>>> Yes you are correct, the addresss offset doesn't affect the parity of
>>>> the register transfer as we are doing fusion pass before RA.
>>>> If that is the case then I think it would be better to introduce a
>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>>>> check:
>>>>>>>
>>>>>>>   multiple_p (offset, access_size)
>>>>>>>
>>>>>>> and for rs6000 it could check:
>>>>>>>
>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>
>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>> condition of tranform_for_base.
>>>>>>>
>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>
>>>> I have addressed the above hooks and it worked fine with both rs6000
>>>> and aarch64. I am sending subsequent patch in some time that address
>>>> above.
>>>>
>>>>> How do you plan on handling this even-odd requirement for rs6000?
>>>>>
>>>>
>>>> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
>>>> generate register pair and thats what we generate and implement
>>>> in rs6000 target
>>>> code.
>>>
>>> Ah, this is coming back to me now.  Sorry, I should have remembered this from
>>> the previous discussion with Richard S.
>>>
>>> Apologies for going on a slight tangent here, but if you're running
>>> before RA are you planning to create a new OImode pseudo register for
>>> the lxvp insn and then somehow update uses of the old transfer registers
>>> to replace them with subregs of that OImode pseudo?  
>>
>> Yes I do the same as you have mentioned. We generate register pairs
>> with 256 bit mode with two subregs of 128 bit modes with 0 and
>> 16 offset.
>>
>> Or do you just plan
>>> or replacing the individual loads with moves (instead of deleting them)?
>>> I guess the latter would be simpler and might work better in the
>>> presence of hard regs.
>>>
>>
>> Would you mind explaining how to generate register pairs with lxvp by
>> replacing loads with moves.
> 
> Yeah, so suppose you have something like:
> 
> (set (reg:V16QI v1) (mem:V16QI addr))
> (set (reg:V16QI v2) (mem:V16QI addr+16))
> 
> then when you insert the lxvp you can then (logically) replace the
> original load instructions with moves from the appropriate subreg, as
> follows:
> 
> (set (reg:OI pair-pseudo) (mem:OI addr)) ; lxvp
> (set (reg:V16QI v1) (subreg:V16QI (reg:OI pair-pseudo) 0))
> (set (reg:V16QI v2) (subreg:V16QI (reg:OI pair-pseudo) 16))
> 

Any Pseudo created with gen_rtx_REG like 
gen_rtx_REG (OOmode, REGNO (dest_exp) will error'ed out 
with unrecognize insn by LRA.

If I create pseudo with gen_reg_rtx (OOmode) will error'ed
out with new_defs Pseudo register is not found in
change->new_defs.

Also the sequential register pairs are not generated by
Register Allocator.

Thats why I haven't used above method as I also thought
through.

Please let me know what do you think.

Thanks & Regards
Ajit 
> now I'm not sure off-hand if this exact combination of subregs and mode
> changes is valid, but hopefully you get the idea.  The benefit of this
> approach is that it keeps the transformation local and is more
> compile-time efficient (we don't have to look through all mentions of
> v1/v2 and replace them with subregs).  I'd hope that the RA can then
> clean up any redundant moves (especially in the case that v1/v2 are
> pseudos).
> 
> That would mean that you don't need all the grubbing around with DF
> looking for occurrences of the transfer regs.  Instead we'd just need
> some logic to insert the extra moves after the lxvp for rs6000, and I
> think this might fit more cleanly into the current pass.
> 
> Does that make sense?  In any case, this shouldn't really affect the
> preparatory aarch64 patch because we were planning to defer adding any
> hooks that are only needed for rs6000 from the initial aarch64/generic
> split.
> 
> Thanks,
> Alex
> 
>>
>> Thanks & Regards
>> Ajit
>>  
>>> Thanks,
>>> Alex
>>>
>>>>
>>>> Thanks & Regards
>>>> Ajit 
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>>>
>>>>>> lxvp should load from even registers and then loaded value will
>>>>>> be in even register and even register +1 (which is odd).
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Ajit
>>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>>>> check:
>>>>>>>
>>>>>>>   multiple_p (offset, access_size)
>>>>>>>
>>>>>>> and for rs6000 it could check:
>>>>>>>
>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>
>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>> condition of tranform_for_base.
>>>>>>>
>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Alex
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks & Regards
>>>>>>>> Ajit
>>>>>>>>> Thanks,
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>  	}
>>>>>>>>>>>>        prev_access = &access;
>>>>>>>>>>>>      }
>>>>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>  // and remove all the tombstone insns, being sure to reparent any uses
>>>>>>>>>>>>  // of mem to previous defs when we do this.
>>>>>>>>>>>>  void
>>>>>>>>>>>> -ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>  {
>>>>>>>>>>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
>>>>>>>>>>>>    if (!m_emitted_tombstone)
>>>>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>  
>>>>>>>>>>>>  template<typename Map>
>>>>>>>>>>>>  void
>>>>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>  {
>>>>>>>>>>>>    for (auto kv : map)
>>>>>>>>>>>>      {
>>>>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>  }
>>>>>>>>>>>>  
>>>>>>>>>>>>  void
>>>>>>>>>>>> -ldp_bb_info::transform ()
>>>>>>>>>>>> +pair_fusion_bb_info::transform ()
>>>>>>>>>>>>  {
>>>>>>>>>>>>    traverse_base_map (expr_map);
>>>>>>>>>>>>    traverse_base_map (def_map);
>>>>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>>>>>>>>>>>  // for load/store candidates.  If running after RA, also try and promote
>>>>>>>>>>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
>>>>>>>>>>>>  // candidates into pairs.
>>>>>>>>>>>> -void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>  {
>>>>>>>>>>>> -  const bool track_loads
>>>>>>>>>>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>> -  const bool track_stores
>>>>>>>>>>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>> +  const bool track_loads = track_load_p ();
>>>>>>>>>>>> +  const bool track_stores = track_store_p ();
>>>>>>>>>>>>  
>>>>>>>>>>>> -  ldp_bb_info bb_state (bb);
>>>>>>>>>>>
>>>>>>>>>>> This:
>>>>>>>>>>>
>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>
>>>>>>>>>>> can be deleted and then:
>>>>>>>>>>>
>>>>>>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
>>>>>>>>>>>
>>>>>>>>>>> can just be:
>>>>>>>>>>>
>>>>>>>>>>>   pair_fusion_bb_info bb_info (bb, this);
>>>>>>>>>>>
>>>>>>>>>>> (or you can pass *this if you make bb_info take a reference).
>>>>>>>>>>>
>>>>>>>>>>> I don't think there's a particular need to change the variable name
>>>>>>>>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
>>>>>>>>>>> with the RTL-SSA structure of the same name as the latter.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Addressed.
>>>>>>>>>>>>  
>>>>>>>>>>>>    for (auto insn : bb->nondebug_insns ())
>>>>>>>>>>>>      {
>>>>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>  	continue;
>>>>>>>>>>>>  
>>>>>>>>>>>>        rtx pat = PATTERN (rti);
>>>>>>>>>>>> -      if (reload_completed
>>>>>>>>>>>> -	  && aarch64_ldp_writeback > 1
>>>>>>>>>>>> -	  && GET_CODE (pat) == PARALLEL
>>>>>>>>>>>> -	  && XVECLEN (pat, 0) == 2)
>>>>>>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
>>>>>>>>>>>>  	try_promote_writeback (insn);
>>>>>>>>>>>
>>>>>>>>>>> It looks like try_promote_writeback itself will need some further work
>>>>>>>>>>> to make it target-independent.  I suppose this check:
>>>>>>>>>>>
>>>>>>>>>>>   auto rti = insn->rtl ();
>>>>>>>>>>>   const auto attr = get_attr_ldpstp (rti);
>>>>>>>>>>>   if (attr == LDPSTP_NONE)
>>>>>>>>>>>     return;
>>>>>>>>>>>
>>>>>>>>>>>   bool load_p = (attr == LDPSTP_LDP);
>>>>>>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
>>>>>>>>>>>
>>>>>>>>>>> will need to become part of the pair_mem_promote_writeback_p hook that you
>>>>>>>>>>> added, potentially changing it to return a boolean for load_p.
>>>>>>>>>>>
>>>>>>>>>>> Then I guess we will need hooks for destructuring the pair insn and
>>>>>>>>>>> another hook to wrap aarch64_gen_writeback_pair.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Addressed.
>>>>>>>>>>>>  
>>>>>>>>>>>>        if (GET_CODE (pat) != SET)
>>>>>>>>>>>>  	continue;
>>>>>>>>>>>>  
>>>>>>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>>>>>>>>>>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>>>>>>>>>>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>      }
>>>>>>>>>>>>  
>>>>>>>>>>>> -  bb_state.transform ();
>>>>>>>>>>>> -  bb_state.cleanup_tombstones ();
>>>>>>>>>>>> +  bb_info.transform ();
>>>>>>>>>>>> +  bb_info.cleanup_tombstones ();
>>>>>>>>>>>>  }
>>>>>>>>>>>>  
>>>>>>>>>>>>  void ldp_fusion ()
>>>>>>>>>>>>  {
>>>>>>>>>>>>    ldp_fusion_init ();
>>>>>>>>>>>> +  pair_fusion *pfuse;
>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>> +  pfuse = &derived;
>>>>>>>>>>>
>>>>>>>>>>> This is indeed the one place where I think it is acceptable to
>>>>>>>>>>> instantiate aarch64_pair_fusion.  But again there's no need to create a
>>>>>>>>>>> pointer to the parent class, just call any function you like directly.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Addressed.
>>>>>>>>>>>>  
>>>>>>>>>>>>    for (auto bb : crtl->ssa->bbs ())
>>>>>>>>>>>> -    ldp_fusion_bb (bb);
>>>>>>>>>>>> +    pfuse->ldp_fusion_bb (bb);
>>>>>>>>>>>
>>>>>>>>>>> I think even the code to iterate over bbs should itself be a member
>>>>>>>>>>> function of pair_fusion (say "run") and then that becomes part of the
>>>>>>>>>>> generic code.
>>>>>>>>>>>
>>>>>>>>>>> So this function would just become:
>>>>>>>>>>>
>>>>>>>>>>> aarch64_pair_fusion pass;
>>>>>>>>>>> pass.run ();
>>>>>>>>>>>
>>>>>>>>>>> and could be inlined into the caller.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Addressed.
>>>>>>>>>>> Perhaps you could also add an early return like:
>>>>>>>>>>>
>>>>>>>>>>>   if (!track_loads_p () && !track_stores_p ())
>>>>>>>>>>>     return;
>>>>>>>>>>>
>>>>>>>>>>> in pair_fusion::run () and then remove the corresponding code from
>>>>>>>>>>> pass_ldp_fusion::gate?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Addressed.
>>>>>>>>>>
>>>>>>>>>>>>  
>>>>>>>>>>>>    ldp_fusion_destroy ();
>>>>>>>>>>>>  }
>>>>>>>>>>>> -- 
>>>>>>>>>>>> 2.39.3
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Alex
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards
>>>>>>>>>> Ajit
Alex Coplan April 12, 2024, 2:45 p.m. UTC | #13
On 12/04/2024 20:02, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 11/04/24 7:55 pm, Alex Coplan wrote:
> > On 10/04/2024 23:48, Ajit Agarwal wrote:
> >> Hello Alex:
> >>
> >> On 10/04/24 7:52 pm, Alex Coplan wrote:
> >>> Hi Ajit,
> >>>
> >>> On 10/04/2024 15:31, Ajit Agarwal wrote:
> >>>> Hello Alex:
> >>>>
> >>>> On 10/04/24 1:42 pm, Alex Coplan wrote:
> >>>>> Hi Ajit,
> >>>>>
> >>>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
> >>>>>> Hello Alex:
> >>>>>>
> >>>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
> >>>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
> >>>>>>>> Hello Alex:
> >>>>>>>>
> >>>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
> >>>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
> >>>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
> >>>>>>>>>>>> Hello Alex/Richard:
> >>>>>>>>>>>>
> >>>>>>>>>>>> All review comments are incorporated.
> >>> <snip>
> >>>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
> >>>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
> >>>>>>>>>>>>  // merge_pairs.
> >>>>>>>>>>>>  void
> >>>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>>>> -				 access_group &group)
> >>>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>>>> +					 access_group &group)
> >>>>>>>>>>>>  {
> >>>>>>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
> >>>>>>>>>>>>    const unsigned access_size = lfs.size;
> >>>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>>>>  		       access.cand_insns,
> >>>>>>>>>>>>  		       lfs.load_p,
> >>>>>>>>>>>>  		       access_size);
> >>>>>>>>>>>> -	  skip_next = access.cand_insns.empty ();
> >>>>>>>>>>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
> >>>>>>>>>>>
> >>>>>>>>>>> As above, why is this needed?
> >>>>>>>>>>
> >>>>>>>>>> For rs6000 we want to return always true. as load store pair
> >>>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
> >>>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
> >>>>>>>>>> to generate always true for rs6000 to skip pairs as above.
> >>>>>>>>>
> >>>>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
> >>>>>>>>> load/store pair instructions where the two arms of the access are storing
> >>>>>>>>> operands of different sizes?  Or something else?
> >>>>>>>>>
> >>>>>>>>> As it stands the logic is to skip the next iteration only if we
> >>>>>>>>> exhausted all the candidate insns for the current access.  In the case
> >>>>>>>>> that we didn't exhaust all such candidates, then the idea is that when
> >>>>>>>>> access becomes prev_access, we can attempt to use those candidates as
> >>>>>>>>> the "left-hand side" of a pair in the next iteration since we failed to
> >>>>>>>>> use them as the "right-hand side" of a pair in the current iteration.
> >>>>>>>>> I don't see why you wouldn't want that behaviour.  Please can you
> >>>>>>>>> explain?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset and
> >>>>>>>> other load is from 16th offset. Then in next iteration we get load
> >>>>>>>> from 16th offset and other load from 32 offset. In next iteration
> >>>>>>>> we get load from 32 offset and other load from 48 offset.
> >>>>>>>>
> >>>>>>>> For example:
> >>>>>>>>
> >>>>>>>> Currently we get the load candiates as follows.
> >>>>>>>>
> >>>>>>>> pairs:
> >>>>>>>>
> >>>>>>>> load from 0th offset.
> >>>>>>>> load from 16th offset.
> >>>>>>>>
> >>>>>>>> next pairs:
> >>>>>>>>
> >>>>>>>> load from 16th offset.
> >>>>>>>> load from 32th offset.
> >>>>>>>>
> >>>>>>>> next pairs:
> >>>>>>>>
> >>>>>>>> load from 32th offset
> >>>>>>>> load from 48th offset.
> >>>>>>>>
> >>>>>>>> Instead in rs6000 we should get:
> >>>>>>>>
> >>>>>>>> pairs:
> >>>>>>>>
> >>>>>>>> load from 0th offset
> >>>>>>>> load from 16th offset.
> >>>>>>>>
> >>>>>>>> next pairs:
> >>>>>>>>
> >>>>>>>> load from 32th offset
> >>>>>>>> load from 48th offset.
> >>>>>>>
> >>>>>>> Hmm, so then I guess my question is: why wouldn't you consider merging
> >>>>>>> the pair with offsets (16,32) for rs6000?  Is it because you have a
> >>>>>>> stricter alignment requirement on the base pair offsets (i.e. they have
> >>>>>>> to be a multiple of 32 when the operand size is 16)?  So the pair
> >>>>>>> offsets have to be a multiple of the entire pair size rather than a
> >>>>>>> single operand size> 
> >>>>>>
> >>>>>> We get load pair at a certain point with (0,16) and other program
> >>>>>> point we get load pair (32, 48).
> >>>>>>
> >>>>>> In current implementation it takes offsets loads as (0, 16),
> >>>>>> (16, 32), (32, 48).
> >>>>>>
> >>>>>> But In rs6000 we want  the load pair to be merged at different points
> >>>>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
> >>>>>> 0 offset and other load (32, 48) with lxvp with 32 offset.
> >>>>>>
> >>>>>> In current case it will merge with lxvp with 0 offset and lxvp with
> >>>>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
> >>>>>> is incorrect in our case as the (16-32) case 16 offset will not
> >>>>>> load from even register and break for rs6000.
> >>>>>
> >>>>> Sorry, I think I'm still missing something here.  Why does the address offset
> >>>>> affect the parity of the tranfser register?  ISTM they needn't be related at
> >>>>> all (and indeed we can't even know the parity of the transfer register before
> >>>>> RA, but perhaps you're only intending on running the pass after RA?)
> >>>>>
> >>>>
> >>>> We have load pair with (0,16) wherein these loads are adjacent and
> >>>> replaced with lxvp.
> >>>>
> >>>> Semantic of lxvp instruction is that it loads adjacent load pair in
> >>>> even register and even_register + 1.
> >>>>
> >>>> We replace the above load pair with lxvp instruction and then we
> >>>> dont need to merge (16,32) as (0, 16) is already merged and instead
> >>>> we merge (32,48).
> >>>
> >>> Ok, but the existing logic should already account for this.  I.e. if we
> >>> successfully merge (0,16), then we don't attempt to merge (16,32).  We'd only
> >>> attempt to merge (16,32) if the merge of (0,16) failed (for whatever reason).
> >>> So I don't see that there's anything specific to lxvp that requires this logic
> >>> to change, _unless_ you have a stricter alignment requirement on the offsets as
> >>> I mentioned before.
> >>>
> >>
> >> Thanks for the suggestion. It worked for rs6000 also with current changes.
> >> Sorry for the confusion.
> > 
> > Alright, glad we got to the bottom of this!
> 
> Thanks.
> > 
> >>
> >>>>
> >>>> Yes you are correct, the addresss offset doesn't affect the parity of
> >>>> the register transfer as we are doing fusion pass before RA.
> >>>> If that is the case then I think it would be better to introduce a
> >>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
> >>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
> >>>>>>> it would also take access_size as a parameter and for aarch64 it should
> >>>>>>> check:
> >>>>>>>
> >>>>>>>   multiple_p (offset, access_size)
> >>>>>>>
> >>>>>>> and for rs6000 it could check:
> >>>>>>>
> >>>>>>>   multiple_p (offset, access_size * 2)
> >>>>>>>
> >>>>>>> and we would then incorporate a call to that predicate in the else if
> >>>>>>> condition of tranform_for_base.
> >>>>>>>
> >>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
> >>>>>>> alignment is a good proxy for RTL offset alignment, but we already
> >>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
> >>>>>>>
> >>>> I have addressed the above hooks and it worked fine with both rs6000
> >>>> and aarch64. I am sending subsequent patch in some time that address
> >>>> above.
> >>>>
> >>>>> How do you plan on handling this even-odd requirement for rs6000?
> >>>>>
> >>>>
> >>>> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
> >>>> generate register pair and thats what we generate and implement
> >>>> in rs6000 target
> >>>> code.
> >>>
> >>> Ah, this is coming back to me now.  Sorry, I should have remembered this from
> >>> the previous discussion with Richard S.
> >>>
> >>> Apologies for going on a slight tangent here, but if you're running
> >>> before RA are you planning to create a new OImode pseudo register for
> >>> the lxvp insn and then somehow update uses of the old transfer registers
> >>> to replace them with subregs of that OImode pseudo?  
> >>
> >> Yes I do the same as you have mentioned. We generate register pairs
> >> with 256 bit mode with two subregs of 128 bit modes with 0 and
> >> 16 offset.
> >>
> >> Or do you just plan
> >>> or replacing the individual loads with moves (instead of deleting them)?
> >>> I guess the latter would be simpler and might work better in the
> >>> presence of hard regs.
> >>>
> >>
> >> Would you mind explaining how to generate register pairs with lxvp by
> >> replacing loads with moves.
> > 
> > Yeah, so suppose you have something like:
> > 
> > (set (reg:V16QI v1) (mem:V16QI addr))
> > (set (reg:V16QI v2) (mem:V16QI addr+16))
> > 
> > then when you insert the lxvp you can then (logically) replace the
> > original load instructions with moves from the appropriate subreg, as
> > follows:
> > 
> > (set (reg:OI pair-pseudo) (mem:OI addr)) ; lxvp
> > (set (reg:V16QI v1) (subreg:V16QI (reg:OI pair-pseudo) 0))
> > (set (reg:V16QI v2) (subreg:V16QI (reg:OI pair-pseudo) 16))
> > 
> 
> Any Pseudo created with gen_rtx_REG like 
> gen_rtx_REG (OOmode, REGNO (dest_exp) will error'ed out 
> with unrecognize insn by LRA.

I'm not surprised that goes wrong: you can't just create a new REG
rtx in a different mode and reuse the regno of an existing pseudo.

> 
> If I create pseudo with gen_reg_rtx (OOmode) will error'ed
> out with new_defs Pseudo register is not found in
> change->new_defs.

Yeah, I suppose you'd need to add an RTL-SSA def for the new pseudo.

> 
> Also the sequential register pairs are not generated by
> Register Allocator.

So how do you get the sequential pairs with your current approach?  My
understanding was that what I suggested above doesn't really change what
you're doing w.r.t the lxvp insn itself, but maybe I didn't properly
understand the approach taken in your initial patchset.

Thanks,
Alex

> 
> Thats why I haven't used above method as I also thought
> through.
> 
> Please let me know what do you think.
> 
> Thanks & Regards
> Ajit 
> > now I'm not sure off-hand if this exact combination of subregs and mode
> > changes is valid, but hopefully you get the idea.  The benefit of this
> > approach is that it keeps the transformation local and is more
> > compile-time efficient (we don't have to look through all mentions of
> > v1/v2 and replace them with subregs).  I'd hope that the RA can then
> > clean up any redundant moves (especially in the case that v1/v2 are
> > pseudos).
> > 
> > That would mean that you don't need all the grubbing around with DF
> > looking for occurrences of the transfer regs.  Instead we'd just need
> > some logic to insert the extra moves after the lxvp for rs6000, and I
> > think this might fit more cleanly into the current pass.
> > 
> > Does that make sense?  In any case, this shouldn't really affect the
> > preparatory aarch64 patch because we were planning to defer adding any
> > hooks that are only needed for rs6000 from the initial aarch64/generic
> > split.
> > 
> > Thanks,
> > Alex
> > 
> >>
> >> Thanks & Regards
> >> Ajit
> >>  
> >>> Thanks,
> >>> Alex
> >>>
> >>>>
> >>>> Thanks & Regards
> >>>> Ajit 
> >>>>> Thanks,
> >>>>> Alex
> >>>>>
> >>>>>>
> >>>>>> lxvp should load from even registers and then loaded value will
> >>>>>> be in even register and even register +1 (which is odd).
> >>>>>>
> >>>>>> Thanks & Regards
> >>>>>> Ajit
> >>>>>>> If that is the case then I think it would be better to introduce a
> >>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
> >>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
> >>>>>>> it would also take access_size as a parameter and for aarch64 it should
> >>>>>>> check:
> >>>>>>>
> >>>>>>>   multiple_p (offset, access_size)
> >>>>>>>
> >>>>>>> and for rs6000 it could check:
> >>>>>>>
> >>>>>>>   multiple_p (offset, access_size * 2)
> >>>>>>>
> >>>>>>> and we would then incorporate a call to that predicate in the else if
> >>>>>>> condition of tranform_for_base.
> >>>>>>>
> >>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
> >>>>>>> alignment is a good proxy for RTL offset alignment, but we already
> >>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Alex
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks & Regards
> >>>>>>>> Ajit
> >>>>>>>>> Thanks,
> >>>>>>>>> Alex
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>  	}
> >>>>>>>>>>>>        prev_access = &access;
> >>>>>>>>>>>>      }
> >>>>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
> >>>>>>>>>>>>  // and remove all the tombstone insns, being sure to reparent any uses
> >>>>>>>>>>>>  // of mem to previous defs when we do this.
> >>>>>>>>>>>>  void
> >>>>>>>>>>>> -ldp_bb_info::cleanup_tombstones ()
> >>>>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
> >>>>>>>>>>>>  {
> >>>>>>>>>>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
> >>>>>>>>>>>>    if (!m_emitted_tombstone)
> >>>>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
> >>>>>>>>>>>>  
> >>>>>>>>>>>>  template<typename Map>
> >>>>>>>>>>>>  void
> >>>>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
> >>>>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
> >>>>>>>>>>>>  {
> >>>>>>>>>>>>    for (auto kv : map)
> >>>>>>>>>>>>      {
> >>>>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
> >>>>>>>>>>>>  }
> >>>>>>>>>>>>  
> >>>>>>>>>>>>  void
> >>>>>>>>>>>> -ldp_bb_info::transform ()
> >>>>>>>>>>>> +pair_fusion_bb_info::transform ()
> >>>>>>>>>>>>  {
> >>>>>>>>>>>>    traverse_base_map (expr_map);
> >>>>>>>>>>>>    traverse_base_map (def_map);
> >>>>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
> >>>>>>>>>>>>  // for load/store candidates.  If running after RA, also try and promote
> >>>>>>>>>>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
> >>>>>>>>>>>>  // candidates into pairs.
> >>>>>>>>>>>> -void ldp_fusion_bb (bb_info *bb)
> >>>>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
> >>>>>>>>>>>>  {
> >>>>>>>>>>>> -  const bool track_loads
> >>>>>>>>>>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>>>>>>>> -  const bool track_stores
> >>>>>>>>>>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
> >>>>>>>>>>>> +  const bool track_loads = track_load_p ();
> >>>>>>>>>>>> +  const bool track_stores = track_store_p ();
> >>>>>>>>>>>>  
> >>>>>>>>>>>> -  ldp_bb_info bb_state (bb);
> >>>>>>>>>>>
> >>>>>>>>>>> This:
> >>>>>>>>>>>
> >>>>>>>>>>>> +  aarch64_pair_fusion derived;
> >>>>>>>>>>>
> >>>>>>>>>>> can be deleted and then:
> >>>>>>>>>>>
> >>>>>>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
> >>>>>>>>>>>
> >>>>>>>>>>> can just be:
> >>>>>>>>>>>
> >>>>>>>>>>>   pair_fusion_bb_info bb_info (bb, this);
> >>>>>>>>>>>
> >>>>>>>>>>> (or you can pass *this if you make bb_info take a reference).
> >>>>>>>>>>>
> >>>>>>>>>>> I don't think there's a particular need to change the variable name
> >>>>>>>>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
> >>>>>>>>>>> with the RTL-SSA structure of the same name as the latter.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Addressed.
> >>>>>>>>>>>>  
> >>>>>>>>>>>>    for (auto insn : bb->nondebug_insns ())
> >>>>>>>>>>>>      {
> >>>>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
> >>>>>>>>>>>>  	continue;
> >>>>>>>>>>>>  
> >>>>>>>>>>>>        rtx pat = PATTERN (rti);
> >>>>>>>>>>>> -      if (reload_completed
> >>>>>>>>>>>> -	  && aarch64_ldp_writeback > 1
> >>>>>>>>>>>> -	  && GET_CODE (pat) == PARALLEL
> >>>>>>>>>>>> -	  && XVECLEN (pat, 0) == 2)
> >>>>>>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
> >>>>>>>>>>>>  	try_promote_writeback (insn);
> >>>>>>>>>>>
> >>>>>>>>>>> It looks like try_promote_writeback itself will need some further work
> >>>>>>>>>>> to make it target-independent.  I suppose this check:
> >>>>>>>>>>>
> >>>>>>>>>>>   auto rti = insn->rtl ();
> >>>>>>>>>>>   const auto attr = get_attr_ldpstp (rti);
> >>>>>>>>>>>   if (attr == LDPSTP_NONE)
> >>>>>>>>>>>     return;
> >>>>>>>>>>>
> >>>>>>>>>>>   bool load_p = (attr == LDPSTP_LDP);
> >>>>>>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
> >>>>>>>>>>>
> >>>>>>>>>>> will need to become part of the pair_mem_promote_writeback_p hook that you
> >>>>>>>>>>> added, potentially changing it to return a boolean for load_p.
> >>>>>>>>>>>
> >>>>>>>>>>> Then I guess we will need hooks for destructuring the pair insn and
> >>>>>>>>>>> another hook to wrap aarch64_gen_writeback_pair.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Addressed.
> >>>>>>>>>>>>  
> >>>>>>>>>>>>        if (GET_CODE (pat) != SET)
> >>>>>>>>>>>>  	continue;
> >>>>>>>>>>>>  
> >>>>>>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
> >>>>>>>>>>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
> >>>>>>>>>>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
> >>>>>>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
> >>>>>>>>>>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
> >>>>>>>>>>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
> >>>>>>>>>>>>      }
> >>>>>>>>>>>>  
> >>>>>>>>>>>> -  bb_state.transform ();
> >>>>>>>>>>>> -  bb_state.cleanup_tombstones ();
> >>>>>>>>>>>> +  bb_info.transform ();
> >>>>>>>>>>>> +  bb_info.cleanup_tombstones ();
> >>>>>>>>>>>>  }
> >>>>>>>>>>>>  
> >>>>>>>>>>>>  void ldp_fusion ()
> >>>>>>>>>>>>  {
> >>>>>>>>>>>>    ldp_fusion_init ();
> >>>>>>>>>>>> +  pair_fusion *pfuse;
> >>>>>>>>>>>> +  aarch64_pair_fusion derived;
> >>>>>>>>>>>> +  pfuse = &derived;
> >>>>>>>>>>>
> >>>>>>>>>>> This is indeed the one place where I think it is acceptable to
> >>>>>>>>>>> instantiate aarch64_pair_fusion.  But again there's no need to create a
> >>>>>>>>>>> pointer to the parent class, just call any function you like directly.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Addressed.
> >>>>>>>>>>>>  
> >>>>>>>>>>>>    for (auto bb : crtl->ssa->bbs ())
> >>>>>>>>>>>> -    ldp_fusion_bb (bb);
> >>>>>>>>>>>> +    pfuse->ldp_fusion_bb (bb);
> >>>>>>>>>>>
> >>>>>>>>>>> I think even the code to iterate over bbs should itself be a member
> >>>>>>>>>>> function of pair_fusion (say "run") and then that becomes part of the
> >>>>>>>>>>> generic code.
> >>>>>>>>>>>
> >>>>>>>>>>> So this function would just become:
> >>>>>>>>>>>
> >>>>>>>>>>> aarch64_pair_fusion pass;
> >>>>>>>>>>> pass.run ();
> >>>>>>>>>>>
> >>>>>>>>>>> and could be inlined into the caller.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Addressed.
> >>>>>>>>>>> Perhaps you could also add an early return like:
> >>>>>>>>>>>
> >>>>>>>>>>>   if (!track_loads_p () && !track_stores_p ())
> >>>>>>>>>>>     return;
> >>>>>>>>>>>
> >>>>>>>>>>> in pair_fusion::run () and then remove the corresponding code from
> >>>>>>>>>>> pass_ldp_fusion::gate?
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Addressed.
> >>>>>>>>>>
> >>>>>>>>>>>>  
> >>>>>>>>>>>>    ldp_fusion_destroy ();
> >>>>>>>>>>>>  }
> >>>>>>>>>>>> -- 
> >>>>>>>>>>>> 2.39.3
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Alex
> >>>>>>>>>>
> >>>>>>>>>> Thanks & Regards
> >>>>>>>>>> Ajit
Ajit Agarwal April 12, 2024, 5:32 p.m. UTC | #14
Hello Alex:

On 12/04/24 8:15 pm, Alex Coplan wrote:
> On 12/04/2024 20:02, Ajit Agarwal wrote:
>> Hello Alex:
>>
>> On 11/04/24 7:55 pm, Alex Coplan wrote:
>>> On 10/04/2024 23:48, Ajit Agarwal wrote:
>>>> Hello Alex:
>>>>
>>>> On 10/04/24 7:52 pm, Alex Coplan wrote:
>>>>> Hi Ajit,
>>>>>
>>>>> On 10/04/2024 15:31, Ajit Agarwal wrote:
>>>>>> Hello Alex:
>>>>>>
>>>>>> On 10/04/24 1:42 pm, Alex Coplan wrote:
>>>>>>> Hi Ajit,
>>>>>>>
>>>>>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
>>>>>>>> Hello Alex:
>>>>>>>>
>>>>>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
>>>>>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
>>>>>>>>>> Hello Alex:
>>>>>>>>>>
>>>>>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>>>>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>>>>>>>>>> Hello Alex/Richard:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> All review comments are incorporated.
>>>>> <snip>
>>>>>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
>>>>>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
>>>>>>>>>>>>>>  // merge_pairs.
>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>> -				 access_group &group)
>>>>>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>> +					 access_group &group)
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
>>>>>>>>>>>>>>    const unsigned access_size = lfs.size;
>>>>>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>  		       access.cand_insns,
>>>>>>>>>>>>>>  		       lfs.load_p,
>>>>>>>>>>>>>>  		       access_size);
>>>>>>>>>>>>>> -	  skip_next = access.cand_insns.empty ();
>>>>>>>>>>>>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
>>>>>>>>>>>>>
>>>>>>>>>>>>> As above, why is this needed?
>>>>>>>>>>>>
>>>>>>>>>>>> For rs6000 we want to return always true. as load store pair
>>>>>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
>>>>>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
>>>>>>>>>>>> to generate always true for rs6000 to skip pairs as above.
>>>>>>>>>>>
>>>>>>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
>>>>>>>>>>> load/store pair instructions where the two arms of the access are storing
>>>>>>>>>>> operands of different sizes?  Or something else?
>>>>>>>>>>>
>>>>>>>>>>> As it stands the logic is to skip the next iteration only if we
>>>>>>>>>>> exhausted all the candidate insns for the current access.  In the case
>>>>>>>>>>> that we didn't exhaust all such candidates, then the idea is that when
>>>>>>>>>>> access becomes prev_access, we can attempt to use those candidates as
>>>>>>>>>>> the "left-hand side" of a pair in the next iteration since we failed to
>>>>>>>>>>> use them as the "right-hand side" of a pair in the current iteration.
>>>>>>>>>>> I don't see why you wouldn't want that behaviour.  Please can you
>>>>>>>>>>> explain?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset and
>>>>>>>>>> other load is from 16th offset. Then in next iteration we get load
>>>>>>>>>> from 16th offset and other load from 32 offset. In next iteration
>>>>>>>>>> we get load from 32 offset and other load from 48 offset.
>>>>>>>>>>
>>>>>>>>>> For example:
>>>>>>>>>>
>>>>>>>>>> Currently we get the load candiates as follows.
>>>>>>>>>>
>>>>>>>>>> pairs:
>>>>>>>>>>
>>>>>>>>>> load from 0th offset.
>>>>>>>>>> load from 16th offset.
>>>>>>>>>>
>>>>>>>>>> next pairs:
>>>>>>>>>>
>>>>>>>>>> load from 16th offset.
>>>>>>>>>> load from 32th offset.
>>>>>>>>>>
>>>>>>>>>> next pairs:
>>>>>>>>>>
>>>>>>>>>> load from 32th offset
>>>>>>>>>> load from 48th offset.
>>>>>>>>>>
>>>>>>>>>> Instead in rs6000 we should get:
>>>>>>>>>>
>>>>>>>>>> pairs:
>>>>>>>>>>
>>>>>>>>>> load from 0th offset
>>>>>>>>>> load from 16th offset.
>>>>>>>>>>
>>>>>>>>>> next pairs:
>>>>>>>>>>
>>>>>>>>>> load from 32th offset
>>>>>>>>>> load from 48th offset.
>>>>>>>>>
>>>>>>>>> Hmm, so then I guess my question is: why wouldn't you consider merging
>>>>>>>>> the pair with offsets (16,32) for rs6000?  Is it because you have a
>>>>>>>>> stricter alignment requirement on the base pair offsets (i.e. they have
>>>>>>>>> to be a multiple of 32 when the operand size is 16)?  So the pair
>>>>>>>>> offsets have to be a multiple of the entire pair size rather than a
>>>>>>>>> single operand size> 
>>>>>>>>
>>>>>>>> We get load pair at a certain point with (0,16) and other program
>>>>>>>> point we get load pair (32, 48).
>>>>>>>>
>>>>>>>> In current implementation it takes offsets loads as (0, 16),
>>>>>>>> (16, 32), (32, 48).
>>>>>>>>
>>>>>>>> But In rs6000 we want  the load pair to be merged at different points
>>>>>>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
>>>>>>>> 0 offset and other load (32, 48) with lxvp with 32 offset.
>>>>>>>>
>>>>>>>> In current case it will merge with lxvp with 0 offset and lxvp with
>>>>>>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
>>>>>>>> is incorrect in our case as the (16-32) case 16 offset will not
>>>>>>>> load from even register and break for rs6000.
>>>>>>>
>>>>>>> Sorry, I think I'm still missing something here.  Why does the address offset
>>>>>>> affect the parity of the tranfser register?  ISTM they needn't be related at
>>>>>>> all (and indeed we can't even know the parity of the transfer register before
>>>>>>> RA, but perhaps you're only intending on running the pass after RA?)
>>>>>>>
>>>>>>
>>>>>> We have load pair with (0,16) wherein these loads are adjacent and
>>>>>> replaced with lxvp.
>>>>>>
>>>>>> Semantic of lxvp instruction is that it loads adjacent load pair in
>>>>>> even register and even_register + 1.
>>>>>>
>>>>>> We replace the above load pair with lxvp instruction and then we
>>>>>> dont need to merge (16,32) as (0, 16) is already merged and instead
>>>>>> we merge (32,48).
>>>>>
>>>>> Ok, but the existing logic should already account for this.  I.e. if we
>>>>> successfully merge (0,16), then we don't attempt to merge (16,32).  We'd only
>>>>> attempt to merge (16,32) if the merge of (0,16) failed (for whatever reason).
>>>>> So I don't see that there's anything specific to lxvp that requires this logic
>>>>> to change, _unless_ you have a stricter alignment requirement on the offsets as
>>>>> I mentioned before.
>>>>>
>>>>
>>>> Thanks for the suggestion. It worked for rs6000 also with current changes.
>>>> Sorry for the confusion.
>>>
>>> Alright, glad we got to the bottom of this!
>>
>> Thanks.
>>>
>>>>
>>>>>>
>>>>>> Yes you are correct, the addresss offset doesn't affect the parity of
>>>>>> the register transfer as we are doing fusion pass before RA.
>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>>>>>> check:
>>>>>>>>>
>>>>>>>>>   multiple_p (offset, access_size)
>>>>>>>>>
>>>>>>>>> and for rs6000 it could check:
>>>>>>>>>
>>>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>>>
>>>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>>>> condition of tranform_for_base.
>>>>>>>>>
>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>>>
>>>>>> I have addressed the above hooks and it worked fine with both rs6000
>>>>>> and aarch64. I am sending subsequent patch in some time that address
>>>>>> above.
>>>>>>
>>>>>>> How do you plan on handling this even-odd requirement for rs6000?
>>>>>>>
>>>>>>
>>>>>> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
>>>>>> generate register pair and thats what we generate and implement
>>>>>> in rs6000 target
>>>>>> code.
>>>>>
>>>>> Ah, this is coming back to me now.  Sorry, I should have remembered this from
>>>>> the previous discussion with Richard S.
>>>>>
>>>>> Apologies for going on a slight tangent here, but if you're running
>>>>> before RA are you planning to create a new OImode pseudo register for
>>>>> the lxvp insn and then somehow update uses of the old transfer registers
>>>>> to replace them with subregs of that OImode pseudo?  
>>>>
>>>> Yes I do the same as you have mentioned. We generate register pairs
>>>> with 256 bit mode with two subregs of 128 bit modes with 0 and
>>>> 16 offset.
>>>>
>>>> Or do you just plan
>>>>> or replacing the individual loads with moves (instead of deleting them)?
>>>>> I guess the latter would be simpler and might work better in the
>>>>> presence of hard regs.
>>>>>
>>>>
>>>> Would you mind explaining how to generate register pairs with lxvp by
>>>> replacing loads with moves.
>>>
>>> Yeah, so suppose you have something like:
>>>
>>> (set (reg:V16QI v1) (mem:V16QI addr))
>>> (set (reg:V16QI v2) (mem:V16QI addr+16))
>>>
>>> then when you insert the lxvp you can then (logically) replace the
>>> original load instructions with moves from the appropriate subreg, as
>>> follows:
>>>
>>> (set (reg:OI pair-pseudo) (mem:OI addr)) ; lxvp
>>> (set (reg:V16QI v1) (subreg:V16QI (reg:OI pair-pseudo) 0))
>>> (set (reg:V16QI v2) (subreg:V16QI (reg:OI pair-pseudo) 16))
>>>
>>
>> Any Pseudo created with gen_rtx_REG like 
>> gen_rtx_REG (OOmode, REGNO (dest_exp) will error'ed out 
>> with unrecognize insn by LRA.
> 
> I'm not surprised that goes wrong: you can't just create a new REG
> rtx in a different mode and reuse the regno of an existing pseudo.
> 
>>
>> If I create pseudo with gen_reg_rtx (OOmode) will error'ed
>> out with new_defs Pseudo register is not found in
>> change->new_defs.
> 
> Yeah, I suppose you'd need to add an RTL-SSA def for the new pseudo.
> 

Would you mind explaining how can I add and RTL-SSA def for the
new pseudo.

>>
>> Also the sequential register pairs are not generated by
>> Register Allocator.
> 
> So how do you get the sequential pairs with your current approach?  My
> understanding was that what I suggested above doesn't really change what
> you're doing w.r.t the lxvp insn itself, but maybe I didn't properly
> understand the approach taken in your initial patchset.
> 

I generate (set (reg:OO pair-pseudo) (mem:OI addr)) ; lxvp
and then at the use point of pair_pseudo generate the following.

(subreg:V16QI (reg:OI pair-pseudo) 0))
(subreg:V16QI (reg:OI pair-pseudo) 16))

Thanks & Regards
Ajit
> Thanks,
> Alex
> 
>>
>> Thats why I haven't used above method as I also thought
>> through.
>>
>> Please let me know what do you think.
>>
>> Thanks & Regards
>> Ajit 
>>> now I'm not sure off-hand if this exact combination of subregs and mode
>>> changes is valid, but hopefully you get the idea.  The benefit of this
>>> approach is that it keeps the transformation local and is more
>>> compile-time efficient (we don't have to look through all mentions of
>>> v1/v2 and replace them with subregs).  I'd hope that the RA can then
>>> clean up any redundant moves (especially in the case that v1/v2 are
>>> pseudos).
>>>
>>> That would mean that you don't need all the grubbing around with DF
>>> looking for occurrences of the transfer regs.  Instead we'd just need
>>> some logic to insert the extra moves after the lxvp for rs6000, and I
>>> think this might fit more cleanly into the current pass.
>>>
>>> Does that make sense?  In any case, this shouldn't really affect the
>>> preparatory aarch64 patch because we were planning to defer adding any
>>> hooks that are only needed for rs6000 from the initial aarch64/generic
>>> split.
>>>
>>> Thanks,
>>> Alex
>>>
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>  
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Ajit 
>>>>>>> Thanks,
>>>>>>> Alex
>>>>>>>
>>>>>>>>
>>>>>>>> lxvp should load from even registers and then loaded value will
>>>>>>>> be in even register and even register +1 (which is odd).
>>>>>>>>
>>>>>>>> Thanks & Regards
>>>>>>>> Ajit
>>>>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>>>>>> check:
>>>>>>>>>
>>>>>>>>>   multiple_p (offset, access_size)
>>>>>>>>>
>>>>>>>>> and for rs6000 it could check:
>>>>>>>>>
>>>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>>>
>>>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>>>> condition of tranform_for_base.
>>>>>>>>>
>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards
>>>>>>>>>> Ajit
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Alex
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>  	}
>>>>>>>>>>>>>>        prev_access = &access;
>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>  // and remove all the tombstone insns, being sure to reparent any uses
>>>>>>>>>>>>>>  // of mem to previous defs when we do this.
>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>> -ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
>>>>>>>>>>>>>>    if (!m_emitted_tombstone)
>>>>>>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>  template<typename Map>
>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>    for (auto kv : map)
>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>> -ldp_bb_info::transform ()
>>>>>>>>>>>>>> +pair_fusion_bb_info::transform ()
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>    traverse_base_map (expr_map);
>>>>>>>>>>>>>>    traverse_base_map (def_map);
>>>>>>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>>>>>>>>>>>>>  // for load/store candidates.  If running after RA, also try and promote
>>>>>>>>>>>>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
>>>>>>>>>>>>>>  // candidates into pairs.
>>>>>>>>>>>>>> -void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>> -  const bool track_loads
>>>>>>>>>>>>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>>>> -  const bool track_stores
>>>>>>>>>>>>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>>>> +  const bool track_loads = track_load_p ();
>>>>>>>>>>>>>> +  const bool track_stores = track_store_p ();
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> -  ldp_bb_info bb_state (bb);
>>>>>>>>>>>>>
>>>>>>>>>>>>> This:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>>>
>>>>>>>>>>>>> can be deleted and then:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
>>>>>>>>>>>>>
>>>>>>>>>>>>> can just be:
>>>>>>>>>>>>>
>>>>>>>>>>>>>   pair_fusion_bb_info bb_info (bb, this);
>>>>>>>>>>>>>
>>>>>>>>>>>>> (or you can pass *this if you make bb_info take a reference).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think there's a particular need to change the variable name
>>>>>>>>>>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
>>>>>>>>>>>>> with the RTL-SSA structure of the same name as the latter.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>    for (auto insn : bb->nondebug_insns ())
>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>  	continue;
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>        rtx pat = PATTERN (rti);
>>>>>>>>>>>>>> -      if (reload_completed
>>>>>>>>>>>>>> -	  && aarch64_ldp_writeback > 1
>>>>>>>>>>>>>> -	  && GET_CODE (pat) == PARALLEL
>>>>>>>>>>>>>> -	  && XVECLEN (pat, 0) == 2)
>>>>>>>>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
>>>>>>>>>>>>>>  	try_promote_writeback (insn);
>>>>>>>>>>>>>
>>>>>>>>>>>>> It looks like try_promote_writeback itself will need some further work
>>>>>>>>>>>>> to make it target-independent.  I suppose this check:
>>>>>>>>>>>>>
>>>>>>>>>>>>>   auto rti = insn->rtl ();
>>>>>>>>>>>>>   const auto attr = get_attr_ldpstp (rti);
>>>>>>>>>>>>>   if (attr == LDPSTP_NONE)
>>>>>>>>>>>>>     return;
>>>>>>>>>>>>>
>>>>>>>>>>>>>   bool load_p = (attr == LDPSTP_LDP);
>>>>>>>>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
>>>>>>>>>>>>>
>>>>>>>>>>>>> will need to become part of the pair_mem_promote_writeback_p hook that you
>>>>>>>>>>>>> added, potentially changing it to return a boolean for load_p.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then I guess we will need hooks for destructuring the pair insn and
>>>>>>>>>>>>> another hook to wrap aarch64_gen_writeback_pair.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>        if (GET_CODE (pat) != SET)
>>>>>>>>>>>>>>  	continue;
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>>>>>>>>>>>>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>>>>>>>>>>>>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> -  bb_state.transform ();
>>>>>>>>>>>>>> -  bb_state.cleanup_tombstones ();
>>>>>>>>>>>>>> +  bb_info.transform ();
>>>>>>>>>>>>>> +  bb_info.cleanup_tombstones ();
>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>  void ldp_fusion ()
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>    ldp_fusion_init ();
>>>>>>>>>>>>>> +  pair_fusion *pfuse;
>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>>>> +  pfuse = &derived;
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is indeed the one place where I think it is acceptable to
>>>>>>>>>>>>> instantiate aarch64_pair_fusion.  But again there's no need to create a
>>>>>>>>>>>>> pointer to the parent class, just call any function you like directly.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>    for (auto bb : crtl->ssa->bbs ())
>>>>>>>>>>>>>> -    ldp_fusion_bb (bb);
>>>>>>>>>>>>>> +    pfuse->ldp_fusion_bb (bb);
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think even the code to iterate over bbs should itself be a member
>>>>>>>>>>>>> function of pair_fusion (say "run") and then that becomes part of the
>>>>>>>>>>>>> generic code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So this function would just become:
>>>>>>>>>>>>>
>>>>>>>>>>>>> aarch64_pair_fusion pass;
>>>>>>>>>>>>> pass.run ();
>>>>>>>>>>>>>
>>>>>>>>>>>>> and could be inlined into the caller.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>> Perhaps you could also add an early return like:
>>>>>>>>>>>>>
>>>>>>>>>>>>>   if (!track_loads_p () && !track_stores_p ())
>>>>>>>>>>>>>     return;
>>>>>>>>>>>>>
>>>>>>>>>>>>> in pair_fusion::run () and then remove the corresponding code from
>>>>>>>>>>>>> pass_ldp_fusion::gate?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>    ldp_fusion_destroy ();
>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>> 2.39.3
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Alex
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks & Regards
>>>>>>>>>>>> Ajit
Ajit Agarwal April 14, 2024, 4:59 p.m. UTC | #15
Hello Alex:

On 12/04/24 11:02 pm, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 12/04/24 8:15 pm, Alex Coplan wrote:
>> On 12/04/2024 20:02, Ajit Agarwal wrote:
>>> Hello Alex:
>>>
>>> On 11/04/24 7:55 pm, Alex Coplan wrote:
>>>> On 10/04/2024 23:48, Ajit Agarwal wrote:
>>>>> Hello Alex:
>>>>>
>>>>> On 10/04/24 7:52 pm, Alex Coplan wrote:
>>>>>> Hi Ajit,
>>>>>>
>>>>>> On 10/04/2024 15:31, Ajit Agarwal wrote:
>>>>>>> Hello Alex:
>>>>>>>
>>>>>>> On 10/04/24 1:42 pm, Alex Coplan wrote:
>>>>>>>> Hi Ajit,
>>>>>>>>
>>>>>>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
>>>>>>>>> Hello Alex:
>>>>>>>>>
>>>>>>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
>>>>>>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
>>>>>>>>>>> Hello Alex:
>>>>>>>>>>>
>>>>>>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>>>>>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>>>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>>>>>>>>>>> Hello Alex/Richard:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> All review comments are incorporated.
>>>>>> <snip>
>>>>>>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
>>>>>>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
>>>>>>>>>>>>>>>  // merge_pairs.
>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>> -				 access_group &group)
>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>> +					 access_group &group)
>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
>>>>>>>>>>>>>>>    const unsigned access_size = lfs.size;
>>>>>>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>>  		       access.cand_insns,
>>>>>>>>>>>>>>>  		       lfs.load_p,
>>>>>>>>>>>>>>>  		       access_size);
>>>>>>>>>>>>>>> -	  skip_next = access.cand_insns.empty ();
>>>>>>>>>>>>>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As above, why is this needed?
>>>>>>>>>>>>>
>>>>>>>>>>>>> For rs6000 we want to return always true. as load store pair
>>>>>>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
>>>>>>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
>>>>>>>>>>>>> to generate always true for rs6000 to skip pairs as above.
>>>>>>>>>>>>
>>>>>>>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
>>>>>>>>>>>> load/store pair instructions where the two arms of the access are storing
>>>>>>>>>>>> operands of different sizes?  Or something else?
>>>>>>>>>>>>
>>>>>>>>>>>> As it stands the logic is to skip the next iteration only if we
>>>>>>>>>>>> exhausted all the candidate insns for the current access.  In the case
>>>>>>>>>>>> that we didn't exhaust all such candidates, then the idea is that when
>>>>>>>>>>>> access becomes prev_access, we can attempt to use those candidates as
>>>>>>>>>>>> the "left-hand side" of a pair in the next iteration since we failed to
>>>>>>>>>>>> use them as the "right-hand side" of a pair in the current iteration.
>>>>>>>>>>>> I don't see why you wouldn't want that behaviour.  Please can you
>>>>>>>>>>>> explain?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset and
>>>>>>>>>>> other load is from 16th offset. Then in next iteration we get load
>>>>>>>>>>> from 16th offset and other load from 32 offset. In next iteration
>>>>>>>>>>> we get load from 32 offset and other load from 48 offset.
>>>>>>>>>>>
>>>>>>>>>>> For example:
>>>>>>>>>>>
>>>>>>>>>>> Currently we get the load candiates as follows.
>>>>>>>>>>>
>>>>>>>>>>> pairs:
>>>>>>>>>>>
>>>>>>>>>>> load from 0th offset.
>>>>>>>>>>> load from 16th offset.
>>>>>>>>>>>
>>>>>>>>>>> next pairs:
>>>>>>>>>>>
>>>>>>>>>>> load from 16th offset.
>>>>>>>>>>> load from 32th offset.
>>>>>>>>>>>
>>>>>>>>>>> next pairs:
>>>>>>>>>>>
>>>>>>>>>>> load from 32th offset
>>>>>>>>>>> load from 48th offset.
>>>>>>>>>>>
>>>>>>>>>>> Instead in rs6000 we should get:
>>>>>>>>>>>
>>>>>>>>>>> pairs:
>>>>>>>>>>>
>>>>>>>>>>> load from 0th offset
>>>>>>>>>>> load from 16th offset.
>>>>>>>>>>>
>>>>>>>>>>> next pairs:
>>>>>>>>>>>
>>>>>>>>>>> load from 32th offset
>>>>>>>>>>> load from 48th offset.
>>>>>>>>>>
>>>>>>>>>> Hmm, so then I guess my question is: why wouldn't you consider merging
>>>>>>>>>> the pair with offsets (16,32) for rs6000?  Is it because you have a
>>>>>>>>>> stricter alignment requirement on the base pair offsets (i.e. they have
>>>>>>>>>> to be a multiple of 32 when the operand size is 16)?  So the pair
>>>>>>>>>> offsets have to be a multiple of the entire pair size rather than a
>>>>>>>>>> single operand size> 
>>>>>>>>>
>>>>>>>>> We get load pair at a certain point with (0,16) and other program
>>>>>>>>> point we get load pair (32, 48).
>>>>>>>>>
>>>>>>>>> In current implementation it takes offsets loads as (0, 16),
>>>>>>>>> (16, 32), (32, 48).
>>>>>>>>>
>>>>>>>>> But In rs6000 we want  the load pair to be merged at different points
>>>>>>>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
>>>>>>>>> 0 offset and other load (32, 48) with lxvp with 32 offset.
>>>>>>>>>
>>>>>>>>> In current case it will merge with lxvp with 0 offset and lxvp with
>>>>>>>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
>>>>>>>>> is incorrect in our case as the (16-32) case 16 offset will not
>>>>>>>>> load from even register and break for rs6000.
>>>>>>>>
>>>>>>>> Sorry, I think I'm still missing something here.  Why does the address offset
>>>>>>>> affect the parity of the tranfser register?  ISTM they needn't be related at
>>>>>>>> all (and indeed we can't even know the parity of the transfer register before
>>>>>>>> RA, but perhaps you're only intending on running the pass after RA?)
>>>>>>>>
>>>>>>>
>>>>>>> We have load pair with (0,16) wherein these loads are adjacent and
>>>>>>> replaced with lxvp.
>>>>>>>
>>>>>>> Semantic of lxvp instruction is that it loads adjacent load pair in
>>>>>>> even register and even_register + 1.
>>>>>>>
>>>>>>> We replace the above load pair with lxvp instruction and then we
>>>>>>> dont need to merge (16,32) as (0, 16) is already merged and instead
>>>>>>> we merge (32,48).
>>>>>>
>>>>>> Ok, but the existing logic should already account for this.  I.e. if we
>>>>>> successfully merge (0,16), then we don't attempt to merge (16,32).  We'd only
>>>>>> attempt to merge (16,32) if the merge of (0,16) failed (for whatever reason).
>>>>>> So I don't see that there's anything specific to lxvp that requires this logic
>>>>>> to change, _unless_ you have a stricter alignment requirement on the offsets as
>>>>>> I mentioned before.
>>>>>>
>>>>>
>>>>> Thanks for the suggestion. It worked for rs6000 also with current changes.
>>>>> Sorry for the confusion.
>>>>
>>>> Alright, glad we got to the bottom of this!
>>>
>>> Thanks.
>>>>
>>>>>
>>>>>>>
>>>>>>> Yes you are correct, the addresss offset doesn't affect the parity of
>>>>>>> the register transfer as we are doing fusion pass before RA.
>>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>>>>>>> check:
>>>>>>>>>>
>>>>>>>>>>   multiple_p (offset, access_size)
>>>>>>>>>>
>>>>>>>>>> and for rs6000 it could check:
>>>>>>>>>>
>>>>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>>>>
>>>>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>>>>> condition of tranform_for_base.
>>>>>>>>>>
>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>>>>
>>>>>>> I have addressed the above hooks and it worked fine with both rs6000
>>>>>>> and aarch64. I am sending subsequent patch in some time that address
>>>>>>> above.
>>>>>>>
>>>>>>>> How do you plan on handling this even-odd requirement for rs6000?
>>>>>>>>
>>>>>>>
>>>>>>> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
>>>>>>> generate register pair and thats what we generate and implement
>>>>>>> in rs6000 target
>>>>>>> code.
>>>>>>
>>>>>> Ah, this is coming back to me now.  Sorry, I should have remembered this from
>>>>>> the previous discussion with Richard S.
>>>>>>
>>>>>> Apologies for going on a slight tangent here, but if you're running
>>>>>> before RA are you planning to create a new OImode pseudo register for
>>>>>> the lxvp insn and then somehow update uses of the old transfer registers
>>>>>> to replace them with subregs of that OImode pseudo?  
>>>>>
>>>>> Yes I do the same as you have mentioned. We generate register pairs
>>>>> with 256 bit mode with two subregs of 128 bit modes with 0 and
>>>>> 16 offset.
>>>>>
>>>>> Or do you just plan
>>>>>> or replacing the individual loads with moves (instead of deleting them)?
>>>>>> I guess the latter would be simpler and might work better in the
>>>>>> presence of hard regs.
>>>>>>
>>>>>
>>>>> Would you mind explaining how to generate register pairs with lxvp by
>>>>> replacing loads with moves.
>>>>
>>>> Yeah, so suppose you have something like:
>>>>
>>>> (set (reg:V16QI v1) (mem:V16QI addr))
>>>> (set (reg:V16QI v2) (mem:V16QI addr+16))
>>>>
>>>> then when you insert the lxvp you can then (logically) replace the
>>>> original load instructions with moves from the appropriate subreg, as
>>>> follows:
>>>>
>>>> (set (reg:OI pair-pseudo) (mem:OI addr)) ; lxvp
>>>> (set (reg:V16QI v1) (subreg:V16QI (reg:OI pair-pseudo) 0))
>>>> (set (reg:V16QI v2) (subreg:V16QI (reg:OI pair-pseudo) 16))
>>>>
>>>
>>> Any Pseudo created with gen_rtx_REG like 
>>> gen_rtx_REG (OOmode, REGNO (dest_exp) will error'ed out 
>>> with unrecognize insn by LRA.
>>
>> I'm not surprised that goes wrong: you can't just create a new REG
>> rtx in a different mode and reuse the regno of an existing pseudo.
>>

Thanks for the suggestion.
>>>
>>> If I create pseudo with gen_reg_rtx (OOmode) will error'ed
>>> out with new_defs Pseudo register is not found in
>>> change->new_defs.
>>
>> Yeah, I suppose you'd need to add an RTL-SSA def for the new pseudo.
>>
> 
> Would you mind explaining how can I add and RTL-SSA def for the
> new pseudo.

I have added and RTL-SSA def for the new pseudo. With that I could
get register oairs correctly.
> 
>>>
>>> Also the sequential register pairs are not generated by
>>> Register Allocator.
>>
>> So how do you get the sequential pairs with your current approach?  My
>> understanding was that what I suggested above doesn't really change what
>> you're doing w.r.t the lxvp insn itself, but maybe I didn't properly
>> understand the approach taken in your initial patchset.
>>
> 
> I generate (set (reg:OO pair-pseudo) (mem:OI addr)) ; lxvp
> and then at the use point of pair_pseudo generate the following.
> 
> (subreg:V16QI (reg:OI pair-pseudo) 0))
> (subreg:V16QI (reg:OI pair-pseudo) 16))
> 

I get register pairs correctly generating RTL as you have 
suggested but we get extra moves that would impact the performance.

Please let me know what do you think.

Thanks & Regards
Ajit
> Thanks & Regards
> Ajit
>> Thanks,
>> Alex
>>
>>>
>>> Thats why I haven't used above method as I also thought
>>> through.
>>>
>>> Please let me know what do you think.
>>>
>>> Thanks & Regards
>>> Ajit 
>>>> now I'm not sure off-hand if this exact combination of subregs and mode
>>>> changes is valid, but hopefully you get the idea.  The benefit of this
>>>> approach is that it keeps the transformation local and is more
>>>> compile-time efficient (we don't have to look through all mentions of
>>>> v1/v2 and replace them with subregs).  I'd hope that the RA can then
>>>> clean up any redundant moves (especially in the case that v1/v2 are
>>>> pseudos).
>>>>
>>>> That would mean that you don't need all the grubbing around with DF
>>>> looking for occurrences of the transfer regs.  Instead we'd just need
>>>> some logic to insert the extra moves after the lxvp for rs6000, and I
>>>> think this might fit more cleanly into the current pass.
>>>>
>>>> Does that make sense?  In any case, this shouldn't really affect the
>>>> preparatory aarch64 patch because we were planning to defer adding any
>>>> hooks that are only needed for rs6000 from the initial aarch64/generic
>>>> split.
>>>>
>>>> Thanks,
>>>> Alex
>>>>
>>>>>
>>>>> Thanks & Regards
>>>>> Ajit
>>>>>  
>>>>>> Thanks,
>>>>>> Alex
>>>>>>
>>>>>>>
>>>>>>> Thanks & Regards
>>>>>>> Ajit 
>>>>>>>> Thanks,
>>>>>>>> Alex
>>>>>>>>
>>>>>>>>>
>>>>>>>>> lxvp should load from even registers and then loaded value will
>>>>>>>>> be in even register and even register +1 (which is odd).
>>>>>>>>>
>>>>>>>>> Thanks & Regards
>>>>>>>>> Ajit
>>>>>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>>>>>>> check:
>>>>>>>>>>
>>>>>>>>>>   multiple_p (offset, access_size)
>>>>>>>>>>
>>>>>>>>>> and for rs6000 it could check:
>>>>>>>>>>
>>>>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>>>>
>>>>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>>>>> condition of tranform_for_base.
>>>>>>>>>>
>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Alex
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks & Regards
>>>>>>>>>>> Ajit
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Alex
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  	}
>>>>>>>>>>>>>>>        prev_access = &access;
>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>>  // and remove all the tombstone insns, being sure to reparent any uses
>>>>>>>>>>>>>>>  // of mem to previous defs when we do this.
>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>> -ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
>>>>>>>>>>>>>>>    if (!m_emitted_tombstone)
>>>>>>>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>  template<typename Map>
>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>    for (auto kv : map)
>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>> -ldp_bb_info::transform ()
>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform ()
>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>    traverse_base_map (expr_map);
>>>>>>>>>>>>>>>    traverse_base_map (def_map);
>>>>>>>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>>>>>>>>>>>>>>  // for load/store candidates.  If running after RA, also try and promote
>>>>>>>>>>>>>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
>>>>>>>>>>>>>>>  // candidates into pairs.
>>>>>>>>>>>>>>> -void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>> -  const bool track_loads
>>>>>>>>>>>>>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>>>>> -  const bool track_stores
>>>>>>>>>>>>>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>>>>> +  const bool track_loads = track_load_p ();
>>>>>>>>>>>>>>> +  const bool track_stores = track_store_p ();
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>> -  ldp_bb_info bb_state (bb);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> can be deleted and then:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> can just be:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   pair_fusion_bb_info bb_info (bb, this);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (or you can pass *this if you make bb_info take a reference).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think there's a particular need to change the variable name
>>>>>>>>>>>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
>>>>>>>>>>>>>> with the RTL-SSA structure of the same name as the latter.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>    for (auto insn : bb->nondebug_insns ())
>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>>  	continue;
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>        rtx pat = PATTERN (rti);
>>>>>>>>>>>>>>> -      if (reload_completed
>>>>>>>>>>>>>>> -	  && aarch64_ldp_writeback > 1
>>>>>>>>>>>>>>> -	  && GET_CODE (pat) == PARALLEL
>>>>>>>>>>>>>>> -	  && XVECLEN (pat, 0) == 2)
>>>>>>>>>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
>>>>>>>>>>>>>>>  	try_promote_writeback (insn);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It looks like try_promote_writeback itself will need some further work
>>>>>>>>>>>>>> to make it target-independent.  I suppose this check:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   auto rti = insn->rtl ();
>>>>>>>>>>>>>>   const auto attr = get_attr_ldpstp (rti);
>>>>>>>>>>>>>>   if (attr == LDPSTP_NONE)
>>>>>>>>>>>>>>     return;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   bool load_p = (attr == LDPSTP_LDP);
>>>>>>>>>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> will need to become part of the pair_mem_promote_writeback_p hook that you
>>>>>>>>>>>>>> added, potentially changing it to return a boolean for load_p.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Then I guess we will need hooks for destructuring the pair insn and
>>>>>>>>>>>>>> another hook to wrap aarch64_gen_writeback_pair.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>        if (GET_CODE (pat) != SET)
>>>>>>>>>>>>>>>  	continue;
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>>>>>>>>>>>>>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>>>>>>>>>>>>>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>> -  bb_state.transform ();
>>>>>>>>>>>>>>> -  bb_state.cleanup_tombstones ();
>>>>>>>>>>>>>>> +  bb_info.transform ();
>>>>>>>>>>>>>>> +  bb_info.cleanup_tombstones ();
>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>  void ldp_fusion ()
>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>    ldp_fusion_init ();
>>>>>>>>>>>>>>> +  pair_fusion *pfuse;
>>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>>>>> +  pfuse = &derived;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is indeed the one place where I think it is acceptable to
>>>>>>>>>>>>>> instantiate aarch64_pair_fusion.  But again there's no need to create a
>>>>>>>>>>>>>> pointer to the parent class, just call any function you like directly.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>    for (auto bb : crtl->ssa->bbs ())
>>>>>>>>>>>>>>> -    ldp_fusion_bb (bb);
>>>>>>>>>>>>>>> +    pfuse->ldp_fusion_bb (bb);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think even the code to iterate over bbs should itself be a member
>>>>>>>>>>>>>> function of pair_fusion (say "run") and then that becomes part of the
>>>>>>>>>>>>>> generic code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So this function would just become:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> aarch64_pair_fusion pass;
>>>>>>>>>>>>>> pass.run ();
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and could be inlined into the caller.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>> Perhaps you could also add an early return like:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   if (!track_loads_p () && !track_stores_p ())
>>>>>>>>>>>>>>     return;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> in pair_fusion::run () and then remove the corresponding code from
>>>>>>>>>>>>>> pass_ldp_fusion::gate?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>    ldp_fusion_destroy ();
>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>> 2.39.3
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks & Regards
>>>>>>>>>>>>> Ajit
Ajit Agarwal April 22, 2024, 7:31 a.m. UTC | #16
Hello Alex:

On 14/04/24 10:29 pm, Ajit Agarwal wrote:
> Hello Alex:
> 
> On 12/04/24 11:02 pm, Ajit Agarwal wrote:
>> Hello Alex:
>>
>> On 12/04/24 8:15 pm, Alex Coplan wrote:
>>> On 12/04/2024 20:02, Ajit Agarwal wrote:
>>>> Hello Alex:
>>>>
>>>> On 11/04/24 7:55 pm, Alex Coplan wrote:
>>>>> On 10/04/2024 23:48, Ajit Agarwal wrote:
>>>>>> Hello Alex:
>>>>>>
>>>>>> On 10/04/24 7:52 pm, Alex Coplan wrote:
>>>>>>> Hi Ajit,
>>>>>>>
>>>>>>> On 10/04/2024 15:31, Ajit Agarwal wrote:
>>>>>>>> Hello Alex:
>>>>>>>>
>>>>>>>> On 10/04/24 1:42 pm, Alex Coplan wrote:
>>>>>>>>> Hi Ajit,
>>>>>>>>>
>>>>>>>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
>>>>>>>>>> Hello Alex:
>>>>>>>>>>
>>>>>>>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
>>>>>>>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
>>>>>>>>>>>> Hello Alex:
>>>>>>>>>>>>
>>>>>>>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>>>>>>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>>>>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>>>>>>>>>>>> Hello Alex/Richard:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> All review comments are incorporated.
>>>>>>> <snip>
>>>>>>>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t &left_list,
>>>>>>>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
>>>>>>>>>>>>>>>>  // merge_pairs.
>>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>>> -				 access_group &group)
>>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>>> +					 access_group &group)
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>    const auto lfs = decode_lfs (encoded_lfs);
>>>>>>>>>>>>>>>>    const unsigned access_size = lfs.size;
>>>>>>>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>>>  		       access.cand_insns,
>>>>>>>>>>>>>>>>  		       lfs.load_p,
>>>>>>>>>>>>>>>>  		       access_size);
>>>>>>>>>>>>>>>> -	  skip_next = access.cand_insns.empty ();
>>>>>>>>>>>>>>>> +	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As above, why is this needed?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For rs6000 we want to return always true. as load store pair
>>>>>>>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
>>>>>>>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
>>>>>>>>>>>>>> to generate always true for rs6000 to skip pairs as above.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you have
>>>>>>>>>>>>> load/store pair instructions where the two arms of the access are storing
>>>>>>>>>>>>> operands of different sizes?  Or something else?
>>>>>>>>>>>>>
>>>>>>>>>>>>> As it stands the logic is to skip the next iteration only if we
>>>>>>>>>>>>> exhausted all the candidate insns for the current access.  In the case
>>>>>>>>>>>>> that we didn't exhaust all such candidates, then the idea is that when
>>>>>>>>>>>>> access becomes prev_access, we can attempt to use those candidates as
>>>>>>>>>>>>> the "left-hand side" of a pair in the next iteration since we failed to
>>>>>>>>>>>>> use them as the "right-hand side" of a pair in the current iteration.
>>>>>>>>>>>>> I don't see why you wouldn't want that behaviour.  Please can you
>>>>>>>>>>>>> explain?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset and
>>>>>>>>>>>> other load is from 16th offset. Then in next iteration we get load
>>>>>>>>>>>> from 16th offset and other load from 32 offset. In next iteration
>>>>>>>>>>>> we get load from 32 offset and other load from 48 offset.
>>>>>>>>>>>>
>>>>>>>>>>>> For example:
>>>>>>>>>>>>
>>>>>>>>>>>> Currently we get the load candiates as follows.
>>>>>>>>>>>>
>>>>>>>>>>>> pairs:
>>>>>>>>>>>>
>>>>>>>>>>>> load from 0th offset.
>>>>>>>>>>>> load from 16th offset.
>>>>>>>>>>>>
>>>>>>>>>>>> next pairs:
>>>>>>>>>>>>
>>>>>>>>>>>> load from 16th offset.
>>>>>>>>>>>> load from 32th offset.
>>>>>>>>>>>>
>>>>>>>>>>>> next pairs:
>>>>>>>>>>>>
>>>>>>>>>>>> load from 32th offset
>>>>>>>>>>>> load from 48th offset.
>>>>>>>>>>>>
>>>>>>>>>>>> Instead in rs6000 we should get:
>>>>>>>>>>>>
>>>>>>>>>>>> pairs:
>>>>>>>>>>>>
>>>>>>>>>>>> load from 0th offset
>>>>>>>>>>>> load from 16th offset.
>>>>>>>>>>>>
>>>>>>>>>>>> next pairs:
>>>>>>>>>>>>
>>>>>>>>>>>> load from 32th offset
>>>>>>>>>>>> load from 48th offset.
>>>>>>>>>>>
>>>>>>>>>>> Hmm, so then I guess my question is: why wouldn't you consider merging
>>>>>>>>>>> the pair with offsets (16,32) for rs6000?  Is it because you have a
>>>>>>>>>>> stricter alignment requirement on the base pair offsets (i.e. they have
>>>>>>>>>>> to be a multiple of 32 when the operand size is 16)?  So the pair
>>>>>>>>>>> offsets have to be a multiple of the entire pair size rather than a
>>>>>>>>>>> single operand size> 
>>>>>>>>>>
>>>>>>>>>> We get load pair at a certain point with (0,16) and other program
>>>>>>>>>> point we get load pair (32, 48).
>>>>>>>>>>
>>>>>>>>>> In current implementation it takes offsets loads as (0, 16),
>>>>>>>>>> (16, 32), (32, 48).
>>>>>>>>>>
>>>>>>>>>> But In rs6000 we want  the load pair to be merged at different points
>>>>>>>>>> as (0,16) and (32, 48). for (0,16) we want to replace load lxvp with
>>>>>>>>>> 0 offset and other load (32, 48) with lxvp with 32 offset.
>>>>>>>>>>
>>>>>>>>>> In current case it will merge with lxvp with 0 offset and lxvp with
>>>>>>>>>> 16 offset, then lxvp with 32 offset and lxvp with 48 offset which
>>>>>>>>>> is incorrect in our case as the (16-32) case 16 offset will not
>>>>>>>>>> load from even register and break for rs6000.
>>>>>>>>>
>>>>>>>>> Sorry, I think I'm still missing something here.  Why does the address offset
>>>>>>>>> affect the parity of the tranfser register?  ISTM they needn't be related at
>>>>>>>>> all (and indeed we can't even know the parity of the transfer register before
>>>>>>>>> RA, but perhaps you're only intending on running the pass after RA?)
>>>>>>>>>
>>>>>>>>
>>>>>>>> We have load pair with (0,16) wherein these loads are adjacent and
>>>>>>>> replaced with lxvp.
>>>>>>>>
>>>>>>>> Semantic of lxvp instruction is that it loads adjacent load pair in
>>>>>>>> even register and even_register + 1.
>>>>>>>>
>>>>>>>> We replace the above load pair with lxvp instruction and then we
>>>>>>>> dont need to merge (16,32) as (0, 16) is already merged and instead
>>>>>>>> we merge (32,48).
>>>>>>>
>>>>>>> Ok, but the existing logic should already account for this.  I.e. if we
>>>>>>> successfully merge (0,16), then we don't attempt to merge (16,32).  We'd only
>>>>>>> attempt to merge (16,32) if the merge of (0,16) failed (for whatever reason).
>>>>>>> So I don't see that there's anything specific to lxvp that requires this logic
>>>>>>> to change, _unless_ you have a stricter alignment requirement on the offsets as
>>>>>>> I mentioned before.
>>>>>>>
>>>>>>
>>>>>> Thanks for the suggestion. It worked for rs6000 also with current changes.
>>>>>> Sorry for the confusion.
>>>>>
>>>>> Alright, glad we got to the bottom of this!
>>>>
>>>> Thanks.
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> Yes you are correct, the addresss offset doesn't affect the parity of
>>>>>>>> the register transfer as we are doing fusion pass before RA.
>>>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>>>>>>>> check:
>>>>>>>>>>>
>>>>>>>>>>>   multiple_p (offset, access_size)
>>>>>>>>>>>
>>>>>>>>>>> and for rs6000 it could check:
>>>>>>>>>>>
>>>>>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>>>>>
>>>>>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>>>>>> condition of tranform_for_base.
>>>>>>>>>>>
>>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>>>>>
>>>>>>>> I have addressed the above hooks and it worked fine with both rs6000
>>>>>>>> and aarch64. I am sending subsequent patch in some time that address
>>>>>>>> above.
>>>>>>>>
>>>>>>>>> How do you plan on handling this even-odd requirement for rs6000?
>>>>>>>>>
>>>>>>>>
>>>>>>>> We plan to handle with V16QI subreg: 0 and V16QI subreg : 16 to
>>>>>>>> generate register pair and thats what we generate and implement
>>>>>>>> in rs6000 target
>>>>>>>> code.
>>>>>>>
>>>>>>> Ah, this is coming back to me now.  Sorry, I should have remembered this from
>>>>>>> the previous discussion with Richard S.
>>>>>>>
>>>>>>> Apologies for going on a slight tangent here, but if you're running
>>>>>>> before RA are you planning to create a new OImode pseudo register for
>>>>>>> the lxvp insn and then somehow update uses of the old transfer registers
>>>>>>> to replace them with subregs of that OImode pseudo?  
>>>>>>
>>>>>> Yes I do the same as you have mentioned. We generate register pairs
>>>>>> with 256 bit mode with two subregs of 128 bit modes with 0 and
>>>>>> 16 offset.
>>>>>>
>>>>>> Or do you just plan
>>>>>>> or replacing the individual loads with moves (instead of deleting them)?
>>>>>>> I guess the latter would be simpler and might work better in the
>>>>>>> presence of hard regs.
>>>>>>>
>>>>>>
>>>>>> Would you mind explaining how to generate register pairs with lxvp by
>>>>>> replacing loads with moves.
>>>>>
>>>>> Yeah, so suppose you have something like:
>>>>>
>>>>> (set (reg:V16QI v1) (mem:V16QI addr))
>>>>> (set (reg:V16QI v2) (mem:V16QI addr+16))
>>>>>
>>>>> then when you insert the lxvp you can then (logically) replace the
>>>>> original load instructions with moves from the appropriate subreg, as
>>>>> follows:
>>>>>
>>>>> (set (reg:OI pair-pseudo) (mem:OI addr)) ; lxvp
>>>>> (set (reg:V16QI v1) (subreg:V16QI (reg:OI pair-pseudo) 0))
>>>>> (set (reg:V16QI v2) (subreg:V16QI (reg:OI pair-pseudo) 16))
>>>>>
>>>>
>>>> Any Pseudo created with gen_rtx_REG like 
>>>> gen_rtx_REG (OOmode, REGNO (dest_exp) will error'ed out 
>>>> with unrecognize insn by LRA.
>>>
>>> I'm not surprised that goes wrong: you can't just create a new REG
>>> rtx in a different mode and reuse the regno of an existing pseudo.
>>>
> 
> Thanks for the suggestion.
>>>>
>>>> If I create pseudo with gen_reg_rtx (OOmode) will error'ed
>>>> out with new_defs Pseudo register is not found in
>>>> change->new_defs.
>>>
>>> Yeah, I suppose you'd need to add an RTL-SSA def for the new pseudo.
>>>
>>
>> Would you mind explaining how can I add and RTL-SSA def for the
>> new pseudo.
> 
> I have added and RTL-SSA def for the new pseudo. With that I could
> get register oairs correctly.
>>
>>>>
>>>> Also the sequential register pairs are not generated by
>>>> Register Allocator.
>>>
>>> So how do you get the sequential pairs with your current approach?  My
>>> understanding was that what I suggested above doesn't really change what
>>> you're doing w.r.t the lxvp insn itself, but maybe I didn't properly
>>> understand the approach taken in your initial patchset.
>>>
>>
>> I generate (set (reg:OO pair-pseudo) (mem:OI addr)) ; lxvp
>> and then at the use point of pair_pseudo generate the following.
>>
>> (subreg:V16QI (reg:OI pair-pseudo) 0))
>> (subreg:V16QI (reg:OI pair-pseudo) 16))
>>
> 
> I get register pairs correctly generating RTL as you have 
> suggested but we get extra moves that would impact the performance.
> 
> Please let me know what do you think.
> 

For the below testcase:

#include <altivec.h>

void
foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
{
  __vector_quad acc;
  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
  *dst = acc;
}


My earlier implementation without moves generates the following assembly
for the above testcase:

.LFB0:
        .cfi_startproc
        .localentry     _Z3fooPu13__vector_quadPDv16_hS1_,1
        lxvp %vs32,0(%r4)
        xvf32ger 0,%vs34,%vs33
        xvf32gerpp 0,%vs34,%vs32
        xxmfacc 0
        stxvp %vs2,0(%r3)
        stxvp %vs0,32(%r3)
        blr


With moves you have suggested I get the below generated assembly:
.LFB0:
        .cfi_startproc
        .localentry     _Z3fooPu13__vector_quadPDv16_hS1_,1
        lxvp %vs0,0(%r4)
        xxlor %vs33,%vs1,%vs1
        xxlor %vs32,%vs0,%vs0
        xvf32ger 0,%vs34,%vs33
        xvf32gerpp 0,%vs34,%vs32
        xxmfacc 0
        stxvp %vs2,0(%r3)
        stxvp %vs0,32(%r3)
        blr


As you see with register moves two extra xxlor instructions are 
generated.

Please let me know what do you think.

Thanks & Regards
Ajit
> Thanks & Regards
> Ajit
>> Thanks & Regards
>> Ajit
>>> Thanks,
>>> Alex
>>>
>>>>
>>>> Thats why I haven't used above method as I also thought
>>>> through.
>>>>
>>>> Please let me know what do you think.
>>>>
>>>> Thanks & Regards
>>>> Ajit 
>>>>> now I'm not sure off-hand if this exact combination of subregs and mode
>>>>> changes is valid, but hopefully you get the idea.  The benefit of this
>>>>> approach is that it keeps the transformation local and is more
>>>>> compile-time efficient (we don't have to look through all mentions of
>>>>> v1/v2 and replace them with subregs).  I'd hope that the RA can then
>>>>> clean up any redundant moves (especially in the case that v1/v2 are
>>>>> pseudos).
>>>>>
>>>>> That would mean that you don't need all the grubbing around with DF
>>>>> looking for occurrences of the transfer regs.  Instead we'd just need
>>>>> some logic to insert the extra moves after the lxvp for rs6000, and I
>>>>> think this might fit more cleanly into the current pass.
>>>>>
>>>>> Does that make sense?  In any case, this shouldn't really affect the
>>>>> preparatory aarch64 patch because we were planning to defer adding any
>>>>> hooks that are only needed for rs6000 from the initial aarch64/generic
>>>>> split.
>>>>>
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Ajit
>>>>>>  
>>>>>>> Thanks,
>>>>>>> Alex
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks & Regards
>>>>>>>> Ajit 
>>>>>>>>> Thanks,
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> lxvp should load from even registers and then loaded value will
>>>>>>>>>> be in even register and even register +1 (which is odd).
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards
>>>>>>>>>> Ajit
>>>>>>>>>>> If that is the case then I think it would be better to introduce a
>>>>>>>>>>> virtual function (say pair_offset_alignment_ok_p) that vets the base
>>>>>>>>>>> offset of the pair (prev_access->offset in transform_for_base).  I guess
>>>>>>>>>>> it would also take access_size as a parameter and for aarch64 it should
>>>>>>>>>>> check:
>>>>>>>>>>>
>>>>>>>>>>>   multiple_p (offset, access_size)
>>>>>>>>>>>
>>>>>>>>>>> and for rs6000 it could check:
>>>>>>>>>>>
>>>>>>>>>>>   multiple_p (offset, access_size * 2)
>>>>>>>>>>>
>>>>>>>>>>> and we would then incorporate a call to that predicate in the else if
>>>>>>>>>>> condition of tranform_for_base.
>>>>>>>>>>>
>>>>>>>>>>> It would have the same limitation whereby we assume that MEM_EXPR offset
>>>>>>>>>>> alignment is a good proxy for RTL offset alignment, but we already
>>>>>>>>>>> depend on that with the multiple_p check in track_via_mem_expr.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Alex
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks & Regards
>>>>>>>>>>>> Ajit
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  	}
>>>>>>>>>>>>>>>>        prev_access = &access;
>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>> @@ -2919,7 +3047,7 @@ ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>>>>>>>>  // and remove all the tombstone insns, being sure to reparent any uses
>>>>>>>>>>>>>>>>  // of mem to previous defs when we do this.
>>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>>> -ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>>> +pair_fusion_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>    // No need to do anything if we didn't emit a tombstone insn for this BB.
>>>>>>>>>>>>>>>>    if (!m_emitted_tombstone)
>>>>>>>>>>>>>>>> @@ -2947,7 +3075,7 @@ ldp_bb_info::cleanup_tombstones ()
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>  template<typename Map>
>>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>>> -ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>>> +pair_fusion_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>    for (auto kv : map)
>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>> @@ -2958,7 +3086,7 @@ ldp_bb_info::traverse_base_map (Map &map)
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>  void
>>>>>>>>>>>>>>>> -ldp_bb_info::transform ()
>>>>>>>>>>>>>>>> +pair_fusion_bb_info::transform ()
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>    traverse_base_map (expr_map);
>>>>>>>>>>>>>>>>    traverse_base_map (def_map);
>>>>>>>>>>>>>>>> @@ -3167,14 +3295,13 @@ try_promote_writeback (insn_info *insn)
>>>>>>>>>>>>>>>>  // for load/store candidates.  If running after RA, also try and promote
>>>>>>>>>>>>>>>>  // non-writeback pairs to use writeback addressing.  Then try to fuse
>>>>>>>>>>>>>>>>  // candidates into pairs.
>>>>>>>>>>>>>>>> -void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>>> +void pair_fusion::ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>> -  const bool track_loads
>>>>>>>>>>>>>>>> -    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>>>>>> -  const bool track_stores
>>>>>>>>>>>>>>>> -    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
>>>>>>>>>>>>>>>> +  const bool track_loads = track_load_p ();
>>>>>>>>>>>>>>>> +  const bool track_stores = track_store_p ();
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>> -  ldp_bb_info bb_state (bb);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> can be deleted and then:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +  pair_fusion_bb_info bb_info (bb, &derived);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> can just be:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   pair_fusion_bb_info bb_info (bb, this);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (or you can pass *this if you make bb_info take a reference).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't think there's a particular need to change the variable name
>>>>>>>>>>>>>>> (bb_state -> bb_info).  I chose the former because it doens't clash
>>>>>>>>>>>>>>> with the RTL-SSA structure of the same name as the latter.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>    for (auto insn : bb->nondebug_insns ())
>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>> @@ -3184,31 +3311,31 @@ void ldp_fusion_bb (bb_info *bb)
>>>>>>>>>>>>>>>>  	continue;
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>        rtx pat = PATTERN (rti);
>>>>>>>>>>>>>>>> -      if (reload_completed
>>>>>>>>>>>>>>>> -	  && aarch64_ldp_writeback > 1
>>>>>>>>>>>>>>>> -	  && GET_CODE (pat) == PARALLEL
>>>>>>>>>>>>>>>> -	  && XVECLEN (pat, 0) == 2)
>>>>>>>>>>>>>>>> +      if (pair_mem_promote_writeback_p (pat))
>>>>>>>>>>>>>>>>  	try_promote_writeback (insn);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It looks like try_promote_writeback itself will need some further work
>>>>>>>>>>>>>>> to make it target-independent.  I suppose this check:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   auto rti = insn->rtl ();
>>>>>>>>>>>>>>>   const auto attr = get_attr_ldpstp (rti);
>>>>>>>>>>>>>>>   if (attr == LDPSTP_NONE)
>>>>>>>>>>>>>>>     return;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   bool load_p = (attr == LDPSTP_LDP);
>>>>>>>>>>>>>>>   gcc_checking_assert (load_p || attr == LDPSTP_STP);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> will need to become part of the pair_mem_promote_writeback_p hook that you
>>>>>>>>>>>>>>> added, potentially changing it to return a boolean for load_p.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Then I guess we will need hooks for destructuring the pair insn and
>>>>>>>>>>>>>>> another hook to wrap aarch64_gen_writeback_pair.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>        if (GET_CODE (pat) != SET)
>>>>>>>>>>>>>>>>  	continue;
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>        if (track_stores && MEM_P (XEXP (pat, 0)))
>>>>>>>>>>>>>>>> -	bb_state.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>>>>> +	bb_info.track_access (insn, false, XEXP (pat, 0));
>>>>>>>>>>>>>>>>        else if (track_loads && MEM_P (XEXP (pat, 1)))
>>>>>>>>>>>>>>>> -	bb_state.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>>>>> +	bb_info.track_access (insn, true, XEXP (pat, 1));
>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>> -  bb_state.transform ();
>>>>>>>>>>>>>>>> -  bb_state.cleanup_tombstones ();
>>>>>>>>>>>>>>>> +  bb_info.transform ();
>>>>>>>>>>>>>>>> +  bb_info.cleanup_tombstones ();
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>  void ldp_fusion ()
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>    ldp_fusion_init ();
>>>>>>>>>>>>>>>> +  pair_fusion *pfuse;
>>>>>>>>>>>>>>>> +  aarch64_pair_fusion derived;
>>>>>>>>>>>>>>>> +  pfuse = &derived;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is indeed the one place where I think it is acceptable to
>>>>>>>>>>>>>>> instantiate aarch64_pair_fusion.  But again there's no need to create a
>>>>>>>>>>>>>>> pointer to the parent class, just call any function you like directly.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>    for (auto bb : crtl->ssa->bbs ())
>>>>>>>>>>>>>>>> -    ldp_fusion_bb (bb);
>>>>>>>>>>>>>>>> +    pfuse->ldp_fusion_bb (bb);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think even the code to iterate over bbs should itself be a member
>>>>>>>>>>>>>>> function of pair_fusion (say "run") and then that becomes part of the
>>>>>>>>>>>>>>> generic code.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So this function would just become:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> aarch64_pair_fusion pass;
>>>>>>>>>>>>>>> pass.run ();
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and could be inlined into the caller.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>> Perhaps you could also add an early return like:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   if (!track_loads_p () && !track_stores_p ())
>>>>>>>>>>>>>>>     return;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> in pair_fusion::run () and then remove the corresponding code from
>>>>>>>>>>>>>>> pass_ldp_fusion::gate?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Addressed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>    ldp_fusion_destroy ();
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>> 2.39.3
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks & Regards
>>>>>>>>>>>>>> Ajit
diff mbox series

Patch

diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 22ed95eb743..cb21b514ef7 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -138,8 +138,122 @@  struct alt_base
   poly_int64 offset;
 };
 
+// Virtual base class for load/store walkers used in alias analysis.
+struct alias_walker
+{
+  virtual bool conflict_p (int &budget) const = 0;
+  virtual insn_info *insn () const = 0;
+  virtual bool valid () const  = 0;
+  virtual void advance () = 0;
+};
+
+struct pair_fusion {
+
+  pair_fusion () {};
+  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
+			       bool load_p) = 0;
+
+  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
+  virtual bool pair_trailing_writeback_p () = 0;
+  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
+				      machine_mode mem_mode) = 0;
+  virtual int pair_mem_alias_check_limit () = 0;
+  virtual bool handle_writeback_opportunities () = 0 ;
+  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
+					machine_mode mode) = 0;
+  virtual rtx gen_mem_pair (rtx *pats,  rtx writeback,
+			    bool load_p) = 0;
+  virtual bool pair_mem_promote_writeback_p (rtx pat) = 0;
+  virtual bool track_load_p () = 0;
+  virtual bool track_store_p () = 0;
+  virtual bool cand_insns_empty_p (std::list<insn_info *> &insns) = 0;
+  virtual bool pair_mem_in_range_p (HOST_WIDE_INT off) = 0;
+  void ldp_fusion_bb (bb_info *bb);
+
+  ~pair_fusion() { }
+};
+
+struct aarch64_pair_fusion : public pair_fusion
+{
+public:
+  aarch64_pair_fusion () : pair_fusion () {};
+  bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
+		    bool load_p) override final
+  {
+    const bool fpsimd_op_p
+      = reload_completed
+      ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
+      : (GET_MODE_CLASS (mem_mode) != MODE_INT
+	 && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
+    return fpsimd_op_p;
+  }
+
+  bool pair_mem_promote_writeback_p (rtx pat)
+  {
+     if (reload_completed
+	 && aarch64_ldp_writeback > 1
+	 && GET_CODE (pat) == PARALLEL
+	 && XVECLEN (pat, 0) == 2)
+	return true;
+
+     return false;
+  }
+
+  bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
+				machine_mode mode)
+  {
+    return aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
+						     load_p,
+						     mode);
+  }
+  bool pair_operand_mode_ok_p (machine_mode mode);
+
+  rtx gen_mem_pair (rtx *pats, rtx writeback, bool load_p);
+
+  bool pair_trailing_writeback_p  ()
+  {
+    return aarch64_ldp_writeback > 1;
+  }
+  bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
+			      machine_mode mem_mode)
+  {
+    return (load_p
+	     ? aarch64_ldp_reg_operand (reg_op, mem_mode)
+	     : aarch64_stp_reg_operand (reg_op, mem_mode));
+  }
+  int pair_mem_alias_check_limit ()
+  {
+    return aarch64_ldp_alias_check_limit;
+  }
+  bool handle_writeback_opportunities ()
+  {
+    return aarch64_ldp_writeback;
+  }
+  bool track_load_p ()
+  {
+    const bool track_loads
+      = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
+    return track_loads;
+  }
+  bool track_store_p ()
+  {
+    const bool track_stores
+      = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
+    return track_stores;
+  }
+  bool cand_insns_empty_p (std::list<insn_info *> &insns)
+  {
+    return insns.empty();
+  }
+  bool pair_mem_in_range_p (HOST_WIDE_INT off)
+  {
+    return (off < LDP_MIN_IMM || off > LDP_MAX_IMM);
+  }
+};
+
+
 // State used by the pass for a given basic block.
-struct ldp_bb_info
+struct pair_fusion_bb_info
 {
   using def_hash = nofree_ptr_hash<def_info>;
   using expr_key_t = pair_hash<tree_operand_hash, int_hash<int, -1, -2>>;
@@ -160,14 +274,17 @@  struct ldp_bb_info
 
   static const size_t obstack_alignment = sizeof (void *);
   bb_info *m_bb;
+  pair_fusion *bb_state;
 
-  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
+  pair_fusion_bb_info (bb_info *bb,
+		      aarch64_pair_fusion *d) : m_bb (bb),
+		      bb_state (d), m_emitted_tombstone (false)
   {
     obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
 				obstack_alignment, obstack_chunk_alloc,
 				obstack_chunk_free);
   }
-  ~ldp_bb_info ()
+  ~pair_fusion_bb_info ()
   {
     obstack_free (&m_obstack, nullptr);
 
@@ -177,10 +294,32 @@  struct ldp_bb_info
 	bitmap_obstack_release (&m_bitmap_obstack);
       }
   }
+  void track_access (insn_info *, bool load, rtx mem);
+  void transform ();
+  void cleanup_tombstones ();
+  void merge_pairs (insn_list_t &, insn_list_t &,
+		    bool load_p, unsigned access_size);
+  void transform_for_base (int load_size, access_group &group);
+
+  bool try_fuse_pair (bool load_p, unsigned access_size,
+			     insn_info *i1, insn_info *i2);
+
+  bool fuse_pair (bool load_p, unsigned access_size,
+		  int writeback,
+		  insn_info *i1, insn_info *i2,
+		  base_cand &base,
+		  const insn_range_info &move_range);
 
-  inline void track_access (insn_info *, bool load, rtx mem);
-  inline void transform ();
-  inline void cleanup_tombstones ();
+  void do_alias_analysis (insn_info *alias_hazards[4],
+			  alias_walker *walkers[4],
+			  bool load_p);
+
+  void track_tombstone (int uid);
+
+  bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
+
+  template<typename Map>
+    void traverse_base_map (Map &map);
 
 private:
   obstack m_obstack;
@@ -191,30 +330,32 @@  private:
   bool m_emitted_tombstone;
 
   inline splay_tree_node<access_record *> *node_alloc (access_record *);
+};
 
-  template<typename Map>
-  inline void traverse_base_map (Map &map);
-  inline void transform_for_base (int load_size, access_group &group);
-
-  inline void merge_pairs (insn_list_t &, insn_list_t &,
-			   bool load_p, unsigned access_size);
-
-  inline bool try_fuse_pair (bool load_p, unsigned access_size,
-			     insn_info *i1, insn_info *i2);
-
-  inline bool fuse_pair (bool load_p, unsigned access_size,
-			 int writeback,
-			 insn_info *i1, insn_info *i2,
-			 base_cand &base,
-			 const insn_range_info &move_range);
-
-  inline void track_tombstone (int uid);
+rtx aarch64_pair_fusion::gen_mem_pair (rtx *pats,
+		  rtx writeback,
+		  bool load_p)
+  {
+    rtx pair_pat;
 
-  inline bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
-};
+    if (writeback)
+      {
+	auto patvec = gen_rtvec (3, writeback, pats[0], pats[1]);
+	pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
+      }
+    else if (load_p)
+      pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
+					XEXP (pats[1], 0),
+					XEXP (pats[0], 1));
+    else
+      pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
+					 XEXP (pats[0], 1),
+					 XEXP (pats[1], 1));
+     return pair_pat;
+  }
 
 splay_tree_node<access_record *> *
-ldp_bb_info::node_alloc (access_record *access)
+pair_fusion_bb_info::node_alloc (access_record *access)
 {
   using T = splay_tree_node<access_record *>;
   void *addr = obstack_alloc (&m_obstack, sizeof (T));
@@ -262,7 +403,7 @@  drop_writeback (rtx mem)
 // RTX_AUTOINC addresses.  The interface is like strip_offset except we take a
 // MEM so that we know the mode of the access.
 static rtx
-ldp_strip_offset (rtx mem, poly_int64 *offset)
+pair_mem_strip_offset (rtx mem, poly_int64 *offset)
 {
   rtx addr = XEXP (mem, 0);
 
@@ -332,6 +473,12 @@  ldp_operand_mode_ok_p (machine_mode mode)
   return reload_completed || mode != TImode;
 }
 
+bool
+aarch64_pair_fusion::pair_operand_mode_ok_p (machine_mode mode)
+{
+  return ldp_operand_mode_ok_p (mode);
+}
+
 // Given LFS (load_p, fpsimd_p, size) fields in FIELDS, encode these
 // into an integer for use as a hash table key.
 static int
@@ -396,7 +543,7 @@  access_group::track (Alloc alloc_node, poly_int64 offset, insn_info *insn)
 // MEM_EXPR base (i.e. a tree decl) relative to which we can track the access.
 // LFS is used as part of the key to the hash table, see track_access.
 bool
-ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
+pair_fusion_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
 {
   if (!MEM_EXPR (mem) || !MEM_OFFSET_KNOWN_P (mem))
     return false;
@@ -412,9 +559,10 @@  ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
   const machine_mode mem_mode = GET_MODE (mem);
   const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
 
-  // Punt on misaligned offsets.  LDP/STP instructions require offsets to be a
-  // multiple of the access size, and we believe that misaligned offsets on
-  // MEM_EXPR bases are likely to lead to misaligned offsets w.r.t. RTL bases.
+  // Punt on misaligned offsets.  Paired memory access instructions require
+  // offsets to be a multiple of the access size, and we believe that
+  // misaligned offsets on MEM_EXPR bases are likely to lead to misaligned
+  // offsets w.r.t. RTL bases.
   if (!multiple_p (offset, mem_size))
     return false;
 
@@ -438,46 +586,38 @@  ldp_bb_info::track_via_mem_expr (insn_info *insn, rtx mem, lfs_fields lfs)
 }
 
 // Main function to begin pair discovery.  Given a memory access INSN,
-// determine whether it could be a candidate for fusing into an ldp/stp,
+// determine whether it could be a candidate for fusing into an pair mem,
 // and if so, track it in the appropriate data structure for this basic
 // block.  LOAD_P is true if the access is a load, and MEM is the mem
 // rtx that occurs in INSN.
 void
-ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
+pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
 {
   // We can't combine volatile MEMs, so punt on these.
   if (MEM_VOLATILE_P (mem))
     return;
 
-  // Ignore writeback accesses if the param says to do so.
-  if (!aarch64_ldp_writeback
+  // Ignore writeback accesses if the param says to do so
+  if (!bb_state->handle_writeback_opportunities ()
       && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
     return;
 
   const machine_mode mem_mode = GET_MODE (mem);
-  if (!ldp_operand_mode_ok_p (mem_mode))
+
+  if (!bb_state->pair_operand_mode_ok_p (mem_mode))
     return;
 
   rtx reg_op = XEXP (PATTERN (insn->rtl ()), !load_p);
 
-  // Ignore the access if the register operand isn't suitable for ldp/stp.
-  if (load_p
-      ? !aarch64_ldp_reg_operand (reg_op, mem_mode)
-      : !aarch64_stp_reg_operand (reg_op, mem_mode))
+  if (!bb_state->pair_reg_operand_ok_p (load_p, reg_op, mem_mode))
     return;
-
   // We want to segregate FP/SIMD accesses from GPR accesses.
   //
   // Before RA, we use the modes, noting that stores of constant zero
   // operands use GPRs (even in non-integer modes).  After RA, we use
   // the hard register numbers.
-  const bool fpsimd_op_p
-    = reload_completed
-    ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
-    : (GET_MODE_CLASS (mem_mode) != MODE_INT
-       && (load_p || !aarch64_const_zero_rtx_p (reg_op)));
-
-  // Note ldp_operand_mode_ok_p already rejected VL modes.
+  const bool fpsimd_op_p = bb_state->fpsimd_op_p (reg_op, mem_mode, load_p);
+  // Note pair_operand_mode_ok_p already rejected VL modes.
   const HOST_WIDE_INT mem_size = GET_MODE_SIZE (mem_mode).to_constant ();
   const lfs_fields lfs = { load_p, fpsimd_op_p, mem_size };
 
@@ -487,7 +627,7 @@  ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
   poly_int64 mem_off;
   rtx addr = XEXP (mem, 0);
   const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
-  rtx base = ldp_strip_offset (mem, &mem_off);
+  rtx base = pair_mem_strip_offset (mem, &mem_off);
   if (!REG_P (base))
     return;
 
@@ -506,8 +646,8 @@  ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
   // elimination offset pre-RA, we should postpone forming pairs on such
   // accesses until after RA.
   //
-  // As it stands, addresses with offsets in range for LDR but not
-  // in range for LDP/STP are currently reloaded inefficiently,
+  // As it stands, addresses in range for an individual load/store but not
+  // for a paired access are currently reloaded inefficiently,
   // ending up with a separate base register for each pair.
   //
   // In theory LRA should make use of
@@ -519,8 +659,8 @@  ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
   // that calls targetm.legitimize_address_displacement.
   //
   // So for now, it's better to punt when we can't be sure that the
-  // offset is in range for LDP/STP.  Out-of-range cases can then be
-  // handled after RA by the out-of-range LDP/STP peepholes.  Eventually, it
+  // offset is in range for paired access.  Out-of-range cases can then be
+  // handled after RA by the out-of-range PAIR MEM  peepholes.  Eventually, it
   // would be nice to handle known out-of-range opportunities in the
   // pass itself (for stack accesses, this would be in the post-RA pass).
   if (!reload_completed
@@ -573,8 +713,8 @@  ldp_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
 	gcc_unreachable (); // Base defs should be unique.
     }
 
-  // Punt on misaligned offsets.  LDP/STP require offsets to be a multiple of
-  // the access size.
+  // Punt on misaligned offsets.  Paired memory accesses require offsets
+  // to be a multiple of the access size.
   if (!multiple_p (mem_off, mem_size))
     return;
 
@@ -614,7 +754,7 @@  static bool no_ignore (insn_info *) { return false; }
 // making use of alias disambiguation.
 static insn_info *
 latest_hazard_before (insn_info *insn, rtx *ignore,
-		      insn_info *ignore_insn = nullptr)
+		      insn_info *ignore_insn = 0)
 {
   insn_info *result = nullptr;
 
@@ -1150,7 +1290,7 @@  extract_writebacks (bool load_p, rtx pats[2], int changed)
       const bool autoinc_p = GET_RTX_CLASS (GET_CODE (addr)) == RTX_AUTOINC;
 
       poly_int64 offset;
-      rtx this_base = ldp_strip_offset (mem, &offset);
+      rtx this_base = pair_mem_strip_offset (mem, &offset);
       gcc_assert (REG_P (this_base));
       if (base_reg)
 	gcc_assert (rtx_equal_p (base_reg, this_base));
@@ -1286,7 +1426,11 @@  find_trailing_add (insn_info *insns[2],
 
   off_hwi /= access_size;
 
-  if (off_hwi < LDP_MIN_IMM || off_hwi > LDP_MAX_IMM)
+  pair_fusion *pfuse;
+  aarch64_pair_fusion derived;
+  pfuse = &derived;
+
+  if (pfuse->pair_mem_in_range_p (off_hwi))
     return nullptr;
 
   auto dump_prefix = [&]()
@@ -1328,7 +1472,7 @@  find_trailing_add (insn_info *insns[2],
 // We just emitted a tombstone with uid UID, track it in a bitmap for
 // this BB so we can easily identify it later when cleaning up tombstones.
 void
-ldp_bb_info::track_tombstone (int uid)
+pair_fusion_bb_info::track_tombstone (int uid)
 {
   if (!m_emitted_tombstone)
     {
@@ -1528,7 +1672,7 @@  fixup_debug_uses (obstack_watermark &attempt,
 	  gcc_checking_assert (GET_RTX_CLASS (GET_CODE (XEXP (mem, 0)))
 			       == RTX_AUTOINC);
 
-	  base = ldp_strip_offset (mem, &offset);
+	  base = pair_mem_strip_offset (mem, &offset);
 	  gcc_checking_assert (REG_P (base) && REGNO (base) == base_regno);
 	}
       fixup_debug_use (attempt, use, def, base, offset);
@@ -1664,7 +1808,7 @@  fixup_debug_uses (obstack_watermark &attempt,
 // BASE gives the chosen base candidate for the pair and MOVE_RANGE is
 // a singleton range which says where to place the pair.
 bool
-ldp_bb_info::fuse_pair (bool load_p,
+pair_fusion_bb_info::fuse_pair (bool load_p,
 			unsigned access_size,
 			int writeback,
 			insn_info *i1, insn_info *i2,
@@ -1800,7 +1944,7 @@  ldp_bb_info::fuse_pair (bool load_p,
     {
       if (dump_file)
 	fprintf (dump_file,
-		 "  ldp: i%d has wb but subsequent i%d has non-wb "
+		 "  load pair: i%d has wb but subsequent i%d has non-wb "
 		 "update of base (r%d), dropping wb\n",
 		 insns[0]->uid (), insns[1]->uid (), base_regno);
       gcc_assert (writeback_effect);
@@ -1823,7 +1967,7 @@  ldp_bb_info::fuse_pair (bool load_p,
     }
 
   // If either of the original insns had writeback, but the resulting pair insn
-  // does not (can happen e.g. in the ldp edge case above, or if the writeback
+  // does not (can happen e.g. in the load pair edge case above, or if the writeback
   // effects cancel out), then drop the def(s) of the base register as
   // appropriate.
   //
@@ -1842,7 +1986,7 @@  ldp_bb_info::fuse_pair (bool load_p,
   // update of the base register and try and fold it in to make this into a
   // writeback pair.
   insn_info *trailing_add = nullptr;
-  if (aarch64_ldp_writeback > 1
+  if (bb_state->pair_trailing_writeback_p ()
       && !writeback_effect
       && (!load_p || (!refers_to_regno_p (base_regno, base_regno + 1,
 					 XEXP (pats[0], 0), nullptr)
@@ -1863,14 +2007,14 @@  ldp_bb_info::fuse_pair (bool load_p,
     }
 
   // Now that we know what base mem we're going to use, check if it's OK
-  // with the ldp/stp policy.
+  // with the pair mem policy.
   rtx first_mem = XEXP (pats[0], load_p);
-  if (!aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
-						load_p,
-						GET_MODE (first_mem)))
+  if (!bb_state->pair_mem_ok_with_policy (first_mem,
+			  load_p,
+			  GET_MODE (first_mem)))
     {
       if (dump_file)
-	fprintf (dump_file, "punting on pair (%d,%d), ldp/stp policy says no\n",
+	fprintf (dump_file, "punting on pair (%d,%d), pair mem  policy says no\n",
 		 i1->uid (), i2->uid ());
       return false;
     }
@@ -1878,21 +2022,10 @@  ldp_bb_info::fuse_pair (bool load_p,
   rtx reg_notes = combine_reg_notes (first, second, load_p);
 
   rtx pair_pat;
-  if (writeback_effect)
-    {
-      auto patvec = gen_rtvec (3, writeback_effect, pats[0], pats[1]);
-      pair_pat = gen_rtx_PARALLEL (VOIDmode, patvec);
-    }
-  else if (load_p)
-    pair_pat = aarch64_gen_load_pair (XEXP (pats[0], 0),
-				      XEXP (pats[1], 0),
-				      XEXP (pats[0], 1));
-  else
-    pair_pat = aarch64_gen_store_pair (XEXP (pats[0], 0),
-				       XEXP (pats[0], 1),
-				       XEXP (pats[1], 1));
 
+  pair_pat = bb_state->gen_mem_pair (pats, writeback_effect, load_p);
   insn_change *pair_change = nullptr;
+
   auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
     rtx_insn *rti = change->insn ()->rtl ();
     validate_unshare_change (rti, &PATTERN (rti), pair_pat, true);
@@ -2133,15 +2266,6 @@  load_modified_by_store_p (insn_info *load,
   return false;
 }
 
-// Virtual base class for load/store walkers used in alias analysis.
-struct alias_walker
-{
-  virtual bool conflict_p (int &budget) const = 0;
-  virtual insn_info *insn () const = 0;
-  virtual bool valid () const  = 0;
-  virtual void advance () = 0;
-};
-
 // Implement some common functionality used by both store_walker
 // and load_walker.
 template<bool reverse>
@@ -2259,13 +2383,13 @@  public:
 //
 // We try to maintain the invariant that if a walker becomes invalid, we
 // set its pointer to null.
-static void
-do_alias_analysis (insn_info *alias_hazards[4],
+void
+pair_fusion_bb_info::do_alias_analysis (insn_info *alias_hazards[4],
 		   alias_walker *walkers[4],
 		   bool load_p)
 {
   const int n_walkers = 2 + (2 * !load_p);
-  int budget = aarch64_ldp_alias_check_limit;
+  int budget = bb_state->pair_mem_alias_check_limit ();
 
   auto next_walker = [walkers,n_walkers](int current) -> int {
     for (int j = 1; j <= n_walkers; j++)
@@ -2365,7 +2489,7 @@  get_viable_bases (insn_info *insns[2],
     {
       const bool is_lower = (i == reversed);
       poly_int64 poly_off;
-      rtx base = ldp_strip_offset (cand_mems[i], &poly_off);
+      rtx base = pair_mem_strip_offset (cand_mems[i], &poly_off);
       if (GET_RTX_CLASS (GET_CODE (XEXP (cand_mems[i], 0))) == RTX_AUTOINC)
 	writeback |= (1 << i);
 
@@ -2373,7 +2497,7 @@  get_viable_bases (insn_info *insns[2],
 	continue;
 
       // Punt on accesses relative to eliminable regs.  See the comment in
-      // ldp_bb_info::track_access for a detailed explanation of this.
+      // pair_fusion_bb_info::track_access for a detailed explanation of this.
       if (!reload_completed
 	  && (REGNO (base) == FRAME_POINTER_REGNUM
 	      || REGNO (base) == ARG_POINTER_REGNUM))
@@ -2397,7 +2521,11 @@  get_viable_bases (insn_info *insns[2],
       if (!is_lower)
 	base_off--;
 
-      if (base_off < LDP_MIN_IMM || base_off > LDP_MAX_IMM)
+      pair_fusion *pfuse;
+      aarch64_pair_fusion derived;
+      pfuse = &derived;
+
+      if (pfuse->pair_mem_in_range_p (base_off))
 	continue;
 
       use_info *use = find_access (insns[i]->uses (), REGNO (base));
@@ -2454,12 +2582,12 @@  get_viable_bases (insn_info *insns[2],
 }
 
 // Given two adjacent memory accesses of the same size, I1 and I2, try
-// and see if we can merge them into a ldp or stp.
+// and see if we can merge them into a paired accesses load and store.
 //
 // ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
 // if the accesses are both loads, otherwise they are both stores.
 bool
-ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
+pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
 			    insn_info *i1, insn_info *i2)
 {
   if (dump_file)
@@ -2494,7 +2622,7 @@  ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
     {
       if (dump_file)
 	fprintf (dump_file,
-		 "punting on ldp due to reg conflcits (%d,%d)\n",
+		 "punting on pair mem load  due to reg conflcits (%d,%d)\n",
 		 insns[0]->uid (), insns[1]->uid ());
       return false;
     }
@@ -2843,7 +2971,7 @@  debug (const insn_list_t &l)
 // we can't re-order them anyway, so provided earlier passes have cleaned up
 // redundant loads, we shouldn't miss opportunities by doing this.
 void
-ldp_bb_info::merge_pairs (insn_list_t &left_list,
+pair_fusion_bb_info::merge_pairs (insn_list_t &left_list,
 			  insn_list_t &right_list,
 			  bool load_p,
 			  unsigned access_size)
@@ -2890,8 +3018,8 @@  ldp_bb_info::merge_pairs (insn_list_t &left_list,
 // of accesses.  If we find two sets of adjacent accesses, call
 // merge_pairs.
 void
-ldp_bb_info::transform_for_base (int encoded_lfs,
-				 access_group &group)
+pair_fusion_bb_info::transform_for_base (int encoded_lfs,
+					 access_group &group)
 {
   const auto lfs = decode_lfs (encoded_lfs);
   const unsigned access_size = lfs.size;
@@ -2909,7 +3037,7 @@  ldp_bb_info::transform_for_base (int encoded_lfs,
 		       access.cand_insns,
 		       lfs.load_p,
 		       access_size);
-	  skip_next = access.cand_insns.empty ();
+	  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
 	}
       prev_access = &access;
     }
@@ -2919,7 +3047,7 @@  ldp_bb_info::transform_for_base (int encoded_lfs,
 // and remove all the tombstone insns, being sure to reparent any uses
 // of mem to previous defs when we do this.
 void
-ldp_bb_info::cleanup_tombstones ()
+pair_fusion_bb_info::cleanup_tombstones ()
 {
   // No need to do anything if we didn't emit a tombstone insn for this BB.
   if (!m_emitted_tombstone)
@@ -2947,7 +3075,7 @@  ldp_bb_info::cleanup_tombstones ()
 
 template<typename Map>
 void
-ldp_bb_info::traverse_base_map (Map &map)
+pair_fusion_bb_info::traverse_base_map (Map &map)
 {
   for (auto kv : map)
     {
@@ -2958,7 +3086,7 @@  ldp_bb_info::traverse_base_map (Map &map)
 }
 
 void
-ldp_bb_info::transform ()
+pair_fusion_bb_info::transform ()
 {
   traverse_base_map (expr_map);
   traverse_base_map (def_map);
@@ -3167,14 +3295,13 @@  try_promote_writeback (insn_info *insn)
 // for load/store candidates.  If running after RA, also try and promote
 // non-writeback pairs to use writeback addressing.  Then try to fuse
 // candidates into pairs.
-void ldp_fusion_bb (bb_info *bb)
+void pair_fusion::ldp_fusion_bb (bb_info *bb)
 {
-  const bool track_loads
-    = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
-  const bool track_stores
-    = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
+  const bool track_loads = track_load_p ();
+  const bool track_stores = track_store_p ();
 
-  ldp_bb_info bb_state (bb);
+  aarch64_pair_fusion derived;
+  pair_fusion_bb_info bb_info (bb, &derived);
 
   for (auto insn : bb->nondebug_insns ())
     {
@@ -3184,31 +3311,31 @@  void ldp_fusion_bb (bb_info *bb)
 	continue;
 
       rtx pat = PATTERN (rti);
-      if (reload_completed
-	  && aarch64_ldp_writeback > 1
-	  && GET_CODE (pat) == PARALLEL
-	  && XVECLEN (pat, 0) == 2)
+      if (pair_mem_promote_writeback_p (pat))
 	try_promote_writeback (insn);
 
       if (GET_CODE (pat) != SET)
 	continue;
 
       if (track_stores && MEM_P (XEXP (pat, 0)))
-	bb_state.track_access (insn, false, XEXP (pat, 0));
+	bb_info.track_access (insn, false, XEXP (pat, 0));
       else if (track_loads && MEM_P (XEXP (pat, 1)))
-	bb_state.track_access (insn, true, XEXP (pat, 1));
+	bb_info.track_access (insn, true, XEXP (pat, 1));
     }
 
-  bb_state.transform ();
-  bb_state.cleanup_tombstones ();
+  bb_info.transform ();
+  bb_info.cleanup_tombstones ();
 }
 
 void ldp_fusion ()
 {
   ldp_fusion_init ();
+  pair_fusion *pfuse;
+  aarch64_pair_fusion derived;
+  pfuse = &derived;
 
   for (auto bb : crtl->ssa->bbs ())
-    ldp_fusion_bb (bb);
+    pfuse->ldp_fusion_bb (bb);
 
   ldp_fusion_destroy ();
 }