diff mbox series

[09/11] aarch64: Rewrite non-writeback ldp/stp patterns

Message ID ZVZbB0KHxzDkN0ci@arm.com
State New
Headers show
Series aarch64: Rework ldp/stp patterns, add new ldp/stp pass | expand

Commit Message

Alex Coplan Nov. 16, 2023, 6:10 p.m. UTC
This patch overhauls the load/store pair patterns with two main goals:

1. Fixing a correctness issue (the current patterns are not RA-friendly).
2. Allowing more flexibility in which operand modes are supported, and which
   combinations of modes are allowed in the two arms of the load/store pair,
   while reducing the number of patterns required both in the source and in
   the generated code.

The correctness issue (1) is due to the fact that the current patterns have
two independent memory operands tied together only by a predicate on the insns.
Since LRA only looks at the constraints, one of the memory operands can get
reloaded without the other one being changed, leading to the insn becoming
unrecognizable after reload.

We fix this issue by changing the patterns such that they only ever have one
memory operand representing the entire pair.  For the store case, we use an
unspec to logically concatenate the register operands before storing them.
For the load case, we use unspecs to extract the "lanes" from the pair mem,
with the second occurrence of the mem matched using a match_dup (such that there
is still really only one memory operand as far as the RA is concerned).

In terms of the modes used for the pair memory operands, we canonicalize
these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
only the correct size but also correct alignment requirement for a
memory operand representing an entire load/store pair.  Unlike the other
two, V2x4QImode didn't previously exist, so had to be added with the
patch.

As with the previous patch generalizing the writeback patterns, this
patch aims to be flexible in the combinations of modes supported by the
patterns without requiring a large number of generated patterns by using
distinct mode iterators.

The new scheme means we only need a single (generated) pattern for each
load/store operation of a given operand size.  For the 4-byte and 8-byte
operand cases, we use the GPI iterator to synthesize the two patterns.
The 16-byte case is implemented as a separate pattern in the source (due
to only having a single possible alternative).

Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
we add REG_CFA_OFFSET notes to the store pair insns emitted by
aarch64_save_callee_saves, so that correct CFI information can still be
generated.  Furthermore, we now unconditionally generate these CFA
notes on frame-related insns emitted by aarch64_save_callee_saves.
This is done in case that the load/store pair pass forms these into
pairs, in which case the CFA notes would be needed.

We also adjust the ldp/stp peepholes to generate the new form.  This is
done by switching the generation to use the
aarch64_gen_{load,store}_pair interface, making it easier to change the
form in the future if needed.  (Likewise, the upcoming aarch64
load/store pair pass also makes use of this interface).

This patch also adds an "ldpstp" attribute to the non-writeback
load/store pair patterns, which is used by the post-RA load/store pair
pass to identify existing patterns and see if they can be promoted to
writeback variants.

One potential concern with using unspecs for the patterns is that it can block
optimization by the generic RTL passes.  This patch series tries to mitigate
this in two ways:
 1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
 2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion to
    emit individual loads/stores instead of ldp/stp.  These should then be
    formed back into load/store pairs much later in the RTL pipeline by the
    new load/store pair pass.

Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Thanks,
Alex

gcc/ChangeLog:

	* config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
	representation from peepholes, allowing use of new form.
	* config/aarch64/aarch64-modes.def (V2x4QImode): Define.
	* config/aarch64/aarch64-protos.h
	(aarch64_finish_ldpstp_peephole): Declare.
	(aarch64_swap_ldrstr_operands): Delete declaration.
	(aarch64_gen_load_pair): Declare.
	(aarch64_gen_store_pair): Declare.
	* config/aarch64/aarch64-simd.md (load_pair<DREG:mode><DREG2:mode>):
	Delete.
	(vec_store_pair<DREG:mode><DREG2:mode>): Delete.
	(load_pair<VQ:mode><VQ2:mode>): Delete.
	(vec_store_pair<VQ:mode><VQ2:mode>): Delete.
	* config/aarch64/aarch64.cc (aarch64_pair_mode_for_mode): New.
	(aarch64_gen_store_pair): Adjust to use new unspec form of stp.
	Drop second mem from parameters.
	(aarch64_gen_load_pair): Likewise.
	(aarch64_pair_mem_from_base): New.
	(aarch64_save_callee_saves): Emit REG_CFA_OFFSET notes for
	frame-related saves.  Adjust call to aarch64_gen_store_pair
	(aarch64_restore_callee_saves): Adjust calls to
	aarch64_gen_load_pair to account for change in interface.
	(aarch64_process_components): Likewise.
	(aarch64_classify_address): Handle 32-byte pair mems in
	LDP_STP_N case.
	(aarch64_print_operand): Likewise.
	(aarch64_copy_one_block_and_progress_pointers): Adjust calls to
	account for change in aarch64_gen_{load,store}_pair interface.
	(aarch64_set_one_block_and_progress_pointer): Likewise.
	(aarch64_finish_ldpstp_peephole): New.
	(aarch64_gen_adjusted_ldpstp): Adjust to use generation helper.
	* config/aarch64/aarch64.md (ldpstp): New attribute.
	(load_pair_sw_<SX:mode><SX2:mode>): Delete.
	(load_pair_dw_<DX:mode><DX2:mode>): Delete.
	(load_pair_dw_<TX:mode><TX2:mode>): Delete.
	(*load_pair_<ldst_sz>): New.
	(*load_pair_16): New.
	(store_pair_sw_<SX:mode><SX2:mode>): Delete.
	(store_pair_dw_<DX:mode><DX2:mode>): Delete.
	(store_pair_dw_<TX:mode><TX2:mode>): Delete.
	(*store_pair_<ldst_sz>): New.
	(*store_pair_16): New.
	(*load_pair_extendsidi2_aarch64): Adjust to use new form.
	(*zero_extendsidi2_aarch64): Likewise.
	* config/aarch64/iterators.md (VPAIR): New.
	* config/aarch64/predicates.md (aarch64_mem_pair_operand): Change to
	a special predicate derived from aarch64_mem_pair_operator.
---
 gcc/config/aarch64/aarch64-ldpstp.md |  66 +++----
 gcc/config/aarch64/aarch64-modes.def |   6 +-
 gcc/config/aarch64/aarch64-protos.h  |   5 +-
 gcc/config/aarch64/aarch64-simd.md   |  60 -------
 gcc/config/aarch64/aarch64.cc        | 257 +++++++++++++++------------
 gcc/config/aarch64/aarch64.md        | 188 +++++++++-----------
 gcc/config/aarch64/iterators.md      |   3 +
 gcc/config/aarch64/predicates.md     |  10 +-
 8 files changed, 270 insertions(+), 325 deletions(-)

Comments

Richard Sandiford Nov. 21, 2023, 4:04 p.m. UTC | #1
Alex Coplan <alex.coplan@arm.com> writes:
> This patch overhauls the load/store pair patterns with two main goals:
>
> 1. Fixing a correctness issue (the current patterns are not RA-friendly).
> 2. Allowing more flexibility in which operand modes are supported, and which
>    combinations of modes are allowed in the two arms of the load/store pair,
>    while reducing the number of patterns required both in the source and in
>    the generated code.
>
> The correctness issue (1) is due to the fact that the current patterns have
> two independent memory operands tied together only by a predicate on the insns.
> Since LRA only looks at the constraints, one of the memory operands can get
> reloaded without the other one being changed, leading to the insn becoming
> unrecognizable after reload.
>
> We fix this issue by changing the patterns such that they only ever have one
> memory operand representing the entire pair.  For the store case, we use an
> unspec to logically concatenate the register operands before storing them.
> For the load case, we use unspecs to extract the "lanes" from the pair mem,
> with the second occurrence of the mem matched using a match_dup (such that there
> is still really only one memory operand as far as the RA is concerned).
>
> In terms of the modes used for the pair memory operands, we canonicalize
> these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
> only the correct size but also correct alignment requirement for a
> memory operand representing an entire load/store pair.  Unlike the other
> two, V2x4QImode didn't previously exist, so had to be added with the
> patch.
>
> As with the previous patch generalizing the writeback patterns, this
> patch aims to be flexible in the combinations of modes supported by the
> patterns without requiring a large number of generated patterns by using
> distinct mode iterators.
>
> The new scheme means we only need a single (generated) pattern for each
> load/store operation of a given operand size.  For the 4-byte and 8-byte
> operand cases, we use the GPI iterator to synthesize the two patterns.
> The 16-byte case is implemented as a separate pattern in the source (due
> to only having a single possible alternative).
>
> Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
> we add REG_CFA_OFFSET notes to the store pair insns emitted by
> aarch64_save_callee_saves, so that correct CFI information can still be
> generated.  Furthermore, we now unconditionally generate these CFA
> notes on frame-related insns emitted by aarch64_save_callee_saves.
> This is done in case that the load/store pair pass forms these into
> pairs, in which case the CFA notes would be needed.
>
> We also adjust the ldp/stp peepholes to generate the new form.  This is
> done by switching the generation to use the
> aarch64_gen_{load,store}_pair interface, making it easier to change the
> form in the future if needed.  (Likewise, the upcoming aarch64
> load/store pair pass also makes use of this interface).
>
> This patch also adds an "ldpstp" attribute to the non-writeback
> load/store pair patterns, which is used by the post-RA load/store pair
> pass to identify existing patterns and see if they can be promoted to
> writeback variants.
>
> One potential concern with using unspecs for the patterns is that it can block
> optimization by the generic RTL passes.  This patch series tries to mitigate
> this in two ways:
>  1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
>  2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion to
>     emit individual loads/stores instead of ldp/stp.  These should then be
>     formed back into load/store pairs much later in the RTL pipeline by the
>     new load/store pair pass.
>
> Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
>
> Thanks,
> Alex
>
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
> 	representation from peepholes, allowing use of new form.
> 	* config/aarch64/aarch64-modes.def (V2x4QImode): Define.
> 	* config/aarch64/aarch64-protos.h
> 	(aarch64_finish_ldpstp_peephole): Declare.
> 	(aarch64_swap_ldrstr_operands): Delete declaration.
> 	(aarch64_gen_load_pair): Declare.
> 	(aarch64_gen_store_pair): Declare.
> 	* config/aarch64/aarch64-simd.md (load_pair<DREG:mode><DREG2:mode>):
> 	Delete.
> 	(vec_store_pair<DREG:mode><DREG2:mode>): Delete.
> 	(load_pair<VQ:mode><VQ2:mode>): Delete.
> 	(vec_store_pair<VQ:mode><VQ2:mode>): Delete.
> 	* config/aarch64/aarch64.cc (aarch64_pair_mode_for_mode): New.
> 	(aarch64_gen_store_pair): Adjust to use new unspec form of stp.
> 	Drop second mem from parameters.
> 	(aarch64_gen_load_pair): Likewise.
> 	(aarch64_pair_mem_from_base): New.
> 	(aarch64_save_callee_saves): Emit REG_CFA_OFFSET notes for
> 	frame-related saves.  Adjust call to aarch64_gen_store_pair
> 	(aarch64_restore_callee_saves): Adjust calls to
> 	aarch64_gen_load_pair to account for change in interface.
> 	(aarch64_process_components): Likewise.
> 	(aarch64_classify_address): Handle 32-byte pair mems in
> 	LDP_STP_N case.
> 	(aarch64_print_operand): Likewise.
> 	(aarch64_copy_one_block_and_progress_pointers): Adjust calls to
> 	account for change in aarch64_gen_{load,store}_pair interface.
> 	(aarch64_set_one_block_and_progress_pointer): Likewise.
> 	(aarch64_finish_ldpstp_peephole): New.
> 	(aarch64_gen_adjusted_ldpstp): Adjust to use generation helper.
> 	* config/aarch64/aarch64.md (ldpstp): New attribute.
> 	(load_pair_sw_<SX:mode><SX2:mode>): Delete.
> 	(load_pair_dw_<DX:mode><DX2:mode>): Delete.
> 	(load_pair_dw_<TX:mode><TX2:mode>): Delete.
> 	(*load_pair_<ldst_sz>): New.
> 	(*load_pair_16): New.
> 	(store_pair_sw_<SX:mode><SX2:mode>): Delete.
> 	(store_pair_dw_<DX:mode><DX2:mode>): Delete.
> 	(store_pair_dw_<TX:mode><TX2:mode>): Delete.
> 	(*store_pair_<ldst_sz>): New.
> 	(*store_pair_16): New.
> 	(*load_pair_extendsidi2_aarch64): Adjust to use new form.
> 	(*zero_extendsidi2_aarch64): Likewise.
> 	* config/aarch64/iterators.md (VPAIR): New.
> 	* config/aarch64/predicates.md (aarch64_mem_pair_operand): Change to
> 	a special predicate derived from aarch64_mem_pair_operator.
> ---
>  gcc/config/aarch64/aarch64-ldpstp.md |  66 +++----
>  gcc/config/aarch64/aarch64-modes.def |   6 +-
>  gcc/config/aarch64/aarch64-protos.h  |   5 +-
>  gcc/config/aarch64/aarch64-simd.md   |  60 -------
>  gcc/config/aarch64/aarch64.cc        | 257 +++++++++++++++------------
>  gcc/config/aarch64/aarch64.md        | 188 +++++++++-----------
>  gcc/config/aarch64/iterators.md      |   3 +
>  gcc/config/aarch64/predicates.md     |  10 +-
>  8 files changed, 270 insertions(+), 325 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-ldpstp.md b/gcc/config/aarch64/aarch64-ldpstp.md
> index 1ee7c73ff0c..dc39af85254 100644
> --- a/gcc/config/aarch64/aarch64-ldpstp.md
> +++ b/gcc/config/aarch64/aarch64-ldpstp.md
> @@ -24,10 +24,10 @@ (define_peephole2
>     (set (match_operand:GPI 2 "register_operand" "")
>  	(match_operand:GPI 3 "memory_operand" ""))]
>    "aarch64_operands_ok_for_ldpstp (operands, true, <MODE>mode)"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -	      (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, true);
> +  aarch64_finish_ldpstp_peephole (operands, true);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -36,10 +36,10 @@ (define_peephole2
>     (set (match_operand:GPI 2 "memory_operand" "")
>  	(match_operand:GPI 3 "aarch64_reg_or_zero" ""))]
>    "aarch64_operands_ok_for_ldpstp (operands, false, <MODE>mode)"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -	      (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, false);
> +  aarch64_finish_ldpstp_peephole (operands, false);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -48,10 +48,10 @@ (define_peephole2
>     (set (match_operand:GPF 2 "register_operand" "")
>  	(match_operand:GPF 3 "memory_operand" ""))]
>    "aarch64_operands_ok_for_ldpstp (operands, true, <MODE>mode)"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -	      (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, true);
> +  aarch64_finish_ldpstp_peephole (operands, true);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -60,10 +60,10 @@ (define_peephole2
>     (set (match_operand:GPF 2 "memory_operand" "")
>  	(match_operand:GPF 3 "aarch64_reg_or_fp_zero" ""))]
>    "aarch64_operands_ok_for_ldpstp (operands, false, <MODE>mode)"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -	      (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, false);
> +  aarch64_finish_ldpstp_peephole (operands, false);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -72,10 +72,10 @@ (define_peephole2
>     (set (match_operand:DREG2 2 "register_operand" "")
>  	(match_operand:DREG2 3 "memory_operand" ""))]
>    "aarch64_operands_ok_for_ldpstp (operands, true, <DREG:MODE>mode)"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -	      (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, true);
> +  aarch64_finish_ldpstp_peephole (operands, true);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -84,10 +84,10 @@ (define_peephole2
>     (set (match_operand:DREG2 2 "memory_operand" "")
>  	(match_operand:DREG2 3 "register_operand" ""))]
>    "aarch64_operands_ok_for_ldpstp (operands, false, <DREG:MODE>mode)"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -	      (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, false);
> +  aarch64_finish_ldpstp_peephole (operands, false);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -99,10 +99,10 @@ (define_peephole2
>     && aarch64_operands_ok_for_ldpstp (operands, true, <VQ:MODE>mode)
>     && (aarch64_tune_params.extra_tuning_flags
>  	& AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -	      (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, true);
> +  aarch64_finish_ldpstp_peephole (operands, true);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -114,10 +114,10 @@ (define_peephole2
>     && aarch64_operands_ok_for_ldpstp (operands, false, <VQ:MODE>mode)
>     && (aarch64_tune_params.extra_tuning_flags
>  	& AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -	      (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, false);
> +  aarch64_finish_ldpstp_peephole (operands, false);
> +  DONE;
>  })
>  
>  
> @@ -129,10 +129,10 @@ (define_peephole2
>     (set (match_operand:DI 2 "register_operand" "")
>  	(sign_extend:DI (match_operand:SI 3 "memory_operand" "")))]
>    "aarch64_operands_ok_for_ldpstp (operands, true, SImode)"
> -  [(parallel [(set (match_dup 0) (sign_extend:DI (match_dup 1)))
> -	      (set (match_dup 2) (sign_extend:DI (match_dup 3)))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, true);
> +  aarch64_finish_ldpstp_peephole (operands, true, SIGN_EXTEND);
> +  DONE;
>  })
>  
>  (define_peephole2
> @@ -141,10 +141,10 @@ (define_peephole2
>     (set (match_operand:DI 2 "register_operand" "")
>  	(zero_extend:DI (match_operand:SI 3 "memory_operand" "")))]
>    "aarch64_operands_ok_for_ldpstp (operands, true, SImode)"
> -  [(parallel [(set (match_dup 0) (zero_extend:DI (match_dup 1)))
> -	      (set (match_dup 2) (zero_extend:DI (match_dup 3)))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, true);
> +  aarch64_finish_ldpstp_peephole (operands, true, ZERO_EXTEND);
> +  DONE;
>  })
>  
>  ;; Handle storing of a floating point zero with integer data.
> @@ -163,10 +163,10 @@ (define_peephole2
>     (set (match_operand:<FCVT_TARGET> 2 "memory_operand" "")
>  	(match_operand:<FCVT_TARGET> 3 "aarch64_reg_zero_or_fp_zero" ""))]
>    "aarch64_operands_ok_for_ldpstp (operands, false, <V_INT_EQUIV>mode)"
> -  [(parallel [(set (match_dup 0) (match_dup 1))
> -	      (set (match_dup 2) (match_dup 3))])]
> +  [(const_int 0)]
>  {
> -  aarch64_swap_ldrstr_operands (operands, false);
> +  aarch64_finish_ldpstp_peephole (operands, false);
> +  DONE;
>  })
>  
>  ;; Handle consecutive load/store whose offset is out of the range
> diff --git a/gcc/config/aarch64/aarch64-modes.def b/gcc/config/aarch64/aarch64-modes.def
> index 6b4f4e17dd5..1e0d770f72f 100644
> --- a/gcc/config/aarch64/aarch64-modes.def
> +++ b/gcc/config/aarch64/aarch64-modes.def
> @@ -93,9 +93,13 @@ INT_MODE (XI, 64);
>  
>  /* V8DI mode.  */
>  VECTOR_MODE_WITH_PREFIX (V, INT, DI, 8, 5);
> -
>  ADJUST_ALIGNMENT (V8DI, 8);
>  
> +/* V2x4QImode.  Used in load/store pair patterns.  */
> +VECTOR_MODE_WITH_PREFIX (V2x, INT, QI, 4, 5);
> +ADJUST_NUNITS (V2x4QI, 8);
> +ADJUST_ALIGNMENT (V2x4QI, 4);
> +
>  /* Define Advanced SIMD modes for structures of 2, 3 and 4 d-registers.  */
>  #define ADV_SIMD_D_REG_STRUCT_MODES(NVECS, VB, VH, VS, VD) \
>    VECTOR_MODES_WITH_PREFIX (V##NVECS##x, INT, 8, 3); \
> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> index e463fd5c817..2ab54f244a7 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -967,6 +967,8 @@ void aarch64_split_compare_and_swap (rtx op[]);
>  void aarch64_split_atomic_op (enum rtx_code, rtx, rtx, rtx, rtx, rtx, rtx);
>  
>  bool aarch64_gen_adjusted_ldpstp (rtx *, bool, machine_mode, RTX_CODE);
> +void aarch64_finish_ldpstp_peephole (rtx *, bool,
> +				     enum rtx_code = (enum rtx_code)0);
>  
>  void aarch64_expand_sve_vec_cmp_int (rtx, rtx_code, rtx, rtx);
>  bool aarch64_expand_sve_vec_cmp_float (rtx, rtx_code, rtx, rtx, bool);
> @@ -1022,8 +1024,9 @@ bool aarch64_mergeable_load_pair_p (machine_mode, rtx, rtx);
>  bool aarch64_operands_ok_for_ldpstp (rtx *, bool, machine_mode);
>  bool aarch64_operands_adjust_ok_for_ldpstp (rtx *, bool, machine_mode);
>  bool aarch64_mem_ok_with_ldpstp_policy_model (rtx, bool, machine_mode);
> -void aarch64_swap_ldrstr_operands (rtx *, bool);
>  bool aarch64_ldpstp_operand_mode_p (machine_mode);
> +rtx aarch64_gen_load_pair (rtx, rtx, rtx, enum rtx_code = (enum rtx_code)0);
> +rtx aarch64_gen_store_pair (rtx, rtx, rtx);
>  
>  extern void aarch64_asm_output_pool_epilogue (FILE *, const char *,
>  					      tree, HOST_WIDE_INT);
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index c6f2d582837..6f5080ab030 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -231,38 +231,6 @@ (define_insn "aarch64_store_lane0<mode>"
>    [(set_attr "type" "neon_store1_1reg<q>")]
>  )
>  
> -(define_insn "load_pair<DREG:mode><DREG2:mode>"
> -  [(set (match_operand:DREG 0 "register_operand")
> -	(match_operand:DREG 1 "aarch64_mem_pair_operand"))
> -   (set (match_operand:DREG2 2 "register_operand")
> -	(match_operand:DREG2 3 "memory_operand"))]
> -  "TARGET_FLOAT
> -   && rtx_equal_p (XEXP (operands[3], 0),
> -		   plus_constant (Pmode,
> -				  XEXP (operands[1], 0),
> -				  GET_MODE_SIZE (<DREG:MODE>mode)))"
> -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type ]
> -     [ w        , Ump , w  , m ; neon_ldp    ] ldp\t%d0, %d2, %z1
> -     [ r        , Ump , r  , m ; load_16     ] ldp\t%x0, %x2, %z1
> -  }
> -)
> -
> -(define_insn "vec_store_pair<DREG:mode><DREG2:mode>"
> -  [(set (match_operand:DREG 0 "aarch64_mem_pair_operand")
> -	(match_operand:DREG 1 "register_operand"))
> -   (set (match_operand:DREG2 2 "memory_operand")
> -	(match_operand:DREG2 3 "register_operand"))]
> -  "TARGET_FLOAT
> -   && rtx_equal_p (XEXP (operands[2], 0),
> -		   plus_constant (Pmode,
> -				  XEXP (operands[0], 0),
> -				  GET_MODE_SIZE (<DREG:MODE>mode)))"
> -  {@ [ cons: =0 , 1 , =2 , 3 ; attrs: type ]
> -     [ Ump      , w , m  , w ; neon_stp    ] stp\t%d1, %d3, %z0
> -     [ Ump      , r , m  , r ; store_16    ] stp\t%x1, %x3, %z0
> -  }
> -)
> -
>  (define_insn "aarch64_simd_stp<mode>"
>    [(set (match_operand:VP_2E 0 "aarch64_mem_pair_lanes_operand")
>  	(vec_duplicate:VP_2E (match_operand:<VEL> 1 "register_operand")))]
> @@ -273,34 +241,6 @@ (define_insn "aarch64_simd_stp<mode>"
>    }
>  )
>  
> -(define_insn "load_pair<VQ:mode><VQ2:mode>"
> -  [(set (match_operand:VQ 0 "register_operand" "=w")
> -	(match_operand:VQ 1 "aarch64_mem_pair_operand" "Ump"))
> -   (set (match_operand:VQ2 2 "register_operand" "=w")
> -	(match_operand:VQ2 3 "memory_operand" "m"))]
> -  "TARGET_FLOAT
> -    && rtx_equal_p (XEXP (operands[3], 0),
> -		    plus_constant (Pmode,
> -			       XEXP (operands[1], 0),
> -			       GET_MODE_SIZE (<VQ:MODE>mode)))"
> -  "ldp\\t%q0, %q2, %z1"
> -  [(set_attr "type" "neon_ldp_q")]
> -)
> -
> -(define_insn "vec_store_pair<VQ:mode><VQ2:mode>"
> -  [(set (match_operand:VQ 0 "aarch64_mem_pair_operand" "=Ump")
> -	(match_operand:VQ 1 "register_operand" "w"))
> -   (set (match_operand:VQ2 2 "memory_operand" "=m")
> -	(match_operand:VQ2 3 "register_operand" "w"))]
> -  "TARGET_FLOAT
> -   && rtx_equal_p (XEXP (operands[2], 0),
> -		   plus_constant (Pmode,
> -				  XEXP (operands[0], 0),
> -				  GET_MODE_SIZE (<VQ:MODE>mode)))"
> -  "stp\\t%q1, %q3, %z0"
> -  [(set_attr "type" "neon_stp_q")]
> -)
> -
>  (define_expand "@aarch64_split_simd_mov<mode>"
>    [(set (match_operand:VQMOV 0)
>  	(match_operand:VQMOV 1))]
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index ccf081d2a16..1f6094bf1bc 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -9056,59 +9056,81 @@ aarch64_pop_regs (unsigned regno1, unsigned regno2, HOST_WIDE_INT adjustment,
>      }
>  }
>  
> -/* Generate and return a store pair instruction of mode MODE to store
> -   register REG1 to MEM1 and register REG2 to MEM2.  */
> +static machine_mode
> +aarch64_pair_mode_for_mode (machine_mode mode)
> +{
> +  if (known_eq (GET_MODE_SIZE (mode), 4))
> +    return E_V2x4QImode;
> +  else if (known_eq (GET_MODE_SIZE (mode), 8))
> +    return E_V2x8QImode;
> +  else if (known_eq (GET_MODE_SIZE (mode), 16))
> +    return E_V2x16QImode;
> +  else
> +    gcc_unreachable ();
> +}

Missing function comment.  There should be no need to use E_ outside switches.

>  
>  static rtx
> -aarch64_gen_store_pair (machine_mode mode, rtx mem1, rtx reg1, rtx mem2,
> -			rtx reg2)
> +aarch64_pair_mem_from_base (rtx mem)
>  {
> -  switch (mode)
> -    {
> -    case E_DImode:
> -      return gen_store_pair_dw_didi (mem1, reg1, mem2, reg2);
> -
> -    case E_DFmode:
> -      return gen_store_pair_dw_dfdf (mem1, reg1, mem2, reg2);
> -
> -    case E_TFmode:
> -      return gen_store_pair_dw_tftf (mem1, reg1, mem2, reg2);
> +  auto pair_mode = aarch64_pair_mode_for_mode (GET_MODE (mem));
> +  mem = adjust_bitfield_address_nv (mem, pair_mode, 0);
> +  gcc_assert (aarch64_mem_pair_lanes_operand (mem, pair_mode));
> +  return mem;
> +}
>  
> -    case E_V4SImode:
> -      return gen_vec_store_pairv4siv4si (mem1, reg1, mem2, reg2);
> +/* Generate and return a store pair instruction to store REG1 and REG2
> +   into memory starting at BASE_MEM.  All three rtxes should have modes of the
> +   same size.  */
>  
> -    case E_V16QImode:
> -      return gen_vec_store_pairv16qiv16qi (mem1, reg1, mem2, reg2);
> +rtx
> +aarch64_gen_store_pair (rtx base_mem, rtx reg1, rtx reg2)
> +{
> +  rtx pair_mem = aarch64_pair_mem_from_base (base_mem);
>  
> -    default:
> -      gcc_unreachable ();
> -    }
> +  return gen_rtx_SET (pair_mem,
> +		      gen_rtx_UNSPEC (GET_MODE (pair_mem),
> +				      gen_rtvec (2, reg1, reg2),
> +				      UNSPEC_STP));
>  }
>  
> -/* Generate and regurn a load pair isntruction of mode MODE to load register
> -   REG1 from MEM1 and register REG2 from MEM2.  */
> +/* Generate and return a load pair instruction to load a pair of
> +   registers starting at BASE_MEM into REG1 and REG2.  If CODE is
> +   UNKNOWN, all three rtxes should have modes of the same size.
> +   Otherwise, CODE is {SIGN,ZERO}_EXTEND, base_mem should be in SImode,
> +   and REG{1,2} should be in DImode.  */
>  
> -static rtx
> -aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx mem1, rtx reg2,
> -		       rtx mem2)
> +rtx
> +aarch64_gen_load_pair (rtx reg1, rtx reg2, rtx base_mem, enum rtx_code code)
>  {
> -  switch (mode)
> -    {
> -    case E_DImode:
> -      return gen_load_pair_dw_didi (reg1, mem1, reg2, mem2);
> +  rtx pair_mem = aarch64_pair_mem_from_base (base_mem);
>  
> -    case E_DFmode:
> -      return gen_load_pair_dw_dfdf (reg1, mem1, reg2, mem2);
> -
> -    case E_TFmode:
> -      return gen_load_pair_dw_tftf (reg1, mem1, reg2, mem2);
> +  const bool any_extend_p = (code == ZERO_EXTEND || code == SIGN_EXTEND);
> +  if (any_extend_p)
> +    {
> +      gcc_checking_assert (GET_MODE (base_mem) == SImode);
> +      gcc_checking_assert (GET_MODE (reg1) == DImode);
> +      gcc_checking_assert (GET_MODE (reg2) == DImode);

Not a personal preference, but I think single asserts with && are
preferred.

> +    }
> +  else
> +    gcc_assert (code == UNKNOWN);
> +
> +  rtx unspecs[2] = {
> +    gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg1),
> +		    gen_rtvec (1, pair_mem),
> +		    UNSPEC_LDP_FST),
> +    gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg2),

IIUC, the unspec modes could both be GET_MODE (base_mem)

> +		    gen_rtvec (1, copy_rtx (pair_mem)),
> +		    UNSPEC_LDP_SND)
> +  };
>  
> -    case E_V4SImode:
> -      return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2);
> +  if (any_extend_p)
> +    for (int i = 0; i < 2; i++)
> +      unspecs[i] = gen_rtx_fmt_e (code, DImode, unspecs[i]);
>  
> -    default:
> -      gcc_unreachable ();
> -    }
> +  return gen_rtx_PARALLEL (VOIDmode,
> +			   gen_rtvec (2,
> +				      gen_rtx_SET (reg1, unspecs[0]),
> +				      gen_rtx_SET (reg2, unspecs[1])));
>  }
>  
>  /* Return TRUE if return address signing should be enabled for the current
> @@ -9321,8 +9343,19 @@ aarch64_save_callee_saves (poly_int64 bytes_below_sp,
>  	  offset -= fp_offset;
>  	}
>        rtx mem = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, offset));
> -      bool need_cfa_note_p = (base_rtx != stack_pointer_rtx);
>  
> +      rtx cfa_base = stack_pointer_rtx;
> +      poly_int64 cfa_offset = sp_offset;

I don't think we need both cfa_offset and sp_offset.  sp_offset in the
current code only exists for CFI purposes.

> +
> +      if (hard_fp_valid_p && frame_pointer_needed)
> +	{
> +	  cfa_base = hard_frame_pointer_rtx;
> +	  cfa_offset += (bytes_below_sp - frame.bytes_below_hard_fp);
> +	}
> +
> +      rtx cfa_mem = gen_frame_mem (mode,
> +				   plus_constant (Pmode,
> +						  cfa_base, cfa_offset));
>        unsigned int regno2;
>        if (!aarch64_sve_mode_p (mode)
>  	  && i + 1 < regs.size ()
> @@ -9331,45 +9364,37 @@ aarch64_save_callee_saves (poly_int64 bytes_below_sp,
>  		       frame.reg_offset[regno2] - frame.reg_offset[regno]))
>  	{
>  	  rtx reg2 = gen_rtx_REG (mode, regno2);
> -	  rtx mem2;
>  
>  	  offset += GET_MODE_SIZE (mode);
> -	  mem2 = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, offset));
> -	  insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2,
> -						    reg2));
> -
> -	  /* The first part of a frame-related parallel insn is
> -	     always assumed to be relevant to the frame
> -	     calculations; subsequent parts, are only
> -	     frame-related if explicitly marked.  */
> +	  insn = emit_insn (aarch64_gen_store_pair (mem, reg, reg2));
> +
>  	  if (aarch64_emit_cfi_for_reg_p (regno2))
>  	    {
> -	      if (need_cfa_note_p)
> -		aarch64_add_cfa_expression (insn, reg2, stack_pointer_rtx,
> -					    sp_offset + GET_MODE_SIZE (mode));
> -	      else
> -		RTX_FRAME_RELATED_P (XVECEXP (PATTERN (insn), 0, 1)) = 1;
> +	      rtx cfa_mem2 = adjust_address_nv (cfa_mem,
> +						Pmode,
> +						GET_MODE_SIZE (mode));

Think this should use get_frame_mem directly, rather than moving beyond
the bounds of the original mem.

> +	      add_reg_note (insn, REG_CFA_OFFSET,
> +			    gen_rtx_SET (cfa_mem2, reg2));
>  	    }
>  
>  	  regno = regno2;
>  	  ++i;
>  	}
>        else if (mode == VNx2DImode && BYTES_BIG_ENDIAN)
> -	{
> -	  insn = emit_insn (gen_aarch64_pred_mov (mode, mem, ptrue, reg));
> -	  need_cfa_note_p = true;
> -	}
> +	insn = emit_insn (gen_aarch64_pred_mov (mode, mem, ptrue, reg));
>        else if (aarch64_sve_mode_p (mode))
>  	insn = emit_insn (gen_rtx_SET (mem, reg));
>        else
>  	insn = emit_move_insn (mem, reg);
>  
>        RTX_FRAME_RELATED_P (insn) = frame_related_p;
> -      if (frame_related_p && need_cfa_note_p)
> -	aarch64_add_cfa_expression (insn, reg, stack_pointer_rtx, sp_offset);
> +
> +      if (frame_related_p)
> +	add_reg_note (insn, REG_CFA_OFFSET, gen_rtx_SET (cfa_mem, reg));

For the record, I might need to add back some CFA_EXPRESSIONs for
locally-streaming SME functions, to ensure that the CFI code doesn't
aggregate SVE saves across a change in the VG DWARF register.
But it's probably easier to do that once the patch is in,
since having a note on all insns will help to ensure consistency.

>      }
>  }
>  
> +

Stray extra whitespace.

>  /* Emit code to restore the callee registers in REGS, ignoring pop candidates
>     and any other registers that are handled separately.  Write the appropriate
>     REG_CFA_RESTORE notes into CFI_OPS.
> @@ -9425,12 +9450,7 @@ aarch64_restore_callee_saves (poly_int64 bytes_below_sp,
>  		       frame.reg_offset[regno2] - frame.reg_offset[regno]))
>  	{
>  	  rtx reg2 = gen_rtx_REG (mode, regno2);
> -	  rtx mem2;
> -
> -	  offset += GET_MODE_SIZE (mode);
> -	  mem2 = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, offset));
> -	  emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
> -
> +	  emit_insn (aarch64_gen_load_pair (reg, reg2, mem));
>  	  *cfi_ops = alloc_reg_note (REG_CFA_RESTORE, reg2, *cfi_ops);
>  	  regno = regno2;
>  	  ++i;
> @@ -9762,9 +9782,9 @@ aarch64_process_components (sbitmap components, bool prologue_p)
>  			     : gen_rtx_SET (reg2, mem2);
>  
>        if (prologue_p)
> -	insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2));
> +	insn = emit_insn (aarch64_gen_store_pair (mem, reg, reg2));
>        else
> -	insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
> +	insn = emit_insn (aarch64_gen_load_pair (reg, reg2, mem));
>  
>        if (frame_related_p || frame_related2_p)
>  	{
> @@ -10983,12 +11003,18 @@ aarch64_classify_address (struct aarch64_address_info *info,
>       mode of the corresponding addressing mode is half of that.  */
>    if (type == ADDR_QUERY_LDP_STP_N)
>      {
> -      if (known_eq (GET_MODE_SIZE (mode), 16))
> +      if (known_eq (GET_MODE_SIZE (mode), 32))
> +	mode = V16QImode;
> +      else if (known_eq (GET_MODE_SIZE (mode), 16))
>  	mode = DFmode;
>        else if (known_eq (GET_MODE_SIZE (mode), 8))
>  	mode = SFmode;
>        else
>  	return false;
> +
> +      /* This isn't really an Advanced SIMD struct mode, but a mode
> +	 used to represent the complete mem in a load/store pair.  */
> +      advsimd_struct_p = false;
>      }
>  
>    bool allow_reg_index_p = (!load_store_pair_p
> @@ -12609,7 +12635,8 @@ aarch64_print_operand (FILE *f, rtx x, int code)
>  	if (!MEM_P (x)
>  	    || (code == 'y'
>  		&& maybe_ne (GET_MODE_SIZE (mode), 8)
> -		&& maybe_ne (GET_MODE_SIZE (mode), 16)))
> +		&& maybe_ne (GET_MODE_SIZE (mode), 16)
> +		&& maybe_ne (GET_MODE_SIZE (mode), 32)))
>  	  {
>  	    output_operand_lossage ("invalid operand for '%%%c'", code);
>  	    return;
> @@ -25431,10 +25458,8 @@ aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
>        *src = adjust_address (*src, mode, 0);
>        *dst = adjust_address (*dst, mode, 0);
>        /* Emit the memcpy.  */
> -      emit_insn (aarch64_gen_load_pair (mode, reg1, *src, reg2,
> -					aarch64_progress_pointer (*src)));
> -      emit_insn (aarch64_gen_store_pair (mode, *dst, reg1,
> -					 aarch64_progress_pointer (*dst), reg2));
> +      emit_insn (aarch64_gen_load_pair (reg1, reg2, *src));
> +      emit_insn (aarch64_gen_store_pair (*dst, reg1, reg2));
>        /* Move the pointers forward.  */
>        *src = aarch64_move_pointer (*src, 32);
>        *dst = aarch64_move_pointer (*dst, 32);
> @@ -25613,8 +25638,7 @@ aarch64_set_one_block_and_progress_pointer (rtx src, rtx *dst,
>        /* "Cast" the *dst to the correct mode.  */
>        *dst = adjust_address (*dst, mode, 0);
>        /* Emit the memset.  */
> -      emit_insn (aarch64_gen_store_pair (mode, *dst, src,
> -					 aarch64_progress_pointer (*dst), src));
> +      emit_insn (aarch64_gen_store_pair (*dst, src, src));
>  
>        /* Move the pointers forward.  */
>        *dst = aarch64_move_pointer (*dst, 32);
> @@ -26812,6 +26836,22 @@ aarch64_swap_ldrstr_operands (rtx* operands, bool load)
>      }
>  }
>  
> +void
> +aarch64_finish_ldpstp_peephole (rtx *operands, bool load_p, enum rtx_code code)

Missing function comment.

> +{
> +  aarch64_swap_ldrstr_operands (operands, load_p);
> +
> +  if (load_p)
> +    emit_insn (aarch64_gen_load_pair (operands[0], operands[2],
> +				      operands[1], code));
> +  else
> +    {
> +      gcc_assert (code == UNKNOWN);
> +      emit_insn (aarch64_gen_store_pair (operands[0], operands[1],
> +					 operands[3]));
> +    }
> +}
> +
>  /* Taking X and Y to be HOST_WIDE_INT pointers, return the result of a
>     comparison between the two.  */
>  int
> @@ -26993,8 +27033,8 @@ bool
>  aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
>  			     machine_mode mode, RTX_CODE code)
>  {
> -  rtx base, offset_1, offset_3, t1, t2;
> -  rtx mem_1, mem_2, mem_3, mem_4;
> +  rtx base, offset_1, offset_3;
> +  rtx mem_1, mem_2;
>    rtx temp_operands[8];
>    HOST_WIDE_INT off_val_1, off_val_3, base_off, new_off_1, new_off_3,
>  		stp_off_upper_limit, stp_off_lower_limit, msize;
> @@ -27019,21 +27059,17 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
>    if (load)
>      {
>        mem_1 = copy_rtx (temp_operands[1]);
> -      mem_2 = copy_rtx (temp_operands[3]);
> -      mem_3 = copy_rtx (temp_operands[5]);
> -      mem_4 = copy_rtx (temp_operands[7]);
> +      mem_2 = copy_rtx (temp_operands[5]);
>      }
>    else
>      {
>        mem_1 = copy_rtx (temp_operands[0]);
> -      mem_2 = copy_rtx (temp_operands[2]);
> -      mem_3 = copy_rtx (temp_operands[4]);
> -      mem_4 = copy_rtx (temp_operands[6]);
> +      mem_2 = copy_rtx (temp_operands[4]);
>        gcc_assert (code == UNKNOWN);
>      }
>  
>    extract_base_offset_in_addr (mem_1, &base, &offset_1);
> -  extract_base_offset_in_addr (mem_3, &base, &offset_3);
> +  extract_base_offset_in_addr (mem_2, &base, &offset_3);

mem_2 with offset_3 feels a bit awkward.  Might be worth using mem_3 instead,
so that the memory and register numbers are in sync.

I suppose we still need Ump for the extending loads, is that right?
Are there any other uses left?

Thanks,
Richard

>    gcc_assert (base != NULL_RTX && offset_1 != NULL_RTX
>  	      && offset_3 != NULL_RTX);
>  
> @@ -27097,63 +27133,48 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
>    replace_equiv_address_nv (mem_1, plus_constant (Pmode, operands[8],
>  						  new_off_1), true);
>    replace_equiv_address_nv (mem_2, plus_constant (Pmode, operands[8],
> -						  new_off_1 + msize), true);
> -  replace_equiv_address_nv (mem_3, plus_constant (Pmode, operands[8],
>  						  new_off_3), true);
> -  replace_equiv_address_nv (mem_4, plus_constant (Pmode, operands[8],
> -						  new_off_3 + msize), true);
>  
>    if (!aarch64_mem_pair_operand (mem_1, mode)
> -      || !aarch64_mem_pair_operand (mem_3, mode))
> +      || !aarch64_mem_pair_operand (mem_2, mode))
>      return false;
>  
> -  if (code == ZERO_EXTEND)
> -    {
> -      mem_1 = gen_rtx_ZERO_EXTEND (DImode, mem_1);
> -      mem_2 = gen_rtx_ZERO_EXTEND (DImode, mem_2);
> -      mem_3 = gen_rtx_ZERO_EXTEND (DImode, mem_3);
> -      mem_4 = gen_rtx_ZERO_EXTEND (DImode, mem_4);
> -    }
> -  else if (code == SIGN_EXTEND)
> -    {
> -      mem_1 = gen_rtx_SIGN_EXTEND (DImode, mem_1);
> -      mem_2 = gen_rtx_SIGN_EXTEND (DImode, mem_2);
> -      mem_3 = gen_rtx_SIGN_EXTEND (DImode, mem_3);
> -      mem_4 = gen_rtx_SIGN_EXTEND (DImode, mem_4);
> -    }
> -
>    if (load)
>      {
>        operands[0] = temp_operands[0];
>        operands[1] = mem_1;
>        operands[2] = temp_operands[2];
> -      operands[3] = mem_2;
>        operands[4] = temp_operands[4];
> -      operands[5] = mem_3;
> +      operands[5] = mem_2;
>        operands[6] = temp_operands[6];
> -      operands[7] = mem_4;
>      }
>    else
>      {
>        operands[0] = mem_1;
>        operands[1] = temp_operands[1];
> -      operands[2] = mem_2;
>        operands[3] = temp_operands[3];
> -      operands[4] = mem_3;
> +      operands[4] = mem_2;
>        operands[5] = temp_operands[5];
> -      operands[6] = mem_4;
>        operands[7] = temp_operands[7];
>      }
>  
>    /* Emit adjusting instruction.  */
>    emit_insn (gen_rtx_SET (operands[8], plus_constant (DImode, base, base_off)));
>    /* Emit ldp/stp instructions.  */
> -  t1 = gen_rtx_SET (operands[0], operands[1]);
> -  t2 = gen_rtx_SET (operands[2], operands[3]);
> -  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, t1, t2)));
> -  t1 = gen_rtx_SET (operands[4], operands[5]);
> -  t2 = gen_rtx_SET (operands[6], operands[7]);
> -  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, t1, t2)));
> +  if (load)
> +    {
> +      emit_insn (aarch64_gen_load_pair (operands[0], operands[2],
> +					operands[1], code));
> +      emit_insn (aarch64_gen_load_pair (operands[4], operands[6],
> +					operands[5], code));
> +    }
> +  else
> +    {
> +      emit_insn (aarch64_gen_store_pair (operands[0], operands[1],
> +					 operands[3]));
> +      emit_insn (aarch64_gen_store_pair (operands[4], operands[5],
> +					 operands[7]));
> +    }
>    return true;
>  }
>  
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index c92a51690c5..ffb6b0ba749 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -175,6 +175,9 @@ (define_c_enum "unspec" [
>      UNSPEC_GOTSMALLTLS
>      UNSPEC_GOTTINYPIC
>      UNSPEC_GOTTINYTLS
> +    UNSPEC_STP
> +    UNSPEC_LDP_FST
> +    UNSPEC_LDP_SND
>      UNSPEC_LD1
>      UNSPEC_LD2
>      UNSPEC_LD2_DREG
> @@ -453,6 +456,11 @@ (define_attr "predicated" "yes,no" (const_string "no"))
>  ;; may chose to hold the tracking state encoded in SP.
>  (define_attr "speculation_barrier" "true,false" (const_string "false"))
>  
> +;; Attribute use to identify load pair and store pair instructions.
> +;; Currently the attribute is only applied to the non-writeback ldp/stp
> +;; patterns.
> +(define_attr "ldpstp" "ldp,stp,none" (const_string "none"))
> +
>  ;; -------------------------------------------------------------------
>  ;; Pipeline descriptions and scheduling
>  ;; -------------------------------------------------------------------
> @@ -1735,100 +1743,62 @@ (define_expand "setmemdi"
>    FAIL;
>  })
>  
> -;; Operands 1 and 3 are tied together by the final condition; so we allow
> -;; fairly lax checking on the second memory operation.
> -(define_insn "load_pair_sw_<SX:mode><SX2:mode>"
> -  [(set (match_operand:SX 0 "register_operand")
> -	(match_operand:SX 1 "aarch64_mem_pair_operand"))
> -   (set (match_operand:SX2 2 "register_operand")
> -	(match_operand:SX2 3 "memory_operand"))]
> -   "rtx_equal_p (XEXP (operands[3], 0),
> -		 plus_constant (Pmode,
> -				XEXP (operands[1], 0),
> -				GET_MODE_SIZE (<SX:MODE>mode)))"
> -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type     , arch ]
> -     [ r        , Ump , r  , m ; load_8          , *    ] ldp\t%w0, %w2, %z1
> -     [ w        , Ump , w  , m ; neon_load1_2reg , fp   ] ldp\t%s0, %s2, %z1
> -  }
> -)
> -
> -;; Storing different modes that can still be merged
> -(define_insn "load_pair_dw_<DX:mode><DX2:mode>"
> -  [(set (match_operand:DX 0 "register_operand")
> -	(match_operand:DX 1 "aarch64_mem_pair_operand"))
> -   (set (match_operand:DX2 2 "register_operand")
> -	(match_operand:DX2 3 "memory_operand"))]
> -   "rtx_equal_p (XEXP (operands[3], 0),
> -		 plus_constant (Pmode,
> -				XEXP (operands[1], 0),
> -				GET_MODE_SIZE (<DX:MODE>mode)))"
> -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type     , arch ]
> -     [ r        , Ump , r  , m ; load_16         , *    ] ldp\t%x0, %x2, %z1
> -     [ w        , Ump , w  , m ; neon_load1_2reg , fp   ] ldp\t%d0, %d2, %z1
> -  }
> -)
> -
> -(define_insn "load_pair_dw_<TX:mode><TX2:mode>"
> -  [(set (match_operand:TX 0 "register_operand" "=w")
> -	(match_operand:TX 1 "aarch64_mem_pair_operand" "Ump"))
> -   (set (match_operand:TX2 2 "register_operand" "=w")
> -	(match_operand:TX2 3 "memory_operand" "m"))]
> -   "TARGET_SIMD
> -    && rtx_equal_p (XEXP (operands[3], 0),
> -		    plus_constant (Pmode,
> -				   XEXP (operands[1], 0),
> -				   GET_MODE_SIZE (<TX:MODE>mode)))"
> -  "ldp\\t%q0, %q2, %z1"
> +(define_insn "*load_pair_<ldst_sz>"
> +  [(set (match_operand:GPI 0 "aarch64_ldp_reg_operand")
> +	(unspec [
> +	  (match_operand:<VPAIR> 1 "aarch64_mem_pair_lanes_operand")
> +	] UNSPEC_LDP_FST))
> +   (set (match_operand:GPI 2 "aarch64_ldp_reg_operand")
> +	(unspec [
> +	  (match_dup 1)
> +	] UNSPEC_LDP_SND))]
> +  ""
> +  {@ [cons: =0, 1,   =2; attrs: type,	   arch]
> +     [	     r, Umn,  r; load_<ldpstp_sz>, *   ] ldp\t%<w>0, %<w>2, %y1
> +     [	     w, Umn,  w; neon_load1_2reg,  fp  ] ldp\t%<v>0, %<v>2, %y1
> +  }
> +  [(set_attr "ldpstp" "ldp")]
> +)
> +
> +(define_insn "*load_pair_16"
> +  [(set (match_operand:TI 0 "aarch64_ldp_reg_operand" "=w")
> +	(unspec [
> +	  (match_operand:V2x16QI 1 "aarch64_mem_pair_lanes_operand" "Umn")
> +	] UNSPEC_LDP_FST))
> +   (set (match_operand:TI 2 "aarch64_ldp_reg_operand" "=w")
> +	(unspec [
> +	  (match_dup 1)
> +	] UNSPEC_LDP_SND))]
> +  "TARGET_FLOAT"
> +  "ldp\\t%q0, %q2, %y1"
>    [(set_attr "type" "neon_ldp_q")
> -   (set_attr "fp" "yes")]
> -)
> -
> -;; Operands 0 and 2 are tied together by the final condition; so we allow
> -;; fairly lax checking on the second memory operation.
> -(define_insn "store_pair_sw_<SX:mode><SX2:mode>"
> -  [(set (match_operand:SX 0 "aarch64_mem_pair_operand")
> -	(match_operand:SX 1 "aarch64_reg_zero_or_fp_zero"))
> -   (set (match_operand:SX2 2 "memory_operand")
> -	(match_operand:SX2 3 "aarch64_reg_zero_or_fp_zero"))]
> -   "rtx_equal_p (XEXP (operands[2], 0),
> -		 plus_constant (Pmode,
> -				XEXP (operands[0], 0),
> -				GET_MODE_SIZE (<SX:MODE>mode)))"
> -  {@ [ cons: =0 , 1   , =2 , 3   ; attrs: type      , arch ]
> -     [ Ump      , rYZ , m  , rYZ ; store_8          , *    ] stp\t%w1, %w3, %z0
> -     [ Ump      , w   , m  , w   ; neon_store1_2reg , fp   ] stp\t%s1, %s3, %z0
> -  }
> -)
> -
> -;; Storing different modes that can still be merged
> -(define_insn "store_pair_dw_<DX:mode><DX2:mode>"
> -  [(set (match_operand:DX 0 "aarch64_mem_pair_operand")
> -	(match_operand:DX 1 "aarch64_reg_zero_or_fp_zero"))
> -   (set (match_operand:DX2 2 "memory_operand")
> -	(match_operand:DX2 3 "aarch64_reg_zero_or_fp_zero"))]
> -   "rtx_equal_p (XEXP (operands[2], 0),
> -		 plus_constant (Pmode,
> -				XEXP (operands[0], 0),
> -				GET_MODE_SIZE (<DX:MODE>mode)))"
> -  {@ [ cons: =0 , 1   , =2 , 3   ; attrs: type      , arch ]
> -     [ Ump      , rYZ , m  , rYZ ; store_16         , *    ] stp\t%x1, %x3, %z0
> -     [ Ump      , w   , m  , w   ; neon_store1_2reg , fp   ] stp\t%d1, %d3, %z0
> -  }
> -)
> -
> -(define_insn "store_pair_dw_<TX:mode><TX2:mode>"
> -  [(set (match_operand:TX 0 "aarch64_mem_pair_operand" "=Ump")
> -	(match_operand:TX 1 "register_operand" "w"))
> -   (set (match_operand:TX2 2 "memory_operand" "=m")
> -	(match_operand:TX2 3 "register_operand" "w"))]
> -   "TARGET_SIMD &&
> -    rtx_equal_p (XEXP (operands[2], 0),
> -		 plus_constant (Pmode,
> -				XEXP (operands[0], 0),
> -				GET_MODE_SIZE (TFmode)))"
> -  "stp\\t%q1, %q3, %z0"
> +   (set_attr "fp" "yes")
> +   (set_attr "ldpstp" "ldp")]
> +)
> +
> +(define_insn "*store_pair_<ldst_sz>"
> +  [(set (match_operand:<VPAIR> 0 "aarch64_mem_pair_lanes_operand")
> +	(unspec:<VPAIR>
> +	  [(match_operand:GPI 1 "aarch64_stp_reg_operand")
> +	   (match_operand:GPI 2 "aarch64_stp_reg_operand")] UNSPEC_STP))]
> +  ""
> +  {@ [cons:  =0,   1,   2; attrs: type      , arch]
> +     [	    Umn, rYZ, rYZ; store_<ldpstp_sz>, *   ] stp\t%<w>1, %<w>2, %y0
> +     [	    Umn,   w,   w; neon_store1_2reg , fp  ] stp\t%<v>1, %<v>2, %y0
> +  }
> +  [(set_attr "ldpstp" "stp")]
> +)
> +
> +(define_insn "*store_pair_16"
> +  [(set (match_operand:V2x16QI 0 "aarch64_mem_pair_lanes_operand" "=Umn")
> +	(unspec:V2x16QI
> +	  [(match_operand:TI 1 "aarch64_ldp_reg_operand" "w")
> +	   (match_operand:TI 2 "aarch64_ldp_reg_operand" "w")] UNSPEC_STP))]
> +  "TARGET_FLOAT"
> +  "stp\t%q1, %q2, %y0"
>    [(set_attr "type" "neon_stp_q")
> -   (set_attr "fp" "yes")]
> +   (set_attr "fp" "yes")
> +   (set_attr "ldpstp" "stp")]
>  )
>  
>  ;; Writeback load/store pair patterns.
> @@ -2074,14 +2044,15 @@ (define_insn "*extendsidi2_aarch64"
>  
>  (define_insn "*load_pair_extendsidi2_aarch64"
>    [(set (match_operand:DI 0 "register_operand" "=r")
> -	(sign_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand" "Ump")))
> +	(sign_extend:DI (unspec:SI [
> +	  (match_operand:V2x4QI 1 "aarch64_mem_pair_lanes_operand" "Umn")
> +	] UNSPEC_LDP_FST)))
>     (set (match_operand:DI 2 "register_operand" "=r")
> -	(sign_extend:DI (match_operand:SI 3 "memory_operand" "m")))]
> -  "rtx_equal_p (XEXP (operands[3], 0),
> -		plus_constant (Pmode,
> -			       XEXP (operands[1], 0),
> -			       GET_MODE_SIZE (SImode)))"
> -  "ldpsw\\t%0, %2, %z1"
> +	(sign_extend:DI (unspec:SI [
> +	  (match_dup 1)
> +	] UNSPEC_LDP_SND)))]
> +  ""
> +  "ldpsw\\t%0, %2, %y1"
>    [(set_attr "type" "load_8")]
>  )
>  
> @@ -2101,16 +2072,17 @@ (define_insn "*zero_extendsidi2_aarch64"
>  
>  (define_insn "*load_pair_zero_extendsidi2_aarch64"
>    [(set (match_operand:DI 0 "register_operand")
> -	(zero_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand")))
> +	(zero_extend:DI (unspec:SI [
> +	  (match_operand:V2x4QI 1 "aarch64_mem_pair_lanes_operand")
> +	] UNSPEC_LDP_FST)))
>     (set (match_operand:DI 2 "register_operand")
> -	(zero_extend:DI (match_operand:SI 3 "memory_operand")))]
> -  "rtx_equal_p (XEXP (operands[3], 0),
> -		plus_constant (Pmode,
> -			       XEXP (operands[1], 0),
> -			       GET_MODE_SIZE (SImode)))"
> -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type     , arch ]
> -     [ r        , Ump , r  , m ; load_8          , *    ] ldp\t%w0, %w2, %z1
> -     [ w        , Ump , w  , m ; neon_load1_2reg , fp   ] ldp\t%s0, %s2, %z1
> +	(zero_extend:DI (unspec:SI [
> +	  (match_dup 1)
> +	] UNSPEC_LDP_SND)))]
> +  ""
> +  {@ [ cons: =0 , 1   , =2; attrs: type    , arch]
> +     [ r	, Umn , r ; load_8	   , *   ] ldp\t%w0, %w2, %y1
> +     [ w	, Umn , w ; neon_load1_2reg, fp  ] ldp\t%s0, %s2, %y1
>    }
>  )
>  
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index a920de99ffc..fd8dd6db349 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -1435,6 +1435,9 @@ (define_mode_attr VDBL [(V8QI "V16QI") (V4HI "V8HI")
>  			(SI   "V2SI")  (SF   "V2SF")
>  			(DI   "V2DI")  (DF   "V2DF")])
>  
> +;; Load/store pair mode.
> +(define_mode_attr VPAIR [(SI "V2x4QI") (DI "V2x8QI")])
> +
>  ;; Register suffix for double-length mode.
>  (define_mode_attr Vdtype [(V4HF "8h") (V2SF "4s")])
>  
> diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
> index b647e5af7c6..80f2e03d8de 100644
> --- a/gcc/config/aarch64/predicates.md
> +++ b/gcc/config/aarch64/predicates.md
> @@ -266,10 +266,12 @@ (define_special_predicate "aarch64_mem_pair_operator"
>        (match_test "known_eq (GET_MODE_SIZE (mode),
>  			     GET_MODE_SIZE (GET_MODE (op)))"))))
>  
> -(define_predicate "aarch64_mem_pair_operand"
> -  (and (match_code "mem")
> -       (match_test "aarch64_legitimate_address_p (mode, XEXP (op, 0), false,
> -						  ADDR_QUERY_LDP_STP)")))
> +;; Like aarch64_mem_pair_operator, but additionally check the
> +;; address is suitable.
> +(define_special_predicate "aarch64_mem_pair_operand"
> +  (and (match_operand 0 "aarch64_mem_pair_operator")
> +       (match_test "aarch64_legitimate_address_p (GET_MODE (op), XEXP (op, 0),
> +						  false, ADDR_QUERY_LDP_STP)")))
>  
>  (define_predicate "pmode_plus_operator"
>    (and (match_code "plus")
Alex Coplan Dec. 5, 2023, 11:01 a.m. UTC | #2
Thanks for the review, I've posted a v2 here which addresses this feedback:
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/639361.html

On 21/11/2023 16:04, Richard Sandiford wrote:
> Alex Coplan <alex.coplan@arm.com> writes:
> > This patch overhauls the load/store pair patterns with two main goals:
> >
> > 1. Fixing a correctness issue (the current patterns are not RA-friendly).
> > 2. Allowing more flexibility in which operand modes are supported, and which
> >    combinations of modes are allowed in the two arms of the load/store pair,
> >    while reducing the number of patterns required both in the source and in
> >    the generated code.
> >
> > The correctness issue (1) is due to the fact that the current patterns have
> > two independent memory operands tied together only by a predicate on the insns.
> > Since LRA only looks at the constraints, one of the memory operands can get
> > reloaded without the other one being changed, leading to the insn becoming
> > unrecognizable after reload.
> >
> > We fix this issue by changing the patterns such that they only ever have one
> > memory operand representing the entire pair.  For the store case, we use an
> > unspec to logically concatenate the register operands before storing them.
> > For the load case, we use unspecs to extract the "lanes" from the pair mem,
> > with the second occurrence of the mem matched using a match_dup (such that there
> > is still really only one memory operand as far as the RA is concerned).
> >
> > In terms of the modes used for the pair memory operands, we canonicalize
> > these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
> > only the correct size but also correct alignment requirement for a
> > memory operand representing an entire load/store pair.  Unlike the other
> > two, V2x4QImode didn't previously exist, so had to be added with the
> > patch.
> >
> > As with the previous patch generalizing the writeback patterns, this
> > patch aims to be flexible in the combinations of modes supported by the
> > patterns without requiring a large number of generated patterns by using
> > distinct mode iterators.
> >
> > The new scheme means we only need a single (generated) pattern for each
> > load/store operation of a given operand size.  For the 4-byte and 8-byte
> > operand cases, we use the GPI iterator to synthesize the two patterns.
> > The 16-byte case is implemented as a separate pattern in the source (due
> > to only having a single possible alternative).
> >
> > Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
> > we add REG_CFA_OFFSET notes to the store pair insns emitted by
> > aarch64_save_callee_saves, so that correct CFI information can still be
> > generated.  Furthermore, we now unconditionally generate these CFA
> > notes on frame-related insns emitted by aarch64_save_callee_saves.
> > This is done in case that the load/store pair pass forms these into
> > pairs, in which case the CFA notes would be needed.
> >
> > We also adjust the ldp/stp peepholes to generate the new form.  This is
> > done by switching the generation to use the
> > aarch64_gen_{load,store}_pair interface, making it easier to change the
> > form in the future if needed.  (Likewise, the upcoming aarch64
> > load/store pair pass also makes use of this interface).
> >
> > This patch also adds an "ldpstp" attribute to the non-writeback
> > load/store pair patterns, which is used by the post-RA load/store pair
> > pass to identify existing patterns and see if they can be promoted to
> > writeback variants.
> >
> > One potential concern with using unspecs for the patterns is that it can block
> > optimization by the generic RTL passes.  This patch series tries to mitigate
> > this in two ways:
> >  1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
> >  2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion to
> >     emit individual loads/stores instead of ldp/stp.  These should then be
> >     formed back into load/store pairs much later in the RTL pipeline by the
> >     new load/store pair pass.
> >
> > Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
> >
> > Thanks,
> > Alex
> >
> > gcc/ChangeLog:
> >
> > 	* config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
> > 	representation from peepholes, allowing use of new form.
> > 	* config/aarch64/aarch64-modes.def (V2x4QImode): Define.
> > 	* config/aarch64/aarch64-protos.h
> > 	(aarch64_finish_ldpstp_peephole): Declare.
> > 	(aarch64_swap_ldrstr_operands): Delete declaration.
> > 	(aarch64_gen_load_pair): Declare.
> > 	(aarch64_gen_store_pair): Declare.
> > 	* config/aarch64/aarch64-simd.md (load_pair<DREG:mode><DREG2:mode>):
> > 	Delete.
> > 	(vec_store_pair<DREG:mode><DREG2:mode>): Delete.
> > 	(load_pair<VQ:mode><VQ2:mode>): Delete.
> > 	(vec_store_pair<VQ:mode><VQ2:mode>): Delete.
> > 	* config/aarch64/aarch64.cc (aarch64_pair_mode_for_mode): New.
> > 	(aarch64_gen_store_pair): Adjust to use new unspec form of stp.
> > 	Drop second mem from parameters.
> > 	(aarch64_gen_load_pair): Likewise.
> > 	(aarch64_pair_mem_from_base): New.
> > 	(aarch64_save_callee_saves): Emit REG_CFA_OFFSET notes for
> > 	frame-related saves.  Adjust call to aarch64_gen_store_pair
> > 	(aarch64_restore_callee_saves): Adjust calls to
> > 	aarch64_gen_load_pair to account for change in interface.
> > 	(aarch64_process_components): Likewise.
> > 	(aarch64_classify_address): Handle 32-byte pair mems in
> > 	LDP_STP_N case.
> > 	(aarch64_print_operand): Likewise.
> > 	(aarch64_copy_one_block_and_progress_pointers): Adjust calls to
> > 	account for change in aarch64_gen_{load,store}_pair interface.
> > 	(aarch64_set_one_block_and_progress_pointer): Likewise.
> > 	(aarch64_finish_ldpstp_peephole): New.
> > 	(aarch64_gen_adjusted_ldpstp): Adjust to use generation helper.
> > 	* config/aarch64/aarch64.md (ldpstp): New attribute.
> > 	(load_pair_sw_<SX:mode><SX2:mode>): Delete.
> > 	(load_pair_dw_<DX:mode><DX2:mode>): Delete.
> > 	(load_pair_dw_<TX:mode><TX2:mode>): Delete.
> > 	(*load_pair_<ldst_sz>): New.
> > 	(*load_pair_16): New.
> > 	(store_pair_sw_<SX:mode><SX2:mode>): Delete.
> > 	(store_pair_dw_<DX:mode><DX2:mode>): Delete.
> > 	(store_pair_dw_<TX:mode><TX2:mode>): Delete.
> > 	(*store_pair_<ldst_sz>): New.
> > 	(*store_pair_16): New.
> > 	(*load_pair_extendsidi2_aarch64): Adjust to use new form.
> > 	(*zero_extendsidi2_aarch64): Likewise.
> > 	* config/aarch64/iterators.md (VPAIR): New.
> > 	* config/aarch64/predicates.md (aarch64_mem_pair_operand): Change to
> > 	a special predicate derived from aarch64_mem_pair_operator.
> > ---
> >  gcc/config/aarch64/aarch64-ldpstp.md |  66 +++----
> >  gcc/config/aarch64/aarch64-modes.def |   6 +-
> >  gcc/config/aarch64/aarch64-protos.h  |   5 +-
> >  gcc/config/aarch64/aarch64-simd.md   |  60 -------
> >  gcc/config/aarch64/aarch64.cc        | 257 +++++++++++++++------------
> >  gcc/config/aarch64/aarch64.md        | 188 +++++++++-----------
> >  gcc/config/aarch64/iterators.md      |   3 +
> >  gcc/config/aarch64/predicates.md     |  10 +-
> >  8 files changed, 270 insertions(+), 325 deletions(-)
> >
> > diff --git a/gcc/config/aarch64/aarch64-ldpstp.md b/gcc/config/aarch64/aarch64-ldpstp.md
> > index 1ee7c73ff0c..dc39af85254 100644
> > --- a/gcc/config/aarch64/aarch64-ldpstp.md
> > +++ b/gcc/config/aarch64/aarch64-ldpstp.md
> > @@ -24,10 +24,10 @@ (define_peephole2
> >     (set (match_operand:GPI 2 "register_operand" "")
> >  	(match_operand:GPI 3 "memory_operand" ""))]
> >    "aarch64_operands_ok_for_ldpstp (operands, true, <MODE>mode)"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -	      (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, true);
> > +  aarch64_finish_ldpstp_peephole (operands, true);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -36,10 +36,10 @@ (define_peephole2
> >     (set (match_operand:GPI 2 "memory_operand" "")
> >  	(match_operand:GPI 3 "aarch64_reg_or_zero" ""))]
> >    "aarch64_operands_ok_for_ldpstp (operands, false, <MODE>mode)"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -	      (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, false);
> > +  aarch64_finish_ldpstp_peephole (operands, false);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -48,10 +48,10 @@ (define_peephole2
> >     (set (match_operand:GPF 2 "register_operand" "")
> >  	(match_operand:GPF 3 "memory_operand" ""))]
> >    "aarch64_operands_ok_for_ldpstp (operands, true, <MODE>mode)"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -	      (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, true);
> > +  aarch64_finish_ldpstp_peephole (operands, true);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -60,10 +60,10 @@ (define_peephole2
> >     (set (match_operand:GPF 2 "memory_operand" "")
> >  	(match_operand:GPF 3 "aarch64_reg_or_fp_zero" ""))]
> >    "aarch64_operands_ok_for_ldpstp (operands, false, <MODE>mode)"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -	      (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, false);
> > +  aarch64_finish_ldpstp_peephole (operands, false);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -72,10 +72,10 @@ (define_peephole2
> >     (set (match_operand:DREG2 2 "register_operand" "")
> >  	(match_operand:DREG2 3 "memory_operand" ""))]
> >    "aarch64_operands_ok_for_ldpstp (operands, true, <DREG:MODE>mode)"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -	      (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, true);
> > +  aarch64_finish_ldpstp_peephole (operands, true);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -84,10 +84,10 @@ (define_peephole2
> >     (set (match_operand:DREG2 2 "memory_operand" "")
> >  	(match_operand:DREG2 3 "register_operand" ""))]
> >    "aarch64_operands_ok_for_ldpstp (operands, false, <DREG:MODE>mode)"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -	      (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, false);
> > +  aarch64_finish_ldpstp_peephole (operands, false);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -99,10 +99,10 @@ (define_peephole2
> >     && aarch64_operands_ok_for_ldpstp (operands, true, <VQ:MODE>mode)
> >     && (aarch64_tune_params.extra_tuning_flags
> >  	& AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -	      (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, true);
> > +  aarch64_finish_ldpstp_peephole (operands, true);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -114,10 +114,10 @@ (define_peephole2
> >     && aarch64_operands_ok_for_ldpstp (operands, false, <VQ:MODE>mode)
> >     && (aarch64_tune_params.extra_tuning_flags
> >  	& AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -	      (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, false);
> > +  aarch64_finish_ldpstp_peephole (operands, false);
> > +  DONE;
> >  })
> >  
> >  
> > @@ -129,10 +129,10 @@ (define_peephole2
> >     (set (match_operand:DI 2 "register_operand" "")
> >  	(sign_extend:DI (match_operand:SI 3 "memory_operand" "")))]
> >    "aarch64_operands_ok_for_ldpstp (operands, true, SImode)"
> > -  [(parallel [(set (match_dup 0) (sign_extend:DI (match_dup 1)))
> > -	      (set (match_dup 2) (sign_extend:DI (match_dup 3)))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, true);
> > +  aarch64_finish_ldpstp_peephole (operands, true, SIGN_EXTEND);
> > +  DONE;
> >  })
> >  
> >  (define_peephole2
> > @@ -141,10 +141,10 @@ (define_peephole2
> >     (set (match_operand:DI 2 "register_operand" "")
> >  	(zero_extend:DI (match_operand:SI 3 "memory_operand" "")))]
> >    "aarch64_operands_ok_for_ldpstp (operands, true, SImode)"
> > -  [(parallel [(set (match_dup 0) (zero_extend:DI (match_dup 1)))
> > -	      (set (match_dup 2) (zero_extend:DI (match_dup 3)))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, true);
> > +  aarch64_finish_ldpstp_peephole (operands, true, ZERO_EXTEND);
> > +  DONE;
> >  })
> >  
> >  ;; Handle storing of a floating point zero with integer data.
> > @@ -163,10 +163,10 @@ (define_peephole2
> >     (set (match_operand:<FCVT_TARGET> 2 "memory_operand" "")
> >  	(match_operand:<FCVT_TARGET> 3 "aarch64_reg_zero_or_fp_zero" ""))]
> >    "aarch64_operands_ok_for_ldpstp (operands, false, <V_INT_EQUIV>mode)"
> > -  [(parallel [(set (match_dup 0) (match_dup 1))
> > -	      (set (match_dup 2) (match_dup 3))])]
> > +  [(const_int 0)]
> >  {
> > -  aarch64_swap_ldrstr_operands (operands, false);
> > +  aarch64_finish_ldpstp_peephole (operands, false);
> > +  DONE;
> >  })
> >  
> >  ;; Handle consecutive load/store whose offset is out of the range
> > diff --git a/gcc/config/aarch64/aarch64-modes.def b/gcc/config/aarch64/aarch64-modes.def
> > index 6b4f4e17dd5..1e0d770f72f 100644
> > --- a/gcc/config/aarch64/aarch64-modes.def
> > +++ b/gcc/config/aarch64/aarch64-modes.def
> > @@ -93,9 +93,13 @@ INT_MODE (XI, 64);
> >  
> >  /* V8DI mode.  */
> >  VECTOR_MODE_WITH_PREFIX (V, INT, DI, 8, 5);
> > -
> >  ADJUST_ALIGNMENT (V8DI, 8);
> >  
> > +/* V2x4QImode.  Used in load/store pair patterns.  */
> > +VECTOR_MODE_WITH_PREFIX (V2x, INT, QI, 4, 5);
> > +ADJUST_NUNITS (V2x4QI, 8);
> > +ADJUST_ALIGNMENT (V2x4QI, 4);
> > +
> >  /* Define Advanced SIMD modes for structures of 2, 3 and 4 d-registers.  */
> >  #define ADV_SIMD_D_REG_STRUCT_MODES(NVECS, VB, VH, VS, VD) \
> >    VECTOR_MODES_WITH_PREFIX (V##NVECS##x, INT, 8, 3); \
> > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> > index e463fd5c817..2ab54f244a7 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -967,6 +967,8 @@ void aarch64_split_compare_and_swap (rtx op[]);
> >  void aarch64_split_atomic_op (enum rtx_code, rtx, rtx, rtx, rtx, rtx, rtx);
> >  
> >  bool aarch64_gen_adjusted_ldpstp (rtx *, bool, machine_mode, RTX_CODE);
> > +void aarch64_finish_ldpstp_peephole (rtx *, bool,
> > +				     enum rtx_code = (enum rtx_code)0);
> >  
> >  void aarch64_expand_sve_vec_cmp_int (rtx, rtx_code, rtx, rtx);
> >  bool aarch64_expand_sve_vec_cmp_float (rtx, rtx_code, rtx, rtx, bool);
> > @@ -1022,8 +1024,9 @@ bool aarch64_mergeable_load_pair_p (machine_mode, rtx, rtx);
> >  bool aarch64_operands_ok_for_ldpstp (rtx *, bool, machine_mode);
> >  bool aarch64_operands_adjust_ok_for_ldpstp (rtx *, bool, machine_mode);
> >  bool aarch64_mem_ok_with_ldpstp_policy_model (rtx, bool, machine_mode);
> > -void aarch64_swap_ldrstr_operands (rtx *, bool);
> >  bool aarch64_ldpstp_operand_mode_p (machine_mode);
> > +rtx aarch64_gen_load_pair (rtx, rtx, rtx, enum rtx_code = (enum rtx_code)0);
> > +rtx aarch64_gen_store_pair (rtx, rtx, rtx);
> >  
> >  extern void aarch64_asm_output_pool_epilogue (FILE *, const char *,
> >  					      tree, HOST_WIDE_INT);
> > diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> > index c6f2d582837..6f5080ab030 100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -231,38 +231,6 @@ (define_insn "aarch64_store_lane0<mode>"
> >    [(set_attr "type" "neon_store1_1reg<q>")]
> >  )
> >  
> > -(define_insn "load_pair<DREG:mode><DREG2:mode>"
> > -  [(set (match_operand:DREG 0 "register_operand")
> > -	(match_operand:DREG 1 "aarch64_mem_pair_operand"))
> > -   (set (match_operand:DREG2 2 "register_operand")
> > -	(match_operand:DREG2 3 "memory_operand"))]
> > -  "TARGET_FLOAT
> > -   && rtx_equal_p (XEXP (operands[3], 0),
> > -		   plus_constant (Pmode,
> > -				  XEXP (operands[1], 0),
> > -				  GET_MODE_SIZE (<DREG:MODE>mode)))"
> > -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type ]
> > -     [ w        , Ump , w  , m ; neon_ldp    ] ldp\t%d0, %d2, %z1
> > -     [ r        , Ump , r  , m ; load_16     ] ldp\t%x0, %x2, %z1
> > -  }
> > -)
> > -
> > -(define_insn "vec_store_pair<DREG:mode><DREG2:mode>"
> > -  [(set (match_operand:DREG 0 "aarch64_mem_pair_operand")
> > -	(match_operand:DREG 1 "register_operand"))
> > -   (set (match_operand:DREG2 2 "memory_operand")
> > -	(match_operand:DREG2 3 "register_operand"))]
> > -  "TARGET_FLOAT
> > -   && rtx_equal_p (XEXP (operands[2], 0),
> > -		   plus_constant (Pmode,
> > -				  XEXP (operands[0], 0),
> > -				  GET_MODE_SIZE (<DREG:MODE>mode)))"
> > -  {@ [ cons: =0 , 1 , =2 , 3 ; attrs: type ]
> > -     [ Ump      , w , m  , w ; neon_stp    ] stp\t%d1, %d3, %z0
> > -     [ Ump      , r , m  , r ; store_16    ] stp\t%x1, %x3, %z0
> > -  }
> > -)
> > -
> >  (define_insn "aarch64_simd_stp<mode>"
> >    [(set (match_operand:VP_2E 0 "aarch64_mem_pair_lanes_operand")
> >  	(vec_duplicate:VP_2E (match_operand:<VEL> 1 "register_operand")))]
> > @@ -273,34 +241,6 @@ (define_insn "aarch64_simd_stp<mode>"
> >    }
> >  )
> >  
> > -(define_insn "load_pair<VQ:mode><VQ2:mode>"
> > -  [(set (match_operand:VQ 0 "register_operand" "=w")
> > -	(match_operand:VQ 1 "aarch64_mem_pair_operand" "Ump"))
> > -   (set (match_operand:VQ2 2 "register_operand" "=w")
> > -	(match_operand:VQ2 3 "memory_operand" "m"))]
> > -  "TARGET_FLOAT
> > -    && rtx_equal_p (XEXP (operands[3], 0),
> > -		    plus_constant (Pmode,
> > -			       XEXP (operands[1], 0),
> > -			       GET_MODE_SIZE (<VQ:MODE>mode)))"
> > -  "ldp\\t%q0, %q2, %z1"
> > -  [(set_attr "type" "neon_ldp_q")]
> > -)
> > -
> > -(define_insn "vec_store_pair<VQ:mode><VQ2:mode>"
> > -  [(set (match_operand:VQ 0 "aarch64_mem_pair_operand" "=Ump")
> > -	(match_operand:VQ 1 "register_operand" "w"))
> > -   (set (match_operand:VQ2 2 "memory_operand" "=m")
> > -	(match_operand:VQ2 3 "register_operand" "w"))]
> > -  "TARGET_FLOAT
> > -   && rtx_equal_p (XEXP (operands[2], 0),
> > -		   plus_constant (Pmode,
> > -				  XEXP (operands[0], 0),
> > -				  GET_MODE_SIZE (<VQ:MODE>mode)))"
> > -  "stp\\t%q1, %q3, %z0"
> > -  [(set_attr "type" "neon_stp_q")]
> > -)
> > -
> >  (define_expand "@aarch64_split_simd_mov<mode>"
> >    [(set (match_operand:VQMOV 0)
> >  	(match_operand:VQMOV 1))]
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index ccf081d2a16..1f6094bf1bc 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -9056,59 +9056,81 @@ aarch64_pop_regs (unsigned regno1, unsigned regno2, HOST_WIDE_INT adjustment,
> >      }
> >  }
> >  
> > -/* Generate and return a store pair instruction of mode MODE to store
> > -   register REG1 to MEM1 and register REG2 to MEM2.  */
> > +static machine_mode
> > +aarch64_pair_mode_for_mode (machine_mode mode)
> > +{
> > +  if (known_eq (GET_MODE_SIZE (mode), 4))
> > +    return E_V2x4QImode;
> > +  else if (known_eq (GET_MODE_SIZE (mode), 8))
> > +    return E_V2x8QImode;
> > +  else if (known_eq (GET_MODE_SIZE (mode), 16))
> > +    return E_V2x16QImode;
> > +  else
> > +    gcc_unreachable ();
> > +}
> 
> Missing function comment.  There should be no need to use E_ outside switches.

Fixed, thanks.

> 
> >  
> >  static rtx
> > -aarch64_gen_store_pair (machine_mode mode, rtx mem1, rtx reg1, rtx mem2,
> > -			rtx reg2)
> > +aarch64_pair_mem_from_base (rtx mem)
> >  {
> > -  switch (mode)
> > -    {
> > -    case E_DImode:
> > -      return gen_store_pair_dw_didi (mem1, reg1, mem2, reg2);
> > -
> > -    case E_DFmode:
> > -      return gen_store_pair_dw_dfdf (mem1, reg1, mem2, reg2);
> > -
> > -    case E_TFmode:
> > -      return gen_store_pair_dw_tftf (mem1, reg1, mem2, reg2);
> > +  auto pair_mode = aarch64_pair_mode_for_mode (GET_MODE (mem));
> > +  mem = adjust_bitfield_address_nv (mem, pair_mode, 0);
> > +  gcc_assert (aarch64_mem_pair_lanes_operand (mem, pair_mode));
> > +  return mem;
> > +}
> >  
> > -    case E_V4SImode:
> > -      return gen_vec_store_pairv4siv4si (mem1, reg1, mem2, reg2);
> > +/* Generate and return a store pair instruction to store REG1 and REG2
> > +   into memory starting at BASE_MEM.  All three rtxes should have modes of the
> > +   same size.  */
> >  
> > -    case E_V16QImode:
> > -      return gen_vec_store_pairv16qiv16qi (mem1, reg1, mem2, reg2);
> > +rtx
> > +aarch64_gen_store_pair (rtx base_mem, rtx reg1, rtx reg2)
> > +{
> > +  rtx pair_mem = aarch64_pair_mem_from_base (base_mem);
> >  
> > -    default:
> > -      gcc_unreachable ();
> > -    }
> > +  return gen_rtx_SET (pair_mem,
> > +		      gen_rtx_UNSPEC (GET_MODE (pair_mem),
> > +				      gen_rtvec (2, reg1, reg2),
> > +				      UNSPEC_STP));
> >  }
> >  
> > -/* Generate and regurn a load pair isntruction of mode MODE to load register
> > -   REG1 from MEM1 and register REG2 from MEM2.  */
> > +/* Generate and return a load pair instruction to load a pair of
> > +   registers starting at BASE_MEM into REG1 and REG2.  If CODE is
> > +   UNKNOWN, all three rtxes should have modes of the same size.
> > +   Otherwise, CODE is {SIGN,ZERO}_EXTEND, base_mem should be in SImode,
> > +   and REG{1,2} should be in DImode.  */
> >  
> > -static rtx
> > -aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx mem1, rtx reg2,
> > -		       rtx mem2)
> > +rtx
> > +aarch64_gen_load_pair (rtx reg1, rtx reg2, rtx base_mem, enum rtx_code code)
> >  {
> > -  switch (mode)
> > -    {
> > -    case E_DImode:
> > -      return gen_load_pair_dw_didi (reg1, mem1, reg2, mem2);
> > +  rtx pair_mem = aarch64_pair_mem_from_base (base_mem);
> >  
> > -    case E_DFmode:
> > -      return gen_load_pair_dw_dfdf (reg1, mem1, reg2, mem2);
> > -
> > -    case E_TFmode:
> > -      return gen_load_pair_dw_tftf (reg1, mem1, reg2, mem2);
> > +  const bool any_extend_p = (code == ZERO_EXTEND || code == SIGN_EXTEND);
> > +  if (any_extend_p)
> > +    {
> > +      gcc_checking_assert (GET_MODE (base_mem) == SImode);
> > +      gcc_checking_assert (GET_MODE (reg1) == DImode);
> > +      gcc_checking_assert (GET_MODE (reg2) == DImode);
> 
> Not a personal preference, but I think single asserts with && are
> preferred.

Ah, that's a shame.  Different asserts allow you to see which one failed from
the backtrace.  Anyway, I've collapsed these in the latest version.

> 
> > +    }
> > +  else
> > +    gcc_assert (code == UNKNOWN);
> > +
> > +  rtx unspecs[2] = {
> > +    gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg1),
> > +		    gen_rtvec (1, pair_mem),
> > +		    UNSPEC_LDP_FST),
> > +    gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg2),
> 
> IIUC, the unspec modes could both be GET_MODE (base_mem)

I don't think so.  In the non-extending case we allow pairs loading to
registers in distinct modes, provided the modes are of the same size.
So I think we should respect the modes of the registers, and allow the
unspec to hide the mode change.  Does that make sense?

> 
> > +		    gen_rtvec (1, copy_rtx (pair_mem)),
> > +		    UNSPEC_LDP_SND)
> > +  };
> >  
> > -    case E_V4SImode:
> > -      return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2);
> > +  if (any_extend_p)
> > +    for (int i = 0; i < 2; i++)
> > +      unspecs[i] = gen_rtx_fmt_e (code, DImode, unspecs[i]);
> >  
> > -    default:
> > -      gcc_unreachable ();
> > -    }
> > +  return gen_rtx_PARALLEL (VOIDmode,
> > +			   gen_rtvec (2,
> > +				      gen_rtx_SET (reg1, unspecs[0]),
> > +				      gen_rtx_SET (reg2, unspecs[1])));
> >  }
> >  
> >  /* Return TRUE if return address signing should be enabled for the current
> > @@ -9321,8 +9343,19 @@ aarch64_save_callee_saves (poly_int64 bytes_below_sp,
> >  	  offset -= fp_offset;
> >  	}
> >        rtx mem = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, offset));
> > -      bool need_cfa_note_p = (base_rtx != stack_pointer_rtx);
> >  
> > +      rtx cfa_base = stack_pointer_rtx;
> > +      poly_int64 cfa_offset = sp_offset;
> 
> I don't think we need both cfa_offset and sp_offset.  sp_offset in the
> current code only exists for CFI purposes.

Fixed, thanks.

> 
> > +
> > +      if (hard_fp_valid_p && frame_pointer_needed)
> > +	{
> > +	  cfa_base = hard_frame_pointer_rtx;
> > +	  cfa_offset += (bytes_below_sp - frame.bytes_below_hard_fp);
> > +	}
> > +
> > +      rtx cfa_mem = gen_frame_mem (mode,
> > +				   plus_constant (Pmode,
> > +						  cfa_base, cfa_offset));
> >        unsigned int regno2;
> >        if (!aarch64_sve_mode_p (mode)
> >  	  && i + 1 < regs.size ()
> > @@ -9331,45 +9364,37 @@ aarch64_save_callee_saves (poly_int64 bytes_below_sp,
> >  		       frame.reg_offset[regno2] - frame.reg_offset[regno]))
> >  	{
> >  	  rtx reg2 = gen_rtx_REG (mode, regno2);
> > -	  rtx mem2;
> >  
> >  	  offset += GET_MODE_SIZE (mode);
> > -	  mem2 = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, offset));
> > -	  insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2,
> > -						    reg2));
> > -
> > -	  /* The first part of a frame-related parallel insn is
> > -	     always assumed to be relevant to the frame
> > -	     calculations; subsequent parts, are only
> > -	     frame-related if explicitly marked.  */
> > +	  insn = emit_insn (aarch64_gen_store_pair (mem, reg, reg2));
> > +
> >  	  if (aarch64_emit_cfi_for_reg_p (regno2))
> >  	    {
> > -	      if (need_cfa_note_p)
> > -		aarch64_add_cfa_expression (insn, reg2, stack_pointer_rtx,
> > -					    sp_offset + GET_MODE_SIZE (mode));
> > -	      else
> > -		RTX_FRAME_RELATED_P (XVECEXP (PATTERN (insn), 0, 1)) = 1;
> > +	      rtx cfa_mem2 = adjust_address_nv (cfa_mem,
> > +						Pmode,
> > +						GET_MODE_SIZE (mode));
> 
> Think this should use get_frame_mem directly, rather than moving beyond
> the bounds of the original mem.

Done.

> 
> > +	      add_reg_note (insn, REG_CFA_OFFSET,
> > +			    gen_rtx_SET (cfa_mem2, reg2));
> >  	    }
> >  
> >  	  regno = regno2;
> >  	  ++i;
> >  	}
> >        else if (mode == VNx2DImode && BYTES_BIG_ENDIAN)
> > -	{
> > -	  insn = emit_insn (gen_aarch64_pred_mov (mode, mem, ptrue, reg));
> > -	  need_cfa_note_p = true;
> > -	}
> > +	insn = emit_insn (gen_aarch64_pred_mov (mode, mem, ptrue, reg));
> >        else if (aarch64_sve_mode_p (mode))
> >  	insn = emit_insn (gen_rtx_SET (mem, reg));
> >        else
> >  	insn = emit_move_insn (mem, reg);
> >  
> >        RTX_FRAME_RELATED_P (insn) = frame_related_p;
> > -      if (frame_related_p && need_cfa_note_p)
> > -	aarch64_add_cfa_expression (insn, reg, stack_pointer_rtx, sp_offset);
> > +
> > +      if (frame_related_p)
> > +	add_reg_note (insn, REG_CFA_OFFSET, gen_rtx_SET (cfa_mem, reg));
> 
> For the record, I might need to add back some CFA_EXPRESSIONs for
> locally-streaming SME functions, to ensure that the CFI code doesn't
> aggregate SVE saves across a change in the VG DWARF register.
> But it's probably easier to do that once the patch is in,
> since having a note on all insns will help to ensure consistency.
> 
> >      }
> >  }
> >  
> > +
> 
> Stray extra whitespace.

Fixed.

> 
> >  /* Emit code to restore the callee registers in REGS, ignoring pop candidates
> >     and any other registers that are handled separately.  Write the appropriate
> >     REG_CFA_RESTORE notes into CFI_OPS.
> > @@ -9425,12 +9450,7 @@ aarch64_restore_callee_saves (poly_int64 bytes_below_sp,
> >  		       frame.reg_offset[regno2] - frame.reg_offset[regno]))
> >  	{
> >  	  rtx reg2 = gen_rtx_REG (mode, regno2);
> > -	  rtx mem2;
> > -
> > -	  offset += GET_MODE_SIZE (mode);
> > -	  mem2 = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, offset));
> > -	  emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
> > -
> > +	  emit_insn (aarch64_gen_load_pair (reg, reg2, mem));
> >  	  *cfi_ops = alloc_reg_note (REG_CFA_RESTORE, reg2, *cfi_ops);
> >  	  regno = regno2;
> >  	  ++i;
> > @@ -9762,9 +9782,9 @@ aarch64_process_components (sbitmap components, bool prologue_p)
> >  			     : gen_rtx_SET (reg2, mem2);
> >  
> >        if (prologue_p)
> > -	insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2));
> > +	insn = emit_insn (aarch64_gen_store_pair (mem, reg, reg2));
> >        else
> > -	insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
> > +	insn = emit_insn (aarch64_gen_load_pair (reg, reg2, mem));
> >  
> >        if (frame_related_p || frame_related2_p)
> >  	{
> > @@ -10983,12 +11003,18 @@ aarch64_classify_address (struct aarch64_address_info *info,
> >       mode of the corresponding addressing mode is half of that.  */
> >    if (type == ADDR_QUERY_LDP_STP_N)
> >      {
> > -      if (known_eq (GET_MODE_SIZE (mode), 16))
> > +      if (known_eq (GET_MODE_SIZE (mode), 32))
> > +	mode = V16QImode;
> > +      else if (known_eq (GET_MODE_SIZE (mode), 16))
> >  	mode = DFmode;
> >        else if (known_eq (GET_MODE_SIZE (mode), 8))
> >  	mode = SFmode;
> >        else
> >  	return false;
> > +
> > +      /* This isn't really an Advanced SIMD struct mode, but a mode
> > +	 used to represent the complete mem in a load/store pair.  */
> > +      advsimd_struct_p = false;
> >      }
> >  
> >    bool allow_reg_index_p = (!load_store_pair_p
> > @@ -12609,7 +12635,8 @@ aarch64_print_operand (FILE *f, rtx x, int code)
> >  	if (!MEM_P (x)
> >  	    || (code == 'y'
> >  		&& maybe_ne (GET_MODE_SIZE (mode), 8)
> > -		&& maybe_ne (GET_MODE_SIZE (mode), 16)))
> > +		&& maybe_ne (GET_MODE_SIZE (mode), 16)
> > +		&& maybe_ne (GET_MODE_SIZE (mode), 32)))
> >  	  {
> >  	    output_operand_lossage ("invalid operand for '%%%c'", code);
> >  	    return;
> > @@ -25431,10 +25458,8 @@ aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
> >        *src = adjust_address (*src, mode, 0);
> >        *dst = adjust_address (*dst, mode, 0);
> >        /* Emit the memcpy.  */
> > -      emit_insn (aarch64_gen_load_pair (mode, reg1, *src, reg2,
> > -					aarch64_progress_pointer (*src)));
> > -      emit_insn (aarch64_gen_store_pair (mode, *dst, reg1,
> > -					 aarch64_progress_pointer (*dst), reg2));
> > +      emit_insn (aarch64_gen_load_pair (reg1, reg2, *src));
> > +      emit_insn (aarch64_gen_store_pair (*dst, reg1, reg2));
> >        /* Move the pointers forward.  */
> >        *src = aarch64_move_pointer (*src, 32);
> >        *dst = aarch64_move_pointer (*dst, 32);
> > @@ -25613,8 +25638,7 @@ aarch64_set_one_block_and_progress_pointer (rtx src, rtx *dst,
> >        /* "Cast" the *dst to the correct mode.  */
> >        *dst = adjust_address (*dst, mode, 0);
> >        /* Emit the memset.  */
> > -      emit_insn (aarch64_gen_store_pair (mode, *dst, src,
> > -					 aarch64_progress_pointer (*dst), src));
> > +      emit_insn (aarch64_gen_store_pair (*dst, src, src));
> >  
> >        /* Move the pointers forward.  */
> >        *dst = aarch64_move_pointer (*dst, 32);
> > @@ -26812,6 +26836,22 @@ aarch64_swap_ldrstr_operands (rtx* operands, bool load)
> >      }
> >  }
> >  
> > +void
> > +aarch64_finish_ldpstp_peephole (rtx *operands, bool load_p, enum rtx_code code)
> 
> Missing function comment.

Fixed.

> 
> > +{
> > +  aarch64_swap_ldrstr_operands (operands, load_p);
> > +
> > +  if (load_p)
> > +    emit_insn (aarch64_gen_load_pair (operands[0], operands[2],
> > +				      operands[1], code));
> > +  else
> > +    {
> > +      gcc_assert (code == UNKNOWN);
> > +      emit_insn (aarch64_gen_store_pair (operands[0], operands[1],
> > +					 operands[3]));
> > +    }
> > +}
> > +
> >  /* Taking X and Y to be HOST_WIDE_INT pointers, return the result of a
> >     comparison between the two.  */
> >  int
> > @@ -26993,8 +27033,8 @@ bool
> >  aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
> >  			     machine_mode mode, RTX_CODE code)
> >  {
> > -  rtx base, offset_1, offset_3, t1, t2;
> > -  rtx mem_1, mem_2, mem_3, mem_4;
> > +  rtx base, offset_1, offset_3;
> > +  rtx mem_1, mem_2;
> >    rtx temp_operands[8];
> >    HOST_WIDE_INT off_val_1, off_val_3, base_off, new_off_1, new_off_3,
> >  		stp_off_upper_limit, stp_off_lower_limit, msize;
> > @@ -27019,21 +27059,17 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
> >    if (load)
> >      {
> >        mem_1 = copy_rtx (temp_operands[1]);
> > -      mem_2 = copy_rtx (temp_operands[3]);
> > -      mem_3 = copy_rtx (temp_operands[5]);
> > -      mem_4 = copy_rtx (temp_operands[7]);
> > +      mem_2 = copy_rtx (temp_operands[5]);
> >      }
> >    else
> >      {
> >        mem_1 = copy_rtx (temp_operands[0]);
> > -      mem_2 = copy_rtx (temp_operands[2]);
> > -      mem_3 = copy_rtx (temp_operands[4]);
> > -      mem_4 = copy_rtx (temp_operands[6]);
> > +      mem_2 = copy_rtx (temp_operands[4]);
> >        gcc_assert (code == UNKNOWN);
> >      }
> >  
> >    extract_base_offset_in_addr (mem_1, &base, &offset_1);
> > -  extract_base_offset_in_addr (mem_3, &base, &offset_3);
> > +  extract_base_offset_in_addr (mem_2, &base, &offset_3);
> 
> mem_2 with offset_3 feels a bit awkward.  Might be worth using mem_3 instead,
> so that the memory and register numbers are in sync.

I went with mem_1 and mem_2 for now.  I think it looks fairly consistent with
that change, WDYT?

> 
> I suppose we still need Ump for the extending loads, is that right?
> Are there any other uses left?

There is a use of satisfies_constraint_Ump in aarch64_process_components, but
that's it.

How does the new version look?

Thanks,
Alex

> 
> Thanks,
> Richard
> 
> >    gcc_assert (base != NULL_RTX && offset_1 != NULL_RTX
> >  	      && offset_3 != NULL_RTX);
> >  
> > @@ -27097,63 +27133,48 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
> >    replace_equiv_address_nv (mem_1, plus_constant (Pmode, operands[8],
> >  						  new_off_1), true);
> >    replace_equiv_address_nv (mem_2, plus_constant (Pmode, operands[8],
> > -						  new_off_1 + msize), true);
> > -  replace_equiv_address_nv (mem_3, plus_constant (Pmode, operands[8],
> >  						  new_off_3), true);
> > -  replace_equiv_address_nv (mem_4, plus_constant (Pmode, operands[8],
> > -						  new_off_3 + msize), true);
> >  
> >    if (!aarch64_mem_pair_operand (mem_1, mode)
> > -      || !aarch64_mem_pair_operand (mem_3, mode))
> > +      || !aarch64_mem_pair_operand (mem_2, mode))
> >      return false;
> >  
> > -  if (code == ZERO_EXTEND)
> > -    {
> > -      mem_1 = gen_rtx_ZERO_EXTEND (DImode, mem_1);
> > -      mem_2 = gen_rtx_ZERO_EXTEND (DImode, mem_2);
> > -      mem_3 = gen_rtx_ZERO_EXTEND (DImode, mem_3);
> > -      mem_4 = gen_rtx_ZERO_EXTEND (DImode, mem_4);
> > -    }
> > -  else if (code == SIGN_EXTEND)
> > -    {
> > -      mem_1 = gen_rtx_SIGN_EXTEND (DImode, mem_1);
> > -      mem_2 = gen_rtx_SIGN_EXTEND (DImode, mem_2);
> > -      mem_3 = gen_rtx_SIGN_EXTEND (DImode, mem_3);
> > -      mem_4 = gen_rtx_SIGN_EXTEND (DImode, mem_4);
> > -    }
> > -
> >    if (load)
> >      {
> >        operands[0] = temp_operands[0];
> >        operands[1] = mem_1;
> >        operands[2] = temp_operands[2];
> > -      operands[3] = mem_2;
> >        operands[4] = temp_operands[4];
> > -      operands[5] = mem_3;
> > +      operands[5] = mem_2;
> >        operands[6] = temp_operands[6];
> > -      operands[7] = mem_4;
> >      }
> >    else
> >      {
> >        operands[0] = mem_1;
> >        operands[1] = temp_operands[1];
> > -      operands[2] = mem_2;
> >        operands[3] = temp_operands[3];
> > -      operands[4] = mem_3;
> > +      operands[4] = mem_2;
> >        operands[5] = temp_operands[5];
> > -      operands[6] = mem_4;
> >        operands[7] = temp_operands[7];
> >      }
> >  
> >    /* Emit adjusting instruction.  */
> >    emit_insn (gen_rtx_SET (operands[8], plus_constant (DImode, base, base_off)));
> >    /* Emit ldp/stp instructions.  */
> > -  t1 = gen_rtx_SET (operands[0], operands[1]);
> > -  t2 = gen_rtx_SET (operands[2], operands[3]);
> > -  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, t1, t2)));
> > -  t1 = gen_rtx_SET (operands[4], operands[5]);
> > -  t2 = gen_rtx_SET (operands[6], operands[7]);
> > -  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, t1, t2)));
> > +  if (load)
> > +    {
> > +      emit_insn (aarch64_gen_load_pair (operands[0], operands[2],
> > +					operands[1], code));
> > +      emit_insn (aarch64_gen_load_pair (operands[4], operands[6],
> > +					operands[5], code));
> > +    }
> > +  else
> > +    {
> > +      emit_insn (aarch64_gen_store_pair (operands[0], operands[1],
> > +					 operands[3]));
> > +      emit_insn (aarch64_gen_store_pair (operands[4], operands[5],
> > +					 operands[7]));
> > +    }
> >    return true;
> >  }
> >  
> > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> > index c92a51690c5..ffb6b0ba749 100644
> > --- a/gcc/config/aarch64/aarch64.md
> > +++ b/gcc/config/aarch64/aarch64.md
> > @@ -175,6 +175,9 @@ (define_c_enum "unspec" [
> >      UNSPEC_GOTSMALLTLS
> >      UNSPEC_GOTTINYPIC
> >      UNSPEC_GOTTINYTLS
> > +    UNSPEC_STP
> > +    UNSPEC_LDP_FST
> > +    UNSPEC_LDP_SND
> >      UNSPEC_LD1
> >      UNSPEC_LD2
> >      UNSPEC_LD2_DREG
> > @@ -453,6 +456,11 @@ (define_attr "predicated" "yes,no" (const_string "no"))
> >  ;; may chose to hold the tracking state encoded in SP.
> >  (define_attr "speculation_barrier" "true,false" (const_string "false"))
> >  
> > +;; Attribute use to identify load pair and store pair instructions.
> > +;; Currently the attribute is only applied to the non-writeback ldp/stp
> > +;; patterns.
> > +(define_attr "ldpstp" "ldp,stp,none" (const_string "none"))
> > +
> >  ;; -------------------------------------------------------------------
> >  ;; Pipeline descriptions and scheduling
> >  ;; -------------------------------------------------------------------
> > @@ -1735,100 +1743,62 @@ (define_expand "setmemdi"
> >    FAIL;
> >  })
> >  
> > -;; Operands 1 and 3 are tied together by the final condition; so we allow
> > -;; fairly lax checking on the second memory operation.
> > -(define_insn "load_pair_sw_<SX:mode><SX2:mode>"
> > -  [(set (match_operand:SX 0 "register_operand")
> > -	(match_operand:SX 1 "aarch64_mem_pair_operand"))
> > -   (set (match_operand:SX2 2 "register_operand")
> > -	(match_operand:SX2 3 "memory_operand"))]
> > -   "rtx_equal_p (XEXP (operands[3], 0),
> > -		 plus_constant (Pmode,
> > -				XEXP (operands[1], 0),
> > -				GET_MODE_SIZE (<SX:MODE>mode)))"
> > -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type     , arch ]
> > -     [ r        , Ump , r  , m ; load_8          , *    ] ldp\t%w0, %w2, %z1
> > -     [ w        , Ump , w  , m ; neon_load1_2reg , fp   ] ldp\t%s0, %s2, %z1
> > -  }
> > -)
> > -
> > -;; Storing different modes that can still be merged
> > -(define_insn "load_pair_dw_<DX:mode><DX2:mode>"
> > -  [(set (match_operand:DX 0 "register_operand")
> > -	(match_operand:DX 1 "aarch64_mem_pair_operand"))
> > -   (set (match_operand:DX2 2 "register_operand")
> > -	(match_operand:DX2 3 "memory_operand"))]
> > -   "rtx_equal_p (XEXP (operands[3], 0),
> > -		 plus_constant (Pmode,
> > -				XEXP (operands[1], 0),
> > -				GET_MODE_SIZE (<DX:MODE>mode)))"
> > -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type     , arch ]
> > -     [ r        , Ump , r  , m ; load_16         , *    ] ldp\t%x0, %x2, %z1
> > -     [ w        , Ump , w  , m ; neon_load1_2reg , fp   ] ldp\t%d0, %d2, %z1
> > -  }
> > -)
> > -
> > -(define_insn "load_pair_dw_<TX:mode><TX2:mode>"
> > -  [(set (match_operand:TX 0 "register_operand" "=w")
> > -	(match_operand:TX 1 "aarch64_mem_pair_operand" "Ump"))
> > -   (set (match_operand:TX2 2 "register_operand" "=w")
> > -	(match_operand:TX2 3 "memory_operand" "m"))]
> > -   "TARGET_SIMD
> > -    && rtx_equal_p (XEXP (operands[3], 0),
> > -		    plus_constant (Pmode,
> > -				   XEXP (operands[1], 0),
> > -				   GET_MODE_SIZE (<TX:MODE>mode)))"
> > -  "ldp\\t%q0, %q2, %z1"
> > +(define_insn "*load_pair_<ldst_sz>"
> > +  [(set (match_operand:GPI 0 "aarch64_ldp_reg_operand")
> > +	(unspec [
> > +	  (match_operand:<VPAIR> 1 "aarch64_mem_pair_lanes_operand")
> > +	] UNSPEC_LDP_FST))
> > +   (set (match_operand:GPI 2 "aarch64_ldp_reg_operand")
> > +	(unspec [
> > +	  (match_dup 1)
> > +	] UNSPEC_LDP_SND))]
> > +  ""
> > +  {@ [cons: =0, 1,   =2; attrs: type,	   arch]
> > +     [	     r, Umn,  r; load_<ldpstp_sz>, *   ] ldp\t%<w>0, %<w>2, %y1
> > +     [	     w, Umn,  w; neon_load1_2reg,  fp  ] ldp\t%<v>0, %<v>2, %y1
> > +  }
> > +  [(set_attr "ldpstp" "ldp")]
> > +)
> > +
> > +(define_insn "*load_pair_16"
> > +  [(set (match_operand:TI 0 "aarch64_ldp_reg_operand" "=w")
> > +	(unspec [
> > +	  (match_operand:V2x16QI 1 "aarch64_mem_pair_lanes_operand" "Umn")
> > +	] UNSPEC_LDP_FST))
> > +   (set (match_operand:TI 2 "aarch64_ldp_reg_operand" "=w")
> > +	(unspec [
> > +	  (match_dup 1)
> > +	] UNSPEC_LDP_SND))]
> > +  "TARGET_FLOAT"
> > +  "ldp\\t%q0, %q2, %y1"
> >    [(set_attr "type" "neon_ldp_q")
> > -   (set_attr "fp" "yes")]
> > -)
> > -
> > -;; Operands 0 and 2 are tied together by the final condition; so we allow
> > -;; fairly lax checking on the second memory operation.
> > -(define_insn "store_pair_sw_<SX:mode><SX2:mode>"
> > -  [(set (match_operand:SX 0 "aarch64_mem_pair_operand")
> > -	(match_operand:SX 1 "aarch64_reg_zero_or_fp_zero"))
> > -   (set (match_operand:SX2 2 "memory_operand")
> > -	(match_operand:SX2 3 "aarch64_reg_zero_or_fp_zero"))]
> > -   "rtx_equal_p (XEXP (operands[2], 0),
> > -		 plus_constant (Pmode,
> > -				XEXP (operands[0], 0),
> > -				GET_MODE_SIZE (<SX:MODE>mode)))"
> > -  {@ [ cons: =0 , 1   , =2 , 3   ; attrs: type      , arch ]
> > -     [ Ump      , rYZ , m  , rYZ ; store_8          , *    ] stp\t%w1, %w3, %z0
> > -     [ Ump      , w   , m  , w   ; neon_store1_2reg , fp   ] stp\t%s1, %s3, %z0
> > -  }
> > -)
> > -
> > -;; Storing different modes that can still be merged
> > -(define_insn "store_pair_dw_<DX:mode><DX2:mode>"
> > -  [(set (match_operand:DX 0 "aarch64_mem_pair_operand")
> > -	(match_operand:DX 1 "aarch64_reg_zero_or_fp_zero"))
> > -   (set (match_operand:DX2 2 "memory_operand")
> > -	(match_operand:DX2 3 "aarch64_reg_zero_or_fp_zero"))]
> > -   "rtx_equal_p (XEXP (operands[2], 0),
> > -		 plus_constant (Pmode,
> > -				XEXP (operands[0], 0),
> > -				GET_MODE_SIZE (<DX:MODE>mode)))"
> > -  {@ [ cons: =0 , 1   , =2 , 3   ; attrs: type      , arch ]
> > -     [ Ump      , rYZ , m  , rYZ ; store_16         , *    ] stp\t%x1, %x3, %z0
> > -     [ Ump      , w   , m  , w   ; neon_store1_2reg , fp   ] stp\t%d1, %d3, %z0
> > -  }
> > -)
> > -
> > -(define_insn "store_pair_dw_<TX:mode><TX2:mode>"
> > -  [(set (match_operand:TX 0 "aarch64_mem_pair_operand" "=Ump")
> > -	(match_operand:TX 1 "register_operand" "w"))
> > -   (set (match_operand:TX2 2 "memory_operand" "=m")
> > -	(match_operand:TX2 3 "register_operand" "w"))]
> > -   "TARGET_SIMD &&
> > -    rtx_equal_p (XEXP (operands[2], 0),
> > -		 plus_constant (Pmode,
> > -				XEXP (operands[0], 0),
> > -				GET_MODE_SIZE (TFmode)))"
> > -  "stp\\t%q1, %q3, %z0"
> > +   (set_attr "fp" "yes")
> > +   (set_attr "ldpstp" "ldp")]
> > +)
> > +
> > +(define_insn "*store_pair_<ldst_sz>"
> > +  [(set (match_operand:<VPAIR> 0 "aarch64_mem_pair_lanes_operand")
> > +	(unspec:<VPAIR>
> > +	  [(match_operand:GPI 1 "aarch64_stp_reg_operand")
> > +	   (match_operand:GPI 2 "aarch64_stp_reg_operand")] UNSPEC_STP))]
> > +  ""
> > +  {@ [cons:  =0,   1,   2; attrs: type      , arch]
> > +     [	    Umn, rYZ, rYZ; store_<ldpstp_sz>, *   ] stp\t%<w>1, %<w>2, %y0
> > +     [	    Umn,   w,   w; neon_store1_2reg , fp  ] stp\t%<v>1, %<v>2, %y0
> > +  }
> > +  [(set_attr "ldpstp" "stp")]
> > +)
> > +
> > +(define_insn "*store_pair_16"
> > +  [(set (match_operand:V2x16QI 0 "aarch64_mem_pair_lanes_operand" "=Umn")
> > +	(unspec:V2x16QI
> > +	  [(match_operand:TI 1 "aarch64_ldp_reg_operand" "w")
> > +	   (match_operand:TI 2 "aarch64_ldp_reg_operand" "w")] UNSPEC_STP))]
> > +  "TARGET_FLOAT"
> > +  "stp\t%q1, %q2, %y0"
> >    [(set_attr "type" "neon_stp_q")
> > -   (set_attr "fp" "yes")]
> > +   (set_attr "fp" "yes")
> > +   (set_attr "ldpstp" "stp")]
> >  )
> >  
> >  ;; Writeback load/store pair patterns.
> > @@ -2074,14 +2044,15 @@ (define_insn "*extendsidi2_aarch64"
> >  
> >  (define_insn "*load_pair_extendsidi2_aarch64"
> >    [(set (match_operand:DI 0 "register_operand" "=r")
> > -	(sign_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand" "Ump")))
> > +	(sign_extend:DI (unspec:SI [
> > +	  (match_operand:V2x4QI 1 "aarch64_mem_pair_lanes_operand" "Umn")
> > +	] UNSPEC_LDP_FST)))
> >     (set (match_operand:DI 2 "register_operand" "=r")
> > -	(sign_extend:DI (match_operand:SI 3 "memory_operand" "m")))]
> > -  "rtx_equal_p (XEXP (operands[3], 0),
> > -		plus_constant (Pmode,
> > -			       XEXP (operands[1], 0),
> > -			       GET_MODE_SIZE (SImode)))"
> > -  "ldpsw\\t%0, %2, %z1"
> > +	(sign_extend:DI (unspec:SI [
> > +	  (match_dup 1)
> > +	] UNSPEC_LDP_SND)))]
> > +  ""
> > +  "ldpsw\\t%0, %2, %y1"
> >    [(set_attr "type" "load_8")]
> >  )
> >  
> > @@ -2101,16 +2072,17 @@ (define_insn "*zero_extendsidi2_aarch64"
> >  
> >  (define_insn "*load_pair_zero_extendsidi2_aarch64"
> >    [(set (match_operand:DI 0 "register_operand")
> > -	(zero_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand")))
> > +	(zero_extend:DI (unspec:SI [
> > +	  (match_operand:V2x4QI 1 "aarch64_mem_pair_lanes_operand")
> > +	] UNSPEC_LDP_FST)))
> >     (set (match_operand:DI 2 "register_operand")
> > -	(zero_extend:DI (match_operand:SI 3 "memory_operand")))]
> > -  "rtx_equal_p (XEXP (operands[3], 0),
> > -		plus_constant (Pmode,
> > -			       XEXP (operands[1], 0),
> > -			       GET_MODE_SIZE (SImode)))"
> > -  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type     , arch ]
> > -     [ r        , Ump , r  , m ; load_8          , *    ] ldp\t%w0, %w2, %z1
> > -     [ w        , Ump , w  , m ; neon_load1_2reg , fp   ] ldp\t%s0, %s2, %z1
> > +	(zero_extend:DI (unspec:SI [
> > +	  (match_dup 1)
> > +	] UNSPEC_LDP_SND)))]
> > +  ""
> > +  {@ [ cons: =0 , 1   , =2; attrs: type    , arch]
> > +     [ r	, Umn , r ; load_8	   , *   ] ldp\t%w0, %w2, %y1
> > +     [ w	, Umn , w ; neon_load1_2reg, fp  ] ldp\t%s0, %s2, %y1
> >    }
> >  )
> >  
> > diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> > index a920de99ffc..fd8dd6db349 100644
> > --- a/gcc/config/aarch64/iterators.md
> > +++ b/gcc/config/aarch64/iterators.md
> > @@ -1435,6 +1435,9 @@ (define_mode_attr VDBL [(V8QI "V16QI") (V4HI "V8HI")
> >  			(SI   "V2SI")  (SF   "V2SF")
> >  			(DI   "V2DI")  (DF   "V2DF")])
> >  
> > +;; Load/store pair mode.
> > +(define_mode_attr VPAIR [(SI "V2x4QI") (DI "V2x8QI")])
> > +
> >  ;; Register suffix for double-length mode.
> >  (define_mode_attr Vdtype [(V4HF "8h") (V2SF "4s")])
> >  
> > diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
> > index b647e5af7c6..80f2e03d8de 100644
> > --- a/gcc/config/aarch64/predicates.md
> > +++ b/gcc/config/aarch64/predicates.md
> > @@ -266,10 +266,12 @@ (define_special_predicate "aarch64_mem_pair_operator"
> >        (match_test "known_eq (GET_MODE_SIZE (mode),
> >  			     GET_MODE_SIZE (GET_MODE (op)))"))))
> >  
> > -(define_predicate "aarch64_mem_pair_operand"
> > -  (and (match_code "mem")
> > -       (match_test "aarch64_legitimate_address_p (mode, XEXP (op, 0), false,
> > -						  ADDR_QUERY_LDP_STP)")))
> > +;; Like aarch64_mem_pair_operator, but additionally check the
> > +;; address is suitable.
> > +(define_special_predicate "aarch64_mem_pair_operand"
> > +  (and (match_operand 0 "aarch64_mem_pair_operator")
> > +       (match_test "aarch64_legitimate_address_p (GET_MODE (op), XEXP (op, 0),
> > +						  false, ADDR_QUERY_LDP_STP)")))
> >  
> >  (define_predicate "pmode_plus_operator"
> >    (and (match_code "plus")
diff mbox series

Patch

diff --git a/gcc/config/aarch64/aarch64-ldpstp.md b/gcc/config/aarch64/aarch64-ldpstp.md
index 1ee7c73ff0c..dc39af85254 100644
--- a/gcc/config/aarch64/aarch64-ldpstp.md
+++ b/gcc/config/aarch64/aarch64-ldpstp.md
@@ -24,10 +24,10 @@  (define_peephole2
    (set (match_operand:GPI 2 "register_operand" "")
 	(match_operand:GPI 3 "memory_operand" ""))]
   "aarch64_operands_ok_for_ldpstp (operands, true, <MODE>mode)"
-  [(parallel [(set (match_dup 0) (match_dup 1))
-	      (set (match_dup 2) (match_dup 3))])]
+  [(const_int 0)]
 {
-  aarch64_swap_ldrstr_operands (operands, true);
+  aarch64_finish_ldpstp_peephole (operands, true);
+  DONE;
 })
 
 (define_peephole2
@@ -36,10 +36,10 @@  (define_peephole2
    (set (match_operand:GPI 2 "memory_operand" "")
 	(match_operand:GPI 3 "aarch64_reg_or_zero" ""))]
   "aarch64_operands_ok_for_ldpstp (operands, false, <MODE>mode)"
-  [(parallel [(set (match_dup 0) (match_dup 1))
-	      (set (match_dup 2) (match_dup 3))])]
+  [(const_int 0)]
 {
-  aarch64_swap_ldrstr_operands (operands, false);
+  aarch64_finish_ldpstp_peephole (operands, false);
+  DONE;
 })
 
 (define_peephole2
@@ -48,10 +48,10 @@  (define_peephole2
    (set (match_operand:GPF 2 "register_operand" "")
 	(match_operand:GPF 3 "memory_operand" ""))]
   "aarch64_operands_ok_for_ldpstp (operands, true, <MODE>mode)"
-  [(parallel [(set (match_dup 0) (match_dup 1))
-	      (set (match_dup 2) (match_dup 3))])]
+  [(const_int 0)]
 {
-  aarch64_swap_ldrstr_operands (operands, true);
+  aarch64_finish_ldpstp_peephole (operands, true);
+  DONE;
 })
 
 (define_peephole2
@@ -60,10 +60,10 @@  (define_peephole2
    (set (match_operand:GPF 2 "memory_operand" "")
 	(match_operand:GPF 3 "aarch64_reg_or_fp_zero" ""))]
   "aarch64_operands_ok_for_ldpstp (operands, false, <MODE>mode)"
-  [(parallel [(set (match_dup 0) (match_dup 1))
-	      (set (match_dup 2) (match_dup 3))])]
+  [(const_int 0)]
 {
-  aarch64_swap_ldrstr_operands (operands, false);
+  aarch64_finish_ldpstp_peephole (operands, false);
+  DONE;
 })
 
 (define_peephole2
@@ -72,10 +72,10 @@  (define_peephole2
    (set (match_operand:DREG2 2 "register_operand" "")
 	(match_operand:DREG2 3 "memory_operand" ""))]
   "aarch64_operands_ok_for_ldpstp (operands, true, <DREG:MODE>mode)"
-  [(parallel [(set (match_dup 0) (match_dup 1))
-	      (set (match_dup 2) (match_dup 3))])]
+  [(const_int 0)]
 {
-  aarch64_swap_ldrstr_operands (operands, true);
+  aarch64_finish_ldpstp_peephole (operands, true);
+  DONE;
 })
 
 (define_peephole2
@@ -84,10 +84,10 @@  (define_peephole2
    (set (match_operand:DREG2 2 "memory_operand" "")
 	(match_operand:DREG2 3 "register_operand" ""))]
   "aarch64_operands_ok_for_ldpstp (operands, false, <DREG:MODE>mode)"
-  [(parallel [(set (match_dup 0) (match_dup 1))
-	      (set (match_dup 2) (match_dup 3))])]
+  [(const_int 0)]
 {
-  aarch64_swap_ldrstr_operands (operands, false);
+  aarch64_finish_ldpstp_peephole (operands, false);
+  DONE;
 })
 
 (define_peephole2
@@ -99,10 +99,10 @@  (define_peephole2
    && aarch64_operands_ok_for_ldpstp (operands, true, <VQ:MODE>mode)
    && (aarch64_tune_params.extra_tuning_flags
 	& AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
-  [(parallel [(set (match_dup 0) (match_dup 1))
-	      (set (match_dup 2) (match_dup 3))])]
+  [(const_int 0)]
 {
-  aarch64_swap_ldrstr_operands (operands, true);
+  aarch64_finish_ldpstp_peephole (operands, true);
+  DONE;
 })
 
 (define_peephole2
@@ -114,10 +114,10 @@  (define_peephole2
    && aarch64_operands_ok_for_ldpstp (operands, false, <VQ:MODE>mode)
    && (aarch64_tune_params.extra_tuning_flags
 	& AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
-  [(parallel [(set (match_dup 0) (match_dup 1))
-	      (set (match_dup 2) (match_dup 3))])]
+  [(const_int 0)]
 {
-  aarch64_swap_ldrstr_operands (operands, false);
+  aarch64_finish_ldpstp_peephole (operands, false);
+  DONE;
 })
 
 
@@ -129,10 +129,10 @@  (define_peephole2
    (set (match_operand:DI 2 "register_operand" "")
 	(sign_extend:DI (match_operand:SI 3 "memory_operand" "")))]
   "aarch64_operands_ok_for_ldpstp (operands, true, SImode)"
-  [(parallel [(set (match_dup 0) (sign_extend:DI (match_dup 1)))
-	      (set (match_dup 2) (sign_extend:DI (match_dup 3)))])]
+  [(const_int 0)]
 {
-  aarch64_swap_ldrstr_operands (operands, true);
+  aarch64_finish_ldpstp_peephole (operands, true, SIGN_EXTEND);
+  DONE;
 })
 
 (define_peephole2
@@ -141,10 +141,10 @@  (define_peephole2
    (set (match_operand:DI 2 "register_operand" "")
 	(zero_extend:DI (match_operand:SI 3 "memory_operand" "")))]
   "aarch64_operands_ok_for_ldpstp (operands, true, SImode)"
-  [(parallel [(set (match_dup 0) (zero_extend:DI (match_dup 1)))
-	      (set (match_dup 2) (zero_extend:DI (match_dup 3)))])]
+  [(const_int 0)]
 {
-  aarch64_swap_ldrstr_operands (operands, true);
+  aarch64_finish_ldpstp_peephole (operands, true, ZERO_EXTEND);
+  DONE;
 })
 
 ;; Handle storing of a floating point zero with integer data.
@@ -163,10 +163,10 @@  (define_peephole2
    (set (match_operand:<FCVT_TARGET> 2 "memory_operand" "")
 	(match_operand:<FCVT_TARGET> 3 "aarch64_reg_zero_or_fp_zero" ""))]
   "aarch64_operands_ok_for_ldpstp (operands, false, <V_INT_EQUIV>mode)"
-  [(parallel [(set (match_dup 0) (match_dup 1))
-	      (set (match_dup 2) (match_dup 3))])]
+  [(const_int 0)]
 {
-  aarch64_swap_ldrstr_operands (operands, false);
+  aarch64_finish_ldpstp_peephole (operands, false);
+  DONE;
 })
 
 ;; Handle consecutive load/store whose offset is out of the range
diff --git a/gcc/config/aarch64/aarch64-modes.def b/gcc/config/aarch64/aarch64-modes.def
index 6b4f4e17dd5..1e0d770f72f 100644
--- a/gcc/config/aarch64/aarch64-modes.def
+++ b/gcc/config/aarch64/aarch64-modes.def
@@ -93,9 +93,13 @@  INT_MODE (XI, 64);
 
 /* V8DI mode.  */
 VECTOR_MODE_WITH_PREFIX (V, INT, DI, 8, 5);
-
 ADJUST_ALIGNMENT (V8DI, 8);
 
+/* V2x4QImode.  Used in load/store pair patterns.  */
+VECTOR_MODE_WITH_PREFIX (V2x, INT, QI, 4, 5);
+ADJUST_NUNITS (V2x4QI, 8);
+ADJUST_ALIGNMENT (V2x4QI, 4);
+
 /* Define Advanced SIMD modes for structures of 2, 3 and 4 d-registers.  */
 #define ADV_SIMD_D_REG_STRUCT_MODES(NVECS, VB, VH, VS, VD) \
   VECTOR_MODES_WITH_PREFIX (V##NVECS##x, INT, 8, 3); \
diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index e463fd5c817..2ab54f244a7 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -967,6 +967,8 @@  void aarch64_split_compare_and_swap (rtx op[]);
 void aarch64_split_atomic_op (enum rtx_code, rtx, rtx, rtx, rtx, rtx, rtx);
 
 bool aarch64_gen_adjusted_ldpstp (rtx *, bool, machine_mode, RTX_CODE);
+void aarch64_finish_ldpstp_peephole (rtx *, bool,
+				     enum rtx_code = (enum rtx_code)0);
 
 void aarch64_expand_sve_vec_cmp_int (rtx, rtx_code, rtx, rtx);
 bool aarch64_expand_sve_vec_cmp_float (rtx, rtx_code, rtx, rtx, bool);
@@ -1022,8 +1024,9 @@  bool aarch64_mergeable_load_pair_p (machine_mode, rtx, rtx);
 bool aarch64_operands_ok_for_ldpstp (rtx *, bool, machine_mode);
 bool aarch64_operands_adjust_ok_for_ldpstp (rtx *, bool, machine_mode);
 bool aarch64_mem_ok_with_ldpstp_policy_model (rtx, bool, machine_mode);
-void aarch64_swap_ldrstr_operands (rtx *, bool);
 bool aarch64_ldpstp_operand_mode_p (machine_mode);
+rtx aarch64_gen_load_pair (rtx, rtx, rtx, enum rtx_code = (enum rtx_code)0);
+rtx aarch64_gen_store_pair (rtx, rtx, rtx);
 
 extern void aarch64_asm_output_pool_epilogue (FILE *, const char *,
 					      tree, HOST_WIDE_INT);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index c6f2d582837..6f5080ab030 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -231,38 +231,6 @@  (define_insn "aarch64_store_lane0<mode>"
   [(set_attr "type" "neon_store1_1reg<q>")]
 )
 
-(define_insn "load_pair<DREG:mode><DREG2:mode>"
-  [(set (match_operand:DREG 0 "register_operand")
-	(match_operand:DREG 1 "aarch64_mem_pair_operand"))
-   (set (match_operand:DREG2 2 "register_operand")
-	(match_operand:DREG2 3 "memory_operand"))]
-  "TARGET_FLOAT
-   && rtx_equal_p (XEXP (operands[3], 0),
-		   plus_constant (Pmode,
-				  XEXP (operands[1], 0),
-				  GET_MODE_SIZE (<DREG:MODE>mode)))"
-  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type ]
-     [ w        , Ump , w  , m ; neon_ldp    ] ldp\t%d0, %d2, %z1
-     [ r        , Ump , r  , m ; load_16     ] ldp\t%x0, %x2, %z1
-  }
-)
-
-(define_insn "vec_store_pair<DREG:mode><DREG2:mode>"
-  [(set (match_operand:DREG 0 "aarch64_mem_pair_operand")
-	(match_operand:DREG 1 "register_operand"))
-   (set (match_operand:DREG2 2 "memory_operand")
-	(match_operand:DREG2 3 "register_operand"))]
-  "TARGET_FLOAT
-   && rtx_equal_p (XEXP (operands[2], 0),
-		   plus_constant (Pmode,
-				  XEXP (operands[0], 0),
-				  GET_MODE_SIZE (<DREG:MODE>mode)))"
-  {@ [ cons: =0 , 1 , =2 , 3 ; attrs: type ]
-     [ Ump      , w , m  , w ; neon_stp    ] stp\t%d1, %d3, %z0
-     [ Ump      , r , m  , r ; store_16    ] stp\t%x1, %x3, %z0
-  }
-)
-
 (define_insn "aarch64_simd_stp<mode>"
   [(set (match_operand:VP_2E 0 "aarch64_mem_pair_lanes_operand")
 	(vec_duplicate:VP_2E (match_operand:<VEL> 1 "register_operand")))]
@@ -273,34 +241,6 @@  (define_insn "aarch64_simd_stp<mode>"
   }
 )
 
-(define_insn "load_pair<VQ:mode><VQ2:mode>"
-  [(set (match_operand:VQ 0 "register_operand" "=w")
-	(match_operand:VQ 1 "aarch64_mem_pair_operand" "Ump"))
-   (set (match_operand:VQ2 2 "register_operand" "=w")
-	(match_operand:VQ2 3 "memory_operand" "m"))]
-  "TARGET_FLOAT
-    && rtx_equal_p (XEXP (operands[3], 0),
-		    plus_constant (Pmode,
-			       XEXP (operands[1], 0),
-			       GET_MODE_SIZE (<VQ:MODE>mode)))"
-  "ldp\\t%q0, %q2, %z1"
-  [(set_attr "type" "neon_ldp_q")]
-)
-
-(define_insn "vec_store_pair<VQ:mode><VQ2:mode>"
-  [(set (match_operand:VQ 0 "aarch64_mem_pair_operand" "=Ump")
-	(match_operand:VQ 1 "register_operand" "w"))
-   (set (match_operand:VQ2 2 "memory_operand" "=m")
-	(match_operand:VQ2 3 "register_operand" "w"))]
-  "TARGET_FLOAT
-   && rtx_equal_p (XEXP (operands[2], 0),
-		   plus_constant (Pmode,
-				  XEXP (operands[0], 0),
-				  GET_MODE_SIZE (<VQ:MODE>mode)))"
-  "stp\\t%q1, %q3, %z0"
-  [(set_attr "type" "neon_stp_q")]
-)
-
 (define_expand "@aarch64_split_simd_mov<mode>"
   [(set (match_operand:VQMOV 0)
 	(match_operand:VQMOV 1))]
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index ccf081d2a16..1f6094bf1bc 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -9056,59 +9056,81 @@  aarch64_pop_regs (unsigned regno1, unsigned regno2, HOST_WIDE_INT adjustment,
     }
 }
 
-/* Generate and return a store pair instruction of mode MODE to store
-   register REG1 to MEM1 and register REG2 to MEM2.  */
+static machine_mode
+aarch64_pair_mode_for_mode (machine_mode mode)
+{
+  if (known_eq (GET_MODE_SIZE (mode), 4))
+    return E_V2x4QImode;
+  else if (known_eq (GET_MODE_SIZE (mode), 8))
+    return E_V2x8QImode;
+  else if (known_eq (GET_MODE_SIZE (mode), 16))
+    return E_V2x16QImode;
+  else
+    gcc_unreachable ();
+}
 
 static rtx
-aarch64_gen_store_pair (machine_mode mode, rtx mem1, rtx reg1, rtx mem2,
-			rtx reg2)
+aarch64_pair_mem_from_base (rtx mem)
 {
-  switch (mode)
-    {
-    case E_DImode:
-      return gen_store_pair_dw_didi (mem1, reg1, mem2, reg2);
-
-    case E_DFmode:
-      return gen_store_pair_dw_dfdf (mem1, reg1, mem2, reg2);
-
-    case E_TFmode:
-      return gen_store_pair_dw_tftf (mem1, reg1, mem2, reg2);
+  auto pair_mode = aarch64_pair_mode_for_mode (GET_MODE (mem));
+  mem = adjust_bitfield_address_nv (mem, pair_mode, 0);
+  gcc_assert (aarch64_mem_pair_lanes_operand (mem, pair_mode));
+  return mem;
+}
 
-    case E_V4SImode:
-      return gen_vec_store_pairv4siv4si (mem1, reg1, mem2, reg2);
+/* Generate and return a store pair instruction to store REG1 and REG2
+   into memory starting at BASE_MEM.  All three rtxes should have modes of the
+   same size.  */
 
-    case E_V16QImode:
-      return gen_vec_store_pairv16qiv16qi (mem1, reg1, mem2, reg2);
+rtx
+aarch64_gen_store_pair (rtx base_mem, rtx reg1, rtx reg2)
+{
+  rtx pair_mem = aarch64_pair_mem_from_base (base_mem);
 
-    default:
-      gcc_unreachable ();
-    }
+  return gen_rtx_SET (pair_mem,
+		      gen_rtx_UNSPEC (GET_MODE (pair_mem),
+				      gen_rtvec (2, reg1, reg2),
+				      UNSPEC_STP));
 }
 
-/* Generate and regurn a load pair isntruction of mode MODE to load register
-   REG1 from MEM1 and register REG2 from MEM2.  */
+/* Generate and return a load pair instruction to load a pair of
+   registers starting at BASE_MEM into REG1 and REG2.  If CODE is
+   UNKNOWN, all three rtxes should have modes of the same size.
+   Otherwise, CODE is {SIGN,ZERO}_EXTEND, base_mem should be in SImode,
+   and REG{1,2} should be in DImode.  */
 
-static rtx
-aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx mem1, rtx reg2,
-		       rtx mem2)
+rtx
+aarch64_gen_load_pair (rtx reg1, rtx reg2, rtx base_mem, enum rtx_code code)
 {
-  switch (mode)
-    {
-    case E_DImode:
-      return gen_load_pair_dw_didi (reg1, mem1, reg2, mem2);
+  rtx pair_mem = aarch64_pair_mem_from_base (base_mem);
 
-    case E_DFmode:
-      return gen_load_pair_dw_dfdf (reg1, mem1, reg2, mem2);
-
-    case E_TFmode:
-      return gen_load_pair_dw_tftf (reg1, mem1, reg2, mem2);
+  const bool any_extend_p = (code == ZERO_EXTEND || code == SIGN_EXTEND);
+  if (any_extend_p)
+    {
+      gcc_checking_assert (GET_MODE (base_mem) == SImode);
+      gcc_checking_assert (GET_MODE (reg1) == DImode);
+      gcc_checking_assert (GET_MODE (reg2) == DImode);
+    }
+  else
+    gcc_assert (code == UNKNOWN);
+
+  rtx unspecs[2] = {
+    gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg1),
+		    gen_rtvec (1, pair_mem),
+		    UNSPEC_LDP_FST),
+    gen_rtx_UNSPEC (any_extend_p ? SImode : GET_MODE (reg2),
+		    gen_rtvec (1, copy_rtx (pair_mem)),
+		    UNSPEC_LDP_SND)
+  };
 
-    case E_V4SImode:
-      return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2);
+  if (any_extend_p)
+    for (int i = 0; i < 2; i++)
+      unspecs[i] = gen_rtx_fmt_e (code, DImode, unspecs[i]);
 
-    default:
-      gcc_unreachable ();
-    }
+  return gen_rtx_PARALLEL (VOIDmode,
+			   gen_rtvec (2,
+				      gen_rtx_SET (reg1, unspecs[0]),
+				      gen_rtx_SET (reg2, unspecs[1])));
 }
 
 /* Return TRUE if return address signing should be enabled for the current
@@ -9321,8 +9343,19 @@  aarch64_save_callee_saves (poly_int64 bytes_below_sp,
 	  offset -= fp_offset;
 	}
       rtx mem = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, offset));
-      bool need_cfa_note_p = (base_rtx != stack_pointer_rtx);
 
+      rtx cfa_base = stack_pointer_rtx;
+      poly_int64 cfa_offset = sp_offset;
+
+      if (hard_fp_valid_p && frame_pointer_needed)
+	{
+	  cfa_base = hard_frame_pointer_rtx;
+	  cfa_offset += (bytes_below_sp - frame.bytes_below_hard_fp);
+	}
+
+      rtx cfa_mem = gen_frame_mem (mode,
+				   plus_constant (Pmode,
+						  cfa_base, cfa_offset));
       unsigned int regno2;
       if (!aarch64_sve_mode_p (mode)
 	  && i + 1 < regs.size ()
@@ -9331,45 +9364,37 @@  aarch64_save_callee_saves (poly_int64 bytes_below_sp,
 		       frame.reg_offset[regno2] - frame.reg_offset[regno]))
 	{
 	  rtx reg2 = gen_rtx_REG (mode, regno2);
-	  rtx mem2;
 
 	  offset += GET_MODE_SIZE (mode);
-	  mem2 = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, offset));
-	  insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2,
-						    reg2));
-
-	  /* The first part of a frame-related parallel insn is
-	     always assumed to be relevant to the frame
-	     calculations; subsequent parts, are only
-	     frame-related if explicitly marked.  */
+	  insn = emit_insn (aarch64_gen_store_pair (mem, reg, reg2));
+
 	  if (aarch64_emit_cfi_for_reg_p (regno2))
 	    {
-	      if (need_cfa_note_p)
-		aarch64_add_cfa_expression (insn, reg2, stack_pointer_rtx,
-					    sp_offset + GET_MODE_SIZE (mode));
-	      else
-		RTX_FRAME_RELATED_P (XVECEXP (PATTERN (insn), 0, 1)) = 1;
+	      rtx cfa_mem2 = adjust_address_nv (cfa_mem,
+						Pmode,
+						GET_MODE_SIZE (mode));
+	      add_reg_note (insn, REG_CFA_OFFSET,
+			    gen_rtx_SET (cfa_mem2, reg2));
 	    }
 
 	  regno = regno2;
 	  ++i;
 	}
       else if (mode == VNx2DImode && BYTES_BIG_ENDIAN)
-	{
-	  insn = emit_insn (gen_aarch64_pred_mov (mode, mem, ptrue, reg));
-	  need_cfa_note_p = true;
-	}
+	insn = emit_insn (gen_aarch64_pred_mov (mode, mem, ptrue, reg));
       else if (aarch64_sve_mode_p (mode))
 	insn = emit_insn (gen_rtx_SET (mem, reg));
       else
 	insn = emit_move_insn (mem, reg);
 
       RTX_FRAME_RELATED_P (insn) = frame_related_p;
-      if (frame_related_p && need_cfa_note_p)
-	aarch64_add_cfa_expression (insn, reg, stack_pointer_rtx, sp_offset);
+
+      if (frame_related_p)
+	add_reg_note (insn, REG_CFA_OFFSET, gen_rtx_SET (cfa_mem, reg));
     }
 }
 
+
 /* Emit code to restore the callee registers in REGS, ignoring pop candidates
    and any other registers that are handled separately.  Write the appropriate
    REG_CFA_RESTORE notes into CFI_OPS.
@@ -9425,12 +9450,7 @@  aarch64_restore_callee_saves (poly_int64 bytes_below_sp,
 		       frame.reg_offset[regno2] - frame.reg_offset[regno]))
 	{
 	  rtx reg2 = gen_rtx_REG (mode, regno2);
-	  rtx mem2;
-
-	  offset += GET_MODE_SIZE (mode);
-	  mem2 = gen_frame_mem (mode, plus_constant (Pmode, base_rtx, offset));
-	  emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
-
+	  emit_insn (aarch64_gen_load_pair (reg, reg2, mem));
 	  *cfi_ops = alloc_reg_note (REG_CFA_RESTORE, reg2, *cfi_ops);
 	  regno = regno2;
 	  ++i;
@@ -9762,9 +9782,9 @@  aarch64_process_components (sbitmap components, bool prologue_p)
 			     : gen_rtx_SET (reg2, mem2);
 
       if (prologue_p)
-	insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2));
+	insn = emit_insn (aarch64_gen_store_pair (mem, reg, reg2));
       else
-	insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
+	insn = emit_insn (aarch64_gen_load_pair (reg, reg2, mem));
 
       if (frame_related_p || frame_related2_p)
 	{
@@ -10983,12 +11003,18 @@  aarch64_classify_address (struct aarch64_address_info *info,
      mode of the corresponding addressing mode is half of that.  */
   if (type == ADDR_QUERY_LDP_STP_N)
     {
-      if (known_eq (GET_MODE_SIZE (mode), 16))
+      if (known_eq (GET_MODE_SIZE (mode), 32))
+	mode = V16QImode;
+      else if (known_eq (GET_MODE_SIZE (mode), 16))
 	mode = DFmode;
       else if (known_eq (GET_MODE_SIZE (mode), 8))
 	mode = SFmode;
       else
 	return false;
+
+      /* This isn't really an Advanced SIMD struct mode, but a mode
+	 used to represent the complete mem in a load/store pair.  */
+      advsimd_struct_p = false;
     }
 
   bool allow_reg_index_p = (!load_store_pair_p
@@ -12609,7 +12635,8 @@  aarch64_print_operand (FILE *f, rtx x, int code)
 	if (!MEM_P (x)
 	    || (code == 'y'
 		&& maybe_ne (GET_MODE_SIZE (mode), 8)
-		&& maybe_ne (GET_MODE_SIZE (mode), 16)))
+		&& maybe_ne (GET_MODE_SIZE (mode), 16)
+		&& maybe_ne (GET_MODE_SIZE (mode), 32)))
 	  {
 	    output_operand_lossage ("invalid operand for '%%%c'", code);
 	    return;
@@ -25431,10 +25458,8 @@  aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
       *src = adjust_address (*src, mode, 0);
       *dst = adjust_address (*dst, mode, 0);
       /* Emit the memcpy.  */
-      emit_insn (aarch64_gen_load_pair (mode, reg1, *src, reg2,
-					aarch64_progress_pointer (*src)));
-      emit_insn (aarch64_gen_store_pair (mode, *dst, reg1,
-					 aarch64_progress_pointer (*dst), reg2));
+      emit_insn (aarch64_gen_load_pair (reg1, reg2, *src));
+      emit_insn (aarch64_gen_store_pair (*dst, reg1, reg2));
       /* Move the pointers forward.  */
       *src = aarch64_move_pointer (*src, 32);
       *dst = aarch64_move_pointer (*dst, 32);
@@ -25613,8 +25638,7 @@  aarch64_set_one_block_and_progress_pointer (rtx src, rtx *dst,
       /* "Cast" the *dst to the correct mode.  */
       *dst = adjust_address (*dst, mode, 0);
       /* Emit the memset.  */
-      emit_insn (aarch64_gen_store_pair (mode, *dst, src,
-					 aarch64_progress_pointer (*dst), src));
+      emit_insn (aarch64_gen_store_pair (*dst, src, src));
 
       /* Move the pointers forward.  */
       *dst = aarch64_move_pointer (*dst, 32);
@@ -26812,6 +26836,22 @@  aarch64_swap_ldrstr_operands (rtx* operands, bool load)
     }
 }
 
+void
+aarch64_finish_ldpstp_peephole (rtx *operands, bool load_p, enum rtx_code code)
+{
+  aarch64_swap_ldrstr_operands (operands, load_p);
+
+  if (load_p)
+    emit_insn (aarch64_gen_load_pair (operands[0], operands[2],
+				      operands[1], code));
+  else
+    {
+      gcc_assert (code == UNKNOWN);
+      emit_insn (aarch64_gen_store_pair (operands[0], operands[1],
+					 operands[3]));
+    }
+}
+
 /* Taking X and Y to be HOST_WIDE_INT pointers, return the result of a
    comparison between the two.  */
 int
@@ -26993,8 +27033,8 @@  bool
 aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
 			     machine_mode mode, RTX_CODE code)
 {
-  rtx base, offset_1, offset_3, t1, t2;
-  rtx mem_1, mem_2, mem_3, mem_4;
+  rtx base, offset_1, offset_3;
+  rtx mem_1, mem_2;
   rtx temp_operands[8];
   HOST_WIDE_INT off_val_1, off_val_3, base_off, new_off_1, new_off_3,
 		stp_off_upper_limit, stp_off_lower_limit, msize;
@@ -27019,21 +27059,17 @@  aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
   if (load)
     {
       mem_1 = copy_rtx (temp_operands[1]);
-      mem_2 = copy_rtx (temp_operands[3]);
-      mem_3 = copy_rtx (temp_operands[5]);
-      mem_4 = copy_rtx (temp_operands[7]);
+      mem_2 = copy_rtx (temp_operands[5]);
     }
   else
     {
       mem_1 = copy_rtx (temp_operands[0]);
-      mem_2 = copy_rtx (temp_operands[2]);
-      mem_3 = copy_rtx (temp_operands[4]);
-      mem_4 = copy_rtx (temp_operands[6]);
+      mem_2 = copy_rtx (temp_operands[4]);
       gcc_assert (code == UNKNOWN);
     }
 
   extract_base_offset_in_addr (mem_1, &base, &offset_1);
-  extract_base_offset_in_addr (mem_3, &base, &offset_3);
+  extract_base_offset_in_addr (mem_2, &base, &offset_3);
   gcc_assert (base != NULL_RTX && offset_1 != NULL_RTX
 	      && offset_3 != NULL_RTX);
 
@@ -27097,63 +27133,48 @@  aarch64_gen_adjusted_ldpstp (rtx *operands, bool load,
   replace_equiv_address_nv (mem_1, plus_constant (Pmode, operands[8],
 						  new_off_1), true);
   replace_equiv_address_nv (mem_2, plus_constant (Pmode, operands[8],
-						  new_off_1 + msize), true);
-  replace_equiv_address_nv (mem_3, plus_constant (Pmode, operands[8],
 						  new_off_3), true);
-  replace_equiv_address_nv (mem_4, plus_constant (Pmode, operands[8],
-						  new_off_3 + msize), true);
 
   if (!aarch64_mem_pair_operand (mem_1, mode)
-      || !aarch64_mem_pair_operand (mem_3, mode))
+      || !aarch64_mem_pair_operand (mem_2, mode))
     return false;
 
-  if (code == ZERO_EXTEND)
-    {
-      mem_1 = gen_rtx_ZERO_EXTEND (DImode, mem_1);
-      mem_2 = gen_rtx_ZERO_EXTEND (DImode, mem_2);
-      mem_3 = gen_rtx_ZERO_EXTEND (DImode, mem_3);
-      mem_4 = gen_rtx_ZERO_EXTEND (DImode, mem_4);
-    }
-  else if (code == SIGN_EXTEND)
-    {
-      mem_1 = gen_rtx_SIGN_EXTEND (DImode, mem_1);
-      mem_2 = gen_rtx_SIGN_EXTEND (DImode, mem_2);
-      mem_3 = gen_rtx_SIGN_EXTEND (DImode, mem_3);
-      mem_4 = gen_rtx_SIGN_EXTEND (DImode, mem_4);
-    }
-
   if (load)
     {
       operands[0] = temp_operands[0];
       operands[1] = mem_1;
       operands[2] = temp_operands[2];
-      operands[3] = mem_2;
       operands[4] = temp_operands[4];
-      operands[5] = mem_3;
+      operands[5] = mem_2;
       operands[6] = temp_operands[6];
-      operands[7] = mem_4;
     }
   else
     {
       operands[0] = mem_1;
       operands[1] = temp_operands[1];
-      operands[2] = mem_2;
       operands[3] = temp_operands[3];
-      operands[4] = mem_3;
+      operands[4] = mem_2;
       operands[5] = temp_operands[5];
-      operands[6] = mem_4;
       operands[7] = temp_operands[7];
     }
 
   /* Emit adjusting instruction.  */
   emit_insn (gen_rtx_SET (operands[8], plus_constant (DImode, base, base_off)));
   /* Emit ldp/stp instructions.  */
-  t1 = gen_rtx_SET (operands[0], operands[1]);
-  t2 = gen_rtx_SET (operands[2], operands[3]);
-  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, t1, t2)));
-  t1 = gen_rtx_SET (operands[4], operands[5]);
-  t2 = gen_rtx_SET (operands[6], operands[7]);
-  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, t1, t2)));
+  if (load)
+    {
+      emit_insn (aarch64_gen_load_pair (operands[0], operands[2],
+					operands[1], code));
+      emit_insn (aarch64_gen_load_pair (operands[4], operands[6],
+					operands[5], code));
+    }
+  else
+    {
+      emit_insn (aarch64_gen_store_pair (operands[0], operands[1],
+					 operands[3]));
+      emit_insn (aarch64_gen_store_pair (operands[4], operands[5],
+					 operands[7]));
+    }
   return true;
 }
 
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index c92a51690c5..ffb6b0ba749 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -175,6 +175,9 @@  (define_c_enum "unspec" [
     UNSPEC_GOTSMALLTLS
     UNSPEC_GOTTINYPIC
     UNSPEC_GOTTINYTLS
+    UNSPEC_STP
+    UNSPEC_LDP_FST
+    UNSPEC_LDP_SND
     UNSPEC_LD1
     UNSPEC_LD2
     UNSPEC_LD2_DREG
@@ -453,6 +456,11 @@  (define_attr "predicated" "yes,no" (const_string "no"))
 ;; may chose to hold the tracking state encoded in SP.
 (define_attr "speculation_barrier" "true,false" (const_string "false"))
 
+;; Attribute use to identify load pair and store pair instructions.
+;; Currently the attribute is only applied to the non-writeback ldp/stp
+;; patterns.
+(define_attr "ldpstp" "ldp,stp,none" (const_string "none"))
+
 ;; -------------------------------------------------------------------
 ;; Pipeline descriptions and scheduling
 ;; -------------------------------------------------------------------
@@ -1735,100 +1743,62 @@  (define_expand "setmemdi"
   FAIL;
 })
 
-;; Operands 1 and 3 are tied together by the final condition; so we allow
-;; fairly lax checking on the second memory operation.
-(define_insn "load_pair_sw_<SX:mode><SX2:mode>"
-  [(set (match_operand:SX 0 "register_operand")
-	(match_operand:SX 1 "aarch64_mem_pair_operand"))
-   (set (match_operand:SX2 2 "register_operand")
-	(match_operand:SX2 3 "memory_operand"))]
-   "rtx_equal_p (XEXP (operands[3], 0),
-		 plus_constant (Pmode,
-				XEXP (operands[1], 0),
-				GET_MODE_SIZE (<SX:MODE>mode)))"
-  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type     , arch ]
-     [ r        , Ump , r  , m ; load_8          , *    ] ldp\t%w0, %w2, %z1
-     [ w        , Ump , w  , m ; neon_load1_2reg , fp   ] ldp\t%s0, %s2, %z1
-  }
-)
-
-;; Storing different modes that can still be merged
-(define_insn "load_pair_dw_<DX:mode><DX2:mode>"
-  [(set (match_operand:DX 0 "register_operand")
-	(match_operand:DX 1 "aarch64_mem_pair_operand"))
-   (set (match_operand:DX2 2 "register_operand")
-	(match_operand:DX2 3 "memory_operand"))]
-   "rtx_equal_p (XEXP (operands[3], 0),
-		 plus_constant (Pmode,
-				XEXP (operands[1], 0),
-				GET_MODE_SIZE (<DX:MODE>mode)))"
-  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type     , arch ]
-     [ r        , Ump , r  , m ; load_16         , *    ] ldp\t%x0, %x2, %z1
-     [ w        , Ump , w  , m ; neon_load1_2reg , fp   ] ldp\t%d0, %d2, %z1
-  }
-)
-
-(define_insn "load_pair_dw_<TX:mode><TX2:mode>"
-  [(set (match_operand:TX 0 "register_operand" "=w")
-	(match_operand:TX 1 "aarch64_mem_pair_operand" "Ump"))
-   (set (match_operand:TX2 2 "register_operand" "=w")
-	(match_operand:TX2 3 "memory_operand" "m"))]
-   "TARGET_SIMD
-    && rtx_equal_p (XEXP (operands[3], 0),
-		    plus_constant (Pmode,
-				   XEXP (operands[1], 0),
-				   GET_MODE_SIZE (<TX:MODE>mode)))"
-  "ldp\\t%q0, %q2, %z1"
+(define_insn "*load_pair_<ldst_sz>"
+  [(set (match_operand:GPI 0 "aarch64_ldp_reg_operand")
+	(unspec [
+	  (match_operand:<VPAIR> 1 "aarch64_mem_pair_lanes_operand")
+	] UNSPEC_LDP_FST))
+   (set (match_operand:GPI 2 "aarch64_ldp_reg_operand")
+	(unspec [
+	  (match_dup 1)
+	] UNSPEC_LDP_SND))]
+  ""
+  {@ [cons: =0, 1,   =2; attrs: type,	   arch]
+     [	     r, Umn,  r; load_<ldpstp_sz>, *   ] ldp\t%<w>0, %<w>2, %y1
+     [	     w, Umn,  w; neon_load1_2reg,  fp  ] ldp\t%<v>0, %<v>2, %y1
+  }
+  [(set_attr "ldpstp" "ldp")]
+)
+
+(define_insn "*load_pair_16"
+  [(set (match_operand:TI 0 "aarch64_ldp_reg_operand" "=w")
+	(unspec [
+	  (match_operand:V2x16QI 1 "aarch64_mem_pair_lanes_operand" "Umn")
+	] UNSPEC_LDP_FST))
+   (set (match_operand:TI 2 "aarch64_ldp_reg_operand" "=w")
+	(unspec [
+	  (match_dup 1)
+	] UNSPEC_LDP_SND))]
+  "TARGET_FLOAT"
+  "ldp\\t%q0, %q2, %y1"
   [(set_attr "type" "neon_ldp_q")
-   (set_attr "fp" "yes")]
-)
-
-;; Operands 0 and 2 are tied together by the final condition; so we allow
-;; fairly lax checking on the second memory operation.
-(define_insn "store_pair_sw_<SX:mode><SX2:mode>"
-  [(set (match_operand:SX 0 "aarch64_mem_pair_operand")
-	(match_operand:SX 1 "aarch64_reg_zero_or_fp_zero"))
-   (set (match_operand:SX2 2 "memory_operand")
-	(match_operand:SX2 3 "aarch64_reg_zero_or_fp_zero"))]
-   "rtx_equal_p (XEXP (operands[2], 0),
-		 plus_constant (Pmode,
-				XEXP (operands[0], 0),
-				GET_MODE_SIZE (<SX:MODE>mode)))"
-  {@ [ cons: =0 , 1   , =2 , 3   ; attrs: type      , arch ]
-     [ Ump      , rYZ , m  , rYZ ; store_8          , *    ] stp\t%w1, %w3, %z0
-     [ Ump      , w   , m  , w   ; neon_store1_2reg , fp   ] stp\t%s1, %s3, %z0
-  }
-)
-
-;; Storing different modes that can still be merged
-(define_insn "store_pair_dw_<DX:mode><DX2:mode>"
-  [(set (match_operand:DX 0 "aarch64_mem_pair_operand")
-	(match_operand:DX 1 "aarch64_reg_zero_or_fp_zero"))
-   (set (match_operand:DX2 2 "memory_operand")
-	(match_operand:DX2 3 "aarch64_reg_zero_or_fp_zero"))]
-   "rtx_equal_p (XEXP (operands[2], 0),
-		 plus_constant (Pmode,
-				XEXP (operands[0], 0),
-				GET_MODE_SIZE (<DX:MODE>mode)))"
-  {@ [ cons: =0 , 1   , =2 , 3   ; attrs: type      , arch ]
-     [ Ump      , rYZ , m  , rYZ ; store_16         , *    ] stp\t%x1, %x3, %z0
-     [ Ump      , w   , m  , w   ; neon_store1_2reg , fp   ] stp\t%d1, %d3, %z0
-  }
-)
-
-(define_insn "store_pair_dw_<TX:mode><TX2:mode>"
-  [(set (match_operand:TX 0 "aarch64_mem_pair_operand" "=Ump")
-	(match_operand:TX 1 "register_operand" "w"))
-   (set (match_operand:TX2 2 "memory_operand" "=m")
-	(match_operand:TX2 3 "register_operand" "w"))]
-   "TARGET_SIMD &&
-    rtx_equal_p (XEXP (operands[2], 0),
-		 plus_constant (Pmode,
-				XEXP (operands[0], 0),
-				GET_MODE_SIZE (TFmode)))"
-  "stp\\t%q1, %q3, %z0"
+   (set_attr "fp" "yes")
+   (set_attr "ldpstp" "ldp")]
+)
+
+(define_insn "*store_pair_<ldst_sz>"
+  [(set (match_operand:<VPAIR> 0 "aarch64_mem_pair_lanes_operand")
+	(unspec:<VPAIR>
+	  [(match_operand:GPI 1 "aarch64_stp_reg_operand")
+	   (match_operand:GPI 2 "aarch64_stp_reg_operand")] UNSPEC_STP))]
+  ""
+  {@ [cons:  =0,   1,   2; attrs: type      , arch]
+     [	    Umn, rYZ, rYZ; store_<ldpstp_sz>, *   ] stp\t%<w>1, %<w>2, %y0
+     [	    Umn,   w,   w; neon_store1_2reg , fp  ] stp\t%<v>1, %<v>2, %y0
+  }
+  [(set_attr "ldpstp" "stp")]
+)
+
+(define_insn "*store_pair_16"
+  [(set (match_operand:V2x16QI 0 "aarch64_mem_pair_lanes_operand" "=Umn")
+	(unspec:V2x16QI
+	  [(match_operand:TI 1 "aarch64_ldp_reg_operand" "w")
+	   (match_operand:TI 2 "aarch64_ldp_reg_operand" "w")] UNSPEC_STP))]
+  "TARGET_FLOAT"
+  "stp\t%q1, %q2, %y0"
   [(set_attr "type" "neon_stp_q")
-   (set_attr "fp" "yes")]
+   (set_attr "fp" "yes")
+   (set_attr "ldpstp" "stp")]
 )
 
 ;; Writeback load/store pair patterns.
@@ -2074,14 +2044,15 @@  (define_insn "*extendsidi2_aarch64"
 
 (define_insn "*load_pair_extendsidi2_aarch64"
   [(set (match_operand:DI 0 "register_operand" "=r")
-	(sign_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand" "Ump")))
+	(sign_extend:DI (unspec:SI [
+	  (match_operand:V2x4QI 1 "aarch64_mem_pair_lanes_operand" "Umn")
+	] UNSPEC_LDP_FST)))
    (set (match_operand:DI 2 "register_operand" "=r")
-	(sign_extend:DI (match_operand:SI 3 "memory_operand" "m")))]
-  "rtx_equal_p (XEXP (operands[3], 0),
-		plus_constant (Pmode,
-			       XEXP (operands[1], 0),
-			       GET_MODE_SIZE (SImode)))"
-  "ldpsw\\t%0, %2, %z1"
+	(sign_extend:DI (unspec:SI [
+	  (match_dup 1)
+	] UNSPEC_LDP_SND)))]
+  ""
+  "ldpsw\\t%0, %2, %y1"
   [(set_attr "type" "load_8")]
 )
 
@@ -2101,16 +2072,17 @@  (define_insn "*zero_extendsidi2_aarch64"
 
 (define_insn "*load_pair_zero_extendsidi2_aarch64"
   [(set (match_operand:DI 0 "register_operand")
-	(zero_extend:DI (match_operand:SI 1 "aarch64_mem_pair_operand")))
+	(zero_extend:DI (unspec:SI [
+	  (match_operand:V2x4QI 1 "aarch64_mem_pair_lanes_operand")
+	] UNSPEC_LDP_FST)))
    (set (match_operand:DI 2 "register_operand")
-	(zero_extend:DI (match_operand:SI 3 "memory_operand")))]
-  "rtx_equal_p (XEXP (operands[3], 0),
-		plus_constant (Pmode,
-			       XEXP (operands[1], 0),
-			       GET_MODE_SIZE (SImode)))"
-  {@ [ cons: =0 , 1   , =2 , 3 ; attrs: type     , arch ]
-     [ r        , Ump , r  , m ; load_8          , *    ] ldp\t%w0, %w2, %z1
-     [ w        , Ump , w  , m ; neon_load1_2reg , fp   ] ldp\t%s0, %s2, %z1
+	(zero_extend:DI (unspec:SI [
+	  (match_dup 1)
+	] UNSPEC_LDP_SND)))]
+  ""
+  {@ [ cons: =0 , 1   , =2; attrs: type    , arch]
+     [ r	, Umn , r ; load_8	   , *   ] ldp\t%w0, %w2, %y1
+     [ w	, Umn , w ; neon_load1_2reg, fp  ] ldp\t%s0, %s2, %y1
   }
 )
 
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index a920de99ffc..fd8dd6db349 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -1435,6 +1435,9 @@  (define_mode_attr VDBL [(V8QI "V16QI") (V4HI "V8HI")
 			(SI   "V2SI")  (SF   "V2SF")
 			(DI   "V2DI")  (DF   "V2DF")])
 
+;; Load/store pair mode.
+(define_mode_attr VPAIR [(SI "V2x4QI") (DI "V2x8QI")])
+
 ;; Register suffix for double-length mode.
 (define_mode_attr Vdtype [(V4HF "8h") (V2SF "4s")])
 
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index b647e5af7c6..80f2e03d8de 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -266,10 +266,12 @@  (define_special_predicate "aarch64_mem_pair_operator"
       (match_test "known_eq (GET_MODE_SIZE (mode),
 			     GET_MODE_SIZE (GET_MODE (op)))"))))
 
-(define_predicate "aarch64_mem_pair_operand"
-  (and (match_code "mem")
-       (match_test "aarch64_legitimate_address_p (mode, XEXP (op, 0), false,
-						  ADDR_QUERY_LDP_STP)")))
+;; Like aarch64_mem_pair_operator, but additionally check the
+;; address is suitable.
+(define_special_predicate "aarch64_mem_pair_operand"
+  (and (match_operand 0 "aarch64_mem_pair_operator")
+       (match_test "aarch64_legitimate_address_p (GET_MODE (op), XEXP (op, 0),
+						  false, ADDR_QUERY_LDP_STP)")))
 
 (define_predicate "pmode_plus_operator"
   (and (match_code "plus")