diff mbox series

[V2] introduce light expander sra

Message ID 20231027015036.3868319-1-guojiufu@linux.ibm.com
State New
Headers show
Series [V2] introduce light expander sra | expand

Commit Message

Jiufu Guo Oct. 27, 2023, 1:50 a.m. UTC
Hi,

Compare with previous version:
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632399.html
This verion supports TI/VEC mode of the access.

There are a few PRs (meta-bug PR101926) on various targets.
The root causes of them are similar: the aggeragte param/
returns are passed by multi-registers, but they are stored
to stack from registers first; and then, access the 
parameter through stack slot.

A general idea to enhance this: accessing the aggregate
parameters/returns directly through incoming/outgoing
scalar registers.  This idea would be a kind of SRA.

This experimental patch for light-expander-sra contains
below parts:

a. Check if the parameters/returns are ok/profitable to
   scalarize, and set the incoming/outgoing registers(
   pseudos) for the parameter/return.
  - This is done in "expand_function_start", after the
    incoming/outgoing hard registers are determined for the
    paramter/return.
    The scalarized registers are recorded in DECL_RTL for
    the parameter/return in parallel form.
  - At the time when setting DECL_RTL, "scalarizable_aggregate"
    is called to check the accesses are ok/profitable to
    scalarize.
    We can continue to enhance this function, to support
    more cases.  For example:
    - 'reverse storage order'.
    - 'writing to parameter'/'overlap accesses'.

b. When expanding the accesses of the parameters/returns,
   according to the info of the access(e.g. bitpos,bitsize,
   mode), the scalar(pseudos) can be figured out to expand
   the access.  This may happen when expand below accesses:
  - The component access of a parameter: "_1 = arg.f1".
    Or whole parameter access: rhs of "_2 = arg"
  - The assignment to a return val:
    "D.xx = yy; or D.xx.f = zz" where D.xx occurs on return
    stmt.
  - This is mainly done in expr.cc(expand_expr_real_1, and
    expand_assignment).  Function "extract_sub_member" is
    used to figure out the scalar rtxs(pseudos).

Besides the above two parts, some work are done in the GIMPLE
tree:  collect sra candidates for parameters/returns, and
collect the SRA access info.
This is mainly done at the beginning of the expander pass.
Below are two major items of this part.
 - Collect light-expand-sra candidates.
  Each parameter is checked if it has the proper aggregate
  type.  Collect return val (VAR_P) on each return stmts if
  the function is returning via registers.  
  This is implemented in expand_sra::collect_sra_candidates. 

 - Build/collect/manage all the access on the candidates.
  The function "scan_function" is used to do this work, it
  goes through all basicblocks, and all interesting stmts (
  phi, return, assign, call, asm) are checked.
  If there is an interesting expression (e.g. COMPONENT_REF
  or PARM_DECL), then record the required info for the access
  (e.g. pos, size, type, base).
  And if it is risky to do SRA, the candidates may be removed.
  e.g. address-taken and accessed via memory.
  "foo(struct S arg) {bar (&arg);}"

This patch is tested on ppc64{,le} and x86_64.
Is this ok for trunk?

BR,
Jeff (Jiufu Guo)

	PR target/65421

gcc/ChangeLog:

	* cfgexpand.cc (struct access): New class.
	(struct expand_sra): New class.
	(expand_sra::collect_sra_candidates): New member function.
	(expand_sra::add_sra_candidate): Likewise.
	(expand_sra::build_access): Likewise.
	(expand_sra::analyze_phi): Likewise.
	(expand_sra::analyze_assign): Likewise.
	(expand_sra::visit_base): Likewise.
	(expand_sra::protect_mem_access_in_stmt): Likewise.
	(expand_sra::expand_sra):  Class constructor.
	(expand_sra::~expand_sra): Class destructor.
	(expand_sra::scalarizable_access):  New member function.
	(expand_sra::scalarizable_accesses):  Likewise.
	(scalarizable_aggregate):  New function.
	(set_scalar_rtx_for_returns):  New function.
	(expand_value_return): Updated.
	(expand_debug_expr): Updated.
	(pass_expand::execute): Updated to use expand_sra.
	* cfgexpand.h (scalarizable_aggregate): New declare.
	(set_scalar_rtx_for_returns): New declare.
	* expr.cc (expand_assignment): Updated.
	(expand_constructor): Updated.
	(query_position_in_parallel): New function.
	(extract_sub_member): New function.
	(expand_expr_real_1): Updated.
	* expr.h (query_position_in_parallel): New declare.
	* function.cc (assign_parm_setup_block): Updated.
	(assign_parms): Updated.
	(expand_function_start): Updated.
	* tree-sra.h (struct sra_base_access): New class.
	(struct sra_default_analyzer): New class.
	(scan_function): New template function.
	* var-tracking.cc (track_loc_p): Updated.

gcc/testsuite/ChangeLog:

	* g++.target/powerpc/pr102024.C: Updated
	* gcc.target/powerpc/pr108073.c: New test.
	* gcc.target/powerpc/pr65421-1.c: New test.
	* gcc.target/powerpc/pr65421-2.c: New test.

---
 gcc/cfgexpand.cc                             | 352 ++++++++++++++++++-
 gcc/cfgexpand.h                              |   2 +
 gcc/expr.cc                                  | 179 +++++++++-
 gcc/expr.h                                   |   3 +
 gcc/function.cc                              |  36 +-
 gcc/tree-sra.h                               |  76 ++++
 gcc/var-tracking.cc                          |   3 +-
 gcc/testsuite/g++.target/powerpc/pr102024.C  |   2 +-
 gcc/testsuite/gcc.target/i386/pr20020-2.c    |   5 +
 gcc/testsuite/gcc.target/powerpc/pr108073.c  |  29 ++
 gcc/testsuite/gcc.target/powerpc/pr65421-1.c |   6 +
 gcc/testsuite/gcc.target/powerpc/pr65421-2.c |  32 ++
 12 files changed, 718 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108073.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421-2.c

Comments

Jiufu Guo Nov. 17, 2023, 8:21 a.m. UTC | #1
Hi,

I would like to have a ping for this patch.

There are some aspects(TODOs) that can be improved for this patch.
I had some statistics to see if these aspects occur often, by
checking the gcc source code(including test suite) and spec2017.
- Reverse storage order. This only occurs in very few tests that
  are using the attribute.
- Writing to parameters. In this kind of case only ~12 hits in the
  gcc source code. (some hits in one Go file.)
- Overlapping access to parameters. For overlapping reading, it
  is supported already. For writing overlapping reading, it is
  not very common, because writing to parameter is rare.
- Bitfields extracting from parameter mult-registers. This occurs
  twice only in gcc code (and hit in one 'go' file:h2_bundle.go),
  6 files in the test suite, and 1 hit in spec2017.

I'm thinking of enhancing the patch incrementally.
Thanks for the comments in advance.

BR,
Jeff (Jiufu Guo)

Jiufu Guo <guojiufu@linux.ibm.com> writes:

> Hi,
>
> Compare with previous version:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632399.html
> This verion supports TI/VEC mode of the access.
>
> There are a few PRs (meta-bug PR101926) on various targets.
> The root causes of them are similar: the aggeragte param/
> returns are passed by multi-registers, but they are stored
> to stack from registers first; and then, access the 
> parameter through stack slot.
>
> A general idea to enhance this: accessing the aggregate
> parameters/returns directly through incoming/outgoing
> scalar registers.  This idea would be a kind of SRA.
>
> This experimental patch for light-expander-sra contains
> below parts:
>
> a. Check if the parameters/returns are ok/profitable to
>    scalarize, and set the incoming/outgoing registers(
>    pseudos) for the parameter/return.
>   - This is done in "expand_function_start", after the
>     incoming/outgoing hard registers are determined for the
>     paramter/return.
>     The scalarized registers are recorded in DECL_RTL for
>     the parameter/return in parallel form.
>   - At the time when setting DECL_RTL, "scalarizable_aggregate"
>     is called to check the accesses are ok/profitable to
>     scalarize.
>     We can continue to enhance this function, to support
>     more cases.  For example:
>     - 'reverse storage order'.
>     - 'writing to parameter'/'overlap accesses'.
>
> b. When expanding the accesses of the parameters/returns,
>    according to the info of the access(e.g. bitpos,bitsize,
>    mode), the scalar(pseudos) can be figured out to expand
>    the access.  This may happen when expand below accesses:
>   - The component access of a parameter: "_1 = arg.f1".
>     Or whole parameter access: rhs of "_2 = arg"
>   - The assignment to a return val:
>     "D.xx = yy; or D.xx.f = zz" where D.xx occurs on return
>     stmt.
>   - This is mainly done in expr.cc(expand_expr_real_1, and
>     expand_assignment).  Function "extract_sub_member" is
>     used to figure out the scalar rtxs(pseudos).
>
> Besides the above two parts, some work are done in the GIMPLE
> tree:  collect sra candidates for parameters/returns, and
> collect the SRA access info.
> This is mainly done at the beginning of the expander pass.
> Below are two major items of this part.
>  - Collect light-expand-sra candidates.
>   Each parameter is checked if it has the proper aggregate
>   type.  Collect return val (VAR_P) on each return stmts if
>   the function is returning via registers.  
>   This is implemented in expand_sra::collect_sra_candidates. 
>
>  - Build/collect/manage all the access on the candidates.
>   The function "scan_function" is used to do this work, it
>   goes through all basicblocks, and all interesting stmts (
>   phi, return, assign, call, asm) are checked.
>   If there is an interesting expression (e.g. COMPONENT_REF
>   or PARM_DECL), then record the required info for the access
>   (e.g. pos, size, type, base).
>   And if it is risky to do SRA, the candidates may be removed.
>   e.g. address-taken and accessed via memory.
>   "foo(struct S arg) {bar (&arg);}"
>
> This patch is tested on ppc64{,le} and x86_64.
> Is this ok for trunk?
>
> BR,
> Jeff (Jiufu Guo)
>
> 	PR target/65421
>
> gcc/ChangeLog:
>
> 	* cfgexpand.cc (struct access): New class.
> 	(struct expand_sra): New class.
> 	(expand_sra::collect_sra_candidates): New member function.
> 	(expand_sra::add_sra_candidate): Likewise.
> 	(expand_sra::build_access): Likewise.
> 	(expand_sra::analyze_phi): Likewise.
> 	(expand_sra::analyze_assign): Likewise.
> 	(expand_sra::visit_base): Likewise.
> 	(expand_sra::protect_mem_access_in_stmt): Likewise.
> 	(expand_sra::expand_sra):  Class constructor.
> 	(expand_sra::~expand_sra): Class destructor.
> 	(expand_sra::scalarizable_access):  New member function.
> 	(expand_sra::scalarizable_accesses):  Likewise.
> 	(scalarizable_aggregate):  New function.
> 	(set_scalar_rtx_for_returns):  New function.
> 	(expand_value_return): Updated.
> 	(expand_debug_expr): Updated.
> 	(pass_expand::execute): Updated to use expand_sra.
> 	* cfgexpand.h (scalarizable_aggregate): New declare.
> 	(set_scalar_rtx_for_returns): New declare.
> 	* expr.cc (expand_assignment): Updated.
> 	(expand_constructor): Updated.
> 	(query_position_in_parallel): New function.
> 	(extract_sub_member): New function.
> 	(expand_expr_real_1): Updated.
> 	* expr.h (query_position_in_parallel): New declare.
> 	* function.cc (assign_parm_setup_block): Updated.
> 	(assign_parms): Updated.
> 	(expand_function_start): Updated.
> 	* tree-sra.h (struct sra_base_access): New class.
> 	(struct sra_default_analyzer): New class.
> 	(scan_function): New template function.
> 	* var-tracking.cc (track_loc_p): Updated.
>
> gcc/testsuite/ChangeLog:
>
> 	* g++.target/powerpc/pr102024.C: Updated
> 	* gcc.target/powerpc/pr108073.c: New test.
> 	* gcc.target/powerpc/pr65421-1.c: New test.
> 	* gcc.target/powerpc/pr65421-2.c: New test.
>
> ---
>  gcc/cfgexpand.cc                             | 352 ++++++++++++++++++-
>  gcc/cfgexpand.h                              |   2 +
>  gcc/expr.cc                                  | 179 +++++++++-
>  gcc/expr.h                                   |   3 +
>  gcc/function.cc                              |  36 +-
>  gcc/tree-sra.h                               |  76 ++++
>  gcc/var-tracking.cc                          |   3 +-
>  gcc/testsuite/g++.target/powerpc/pr102024.C  |   2 +-
>  gcc/testsuite/gcc.target/i386/pr20020-2.c    |   5 +
>  gcc/testsuite/gcc.target/powerpc/pr108073.c  |  29 ++
>  gcc/testsuite/gcc.target/powerpc/pr65421-1.c |   6 +
>  gcc/testsuite/gcc.target/powerpc/pr65421-2.c |  32 ++
>  12 files changed, 718 insertions(+), 7 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108073.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421-1.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421-2.c
>
> diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
> index 4262703a138..ef99ca8ac13 100644
> --- a/gcc/cfgexpand.cc
> +++ b/gcc/cfgexpand.cc
> @@ -74,6 +74,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "output.h"
>  #include "builtins.h"
>  #include "opts.h"
> +#include "tree-sra.h"
>  
>  /* Some systems use __main in a way incompatible with its use in gcc, in these
>     cases use the macros NAME__MAIN to give a quoted symbol and SYMBOL__MAIN to
> @@ -97,6 +98,343 @@ static bool defer_stack_allocation (tree, bool);
>  
>  static void record_alignment_for_reg_var (unsigned int);
>  
> +/* For light SRA in expander about paramaters and returns.  */
> +struct access : public sra_base_access
> +{
> +};
> +
> +typedef struct access *access_p;
> +
> +struct expand_sra : public sra_default_analyzer
> +{
> +  /* Construct/destruct resources, e.g. sra candidates.  */
> +  expand_sra ();
> +  ~expand_sra ();
> +
> +  /* No actions for pre_analyze_stmt, analyze_return.  */
> +
> +  /* Overwrite phi,call,asm analyzations.  */
> +  void analyze_phi (gphi *s);
> +
> +  /* TODO: Check accesses on call/asm.  */
> +  void analyze_call (gcall *s) { protect_mem_access_in_stmt (s); };
> +  void analyze_asm (gasm *s) { protect_mem_access_in_stmt (s); };
> +
> +  /* Check access of SRA on assignment.  */
> +  void analyze_assign (gassign *);
> +
> +  /* Check if the accesses of BASE(parameter or return) are
> +     scalarizable, according to the incoming/outgoing REGS. */
> +  bool scalarizable_accesses (tree base, rtx regs);
> +
> +private:
> +  /* Collect the parameter and returns to check if they are suitable for
> +     scalarization.  */
> +  bool collect_sra_candidates (void);
> +
> +  /* Return true if VAR is added as a candidate for SRA.  */
> +  bool add_sra_candidate (tree var);
> +
> +  /* Return true if EXPR has interesting sra access, and created access,
> +     return false otherwise.  */
> +  access_p build_access (tree expr, bool write);
> +
> +  /* Check if the access ACC is scalarizable.  REGS is the incoming/outgoing
> +     registers which the access is based on. */
> +  bool scalarizable_access (access_p acc, rtx regs, bool is_parm);
> +
> +  /* If there is risk (stored/loaded or addr taken),
> +     disqualify the sra candidates in the un-interesting STMT. */
> +  void protect_mem_access_in_stmt (gimple *stmt);
> +
> +  /* Callback of walk_stmt_load_store_addr_ops, used to remove
> +     unscalarizable accesses.  */
> +  static bool visit_base (gimple *, tree op, tree, void *data);
> +
> +  /* Base (tree) -> Vector (vec<access_p> *) map.  */
> +  hash_map<tree, auto_vec<access_p> > *base_access_vec;
> +};
> +
> +bool
> +expand_sra::collect_sra_candidates (void)
> +{
> +  bool ret = false;
> +
> +  /* Collect parameters.  */
> +  for (tree parm = DECL_ARGUMENTS (current_function_decl); parm;
> +       parm = DECL_CHAIN (parm))
> +    ret |= add_sra_candidate (parm);
> +
> +  /* Collect VARs on returns.  */
> +  if (DECL_RESULT (current_function_decl))
> +    {
> +      edge_iterator ei;
> +      edge e;
> +      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
> +	if (greturn *r = safe_dyn_cast<greturn *> (*gsi_last_bb (e->src)))
> +	  {
> +	    tree val = gimple_return_retval (r);
> +	    /* To sclaraized the return, the return value should be only
> +	       writen (except this return stmt).  Using 'true(write)' to
> +	       pretend the access only be 'writen'. */
> +	    if (val && VAR_P (val))
> +	      ret |= add_sra_candidate (val) && build_access (val, true);
> +	  }
> +    }
> +
> +  return ret;
> +}
> +
> +bool
> +expand_sra::add_sra_candidate (tree var)
> +{
> +  tree type = TREE_TYPE (var);
> +
> +  if (!AGGREGATE_TYPE_P (type) || !tree_fits_shwi_p (TYPE_SIZE (type))
> +      || tree_to_shwi (TYPE_SIZE (type)) == 0 || TREE_THIS_VOLATILE (var)
> +      || is_va_list_type (type))
> +    return false;
> +
> +  base_access_vec->get_or_insert (var);
> +
> +  return true;
> +}
> +
> +access_p
> +expand_sra::build_access (tree expr, bool write)
> +{
> +  enum tree_code code = TREE_CODE (expr);
> +  if (code != VAR_DECL && code != PARM_DECL && code != COMPONENT_REF
> +      && code != ARRAY_REF && code != ARRAY_RANGE_REF)
> +    return NULL;
> +
> +  HOST_WIDE_INT offset, size;
> +  bool reverse;
> +  tree base = get_ref_base_and_extent_hwi (expr, &offset, &size, &reverse);
> +  if (!base || !DECL_P (base))
> +    return NULL;
> +
> +  vec<access_p> *access_vec = base_access_vec->get (base);
> +  if (!access_vec)
> +    return NULL;
> +
> +  /* TODO: support reverse. */
> +  if (reverse || size <= 0 || offset + size > tree_to_shwi (DECL_SIZE (base)))
> +    {
> +      base_access_vec->remove (base);
> +      return NULL;
> +    }
> +
> +  struct access *access = XNEWVEC (struct access, 1);
> +
> +  memset (access, 0, sizeof (struct access));
> +  access->offset = offset;
> +  access->size = size;
> +  access->expr = expr;
> +  access->write = write;
> +  access->reverse = reverse;
> +
> +  access_vec->safe_push (access);
> +
> +  return access;
> +}
> +
> +/* Function protect_mem_access_in_stmt removes the SRA candidates if
> +   there is addr-taken on the candidate in the STMT.  */
> +
> +void
> +expand_sra::analyze_phi (gphi *stmt)
> +{
> +  if (base_access_vec && !base_access_vec->is_empty ())
> +    walk_stmt_load_store_addr_ops (stmt, this, NULL, NULL, visit_base);
> +}
> +
> +void
> +expand_sra::analyze_assign (gassign *stmt)
> +{
> +  if (!base_access_vec || base_access_vec->is_empty ())
> +    return;
> +
> +  if (gimple_assign_single_p (stmt) && !gimple_clobber_p (stmt))
> +    {
> +      tree rhs = gimple_assign_rhs1 (stmt);
> +      tree lhs = gimple_assign_lhs (stmt);
> +      bool res_r = build_access (rhs, false);
> +      bool res_l = build_access (lhs, true);
> +
> +      if (res_l || res_r)
> +	return;
> +    }
> +
> +  protect_mem_access_in_stmt (stmt);
> +}
> +
> +/* Callback of walk_stmt_load_store_addr_ops, used to remove
> +   unscalarizable accesses.  Called by protect_mem_access_in_stmt.  */
> +
> +bool
> +expand_sra::visit_base (gimple *, tree op, tree, void *data)
> +{
> +  op = get_base_address (op);
> +  if (op && DECL_P (op))
> +    {
> +      expand_sra *p = (expand_sra *) data;
> +      p->base_access_vec->remove (op);
> +    }
> +  return false;
> +}
> +
> +/* Function protect_mem_access_in_stmt removes the SRA candidates if
> +   there is store/load/addr-taken on the candidate in the STMT.
> +
> +   For some statements, which SRA does not care about, if there are
> +   possible memory operation on the SRA candidates, it would be risky
> +   to scalarize it.  */
> +
> +void
> +expand_sra::protect_mem_access_in_stmt (gimple *stmt)
> +{
> +  if (base_access_vec && !base_access_vec->is_empty ())
> +    walk_stmt_load_store_addr_ops (stmt, this, visit_base, visit_base,
> +				   visit_base);
> +}
> +
> +expand_sra::expand_sra () : base_access_vec (NULL)
> +{
> +  if (optimize <= 0)
> +    return;
> +
> +  base_access_vec = new hash_map<tree, auto_vec<access_p> >;
> +  collect_sra_candidates ();
> +}
> +
> +expand_sra::~expand_sra ()
> +{
> +  if (optimize <= 0)
> +    return;
> +
> +  delete base_access_vec;
> +}
> +
> +bool
> +expand_sra::scalarizable_access (access_p acc, rtx regs, bool is_parm)
> +{
> +  /* Now only support reading from parms
> +     or writing to returns.  */
> +  if (is_parm && acc->write)
> +    return false;
> +  if (!is_parm && !acc->write)
> +    return false;
> +
> +  /* Compute the position of the access in the parallel regs.  */
> +  int start_index = -1;
> +  int end_index = -1;
> +  HOST_WIDE_INT left_bits = 0;
> +  HOST_WIDE_INT right_bits = 0;
> +  query_position_in_parallel (acc->offset, acc->size, regs, start_index,
> +			      end_index, left_bits, right_bits);
> +
> +  /* Invalid access possition: padding or outof bound.  */
> +  if (start_index < 0 || end_index < 0)
> +    return false;
> +
> +  machine_mode expr_mode = TYPE_MODE (TREE_TYPE (acc->expr));
> +  /* Need multi-registers in a parallel for the access.  */
> +  if (expr_mode == BLKmode || end_index > start_index)
> +    {
> +      if (left_bits || right_bits)
> +	return false;
> +      if (expr_mode == BLKmode)
> +	return true;
> +
> +      /* For large modes, only support TI/VECTOR in mult-registers. */
> +      if (known_gt (acc->size, GET_MODE_BITSIZE (word_mode)))
> +	return expr_mode == TImode || VECTOR_MODE_P (expr_mode);
> +      return true;
> +    }
> +
> +  gcc_assert (end_index == start_index);
> +
> +  /* Just need one reg for the access.  */
> +  if (left_bits == 0 && right_bits == 0)
> +    return true;
> +
> +  scalar_int_mode imode;
> +  /* Need to extract bits from the reg for the access.  */
> +  return !acc->write && int_mode_for_mode (expr_mode).exists (&imode);
> +}
> +
> +/* Now, the base (parm/return) is scalarizable, only if all
> +   accesses of the BASE are scalariable.
> +
> +   This function need to be updated, to support more complicate
> +   cases, like:
> +   - Some access are scalarizable, but some are not.
> +   - Access is writing to a parameter.
> +   - Writing accesses are overlap with multi-accesses.   */
> +
> +bool
> +expand_sra::scalarizable_accesses (tree base, rtx regs)
> +{
> +  if (!base_access_vec)
> +    return false;
> +  vec<access_p> *access_vec = base_access_vec->get (base);
> +  if (!access_vec)
> +    return false;
> +  if (access_vec->is_empty ())
> +    return false;
> +
> +  bool is_parm = TREE_CODE (base) == PARM_DECL;
> +  int n = access_vec->length ();
> +  int cur_access_index = 0;
> +  for (; cur_access_index < n; cur_access_index++)
> +    if (!scalarizable_access ((*access_vec)[cur_access_index], regs, is_parm))
> +      break;
> +
> +  /* It is ok if all access are scalarizable.  */
> +  if (cur_access_index == n)
> +    return true;
> +
> +  base_access_vec->remove (base);
> +  return false;
> +}
> +
> +static expand_sra *current_sra = NULL;
> +
> +/* Check if the PARAM (or return) is scalarizable.
> +
> +   This interface is used in expand_function_start
> +   to check sra possiblity for parmeters. */
> +
> +bool
> +scalarizable_aggregate (tree parm, rtx regs)
> +{
> +  if (!current_sra)
> +    return false;
> +  return current_sra->scalarizable_accesses (parm, regs);
> +}
> +
> +/* Check if interesting returns, and if they are scalarizable,
> +   set DECL_RTL as scalar registers.
> +
> +   This interface is used in expand_function_start
> +   when outgoing registers are determinded for DECL_RESULT.  */
> +
> +void
> +set_scalar_rtx_for_returns ()
> +{
> +  rtx res = DECL_RTL (DECL_RESULT (current_function_decl));
> +  edge_iterator ei;
> +  edge e;
> +  FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
> +    if (greturn *r = safe_dyn_cast<greturn *> (*gsi_last_bb (e->src)))
> +      {
> +	tree val = gimple_return_retval (r);
> +	if (val && VAR_P (val) && scalarizable_aggregate (val, res))
> +	  SET_DECL_RTL (val, res);
> +      }
> +}
> +
>  /* Return an expression tree corresponding to the RHS of GIMPLE
>     statement STMT.  */
>  
> @@ -3707,7 +4045,7 @@ expand_value_return (rtx val)
>  
>    tree decl = DECL_RESULT (current_function_decl);
>    rtx return_reg = DECL_RTL (decl);
> -  if (return_reg != val)
> +  if (!rtx_equal_p (return_reg, val))
>      {
>        tree funtype = TREE_TYPE (current_function_decl);
>        tree type = TREE_TYPE (decl);
> @@ -4423,6 +4761,12 @@ expand_debug_expr (tree exp)
>    addr_space_t as;
>    scalar_int_mode op0_mode, op1_mode, addr_mode;
>  
> +  /* TODO: Enable to debug expand-sra optimized parm/returns.  */
> +  tree base = get_base_address (exp);
> +  if ((TREE_CODE (base) == PARM_DECL || (VAR_P (base) && DECL_RTL_SET_P (base)))
> +      && GET_CODE (DECL_RTL (base)) == PARALLEL)
> +    return NULL_RTX;
> +
>    switch (TREE_CODE_CLASS (TREE_CODE (exp)))
>      {
>      case tcc_expression:
> @@ -6628,6 +6972,10 @@ pass_expand::execute (function *fun)
>    auto_bitmap forced_stack_vars;
>    discover_nonconstant_array_refs (forced_stack_vars);
>  
> +  /* Enable light-expander-sra.  */
> +  current_sra = new expand_sra;
> +  scan_function (cfun, *current_sra);
> +
>    /* Make sure all values used by the optimization passes have sane
>       defaults.  */
>    reg_renumber = 0;
> @@ -7056,6 +7404,8 @@ pass_expand::execute (function *fun)
>        loop_optimizer_finalize ();
>      }
>  
> +  delete current_sra;
> +  current_sra = NULL;
>    timevar_pop (TV_POST_EXPAND);
>  
>    return 0;
> diff --git a/gcc/cfgexpand.h b/gcc/cfgexpand.h
> index 0e551f6cfd3..3415c217708 100644
> --- a/gcc/cfgexpand.h
> +++ b/gcc/cfgexpand.h
> @@ -24,5 +24,7 @@ extern tree gimple_assign_rhs_to_tree (gimple *);
>  extern HOST_WIDE_INT estimated_stack_frame_size (struct cgraph_node *);
>  extern void set_parm_rtl (tree, rtx);
>  
> +extern bool scalarizable_aggregate (tree, rtx);
> +extern void set_scalar_rtx_for_returns ();
>  
>  #endif /* GCC_CFGEXPAND_H */
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index 763bd82c59f..5ba26e0ef52 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -5618,7 +5618,10 @@ expand_assignment (tree to, tree from, bool nontemporal)
>       Assignment of an array element at a constant index, and assignment of
>       an array element in an unaligned packed structure field, has the same
>       problem.  Same for (partially) storing into a non-memory object.  */
> -  if (handled_component_p (to)
> +  if ((handled_component_p (to)
> +       && !(VAR_P (get_base_address (to))
> +	    && DECL_RTL_SET_P (get_base_address (to))
> +	    && GET_CODE (DECL_RTL (get_base_address (to))) == PARALLEL))
>        || (TREE_CODE (to) == MEM_REF
>  	  && (REF_REVERSE_STORAGE_ORDER (to)
>  	      || mem_ref_refers_to_non_mem_p (to)))
> @@ -8909,6 +8912,19 @@ expand_constructor (tree exp, rtx target, enum expand_modifier modifier,
>        && ! mostly_zeros_p (exp))
>      return NULL_RTX;
>  
> +  if (target && GET_CODE (target) == PARALLEL && all_zeros_p (exp))
> +    {
> +      int length = XVECLEN (target, 0);
> +      int start = XEXP (XVECEXP (target, 0, 0), 0) ? 0 : 1;
> +      for (int i = start; i < length; i++)
> +	{
> +	  rtx dst = XEXP (XVECEXP (target, 0, i), 0);
> +	  rtx zero = CONST0_RTX (GET_MODE (dst));
> +	  emit_move_insn (dst, zero);
> +	}
> +      return target;
> +    }
> +
>    /* Handle calls that pass values in multiple non-contiguous
>       locations.  The Irix 6 ABI has examples of this.  */
>    if (target == 0 || ! safe_from_p (target, exp, 1)
> @@ -10621,6 +10637,157 @@ stmt_is_replaceable_p (gimple *stmt)
>    return false;
>  }
>  
> +/* In the parallel rtx register series REGS, compute the position for given
> +   {BITPOS, BITSIZE}.
> +   START_INDEX, END_INDEX, LEFT_BITS and RIGHT_BITS are computed outputs.  */
> +
> +void
> +query_position_in_parallel (HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize,
> +			    rtx regs, int &start_index, int &end_index,
> +			    HOST_WIDE_INT &left_bits, HOST_WIDE_INT &right_bits)
> +{
> +  int cur_index = XEXP (XVECEXP (regs, 0, 0), 0) ? 0 : 1;
> +  for (; cur_index < XVECLEN (regs, 0); cur_index++)
> +    {
> +      rtx slot = XVECEXP (regs, 0, cur_index);
> +      HOST_WIDE_INT off = UINTVAL (XEXP (slot, 1)) * BITS_PER_UNIT;
> +      machine_mode mode = GET_MODE (XEXP (slot, 0));
> +      HOST_WIDE_INT size = GET_MODE_BITSIZE (mode).to_constant ();
> +      if (off <= bitpos && off + size > bitpos)
> +	{
> +	  start_index = cur_index;
> +	  left_bits = bitpos - off;
> +	}
> +      if (off + size >= bitpos + bitsize)
> +	{
> +	  end_index = cur_index;
> +	  right_bits = off + size - (bitpos + bitsize);
> +	  break;
> +	}
> +    }
> +}
> +
> +static rtx
> +extract_sub_member (rtx regs, HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize,
> +		    tree expr)
> +{
> +  int start_index = -1;
> +  int end_index = -1;
> +  HOST_WIDE_INT left_bits = 0;
> +  HOST_WIDE_INT right_bits = 0;
> +  query_position_in_parallel (bitpos, bitsize, regs, start_index, end_index,
> +			      left_bits, right_bits);
> +
> +  machine_mode expr_mode = TYPE_MODE (TREE_TYPE (expr));
> +  if (end_index > start_index || expr_mode == BLKmode)
> +    {
> +      /* TImode in multi-registers.  */
> +      if (expr_mode == TImode)
> +	{
> +	  rtx res = gen_reg_rtx (expr_mode);
> +	  HOST_WIDE_INT start;
> +	  start = UINTVAL (XEXP (XVECEXP (regs, 0, start_index), 1));
> +	  for (int index = start_index; index <= end_index; index++)
> +	    {
> +	      rtx reg = XEXP (XVECEXP (regs, 0, index), 0);
> +	      machine_mode mode = GET_MODE (reg);
> +	      HOST_WIDE_INT off;
> +	      off = UINTVAL (XEXP (XVECEXP (regs, 0, index), 1)) - start;
> +	      rtx sub = simplify_gen_subreg (mode, res, expr_mode, off);
> +	      emit_move_insn (sub, reg);
> +	    }
> +	  return res;
> +	}
> +
> +      /* Vector in multi-registers.  */
> +      if (VECTOR_MODE_P (expr_mode))
> +	{
> +	  rtvec vector = rtvec_alloc (end_index - start_index + 1);
> +	  machine_mode emode;
> +	  emode = GET_MODE (XEXP (XVECEXP (regs, 0, start_index), 0));
> +	  for (int index = start_index; index <= end_index; index++)
> +	    {
> +	      rtx reg = XEXP (XVECEXP (regs, 0, index), 0);
> +	      gcc_assert (emode == GET_MODE (reg));
> +	      RTVEC_ELT (vector, index - start_index) = reg;
> +	    }
> +	  scalar_int_mode imode;
> +	  machine_mode vmode;
> +	  int nunits = end_index - start_index + 1;
> +	  if (!(int_mode_for_mode (emode).exists (&imode)
> +		&& mode_for_vector (imode, nunits).exists (&vmode)))
> +	    gcc_unreachable ();
> +
> +	  insn_code icode;
> +	  icode = convert_optab_handler (vec_init_optab, vmode, imode);
> +	  rtx res = gen_reg_rtx (vmode);
> +	  emit_insn (GEN_FCN (icode) (res, gen_rtx_PARALLEL (vmode, vector)));
> +	  if (expr_mode == vmode)
> +	    return res;
> +	  return simplify_gen_subreg (expr_mode, res, vmode, 0);
> +	}
> +
> +      /* Need multi-registers in a parallel for the access.  */
> +      int num_words = end_index - start_index + 1;
> +      rtx *tmps = XALLOCAVEC (rtx, num_words);
> +
> +      int pos = 0;
> +      HOST_WIDE_INT start;
> +      start = UINTVAL (XEXP (XVECEXP (regs, 0, start_index), 1));
> +      /* Extract whole registers.  */
> +      for (; pos < num_words; pos++)
> +	{
> +	  int index = start_index + pos;
> +	  rtx reg = XEXP (XVECEXP (regs, 0, index), 0);
> +	  machine_mode mode = GET_MODE (reg);
> +	  HOST_WIDE_INT off;
> +	  off = UINTVAL (XEXP (XVECEXP (regs, 0, index), 1)) - start;
> +	  tmps[pos] = gen_rtx_EXPR_LIST (mode, reg, GEN_INT (off));
> +	}
> +
> +      rtx res = gen_rtx_PARALLEL (expr_mode, gen_rtvec_v (pos, tmps));
> +      return res;
> +    }
> +
> +  gcc_assert (end_index == start_index);
> +
> +  /* Just need one reg for the access.  */
> +  if (left_bits == 0 && right_bits == 0)
> +    {
> +      rtx reg = XEXP (XVECEXP (regs, 0, start_index), 0);
> +      if (GET_MODE (reg) != expr_mode)
> +	reg = gen_lowpart (expr_mode, reg);
> +      return reg;
> +    }
> +
> +  /* Need to extract bitfield part reg for the access.
> +     left_bits != 0 or right_bits != 0 */
> +  rtx reg = XEXP (XVECEXP (regs, 0, start_index), 0);
> +  bool sgn = TYPE_UNSIGNED (TREE_TYPE (expr));
> +  scalar_int_mode imode;
> +  if (!int_mode_for_mode (expr_mode).exists (&imode))
> +    {
> +      gcc_assert (false);
> +      return NULL_RTX;
> +    }
> +
> +  machine_mode mode = GET_MODE (reg);
> +  bool reverse = false;
> +  rtx bfld = extract_bit_field (reg, bitsize, left_bits, sgn, NULL_RTX, mode,
> +				imode, reverse, NULL);
> +
> +  if (GET_MODE (bfld) != imode)
> +    bfld = gen_lowpart (imode, bfld);
> +
> +  if (expr_mode == imode)
> +    return bfld;
> +
> +  /* expr_mode != imode, e.g. SF != SI.  */
> +  rtx result = gen_reg_rtx (imode);
> +  emit_move_insn (result, bfld);
> +  return gen_lowpart (expr_mode, result);
> +}
> +
>  rtx
>  expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
>  		    enum expand_modifier modifier, rtx *alt_rtl,
> @@ -11498,6 +11665,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
>  	  = expand_expr_real (tem, tem_target, VOIDmode, tem_modifier, NULL,
>  			      true);
>  
> +	/* It is scalarizable access on param which is passed by registers.  */
> +	if (GET_CODE (op0) == PARALLEL
> +	    && (TREE_CODE (tem) == PARM_DECL || VAR_P (tem)))
> +	  {
> +	    HOST_WIDE_INT pos, size;
> +	    size = bitsize.to_constant ();
> +	    pos = bitpos.to_constant ();
> +	    return extract_sub_member (op0, pos, size, exp);
> +	  }
> +
>  	/* If the field has a mode, we want to access it in the
>  	   field's mode, not the computed mode.
>  	   If a MEM has VOIDmode (external with incomplete type),
> diff --git a/gcc/expr.h b/gcc/expr.h
> index 2a172867fdb..8a9332aaad6 100644
> --- a/gcc/expr.h
> +++ b/gcc/expr.h
> @@ -362,5 +362,8 @@ extern rtx expr_size (tree);
>  
>  extern bool mem_ref_refers_to_non_mem_p (tree);
>  extern bool non_mem_decl_p (tree);
> +extern void query_position_in_parallel (HOST_WIDE_INT, HOST_WIDE_INT, rtx,
> +					int &, int &, HOST_WIDE_INT &,
> +					HOST_WIDE_INT &);
>  
>  #endif /* GCC_EXPR_H */
> diff --git a/gcc/function.cc b/gcc/function.cc
> index afb0b33da9e..518250b2728 100644
> --- a/gcc/function.cc
> +++ b/gcc/function.cc
> @@ -3107,8 +3107,29 @@ assign_parm_setup_block (struct assign_parm_data_all *all,
>  	  emit_move_insn (mem, entry_parm);
>  	}
>        else
> -	move_block_from_reg (REGNO (entry_parm), mem,
> -			     size_stored / UNITS_PER_WORD);
> +	{
> +	  int regno = REGNO (entry_parm);
> +	  int nregs = size_stored / UNITS_PER_WORD;
> +	  rtx *tmps = XALLOCAVEC (rtx, nregs);
> +	  machine_mode mode = word_mode;
> +	  HOST_WIDE_INT word_size = GET_MODE_SIZE (mode).to_constant ();
> +	  for (int i = 0; i < nregs; i++)
> +	    {
> +	      rtx reg = gen_rtx_REG (mode, regno + i);
> +	      rtx off = GEN_INT (word_size * i);
> +	      tmps[i] = gen_rtx_EXPR_LIST (VOIDmode, reg, off);
> +	    }
> +
> +	  rtx regs = gen_rtx_PARALLEL (BLKmode, gen_rtvec_v (nregs, tmps));
> +	  if (scalarizable_aggregate (parm, regs))
> +	    {
> +	      rtx pseudos = gen_group_rtx (regs);
> +	      emit_group_move (pseudos, regs);
> +	      stack_parm = pseudos;
> +	    }
> +	  else
> +	    move_block_from_reg (regno, mem, nregs);
> +	}
>      }
>    else if (data->stack_parm == 0 && !TYPE_EMPTY_P (data->arg.type))
>      {
> @@ -3710,7 +3731,15 @@ assign_parms (tree fndecl)
>  
>        assign_parm_adjust_stack_rtl (&data);
>  
> -      if (assign_parm_setup_block_p (&data))
> +      rtx incoming = DECL_INCOMING_RTL (parm);
> +      if (GET_CODE (incoming) == PARALLEL
> +	  && scalarizable_aggregate (parm, incoming))
> +	{
> +	  rtx pseudos = gen_group_rtx (incoming);
> +	  emit_group_move (pseudos, incoming);
> +	  set_parm_rtl (parm, pseudos);
> +	}
> +      else if (assign_parm_setup_block_p (&data))
>  	assign_parm_setup_block (&all, parm, &data);
>        else if (data.arg.pass_by_reference || use_register_for_decl (parm))
>  	assign_parm_setup_reg (&all, parm, &data);
> @@ -5128,6 +5157,7 @@ expand_function_start (tree subr)
>  	    {
>  	      gcc_assert (GET_CODE (hard_reg) == PARALLEL);
>  	      set_parm_rtl (res, gen_group_rtx (hard_reg));
> +	      set_scalar_rtx_for_returns ();
>  	    }
>  	}
>  
> diff --git a/gcc/tree-sra.h b/gcc/tree-sra.h
> index f20266c4622..df3071ccf6e 100644
> --- a/gcc/tree-sra.h
> +++ b/gcc/tree-sra.h
> @@ -19,6 +19,82 @@ You should have received a copy of the GNU General Public License
>  along with GCC; see the file COPYING3.  If not see
>  <http://www.gnu.org/licenses/>.  */
>  
> +struct sra_base_access
> +{
> +  /* Values returned by get_ref_base_and_extent, indicates the
> +     OFFSET, SIZE and BASE of the access.  */
> +  HOST_WIDE_INT offset;
> +  HOST_WIDE_INT size;
> +
> +  /* The context expression of this access.  */
> +  tree expr;
> +
> +  /* Indicates this is a write access.  */
> +  bool write : 1;
> +
> +  /* Indicates if this access is made in reverse storage order.  */
> +  bool reverse : 1;
> +};
> +
> +/* Default template for sra_scan_function.  */
> +
> +struct sra_default_analyzer
> +{
> +  /* Template analyze functions.  */
> +  void analyze_phi (gphi *){};
> +  void pre_analyze_stmt (gimple *){};
> +  void analyze_return (greturn *){};
> +  void analyze_assign (gassign *){};
> +  void analyze_call (gcall *){};
> +  void analyze_asm (gasm *){};
> +  void analyze_default_stmt (gimple *){};
> +};
> +
> +/* Scan function and look for interesting expressions.  */
> +
> +template <typename analyzer>
> +void
> +scan_function (struct function *fun, analyzer &a)
> +{
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, fun)
> +    {
> +      for (gphi_iterator gsi = gsi_start_phis (bb); !gsi_end_p (gsi);
> +	   gsi_next (&gsi))
> +	a.analyze_phi (gsi.phi ());
> +
> +      for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
> +	   gsi_next (&gsi))
> +	{
> +	  gimple *stmt = gsi_stmt (gsi);
> +	  a.pre_analyze_stmt (stmt);
> +
> +	  switch (gimple_code (stmt))
> +	    {
> +	    case GIMPLE_RETURN:
> +	      a.analyze_return (as_a<greturn *> (stmt));
> +	      break;
> +
> +	    case GIMPLE_ASSIGN:
> +	      a.analyze_assign (as_a<gassign *> (stmt));
> +	      break;
> +
> +	    case GIMPLE_CALL:
> +	      a.analyze_call (as_a<gcall *> (stmt));
> +	      break;
> +
> +	    case GIMPLE_ASM:
> +	      a.analyze_asm (as_a<gasm *> (stmt));
> +	      break;
> +
> +	    default:
> +	      a.analyze_default_stmt (stmt);
> +	      break;
> +	    }
> +	}
> +    }
> +}
> +
>  bool type_internals_preclude_sra_p (tree type, const char **msg);
>  
>  /* Return true iff TYPE is stdarg va_list type (which early SRA and IPA-SRA
> diff --git a/gcc/var-tracking.cc b/gcc/var-tracking.cc
> index d8dafa5481a..7fc801f2612 100644
> --- a/gcc/var-tracking.cc
> +++ b/gcc/var-tracking.cc
> @@ -5352,7 +5352,8 @@ track_loc_p (rtx loc, tree expr, poly_int64 offset, bool store_reg_p,
>       because the real and imaginary parts are represented as separate
>       pseudo registers, even if the whole complex value fits into one
>       hard register.  */
> -  if ((paradoxical_subreg_p (mode, DECL_MODE (expr))
> +  if (((DECL_MODE (expr) != BLKmode
> +	&& paradoxical_subreg_p (mode, DECL_MODE (expr)))
>         || (store_reg_p
>  	   && !COMPLEX_MODE_P (DECL_MODE (expr))
>  	   && hard_regno_nregs (REGNO (loc), DECL_MODE (expr)) == 1))
> diff --git a/gcc/testsuite/g++.target/powerpc/pr102024.C b/gcc/testsuite/g++.target/powerpc/pr102024.C
> index 769585052b5..c8995cae707 100644
> --- a/gcc/testsuite/g++.target/powerpc/pr102024.C
> +++ b/gcc/testsuite/g++.target/powerpc/pr102024.C
> @@ -5,7 +5,7 @@
>  // Test that a zero-width bit field in an otherwise homogeneous aggregate
>  // generates a psabi warning and passes arguments in GPRs.
>  
> -// { dg-final { scan-assembler-times {\mstd\M} 4 } }
> +// { dg-final { scan-assembler-times {\mmtvsrd\M} 4 } }
>  
>  struct a_thing
>  {
> diff --git a/gcc/testsuite/gcc.target/i386/pr20020-2.c b/gcc/testsuite/gcc.target/i386/pr20020-2.c
> index fa8cb2528c5..723f1826630 100644
> --- a/gcc/testsuite/gcc.target/i386/pr20020-2.c
> +++ b/gcc/testsuite/gcc.target/i386/pr20020-2.c
> @@ -15,10 +15,15 @@ struct shared_ptr_struct
>  };
>  typedef struct shared_ptr_struct sptr_t;
>  
> +void foo (sptr_t *);
> +
>  void
>  copy_sptr (sptr_t *dest, sptr_t src)
>  {
>    *dest = src;
> +
> +  /* Prevent 'src' to be scalarized as registers.  */
> +  foo (&src);
>  }
>  
>  /* { dg-final { scan-rtl-dump "\\\(set \\\(reg:TI \[0-9\]*" "expand" } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr108073.c b/gcc/testsuite/gcc.target/powerpc/pr108073.c
> new file mode 100644
> index 00000000000..293bf93fb9a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr108073.c
> @@ -0,0 +1,29 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -save-temps" } */
> +
> +typedef struct DF {double a[4]; short s1; short s2; short s3; short s4; } DF;
> +typedef struct SF {float a[4]; int i1; int i2; } SF;
> +
> +/* { dg-final { scan-assembler-times {\mmtvsrd|mtvsrws\M} 3 {target { has_arch_ppc64 && has_arch_pwr8 } } } } */
> +/* { dg-final { scan-assembler-not {\mlwz\M} {target { has_arch_ppc64 && has_arch_pwr8 } } } } */
> +/* { dg-final { scan-assembler-not {\mlhz\M} {target { has_arch_ppc64 && has_arch_pwr8 } } } } */
> +short  __attribute__ ((noipa)) foo_hi (DF a, int flag){if (flag == 2)return a.s2+a.s3;return 0;}
> +int  __attribute__ ((noipa)) foo_si (SF a, int flag){if (flag == 2)return a.i2+a.i1;return 0;}
> +double __attribute__ ((noipa)) foo_df (DF arg, int flag){if (flag == 2)return arg.a[3];else return 0.0;}
> +float  __attribute__ ((noipa)) foo_sf (SF arg, int flag){if (flag == 2)return arg.a[2]; return 0;}
> +float  __attribute__ ((noipa)) foo_sf1 (SF arg, int flag){if (flag == 2)return arg.a[1];return 0;}
> +
> +DF gdf = {{1.0,2.0,3.0,4.0}, 1, 2, 3, 4};
> +SF gsf = {{1.0f,2.0f,3.0f,4.0f}, 1, 2};
> +
> +int main()
> +{
> +  if (!(foo_hi (gdf, 2) == 5 && foo_si (gsf, 2) == 3 && foo_df (gdf, 2) == 4.0
> +	&& foo_sf (gsf, 2) == 3.0 && foo_sf1 (gsf, 2) == 2.0))
> +    __builtin_abort ();
> +  if (!(foo_hi (gdf, 1) == 0 && foo_si (gsf, 1) == 0 && foo_df (gdf, 1) == 0
> +	&& foo_sf (gsf, 1) == 0 && foo_sf1 (gsf, 1) == 0))
> +    __builtin_abort ();
> +  return 0;
> +}
> +
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr65421-1.c b/gcc/testsuite/gcc.target/powerpc/pr65421-1.c
> new file mode 100644
> index 00000000000..4e1f87f7939
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr65421-1.c
> @@ -0,0 +1,6 @@
> +/* PR target/65421 */
> +/* { dg-options "-O2" } */
> +
> +typedef struct LARGE {double a[4]; int arr[32];} LARGE;
> +LARGE foo (LARGE a){return a;}
> +/* { dg-final { scan-assembler-times {\mmemcpy\M} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr65421-2.c b/gcc/testsuite/gcc.target/powerpc/pr65421-2.c
> new file mode 100644
> index 00000000000..8a8e1a0e996
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr65421-2.c
> @@ -0,0 +1,32 @@
> +/* PR target/65421 */
> +/* { dg-options "-O2" } */
> +/* { dg-require-effective-target powerpc_elfv2 } */
> +/* { dg-require-effective-target has_arch_ppc64 } */
> +
> +typedef struct FLOATS
> +{
> +  double a[3];
> +} FLOATS;
> +
> +/* 3 lfd after returns also optimized */
> +/* FLOATS ret_arg_pt (FLOATS *a){return *a;} */
> +
> +/* 3 stfd */
> +void st_arg (FLOATS a, FLOATS *p) {*p = a;}
> +/* { dg-final { scan-assembler-times {\mstfd\M} 3 } } */
> +
> +/* blr */
> +FLOATS ret_arg (FLOATS a) {return a;}
> +
> +typedef struct MIX
> +{
> +  double a[2];
> +  long l;
> +} MIX;
> +
> +/* std 3 param regs to return slot */
> +MIX ret_arg1 (MIX a) {return a;}
> +/* { dg-final { scan-assembler-times {\mstd\M} 3 } } */
> +
> +/* count insns */
> +/* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 9 } } */
Richard Biener Nov. 23, 2023, 2:12 p.m. UTC | #2
On Fri, Oct 27, 2023 at 3:51 AM Jiufu Guo <guojiufu@linux.ibm.com> wrote:
>
> Hi,
>
> Compare with previous version:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632399.html
> This verion supports TI/VEC mode of the access.
>
> There are a few PRs (meta-bug PR101926) on various targets.
> The root causes of them are similar: the aggeragte param/
> returns are passed by multi-registers, but they are stored
> to stack from registers first; and then, access the
> parameter through stack slot.
>
> A general idea to enhance this: accessing the aggregate
> parameters/returns directly through incoming/outgoing
> scalar registers.  This idea would be a kind of SRA.
>
> This experimental patch for light-expander-sra contains
> below parts:
>
> a. Check if the parameters/returns are ok/profitable to
>    scalarize, and set the incoming/outgoing registers(
>    pseudos) for the parameter/return.
>   - This is done in "expand_function_start", after the
>     incoming/outgoing hard registers are determined for the
>     paramter/return.
>     The scalarized registers are recorded in DECL_RTL for
>     the parameter/return in parallel form.
>   - At the time when setting DECL_RTL, "scalarizable_aggregate"
>     is called to check the accesses are ok/profitable to
>     scalarize.
>     We can continue to enhance this function, to support
>     more cases.  For example:
>     - 'reverse storage order'.
>     - 'writing to parameter'/'overlap accesses'.
>
> b. When expanding the accesses of the parameters/returns,
>    according to the info of the access(e.g. bitpos,bitsize,
>    mode), the scalar(pseudos) can be figured out to expand
>    the access.  This may happen when expand below accesses:
>   - The component access of a parameter: "_1 = arg.f1".
>     Or whole parameter access: rhs of "_2 = arg"
>   - The assignment to a return val:
>     "D.xx = yy; or D.xx.f = zz" where D.xx occurs on return
>     stmt.
>   - This is mainly done in expr.cc(expand_expr_real_1, and
>     expand_assignment).  Function "extract_sub_member" is
>     used to figure out the scalar rtxs(pseudos).
>
> Besides the above two parts, some work are done in the GIMPLE
> tree:  collect sra candidates for parameters/returns, and
> collect the SRA access info.
> This is mainly done at the beginning of the expander pass.
> Below are two major items of this part.
>  - Collect light-expand-sra candidates.
>   Each parameter is checked if it has the proper aggregate
>   type.  Collect return val (VAR_P) on each return stmts if
>   the function is returning via registers.
>   This is implemented in expand_sra::collect_sra_candidates.
>
>  - Build/collect/manage all the access on the candidates.
>   The function "scan_function" is used to do this work, it
>   goes through all basicblocks, and all interesting stmts (
>   phi, return, assign, call, asm) are checked.
>   If there is an interesting expression (e.g. COMPONENT_REF
>   or PARM_DECL), then record the required info for the access
>   (e.g. pos, size, type, base).
>   And if it is risky to do SRA, the candidates may be removed.
>   e.g. address-taken and accessed via memory.
>   "foo(struct S arg) {bar (&arg);}"
>
> This patch is tested on ppc64{,le} and x86_64.
> Is this ok for trunk?
>
> BR,
> Jeff (Jiufu Guo)
>
>         PR target/65421
>
> gcc/ChangeLog:
>
>         * cfgexpand.cc (struct access): New class.
>         (struct expand_sra): New class.
>         (expand_sra::collect_sra_candidates): New member function.
>         (expand_sra::add_sra_candidate): Likewise.
>         (expand_sra::build_access): Likewise.
>         (expand_sra::analyze_phi): Likewise.
>         (expand_sra::analyze_assign): Likewise.
>         (expand_sra::visit_base): Likewise.
>         (expand_sra::protect_mem_access_in_stmt): Likewise.
>         (expand_sra::expand_sra):  Class constructor.
>         (expand_sra::~expand_sra): Class destructor.
>         (expand_sra::scalarizable_access):  New member function.
>         (expand_sra::scalarizable_accesses):  Likewise.
>         (scalarizable_aggregate):  New function.
>         (set_scalar_rtx_for_returns):  New function.
>         (expand_value_return): Updated.
>         (expand_debug_expr): Updated.
>         (pass_expand::execute): Updated to use expand_sra.
>         * cfgexpand.h (scalarizable_aggregate): New declare.
>         (set_scalar_rtx_for_returns): New declare.
>         * expr.cc (expand_assignment): Updated.
>         (expand_constructor): Updated.
>         (query_position_in_parallel): New function.
>         (extract_sub_member): New function.
>         (expand_expr_real_1): Updated.
>         * expr.h (query_position_in_parallel): New declare.
>         * function.cc (assign_parm_setup_block): Updated.
>         (assign_parms): Updated.
>         (expand_function_start): Updated.
>         * tree-sra.h (struct sra_base_access): New class.
>         (struct sra_default_analyzer): New class.
>         (scan_function): New template function.
>         * var-tracking.cc (track_loc_p): Updated.
>
> gcc/testsuite/ChangeLog:
>
>         * g++.target/powerpc/pr102024.C: Updated
>         * gcc.target/powerpc/pr108073.c: New test.
>         * gcc.target/powerpc/pr65421-1.c: New test.
>         * gcc.target/powerpc/pr65421-2.c: New test.
>
> ---
>  gcc/cfgexpand.cc                             | 352 ++++++++++++++++++-
>  gcc/cfgexpand.h                              |   2 +
>  gcc/expr.cc                                  | 179 +++++++++-
>  gcc/expr.h                                   |   3 +
>  gcc/function.cc                              |  36 +-
>  gcc/tree-sra.h                               |  76 ++++
>  gcc/var-tracking.cc                          |   3 +-
>  gcc/testsuite/g++.target/powerpc/pr102024.C  |   2 +-
>  gcc/testsuite/gcc.target/i386/pr20020-2.c    |   5 +
>  gcc/testsuite/gcc.target/powerpc/pr108073.c  |  29 ++
>  gcc/testsuite/gcc.target/powerpc/pr65421-1.c |   6 +
>  gcc/testsuite/gcc.target/powerpc/pr65421-2.c |  32 ++
>  12 files changed, 718 insertions(+), 7 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108073.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421-1.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421-2.c
>
> diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
> index 4262703a138..ef99ca8ac13 100644
> --- a/gcc/cfgexpand.cc
> +++ b/gcc/cfgexpand.cc
> @@ -74,6 +74,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "output.h"
>  #include "builtins.h"
>  #include "opts.h"
> +#include "tree-sra.h"
>
>  /* Some systems use __main in a way incompatible with its use in gcc, in these
>     cases use the macros NAME__MAIN to give a quoted symbol and SYMBOL__MAIN to
> @@ -97,6 +98,343 @@ static bool defer_stack_allocation (tree, bool);
>
>  static void record_alignment_for_reg_var (unsigned int);
>
> +/* For light SRA in expander about paramaters and returns.  */
> +struct access : public sra_base_access
> +{
> +};
> +
> +typedef struct access *access_p;
> +
> +struct expand_sra : public sra_default_analyzer
> +{
> +  /* Construct/destruct resources, e.g. sra candidates.  */
> +  expand_sra ();
> +  ~expand_sra ();
> +
> +  /* No actions for pre_analyze_stmt, analyze_return.  */
> +
> +  /* Overwrite phi,call,asm analyzations.  */
> +  void analyze_phi (gphi *s);
> +
> +  /* TODO: Check accesses on call/asm.  */
> +  void analyze_call (gcall *s) { protect_mem_access_in_stmt (s); };
> +  void analyze_asm (gasm *s) { protect_mem_access_in_stmt (s); };
> +
> +  /* Check access of SRA on assignment.  */
> +  void analyze_assign (gassign *);
> +
> +  /* Check if the accesses of BASE(parameter or return) are
> +     scalarizable, according to the incoming/outgoing REGS. */
> +  bool scalarizable_accesses (tree base, rtx regs);
> +
> +private:
> +  /* Collect the parameter and returns to check if they are suitable for
> +     scalarization.  */
> +  bool collect_sra_candidates (void);
> +
> +  /* Return true if VAR is added as a candidate for SRA.  */
> +  bool add_sra_candidate (tree var);
> +
> +  /* Return true if EXPR has interesting sra access, and created access,
> +     return false otherwise.  */
> +  access_p build_access (tree expr, bool write);
> +
> +  /* Check if the access ACC is scalarizable.  REGS is the incoming/outgoing
> +     registers which the access is based on. */
> +  bool scalarizable_access (access_p acc, rtx regs, bool is_parm);
> +
> +  /* If there is risk (stored/loaded or addr taken),
> +     disqualify the sra candidates in the un-interesting STMT. */
> +  void protect_mem_access_in_stmt (gimple *stmt);
> +
> +  /* Callback of walk_stmt_load_store_addr_ops, used to remove
> +     unscalarizable accesses.  */
> +  static bool visit_base (gimple *, tree op, tree, void *data);
> +
> +  /* Base (tree) -> Vector (vec<access_p> *) map.  */
> +  hash_map<tree, auto_vec<access_p> > *base_access_vec;
> +};
> +
> +bool
> +expand_sra::collect_sra_candidates (void)
> +{
> +  bool ret = false;
> +
> +  /* Collect parameters.  */
> +  for (tree parm = DECL_ARGUMENTS (current_function_decl); parm;
> +       parm = DECL_CHAIN (parm))
> +    ret |= add_sra_candidate (parm);
> +
> +  /* Collect VARs on returns.  */
> +  if (DECL_RESULT (current_function_decl))
> +    {
> +      edge_iterator ei;
> +      edge e;
> +      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
> +       if (greturn *r = safe_dyn_cast<greturn *> (*gsi_last_bb (e->src)))
> +         {
> +           tree val = gimple_return_retval (r);
> +           /* To sclaraized the return, the return value should be only
> +              writen (except this return stmt).  Using 'true(write)' to
> +              pretend the access only be 'writen'. */
> +           if (val && VAR_P (val))
> +             ret |= add_sra_candidate (val) && build_access (val, true);
> +         }
> +    }
> +
> +  return ret;
> +}
> +
> +bool
> +expand_sra::add_sra_candidate (tree var)
> +{
> +  tree type = TREE_TYPE (var);
> +
> +  if (!AGGREGATE_TYPE_P (type) || !tree_fits_shwi_p (TYPE_SIZE (type))
> +      || tree_to_shwi (TYPE_SIZE (type)) == 0 || TREE_THIS_VOLATILE (var)
> +      || is_va_list_type (type))
> +    return false;
> +
> +  base_access_vec->get_or_insert (var);
> +
> +  return true;
> +}
> +
> +access_p
> +expand_sra::build_access (tree expr, bool write)
> +{
> +  enum tree_code code = TREE_CODE (expr);
> +  if (code != VAR_DECL && code != PARM_DECL && code != COMPONENT_REF
> +      && code != ARRAY_REF && code != ARRAY_RANGE_REF)
> +    return NULL;
> +
> +  HOST_WIDE_INT offset, size;
> +  bool reverse;
> +  tree base = get_ref_base_and_extent_hwi (expr, &offset, &size, &reverse);
> +  if (!base || !DECL_P (base))
> +    return NULL;
> +
> +  vec<access_p> *access_vec = base_access_vec->get (base);
> +  if (!access_vec)
> +    return NULL;
> +
> +  /* TODO: support reverse. */
> +  if (reverse || size <= 0 || offset + size > tree_to_shwi (DECL_SIZE (base)))
> +    {
> +      base_access_vec->remove (base);
> +      return NULL;
> +    }
> +
> +  struct access *access = XNEWVEC (struct access, 1);
> +
> +  memset (access, 0, sizeof (struct access));
> +  access->offset = offset;
> +  access->size = size;
> +  access->expr = expr;
> +  access->write = write;
> +  access->reverse = reverse;
> +
> +  access_vec->safe_push (access);
> +
> +  return access;
> +}

I still think there's SRA code we can share for producing
accesses from a stmt.

> +
> +/* Function protect_mem_access_in_stmt removes the SRA candidates if
> +   there is addr-taken on the candidate in the STMT.  */
> +
> +void
> +expand_sra::analyze_phi (gphi *stmt)
> +{
> +  if (base_access_vec && !base_access_vec->is_empty ())
> +    walk_stmt_load_store_addr_ops (stmt, this, NULL, NULL, visit_base);
> +}
> +
> +void
> +expand_sra::analyze_assign (gassign *stmt)
> +{
> +  if (!base_access_vec || base_access_vec->is_empty ())
> +    return;
> +
> +  if (gimple_assign_single_p (stmt) && !gimple_clobber_p (stmt))
> +    {
> +      tree rhs = gimple_assign_rhs1 (stmt);
> +      tree lhs = gimple_assign_lhs (stmt);
> +      bool res_r = build_access (rhs, false);
> +      bool res_l = build_access (lhs, true);
> +
> +      if (res_l || res_r)
> +       return;
> +    }
> +
> +  protect_mem_access_in_stmt (stmt);
> +}
> +
> +/* Callback of walk_stmt_load_store_addr_ops, used to remove
> +   unscalarizable accesses.  Called by protect_mem_access_in_stmt.  */
> +
> +bool
> +expand_sra::visit_base (gimple *, tree op, tree, void *data)
> +{
> +  op = get_base_address (op);
> +  if (op && DECL_P (op))
> +    {
> +      expand_sra *p = (expand_sra *) data;
> +      p->base_access_vec->remove (op);
> +    }
> +  return false;
> +}
> +
> +/* Function protect_mem_access_in_stmt removes the SRA candidates if
> +   there is store/load/addr-taken on the candidate in the STMT.
> +
> +   For some statements, which SRA does not care about, if there are
> +   possible memory operation on the SRA candidates, it would be risky
> +   to scalarize it.  */
> +
> +void
> +expand_sra::protect_mem_access_in_stmt (gimple *stmt)
> +{
> +  if (base_access_vec && !base_access_vec->is_empty ())
> +    walk_stmt_load_store_addr_ops (stmt, this, visit_base, visit_base,
> +                                  visit_base);
> +}
> +
> +expand_sra::expand_sra () : base_access_vec (NULL)
> +{
> +  if (optimize <= 0)
> +    return;
> +
> +  base_access_vec = new hash_map<tree, auto_vec<access_p> >;
> +  collect_sra_candidates ();
> +}
> +
> +expand_sra::~expand_sra ()
> +{
> +  if (optimize <= 0)
> +    return;
> +
> +  delete base_access_vec;
> +}
> +
> +bool
> +expand_sra::scalarizable_access (access_p acc, rtx regs, bool is_parm)
> +{
> +  /* Now only support reading from parms
> +     or writing to returns.  */
> +  if (is_parm && acc->write)
> +    return false;
> +  if (!is_parm && !acc->write)
> +    return false;
> +
> +  /* Compute the position of the access in the parallel regs.  */
> +  int start_index = -1;
> +  int end_index = -1;
> +  HOST_WIDE_INT left_bits = 0;
> +  HOST_WIDE_INT right_bits = 0;
> +  query_position_in_parallel (acc->offset, acc->size, regs, start_index,
> +                             end_index, left_bits, right_bits);
> +
> +  /* Invalid access possition: padding or outof bound.  */
> +  if (start_index < 0 || end_index < 0)
> +    return false;
> +
> +  machine_mode expr_mode = TYPE_MODE (TREE_TYPE (acc->expr));
> +  /* Need multi-registers in a parallel for the access.  */
> +  if (expr_mode == BLKmode || end_index > start_index)
> +    {
> +      if (left_bits || right_bits)
> +       return false;
> +      if (expr_mode == BLKmode)
> +       return true;
> +
> +      /* For large modes, only support TI/VECTOR in mult-registers. */
> +      if (known_gt (acc->size, GET_MODE_BITSIZE (word_mode)))
> +       return expr_mode == TImode || VECTOR_MODE_P (expr_mode);
> +      return true;
> +    }
> +
> +  gcc_assert (end_index == start_index);
> +
> +  /* Just need one reg for the access.  */
> +  if (left_bits == 0 && right_bits == 0)
> +    return true;
> +
> +  scalar_int_mode imode;
> +  /* Need to extract bits from the reg for the access.  */
> +  return !acc->write && int_mode_for_mode (expr_mode).exists (&imode);
> +}
> +
> +/* Now, the base (parm/return) is scalarizable, only if all
> +   accesses of the BASE are scalariable.
> +
> +   This function need to be updated, to support more complicate
> +   cases, like:
> +   - Some access are scalarizable, but some are not.
> +   - Access is writing to a parameter.
> +   - Writing accesses are overlap with multi-accesses.   */
> +
> +bool
> +expand_sra::scalarizable_accesses (tree base, rtx regs)
> +{
> +  if (!base_access_vec)
> +    return false;
> +  vec<access_p> *access_vec = base_access_vec->get (base);
> +  if (!access_vec)
> +    return false;
> +  if (access_vec->is_empty ())
> +    return false;
> +
> +  bool is_parm = TREE_CODE (base) == PARM_DECL;
> +  int n = access_vec->length ();
> +  int cur_access_index = 0;
> +  for (; cur_access_index < n; cur_access_index++)
> +    if (!scalarizable_access ((*access_vec)[cur_access_index], regs, is_parm))
> +      break;
> +
> +  /* It is ok if all access are scalarizable.  */
> +  if (cur_access_index == n)
> +    return true;
> +
> +  base_access_vec->remove (base);
> +  return false;
> +}
> +
> +static expand_sra *current_sra = NULL;
> +
> +/* Check if the PARAM (or return) is scalarizable.
> +
> +   This interface is used in expand_function_start
> +   to check sra possiblity for parmeters. */
> +
> +bool
> +scalarizable_aggregate (tree parm, rtx regs)
> +{
> +  if (!current_sra)
> +    return false;
> +  return current_sra->scalarizable_accesses (parm, regs);
> +}
> +
> +/* Check if interesting returns, and if they are scalarizable,
> +   set DECL_RTL as scalar registers.
> +
> +   This interface is used in expand_function_start
> +   when outgoing registers are determinded for DECL_RESULT.  */
> +
> +void
> +set_scalar_rtx_for_returns ()
> +{
> +  rtx res = DECL_RTL (DECL_RESULT (current_function_decl));
> +  edge_iterator ei;
> +  edge e;
> +  FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
> +    if (greturn *r = safe_dyn_cast<greturn *> (*gsi_last_bb (e->src)))
> +      {
> +       tree val = gimple_return_retval (r);
> +       if (val && VAR_P (val) && scalarizable_aggregate (val, res))
> +         SET_DECL_RTL (val, res);
> +      }
> +}
> +
>  /* Return an expression tree corresponding to the RHS of GIMPLE
>     statement STMT.  */
>
> @@ -3707,7 +4045,7 @@ expand_value_return (rtx val)
>
>    tree decl = DECL_RESULT (current_function_decl);
>    rtx return_reg = DECL_RTL (decl);
> -  if (return_reg != val)
> +  if (!rtx_equal_p (return_reg, val))
>      {
>        tree funtype = TREE_TYPE (current_function_decl);
>        tree type = TREE_TYPE (decl);
> @@ -4423,6 +4761,12 @@ expand_debug_expr (tree exp)
>    addr_space_t as;
>    scalar_int_mode op0_mode, op1_mode, addr_mode;
>
> +  /* TODO: Enable to debug expand-sra optimized parm/returns.  */
> +  tree base = get_base_address (exp);
> +  if ((TREE_CODE (base) == PARM_DECL || (VAR_P (base) && DECL_RTL_SET_P (base)))
> +      && GET_CODE (DECL_RTL (base)) == PARALLEL)
> +    return NULL_RTX;
> +
>    switch (TREE_CODE_CLASS (TREE_CODE (exp)))
>      {
>      case tcc_expression:
> @@ -6628,6 +6972,10 @@ pass_expand::execute (function *fun)
>    auto_bitmap forced_stack_vars;
>    discover_nonconstant_array_refs (forced_stack_vars);
>
> +  /* Enable light-expander-sra.  */
> +  current_sra = new expand_sra;
> +  scan_function (cfun, *current_sra);
> +
>    /* Make sure all values used by the optimization passes have sane
>       defaults.  */
>    reg_renumber = 0;
> @@ -7056,6 +7404,8 @@ pass_expand::execute (function *fun)
>        loop_optimizer_finalize ();
>      }
>
> +  delete current_sra;
> +  current_sra = NULL;
>    timevar_pop (TV_POST_EXPAND);
>
>    return 0;
> diff --git a/gcc/cfgexpand.h b/gcc/cfgexpand.h
> index 0e551f6cfd3..3415c217708 100644
> --- a/gcc/cfgexpand.h
> +++ b/gcc/cfgexpand.h
> @@ -24,5 +24,7 @@ extern tree gimple_assign_rhs_to_tree (gimple *);
>  extern HOST_WIDE_INT estimated_stack_frame_size (struct cgraph_node *);
>  extern void set_parm_rtl (tree, rtx);
>
> +extern bool scalarizable_aggregate (tree, rtx);
> +extern void set_scalar_rtx_for_returns ();
>
>  #endif /* GCC_CFGEXPAND_H */
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index 763bd82c59f..5ba26e0ef52 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -5618,7 +5618,10 @@ expand_assignment (tree to, tree from, bool nontemporal)
>       Assignment of an array element at a constant index, and assignment of
>       an array element in an unaligned packed structure field, has the same
>       problem.  Same for (partially) storing into a non-memory object.  */
> -  if (handled_component_p (to)
> +  if ((handled_component_p (to)
> +       && !(VAR_P (get_base_address (to))

get_base_address is pretty expensive so please cache the result.

> +           && DECL_RTL_SET_P (get_base_address (to))
> +           && GET_CODE (DECL_RTL (get_base_address (to))) == PARALLEL))

I think that's possibly fragile in case a PARALLEL can also appear by
other means here.  I also think it's the wrong thing to map a structural
reference to a separate RTL expression, instead we should rewrite the
GIMPLE accordingly with artificial decls which would also be the way
to preserve debug info.

In principle "expander SRA" isn't different from regular SRA apart from
the fact that it handles parameter and return variables and heuristically
includes the ABI (does it?) in decomposition.

So the only important difference is that we cannot represent the
change for the incoming arguments or the outgoing regs in GIMPLE
but we should only have to special case _those_ bits, not all the
other accesses we can rewrite.

We could even use the actual SRA pass in a special mode right
before RTL expansion (or close to it) and invent a special (but ugly)
representation for the incoming/outgoing part, say by using
new internal functions present between the SRA pass and RTL
expansion:

struct X foo (struct X x)
{
  int D.1234 = .ARG_PART (x, offset, size);
  int D.1235 = .ARG_PART (x, offset, size);
  D.2345 = .SET_ARG_PARTS (D.1234, offset, size, D.1235, offset, size);
  return D.2345;
}

I'm sure that's not an ideal representation though but my point about
modifying the actual accesses stands.  Finding a working and not
too ugly way to represent the data flow between the original PARM_DECL
and the pieces used and the same for the return is key here unless
we want to rely on on-the-side data and nothing disrupting things between
that SRA and actual expansion (it would work to call the passes functions
directly from cfgexpand.cc of course - that's what you do as well).

Sorry for the very late reply here.

Thanks,
Richard.

>        || (TREE_CODE (to) == MEM_REF
>           && (REF_REVERSE_STORAGE_ORDER (to)
>               || mem_ref_refers_to_non_mem_p (to)))
> @@ -8909,6 +8912,19 @@ expand_constructor (tree exp, rtx target, enum expand_modifier modifier,
>        && ! mostly_zeros_p (exp))
>      return NULL_RTX;
>
> +  if (target && GET_CODE (target) == PARALLEL && all_zeros_p (exp))
> +    {
> +      int length = XVECLEN (target, 0);
> +      int start = XEXP (XVECEXP (target, 0, 0), 0) ? 0 : 1;
> +      for (int i = start; i < length; i++)
> +       {
> +         rtx dst = XEXP (XVECEXP (target, 0, i), 0);
> +         rtx zero = CONST0_RTX (GET_MODE (dst));
> +         emit_move_insn (dst, zero);
> +       }
> +      return target;
> +    }
> +
>    /* Handle calls that pass values in multiple non-contiguous
>       locations.  The Irix 6 ABI has examples of this.  */
>    if (target == 0 || ! safe_from_p (target, exp, 1)
> @@ -10621,6 +10637,157 @@ stmt_is_replaceable_p (gimple *stmt)
>    return false;
>  }
>
> +/* In the parallel rtx register series REGS, compute the position for given
> +   {BITPOS, BITSIZE}.
> +   START_INDEX, END_INDEX, LEFT_BITS and RIGHT_BITS are computed outputs.  */
> +
> +void
> +query_position_in_parallel (HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize,
> +                           rtx regs, int &start_index, int &end_index,
> +                           HOST_WIDE_INT &left_bits, HOST_WIDE_INT &right_bits)
> +{
> +  int cur_index = XEXP (XVECEXP (regs, 0, 0), 0) ? 0 : 1;
> +  for (; cur_index < XVECLEN (regs, 0); cur_index++)
> +    {
> +      rtx slot = XVECEXP (regs, 0, cur_index);
> +      HOST_WIDE_INT off = UINTVAL (XEXP (slot, 1)) * BITS_PER_UNIT;
> +      machine_mode mode = GET_MODE (XEXP (slot, 0));
> +      HOST_WIDE_INT size = GET_MODE_BITSIZE (mode).to_constant ();
> +      if (off <= bitpos && off + size > bitpos)
> +       {
> +         start_index = cur_index;
> +         left_bits = bitpos - off;
> +       }
> +      if (off + size >= bitpos + bitsize)
> +       {
> +         end_index = cur_index;
> +         right_bits = off + size - (bitpos + bitsize);
> +         break;
> +       }
> +    }
> +}
> +
> +static rtx
> +extract_sub_member (rtx regs, HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize,
> +                   tree expr)
> +{
> +  int start_index = -1;
> +  int end_index = -1;
> +  HOST_WIDE_INT left_bits = 0;
> +  HOST_WIDE_INT right_bits = 0;
> +  query_position_in_parallel (bitpos, bitsize, regs, start_index, end_index,
> +                             left_bits, right_bits);
> +
> +  machine_mode expr_mode = TYPE_MODE (TREE_TYPE (expr));
> +  if (end_index > start_index || expr_mode == BLKmode)
> +    {
> +      /* TImode in multi-registers.  */
> +      if (expr_mode == TImode)
> +       {
> +         rtx res = gen_reg_rtx (expr_mode);
> +         HOST_WIDE_INT start;
> +         start = UINTVAL (XEXP (XVECEXP (regs, 0, start_index), 1));
> +         for (int index = start_index; index <= end_index; index++)
> +           {
> +             rtx reg = XEXP (XVECEXP (regs, 0, index), 0);
> +             machine_mode mode = GET_MODE (reg);
> +             HOST_WIDE_INT off;
> +             off = UINTVAL (XEXP (XVECEXP (regs, 0, index), 1)) - start;
> +             rtx sub = simplify_gen_subreg (mode, res, expr_mode, off);
> +             emit_move_insn (sub, reg);
> +           }
> +         return res;
> +       }
> +
> +      /* Vector in multi-registers.  */
> +      if (VECTOR_MODE_P (expr_mode))
> +       {
> +         rtvec vector = rtvec_alloc (end_index - start_index + 1);
> +         machine_mode emode;
> +         emode = GET_MODE (XEXP (XVECEXP (regs, 0, start_index), 0));
> +         for (int index = start_index; index <= end_index; index++)
> +           {
> +             rtx reg = XEXP (XVECEXP (regs, 0, index), 0);
> +             gcc_assert (emode == GET_MODE (reg));
> +             RTVEC_ELT (vector, index - start_index) = reg;
> +           }
> +         scalar_int_mode imode;
> +         machine_mode vmode;
> +         int nunits = end_index - start_index + 1;
> +         if (!(int_mode_for_mode (emode).exists (&imode)
> +               && mode_for_vector (imode, nunits).exists (&vmode)))
> +           gcc_unreachable ();
> +
> +         insn_code icode;
> +         icode = convert_optab_handler (vec_init_optab, vmode, imode);
> +         rtx res = gen_reg_rtx (vmode);
> +         emit_insn (GEN_FCN (icode) (res, gen_rtx_PARALLEL (vmode, vector)));
> +         if (expr_mode == vmode)
> +           return res;
> +         return simplify_gen_subreg (expr_mode, res, vmode, 0);
> +       }
> +
> +      /* Need multi-registers in a parallel for the access.  */
> +      int num_words = end_index - start_index + 1;
> +      rtx *tmps = XALLOCAVEC (rtx, num_words);
> +
> +      int pos = 0;
> +      HOST_WIDE_INT start;
> +      start = UINTVAL (XEXP (XVECEXP (regs, 0, start_index), 1));
> +      /* Extract whole registers.  */
> +      for (; pos < num_words; pos++)
> +       {
> +         int index = start_index + pos;
> +         rtx reg = XEXP (XVECEXP (regs, 0, index), 0);
> +         machine_mode mode = GET_MODE (reg);
> +         HOST_WIDE_INT off;
> +         off = UINTVAL (XEXP (XVECEXP (regs, 0, index), 1)) - start;
> +         tmps[pos] = gen_rtx_EXPR_LIST (mode, reg, GEN_INT (off));
> +       }
> +
> +      rtx res = gen_rtx_PARALLEL (expr_mode, gen_rtvec_v (pos, tmps));
> +      return res;
> +    }
> +
> +  gcc_assert (end_index == start_index);
> +
> +  /* Just need one reg for the access.  */
> +  if (left_bits == 0 && right_bits == 0)
> +    {
> +      rtx reg = XEXP (XVECEXP (regs, 0, start_index), 0);
> +      if (GET_MODE (reg) != expr_mode)
> +       reg = gen_lowpart (expr_mode, reg);
> +      return reg;
> +    }
> +
> +  /* Need to extract bitfield part reg for the access.
> +     left_bits != 0 or right_bits != 0 */
> +  rtx reg = XEXP (XVECEXP (regs, 0, start_index), 0);
> +  bool sgn = TYPE_UNSIGNED (TREE_TYPE (expr));
> +  scalar_int_mode imode;
> +  if (!int_mode_for_mode (expr_mode).exists (&imode))
> +    {
> +      gcc_assert (false);
> +      return NULL_RTX;
> +    }
> +
> +  machine_mode mode = GET_MODE (reg);
> +  bool reverse = false;
> +  rtx bfld = extract_bit_field (reg, bitsize, left_bits, sgn, NULL_RTX, mode,
> +                               imode, reverse, NULL);
> +
> +  if (GET_MODE (bfld) != imode)
> +    bfld = gen_lowpart (imode, bfld);
> +
> +  if (expr_mode == imode)
> +    return bfld;
> +
> +  /* expr_mode != imode, e.g. SF != SI.  */
> +  rtx result = gen_reg_rtx (imode);
> +  emit_move_insn (result, bfld);
> +  return gen_lowpart (expr_mode, result);
> +}
> +
>  rtx
>  expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
>                     enum expand_modifier modifier, rtx *alt_rtl,
> @@ -11498,6 +11665,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
>           = expand_expr_real (tem, tem_target, VOIDmode, tem_modifier, NULL,
>                               true);
>
> +       /* It is scalarizable access on param which is passed by registers.  */
> +       if (GET_CODE (op0) == PARALLEL
> +           && (TREE_CODE (tem) == PARM_DECL || VAR_P (tem)))
> +         {
> +           HOST_WIDE_INT pos, size;
> +           size = bitsize.to_constant ();
> +           pos = bitpos.to_constant ();
> +           return extract_sub_member (op0, pos, size, exp);
> +         }
> +
>         /* If the field has a mode, we want to access it in the
>            field's mode, not the computed mode.
>            If a MEM has VOIDmode (external with incomplete type),
> diff --git a/gcc/expr.h b/gcc/expr.h
> index 2a172867fdb..8a9332aaad6 100644
> --- a/gcc/expr.h
> +++ b/gcc/expr.h
> @@ -362,5 +362,8 @@ extern rtx expr_size (tree);
>
>  extern bool mem_ref_refers_to_non_mem_p (tree);
>  extern bool non_mem_decl_p (tree);
> +extern void query_position_in_parallel (HOST_WIDE_INT, HOST_WIDE_INT, rtx,
> +                                       int &, int &, HOST_WIDE_INT &,
> +                                       HOST_WIDE_INT &);
>
>  #endif /* GCC_EXPR_H */
> diff --git a/gcc/function.cc b/gcc/function.cc
> index afb0b33da9e..518250b2728 100644
> --- a/gcc/function.cc
> +++ b/gcc/function.cc
> @@ -3107,8 +3107,29 @@ assign_parm_setup_block (struct assign_parm_data_all *all,
>           emit_move_insn (mem, entry_parm);
>         }
>        else
> -       move_block_from_reg (REGNO (entry_parm), mem,
> -                            size_stored / UNITS_PER_WORD);
> +       {
> +         int regno = REGNO (entry_parm);
> +         int nregs = size_stored / UNITS_PER_WORD;
> +         rtx *tmps = XALLOCAVEC (rtx, nregs);
> +         machine_mode mode = word_mode;
> +         HOST_WIDE_INT word_size = GET_MODE_SIZE (mode).to_constant ();
> +         for (int i = 0; i < nregs; i++)
> +           {
> +             rtx reg = gen_rtx_REG (mode, regno + i);
> +             rtx off = GEN_INT (word_size * i);
> +             tmps[i] = gen_rtx_EXPR_LIST (VOIDmode, reg, off);
> +           }
> +
> +         rtx regs = gen_rtx_PARALLEL (BLKmode, gen_rtvec_v (nregs, tmps));
> +         if (scalarizable_aggregate (parm, regs))
> +           {
> +             rtx pseudos = gen_group_rtx (regs);
> +             emit_group_move (pseudos, regs);
> +             stack_parm = pseudos;
> +           }
> +         else
> +           move_block_from_reg (regno, mem, nregs);
> +       }
>      }
>    else if (data->stack_parm == 0 && !TYPE_EMPTY_P (data->arg.type))
>      {
> @@ -3710,7 +3731,15 @@ assign_parms (tree fndecl)
>
>        assign_parm_adjust_stack_rtl (&data);
>
> -      if (assign_parm_setup_block_p (&data))
> +      rtx incoming = DECL_INCOMING_RTL (parm);
> +      if (GET_CODE (incoming) == PARALLEL
> +         && scalarizable_aggregate (parm, incoming))
> +       {
> +         rtx pseudos = gen_group_rtx (incoming);
> +         emit_group_move (pseudos, incoming);
> +         set_parm_rtl (parm, pseudos);
> +       }
> +      else if (assign_parm_setup_block_p (&data))
>         assign_parm_setup_block (&all, parm, &data);
>        else if (data.arg.pass_by_reference || use_register_for_decl (parm))
>         assign_parm_setup_reg (&all, parm, &data);
> @@ -5128,6 +5157,7 @@ expand_function_start (tree subr)
>             {
>               gcc_assert (GET_CODE (hard_reg) == PARALLEL);
>               set_parm_rtl (res, gen_group_rtx (hard_reg));
> +             set_scalar_rtx_for_returns ();
>             }
>         }
>
> diff --git a/gcc/tree-sra.h b/gcc/tree-sra.h
> index f20266c4622..df3071ccf6e 100644
> --- a/gcc/tree-sra.h
> +++ b/gcc/tree-sra.h
> @@ -19,6 +19,82 @@ You should have received a copy of the GNU General Public License
>  along with GCC; see the file COPYING3.  If not see
>  <http://www.gnu.org/licenses/>.  */
>
> +struct sra_base_access
> +{
> +  /* Values returned by get_ref_base_and_extent, indicates the
> +     OFFSET, SIZE and BASE of the access.  */
> +  HOST_WIDE_INT offset;
> +  HOST_WIDE_INT size;
> +
> +  /* The context expression of this access.  */
> +  tree expr;
> +
> +  /* Indicates this is a write access.  */
> +  bool write : 1;
> +
> +  /* Indicates if this access is made in reverse storage order.  */
> +  bool reverse : 1;
> +};
> +
> +/* Default template for sra_scan_function.  */
> +
> +struct sra_default_analyzer
> +{
> +  /* Template analyze functions.  */
> +  void analyze_phi (gphi *){};
> +  void pre_analyze_stmt (gimple *){};
> +  void analyze_return (greturn *){};
> +  void analyze_assign (gassign *){};
> +  void analyze_call (gcall *){};
> +  void analyze_asm (gasm *){};
> +  void analyze_default_stmt (gimple *){};
> +};
> +
> +/* Scan function and look for interesting expressions.  */
> +
> +template <typename analyzer>
> +void
> +scan_function (struct function *fun, analyzer &a)
> +{
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, fun)
> +    {
> +      for (gphi_iterator gsi = gsi_start_phis (bb); !gsi_end_p (gsi);
> +          gsi_next (&gsi))
> +       a.analyze_phi (gsi.phi ());
> +
> +      for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
> +          gsi_next (&gsi))
> +       {
> +         gimple *stmt = gsi_stmt (gsi);
> +         a.pre_analyze_stmt (stmt);
> +
> +         switch (gimple_code (stmt))
> +           {
> +           case GIMPLE_RETURN:
> +             a.analyze_return (as_a<greturn *> (stmt));
> +             break;
> +
> +           case GIMPLE_ASSIGN:
> +             a.analyze_assign (as_a<gassign *> (stmt));
> +             break;
> +
> +           case GIMPLE_CALL:
> +             a.analyze_call (as_a<gcall *> (stmt));
> +             break;
> +
> +           case GIMPLE_ASM:
> +             a.analyze_asm (as_a<gasm *> (stmt));
> +             break;
> +
> +           default:
> +             a.analyze_default_stmt (stmt);
> +             break;
> +           }
> +       }
> +    }
> +}
> +
>  bool type_internals_preclude_sra_p (tree type, const char **msg);
>
>  /* Return true iff TYPE is stdarg va_list type (which early SRA and IPA-SRA
> diff --git a/gcc/var-tracking.cc b/gcc/var-tracking.cc
> index d8dafa5481a..7fc801f2612 100644
> --- a/gcc/var-tracking.cc
> +++ b/gcc/var-tracking.cc
> @@ -5352,7 +5352,8 @@ track_loc_p (rtx loc, tree expr, poly_int64 offset, bool store_reg_p,
>       because the real and imaginary parts are represented as separate
>       pseudo registers, even if the whole complex value fits into one
>       hard register.  */
> -  if ((paradoxical_subreg_p (mode, DECL_MODE (expr))
> +  if (((DECL_MODE (expr) != BLKmode
> +       && paradoxical_subreg_p (mode, DECL_MODE (expr)))
>         || (store_reg_p
>            && !COMPLEX_MODE_P (DECL_MODE (expr))
>            && hard_regno_nregs (REGNO (loc), DECL_MODE (expr)) == 1))
> diff --git a/gcc/testsuite/g++.target/powerpc/pr102024.C b/gcc/testsuite/g++.target/powerpc/pr102024.C
> index 769585052b5..c8995cae707 100644
> --- a/gcc/testsuite/g++.target/powerpc/pr102024.C
> +++ b/gcc/testsuite/g++.target/powerpc/pr102024.C
> @@ -5,7 +5,7 @@
>  // Test that a zero-width bit field in an otherwise homogeneous aggregate
>  // generates a psabi warning and passes arguments in GPRs.
>
> -// { dg-final { scan-assembler-times {\mstd\M} 4 } }
> +// { dg-final { scan-assembler-times {\mmtvsrd\M} 4 } }
>
>  struct a_thing
>  {
> diff --git a/gcc/testsuite/gcc.target/i386/pr20020-2.c b/gcc/testsuite/gcc.target/i386/pr20020-2.c
> index fa8cb2528c5..723f1826630 100644
> --- a/gcc/testsuite/gcc.target/i386/pr20020-2.c
> +++ b/gcc/testsuite/gcc.target/i386/pr20020-2.c
> @@ -15,10 +15,15 @@ struct shared_ptr_struct
>  };
>  typedef struct shared_ptr_struct sptr_t;
>
> +void foo (sptr_t *);
> +
>  void
>  copy_sptr (sptr_t *dest, sptr_t src)
>  {
>    *dest = src;
> +
> +  /* Prevent 'src' to be scalarized as registers.  */
> +  foo (&src);
>  }
>
>  /* { dg-final { scan-rtl-dump "\\\(set \\\(reg:TI \[0-9\]*" "expand" } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr108073.c b/gcc/testsuite/gcc.target/powerpc/pr108073.c
> new file mode 100644
> index 00000000000..293bf93fb9a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr108073.c
> @@ -0,0 +1,29 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -save-temps" } */
> +
> +typedef struct DF {double a[4]; short s1; short s2; short s3; short s4; } DF;
> +typedef struct SF {float a[4]; int i1; int i2; } SF;
> +
> +/* { dg-final { scan-assembler-times {\mmtvsrd|mtvsrws\M} 3 {target { has_arch_ppc64 && has_arch_pwr8 } } } } */
> +/* { dg-final { scan-assembler-not {\mlwz\M} {target { has_arch_ppc64 && has_arch_pwr8 } } } } */
> +/* { dg-final { scan-assembler-not {\mlhz\M} {target { has_arch_ppc64 && has_arch_pwr8 } } } } */
> +short  __attribute__ ((noipa)) foo_hi (DF a, int flag){if (flag == 2)return a.s2+a.s3;return 0;}
> +int  __attribute__ ((noipa)) foo_si (SF a, int flag){if (flag == 2)return a.i2+a.i1;return 0;}
> +double __attribute__ ((noipa)) foo_df (DF arg, int flag){if (flag == 2)return arg.a[3];else return 0.0;}
> +float  __attribute__ ((noipa)) foo_sf (SF arg, int flag){if (flag == 2)return arg.a[2]; return 0;}
> +float  __attribute__ ((noipa)) foo_sf1 (SF arg, int flag){if (flag == 2)return arg.a[1];return 0;}
> +
> +DF gdf = {{1.0,2.0,3.0,4.0}, 1, 2, 3, 4};
> +SF gsf = {{1.0f,2.0f,3.0f,4.0f}, 1, 2};
> +
> +int main()
> +{
> +  if (!(foo_hi (gdf, 2) == 5 && foo_si (gsf, 2) == 3 && foo_df (gdf, 2) == 4.0
> +       && foo_sf (gsf, 2) == 3.0 && foo_sf1 (gsf, 2) == 2.0))
> +    __builtin_abort ();
> +  if (!(foo_hi (gdf, 1) == 0 && foo_si (gsf, 1) == 0 && foo_df (gdf, 1) == 0
> +       && foo_sf (gsf, 1) == 0 && foo_sf1 (gsf, 1) == 0))
> +    __builtin_abort ();
> +  return 0;
> +}
> +
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr65421-1.c b/gcc/testsuite/gcc.target/powerpc/pr65421-1.c
> new file mode 100644
> index 00000000000..4e1f87f7939
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr65421-1.c
> @@ -0,0 +1,6 @@
> +/* PR target/65421 */
> +/* { dg-options "-O2" } */
> +
> +typedef struct LARGE {double a[4]; int arr[32];} LARGE;
> +LARGE foo (LARGE a){return a;}
> +/* { dg-final { scan-assembler-times {\mmemcpy\M} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr65421-2.c b/gcc/testsuite/gcc.target/powerpc/pr65421-2.c
> new file mode 100644
> index 00000000000..8a8e1a0e996
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr65421-2.c
> @@ -0,0 +1,32 @@
> +/* PR target/65421 */
> +/* { dg-options "-O2" } */
> +/* { dg-require-effective-target powerpc_elfv2 } */
> +/* { dg-require-effective-target has_arch_ppc64 } */
> +
> +typedef struct FLOATS
> +{
> +  double a[3];
> +} FLOATS;
> +
> +/* 3 lfd after returns also optimized */
> +/* FLOATS ret_arg_pt (FLOATS *a){return *a;} */
> +
> +/* 3 stfd */
> +void st_arg (FLOATS a, FLOATS *p) {*p = a;}
> +/* { dg-final { scan-assembler-times {\mstfd\M} 3 } } */
> +
> +/* blr */
> +FLOATS ret_arg (FLOATS a) {return a;}
> +
> +typedef struct MIX
> +{
> +  double a[2];
> +  long l;
> +} MIX;
> +
> +/* std 3 param regs to return slot */
> +MIX ret_arg1 (MIX a) {return a;}
> +/* { dg-final { scan-assembler-times {\mstd\M} 3 } } */
> +
> +/* count insns */
> +/* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 9 } } */
> --
> 2.25.1
>
Jiufu Guo Nov. 28, 2023, 2:59 a.m. UTC | #3
Hi,

Thanks so much for your helpful review!

Richard Biener <richard.guenther@gmail.com> writes:

> On Fri, Oct 27, 2023 at 3:51 AM Jiufu Guo <guojiufu@linux.ibm.com> wrote:
>>
>> Hi,
>>
>> Compare with previous version:
>> https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632399.html
>> This verion supports TI/VEC mode of the access.
>>
>> There are a few PRs (meta-bug PR101926) on various targets.
>> The root causes of them are similar: the aggeragte param/
>> returns are passed by multi-registers, but they are stored
>> to stack from registers first; and then, access the
>> parameter through stack slot.
>>
>> A general idea to enhance this: accessing the aggregate
>> parameters/returns directly through incoming/outgoing
>> scalar registers.  This idea would be a kind of SRA.
>>
>> This experimental patch for light-expander-sra contains
>> below parts:
>>
>> a. Check if the parameters/returns are ok/profitable to
>>    scalarize, and set the incoming/outgoing registers(
>>    pseudos) for the parameter/return.
>>   - This is done in "expand_function_start", after the
>>     incoming/outgoing hard registers are determined for the
>>     paramter/return.
>>     The scalarized registers are recorded in DECL_RTL for
>>     the parameter/return in parallel form.
>>   - At the time when setting DECL_RTL, "scalarizable_aggregate"
>>     is called to check the accesses are ok/profitable to
>>     scalarize.
>>     We can continue to enhance this function, to support
>>     more cases.  For example:
>>     - 'reverse storage order'.
>>     - 'writing to parameter'/'overlap accesses'.
>>
>> b. When expanding the accesses of the parameters/returns,
>>    according to the info of the access(e.g. bitpos,bitsize,
>>    mode), the scalar(pseudos) can be figured out to expand
>>    the access.  This may happen when expand below accesses:
>>   - The component access of a parameter: "_1 = arg.f1".
>>     Or whole parameter access: rhs of "_2 = arg"
>>   - The assignment to a return val:
>>     "D.xx = yy; or D.xx.f = zz" where D.xx occurs on return
>>     stmt.
>>   - This is mainly done in expr.cc(expand_expr_real_1, and
>>     expand_assignment).  Function "extract_sub_member" is
>>     used to figure out the scalar rtxs(pseudos).
>>
>> Besides the above two parts, some work are done in the GIMPLE
>> tree:  collect sra candidates for parameters/returns, and
>> collect the SRA access info.
>> This is mainly done at the beginning of the expander pass.
>> Below are two major items of this part.
>>  - Collect light-expand-sra candidates.
>>   Each parameter is checked if it has the proper aggregate
>>   type.  Collect return val (VAR_P) on each return stmts if
>>   the function is returning via registers.
>>   This is implemented in expand_sra::collect_sra_candidates.
>>
>>  - Build/collect/manage all the access on the candidates.
>>   The function "scan_function" is used to do this work, it
>>   goes through all basicblocks, and all interesting stmts (
>>   phi, return, assign, call, asm) are checked.
>>   If there is an interesting expression (e.g. COMPONENT_REF
>>   or PARM_DECL), then record the required info for the access
>>   (e.g. pos, size, type, base).
>>   And if it is risky to do SRA, the candidates may be removed.
>>   e.g. address-taken and accessed via memory.
>>   "foo(struct S arg) {bar (&arg);}"
>>
>> This patch is tested on ppc64{,le} and x86_64.
>> Is this ok for trunk?
>>
>> BR,
>> Jeff (Jiufu Guo)
>>
>>         PR target/65421
>>
>> gcc/ChangeLog:
>>
>>         * cfgexpand.cc (struct access): New class.
>>         (struct expand_sra): New class.
>>         (expand_sra::collect_sra_candidates): New member function.
>>         (expand_sra::add_sra_candidate): Likewise.
>>         (expand_sra::build_access): Likewise.
>>         (expand_sra::analyze_phi): Likewise.
>>         (expand_sra::analyze_assign): Likewise.
>>         (expand_sra::visit_base): Likewise.
>>         (expand_sra::protect_mem_access_in_stmt): Likewise.
>>         (expand_sra::expand_sra):  Class constructor.
>>         (expand_sra::~expand_sra): Class destructor.
>>         (expand_sra::scalarizable_access):  New member function.
>>         (expand_sra::scalarizable_accesses):  Likewise.
>>         (scalarizable_aggregate):  New function.
>>         (set_scalar_rtx_for_returns):  New function.
>>         (expand_value_return): Updated.
>>         (expand_debug_expr): Updated.
>>         (pass_expand::execute): Updated to use expand_sra.
>>         * cfgexpand.h (scalarizable_aggregate): New declare.
>>         (set_scalar_rtx_for_returns): New declare.
>>         * expr.cc (expand_assignment): Updated.
>>         (expand_constructor): Updated.
>>         (query_position_in_parallel): New function.
>>         (extract_sub_member): New function.
>>         (expand_expr_real_1): Updated.
>>         * expr.h (query_position_in_parallel): New declare.
>>         * function.cc (assign_parm_setup_block): Updated.
>>         (assign_parms): Updated.
>>         (expand_function_start): Updated.
>>         * tree-sra.h (struct sra_base_access): New class.
>>         (struct sra_default_analyzer): New class.
>>         (scan_function): New template function.
>>         * var-tracking.cc (track_loc_p): Updated.
>>
>> gcc/testsuite/ChangeLog:
>>
>>         * g++.target/powerpc/pr102024.C: Updated
>>         * gcc.target/powerpc/pr108073.c: New test.
>>         * gcc.target/powerpc/pr65421-1.c: New test.
>>         * gcc.target/powerpc/pr65421-2.c: New test.
>>
>> ---
>>  gcc/cfgexpand.cc                             | 352 ++++++++++++++++++-
>>  gcc/cfgexpand.h                              |   2 +
>>  gcc/expr.cc                                  | 179 +++++++++-
>>  gcc/expr.h                                   |   3 +
>>  gcc/function.cc                              |  36 +-
>>  gcc/tree-sra.h                               |  76 ++++
>>  gcc/var-tracking.cc                          |   3 +-
>>  gcc/testsuite/g++.target/powerpc/pr102024.C  |   2 +-
>>  gcc/testsuite/gcc.target/i386/pr20020-2.c    |   5 +
>>  gcc/testsuite/gcc.target/powerpc/pr108073.c  |  29 ++
>>  gcc/testsuite/gcc.target/powerpc/pr65421-1.c |   6 +
>>  gcc/testsuite/gcc.target/powerpc/pr65421-2.c |  32 ++
>>  12 files changed, 718 insertions(+), 7 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108073.c
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421-1.c
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421-2.c
>>
>> diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
>> index 4262703a138..ef99ca8ac13 100644
>> --- a/gcc/cfgexpand.cc
>> +++ b/gcc/cfgexpand.cc
>> @@ -74,6 +74,7 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "output.h"
>>  #include "builtins.h"
>>  #include "opts.h"
>> +#include "tree-sra.h"
>>
>>  /* Some systems use __main in a way incompatible with its use in gcc, in these
>>     cases use the macros NAME__MAIN to give a quoted symbol and SYMBOL__MAIN to
>> @@ -97,6 +98,343 @@ static bool defer_stack_allocation (tree, bool);
>>
>>  static void record_alignment_for_reg_var (unsigned int);
>>
>> +/* For light SRA in expander about paramaters and returns.  */
>> +struct access : public sra_base_access
>> +{
>> +};
>> +
>> +typedef struct access *access_p;
>> +
>> +struct expand_sra : public sra_default_analyzer
>> +{
>> +  /* Construct/destruct resources, e.g. sra candidates.  */
>> +  expand_sra ();
>> +  ~expand_sra ();
>> +
>> +  /* No actions for pre_analyze_stmt, analyze_return.  */
>> +
>> +  /* Overwrite phi,call,asm analyzations.  */
>> +  void analyze_phi (gphi *s);
>> +
>> +  /* TODO: Check accesses on call/asm.  */
>> +  void analyze_call (gcall *s) { protect_mem_access_in_stmt (s); };
>> +  void analyze_asm (gasm *s) { protect_mem_access_in_stmt (s); };
>> +
>> +  /* Check access of SRA on assignment.  */
>> +  void analyze_assign (gassign *);
>> +
>> +  /* Check if the accesses of BASE(parameter or return) are
>> +     scalarizable, according to the incoming/outgoing REGS. */
>> +  bool scalarizable_accesses (tree base, rtx regs);
>> +
>> +private:
>> +  /* Collect the parameter and returns to check if they are suitable for
>> +     scalarization.  */
>> +  bool collect_sra_candidates (void);
>> +
>> +  /* Return true if VAR is added as a candidate for SRA.  */
>> +  bool add_sra_candidate (tree var);
>> +
>> +  /* Return true if EXPR has interesting sra access, and created access,
>> +     return false otherwise.  */
>> +  access_p build_access (tree expr, bool write);
>> +
>> +  /* Check if the access ACC is scalarizable.  REGS is the incoming/outgoing
>> +     registers which the access is based on. */
>> +  bool scalarizable_access (access_p acc, rtx regs, bool is_parm);
>> +
>> +  /* If there is risk (stored/loaded or addr taken),
>> +     disqualify the sra candidates in the un-interesting STMT. */
>> +  void protect_mem_access_in_stmt (gimple *stmt);
>> +
>> +  /* Callback of walk_stmt_load_store_addr_ops, used to remove
>> +     unscalarizable accesses.  */
>> +  static bool visit_base (gimple *, tree op, tree, void *data);
>> +
>> +  /* Base (tree) -> Vector (vec<access_p> *) map.  */
>> +  hash_map<tree, auto_vec<access_p> > *base_access_vec;
>> +};
>> +
>> +bool
>> +expand_sra::collect_sra_candidates (void)
>> +{
>> +  bool ret = false;
>> +
>> +  /* Collect parameters.  */
>> +  for (tree parm = DECL_ARGUMENTS (current_function_decl); parm;
>> +       parm = DECL_CHAIN (parm))
>> +    ret |= add_sra_candidate (parm);
>> +
>> +  /* Collect VARs on returns.  */
>> +  if (DECL_RESULT (current_function_decl))
>> +    {
>> +      edge_iterator ei;
>> +      edge e;
>> +      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
>> +       if (greturn *r = safe_dyn_cast<greturn *> (*gsi_last_bb (e->src)))
>> +         {
>> +           tree val = gimple_return_retval (r);
>> +           /* To sclaraized the return, the return value should be only
>> +              writen (except this return stmt).  Using 'true(write)' to
>> +              pretend the access only be 'writen'. */
>> +           if (val && VAR_P (val))
>> +             ret |= add_sra_candidate (val) && build_access (val, true);
>> +         }
>> +    }
>> +
>> +  return ret;
>> +}
>> +
>> +bool
>> +expand_sra::add_sra_candidate (tree var)
>> +{
>> +  tree type = TREE_TYPE (var);
>> +
>> +  if (!AGGREGATE_TYPE_P (type) || !tree_fits_shwi_p (TYPE_SIZE (type))
>> +      || tree_to_shwi (TYPE_SIZE (type)) == 0 || TREE_THIS_VOLATILE (var)
>> +      || is_va_list_type (type))
>> +    return false;
>> +
>> +  base_access_vec->get_or_insert (var);
>> +
>> +  return true;
>> +}
>> +
>> +access_p
>> +expand_sra::build_access (tree expr, bool write)
>> +{
>> +  enum tree_code code = TREE_CODE (expr);
>> +  if (code != VAR_DECL && code != PARM_DECL && code != COMPONENT_REF
>> +      && code != ARRAY_REF && code != ARRAY_RANGE_REF)
>> +    return NULL;
>> +
>> +  HOST_WIDE_INT offset, size;
>> +  bool reverse;
>> +  tree base = get_ref_base_and_extent_hwi (expr, &offset, &size, &reverse);
>> +  if (!base || !DECL_P (base))
>> +    return NULL;
>> +
>> +  vec<access_p> *access_vec = base_access_vec->get (base);
>> +  if (!access_vec)
>> +    return NULL;
>> +
>> +  /* TODO: support reverse. */
>> +  if (reverse || size <= 0 || offset + size > tree_to_shwi (DECL_SIZE (base)))
>> +    {
>> +      base_access_vec->remove (base);
>> +      return NULL;
>> +    }
>> +
>> +  struct access *access = XNEWVEC (struct access, 1);
>> +
>> +  memset (access, 0, sizeof (struct access));
>> +  access->offset = offset;
>> +  access->size = size;
>> +  access->expr = expr;
>> +  access->write = write;
>> +  access->reverse = reverse;
>> +
>> +  access_vec->safe_push (access);
>> +
>> +  return access;
>> +}
>
> I still think there's SRA code we can share for producing
> accesses from a stmt.

Yeap, there is similar code in existing SRAs. To common the code is a
todo item. 
One potential issue, for the example: "analyze/scan expr and
build/create access", there are some small/scatter (e.g. the order of
if conditions) difference may cause different behavior. This may cause
the common code un-pretty.

>
>> +
>> +/* Function protect_mem_access_in_stmt removes the SRA candidates if
>> +   there is addr-taken on the candidate in the STMT.  */
>> +
>> +void
>> +expand_sra::analyze_phi (gphi *stmt)
>> +{
>> +  if (base_access_vec && !base_access_vec->is_empty ())
>> +    walk_stmt_load_store_addr_ops (stmt, this, NULL, NULL, visit_base);
>> +}
>> +
>> +void
>> +expand_sra::analyze_assign (gassign *stmt)
>> +{
>> +  if (!base_access_vec || base_access_vec->is_empty ())
>> +    return;
>> +
>> +  if (gimple_assign_single_p (stmt) && !gimple_clobber_p (stmt))
>> +    {
>> +      tree rhs = gimple_assign_rhs1 (stmt);
>> +      tree lhs = gimple_assign_lhs (stmt);
>> +      bool res_r = build_access (rhs, false);
>> +      bool res_l = build_access (lhs, true);
>> +
>> +      if (res_l || res_r)
>> +       return;
>> +    }
>> +
>> +  protect_mem_access_in_stmt (stmt);
>> +}
>> +
>> +/* Callback of walk_stmt_load_store_addr_ops, used to remove
>> +   unscalarizable accesses.  Called by protect_mem_access_in_stmt.  */
>> +
>> +bool
>> +expand_sra::visit_base (gimple *, tree op, tree, void *data)
>> +{
>> +  op = get_base_address (op);
>> +  if (op && DECL_P (op))
>> +    {
>> +      expand_sra *p = (expand_sra *) data;
>> +      p->base_access_vec->remove (op);
>> +    }
>> +  return false;
>> +}
>> +
>> +/* Function protect_mem_access_in_stmt removes the SRA candidates if
>> +   there is store/load/addr-taken on the candidate in the STMT.
>> +
>> +   For some statements, which SRA does not care about, if there are
>> +   possible memory operation on the SRA candidates, it would be risky
>> +   to scalarize it.  */
>> +
>> +void
>> +expand_sra::protect_mem_access_in_stmt (gimple *stmt)
>> +{
>> +  if (base_access_vec && !base_access_vec->is_empty ())
>> +    walk_stmt_load_store_addr_ops (stmt, this, visit_base, visit_base,
>> +                                  visit_base);
>> +}
>> +
>> +expand_sra::expand_sra () : base_access_vec (NULL)
>> +{
>> +  if (optimize <= 0)
>> +    return;
>> +
>> +  base_access_vec = new hash_map<tree, auto_vec<access_p> >;
>> +  collect_sra_candidates ();
>> +}
>> +
>> +expand_sra::~expand_sra ()
>> +{
>> +  if (optimize <= 0)
>> +    return;
>> +
>> +  delete base_access_vec;
>> +}
>> +
>> +bool
>> +expand_sra::scalarizable_access (access_p acc, rtx regs, bool is_parm)
>> +{
>> +  /* Now only support reading from parms
>> +     or writing to returns.  */
>> +  if (is_parm && acc->write)
>> +    return false;
>> +  if (!is_parm && !acc->write)
>> +    return false;
>> +
>> +  /* Compute the position of the access in the parallel regs.  */
>> +  int start_index = -1;
>> +  int end_index = -1;
>> +  HOST_WIDE_INT left_bits = 0;
>> +  HOST_WIDE_INT right_bits = 0;
>> +  query_position_in_parallel (acc->offset, acc->size, regs, start_index,
>> +                             end_index, left_bits, right_bits);
>> +
>> +  /* Invalid access possition: padding or outof bound.  */
>> +  if (start_index < 0 || end_index < 0)
>> +    return false;
>> +
>> +  machine_mode expr_mode = TYPE_MODE (TREE_TYPE (acc->expr));
>> +  /* Need multi-registers in a parallel for the access.  */
>> +  if (expr_mode == BLKmode || end_index > start_index)
>> +    {
>> +      if (left_bits || right_bits)
>> +       return false;
>> +      if (expr_mode == BLKmode)
>> +       return true;
>> +
>> +      /* For large modes, only support TI/VECTOR in mult-registers. */
>> +      if (known_gt (acc->size, GET_MODE_BITSIZE (word_mode)))
>> +       return expr_mode == TImode || VECTOR_MODE_P (expr_mode);
>> +      return true;
>> +    }
>> +
>> +  gcc_assert (end_index == start_index);
>> +
>> +  /* Just need one reg for the access.  */
>> +  if (left_bits == 0 && right_bits == 0)
>> +    return true;
>> +
>> +  scalar_int_mode imode;
>> +  /* Need to extract bits from the reg for the access.  */
>> +  return !acc->write && int_mode_for_mode (expr_mode).exists (&imode);
>> +}
>> +
>> +/* Now, the base (parm/return) is scalarizable, only if all
>> +   accesses of the BASE are scalariable.
>> +
>> +   This function need to be updated, to support more complicate
>> +   cases, like:
>> +   - Some access are scalarizable, but some are not.
>> +   - Access is writing to a parameter.
>> +   - Writing accesses are overlap with multi-accesses.   */
>> +
>> +bool
>> +expand_sra::scalarizable_accesses (tree base, rtx regs)
>> +{
>> +  if (!base_access_vec)
>> +    return false;
>> +  vec<access_p> *access_vec = base_access_vec->get (base);
>> +  if (!access_vec)
>> +    return false;
>> +  if (access_vec->is_empty ())
>> +    return false;
>> +
>> +  bool is_parm = TREE_CODE (base) == PARM_DECL;
>> +  int n = access_vec->length ();
>> +  int cur_access_index = 0;
>> +  for (; cur_access_index < n; cur_access_index++)
>> +    if (!scalarizable_access ((*access_vec)[cur_access_index], regs, is_parm))
>> +      break;
>> +
>> +  /* It is ok if all access are scalarizable.  */
>> +  if (cur_access_index == n)
>> +    return true;
>> +
>> +  base_access_vec->remove (base);
>> +  return false;
>> +}
>> +
>> +static expand_sra *current_sra = NULL;
>> +
>> +/* Check if the PARAM (or return) is scalarizable.
>> +
>> +   This interface is used in expand_function_start
>> +   to check sra possiblity for parmeters. */
>> +
>> +bool
>> +scalarizable_aggregate (tree parm, rtx regs)
>> +{
>> +  if (!current_sra)
>> +    return false;
>> +  return current_sra->scalarizable_accesses (parm, regs);
>> +}
>> +
>> +/* Check if interesting returns, and if they are scalarizable,
>> +   set DECL_RTL as scalar registers.
>> +
>> +   This interface is used in expand_function_start
>> +   when outgoing registers are determinded for DECL_RESULT.  */
>> +
>> +void
>> +set_scalar_rtx_for_returns ()
>> +{
>> +  rtx res = DECL_RTL (DECL_RESULT (current_function_decl));
>> +  edge_iterator ei;
>> +  edge e;
>> +  FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
>> +    if (greturn *r = safe_dyn_cast<greturn *> (*gsi_last_bb (e->src)))
>> +      {
>> +       tree val = gimple_return_retval (r);
>> +       if (val && VAR_P (val) && scalarizable_aggregate (val, res))
>> +         SET_DECL_RTL (val, res);
>> +      }
>> +}
>> +
>>  /* Return an expression tree corresponding to the RHS of GIMPLE
>>     statement STMT.  */
>>
>> @@ -3707,7 +4045,7 @@ expand_value_return (rtx val)
>>
>>    tree decl = DECL_RESULT (current_function_decl);
>>    rtx return_reg = DECL_RTL (decl);
>> -  if (return_reg != val)
>> +  if (!rtx_equal_p (return_reg, val))
>>      {
>>        tree funtype = TREE_TYPE (current_function_decl);
>>        tree type = TREE_TYPE (decl);
>> @@ -4423,6 +4761,12 @@ expand_debug_expr (tree exp)
>>    addr_space_t as;
>>    scalar_int_mode op0_mode, op1_mode, addr_mode;
>>
>> +  /* TODO: Enable to debug expand-sra optimized parm/returns.  */
>> +  tree base = get_base_address (exp);
>> +  if ((TREE_CODE (base) == PARM_DECL || (VAR_P (base) && DECL_RTL_SET_P (base)))
>> +      && GET_CODE (DECL_RTL (base)) == PARALLEL)
>> +    return NULL_RTX;
>> +
>>    switch (TREE_CODE_CLASS (TREE_CODE (exp)))
>>      {
>>      case tcc_expression:
>> @@ -6628,6 +6972,10 @@ pass_expand::execute (function *fun)
>>    auto_bitmap forced_stack_vars;
>>    discover_nonconstant_array_refs (forced_stack_vars);
>>
>> +  /* Enable light-expander-sra.  */
>> +  current_sra = new expand_sra;
>> +  scan_function (cfun, *current_sra);
>> +
>>    /* Make sure all values used by the optimization passes have sane
>>       defaults.  */
>>    reg_renumber = 0;
>> @@ -7056,6 +7404,8 @@ pass_expand::execute (function *fun)
>>        loop_optimizer_finalize ();
>>      }
>>
>> +  delete current_sra;
>> +  current_sra = NULL;
>>    timevar_pop (TV_POST_EXPAND);
>>
>>    return 0;
>> diff --git a/gcc/cfgexpand.h b/gcc/cfgexpand.h
>> index 0e551f6cfd3..3415c217708 100644
>> --- a/gcc/cfgexpand.h
>> +++ b/gcc/cfgexpand.h
>> @@ -24,5 +24,7 @@ extern tree gimple_assign_rhs_to_tree (gimple *);
>>  extern HOST_WIDE_INT estimated_stack_frame_size (struct cgraph_node *);
>>  extern void set_parm_rtl (tree, rtx);
>>
>> +extern bool scalarizable_aggregate (tree, rtx);
>> +extern void set_scalar_rtx_for_returns ();
>>
>>  #endif /* GCC_CFGEXPAND_H */
>> diff --git a/gcc/expr.cc b/gcc/expr.cc
>> index 763bd82c59f..5ba26e0ef52 100644
>> --- a/gcc/expr.cc
>> +++ b/gcc/expr.cc
>> @@ -5618,7 +5618,10 @@ expand_assignment (tree to, tree from, bool nontemporal)
>>       Assignment of an array element at a constant index, and assignment of
>>       an array element in an unaligned packed structure field, has the same
>>       problem.  Same for (partially) storing into a non-memory object.  */
>> -  if (handled_component_p (to)
>> +  if ((handled_component_p (to)
>> +       && !(VAR_P (get_base_address (to))
>
> get_base_address is pretty expensive so please cache the result.

Thanks for pointing out this!

>
>> +           && DECL_RTL_SET_P (get_base_address (to))
>> +           && GET_CODE (DECL_RTL (get_base_address (to))) == PARALLEL))
>
> I think that's possibly fragile in case a PARALLEL can also appear by
> other means here.  I also think it's the wrong thing to map a structural

Understand your concern. The current trunk code uses PARALLEL to carry
the incoming/outgoing registers for aggregate, so this patch reuse this
character. While it is possible that PARALLEL is used by other means.
To make it safe, adding an additional flag may be an idea. (While this
would not needed if using your suggestions below).

> GIMPLE accordingly with artificial decls which would also be the way
> to preserve debug info.
>
> In principle "expander SRA" isn't different from regular SRA apart from
> the fact that it handles parameter and return variables and heuristically
> includes the ABI (does it?) in decomposition.
>
> So the only important difference is that we cannot represent the
> change for the incoming arguments or the outgoing regs in GIMPLE
> but we should only have to special case _those_ bits, not all the
> other accesses we can rewrite.
>
> We could even use the actual SRA pass in a special mode right
> before RTL expansion (or close to it) and invent a special (but ugly)
> representation for the incoming/outgoing part, say by using
> new internal functions present between the SRA pass and RTL
> expansion:
>
Previously, I tried the method to rerun tree-sra just before the
"optimized" pass once, and find that some enhancement woudl be needed to
make it work: for example, the parameters analyzing seems not work for
some cases.  While the two major reasons for me to combine the
'light-sra' inside expander:
1. We may just use the reg/pseudos based on incoming/outgoing registers
to access the struct.  Then, no need to create the replacements like
tree-sra and ipa-sra.

2. For small structures which may have int mode, RTL passes (start from
expander) are using the int reg to access the struct.  It would be
better to avoid struct/member access at the beggining of expeander.
It seems these issues can also be relieved by the idea of introducing
"new internal function". We just need to expand the internal function
in any expected way.

> struct X foo (struct X x)
> {
>   int D.1234 = .ARG_PART (x, offset, size);
>   int D.1235 = .ARG_PART (x, offset, size);
>   D.2345 = .SET_ARG_PARTS (D.1234, offset, size, D.1235, offset, size);
>   return D.2345;
> }

This code expresses two major functionalities: One is extracting fields
from aggregate (mainly for parameters); and another one is constructing
an aggregate from the value of elements (mainly for returns).

>
> I'm sure that's not an ideal representation though but my point about
> modifying the actual accesses stands.  Finding a working and not
> too ugly way to represent the data flow between the original PARM_DECL
> and the pieces used and the same for the return is key here unless
> we want to rely on on-the-side data and nothing disrupting things between
> that SRA and actual expansion (it would work to call the passes functions
> directly from cfgexpand.cc of course - that's what you do as well).

Thanks for your suggestions!

>
> Sorry for the very late reply here.


BR,
Jeff (Jiufu Guo)

>
> Thanks,
> Richard.
>
>>        || (TREE_CODE (to) == MEM_REF
>>           && (REF_REVERSE_STORAGE_ORDER (to)
>>               || mem_ref_refers_to_non_mem_p (to)))
>> @@ -8909,6 +8912,19 @@ expand_constructor (tree exp, rtx target, enum expand_modifier modifier,
>>        && ! mostly_zeros_p (exp))
>>      return NULL_RTX;
>>
>> +  if (target && GET_CODE (target) == PARALLEL && all_zeros_p (exp))
>> +    {
>> +      int length = XVECLEN (target, 0);
>> +      int start = XEXP (XVECEXP (target, 0, 0), 0) ? 0 : 1;
>> +      for (int i = start; i < length; i++)
>> +       {
>> +         rtx dst = XEXP (XVECEXP (target, 0, i), 0);
>> +         rtx zero = CONST0_RTX (GET_MODE (dst));
>> +         emit_move_insn (dst, zero);
>> +       }
>> +      return target;
>> +    }
>> +
>>    /* Handle calls that pass values in multiple non-contiguous
>>       locations.  The Irix 6 ABI has examples of this.  */
>>    if (target == 0 || ! safe_from_p (target, exp, 1)
>> @@ -10621,6 +10637,157 @@ stmt_is_replaceable_p (gimple *stmt)
>>    return false;
>>  }
>>
>> +/* In the parallel rtx register series REGS, compute the position for given
>> +   {BITPOS, BITSIZE}.
>> +   START_INDEX, END_INDEX, LEFT_BITS and RIGHT_BITS are computed outputs.  */
>> +
>> +void
>> +query_position_in_parallel (HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize,
>> +                           rtx regs, int &start_index, int &end_index,
>> +                           HOST_WIDE_INT &left_bits, HOST_WIDE_INT &right_bits)
>> +{
>> +  int cur_index = XEXP (XVECEXP (regs, 0, 0), 0) ? 0 : 1;
>> +  for (; cur_index < XVECLEN (regs, 0); cur_index++)
>> +    {
>> +      rtx slot = XVECEXP (regs, 0, cur_index);
>> +      HOST_WIDE_INT off = UINTVAL (XEXP (slot, 1)) * BITS_PER_UNIT;
>> +      machine_mode mode = GET_MODE (XEXP (slot, 0));
>> +      HOST_WIDE_INT size = GET_MODE_BITSIZE (mode).to_constant ();
>> +      if (off <= bitpos && off + size > bitpos)
>> +       {
>> +         start_index = cur_index;
>> +         left_bits = bitpos - off;
>> +       }
>> +      if (off + size >= bitpos + bitsize)
>> +       {
>> +         end_index = cur_index;
>> +         right_bits = off + size - (bitpos + bitsize);
>> +         break;
>> +       }
>> +    }
>> +}
>> +
>> +static rtx
>> +extract_sub_member (rtx regs, HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize,
>> +                   tree expr)
>> +{
>> +  int start_index = -1;
>> +  int end_index = -1;
>> +  HOST_WIDE_INT left_bits = 0;
>> +  HOST_WIDE_INT right_bits = 0;
>> +  query_position_in_parallel (bitpos, bitsize, regs, start_index, end_index,
>> +                             left_bits, right_bits);
>> +
>> +  machine_mode expr_mode = TYPE_MODE (TREE_TYPE (expr));
>> +  if (end_index > start_index || expr_mode == BLKmode)
>> +    {
>> +      /* TImode in multi-registers.  */
>> +      if (expr_mode == TImode)
>> +       {
>> +         rtx res = gen_reg_rtx (expr_mode);
>> +         HOST_WIDE_INT start;
>> +         start = UINTVAL (XEXP (XVECEXP (regs, 0, start_index), 1));
>> +         for (int index = start_index; index <= end_index; index++)
>> +           {
>> +             rtx reg = XEXP (XVECEXP (regs, 0, index), 0);
>> +             machine_mode mode = GET_MODE (reg);
>> +             HOST_WIDE_INT off;
>> +             off = UINTVAL (XEXP (XVECEXP (regs, 0, index), 1)) - start;
>> +             rtx sub = simplify_gen_subreg (mode, res, expr_mode, off);
>> +             emit_move_insn (sub, reg);
>> +           }
>> +         return res;
>> +       }
>> +
>> +      /* Vector in multi-registers.  */
>> +      if (VECTOR_MODE_P (expr_mode))
>> +       {
>> +         rtvec vector = rtvec_alloc (end_index - start_index + 1);
>> +         machine_mode emode;
>> +         emode = GET_MODE (XEXP (XVECEXP (regs, 0, start_index), 0));
>> +         for (int index = start_index; index <= end_index; index++)
>> +           {
>> +             rtx reg = XEXP (XVECEXP (regs, 0, index), 0);
>> +             gcc_assert (emode == GET_MODE (reg));
>> +             RTVEC_ELT (vector, index - start_index) = reg;
>> +           }
>> +         scalar_int_mode imode;
>> +         machine_mode vmode;
>> +         int nunits = end_index - start_index + 1;
>> +         if (!(int_mode_for_mode (emode).exists (&imode)
>> +               && mode_for_vector (imode, nunits).exists (&vmode)))
>> +           gcc_unreachable ();
>> +
>> +         insn_code icode;
>> +         icode = convert_optab_handler (vec_init_optab, vmode, imode);
>> +         rtx res = gen_reg_rtx (vmode);
>> +         emit_insn (GEN_FCN (icode) (res, gen_rtx_PARALLEL (vmode, vector)));
>> +         if (expr_mode == vmode)
>> +           return res;
>> +         return simplify_gen_subreg (expr_mode, res, vmode, 0);
>> +       }
>> +
>> +      /* Need multi-registers in a parallel for the access.  */
>> +      int num_words = end_index - start_index + 1;
>> +      rtx *tmps = XALLOCAVEC (rtx, num_words);
>> +
>> +      int pos = 0;
>> +      HOST_WIDE_INT start;
>> +      start = UINTVAL (XEXP (XVECEXP (regs, 0, start_index), 1));
>> +      /* Extract whole registers.  */
>> +      for (; pos < num_words; pos++)
>> +       {
>> +         int index = start_index + pos;
>> +         rtx reg = XEXP (XVECEXP (regs, 0, index), 0);
>> +         machine_mode mode = GET_MODE (reg);
>> +         HOST_WIDE_INT off;
>> +         off = UINTVAL (XEXP (XVECEXP (regs, 0, index), 1)) - start;
>> +         tmps[pos] = gen_rtx_EXPR_LIST (mode, reg, GEN_INT (off));
>> +       }
>> +
>> +      rtx res = gen_rtx_PARALLEL (expr_mode, gen_rtvec_v (pos, tmps));
>> +      return res;
>> +    }
>> +
>> +  gcc_assert (end_index == start_index);
>> +
>> +  /* Just need one reg for the access.  */
>> +  if (left_bits == 0 && right_bits == 0)
>> +    {
>> +      rtx reg = XEXP (XVECEXP (regs, 0, start_index), 0);
>> +      if (GET_MODE (reg) != expr_mode)
>> +       reg = gen_lowpart (expr_mode, reg);
>> +      return reg;
>> +    }
>> +
>> +  /* Need to extract bitfield part reg for the access.
>> +     left_bits != 0 or right_bits != 0 */
>> +  rtx reg = XEXP (XVECEXP (regs, 0, start_index), 0);
>> +  bool sgn = TYPE_UNSIGNED (TREE_TYPE (expr));
>> +  scalar_int_mode imode;
>> +  if (!int_mode_for_mode (expr_mode).exists (&imode))
>> +    {
>> +      gcc_assert (false);
>> +      return NULL_RTX;
>> +    }
>> +
>> +  machine_mode mode = GET_MODE (reg);
>> +  bool reverse = false;
>> +  rtx bfld = extract_bit_field (reg, bitsize, left_bits, sgn, NULL_RTX, mode,
>> +                               imode, reverse, NULL);
>> +
>> +  if (GET_MODE (bfld) != imode)
>> +    bfld = gen_lowpart (imode, bfld);
>> +
>> +  if (expr_mode == imode)
>> +    return bfld;
>> +
>> +  /* expr_mode != imode, e.g. SF != SI.  */
>> +  rtx result = gen_reg_rtx (imode);
>> +  emit_move_insn (result, bfld);
>> +  return gen_lowpart (expr_mode, result);
>> +}
>> +
>>  rtx
>>  expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
>>                     enum expand_modifier modifier, rtx *alt_rtl,
>> @@ -11498,6 +11665,16 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
>>           = expand_expr_real (tem, tem_target, VOIDmode, tem_modifier, NULL,
>>                               true);
>>
>> +       /* It is scalarizable access on param which is passed by registers.  */
>> +       if (GET_CODE (op0) == PARALLEL
>> +           && (TREE_CODE (tem) == PARM_DECL || VAR_P (tem)))
>> +         {
>> +           HOST_WIDE_INT pos, size;
>> +           size = bitsize.to_constant ();
>> +           pos = bitpos.to_constant ();
>> +           return extract_sub_member (op0, pos, size, exp);
>> +         }
>> +
>>         /* If the field has a mode, we want to access it in the
>>            field's mode, not the computed mode.
>>            If a MEM has VOIDmode (external with incomplete type),
>> diff --git a/gcc/expr.h b/gcc/expr.h
>> index 2a172867fdb..8a9332aaad6 100644
>> --- a/gcc/expr.h
>> +++ b/gcc/expr.h
>> @@ -362,5 +362,8 @@ extern rtx expr_size (tree);
>>
>>  extern bool mem_ref_refers_to_non_mem_p (tree);
>>  extern bool non_mem_decl_p (tree);
>> +extern void query_position_in_parallel (HOST_WIDE_INT, HOST_WIDE_INT, rtx,
>> +                                       int &, int &, HOST_WIDE_INT &,
>> +                                       HOST_WIDE_INT &);
>>
>>  #endif /* GCC_EXPR_H */
>> diff --git a/gcc/function.cc b/gcc/function.cc
>> index afb0b33da9e..518250b2728 100644
>> --- a/gcc/function.cc
>> +++ b/gcc/function.cc
>> @@ -3107,8 +3107,29 @@ assign_parm_setup_block (struct assign_parm_data_all *all,
>>           emit_move_insn (mem, entry_parm);
>>         }
>>        else
>> -       move_block_from_reg (REGNO (entry_parm), mem,
>> -                            size_stored / UNITS_PER_WORD);
>> +       {
>> +         int regno = REGNO (entry_parm);
>> +         int nregs = size_stored / UNITS_PER_WORD;
>> +         rtx *tmps = XALLOCAVEC (rtx, nregs);
>> +         machine_mode mode = word_mode;
>> +         HOST_WIDE_INT word_size = GET_MODE_SIZE (mode).to_constant ();
>> +         for (int i = 0; i < nregs; i++)
>> +           {
>> +             rtx reg = gen_rtx_REG (mode, regno + i);
>> +             rtx off = GEN_INT (word_size * i);
>> +             tmps[i] = gen_rtx_EXPR_LIST (VOIDmode, reg, off);
>> +           }
>> +
>> +         rtx regs = gen_rtx_PARALLEL (BLKmode, gen_rtvec_v (nregs, tmps));
>> +         if (scalarizable_aggregate (parm, regs))
>> +           {
>> +             rtx pseudos = gen_group_rtx (regs);
>> +             emit_group_move (pseudos, regs);
>> +             stack_parm = pseudos;
>> +           }
>> +         else
>> +           move_block_from_reg (regno, mem, nregs);
>> +       }
>>      }
>>    else if (data->stack_parm == 0 && !TYPE_EMPTY_P (data->arg.type))
>>      {
>> @@ -3710,7 +3731,15 @@ assign_parms (tree fndecl)
>>
>>        assign_parm_adjust_stack_rtl (&data);
>>
>> -      if (assign_parm_setup_block_p (&data))
>> +      rtx incoming = DECL_INCOMING_RTL (parm);
>> +      if (GET_CODE (incoming) == PARALLEL
>> +         && scalarizable_aggregate (parm, incoming))
>> +       {
>> +         rtx pseudos = gen_group_rtx (incoming);
>> +         emit_group_move (pseudos, incoming);
>> +         set_parm_rtl (parm, pseudos);
>> +       }
>> +      else if (assign_parm_setup_block_p (&data))
>>         assign_parm_setup_block (&all, parm, &data);
>>        else if (data.arg.pass_by_reference || use_register_for_decl (parm))
>>         assign_parm_setup_reg (&all, parm, &data);
>> @@ -5128,6 +5157,7 @@ expand_function_start (tree subr)
>>             {
>>               gcc_assert (GET_CODE (hard_reg) == PARALLEL);
>>               set_parm_rtl (res, gen_group_rtx (hard_reg));
>> +             set_scalar_rtx_for_returns ();
>>             }
>>         }
>>
>> diff --git a/gcc/tree-sra.h b/gcc/tree-sra.h
>> index f20266c4622..df3071ccf6e 100644
>> --- a/gcc/tree-sra.h
>> +++ b/gcc/tree-sra.h
>> @@ -19,6 +19,82 @@ You should have received a copy of the GNU General Public License
>>  along with GCC; see the file COPYING3.  If not see
>>  <http://www.gnu.org/licenses/>.  */
>>
>> +struct sra_base_access
>> +{
>> +  /* Values returned by get_ref_base_and_extent, indicates the
>> +     OFFSET, SIZE and BASE of the access.  */
>> +  HOST_WIDE_INT offset;
>> +  HOST_WIDE_INT size;
>> +
>> +  /* The context expression of this access.  */
>> +  tree expr;
>> +
>> +  /* Indicates this is a write access.  */
>> +  bool write : 1;
>> +
>> +  /* Indicates if this access is made in reverse storage order.  */
>> +  bool reverse : 1;
>> +};
>> +
>> +/* Default template for sra_scan_function.  */
>> +
>> +struct sra_default_analyzer
>> +{
>> +  /* Template analyze functions.  */
>> +  void analyze_phi (gphi *){};
>> +  void pre_analyze_stmt (gimple *){};
>> +  void analyze_return (greturn *){};
>> +  void analyze_assign (gassign *){};
>> +  void analyze_call (gcall *){};
>> +  void analyze_asm (gasm *){};
>> +  void analyze_default_stmt (gimple *){};
>> +};
>> +
>> +/* Scan function and look for interesting expressions.  */
>> +
>> +template <typename analyzer>
>> +void
>> +scan_function (struct function *fun, analyzer &a)
>> +{
>> +  basic_block bb;
>> +  FOR_EACH_BB_FN (bb, fun)
>> +    {
>> +      for (gphi_iterator gsi = gsi_start_phis (bb); !gsi_end_p (gsi);
>> +          gsi_next (&gsi))
>> +       a.analyze_phi (gsi.phi ());
>> +
>> +      for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
>> +          gsi_next (&gsi))
>> +       {
>> +         gimple *stmt = gsi_stmt (gsi);
>> +         a.pre_analyze_stmt (stmt);
>> +
>> +         switch (gimple_code (stmt))
>> +           {
>> +           case GIMPLE_RETURN:
>> +             a.analyze_return (as_a<greturn *> (stmt));
>> +             break;
>> +
>> +           case GIMPLE_ASSIGN:
>> +             a.analyze_assign (as_a<gassign *> (stmt));
>> +             break;
>> +
>> +           case GIMPLE_CALL:
>> +             a.analyze_call (as_a<gcall *> (stmt));
>> +             break;
>> +
>> +           case GIMPLE_ASM:
>> +             a.analyze_asm (as_a<gasm *> (stmt));
>> +             break;
>> +
>> +           default:
>> +             a.analyze_default_stmt (stmt);
>> +             break;
>> +           }
>> +       }
>> +    }
>> +}
>> +
>>  bool type_internals_preclude_sra_p (tree type, const char **msg);
>>
>>  /* Return true iff TYPE is stdarg va_list type (which early SRA and IPA-SRA
>> diff --git a/gcc/var-tracking.cc b/gcc/var-tracking.cc
>> index d8dafa5481a..7fc801f2612 100644
>> --- a/gcc/var-tracking.cc
>> +++ b/gcc/var-tracking.cc
>> @@ -5352,7 +5352,8 @@ track_loc_p (rtx loc, tree expr, poly_int64 offset, bool store_reg_p,
>>       because the real and imaginary parts are represented as separate
>>       pseudo registers, even if the whole complex value fits into one
>>       hard register.  */
>> -  if ((paradoxical_subreg_p (mode, DECL_MODE (expr))
>> +  if (((DECL_MODE (expr) != BLKmode
>> +       && paradoxical_subreg_p (mode, DECL_MODE (expr)))
>>         || (store_reg_p
>>            && !COMPLEX_MODE_P (DECL_MODE (expr))
>>            && hard_regno_nregs (REGNO (loc), DECL_MODE (expr)) == 1))
>> diff --git a/gcc/testsuite/g++.target/powerpc/pr102024.C b/gcc/testsuite/g++.target/powerpc/pr102024.C
>> index 769585052b5..c8995cae707 100644
>> --- a/gcc/testsuite/g++.target/powerpc/pr102024.C
>> +++ b/gcc/testsuite/g++.target/powerpc/pr102024.C
>> @@ -5,7 +5,7 @@
>>  // Test that a zero-width bit field in an otherwise homogeneous aggregate
>>  // generates a psabi warning and passes arguments in GPRs.
>>
>> -// { dg-final { scan-assembler-times {\mstd\M} 4 } }
>> +// { dg-final { scan-assembler-times {\mmtvsrd\M} 4 } }
>>
>>  struct a_thing
>>  {
>> diff --git a/gcc/testsuite/gcc.target/i386/pr20020-2.c b/gcc/testsuite/gcc.target/i386/pr20020-2.c
>> index fa8cb2528c5..723f1826630 100644
>> --- a/gcc/testsuite/gcc.target/i386/pr20020-2.c
>> +++ b/gcc/testsuite/gcc.target/i386/pr20020-2.c
>> @@ -15,10 +15,15 @@ struct shared_ptr_struct
>>  };
>>  typedef struct shared_ptr_struct sptr_t;
>>
>> +void foo (sptr_t *);
>> +
>>  void
>>  copy_sptr (sptr_t *dest, sptr_t src)
>>  {
>>    *dest = src;
>> +
>> +  /* Prevent 'src' to be scalarized as registers.  */
>> +  foo (&src);
>>  }
>>
>>  /* { dg-final { scan-rtl-dump "\\\(set \\\(reg:TI \[0-9\]*" "expand" } } */
>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr108073.c b/gcc/testsuite/gcc.target/powerpc/pr108073.c
>> new file mode 100644
>> index 00000000000..293bf93fb9a
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr108073.c
>> @@ -0,0 +1,29 @@
>> +/* { dg-do run } */
>> +/* { dg-options "-O2 -save-temps" } */
>> +
>> +typedef struct DF {double a[4]; short s1; short s2; short s3; short s4; } DF;
>> +typedef struct SF {float a[4]; int i1; int i2; } SF;
>> +
>> +/* { dg-final { scan-assembler-times {\mmtvsrd|mtvsrws\M} 3 {target { has_arch_ppc64 && has_arch_pwr8 } } } } */
>> +/* { dg-final { scan-assembler-not {\mlwz\M} {target { has_arch_ppc64 && has_arch_pwr8 } } } } */
>> +/* { dg-final { scan-assembler-not {\mlhz\M} {target { has_arch_ppc64 && has_arch_pwr8 } } } } */
>> +short  __attribute__ ((noipa)) foo_hi (DF a, int flag){if (flag == 2)return a.s2+a.s3;return 0;}
>> +int  __attribute__ ((noipa)) foo_si (SF a, int flag){if (flag == 2)return a.i2+a.i1;return 0;}
>> +double __attribute__ ((noipa)) foo_df (DF arg, int flag){if (flag == 2)return arg.a[3];else return 0.0;}
>> +float  __attribute__ ((noipa)) foo_sf (SF arg, int flag){if (flag == 2)return arg.a[2]; return 0;}
>> +float  __attribute__ ((noipa)) foo_sf1 (SF arg, int flag){if (flag == 2)return arg.a[1];return 0;}
>> +
>> +DF gdf = {{1.0,2.0,3.0,4.0}, 1, 2, 3, 4};
>> +SF gsf = {{1.0f,2.0f,3.0f,4.0f}, 1, 2};
>> +
>> +int main()
>> +{
>> +  if (!(foo_hi (gdf, 2) == 5 && foo_si (gsf, 2) == 3 && foo_df (gdf, 2) == 4.0
>> +       && foo_sf (gsf, 2) == 3.0 && foo_sf1 (gsf, 2) == 2.0))
>> +    __builtin_abort ();
>> +  if (!(foo_hi (gdf, 1) == 0 && foo_si (gsf, 1) == 0 && foo_df (gdf, 1) == 0
>> +       && foo_sf (gsf, 1) == 0 && foo_sf1 (gsf, 1) == 0))
>> +    __builtin_abort ();
>> +  return 0;
>> +}
>> +
>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr65421-1.c b/gcc/testsuite/gcc.target/powerpc/pr65421-1.c
>> new file mode 100644
>> index 00000000000..4e1f87f7939
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr65421-1.c
>> @@ -0,0 +1,6 @@
>> +/* PR target/65421 */
>> +/* { dg-options "-O2" } */
>> +
>> +typedef struct LARGE {double a[4]; int arr[32];} LARGE;
>> +LARGE foo (LARGE a){return a;}
>> +/* { dg-final { scan-assembler-times {\mmemcpy\M} 1 } } */
>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr65421-2.c b/gcc/testsuite/gcc.target/powerpc/pr65421-2.c
>> new file mode 100644
>> index 00000000000..8a8e1a0e996
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr65421-2.c
>> @@ -0,0 +1,32 @@
>> +/* PR target/65421 */
>> +/* { dg-options "-O2" } */
>> +/* { dg-require-effective-target powerpc_elfv2 } */
>> +/* { dg-require-effective-target has_arch_ppc64 } */
>> +
>> +typedef struct FLOATS
>> +{
>> +  double a[3];
>> +} FLOATS;
>> +
>> +/* 3 lfd after returns also optimized */
>> +/* FLOATS ret_arg_pt (FLOATS *a){return *a;} */
>> +
>> +/* 3 stfd */
>> +void st_arg (FLOATS a, FLOATS *p) {*p = a;}
>> +/* { dg-final { scan-assembler-times {\mstfd\M} 3 } } */
>> +
>> +/* blr */
>> +FLOATS ret_arg (FLOATS a) {return a;}
>> +
>> +typedef struct MIX
>> +{
>> +  double a[2];
>> +  long l;
>> +} MIX;
>> +
>> +/* std 3 param regs to return slot */
>> +MIX ret_arg1 (MIX a) {return a;}
>> +/* { dg-final { scan-assembler-times {\mstd\M} 3 } } */
>> +
>> +/* count insns */
>> +/* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 9 } } */
>> --
>> 2.25.1
>>
Richard Biener Nov. 28, 2023, 10:52 a.m. UTC | #4
On Tue, 28 Nov 2023, Jiufu Guo wrote:

> 
> Hi,
> 
> Thanks so much for your helpful review!
> 
> Richard Biener <richard.guenther@gmail.com> writes:
> 
> > On Fri, Oct 27, 2023 at 3:51?AM Jiufu Guo <guojiufu@linux.ibm.com> wrote:
> >>
> >> Hi,
> >>
> >> Compare with previous version:
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632399.html
> >> This verion supports TI/VEC mode of the access.
> >>
> >> There are a few PRs (meta-bug PR101926) on various targets.
> >> The root causes of them are similar: the aggeragte param/
> >> returns are passed by multi-registers, but they are stored
> >> to stack from registers first; and then, access the
> >> parameter through stack slot.
> >>
> >> A general idea to enhance this: accessing the aggregate
> >> parameters/returns directly through incoming/outgoing
> >> scalar registers.  This idea would be a kind of SRA.
> >>
> >> This experimental patch for light-expander-sra contains
> >> below parts:
> >>
> >> a. Check if the parameters/returns are ok/profitable to
> >>    scalarize, and set the incoming/outgoing registers(
> >>    pseudos) for the parameter/return.
> >>   - This is done in "expand_function_start", after the
> >>     incoming/outgoing hard registers are determined for the
> >>     paramter/return.
> >>     The scalarized registers are recorded in DECL_RTL for
> >>     the parameter/return in parallel form.
> >>   - At the time when setting DECL_RTL, "scalarizable_aggregate"
> >>     is called to check the accesses are ok/profitable to
> >>     scalarize.
> >>     We can continue to enhance this function, to support
> >>     more cases.  For example:
> >>     - 'reverse storage order'.
> >>     - 'writing to parameter'/'overlap accesses'.
> >>
> >> b. When expanding the accesses of the parameters/returns,
> >>    according to the info of the access(e.g. bitpos,bitsize,
> >>    mode), the scalar(pseudos) can be figured out to expand
> >>    the access.  This may happen when expand below accesses:
> >>   - The component access of a parameter: "_1 = arg.f1".
> >>     Or whole parameter access: rhs of "_2 = arg"
> >>   - The assignment to a return val:
> >>     "D.xx = yy; or D.xx.f = zz" where D.xx occurs on return
> >>     stmt.
> >>   - This is mainly done in expr.cc(expand_expr_real_1, and
> >>     expand_assignment).  Function "extract_sub_member" is
> >>     used to figure out the scalar rtxs(pseudos).
> >>
> >> Besides the above two parts, some work are done in the GIMPLE
> >> tree:  collect sra candidates for parameters/returns, and
> >> collect the SRA access info.
> >> This is mainly done at the beginning of the expander pass.
> >> Below are two major items of this part.
> >>  - Collect light-expand-sra candidates.
> >>   Each parameter is checked if it has the proper aggregate
> >>   type.  Collect return val (VAR_P) on each return stmts if
> >>   the function is returning via registers.
> >>   This is implemented in expand_sra::collect_sra_candidates.
> >>
> >>  - Build/collect/manage all the access on the candidates.
> >>   The function "scan_function" is used to do this work, it
> >>   goes through all basicblocks, and all interesting stmts (
> >>   phi, return, assign, call, asm) are checked.
> >>   If there is an interesting expression (e.g. COMPONENT_REF
> >>   or PARM_DECL), then record the required info for the access
> >>   (e.g. pos, size, type, base).
> >>   And if it is risky to do SRA, the candidates may be removed.
> >>   e.g. address-taken and accessed via memory.
> >>   "foo(struct S arg) {bar (&arg);}"
> >>
> >> This patch is tested on ppc64{,le} and x86_64.
> >> Is this ok for trunk?
> >>
> >> BR,
> >> Jeff (Jiufu Guo)
> >>
> >>         PR target/65421
> >>
> >> gcc/ChangeLog:
> >>
> >>         * cfgexpand.cc (struct access): New class.
> >>         (struct expand_sra): New class.
> >>         (expand_sra::collect_sra_candidates): New member function.
> >>         (expand_sra::add_sra_candidate): Likewise.
> >>         (expand_sra::build_access): Likewise.
> >>         (expand_sra::analyze_phi): Likewise.
> >>         (expand_sra::analyze_assign): Likewise.
> >>         (expand_sra::visit_base): Likewise.
> >>         (expand_sra::protect_mem_access_in_stmt): Likewise.
> >>         (expand_sra::expand_sra):  Class constructor.
> >>         (expand_sra::~expand_sra): Class destructor.
> >>         (expand_sra::scalarizable_access):  New member function.
> >>         (expand_sra::scalarizable_accesses):  Likewise.
> >>         (scalarizable_aggregate):  New function.
> >>         (set_scalar_rtx_for_returns):  New function.
> >>         (expand_value_return): Updated.
> >>         (expand_debug_expr): Updated.
> >>         (pass_expand::execute): Updated to use expand_sra.
> >>         * cfgexpand.h (scalarizable_aggregate): New declare.
> >>         (set_scalar_rtx_for_returns): New declare.
> >>         * expr.cc (expand_assignment): Updated.
> >>         (expand_constructor): Updated.
> >>         (query_position_in_parallel): New function.
> >>         (extract_sub_member): New function.
> >>         (expand_expr_real_1): Updated.
> >>         * expr.h (query_position_in_parallel): New declare.
> >>         * function.cc (assign_parm_setup_block): Updated.
> >>         (assign_parms): Updated.
> >>         (expand_function_start): Updated.
> >>         * tree-sra.h (struct sra_base_access): New class.
> >>         (struct sra_default_analyzer): New class.
> >>         (scan_function): New template function.
> >>         * var-tracking.cc (track_loc_p): Updated.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >>         * g++.target/powerpc/pr102024.C: Updated
> >>         * gcc.target/powerpc/pr108073.c: New test.
> >>         * gcc.target/powerpc/pr65421-1.c: New test.
> >>         * gcc.target/powerpc/pr65421-2.c: New test.
> >>
> >> ---
> >>  gcc/cfgexpand.cc                             | 352 ++++++++++++++++++-
> >>  gcc/cfgexpand.h                              |   2 +
> >>  gcc/expr.cc                                  | 179 +++++++++-
> >>  gcc/expr.h                                   |   3 +
> >>  gcc/function.cc                              |  36 +-
> >>  gcc/tree-sra.h                               |  76 ++++
> >>  gcc/var-tracking.cc                          |   3 +-
> >>  gcc/testsuite/g++.target/powerpc/pr102024.C  |   2 +-
> >>  gcc/testsuite/gcc.target/i386/pr20020-2.c    |   5 +
> >>  gcc/testsuite/gcc.target/powerpc/pr108073.c  |  29 ++
> >>  gcc/testsuite/gcc.target/powerpc/pr65421-1.c |   6 +
> >>  gcc/testsuite/gcc.target/powerpc/pr65421-2.c |  32 ++
> >>  12 files changed, 718 insertions(+), 7 deletions(-)
> >>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108073.c
> >>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421-1.c
> >>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421-2.c
> >>
> >> diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
> >> index 4262703a138..ef99ca8ac13 100644
> >> --- a/gcc/cfgexpand.cc
> >> +++ b/gcc/cfgexpand.cc
> >> @@ -74,6 +74,7 @@ along with GCC; see the file COPYING3.  If not see
> >>  #include "output.h"
> >>  #include "builtins.h"
> >>  #include "opts.h"
> >> +#include "tree-sra.h"
> >>
> >>  /* Some systems use __main in a way incompatible with its use in gcc, in these
> >>     cases use the macros NAME__MAIN to give a quoted symbol and SYMBOL__MAIN to
> >> @@ -97,6 +98,343 @@ static bool defer_stack_allocation (tree, bool);
> >>
> >>  static void record_alignment_for_reg_var (unsigned int);
> >>
> >> +/* For light SRA in expander about paramaters and returns.  */
> >> +struct access : public sra_base_access
> >> +{
> >> +};
> >> +
> >> +typedef struct access *access_p;
> >> +
> >> +struct expand_sra : public sra_default_analyzer
> >> +{
> >> +  /* Construct/destruct resources, e.g. sra candidates.  */
> >> +  expand_sra ();
> >> +  ~expand_sra ();
> >> +
> >> +  /* No actions for pre_analyze_stmt, analyze_return.  */
> >> +
> >> +  /* Overwrite phi,call,asm analyzations.  */
> >> +  void analyze_phi (gphi *s);
> >> +
> >> +  /* TODO: Check accesses on call/asm.  */
> >> +  void analyze_call (gcall *s) { protect_mem_access_in_stmt (s); };
> >> +  void analyze_asm (gasm *s) { protect_mem_access_in_stmt (s); };
> >> +
> >> +  /* Check access of SRA on assignment.  */
> >> +  void analyze_assign (gassign *);
> >> +
> >> +  /* Check if the accesses of BASE(parameter or return) are
> >> +     scalarizable, according to the incoming/outgoing REGS. */
> >> +  bool scalarizable_accesses (tree base, rtx regs);
> >> +
> >> +private:
> >> +  /* Collect the parameter and returns to check if they are suitable for
> >> +     scalarization.  */
> >> +  bool collect_sra_candidates (void);
> >> +
> >> +  /* Return true if VAR is added as a candidate for SRA.  */
> >> +  bool add_sra_candidate (tree var);
> >> +
> >> +  /* Return true if EXPR has interesting sra access, and created access,
> >> +     return false otherwise.  */
> >> +  access_p build_access (tree expr, bool write);
> >> +
> >> +  /* Check if the access ACC is scalarizable.  REGS is the incoming/outgoing
> >> +     registers which the access is based on. */
> >> +  bool scalarizable_access (access_p acc, rtx regs, bool is_parm);
> >> +
> >> +  /* If there is risk (stored/loaded or addr taken),
> >> +     disqualify the sra candidates in the un-interesting STMT. */
> >> +  void protect_mem_access_in_stmt (gimple *stmt);
> >> +
> >> +  /* Callback of walk_stmt_load_store_addr_ops, used to remove
> >> +     unscalarizable accesses.  */
> >> +  static bool visit_base (gimple *, tree op, tree, void *data);
> >> +
> >> +  /* Base (tree) -> Vector (vec<access_p> *) map.  */
> >> +  hash_map<tree, auto_vec<access_p> > *base_access_vec;
> >> +};
> >> +
> >> +bool
> >> +expand_sra::collect_sra_candidates (void)
> >> +{
> >> +  bool ret = false;
> >> +
> >> +  /* Collect parameters.  */
> >> +  for (tree parm = DECL_ARGUMENTS (current_function_decl); parm;
> >> +       parm = DECL_CHAIN (parm))
> >> +    ret |= add_sra_candidate (parm);
> >> +
> >> +  /* Collect VARs on returns.  */
> >> +  if (DECL_RESULT (current_function_decl))
> >> +    {
> >> +      edge_iterator ei;
> >> +      edge e;
> >> +      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
> >> +       if (greturn *r = safe_dyn_cast<greturn *> (*gsi_last_bb (e->src)))
> >> +         {
> >> +           tree val = gimple_return_retval (r);
> >> +           /* To sclaraized the return, the return value should be only
> >> +              writen (except this return stmt).  Using 'true(write)' to
> >> +              pretend the access only be 'writen'. */
> >> +           if (val && VAR_P (val))
> >> +             ret |= add_sra_candidate (val) && build_access (val, true);
> >> +         }
> >> +    }
> >> +
> >> +  return ret;
> >> +}
> >> +
> >> +bool
> >> +expand_sra::add_sra_candidate (tree var)
> >> +{
> >> +  tree type = TREE_TYPE (var);
> >> +
> >> +  if (!AGGREGATE_TYPE_P (type) || !tree_fits_shwi_p (TYPE_SIZE (type))
> >> +      || tree_to_shwi (TYPE_SIZE (type)) == 0 || TREE_THIS_VOLATILE (var)
> >> +      || is_va_list_type (type))
> >> +    return false;
> >> +
> >> +  base_access_vec->get_or_insert (var);
> >> +
> >> +  return true;
> >> +}
> >> +
> >> +access_p
> >> +expand_sra::build_access (tree expr, bool write)
> >> +{
> >> +  enum tree_code code = TREE_CODE (expr);
> >> +  if (code != VAR_DECL && code != PARM_DECL && code != COMPONENT_REF
> >> +      && code != ARRAY_REF && code != ARRAY_RANGE_REF)
> >> +    return NULL;
> >> +
> >> +  HOST_WIDE_INT offset, size;
> >> +  bool reverse;
> >> +  tree base = get_ref_base_and_extent_hwi (expr, &offset, &size, &reverse);
> >> +  if (!base || !DECL_P (base))
> >> +    return NULL;
> >> +
> >> +  vec<access_p> *access_vec = base_access_vec->get (base);
> >> +  if (!access_vec)
> >> +    return NULL;
> >> +
> >> +  /* TODO: support reverse. */
> >> +  if (reverse || size <= 0 || offset + size > tree_to_shwi (DECL_SIZE (base)))
> >> +    {
> >> +      base_access_vec->remove (base);
> >> +      return NULL;
> >> +    }
> >> +
> >> +  struct access *access = XNEWVEC (struct access, 1);
> >> +
> >> +  memset (access, 0, sizeof (struct access));
> >> +  access->offset = offset;
> >> +  access->size = size;
> >> +  access->expr = expr;
> >> +  access->write = write;
> >> +  access->reverse = reverse;
> >> +
> >> +  access_vec->safe_push (access);
> >> +
> >> +  return access;
> >> +}
> >
> > I still think there's SRA code we can share for producing
> > accesses from a stmt.
> 
> Yeap, there is similar code in existing SRAs. To common the code is a
> todo item. 
> One potential issue, for the example: "analyze/scan expr and
> build/create access", there are some small/scatter (e.g. the order of
> if conditions) difference may cause different behavior. This may cause
> the common code un-pretty.
> 
> >
> >> +
> >> +/* Function protect_mem_access_in_stmt removes the SRA candidates if
> >> +   there is addr-taken on the candidate in the STMT.  */
> >> +
> >> +void
> >> +expand_sra::analyze_phi (gphi *stmt)
> >> +{
> >> +  if (base_access_vec && !base_access_vec->is_empty ())
> >> +    walk_stmt_load_store_addr_ops (stmt, this, NULL, NULL, visit_base);
> >> +}
> >> +
> >> +void
> >> +expand_sra::analyze_assign (gassign *stmt)
> >> +{
> >> +  if (!base_access_vec || base_access_vec->is_empty ())
> >> +    return;
> >> +
> >> +  if (gimple_assign_single_p (stmt) && !gimple_clobber_p (stmt))
> >> +    {
> >> +      tree rhs = gimple_assign_rhs1 (stmt);
> >> +      tree lhs = gimple_assign_lhs (stmt);
> >> +      bool res_r = build_access (rhs, false);
> >> +      bool res_l = build_access (lhs, true);
> >> +
> >> +      if (res_l || res_r)
> >> +       return;
> >> +    }
> >> +
> >> +  protect_mem_access_in_stmt (stmt);
> >> +}
> >> +
> >> +/* Callback of walk_stmt_load_store_addr_ops, used to remove
> >> +   unscalarizable accesses.  Called by protect_mem_access_in_stmt.  */
> >> +
> >> +bool
> >> +expand_sra::visit_base (gimple *, tree op, tree, void *data)
> >> +{
> >> +  op = get_base_address (op);
> >> +  if (op && DECL_P (op))
> >> +    {
> >> +      expand_sra *p = (expand_sra *) data;
> >> +      p->base_access_vec->remove (op);
> >> +    }
> >> +  return false;
> >> +}
> >> +
> >> +/* Function protect_mem_access_in_stmt removes the SRA candidates if
> >> +   there is store/load/addr-taken on the candidate in the STMT.
> >> +
> >> +   For some statements, which SRA does not care about, if there are
> >> +   possible memory operation on the SRA candidates, it would be risky
> >> +   to scalarize it.  */
> >> +
> >> +void
> >> +expand_sra::protect_mem_access_in_stmt (gimple *stmt)
> >> +{
> >> +  if (base_access_vec && !base_access_vec->is_empty ())
> >> +    walk_stmt_load_store_addr_ops (stmt, this, visit_base, visit_base,
> >> +                                  visit_base);
> >> +}
> >> +
> >> +expand_sra::expand_sra () : base_access_vec (NULL)
> >> +{
> >> +  if (optimize <= 0)
> >> +    return;
> >> +
> >> +  base_access_vec = new hash_map<tree, auto_vec<access_p> >;
> >> +  collect_sra_candidates ();
> >> +}
> >> +
> >> +expand_sra::~expand_sra ()
> >> +{
> >> +  if (optimize <= 0)
> >> +    return;
> >> +
> >> +  delete base_access_vec;
> >> +}
> >> +
> >> +bool
> >> +expand_sra::scalarizable_access (access_p acc, rtx regs, bool is_parm)
> >> +{
> >> +  /* Now only support reading from parms
> >> +     or writing to returns.  */
> >> +  if (is_parm && acc->write)
> >> +    return false;
> >> +  if (!is_parm && !acc->write)
> >> +    return false;
> >> +
> >> +  /* Compute the position of the access in the parallel regs.  */
> >> +  int start_index = -1;
> >> +  int end_index = -1;
> >> +  HOST_WIDE_INT left_bits = 0;
> >> +  HOST_WIDE_INT right_bits = 0;
> >> +  query_position_in_parallel (acc->offset, acc->size, regs, start_index,
> >> +                             end_index, left_bits, right_bits);
> >> +
> >> +  /* Invalid access possition: padding or outof bound.  */
> >> +  if (start_index < 0 || end_index < 0)
> >> +    return false;
> >> +
> >> +  machine_mode expr_mode = TYPE_MODE (TREE_TYPE (acc->expr));
> >> +  /* Need multi-registers in a parallel for the access.  */
> >> +  if (expr_mode == BLKmode || end_index > start_index)
> >> +    {
> >> +      if (left_bits || right_bits)
> >> +       return false;
> >> +      if (expr_mode == BLKmode)
> >> +       return true;
> >> +
> >> +      /* For large modes, only support TI/VECTOR in mult-registers. */
> >> +      if (known_gt (acc->size, GET_MODE_BITSIZE (word_mode)))
> >> +       return expr_mode == TImode || VECTOR_MODE_P (expr_mode);
> >> +      return true;
> >> +    }
> >> +
> >> +  gcc_assert (end_index == start_index);
> >> +
> >> +  /* Just need one reg for the access.  */
> >> +  if (left_bits == 0 && right_bits == 0)
> >> +    return true;
> >> +
> >> +  scalar_int_mode imode;
> >> +  /* Need to extract bits from the reg for the access.  */
> >> +  return !acc->write && int_mode_for_mode (expr_mode).exists (&imode);
> >> +}
> >> +
> >> +/* Now, the base (parm/return) is scalarizable, only if all
> >> +   accesses of the BASE are scalariable.
> >> +
> >> +   This function need to be updated, to support more complicate
> >> +   cases, like:
> >> +   - Some access are scalarizable, but some are not.
> >> +   - Access is writing to a parameter.
> >> +   - Writing accesses are overlap with multi-accesses.   */
> >> +
> >> +bool
> >> +expand_sra::scalarizable_accesses (tree base, rtx regs)
> >> +{
> >> +  if (!base_access_vec)
> >> +    return false;
> >> +  vec<access_p> *access_vec = base_access_vec->get (base);
> >> +  if (!access_vec)
> >> +    return false;
> >> +  if (access_vec->is_empty ())
> >> +    return false;
> >> +
> >> +  bool is_parm = TREE_CODE (base) == PARM_DECL;
> >> +  int n = access_vec->length ();
> >> +  int cur_access_index = 0;
> >> +  for (; cur_access_index < n; cur_access_index++)
> >> +    if (!scalarizable_access ((*access_vec)[cur_access_index], regs, is_parm))
> >> +      break;
> >> +
> >> +  /* It is ok if all access are scalarizable.  */
> >> +  if (cur_access_index == n)
> >> +    return true;
> >> +
> >> +  base_access_vec->remove (base);
> >> +  return false;
> >> +}
> >> +
> >> +static expand_sra *current_sra = NULL;
> >> +
> >> +/* Check if the PARAM (or return) is scalarizable.
> >> +
> >> +   This interface is used in expand_function_start
> >> +   to check sra possiblity for parmeters. */
> >> +
> >> +bool
> >> +scalarizable_aggregate (tree parm, rtx regs)
> >> +{
> >> +  if (!current_sra)
> >> +    return false;
> >> +  return current_sra->scalarizable_accesses (parm, regs);
> >> +}
> >> +
> >> +/* Check if interesting returns, and if they are scalarizable,
> >> +   set DECL_RTL as scalar registers.
> >> +
> >> +   This interface is used in expand_function_start
> >> +   when outgoing registers are determinded for DECL_RESULT.  */
> >> +
> >> +void
> >> +set_scalar_rtx_for_returns ()
> >> +{
> >> +  rtx res = DECL_RTL (DECL_RESULT (current_function_decl));
> >> +  edge_iterator ei;
> >> +  edge e;
> >> +  FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
> >> +    if (greturn *r = safe_dyn_cast<greturn *> (*gsi_last_bb (e->src)))
> >> +      {
> >> +       tree val = gimple_return_retval (r);
> >> +       if (val && VAR_P (val) && scalarizable_aggregate (val, res))
> >> +         SET_DECL_RTL (val, res);
> >> +      }
> >> +}
> >> +
> >>  /* Return an expression tree corresponding to the RHS of GIMPLE
> >>     statement STMT.  */
> >>
> >> @@ -3707,7 +4045,7 @@ expand_value_return (rtx val)
> >>
> >>    tree decl = DECL_RESULT (current_function_decl);
> >>    rtx return_reg = DECL_RTL (decl);
> >> -  if (return_reg != val)
> >> +  if (!rtx_equal_p (return_reg, val))
> >>      {
> >>        tree funtype = TREE_TYPE (current_function_decl);
> >>        tree type = TREE_TYPE (decl);
> >> @@ -4423,6 +4761,12 @@ expand_debug_expr (tree exp)
> >>    addr_space_t as;
> >>    scalar_int_mode op0_mode, op1_mode, addr_mode;
> >>
> >> +  /* TODO: Enable to debug expand-sra optimized parm/returns.  */
> >> +  tree base = get_base_address (exp);
> >> +  if ((TREE_CODE (base) == PARM_DECL || (VAR_P (base) && DECL_RTL_SET_P (base)))
> >> +      && GET_CODE (DECL_RTL (base)) == PARALLEL)
> >> +    return NULL_RTX;
> >> +
> >>    switch (TREE_CODE_CLASS (TREE_CODE (exp)))
> >>      {
> >>      case tcc_expression:
> >> @@ -6628,6 +6972,10 @@ pass_expand::execute (function *fun)
> >>    auto_bitmap forced_stack_vars;
> >>    discover_nonconstant_array_refs (forced_stack_vars);
> >>
> >> +  /* Enable light-expander-sra.  */
> >> +  current_sra = new expand_sra;
> >> +  scan_function (cfun, *current_sra);
> >> +
> >>    /* Make sure all values used by the optimization passes have sane
> >>       defaults.  */
> >>    reg_renumber = 0;
> >> @@ -7056,6 +7404,8 @@ pass_expand::execute (function *fun)
> >>        loop_optimizer_finalize ();
> >>      }
> >>
> >> +  delete current_sra;
> >> +  current_sra = NULL;
> >>    timevar_pop (TV_POST_EXPAND);
> >>
> >>    return 0;
> >> diff --git a/gcc/cfgexpand.h b/gcc/cfgexpand.h
> >> index 0e551f6cfd3..3415c217708 100644
> >> --- a/gcc/cfgexpand.h
> >> +++ b/gcc/cfgexpand.h
> >> @@ -24,5 +24,7 @@ extern tree gimple_assign_rhs_to_tree (gimple *);
> >>  extern HOST_WIDE_INT estimated_stack_frame_size (struct cgraph_node *);
> >>  extern void set_parm_rtl (tree, rtx);
> >>
> >> +extern bool scalarizable_aggregate (tree, rtx);
> >> +extern void set_scalar_rtx_for_returns ();
> >>
> >>  #endif /* GCC_CFGEXPAND_H */
> >> diff --git a/gcc/expr.cc b/gcc/expr.cc
> >> index 763bd82c59f..5ba26e0ef52 100644
> >> --- a/gcc/expr.cc
> >> +++ b/gcc/expr.cc
> >> @@ -5618,7 +5618,10 @@ expand_assignment (tree to, tree from, bool nontemporal)
> >>       Assignment of an array element at a constant index, and assignment of
> >>       an array element in an unaligned packed structure field, has the same
> >>       problem.  Same for (partially) storing into a non-memory object.  */
> >> -  if (handled_component_p (to)
> >> +  if ((handled_component_p (to)
> >> +       && !(VAR_P (get_base_address (to))
> >
> > get_base_address is pretty expensive so please cache the result.
> 
> Thanks for pointing out this!
> 
> >
> >> +           && DECL_RTL_SET_P (get_base_address (to))
> >> +           && GET_CODE (DECL_RTL (get_base_address (to))) == PARALLEL))
> >
> > I think that's possibly fragile in case a PARALLEL can also appear by
> > other means here.  I also think it's the wrong thing to map a structural
> 
> Understand your concern. The current trunk code uses PARALLEL to carry
> the incoming/outgoing registers for aggregate, so this patch reuse this
> character. While it is possible that PARALLEL is used by other means.
> To make it safe, adding an additional flag may be an idea. (While this
> would not needed if using your suggestions below).
> 
> > GIMPLE accordingly with artificial decls which would also be the way
> > to preserve debug info.
> >
> > In principle "expander SRA" isn't different from regular SRA apart from
> > the fact that it handles parameter and return variables and heuristically
> > includes the ABI (does it?) in decomposition.
> >
> > So the only important difference is that we cannot represent the
> > change for the incoming arguments or the outgoing regs in GIMPLE
> > but we should only have to special case _those_ bits, not all the
> > other accesses we can rewrite.
> >
> > We could even use the actual SRA pass in a special mode right
> > before RTL expansion (or close to it) and invent a special (but ugly)
> > representation for the incoming/outgoing part, say by using
> > new internal functions present between the SRA pass and RTL
> > expansion:
> >
> Previously, I tried the method to rerun tree-sra just before the
> "optimized" pass once, and find that some enhancement woudl be needed to
> make it work: for example, the parameters analyzing seems not work for
> some cases.  While the two major reasons for me to combine the
> 'light-sra' inside expander:
> 1. We may just use the reg/pseudos based on incoming/outgoing registers
> to access the struct.  Then, no need to create the replacements like
> tree-sra and ipa-sra.
> 
> 2. For small structures which may have int mode, RTL passes (start from
> expander) are using the int reg to access the struct.  It would be
> better to avoid struct/member access at the beggining of expeander.
> It seems these issues can also be relieved by the idea of introducing
> "new internal function". We just need to expand the internal function
> in any expected way.
> 
> > struct X foo (struct X x)
> > {
> >   int D.1234 = .ARG_PART (x, offset, size);
> >   int D.1235 = .ARG_PART (x, offset, size);
> >   D.2345 = .SET_ARG_PARTS (D.1234, offset, size, D.1235, offset, size);
> >   return D.2345;
> > }
> 
> This code expresses two major functionalities: One is extracting fields
> from aggregate (mainly for parameters); and another one is constructing
> an aggregate from the value of elements (mainly for returns).

Yes.  In principle BIT_FIELD_REF could be used for the former and
a CONSTRUCTOR used for the latter but I've used internal functions
to tightly control how we expand these bits - in fact I think for
all practical cases we care about the SRA pass should have made
sure they expand to plain (register) moves, that is, the pieces
1:1 correspond to incoming argument registers or outgoing return
value registers.

Richard.
diff mbox series

Patch

diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index 4262703a138..ef99ca8ac13 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -74,6 +74,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "output.h"
 #include "builtins.h"
 #include "opts.h"
+#include "tree-sra.h"
 
 /* Some systems use __main in a way incompatible with its use in gcc, in these
    cases use the macros NAME__MAIN to give a quoted symbol and SYMBOL__MAIN to
@@ -97,6 +98,343 @@  static bool defer_stack_allocation (tree, bool);
 
 static void record_alignment_for_reg_var (unsigned int);
 
+/* For light SRA in expander about paramaters and returns.  */
+struct access : public sra_base_access
+{
+};
+
+typedef struct access *access_p;
+
+struct expand_sra : public sra_default_analyzer
+{
+  /* Construct/destruct resources, e.g. sra candidates.  */
+  expand_sra ();
+  ~expand_sra ();
+
+  /* No actions for pre_analyze_stmt, analyze_return.  */
+
+  /* Overwrite phi,call,asm analyzations.  */
+  void analyze_phi (gphi *s);
+
+  /* TODO: Check accesses on call/asm.  */
+  void analyze_call (gcall *s) { protect_mem_access_in_stmt (s); };
+  void analyze_asm (gasm *s) { protect_mem_access_in_stmt (s); };
+
+  /* Check access of SRA on assignment.  */
+  void analyze_assign (gassign *);
+
+  /* Check if the accesses of BASE(parameter or return) are
+     scalarizable, according to the incoming/outgoing REGS. */
+  bool scalarizable_accesses (tree base, rtx regs);
+
+private:
+  /* Collect the parameter and returns to check if they are suitable for
+     scalarization.  */
+  bool collect_sra_candidates (void);
+
+  /* Return true if VAR is added as a candidate for SRA.  */
+  bool add_sra_candidate (tree var);
+
+  /* Return true if EXPR has interesting sra access, and created access,
+     return false otherwise.  */
+  access_p build_access (tree expr, bool write);
+
+  /* Check if the access ACC is scalarizable.  REGS is the incoming/outgoing
+     registers which the access is based on. */
+  bool scalarizable_access (access_p acc, rtx regs, bool is_parm);
+
+  /* If there is risk (stored/loaded or addr taken),
+     disqualify the sra candidates in the un-interesting STMT. */
+  void protect_mem_access_in_stmt (gimple *stmt);
+
+  /* Callback of walk_stmt_load_store_addr_ops, used to remove
+     unscalarizable accesses.  */
+  static bool visit_base (gimple *, tree op, tree, void *data);
+
+  /* Base (tree) -> Vector (vec<access_p> *) map.  */
+  hash_map<tree, auto_vec<access_p> > *base_access_vec;
+};
+
+bool
+expand_sra::collect_sra_candidates (void)
+{
+  bool ret = false;
+
+  /* Collect parameters.  */
+  for (tree parm = DECL_ARGUMENTS (current_function_decl); parm;
+       parm = DECL_CHAIN (parm))
+    ret |= add_sra_candidate (parm);
+
+  /* Collect VARs on returns.  */
+  if (DECL_RESULT (current_function_decl))
+    {
+      edge_iterator ei;
+      edge e;
+      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
+	if (greturn *r = safe_dyn_cast<greturn *> (*gsi_last_bb (e->src)))
+	  {
+	    tree val = gimple_return_retval (r);
+	    /* To sclaraized the return, the return value should be only
+	       writen (except this return stmt).  Using 'true(write)' to
+	       pretend the access only be 'writen'. */
+	    if (val && VAR_P (val))
+	      ret |= add_sra_candidate (val) && build_access (val, true);
+	  }
+    }
+
+  return ret;
+}
+
+bool
+expand_sra::add_sra_candidate (tree var)
+{
+  tree type = TREE_TYPE (var);
+
+  if (!AGGREGATE_TYPE_P (type) || !tree_fits_shwi_p (TYPE_SIZE (type))
+      || tree_to_shwi (TYPE_SIZE (type)) == 0 || TREE_THIS_VOLATILE (var)
+      || is_va_list_type (type))
+    return false;
+
+  base_access_vec->get_or_insert (var);
+
+  return true;
+}
+
+access_p
+expand_sra::build_access (tree expr, bool write)
+{
+  enum tree_code code = TREE_CODE (expr);
+  if (code != VAR_DECL && code != PARM_DECL && code != COMPONENT_REF
+      && code != ARRAY_REF && code != ARRAY_RANGE_REF)
+    return NULL;
+
+  HOST_WIDE_INT offset, size;
+  bool reverse;
+  tree base = get_ref_base_and_extent_hwi (expr, &offset, &size, &reverse);
+  if (!base || !DECL_P (base))
+    return NULL;
+
+  vec<access_p> *access_vec = base_access_vec->get (base);
+  if (!access_vec)
+    return NULL;
+
+  /* TODO: support reverse. */
+  if (reverse || size <= 0 || offset + size > tree_to_shwi (DECL_SIZE (base)))
+    {
+      base_access_vec->remove (base);
+      return NULL;
+    }
+
+  struct access *access = XNEWVEC (struct access, 1);
+
+  memset (access, 0, sizeof (struct access));
+  access->offset = offset;
+  access->size = size;
+  access->expr = expr;
+  access->write = write;
+  access->reverse = reverse;
+
+  access_vec->safe_push (access);
+
+  return access;
+}
+
+/* Function protect_mem_access_in_stmt removes the SRA candidates if
+   there is addr-taken on the candidate in the STMT.  */
+
+void
+expand_sra::analyze_phi (gphi *stmt)
+{
+  if (base_access_vec && !base_access_vec->is_empty ())
+    walk_stmt_load_store_addr_ops (stmt, this, NULL, NULL, visit_base);
+}
+
+void
+expand_sra::analyze_assign (gassign *stmt)
+{
+  if (!base_access_vec || base_access_vec->is_empty ())
+    return;
+
+  if (gimple_assign_single_p (stmt) && !gimple_clobber_p (stmt))
+    {
+      tree rhs = gimple_assign_rhs1 (stmt);
+      tree lhs = gimple_assign_lhs (stmt);
+      bool res_r = build_access (rhs, false);
+      bool res_l = build_access (lhs, true);
+
+      if (res_l || res_r)
+	return;
+    }
+
+  protect_mem_access_in_stmt (stmt);
+}
+
+/* Callback of walk_stmt_load_store_addr_ops, used to remove
+   unscalarizable accesses.  Called by protect_mem_access_in_stmt.  */
+
+bool
+expand_sra::visit_base (gimple *, tree op, tree, void *data)
+{
+  op = get_base_address (op);
+  if (op && DECL_P (op))
+    {
+      expand_sra *p = (expand_sra *) data;
+      p->base_access_vec->remove (op);
+    }
+  return false;
+}
+
+/* Function protect_mem_access_in_stmt removes the SRA candidates if
+   there is store/load/addr-taken on the candidate in the STMT.
+
+   For some statements, which SRA does not care about, if there are
+   possible memory operation on the SRA candidates, it would be risky
+   to scalarize it.  */
+
+void
+expand_sra::protect_mem_access_in_stmt (gimple *stmt)
+{
+  if (base_access_vec && !base_access_vec->is_empty ())
+    walk_stmt_load_store_addr_ops (stmt, this, visit_base, visit_base,
+				   visit_base);
+}
+
+expand_sra::expand_sra () : base_access_vec (NULL)
+{
+  if (optimize <= 0)
+    return;
+
+  base_access_vec = new hash_map<tree, auto_vec<access_p> >;
+  collect_sra_candidates ();
+}
+
+expand_sra::~expand_sra ()
+{
+  if (optimize <= 0)
+    return;
+
+  delete base_access_vec;
+}
+
+bool
+expand_sra::scalarizable_access (access_p acc, rtx regs, bool is_parm)
+{
+  /* Now only support reading from parms
+     or writing to returns.  */
+  if (is_parm && acc->write)
+    return false;
+  if (!is_parm && !acc->write)
+    return false;
+
+  /* Compute the position of the access in the parallel regs.  */
+  int start_index = -1;
+  int end_index = -1;
+  HOST_WIDE_INT left_bits = 0;
+  HOST_WIDE_INT right_bits = 0;
+  query_position_in_parallel (acc->offset, acc->size, regs, start_index,
+			      end_index, left_bits, right_bits);
+
+  /* Invalid access possition: padding or outof bound.  */
+  if (start_index < 0 || end_index < 0)
+    return false;
+
+  machine_mode expr_mode = TYPE_MODE (TREE_TYPE (acc->expr));
+  /* Need multi-registers in a parallel for the access.  */
+  if (expr_mode == BLKmode || end_index > start_index)
+    {
+      if (left_bits || right_bits)
+	return false;
+      if (expr_mode == BLKmode)
+	return true;
+
+      /* For large modes, only support TI/VECTOR in mult-registers. */
+      if (known_gt (acc->size, GET_MODE_BITSIZE (word_mode)))
+	return expr_mode == TImode || VECTOR_MODE_P (expr_mode);
+      return true;
+    }
+
+  gcc_assert (end_index == start_index);
+
+  /* Just need one reg for the access.  */
+  if (left_bits == 0 && right_bits == 0)
+    return true;
+
+  scalar_int_mode imode;
+  /* Need to extract bits from the reg for the access.  */
+  return !acc->write && int_mode_for_mode (expr_mode).exists (&imode);
+}
+
+/* Now, the base (parm/return) is scalarizable, only if all
+   accesses of the BASE are scalariable.
+
+   This function need to be updated, to support more complicate
+   cases, like:
+   - Some access are scalarizable, but some are not.
+   - Access is writing to a parameter.
+   - Writing accesses are overlap with multi-accesses.   */
+
+bool
+expand_sra::scalarizable_accesses (tree base, rtx regs)
+{
+  if (!base_access_vec)
+    return false;
+  vec<access_p> *access_vec = base_access_vec->get (base);
+  if (!access_vec)
+    return false;
+  if (access_vec->is_empty ())
+    return false;
+
+  bool is_parm = TREE_CODE (base) == PARM_DECL;
+  int n = access_vec->length ();
+  int cur_access_index = 0;
+  for (; cur_access_index < n; cur_access_index++)
+    if (!scalarizable_access ((*access_vec)[cur_access_index], regs, is_parm))
+      break;
+
+  /* It is ok if all access are scalarizable.  */
+  if (cur_access_index == n)
+    return true;
+
+  base_access_vec->remove (base);
+  return false;
+}
+
+static expand_sra *current_sra = NULL;
+
+/* Check if the PARAM (or return) is scalarizable.
+
+   This interface is used in expand_function_start
+   to check sra possiblity for parmeters. */
+
+bool
+scalarizable_aggregate (tree parm, rtx regs)
+{
+  if (!current_sra)
+    return false;
+  return current_sra->scalarizable_accesses (parm, regs);
+}
+
+/* Check if interesting returns, and if they are scalarizable,
+   set DECL_RTL as scalar registers.
+
+   This interface is used in expand_function_start
+   when outgoing registers are determinded for DECL_RESULT.  */
+
+void
+set_scalar_rtx_for_returns ()
+{
+  rtx res = DECL_RTL (DECL_RESULT (current_function_decl));
+  edge_iterator ei;
+  edge e;
+  FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
+    if (greturn *r = safe_dyn_cast<greturn *> (*gsi_last_bb (e->src)))
+      {
+	tree val = gimple_return_retval (r);
+	if (val && VAR_P (val) && scalarizable_aggregate (val, res))
+	  SET_DECL_RTL (val, res);
+      }
+}
+
 /* Return an expression tree corresponding to the RHS of GIMPLE
    statement STMT.  */
 
@@ -3707,7 +4045,7 @@  expand_value_return (rtx val)
 
   tree decl = DECL_RESULT (current_function_decl);
   rtx return_reg = DECL_RTL (decl);
-  if (return_reg != val)
+  if (!rtx_equal_p (return_reg, val))
     {
       tree funtype = TREE_TYPE (current_function_decl);
       tree type = TREE_TYPE (decl);
@@ -4423,6 +4761,12 @@  expand_debug_expr (tree exp)
   addr_space_t as;
   scalar_int_mode op0_mode, op1_mode, addr_mode;
 
+  /* TODO: Enable to debug expand-sra optimized parm/returns.  */
+  tree base = get_base_address (exp);
+  if ((TREE_CODE (base) == PARM_DECL || (VAR_P (base) && DECL_RTL_SET_P (base)))
+      && GET_CODE (DECL_RTL (base)) == PARALLEL)
+    return NULL_RTX;
+
   switch (TREE_CODE_CLASS (TREE_CODE (exp)))
     {
     case tcc_expression:
@@ -6628,6 +6972,10 @@  pass_expand::execute (function *fun)
   auto_bitmap forced_stack_vars;
   discover_nonconstant_array_refs (forced_stack_vars);
 
+  /* Enable light-expander-sra.  */
+  current_sra = new expand_sra;
+  scan_function (cfun, *current_sra);
+
   /* Make sure all values used by the optimization passes have sane
      defaults.  */
   reg_renumber = 0;
@@ -7056,6 +7404,8 @@  pass_expand::execute (function *fun)
       loop_optimizer_finalize ();
     }
 
+  delete current_sra;
+  current_sra = NULL;
   timevar_pop (TV_POST_EXPAND);
 
   return 0;
diff --git a/gcc/cfgexpand.h b/gcc/cfgexpand.h
index 0e551f6cfd3..3415c217708 100644
--- a/gcc/cfgexpand.h
+++ b/gcc/cfgexpand.h
@@ -24,5 +24,7 @@  extern tree gimple_assign_rhs_to_tree (gimple *);
 extern HOST_WIDE_INT estimated_stack_frame_size (struct cgraph_node *);
 extern void set_parm_rtl (tree, rtx);
 
+extern bool scalarizable_aggregate (tree, rtx);
+extern void set_scalar_rtx_for_returns ();
 
 #endif /* GCC_CFGEXPAND_H */
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 763bd82c59f..5ba26e0ef52 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -5618,7 +5618,10 @@  expand_assignment (tree to, tree from, bool nontemporal)
      Assignment of an array element at a constant index, and assignment of
      an array element in an unaligned packed structure field, has the same
      problem.  Same for (partially) storing into a non-memory object.  */
-  if (handled_component_p (to)
+  if ((handled_component_p (to)
+       && !(VAR_P (get_base_address (to))
+	    && DECL_RTL_SET_P (get_base_address (to))
+	    && GET_CODE (DECL_RTL (get_base_address (to))) == PARALLEL))
       || (TREE_CODE (to) == MEM_REF
 	  && (REF_REVERSE_STORAGE_ORDER (to)
 	      || mem_ref_refers_to_non_mem_p (to)))
@@ -8909,6 +8912,19 @@  expand_constructor (tree exp, rtx target, enum expand_modifier modifier,
       && ! mostly_zeros_p (exp))
     return NULL_RTX;
 
+  if (target && GET_CODE (target) == PARALLEL && all_zeros_p (exp))
+    {
+      int length = XVECLEN (target, 0);
+      int start = XEXP (XVECEXP (target, 0, 0), 0) ? 0 : 1;
+      for (int i = start; i < length; i++)
+	{
+	  rtx dst = XEXP (XVECEXP (target, 0, i), 0);
+	  rtx zero = CONST0_RTX (GET_MODE (dst));
+	  emit_move_insn (dst, zero);
+	}
+      return target;
+    }
+
   /* Handle calls that pass values in multiple non-contiguous
      locations.  The Irix 6 ABI has examples of this.  */
   if (target == 0 || ! safe_from_p (target, exp, 1)
@@ -10621,6 +10637,157 @@  stmt_is_replaceable_p (gimple *stmt)
   return false;
 }
 
+/* In the parallel rtx register series REGS, compute the position for given
+   {BITPOS, BITSIZE}.
+   START_INDEX, END_INDEX, LEFT_BITS and RIGHT_BITS are computed outputs.  */
+
+void
+query_position_in_parallel (HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize,
+			    rtx regs, int &start_index, int &end_index,
+			    HOST_WIDE_INT &left_bits, HOST_WIDE_INT &right_bits)
+{
+  int cur_index = XEXP (XVECEXP (regs, 0, 0), 0) ? 0 : 1;
+  for (; cur_index < XVECLEN (regs, 0); cur_index++)
+    {
+      rtx slot = XVECEXP (regs, 0, cur_index);
+      HOST_WIDE_INT off = UINTVAL (XEXP (slot, 1)) * BITS_PER_UNIT;
+      machine_mode mode = GET_MODE (XEXP (slot, 0));
+      HOST_WIDE_INT size = GET_MODE_BITSIZE (mode).to_constant ();
+      if (off <= bitpos && off + size > bitpos)
+	{
+	  start_index = cur_index;
+	  left_bits = bitpos - off;
+	}
+      if (off + size >= bitpos + bitsize)
+	{
+	  end_index = cur_index;
+	  right_bits = off + size - (bitpos + bitsize);
+	  break;
+	}
+    }
+}
+
+static rtx
+extract_sub_member (rtx regs, HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize,
+		    tree expr)
+{
+  int start_index = -1;
+  int end_index = -1;
+  HOST_WIDE_INT left_bits = 0;
+  HOST_WIDE_INT right_bits = 0;
+  query_position_in_parallel (bitpos, bitsize, regs, start_index, end_index,
+			      left_bits, right_bits);
+
+  machine_mode expr_mode = TYPE_MODE (TREE_TYPE (expr));
+  if (end_index > start_index || expr_mode == BLKmode)
+    {
+      /* TImode in multi-registers.  */
+      if (expr_mode == TImode)
+	{
+	  rtx res = gen_reg_rtx (expr_mode);
+	  HOST_WIDE_INT start;
+	  start = UINTVAL (XEXP (XVECEXP (regs, 0, start_index), 1));
+	  for (int index = start_index; index <= end_index; index++)
+	    {
+	      rtx reg = XEXP (XVECEXP (regs, 0, index), 0);
+	      machine_mode mode = GET_MODE (reg);
+	      HOST_WIDE_INT off;
+	      off = UINTVAL (XEXP (XVECEXP (regs, 0, index), 1)) - start;
+	      rtx sub = simplify_gen_subreg (mode, res, expr_mode, off);
+	      emit_move_insn (sub, reg);
+	    }
+	  return res;
+	}
+
+      /* Vector in multi-registers.  */
+      if (VECTOR_MODE_P (expr_mode))
+	{
+	  rtvec vector = rtvec_alloc (end_index - start_index + 1);
+	  machine_mode emode;
+	  emode = GET_MODE (XEXP (XVECEXP (regs, 0, start_index), 0));
+	  for (int index = start_index; index <= end_index; index++)
+	    {
+	      rtx reg = XEXP (XVECEXP (regs, 0, index), 0);
+	      gcc_assert (emode == GET_MODE (reg));
+	      RTVEC_ELT (vector, index - start_index) = reg;
+	    }
+	  scalar_int_mode imode;
+	  machine_mode vmode;
+	  int nunits = end_index - start_index + 1;
+	  if (!(int_mode_for_mode (emode).exists (&imode)
+		&& mode_for_vector (imode, nunits).exists (&vmode)))
+	    gcc_unreachable ();
+
+	  insn_code icode;
+	  icode = convert_optab_handler (vec_init_optab, vmode, imode);
+	  rtx res = gen_reg_rtx (vmode);
+	  emit_insn (GEN_FCN (icode) (res, gen_rtx_PARALLEL (vmode, vector)));
+	  if (expr_mode == vmode)
+	    return res;
+	  return simplify_gen_subreg (expr_mode, res, vmode, 0);
+	}
+
+      /* Need multi-registers in a parallel for the access.  */
+      int num_words = end_index - start_index + 1;
+      rtx *tmps = XALLOCAVEC (rtx, num_words);
+
+      int pos = 0;
+      HOST_WIDE_INT start;
+      start = UINTVAL (XEXP (XVECEXP (regs, 0, start_index), 1));
+      /* Extract whole registers.  */
+      for (; pos < num_words; pos++)
+	{
+	  int index = start_index + pos;
+	  rtx reg = XEXP (XVECEXP (regs, 0, index), 0);
+	  machine_mode mode = GET_MODE (reg);
+	  HOST_WIDE_INT off;
+	  off = UINTVAL (XEXP (XVECEXP (regs, 0, index), 1)) - start;
+	  tmps[pos] = gen_rtx_EXPR_LIST (mode, reg, GEN_INT (off));
+	}
+
+      rtx res = gen_rtx_PARALLEL (expr_mode, gen_rtvec_v (pos, tmps));
+      return res;
+    }
+
+  gcc_assert (end_index == start_index);
+
+  /* Just need one reg for the access.  */
+  if (left_bits == 0 && right_bits == 0)
+    {
+      rtx reg = XEXP (XVECEXP (regs, 0, start_index), 0);
+      if (GET_MODE (reg) != expr_mode)
+	reg = gen_lowpart (expr_mode, reg);
+      return reg;
+    }
+
+  /* Need to extract bitfield part reg for the access.
+     left_bits != 0 or right_bits != 0 */
+  rtx reg = XEXP (XVECEXP (regs, 0, start_index), 0);
+  bool sgn = TYPE_UNSIGNED (TREE_TYPE (expr));
+  scalar_int_mode imode;
+  if (!int_mode_for_mode (expr_mode).exists (&imode))
+    {
+      gcc_assert (false);
+      return NULL_RTX;
+    }
+
+  machine_mode mode = GET_MODE (reg);
+  bool reverse = false;
+  rtx bfld = extract_bit_field (reg, bitsize, left_bits, sgn, NULL_RTX, mode,
+				imode, reverse, NULL);
+
+  if (GET_MODE (bfld) != imode)
+    bfld = gen_lowpart (imode, bfld);
+
+  if (expr_mode == imode)
+    return bfld;
+
+  /* expr_mode != imode, e.g. SF != SI.  */
+  rtx result = gen_reg_rtx (imode);
+  emit_move_insn (result, bfld);
+  return gen_lowpart (expr_mode, result);
+}
+
 rtx
 expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
 		    enum expand_modifier modifier, rtx *alt_rtl,
@@ -11498,6 +11665,16 @@  expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
 	  = expand_expr_real (tem, tem_target, VOIDmode, tem_modifier, NULL,
 			      true);
 
+	/* It is scalarizable access on param which is passed by registers.  */
+	if (GET_CODE (op0) == PARALLEL
+	    && (TREE_CODE (tem) == PARM_DECL || VAR_P (tem)))
+	  {
+	    HOST_WIDE_INT pos, size;
+	    size = bitsize.to_constant ();
+	    pos = bitpos.to_constant ();
+	    return extract_sub_member (op0, pos, size, exp);
+	  }
+
 	/* If the field has a mode, we want to access it in the
 	   field's mode, not the computed mode.
 	   If a MEM has VOIDmode (external with incomplete type),
diff --git a/gcc/expr.h b/gcc/expr.h
index 2a172867fdb..8a9332aaad6 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -362,5 +362,8 @@  extern rtx expr_size (tree);
 
 extern bool mem_ref_refers_to_non_mem_p (tree);
 extern bool non_mem_decl_p (tree);
+extern void query_position_in_parallel (HOST_WIDE_INT, HOST_WIDE_INT, rtx,
+					int &, int &, HOST_WIDE_INT &,
+					HOST_WIDE_INT &);
 
 #endif /* GCC_EXPR_H */
diff --git a/gcc/function.cc b/gcc/function.cc
index afb0b33da9e..518250b2728 100644
--- a/gcc/function.cc
+++ b/gcc/function.cc
@@ -3107,8 +3107,29 @@  assign_parm_setup_block (struct assign_parm_data_all *all,
 	  emit_move_insn (mem, entry_parm);
 	}
       else
-	move_block_from_reg (REGNO (entry_parm), mem,
-			     size_stored / UNITS_PER_WORD);
+	{
+	  int regno = REGNO (entry_parm);
+	  int nregs = size_stored / UNITS_PER_WORD;
+	  rtx *tmps = XALLOCAVEC (rtx, nregs);
+	  machine_mode mode = word_mode;
+	  HOST_WIDE_INT word_size = GET_MODE_SIZE (mode).to_constant ();
+	  for (int i = 0; i < nregs; i++)
+	    {
+	      rtx reg = gen_rtx_REG (mode, regno + i);
+	      rtx off = GEN_INT (word_size * i);
+	      tmps[i] = gen_rtx_EXPR_LIST (VOIDmode, reg, off);
+	    }
+
+	  rtx regs = gen_rtx_PARALLEL (BLKmode, gen_rtvec_v (nregs, tmps));
+	  if (scalarizable_aggregate (parm, regs))
+	    {
+	      rtx pseudos = gen_group_rtx (regs);
+	      emit_group_move (pseudos, regs);
+	      stack_parm = pseudos;
+	    }
+	  else
+	    move_block_from_reg (regno, mem, nregs);
+	}
     }
   else if (data->stack_parm == 0 && !TYPE_EMPTY_P (data->arg.type))
     {
@@ -3710,7 +3731,15 @@  assign_parms (tree fndecl)
 
       assign_parm_adjust_stack_rtl (&data);
 
-      if (assign_parm_setup_block_p (&data))
+      rtx incoming = DECL_INCOMING_RTL (parm);
+      if (GET_CODE (incoming) == PARALLEL
+	  && scalarizable_aggregate (parm, incoming))
+	{
+	  rtx pseudos = gen_group_rtx (incoming);
+	  emit_group_move (pseudos, incoming);
+	  set_parm_rtl (parm, pseudos);
+	}
+      else if (assign_parm_setup_block_p (&data))
 	assign_parm_setup_block (&all, parm, &data);
       else if (data.arg.pass_by_reference || use_register_for_decl (parm))
 	assign_parm_setup_reg (&all, parm, &data);
@@ -5128,6 +5157,7 @@  expand_function_start (tree subr)
 	    {
 	      gcc_assert (GET_CODE (hard_reg) == PARALLEL);
 	      set_parm_rtl (res, gen_group_rtx (hard_reg));
+	      set_scalar_rtx_for_returns ();
 	    }
 	}
 
diff --git a/gcc/tree-sra.h b/gcc/tree-sra.h
index f20266c4622..df3071ccf6e 100644
--- a/gcc/tree-sra.h
+++ b/gcc/tree-sra.h
@@ -19,6 +19,82 @@  You should have received a copy of the GNU General Public License
 along with GCC; see the file COPYING3.  If not see
 <http://www.gnu.org/licenses/>.  */
 
+struct sra_base_access
+{
+  /* Values returned by get_ref_base_and_extent, indicates the
+     OFFSET, SIZE and BASE of the access.  */
+  HOST_WIDE_INT offset;
+  HOST_WIDE_INT size;
+
+  /* The context expression of this access.  */
+  tree expr;
+
+  /* Indicates this is a write access.  */
+  bool write : 1;
+
+  /* Indicates if this access is made in reverse storage order.  */
+  bool reverse : 1;
+};
+
+/* Default template for sra_scan_function.  */
+
+struct sra_default_analyzer
+{
+  /* Template analyze functions.  */
+  void analyze_phi (gphi *){};
+  void pre_analyze_stmt (gimple *){};
+  void analyze_return (greturn *){};
+  void analyze_assign (gassign *){};
+  void analyze_call (gcall *){};
+  void analyze_asm (gasm *){};
+  void analyze_default_stmt (gimple *){};
+};
+
+/* Scan function and look for interesting expressions.  */
+
+template <typename analyzer>
+void
+scan_function (struct function *fun, analyzer &a)
+{
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, fun)
+    {
+      for (gphi_iterator gsi = gsi_start_phis (bb); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	a.analyze_phi (gsi.phi ());
+
+      for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  a.pre_analyze_stmt (stmt);
+
+	  switch (gimple_code (stmt))
+	    {
+	    case GIMPLE_RETURN:
+	      a.analyze_return (as_a<greturn *> (stmt));
+	      break;
+
+	    case GIMPLE_ASSIGN:
+	      a.analyze_assign (as_a<gassign *> (stmt));
+	      break;
+
+	    case GIMPLE_CALL:
+	      a.analyze_call (as_a<gcall *> (stmt));
+	      break;
+
+	    case GIMPLE_ASM:
+	      a.analyze_asm (as_a<gasm *> (stmt));
+	      break;
+
+	    default:
+	      a.analyze_default_stmt (stmt);
+	      break;
+	    }
+	}
+    }
+}
+
 bool type_internals_preclude_sra_p (tree type, const char **msg);
 
 /* Return true iff TYPE is stdarg va_list type (which early SRA and IPA-SRA
diff --git a/gcc/var-tracking.cc b/gcc/var-tracking.cc
index d8dafa5481a..7fc801f2612 100644
--- a/gcc/var-tracking.cc
+++ b/gcc/var-tracking.cc
@@ -5352,7 +5352,8 @@  track_loc_p (rtx loc, tree expr, poly_int64 offset, bool store_reg_p,
      because the real and imaginary parts are represented as separate
      pseudo registers, even if the whole complex value fits into one
      hard register.  */
-  if ((paradoxical_subreg_p (mode, DECL_MODE (expr))
+  if (((DECL_MODE (expr) != BLKmode
+	&& paradoxical_subreg_p (mode, DECL_MODE (expr)))
        || (store_reg_p
 	   && !COMPLEX_MODE_P (DECL_MODE (expr))
 	   && hard_regno_nregs (REGNO (loc), DECL_MODE (expr)) == 1))
diff --git a/gcc/testsuite/g++.target/powerpc/pr102024.C b/gcc/testsuite/g++.target/powerpc/pr102024.C
index 769585052b5..c8995cae707 100644
--- a/gcc/testsuite/g++.target/powerpc/pr102024.C
+++ b/gcc/testsuite/g++.target/powerpc/pr102024.C
@@ -5,7 +5,7 @@ 
 // Test that a zero-width bit field in an otherwise homogeneous aggregate
 // generates a psabi warning and passes arguments in GPRs.
 
-// { dg-final { scan-assembler-times {\mstd\M} 4 } }
+// { dg-final { scan-assembler-times {\mmtvsrd\M} 4 } }
 
 struct a_thing
 {
diff --git a/gcc/testsuite/gcc.target/i386/pr20020-2.c b/gcc/testsuite/gcc.target/i386/pr20020-2.c
index fa8cb2528c5..723f1826630 100644
--- a/gcc/testsuite/gcc.target/i386/pr20020-2.c
+++ b/gcc/testsuite/gcc.target/i386/pr20020-2.c
@@ -15,10 +15,15 @@  struct shared_ptr_struct
 };
 typedef struct shared_ptr_struct sptr_t;
 
+void foo (sptr_t *);
+
 void
 copy_sptr (sptr_t *dest, sptr_t src)
 {
   *dest = src;
+
+  /* Prevent 'src' to be scalarized as registers.  */
+  foo (&src);
 }
 
 /* { dg-final { scan-rtl-dump "\\\(set \\\(reg:TI \[0-9\]*" "expand" } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/pr108073.c b/gcc/testsuite/gcc.target/powerpc/pr108073.c
new file mode 100644
index 00000000000..293bf93fb9a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr108073.c
@@ -0,0 +1,29 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2 -save-temps" } */
+
+typedef struct DF {double a[4]; short s1; short s2; short s3; short s4; } DF;
+typedef struct SF {float a[4]; int i1; int i2; } SF;
+
+/* { dg-final { scan-assembler-times {\mmtvsrd|mtvsrws\M} 3 {target { has_arch_ppc64 && has_arch_pwr8 } } } } */
+/* { dg-final { scan-assembler-not {\mlwz\M} {target { has_arch_ppc64 && has_arch_pwr8 } } } } */
+/* { dg-final { scan-assembler-not {\mlhz\M} {target { has_arch_ppc64 && has_arch_pwr8 } } } } */
+short  __attribute__ ((noipa)) foo_hi (DF a, int flag){if (flag == 2)return a.s2+a.s3;return 0;}
+int  __attribute__ ((noipa)) foo_si (SF a, int flag){if (flag == 2)return a.i2+a.i1;return 0;}
+double __attribute__ ((noipa)) foo_df (DF arg, int flag){if (flag == 2)return arg.a[3];else return 0.0;}
+float  __attribute__ ((noipa)) foo_sf (SF arg, int flag){if (flag == 2)return arg.a[2]; return 0;}
+float  __attribute__ ((noipa)) foo_sf1 (SF arg, int flag){if (flag == 2)return arg.a[1];return 0;}
+
+DF gdf = {{1.0,2.0,3.0,4.0}, 1, 2, 3, 4};
+SF gsf = {{1.0f,2.0f,3.0f,4.0f}, 1, 2};
+
+int main()
+{
+  if (!(foo_hi (gdf, 2) == 5 && foo_si (gsf, 2) == 3 && foo_df (gdf, 2) == 4.0
+	&& foo_sf (gsf, 2) == 3.0 && foo_sf1 (gsf, 2) == 2.0))
+    __builtin_abort ();
+  if (!(foo_hi (gdf, 1) == 0 && foo_si (gsf, 1) == 0 && foo_df (gdf, 1) == 0
+	&& foo_sf (gsf, 1) == 0 && foo_sf1 (gsf, 1) == 0))
+    __builtin_abort ();
+  return 0;
+}
+
diff --git a/gcc/testsuite/gcc.target/powerpc/pr65421-1.c b/gcc/testsuite/gcc.target/powerpc/pr65421-1.c
new file mode 100644
index 00000000000..4e1f87f7939
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr65421-1.c
@@ -0,0 +1,6 @@ 
+/* PR target/65421 */
+/* { dg-options "-O2" } */
+
+typedef struct LARGE {double a[4]; int arr[32];} LARGE;
+LARGE foo (LARGE a){return a;}
+/* { dg-final { scan-assembler-times {\mmemcpy\M} 1 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/pr65421-2.c b/gcc/testsuite/gcc.target/powerpc/pr65421-2.c
new file mode 100644
index 00000000000..8a8e1a0e996
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr65421-2.c
@@ -0,0 +1,32 @@ 
+/* PR target/65421 */
+/* { dg-options "-O2" } */
+/* { dg-require-effective-target powerpc_elfv2 } */
+/* { dg-require-effective-target has_arch_ppc64 } */
+
+typedef struct FLOATS
+{
+  double a[3];
+} FLOATS;
+
+/* 3 lfd after returns also optimized */
+/* FLOATS ret_arg_pt (FLOATS *a){return *a;} */
+
+/* 3 stfd */
+void st_arg (FLOATS a, FLOATS *p) {*p = a;}
+/* { dg-final { scan-assembler-times {\mstfd\M} 3 } } */
+
+/* blr */
+FLOATS ret_arg (FLOATS a) {return a;}
+
+typedef struct MIX
+{
+  double a[2];
+  long l;
+} MIX;
+
+/* std 3 param regs to return slot */
+MIX ret_arg1 (MIX a) {return a;}
+/* { dg-final { scan-assembler-times {\mstd\M} 3 } } */
+
+/* count insns */
+/* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 9 } } */