diff mbox series

Improve store merging to handle load+store or bitwise logicals (PR tree-optimization/78821, take 2)

Message ID 20171102173641.GL14653@tucnak
State New
Headers show
Series Improve store merging to handle load+store or bitwise logicals (PR tree-optimization/78821, take 2) | expand

Commit Message

Jakub Jelinek Nov. 2, 2017, 5:36 p.m. UTC
On Thu, Nov 02, 2017 at 03:38:45PM +0000, Kyrill Tkachov wrote:
> this looks great! I have a couple of comments.
> * Can you please extend file comments for gimple-ssa-store-merging.c ?
> Currently it mostly describes how we merge constants together. Once we start
> accepting non-constant members
> we should mention it in there.

The following updated patch introduced the #define and updates comments.
I'll do the BIT_NOT_EXPR work incrementally.

BTW, finished the statistics gathering from combined x86_64 and i686-linux
bootstraps.  With my recent gimple-ssa-store-merging.c (the bitfield
handling etc.) changes reverted, the split_stores.length () and orig_num_stmts
counts at the end of successful output_merged_store was (sum from all
cases):
integer_cst	199245	413294
with the recent change in plus this patch:
integer_cst	215274	442134
mem_ref		16943	35369
bit_and_expr	37	88
bit_ior_expr	19	46
bit_xor_expr	27	58
I think the integer_cst numbers without/with this patch should be roughly
the same.

2017-11-02  Jakub Jelinek  <jakub@redhat.com>

	PR tree-optimization/78821
	* gimple-ssa-store-merging.c: Update the file comment.
	(MAX_STORE_ALIAS_CHECKS): Define.
	(struct store_operand_info): New type.
	(store_operand_info::store_operand_info): New constructor.
	(struct store_immediate_info): Add rhs_code and ops data members.
	(store_immediate_info::store_immediate_info): Add rhscode, op0r
	and op1r arguments to the ctor, initialize corresponding data members.
	(struct merged_store_group): Add load_align_base and load_align
	data members.
	(merged_store_group::merged_store_group): Initialize them.
	(merged_store_group::do_merge): Update them.
	(merged_store_group::apply_stores): Pick the constant for
	encode_tree_to_bitpos from one of the two operands, or skip
	encode_tree_to_bitpos if neither operand is a constant.
	(class pass_store_merging): Add process_store method decl.  Remove
	bool argument from terminate_all_aliasing_chains method decl.
	(pass_store_merging::terminate_all_aliasing_chains): Remove
	var_offset_p argument and corresponding handling.
	(stmts_may_clobber_ref_p): New function.
	(compatible_load_p): New function.
	(imm_store_chain_info::coalesce_immediate_stores): Terminate group
	if there is overlap and rhs_code is not INTEGER_CST.  For
	non-overlapping stores terminate group if rhs is not mergeable.
	(get_alias_type_for_stmts): Change first argument from
	auto_vec<gimple *> & to vec<gimple *> &.  Add IS_LOAD, CLIQUEP and
	BASEP arguments.  If IS_LOAD is true, look at rhs1 of the stmts
	instead of lhs.  Compute *CLIQUEP and *BASEP in addition to the
	alias type.
	(get_location_for_stmts): Change first argument from
	auto_vec<gimple *> & to vec<gimple *> &.
	(struct split_store): Remove orig_stmts data member, add orig_stores.
	(split_store::split_store): Create orig_stores rather than orig_stmts.
	(find_constituent_stmts): Renamed to ...
	(find_constituent_stores): ... this.  Change second argument from
	vec<gimple *> * to vec<store_immediate_info *> *, push pointers
	to info structures rather than the statements.
	(split_group): Rename ALLOW_UNALIGNED argument to
	ALLOW_UNALIGNED_STORE, add ALLOW_UNALIGNED_LOAD argument and handle
	it.  Adjust find_constituent_stores caller.
	(imm_store_chain_info::output_merged_store): Handle rhs_code other
	than INTEGER_CST, adjust split_group, get_alias_type_for_stmts and
	get_location_for_stmts callers.  Set MR_DEPENDENCE_CLIQUE and
	MR_DEPENDENCE_BASE on the MEM_REFs if they are the same in all stores.
	(mem_valid_for_store_merging): New function.
	(handled_load): New function.
	(pass_store_merging::process_store): New method.
	(pass_store_merging::execute): Use process_store method.  Adjust
	terminate_all_aliasing_chains caller.

	* gcc.dg/store_merging_13.c: New test.
	* gcc.dg/store_merging_14.c: New test.



	Jakub

Comments

Richard Biener Nov. 3, 2017, 1:14 p.m. UTC | #1
On Thu, 2 Nov 2017, Jakub Jelinek wrote:

> On Thu, Nov 02, 2017 at 03:38:45PM +0000, Kyrill Tkachov wrote:
> > this looks great! I have a couple of comments.
> > * Can you please extend file comments for gimple-ssa-store-merging.c ?
> > Currently it mostly describes how we merge constants together. Once we start
> > accepting non-constant members
> > we should mention it in there.
> 
> The following updated patch introduced the #define and updates comments.
> I'll do the BIT_NOT_EXPR work incrementally.
> 
> BTW, finished the statistics gathering from combined x86_64 and i686-linux
> bootstraps.  With my recent gimple-ssa-store-merging.c (the bitfield
> handling etc.) changes reverted, the split_stores.length () and orig_num_stmts
> counts at the end of successful output_merged_store was (sum from all
> cases):
> integer_cst	199245	413294
> with the recent change in plus this patch:
> integer_cst	215274	442134
> mem_ref		16943	35369
> bit_and_expr	37	88
> bit_ior_expr	19	46
> bit_xor_expr	27	58
> I think the integer_cst numbers without/with this patch should be roughly
> the same.
> 
> 2017-11-02  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR tree-optimization/78821
> 	* gimple-ssa-store-merging.c: Update the file comment.
> 	(MAX_STORE_ALIAS_CHECKS): Define.
> 	(struct store_operand_info): New type.
> 	(store_operand_info::store_operand_info): New constructor.
> 	(struct store_immediate_info): Add rhs_code and ops data members.
> 	(store_immediate_info::store_immediate_info): Add rhscode, op0r
> 	and op1r arguments to the ctor, initialize corresponding data members.
> 	(struct merged_store_group): Add load_align_base and load_align
> 	data members.
> 	(merged_store_group::merged_store_group): Initialize them.
> 	(merged_store_group::do_merge): Update them.
> 	(merged_store_group::apply_stores): Pick the constant for
> 	encode_tree_to_bitpos from one of the two operands, or skip
> 	encode_tree_to_bitpos if neither operand is a constant.
> 	(class pass_store_merging): Add process_store method decl.  Remove
> 	bool argument from terminate_all_aliasing_chains method decl.
> 	(pass_store_merging::terminate_all_aliasing_chains): Remove
> 	var_offset_p argument and corresponding handling.
> 	(stmts_may_clobber_ref_p): New function.
> 	(compatible_load_p): New function.
> 	(imm_store_chain_info::coalesce_immediate_stores): Terminate group
> 	if there is overlap and rhs_code is not INTEGER_CST.  For
> 	non-overlapping stores terminate group if rhs is not mergeable.
> 	(get_alias_type_for_stmts): Change first argument from
> 	auto_vec<gimple *> & to vec<gimple *> &.  Add IS_LOAD, CLIQUEP and
> 	BASEP arguments.  If IS_LOAD is true, look at rhs1 of the stmts
> 	instead of lhs.  Compute *CLIQUEP and *BASEP in addition to the
> 	alias type.
> 	(get_location_for_stmts): Change first argument from
> 	auto_vec<gimple *> & to vec<gimple *> &.
> 	(struct split_store): Remove orig_stmts data member, add orig_stores.
> 	(split_store::split_store): Create orig_stores rather than orig_stmts.
> 	(find_constituent_stmts): Renamed to ...
> 	(find_constituent_stores): ... this.  Change second argument from
> 	vec<gimple *> * to vec<store_immediate_info *> *, push pointers
> 	to info structures rather than the statements.
> 	(split_group): Rename ALLOW_UNALIGNED argument to
> 	ALLOW_UNALIGNED_STORE, add ALLOW_UNALIGNED_LOAD argument and handle
> 	it.  Adjust find_constituent_stores caller.
> 	(imm_store_chain_info::output_merged_store): Handle rhs_code other
> 	than INTEGER_CST, adjust split_group, get_alias_type_for_stmts and
> 	get_location_for_stmts callers.  Set MR_DEPENDENCE_CLIQUE and
> 	MR_DEPENDENCE_BASE on the MEM_REFs if they are the same in all stores.
> 	(mem_valid_for_store_merging): New function.
> 	(handled_load): New function.
> 	(pass_store_merging::process_store): New method.
> 	(pass_store_merging::execute): Use process_store method.  Adjust
> 	terminate_all_aliasing_chains caller.
> 
> 	* gcc.dg/store_merging_13.c: New test.
> 	* gcc.dg/store_merging_14.c: New test.
> 
> --- gcc/gimple-ssa-store-merging.c.jj	2017-11-01 22:49:18.123965696 +0100
> +++ gcc/gimple-ssa-store-merging.c	2017-11-02 17:24:04.236317245 +0100
> @@ -19,7 +19,8 @@
>     <http://www.gnu.org/licenses/>.  */
>  
>  /* The purpose of this pass is to combine multiple memory stores of
> -   constant values to consecutive memory locations into fewer wider stores.
> +   constant values, values loaded from memory or bitwise operations
> +   on those to consecutive memory locations into fewer wider stores.
>     For example, if we have a sequence peforming four byte stores to
>     consecutive memory locations:
>     [p     ] := imm1;
> @@ -29,21 +30,49 @@
>     we can transform this into a single 4-byte store if the target supports it:
>    [p] := imm1:imm2:imm3:imm4 //concatenated immediates according to endianness.
>  
> +   Or:
> +   [p     ] := [q     ];
> +   [p + 1B] := [q + 1B];
> +   [p + 2B] := [q + 2B];
> +   [p + 3B] := [q + 3B];
> +   if there is no overlap can be transformed into a single 4-byte
> +   load followed by single 4-byte store.
> +
> +   Or:
> +   [p     ] := [q     ] ^ imm1;
> +   [p + 1B] := [q + 1B] ^ imm2;
> +   [p + 2B] := [q + 2B] ^ imm3;
> +   [p + 3B] := [q + 3B] ^ imm4;
> +   if there is no overlap can be transformed into a single 4-byte
> +   load, xored with imm1:imm2:imm3:imm4 and stored using a single 4-byte store.
> +
>     The algorithm is applied to each basic block in three phases:
>  
> -   1) Scan through the basic block recording constant assignments to
> +   1) Scan through the basic block recording assignments to
>     destinations that can be expressed as a store to memory of a certain size
> -   at a certain bit offset.  Record store chains to different bases in a
> -   hash_map (m_stores) and make sure to terminate such chains when appropriate
> -   (for example when when the stored values get used subsequently).
> +   at a certain bit offset from expressions we can handle.  For bit-fields
> +   we also note the surrounding bit region, bits that could be stored in
> +   a read-modify-write operation when storing the bit-field.  Record store
> +   chains to different bases in a hash_map (m_stores) and make sure to
> +   terminate such chains when appropriate (for example when when the stored
> +   values get used subsequently).
>     These stores can be a result of structure element initializers, array stores
>     etc.  A store_immediate_info object is recorded for every such store.
>     Record as many such assignments to a single base as possible until a
>     statement that interferes with the store sequence is encountered.
> +   Each store has up to 2 operands, which can be an immediate constant
> +   or a memory load, from which the value to be stored can be computed.
> +   At most one of the operands can be a constant.  The operands are recorded
> +   in store_operand_info struct.
>  
>     2) Analyze the chain of stores recorded in phase 1) (i.e. the vector of
>     store_immediate_info objects) and coalesce contiguous stores into
> -   merged_store_group objects.
> +   merged_store_group objects.  For bit-fields stores, we don't need to
> +   require the stores to be contiguous, just their surrounding bit regions
> +   have to be contiguous.  If the expression being stored is different
> +   between adjacent stores, such as one store storing a constant and
> +   following storing a value loaded from memory, or if the loaded memory
> +   objects are not adjacent, a new merged_store_group is created as well.
>  
>     For example, given the stores:
>     [p     ] := 0;
> @@ -134,8 +163,35 @@
>  #define MAX_STORE_BITSIZE (BITS_PER_WORD)
>  #define MAX_STORE_BYTES (MAX_STORE_BITSIZE / BITS_PER_UNIT)
>  
> +/* Limit to bound the number of aliasing checks for loads with the same
> +   vuse as the corresponding store.  */
> +#define MAX_STORE_ALIAS_CHECKS 64
> +
>  namespace {
>  
> +/* Struct recording one operand for the store, which is either a constant,
> +   then VAL represents the constant and all the other fields are zero,
> +   or a memory load, then VAL represents the reference, BASE_ADDR is non-NULL
> +   and the other fields also reflect the memory load.  */
> +
> +struct store_operand_info
> +{
> +  tree val;
> +  tree base_addr;
> +  unsigned HOST_WIDE_INT bitsize;
> +  unsigned HOST_WIDE_INT bitpos;
> +  unsigned HOST_WIDE_INT bitregion_start;
> +  unsigned HOST_WIDE_INT bitregion_end;
> +  gimple *stmt;
> +  store_operand_info ();
> +};
> +
> +store_operand_info::store_operand_info ()
> +  : val (NULL_TREE), base_addr (NULL_TREE), bitsize (0), bitpos (0),
> +    bitregion_start (0), bitregion_end (0), stmt (NULL)
> +{
> +}
> +
>  /* Struct recording the information about a single store of an immediate
>     to memory.  These are created in the first phase and coalesced into
>     merged_store_group objects in the second phase.  */
> @@ -149,9 +205,17 @@ struct store_immediate_info
>    unsigned HOST_WIDE_INT bitregion_end;
>    gimple *stmt;
>    unsigned int order;
> +  /* INTEGER_CST for constant stores, MEM_REF for memory copy or
> +     BIT_*_EXPR for logical bitwise operation.  */
> +  enum tree_code rhs_code;
> +  /* Operands.  For BIT_*_EXPR rhs_code both operands are used, otherwise
> +     just the first one.  */
> +  store_operand_info ops[2];
>    store_immediate_info (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
>  			unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
> -			gimple *, unsigned int);
> +			gimple *, unsigned int, enum tree_code,
> +			const store_operand_info &,
> +			const store_operand_info &);
>  };
>  
>  store_immediate_info::store_immediate_info (unsigned HOST_WIDE_INT bs,
> @@ -159,11 +223,22 @@ store_immediate_info::store_immediate_in
>  					    unsigned HOST_WIDE_INT brs,
>  					    unsigned HOST_WIDE_INT bre,
>  					    gimple *st,
> -					    unsigned int ord)
> +					    unsigned int ord,
> +					    enum tree_code rhscode,
> +					    const store_operand_info &op0r,
> +					    const store_operand_info &op1r)
>    : bitsize (bs), bitpos (bp), bitregion_start (brs), bitregion_end (bre),
> -    stmt (st), order (ord)
> +    stmt (st), order (ord), rhs_code (rhscode)
> +#if __cplusplus >= 201103L
> +    , ops { op0r, op1r }
> +{
> +}
> +#else
>  {
> +  ops[0] = op0r;
> +  ops[1] = op1r;
>  }
> +#endif
>  
>  /* Struct representing a group of stores to contiguous memory locations.
>     These are produced by the second phase (coalescing) and consumed in the
> @@ -178,8 +253,10 @@ struct merged_store_group
>    /* The size of the allocated memory for val and mask.  */
>    unsigned HOST_WIDE_INT buf_size;
>    unsigned HOST_WIDE_INT align_base;
> +  unsigned HOST_WIDE_INT load_align_base[2];
>  
>    unsigned int align;
> +  unsigned int load_align[2];
>    unsigned int first_order;
>    unsigned int last_order;
>  
> @@ -576,6 +653,20 @@ merged_store_group::merged_store_group (
>    get_object_alignment_1 (gimple_assign_lhs (info->stmt),
>  			  &align, &align_bitpos);
>    align_base = start - align_bitpos;
> +  for (int i = 0; i < 2; ++i)
> +    {
> +      store_operand_info &op = info->ops[i];
> +      if (op.base_addr == NULL_TREE)
> +	{
> +	  load_align[i] = 0;
> +	  load_align_base[i] = 0;
> +	}
> +      else
> +	{
> +	  get_object_alignment_1 (op.val, &load_align[i], &align_bitpos);
> +	  load_align_base[i] = op.bitpos - align_bitpos;
> +	}
> +    }
>    stores.create (1);
>    stores.safe_push (info);
>    last_stmt = info->stmt;
> @@ -608,6 +699,19 @@ merged_store_group::do_merge (store_imme
>        align = this_align;
>        align_base = info->bitpos - align_bitpos;
>      }
> +  for (int i = 0; i < 2; ++i)
> +    {
> +      store_operand_info &op = info->ops[i];
> +      if (!op.base_addr)
> +	continue;
> +
> +      get_object_alignment_1 (op.val, &this_align, &align_bitpos);
> +      if (this_align > load_align[i])
> +	{
> +	  load_align[i] = this_align;
> +	  load_align_base[i] = op.bitpos - align_bitpos;
> +	}
> +    }
>  
>    gimple *stmt = info->stmt;
>    stores.safe_push (info);
> @@ -682,16 +786,21 @@ merged_store_group::apply_stores ()
>    FOR_EACH_VEC_ELT (stores, i, info)
>      {
>        unsigned int pos_in_buffer = info->bitpos - bitregion_start;
> -      bool ret = encode_tree_to_bitpos (gimple_assign_rhs1 (info->stmt),
> -					val, info->bitsize,
> -					pos_in_buffer, buf_size);
> -      if (dump_file && (dump_flags & TDF_DETAILS))
> +      tree cst = NULL_TREE;
> +      if (info->ops[0].val && info->ops[0].base_addr == NULL_TREE)
> +	cst = info->ops[0].val;
> +      else if (info->ops[1].val && info->ops[1].base_addr == NULL_TREE)
> +	cst = info->ops[1].val;
> +      bool ret = true;
> +      if (cst)
> +	ret = encode_tree_to_bitpos (cst, val, info->bitsize,
> +				     pos_in_buffer, buf_size);
> +      if (cst && dump_file && (dump_flags & TDF_DETAILS))
>  	{
>  	  if (ret)
>  	    {
>  	      fprintf (dump_file, "After writing ");
> -	      print_generic_expr (dump_file,
> -				  gimple_assign_rhs1 (info->stmt), 0);
> +	      print_generic_expr (dump_file, cst, 0);
>  	      fprintf (dump_file, " of size " HOST_WIDE_INT_PRINT_DEC
>  			" at position %d the merged region contains:\n",
>  			info->bitsize, pos_in_buffer);
> @@ -799,9 +908,10 @@ private:
>       decisions when going out of SSA).  */
>    imm_store_chain_info *m_stores_head;
>  
> +  void process_store (gimple *);
>    bool terminate_and_process_all_chains ();
>    bool terminate_all_aliasing_chains (imm_store_chain_info **,
> -				      bool, gimple *);
> +				      gimple *);
>    bool terminate_and_release_chain (imm_store_chain_info *);
>  }; // class pass_store_merging
>  
> @@ -831,7 +941,6 @@ pass_store_merging::terminate_and_proces
>  bool
>  pass_store_merging::terminate_all_aliasing_chains (imm_store_chain_info
>  						     **chain_info,
> -						   bool var_offset_p,
>  						   gimple *stmt)
>  {
>    bool ret = false;
> @@ -845,37 +954,21 @@ pass_store_merging::terminate_all_aliasi
>       of a chain.  */
>    if (chain_info)
>      {
> -      /* We have a chain at BASE and we're writing to [BASE + <variable>].
> -	 This can interfere with any of the stores so terminate
> -	 the chain.  */
> -      if (var_offset_p)
> +      store_immediate_info *info;
> +      unsigned int i;
> +      FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
>  	{
> -	  terminate_and_release_chain (*chain_info);
> -	  ret = true;
> -	}
> -      /* Otherwise go through every store in the chain to see if it
> -	 aliases with any of them.  */
> -      else
> -	{
> -	  store_immediate_info *info;
> -	  unsigned int i;
> -	  FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
> +	  if (ref_maybe_used_by_stmt_p (stmt, gimple_assign_lhs (info->stmt))
> +	      || stmt_may_clobber_ref_p (stmt, gimple_assign_lhs (info->stmt)))
>  	    {
> -	      if (ref_maybe_used_by_stmt_p (stmt,
> -					    gimple_assign_lhs (info->stmt))
> -		  || stmt_may_clobber_ref_p (stmt,
> -					     gimple_assign_lhs (info->stmt)))
> +	      if (dump_file && (dump_flags & TDF_DETAILS))
>  		{
> -		  if (dump_file && (dump_flags & TDF_DETAILS))
> -		    {
> -		      fprintf (dump_file,
> -			       "stmt causes chain termination:\n");
> -		      print_gimple_stmt (dump_file, stmt, 0);
> -		    }
> -		  terminate_and_release_chain (*chain_info);
> -		  ret = true;
> -		  break;
> +		  fprintf (dump_file, "stmt causes chain termination:\n");
> +		  print_gimple_stmt (dump_file, stmt, 0);
>  		}
> +	      terminate_and_release_chain (*chain_info);
> +	      ret = true;
> +	      break;
>  	    }
>  	}
>      }
> @@ -920,6 +1013,109 @@ pass_store_merging::terminate_and_releas
>    return ret;
>  }
>  
> +/* Return true if stmts in between FIRST (inclusive) and LAST (exclusive)
> +   may clobber REF.  FIRST and LAST must be in the same basic block and
> +   have non-NULL vdef.  */
> +
> +bool
> +stmts_may_clobber_ref_p (gimple *first, gimple *last, tree ref)
> +{
> +  ao_ref r;
> +  ao_ref_init (&r, ref);
> +  unsigned int count = 0;
> +  tree vop = gimple_vdef (last);
> +  gimple *stmt;
> +
> +  gcc_checking_assert (gimple_bb (first) == gimple_bb (last));

EBB would probably work as well, thus we should assert we do not
end up visiting a PHI in the loop?

Maybe instead assert here that first has a VDEF instead and that the
bb of first dominates that of last?

> +  do
> +    {
> +      stmt = SSA_NAME_DEF_STMT (vop);

thus gcc_assert (gimple_code (stmt) != GIMPLE_PHI)?  OTOH we'll ICE
at the gimple_vuse () call for PHIs so the assert would be for
documentation purposes only (EBB).

> +      if (stmt_may_clobber_ref_p_1 (stmt, &r))
> +	return true;
> +      /* Avoid quadratic compile time by bounding the number of checks
> +	 we perform.  */
> +      if (++count > MAX_STORE_ALIAS_CHECKS)
> +	return true;
> +      vop = gimple_vuse (stmt);
> +    }
> +  while (stmt != first);
> +  return false;
> +}
> +
> +/* Return true if INFO->ops[IDX] is mergeable with the
> +   corresponding loads already in MERGED_STORE group.
> +   BASE_ADDR is the base address of the whole store group.  */
> +
> +bool
> +compatible_load_p (merged_store_group *merged_store,
> +		   store_immediate_info *info,
> +		   tree base_addr, int idx)
> +{
> +  store_immediate_info *infof = merged_store->stores[0];
> +  if (!info->ops[idx].base_addr
> +      || (info->ops[idx].bitpos - infof->ops[idx].bitpos
> +	  != info->bitpos - infof->bitpos)
> +      || !operand_equal_p (info->ops[idx].base_addr,
> +			   infof->ops[idx].base_addr, 0))
> +    return false;
> +
> +  store_immediate_info *infol = merged_store->stores.last ();
> +  tree load_vuse = gimple_vuse (info->ops[idx].stmt);
> +  /* In this case all vuses should be the same, e.g.
> +     _1 = s.a; _2 = s.b; _3 = _1 | 1; t.a = _3; _4 = _2 | 2; t.b = _4;
> +     or
> +     _1 = s.a; _2 = s.b; t.a = _1; t.b = _2;
> +     and we can emit the coalesced load next to any of those loads.  */
> +  if (gimple_vuse (infof->ops[idx].stmt) == load_vuse
> +      && gimple_vuse (infol->ops[idx].stmt) == load_vuse)
> +    return true;
> +
> +  /* Otherwise, at least for now require that the load has the same
> +     vuse as the store.  See following examples.  */
> +  if (gimple_vuse (info->stmt) != load_vuse)
> +    return false;
> +
> +  if (gimple_vuse (infof->stmt) != gimple_vuse (infof->ops[idx].stmt)
> +      || (infof != infol
> +	  && gimple_vuse (infol->stmt) != gimple_vuse (infol->ops[idx].stmt)))
> +    return false;
> +
> +  /* If the load is from the same location as the store, already
> +     the construction of the immediate chain info guarantees no intervening
> +     stores, so no further checks are needed.  Example:
> +     _1 = s.a; _2 = _1 & -7; s.a = _2; _3 = s.b; _4 = _3 & -7; s.b = _4;  */
> +  if (info->ops[idx].bitpos == info->bitpos
> +      && operand_equal_p (info->ops[idx].base_addr, base_addr, 0))
> +    return true;
> +
> +  gimple *first = merged_store->first_stmt;
> +  gimple *last = merged_store->last_stmt;
> +  unsigned int i;
> +  store_immediate_info *infoc;
> +  if (info->order < merged_store->first_order)
> +    {
> +      FOR_EACH_VEC_ELT (merged_store->stores, i, infoc)
> +	if (stmts_may_clobber_ref_p (info->stmt, first, infoc->ops[idx].val))
> +	  return false;
> +      first = info->stmt;
> +    }
> +  else if (info->order > merged_store->last_order)
> +    {
> +      FOR_EACH_VEC_ELT (merged_store->stores, i, infoc)
> +	if (stmts_may_clobber_ref_p (last, info->stmt, infoc->ops[idx].val))
> +	  return false;
> +      last = info->stmt;
> +    }
> +  if (stmts_may_clobber_ref_p (first, last, info->ops[idx].val))
> +    return false;

Can you comment on what you check in this block?  It first checks
all stmts (but not info->stmt itself if it is after last!?) 
against
all stores that would be added when adding 'info'.  Then it checks
from new first to last against the newly added stmt (again
excluding that stmt if it was added last).

> +
> +  /* Otherwise, we are looking for:
> +     _1 = s.a; _2 = _1 ^ 15; t.a = _2; _3 = s.b; _4 = _3 ^ 15; t.b = _4;
> +     or
> +     _1 = s.a; t.a = _1; _2 = s.b; t.b = _2;  */
> +  return true;
> +}
> +
>  /* Go through the candidate stores recorded in m_store_info and merge them
>     into merged_store_group objects recorded into m_merged_store_groups
>     representing the widened stores.  Return true if coalescing was successful
> @@ -967,32 +1163,56 @@ imm_store_chain_info::coalesce_immediate
>        if (IN_RANGE (start, merged_store->start,
>  		    merged_store->start + merged_store->width - 1))
>  	{
> -	  merged_store->merge_overlapping (info);
> -	  continue;
> +	  /* Only allow overlapping stores of constants.  */
> +	  if (info->rhs_code == INTEGER_CST
> +	      && merged_store->stores[0]->rhs_code == INTEGER_CST)
> +	    {
> +	      merged_store->merge_overlapping (info);
> +	      continue;
> +	    }
> +	}
> +      /* |---store 1---||---store 2---|
> +	 This store is consecutive to the previous one.
> +	 Merge it into the current store group.  There can be gaps in between
> +	 the stores, but there can't be gaps in between bitregions.  */
> +      else if (info->bitregion_start <= merged_store->bitregion_end
> +	       && info->rhs_code == merged_store->stores[0]->rhs_code)
> +	{
> +	  store_immediate_info *infof = merged_store->stores[0];
> +
> +	  /* All the rhs_code ops that take 2 operands are commutative,
> +	     swap the operands if it could make the operands compatible.  */
> +	  if (infof->ops[0].base_addr
> +	      && infof->ops[1].base_addr
> +	      && info->ops[0].base_addr
> +	      && info->ops[1].base_addr
> +	      && (info->ops[1].bitpos - infof->ops[0].bitpos
> +		  == info->bitpos - infof->bitpos)
> +	      && operand_equal_p (info->ops[1].base_addr,
> +				  infof->ops[0].base_addr, 0))
> +	    std::swap (info->ops[0], info->ops[1]);
> +	  if ((!infof->ops[0].base_addr
> +	       || compatible_load_p (merged_store, info, base_addr, 0))
> +	      && (!infof->ops[1].base_addr
> +		  || compatible_load_p (merged_store, info, base_addr, 1)))
> +	    {
> +	      merged_store->merge_into (info);
> +	      continue;
> +	    }
>  	}
>  
>        /* |---store 1---| <gap> |---store 2---|.
> -	 Gap between stores.  Start a new group if there are any gaps
> -	 between bitregions.  */
> -      if (info->bitregion_start > merged_store->bitregion_end)
> -	{
> -	  /* Try to apply all the stores recorded for the group to determine
> -	     the bitpattern they write and discard it if that fails.
> -	     This will also reject single-store groups.  */
> -	  if (!merged_store->apply_stores ())
> -	    delete merged_store;
> -	  else
> -	    m_merged_store_groups.safe_push (merged_store);
> +	 Gap between stores or the rhs not compatible.  Start a new group.  */
>  
> -	  merged_store = new merged_store_group (info);
> -
> -	  continue;
> -	}
> +      /* Try to apply all the stores recorded for the group to determine
> +	 the bitpattern they write and discard it if that fails.
> +	 This will also reject single-store groups.  */
> +      if (!merged_store->apply_stores ())
> +	delete merged_store;
> +      else
> +	m_merged_store_groups.safe_push (merged_store);
>  
> -      /* |---store 1---||---store 2---|
> -	 This store is consecutive to the previous one.
> -	 Merge it into the current store group.  */
> -       merged_store->merge_into (info);
> +      merged_store = new merged_store_group (info);
>      }
>  
>    /* Record or discard the last store group.  */
> @@ -1014,35 +1234,57 @@ imm_store_chain_info::coalesce_immediate
>    return success;
>  }
>  
> -/* Return the type to use for the merged stores described by STMTS.
> -   This is needed to get the alias sets right.  */
> +/* Return the type to use for the merged stores or loads described by STMTS.
> +   This is needed to get the alias sets right.  If IS_LOAD, look for rhs,
> +   otherwise lhs.  Additionally set *CLIQUEP and *BASEP to MR_DEPENDENCE_*
> +   of the MEM_REFs if any.  */
>  
>  static tree
> -get_alias_type_for_stmts (auto_vec<gimple *> &stmts)
> +get_alias_type_for_stmts (vec<gimple *> &stmts, bool is_load,
> +			  unsigned short *cliquep, unsigned short *basep)
>  {
>    gimple *stmt;
>    unsigned int i;
> -  tree lhs = gimple_assign_lhs (stmts[0]);
> -  tree type = reference_alias_ptr_type (lhs);
> +  tree type = NULL_TREE;
> +  tree ret = NULL_TREE;
> +  *cliquep = 0;
> +  *basep = 0;
>  
>    FOR_EACH_VEC_ELT (stmts, i, stmt)
>      {
> -      if (i == 0)
> -	continue;
> +      tree ref = is_load ? gimple_assign_rhs1 (stmt)
> +			 : gimple_assign_lhs (stmt);
> +      tree type1 = reference_alias_ptr_type (ref);
> +      tree base = get_base_address (ref);
>  
> -      lhs = gimple_assign_lhs (stmt);
> -      tree type1 = reference_alias_ptr_type (lhs);
> +      if (i == 0)
> +	{
> +	  if (TREE_CODE (base) == MEM_REF)
> +	    {
> +	      *cliquep = MR_DEPENDENCE_CLIQUE (base);
> +	      *basep = MR_DEPENDENCE_BASE (base);
> +	    }
> +	  ret = type = type1;
> +	  continue;
> +	}
>        if (!alias_ptr_types_compatible_p (type, type1))
> -	return ptr_type_node;
> +	ret = ptr_type_node;
> +      if (TREE_CODE (base) != MEM_REF
> +	  || *cliquep != MR_DEPENDENCE_CLIQUE (base)
> +	  || *basep != MR_DEPENDENCE_BASE (base))
> +	{
> +	  *cliquep = 0;
> +	  *basep = 0;
> +	}
>      }
> -  return type;
> +  return ret;
>  }
>  
>  /* Return the location_t information we can find among the statements
>     in STMTS.  */
>  
>  static location_t
> -get_location_for_stmts (auto_vec<gimple *> &stmts)
> +get_location_for_stmts (vec<gimple *> &stmts)
>  {
>    gimple *stmt;
>    unsigned int i;
> @@ -1062,7 +1304,7 @@ struct split_store
>    unsigned HOST_WIDE_INT bytepos;
>    unsigned HOST_WIDE_INT size;
>    unsigned HOST_WIDE_INT align;
> -  auto_vec<gimple *> orig_stmts;
> +  auto_vec<store_immediate_info *> orig_stores;
>    /* True if there is a single orig stmt covering the whole split store.  */
>    bool orig;
>    split_store (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
> @@ -1076,21 +1318,20 @@ split_store::split_store (unsigned HOST_
>  			  unsigned HOST_WIDE_INT al)
>  			  : bytepos (bp), size (sz), align (al), orig (false)
>  {
> -  orig_stmts.create (0);
> +  orig_stores.create (0);
>  }
>  
> -/* Record all statements corresponding to stores in GROUP that write to
> -   the region starting at BITPOS and is of size BITSIZE.  Record such
> -   statements in STMTS if non-NULL.  The stores in GROUP must be sorted by
> -   bitposition.  Return INFO if there is exactly one original store
> -   in the range.  */
> +/* Record all stores in GROUP that write to the region starting at BITPOS and
> +   is of size BITSIZE.  Record infos for such statements in STORES if
> +   non-NULL.  The stores in GROUP must be sorted by bitposition.  Return INFO
> +   if there is exactly one original store in the range.  */
>  
>  static store_immediate_info *
> -find_constituent_stmts (struct merged_store_group *group,
> -			vec<gimple *> *stmts,
> -			unsigned int *first,
> -			unsigned HOST_WIDE_INT bitpos,
> -			unsigned HOST_WIDE_INT bitsize)
> +find_constituent_stores (struct merged_store_group *group,
> +			 vec<store_immediate_info *> *stores,
> +			 unsigned int *first,
> +			 unsigned HOST_WIDE_INT bitpos,
> +			 unsigned HOST_WIDE_INT bitsize)
>  {
>    store_immediate_info *info, *ret = NULL;
>    unsigned int i;
> @@ -1119,9 +1360,9 @@ find_constituent_stmts (struct merged_st
>        if (stmt_start >= end)
>  	return ret;
>  
> -      if (stmts)
> +      if (stores)
>  	{
> -	  stmts->safe_push (info->stmt);
> +	  stores->safe_push (info);
>  	  if (ret)
>  	    {
>  	      ret = NULL;
> @@ -1143,11 +1384,14 @@ find_constituent_stmts (struct merged_st
>     This is to separate the splitting strategy from the statement
>     building/emission/linking done in output_merged_store.
>     Return number of new stores.
> +   If ALLOW_UNALIGNED_STORE is false, then all stores must be aligned.
> +   If ALLOW_UNALIGNED_LOAD is false, then all loads must be aligned.
>     If SPLIT_STORES is NULL, it is just a dry run to count number of
>     new stores.  */
>  
>  static unsigned int
> -split_group (merged_store_group *group, bool allow_unaligned,
> +split_group (merged_store_group *group, bool allow_unaligned_store,
> +	     bool allow_unaligned_load,
>  	     vec<struct split_store *> *split_stores)
>  {
>    unsigned HOST_WIDE_INT pos = group->bitregion_start;
> @@ -1155,6 +1399,7 @@ split_group (merged_store_group *group,
>    unsigned HOST_WIDE_INT bytepos = pos / BITS_PER_UNIT;
>    unsigned HOST_WIDE_INT group_align = group->align;
>    unsigned HOST_WIDE_INT align_base = group->align_base;
> +  unsigned HOST_WIDE_INT group_load_align = group_align;
>  
>    gcc_assert ((size % BITS_PER_UNIT == 0) && (pos % BITS_PER_UNIT == 0));
>  
> @@ -1162,9 +1407,14 @@ split_group (merged_store_group *group,
>    unsigned HOST_WIDE_INT try_pos = bytepos;
>    group->stores.qsort (sort_by_bitpos);
>  
> +  if (!allow_unaligned_load)
> +    for (int i = 0; i < 2; ++i)
> +      if (group->load_align[i])
> +	group_load_align = MIN (group_load_align, group->load_align[i]);
> +
>    while (size > 0)
>      {
> -      if ((allow_unaligned || group_align <= BITS_PER_UNIT)
> +      if ((allow_unaligned_store || group_align <= BITS_PER_UNIT)
>  	  && group->mask[try_pos - bytepos] == (unsigned char) ~0U)
>  	{
>  	  /* Skip padding bytes.  */
> @@ -1180,10 +1430,34 @@ split_group (merged_store_group *group,
>        unsigned HOST_WIDE_INT align = group_align;
>        if (align_bitpos)
>  	align = least_bit_hwi (align_bitpos);
> -      if (!allow_unaligned)
> +      if (!allow_unaligned_store)
>  	try_size = MIN (try_size, align);
> +      if (!allow_unaligned_load)
> +	{
> +	  /* If we can't do or don't want to do unaligned stores
> +	     as well as loads, we need to take the loads into account
> +	     as well.  */
> +	  unsigned HOST_WIDE_INT load_align = group_load_align;
> +	  align_bitpos = (try_bitpos - align_base) & (load_align - 1);
> +	  if (align_bitpos)
> +	    load_align = least_bit_hwi (align_bitpos);
> +	  for (int i = 0; i < 2; ++i)
> +	    if (group->load_align[i])
> +	      {
> +		align_bitpos = try_bitpos - group->stores[0]->bitpos;
> +		align_bitpos += group->stores[0]->ops[i].bitpos;
> +		align_bitpos -= group->load_align_base[i];
> +		align_bitpos &= (group_load_align - 1);
> +		if (align_bitpos)
> +		  {
> +		    unsigned HOST_WIDE_INT a = least_bit_hwi (align_bitpos);
> +		    load_align = MIN (load_align, a);
> +		  }
> +	      }
> +	  try_size = MIN (try_size, load_align);
> +	}
>        store_immediate_info *info
> -	= find_constituent_stmts (group, NULL, &first, try_bitpos, try_size);
> +	= find_constituent_stores (group, NULL, &first, try_bitpos, try_size);
>        if (info)
>  	{
>  	  /* If there is just one original statement for the range, see if
> @@ -1191,8 +1465,8 @@ split_group (merged_store_group *group,
>  	     than try_size.  */
>  	  unsigned HOST_WIDE_INT stmt_end
>  	    = ROUND_UP (info->bitpos + info->bitsize, BITS_PER_UNIT);
> -	  info = find_constituent_stmts (group, NULL, &first, try_bitpos,
> -					 stmt_end - try_bitpos);
> +	  info = find_constituent_stores (group, NULL, &first, try_bitpos,
> +					  stmt_end - try_bitpos);
>  	  if (info && info->bitpos >= try_bitpos)
>  	    {
>  	      try_size = stmt_end - try_bitpos;
> @@ -1221,7 +1495,7 @@ split_group (merged_store_group *group,
>        nonmasked *= BITS_PER_UNIT;
>        while (nonmasked <= try_size / 2)
>  	try_size /= 2;
> -      if (!allow_unaligned && group_align > BITS_PER_UNIT)
> +      if (!allow_unaligned_store && group_align > BITS_PER_UNIT)
>  	{
>  	  /* Now look for whole padding bytes at the start of that bitsize.  */
>  	  unsigned int try_bytesize = try_size / BITS_PER_UNIT, masked;
> @@ -1252,8 +1526,8 @@ split_group (merged_store_group *group,
>  	{
>  	  struct split_store *store
>  	    = new split_store (try_pos, try_size, align);
> -	  info = find_constituent_stmts (group, &store->orig_stmts,
> -	  				 &first, try_bitpos, try_size);
> +	  info = find_constituent_stores (group, &store->orig_stores,
> +					  &first, try_bitpos, try_size);
>  	  if (info
>  	      && info->bitpos >= try_bitpos
>  	      && info->bitpos + info->bitsize <= try_bitpos + try_size)
> @@ -1288,19 +1562,23 @@ imm_store_chain_info::output_merged_stor
>  
>    auto_vec<struct split_store *, 32> split_stores;
>    split_stores.create (0);
> -  bool allow_unaligned
> +  bool allow_unaligned_store
>      = !STRICT_ALIGNMENT && PARAM_VALUE (PARAM_STORE_MERGING_ALLOW_UNALIGNED);
> -  if (allow_unaligned)
> +  bool allow_unaligned_load = allow_unaligned_store;
> +  if (allow_unaligned_store)
>      {
>        /* If unaligned stores are allowed, see how many stores we'd emit
>  	 for unaligned and how many stores we'd emit for aligned stores.
>  	 Only use unaligned stores if it allows fewer stores than aligned.  */
> -      unsigned aligned_cnt = split_group (group, false, NULL);
> -      unsigned unaligned_cnt = split_group (group, true, NULL);
> +      unsigned aligned_cnt
> +	= split_group (group, false, allow_unaligned_load, NULL);
> +      unsigned unaligned_cnt
> +	= split_group (group, true, allow_unaligned_load, NULL);
>        if (aligned_cnt <= unaligned_cnt)
> -	allow_unaligned = false;
> +	allow_unaligned_store = false;
>      }
> -  split_group (group, allow_unaligned, &split_stores);
> +  split_group (group, allow_unaligned_store, allow_unaligned_load,
> +	       &split_stores);
>  
>    if (split_stores.length () >= orig_num_stmts)
>      {
> @@ -1323,9 +1601,37 @@ imm_store_chain_info::output_merged_stor
>    gimple *stmt = NULL;
>    split_store *split_store;
>    unsigned int i;
> -
> +  auto_vec<gimple *, 32> orig_stmts;
>    tree addr = force_gimple_operand_1 (unshare_expr (base_addr), &seq,
>  				      is_gimple_mem_ref_addr, NULL_TREE);
> +
> +  tree load_addr[2] = { NULL_TREE, NULL_TREE };
> +  gimple_seq load_seq[2] = { NULL, NULL };
> +  gimple_stmt_iterator load_gsi[2] = { gsi_none (), gsi_none () };
> +  for (int j = 0; j < 2; ++j)
> +    {
> +      store_operand_info &op = group->stores[0]->ops[j];
> +      if (op.base_addr == NULL_TREE)
> +	continue;
> +
> +      store_immediate_info *infol = group->stores.last ();
> +      if (gimple_vuse (op.stmt) == gimple_vuse (infol->ops[j].stmt))
> +	{
> +	  load_gsi[j] = gsi_for_stmt (op.stmt);
> +	  load_addr[j]
> +	    = force_gimple_operand_1 (unshare_expr (op.base_addr),
> +				      &load_seq[j], is_gimple_mem_ref_addr,
> +				      NULL_TREE);
> +	}
> +      else if (operand_equal_p (base_addr, op.base_addr, 0))
> +	load_addr[j] = addr;
> +      else
> +	load_addr[j]
> +	  = force_gimple_operand_1 (unshare_expr (op.base_addr),
> +				    &seq, is_gimple_mem_ref_addr,
> +				    NULL_TREE);
> +    }
> +
>    FOR_EACH_VEC_ELT (split_stores, i, split_store)
>      {
>        unsigned HOST_WIDE_INT try_size = split_store->size;
> @@ -1337,27 +1643,144 @@ imm_store_chain_info::output_merged_stor
>  	{
>  	  /* If there is just a single constituent store which covers
>  	     the whole area, just reuse the lhs and rhs.  */
> -	  dest = gimple_assign_lhs (split_store->orig_stmts[0]);
> -	  src = gimple_assign_rhs1 (split_store->orig_stmts[0]);
> -	  loc = gimple_location (split_store->orig_stmts[0]);
> +	  gimple *orig_stmt = split_store->orig_stores[0]->stmt;
> +	  dest = gimple_assign_lhs (orig_stmt);
> +	  src = gimple_assign_rhs1 (orig_stmt);
> +	  loc = gimple_location (orig_stmt);
>  	}
>        else
>  	{
> +	  store_immediate_info *info;
> +	  unsigned short clique, base;
> +	  unsigned int k;
> +	  FOR_EACH_VEC_ELT (split_store->orig_stores, k, info)
> +	    orig_stmts.safe_push (info->stmt);
>  	  tree offset_type
> -	    = get_alias_type_for_stmts (split_store->orig_stmts);
> -	  loc = get_location_for_stmts (split_store->orig_stmts);
> +	    = get_alias_type_for_stmts (orig_stmts, false, &clique, &base);
> +	  loc = get_location_for_stmts (orig_stmts);
> +	  orig_stmts.truncate (0);
>  
>  	  tree int_type = build_nonstandard_integer_type (try_size, UNSIGNED);
>  	  int_type = build_aligned_type (int_type, align);
>  	  dest = fold_build2 (MEM_REF, int_type, addr,
>  			      build_int_cst (offset_type, try_pos));
> -	  src = native_interpret_expr (int_type,
> -				       group->val + try_pos - start_byte_pos,
> -				       group->buf_size);
> +	  if (TREE_CODE (dest) == MEM_REF)
> +	    {
> +	      MR_DEPENDENCE_CLIQUE (dest) = clique;
> +	      MR_DEPENDENCE_BASE (dest) = base;
> +	    }
> +
>  	  tree mask
>  	    = native_interpret_expr (int_type,
>  				     group->mask + try_pos - start_byte_pos,
>  				     group->buf_size);
> +
> +	  tree ops[2];
> +	  for (int j = 0;
> +	       j < 1 + (split_store->orig_stores[0]->ops[1].val != NULL_TREE);
> +	       ++j)
> +	    {
> +	      store_operand_info &op = split_store->orig_stores[0]->ops[j];
> +	      if (op.base_addr)
> +		{
> +		  FOR_EACH_VEC_ELT (split_store->orig_stores, k, info)
> +		    orig_stmts.safe_push (info->ops[j].stmt);
> +
> +		  offset_type = get_alias_type_for_stmts (orig_stmts, true,
> +							  &clique, &base);
> +		  location_t load_loc = get_location_for_stmts (orig_stmts);
> +		  orig_stmts.truncate (0);
> +
> +		  unsigned HOST_WIDE_INT load_align = group->load_align[j];
> +		  unsigned HOST_WIDE_INT align_bitpos
> +		    = (try_pos * BITS_PER_UNIT
> +		       - split_store->orig_stores[0]->bitpos
> +		       + op.bitpos) & (load_align - 1);
> +		  if (align_bitpos)
> +		    load_align = least_bit_hwi (align_bitpos);
> +
> +		  tree load_int_type
> +		    = build_nonstandard_integer_type (try_size, UNSIGNED);
> +		  load_int_type
> +		    = build_aligned_type (load_int_type, load_align);
> +
> +		  unsigned HOST_WIDE_INT load_pos
> +		    = (try_pos * BITS_PER_UNIT
> +		       - split_store->orig_stores[0]->bitpos
> +		       + op.bitpos) / BITS_PER_UNIT;
> +		  ops[j] = fold_build2 (MEM_REF, load_int_type, load_addr[j],
> +					build_int_cst (offset_type, load_pos));
> +		  if (TREE_CODE (ops[j]) == MEM_REF)
> +		    {
> +		      MR_DEPENDENCE_CLIQUE (ops[j]) = clique;
> +		      MR_DEPENDENCE_BASE (ops[j]) = base;
> +		    }
> +		  if (!integer_zerop (mask))
> +		    /* The load might load some bits (that will be masked off
> +		       later on) uninitialized, avoid -W*uninitialized
> +		       warnings in that case.  */
> +		    TREE_NO_WARNING (ops[j]) = 1;
> +
> +		  stmt = gimple_build_assign (make_ssa_name (int_type),
> +					      ops[j]);
> +		  gimple_set_location (stmt, load_loc);
> +		  if (gsi_bb (load_gsi[j]))
> +		    {
> +		      gimple_set_vuse (stmt, gimple_vuse (op.stmt));
> +		      gimple_seq_add_stmt_without_update (&load_seq[j], stmt);
> +		    }
> +		  else
> +		    {
> +		      gimple_set_vuse (stmt, new_vuse);
> +		      gimple_seq_add_stmt_without_update (&seq, stmt);
> +		    }
> +		  ops[j] = gimple_assign_lhs (stmt);
> +		}
> +	      else
> +		ops[j] = native_interpret_expr (int_type,
> +						group->val + try_pos
> +						- start_byte_pos,
> +						group->buf_size);
> +	    }
> +
> +	  switch (split_store->orig_stores[0]->rhs_code)
> +	    {
> +	    case BIT_AND_EXPR:
> +	    case BIT_IOR_EXPR:
> +	    case BIT_XOR_EXPR:
> +	      FOR_EACH_VEC_ELT (split_store->orig_stores, k, info)
> +		{
> +		  tree rhs1 = gimple_assign_rhs1 (info->stmt);
> +		  orig_stmts.safe_push (SSA_NAME_DEF_STMT (rhs1));
> +		}
> +	      location_t bit_loc;
> +	      bit_loc = get_location_for_stmts (orig_stmts);
> +	      orig_stmts.truncate (0);
> +
> +	      stmt
> +		= gimple_build_assign (make_ssa_name (int_type),
> +				       split_store->orig_stores[0]->rhs_code,
> +				       ops[0], ops[1]);
> +	      gimple_set_location (stmt, bit_loc);
> +	      /* If there is just one load and there is a separate
> +		 load_seq[0], emit the bitwise op right after it.  */
> +	      if (load_addr[1] == NULL_TREE && gsi_bb (load_gsi[0]))
> +		gimple_seq_add_stmt_without_update (&load_seq[0], stmt);
> +	      /* Otherwise, if at least one load is in seq, we need to
> +		 emit the bitwise op right before the store.  If there
> +		 are two loads and are emitted somewhere else, it would
> +		 be better to emit the bitwise op as early as possible;
> +		 we don't track where that would be possible right now
> +		 though.  */
> +	      else
> +		gimple_seq_add_stmt_without_update (&seq, stmt);
> +	      src = gimple_assign_lhs (stmt);
> +	      break;
> +	    default:
> +	      src = ops[0];
> +	      break;
> +	    }
> +
>  	  if (!integer_zerop (mask))
>  	    {
>  	      tree tem = make_ssa_name (int_type);
> @@ -1382,9 +1805,21 @@ imm_store_chain_info::output_merged_stor
>  	      gimple_seq_add_stmt_without_update (&seq, stmt);
>  	      tem = gimple_assign_lhs (stmt);
>  
> -	      src = wide_int_to_tree (int_type,
> -				      wi::bit_and_not (wi::to_wide (src),
> -						       wi::to_wide (mask)));
> +	      if (TREE_CODE (src) == INTEGER_CST)
> +		src = wide_int_to_tree (int_type,
> +					wi::bit_and_not (wi::to_wide (src),
> +							 wi::to_wide (mask)));
> +	      else
> +		{
> +		  tree nmask
> +		    = wide_int_to_tree (int_type,
> +					wi::bit_not (wi::to_wide (mask)));
> +		  stmt = gimple_build_assign (make_ssa_name (int_type),
> +					      BIT_AND_EXPR, src, nmask);
> +		  gimple_set_location (stmt, loc);
> +		  gimple_seq_add_stmt_without_update (&seq, stmt);
> +		  src = gimple_assign_lhs (stmt);
> +		}
>  	      stmt = gimple_build_assign (make_ssa_name (int_type),
>  					  BIT_IOR_EXPR, tem, src);
>  	      gimple_set_location (stmt, loc);
> @@ -1422,6 +1857,9 @@ imm_store_chain_info::output_merged_stor
>  	print_gimple_seq (dump_file, seq, 0, TDF_VOPS | TDF_MEMSYMS);
>      }
>    gsi_insert_seq_after (&last_gsi, seq, GSI_SAME_STMT);
> +  for (int j = 0; j < 2; ++j)
> +    if (load_seq[j])
> +      gsi_insert_seq_after (&load_gsi[j], load_seq[j], GSI_SAME_STMT);
>  
>    return true;
>  }
> @@ -1520,10 +1958,290 @@ rhs_valid_for_store_merging_p (tree rhs)
>  			     GET_MODE_SIZE (TYPE_MODE (TREE_TYPE (rhs)))) != 0;
>  }
>  
> +/* If MEM is a memory reference usable for store merging (either as
> +   store destination or for loads), return the non-NULL base_addr
> +   and set *PBITSIZE, *PBITPOS, *PBITREGION_START and *PBITREGION_END.
> +   Otherwise return NULL, *PBITPOS should be still valid even for that
> +   case.  */
> +
> +static tree
> +mem_valid_for_store_merging (tree mem, unsigned HOST_WIDE_INT *pbitsize,
> +			     unsigned HOST_WIDE_INT *pbitpos,
> +			     unsigned HOST_WIDE_INT *pbitregion_start,
> +			     unsigned HOST_WIDE_INT *pbitregion_end)
> +{
> +  HOST_WIDE_INT bitsize;
> +  HOST_WIDE_INT bitpos;
> +  unsigned HOST_WIDE_INT bitregion_start = 0;
> +  unsigned HOST_WIDE_INT bitregion_end = 0;
> +  machine_mode mode;
> +  int unsignedp = 0, reversep = 0, volatilep = 0;
> +  tree offset;
> +  tree base_addr = get_inner_reference (mem, &bitsize, &bitpos, &offset, &mode,
> +					&unsignedp, &reversep, &volatilep);
> +  *pbitsize = bitsize;
> +  if (bitsize == 0)
> +    return NULL_TREE;
> +
> +  if (TREE_CODE (mem) == COMPONENT_REF
> +      && DECL_BIT_FIELD_TYPE (TREE_OPERAND (mem, 1)))
> +    {
> +      get_bit_range (&bitregion_start, &bitregion_end, mem, &bitpos, &offset);
> +      if (bitregion_end)
> +	++bitregion_end;
> +    }
> +
> +  if (reversep)
> +    return NULL_TREE;
> +
> +  /* We do not want to rewrite TARGET_MEM_REFs.  */
> +  if (TREE_CODE (base_addr) == TARGET_MEM_REF)
> +    return NULL_TREE;
> +  /* In some cases get_inner_reference may return a
> +     MEM_REF [ptr + byteoffset].  For the purposes of this pass
> +     canonicalize the base_addr to MEM_REF [ptr] and take
> +     byteoffset into account in the bitpos.  This occurs in
> +     PR 23684 and this way we can catch more chains.  */
> +  else if (TREE_CODE (base_addr) == MEM_REF)
> +    {
> +      offset_int bit_off, byte_off = mem_ref_offset (base_addr);
> +      bit_off = byte_off << LOG2_BITS_PER_UNIT;
> +      bit_off += bitpos;
> +      if (!wi::neg_p (bit_off) && wi::fits_shwi_p (bit_off))
> +	{
> +	  bitpos = bit_off.to_shwi ();
> +	  if (bitregion_end)
> +	    {
> +	      bit_off = byte_off << LOG2_BITS_PER_UNIT;
> +	      bit_off += bitregion_start;
> +	      if (wi::fits_uhwi_p (bit_off))
> +		{
> +		  bitregion_start = bit_off.to_uhwi ();
> +		  bit_off = byte_off << LOG2_BITS_PER_UNIT;
> +		  bit_off += bitregion_end;
> +		  if (wi::fits_uhwi_p (bit_off))
> +		    bitregion_end = bit_off.to_uhwi ();
> +		  else
> +		    bitregion_end = 0;
> +		}
> +	      else
> +		bitregion_end = 0;
> +	    }
> +	}
> +      else
> +	return NULL_TREE;
> +      base_addr = TREE_OPERAND (base_addr, 0);
> +    }
> +  /* get_inner_reference returns the base object, get at its
> +     address now.  */
> +  else
> +    {
> +      if (bitpos < 0)
> +	return NULL_TREE;
> +      base_addr = build_fold_addr_expr (base_addr);
> +    }
> +
> +  if (!bitregion_end)
> +    {
> +      bitregion_start = ROUND_DOWN (bitpos, BITS_PER_UNIT);
> +      bitregion_end = ROUND_UP (bitpos + bitsize, BITS_PER_UNIT);
> +    }
> +
> +  if (offset != NULL_TREE)
> +    {
> +      /* If the access is variable offset then a base decl has to be
> +	 address-taken to be able to emit pointer-based stores to it.
> +	 ???  We might be able to get away with re-using the original
> +	 base up to the first variable part and then wrapping that inside
> +	 a BIT_FIELD_REF.  */

Yes, that's what I'd generally recommend...  OTOH it can get quite
fugly but it only has to survive until RTL expansion...

As extra sanity check I'd rather have that all refs share a common
base (operand-equal-p'ish).  But I guess that's what usually will
happen anyways.  The alias-ptr-type trick will be tricky to do
here as well (you have to go down to the base MEM_REF, wrap
a decl if there's no MEM_REF and adjust the offset type).

> +      tree base = get_base_address (base_addr);
> +      if (! base
> +	  || (DECL_P (base) && ! TREE_ADDRESSABLE (base)))
> +	return NULL_TREE;
> +
> +      base_addr = build2 (POINTER_PLUS_EXPR, TREE_TYPE (base_addr),
> +			  base_addr, offset);
> +    }
> +
> +  *pbitsize = bitsize;
> +  *pbitpos = bitpos;
> +  *pbitregion_start = bitregion_start;
> +  *pbitregion_end = bitregion_end;
> +  return base_addr;
> +}
> +
> +/* Return true if STMT is a load that can be used for store merging.
> +   In that case fill in *OP.  BITSIZE, BITPOS, BITREGION_START and
> +   BITREGION_END are properties of the corresponding store.  */
> +
> +static bool
> +handled_load (gimple *stmt, store_operand_info *op,
> +	      unsigned HOST_WIDE_INT bitsize, unsigned HOST_WIDE_INT bitpos,
> +	      unsigned HOST_WIDE_INT bitregion_start,
> +	      unsigned HOST_WIDE_INT bitregion_end)
> +{
> +  if (!is_gimple_assign (stmt) || !gimple_vuse (stmt))
> +    return false;
> +  if (gimple_assign_load_p (stmt)
> +      && !stmt_can_throw_internal (stmt)
> +      && !gimple_has_volatile_ops (stmt))
> +    {
> +      tree mem = gimple_assign_rhs1 (stmt);
> +      op->base_addr
> +	= mem_valid_for_store_merging (mem, &op->bitsize, &op->bitpos,
> +				       &op->bitregion_start,
> +				       &op->bitregion_end);
> +      if (op->base_addr != NULL_TREE
> +	  && op->bitsize == bitsize
> +	  && ((op->bitpos - bitpos) % BITS_PER_UNIT) == 0
> +	  && op->bitpos - op->bitregion_start >= bitpos - bitregion_start
> +	  && op->bitregion_end - op->bitpos >= bitregion_end - bitpos)
> +	{
> +	  op->stmt = stmt;
> +	  op->val = mem;
> +	  return true;
> +	}
> +    }
> +  return false;
> +}
> +
> +/* Record the store STMT for store merging optimization if it can be
> +   optimized.  */
> +
> +void
> +pass_store_merging::process_store (gimple *stmt)
> +{
> +  tree lhs = gimple_assign_lhs (stmt);
> +  tree rhs = gimple_assign_rhs1 (stmt);
> +  unsigned HOST_WIDE_INT bitsize, bitpos;
> +  unsigned HOST_WIDE_INT bitregion_start;
> +  unsigned HOST_WIDE_INT bitregion_end;
> +  tree base_addr
> +    = mem_valid_for_store_merging (lhs, &bitsize, &bitpos,
> +				   &bitregion_start, &bitregion_end);
> +  if (bitsize == 0)
> +    return;
> +
> +  bool invalid = (base_addr == NULL_TREE
> +		  || ((bitsize > MAX_BITSIZE_MODE_ANY_INT)
> +		       && (TREE_CODE (rhs) != INTEGER_CST)));
> +  enum tree_code rhs_code = ERROR_MARK;
> +  store_operand_info ops[2];
> +  if (invalid)
> +    ;
> +  else if (rhs_valid_for_store_merging_p (rhs))
> +    {
> +      rhs_code = INTEGER_CST;
> +      ops[0].val = rhs;
> +    }
> +  else if (TREE_CODE (rhs) != SSA_NAME)
> +    invalid = true;
> +  else
> +    {
> +      gimple *def_stmt = SSA_NAME_DEF_STMT (rhs), *def_stmt1, *def_stmt2;
> +      if (!is_gimple_assign (def_stmt))
> +	invalid = true;
> +      else if (handled_load (def_stmt, &ops[0], bitsize, bitpos,
> +			     bitregion_start, bitregion_end))
> +	rhs_code = MEM_REF;
> +      else
> +	switch ((rhs_code = gimple_assign_rhs_code (def_stmt)))
> +	  {
> +	  case BIT_AND_EXPR:
> +	  case BIT_IOR_EXPR:
> +	  case BIT_XOR_EXPR:
> +	    tree rhs1, rhs2;
> +	    rhs1 = gimple_assign_rhs1 (def_stmt);
> +	    rhs2 = gimple_assign_rhs2 (def_stmt);
> +	    invalid = true;
> +	    if (TREE_CODE (rhs1) != SSA_NAME)
> +	      break;
> +	    def_stmt1 = SSA_NAME_DEF_STMT (rhs1);
> +	    if (!is_gimple_assign (def_stmt1)
> +		|| !handled_load (def_stmt1, &ops[0], bitsize, bitpos,
> +				  bitregion_start, bitregion_end))
> +	      break;
> +	    if (rhs_valid_for_store_merging_p (rhs2))
> +	      ops[1].val = rhs2;
> +	    else if (TREE_CODE (rhs2) != SSA_NAME)
> +	      break;
> +	    else
> +	      {
> +		def_stmt2 = SSA_NAME_DEF_STMT (rhs2);
> +		if (!is_gimple_assign (def_stmt2))
> +		  break;
> +		else if (!handled_load (def_stmt2, &ops[1], bitsize, bitpos,
> +					bitregion_start, bitregion_end))
> +		  break;
> +	      }
> +	    invalid = false;
> +	    break;
> +	  default:
> +	    invalid = true;
> +	    break;
> +	  }

given the style of processing we can end up doing more than
necessary work when following ! single-use chains here, no?
Would it make sense to restrict this to single-uses or do we
handle any case of ! single-uses?  When extending things to
allow an intermediate swap or general permute we could handle
a byte/word-splat for example.

Otherwise the patch looks good.  Quite some parts of the changes
seem to be due to splitting out stuff into functions and refactoring.

Thanks,
Richard.

> +    }
> +
> +  struct imm_store_chain_info **chain_info = NULL;
> +  if (base_addr)
> +    chain_info = m_stores.get (base_addr);
> +
> +  if (invalid)
> +    {
> +      terminate_all_aliasing_chains (chain_info, stmt);
> +      return;
> +    }
> +
> +  store_immediate_info *info;
> +  if (chain_info)
> +    {
> +      unsigned int ord = (*chain_info)->m_store_info.length ();
> +      info = new store_immediate_info (bitsize, bitpos, bitregion_start,
> +				       bitregion_end, stmt, ord, rhs_code,
> +				       ops[0], ops[1]);
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	{
> +	  fprintf (dump_file, "Recording immediate store from stmt:\n");
> +	  print_gimple_stmt (dump_file, stmt, 0);
> +	}
> +      (*chain_info)->m_store_info.safe_push (info);
> +      /* If we reach the limit of stores to merge in a chain terminate and
> +	 process the chain now.  */
> +      if ((*chain_info)->m_store_info.length ()
> +	  == (unsigned int) PARAM_VALUE (PARAM_MAX_STORES_TO_MERGE))
> +	{
> +	  if (dump_file && (dump_flags & TDF_DETAILS))
> +	    fprintf (dump_file,
> +		     "Reached maximum number of statements to merge:\n");
> +	  terminate_and_release_chain (*chain_info);
> +	}
> +      return;
> +    }
> +
> +  /* Store aliases any existing chain?  */
> +  terminate_all_aliasing_chains (chain_info, stmt);
> +  /* Start a new chain.  */
> +  struct imm_store_chain_info *new_chain
> +    = new imm_store_chain_info (m_stores_head, base_addr);
> +  info = new store_immediate_info (bitsize, bitpos, bitregion_start,
> +				   bitregion_end, stmt, 0, rhs_code,
> +				   ops[0], ops[1]);
> +  new_chain->m_store_info.safe_push (info);
> +  m_stores.put (base_addr, new_chain);
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, "Starting new chain with statement:\n");
> +      print_gimple_stmt (dump_file, stmt, 0);
> +      fprintf (dump_file, "The base object is:\n");
> +      print_generic_expr (dump_file, base_addr);
> +      fprintf (dump_file, "\n");
> +    }
> +}
> +
>  /* Entry point for the pass.  Go over each basic block recording chains of
> -  immediate stores.  Upon encountering a terminating statement (as defined
> -  by stmt_terminates_chain_p) process the recorded stores and emit the widened
> -  variants.  */
> +   immediate stores.  Upon encountering a terminating statement (as defined
> +   by stmt_terminates_chain_p) process the recorded stores and emit the widened
> +   variants.  */
>  
>  unsigned int
>  pass_store_merging::execute (function *fun)
> @@ -1573,175 +2291,9 @@ pass_store_merging::execute (function *f
>  	  if (gimple_assign_single_p (stmt) && gimple_vdef (stmt)
>  	      && !stmt_can_throw_internal (stmt)
>  	      && lhs_valid_for_store_merging_p (gimple_assign_lhs (stmt)))
> -	    {
> -	      tree lhs = gimple_assign_lhs (stmt);
> -	      tree rhs = gimple_assign_rhs1 (stmt);
> -
> -	      HOST_WIDE_INT bitsize, bitpos;
> -	      unsigned HOST_WIDE_INT bitregion_start = 0;
> -	      unsigned HOST_WIDE_INT bitregion_end = 0;
> -	      machine_mode mode;
> -	      int unsignedp = 0, reversep = 0, volatilep = 0;
> -	      tree offset, base_addr;
> -	      base_addr
> -		= get_inner_reference (lhs, &bitsize, &bitpos, &offset, &mode,
> -				       &unsignedp, &reversep, &volatilep);
> -	      if (TREE_CODE (lhs) == COMPONENT_REF
> -		  && DECL_BIT_FIELD_TYPE (TREE_OPERAND (lhs, 1)))
> -		{
> -		  get_bit_range (&bitregion_start, &bitregion_end, lhs,
> -				 &bitpos, &offset);
> -		  if (bitregion_end)
> -		    ++bitregion_end;
> -		}
> -	      if (bitsize == 0)
> -		continue;
> -
> -	      /* As a future enhancement we could handle stores with the same
> -		 base and offset.  */
> -	      bool invalid = reversep
> -			     || ((bitsize > MAX_BITSIZE_MODE_ANY_INT)
> -				  && (TREE_CODE (rhs) != INTEGER_CST))
> -			     || !rhs_valid_for_store_merging_p (rhs);
> -
> -	      /* We do not want to rewrite TARGET_MEM_REFs.  */
> -	      if (TREE_CODE (base_addr) == TARGET_MEM_REF)
> -		invalid = true;
> -	      /* In some cases get_inner_reference may return a
> -		 MEM_REF [ptr + byteoffset].  For the purposes of this pass
> -		 canonicalize the base_addr to MEM_REF [ptr] and take
> -		 byteoffset into account in the bitpos.  This occurs in
> -		 PR 23684 and this way we can catch more chains.  */
> -	      else if (TREE_CODE (base_addr) == MEM_REF)
> -		{
> -		  offset_int bit_off, byte_off = mem_ref_offset (base_addr);
> -		  bit_off = byte_off << LOG2_BITS_PER_UNIT;
> -		  bit_off += bitpos;
> -		  if (!wi::neg_p (bit_off) && wi::fits_shwi_p (bit_off))
> -		    {
> -		      bitpos = bit_off.to_shwi ();
> -		      if (bitregion_end)
> -			{
> -			  bit_off = byte_off << LOG2_BITS_PER_UNIT;
> -			  bit_off += bitregion_start;
> -			  if (wi::fits_uhwi_p (bit_off))
> -			    {
> -			      bitregion_start = bit_off.to_uhwi ();
> -			      bit_off = byte_off << LOG2_BITS_PER_UNIT;
> -			      bit_off += bitregion_end;
> -			      if (wi::fits_uhwi_p (bit_off))
> -				bitregion_end = bit_off.to_uhwi ();
> -			      else
> -				bitregion_end = 0;
> -			    }
> -			  else
> -			    bitregion_end = 0;
> -			}
> -		    }
> -		  else
> -		    invalid = true;
> -		  base_addr = TREE_OPERAND (base_addr, 0);
> -		}
> -	      /* get_inner_reference returns the base object, get at its
> -	         address now.  */
> -	      else
> -		{
> -		  if (bitpos < 0)
> -		    invalid = true;
> -		  base_addr = build_fold_addr_expr (base_addr);
> -		}
> -
> -	      if (!bitregion_end)
> -		{
> -		  bitregion_start = ROUND_DOWN (bitpos, BITS_PER_UNIT);
> -		  bitregion_end = ROUND_UP (bitpos + bitsize, BITS_PER_UNIT);
> -		}
> -
> -	      if (! invalid
> -		  && offset != NULL_TREE)
> -		{
> -		  /* If the access is variable offset then a base
> -		     decl has to be address-taken to be able to
> -		     emit pointer-based stores to it.
> -		     ???  We might be able to get away with
> -		     re-using the original base up to the first
> -		     variable part and then wrapping that inside
> -		     a BIT_FIELD_REF.  */
> -		  tree base = get_base_address (base_addr);
> -		  if (! base
> -		      || (DECL_P (base)
> -			  && ! TREE_ADDRESSABLE (base)))
> -		    invalid = true;
> -		  else
> -		    base_addr = build2 (POINTER_PLUS_EXPR,
> -					TREE_TYPE (base_addr),
> -					base_addr, offset);
> -		}
> -
> -	      struct imm_store_chain_info **chain_info
> -		= m_stores.get (base_addr);
> -
> -	      if (!invalid)
> -		{
> -		  store_immediate_info *info;
> -		  if (chain_info)
> -		    {
> -		      unsigned int ord = (*chain_info)->m_store_info.length ();
> -		      info = new store_immediate_info (bitsize, bitpos,
> -						       bitregion_start,
> -						       bitregion_end,
> -						       stmt, ord);
> -		      if (dump_file && (dump_flags & TDF_DETAILS))
> -			{
> -			  fprintf (dump_file,
> -				   "Recording immediate store from stmt:\n");
> -			  print_gimple_stmt (dump_file, stmt, 0);
> -			}
> -		      (*chain_info)->m_store_info.safe_push (info);
> -		      /* If we reach the limit of stores to merge in a chain
> -			 terminate and process the chain now.  */
> -		      if ((*chain_info)->m_store_info.length ()
> -			   == (unsigned int)
> -			      PARAM_VALUE (PARAM_MAX_STORES_TO_MERGE))
> -			{
> -			  if (dump_file && (dump_flags & TDF_DETAILS))
> -			    fprintf (dump_file,
> -				 "Reached maximum number of statements"
> -				 " to merge:\n");
> -			  terminate_and_release_chain (*chain_info);
> -			}
> -		      continue;
> -		    }
> -
> -		  /* Store aliases any existing chain?  */
> -		  terminate_all_aliasing_chains (chain_info, false, stmt);
> -		  /* Start a new chain.  */
> -		  struct imm_store_chain_info *new_chain
> -		    = new imm_store_chain_info (m_stores_head, base_addr);
> -		  info = new store_immediate_info (bitsize, bitpos,
> -						   bitregion_start,
> -						   bitregion_end,
> -						   stmt, 0);
> -		  new_chain->m_store_info.safe_push (info);
> -		  m_stores.put (base_addr, new_chain);
> -		  if (dump_file && (dump_flags & TDF_DETAILS))
> -		    {
> -		      fprintf (dump_file,
> -			       "Starting new chain with statement:\n");
> -		      print_gimple_stmt (dump_file, stmt, 0);
> -		      fprintf (dump_file, "The base object is:\n");
> -		      print_generic_expr (dump_file, base_addr);
> -		      fprintf (dump_file, "\n");
> -		    }
> -		}
> -	      else
> -		terminate_all_aliasing_chains (chain_info,
> -					       offset != NULL_TREE, stmt);
> -
> -	      continue;
> -	    }
> -
> -	  terminate_all_aliasing_chains (NULL, false, stmt);
> +	    process_store (stmt);
> +	  else
> +	    terminate_all_aliasing_chains (NULL, stmt);
>  	}
>        terminate_and_process_all_chains ();
>      }
> --- gcc/testsuite/gcc.dg/store_merging_13.c.jj	2017-11-02 08:50:03.544226508 +0100
> +++ gcc/testsuite/gcc.dg/store_merging_13.c	2017-11-02 08:50:03.544226508 +0100
> @@ -0,0 +1,157 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target store_merge } */
> +/* { dg-options "-O2 -fdump-tree-store-merging" } */
> +
> +struct S { unsigned char a, b; unsigned short c; unsigned char d, e, f, g; unsigned long long h; };
> +
> +__attribute__((noipa)) void
> +f1 (struct S *p)
> +{
> +  p->a = 1;
> +  p->b = 2;
> +  p->c = 3;
> +  p->d = 4;
> +  p->e = 5;
> +  p->f = 6;
> +  p->g = 7;
> +}
> +
> +__attribute__((noipa)) void
> +f2 (struct S *__restrict p, struct S *__restrict q)
> +{
> +  p->a = q->a;
> +  p->b = q->b;
> +  p->c = q->c;
> +  p->d = q->d;
> +  p->e = q->e;
> +  p->f = q->f;
> +  p->g = q->g;
> +}
> +
> +__attribute__((noipa)) void
> +f3 (struct S *p, struct S *q)
> +{
> +  unsigned char pa = q->a;
> +  unsigned char pb = q->b;
> +  unsigned short pc = q->c;
> +  unsigned char pd = q->d;
> +  unsigned char pe = q->e;
> +  unsigned char pf = q->f;
> +  unsigned char pg = q->g;
> +  p->a = pa;
> +  p->b = pb;
> +  p->c = pc;
> +  p->d = pd;
> +  p->e = pe;
> +  p->f = pf;
> +  p->g = pg;
> +}
> +
> +__attribute__((noipa)) void
> +f4 (struct S *p, struct S *q)
> +{
> +  unsigned char pa = p->a | q->a;
> +  unsigned char pb = p->b | q->b;
> +  unsigned short pc = p->c | q->c;
> +  unsigned char pd = p->d | q->d;
> +  unsigned char pe = p->e | q->e;
> +  unsigned char pf = p->f | q->f;
> +  unsigned char pg = p->g | q->g;
> +  p->a = pa;
> +  p->b = pb;
> +  p->c = pc;
> +  p->d = pd;
> +  p->e = pe;
> +  p->f = pf;
> +  p->g = pg;
> +}
> +
> +__attribute__((noipa)) void
> +f5 (struct S *p, struct S *q)
> +{
> +  unsigned char pa = p->a & q->a;
> +  unsigned char pb = p->b & q->b;
> +  unsigned short pc = p->c & q->c;
> +  unsigned char pd = p->d & q->d;
> +  unsigned char pe = p->e & q->e;
> +  unsigned char pf = p->f & q->f;
> +  unsigned char pg = p->g & q->g;
> +  p->a = pa;
> +  p->b = pb;
> +  p->c = pc;
> +  p->d = pd;
> +  p->e = pe;
> +  p->f = pf;
> +  p->g = pg;
> +}
> +
> +__attribute__((noipa)) void
> +f6 (struct S *p, struct S *q)
> +{
> +  unsigned char pa = p->a ^ q->a;
> +  unsigned char pb = p->b ^ q->b;
> +  unsigned short pc = p->c ^ q->c;
> +  unsigned char pd = p->d ^ q->d;
> +  unsigned char pe = p->e ^ q->e;
> +  unsigned char pf = p->f ^ q->f;
> +  unsigned char pg = p->g ^ q->g;
> +  p->a = pa;
> +  p->b = pb;
> +  p->c = pc;
> +  p->d = pd;
> +  p->e = pe;
> +  p->f = pf;
> +  p->g = pg;
> +}
> +
> +struct S s = { 20, 21, 22, 23, 24, 25, 26, 27 };
> +struct S t = { 0x71, 0x72, 0x7f04, 0x78, 0x31, 0x32, 0x34, 0xf1f2f3f4f5f6f7f8ULL };
> +struct S u = { 28, 29, 30, 31, 32, 33, 34, 35 };
> +struct S v = { 36, 37, 38, 39, 40, 41, 42, 43 };
> +
> +int
> +main ()
> +{
> +  asm volatile ("" : : : "memory");
> +  f1 (&s);
> +  asm volatile ("" : : : "memory");
> +  if (s.a != 1 || s.b != 2 || s.c != 3 || s.d != 4
> +      || s.e != 5 || s.f != 6 || s.g != 7 || s.h != 27)
> +    __builtin_abort ();
> +  f2 (&s, &u);
> +  asm volatile ("" : : : "memory");
> +  if (s.a != 28 || s.b != 29 || s.c != 30 || s.d != 31
> +      || s.e != 32 || s.f != 33 || s.g != 34 || s.h != 27)
> +    __builtin_abort ();
> +  f3 (&s, &v);
> +  asm volatile ("" : : : "memory");
> +  if (s.a != 36 || s.b != 37 || s.c != 38 || s.d != 39
> +      || s.e != 40 || s.f != 41 || s.g != 42 || s.h != 27)
> +    __builtin_abort ();
> +  f4 (&s, &t);
> +  asm volatile ("" : : : "memory");
> +  if (s.a != (36 | 0x71) || s.b != (37 | 0x72)
> +      || s.c != (38 | 0x7f04) || s.d != (39 | 0x78)
> +      || s.e != (40 | 0x31) || s.f != (41 | 0x32)
> +      || s.g != (42 | 0x34) || s.h != 27)
> +    __builtin_abort ();
> +  f3 (&s, &u);
> +  f5 (&s, &t);
> +  asm volatile ("" : : : "memory");
> +  if (s.a != (28 & 0x71) || s.b != (29 & 0x72)
> +      || s.c != (30 & 0x7f04) || s.d != (31 & 0x78)
> +      || s.e != (32 & 0x31) || s.f != (33 & 0x32)
> +      || s.g != (34 & 0x34) || s.h != 27)
> +    __builtin_abort ();
> +  f2 (&s, &v);
> +  f6 (&s, &t);
> +  asm volatile ("" : : : "memory");
> +  if (s.a != (36 ^ 0x71) || s.b != (37 ^ 0x72)
> +      || s.c != (38 ^ 0x7f04) || s.d != (39 ^ 0x78)
> +      || s.e != (40 ^ 0x31) || s.f != (41 ^ 0x32)
> +      || s.g != (42 ^ 0x34) || s.h != 27)
> +    __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Merging successful" 6 "store-merging" } } */
> --- gcc/testsuite/gcc.dg/store_merging_14.c.jj	2017-11-02 08:50:03.544226508 +0100
> +++ gcc/testsuite/gcc.dg/store_merging_14.c	2017-11-02 10:35:51.000000000 +0100
> @@ -0,0 +1,157 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target store_merge } */
> +/* { dg-options "-O2 -fdump-tree-store-merging" } */
> +
> +struct S { unsigned int i : 8, a : 7, b : 7, j : 10, c : 15, d : 7, e : 10, f : 7, g : 9, k : 16; unsigned long long h; };
> +
> +__attribute__((noipa)) void
> +f1 (struct S *p)
> +{
> +  p->a = 1;
> +  p->b = 2;
> +  p->c = 3;
> +  p->d = 4;
> +  p->e = 5;
> +  p->f = 6;
> +  p->g = 7;
> +}
> +
> +__attribute__((noipa)) void
> +f2 (struct S *__restrict p, struct S *__restrict q)
> +{
> +  p->a = q->a;
> +  p->b = q->b;
> +  p->c = q->c;
> +  p->d = q->d;
> +  p->e = q->e;
> +  p->f = q->f;
> +  p->g = q->g;
> +}
> +
> +__attribute__((noipa)) void
> +f3 (struct S *p, struct S *q)
> +{
> +  unsigned char pa = q->a;
> +  unsigned char pb = q->b;
> +  unsigned short pc = q->c;
> +  unsigned char pd = q->d;
> +  unsigned short pe = q->e;
> +  unsigned char pf = q->f;
> +  unsigned short pg = q->g;
> +  p->a = pa;
> +  p->b = pb;
> +  p->c = pc;
> +  p->d = pd;
> +  p->e = pe;
> +  p->f = pf;
> +  p->g = pg;
> +}
> +
> +__attribute__((noipa)) void
> +f4 (struct S *p, struct S *q)
> +{
> +  unsigned char pa = p->a | q->a;
> +  unsigned char pb = p->b | q->b;
> +  unsigned short pc = p->c | q->c;
> +  unsigned char pd = p->d | q->d;
> +  unsigned short pe = p->e | q->e;
> +  unsigned char pf = p->f | q->f;
> +  unsigned short pg = p->g | q->g;
> +  p->a = pa;
> +  p->b = pb;
> +  p->c = pc;
> +  p->d = pd;
> +  p->e = pe;
> +  p->f = pf;
> +  p->g = pg;
> +}
> +
> +__attribute__((noipa)) void
> +f5 (struct S *p, struct S *q)
> +{
> +  unsigned char pa = p->a & q->a;
> +  unsigned char pb = p->b & q->b;
> +  unsigned short pc = p->c & q->c;
> +  unsigned char pd = p->d & q->d;
> +  unsigned short pe = p->e & q->e;
> +  unsigned char pf = p->f & q->f;
> +  unsigned short pg = p->g & q->g;
> +  p->a = pa;
> +  p->b = pb;
> +  p->c = pc;
> +  p->d = pd;
> +  p->e = pe;
> +  p->f = pf;
> +  p->g = pg;
> +}
> +
> +__attribute__((noipa)) void
> +f6 (struct S *p, struct S *q)
> +{
> +  unsigned char pa = p->a ^ q->a;
> +  unsigned char pb = p->b ^ q->b;
> +  unsigned short pc = p->c ^ q->c;
> +  unsigned char pd = p->d ^ q->d;
> +  unsigned short pe = p->e ^ q->e;
> +  unsigned char pf = p->f ^ q->f;
> +  unsigned short pg = p->g ^ q->g;
> +  p->a = pa;
> +  p->b = pb;
> +  p->c = pc;
> +  p->d = pd;
> +  p->e = pe;
> +  p->f = pf;
> +  p->g = pg;
> +}
> +
> +struct S s = { 72, 20, 21, 73, 22, 23, 24, 25, 26, 74, 27 };
> +struct S t = { 75, 0x71, 0x72, 76, 0x7f04, 0x78, 0x31, 0x32, 0x34, 77, 0xf1f2f3f4f5f6f7f8ULL };
> +struct S u = { 78, 28, 29, 79, 30, 31, 32, 33, 34, 80, 35 };
> +struct S v = { 81, 36, 37, 82, 38, 39, 40, 41, 42, 83, 43 };
> +
> +int
> +main ()
> +{
> +  asm volatile ("" : : : "memory");
> +  f1 (&s);
> +  asm volatile ("" : : : "memory");
> +  if (s.i != 72 || s.a != 1 || s.b != 2 || s.j != 73 || s.c != 3 || s.d != 4
> +      || s.e != 5 || s.f != 6 || s.g != 7 || s.k != 74 || s.h != 27)
> +    __builtin_abort ();
> +  f2 (&s, &u);
> +  asm volatile ("" : : : "memory");
> +  if (s.i != 72 || s.a != 28 || s.b != 29 || s.j != 73 || s.c != 30 || s.d != 31
> +      || s.e != 32 || s.f != 33 || s.g != 34 || s.k != 74 || s.h != 27)
> +    __builtin_abort ();
> +  f3 (&s, &v);
> +  asm volatile ("" : : : "memory");
> +  if (s.i != 72 || s.a != 36 || s.b != 37 || s.j != 73 || s.c != 38 || s.d != 39
> +      || s.e != 40 || s.f != 41 || s.g != 42 || s.k != 74 || s.h != 27)
> +    __builtin_abort ();
> +  f4 (&s, &t);
> +  asm volatile ("" : : : "memory");
> +  if (s.i != 72 || s.a != (36 | 0x71) || s.b != (37 | 0x72) || s.j != 73
> +      || s.c != (38 | 0x7f04) || s.d != (39 | 0x78)
> +      || s.e != (40 | 0x31) || s.f != (41 | 0x32)
> +      || s.g != (42 | 0x34) || s.k != 74 || s.h != 27)
> +    __builtin_abort ();
> +  f3 (&s, &u);
> +  f5 (&s, &t);
> +  asm volatile ("" : : : "memory");
> +  if (s.i != 72 || s.a != (28 & 0x71) || s.b != (29 & 0x72) || s.j != 73
> +      || s.c != (30 & 0x7f04) || s.d != (31 & 0x78)
> +      || s.e != (32 & 0x31) || s.f != (33 & 0x32)
> +      || s.g != (34 & 0x34) || s.k != 74 || s.h != 27)
> +    __builtin_abort ();
> +  f2 (&s, &v);
> +  f6 (&s, &t);
> +  asm volatile ("" : : : "memory");
> +  if (s.i != 72 || s.a != (36 ^ 0x71) || s.b != (37 ^ 0x72) || s.j != 73
> +      || s.c != (38 ^ 0x7f04) || s.d != (39 ^ 0x78)
> +      || s.e != (40 ^ 0x31) || s.f != (41 ^ 0x32)
> +      || s.g != (42 ^ 0x34) || s.k != 74 || s.h != 27)
> +    __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Merging successful" 6 "store-merging" } } */
> 
> 
> 	Jakub
> 
>
Jakub Jelinek Nov. 3, 2017, 2:04 p.m. UTC | #2
On Fri, Nov 03, 2017 at 02:14:39PM +0100, Richard Biener wrote:
> > +/* Return true if stmts in between FIRST (inclusive) and LAST (exclusive)
> > +   may clobber REF.  FIRST and LAST must be in the same basic block and
> > +   have non-NULL vdef.  */
> > +
> > +bool
> > +stmts_may_clobber_ref_p (gimple *first, gimple *last, tree ref)
> > +{
> > +  ao_ref r;
> > +  ao_ref_init (&r, ref);
> > +  unsigned int count = 0;
> > +  tree vop = gimple_vdef (last);
> > +  gimple *stmt;
> > +
> > +  gcc_checking_assert (gimple_bb (first) == gimple_bb (last));
> 
> EBB would probably work as well, thus we should assert we do not
> end up visiting a PHI in the loop?

For a general purpose routine sure, this one is in anonymous namespace
and meant for use in this pass.  And there it is only checking stores from
the same store group and any other stores intermixed between those.
The pass at least right now is resetting all of its state at the end of
basic blocks, so gimple_bb (first) == gimple_bb (last) is indeed always
guaranteed.  If we ever extend it such that we don't have this guarantee,
then this assert would fail and then of course it should be adjusted to
handle whatever is needed.  But do we need to do that right now?

Note extending store-merging to handle groups of stores within EBB
doesn't look useful, then not all stores would be unconditional.
What could make sense is if we have e.g. a diamond
     |
    bb1
   /  \
  bb2 bb3
   \  /
    bb4
     |
and stores are in bb1 and bb4 and no stores in bb2 or bb3 can alias
with those.  But then we'd likely need full-blown walk_aliased_vdefs
for this...

> > +  gimple *first = merged_store->first_stmt;
> > +  gimple *last = merged_store->last_stmt;
> > +  unsigned int i;
> > +  store_immediate_info *infoc;
> > +  if (info->order < merged_store->first_order)
> > +    {
> > +      FOR_EACH_VEC_ELT (merged_store->stores, i, infoc)
> > +	if (stmts_may_clobber_ref_p (info->stmt, first, infoc->ops[idx].val))
> > +	  return false;
> > +      first = info->stmt;
> > +    }
> > +  else if (info->order > merged_store->last_order)
> > +    {
> > +      FOR_EACH_VEC_ELT (merged_store->stores, i, infoc)
> > +	if (stmts_may_clobber_ref_p (last, info->stmt, infoc->ops[idx].val))
> > +	  return false;
> > +      last = info->stmt;
> > +    }
> > +  if (stmts_may_clobber_ref_p (first, last, info->ops[idx].val))
> > +    return false;
> 
> Can you comment on what you check in this block?  It first checks
> all stmts (but not info->stmt itself if it is after last!?) 
> against
> all stores that would be added when adding 'info'.  Then it checks
> from new first to last against the newly added stmt (again
> excluding that stmt if it was added last).

The stmts_may_clobber_ref_p routine doesn't check aliasing on the last
stmt, only on the first stmt and stmts in between.

Previous iterations have checked FOR_EACH_VEC_ELT (merged_store->stores, i, infoc)
that merged_store->first_stmt and stmts in between that and 
merged_store->last_stmt don't clobber any of the infoc->ops[idx].val
references and we want to maintain that invariant if we add another store to
the group.  So, if we are about to extend the range between first_stmt and
last_stmt, then we need to check all the earlier refs on the stmts we've
added to the range.  Note that the stores are sorted by bitpos, not by
their order within the basic block at this point, so it is possible that a
store with a higher bitpos extends to earlier stmts or later stmts.

And finally the if (stmts_may_clobber_ref_p (first, last, info->ops[idx].val))
is checking the reference we are adding against the whole new range.

> > +  if (offset != NULL_TREE)
> > +    {
> > +      /* If the access is variable offset then a base decl has to be
> > +	 address-taken to be able to emit pointer-based stores to it.
> > +	 ???  We might be able to get away with re-using the original
> > +	 base up to the first variable part and then wrapping that inside
> > +	 a BIT_FIELD_REF.  */
> 
> Yes, that's what I'd generally recommend...  OTOH it can get quite
> fugly but it only has to survive until RTL expansion...

This is an preexisting comment I've just moved around from the
pass_store_merging::execute method into a helper function (it grew too much
and needed too big indentation and furthermore I wanted to use it for the
loads too).  Haven't really changed anything on that.

> As extra sanity check I'd rather have that all refs share a common
> base (operand-equal-p'ish).  But I guess that's what usually will
> happen anyways.  The alias-ptr-type trick will be tricky to do
> here as well (you have to go down to the base MEM_REF, wrap
> a decl if there's no MEM_REF and adjust the offset type).

For the aliasing, I have an untested incremental patch, need to finish
testcases for that, then test and then I can post it.

> given the style of processing we can end up doing more than
> necessary work when following ! single-use chains here, no?
> Would it make sense to restrict this to single-uses or do we
> handle any case of ! single-uses?  When extending things to
> allow an intermediate swap or general permute we could handle
> a byte/word-splat for example.

single-use vs. multiple uses is something I've thought about, but don't
know whether it is better to require single-use or not (or sometimes,
under some condition?).  Say if we have:
  _1 = t.a;
  _2 = t.b;
  _3 = t.c;
  _4 = t.d;
  s.a = _1;
  s.b = _2;
  s.c = _3;
  s.d = _4;
  use (_1, _2, _3, _4);
it will likely be shorter and maybe faster if we do:
  _1 = t.a;
  _2 = t.b;
  _3 = t.c;
  _4 = t.d;
  _5 = load[t.a...t.d];
  store[s.a...s.d] = _5;
  use (_1, _2, _3, _4);
If there are just 2 stores combined together, having
  _1 = t.a; _2 = t.b; _5 = load[t.a...t.b]; store[t.a...t.b] = _5;
  use (_1, _2);
will be as many stmts as before.  And if there is BIT_*_EXPR, we can be
adding not just loads, but also the bitwise ops.

So, if you want, I can at least for now add has_single_use checks
in all the spots where it follows SSA_NAME_DEF_STMT.

> Otherwise the patch looks good.  Quite some parts of the changes
> seem to be due to splitting out stuff into functions and refactoring.

Indeed, the mem_valid_for_store_merging and pass_store_merging::process_store
functions are mostly code move + reindent + tweaks.

Here are hand made diffs between old portions of pass_store_merging::execute
corresponding to the above 2 functions and those new functions with -upbd
if it is of any help.

@@ -1,33 +1,36 @@
-	      HOST_WIDE_INT bitsize, bitpos;
+static tree
+mem_valid_for_store_merging (tree mem, unsigned HOST_WIDE_INT *pbitsize,
+			     unsigned HOST_WIDE_INT *pbitpos,
+			     unsigned HOST_WIDE_INT *pbitregion_start,
+			     unsigned HOST_WIDE_INT *pbitregion_end)
+{
+  HOST_WIDE_INT bitsize;
+  HOST_WIDE_INT bitpos;
 	      unsigned HOST_WIDE_INT bitregion_start = 0;
 	      unsigned HOST_WIDE_INT bitregion_end = 0;
 	      machine_mode mode;
 	      int unsignedp = 0, reversep = 0, volatilep = 0;
-	      tree offset, base_addr;
-	      base_addr
-		= get_inner_reference (lhs, &bitsize, &bitpos, &offset, &mode,
+  tree offset;
+  tree base_addr = get_inner_reference (mem, &bitsize, &bitpos, &offset, &mode,
 				       &unsignedp, &reversep, &volatilep);
-	      if (TREE_CODE (lhs) == COMPONENT_REF
-		  && DECL_BIT_FIELD_TYPE (TREE_OPERAND (lhs, 1)))
+  *pbitsize = bitsize;
+  if (bitsize == 0)
+    return NULL_TREE;
+
+  if (TREE_CODE (mem) == COMPONENT_REF
+      && DECL_BIT_FIELD_TYPE (TREE_OPERAND (mem, 1)))
 		{
-		  get_bit_range (&bitregion_start, &bitregion_end, lhs,
-				 &bitpos, &offset);
+      get_bit_range (&bitregion_start, &bitregion_end, mem, &bitpos, &offset);
 		  if (bitregion_end)
 		    ++bitregion_end;
 		}
-	      if (bitsize == 0)
-		continue;
 
-	      /* As a future enhancement we could handle stores with the same
-		 base and offset.  */
-	      bool invalid = reversep
-			     || ((bitsize > MAX_BITSIZE_MODE_ANY_INT)
-				  && (TREE_CODE (rhs) != INTEGER_CST))
-			     || !rhs_valid_for_store_merging_p (rhs);
+  if (reversep)
+    return NULL_TREE;
 
 	      /* We do not want to rewrite TARGET_MEM_REFs.  */
 	      if (TREE_CODE (base_addr) == TARGET_MEM_REF)
-		invalid = true;
+    return NULL_TREE;
 	      /* In some cases get_inner_reference may return a
 		 MEM_REF [ptr + byteoffset].  For the purposes of this pass
 		 canonicalize the base_addr to MEM_REF [ptr] and take
@@ -60,7 +63,7 @@
 			}
 		    }
 		  else
-		    invalid = true;
+	return NULL_TREE;
 		  base_addr = TREE_OPERAND (base_addr, 0);
 		}
 	      /* get_inner_reference returns the base object, get at its
@@ -68,7 +71,7 @@
 	      else
 		{
 		  if (bitpos < 0)
-		    invalid = true;
+	return NULL_TREE;
 		  base_addr = build_fold_addr_expr (base_addr);
 		}
 
@@ -78,23 +81,25 @@
 		  bitregion_end = ROUND_UP (bitpos + bitsize, BITS_PER_UNIT);
 		}
 
-	      if (! invalid
-		  && offset != NULL_TREE)
+  if (offset != NULL_TREE)
 		{
-		  /* If the access is variable offset then a base
-		     decl has to be address-taken to be able to
-		     emit pointer-based stores to it.
-		     ???  We might be able to get away with
-		     re-using the original base up to the first
-		     variable part and then wrapping that inside
+      /* If the access is variable offset then a base decl has to be
+	 address-taken to be able to emit pointer-based stores to it.
+	 ???  We might be able to get away with re-using the original
+	 base up to the first variable part and then wrapping that inside
 		     a BIT_FIELD_REF.  */
 		  tree base = get_base_address (base_addr);
 		  if (! base
-		      || (DECL_P (base)
-			  && ! TREE_ADDRESSABLE (base)))
-		    invalid = true;
-		  else
-		    base_addr = build2 (POINTER_PLUS_EXPR,
-					TREE_TYPE (base_addr),
+	  || (DECL_P (base) && ! TREE_ADDRESSABLE (base)))
+	return NULL_TREE;
+
+      base_addr = build2 (POINTER_PLUS_EXPR, TREE_TYPE (base_addr),
 					base_addr, offset);
 		}
+
+  *pbitsize = bitsize;
+  *pbitpos = bitpos;
+  *pbitregion_start = bitregion_start;
+  *pbitregion_end = bitregion_end;
+  return base_addr;
+}

------

@@ -1,71 +1,129 @@
+void
+pass_store_merging::process_store (gimple *stmt)
+{
 	      tree lhs = gimple_assign_lhs (stmt);
 	      tree rhs = gimple_assign_rhs1 (stmt);
+  unsigned HOST_WIDE_INT bitsize, bitpos;
+  unsigned HOST_WIDE_INT bitregion_start;
+  unsigned HOST_WIDE_INT bitregion_end;
+  tree base_addr
+    = mem_valid_for_store_merging (lhs, &bitsize, &bitpos,
+				   &bitregion_start, &bitregion_end);
+  if (bitsize == 0)
+    return;
 
-	      /* As a future enhancement we could handle stores with the same
-		 base and offset.  */
-	      bool invalid = reversep
+  bool invalid = (base_addr == NULL_TREE
 			     || ((bitsize > MAX_BITSIZE_MODE_ANY_INT)
-				  && (TREE_CODE (rhs) != INTEGER_CST))
-			     || !rhs_valid_for_store_merging_p (rhs);
+		       && (TREE_CODE (rhs) != INTEGER_CST)));
+  enum tree_code rhs_code = ERROR_MARK;
+  store_operand_info ops[2];
+  if (invalid)
+    ;
+  else if (rhs_valid_for_store_merging_p (rhs))
+    {
+      rhs_code = INTEGER_CST;
+      ops[0].val = rhs;
+    }
+  else if (TREE_CODE (rhs) != SSA_NAME)
+    invalid = true;
+  else
+    {
+      gimple *def_stmt = SSA_NAME_DEF_STMT (rhs), *def_stmt1, *def_stmt2;
+      if (!is_gimple_assign (def_stmt))
+	invalid = true;
+      else if (handled_load (def_stmt, &ops[0], bitsize, bitpos,
+			     bitregion_start, bitregion_end))
+	rhs_code = MEM_REF;
+      else
+	switch ((rhs_code = gimple_assign_rhs_code (def_stmt)))
+	  {
+	  case BIT_AND_EXPR:
+	  case BIT_IOR_EXPR:
+	  case BIT_XOR_EXPR:
+	    tree rhs1, rhs2;
+	    rhs1 = gimple_assign_rhs1 (def_stmt);
+	    rhs2 = gimple_assign_rhs2 (def_stmt);
+	    invalid = true;
+	    if (TREE_CODE (rhs1) != SSA_NAME)
+	      break;
+	    def_stmt1 = SSA_NAME_DEF_STMT (rhs1);
+	    if (!is_gimple_assign (def_stmt1)
+		|| !handled_load (def_stmt1, &ops[0], bitsize, bitpos,
+				  bitregion_start, bitregion_end))
+	      break;
+	    if (rhs_valid_for_store_merging_p (rhs2))
+	      ops[1].val = rhs2;
+	    else if (TREE_CODE (rhs2) != SSA_NAME)
+	      break;
+	    else
+	      {
+		def_stmt2 = SSA_NAME_DEF_STMT (rhs2);
+		if (!is_gimple_assign (def_stmt2))
+		  break;
+		else if (!handled_load (def_stmt2, &ops[1], bitsize, bitpos,
+					bitregion_start, bitregion_end))
+		  break;
+	      }
+	    invalid = false;
+	    break;
+	  default:
+	    invalid = true;
+	    break;
+	  }
+    }
 
-	      struct imm_store_chain_info **chain_info
-		= m_stores.get (base_addr);
+  struct imm_store_chain_info **chain_info = NULL;
+  if (base_addr)
+    chain_info = m_stores.get (base_addr);
 
-	      if (!invalid)
+  if (invalid)
 		{
+      terminate_all_aliasing_chains (chain_info, stmt);
+      return;
+    }
+
 		  store_immediate_info *info;
 		  if (chain_info)
 		    {
 		      unsigned int ord = (*chain_info)->m_store_info.length ();
-		      info = new store_immediate_info (bitsize, bitpos,
-						       bitregion_start,
-						       bitregion_end,
-						       stmt, ord);
+      info = new store_immediate_info (bitsize, bitpos, bitregion_start,
+				       bitregion_end, stmt, ord, rhs_code,
+				       ops[0], ops[1]);
 		      if (dump_file && (dump_flags & TDF_DETAILS))
 			{
-			  fprintf (dump_file,
-				   "Recording immediate store from stmt:\n");
+	  fprintf (dump_file, "Recording immediate store from stmt:\n");
 			  print_gimple_stmt (dump_file, stmt, 0);
 			}
 		      (*chain_info)->m_store_info.safe_push (info);
-		      /* If we reach the limit of stores to merge in a chain
-			 terminate and process the chain now.  */
+      /* If we reach the limit of stores to merge in a chain terminate and
+	 process the chain now.  */
 		      if ((*chain_info)->m_store_info.length ()
-			   == (unsigned int)
-			      PARAM_VALUE (PARAM_MAX_STORES_TO_MERGE))
+	  == (unsigned int) PARAM_VALUE (PARAM_MAX_STORES_TO_MERGE))
 			{
 			  if (dump_file && (dump_flags & TDF_DETAILS))
 			    fprintf (dump_file,
-				 "Reached maximum number of statements"
-				 " to merge:\n");
+		     "Reached maximum number of statements to merge:\n");
 			  terminate_and_release_chain (*chain_info);
 			}
-		      continue;
+      return;
 		    }
 
 		  /* Store aliases any existing chain?  */
-		  terminate_all_aliasing_chains (chain_info, false, stmt);
+  terminate_all_aliasing_chains (chain_info, stmt);
 		  /* Start a new chain.  */
 		  struct imm_store_chain_info *new_chain
 		    = new imm_store_chain_info (m_stores_head, base_addr);
-		  info = new store_immediate_info (bitsize, bitpos,
-						   bitregion_start,
-						   bitregion_end,
-						   stmt, 0);
+  info = new store_immediate_info (bitsize, bitpos, bitregion_start,
+				   bitregion_end, stmt, 0, rhs_code,
+				   ops[0], ops[1]);
 		  new_chain->m_store_info.safe_push (info);
 		  m_stores.put (base_addr, new_chain);
 		  if (dump_file && (dump_flags & TDF_DETAILS))
 		    {
-		      fprintf (dump_file,
-			       "Starting new chain with statement:\n");
+      fprintf (dump_file, "Starting new chain with statement:\n");
 		      print_gimple_stmt (dump_file, stmt, 0);
 		      fprintf (dump_file, "The base object is:\n");
 		      print_generic_expr (dump_file, base_addr);
 		      fprintf (dump_file, "\n");
 		    }
-		}
-	      else
-		terminate_all_aliasing_chains (chain_info,
-					       offset != NULL_TREE, stmt);
-
-	      continue;
+}


	Jakub
Richard Biener Nov. 3, 2017, 2:09 p.m. UTC | #3
On Fri, 3 Nov 2017, Jakub Jelinek wrote:

> On Fri, Nov 03, 2017 at 02:14:39PM +0100, Richard Biener wrote:
> > > +/* Return true if stmts in between FIRST (inclusive) and LAST (exclusive)
> > > +   may clobber REF.  FIRST and LAST must be in the same basic block and
> > > +   have non-NULL vdef.  */
> > > +
> > > +bool
> > > +stmts_may_clobber_ref_p (gimple *first, gimple *last, tree ref)
> > > +{
> > > +  ao_ref r;
> > > +  ao_ref_init (&r, ref);
> > > +  unsigned int count = 0;
> > > +  tree vop = gimple_vdef (last);
> > > +  gimple *stmt;
> > > +
> > > +  gcc_checking_assert (gimple_bb (first) == gimple_bb (last));
> > 
> > EBB would probably work as well, thus we should assert we do not
> > end up visiting a PHI in the loop?
> 
> For a general purpose routine sure, this one is in anonymous namespace
> and meant for use in this pass.  And there it is only checking stores from
> the same store group and any other stores intermixed between those.
> The pass at least right now is resetting all of its state at the end of
> basic blocks, so gimple_bb (first) == gimple_bb (last) is indeed always
> guaranteed.  If we ever extend it such that we don't have this guarantee,
> then this assert would fail and then of course it should be adjusted to
> handle whatever is needed.  But do we need to do that right now?

No, we don't.  Just wondered about the assert and the real limitation
of the implementation.

> Note extending store-merging to handle groups of stores within EBB
> doesn't look useful, then not all stores would be unconditional.

Yes.

> What could make sense is if we have e.g. a diamond
>      |
>     bb1
>    /  \
>   bb2 bb3
>    \  /
>     bb4
>      |
> and stores are in bb1 and bb4 and no stores in bb2 or bb3 can alias
> with those.  But then we'd likely need full-blown walk_aliased_vdefs
> for this...

Yes.

> > > +  gimple *first = merged_store->first_stmt;
> > > +  gimple *last = merged_store->last_stmt;
> > > +  unsigned int i;
> > > +  store_immediate_info *infoc;
> > > +  if (info->order < merged_store->first_order)
> > > +    {
> > > +      FOR_EACH_VEC_ELT (merged_store->stores, i, infoc)
> > > +	if (stmts_may_clobber_ref_p (info->stmt, first, infoc->ops[idx].val))
> > > +	  return false;
> > > +      first = info->stmt;
> > > +    }
> > > +  else if (info->order > merged_store->last_order)
> > > +    {
> > > +      FOR_EACH_VEC_ELT (merged_store->stores, i, infoc)
> > > +	if (stmts_may_clobber_ref_p (last, info->stmt, infoc->ops[idx].val))
> > > +	  return false;
> > > +      last = info->stmt;
> > > +    }
> > > +  if (stmts_may_clobber_ref_p (first, last, info->ops[idx].val))
> > > +    return false;
> > 
> > Can you comment on what you check in this block?  It first checks
> > all stmts (but not info->stmt itself if it is after last!?) 
> > against
> > all stores that would be added when adding 'info'.  Then it checks
> > from new first to last against the newly added stmt (again
> > excluding that stmt if it was added last).
> 
> The stmts_may_clobber_ref_p routine doesn't check aliasing on the last
> stmt, only on the first stmt and stmts in between.
> 
> Previous iterations have checked FOR_EACH_VEC_ELT (merged_store->stores, i, infoc)
> that merged_store->first_stmt and stmts in between that and 
> merged_store->last_stmt don't clobber any of the infoc->ops[idx].val
> references and we want to maintain that invariant if we add another store to
> the group.  So, if we are about to extend the range between first_stmt and
> last_stmt, then we need to check all the earlier refs on the stmts we've
> added to the range.  Note that the stores are sorted by bitpos, not by
> their order within the basic block at this point, so it is possible that a
> store with a higher bitpos extends to earlier stmts or later stmts.
> 
> And finally the if (stmts_may_clobber_ref_p (first, last, info->ops[idx].val))
> is checking the reference we are adding against the whole new range.
> 
> > > +  if (offset != NULL_TREE)
> > > +    {
> > > +      /* If the access is variable offset then a base decl has to be
> > > +	 address-taken to be able to emit pointer-based stores to it.
> > > +	 ???  We might be able to get away with re-using the original
> > > +	 base up to the first variable part and then wrapping that inside
> > > +	 a BIT_FIELD_REF.  */
> > 
> > Yes, that's what I'd generally recommend...  OTOH it can get quite
> > fugly but it only has to survive until RTL expansion...
> 
> This is an preexisting comment I've just moved around from the
> pass_store_merging::execute method into a helper function (it grew too much
> and needed too big indentation and furthermore I wanted to use it for the
> loads too).  Haven't really changed anything on that.
> 
> > As extra sanity check I'd rather have that all refs share a common
> > base (operand-equal-p'ish).  But I guess that's what usually will
> > happen anyways.  The alias-ptr-type trick will be tricky to do
> > here as well (you have to go down to the base MEM_REF, wrap
> > a decl if there's no MEM_REF and adjust the offset type).
> 
> For the aliasing, I have an untested incremental patch, need to finish
> testcases for that, then test and then I can post it.
> 
> > given the style of processing we can end up doing more than
> > necessary work when following ! single-use chains here, no?
> > Would it make sense to restrict this to single-uses or do we
> > handle any case of ! single-uses?  When extending things to
> > allow an intermediate swap or general permute we could handle
> > a byte/word-splat for example.
> 
> single-use vs. multiple uses is something I've thought about, but don't
> know whether it is better to require single-use or not (or sometimes,
> under some condition?).  Say if we have:
>   _1 = t.a;
>   _2 = t.b;
>   _3 = t.c;
>   _4 = t.d;
>   s.a = _1;
>   s.b = _2;
>   s.c = _3;
>   s.d = _4;
>   use (_1, _2, _3, _4);
> it will likely be shorter and maybe faster if we do:
>   _1 = t.a;
>   _2 = t.b;
>   _3 = t.c;
>   _4 = t.d;
>   _5 = load[t.a...t.d];
>   store[s.a...s.d] = _5;
>   use (_1, _2, _3, _4);
> If there are just 2 stores combined together, having
>   _1 = t.a; _2 = t.b; _5 = load[t.a...t.b]; store[t.a...t.b] = _5;
>   use (_1, _2);
> will be as many stmts as before.  And if there is BIT_*_EXPR, we can be
> adding not just loads, but also the bitwise ops.
> 
> So, if you want, I can at least for now add has_single_use checks
> in all the spots where it follows SSA_NAME_DEF_STMT.
> 
> > Otherwise the patch looks good.  Quite some parts of the changes
> > seem to be due to splitting out stuff into functions and refactoring.
> 
> Indeed, the mem_valid_for_store_merging and pass_store_merging::process_store
> functions are mostly code move + reindent + tweaks.
> 
> Here are hand made diffs between old portions of pass_store_merging::execute
> corresponding to the above 2 functions and those new functions with -upbd
> if it is of any help.
> 
> @@ -1,33 +1,36 @@
> -	      HOST_WIDE_INT bitsize, bitpos;
> +static tree
> +mem_valid_for_store_merging (tree mem, unsigned HOST_WIDE_INT *pbitsize,
> +			     unsigned HOST_WIDE_INT *pbitpos,
> +			     unsigned HOST_WIDE_INT *pbitregion_start,
> +			     unsigned HOST_WIDE_INT *pbitregion_end)
> +{
> +  HOST_WIDE_INT bitsize;
> +  HOST_WIDE_INT bitpos;
>  	      unsigned HOST_WIDE_INT bitregion_start = 0;
>  	      unsigned HOST_WIDE_INT bitregion_end = 0;
>  	      machine_mode mode;
>  	      int unsignedp = 0, reversep = 0, volatilep = 0;
> -	      tree offset, base_addr;
> -	      base_addr
> -		= get_inner_reference (lhs, &bitsize, &bitpos, &offset, &mode,
> +  tree offset;
> +  tree base_addr = get_inner_reference (mem, &bitsize, &bitpos, &offset, &mode,
>  				       &unsignedp, &reversep, &volatilep);
> -	      if (TREE_CODE (lhs) == COMPONENT_REF
> -		  && DECL_BIT_FIELD_TYPE (TREE_OPERAND (lhs, 1)))
> +  *pbitsize = bitsize;
> +  if (bitsize == 0)
> +    return NULL_TREE;
> +
> +  if (TREE_CODE (mem) == COMPONENT_REF
> +      && DECL_BIT_FIELD_TYPE (TREE_OPERAND (mem, 1)))
>  		{
> -		  get_bit_range (&bitregion_start, &bitregion_end, lhs,
> -				 &bitpos, &offset);
> +      get_bit_range (&bitregion_start, &bitregion_end, mem, &bitpos, &offset);
>  		  if (bitregion_end)
>  		    ++bitregion_end;
>  		}
> -	      if (bitsize == 0)
> -		continue;
>  
> -	      /* As a future enhancement we could handle stores with the same
> -		 base and offset.  */
> -	      bool invalid = reversep
> -			     || ((bitsize > MAX_BITSIZE_MODE_ANY_INT)
> -				  && (TREE_CODE (rhs) != INTEGER_CST))
> -			     || !rhs_valid_for_store_merging_p (rhs);
> +  if (reversep)
> +    return NULL_TREE;
>  
>  	      /* We do not want to rewrite TARGET_MEM_REFs.  */
>  	      if (TREE_CODE (base_addr) == TARGET_MEM_REF)
> -		invalid = true;
> +    return NULL_TREE;
>  	      /* In some cases get_inner_reference may return a
>  		 MEM_REF [ptr + byteoffset].  For the purposes of this pass
>  		 canonicalize the base_addr to MEM_REF [ptr] and take
> @@ -60,7 +63,7 @@
>  			}
>  		    }
>  		  else
> -		    invalid = true;
> +	return NULL_TREE;
>  		  base_addr = TREE_OPERAND (base_addr, 0);
>  		}
>  	      /* get_inner_reference returns the base object, get at its
> @@ -68,7 +71,7 @@
>  	      else
>  		{
>  		  if (bitpos < 0)
> -		    invalid = true;
> +	return NULL_TREE;
>  		  base_addr = build_fold_addr_expr (base_addr);
>  		}
>  
> @@ -78,23 +81,25 @@
>  		  bitregion_end = ROUND_UP (bitpos + bitsize, BITS_PER_UNIT);
>  		}
>  
> -	      if (! invalid
> -		  && offset != NULL_TREE)
> +  if (offset != NULL_TREE)
>  		{
> -		  /* If the access is variable offset then a base
> -		     decl has to be address-taken to be able to
> -		     emit pointer-based stores to it.
> -		     ???  We might be able to get away with
> -		     re-using the original base up to the first
> -		     variable part and then wrapping that inside
> +      /* If the access is variable offset then a base decl has to be
> +	 address-taken to be able to emit pointer-based stores to it.
> +	 ???  We might be able to get away with re-using the original
> +	 base up to the first variable part and then wrapping that inside
>  		     a BIT_FIELD_REF.  */
>  		  tree base = get_base_address (base_addr);
>  		  if (! base
> -		      || (DECL_P (base)
> -			  && ! TREE_ADDRESSABLE (base)))
> -		    invalid = true;
> -		  else
> -		    base_addr = build2 (POINTER_PLUS_EXPR,
> -					TREE_TYPE (base_addr),
> +	  || (DECL_P (base) && ! TREE_ADDRESSABLE (base)))
> +	return NULL_TREE;
> +
> +      base_addr = build2 (POINTER_PLUS_EXPR, TREE_TYPE (base_addr),
>  					base_addr, offset);
>  		}
> +
> +  *pbitsize = bitsize;
> +  *pbitpos = bitpos;
> +  *pbitregion_start = bitregion_start;
> +  *pbitregion_end = bitregion_end;
> +  return base_addr;
> +}
> 
> ------
> 
> @@ -1,71 +1,129 @@
> +void
> +pass_store_merging::process_store (gimple *stmt)
> +{
>  	      tree lhs = gimple_assign_lhs (stmt);
>  	      tree rhs = gimple_assign_rhs1 (stmt);
> +  unsigned HOST_WIDE_INT bitsize, bitpos;
> +  unsigned HOST_WIDE_INT bitregion_start;
> +  unsigned HOST_WIDE_INT bitregion_end;
> +  tree base_addr
> +    = mem_valid_for_store_merging (lhs, &bitsize, &bitpos,
> +				   &bitregion_start, &bitregion_end);
> +  if (bitsize == 0)
> +    return;
>  
> -	      /* As a future enhancement we could handle stores with the same
> -		 base and offset.  */
> -	      bool invalid = reversep
> +  bool invalid = (base_addr == NULL_TREE
>  			     || ((bitsize > MAX_BITSIZE_MODE_ANY_INT)
> -				  && (TREE_CODE (rhs) != INTEGER_CST))
> -			     || !rhs_valid_for_store_merging_p (rhs);
> +		       && (TREE_CODE (rhs) != INTEGER_CST)));
> +  enum tree_code rhs_code = ERROR_MARK;
> +  store_operand_info ops[2];
> +  if (invalid)
> +    ;
> +  else if (rhs_valid_for_store_merging_p (rhs))
> +    {
> +      rhs_code = INTEGER_CST;
> +      ops[0].val = rhs;
> +    }
> +  else if (TREE_CODE (rhs) != SSA_NAME)
> +    invalid = true;
> +  else
> +    {
> +      gimple *def_stmt = SSA_NAME_DEF_STMT (rhs), *def_stmt1, *def_stmt2;
> +      if (!is_gimple_assign (def_stmt))
> +	invalid = true;
> +      else if (handled_load (def_stmt, &ops[0], bitsize, bitpos,
> +			     bitregion_start, bitregion_end))
> +	rhs_code = MEM_REF;
> +      else
> +	switch ((rhs_code = gimple_assign_rhs_code (def_stmt)))
> +	  {
> +	  case BIT_AND_EXPR:
> +	  case BIT_IOR_EXPR:
> +	  case BIT_XOR_EXPR:
> +	    tree rhs1, rhs2;
> +	    rhs1 = gimple_assign_rhs1 (def_stmt);
> +	    rhs2 = gimple_assign_rhs2 (def_stmt);
> +	    invalid = true;
> +	    if (TREE_CODE (rhs1) != SSA_NAME)
> +	      break;
> +	    def_stmt1 = SSA_NAME_DEF_STMT (rhs1);
> +	    if (!is_gimple_assign (def_stmt1)
> +		|| !handled_load (def_stmt1, &ops[0], bitsize, bitpos,
> +				  bitregion_start, bitregion_end))
> +	      break;
> +	    if (rhs_valid_for_store_merging_p (rhs2))
> +	      ops[1].val = rhs2;
> +	    else if (TREE_CODE (rhs2) != SSA_NAME)
> +	      break;
> +	    else
> +	      {
> +		def_stmt2 = SSA_NAME_DEF_STMT (rhs2);
> +		if (!is_gimple_assign (def_stmt2))
> +		  break;
> +		else if (!handled_load (def_stmt2, &ops[1], bitsize, bitpos,
> +					bitregion_start, bitregion_end))
> +		  break;
> +	      }
> +	    invalid = false;
> +	    break;
> +	  default:
> +	    invalid = true;
> +	    break;
> +	  }
> +    }
>  
> -	      struct imm_store_chain_info **chain_info
> -		= m_stores.get (base_addr);
> +  struct imm_store_chain_info **chain_info = NULL;
> +  if (base_addr)
> +    chain_info = m_stores.get (base_addr);
>  
> -	      if (!invalid)
> +  if (invalid)
>  		{
> +      terminate_all_aliasing_chains (chain_info, stmt);
> +      return;
> +    }
> +
>  		  store_immediate_info *info;
>  		  if (chain_info)
>  		    {
>  		      unsigned int ord = (*chain_info)->m_store_info.length ();
> -		      info = new store_immediate_info (bitsize, bitpos,
> -						       bitregion_start,
> -						       bitregion_end,
> -						       stmt, ord);
> +      info = new store_immediate_info (bitsize, bitpos, bitregion_start,
> +				       bitregion_end, stmt, ord, rhs_code,
> +				       ops[0], ops[1]);
>  		      if (dump_file && (dump_flags & TDF_DETAILS))
>  			{
> -			  fprintf (dump_file,
> -				   "Recording immediate store from stmt:\n");
> +	  fprintf (dump_file, "Recording immediate store from stmt:\n");
>  			  print_gimple_stmt (dump_file, stmt, 0);
>  			}
>  		      (*chain_info)->m_store_info.safe_push (info);
> -		      /* If we reach the limit of stores to merge in a chain
> -			 terminate and process the chain now.  */
> +      /* If we reach the limit of stores to merge in a chain terminate and
> +	 process the chain now.  */
>  		      if ((*chain_info)->m_store_info.length ()
> -			   == (unsigned int)
> -			      PARAM_VALUE (PARAM_MAX_STORES_TO_MERGE))
> +	  == (unsigned int) PARAM_VALUE (PARAM_MAX_STORES_TO_MERGE))
>  			{
>  			  if (dump_file && (dump_flags & TDF_DETAILS))
>  			    fprintf (dump_file,
> -				 "Reached maximum number of statements"
> -				 " to merge:\n");
> +		     "Reached maximum number of statements to merge:\n");
>  			  terminate_and_release_chain (*chain_info);
>  			}
> -		      continue;
> +      return;
>  		    }
>  
>  		  /* Store aliases any existing chain?  */
> -		  terminate_all_aliasing_chains (chain_info, false, stmt);
> +  terminate_all_aliasing_chains (chain_info, stmt);
>  		  /* Start a new chain.  */
>  		  struct imm_store_chain_info *new_chain
>  		    = new imm_store_chain_info (m_stores_head, base_addr);
> -		  info = new store_immediate_info (bitsize, bitpos,
> -						   bitregion_start,
> -						   bitregion_end,
> -						   stmt, 0);
> +  info = new store_immediate_info (bitsize, bitpos, bitregion_start,
> +				   bitregion_end, stmt, 0, rhs_code,
> +				   ops[0], ops[1]);
>  		  new_chain->m_store_info.safe_push (info);
>  		  m_stores.put (base_addr, new_chain);
>  		  if (dump_file && (dump_flags & TDF_DETAILS))
>  		    {
> -		      fprintf (dump_file,
> -			       "Starting new chain with statement:\n");
> +      fprintf (dump_file, "Starting new chain with statement:\n");
>  		      print_gimple_stmt (dump_file, stmt, 0);
>  		      fprintf (dump_file, "The base object is:\n");
>  		      print_generic_expr (dump_file, base_addr);
>  		      fprintf (dump_file, "\n");
>  		    }
> -		}
> -	      else
> -		terminate_all_aliasing_chains (chain_info,
> -					       offset != NULL_TREE, stmt);
> -
> -	      continue;
> +}
> 
> 
> 	Jakub
> 
>
Jakub Jelinek Nov. 3, 2017, 7:17 p.m. UTC | #4
On Fri, Nov 03, 2017 at 03:04:18PM +0100, Jakub Jelinek wrote:
> single-use vs. multiple uses is something I've thought about, but don't
> know whether it is better to require single-use or not (or sometimes,
> under some condition?).  Say if we have:

So, here is what I've committed in the end after bootstrapping/regtesting
it on x86_64-linux and i686-linux, the only changes from the earlier patch
were comments and addition of has_single_use checks.

In those bootstraps/regtests, the number of integer_cst stores were
expectedly the same, and so were the number of bit_*_expr cases, but
it apparently matters a lot for the memory copying (rhs_code MEM_REF).
Without this patch new/orig stores:
16943   35369
and with the patch:
12111   24911
So, perhaps we'll need to do something smarter (approximate how many
original loads would be kept and how many new loads/stores we'd need to add
to get rid of how many original stores).
Or allow multiple uses for the MEM_REF rhs_code only and for anything else
require single use.

2017-11-03  Jakub Jelinek  <jakub@redhat.com>

	PR tree-optimization/78821
	* gimple-ssa-store-merging.c: Update the file comment.
	(MAX_STORE_ALIAS_CHECKS): Define.
	(struct store_operand_info): New type.
	(store_operand_info::store_operand_info): New constructor.
	(struct store_immediate_info): Add rhs_code and ops data members.
	(store_immediate_info::store_immediate_info): Add rhscode, op0r
	and op1r arguments to the ctor, initialize corresponding data members.
	(struct merged_store_group): Add load_align_base and load_align
	data members.
	(merged_store_group::merged_store_group): Initialize them.
	(merged_store_group::do_merge): Update them.
	(merged_store_group::apply_stores): Pick the constant for
	encode_tree_to_bitpos from one of the two operands, or skip
	encode_tree_to_bitpos if neither operand is a constant.
	(class pass_store_merging): Add process_store method decl.  Remove
	bool argument from terminate_all_aliasing_chains method decl.
	(pass_store_merging::terminate_all_aliasing_chains): Remove
	var_offset_p argument and corresponding handling.
	(stmts_may_clobber_ref_p): New function.
	(compatible_load_p): New function.
	(imm_store_chain_info::coalesce_immediate_stores): Terminate group
	if there is overlap and rhs_code is not INTEGER_CST.  For
	non-overlapping stores terminate group if rhs is not mergeable.
	(get_alias_type_for_stmts): Change first argument from
	auto_vec<gimple *> & to vec<gimple *> &.  Add IS_LOAD, CLIQUEP and
	BASEP arguments.  If IS_LOAD is true, look at rhs1 of the stmts
	instead of lhs.  Compute *CLIQUEP and *BASEP in addition to the
	alias type.
	(get_location_for_stmts): Change first argument from
	auto_vec<gimple *> & to vec<gimple *> &.
	(struct split_store): Remove orig_stmts data member, add orig_stores.
	(split_store::split_store): Create orig_stores rather than orig_stmts.
	(find_constituent_stmts): Renamed to ...
	(find_constituent_stores): ... this.  Change second argument from
	vec<gimple *> * to vec<store_immediate_info *> *, push pointers
	to info structures rather than the statements.
	(split_group): Rename ALLOW_UNALIGNED argument to
	ALLOW_UNALIGNED_STORE, add ALLOW_UNALIGNED_LOAD argument and handle
	it.  Adjust find_constituent_stores caller.
	(imm_store_chain_info::output_merged_store): Handle rhs_code other
	than INTEGER_CST, adjust split_group, get_alias_type_for_stmts and
	get_location_for_stmts callers.  Set MR_DEPENDENCE_CLIQUE and
	MR_DEPENDENCE_BASE on the MEM_REFs if they are the same in all stores.
	(mem_valid_for_store_merging): New function.
	(handled_load): New function.
	(pass_store_merging::process_store): New method.
	(pass_store_merging::execute): Use process_store method.  Adjust
	terminate_all_aliasing_chains caller.

	* gcc.dg/store_merging_13.c: New test.
	* gcc.dg/store_merging_14.c: New test.

--- gcc/gimple-ssa-store-merging.c.jj	2017-11-03 15:37:02.869561500 +0100
+++ gcc/gimple-ssa-store-merging.c	2017-11-03 16:15:15.059282459 +0100
@@ -19,7 +19,8 @@
    <http://www.gnu.org/licenses/>.  */
 
 /* The purpose of this pass is to combine multiple memory stores of
-   constant values to consecutive memory locations into fewer wider stores.
+   constant values, values loaded from memory or bitwise operations
+   on those to consecutive memory locations into fewer wider stores.
    For example, if we have a sequence peforming four byte stores to
    consecutive memory locations:
    [p     ] := imm1;
@@ -29,21 +30,49 @@
    we can transform this into a single 4-byte store if the target supports it:
   [p] := imm1:imm2:imm3:imm4 //concatenated immediates according to endianness.
 
+   Or:
+   [p     ] := [q     ];
+   [p + 1B] := [q + 1B];
+   [p + 2B] := [q + 2B];
+   [p + 3B] := [q + 3B];
+   if there is no overlap can be transformed into a single 4-byte
+   load followed by single 4-byte store.
+
+   Or:
+   [p     ] := [q     ] ^ imm1;
+   [p + 1B] := [q + 1B] ^ imm2;
+   [p + 2B] := [q + 2B] ^ imm3;
+   [p + 3B] := [q + 3B] ^ imm4;
+   if there is no overlap can be transformed into a single 4-byte
+   load, xored with imm1:imm2:imm3:imm4 and stored using a single 4-byte store.
+
    The algorithm is applied to each basic block in three phases:
 
-   1) Scan through the basic block recording constant assignments to
+   1) Scan through the basic block recording assignments to
    destinations that can be expressed as a store to memory of a certain size
-   at a certain bit offset.  Record store chains to different bases in a
-   hash_map (m_stores) and make sure to terminate such chains when appropriate
-   (for example when when the stored values get used subsequently).
+   at a certain bit offset from expressions we can handle.  For bit-fields
+   we also note the surrounding bit region, bits that could be stored in
+   a read-modify-write operation when storing the bit-field.  Record store
+   chains to different bases in a hash_map (m_stores) and make sure to
+   terminate such chains when appropriate (for example when when the stored
+   values get used subsequently).
    These stores can be a result of structure element initializers, array stores
    etc.  A store_immediate_info object is recorded for every such store.
    Record as many such assignments to a single base as possible until a
    statement that interferes with the store sequence is encountered.
+   Each store has up to 2 operands, which can be an immediate constant
+   or a memory load, from which the value to be stored can be computed.
+   At most one of the operands can be a constant.  The operands are recorded
+   in store_operand_info struct.
 
    2) Analyze the chain of stores recorded in phase 1) (i.e. the vector of
    store_immediate_info objects) and coalesce contiguous stores into
-   merged_store_group objects.
+   merged_store_group objects.  For bit-fields stores, we don't need to
+   require the stores to be contiguous, just their surrounding bit regions
+   have to be contiguous.  If the expression being stored is different
+   between adjacent stores, such as one store storing a constant and
+   following storing a value loaded from memory, or if the loaded memory
+   objects are not adjacent, a new merged_store_group is created as well.
 
    For example, given the stores:
    [p     ] := 0;
@@ -134,8 +163,35 @@
 #define MAX_STORE_BITSIZE (BITS_PER_WORD)
 #define MAX_STORE_BYTES (MAX_STORE_BITSIZE / BITS_PER_UNIT)
 
+/* Limit to bound the number of aliasing checks for loads with the same
+   vuse as the corresponding store.  */
+#define MAX_STORE_ALIAS_CHECKS 64
+
 namespace {
 
+/* Struct recording one operand for the store, which is either a constant,
+   then VAL represents the constant and all the other fields are zero,
+   or a memory load, then VAL represents the reference, BASE_ADDR is non-NULL
+   and the other fields also reflect the memory load.  */
+
+struct store_operand_info
+{
+  tree val;
+  tree base_addr;
+  unsigned HOST_WIDE_INT bitsize;
+  unsigned HOST_WIDE_INT bitpos;
+  unsigned HOST_WIDE_INT bitregion_start;
+  unsigned HOST_WIDE_INT bitregion_end;
+  gimple *stmt;
+  store_operand_info ();
+};
+
+store_operand_info::store_operand_info ()
+  : val (NULL_TREE), base_addr (NULL_TREE), bitsize (0), bitpos (0),
+    bitregion_start (0), bitregion_end (0), stmt (NULL)
+{
+}
+
 /* Struct recording the information about a single store of an immediate
    to memory.  These are created in the first phase and coalesced into
    merged_store_group objects in the second phase.  */
@@ -149,9 +205,17 @@ struct store_immediate_info
   unsigned HOST_WIDE_INT bitregion_end;
   gimple *stmt;
   unsigned int order;
+  /* INTEGER_CST for constant stores, MEM_REF for memory copy or
+     BIT_*_EXPR for logical bitwise operation.  */
+  enum tree_code rhs_code;
+  /* Operands.  For BIT_*_EXPR rhs_code both operands are used, otherwise
+     just the first one.  */
+  store_operand_info ops[2];
   store_immediate_info (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
 			unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
-			gimple *, unsigned int);
+			gimple *, unsigned int, enum tree_code,
+			const store_operand_info &,
+			const store_operand_info &);
 };
 
 store_immediate_info::store_immediate_info (unsigned HOST_WIDE_INT bs,
@@ -159,11 +223,22 @@ store_immediate_info::store_immediate_in
 					    unsigned HOST_WIDE_INT brs,
 					    unsigned HOST_WIDE_INT bre,
 					    gimple *st,
-					    unsigned int ord)
+					    unsigned int ord,
+					    enum tree_code rhscode,
+					    const store_operand_info &op0r,
+					    const store_operand_info &op1r)
   : bitsize (bs), bitpos (bp), bitregion_start (brs), bitregion_end (bre),
-    stmt (st), order (ord)
+    stmt (st), order (ord), rhs_code (rhscode)
+#if __cplusplus >= 201103L
+    , ops { op0r, op1r }
+{
+}
+#else
 {
+  ops[0] = op0r;
+  ops[1] = op1r;
 }
+#endif
 
 /* Struct representing a group of stores to contiguous memory locations.
    These are produced by the second phase (coalescing) and consumed in the
@@ -178,8 +253,10 @@ struct merged_store_group
   /* The size of the allocated memory for val and mask.  */
   unsigned HOST_WIDE_INT buf_size;
   unsigned HOST_WIDE_INT align_base;
+  unsigned HOST_WIDE_INT load_align_base[2];
 
   unsigned int align;
+  unsigned int load_align[2];
   unsigned int first_order;
   unsigned int last_order;
 
@@ -576,6 +653,20 @@ merged_store_group::merged_store_group (
   get_object_alignment_1 (gimple_assign_lhs (info->stmt),
 			  &align, &align_bitpos);
   align_base = start - align_bitpos;
+  for (int i = 0; i < 2; ++i)
+    {
+      store_operand_info &op = info->ops[i];
+      if (op.base_addr == NULL_TREE)
+	{
+	  load_align[i] = 0;
+	  load_align_base[i] = 0;
+	}
+      else
+	{
+	  get_object_alignment_1 (op.val, &load_align[i], &align_bitpos);
+	  load_align_base[i] = op.bitpos - align_bitpos;
+	}
+    }
   stores.create (1);
   stores.safe_push (info);
   last_stmt = info->stmt;
@@ -608,6 +699,19 @@ merged_store_group::do_merge (store_imme
       align = this_align;
       align_base = info->bitpos - align_bitpos;
     }
+  for (int i = 0; i < 2; ++i)
+    {
+      store_operand_info &op = info->ops[i];
+      if (!op.base_addr)
+	continue;
+
+      get_object_alignment_1 (op.val, &this_align, &align_bitpos);
+      if (this_align > load_align[i])
+	{
+	  load_align[i] = this_align;
+	  load_align_base[i] = op.bitpos - align_bitpos;
+	}
+    }
 
   gimple *stmt = info->stmt;
   stores.safe_push (info);
@@ -682,16 +786,21 @@ merged_store_group::apply_stores ()
   FOR_EACH_VEC_ELT (stores, i, info)
     {
       unsigned int pos_in_buffer = info->bitpos - bitregion_start;
-      bool ret = encode_tree_to_bitpos (gimple_assign_rhs1 (info->stmt),
-					val, info->bitsize,
-					pos_in_buffer, buf_size);
-      if (dump_file && (dump_flags & TDF_DETAILS))
+      tree cst = NULL_TREE;
+      if (info->ops[0].val && info->ops[0].base_addr == NULL_TREE)
+	cst = info->ops[0].val;
+      else if (info->ops[1].val && info->ops[1].base_addr == NULL_TREE)
+	cst = info->ops[1].val;
+      bool ret = true;
+      if (cst)
+	ret = encode_tree_to_bitpos (cst, val, info->bitsize,
+				     pos_in_buffer, buf_size);
+      if (cst && dump_file && (dump_flags & TDF_DETAILS))
 	{
 	  if (ret)
 	    {
 	      fprintf (dump_file, "After writing ");
-	      print_generic_expr (dump_file,
-				  gimple_assign_rhs1 (info->stmt), 0);
+	      print_generic_expr (dump_file, cst, 0);
 	      fprintf (dump_file, " of size " HOST_WIDE_INT_PRINT_DEC
 			" at position %d the merged region contains:\n",
 			info->bitsize, pos_in_buffer);
@@ -799,9 +908,10 @@ private:
      decisions when going out of SSA).  */
   imm_store_chain_info *m_stores_head;
 
+  void process_store (gimple *);
   bool terminate_and_process_all_chains ();
   bool terminate_all_aliasing_chains (imm_store_chain_info **,
-				      bool, gimple *);
+				      gimple *);
   bool terminate_and_release_chain (imm_store_chain_info *);
 }; // class pass_store_merging
 
@@ -831,7 +941,6 @@ pass_store_merging::terminate_and_proces
 bool
 pass_store_merging::terminate_all_aliasing_chains (imm_store_chain_info
 						     **chain_info,
-						   bool var_offset_p,
 						   gimple *stmt)
 {
   bool ret = false;
@@ -845,37 +954,21 @@ pass_store_merging::terminate_all_aliasi
      of a chain.  */
   if (chain_info)
     {
-      /* We have a chain at BASE and we're writing to [BASE + <variable>].
-	 This can interfere with any of the stores so terminate
-	 the chain.  */
-      if (var_offset_p)
-	{
-	  terminate_and_release_chain (*chain_info);
-	  ret = true;
-	}
-      /* Otherwise go through every store in the chain to see if it
-	 aliases with any of them.  */
-      else
+      store_immediate_info *info;
+      unsigned int i;
+      FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
 	{
-	  store_immediate_info *info;
-	  unsigned int i;
-	  FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
+	  if (ref_maybe_used_by_stmt_p (stmt, gimple_assign_lhs (info->stmt))
+	      || stmt_may_clobber_ref_p (stmt, gimple_assign_lhs (info->stmt)))
 	    {
-	      if (ref_maybe_used_by_stmt_p (stmt,
-					    gimple_assign_lhs (info->stmt))
-		  || stmt_may_clobber_ref_p (stmt,
-					     gimple_assign_lhs (info->stmt)))
+	      if (dump_file && (dump_flags & TDF_DETAILS))
 		{
-		  if (dump_file && (dump_flags & TDF_DETAILS))
-		    {
-		      fprintf (dump_file,
-			       "stmt causes chain termination:\n");
-		      print_gimple_stmt (dump_file, stmt, 0);
-		    }
-		  terminate_and_release_chain (*chain_info);
-		  ret = true;
-		  break;
+		  fprintf (dump_file, "stmt causes chain termination:\n");
+		  print_gimple_stmt (dump_file, stmt, 0);
 		}
+	      terminate_and_release_chain (*chain_info);
+	      ret = true;
+	      break;
 	    }
 	}
     }
@@ -920,6 +1013,125 @@ pass_store_merging::terminate_and_releas
   return ret;
 }
 
+/* Return true if stmts in between FIRST (inclusive) and LAST (exclusive)
+   may clobber REF.  FIRST and LAST must be in the same basic block and
+   have non-NULL vdef.  */
+
+bool
+stmts_may_clobber_ref_p (gimple *first, gimple *last, tree ref)
+{
+  ao_ref r;
+  ao_ref_init (&r, ref);
+  unsigned int count = 0;
+  tree vop = gimple_vdef (last);
+  gimple *stmt;
+
+  gcc_checking_assert (gimple_bb (first) == gimple_bb (last));
+  do
+    {
+      stmt = SSA_NAME_DEF_STMT (vop);
+      if (stmt_may_clobber_ref_p_1 (stmt, &r))
+	return true;
+      /* Avoid quadratic compile time by bounding the number of checks
+	 we perform.  */
+      if (++count > MAX_STORE_ALIAS_CHECKS)
+	return true;
+      vop = gimple_vuse (stmt);
+    }
+  while (stmt != first);
+  return false;
+}
+
+/* Return true if INFO->ops[IDX] is mergeable with the
+   corresponding loads already in MERGED_STORE group.
+   BASE_ADDR is the base address of the whole store group.  */
+
+bool
+compatible_load_p (merged_store_group *merged_store,
+		   store_immediate_info *info,
+		   tree base_addr, int idx)
+{
+  store_immediate_info *infof = merged_store->stores[0];
+  if (!info->ops[idx].base_addr
+      || (info->ops[idx].bitpos - infof->ops[idx].bitpos
+	  != info->bitpos - infof->bitpos)
+      || !operand_equal_p (info->ops[idx].base_addr,
+			   infof->ops[idx].base_addr, 0))
+    return false;
+
+  store_immediate_info *infol = merged_store->stores.last ();
+  tree load_vuse = gimple_vuse (info->ops[idx].stmt);
+  /* In this case all vuses should be the same, e.g.
+     _1 = s.a; _2 = s.b; _3 = _1 | 1; t.a = _3; _4 = _2 | 2; t.b = _4;
+     or
+     _1 = s.a; _2 = s.b; t.a = _1; t.b = _2;
+     and we can emit the coalesced load next to any of those loads.  */
+  if (gimple_vuse (infof->ops[idx].stmt) == load_vuse
+      && gimple_vuse (infol->ops[idx].stmt) == load_vuse)
+    return true;
+
+  /* Otherwise, at least for now require that the load has the same
+     vuse as the store.  See following examples.  */
+  if (gimple_vuse (info->stmt) != load_vuse)
+    return false;
+
+  if (gimple_vuse (infof->stmt) != gimple_vuse (infof->ops[idx].stmt)
+      || (infof != infol
+	  && gimple_vuse (infol->stmt) != gimple_vuse (infol->ops[idx].stmt)))
+    return false;
+
+  /* If the load is from the same location as the store, already
+     the construction of the immediate chain info guarantees no intervening
+     stores, so no further checks are needed.  Example:
+     _1 = s.a; _2 = _1 & -7; s.a = _2; _3 = s.b; _4 = _3 & -7; s.b = _4;  */
+  if (info->ops[idx].bitpos == info->bitpos
+      && operand_equal_p (info->ops[idx].base_addr, base_addr, 0))
+    return true;
+
+  /* Otherwise, we need to punt if any of the loads can be clobbered by any
+     of the stores in the group, or any other stores in between those.
+     Previous calls to compatible_load_p ensured that for all the
+     merged_store->stores IDX loads, no stmts starting with
+     merged_store->first_stmt and ending right before merged_store->last_stmt
+     clobbers those loads.  */
+  gimple *first = merged_store->first_stmt;
+  gimple *last = merged_store->last_stmt;
+  unsigned int i;
+  store_immediate_info *infoc;
+  /* The stores are sorted by increasing store bitpos, so if info->stmt store
+     comes before the so far first load, we'll be changing
+     merged_store->first_stmt.  In that case we need to give up if
+     any of the earlier processed loads clobber with the stmts in the new
+     range.  */
+  if (info->order < merged_store->first_order)
+    {
+      FOR_EACH_VEC_ELT (merged_store->stores, i, infoc)
+	if (stmts_may_clobber_ref_p (info->stmt, first, infoc->ops[idx].val))
+	  return false;
+      first = info->stmt;
+    }
+  /* Similarly, we could change merged_store->last_stmt, so ensure
+     in that case no stmts in the new range clobber any of the earlier
+     processed loads.  */
+  else if (info->order > merged_store->last_order)
+    {
+      FOR_EACH_VEC_ELT (merged_store->stores, i, infoc)
+	if (stmts_may_clobber_ref_p (last, info->stmt, infoc->ops[idx].val))
+	  return false;
+      last = info->stmt;
+    }
+  /* And finally, we'd be adding a new load to the set, ensure it isn't
+     clobbered in the new range.  */
+  if (stmts_may_clobber_ref_p (first, last, info->ops[idx].val))
+    return false;
+
+  /* Otherwise, we are looking for:
+     _1 = s.a; _2 = _1 ^ 15; t.a = _2; _3 = s.b; _4 = _3 ^ 15; t.b = _4;
+     or
+     _1 = s.a; t.a = _1; _2 = s.b; t.b = _2;  */
+  return true;
+}
+
 /* Go through the candidate stores recorded in m_store_info and merge them
    into merged_store_group objects recorded into m_merged_store_groups
    representing the widened stores.  Return true if coalescing was successful
@@ -967,32 +1179,56 @@ imm_store_chain_info::coalesce_immediate
       if (IN_RANGE (start, merged_store->start,
 		    merged_store->start + merged_store->width - 1))
 	{
-	  merged_store->merge_overlapping (info);
-	  continue;
+	  /* Only allow overlapping stores of constants.  */
+	  if (info->rhs_code == INTEGER_CST
+	      && merged_store->stores[0]->rhs_code == INTEGER_CST)
+	    {
+	      merged_store->merge_overlapping (info);
+	      continue;
+	    }
+	}
+      /* |---store 1---||---store 2---|
+	 This store is consecutive to the previous one.
+	 Merge it into the current store group.  There can be gaps in between
+	 the stores, but there can't be gaps in between bitregions.  */
+      else if (info->bitregion_start <= merged_store->bitregion_end
+	       && info->rhs_code == merged_store->stores[0]->rhs_code)
+	{
+	  store_immediate_info *infof = merged_store->stores[0];
+
+	  /* All the rhs_code ops that take 2 operands are commutative,
+	     swap the operands if it could make the operands compatible.  */
+	  if (infof->ops[0].base_addr
+	      && infof->ops[1].base_addr
+	      && info->ops[0].base_addr
+	      && info->ops[1].base_addr
+	      && (info->ops[1].bitpos - infof->ops[0].bitpos
+		  == info->bitpos - infof->bitpos)
+	      && operand_equal_p (info->ops[1].base_addr,
+				  infof->ops[0].base_addr, 0))
+	    std::swap (info->ops[0], info->ops[1]);
+	  if ((!infof->ops[0].base_addr
+	       || compatible_load_p (merged_store, info, base_addr, 0))
+	      && (!infof->ops[1].base_addr
+		  || compatible_load_p (merged_store, info, base_addr, 1)))
+	    {
+	      merged_store->merge_into (info);
+	      continue;
+	    }
 	}
 
       /* |---store 1---| <gap> |---store 2---|.
-	 Gap between stores.  Start a new group if there are any gaps
-	 between bitregions.  */
-      if (info->bitregion_start > merged_store->bitregion_end)
-	{
-	  /* Try to apply all the stores recorded for the group to determine
-	     the bitpattern they write and discard it if that fails.
-	     This will also reject single-store groups.  */
-	  if (!merged_store->apply_stores ())
-	    delete merged_store;
-	  else
-	    m_merged_store_groups.safe_push (merged_store);
-
-	  merged_store = new merged_store_group (info);
+	 Gap between stores or the rhs not compatible.  Start a new group.  */
 
-	  continue;
-	}
+      /* Try to apply all the stores recorded for the group to determine
+	 the bitpattern they write and discard it if that fails.
+	 This will also reject single-store groups.  */
+      if (!merged_store->apply_stores ())
+	delete merged_store;
+      else
+	m_merged_store_groups.safe_push (merged_store);
 
-      /* |---store 1---||---store 2---|
-	 This store is consecutive to the previous one.
-	 Merge it into the current store group.  */
-       merged_store->merge_into (info);
+      merged_store = new merged_store_group (info);
     }
 
   /* Record or discard the last store group.  */
@@ -1014,35 +1250,57 @@ imm_store_chain_info::coalesce_immediate
   return success;
 }
 
-/* Return the type to use for the merged stores described by STMTS.
-   This is needed to get the alias sets right.  */
+/* Return the type to use for the merged stores or loads described by STMTS.
+   This is needed to get the alias sets right.  If IS_LOAD, look for rhs,
+   otherwise lhs.  Additionally set *CLIQUEP and *BASEP to MR_DEPENDENCE_*
+   of the MEM_REFs if any.  */
 
 static tree
-get_alias_type_for_stmts (auto_vec<gimple *> &stmts)
+get_alias_type_for_stmts (vec<gimple *> &stmts, bool is_load,
+			  unsigned short *cliquep, unsigned short *basep)
 {
   gimple *stmt;
   unsigned int i;
-  tree lhs = gimple_assign_lhs (stmts[0]);
-  tree type = reference_alias_ptr_type (lhs);
+  tree type = NULL_TREE;
+  tree ret = NULL_TREE;
+  *cliquep = 0;
+  *basep = 0;
 
   FOR_EACH_VEC_ELT (stmts, i, stmt)
     {
-      if (i == 0)
-	continue;
+      tree ref = is_load ? gimple_assign_rhs1 (stmt)
+			 : gimple_assign_lhs (stmt);
+      tree type1 = reference_alias_ptr_type (ref);
+      tree base = get_base_address (ref);
 
-      lhs = gimple_assign_lhs (stmt);
-      tree type1 = reference_alias_ptr_type (lhs);
+      if (i == 0)
+	{
+	  if (TREE_CODE (base) == MEM_REF)
+	    {
+	      *cliquep = MR_DEPENDENCE_CLIQUE (base);
+	      *basep = MR_DEPENDENCE_BASE (base);
+	    }
+	  ret = type = type1;
+	  continue;
+	}
       if (!alias_ptr_types_compatible_p (type, type1))
-	return ptr_type_node;
+	ret = ptr_type_node;
+      if (TREE_CODE (base) != MEM_REF
+	  || *cliquep != MR_DEPENDENCE_CLIQUE (base)
+	  || *basep != MR_DEPENDENCE_BASE (base))
+	{
+	  *cliquep = 0;
+	  *basep = 0;
+	}
     }
-  return type;
+  return ret;
 }
 
 /* Return the location_t information we can find among the statements
    in STMTS.  */
 
 static location_t
-get_location_for_stmts (auto_vec<gimple *> &stmts)
+get_location_for_stmts (vec<gimple *> &stmts)
 {
   gimple *stmt;
   unsigned int i;
@@ -1062,7 +1320,7 @@ struct split_store
   unsigned HOST_WIDE_INT bytepos;
   unsigned HOST_WIDE_INT size;
   unsigned HOST_WIDE_INT align;
-  auto_vec<gimple *> orig_stmts;
+  auto_vec<store_immediate_info *> orig_stores;
   /* True if there is a single orig stmt covering the whole split store.  */
   bool orig;
   split_store (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
@@ -1076,21 +1334,20 @@ split_store::split_store (unsigned HOST_
 			  unsigned HOST_WIDE_INT al)
 			  : bytepos (bp), size (sz), align (al), orig (false)
 {
-  orig_stmts.create (0);
+  orig_stores.create (0);
 }
 
-/* Record all statements corresponding to stores in GROUP that write to
-   the region starting at BITPOS and is of size BITSIZE.  Record such
-   statements in STMTS if non-NULL.  The stores in GROUP must be sorted by
-   bitposition.  Return INFO if there is exactly one original store
-   in the range.  */
+/* Record all stores in GROUP that write to the region starting at BITPOS and
+   is of size BITSIZE.  Record infos for such statements in STORES if
+   non-NULL.  The stores in GROUP must be sorted by bitposition.  Return INFO
+   if there is exactly one original store in the range.  */
 
 static store_immediate_info *
-find_constituent_stmts (struct merged_store_group *group,
-			vec<gimple *> *stmts,
-			unsigned int *first,
-			unsigned HOST_WIDE_INT bitpos,
-			unsigned HOST_WIDE_INT bitsize)
+find_constituent_stores (struct merged_store_group *group,
+			 vec<store_immediate_info *> *stores,
+			 unsigned int *first,
+			 unsigned HOST_WIDE_INT bitpos,
+			 unsigned HOST_WIDE_INT bitsize)
 {
   store_immediate_info *info, *ret = NULL;
   unsigned int i;
@@ -1119,9 +1376,9 @@ find_constituent_stmts (struct merged_st
       if (stmt_start >= end)
 	return ret;
 
-      if (stmts)
+      if (stores)
 	{
-	  stmts->safe_push (info->stmt);
+	  stores->safe_push (info);
 	  if (ret)
 	    {
 	      ret = NULL;
@@ -1143,11 +1400,14 @@ find_constituent_stmts (struct merged_st
    This is to separate the splitting strategy from the statement
    building/emission/linking done in output_merged_store.
    Return number of new stores.
+   If ALLOW_UNALIGNED_STORE is false, then all stores must be aligned.
+   If ALLOW_UNALIGNED_LOAD is false, then all loads must be aligned.
    If SPLIT_STORES is NULL, it is just a dry run to count number of
    new stores.  */
 
 static unsigned int
-split_group (merged_store_group *group, bool allow_unaligned,
+split_group (merged_store_group *group, bool allow_unaligned_store,
+	     bool allow_unaligned_load,
 	     vec<struct split_store *> *split_stores)
 {
   unsigned HOST_WIDE_INT pos = group->bitregion_start;
@@ -1155,6 +1415,7 @@ split_group (merged_store_group *group,
   unsigned HOST_WIDE_INT bytepos = pos / BITS_PER_UNIT;
   unsigned HOST_WIDE_INT group_align = group->align;
   unsigned HOST_WIDE_INT align_base = group->align_base;
+  unsigned HOST_WIDE_INT group_load_align = group_align;
 
   gcc_assert ((size % BITS_PER_UNIT == 0) && (pos % BITS_PER_UNIT == 0));
 
@@ -1162,9 +1423,14 @@ split_group (merged_store_group *group,
   unsigned HOST_WIDE_INT try_pos = bytepos;
   group->stores.qsort (sort_by_bitpos);
 
+  if (!allow_unaligned_load)
+    for (int i = 0; i < 2; ++i)
+      if (group->load_align[i])
+	group_load_align = MIN (group_load_align, group->load_align[i]);
+
   while (size > 0)
     {
-      if ((allow_unaligned || group_align <= BITS_PER_UNIT)
+      if ((allow_unaligned_store || group_align <= BITS_PER_UNIT)
 	  && group->mask[try_pos - bytepos] == (unsigned char) ~0U)
 	{
 	  /* Skip padding bytes.  */
@@ -1180,10 +1446,34 @@ split_group (merged_store_group *group,
       unsigned HOST_WIDE_INT align = group_align;
       if (align_bitpos)
 	align = least_bit_hwi (align_bitpos);
-      if (!allow_unaligned)
+      if (!allow_unaligned_store)
 	try_size = MIN (try_size, align);
+      if (!allow_unaligned_load)
+	{
+	  /* If we can't do or don't want to do unaligned stores
+	     as well as loads, we need to take the loads into account
+	     as well.  */
+	  unsigned HOST_WIDE_INT load_align = group_load_align;
+	  align_bitpos = (try_bitpos - align_base) & (load_align - 1);
+	  if (align_bitpos)
+	    load_align = least_bit_hwi (align_bitpos);
+	  for (int i = 0; i < 2; ++i)
+	    if (group->load_align[i])
+	      {
+		align_bitpos = try_bitpos - group->stores[0]->bitpos;
+		align_bitpos += group->stores[0]->ops[i].bitpos;
+		align_bitpos -= group->load_align_base[i];
+		align_bitpos &= (group_load_align - 1);
+		if (align_bitpos)
+		  {
+		    unsigned HOST_WIDE_INT a = least_bit_hwi (align_bitpos);
+		    load_align = MIN (load_align, a);
+		  }
+	      }
+	  try_size = MIN (try_size, load_align);
+	}
       store_immediate_info *info
-	= find_constituent_stmts (group, NULL, &first, try_bitpos, try_size);
+	= find_constituent_stores (group, NULL, &first, try_bitpos, try_size);
       if (info)
 	{
 	  /* If there is just one original statement for the range, see if
@@ -1191,8 +1481,8 @@ split_group (merged_store_group *group,
 	     than try_size.  */
 	  unsigned HOST_WIDE_INT stmt_end
 	    = ROUND_UP (info->bitpos + info->bitsize, BITS_PER_UNIT);
-	  info = find_constituent_stmts (group, NULL, &first, try_bitpos,
-					 stmt_end - try_bitpos);
+	  info = find_constituent_stores (group, NULL, &first, try_bitpos,
+					  stmt_end - try_bitpos);
 	  if (info && info->bitpos >= try_bitpos)
 	    {
 	      try_size = stmt_end - try_bitpos;
@@ -1221,7 +1511,7 @@ split_group (merged_store_group *group,
       nonmasked *= BITS_PER_UNIT;
       while (nonmasked <= try_size / 2)
 	try_size /= 2;
-      if (!allow_unaligned && group_align > BITS_PER_UNIT)
+      if (!allow_unaligned_store && group_align > BITS_PER_UNIT)
 	{
 	  /* Now look for whole padding bytes at the start of that bitsize.  */
 	  unsigned int try_bytesize = try_size / BITS_PER_UNIT, masked;
@@ -1252,8 +1542,8 @@ split_group (merged_store_group *group,
 	{
 	  struct split_store *store
 	    = new split_store (try_pos, try_size, align);
-	  info = find_constituent_stmts (group, &store->orig_stmts,
-	  				 &first, try_bitpos, try_size);
+	  info = find_constituent_stores (group, &store->orig_stores,
+					  &first, try_bitpos, try_size);
 	  if (info
 	      && info->bitpos >= try_bitpos
 	      && info->bitpos + info->bitsize <= try_bitpos + try_size)
@@ -1288,19 +1578,23 @@ imm_store_chain_info::output_merged_stor
 
   auto_vec<struct split_store *, 32> split_stores;
   split_stores.create (0);
-  bool allow_unaligned
+  bool allow_unaligned_store
     = !STRICT_ALIGNMENT && PARAM_VALUE (PARAM_STORE_MERGING_ALLOW_UNALIGNED);
-  if (allow_unaligned)
+  bool allow_unaligned_load = allow_unaligned_store;
+  if (allow_unaligned_store)
     {
       /* If unaligned stores are allowed, see how many stores we'd emit
 	 for unaligned and how many stores we'd emit for aligned stores.
 	 Only use unaligned stores if it allows fewer stores than aligned.  */
-      unsigned aligned_cnt = split_group (group, false, NULL);
-      unsigned unaligned_cnt = split_group (group, true, NULL);
+      unsigned aligned_cnt
+	= split_group (group, false, allow_unaligned_load, NULL);
+      unsigned unaligned_cnt
+	= split_group (group, true, allow_unaligned_load, NULL);
       if (aligned_cnt <= unaligned_cnt)
-	allow_unaligned = false;
+	allow_unaligned_store = false;
     }
-  split_group (group, allow_unaligned, &split_stores);
+  split_group (group, allow_unaligned_store, allow_unaligned_load,
+	       &split_stores);
 
   if (split_stores.length () >= orig_num_stmts)
     {
@@ -1323,9 +1617,37 @@ imm_store_chain_info::output_merged_stor
   gimple *stmt = NULL;
   split_store *split_store;
   unsigned int i;
-
+  auto_vec<gimple *, 32> orig_stmts;
   tree addr = force_gimple_operand_1 (unshare_expr (base_addr), &seq,
 				      is_gimple_mem_ref_addr, NULL_TREE);
+
+  tree load_addr[2] = { NULL_TREE, NULL_TREE };
+  gimple_seq load_seq[2] = { NULL, NULL };
+  gimple_stmt_iterator load_gsi[2] = { gsi_none (), gsi_none () };
+  for (int j = 0; j < 2; ++j)
+    {
+      store_operand_info &op = group->stores[0]->ops[j];
+      if (op.base_addr == NULL_TREE)
+	continue;
+
+      store_immediate_info *infol = group->stores.last ();
+      if (gimple_vuse (op.stmt) == gimple_vuse (infol->ops[j].stmt))
+	{
+	  load_gsi[j] = gsi_for_stmt (op.stmt);
+	  load_addr[j]
+	    = force_gimple_operand_1 (unshare_expr (op.base_addr),
+				      &load_seq[j], is_gimple_mem_ref_addr,
+				      NULL_TREE);
+	}
+      else if (operand_equal_p (base_addr, op.base_addr, 0))
+	load_addr[j] = addr;
+      else
+	load_addr[j]
+	  = force_gimple_operand_1 (unshare_expr (op.base_addr),
+				    &seq, is_gimple_mem_ref_addr,
+				    NULL_TREE);
+    }
+
   FOR_EACH_VEC_ELT (split_stores, i, split_store)
     {
       unsigned HOST_WIDE_INT try_size = split_store->size;
@@ -1337,27 +1659,144 @@ imm_store_chain_info::output_merged_stor
 	{
 	  /* If there is just a single constituent store which covers
 	     the whole area, just reuse the lhs and rhs.  */
-	  dest = gimple_assign_lhs (split_store->orig_stmts[0]);
-	  src = gimple_assign_rhs1 (split_store->orig_stmts[0]);
-	  loc = gimple_location (split_store->orig_stmts[0]);
+	  gimple *orig_stmt = split_store->orig_stores[0]->stmt;
+	  dest = gimple_assign_lhs (orig_stmt);
+	  src = gimple_assign_rhs1 (orig_stmt);
+	  loc = gimple_location (orig_stmt);
 	}
       else
 	{
+	  store_immediate_info *info;
+	  unsigned short clique, base;
+	  unsigned int k;
+	  FOR_EACH_VEC_ELT (split_store->orig_stores, k, info)
+	    orig_stmts.safe_push (info->stmt);
 	  tree offset_type
-	    = get_alias_type_for_stmts (split_store->orig_stmts);
-	  loc = get_location_for_stmts (split_store->orig_stmts);
+	    = get_alias_type_for_stmts (orig_stmts, false, &clique, &base);
+	  loc = get_location_for_stmts (orig_stmts);
+	  orig_stmts.truncate (0);
 
 	  tree int_type = build_nonstandard_integer_type (try_size, UNSIGNED);
 	  int_type = build_aligned_type (int_type, align);
 	  dest = fold_build2 (MEM_REF, int_type, addr,
 			      build_int_cst (offset_type, try_pos));
-	  src = native_interpret_expr (int_type,
-				       group->val + try_pos - start_byte_pos,
-				       group->buf_size);
+	  if (TREE_CODE (dest) == MEM_REF)
+	    {
+	      MR_DEPENDENCE_CLIQUE (dest) = clique;
+	      MR_DEPENDENCE_BASE (dest) = base;
+	    }
+
 	  tree mask
 	    = native_interpret_expr (int_type,
 				     group->mask + try_pos - start_byte_pos,
 				     group->buf_size);
+
+	  tree ops[2];
+	  for (int j = 0;
+	       j < 1 + (split_store->orig_stores[0]->ops[1].val != NULL_TREE);
+	       ++j)
+	    {
+	      store_operand_info &op = split_store->orig_stores[0]->ops[j];
+	      if (op.base_addr)
+		{
+		  FOR_EACH_VEC_ELT (split_store->orig_stores, k, info)
+		    orig_stmts.safe_push (info->ops[j].stmt);
+
+		  offset_type = get_alias_type_for_stmts (orig_stmts, true,
+							  &clique, &base);
+		  location_t load_loc = get_location_for_stmts (orig_stmts);
+		  orig_stmts.truncate (0);
+
+		  unsigned HOST_WIDE_INT load_align = group->load_align[j];
+		  unsigned HOST_WIDE_INT align_bitpos
+		    = (try_pos * BITS_PER_UNIT
+		       - split_store->orig_stores[0]->bitpos
+		       + op.bitpos) & (load_align - 1);
+		  if (align_bitpos)
+		    load_align = least_bit_hwi (align_bitpos);
+
+		  tree load_int_type
+		    = build_nonstandard_integer_type (try_size, UNSIGNED);
+		  load_int_type
+		    = build_aligned_type (load_int_type, load_align);
+
+		  unsigned HOST_WIDE_INT load_pos
+		    = (try_pos * BITS_PER_UNIT
+		       - split_store->orig_stores[0]->bitpos
+		       + op.bitpos) / BITS_PER_UNIT;
+		  ops[j] = fold_build2 (MEM_REF, load_int_type, load_addr[j],
+					build_int_cst (offset_type, load_pos));
+		  if (TREE_CODE (ops[j]) == MEM_REF)
+		    {
+		      MR_DEPENDENCE_CLIQUE (ops[j]) = clique;
+		      MR_DEPENDENCE_BASE (ops[j]) = base;
+		    }
+		  if (!integer_zerop (mask))
+		    /* The load might load some bits (that will be masked off
+		       later on) uninitialized, avoid -W*uninitialized
+		       warnings in that case.  */
+		    TREE_NO_WARNING (ops[j]) = 1;
+
+		  stmt = gimple_build_assign (make_ssa_name (int_type),
+					      ops[j]);
+		  gimple_set_location (stmt, load_loc);
+		  if (gsi_bb (load_gsi[j]))
+		    {
+		      gimple_set_vuse (stmt, gimple_vuse (op.stmt));
+		      gimple_seq_add_stmt_without_update (&load_seq[j], stmt);
+		    }
+		  else
+		    {
+		      gimple_set_vuse (stmt, new_vuse);
+		      gimple_seq_add_stmt_without_update (&seq, stmt);
+		    }
+		  ops[j] = gimple_assign_lhs (stmt);
+		}
+	      else
+		ops[j] = native_interpret_expr (int_type,
+						group->val + try_pos
+						- start_byte_pos,
+						group->buf_size);
+	    }
+
+	  switch (split_store->orig_stores[0]->rhs_code)
+	    {
+	    case BIT_AND_EXPR:
+	    case BIT_IOR_EXPR:
+	    case BIT_XOR_EXPR:
+	      FOR_EACH_VEC_ELT (split_store->orig_stores, k, info)
+		{
+		  tree rhs1 = gimple_assign_rhs1 (info->stmt);
+		  orig_stmts.safe_push (SSA_NAME_DEF_STMT (rhs1));
+		}
+	      location_t bit_loc;
+	      bit_loc = get_location_for_stmts (orig_stmts);
+	      orig_stmts.truncate (0);
+
+	      stmt
+		= gimple_build_assign (make_ssa_name (int_type),
+				       split_store->orig_stores[0]->rhs_code,
+				       ops[0], ops[1]);
+	      gimple_set_location (stmt, bit_loc);
+	      /* If there is just one load and there is a separate
+		 load_seq[0], emit the bitwise op right after it.  */
+	      if (load_addr[1] == NULL_TREE && gsi_bb (load_gsi[0]))
+		gimple_seq_add_stmt_without_update (&load_seq[0], stmt);
+	      /* Otherwise, if at least one load is in seq, we need to
+		 emit the bitwise op right before the store.  If there
+		 are two loads and are emitted somewhere else, it would
+		 be better to emit the bitwise op as early as possible;
+		 we don't track where that would be possible right now
+		 though.  */
+	      else
+		gimple_seq_add_stmt_without_update (&seq, stmt);
+	      src = gimple_assign_lhs (stmt);
+	      break;
+	    default:
+	      src = ops[0];
+	      break;
+	    }
+
 	  if (!integer_zerop (mask))
 	    {
 	      tree tem = make_ssa_name (int_type);
@@ -1382,9 +1821,21 @@ imm_store_chain_info::output_merged_stor
 	      gimple_seq_add_stmt_without_update (&seq, stmt);
 	      tem = gimple_assign_lhs (stmt);
 
-	      src = wide_int_to_tree (int_type,
-				      wi::bit_and_not (wi::to_wide (src),
-						       wi::to_wide (mask)));
+	      if (TREE_CODE (src) == INTEGER_CST)
+		src = wide_int_to_tree (int_type,
+					wi::bit_and_not (wi::to_wide (src),
+							 wi::to_wide (mask)));
+	      else
+		{
+		  tree nmask
+		    = wide_int_to_tree (int_type,
+					wi::bit_not (wi::to_wide (mask)));
+		  stmt = gimple_build_assign (make_ssa_name (int_type),
+					      BIT_AND_EXPR, src, nmask);
+		  gimple_set_location (stmt, loc);
+		  gimple_seq_add_stmt_without_update (&seq, stmt);
+		  src = gimple_assign_lhs (stmt);
+		}
 	      stmt = gimple_build_assign (make_ssa_name (int_type),
 					  BIT_IOR_EXPR, tem, src);
 	      gimple_set_location (stmt, loc);
@@ -1422,6 +1873,9 @@ imm_store_chain_info::output_merged_stor
 	print_gimple_seq (dump_file, seq, 0, TDF_VOPS | TDF_MEMSYMS);
     }
   gsi_insert_seq_after (&last_gsi, seq, GSI_SAME_STMT);
+  for (int j = 0; j < 2; ++j)
+    if (load_seq[j])
+      gsi_insert_seq_after (&load_gsi[j], load_seq[j], GSI_SAME_STMT);
 
   return true;
 }
@@ -1520,10 +1974,290 @@ rhs_valid_for_store_merging_p (tree rhs)
 			     GET_MODE_SIZE (TYPE_MODE (TREE_TYPE (rhs)))) != 0;
 }
 
+/* If MEM is a memory reference usable for store merging (either as
+   store destination or for loads), return the non-NULL base_addr
+   and set *PBITSIZE, *PBITPOS, *PBITREGION_START and *PBITREGION_END.
+   Otherwise return NULL, *PBITPOS should be still valid even for that
+   case.  */
+
+static tree
+mem_valid_for_store_merging (tree mem, unsigned HOST_WIDE_INT *pbitsize,
+			     unsigned HOST_WIDE_INT *pbitpos,
+			     unsigned HOST_WIDE_INT *pbitregion_start,
+			     unsigned HOST_WIDE_INT *pbitregion_end)
+{
+  HOST_WIDE_INT bitsize;
+  HOST_WIDE_INT bitpos;
+  unsigned HOST_WIDE_INT bitregion_start = 0;
+  unsigned HOST_WIDE_INT bitregion_end = 0;
+  machine_mode mode;
+  int unsignedp = 0, reversep = 0, volatilep = 0;
+  tree offset;
+  tree base_addr = get_inner_reference (mem, &bitsize, &bitpos, &offset, &mode,
+					&unsignedp, &reversep, &volatilep);
+  *pbitsize = bitsize;
+  if (bitsize == 0)
+    return NULL_TREE;
+
+  if (TREE_CODE (mem) == COMPONENT_REF
+      && DECL_BIT_FIELD_TYPE (TREE_OPERAND (mem, 1)))
+    {
+      get_bit_range (&bitregion_start, &bitregion_end, mem, &bitpos, &offset);
+      if (bitregion_end)
+	++bitregion_end;
+    }
+
+  if (reversep)
+    return NULL_TREE;
+
+  /* We do not want to rewrite TARGET_MEM_REFs.  */
+  if (TREE_CODE (base_addr) == TARGET_MEM_REF)
+    return NULL_TREE;
+  /* In some cases get_inner_reference may return a
+     MEM_REF [ptr + byteoffset].  For the purposes of this pass
+     canonicalize the base_addr to MEM_REF [ptr] and take
+     byteoffset into account in the bitpos.  This occurs in
+     PR 23684 and this way we can catch more chains.  */
+  else if (TREE_CODE (base_addr) == MEM_REF)
+    {
+      offset_int bit_off, byte_off = mem_ref_offset (base_addr);
+      bit_off = byte_off << LOG2_BITS_PER_UNIT;
+      bit_off += bitpos;
+      if (!wi::neg_p (bit_off) && wi::fits_shwi_p (bit_off))
+	{
+	  bitpos = bit_off.to_shwi ();
+	  if (bitregion_end)
+	    {
+	      bit_off = byte_off << LOG2_BITS_PER_UNIT;
+	      bit_off += bitregion_start;
+	      if (wi::fits_uhwi_p (bit_off))
+		{
+		  bitregion_start = bit_off.to_uhwi ();
+		  bit_off = byte_off << LOG2_BITS_PER_UNIT;
+		  bit_off += bitregion_end;
+		  if (wi::fits_uhwi_p (bit_off))
+		    bitregion_end = bit_off.to_uhwi ();
+		  else
+		    bitregion_end = 0;
+		}
+	      else
+		bitregion_end = 0;
+	    }
+	}
+      else
+	return NULL_TREE;
+      base_addr = TREE_OPERAND (base_addr, 0);
+    }
+  /* get_inner_reference returns the base object, get at its
+     address now.  */
+  else
+    {
+      if (bitpos < 0)
+	return NULL_TREE;
+      base_addr = build_fold_addr_expr (base_addr);
+    }
+
+  if (!bitregion_end)
+    {
+      bitregion_start = ROUND_DOWN (bitpos, BITS_PER_UNIT);
+      bitregion_end = ROUND_UP (bitpos + bitsize, BITS_PER_UNIT);
+    }
+
+  if (offset != NULL_TREE)
+    {
+      /* If the access is variable offset then a base decl has to be
+	 address-taken to be able to emit pointer-based stores to it.
+	 ???  We might be able to get away with re-using the original
+	 base up to the first variable part and then wrapping that inside
+	 a BIT_FIELD_REF.  */
+      tree base = get_base_address (base_addr);
+      if (! base
+	  || (DECL_P (base) && ! TREE_ADDRESSABLE (base)))
+	return NULL_TREE;
+
+      base_addr = build2 (POINTER_PLUS_EXPR, TREE_TYPE (base_addr),
+			  base_addr, offset);
+    }
+
+  *pbitsize = bitsize;
+  *pbitpos = bitpos;
+  *pbitregion_start = bitregion_start;
+  *pbitregion_end = bitregion_end;
+  return base_addr;
+}
+
+/* Return true if STMT is a load that can be used for store merging.
+   In that case fill in *OP.  BITSIZE, BITPOS, BITREGION_START and
+   BITREGION_END are properties of the corresponding store.  */
+
+static bool
+handled_load (gimple *stmt, store_operand_info *op,
+	      unsigned HOST_WIDE_INT bitsize, unsigned HOST_WIDE_INT bitpos,
+	      unsigned HOST_WIDE_INT bitregion_start,
+	      unsigned HOST_WIDE_INT bitregion_end)
+{
+  if (!is_gimple_assign (stmt) || !gimple_vuse (stmt))
+    return false;
+  if (gimple_assign_load_p (stmt)
+      && !stmt_can_throw_internal (stmt)
+      && !gimple_has_volatile_ops (stmt))
+    {
+      tree mem = gimple_assign_rhs1 (stmt);
+      op->base_addr
+	= mem_valid_for_store_merging (mem, &op->bitsize, &op->bitpos,
+				       &op->bitregion_start,
+				       &op->bitregion_end);
+      if (op->base_addr != NULL_TREE
+	  && op->bitsize == bitsize
+	  && ((op->bitpos - bitpos) % BITS_PER_UNIT) == 0
+	  && op->bitpos - op->bitregion_start >= bitpos - bitregion_start
+	  && op->bitregion_end - op->bitpos >= bitregion_end - bitpos)
+	{
+	  op->stmt = stmt;
+	  op->val = mem;
+	  return true;
+	}
+    }
+  return false;
+}
+
+/* Record the store STMT for store merging optimization if it can be
+   optimized.  */
+
+void
+pass_store_merging::process_store (gimple *stmt)
+{
+  tree lhs = gimple_assign_lhs (stmt);
+  tree rhs = gimple_assign_rhs1 (stmt);
+  unsigned HOST_WIDE_INT bitsize, bitpos;
+  unsigned HOST_WIDE_INT bitregion_start;
+  unsigned HOST_WIDE_INT bitregion_end;
+  tree base_addr
+    = mem_valid_for_store_merging (lhs, &bitsize, &bitpos,
+				   &bitregion_start, &bitregion_end);
+  if (bitsize == 0)
+    return;
+
+  bool invalid = (base_addr == NULL_TREE
+		  || ((bitsize > MAX_BITSIZE_MODE_ANY_INT)
+		       && (TREE_CODE (rhs) != INTEGER_CST)));
+  enum tree_code rhs_code = ERROR_MARK;
+  store_operand_info ops[2];
+  if (invalid)
+    ;
+  else if (rhs_valid_for_store_merging_p (rhs))
+    {
+      rhs_code = INTEGER_CST;
+      ops[0].val = rhs;
+    }
+  else if (TREE_CODE (rhs) != SSA_NAME || !has_single_use (rhs))
+    invalid = true;
+  else
+    {
+      gimple *def_stmt = SSA_NAME_DEF_STMT (rhs), *def_stmt1, *def_stmt2;
+      if (!is_gimple_assign (def_stmt))
+	invalid = true;
+      else if (handled_load (def_stmt, &ops[0], bitsize, bitpos,
+			     bitregion_start, bitregion_end))
+	rhs_code = MEM_REF;
+      else
+	switch ((rhs_code = gimple_assign_rhs_code (def_stmt)))
+	  {
+	  case BIT_AND_EXPR:
+	  case BIT_IOR_EXPR:
+	  case BIT_XOR_EXPR:
+	    tree rhs1, rhs2;
+	    rhs1 = gimple_assign_rhs1 (def_stmt);
+	    rhs2 = gimple_assign_rhs2 (def_stmt);
+	    invalid = true;
+	    if (TREE_CODE (rhs1) != SSA_NAME || !has_single_use (rhs1))
+	      break;
+	    def_stmt1 = SSA_NAME_DEF_STMT (rhs1);
+	    if (!is_gimple_assign (def_stmt1)
+		|| !handled_load (def_stmt1, &ops[0], bitsize, bitpos,
+				  bitregion_start, bitregion_end))
+	      break;
+	    if (rhs_valid_for_store_merging_p (rhs2))
+	      ops[1].val = rhs2;
+	    else if (TREE_CODE (rhs2) != SSA_NAME || !has_single_use (rhs2))
+	      break;
+	    else
+	      {
+		def_stmt2 = SSA_NAME_DEF_STMT (rhs2);
+		if (!is_gimple_assign (def_stmt2))
+		  break;
+		else if (!handled_load (def_stmt2, &ops[1], bitsize, bitpos,
+					bitregion_start, bitregion_end))
+		  break;
+	      }
+	    invalid = false;
+	    break;
+	  default:
+	    invalid = true;
+	    break;
+	  }
+    }
+
+  struct imm_store_chain_info **chain_info = NULL;
+  if (base_addr)
+    chain_info = m_stores.get (base_addr);
+
+  if (invalid)
+    {
+      terminate_all_aliasing_chains (chain_info, stmt);
+      return;
+    }
+
+  store_immediate_info *info;
+  if (chain_info)
+    {
+      unsigned int ord = (*chain_info)->m_store_info.length ();
+      info = new store_immediate_info (bitsize, bitpos, bitregion_start,
+				       bitregion_end, stmt, ord, rhs_code,
+				       ops[0], ops[1]);
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Recording immediate store from stmt:\n");
+	  print_gimple_stmt (dump_file, stmt, 0);
+	}
+      (*chain_info)->m_store_info.safe_push (info);
+      /* If we reach the limit of stores to merge in a chain terminate and
+	 process the chain now.  */
+      if ((*chain_info)->m_store_info.length ()
+	  == (unsigned int) PARAM_VALUE (PARAM_MAX_STORES_TO_MERGE))
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Reached maximum number of statements to merge:\n");
+	  terminate_and_release_chain (*chain_info);
+	}
+      return;
+    }
+
+  /* Store aliases any existing chain?  */
+  terminate_all_aliasing_chains (chain_info, stmt);
+  /* Start a new chain.  */
+  struct imm_store_chain_info *new_chain
+    = new imm_store_chain_info (m_stores_head, base_addr);
+  info = new store_immediate_info (bitsize, bitpos, bitregion_start,
+				   bitregion_end, stmt, 0, rhs_code,
+				   ops[0], ops[1]);
+  new_chain->m_store_info.safe_push (info);
+  m_stores.put (base_addr, new_chain);
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "Starting new chain with statement:\n");
+      print_gimple_stmt (dump_file, stmt, 0);
+      fprintf (dump_file, "The base object is:\n");
+      print_generic_expr (dump_file, base_addr);
+      fprintf (dump_file, "\n");
+    }
+}
+
 /* Entry point for the pass.  Go over each basic block recording chains of
-  immediate stores.  Upon encountering a terminating statement (as defined
-  by stmt_terminates_chain_p) process the recorded stores and emit the widened
-  variants.  */
+   immediate stores.  Upon encountering a terminating statement (as defined
+   by stmt_terminates_chain_p) process the recorded stores and emit the widened
+   variants.  */
 
 unsigned int
 pass_store_merging::execute (function *fun)
@@ -1573,175 +2307,9 @@ pass_store_merging::execute (function *f
 	  if (gimple_assign_single_p (stmt) && gimple_vdef (stmt)
 	      && !stmt_can_throw_internal (stmt)
 	      && lhs_valid_for_store_merging_p (gimple_assign_lhs (stmt)))
-	    {
-	      tree lhs = gimple_assign_lhs (stmt);
-	      tree rhs = gimple_assign_rhs1 (stmt);
-
-	      HOST_WIDE_INT bitsize, bitpos;
-	      unsigned HOST_WIDE_INT bitregion_start = 0;
-	      unsigned HOST_WIDE_INT bitregion_end = 0;
-	      machine_mode mode;
-	      int unsignedp = 0, reversep = 0, volatilep = 0;
-	      tree offset, base_addr;
-	      base_addr
-		= get_inner_reference (lhs, &bitsize, &bitpos, &offset, &mode,
-				       &unsignedp, &reversep, &volatilep);
-	      if (TREE_CODE (lhs) == COMPONENT_REF
-		  && DECL_BIT_FIELD_TYPE (TREE_OPERAND (lhs, 1)))
-		{
-		  get_bit_range (&bitregion_start, &bitregion_end, lhs,
-				 &bitpos, &offset);
-		  if (bitregion_end)
-		    ++bitregion_end;
-		}
-	      if (bitsize == 0)
-		continue;
-
-	      /* As a future enhancement we could handle stores with the same
-		 base and offset.  */
-	      bool invalid = reversep
-			     || ((bitsize > MAX_BITSIZE_MODE_ANY_INT)
-				  && (TREE_CODE (rhs) != INTEGER_CST))
-			     || !rhs_valid_for_store_merging_p (rhs);
-
-	      /* We do not want to rewrite TARGET_MEM_REFs.  */
-	      if (TREE_CODE (base_addr) == TARGET_MEM_REF)
-		invalid = true;
-	      /* In some cases get_inner_reference may return a
-		 MEM_REF [ptr + byteoffset].  For the purposes of this pass
-		 canonicalize the base_addr to MEM_REF [ptr] and take
-		 byteoffset into account in the bitpos.  This occurs in
-		 PR 23684 and this way we can catch more chains.  */
-	      else if (TREE_CODE (base_addr) == MEM_REF)
-		{
-		  offset_int bit_off, byte_off = mem_ref_offset (base_addr);
-		  bit_off = byte_off << LOG2_BITS_PER_UNIT;
-		  bit_off += bitpos;
-		  if (!wi::neg_p (bit_off) && wi::fits_shwi_p (bit_off))
-		    {
-		      bitpos = bit_off.to_shwi ();
-		      if (bitregion_end)
-			{
-			  bit_off = byte_off << LOG2_BITS_PER_UNIT;
-			  bit_off += bitregion_start;
-			  if (wi::fits_uhwi_p (bit_off))
-			    {
-			      bitregion_start = bit_off.to_uhwi ();
-			      bit_off = byte_off << LOG2_BITS_PER_UNIT;
-			      bit_off += bitregion_end;
-			      if (wi::fits_uhwi_p (bit_off))
-				bitregion_end = bit_off.to_uhwi ();
-			      else
-				bitregion_end = 0;
-			    }
-			  else
-			    bitregion_end = 0;
-			}
-		    }
-		  else
-		    invalid = true;
-		  base_addr = TREE_OPERAND (base_addr, 0);
-		}
-	      /* get_inner_reference returns the base object, get at its
-	         address now.  */
-	      else
-		{
-		  if (bitpos < 0)
-		    invalid = true;
-		  base_addr = build_fold_addr_expr (base_addr);
-		}
-
-	      if (!bitregion_end)
-		{
-		  bitregion_start = ROUND_DOWN (bitpos, BITS_PER_UNIT);
-		  bitregion_end = ROUND_UP (bitpos + bitsize, BITS_PER_UNIT);
-		}
-
-	      if (! invalid
-		  && offset != NULL_TREE)
-		{
-		  /* If the access is variable offset then a base
-		     decl has to be address-taken to be able to
-		     emit pointer-based stores to it.
-		     ???  We might be able to get away with
-		     re-using the original base up to the first
-		     variable part and then wrapping that inside
-		     a BIT_FIELD_REF.  */
-		  tree base = get_base_address (base_addr);
-		  if (! base
-		      || (DECL_P (base)
-			  && ! TREE_ADDRESSABLE (base)))
-		    invalid = true;
-		  else
-		    base_addr = build2 (POINTER_PLUS_EXPR,
-					TREE_TYPE (base_addr),
-					base_addr, offset);
-		}
-
-	      struct imm_store_chain_info **chain_info
-		= m_stores.get (base_addr);
-
-	      if (!invalid)
-		{
-		  store_immediate_info *info;
-		  if (chain_info)
-		    {
-		      unsigned int ord = (*chain_info)->m_store_info.length ();
-		      info = new store_immediate_info (bitsize, bitpos,
-						       bitregion_start,
-						       bitregion_end,
-						       stmt, ord);
-		      if (dump_file && (dump_flags & TDF_DETAILS))
-			{
-			  fprintf (dump_file,
-				   "Recording immediate store from stmt:\n");
-			  print_gimple_stmt (dump_file, stmt, 0);
-			}
-		      (*chain_info)->m_store_info.safe_push (info);
-		      /* If we reach the limit of stores to merge in a chain
-			 terminate and process the chain now.  */
-		      if ((*chain_info)->m_store_info.length ()
-			   == (unsigned int)
-			      PARAM_VALUE (PARAM_MAX_STORES_TO_MERGE))
-			{
-			  if (dump_file && (dump_flags & TDF_DETAILS))
-			    fprintf (dump_file,
-				 "Reached maximum number of statements"
-				 " to merge:\n");
-			  terminate_and_release_chain (*chain_info);
-			}
-		      continue;
-		    }
-
-		  /* Store aliases any existing chain?  */
-		  terminate_all_aliasing_chains (chain_info, false, stmt);
-		  /* Start a new chain.  */
-		  struct imm_store_chain_info *new_chain
-		    = new imm_store_chain_info (m_stores_head, base_addr);
-		  info = new store_immediate_info (bitsize, bitpos,
-						   bitregion_start,
-						   bitregion_end,
-						   stmt, 0);
-		  new_chain->m_store_info.safe_push (info);
-		  m_stores.put (base_addr, new_chain);
-		  if (dump_file && (dump_flags & TDF_DETAILS))
-		    {
-		      fprintf (dump_file,
-			       "Starting new chain with statement:\n");
-		      print_gimple_stmt (dump_file, stmt, 0);
-		      fprintf (dump_file, "The base object is:\n");
-		      print_generic_expr (dump_file, base_addr);
-		      fprintf (dump_file, "\n");
-		    }
-		}
-	      else
-		terminate_all_aliasing_chains (chain_info,
-					       offset != NULL_TREE, stmt);
-
-	      continue;
-	    }
-
-	  terminate_all_aliasing_chains (NULL, false, stmt);
+	    process_store (stmt);
+	  else
+	    terminate_all_aliasing_chains (NULL, stmt);
 	}
       terminate_and_process_all_chains ();
     }
--- gcc/testsuite/gcc.dg/store_merging_13.c.jj	2017-11-02 08:50:03.544226508 +0100
+++ gcc/testsuite/gcc.dg/store_merging_13.c	2017-11-02 08:50:03.544226508 +0100
@@ -0,0 +1,157 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target store_merge } */
+/* { dg-options "-O2 -fdump-tree-store-merging" } */
+
+struct S { unsigned char a, b; unsigned short c; unsigned char d, e, f, g; unsigned long long h; };
+
+__attribute__((noipa)) void
+f1 (struct S *p)
+{
+  p->a = 1;
+  p->b = 2;
+  p->c = 3;
+  p->d = 4;
+  p->e = 5;
+  p->f = 6;
+  p->g = 7;
+}
+
+__attribute__((noipa)) void
+f2 (struct S *__restrict p, struct S *__restrict q)
+{
+  p->a = q->a;
+  p->b = q->b;
+  p->c = q->c;
+  p->d = q->d;
+  p->e = q->e;
+  p->f = q->f;
+  p->g = q->g;
+}
+
+__attribute__((noipa)) void
+f3 (struct S *p, struct S *q)
+{
+  unsigned char pa = q->a;
+  unsigned char pb = q->b;
+  unsigned short pc = q->c;
+  unsigned char pd = q->d;
+  unsigned char pe = q->e;
+  unsigned char pf = q->f;
+  unsigned char pg = q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+__attribute__((noipa)) void
+f4 (struct S *p, struct S *q)
+{
+  unsigned char pa = p->a | q->a;
+  unsigned char pb = p->b | q->b;
+  unsigned short pc = p->c | q->c;
+  unsigned char pd = p->d | q->d;
+  unsigned char pe = p->e | q->e;
+  unsigned char pf = p->f | q->f;
+  unsigned char pg = p->g | q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+__attribute__((noipa)) void
+f5 (struct S *p, struct S *q)
+{
+  unsigned char pa = p->a & q->a;
+  unsigned char pb = p->b & q->b;
+  unsigned short pc = p->c & q->c;
+  unsigned char pd = p->d & q->d;
+  unsigned char pe = p->e & q->e;
+  unsigned char pf = p->f & q->f;
+  unsigned char pg = p->g & q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+__attribute__((noipa)) void
+f6 (struct S *p, struct S *q)
+{
+  unsigned char pa = p->a ^ q->a;
+  unsigned char pb = p->b ^ q->b;
+  unsigned short pc = p->c ^ q->c;
+  unsigned char pd = p->d ^ q->d;
+  unsigned char pe = p->e ^ q->e;
+  unsigned char pf = p->f ^ q->f;
+  unsigned char pg = p->g ^ q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+struct S s = { 20, 21, 22, 23, 24, 25, 26, 27 };
+struct S t = { 0x71, 0x72, 0x7f04, 0x78, 0x31, 0x32, 0x34, 0xf1f2f3f4f5f6f7f8ULL };
+struct S u = { 28, 29, 30, 31, 32, 33, 34, 35 };
+struct S v = { 36, 37, 38, 39, 40, 41, 42, 43 };
+
+int
+main ()
+{
+  asm volatile ("" : : : "memory");
+  f1 (&s);
+  asm volatile ("" : : : "memory");
+  if (s.a != 1 || s.b != 2 || s.c != 3 || s.d != 4
+      || s.e != 5 || s.f != 6 || s.g != 7 || s.h != 27)
+    __builtin_abort ();
+  f2 (&s, &u);
+  asm volatile ("" : : : "memory");
+  if (s.a != 28 || s.b != 29 || s.c != 30 || s.d != 31
+      || s.e != 32 || s.f != 33 || s.g != 34 || s.h != 27)
+    __builtin_abort ();
+  f3 (&s, &v);
+  asm volatile ("" : : : "memory");
+  if (s.a != 36 || s.b != 37 || s.c != 38 || s.d != 39
+      || s.e != 40 || s.f != 41 || s.g != 42 || s.h != 27)
+    __builtin_abort ();
+  f4 (&s, &t);
+  asm volatile ("" : : : "memory");
+  if (s.a != (36 | 0x71) || s.b != (37 | 0x72)
+      || s.c != (38 | 0x7f04) || s.d != (39 | 0x78)
+      || s.e != (40 | 0x31) || s.f != (41 | 0x32)
+      || s.g != (42 | 0x34) || s.h != 27)
+    __builtin_abort ();
+  f3 (&s, &u);
+  f5 (&s, &t);
+  asm volatile ("" : : : "memory");
+  if (s.a != (28 & 0x71) || s.b != (29 & 0x72)
+      || s.c != (30 & 0x7f04) || s.d != (31 & 0x78)
+      || s.e != (32 & 0x31) || s.f != (33 & 0x32)
+      || s.g != (34 & 0x34) || s.h != 27)
+    __builtin_abort ();
+  f2 (&s, &v);
+  f6 (&s, &t);
+  asm volatile ("" : : : "memory");
+  if (s.a != (36 ^ 0x71) || s.b != (37 ^ 0x72)
+      || s.c != (38 ^ 0x7f04) || s.d != (39 ^ 0x78)
+      || s.e != (40 ^ 0x31) || s.f != (41 ^ 0x32)
+      || s.g != (42 ^ 0x34) || s.h != 27)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Merging successful" 6 "store-merging" } } */
--- gcc/testsuite/gcc.dg/store_merging_14.c.jj	2017-11-02 08:50:03.544226508 +0100
+++ gcc/testsuite/gcc.dg/store_merging_14.c	2017-11-02 10:35:51.000000000 +0100
@@ -0,0 +1,157 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target store_merge } */
+/* { dg-options "-O2 -fdump-tree-store-merging" } */
+
+struct S { unsigned int i : 8, a : 7, b : 7, j : 10, c : 15, d : 7, e : 10, f : 7, g : 9, k : 16; unsigned long long h; };
+
+__attribute__((noipa)) void
+f1 (struct S *p)
+{
+  p->a = 1;
+  p->b = 2;
+  p->c = 3;
+  p->d = 4;
+  p->e = 5;
+  p->f = 6;
+  p->g = 7;
+}
+
+__attribute__((noipa)) void
+f2 (struct S *__restrict p, struct S *__restrict q)
+{
+  p->a = q->a;
+  p->b = q->b;
+  p->c = q->c;
+  p->d = q->d;
+  p->e = q->e;
+  p->f = q->f;
+  p->g = q->g;
+}
+
+__attribute__((noipa)) void
+f3 (struct S *p, struct S *q)
+{
+  unsigned char pa = q->a;
+  unsigned char pb = q->b;
+  unsigned short pc = q->c;
+  unsigned char pd = q->d;
+  unsigned short pe = q->e;
+  unsigned char pf = q->f;
+  unsigned short pg = q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+__attribute__((noipa)) void
+f4 (struct S *p, struct S *q)
+{
+  unsigned char pa = p->a | q->a;
+  unsigned char pb = p->b | q->b;
+  unsigned short pc = p->c | q->c;
+  unsigned char pd = p->d | q->d;
+  unsigned short pe = p->e | q->e;
+  unsigned char pf = p->f | q->f;
+  unsigned short pg = p->g | q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+__attribute__((noipa)) void
+f5 (struct S *p, struct S *q)
+{
+  unsigned char pa = p->a & q->a;
+  unsigned char pb = p->b & q->b;
+  unsigned short pc = p->c & q->c;
+  unsigned char pd = p->d & q->d;
+  unsigned short pe = p->e & q->e;
+  unsigned char pf = p->f & q->f;
+  unsigned short pg = p->g & q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+__attribute__((noipa)) void
+f6 (struct S *p, struct S *q)
+{
+  unsigned char pa = p->a ^ q->a;
+  unsigned char pb = p->b ^ q->b;
+  unsigned short pc = p->c ^ q->c;
+  unsigned char pd = p->d ^ q->d;
+  unsigned short pe = p->e ^ q->e;
+  unsigned char pf = p->f ^ q->f;
+  unsigned short pg = p->g ^ q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+struct S s = { 72, 20, 21, 73, 22, 23, 24, 25, 26, 74, 27 };
+struct S t = { 75, 0x71, 0x72, 76, 0x7f04, 0x78, 0x31, 0x32, 0x34, 77, 0xf1f2f3f4f5f6f7f8ULL };
+struct S u = { 78, 28, 29, 79, 30, 31, 32, 33, 34, 80, 35 };
+struct S v = { 81, 36, 37, 82, 38, 39, 40, 41, 42, 83, 43 };
+
+int
+main ()
+{
+  asm volatile ("" : : : "memory");
+  f1 (&s);
+  asm volatile ("" : : : "memory");
+  if (s.i != 72 || s.a != 1 || s.b != 2 || s.j != 73 || s.c != 3 || s.d != 4
+      || s.e != 5 || s.f != 6 || s.g != 7 || s.k != 74 || s.h != 27)
+    __builtin_abort ();
+  f2 (&s, &u);
+  asm volatile ("" : : : "memory");
+  if (s.i != 72 || s.a != 28 || s.b != 29 || s.j != 73 || s.c != 30 || s.d != 31
+      || s.e != 32 || s.f != 33 || s.g != 34 || s.k != 74 || s.h != 27)
+    __builtin_abort ();
+  f3 (&s, &v);
+  asm volatile ("" : : : "memory");
+  if (s.i != 72 || s.a != 36 || s.b != 37 || s.j != 73 || s.c != 38 || s.d != 39
+      || s.e != 40 || s.f != 41 || s.g != 42 || s.k != 74 || s.h != 27)
+    __builtin_abort ();
+  f4 (&s, &t);
+  asm volatile ("" : : : "memory");
+  if (s.i != 72 || s.a != (36 | 0x71) || s.b != (37 | 0x72) || s.j != 73
+      || s.c != (38 | 0x7f04) || s.d != (39 | 0x78)
+      || s.e != (40 | 0x31) || s.f != (41 | 0x32)
+      || s.g != (42 | 0x34) || s.k != 74 || s.h != 27)
+    __builtin_abort ();
+  f3 (&s, &u);
+  f5 (&s, &t);
+  asm volatile ("" : : : "memory");
+  if (s.i != 72 || s.a != (28 & 0x71) || s.b != (29 & 0x72) || s.j != 73
+      || s.c != (30 & 0x7f04) || s.d != (31 & 0x78)
+      || s.e != (32 & 0x31) || s.f != (33 & 0x32)
+      || s.g != (34 & 0x34) || s.k != 74 || s.h != 27)
+    __builtin_abort ();
+  f2 (&s, &v);
+  f6 (&s, &t);
+  asm volatile ("" : : : "memory");
+  if (s.i != 72 || s.a != (36 ^ 0x71) || s.b != (37 ^ 0x72) || s.j != 73
+      || s.c != (38 ^ 0x7f04) || s.d != (39 ^ 0x78)
+      || s.e != (40 ^ 0x31) || s.f != (41 ^ 0x32)
+      || s.g != (42 ^ 0x34) || s.k != 74 || s.h != 27)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Merging successful" 6 "store-merging" } } */


	Jakub
Richard Biener Nov. 3, 2017, 7:39 p.m. UTC | #5
On November 3, 2017 8:17:30 PM GMT+01:00, Jakub Jelinek <jakub@redhat.com> wrote:
>On Fri, Nov 03, 2017 at 03:04:18PM +0100, Jakub Jelinek wrote:
>> single-use vs. multiple uses is something I've thought about, but
>don't
>> know whether it is better to require single-use or not (or sometimes,
>> under some condition?).  Say if we have:
>
>So, here is what I've committed in the end after
>bootstrapping/regtesting
>it on x86_64-linux and i686-linux, the only changes from the earlier
>patch
>were comments and addition of has_single_use checks.
>
>In those bootstraps/regtests, the number of integer_cst stores were
>expectedly the same, and so were the number of bit_*_expr cases, but
>it apparently matters a lot for the memory copying (rhs_code MEM_REF).
>Without this patch new/orig stores:
>16943   35369
>and with the patch:
>12111   24911
>So, perhaps we'll need to do something smarter (approximate how many
>original loads would be kept and how many new loads/stores we'd need to
>add
>to get rid of how many original stores).
>Or allow multiple uses for the MEM_REF rhs_code only and for anything
>else
>require single use.

Probably interesting to look at the individual cases. But yes, it should be factored into the cost model somehow. 
It's possibly also increasing register pressure. 

Richard. 

>2017-11-03  Jakub Jelinek  <jakub@redhat.com>
>
>	PR tree-optimization/78821
>	* gimple-ssa-store-merging.c: Update the file comment.
>	(MAX_STORE_ALIAS_CHECKS): Define.
>	(struct store_operand_info): New type.
>	(store_operand_info::store_operand_info): New constructor.
>	(struct store_immediate_info): Add rhs_code and ops data members.
>	(store_immediate_info::store_immediate_info): Add rhscode, op0r
>	and op1r arguments to the ctor, initialize corresponding data members.
>	(struct merged_store_group): Add load_align_base and load_align
>	data members.
>	(merged_store_group::merged_store_group): Initialize them.
>	(merged_store_group::do_merge): Update them.
>	(merged_store_group::apply_stores): Pick the constant for
>	encode_tree_to_bitpos from one of the two operands, or skip
>	encode_tree_to_bitpos if neither operand is a constant.
>	(class pass_store_merging): Add process_store method decl.  Remove
>	bool argument from terminate_all_aliasing_chains method decl.
>	(pass_store_merging::terminate_all_aliasing_chains): Remove
>	var_offset_p argument and corresponding handling.
>	(stmts_may_clobber_ref_p): New function.
>	(compatible_load_p): New function.
>	(imm_store_chain_info::coalesce_immediate_stores): Terminate group
>	if there is overlap and rhs_code is not INTEGER_CST.  For
>	non-overlapping stores terminate group if rhs is not mergeable.
>	(get_alias_type_for_stmts): Change first argument from
>	auto_vec<gimple *> & to vec<gimple *> &.  Add IS_LOAD, CLIQUEP and
>	BASEP arguments.  If IS_LOAD is true, look at rhs1 of the stmts
>	instead of lhs.  Compute *CLIQUEP and *BASEP in addition to the
>	alias type.
>	(get_location_for_stmts): Change first argument from
>	auto_vec<gimple *> & to vec<gimple *> &.
>	(struct split_store): Remove orig_stmts data member, add orig_stores.
>	(split_store::split_store): Create orig_stores rather than orig_stmts.
>	(find_constituent_stmts): Renamed to ...
>	(find_constituent_stores): ... this.  Change second argument from
>	vec<gimple *> * to vec<store_immediate_info *> *, push pointers
>	to info structures rather than the statements.
>	(split_group): Rename ALLOW_UNALIGNED argument to
>	ALLOW_UNALIGNED_STORE, add ALLOW_UNALIGNED_LOAD argument and handle
>	it.  Adjust find_constituent_stores caller.
>	(imm_store_chain_info::output_merged_store): Handle rhs_code other
>	than INTEGER_CST, adjust split_group, get_alias_type_for_stmts and
>	get_location_for_stmts callers.  Set MR_DEPENDENCE_CLIQUE and
>	MR_DEPENDENCE_BASE on the MEM_REFs if they are the same in all stores.
>	(mem_valid_for_store_merging): New function.
>	(handled_load): New function.
>	(pass_store_merging::process_store): New method.
>	(pass_store_merging::execute): Use process_store method.  Adjust
>	terminate_all_aliasing_chains caller.
>
>	* gcc.dg/store_merging_13.c: New test.
>	* gcc.dg/store_merging_14.c: New test.
>
>--- gcc/gimple-ssa-store-merging.c.jj	2017-11-03 15:37:02.869561500
>+0100
>+++ gcc/gimple-ssa-store-merging.c	2017-11-03 16:15:15.059282459 +0100
>@@ -19,7 +19,8 @@
>    <http://www.gnu.org/licenses/>.  */
> 
> /* The purpose of this pass is to combine multiple memory stores of
>-   constant values to consecutive memory locations into fewer wider
>stores.
>+   constant values, values loaded from memory or bitwise operations
>+   on those to consecutive memory locations into fewer wider stores.
>    For example, if we have a sequence peforming four byte stores to
>    consecutive memory locations:
>    [p     ] := imm1;
>@@ -29,21 +30,49 @@
>we can transform this into a single 4-byte store if the target supports
>it:
>[p] := imm1:imm2:imm3:imm4 //concatenated immediates according to
>endianness.
> 
>+   Or:
>+   [p     ] := [q     ];
>+   [p + 1B] := [q + 1B];
>+   [p + 2B] := [q + 2B];
>+   [p + 3B] := [q + 3B];
>+   if there is no overlap can be transformed into a single 4-byte
>+   load followed by single 4-byte store.
>+
>+   Or:
>+   [p     ] := [q     ] ^ imm1;
>+   [p + 1B] := [q + 1B] ^ imm2;
>+   [p + 2B] := [q + 2B] ^ imm3;
>+   [p + 3B] := [q + 3B] ^ imm4;
>+   if there is no overlap can be transformed into a single 4-byte
>+   load, xored with imm1:imm2:imm3:imm4 and stored using a single
>4-byte store.
>+
>    The algorithm is applied to each basic block in three phases:
> 
>-   1) Scan through the basic block recording constant assignments to
>+   1) Scan through the basic block recording assignments to
>destinations that can be expressed as a store to memory of a certain
>size
>-   at a certain bit offset.  Record store chains to different bases in
>a
>-   hash_map (m_stores) and make sure to terminate such chains when
>appropriate
>-   (for example when when the stored values get used subsequently).
>+   at a certain bit offset from expressions we can handle.  For
>bit-fields
>+   we also note the surrounding bit region, bits that could be stored
>in
>+   a read-modify-write operation when storing the bit-field.  Record
>store
>+   chains to different bases in a hash_map (m_stores) and make sure to
>+   terminate such chains when appropriate (for example when when the
>stored
>+   values get used subsequently).
>These stores can be a result of structure element initializers, array
>stores
>  etc.  A store_immediate_info object is recorded for every such store.
>   Record as many such assignments to a single base as possible until a
>    statement that interferes with the store sequence is encountered.
>+   Each store has up to 2 operands, which can be an immediate constant
>+   or a memory load, from which the value to be stored can be
>computed.
>+   At most one of the operands can be a constant.  The operands are
>recorded
>+   in store_operand_info struct.
> 
>2) Analyze the chain of stores recorded in phase 1) (i.e. the vector of
>    store_immediate_info objects) and coalesce contiguous stores into
>-   merged_store_group objects.
>+   merged_store_group objects.  For bit-fields stores, we don't need
>to
>+   require the stores to be contiguous, just their surrounding bit
>regions
>+   have to be contiguous.  If the expression being stored is different
>+   between adjacent stores, such as one store storing a constant and
>+   following storing a value loaded from memory, or if the loaded
>memory
>+   objects are not adjacent, a new merged_store_group is created as
>well.
> 
>    For example, given the stores:
>    [p     ] := 0;
>@@ -134,8 +163,35 @@
> #define MAX_STORE_BITSIZE (BITS_PER_WORD)
> #define MAX_STORE_BYTES (MAX_STORE_BITSIZE / BITS_PER_UNIT)
> 
>+/* Limit to bound the number of aliasing checks for loads with the
>same
>+   vuse as the corresponding store.  */
>+#define MAX_STORE_ALIAS_CHECKS 64
>+
> namespace {
> 
>+/* Struct recording one operand for the store, which is either a
>constant,
>+   then VAL represents the constant and all the other fields are zero,
>+   or a memory load, then VAL represents the reference, BASE_ADDR is
>non-NULL
>+   and the other fields also reflect the memory load.  */
>+
>+struct store_operand_info
>+{
>+  tree val;
>+  tree base_addr;
>+  unsigned HOST_WIDE_INT bitsize;
>+  unsigned HOST_WIDE_INT bitpos;
>+  unsigned HOST_WIDE_INT bitregion_start;
>+  unsigned HOST_WIDE_INT bitregion_end;
>+  gimple *stmt;
>+  store_operand_info ();
>+};
>+
>+store_operand_info::store_operand_info ()
>+  : val (NULL_TREE), base_addr (NULL_TREE), bitsize (0), bitpos (0),
>+    bitregion_start (0), bitregion_end (0), stmt (NULL)
>+{
>+}
>+
>/* Struct recording the information about a single store of an
>immediate
>    to memory.  These are created in the first phase and coalesced into
>    merged_store_group objects in the second phase.  */
>@@ -149,9 +205,17 @@ struct store_immediate_info
>   unsigned HOST_WIDE_INT bitregion_end;
>   gimple *stmt;
>   unsigned int order;
>+  /* INTEGER_CST for constant stores, MEM_REF for memory copy or
>+     BIT_*_EXPR for logical bitwise operation.  */
>+  enum tree_code rhs_code;
>+  /* Operands.  For BIT_*_EXPR rhs_code both operands are used,
>otherwise
>+     just the first one.  */
>+  store_operand_info ops[2];
>  store_immediate_info (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
> 			unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
>-			gimple *, unsigned int);
>+			gimple *, unsigned int, enum tree_code,
>+			const store_operand_info &,
>+			const store_operand_info &);
> };
> 
> store_immediate_info::store_immediate_info (unsigned HOST_WIDE_INT bs,
>@@ -159,11 +223,22 @@ store_immediate_info::store_immediate_in
> 					    unsigned HOST_WIDE_INT brs,
> 					    unsigned HOST_WIDE_INT bre,
> 					    gimple *st,
>-					    unsigned int ord)
>+					    unsigned int ord,
>+					    enum tree_code rhscode,
>+					    const store_operand_info &op0r,
>+					    const store_operand_info &op1r)
>: bitsize (bs), bitpos (bp), bitregion_start (brs), bitregion_end
>(bre),
>-    stmt (st), order (ord)
>+    stmt (st), order (ord), rhs_code (rhscode)
>+#if __cplusplus >= 201103L
>+    , ops { op0r, op1r }
>+{
>+}
>+#else
> {
>+  ops[0] = op0r;
>+  ops[1] = op1r;
> }
>+#endif
> 
>/* Struct representing a group of stores to contiguous memory
>locations.
>These are produced by the second phase (coalescing) and consumed in the
>@@ -178,8 +253,10 @@ struct merged_store_group
>   /* The size of the allocated memory for val and mask.  */
>   unsigned HOST_WIDE_INT buf_size;
>   unsigned HOST_WIDE_INT align_base;
>+  unsigned HOST_WIDE_INT load_align_base[2];
> 
>   unsigned int align;
>+  unsigned int load_align[2];
>   unsigned int first_order;
>   unsigned int last_order;
> 
>@@ -576,6 +653,20 @@ merged_store_group::merged_store_group (
>   get_object_alignment_1 (gimple_assign_lhs (info->stmt),
> 			  &align, &align_bitpos);
>   align_base = start - align_bitpos;
>+  for (int i = 0; i < 2; ++i)
>+    {
>+      store_operand_info &op = info->ops[i];
>+      if (op.base_addr == NULL_TREE)
>+	{
>+	  load_align[i] = 0;
>+	  load_align_base[i] = 0;
>+	}
>+      else
>+	{
>+	  get_object_alignment_1 (op.val, &load_align[i], &align_bitpos);
>+	  load_align_base[i] = op.bitpos - align_bitpos;
>+	}
>+    }
>   stores.create (1);
>   stores.safe_push (info);
>   last_stmt = info->stmt;
>@@ -608,6 +699,19 @@ merged_store_group::do_merge (store_imme
>       align = this_align;
>       align_base = info->bitpos - align_bitpos;
>     }
>+  for (int i = 0; i < 2; ++i)
>+    {
>+      store_operand_info &op = info->ops[i];
>+      if (!op.base_addr)
>+	continue;
>+
>+      get_object_alignment_1 (op.val, &this_align, &align_bitpos);
>+      if (this_align > load_align[i])
>+	{
>+	  load_align[i] = this_align;
>+	  load_align_base[i] = op.bitpos - align_bitpos;
>+	}
>+    }
> 
>   gimple *stmt = info->stmt;
>   stores.safe_push (info);
>@@ -682,16 +786,21 @@ merged_store_group::apply_stores ()
>   FOR_EACH_VEC_ELT (stores, i, info)
>     {
>       unsigned int pos_in_buffer = info->bitpos - bitregion_start;
>-      bool ret = encode_tree_to_bitpos (gimple_assign_rhs1
>(info->stmt),
>-					val, info->bitsize,
>-					pos_in_buffer, buf_size);
>-      if (dump_file && (dump_flags & TDF_DETAILS))
>+      tree cst = NULL_TREE;
>+      if (info->ops[0].val && info->ops[0].base_addr == NULL_TREE)
>+	cst = info->ops[0].val;
>+      else if (info->ops[1].val && info->ops[1].base_addr ==
>NULL_TREE)
>+	cst = info->ops[1].val;
>+      bool ret = true;
>+      if (cst)
>+	ret = encode_tree_to_bitpos (cst, val, info->bitsize,
>+				     pos_in_buffer, buf_size);
>+      if (cst && dump_file && (dump_flags & TDF_DETAILS))
> 	{
> 	  if (ret)
> 	    {
> 	      fprintf (dump_file, "After writing ");
>-	      print_generic_expr (dump_file,
>-				  gimple_assign_rhs1 (info->stmt), 0);
>+	      print_generic_expr (dump_file, cst, 0);
> 	      fprintf (dump_file, " of size " HOST_WIDE_INT_PRINT_DEC
> 			" at position %d the merged region contains:\n",
> 			info->bitsize, pos_in_buffer);
>@@ -799,9 +908,10 @@ private:
>      decisions when going out of SSA).  */
>   imm_store_chain_info *m_stores_head;
> 
>+  void process_store (gimple *);
>   bool terminate_and_process_all_chains ();
>   bool terminate_all_aliasing_chains (imm_store_chain_info **,
>-				      bool, gimple *);
>+				      gimple *);
>   bool terminate_and_release_chain (imm_store_chain_info *);
> }; // class pass_store_merging
> 
>@@ -831,7 +941,6 @@ pass_store_merging::terminate_and_proces
> bool
>pass_store_merging::terminate_all_aliasing_chains (imm_store_chain_info
> 						     **chain_info,
>-						   bool var_offset_p,
> 						   gimple *stmt)
> {
>   bool ret = false;
>@@ -845,37 +954,21 @@ pass_store_merging::terminate_all_aliasi
>      of a chain.  */
>   if (chain_info)
>     {
>-      /* We have a chain at BASE and we're writing to [BASE +
><variable>].
>-	 This can interfere with any of the stores so terminate
>-	 the chain.  */
>-      if (var_offset_p)
>-	{
>-	  terminate_and_release_chain (*chain_info);
>-	  ret = true;
>-	}
>-      /* Otherwise go through every store in the chain to see if it
>-	 aliases with any of them.  */
>-      else
>+      store_immediate_info *info;
>+      unsigned int i;
>+      FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
> 	{
>-	  store_immediate_info *info;
>-	  unsigned int i;
>-	  FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
>+	  if (ref_maybe_used_by_stmt_p (stmt, gimple_assign_lhs (info->stmt))
>+	      || stmt_may_clobber_ref_p (stmt, gimple_assign_lhs
>(info->stmt)))
> 	    {
>-	      if (ref_maybe_used_by_stmt_p (stmt,
>-					    gimple_assign_lhs (info->stmt))
>-		  || stmt_may_clobber_ref_p (stmt,
>-					     gimple_assign_lhs (info->stmt)))
>+	      if (dump_file && (dump_flags & TDF_DETAILS))
> 		{
>-		  if (dump_file && (dump_flags & TDF_DETAILS))
>-		    {
>-		      fprintf (dump_file,
>-			       "stmt causes chain termination:\n");
>-		      print_gimple_stmt (dump_file, stmt, 0);
>-		    }
>-		  terminate_and_release_chain (*chain_info);
>-		  ret = true;
>-		  break;
>+		  fprintf (dump_file, "stmt causes chain termination:\n");
>+		  print_gimple_stmt (dump_file, stmt, 0);
> 		}
>+	      terminate_and_release_chain (*chain_info);
>+	      ret = true;
>+	      break;
> 	    }
> 	}
>     }
>@@ -920,6 +1013,125 @@ pass_store_merging::terminate_and_releas
>   return ret;
> }
> 
>+/* Return true if stmts in between FIRST (inclusive) and LAST
>(exclusive)
>+   may clobber REF.  FIRST and LAST must be in the same basic block
>and
>+   have non-NULL vdef.  */
>+
>+bool
>+stmts_may_clobber_ref_p (gimple *first, gimple *last, tree ref)
>+{
>+  ao_ref r;
>+  ao_ref_init (&r, ref);
>+  unsigned int count = 0;
>+  tree vop = gimple_vdef (last);
>+  gimple *stmt;
>+
>+  gcc_checking_assert (gimple_bb (first) == gimple_bb (last));
>+  do
>+    {
>+      stmt = SSA_NAME_DEF_STMT (vop);
>+      if (stmt_may_clobber_ref_p_1 (stmt, &r))
>+	return true;
>+      /* Avoid quadratic compile time by bounding the number of checks
>+	 we perform.  */
>+      if (++count > MAX_STORE_ALIAS_CHECKS)
>+	return true;
>+      vop = gimple_vuse (stmt);
>+    }
>+  while (stmt != first);
>+  return false;
>+}
>+
>+/* Return true if INFO->ops[IDX] is mergeable with the
>+   corresponding loads already in MERGED_STORE group.
>+   BASE_ADDR is the base address of the whole store group.  */
>+
>+bool
>+compatible_load_p (merged_store_group *merged_store,
>+		   store_immediate_info *info,
>+		   tree base_addr, int idx)
>+{
>+  store_immediate_info *infof = merged_store->stores[0];
>+  if (!info->ops[idx].base_addr
>+      || (info->ops[idx].bitpos - infof->ops[idx].bitpos
>+	  != info->bitpos - infof->bitpos)
>+      || !operand_equal_p (info->ops[idx].base_addr,
>+			   infof->ops[idx].base_addr, 0))
>+    return false;
>+
>+  store_immediate_info *infol = merged_store->stores.last ();
>+  tree load_vuse = gimple_vuse (info->ops[idx].stmt);
>+  /* In this case all vuses should be the same, e.g.
>+     _1 = s.a; _2 = s.b; _3 = _1 | 1; t.a = _3; _4 = _2 | 2; t.b = _4;
>+     or
>+     _1 = s.a; _2 = s.b; t.a = _1; t.b = _2;
>+     and we can emit the coalesced load next to any of those loads. 
>*/
>+  if (gimple_vuse (infof->ops[idx].stmt) == load_vuse
>+      && gimple_vuse (infol->ops[idx].stmt) == load_vuse)
>+    return true;
>+
>+  /* Otherwise, at least for now require that the load has the same
>+     vuse as the store.  See following examples.  */
>+  if (gimple_vuse (info->stmt) != load_vuse)
>+    return false;
>+
>+  if (gimple_vuse (infof->stmt) != gimple_vuse (infof->ops[idx].stmt)
>+      || (infof != infol
>+	  && gimple_vuse (infol->stmt) != gimple_vuse
>(infol->ops[idx].stmt)))
>+    return false;
>+
>+  /* If the load is from the same location as the store, already
>+     the construction of the immediate chain info guarantees no
>intervening
>+     stores, so no further checks are needed.  Example:
>+     _1 = s.a; _2 = _1 & -7; s.a = _2; _3 = s.b; _4 = _3 & -7; s.b =
>_4;  */
>+  if (info->ops[idx].bitpos == info->bitpos
>+      && operand_equal_p (info->ops[idx].base_addr, base_addr, 0))
>+    return true;
>+
>+  /* Otherwise, we need to punt if any of the loads can be clobbered
>by any
>+     of the stores in the group, or any other stores in between those.
>+     Previous calls to compatible_load_p ensured that for all the
>+     merged_store->stores IDX loads, no stmts starting with
>+     merged_store->first_stmt and ending right before
>merged_store->last_stmt
>+     clobbers those loads.  */
>+  gimple *first = merged_store->first_stmt;
>+  gimple *last = merged_store->last_stmt;
>+  unsigned int i;
>+  store_immediate_info *infoc;
>+  /* The stores are sorted by increasing store bitpos, so if
>info->stmt store
>+     comes before the so far first load, we'll be changing
>+     merged_store->first_stmt.  In that case we need to give up if
>+     any of the earlier processed loads clobber with the stmts in the
>new
>+     range.  */
>+  if (info->order < merged_store->first_order)
>+    {
>+      FOR_EACH_VEC_ELT (merged_store->stores, i, infoc)
>+	if (stmts_may_clobber_ref_p (info->stmt, first, infoc->ops[idx].val))
>+	  return false;
>+      first = info->stmt;
>+    }
>+  /* Similarly, we could change merged_store->last_stmt, so ensure
>+     in that case no stmts in the new range clobber any of the earlier
>+     processed loads.  */
>+  else if (info->order > merged_store->last_order)
>+    {
>+      FOR_EACH_VEC_ELT (merged_store->stores, i, infoc)
>+	if (stmts_may_clobber_ref_p (last, info->stmt, infoc->ops[idx].val))
>+	  return false;
>+      last = info->stmt;
>+    }
>+  /* And finally, we'd be adding a new load to the set, ensure it
>isn't
>+     clobbered in the new range.  */
>+  if (stmts_may_clobber_ref_p (first, last, info->ops[idx].val))
>+    return false;
>+
>+  /* Otherwise, we are looking for:
>+     _1 = s.a; _2 = _1 ^ 15; t.a = _2; _3 = s.b; _4 = _3 ^ 15; t.b =
>_4;
>+     or
>+     _1 = s.a; t.a = _1; _2 = s.b; t.b = _2;  */
>+  return true;
>+}
>+
>/* Go through the candidate stores recorded in m_store_info and merge
>them
>    into merged_store_group objects recorded into m_merged_store_groups
>representing the widened stores.  Return true if coalescing was
>successful
>@@ -967,32 +1179,56 @@ imm_store_chain_info::coalesce_immediate
>       if (IN_RANGE (start, merged_store->start,
> 		    merged_store->start + merged_store->width - 1))
> 	{
>-	  merged_store->merge_overlapping (info);
>-	  continue;
>+	  /* Only allow overlapping stores of constants.  */
>+	  if (info->rhs_code == INTEGER_CST
>+	      && merged_store->stores[0]->rhs_code == INTEGER_CST)
>+	    {
>+	      merged_store->merge_overlapping (info);
>+	      continue;
>+	    }
>+	}
>+      /* |---store 1---||---store 2---|
>+	 This store is consecutive to the previous one.
>+	 Merge it into the current store group.  There can be gaps in between
>+	 the stores, but there can't be gaps in between bitregions.  */
>+      else if (info->bitregion_start <= merged_store->bitregion_end
>+	       && info->rhs_code == merged_store->stores[0]->rhs_code)
>+	{
>+	  store_immediate_info *infof = merged_store->stores[0];
>+
>+	  /* All the rhs_code ops that take 2 operands are commutative,
>+	     swap the operands if it could make the operands compatible.  */
>+	  if (infof->ops[0].base_addr
>+	      && infof->ops[1].base_addr
>+	      && info->ops[0].base_addr
>+	      && info->ops[1].base_addr
>+	      && (info->ops[1].bitpos - infof->ops[0].bitpos
>+		  == info->bitpos - infof->bitpos)
>+	      && operand_equal_p (info->ops[1].base_addr,
>+				  infof->ops[0].base_addr, 0))
>+	    std::swap (info->ops[0], info->ops[1]);
>+	  if ((!infof->ops[0].base_addr
>+	       || compatible_load_p (merged_store, info, base_addr, 0))
>+	      && (!infof->ops[1].base_addr
>+		  || compatible_load_p (merged_store, info, base_addr, 1)))
>+	    {
>+	      merged_store->merge_into (info);
>+	      continue;
>+	    }
> 	}
> 
>       /* |---store 1---| <gap> |---store 2---|.
>-	 Gap between stores.  Start a new group if there are any gaps
>-	 between bitregions.  */
>-      if (info->bitregion_start > merged_store->bitregion_end)
>-	{
>-	  /* Try to apply all the stores recorded for the group to determine
>-	     the bitpattern they write and discard it if that fails.
>-	     This will also reject single-store groups.  */
>-	  if (!merged_store->apply_stores ())
>-	    delete merged_store;
>-	  else
>-	    m_merged_store_groups.safe_push (merged_store);
>-
>-	  merged_store = new merged_store_group (info);
>+	 Gap between stores or the rhs not compatible.  Start a new group. 
>*/
> 
>-	  continue;
>-	}
>+      /* Try to apply all the stores recorded for the group to
>determine
>+	 the bitpattern they write and discard it if that fails.
>+	 This will also reject single-store groups.  */
>+      if (!merged_store->apply_stores ())
>+	delete merged_store;
>+      else
>+	m_merged_store_groups.safe_push (merged_store);
> 
>-      /* |---store 1---||---store 2---|
>-	 This store is consecutive to the previous one.
>-	 Merge it into the current store group.  */
>-       merged_store->merge_into (info);
>+      merged_store = new merged_store_group (info);
>     }
> 
>   /* Record or discard the last store group.  */
>@@ -1014,35 +1250,57 @@ imm_store_chain_info::coalesce_immediate
>   return success;
> }
> 
>-/* Return the type to use for the merged stores described by STMTS.
>-   This is needed to get the alias sets right.  */
>+/* Return the type to use for the merged stores or loads described by
>STMTS.
>+   This is needed to get the alias sets right.  If IS_LOAD, look for
>rhs,
>+   otherwise lhs.  Additionally set *CLIQUEP and *BASEP to
>MR_DEPENDENCE_*
>+   of the MEM_REFs if any.  */
> 
> static tree
>-get_alias_type_for_stmts (auto_vec<gimple *> &stmts)
>+get_alias_type_for_stmts (vec<gimple *> &stmts, bool is_load,
>+			  unsigned short *cliquep, unsigned short *basep)
> {
>   gimple *stmt;
>   unsigned int i;
>-  tree lhs = gimple_assign_lhs (stmts[0]);
>-  tree type = reference_alias_ptr_type (lhs);
>+  tree type = NULL_TREE;
>+  tree ret = NULL_TREE;
>+  *cliquep = 0;
>+  *basep = 0;
> 
>   FOR_EACH_VEC_ELT (stmts, i, stmt)
>     {
>-      if (i == 0)
>-	continue;
>+      tree ref = is_load ? gimple_assign_rhs1 (stmt)
>+			 : gimple_assign_lhs (stmt);
>+      tree type1 = reference_alias_ptr_type (ref);
>+      tree base = get_base_address (ref);
> 
>-      lhs = gimple_assign_lhs (stmt);
>-      tree type1 = reference_alias_ptr_type (lhs);
>+      if (i == 0)
>+	{
>+	  if (TREE_CODE (base) == MEM_REF)
>+	    {
>+	      *cliquep = MR_DEPENDENCE_CLIQUE (base);
>+	      *basep = MR_DEPENDENCE_BASE (base);
>+	    }
>+	  ret = type = type1;
>+	  continue;
>+	}
>       if (!alias_ptr_types_compatible_p (type, type1))
>-	return ptr_type_node;
>+	ret = ptr_type_node;
>+      if (TREE_CODE (base) != MEM_REF
>+	  || *cliquep != MR_DEPENDENCE_CLIQUE (base)
>+	  || *basep != MR_DEPENDENCE_BASE (base))
>+	{
>+	  *cliquep = 0;
>+	  *basep = 0;
>+	}
>     }
>-  return type;
>+  return ret;
> }
> 
> /* Return the location_t information we can find among the statements
>    in STMTS.  */
> 
> static location_t
>-get_location_for_stmts (auto_vec<gimple *> &stmts)
>+get_location_for_stmts (vec<gimple *> &stmts)
> {
>   gimple *stmt;
>   unsigned int i;
>@@ -1062,7 +1320,7 @@ struct split_store
>   unsigned HOST_WIDE_INT bytepos;
>   unsigned HOST_WIDE_INT size;
>   unsigned HOST_WIDE_INT align;
>-  auto_vec<gimple *> orig_stmts;
>+  auto_vec<store_immediate_info *> orig_stores;
>/* True if there is a single orig stmt covering the whole split store. 
>*/
>   bool orig;
>   split_store (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
>@@ -1076,21 +1334,20 @@ split_store::split_store (unsigned HOST_
> 			  unsigned HOST_WIDE_INT al)
> 			  : bytepos (bp), size (sz), align (al), orig (false)
> {
>-  orig_stmts.create (0);
>+  orig_stores.create (0);
> }
> 
>-/* Record all statements corresponding to stores in GROUP that write
>to
>-   the region starting at BITPOS and is of size BITSIZE.  Record such
>-   statements in STMTS if non-NULL.  The stores in GROUP must be
>sorted by
>-   bitposition.  Return INFO if there is exactly one original store
>-   in the range.  */
>+/* Record all stores in GROUP that write to the region starting at
>BITPOS and
>+   is of size BITSIZE.  Record infos for such statements in STORES if
>+   non-NULL.  The stores in GROUP must be sorted by bitposition. 
>Return INFO
>+   if there is exactly one original store in the range.  */
> 
> static store_immediate_info *
>-find_constituent_stmts (struct merged_store_group *group,
>-			vec<gimple *> *stmts,
>-			unsigned int *first,
>-			unsigned HOST_WIDE_INT bitpos,
>-			unsigned HOST_WIDE_INT bitsize)
>+find_constituent_stores (struct merged_store_group *group,
>+			 vec<store_immediate_info *> *stores,
>+			 unsigned int *first,
>+			 unsigned HOST_WIDE_INT bitpos,
>+			 unsigned HOST_WIDE_INT bitsize)
> {
>   store_immediate_info *info, *ret = NULL;
>   unsigned int i;
>@@ -1119,9 +1376,9 @@ find_constituent_stmts (struct merged_st
>       if (stmt_start >= end)
> 	return ret;
> 
>-      if (stmts)
>+      if (stores)
> 	{
>-	  stmts->safe_push (info->stmt);
>+	  stores->safe_push (info);
> 	  if (ret)
> 	    {
> 	      ret = NULL;
>@@ -1143,11 +1400,14 @@ find_constituent_stmts (struct merged_st
>    This is to separate the splitting strategy from the statement
>    building/emission/linking done in output_merged_store.
>    Return number of new stores.
>+   If ALLOW_UNALIGNED_STORE is false, then all stores must be aligned.
>+   If ALLOW_UNALIGNED_LOAD is false, then all loads must be aligned.
>    If SPLIT_STORES is NULL, it is just a dry run to count number of
>    new stores.  */
> 
> static unsigned int
>-split_group (merged_store_group *group, bool allow_unaligned,
>+split_group (merged_store_group *group, bool allow_unaligned_store,
>+	     bool allow_unaligned_load,
> 	     vec<struct split_store *> *split_stores)
> {
>   unsigned HOST_WIDE_INT pos = group->bitregion_start;
>@@ -1155,6 +1415,7 @@ split_group (merged_store_group *group,
>   unsigned HOST_WIDE_INT bytepos = pos / BITS_PER_UNIT;
>   unsigned HOST_WIDE_INT group_align = group->align;
>   unsigned HOST_WIDE_INT align_base = group->align_base;
>+  unsigned HOST_WIDE_INT group_load_align = group_align;
> 
>gcc_assert ((size % BITS_PER_UNIT == 0) && (pos % BITS_PER_UNIT == 0));
> 
>@@ -1162,9 +1423,14 @@ split_group (merged_store_group *group,
>   unsigned HOST_WIDE_INT try_pos = bytepos;
>   group->stores.qsort (sort_by_bitpos);
> 
>+  if (!allow_unaligned_load)
>+    for (int i = 0; i < 2; ++i)
>+      if (group->load_align[i])
>+	group_load_align = MIN (group_load_align, group->load_align[i]);
>+
>   while (size > 0)
>     {
>-      if ((allow_unaligned || group_align <= BITS_PER_UNIT)
>+      if ((allow_unaligned_store || group_align <= BITS_PER_UNIT)
> 	  && group->mask[try_pos - bytepos] == (unsigned char) ~0U)
> 	{
> 	  /* Skip padding bytes.  */
>@@ -1180,10 +1446,34 @@ split_group (merged_store_group *group,
>       unsigned HOST_WIDE_INT align = group_align;
>       if (align_bitpos)
> 	align = least_bit_hwi (align_bitpos);
>-      if (!allow_unaligned)
>+      if (!allow_unaligned_store)
> 	try_size = MIN (try_size, align);
>+      if (!allow_unaligned_load)
>+	{
>+	  /* If we can't do or don't want to do unaligned stores
>+	     as well as loads, we need to take the loads into account
>+	     as well.  */
>+	  unsigned HOST_WIDE_INT load_align = group_load_align;
>+	  align_bitpos = (try_bitpos - align_base) & (load_align - 1);
>+	  if (align_bitpos)
>+	    load_align = least_bit_hwi (align_bitpos);
>+	  for (int i = 0; i < 2; ++i)
>+	    if (group->load_align[i])
>+	      {
>+		align_bitpos = try_bitpos - group->stores[0]->bitpos;
>+		align_bitpos += group->stores[0]->ops[i].bitpos;
>+		align_bitpos -= group->load_align_base[i];
>+		align_bitpos &= (group_load_align - 1);
>+		if (align_bitpos)
>+		  {
>+		    unsigned HOST_WIDE_INT a = least_bit_hwi (align_bitpos);
>+		    load_align = MIN (load_align, a);
>+		  }
>+	      }
>+	  try_size = MIN (try_size, load_align);
>+	}
>       store_immediate_info *info
>-	= find_constituent_stmts (group, NULL, &first, try_bitpos, try_size);
>+	= find_constituent_stores (group, NULL, &first, try_bitpos,
>try_size);
>       if (info)
> 	{
> 	  /* If there is just one original statement for the range, see if
>@@ -1191,8 +1481,8 @@ split_group (merged_store_group *group,
> 	     than try_size.  */
> 	  unsigned HOST_WIDE_INT stmt_end
> 	    = ROUND_UP (info->bitpos + info->bitsize, BITS_PER_UNIT);
>-	  info = find_constituent_stmts (group, NULL, &first, try_bitpos,
>-					 stmt_end - try_bitpos);
>+	  info = find_constituent_stores (group, NULL, &first, try_bitpos,
>+					  stmt_end - try_bitpos);
> 	  if (info && info->bitpos >= try_bitpos)
> 	    {
> 	      try_size = stmt_end - try_bitpos;
>@@ -1221,7 +1511,7 @@ split_group (merged_store_group *group,
>       nonmasked *= BITS_PER_UNIT;
>       while (nonmasked <= try_size / 2)
> 	try_size /= 2;
>-      if (!allow_unaligned && group_align > BITS_PER_UNIT)
>+      if (!allow_unaligned_store && group_align > BITS_PER_UNIT)
> 	{
>	  /* Now look for whole padding bytes at the start of that bitsize. 
>*/
> 	  unsigned int try_bytesize = try_size / BITS_PER_UNIT, masked;
>@@ -1252,8 +1542,8 @@ split_group (merged_store_group *group,
> 	{
> 	  struct split_store *store
> 	    = new split_store (try_pos, try_size, align);
>-	  info = find_constituent_stmts (group, &store->orig_stmts,
>-	  				 &first, try_bitpos, try_size);
>+	  info = find_constituent_stores (group, &store->orig_stores,
>+					  &first, try_bitpos, try_size);
> 	  if (info
> 	      && info->bitpos >= try_bitpos
> 	      && info->bitpos + info->bitsize <= try_bitpos + try_size)
>@@ -1288,19 +1578,23 @@ imm_store_chain_info::output_merged_stor
> 
>   auto_vec<struct split_store *, 32> split_stores;
>   split_stores.create (0);
>-  bool allow_unaligned
>+  bool allow_unaligned_store
>= !STRICT_ALIGNMENT && PARAM_VALUE
>(PARAM_STORE_MERGING_ALLOW_UNALIGNED);
>-  if (allow_unaligned)
>+  bool allow_unaligned_load = allow_unaligned_store;
>+  if (allow_unaligned_store)
>     {
>      /* If unaligned stores are allowed, see how many stores we'd emit
> 	 for unaligned and how many stores we'd emit for aligned stores.
>	 Only use unaligned stores if it allows fewer stores than aligned.  */
>-      unsigned aligned_cnt = split_group (group, false, NULL);
>-      unsigned unaligned_cnt = split_group (group, true, NULL);
>+      unsigned aligned_cnt
>+	= split_group (group, false, allow_unaligned_load, NULL);
>+      unsigned unaligned_cnt
>+	= split_group (group, true,
diff mbox series

Patch

--- gcc/gimple-ssa-store-merging.c.jj	2017-11-01 22:49:18.123965696 +0100
+++ gcc/gimple-ssa-store-merging.c	2017-11-02 17:24:04.236317245 +0100
@@ -19,7 +19,8 @@ 
    <http://www.gnu.org/licenses/>.  */
 
 /* The purpose of this pass is to combine multiple memory stores of
-   constant values to consecutive memory locations into fewer wider stores.
+   constant values, values loaded from memory or bitwise operations
+   on those to consecutive memory locations into fewer wider stores.
    For example, if we have a sequence peforming four byte stores to
    consecutive memory locations:
    [p     ] := imm1;
@@ -29,21 +30,49 @@ 
    we can transform this into a single 4-byte store if the target supports it:
   [p] := imm1:imm2:imm3:imm4 //concatenated immediates according to endianness.
 
+   Or:
+   [p     ] := [q     ];
+   [p + 1B] := [q + 1B];
+   [p + 2B] := [q + 2B];
+   [p + 3B] := [q + 3B];
+   if there is no overlap can be transformed into a single 4-byte
+   load followed by single 4-byte store.
+
+   Or:
+   [p     ] := [q     ] ^ imm1;
+   [p + 1B] := [q + 1B] ^ imm2;
+   [p + 2B] := [q + 2B] ^ imm3;
+   [p + 3B] := [q + 3B] ^ imm4;
+   if there is no overlap can be transformed into a single 4-byte
+   load, xored with imm1:imm2:imm3:imm4 and stored using a single 4-byte store.
+
    The algorithm is applied to each basic block in three phases:
 
-   1) Scan through the basic block recording constant assignments to
+   1) Scan through the basic block recording assignments to
    destinations that can be expressed as a store to memory of a certain size
-   at a certain bit offset.  Record store chains to different bases in a
-   hash_map (m_stores) and make sure to terminate such chains when appropriate
-   (for example when when the stored values get used subsequently).
+   at a certain bit offset from expressions we can handle.  For bit-fields
+   we also note the surrounding bit region, bits that could be stored in
+   a read-modify-write operation when storing the bit-field.  Record store
+   chains to different bases in a hash_map (m_stores) and make sure to
+   terminate such chains when appropriate (for example when when the stored
+   values get used subsequently).
    These stores can be a result of structure element initializers, array stores
    etc.  A store_immediate_info object is recorded for every such store.
    Record as many such assignments to a single base as possible until a
    statement that interferes with the store sequence is encountered.
+   Each store has up to 2 operands, which can be an immediate constant
+   or a memory load, from which the value to be stored can be computed.
+   At most one of the operands can be a constant.  The operands are recorded
+   in store_operand_info struct.
 
    2) Analyze the chain of stores recorded in phase 1) (i.e. the vector of
    store_immediate_info objects) and coalesce contiguous stores into
-   merged_store_group objects.
+   merged_store_group objects.  For bit-fields stores, we don't need to
+   require the stores to be contiguous, just their surrounding bit regions
+   have to be contiguous.  If the expression being stored is different
+   between adjacent stores, such as one store storing a constant and
+   following storing a value loaded from memory, or if the loaded memory
+   objects are not adjacent, a new merged_store_group is created as well.
 
    For example, given the stores:
    [p     ] := 0;
@@ -134,8 +163,35 @@ 
 #define MAX_STORE_BITSIZE (BITS_PER_WORD)
 #define MAX_STORE_BYTES (MAX_STORE_BITSIZE / BITS_PER_UNIT)
 
+/* Limit to bound the number of aliasing checks for loads with the same
+   vuse as the corresponding store.  */
+#define MAX_STORE_ALIAS_CHECKS 64
+
 namespace {
 
+/* Struct recording one operand for the store, which is either a constant,
+   then VAL represents the constant and all the other fields are zero,
+   or a memory load, then VAL represents the reference, BASE_ADDR is non-NULL
+   and the other fields also reflect the memory load.  */
+
+struct store_operand_info
+{
+  tree val;
+  tree base_addr;
+  unsigned HOST_WIDE_INT bitsize;
+  unsigned HOST_WIDE_INT bitpos;
+  unsigned HOST_WIDE_INT bitregion_start;
+  unsigned HOST_WIDE_INT bitregion_end;
+  gimple *stmt;
+  store_operand_info ();
+};
+
+store_operand_info::store_operand_info ()
+  : val (NULL_TREE), base_addr (NULL_TREE), bitsize (0), bitpos (0),
+    bitregion_start (0), bitregion_end (0), stmt (NULL)
+{
+}
+
 /* Struct recording the information about a single store of an immediate
    to memory.  These are created in the first phase and coalesced into
    merged_store_group objects in the second phase.  */
@@ -149,9 +205,17 @@  struct store_immediate_info
   unsigned HOST_WIDE_INT bitregion_end;
   gimple *stmt;
   unsigned int order;
+  /* INTEGER_CST for constant stores, MEM_REF for memory copy or
+     BIT_*_EXPR for logical bitwise operation.  */
+  enum tree_code rhs_code;
+  /* Operands.  For BIT_*_EXPR rhs_code both operands are used, otherwise
+     just the first one.  */
+  store_operand_info ops[2];
   store_immediate_info (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
 			unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
-			gimple *, unsigned int);
+			gimple *, unsigned int, enum tree_code,
+			const store_operand_info &,
+			const store_operand_info &);
 };
 
 store_immediate_info::store_immediate_info (unsigned HOST_WIDE_INT bs,
@@ -159,11 +223,22 @@  store_immediate_info::store_immediate_in
 					    unsigned HOST_WIDE_INT brs,
 					    unsigned HOST_WIDE_INT bre,
 					    gimple *st,
-					    unsigned int ord)
+					    unsigned int ord,
+					    enum tree_code rhscode,
+					    const store_operand_info &op0r,
+					    const store_operand_info &op1r)
   : bitsize (bs), bitpos (bp), bitregion_start (brs), bitregion_end (bre),
-    stmt (st), order (ord)
+    stmt (st), order (ord), rhs_code (rhscode)
+#if __cplusplus >= 201103L
+    , ops { op0r, op1r }
+{
+}
+#else
 {
+  ops[0] = op0r;
+  ops[1] = op1r;
 }
+#endif
 
 /* Struct representing a group of stores to contiguous memory locations.
    These are produced by the second phase (coalescing) and consumed in the
@@ -178,8 +253,10 @@  struct merged_store_group
   /* The size of the allocated memory for val and mask.  */
   unsigned HOST_WIDE_INT buf_size;
   unsigned HOST_WIDE_INT align_base;
+  unsigned HOST_WIDE_INT load_align_base[2];
 
   unsigned int align;
+  unsigned int load_align[2];
   unsigned int first_order;
   unsigned int last_order;
 
@@ -576,6 +653,20 @@  merged_store_group::merged_store_group (
   get_object_alignment_1 (gimple_assign_lhs (info->stmt),
 			  &align, &align_bitpos);
   align_base = start - align_bitpos;
+  for (int i = 0; i < 2; ++i)
+    {
+      store_operand_info &op = info->ops[i];
+      if (op.base_addr == NULL_TREE)
+	{
+	  load_align[i] = 0;
+	  load_align_base[i] = 0;
+	}
+      else
+	{
+	  get_object_alignment_1 (op.val, &load_align[i], &align_bitpos);
+	  load_align_base[i] = op.bitpos - align_bitpos;
+	}
+    }
   stores.create (1);
   stores.safe_push (info);
   last_stmt = info->stmt;
@@ -608,6 +699,19 @@  merged_store_group::do_merge (store_imme
       align = this_align;
       align_base = info->bitpos - align_bitpos;
     }
+  for (int i = 0; i < 2; ++i)
+    {
+      store_operand_info &op = info->ops[i];
+      if (!op.base_addr)
+	continue;
+
+      get_object_alignment_1 (op.val, &this_align, &align_bitpos);
+      if (this_align > load_align[i])
+	{
+	  load_align[i] = this_align;
+	  load_align_base[i] = op.bitpos - align_bitpos;
+	}
+    }
 
   gimple *stmt = info->stmt;
   stores.safe_push (info);
@@ -682,16 +786,21 @@  merged_store_group::apply_stores ()
   FOR_EACH_VEC_ELT (stores, i, info)
     {
       unsigned int pos_in_buffer = info->bitpos - bitregion_start;
-      bool ret = encode_tree_to_bitpos (gimple_assign_rhs1 (info->stmt),
-					val, info->bitsize,
-					pos_in_buffer, buf_size);
-      if (dump_file && (dump_flags & TDF_DETAILS))
+      tree cst = NULL_TREE;
+      if (info->ops[0].val && info->ops[0].base_addr == NULL_TREE)
+	cst = info->ops[0].val;
+      else if (info->ops[1].val && info->ops[1].base_addr == NULL_TREE)
+	cst = info->ops[1].val;
+      bool ret = true;
+      if (cst)
+	ret = encode_tree_to_bitpos (cst, val, info->bitsize,
+				     pos_in_buffer, buf_size);
+      if (cst && dump_file && (dump_flags & TDF_DETAILS))
 	{
 	  if (ret)
 	    {
 	      fprintf (dump_file, "After writing ");
-	      print_generic_expr (dump_file,
-				  gimple_assign_rhs1 (info->stmt), 0);
+	      print_generic_expr (dump_file, cst, 0);
 	      fprintf (dump_file, " of size " HOST_WIDE_INT_PRINT_DEC
 			" at position %d the merged region contains:\n",
 			info->bitsize, pos_in_buffer);
@@ -799,9 +908,10 @@  private:
      decisions when going out of SSA).  */
   imm_store_chain_info *m_stores_head;
 
+  void process_store (gimple *);
   bool terminate_and_process_all_chains ();
   bool terminate_all_aliasing_chains (imm_store_chain_info **,
-				      bool, gimple *);
+				      gimple *);
   bool terminate_and_release_chain (imm_store_chain_info *);
 }; // class pass_store_merging
 
@@ -831,7 +941,6 @@  pass_store_merging::terminate_and_proces
 bool
 pass_store_merging::terminate_all_aliasing_chains (imm_store_chain_info
 						     **chain_info,
-						   bool var_offset_p,
 						   gimple *stmt)
 {
   bool ret = false;
@@ -845,37 +954,21 @@  pass_store_merging::terminate_all_aliasi
      of a chain.  */
   if (chain_info)
     {
-      /* We have a chain at BASE and we're writing to [BASE + <variable>].
-	 This can interfere with any of the stores so terminate
-	 the chain.  */
-      if (var_offset_p)
+      store_immediate_info *info;
+      unsigned int i;
+      FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
 	{
-	  terminate_and_release_chain (*chain_info);
-	  ret = true;
-	}
-      /* Otherwise go through every store in the chain to see if it
-	 aliases with any of them.  */
-      else
-	{
-	  store_immediate_info *info;
-	  unsigned int i;
-	  FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
+	  if (ref_maybe_used_by_stmt_p (stmt, gimple_assign_lhs (info->stmt))
+	      || stmt_may_clobber_ref_p (stmt, gimple_assign_lhs (info->stmt)))
 	    {
-	      if (ref_maybe_used_by_stmt_p (stmt,
-					    gimple_assign_lhs (info->stmt))
-		  || stmt_may_clobber_ref_p (stmt,
-					     gimple_assign_lhs (info->stmt)))
+	      if (dump_file && (dump_flags & TDF_DETAILS))
 		{
-		  if (dump_file && (dump_flags & TDF_DETAILS))
-		    {
-		      fprintf (dump_file,
-			       "stmt causes chain termination:\n");
-		      print_gimple_stmt (dump_file, stmt, 0);
-		    }
-		  terminate_and_release_chain (*chain_info);
-		  ret = true;
-		  break;
+		  fprintf (dump_file, "stmt causes chain termination:\n");
+		  print_gimple_stmt (dump_file, stmt, 0);
 		}
+	      terminate_and_release_chain (*chain_info);
+	      ret = true;
+	      break;
 	    }
 	}
     }
@@ -920,6 +1013,109 @@  pass_store_merging::terminate_and_releas
   return ret;
 }
 
+/* Return true if stmts in between FIRST (inclusive) and LAST (exclusive)
+   may clobber REF.  FIRST and LAST must be in the same basic block and
+   have non-NULL vdef.  */
+
+bool
+stmts_may_clobber_ref_p (gimple *first, gimple *last, tree ref)
+{
+  ao_ref r;
+  ao_ref_init (&r, ref);
+  unsigned int count = 0;
+  tree vop = gimple_vdef (last);
+  gimple *stmt;
+
+  gcc_checking_assert (gimple_bb (first) == gimple_bb (last));
+  do
+    {
+      stmt = SSA_NAME_DEF_STMT (vop);
+      if (stmt_may_clobber_ref_p_1 (stmt, &r))
+	return true;
+      /* Avoid quadratic compile time by bounding the number of checks
+	 we perform.  */
+      if (++count > MAX_STORE_ALIAS_CHECKS)
+	return true;
+      vop = gimple_vuse (stmt);
+    }
+  while (stmt != first);
+  return false;
+}
+
+/* Return true if INFO->ops[IDX] is mergeable with the
+   corresponding loads already in MERGED_STORE group.
+   BASE_ADDR is the base address of the whole store group.  */
+
+bool
+compatible_load_p (merged_store_group *merged_store,
+		   store_immediate_info *info,
+		   tree base_addr, int idx)
+{
+  store_immediate_info *infof = merged_store->stores[0];
+  if (!info->ops[idx].base_addr
+      || (info->ops[idx].bitpos - infof->ops[idx].bitpos
+	  != info->bitpos - infof->bitpos)
+      || !operand_equal_p (info->ops[idx].base_addr,
+			   infof->ops[idx].base_addr, 0))
+    return false;
+
+  store_immediate_info *infol = merged_store->stores.last ();
+  tree load_vuse = gimple_vuse (info->ops[idx].stmt);
+  /* In this case all vuses should be the same, e.g.
+     _1 = s.a; _2 = s.b; _3 = _1 | 1; t.a = _3; _4 = _2 | 2; t.b = _4;
+     or
+     _1 = s.a; _2 = s.b; t.a = _1; t.b = _2;
+     and we can emit the coalesced load next to any of those loads.  */
+  if (gimple_vuse (infof->ops[idx].stmt) == load_vuse
+      && gimple_vuse (infol->ops[idx].stmt) == load_vuse)
+    return true;
+
+  /* Otherwise, at least for now require that the load has the same
+     vuse as the store.  See following examples.  */
+  if (gimple_vuse (info->stmt) != load_vuse)
+    return false;
+
+  if (gimple_vuse (infof->stmt) != gimple_vuse (infof->ops[idx].stmt)
+      || (infof != infol
+	  && gimple_vuse (infol->stmt) != gimple_vuse (infol->ops[idx].stmt)))
+    return false;
+
+  /* If the load is from the same location as the store, already
+     the construction of the immediate chain info guarantees no intervening
+     stores, so no further checks are needed.  Example:
+     _1 = s.a; _2 = _1 & -7; s.a = _2; _3 = s.b; _4 = _3 & -7; s.b = _4;  */
+  if (info->ops[idx].bitpos == info->bitpos
+      && operand_equal_p (info->ops[idx].base_addr, base_addr, 0))
+    return true;
+
+  gimple *first = merged_store->first_stmt;
+  gimple *last = merged_store->last_stmt;
+  unsigned int i;
+  store_immediate_info *infoc;
+  if (info->order < merged_store->first_order)
+    {
+      FOR_EACH_VEC_ELT (merged_store->stores, i, infoc)
+	if (stmts_may_clobber_ref_p (info->stmt, first, infoc->ops[idx].val))
+	  return false;
+      first = info->stmt;
+    }
+  else if (info->order > merged_store->last_order)
+    {
+      FOR_EACH_VEC_ELT (merged_store->stores, i, infoc)
+	if (stmts_may_clobber_ref_p (last, info->stmt, infoc->ops[idx].val))
+	  return false;
+      last = info->stmt;
+    }
+  if (stmts_may_clobber_ref_p (first, last, info->ops[idx].val))
+    return false;
+
+  /* Otherwise, we are looking for:
+     _1 = s.a; _2 = _1 ^ 15; t.a = _2; _3 = s.b; _4 = _3 ^ 15; t.b = _4;
+     or
+     _1 = s.a; t.a = _1; _2 = s.b; t.b = _2;  */
+  return true;
+}
+
 /* Go through the candidate stores recorded in m_store_info and merge them
    into merged_store_group objects recorded into m_merged_store_groups
    representing the widened stores.  Return true if coalescing was successful
@@ -967,32 +1163,56 @@  imm_store_chain_info::coalesce_immediate
       if (IN_RANGE (start, merged_store->start,
 		    merged_store->start + merged_store->width - 1))
 	{
-	  merged_store->merge_overlapping (info);
-	  continue;
+	  /* Only allow overlapping stores of constants.  */
+	  if (info->rhs_code == INTEGER_CST
+	      && merged_store->stores[0]->rhs_code == INTEGER_CST)
+	    {
+	      merged_store->merge_overlapping (info);
+	      continue;
+	    }
+	}
+      /* |---store 1---||---store 2---|
+	 This store is consecutive to the previous one.
+	 Merge it into the current store group.  There can be gaps in between
+	 the stores, but there can't be gaps in between bitregions.  */
+      else if (info->bitregion_start <= merged_store->bitregion_end
+	       && info->rhs_code == merged_store->stores[0]->rhs_code)
+	{
+	  store_immediate_info *infof = merged_store->stores[0];
+
+	  /* All the rhs_code ops that take 2 operands are commutative,
+	     swap the operands if it could make the operands compatible.  */
+	  if (infof->ops[0].base_addr
+	      && infof->ops[1].base_addr
+	      && info->ops[0].base_addr
+	      && info->ops[1].base_addr
+	      && (info->ops[1].bitpos - infof->ops[0].bitpos
+		  == info->bitpos - infof->bitpos)
+	      && operand_equal_p (info->ops[1].base_addr,
+				  infof->ops[0].base_addr, 0))
+	    std::swap (info->ops[0], info->ops[1]);
+	  if ((!infof->ops[0].base_addr
+	       || compatible_load_p (merged_store, info, base_addr, 0))
+	      && (!infof->ops[1].base_addr
+		  || compatible_load_p (merged_store, info, base_addr, 1)))
+	    {
+	      merged_store->merge_into (info);
+	      continue;
+	    }
 	}
 
       /* |---store 1---| <gap> |---store 2---|.
-	 Gap between stores.  Start a new group if there are any gaps
-	 between bitregions.  */
-      if (info->bitregion_start > merged_store->bitregion_end)
-	{
-	  /* Try to apply all the stores recorded for the group to determine
-	     the bitpattern they write and discard it if that fails.
-	     This will also reject single-store groups.  */
-	  if (!merged_store->apply_stores ())
-	    delete merged_store;
-	  else
-	    m_merged_store_groups.safe_push (merged_store);
+	 Gap between stores or the rhs not compatible.  Start a new group.  */
 
-	  merged_store = new merged_store_group (info);
-
-	  continue;
-	}
+      /* Try to apply all the stores recorded for the group to determine
+	 the bitpattern they write and discard it if that fails.
+	 This will also reject single-store groups.  */
+      if (!merged_store->apply_stores ())
+	delete merged_store;
+      else
+	m_merged_store_groups.safe_push (merged_store);
 
-      /* |---store 1---||---store 2---|
-	 This store is consecutive to the previous one.
-	 Merge it into the current store group.  */
-       merged_store->merge_into (info);
+      merged_store = new merged_store_group (info);
     }
 
   /* Record or discard the last store group.  */
@@ -1014,35 +1234,57 @@  imm_store_chain_info::coalesce_immediate
   return success;
 }
 
-/* Return the type to use for the merged stores described by STMTS.
-   This is needed to get the alias sets right.  */
+/* Return the type to use for the merged stores or loads described by STMTS.
+   This is needed to get the alias sets right.  If IS_LOAD, look for rhs,
+   otherwise lhs.  Additionally set *CLIQUEP and *BASEP to MR_DEPENDENCE_*
+   of the MEM_REFs if any.  */
 
 static tree
-get_alias_type_for_stmts (auto_vec<gimple *> &stmts)
+get_alias_type_for_stmts (vec<gimple *> &stmts, bool is_load,
+			  unsigned short *cliquep, unsigned short *basep)
 {
   gimple *stmt;
   unsigned int i;
-  tree lhs = gimple_assign_lhs (stmts[0]);
-  tree type = reference_alias_ptr_type (lhs);
+  tree type = NULL_TREE;
+  tree ret = NULL_TREE;
+  *cliquep = 0;
+  *basep = 0;
 
   FOR_EACH_VEC_ELT (stmts, i, stmt)
     {
-      if (i == 0)
-	continue;
+      tree ref = is_load ? gimple_assign_rhs1 (stmt)
+			 : gimple_assign_lhs (stmt);
+      tree type1 = reference_alias_ptr_type (ref);
+      tree base = get_base_address (ref);
 
-      lhs = gimple_assign_lhs (stmt);
-      tree type1 = reference_alias_ptr_type (lhs);
+      if (i == 0)
+	{
+	  if (TREE_CODE (base) == MEM_REF)
+	    {
+	      *cliquep = MR_DEPENDENCE_CLIQUE (base);
+	      *basep = MR_DEPENDENCE_BASE (base);
+	    }
+	  ret = type = type1;
+	  continue;
+	}
       if (!alias_ptr_types_compatible_p (type, type1))
-	return ptr_type_node;
+	ret = ptr_type_node;
+      if (TREE_CODE (base) != MEM_REF
+	  || *cliquep != MR_DEPENDENCE_CLIQUE (base)
+	  || *basep != MR_DEPENDENCE_BASE (base))
+	{
+	  *cliquep = 0;
+	  *basep = 0;
+	}
     }
-  return type;
+  return ret;
 }
 
 /* Return the location_t information we can find among the statements
    in STMTS.  */
 
 static location_t
-get_location_for_stmts (auto_vec<gimple *> &stmts)
+get_location_for_stmts (vec<gimple *> &stmts)
 {
   gimple *stmt;
   unsigned int i;
@@ -1062,7 +1304,7 @@  struct split_store
   unsigned HOST_WIDE_INT bytepos;
   unsigned HOST_WIDE_INT size;
   unsigned HOST_WIDE_INT align;
-  auto_vec<gimple *> orig_stmts;
+  auto_vec<store_immediate_info *> orig_stores;
   /* True if there is a single orig stmt covering the whole split store.  */
   bool orig;
   split_store (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
@@ -1076,21 +1318,20 @@  split_store::split_store (unsigned HOST_
 			  unsigned HOST_WIDE_INT al)
 			  : bytepos (bp), size (sz), align (al), orig (false)
 {
-  orig_stmts.create (0);
+  orig_stores.create (0);
 }
 
-/* Record all statements corresponding to stores in GROUP that write to
-   the region starting at BITPOS and is of size BITSIZE.  Record such
-   statements in STMTS if non-NULL.  The stores in GROUP must be sorted by
-   bitposition.  Return INFO if there is exactly one original store
-   in the range.  */
+/* Record all stores in GROUP that write to the region starting at BITPOS and
+   is of size BITSIZE.  Record infos for such statements in STORES if
+   non-NULL.  The stores in GROUP must be sorted by bitposition.  Return INFO
+   if there is exactly one original store in the range.  */
 
 static store_immediate_info *
-find_constituent_stmts (struct merged_store_group *group,
-			vec<gimple *> *stmts,
-			unsigned int *first,
-			unsigned HOST_WIDE_INT bitpos,
-			unsigned HOST_WIDE_INT bitsize)
+find_constituent_stores (struct merged_store_group *group,
+			 vec<store_immediate_info *> *stores,
+			 unsigned int *first,
+			 unsigned HOST_WIDE_INT bitpos,
+			 unsigned HOST_WIDE_INT bitsize)
 {
   store_immediate_info *info, *ret = NULL;
   unsigned int i;
@@ -1119,9 +1360,9 @@  find_constituent_stmts (struct merged_st
       if (stmt_start >= end)
 	return ret;
 
-      if (stmts)
+      if (stores)
 	{
-	  stmts->safe_push (info->stmt);
+	  stores->safe_push (info);
 	  if (ret)
 	    {
 	      ret = NULL;
@@ -1143,11 +1384,14 @@  find_constituent_stmts (struct merged_st
    This is to separate the splitting strategy from the statement
    building/emission/linking done in output_merged_store.
    Return number of new stores.
+   If ALLOW_UNALIGNED_STORE is false, then all stores must be aligned.
+   If ALLOW_UNALIGNED_LOAD is false, then all loads must be aligned.
    If SPLIT_STORES is NULL, it is just a dry run to count number of
    new stores.  */
 
 static unsigned int
-split_group (merged_store_group *group, bool allow_unaligned,
+split_group (merged_store_group *group, bool allow_unaligned_store,
+	     bool allow_unaligned_load,
 	     vec<struct split_store *> *split_stores)
 {
   unsigned HOST_WIDE_INT pos = group->bitregion_start;
@@ -1155,6 +1399,7 @@  split_group (merged_store_group *group,
   unsigned HOST_WIDE_INT bytepos = pos / BITS_PER_UNIT;
   unsigned HOST_WIDE_INT group_align = group->align;
   unsigned HOST_WIDE_INT align_base = group->align_base;
+  unsigned HOST_WIDE_INT group_load_align = group_align;
 
   gcc_assert ((size % BITS_PER_UNIT == 0) && (pos % BITS_PER_UNIT == 0));
 
@@ -1162,9 +1407,14 @@  split_group (merged_store_group *group,
   unsigned HOST_WIDE_INT try_pos = bytepos;
   group->stores.qsort (sort_by_bitpos);
 
+  if (!allow_unaligned_load)
+    for (int i = 0; i < 2; ++i)
+      if (group->load_align[i])
+	group_load_align = MIN (group_load_align, group->load_align[i]);
+
   while (size > 0)
     {
-      if ((allow_unaligned || group_align <= BITS_PER_UNIT)
+      if ((allow_unaligned_store || group_align <= BITS_PER_UNIT)
 	  && group->mask[try_pos - bytepos] == (unsigned char) ~0U)
 	{
 	  /* Skip padding bytes.  */
@@ -1180,10 +1430,34 @@  split_group (merged_store_group *group,
       unsigned HOST_WIDE_INT align = group_align;
       if (align_bitpos)
 	align = least_bit_hwi (align_bitpos);
-      if (!allow_unaligned)
+      if (!allow_unaligned_store)
 	try_size = MIN (try_size, align);
+      if (!allow_unaligned_load)
+	{
+	  /* If we can't do or don't want to do unaligned stores
+	     as well as loads, we need to take the loads into account
+	     as well.  */
+	  unsigned HOST_WIDE_INT load_align = group_load_align;
+	  align_bitpos = (try_bitpos - align_base) & (load_align - 1);
+	  if (align_bitpos)
+	    load_align = least_bit_hwi (align_bitpos);
+	  for (int i = 0; i < 2; ++i)
+	    if (group->load_align[i])
+	      {
+		align_bitpos = try_bitpos - group->stores[0]->bitpos;
+		align_bitpos += group->stores[0]->ops[i].bitpos;
+		align_bitpos -= group->load_align_base[i];
+		align_bitpos &= (group_load_align - 1);
+		if (align_bitpos)
+		  {
+		    unsigned HOST_WIDE_INT a = least_bit_hwi (align_bitpos);
+		    load_align = MIN (load_align, a);
+		  }
+	      }
+	  try_size = MIN (try_size, load_align);
+	}
       store_immediate_info *info
-	= find_constituent_stmts (group, NULL, &first, try_bitpos, try_size);
+	= find_constituent_stores (group, NULL, &first, try_bitpos, try_size);
       if (info)
 	{
 	  /* If there is just one original statement for the range, see if
@@ -1191,8 +1465,8 @@  split_group (merged_store_group *group,
 	     than try_size.  */
 	  unsigned HOST_WIDE_INT stmt_end
 	    = ROUND_UP (info->bitpos + info->bitsize, BITS_PER_UNIT);
-	  info = find_constituent_stmts (group, NULL, &first, try_bitpos,
-					 stmt_end - try_bitpos);
+	  info = find_constituent_stores (group, NULL, &first, try_bitpos,
+					  stmt_end - try_bitpos);
 	  if (info && info->bitpos >= try_bitpos)
 	    {
 	      try_size = stmt_end - try_bitpos;
@@ -1221,7 +1495,7 @@  split_group (merged_store_group *group,
       nonmasked *= BITS_PER_UNIT;
       while (nonmasked <= try_size / 2)
 	try_size /= 2;
-      if (!allow_unaligned && group_align > BITS_PER_UNIT)
+      if (!allow_unaligned_store && group_align > BITS_PER_UNIT)
 	{
 	  /* Now look for whole padding bytes at the start of that bitsize.  */
 	  unsigned int try_bytesize = try_size / BITS_PER_UNIT, masked;
@@ -1252,8 +1526,8 @@  split_group (merged_store_group *group,
 	{
 	  struct split_store *store
 	    = new split_store (try_pos, try_size, align);
-	  info = find_constituent_stmts (group, &store->orig_stmts,
-	  				 &first, try_bitpos, try_size);
+	  info = find_constituent_stores (group, &store->orig_stores,
+					  &first, try_bitpos, try_size);
 	  if (info
 	      && info->bitpos >= try_bitpos
 	      && info->bitpos + info->bitsize <= try_bitpos + try_size)
@@ -1288,19 +1562,23 @@  imm_store_chain_info::output_merged_stor
 
   auto_vec<struct split_store *, 32> split_stores;
   split_stores.create (0);
-  bool allow_unaligned
+  bool allow_unaligned_store
     = !STRICT_ALIGNMENT && PARAM_VALUE (PARAM_STORE_MERGING_ALLOW_UNALIGNED);
-  if (allow_unaligned)
+  bool allow_unaligned_load = allow_unaligned_store;
+  if (allow_unaligned_store)
     {
       /* If unaligned stores are allowed, see how many stores we'd emit
 	 for unaligned and how many stores we'd emit for aligned stores.
 	 Only use unaligned stores if it allows fewer stores than aligned.  */
-      unsigned aligned_cnt = split_group (group, false, NULL);
-      unsigned unaligned_cnt = split_group (group, true, NULL);
+      unsigned aligned_cnt
+	= split_group (group, false, allow_unaligned_load, NULL);
+      unsigned unaligned_cnt
+	= split_group (group, true, allow_unaligned_load, NULL);
       if (aligned_cnt <= unaligned_cnt)
-	allow_unaligned = false;
+	allow_unaligned_store = false;
     }
-  split_group (group, allow_unaligned, &split_stores);
+  split_group (group, allow_unaligned_store, allow_unaligned_load,
+	       &split_stores);
 
   if (split_stores.length () >= orig_num_stmts)
     {
@@ -1323,9 +1601,37 @@  imm_store_chain_info::output_merged_stor
   gimple *stmt = NULL;
   split_store *split_store;
   unsigned int i;
-
+  auto_vec<gimple *, 32> orig_stmts;
   tree addr = force_gimple_operand_1 (unshare_expr (base_addr), &seq,
 				      is_gimple_mem_ref_addr, NULL_TREE);
+
+  tree load_addr[2] = { NULL_TREE, NULL_TREE };
+  gimple_seq load_seq[2] = { NULL, NULL };
+  gimple_stmt_iterator load_gsi[2] = { gsi_none (), gsi_none () };
+  for (int j = 0; j < 2; ++j)
+    {
+      store_operand_info &op = group->stores[0]->ops[j];
+      if (op.base_addr == NULL_TREE)
+	continue;
+
+      store_immediate_info *infol = group->stores.last ();
+      if (gimple_vuse (op.stmt) == gimple_vuse (infol->ops[j].stmt))
+	{
+	  load_gsi[j] = gsi_for_stmt (op.stmt);
+	  load_addr[j]
+	    = force_gimple_operand_1 (unshare_expr (op.base_addr),
+				      &load_seq[j], is_gimple_mem_ref_addr,
+				      NULL_TREE);
+	}
+      else if (operand_equal_p (base_addr, op.base_addr, 0))
+	load_addr[j] = addr;
+      else
+	load_addr[j]
+	  = force_gimple_operand_1 (unshare_expr (op.base_addr),
+				    &seq, is_gimple_mem_ref_addr,
+				    NULL_TREE);
+    }
+
   FOR_EACH_VEC_ELT (split_stores, i, split_store)
     {
       unsigned HOST_WIDE_INT try_size = split_store->size;
@@ -1337,27 +1643,144 @@  imm_store_chain_info::output_merged_stor
 	{
 	  /* If there is just a single constituent store which covers
 	     the whole area, just reuse the lhs and rhs.  */
-	  dest = gimple_assign_lhs (split_store->orig_stmts[0]);
-	  src = gimple_assign_rhs1 (split_store->orig_stmts[0]);
-	  loc = gimple_location (split_store->orig_stmts[0]);
+	  gimple *orig_stmt = split_store->orig_stores[0]->stmt;
+	  dest = gimple_assign_lhs (orig_stmt);
+	  src = gimple_assign_rhs1 (orig_stmt);
+	  loc = gimple_location (orig_stmt);
 	}
       else
 	{
+	  store_immediate_info *info;
+	  unsigned short clique, base;
+	  unsigned int k;
+	  FOR_EACH_VEC_ELT (split_store->orig_stores, k, info)
+	    orig_stmts.safe_push (info->stmt);
 	  tree offset_type
-	    = get_alias_type_for_stmts (split_store->orig_stmts);
-	  loc = get_location_for_stmts (split_store->orig_stmts);
+	    = get_alias_type_for_stmts (orig_stmts, false, &clique, &base);
+	  loc = get_location_for_stmts (orig_stmts);
+	  orig_stmts.truncate (0);
 
 	  tree int_type = build_nonstandard_integer_type (try_size, UNSIGNED);
 	  int_type = build_aligned_type (int_type, align);
 	  dest = fold_build2 (MEM_REF, int_type, addr,
 			      build_int_cst (offset_type, try_pos));
-	  src = native_interpret_expr (int_type,
-				       group->val + try_pos - start_byte_pos,
-				       group->buf_size);
+	  if (TREE_CODE (dest) == MEM_REF)
+	    {
+	      MR_DEPENDENCE_CLIQUE (dest) = clique;
+	      MR_DEPENDENCE_BASE (dest) = base;
+	    }
+
 	  tree mask
 	    = native_interpret_expr (int_type,
 				     group->mask + try_pos - start_byte_pos,
 				     group->buf_size);
+
+	  tree ops[2];
+	  for (int j = 0;
+	       j < 1 + (split_store->orig_stores[0]->ops[1].val != NULL_TREE);
+	       ++j)
+	    {
+	      store_operand_info &op = split_store->orig_stores[0]->ops[j];
+	      if (op.base_addr)
+		{
+		  FOR_EACH_VEC_ELT (split_store->orig_stores, k, info)
+		    orig_stmts.safe_push (info->ops[j].stmt);
+
+		  offset_type = get_alias_type_for_stmts (orig_stmts, true,
+							  &clique, &base);
+		  location_t load_loc = get_location_for_stmts (orig_stmts);
+		  orig_stmts.truncate (0);
+
+		  unsigned HOST_WIDE_INT load_align = group->load_align[j];
+		  unsigned HOST_WIDE_INT align_bitpos
+		    = (try_pos * BITS_PER_UNIT
+		       - split_store->orig_stores[0]->bitpos
+		       + op.bitpos) & (load_align - 1);
+		  if (align_bitpos)
+		    load_align = least_bit_hwi (align_bitpos);
+
+		  tree load_int_type
+		    = build_nonstandard_integer_type (try_size, UNSIGNED);
+		  load_int_type
+		    = build_aligned_type (load_int_type, load_align);
+
+		  unsigned HOST_WIDE_INT load_pos
+		    = (try_pos * BITS_PER_UNIT
+		       - split_store->orig_stores[0]->bitpos
+		       + op.bitpos) / BITS_PER_UNIT;
+		  ops[j] = fold_build2 (MEM_REF, load_int_type, load_addr[j],
+					build_int_cst (offset_type, load_pos));
+		  if (TREE_CODE (ops[j]) == MEM_REF)
+		    {
+		      MR_DEPENDENCE_CLIQUE (ops[j]) = clique;
+		      MR_DEPENDENCE_BASE (ops[j]) = base;
+		    }
+		  if (!integer_zerop (mask))
+		    /* The load might load some bits (that will be masked off
+		       later on) uninitialized, avoid -W*uninitialized
+		       warnings in that case.  */
+		    TREE_NO_WARNING (ops[j]) = 1;
+
+		  stmt = gimple_build_assign (make_ssa_name (int_type),
+					      ops[j]);
+		  gimple_set_location (stmt, load_loc);
+		  if (gsi_bb (load_gsi[j]))
+		    {
+		      gimple_set_vuse (stmt, gimple_vuse (op.stmt));
+		      gimple_seq_add_stmt_without_update (&load_seq[j], stmt);
+		    }
+		  else
+		    {
+		      gimple_set_vuse (stmt, new_vuse);
+		      gimple_seq_add_stmt_without_update (&seq, stmt);
+		    }
+		  ops[j] = gimple_assign_lhs (stmt);
+		}
+	      else
+		ops[j] = native_interpret_expr (int_type,
+						group->val + try_pos
+						- start_byte_pos,
+						group->buf_size);
+	    }
+
+	  switch (split_store->orig_stores[0]->rhs_code)
+	    {
+	    case BIT_AND_EXPR:
+	    case BIT_IOR_EXPR:
+	    case BIT_XOR_EXPR:
+	      FOR_EACH_VEC_ELT (split_store->orig_stores, k, info)
+		{
+		  tree rhs1 = gimple_assign_rhs1 (info->stmt);
+		  orig_stmts.safe_push (SSA_NAME_DEF_STMT (rhs1));
+		}
+	      location_t bit_loc;
+	      bit_loc = get_location_for_stmts (orig_stmts);
+	      orig_stmts.truncate (0);
+
+	      stmt
+		= gimple_build_assign (make_ssa_name (int_type),
+				       split_store->orig_stores[0]->rhs_code,
+				       ops[0], ops[1]);
+	      gimple_set_location (stmt, bit_loc);
+	      /* If there is just one load and there is a separate
+		 load_seq[0], emit the bitwise op right after it.  */
+	      if (load_addr[1] == NULL_TREE && gsi_bb (load_gsi[0]))
+		gimple_seq_add_stmt_without_update (&load_seq[0], stmt);
+	      /* Otherwise, if at least one load is in seq, we need to
+		 emit the bitwise op right before the store.  If there
+		 are two loads and are emitted somewhere else, it would
+		 be better to emit the bitwise op as early as possible;
+		 we don't track where that would be possible right now
+		 though.  */
+	      else
+		gimple_seq_add_stmt_without_update (&seq, stmt);
+	      src = gimple_assign_lhs (stmt);
+	      break;
+	    default:
+	      src = ops[0];
+	      break;
+	    }
+
 	  if (!integer_zerop (mask))
 	    {
 	      tree tem = make_ssa_name (int_type);
@@ -1382,9 +1805,21 @@  imm_store_chain_info::output_merged_stor
 	      gimple_seq_add_stmt_without_update (&seq, stmt);
 	      tem = gimple_assign_lhs (stmt);
 
-	      src = wide_int_to_tree (int_type,
-				      wi::bit_and_not (wi::to_wide (src),
-						       wi::to_wide (mask)));
+	      if (TREE_CODE (src) == INTEGER_CST)
+		src = wide_int_to_tree (int_type,
+					wi::bit_and_not (wi::to_wide (src),
+							 wi::to_wide (mask)));
+	      else
+		{
+		  tree nmask
+		    = wide_int_to_tree (int_type,
+					wi::bit_not (wi::to_wide (mask)));
+		  stmt = gimple_build_assign (make_ssa_name (int_type),
+					      BIT_AND_EXPR, src, nmask);
+		  gimple_set_location (stmt, loc);
+		  gimple_seq_add_stmt_without_update (&seq, stmt);
+		  src = gimple_assign_lhs (stmt);
+		}
 	      stmt = gimple_build_assign (make_ssa_name (int_type),
 					  BIT_IOR_EXPR, tem, src);
 	      gimple_set_location (stmt, loc);
@@ -1422,6 +1857,9 @@  imm_store_chain_info::output_merged_stor
 	print_gimple_seq (dump_file, seq, 0, TDF_VOPS | TDF_MEMSYMS);
     }
   gsi_insert_seq_after (&last_gsi, seq, GSI_SAME_STMT);
+  for (int j = 0; j < 2; ++j)
+    if (load_seq[j])
+      gsi_insert_seq_after (&load_gsi[j], load_seq[j], GSI_SAME_STMT);
 
   return true;
 }
@@ -1520,10 +1958,290 @@  rhs_valid_for_store_merging_p (tree rhs)
 			     GET_MODE_SIZE (TYPE_MODE (TREE_TYPE (rhs)))) != 0;
 }
 
+/* If MEM is a memory reference usable for store merging (either as
+   store destination or for loads), return the non-NULL base_addr
+   and set *PBITSIZE, *PBITPOS, *PBITREGION_START and *PBITREGION_END.
+   Otherwise return NULL, *PBITPOS should be still valid even for that
+   case.  */
+
+static tree
+mem_valid_for_store_merging (tree mem, unsigned HOST_WIDE_INT *pbitsize,
+			     unsigned HOST_WIDE_INT *pbitpos,
+			     unsigned HOST_WIDE_INT *pbitregion_start,
+			     unsigned HOST_WIDE_INT *pbitregion_end)
+{
+  HOST_WIDE_INT bitsize;
+  HOST_WIDE_INT bitpos;
+  unsigned HOST_WIDE_INT bitregion_start = 0;
+  unsigned HOST_WIDE_INT bitregion_end = 0;
+  machine_mode mode;
+  int unsignedp = 0, reversep = 0, volatilep = 0;
+  tree offset;
+  tree base_addr = get_inner_reference (mem, &bitsize, &bitpos, &offset, &mode,
+					&unsignedp, &reversep, &volatilep);
+  *pbitsize = bitsize;
+  if (bitsize == 0)
+    return NULL_TREE;
+
+  if (TREE_CODE (mem) == COMPONENT_REF
+      && DECL_BIT_FIELD_TYPE (TREE_OPERAND (mem, 1)))
+    {
+      get_bit_range (&bitregion_start, &bitregion_end, mem, &bitpos, &offset);
+      if (bitregion_end)
+	++bitregion_end;
+    }
+
+  if (reversep)
+    return NULL_TREE;
+
+  /* We do not want to rewrite TARGET_MEM_REFs.  */
+  if (TREE_CODE (base_addr) == TARGET_MEM_REF)
+    return NULL_TREE;
+  /* In some cases get_inner_reference may return a
+     MEM_REF [ptr + byteoffset].  For the purposes of this pass
+     canonicalize the base_addr to MEM_REF [ptr] and take
+     byteoffset into account in the bitpos.  This occurs in
+     PR 23684 and this way we can catch more chains.  */
+  else if (TREE_CODE (base_addr) == MEM_REF)
+    {
+      offset_int bit_off, byte_off = mem_ref_offset (base_addr);
+      bit_off = byte_off << LOG2_BITS_PER_UNIT;
+      bit_off += bitpos;
+      if (!wi::neg_p (bit_off) && wi::fits_shwi_p (bit_off))
+	{
+	  bitpos = bit_off.to_shwi ();
+	  if (bitregion_end)
+	    {
+	      bit_off = byte_off << LOG2_BITS_PER_UNIT;
+	      bit_off += bitregion_start;
+	      if (wi::fits_uhwi_p (bit_off))
+		{
+		  bitregion_start = bit_off.to_uhwi ();
+		  bit_off = byte_off << LOG2_BITS_PER_UNIT;
+		  bit_off += bitregion_end;
+		  if (wi::fits_uhwi_p (bit_off))
+		    bitregion_end = bit_off.to_uhwi ();
+		  else
+		    bitregion_end = 0;
+		}
+	      else
+		bitregion_end = 0;
+	    }
+	}
+      else
+	return NULL_TREE;
+      base_addr = TREE_OPERAND (base_addr, 0);
+    }
+  /* get_inner_reference returns the base object, get at its
+     address now.  */
+  else
+    {
+      if (bitpos < 0)
+	return NULL_TREE;
+      base_addr = build_fold_addr_expr (base_addr);
+    }
+
+  if (!bitregion_end)
+    {
+      bitregion_start = ROUND_DOWN (bitpos, BITS_PER_UNIT);
+      bitregion_end = ROUND_UP (bitpos + bitsize, BITS_PER_UNIT);
+    }
+
+  if (offset != NULL_TREE)
+    {
+      /* If the access is variable offset then a base decl has to be
+	 address-taken to be able to emit pointer-based stores to it.
+	 ???  We might be able to get away with re-using the original
+	 base up to the first variable part and then wrapping that inside
+	 a BIT_FIELD_REF.  */
+      tree base = get_base_address (base_addr);
+      if (! base
+	  || (DECL_P (base) && ! TREE_ADDRESSABLE (base)))
+	return NULL_TREE;
+
+      base_addr = build2 (POINTER_PLUS_EXPR, TREE_TYPE (base_addr),
+			  base_addr, offset);
+    }
+
+  *pbitsize = bitsize;
+  *pbitpos = bitpos;
+  *pbitregion_start = bitregion_start;
+  *pbitregion_end = bitregion_end;
+  return base_addr;
+}
+
+/* Return true if STMT is a load that can be used for store merging.
+   In that case fill in *OP.  BITSIZE, BITPOS, BITREGION_START and
+   BITREGION_END are properties of the corresponding store.  */
+
+static bool
+handled_load (gimple *stmt, store_operand_info *op,
+	      unsigned HOST_WIDE_INT bitsize, unsigned HOST_WIDE_INT bitpos,
+	      unsigned HOST_WIDE_INT bitregion_start,
+	      unsigned HOST_WIDE_INT bitregion_end)
+{
+  if (!is_gimple_assign (stmt) || !gimple_vuse (stmt))
+    return false;
+  if (gimple_assign_load_p (stmt)
+      && !stmt_can_throw_internal (stmt)
+      && !gimple_has_volatile_ops (stmt))
+    {
+      tree mem = gimple_assign_rhs1 (stmt);
+      op->base_addr
+	= mem_valid_for_store_merging (mem, &op->bitsize, &op->bitpos,
+				       &op->bitregion_start,
+				       &op->bitregion_end);
+      if (op->base_addr != NULL_TREE
+	  && op->bitsize == bitsize
+	  && ((op->bitpos - bitpos) % BITS_PER_UNIT) == 0
+	  && op->bitpos - op->bitregion_start >= bitpos - bitregion_start
+	  && op->bitregion_end - op->bitpos >= bitregion_end - bitpos)
+	{
+	  op->stmt = stmt;
+	  op->val = mem;
+	  return true;
+	}
+    }
+  return false;
+}
+
+/* Record the store STMT for store merging optimization if it can be
+   optimized.  */
+
+void
+pass_store_merging::process_store (gimple *stmt)
+{
+  tree lhs = gimple_assign_lhs (stmt);
+  tree rhs = gimple_assign_rhs1 (stmt);
+  unsigned HOST_WIDE_INT bitsize, bitpos;
+  unsigned HOST_WIDE_INT bitregion_start;
+  unsigned HOST_WIDE_INT bitregion_end;
+  tree base_addr
+    = mem_valid_for_store_merging (lhs, &bitsize, &bitpos,
+				   &bitregion_start, &bitregion_end);
+  if (bitsize == 0)
+    return;
+
+  bool invalid = (base_addr == NULL_TREE
+		  || ((bitsize > MAX_BITSIZE_MODE_ANY_INT)
+		       && (TREE_CODE (rhs) != INTEGER_CST)));
+  enum tree_code rhs_code = ERROR_MARK;
+  store_operand_info ops[2];
+  if (invalid)
+    ;
+  else if (rhs_valid_for_store_merging_p (rhs))
+    {
+      rhs_code = INTEGER_CST;
+      ops[0].val = rhs;
+    }
+  else if (TREE_CODE (rhs) != SSA_NAME)
+    invalid = true;
+  else
+    {
+      gimple *def_stmt = SSA_NAME_DEF_STMT (rhs), *def_stmt1, *def_stmt2;
+      if (!is_gimple_assign (def_stmt))
+	invalid = true;
+      else if (handled_load (def_stmt, &ops[0], bitsize, bitpos,
+			     bitregion_start, bitregion_end))
+	rhs_code = MEM_REF;
+      else
+	switch ((rhs_code = gimple_assign_rhs_code (def_stmt)))
+	  {
+	  case BIT_AND_EXPR:
+	  case BIT_IOR_EXPR:
+	  case BIT_XOR_EXPR:
+	    tree rhs1, rhs2;
+	    rhs1 = gimple_assign_rhs1 (def_stmt);
+	    rhs2 = gimple_assign_rhs2 (def_stmt);
+	    invalid = true;
+	    if (TREE_CODE (rhs1) != SSA_NAME)
+	      break;
+	    def_stmt1 = SSA_NAME_DEF_STMT (rhs1);
+	    if (!is_gimple_assign (def_stmt1)
+		|| !handled_load (def_stmt1, &ops[0], bitsize, bitpos,
+				  bitregion_start, bitregion_end))
+	      break;
+	    if (rhs_valid_for_store_merging_p (rhs2))
+	      ops[1].val = rhs2;
+	    else if (TREE_CODE (rhs2) != SSA_NAME)
+	      break;
+	    else
+	      {
+		def_stmt2 = SSA_NAME_DEF_STMT (rhs2);
+		if (!is_gimple_assign (def_stmt2))
+		  break;
+		else if (!handled_load (def_stmt2, &ops[1], bitsize, bitpos,
+					bitregion_start, bitregion_end))
+		  break;
+	      }
+	    invalid = false;
+	    break;
+	  default:
+	    invalid = true;
+	    break;
+	  }
+    }
+
+  struct imm_store_chain_info **chain_info = NULL;
+  if (base_addr)
+    chain_info = m_stores.get (base_addr);
+
+  if (invalid)
+    {
+      terminate_all_aliasing_chains (chain_info, stmt);
+      return;
+    }
+
+  store_immediate_info *info;
+  if (chain_info)
+    {
+      unsigned int ord = (*chain_info)->m_store_info.length ();
+      info = new store_immediate_info (bitsize, bitpos, bitregion_start,
+				       bitregion_end, stmt, ord, rhs_code,
+				       ops[0], ops[1]);
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Recording immediate store from stmt:\n");
+	  print_gimple_stmt (dump_file, stmt, 0);
+	}
+      (*chain_info)->m_store_info.safe_push (info);
+      /* If we reach the limit of stores to merge in a chain terminate and
+	 process the chain now.  */
+      if ((*chain_info)->m_store_info.length ()
+	  == (unsigned int) PARAM_VALUE (PARAM_MAX_STORES_TO_MERGE))
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Reached maximum number of statements to merge:\n");
+	  terminate_and_release_chain (*chain_info);
+	}
+      return;
+    }
+
+  /* Store aliases any existing chain?  */
+  terminate_all_aliasing_chains (chain_info, stmt);
+  /* Start a new chain.  */
+  struct imm_store_chain_info *new_chain
+    = new imm_store_chain_info (m_stores_head, base_addr);
+  info = new store_immediate_info (bitsize, bitpos, bitregion_start,
+				   bitregion_end, stmt, 0, rhs_code,
+				   ops[0], ops[1]);
+  new_chain->m_store_info.safe_push (info);
+  m_stores.put (base_addr, new_chain);
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "Starting new chain with statement:\n");
+      print_gimple_stmt (dump_file, stmt, 0);
+      fprintf (dump_file, "The base object is:\n");
+      print_generic_expr (dump_file, base_addr);
+      fprintf (dump_file, "\n");
+    }
+}
+
 /* Entry point for the pass.  Go over each basic block recording chains of
-  immediate stores.  Upon encountering a terminating statement (as defined
-  by stmt_terminates_chain_p) process the recorded stores and emit the widened
-  variants.  */
+   immediate stores.  Upon encountering a terminating statement (as defined
+   by stmt_terminates_chain_p) process the recorded stores and emit the widened
+   variants.  */
 
 unsigned int
 pass_store_merging::execute (function *fun)
@@ -1573,175 +2291,9 @@  pass_store_merging::execute (function *f
 	  if (gimple_assign_single_p (stmt) && gimple_vdef (stmt)
 	      && !stmt_can_throw_internal (stmt)
 	      && lhs_valid_for_store_merging_p (gimple_assign_lhs (stmt)))
-	    {
-	      tree lhs = gimple_assign_lhs (stmt);
-	      tree rhs = gimple_assign_rhs1 (stmt);
-
-	      HOST_WIDE_INT bitsize, bitpos;
-	      unsigned HOST_WIDE_INT bitregion_start = 0;
-	      unsigned HOST_WIDE_INT bitregion_end = 0;
-	      machine_mode mode;
-	      int unsignedp = 0, reversep = 0, volatilep = 0;
-	      tree offset, base_addr;
-	      base_addr
-		= get_inner_reference (lhs, &bitsize, &bitpos, &offset, &mode,
-				       &unsignedp, &reversep, &volatilep);
-	      if (TREE_CODE (lhs) == COMPONENT_REF
-		  && DECL_BIT_FIELD_TYPE (TREE_OPERAND (lhs, 1)))
-		{
-		  get_bit_range (&bitregion_start, &bitregion_end, lhs,
-				 &bitpos, &offset);
-		  if (bitregion_end)
-		    ++bitregion_end;
-		}
-	      if (bitsize == 0)
-		continue;
-
-	      /* As a future enhancement we could handle stores with the same
-		 base and offset.  */
-	      bool invalid = reversep
-			     || ((bitsize > MAX_BITSIZE_MODE_ANY_INT)
-				  && (TREE_CODE (rhs) != INTEGER_CST))
-			     || !rhs_valid_for_store_merging_p (rhs);
-
-	      /* We do not want to rewrite TARGET_MEM_REFs.  */
-	      if (TREE_CODE (base_addr) == TARGET_MEM_REF)
-		invalid = true;
-	      /* In some cases get_inner_reference may return a
-		 MEM_REF [ptr + byteoffset].  For the purposes of this pass
-		 canonicalize the base_addr to MEM_REF [ptr] and take
-		 byteoffset into account in the bitpos.  This occurs in
-		 PR 23684 and this way we can catch more chains.  */
-	      else if (TREE_CODE (base_addr) == MEM_REF)
-		{
-		  offset_int bit_off, byte_off = mem_ref_offset (base_addr);
-		  bit_off = byte_off << LOG2_BITS_PER_UNIT;
-		  bit_off += bitpos;
-		  if (!wi::neg_p (bit_off) && wi::fits_shwi_p (bit_off))
-		    {
-		      bitpos = bit_off.to_shwi ();
-		      if (bitregion_end)
-			{
-			  bit_off = byte_off << LOG2_BITS_PER_UNIT;
-			  bit_off += bitregion_start;
-			  if (wi::fits_uhwi_p (bit_off))
-			    {
-			      bitregion_start = bit_off.to_uhwi ();
-			      bit_off = byte_off << LOG2_BITS_PER_UNIT;
-			      bit_off += bitregion_end;
-			      if (wi::fits_uhwi_p (bit_off))
-				bitregion_end = bit_off.to_uhwi ();
-			      else
-				bitregion_end = 0;
-			    }
-			  else
-			    bitregion_end = 0;
-			}
-		    }
-		  else
-		    invalid = true;
-		  base_addr = TREE_OPERAND (base_addr, 0);
-		}
-	      /* get_inner_reference returns the base object, get at its
-	         address now.  */
-	      else
-		{
-		  if (bitpos < 0)
-		    invalid = true;
-		  base_addr = build_fold_addr_expr (base_addr);
-		}
-
-	      if (!bitregion_end)
-		{
-		  bitregion_start = ROUND_DOWN (bitpos, BITS_PER_UNIT);
-		  bitregion_end = ROUND_UP (bitpos + bitsize, BITS_PER_UNIT);
-		}
-
-	      if (! invalid
-		  && offset != NULL_TREE)
-		{
-		  /* If the access is variable offset then a base
-		     decl has to be address-taken to be able to
-		     emit pointer-based stores to it.
-		     ???  We might be able to get away with
-		     re-using the original base up to the first
-		     variable part and then wrapping that inside
-		     a BIT_FIELD_REF.  */
-		  tree base = get_base_address (base_addr);
-		  if (! base
-		      || (DECL_P (base)
-			  && ! TREE_ADDRESSABLE (base)))
-		    invalid = true;
-		  else
-		    base_addr = build2 (POINTER_PLUS_EXPR,
-					TREE_TYPE (base_addr),
-					base_addr, offset);
-		}
-
-	      struct imm_store_chain_info **chain_info
-		= m_stores.get (base_addr);
-
-	      if (!invalid)
-		{
-		  store_immediate_info *info;
-		  if (chain_info)
-		    {
-		      unsigned int ord = (*chain_info)->m_store_info.length ();
-		      info = new store_immediate_info (bitsize, bitpos,
-						       bitregion_start,
-						       bitregion_end,
-						       stmt, ord);
-		      if (dump_file && (dump_flags & TDF_DETAILS))
-			{
-			  fprintf (dump_file,
-				   "Recording immediate store from stmt:\n");
-			  print_gimple_stmt (dump_file, stmt, 0);
-			}
-		      (*chain_info)->m_store_info.safe_push (info);
-		      /* If we reach the limit of stores to merge in a chain
-			 terminate and process the chain now.  */
-		      if ((*chain_info)->m_store_info.length ()
-			   == (unsigned int)
-			      PARAM_VALUE (PARAM_MAX_STORES_TO_MERGE))
-			{
-			  if (dump_file && (dump_flags & TDF_DETAILS))
-			    fprintf (dump_file,
-				 "Reached maximum number of statements"
-				 " to merge:\n");
-			  terminate_and_release_chain (*chain_info);
-			}
-		      continue;
-		    }
-
-		  /* Store aliases any existing chain?  */
-		  terminate_all_aliasing_chains (chain_info, false, stmt);
-		  /* Start a new chain.  */
-		  struct imm_store_chain_info *new_chain
-		    = new imm_store_chain_info (m_stores_head, base_addr);
-		  info = new store_immediate_info (bitsize, bitpos,
-						   bitregion_start,
-						   bitregion_end,
-						   stmt, 0);
-		  new_chain->m_store_info.safe_push (info);
-		  m_stores.put (base_addr, new_chain);
-		  if (dump_file && (dump_flags & TDF_DETAILS))
-		    {
-		      fprintf (dump_file,
-			       "Starting new chain with statement:\n");
-		      print_gimple_stmt (dump_file, stmt, 0);
-		      fprintf (dump_file, "The base object is:\n");
-		      print_generic_expr (dump_file, base_addr);
-		      fprintf (dump_file, "\n");
-		    }
-		}
-	      else
-		terminate_all_aliasing_chains (chain_info,
-					       offset != NULL_TREE, stmt);
-
-	      continue;
-	    }
-
-	  terminate_all_aliasing_chains (NULL, false, stmt);
+	    process_store (stmt);
+	  else
+	    terminate_all_aliasing_chains (NULL, stmt);
 	}
       terminate_and_process_all_chains ();
     }
--- gcc/testsuite/gcc.dg/store_merging_13.c.jj	2017-11-02 08:50:03.544226508 +0100
+++ gcc/testsuite/gcc.dg/store_merging_13.c	2017-11-02 08:50:03.544226508 +0100
@@ -0,0 +1,157 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target store_merge } */
+/* { dg-options "-O2 -fdump-tree-store-merging" } */
+
+struct S { unsigned char a, b; unsigned short c; unsigned char d, e, f, g; unsigned long long h; };
+
+__attribute__((noipa)) void
+f1 (struct S *p)
+{
+  p->a = 1;
+  p->b = 2;
+  p->c = 3;
+  p->d = 4;
+  p->e = 5;
+  p->f = 6;
+  p->g = 7;
+}
+
+__attribute__((noipa)) void
+f2 (struct S *__restrict p, struct S *__restrict q)
+{
+  p->a = q->a;
+  p->b = q->b;
+  p->c = q->c;
+  p->d = q->d;
+  p->e = q->e;
+  p->f = q->f;
+  p->g = q->g;
+}
+
+__attribute__((noipa)) void
+f3 (struct S *p, struct S *q)
+{
+  unsigned char pa = q->a;
+  unsigned char pb = q->b;
+  unsigned short pc = q->c;
+  unsigned char pd = q->d;
+  unsigned char pe = q->e;
+  unsigned char pf = q->f;
+  unsigned char pg = q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+__attribute__((noipa)) void
+f4 (struct S *p, struct S *q)
+{
+  unsigned char pa = p->a | q->a;
+  unsigned char pb = p->b | q->b;
+  unsigned short pc = p->c | q->c;
+  unsigned char pd = p->d | q->d;
+  unsigned char pe = p->e | q->e;
+  unsigned char pf = p->f | q->f;
+  unsigned char pg = p->g | q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+__attribute__((noipa)) void
+f5 (struct S *p, struct S *q)
+{
+  unsigned char pa = p->a & q->a;
+  unsigned char pb = p->b & q->b;
+  unsigned short pc = p->c & q->c;
+  unsigned char pd = p->d & q->d;
+  unsigned char pe = p->e & q->e;
+  unsigned char pf = p->f & q->f;
+  unsigned char pg = p->g & q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+__attribute__((noipa)) void
+f6 (struct S *p, struct S *q)
+{
+  unsigned char pa = p->a ^ q->a;
+  unsigned char pb = p->b ^ q->b;
+  unsigned short pc = p->c ^ q->c;
+  unsigned char pd = p->d ^ q->d;
+  unsigned char pe = p->e ^ q->e;
+  unsigned char pf = p->f ^ q->f;
+  unsigned char pg = p->g ^ q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+struct S s = { 20, 21, 22, 23, 24, 25, 26, 27 };
+struct S t = { 0x71, 0x72, 0x7f04, 0x78, 0x31, 0x32, 0x34, 0xf1f2f3f4f5f6f7f8ULL };
+struct S u = { 28, 29, 30, 31, 32, 33, 34, 35 };
+struct S v = { 36, 37, 38, 39, 40, 41, 42, 43 };
+
+int
+main ()
+{
+  asm volatile ("" : : : "memory");
+  f1 (&s);
+  asm volatile ("" : : : "memory");
+  if (s.a != 1 || s.b != 2 || s.c != 3 || s.d != 4
+      || s.e != 5 || s.f != 6 || s.g != 7 || s.h != 27)
+    __builtin_abort ();
+  f2 (&s, &u);
+  asm volatile ("" : : : "memory");
+  if (s.a != 28 || s.b != 29 || s.c != 30 || s.d != 31
+      || s.e != 32 || s.f != 33 || s.g != 34 || s.h != 27)
+    __builtin_abort ();
+  f3 (&s, &v);
+  asm volatile ("" : : : "memory");
+  if (s.a != 36 || s.b != 37 || s.c != 38 || s.d != 39
+      || s.e != 40 || s.f != 41 || s.g != 42 || s.h != 27)
+    __builtin_abort ();
+  f4 (&s, &t);
+  asm volatile ("" : : : "memory");
+  if (s.a != (36 | 0x71) || s.b != (37 | 0x72)
+      || s.c != (38 | 0x7f04) || s.d != (39 | 0x78)
+      || s.e != (40 | 0x31) || s.f != (41 | 0x32)
+      || s.g != (42 | 0x34) || s.h != 27)
+    __builtin_abort ();
+  f3 (&s, &u);
+  f5 (&s, &t);
+  asm volatile ("" : : : "memory");
+  if (s.a != (28 & 0x71) || s.b != (29 & 0x72)
+      || s.c != (30 & 0x7f04) || s.d != (31 & 0x78)
+      || s.e != (32 & 0x31) || s.f != (33 & 0x32)
+      || s.g != (34 & 0x34) || s.h != 27)
+    __builtin_abort ();
+  f2 (&s, &v);
+  f6 (&s, &t);
+  asm volatile ("" : : : "memory");
+  if (s.a != (36 ^ 0x71) || s.b != (37 ^ 0x72)
+      || s.c != (38 ^ 0x7f04) || s.d != (39 ^ 0x78)
+      || s.e != (40 ^ 0x31) || s.f != (41 ^ 0x32)
+      || s.g != (42 ^ 0x34) || s.h != 27)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Merging successful" 6 "store-merging" } } */
--- gcc/testsuite/gcc.dg/store_merging_14.c.jj	2017-11-02 08:50:03.544226508 +0100
+++ gcc/testsuite/gcc.dg/store_merging_14.c	2017-11-02 10:35:51.000000000 +0100
@@ -0,0 +1,157 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target store_merge } */
+/* { dg-options "-O2 -fdump-tree-store-merging" } */
+
+struct S { unsigned int i : 8, a : 7, b : 7, j : 10, c : 15, d : 7, e : 10, f : 7, g : 9, k : 16; unsigned long long h; };
+
+__attribute__((noipa)) void
+f1 (struct S *p)
+{
+  p->a = 1;
+  p->b = 2;
+  p->c = 3;
+  p->d = 4;
+  p->e = 5;
+  p->f = 6;
+  p->g = 7;
+}
+
+__attribute__((noipa)) void
+f2 (struct S *__restrict p, struct S *__restrict q)
+{
+  p->a = q->a;
+  p->b = q->b;
+  p->c = q->c;
+  p->d = q->d;
+  p->e = q->e;
+  p->f = q->f;
+  p->g = q->g;
+}
+
+__attribute__((noipa)) void
+f3 (struct S *p, struct S *q)
+{
+  unsigned char pa = q->a;
+  unsigned char pb = q->b;
+  unsigned short pc = q->c;
+  unsigned char pd = q->d;
+  unsigned short pe = q->e;
+  unsigned char pf = q->f;
+  unsigned short pg = q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+__attribute__((noipa)) void
+f4 (struct S *p, struct S *q)
+{
+  unsigned char pa = p->a | q->a;
+  unsigned char pb = p->b | q->b;
+  unsigned short pc = p->c | q->c;
+  unsigned char pd = p->d | q->d;
+  unsigned short pe = p->e | q->e;
+  unsigned char pf = p->f | q->f;
+  unsigned short pg = p->g | q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+__attribute__((noipa)) void
+f5 (struct S *p, struct S *q)
+{
+  unsigned char pa = p->a & q->a;
+  unsigned char pb = p->b & q->b;
+  unsigned short pc = p->c & q->c;
+  unsigned char pd = p->d & q->d;
+  unsigned short pe = p->e & q->e;
+  unsigned char pf = p->f & q->f;
+  unsigned short pg = p->g & q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+__attribute__((noipa)) void
+f6 (struct S *p, struct S *q)
+{
+  unsigned char pa = p->a ^ q->a;
+  unsigned char pb = p->b ^ q->b;
+  unsigned short pc = p->c ^ q->c;
+  unsigned char pd = p->d ^ q->d;
+  unsigned short pe = p->e ^ q->e;
+  unsigned char pf = p->f ^ q->f;
+  unsigned short pg = p->g ^ q->g;
+  p->a = pa;
+  p->b = pb;
+  p->c = pc;
+  p->d = pd;
+  p->e = pe;
+  p->f = pf;
+  p->g = pg;
+}
+
+struct S s = { 72, 20, 21, 73, 22, 23, 24, 25, 26, 74, 27 };
+struct S t = { 75, 0x71, 0x72, 76, 0x7f04, 0x78, 0x31, 0x32, 0x34, 77, 0xf1f2f3f4f5f6f7f8ULL };
+struct S u = { 78, 28, 29, 79, 30, 31, 32, 33, 34, 80, 35 };
+struct S v = { 81, 36, 37, 82, 38, 39, 40, 41, 42, 83, 43 };
+
+int
+main ()
+{
+  asm volatile ("" : : : "memory");
+  f1 (&s);
+  asm volatile ("" : : : "memory");
+  if (s.i != 72 || s.a != 1 || s.b != 2 || s.j != 73 || s.c != 3 || s.d != 4
+      || s.e != 5 || s.f != 6 || s.g != 7 || s.k != 74 || s.h != 27)
+    __builtin_abort ();
+  f2 (&s, &u);
+  asm volatile ("" : : : "memory");
+  if (s.i != 72 || s.a != 28 || s.b != 29 || s.j != 73 || s.c != 30 || s.d != 31
+      || s.e != 32 || s.f != 33 || s.g != 34 || s.k != 74 || s.h != 27)
+    __builtin_abort ();
+  f3 (&s, &v);
+  asm volatile ("" : : : "memory");
+  if (s.i != 72 || s.a != 36 || s.b != 37 || s.j != 73 || s.c != 38 || s.d != 39
+      || s.e != 40 || s.f != 41 || s.g != 42 || s.k != 74 || s.h != 27)
+    __builtin_abort ();
+  f4 (&s, &t);
+  asm volatile ("" : : : "memory");
+  if (s.i != 72 || s.a != (36 | 0x71) || s.b != (37 | 0x72) || s.j != 73
+      || s.c != (38 | 0x7f04) || s.d != (39 | 0x78)
+      || s.e != (40 | 0x31) || s.f != (41 | 0x32)
+      || s.g != (42 | 0x34) || s.k != 74 || s.h != 27)
+    __builtin_abort ();
+  f3 (&s, &u);
+  f5 (&s, &t);
+  asm volatile ("" : : : "memory");
+  if (s.i != 72 || s.a != (28 & 0x71) || s.b != (29 & 0x72) || s.j != 73
+      || s.c != (30 & 0x7f04) || s.d != (31 & 0x78)
+      || s.e != (32 & 0x31) || s.f != (33 & 0x32)
+      || s.g != (34 & 0x34) || s.k != 74 || s.h != 27)
+    __builtin_abort ();
+  f2 (&s, &v);
+  f6 (&s, &t);
+  asm volatile ("" : : : "memory");
+  if (s.i != 72 || s.a != (36 ^ 0x71) || s.b != (37 ^ 0x72) || s.j != 73
+      || s.c != (38 ^ 0x7f04) || s.d != (39 ^ 0x78)
+      || s.e != (40 ^ 0x31) || s.f != (41 ^ 0x32)
+      || s.g != (42 ^ 0x34) || s.k != 74 || s.h != 27)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Merging successful" 6 "store-merging" } } */