diff mbox series

Assorted store-merging improvements (PR middle-end/22141)

Message ID 20171027183731.GJ14653@tucnak
State New
Headers show
Series Assorted store-merging improvements (PR middle-end/22141) | expand

Commit Message

Jakub Jelinek Oct. 27, 2017, 6:37 p.m. UTC
Hi!

The following patch attempts to improve store merging, for the time being
it still only optimizes constant stores to adjacent memory.

The biggest improvement is handling bitfields, it uses the get_bit_range
helper to find the bounds of what can be modified when modifying the bitfield
and instead of requiring all the stores to be adjacent it now only requires
that their bitregion_* regions are adjacent.  If get_bit_range fails (e.g. for
non-C/C++ languages), then it still rounds the boundaries down and up to whole
bytes, as any kind of changes within the byte affect the rest.  At the end,
if there are any gaps in between the stored values, the old value is loaded from
memory (had to set TREE_NO_WARNING on it, so that uninit doesn't complain)
masked with mask, ored with the constant maked with negation of mask and stored,
pretty much what the expansion emits.  As incremental improvement, perhaps we
could emit in some cases where all the stored bitfields in one load/store set
are adjacent a BIT_INSERT_EXPR instead of doing the and/or.

Another improvement is related to alignment handling, previously the code
was using get_object_alignment, which say for the store_merging_11.c testcase
has to return 8-bit alignment, as the whole struct is 64-bit aligned,
but the first store is 1 byte after that.  The old code would then on
targets that don't allow unaligned stores or have them slow emit just byte
stores (many of them).  The patch uses get_object_alignment_1, so that we
get both the maximum known alignment and misalign, and computes alignment
for that for every bitpos we try, such that for stores starting with
1 byte after 64-bit alignment we get 1 byte store, then 2 byte and then 4 byte
and then 8 byte.

Another improvement is for targets that allow unaligned stores, the new
code performs a dry run on split_group and if it determines that aligned
stores are as many or fewer as unaligned stores, it prefers aligned stores.
E.g. for the case in store_merging_11.c, where ptr is 64-bit aligned and
we store 15 bytes, unaligned stores and unpatched gcc would choose to
do an 8 byte, then 4 byte, then 2 byte and then one byte store.
Aligned stores, 1, 2, 4, 8 in that order are also 4, so it is better
to do those.

The patch also attempts to reuse original stores (well, just their lhs/rhs1),
if we choose a split store that has a single original insn in it.  That way
we don't lose ARRAY_REFs/COMPONENT_REFs etc. unnecessarily.  Furthermore, if
there is a larger original store than the maximum we try (wordsize), e.g. when
there is originally 8 byte long long store followed by 1 byte store followed by
1 byte store, on 32-bit targets we'd previously try to split it into 4 byte
store, 4 byte store and 2 byte store, figure out that is 3 stores like previously
and give up.  With the patch, if we see a single original larger store at the
bitpos we want, we just reuse that store, so we get in that case an 8 byte
original store (lhs/rhs1) followed by 2 byte store.

In find_constituent_stmts it optimizes by not walking unnecessarily group->stores
entries that are already known to be before the bitpos we ask.  It fixes the comparisons
which were off by one, so previously it often chose more original stores than were really
in the split store.

Another change is that output_merged_store used to emit the new stores into a sequence
and if it found out there are too many, released all ssa names and failed.
That seems to be unnecessary to me, because we know before entering the loop
how many split stores there are, so can just fail at that point and so only start
emitting something when we have decided to do the replacement.

I had to disable store merging in the g++.dg/pr71694.C testcase, but that is just
because the testcase doesn't test what it should.  In my understanding, it wants to
verify that the c.c store isn't using 32-bit RMW, because it would create data race
for c.d.  But it stores both c.c and c.d next to each other, so even when c.c's
bitregion is the first 3 bytes and c.d's bitregion is the following byte, we are
then touching bytes in both of the regions and thus a RMW cycle for the whole
32-bit word is fine, as c.d is written, it will store the new value and ignore the
old value of the c.d byte.  What is wrong, but store merging doesn't emit, is what
we emitted before, i.e. 32-bit RMW that just stored c.c, followed by c.d store.
Another thread could have stored value1 into c.d, we'd R it by 32-bit read, modify,
while another thread stored value2 into c.d, then we W the 32-bit word and thus
introduce a value1 store into c.d, then another thread reads it and finds
a value1 instead of expected value2.  Finally we store into c.d value3.
So, alternative to -fno-store-merging in the testcase would be probably separate
functions where one would store to c.c and another one to c.d, then we can make
sure neither store is using movl.  Though, it probably still should only look at
movl stores or loads, other moves are fine.

The included store_merging_10.c improves on x86_64 from:
 	movzbl	(%rdi), %eax
-	andl	$-19, %eax
+	andl	$-32, %eax
 	orl	$13, %eax
 	movb	%al, (%rdi)
in foo and
-	orb	$1, (%rdi)
 	movl	(%rdi), %eax
-	andl	$-131071, %eax
+	andl	$2147352576, %eax
+	orl	$1, %eax
 	movl	%eax, (%rdi)
-	shrl	$24, %eax
-	andl	$127, %eax
-	movb	%al, 3(%rdi)
in bar.  foo is something combine.c managed to optimize too, but bar it couldn't.
In store_merging_11.c on x86_64, bar is the same and foo changed:
-	movabsq	$578437695752115969, %rax
-	movl	$0, 9(%rdi)
-	movb	$0, 15(%rdi)
-	movq	%rax, 1(%rdi)
-	xorl	%eax, %eax
-	movw	%ax, 13(%rdi)
+	movl	$23, %eax
+	movb	$1, 1(%rdi)
+	movl	$117835012, 4(%rdi)
+	movw	%ax, 2(%rdi)
+	movq	$8, 8(%rdi)
which is not only shorter, but all the stores are aligned.
On ppc64le in store_merging_10.c the difference is:
-	lwz 9,0(3)
+	lbz 9,0(3)
 	rlwinm 9,9,0,0,26
 	ori 9,9,0xd
-	stw 9,0(3)
+	stb 9,0(3)
in foo and
 	lwz 9,0(3)
+	rlwinm 9,9,0,1,14
 	ori 9,9,0x1
-	rlwinm 9,9,0,31,14
-	rlwinm 9,9,0,1,31
 	stw 9,0(3)
in bar, and store_merging_11.c the difference is:
-	lis 8,0x807
-	li 9,0
-	ori 8,8,0x605
-	li 10,0
-	sldi 8,8,32
-	stw 9,9(3)
-	sth 9,13(3)
-	oris 8,8,0x400
-	stb 10,15(3)
-	ori 8,8,0x1701
-	mtvsrd 0,8
-	stfd 0,1(3)
+	lis 9,0x706
+	li 7,1
+	li 8,23
+	ori 9,9,0x504
+	li 10,8
+	stb 7,1(3)
+	sth 8,2(3)
+	stw 9,4(3)
+	std 10,8(3)
in foo and no changes in bar.

What the patch doesn't implement yet, but could be also possible for
allow_unaligned case is in store_merging_11.c when we are storing 15 bytes
store 8 bytes at offset 1 and 8 bytes at offset 8 (i.e. create two
overlapping stores, in this case one aligned and one unaligned).

Bootstrapped/regtested on x86_64-linux, i686-linux and powerpc64le-linux,
ok for trunk?

2017-10-27  Jakub Jelinek  <jakub@redhat.com>

	PR middle-end/22141
	* gimple-ssa-store-merging.c: Include rtl.h and expr.h.
	(struct store_immediate_info): Add bitregion_start and bitregion_end
	fields.
	(store_immediate_info::store_immediate_info): Add brs and bre
	arguments and initialize bitregion_{start,end} from those.
	(struct merged_store_group): Add bitregion_start, bitregion_end,
	align_base and mask fields.  Drop unnecessary struct keyword from
	struct store_immediate_info.  Add do_merge method.
	(clear_bit_region_be): Use memset instead of loop storing zeros.
	(merged_store_group::do_merge): New method.
	(merged_store_group::merge_into): Use do_merge.  Allow gaps in between
	stores as long as the surrounding bitregions have no gaps.
	(merged_store_group::merge_overlapping): Use do_merge.
	(merged_store_group::apply_stores): Test that bitregion_{start,end}
	is byte aligned, rather than requiring that start and width are
	byte aligned.  Drop unnecessary struct keyword from
	struct store_immediate_info.  Allocate and populate also mask array.
	Make start of the arrays relative to bitregion_start rather than
	start and size them according to bitregion_{end,start} difference.
	(struct imm_store_chain_info): Drop unnecessary struct keyword from
	struct store_immediate_info.
	(pass_store_merging::gate): Punt if BITS_PER_UNIT or CHAR_BIT is not 8.
	(pass_store_merging::terminate_all_aliasing_chains): Drop unnecessary
	struct keyword from struct store_immediate_info.
	(imm_store_chain_info::coalesce_immediate_stores): Allow gaps in
	between stores as long as the surrounding bitregions have no gaps.
	Formatting fixes.
	(struct split_store): Add orig non-static data member.
	(split_store::split_store): Initialize orig to false.
	(find_constituent_stmts): Return store_immediate_info *, non-NULL
	if there is exactly a single original stmt.  Change stmts argument
	to pointer from reference, if NULL, don't push anything to it.  Add
	first argument, use it to optimize skipping over orig stmts that
	are known to be before bitpos already.  Simplify.
	(split_group): Return unsigned int count how many stores are or
	would be needed rather than a bool.  Add allow_unaligned argument.
	Change split_stores argument from reference to pointer, if NULL,
	only do a dry run computing how many stores would be produced.
	Rewritten algorithm to use both alignment and misalign if
	!allow_unaligned and handle bitfield stores with gaps.
	(imm_store_chain_info::output_merged_store): Set start_byte_pos
	from bitregion_start instead of start.  Compute allow_unaligned
	here, if true, do 2 split_group dry runs to compute which one
	produces fewer stores and prefer aligned if equal.  Punt if
	new count is bigger or equal than original before emitting any
	statements, rather than during that.  Remove no longer needed
	new_ssa_names tracking.  Replace num_stmts with
	split_stores.length ().  Use 32-bit stack allocated entries
	in split_stores auto_vec.  Try to reuse original store lhs/rhs1
	if possible.  Handle bitfields with gaps.
	(pass_store_merging::execute): Ignore bitsize == 0 stores.
	Compute bitregion_{start,end} for the stores and construct
	store_immediate_info with that.  Formatting fixes.

	* gcc.dg/store_merging_10.c: New test.
	* gcc.dg/store_merging_11.c: New test.
	* gcc.dg/store_merging_12.c: New test.
	* g++.dg/pr71694.C: Add -fno-store-merging to dg-options.



	Jakub

Comments

Richard Biener Oct. 30, 2017, 10:46 a.m. UTC | #1
On Fri, 27 Oct 2017, Jakub Jelinek wrote:

> Hi!
> 
> The following patch attempts to improve store merging, for the time being
> it still only optimizes constant stores to adjacent memory.
> 
> The biggest improvement is handling bitfields, it uses the get_bit_range
> helper to find the bounds of what can be modified when modifying the bitfield
> and instead of requiring all the stores to be adjacent it now only requires
> that their bitregion_* regions are adjacent.  If get_bit_range fails (e.g. for
> non-C/C++ languages), then it still rounds the boundaries down and up to whole
> bytes, as any kind of changes within the byte affect the rest.  At the end,
> if there are any gaps in between the stored values, the old value is loaded from
> memory (had to set TREE_NO_WARNING on it, so that uninit doesn't complain)
> masked with mask, ored with the constant maked with negation of mask and stored,
> pretty much what the expansion emits.  As incremental improvement, perhaps we
> could emit in some cases where all the stored bitfields in one load/store set
> are adjacent a BIT_INSERT_EXPR instead of doing the and/or.
> 
> Another improvement is related to alignment handling, previously the code
> was using get_object_alignment, which say for the store_merging_11.c testcase
> has to return 8-bit alignment, as the whole struct is 64-bit aligned,
> but the first store is 1 byte after that.  The old code would then on
> targets that don't allow unaligned stores or have them slow emit just byte
> stores (many of them).  The patch uses get_object_alignment_1, so that we
> get both the maximum known alignment and misalign, and computes alignment
> for that for every bitpos we try, such that for stores starting with
> 1 byte after 64-bit alignment we get 1 byte store, then 2 byte and then 4 byte
> and then 8 byte.
> 
> Another improvement is for targets that allow unaligned stores, the new
> code performs a dry run on split_group and if it determines that aligned
> stores are as many or fewer as unaligned stores, it prefers aligned stores.
> E.g. for the case in store_merging_11.c, where ptr is 64-bit aligned and
> we store 15 bytes, unaligned stores and unpatched gcc would choose to
> do an 8 byte, then 4 byte, then 2 byte and then one byte store.
> Aligned stores, 1, 2, 4, 8 in that order are also 4, so it is better
> to do those.
> 
> The patch also attempts to reuse original stores (well, just their lhs/rhs1),
> if we choose a split store that has a single original insn in it.  That way
> we don't lose ARRAY_REFs/COMPONENT_REFs etc. unnecessarily.  Furthermore, if
> there is a larger original store than the maximum we try (wordsize), e.g. when
> there is originally 8 byte long long store followed by 1 byte store followed by
> 1 byte store, on 32-bit targets we'd previously try to split it into 4 byte
> store, 4 byte store and 2 byte store, figure out that is 3 stores like previously
> and give up.  With the patch, if we see a single original larger store at the
> bitpos we want, we just reuse that store, so we get in that case an 8 byte
> original store (lhs/rhs1) followed by 2 byte store.
> 
> In find_constituent_stmts it optimizes by not walking unnecessarily group->stores
> entries that are already known to be before the bitpos we ask.  It fixes the comparisons
> which were off by one, so previously it often chose more original stores than were really
> in the split store.
> 
> Another change is that output_merged_store used to emit the new stores into a sequence
> and if it found out there are too many, released all ssa names and failed.
> That seems to be unnecessary to me, because we know before entering the loop
> how many split stores there are, so can just fail at that point and so only start
> emitting something when we have decided to do the replacement.
> 
> I had to disable store merging in the g++.dg/pr71694.C testcase, but that is just
> because the testcase doesn't test what it should.  In my understanding, it wants to
> verify that the c.c store isn't using 32-bit RMW, because it would create data race
> for c.d.  But it stores both c.c and c.d next to each other, so even when c.c's
> bitregion is the first 3 bytes and c.d's bitregion is the following byte, we are
> then touching bytes in both of the regions and thus a RMW cycle for the whole
> 32-bit word is fine, as c.d is written, it will store the new value and ignore the
> old value of the c.d byte.  What is wrong, but store merging doesn't emit, is what
> we emitted before, i.e. 32-bit RMW that just stored c.c, followed by c.d store.
> Another thread could have stored value1 into c.d, we'd R it by 32-bit read, modify,
> while another thread stored value2 into c.d, then we W the 32-bit word and thus
> introduce a value1 store into c.d, then another thread reads it and finds
> a value1 instead of expected value2.  Finally we store into c.d value3.
> So, alternative to -fno-store-merging in the testcase would be probably separate
> functions where one would store to c.c and another one to c.d, then we can make
> sure neither store is using movl.  Though, it probably still should only look at
> movl stores or loads, other moves are fine.
> 
> The included store_merging_10.c improves on x86_64 from:
>  	movzbl	(%rdi), %eax
> -	andl	$-19, %eax
> +	andl	$-32, %eax
>  	orl	$13, %eax
>  	movb	%al, (%rdi)
> in foo and
> -	orb	$1, (%rdi)
>  	movl	(%rdi), %eax
> -	andl	$-131071, %eax
> +	andl	$2147352576, %eax
> +	orl	$1, %eax
>  	movl	%eax, (%rdi)
> -	shrl	$24, %eax
> -	andl	$127, %eax
> -	movb	%al, 3(%rdi)
> in bar.  foo is something combine.c managed to optimize too, but bar it couldn't.
> In store_merging_11.c on x86_64, bar is the same and foo changed:
> -	movabsq	$578437695752115969, %rax
> -	movl	$0, 9(%rdi)
> -	movb	$0, 15(%rdi)
> -	movq	%rax, 1(%rdi)
> -	xorl	%eax, %eax
> -	movw	%ax, 13(%rdi)
> +	movl	$23, %eax
> +	movb	$1, 1(%rdi)
> +	movl	$117835012, 4(%rdi)
> +	movw	%ax, 2(%rdi)
> +	movq	$8, 8(%rdi)
> which is not only shorter, but all the stores are aligned.
> On ppc64le in store_merging_10.c the difference is:
> -	lwz 9,0(3)
> +	lbz 9,0(3)
>  	rlwinm 9,9,0,0,26
>  	ori 9,9,0xd
> -	stw 9,0(3)
> +	stb 9,0(3)
> in foo and
>  	lwz 9,0(3)
> +	rlwinm 9,9,0,1,14
>  	ori 9,9,0x1
> -	rlwinm 9,9,0,31,14
> -	rlwinm 9,9,0,1,31
>  	stw 9,0(3)
> in bar, and store_merging_11.c the difference is:
> -	lis 8,0x807
> -	li 9,0
> -	ori 8,8,0x605
> -	li 10,0
> -	sldi 8,8,32
> -	stw 9,9(3)
> -	sth 9,13(3)
> -	oris 8,8,0x400
> -	stb 10,15(3)
> -	ori 8,8,0x1701
> -	mtvsrd 0,8
> -	stfd 0,1(3)
> +	lis 9,0x706
> +	li 7,1
> +	li 8,23
> +	ori 9,9,0x504
> +	li 10,8
> +	stb 7,1(3)
> +	sth 8,2(3)
> +	stw 9,4(3)
> +	std 10,8(3)
> in foo and no changes in bar.
> 
> What the patch doesn't implement yet, but could be also possible for
> allow_unaligned case is in store_merging_11.c when we are storing 15 bytes
> store 8 bytes at offset 1 and 8 bytes at offset 8 (i.e. create two
> overlapping stores, in this case one aligned and one unaligned).
> 
> Bootstrapped/regtested on x86_64-linux, i686-linux and powerpc64le-linux,
> ok for trunk?

Ok.

Thanks,
Richard.

> 2017-10-27  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR middle-end/22141
> 	* gimple-ssa-store-merging.c: Include rtl.h and expr.h.
> 	(struct store_immediate_info): Add bitregion_start and bitregion_end
> 	fields.
> 	(store_immediate_info::store_immediate_info): Add brs and bre
> 	arguments and initialize bitregion_{start,end} from those.
> 	(struct merged_store_group): Add bitregion_start, bitregion_end,
> 	align_base and mask fields.  Drop unnecessary struct keyword from
> 	struct store_immediate_info.  Add do_merge method.
> 	(clear_bit_region_be): Use memset instead of loop storing zeros.
> 	(merged_store_group::do_merge): New method.
> 	(merged_store_group::merge_into): Use do_merge.  Allow gaps in between
> 	stores as long as the surrounding bitregions have no gaps.
> 	(merged_store_group::merge_overlapping): Use do_merge.
> 	(merged_store_group::apply_stores): Test that bitregion_{start,end}
> 	is byte aligned, rather than requiring that start and width are
> 	byte aligned.  Drop unnecessary struct keyword from
> 	struct store_immediate_info.  Allocate and populate also mask array.
> 	Make start of the arrays relative to bitregion_start rather than
> 	start and size them according to bitregion_{end,start} difference.
> 	(struct imm_store_chain_info): Drop unnecessary struct keyword from
> 	struct store_immediate_info.
> 	(pass_store_merging::gate): Punt if BITS_PER_UNIT or CHAR_BIT is not 8.
> 	(pass_store_merging::terminate_all_aliasing_chains): Drop unnecessary
> 	struct keyword from struct store_immediate_info.
> 	(imm_store_chain_info::coalesce_immediate_stores): Allow gaps in
> 	between stores as long as the surrounding bitregions have no gaps.
> 	Formatting fixes.
> 	(struct split_store): Add orig non-static data member.
> 	(split_store::split_store): Initialize orig to false.
> 	(find_constituent_stmts): Return store_immediate_info *, non-NULL
> 	if there is exactly a single original stmt.  Change stmts argument
> 	to pointer from reference, if NULL, don't push anything to it.  Add
> 	first argument, use it to optimize skipping over orig stmts that
> 	are known to be before bitpos already.  Simplify.
> 	(split_group): Return unsigned int count how many stores are or
> 	would be needed rather than a bool.  Add allow_unaligned argument.
> 	Change split_stores argument from reference to pointer, if NULL,
> 	only do a dry run computing how many stores would be produced.
> 	Rewritten algorithm to use both alignment and misalign if
> 	!allow_unaligned and handle bitfield stores with gaps.
> 	(imm_store_chain_info::output_merged_store): Set start_byte_pos
> 	from bitregion_start instead of start.  Compute allow_unaligned
> 	here, if true, do 2 split_group dry runs to compute which one
> 	produces fewer stores and prefer aligned if equal.  Punt if
> 	new count is bigger or equal than original before emitting any
> 	statements, rather than during that.  Remove no longer needed
> 	new_ssa_names tracking.  Replace num_stmts with
> 	split_stores.length ().  Use 32-bit stack allocated entries
> 	in split_stores auto_vec.  Try to reuse original store lhs/rhs1
> 	if possible.  Handle bitfields with gaps.
> 	(pass_store_merging::execute): Ignore bitsize == 0 stores.
> 	Compute bitregion_{start,end} for the stores and construct
> 	store_immediate_info with that.  Formatting fixes.
> 
> 	* gcc.dg/store_merging_10.c: New test.
> 	* gcc.dg/store_merging_11.c: New test.
> 	* gcc.dg/store_merging_12.c: New test.
> 	* g++.dg/pr71694.C: Add -fno-store-merging to dg-options.
> 
> --- gcc/gimple-ssa-store-merging.c.jj	2017-10-27 14:16:26.074249585 +0200
> +++ gcc/gimple-ssa-store-merging.c	2017-10-27 18:10:18.212762773 +0200
> @@ -126,6 +126,8 @@
>  #include "tree-eh.h"
>  #include "target.h"
>  #include "gimplify-me.h"
> +#include "rtl.h"
> +#include "expr.h"	/* For get_bit_range.  */
>  #include "selftest.h"
>  
>  /* The maximum size (in bits) of the stores this pass should generate.  */
> @@ -142,17 +144,24 @@ struct store_immediate_info
>  {
>    unsigned HOST_WIDE_INT bitsize;
>    unsigned HOST_WIDE_INT bitpos;
> +  unsigned HOST_WIDE_INT bitregion_start;
> +  /* This is one past the last bit of the bit region.  */
> +  unsigned HOST_WIDE_INT bitregion_end;
>    gimple *stmt;
>    unsigned int order;
>    store_immediate_info (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
> +			unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
>  			gimple *, unsigned int);
>  };
>  
>  store_immediate_info::store_immediate_info (unsigned HOST_WIDE_INT bs,
>  					    unsigned HOST_WIDE_INT bp,
> +					    unsigned HOST_WIDE_INT brs,
> +					    unsigned HOST_WIDE_INT bre,
>  					    gimple *st,
>  					    unsigned int ord)
> -  : bitsize (bs), bitpos (bp), stmt (st), order (ord)
> +  : bitsize (bs), bitpos (bp), bitregion_start (brs), bitregion_end (bre),
> +    stmt (st), order (ord)
>  {
>  }
>  
> @@ -164,26 +173,32 @@ struct merged_store_group
>  {
>    unsigned HOST_WIDE_INT start;
>    unsigned HOST_WIDE_INT width;
> -  /* The size of the allocated memory for val.  */
> +  unsigned HOST_WIDE_INT bitregion_start;
> +  unsigned HOST_WIDE_INT bitregion_end;
> +  /* The size of the allocated memory for val and mask.  */
>    unsigned HOST_WIDE_INT buf_size;
> +  unsigned HOST_WIDE_INT align_base;
>  
>    unsigned int align;
>    unsigned int first_order;
>    unsigned int last_order;
>  
> -  auto_vec<struct store_immediate_info *> stores;
> +  auto_vec<store_immediate_info *> stores;
>    /* We record the first and last original statements in the sequence because
>       we'll need their vuse/vdef and replacement position.  It's easier to keep
>       track of them separately as 'stores' is reordered by apply_stores.  */
>    gimple *last_stmt;
>    gimple *first_stmt;
>    unsigned char *val;
> +  unsigned char *mask;
>  
>    merged_store_group (store_immediate_info *);
>    ~merged_store_group ();
>    void merge_into (store_immediate_info *);
>    void merge_overlapping (store_immediate_info *);
>    bool apply_stores ();
> +private:
> +  void do_merge (store_immediate_info *);
>  };
>  
>  /* Debug helper.  Dump LEN elements of byte array PTR to FD in hex.  */
> @@ -287,8 +302,7 @@ clear_bit_region_be (unsigned char *ptr,
>  	   && len > BITS_PER_UNIT)
>      {
>        unsigned int nbytes = len / BITS_PER_UNIT;
> -      for (unsigned int i = 0; i < nbytes; i++)
> -	ptr[i] = 0U;
> +      memset (ptr, 0, nbytes);
>        if (len % BITS_PER_UNIT != 0)
>  	clear_bit_region_be (ptr + nbytes, BITS_PER_UNIT - 1,
>  			     len % BITS_PER_UNIT);
> @@ -549,10 +563,16 @@ merged_store_group::merged_store_group (
>  {
>    start = info->bitpos;
>    width = info->bitsize;
> +  bitregion_start = info->bitregion_start;
> +  bitregion_end = info->bitregion_end;
>    /* VAL has memory allocated for it in apply_stores once the group
>       width has been finalized.  */
>    val = NULL;
> -  align = get_object_alignment (gimple_assign_lhs (info->stmt));
> +  mask = NULL;
> +  unsigned HOST_WIDE_INT align_bitpos = 0;
> +  get_object_alignment_1 (gimple_assign_lhs (info->stmt),
> +			  &align, &align_bitpos);
> +  align_base = start - align_bitpos;
>    stores.create (1);
>    stores.safe_push (info);
>    last_stmt = info->stmt;
> @@ -568,18 +588,24 @@ merged_store_group::~merged_store_group
>      XDELETEVEC (val);
>  }
>  
> -/* Merge a store recorded by INFO into this merged store.
> -   The store is not overlapping with the existing recorded
> -   stores.  */
> -
> +/* Helper method for merge_into and merge_overlapping to do
> +   the common part.  */
>  void
> -merged_store_group::merge_into (store_immediate_info *info)
> +merged_store_group::do_merge (store_immediate_info *info)
>  {
> -  unsigned HOST_WIDE_INT wid = info->bitsize;
> -  /* Make sure we're inserting in the position we think we're inserting.  */
> -  gcc_assert (info->bitpos == start + width);
> +  bitregion_start = MIN (bitregion_start, info->bitregion_start);
> +  bitregion_end = MAX (bitregion_end, info->bitregion_end);
> +
> +  unsigned int this_align;
> +  unsigned HOST_WIDE_INT align_bitpos = 0;
> +  get_object_alignment_1 (gimple_assign_lhs (info->stmt),
> +			  &this_align, &align_bitpos);
> +  if (this_align > align)
> +    {
> +      align = this_align;
> +      align_base = info->bitpos - align_bitpos;
> +    }
>  
> -  width += wid;
>    gimple *stmt = info->stmt;
>    stores.safe_push (info);
>    if (info->order > last_order)
> @@ -594,6 +620,22 @@ merged_store_group::merge_into (store_im
>      }
>  }
>  
> +/* Merge a store recorded by INFO into this merged store.
> +   The store is not overlapping with the existing recorded
> +   stores.  */
> +
> +void
> +merged_store_group::merge_into (store_immediate_info *info)
> +{
> +  unsigned HOST_WIDE_INT wid = info->bitsize;
> +  /* Make sure we're inserting in the position we think we're inserting.  */
> +  gcc_assert (info->bitpos >= start + width
> +	      && info->bitregion_start <= bitregion_end);
> +
> +  width += wid;
> +  do_merge (info);
> +}
> +
>  /* Merge a store described by INFO into this merged store.
>     INFO overlaps in some way with the current store (i.e. it's not contiguous
>     which is handled by merged_store_group::merge_into).  */
> @@ -601,23 +643,11 @@ merged_store_group::merge_into (store_im
>  void
>  merged_store_group::merge_overlapping (store_immediate_info *info)
>  {
> -  gimple *stmt = info->stmt;
> -  stores.safe_push (info);
> -
>    /* If the store extends the size of the group, extend the width.  */
> -  if ((info->bitpos + info->bitsize) > (start + width))
> +  if (info->bitpos + info->bitsize > start + width)
>      width += info->bitpos + info->bitsize - (start + width);
>  
> -  if (info->order > last_order)
> -    {
> -      last_order = info->order;
> -      last_stmt = stmt;
> -    }
> -  else if (info->order < first_order)
> -    {
> -      first_order = info->order;
> -      first_stmt = stmt;
> -    }
> +  do_merge (info);
>  }
>  
>  /* Go through all the recorded stores in this group in program order and
> @@ -627,27 +657,28 @@ merged_store_group::merge_overlapping (s
>  bool
>  merged_store_group::apply_stores ()
>  {
> -  /* The total width of the stores must add up to a whole number of bytes
> -     and start at a byte boundary.  We don't support emitting bitfield
> -     references for now.  Also, make sure we have more than one store
> -     in the group, otherwise we cannot merge anything.  */
> -  if (width % BITS_PER_UNIT != 0
> -      || start % BITS_PER_UNIT != 0
> +  /* Make sure we have more than one store in the group, otherwise we cannot
> +     merge anything.  */
> +  if (bitregion_start % BITS_PER_UNIT != 0
> +      || bitregion_end % BITS_PER_UNIT != 0
>        || stores.length () == 1)
>      return false;
>  
>    stores.qsort (sort_by_order);
> -  struct store_immediate_info *info;
> +  store_immediate_info *info;
>    unsigned int i;
>    /* Create a buffer of a size that is 2 times the number of bytes we're
>       storing.  That way native_encode_expr can write power-of-2-sized
>       chunks without overrunning.  */
> -  buf_size = 2 * (ROUND_UP (width, BITS_PER_UNIT) / BITS_PER_UNIT);
> -  val = XCNEWVEC (unsigned char, buf_size);
> +  buf_size = 2 * ((bitregion_end - bitregion_start) / BITS_PER_UNIT);
> +  val = XNEWVEC (unsigned char, 2 * buf_size);
> +  mask = val + buf_size;
> +  memset (val, 0, buf_size);
> +  memset (mask, ~0U, buf_size);
>  
>    FOR_EACH_VEC_ELT (stores, i, info)
>      {
> -      unsigned int pos_in_buffer = info->bitpos - start;
> +      unsigned int pos_in_buffer = info->bitpos - bitregion_start;
>        bool ret = encode_tree_to_bitpos (gimple_assign_rhs1 (info->stmt),
>  					val, info->bitsize,
>  					pos_in_buffer, buf_size);
> @@ -668,6 +699,11 @@ merged_store_group::apply_stores ()
>          }
>        if (!ret)
>  	return false;
> +      unsigned char *m = mask + (pos_in_buffer / BITS_PER_UNIT);
> +      if (BYTES_BIG_ENDIAN)
> +	clear_bit_region_be (m, pos_in_buffer % BITS_PER_UNIT, info->bitsize);
> +      else
> +	clear_bit_region (m, pos_in_buffer % BITS_PER_UNIT, info->bitsize);
>      }
>    return true;
>  }
> @@ -682,7 +718,7 @@ struct imm_store_chain_info
>       See pass_store_merging::m_stores_head for more rationale.  */
>    imm_store_chain_info *next, **pnxp;
>    tree base_addr;
> -  auto_vec<struct store_immediate_info *> m_store_info;
> +  auto_vec<store_immediate_info *> m_store_info;
>    auto_vec<merged_store_group *> m_merged_store_groups;
>  
>    imm_store_chain_info (imm_store_chain_info *&inspt, tree b_a)
> @@ -730,11 +766,16 @@ public:
>    {
>    }
>  
> -  /* Pass not supported for PDP-endianness.  */
> +  /* Pass not supported for PDP-endianness, nor for insane hosts
> +     or target character sizes where native_{encode,interpret}_expr
> +     doesn't work properly.  */
>    virtual bool
>    gate (function *)
>    {
> -    return flag_store_merging && (WORDS_BIG_ENDIAN == BYTES_BIG_ENDIAN);
> +    return flag_store_merging
> +	   && WORDS_BIG_ENDIAN == BYTES_BIG_ENDIAN
> +	   && CHAR_BIT == 8
> +	   && BITS_PER_UNIT == 8;
>    }
>  
>    virtual unsigned int execute (function *);
> @@ -811,7 +852,7 @@ pass_store_merging::terminate_all_aliasi
>  	 aliases with any of them.  */
>        else
>  	{
> -	  struct store_immediate_info *info;
> +	  store_immediate_info *info;
>  	  unsigned int i;
>  	  FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
>  	    {
> @@ -926,8 +967,9 @@ imm_store_chain_info::coalesce_immediate
>  	}
>  
>        /* |---store 1---| <gap> |---store 2---|.
> -	 Gap between stores.  Start a new group.  */
> -      if (start != merged_store->start + merged_store->width)
> +	 Gap between stores.  Start a new group if there are any gaps
> +	 between bitregions.  */
> +      if (info->bitregion_start > merged_store->bitregion_end)
>  	{
>  	  /* Try to apply all the stores recorded for the group to determine
>  	     the bitpattern they write and discard it if that fails.
> @@ -948,11 +990,11 @@ imm_store_chain_info::coalesce_immediate
>         merged_store->merge_into (info);
>      }
>  
> -    /* Record or discard the last store group.  */
> -    if (!merged_store->apply_stores ())
> -      delete merged_store;
> -    else
> -      m_merged_store_groups.safe_push (merged_store);
> +  /* Record or discard the last store group.  */
> +  if (!merged_store->apply_stores ())
> +    delete merged_store;
> +  else
> +    m_merged_store_groups.safe_push (merged_store);
>  
>    gcc_assert (m_merged_store_groups.length () <= m_store_info.length ());
>    bool success
> @@ -961,8 +1003,8 @@ imm_store_chain_info::coalesce_immediate
>  
>    if (success && dump_file)
>      fprintf (dump_file, "Coalescing successful!\n"
> -			 "Merged into %u stores\n",
> -		m_merged_store_groups.length ());
> +			"Merged into %u stores\n",
> +	     m_merged_store_groups.length ());
>  
>    return success;
>  }
> @@ -1016,6 +1058,8 @@ struct split_store
>    unsigned HOST_WIDE_INT size;
>    unsigned HOST_WIDE_INT align;
>    auto_vec<gimple *> orig_stmts;
> +  /* True if there is a single orig stmt covering the whole split store.  */
> +  bool orig;
>    split_store (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
>  	       unsigned HOST_WIDE_INT);
>  };
> @@ -1025,100 +1069,198 @@ struct split_store
>  split_store::split_store (unsigned HOST_WIDE_INT bp,
>  			  unsigned HOST_WIDE_INT sz,
>  			  unsigned HOST_WIDE_INT al)
> -			  : bytepos (bp), size (sz), align (al)
> +			  : bytepos (bp), size (sz), align (al), orig (false)
>  {
>    orig_stmts.create (0);
>  }
>  
>  /* Record all statements corresponding to stores in GROUP that write to
>     the region starting at BITPOS and is of size BITSIZE.  Record such
> -   statements in STMTS.  The stores in GROUP must be sorted by
> -   bitposition.  */
> +   statements in STMTS if non-NULL.  The stores in GROUP must be sorted by
> +   bitposition.  Return INFO if there is exactly one original store
> +   in the range.  */
>  
> -static void
> +static store_immediate_info *
>  find_constituent_stmts (struct merged_store_group *group,
> -			 auto_vec<gimple *> &stmts,
> -			 unsigned HOST_WIDE_INT bitpos,
> -			 unsigned HOST_WIDE_INT bitsize)
> +			vec<gimple *> *stmts,
> +			unsigned int *first,
> +			unsigned HOST_WIDE_INT bitpos,
> +			unsigned HOST_WIDE_INT bitsize)
>  {
> -  struct store_immediate_info *info;
> +  store_immediate_info *info, *ret = NULL;
>    unsigned int i;
> +  bool second = false;
> +  bool update_first = true;
>    unsigned HOST_WIDE_INT end = bitpos + bitsize;
> -  FOR_EACH_VEC_ELT (group->stores, i, info)
> +  for (i = *first; group->stores.iterate (i, &info); ++i)
>      {
>        unsigned HOST_WIDE_INT stmt_start = info->bitpos;
>        unsigned HOST_WIDE_INT stmt_end = stmt_start + info->bitsize;
> -      if (stmt_end < bitpos)
> -	continue;
> +      if (stmt_end <= bitpos)
> +	{
> +	  /* BITPOS passed to this function never decreases from within the
> +	     same split_group call, so optimize and don't scan info records
> +	     which are known to end before or at BITPOS next time.
> +	     Only do it if all stores before this one also pass this.  */
> +	  if (update_first)
> +	    *first = i + 1;
> +	  continue;
> +	}
> +      else
> +	update_first = false;
> +
>        /* The stores in GROUP are ordered by bitposition so if we're past
> -	  the region for this group return early.  */
> -      if (stmt_start > end)
> -	return;
> -
> -      if (IN_RANGE (stmt_start, bitpos, bitpos + bitsize)
> -	  || IN_RANGE (stmt_end, bitpos, end)
> -	  /* The statement writes a region that completely encloses the region
> -	     that this group writes.  Unlikely to occur but let's
> -	     handle it.  */
> -	  || IN_RANGE (bitpos, stmt_start, stmt_end))
> -	stmts.safe_push (info->stmt);
> +	 the region for this group return early.  */
> +      if (stmt_start >= end)
> +	return ret;
> +
> +      if (stmts)
> +	{
> +	  stmts->safe_push (info->stmt);
> +	  if (ret)
> +	    {
> +	      ret = NULL;
> +	      second = true;
> +	    }
> +	}
> +      else if (ret)
> +	return NULL;
> +      if (!second)
> +	ret = info;
>      }
> +  return ret;
>  }
>  
>  /* Split a merged store described by GROUP by populating the SPLIT_STORES
> -   vector with split_store structs describing the byte offset (from the base),
> -   the bit size and alignment of each store as well as the original statements
> -   involved in each such split group.
> +   vector (if non-NULL) with split_store structs describing the byte offset
> +   (from the base), the bit size and alignment of each store as well as the
> +   original statements involved in each such split group.
>     This is to separate the splitting strategy from the statement
>     building/emission/linking done in output_merged_store.
> -   At the moment just start with the widest possible size and keep emitting
> -   the widest we can until we have emitted all the bytes, halving the size
> -   when appropriate.  */
> -
> -static bool
> -split_group (merged_store_group *group,
> -	     auto_vec<struct split_store *> &split_stores)
> +   Return number of new stores.
> +   If SPLIT_STORES is NULL, it is just a dry run to count number of
> +   new stores.  */
> +
> +static unsigned int
> +split_group (merged_store_group *group, bool allow_unaligned,
> +	     vec<struct split_store *> *split_stores)
>  {
> -  unsigned HOST_WIDE_INT pos = group->start;
> -  unsigned HOST_WIDE_INT size = group->width;
> +  unsigned HOST_WIDE_INT pos = group->bitregion_start;
> +  unsigned HOST_WIDE_INT size = group->bitregion_end - pos;
>    unsigned HOST_WIDE_INT bytepos = pos / BITS_PER_UNIT;
> -  unsigned HOST_WIDE_INT align = group->align;
> +  unsigned HOST_WIDE_INT group_align = group->align;
> +  unsigned HOST_WIDE_INT align_base = group->align_base;
>  
> -  /* We don't handle partial bitfields for now.  We shouldn't have
> -     reached this far.  */
>    gcc_assert ((size % BITS_PER_UNIT == 0) && (pos % BITS_PER_UNIT == 0));
>  
> -  bool allow_unaligned
> -    = !STRICT_ALIGNMENT && PARAM_VALUE (PARAM_STORE_MERGING_ALLOW_UNALIGNED);
> -
> -  unsigned int try_size = MAX_STORE_BITSIZE;
> -  while (try_size > size
> -	 || (!allow_unaligned
> -	     && try_size > align))
> -    {
> -      try_size /= 2;
> -      if (try_size < BITS_PER_UNIT)
> -	return false;
> -    }
> -
> +  unsigned int ret = 0, first = 0;
>    unsigned HOST_WIDE_INT try_pos = bytepos;
>    group->stores.qsort (sort_by_bitpos);
>  
>    while (size > 0)
>      {
> -      struct split_store *store = new split_store (try_pos, try_size, align);
> +      if ((allow_unaligned || group_align <= BITS_PER_UNIT)
> +	  && group->mask[try_pos - bytepos] == (unsigned char) ~0U)
> +	{
> +	  /* Skip padding bytes.  */
> +	  ++try_pos;
> +	  size -= BITS_PER_UNIT;
> +	  continue;
> +	}
> +
>        unsigned HOST_WIDE_INT try_bitpos = try_pos * BITS_PER_UNIT;
> -      find_constituent_stmts (group, store->orig_stmts, try_bitpos, try_size);
> -      split_stores.safe_push (store);
> +      unsigned int try_size = MAX_STORE_BITSIZE, nonmasked;
> +      unsigned HOST_WIDE_INT align_bitpos
> +	= (try_bitpos - align_base) & (group_align - 1);
> +      unsigned HOST_WIDE_INT align = group_align;
> +      if (align_bitpos)
> +	align = least_bit_hwi (align_bitpos);
> +      if (!allow_unaligned)
> +	try_size = MIN (try_size, align);
> +      store_immediate_info *info
> +	= find_constituent_stmts (group, NULL, &first, try_bitpos, try_size);
> +      if (info)
> +	{
> +	  /* If there is just one original statement for the range, see if
> +	     we can just reuse the original store which could be even larger
> +	     than try_size.  */
> +	  unsigned HOST_WIDE_INT stmt_end
> +	    = ROUND_UP (info->bitpos + info->bitsize, BITS_PER_UNIT);
> +	  info = find_constituent_stmts (group, NULL, &first, try_bitpos,
> +					 stmt_end - try_bitpos);
> +	  if (info && info->bitpos >= try_bitpos)
> +	    {
> +	      try_size = stmt_end - try_bitpos;
> +	      goto found;
> +	    }
> +	}
>  
> -      try_pos += try_size / BITS_PER_UNIT;
> +      /* Approximate store bitsize for the case when there are no padding
> +	 bits.  */
> +      while (try_size > size)
> +	try_size /= 2;
> +      /* Now look for whole padding bytes at the end of that bitsize.  */
> +      for (nonmasked = try_size / BITS_PER_UNIT; nonmasked > 0; --nonmasked)
> +	if (group->mask[try_pos - bytepos + nonmasked - 1]
> +	    != (unsigned char) ~0U)
> +	  break;
> +      if (nonmasked == 0)
> +	{
> +	  /* If entire try_size range is padding, skip it.  */
> +	  try_pos += try_size / BITS_PER_UNIT;
> +	  size -= try_size;
> +	  continue;
> +	}
> +      /* Otherwise try to decrease try_size if second half, last 3 quarters
> +	 etc. are padding.  */
> +      nonmasked *= BITS_PER_UNIT;
> +      while (nonmasked <= try_size / 2)
> +	try_size /= 2;
> +      if (!allow_unaligned && group_align > BITS_PER_UNIT)
> +	{
> +	  /* Now look for whole padding bytes at the start of that bitsize.  */
> +	  unsigned int try_bytesize = try_size / BITS_PER_UNIT, masked;
> +	  for (masked = 0; masked < try_bytesize; ++masked)
> +	    if (group->mask[try_pos - bytepos + masked] != (unsigned char) ~0U)
> +	      break;
> +	  masked *= BITS_PER_UNIT;
> +	  gcc_assert (masked < try_size);
> +	  if (masked >= try_size / 2)
> +	    {
> +	      while (masked >= try_size / 2)
> +		{
> +		  try_size /= 2;
> +		  try_pos += try_size / BITS_PER_UNIT;
> +		  size -= try_size;
> +		  masked -= try_size;
> +		}
> +	      /* Need to recompute the alignment, so just retry at the new
> +		 position.  */
> +	      continue;
> +	    }
> +	}
>  
> +    found:
> +      ++ret;
> +
> +      if (split_stores)
> +	{
> +	  struct split_store *store
> +	    = new split_store (try_pos, try_size, align);
> +	  info = find_constituent_stmts (group, &store->orig_stmts,
> +	  				 &first, try_bitpos, try_size);
> +	  if (info
> +	      && info->bitpos >= try_bitpos
> +	      && info->bitpos + info->bitsize <= try_bitpos + try_size)
> +	    store->orig = true;
> +	  split_stores->safe_push (store);
> +	}
> +
> +      try_pos += try_size / BITS_PER_UNIT;
>        size -= try_size;
> -      align = try_size;
> -      while (size < try_size)
> -	try_size /= 2;
>      }
> -  return true;
> +
> +  return ret;
>  }
>  
>  /* Given a merged store group GROUP output the widened version of it.
> @@ -1132,31 +1274,50 @@ split_group (merged_store_group *group,
>  bool
>  imm_store_chain_info::output_merged_store (merged_store_group *group)
>  {
> -  unsigned HOST_WIDE_INT start_byte_pos = group->start / BITS_PER_UNIT;
> +  unsigned HOST_WIDE_INT start_byte_pos
> +    = group->bitregion_start / BITS_PER_UNIT;
>  
>    unsigned int orig_num_stmts = group->stores.length ();
>    if (orig_num_stmts < 2)
>      return false;
>  
> -  auto_vec<struct split_store *> split_stores;
> +  auto_vec<struct split_store *, 32> split_stores;
>    split_stores.create (0);
> -  if (!split_group (group, split_stores))
> -    return false;
> +  bool allow_unaligned
> +    = !STRICT_ALIGNMENT && PARAM_VALUE (PARAM_STORE_MERGING_ALLOW_UNALIGNED);
> +  if (allow_unaligned)
> +    {
> +      /* If unaligned stores are allowed, see how many stores we'd emit
> +	 for unaligned and how many stores we'd emit for aligned stores.
> +	 Only use unaligned stores if it allows fewer stores than aligned.  */
> +      unsigned aligned_cnt = split_group (group, false, NULL);
> +      unsigned unaligned_cnt = split_group (group, true, NULL);
> +      if (aligned_cnt <= unaligned_cnt)
> +	allow_unaligned = false;
> +    }
> +  split_group (group, allow_unaligned, &split_stores);
> +
> +  if (split_stores.length () >= orig_num_stmts)
> +    {
> +      /* We didn't manage to reduce the number of statements.  Bail out.  */
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	{
> +	  fprintf (dump_file, "Exceeded original number of stmts (%u)."
> +			      "  Not profitable to emit new sequence.\n",
> +		   orig_num_stmts);
> +	}
> +      return false;
> +    }
>  
>    gimple_stmt_iterator last_gsi = gsi_for_stmt (group->last_stmt);
>    gimple_seq seq = NULL;
> -  unsigned int num_stmts = 0;
>    tree last_vdef, new_vuse;
>    last_vdef = gimple_vdef (group->last_stmt);
>    new_vuse = gimple_vuse (group->last_stmt);
>  
>    gimple *stmt = NULL;
> -  /* The new SSA names created.  Keep track of them so that we can free them
> -     if we decide to not use the new sequence.  */
> -  auto_vec<tree> new_ssa_names;
>    split_store *split_store;
>    unsigned int i;
> -  bool fail = false;
>  
>    tree addr = force_gimple_operand_1 (unshare_expr (base_addr), &seq,
>  				      is_gimple_mem_ref_addr, NULL_TREE);
> @@ -1165,48 +1326,76 @@ imm_store_chain_info::output_merged_stor
>        unsigned HOST_WIDE_INT try_size = split_store->size;
>        unsigned HOST_WIDE_INT try_pos = split_store->bytepos;
>        unsigned HOST_WIDE_INT align = split_store->align;
> -      tree offset_type = get_alias_type_for_stmts (split_store->orig_stmts);
> -      location_t loc = get_location_for_stmts (split_store->orig_stmts);
> -
> -      tree int_type = build_nonstandard_integer_type (try_size, UNSIGNED);
> -      int_type = build_aligned_type (int_type, align);
> -      tree dest = fold_build2 (MEM_REF, int_type, addr,
> -			       build_int_cst (offset_type, try_pos));
> -
> -      tree src = native_interpret_expr (int_type,
> -					group->val + try_pos - start_byte_pos,
> -					group->buf_size);
> +      tree dest, src;
> +      location_t loc;
> +      if (split_store->orig)
> +	{
> +	  /* If there is just a single constituent store which covers
> +	     the whole area, just reuse the lhs and rhs.  */
> +	  dest = gimple_assign_lhs (split_store->orig_stmts[0]);
> +	  src = gimple_assign_rhs1 (split_store->orig_stmts[0]);
> +	  loc = gimple_location (split_store->orig_stmts[0]);
> +	}
> +      else
> +	{
> +	  tree offset_type
> +	    = get_alias_type_for_stmts (split_store->orig_stmts);
> +	  loc = get_location_for_stmts (split_store->orig_stmts);
> +
> +	  tree int_type = build_nonstandard_integer_type (try_size, UNSIGNED);
> +	  int_type = build_aligned_type (int_type, align);
> +	  dest = fold_build2 (MEM_REF, int_type, addr,
> +			      build_int_cst (offset_type, try_pos));
> +	  src = native_interpret_expr (int_type,
> +				       group->val + try_pos - start_byte_pos,
> +				       group->buf_size);
> +	  tree mask
> +	    = native_interpret_expr (int_type,
> +				     group->mask + try_pos - start_byte_pos,
> +				     group->buf_size);
> +	  if (!integer_zerop (mask))
> +	    {
> +	      tree tem = make_ssa_name (int_type);
> +	      tree load_src = unshare_expr (dest);
> +	      /* The load might load some or all bits uninitialized,
> +		 avoid -W*uninitialized warnings in that case.
> +		 As optimization, it would be nice if all the bits are
> +		 provably uninitialized (no stores at all yet or previous
> +		 store a CLOBBER) we'd optimize away the load and replace
> +		 it e.g. with 0.  */
> +	      TREE_NO_WARNING (load_src) = 1;
> +	      stmt = gimple_build_assign (tem, load_src);
> +	      gimple_set_location (stmt, loc);
> +	      gimple_set_vuse (stmt, new_vuse);
> +	      gimple_seq_add_stmt_without_update (&seq, stmt);
> +
> +	      /* FIXME: If there is a single chunk of zero bits in mask,
> +		 perhaps use BIT_INSERT_EXPR instead?  */
> +	      stmt = gimple_build_assign (make_ssa_name (int_type),
> +					  BIT_AND_EXPR, tem, mask);
> +	      gimple_set_location (stmt, loc);
> +	      gimple_seq_add_stmt_without_update (&seq, stmt);
> +	      tem = gimple_assign_lhs (stmt);
> +
> +	      src = wide_int_to_tree (int_type,
> +				      wi::bit_and_not (wi::to_wide (src),
> +						       wi::to_wide (mask)));
> +	      stmt = gimple_build_assign (make_ssa_name (int_type),
> +					  BIT_IOR_EXPR, tem, src);
> +	      gimple_set_location (stmt, loc);
> +	      gimple_seq_add_stmt_without_update (&seq, stmt);
> +	      src = gimple_assign_lhs (stmt);
> +	    }
> +	}
>  
>        stmt = gimple_build_assign (dest, src);
>        gimple_set_location (stmt, loc);
>        gimple_set_vuse (stmt, new_vuse);
>        gimple_seq_add_stmt_without_update (&seq, stmt);
>  
> -      /* We didn't manage to reduce the number of statements.  Bail out.  */
> -      if (++num_stmts == orig_num_stmts)
> -	{
> -	  if (dump_file && (dump_flags & TDF_DETAILS))
> -	    {
> -	      fprintf (dump_file, "Exceeded original number of stmts (%u)."
> -				  "  Not profitable to emit new sequence.\n",
> -		       orig_num_stmts);
> -	    }
> -	  unsigned int ssa_count;
> -	  tree ssa_name;
> -	  /* Don't forget to cleanup the temporary SSA names.  */
> -	  FOR_EACH_VEC_ELT (new_ssa_names, ssa_count, ssa_name)
> -	    release_ssa_name (ssa_name);
> -
> -	  fail = true;
> -	  break;
> -	}
> -
>        tree new_vdef;
>        if (i < split_stores.length () - 1)
> -	{
> -	  new_vdef = make_ssa_name (gimple_vop (cfun), stmt);
> -	  new_ssa_names.safe_push (new_vdef);
> -	}
> +	new_vdef = make_ssa_name (gimple_vop (cfun), stmt);
>        else
>  	new_vdef = last_vdef;
>  
> @@ -1218,15 +1407,12 @@ imm_store_chain_info::output_merged_stor
>    FOR_EACH_VEC_ELT (split_stores, i, split_store)
>      delete split_store;
>  
> -  if (fail)
> -    return false;
> -
>    gcc_assert (seq);
>    if (dump_file)
>      {
>        fprintf (dump_file,
>  	       "New sequence of %u stmts to replace old one of %u stmts\n",
> -	       num_stmts, orig_num_stmts);
> +	       split_stores.length (), orig_num_stmts);
>        if (dump_flags & TDF_DETAILS)
>  	print_gimple_seq (dump_file, seq, 0, TDF_VOPS | TDF_MEMSYMS);
>      }
> @@ -1387,12 +1573,25 @@ pass_store_merging::execute (function *f
>  	      tree rhs = gimple_assign_rhs1 (stmt);
>  
>  	      HOST_WIDE_INT bitsize, bitpos;
> +	      unsigned HOST_WIDE_INT bitregion_start = 0;
> +	      unsigned HOST_WIDE_INT bitregion_end = 0;
>  	      machine_mode mode;
>  	      int unsignedp = 0, reversep = 0, volatilep = 0;
>  	      tree offset, base_addr;
>  	      base_addr
>  		= get_inner_reference (lhs, &bitsize, &bitpos, &offset, &mode,
>  				       &unsignedp, &reversep, &volatilep);
> +	      if (TREE_CODE (lhs) == COMPONENT_REF
> +		  && DECL_BIT_FIELD_TYPE (TREE_OPERAND (lhs, 1)))
> +		{
> +		  get_bit_range (&bitregion_start, &bitregion_end, lhs,
> +				 &bitpos, &offset);
> +		  if (bitregion_end)
> +		    ++bitregion_end;
> +		}
> +	      if (bitsize == 0)
> +		continue;
> +
>  	      /* As a future enhancement we could handle stores with the same
>  		 base and offset.  */
>  	      bool invalid = reversep
> @@ -1414,7 +1613,26 @@ pass_store_merging::execute (function *f
>  		  bit_off = byte_off << LOG2_BITS_PER_UNIT;
>  		  bit_off += bitpos;
>  		  if (!wi::neg_p (bit_off) && wi::fits_shwi_p (bit_off))
> -		    bitpos = bit_off.to_shwi ();
> +		    {
> +		      bitpos = bit_off.to_shwi ();
> +		      if (bitregion_end)
> +			{
> +			  bit_off = byte_off << LOG2_BITS_PER_UNIT;
> +			  bit_off += bitregion_start;
> +			  if (wi::fits_uhwi_p (bit_off))
> +			    {
> +			      bitregion_start = bit_off.to_uhwi ();
> +			      bit_off = byte_off << LOG2_BITS_PER_UNIT;
> +			      bit_off += bitregion_end;
> +			      if (wi::fits_uhwi_p (bit_off))
> +				bitregion_end = bit_off.to_uhwi ();
> +			      else
> +				bitregion_end = 0;
> +			    }
> +			  else
> +			    bitregion_end = 0;
> +			}
> +		    }
>  		  else
>  		    invalid = true;
>  		  base_addr = TREE_OPERAND (base_addr, 0);
> @@ -1428,6 +1646,12 @@ pass_store_merging::execute (function *f
>  		  base_addr = build_fold_addr_expr (base_addr);
>  		}
>  
> +	      if (!bitregion_end)
> +		{
> +		  bitregion_start = ROUND_DOWN (bitpos, BITS_PER_UNIT);
> +		  bitregion_end = ROUND_UP (bitpos + bitsize, BITS_PER_UNIT);
> +		}
> +
>  	      if (! invalid
>  		  && offset != NULL_TREE)
>  		{
> @@ -1457,9 +1681,11 @@ pass_store_merging::execute (function *f
>  		  store_immediate_info *info;
>  		  if (chain_info)
>  		    {
> -		      info = new store_immediate_info (
> -			bitsize, bitpos, stmt,
> -			(*chain_info)->m_store_info.length ());
> +		      unsigned int ord = (*chain_info)->m_store_info.length ();
> +		      info = new store_immediate_info (bitsize, bitpos,
> +						       bitregion_start,
> +						       bitregion_end,
> +						       stmt, ord);
>  		      if (dump_file && (dump_flags & TDF_DETAILS))
>  			{
>  			  fprintf (dump_file,
> @@ -1488,6 +1714,8 @@ pass_store_merging::execute (function *f
>  		  struct imm_store_chain_info *new_chain
>  		    = new imm_store_chain_info (m_stores_head, base_addr);
>  		  info = new store_immediate_info (bitsize, bitpos,
> +						   bitregion_start,
> +						   bitregion_end,
>  						   stmt, 0);
>  		  new_chain->m_store_info.safe_push (info);
>  		  m_stores.put (base_addr, new_chain);
> --- gcc/testsuite/gcc.dg/store_merging_10.c.jj	2017-10-27 14:52:29.724755656 +0200
> +++ gcc/testsuite/gcc.dg/store_merging_10.c	2017-10-27 14:52:29.724755656 +0200
> @@ -0,0 +1,56 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target store_merge } */
> +/* { dg-options "-O2 -fdump-tree-store-merging" } */
> +
> +struct S {
> +  unsigned int b1:1;
> +  unsigned int b2:1;
> +  unsigned int b3:1;
> +  unsigned int b4:1;
> +  unsigned int b5:1;
> +  unsigned int b6:27;
> +};
> +
> +struct T {
> +  unsigned int b1:1;
> +  unsigned int b2:16;
> +  unsigned int b3:14;
> +  unsigned int b4:1;
> +};
> +
> +__attribute__((noipa)) void
> +foo (struct S *x)
> +{
> +  x->b1 = 1;
> +  x->b2 = 0;
> +  x->b3 = 1;
> +  x->b4 = 1;
> +  x->b5 = 0;
> +}
> +
> +__attribute__((noipa)) void
> +bar (struct T *x)
> +{
> +  x->b1 = 1;
> +  x->b2 = 0;
> +  x->b4 = 0;
> +}
> +
> +struct S s = { 0, 1, 0, 0, 1, 0x3a5f05a };
> +struct T t = { 0, 0xf5af, 0x3a5a, 1 };
> +
> +int
> +main ()
> +{
> +  asm volatile ("" : : : "memory");
> +  foo (&s);
> +  bar (&t);
> +  asm volatile ("" : : : "memory");
> +  if (s.b1 != 1 || s.b2 != 0 || s.b3 != 1 || s.b4 != 1 || s.b5 != 0 || s.b6 != 0x3a5f05a)
> +    __builtin_abort ();
> +  if (t.b1 != 1 || t.b2 != 0 || t.b3 != 0x3a5a || t.b4 != 0)
> +    __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging" } } */
> --- gcc/testsuite/gcc.dg/store_merging_11.c.jj	2017-10-27 14:52:29.725755644 +0200
> +++ gcc/testsuite/gcc.dg/store_merging_11.c	2017-10-27 14:52:29.725755644 +0200
> @@ -0,0 +1,47 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target store_merge } */
> +/* { dg-options "-O2 -fdump-tree-store-merging" } */
> +
> +struct S { unsigned char b[2]; unsigned short c; unsigned char d[4]; unsigned long e; };
> +
> +__attribute__((noipa)) void
> +foo (struct S *p)
> +{
> +  p->b[1] = 1;
> +  p->c = 23;
> +  p->d[0] = 4;
> +  p->d[1] = 5;
> +  p->d[2] = 6;
> +  p->d[3] = 7;
> +  p->e = 8;
> +}
> +
> +__attribute__((noipa)) void
> +bar (struct S *p)
> +{
> +  p->b[1] = 9;
> +  p->c = 112;
> +  p->d[0] = 10;
> +  p->d[1] = 11;
> +}
> +
> +struct S s = { { 30, 31 }, 32, { 33, 34, 35, 36 }, 37 };
> +
> +int
> +main ()
> +{
> +  asm volatile ("" : : : "memory");
> +  foo (&s);
> +  asm volatile ("" : : : "memory");
> +  if (s.b[0] != 30 || s.b[1] != 1 || s.c != 23 || s.d[0] != 4 || s.d[1] != 5
> +      || s.d[2] != 6 || s.d[3] != 7 || s.e != 8)
> +    __builtin_abort ();
> +  bar (&s);
> +  asm volatile ("" : : : "memory");
> +  if (s.b[0] != 30 || s.b[1] != 9 || s.c != 112 || s.d[0] != 10 || s.d[1] != 11
> +      || s.d[2] != 6 || s.d[3] != 7 || s.e != 8)
> +    __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging" } } */
> --- gcc/testsuite/gcc.dg/store_merging_12.c.jj	2017-10-27 15:00:20.046976487 +0200
> +++ gcc/testsuite/gcc.dg/store_merging_12.c	2017-10-27 14:59:56.000000000 +0200
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -Wall" } */
> +
> +struct S { unsigned int b1:1, b2:1, b3:1, b4:1, b5:1, b6:27; };
> +void bar (struct S *);
> +void foo (int x)
> +{
> +  struct S s;
> +  s.b2 = 1; s.b3 = 0; s.b4 = 1; s.b5 = 0; s.b1 = x; s.b6 = x;	/* { dg-bogus "is used uninitialized in this function" } */
> +  bar (&s);
> +}
> --- gcc/testsuite/g++.dg/pr71694.C.jj	2016-12-16 11:24:32.000000000 +0100
> +++ gcc/testsuite/g++.dg/pr71694.C	2017-10-27 16:53:09.278596219 +0200
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2" } */
> +/* { dg-options "-O2 -fno-store-merging" } */
>  
>  struct B {
>      B() {}
> 
> 
> 	Jakub
> 
>
diff mbox series

Patch

--- gcc/gimple-ssa-store-merging.c.jj	2017-10-27 14:16:26.074249585 +0200
+++ gcc/gimple-ssa-store-merging.c	2017-10-27 18:10:18.212762773 +0200
@@ -126,6 +126,8 @@ 
 #include "tree-eh.h"
 #include "target.h"
 #include "gimplify-me.h"
+#include "rtl.h"
+#include "expr.h"	/* For get_bit_range.  */
 #include "selftest.h"
 
 /* The maximum size (in bits) of the stores this pass should generate.  */
@@ -142,17 +144,24 @@  struct store_immediate_info
 {
   unsigned HOST_WIDE_INT bitsize;
   unsigned HOST_WIDE_INT bitpos;
+  unsigned HOST_WIDE_INT bitregion_start;
+  /* This is one past the last bit of the bit region.  */
+  unsigned HOST_WIDE_INT bitregion_end;
   gimple *stmt;
   unsigned int order;
   store_immediate_info (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
+			unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
 			gimple *, unsigned int);
 };
 
 store_immediate_info::store_immediate_info (unsigned HOST_WIDE_INT bs,
 					    unsigned HOST_WIDE_INT bp,
+					    unsigned HOST_WIDE_INT brs,
+					    unsigned HOST_WIDE_INT bre,
 					    gimple *st,
 					    unsigned int ord)
-  : bitsize (bs), bitpos (bp), stmt (st), order (ord)
+  : bitsize (bs), bitpos (bp), bitregion_start (brs), bitregion_end (bre),
+    stmt (st), order (ord)
 {
 }
 
@@ -164,26 +173,32 @@  struct merged_store_group
 {
   unsigned HOST_WIDE_INT start;
   unsigned HOST_WIDE_INT width;
-  /* The size of the allocated memory for val.  */
+  unsigned HOST_WIDE_INT bitregion_start;
+  unsigned HOST_WIDE_INT bitregion_end;
+  /* The size of the allocated memory for val and mask.  */
   unsigned HOST_WIDE_INT buf_size;
+  unsigned HOST_WIDE_INT align_base;
 
   unsigned int align;
   unsigned int first_order;
   unsigned int last_order;
 
-  auto_vec<struct store_immediate_info *> stores;
+  auto_vec<store_immediate_info *> stores;
   /* We record the first and last original statements in the sequence because
      we'll need their vuse/vdef and replacement position.  It's easier to keep
      track of them separately as 'stores' is reordered by apply_stores.  */
   gimple *last_stmt;
   gimple *first_stmt;
   unsigned char *val;
+  unsigned char *mask;
 
   merged_store_group (store_immediate_info *);
   ~merged_store_group ();
   void merge_into (store_immediate_info *);
   void merge_overlapping (store_immediate_info *);
   bool apply_stores ();
+private:
+  void do_merge (store_immediate_info *);
 };
 
 /* Debug helper.  Dump LEN elements of byte array PTR to FD in hex.  */
@@ -287,8 +302,7 @@  clear_bit_region_be (unsigned char *ptr,
 	   && len > BITS_PER_UNIT)
     {
       unsigned int nbytes = len / BITS_PER_UNIT;
-      for (unsigned int i = 0; i < nbytes; i++)
-	ptr[i] = 0U;
+      memset (ptr, 0, nbytes);
       if (len % BITS_PER_UNIT != 0)
 	clear_bit_region_be (ptr + nbytes, BITS_PER_UNIT - 1,
 			     len % BITS_PER_UNIT);
@@ -549,10 +563,16 @@  merged_store_group::merged_store_group (
 {
   start = info->bitpos;
   width = info->bitsize;
+  bitregion_start = info->bitregion_start;
+  bitregion_end = info->bitregion_end;
   /* VAL has memory allocated for it in apply_stores once the group
      width has been finalized.  */
   val = NULL;
-  align = get_object_alignment (gimple_assign_lhs (info->stmt));
+  mask = NULL;
+  unsigned HOST_WIDE_INT align_bitpos = 0;
+  get_object_alignment_1 (gimple_assign_lhs (info->stmt),
+			  &align, &align_bitpos);
+  align_base = start - align_bitpos;
   stores.create (1);
   stores.safe_push (info);
   last_stmt = info->stmt;
@@ -568,18 +588,24 @@  merged_store_group::~merged_store_group
     XDELETEVEC (val);
 }
 
-/* Merge a store recorded by INFO into this merged store.
-   The store is not overlapping with the existing recorded
-   stores.  */
-
+/* Helper method for merge_into and merge_overlapping to do
+   the common part.  */
 void
-merged_store_group::merge_into (store_immediate_info *info)
+merged_store_group::do_merge (store_immediate_info *info)
 {
-  unsigned HOST_WIDE_INT wid = info->bitsize;
-  /* Make sure we're inserting in the position we think we're inserting.  */
-  gcc_assert (info->bitpos == start + width);
+  bitregion_start = MIN (bitregion_start, info->bitregion_start);
+  bitregion_end = MAX (bitregion_end, info->bitregion_end);
+
+  unsigned int this_align;
+  unsigned HOST_WIDE_INT align_bitpos = 0;
+  get_object_alignment_1 (gimple_assign_lhs (info->stmt),
+			  &this_align, &align_bitpos);
+  if (this_align > align)
+    {
+      align = this_align;
+      align_base = info->bitpos - align_bitpos;
+    }
 
-  width += wid;
   gimple *stmt = info->stmt;
   stores.safe_push (info);
   if (info->order > last_order)
@@ -594,6 +620,22 @@  merged_store_group::merge_into (store_im
     }
 }
 
+/* Merge a store recorded by INFO into this merged store.
+   The store is not overlapping with the existing recorded
+   stores.  */
+
+void
+merged_store_group::merge_into (store_immediate_info *info)
+{
+  unsigned HOST_WIDE_INT wid = info->bitsize;
+  /* Make sure we're inserting in the position we think we're inserting.  */
+  gcc_assert (info->bitpos >= start + width
+	      && info->bitregion_start <= bitregion_end);
+
+  width += wid;
+  do_merge (info);
+}
+
 /* Merge a store described by INFO into this merged store.
    INFO overlaps in some way with the current store (i.e. it's not contiguous
    which is handled by merged_store_group::merge_into).  */
@@ -601,23 +643,11 @@  merged_store_group::merge_into (store_im
 void
 merged_store_group::merge_overlapping (store_immediate_info *info)
 {
-  gimple *stmt = info->stmt;
-  stores.safe_push (info);
-
   /* If the store extends the size of the group, extend the width.  */
-  if ((info->bitpos + info->bitsize) > (start + width))
+  if (info->bitpos + info->bitsize > start + width)
     width += info->bitpos + info->bitsize - (start + width);
 
-  if (info->order > last_order)
-    {
-      last_order = info->order;
-      last_stmt = stmt;
-    }
-  else if (info->order < first_order)
-    {
-      first_order = info->order;
-      first_stmt = stmt;
-    }
+  do_merge (info);
 }
 
 /* Go through all the recorded stores in this group in program order and
@@ -627,27 +657,28 @@  merged_store_group::merge_overlapping (s
 bool
 merged_store_group::apply_stores ()
 {
-  /* The total width of the stores must add up to a whole number of bytes
-     and start at a byte boundary.  We don't support emitting bitfield
-     references for now.  Also, make sure we have more than one store
-     in the group, otherwise we cannot merge anything.  */
-  if (width % BITS_PER_UNIT != 0
-      || start % BITS_PER_UNIT != 0
+  /* Make sure we have more than one store in the group, otherwise we cannot
+     merge anything.  */
+  if (bitregion_start % BITS_PER_UNIT != 0
+      || bitregion_end % BITS_PER_UNIT != 0
       || stores.length () == 1)
     return false;
 
   stores.qsort (sort_by_order);
-  struct store_immediate_info *info;
+  store_immediate_info *info;
   unsigned int i;
   /* Create a buffer of a size that is 2 times the number of bytes we're
      storing.  That way native_encode_expr can write power-of-2-sized
      chunks without overrunning.  */
-  buf_size = 2 * (ROUND_UP (width, BITS_PER_UNIT) / BITS_PER_UNIT);
-  val = XCNEWVEC (unsigned char, buf_size);
+  buf_size = 2 * ((bitregion_end - bitregion_start) / BITS_PER_UNIT);
+  val = XNEWVEC (unsigned char, 2 * buf_size);
+  mask = val + buf_size;
+  memset (val, 0, buf_size);
+  memset (mask, ~0U, buf_size);
 
   FOR_EACH_VEC_ELT (stores, i, info)
     {
-      unsigned int pos_in_buffer = info->bitpos - start;
+      unsigned int pos_in_buffer = info->bitpos - bitregion_start;
       bool ret = encode_tree_to_bitpos (gimple_assign_rhs1 (info->stmt),
 					val, info->bitsize,
 					pos_in_buffer, buf_size);
@@ -668,6 +699,11 @@  merged_store_group::apply_stores ()
         }
       if (!ret)
 	return false;
+      unsigned char *m = mask + (pos_in_buffer / BITS_PER_UNIT);
+      if (BYTES_BIG_ENDIAN)
+	clear_bit_region_be (m, pos_in_buffer % BITS_PER_UNIT, info->bitsize);
+      else
+	clear_bit_region (m, pos_in_buffer % BITS_PER_UNIT, info->bitsize);
     }
   return true;
 }
@@ -682,7 +718,7 @@  struct imm_store_chain_info
      See pass_store_merging::m_stores_head for more rationale.  */
   imm_store_chain_info *next, **pnxp;
   tree base_addr;
-  auto_vec<struct store_immediate_info *> m_store_info;
+  auto_vec<store_immediate_info *> m_store_info;
   auto_vec<merged_store_group *> m_merged_store_groups;
 
   imm_store_chain_info (imm_store_chain_info *&inspt, tree b_a)
@@ -730,11 +766,16 @@  public:
   {
   }
 
-  /* Pass not supported for PDP-endianness.  */
+  /* Pass not supported for PDP-endianness, nor for insane hosts
+     or target character sizes where native_{encode,interpret}_expr
+     doesn't work properly.  */
   virtual bool
   gate (function *)
   {
-    return flag_store_merging && (WORDS_BIG_ENDIAN == BYTES_BIG_ENDIAN);
+    return flag_store_merging
+	   && WORDS_BIG_ENDIAN == BYTES_BIG_ENDIAN
+	   && CHAR_BIT == 8
+	   && BITS_PER_UNIT == 8;
   }
 
   virtual unsigned int execute (function *);
@@ -811,7 +852,7 @@  pass_store_merging::terminate_all_aliasi
 	 aliases with any of them.  */
       else
 	{
-	  struct store_immediate_info *info;
+	  store_immediate_info *info;
 	  unsigned int i;
 	  FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
 	    {
@@ -926,8 +967,9 @@  imm_store_chain_info::coalesce_immediate
 	}
 
       /* |---store 1---| <gap> |---store 2---|.
-	 Gap between stores.  Start a new group.  */
-      if (start != merged_store->start + merged_store->width)
+	 Gap between stores.  Start a new group if there are any gaps
+	 between bitregions.  */
+      if (info->bitregion_start > merged_store->bitregion_end)
 	{
 	  /* Try to apply all the stores recorded for the group to determine
 	     the bitpattern they write and discard it if that fails.
@@ -948,11 +990,11 @@  imm_store_chain_info::coalesce_immediate
        merged_store->merge_into (info);
     }
 
-    /* Record or discard the last store group.  */
-    if (!merged_store->apply_stores ())
-      delete merged_store;
-    else
-      m_merged_store_groups.safe_push (merged_store);
+  /* Record or discard the last store group.  */
+  if (!merged_store->apply_stores ())
+    delete merged_store;
+  else
+    m_merged_store_groups.safe_push (merged_store);
 
   gcc_assert (m_merged_store_groups.length () <= m_store_info.length ());
   bool success
@@ -961,8 +1003,8 @@  imm_store_chain_info::coalesce_immediate
 
   if (success && dump_file)
     fprintf (dump_file, "Coalescing successful!\n"
-			 "Merged into %u stores\n",
-		m_merged_store_groups.length ());
+			"Merged into %u stores\n",
+	     m_merged_store_groups.length ());
 
   return success;
 }
@@ -1016,6 +1058,8 @@  struct split_store
   unsigned HOST_WIDE_INT size;
   unsigned HOST_WIDE_INT align;
   auto_vec<gimple *> orig_stmts;
+  /* True if there is a single orig stmt covering the whole split store.  */
+  bool orig;
   split_store (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT,
 	       unsigned HOST_WIDE_INT);
 };
@@ -1025,100 +1069,198 @@  struct split_store
 split_store::split_store (unsigned HOST_WIDE_INT bp,
 			  unsigned HOST_WIDE_INT sz,
 			  unsigned HOST_WIDE_INT al)
-			  : bytepos (bp), size (sz), align (al)
+			  : bytepos (bp), size (sz), align (al), orig (false)
 {
   orig_stmts.create (0);
 }
 
 /* Record all statements corresponding to stores in GROUP that write to
    the region starting at BITPOS and is of size BITSIZE.  Record such
-   statements in STMTS.  The stores in GROUP must be sorted by
-   bitposition.  */
+   statements in STMTS if non-NULL.  The stores in GROUP must be sorted by
+   bitposition.  Return INFO if there is exactly one original store
+   in the range.  */
 
-static void
+static store_immediate_info *
 find_constituent_stmts (struct merged_store_group *group,
-			 auto_vec<gimple *> &stmts,
-			 unsigned HOST_WIDE_INT bitpos,
-			 unsigned HOST_WIDE_INT bitsize)
+			vec<gimple *> *stmts,
+			unsigned int *first,
+			unsigned HOST_WIDE_INT bitpos,
+			unsigned HOST_WIDE_INT bitsize)
 {
-  struct store_immediate_info *info;
+  store_immediate_info *info, *ret = NULL;
   unsigned int i;
+  bool second = false;
+  bool update_first = true;
   unsigned HOST_WIDE_INT end = bitpos + bitsize;
-  FOR_EACH_VEC_ELT (group->stores, i, info)
+  for (i = *first; group->stores.iterate (i, &info); ++i)
     {
       unsigned HOST_WIDE_INT stmt_start = info->bitpos;
       unsigned HOST_WIDE_INT stmt_end = stmt_start + info->bitsize;
-      if (stmt_end < bitpos)
-	continue;
+      if (stmt_end <= bitpos)
+	{
+	  /* BITPOS passed to this function never decreases from within the
+	     same split_group call, so optimize and don't scan info records
+	     which are known to end before or at BITPOS next time.
+	     Only do it if all stores before this one also pass this.  */
+	  if (update_first)
+	    *first = i + 1;
+	  continue;
+	}
+      else
+	update_first = false;
+
       /* The stores in GROUP are ordered by bitposition so if we're past
-	  the region for this group return early.  */
-      if (stmt_start > end)
-	return;
-
-      if (IN_RANGE (stmt_start, bitpos, bitpos + bitsize)
-	  || IN_RANGE (stmt_end, bitpos, end)
-	  /* The statement writes a region that completely encloses the region
-	     that this group writes.  Unlikely to occur but let's
-	     handle it.  */
-	  || IN_RANGE (bitpos, stmt_start, stmt_end))
-	stmts.safe_push (info->stmt);
+	 the region for this group return early.  */
+      if (stmt_start >= end)
+	return ret;
+
+      if (stmts)
+	{
+	  stmts->safe_push (info->stmt);
+	  if (ret)
+	    {
+	      ret = NULL;
+	      second = true;
+	    }
+	}
+      else if (ret)
+	return NULL;
+      if (!second)
+	ret = info;
     }
+  return ret;
 }
 
 /* Split a merged store described by GROUP by populating the SPLIT_STORES
-   vector with split_store structs describing the byte offset (from the base),
-   the bit size and alignment of each store as well as the original statements
-   involved in each such split group.
+   vector (if non-NULL) with split_store structs describing the byte offset
+   (from the base), the bit size and alignment of each store as well as the
+   original statements involved in each such split group.
    This is to separate the splitting strategy from the statement
    building/emission/linking done in output_merged_store.
-   At the moment just start with the widest possible size and keep emitting
-   the widest we can until we have emitted all the bytes, halving the size
-   when appropriate.  */
-
-static bool
-split_group (merged_store_group *group,
-	     auto_vec<struct split_store *> &split_stores)
+   Return number of new stores.
+   If SPLIT_STORES is NULL, it is just a dry run to count number of
+   new stores.  */
+
+static unsigned int
+split_group (merged_store_group *group, bool allow_unaligned,
+	     vec<struct split_store *> *split_stores)
 {
-  unsigned HOST_WIDE_INT pos = group->start;
-  unsigned HOST_WIDE_INT size = group->width;
+  unsigned HOST_WIDE_INT pos = group->bitregion_start;
+  unsigned HOST_WIDE_INT size = group->bitregion_end - pos;
   unsigned HOST_WIDE_INT bytepos = pos / BITS_PER_UNIT;
-  unsigned HOST_WIDE_INT align = group->align;
+  unsigned HOST_WIDE_INT group_align = group->align;
+  unsigned HOST_WIDE_INT align_base = group->align_base;
 
-  /* We don't handle partial bitfields for now.  We shouldn't have
-     reached this far.  */
   gcc_assert ((size % BITS_PER_UNIT == 0) && (pos % BITS_PER_UNIT == 0));
 
-  bool allow_unaligned
-    = !STRICT_ALIGNMENT && PARAM_VALUE (PARAM_STORE_MERGING_ALLOW_UNALIGNED);
-
-  unsigned int try_size = MAX_STORE_BITSIZE;
-  while (try_size > size
-	 || (!allow_unaligned
-	     && try_size > align))
-    {
-      try_size /= 2;
-      if (try_size < BITS_PER_UNIT)
-	return false;
-    }
-
+  unsigned int ret = 0, first = 0;
   unsigned HOST_WIDE_INT try_pos = bytepos;
   group->stores.qsort (sort_by_bitpos);
 
   while (size > 0)
     {
-      struct split_store *store = new split_store (try_pos, try_size, align);
+      if ((allow_unaligned || group_align <= BITS_PER_UNIT)
+	  && group->mask[try_pos - bytepos] == (unsigned char) ~0U)
+	{
+	  /* Skip padding bytes.  */
+	  ++try_pos;
+	  size -= BITS_PER_UNIT;
+	  continue;
+	}
+
       unsigned HOST_WIDE_INT try_bitpos = try_pos * BITS_PER_UNIT;
-      find_constituent_stmts (group, store->orig_stmts, try_bitpos, try_size);
-      split_stores.safe_push (store);
+      unsigned int try_size = MAX_STORE_BITSIZE, nonmasked;
+      unsigned HOST_WIDE_INT align_bitpos
+	= (try_bitpos - align_base) & (group_align - 1);
+      unsigned HOST_WIDE_INT align = group_align;
+      if (align_bitpos)
+	align = least_bit_hwi (align_bitpos);
+      if (!allow_unaligned)
+	try_size = MIN (try_size, align);
+      store_immediate_info *info
+	= find_constituent_stmts (group, NULL, &first, try_bitpos, try_size);
+      if (info)
+	{
+	  /* If there is just one original statement for the range, see if
+	     we can just reuse the original store which could be even larger
+	     than try_size.  */
+	  unsigned HOST_WIDE_INT stmt_end
+	    = ROUND_UP (info->bitpos + info->bitsize, BITS_PER_UNIT);
+	  info = find_constituent_stmts (group, NULL, &first, try_bitpos,
+					 stmt_end - try_bitpos);
+	  if (info && info->bitpos >= try_bitpos)
+	    {
+	      try_size = stmt_end - try_bitpos;
+	      goto found;
+	    }
+	}
 
-      try_pos += try_size / BITS_PER_UNIT;
+      /* Approximate store bitsize for the case when there are no padding
+	 bits.  */
+      while (try_size > size)
+	try_size /= 2;
+      /* Now look for whole padding bytes at the end of that bitsize.  */
+      for (nonmasked = try_size / BITS_PER_UNIT; nonmasked > 0; --nonmasked)
+	if (group->mask[try_pos - bytepos + nonmasked - 1]
+	    != (unsigned char) ~0U)
+	  break;
+      if (nonmasked == 0)
+	{
+	  /* If entire try_size range is padding, skip it.  */
+	  try_pos += try_size / BITS_PER_UNIT;
+	  size -= try_size;
+	  continue;
+	}
+      /* Otherwise try to decrease try_size if second half, last 3 quarters
+	 etc. are padding.  */
+      nonmasked *= BITS_PER_UNIT;
+      while (nonmasked <= try_size / 2)
+	try_size /= 2;
+      if (!allow_unaligned && group_align > BITS_PER_UNIT)
+	{
+	  /* Now look for whole padding bytes at the start of that bitsize.  */
+	  unsigned int try_bytesize = try_size / BITS_PER_UNIT, masked;
+	  for (masked = 0; masked < try_bytesize; ++masked)
+	    if (group->mask[try_pos - bytepos + masked] != (unsigned char) ~0U)
+	      break;
+	  masked *= BITS_PER_UNIT;
+	  gcc_assert (masked < try_size);
+	  if (masked >= try_size / 2)
+	    {
+	      while (masked >= try_size / 2)
+		{
+		  try_size /= 2;
+		  try_pos += try_size / BITS_PER_UNIT;
+		  size -= try_size;
+		  masked -= try_size;
+		}
+	      /* Need to recompute the alignment, so just retry at the new
+		 position.  */
+	      continue;
+	    }
+	}
 
+    found:
+      ++ret;
+
+      if (split_stores)
+	{
+	  struct split_store *store
+	    = new split_store (try_pos, try_size, align);
+	  info = find_constituent_stmts (group, &store->orig_stmts,
+	  				 &first, try_bitpos, try_size);
+	  if (info
+	      && info->bitpos >= try_bitpos
+	      && info->bitpos + info->bitsize <= try_bitpos + try_size)
+	    store->orig = true;
+	  split_stores->safe_push (store);
+	}
+
+      try_pos += try_size / BITS_PER_UNIT;
       size -= try_size;
-      align = try_size;
-      while (size < try_size)
-	try_size /= 2;
     }
-  return true;
+
+  return ret;
 }
 
 /* Given a merged store group GROUP output the widened version of it.
@@ -1132,31 +1274,50 @@  split_group (merged_store_group *group,
 bool
 imm_store_chain_info::output_merged_store (merged_store_group *group)
 {
-  unsigned HOST_WIDE_INT start_byte_pos = group->start / BITS_PER_UNIT;
+  unsigned HOST_WIDE_INT start_byte_pos
+    = group->bitregion_start / BITS_PER_UNIT;
 
   unsigned int orig_num_stmts = group->stores.length ();
   if (orig_num_stmts < 2)
     return false;
 
-  auto_vec<struct split_store *> split_stores;
+  auto_vec<struct split_store *, 32> split_stores;
   split_stores.create (0);
-  if (!split_group (group, split_stores))
-    return false;
+  bool allow_unaligned
+    = !STRICT_ALIGNMENT && PARAM_VALUE (PARAM_STORE_MERGING_ALLOW_UNALIGNED);
+  if (allow_unaligned)
+    {
+      /* If unaligned stores are allowed, see how many stores we'd emit
+	 for unaligned and how many stores we'd emit for aligned stores.
+	 Only use unaligned stores if it allows fewer stores than aligned.  */
+      unsigned aligned_cnt = split_group (group, false, NULL);
+      unsigned unaligned_cnt = split_group (group, true, NULL);
+      if (aligned_cnt <= unaligned_cnt)
+	allow_unaligned = false;
+    }
+  split_group (group, allow_unaligned, &split_stores);
+
+  if (split_stores.length () >= orig_num_stmts)
+    {
+      /* We didn't manage to reduce the number of statements.  Bail out.  */
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Exceeded original number of stmts (%u)."
+			      "  Not profitable to emit new sequence.\n",
+		   orig_num_stmts);
+	}
+      return false;
+    }
 
   gimple_stmt_iterator last_gsi = gsi_for_stmt (group->last_stmt);
   gimple_seq seq = NULL;
-  unsigned int num_stmts = 0;
   tree last_vdef, new_vuse;
   last_vdef = gimple_vdef (group->last_stmt);
   new_vuse = gimple_vuse (group->last_stmt);
 
   gimple *stmt = NULL;
-  /* The new SSA names created.  Keep track of them so that we can free them
-     if we decide to not use the new sequence.  */
-  auto_vec<tree> new_ssa_names;
   split_store *split_store;
   unsigned int i;
-  bool fail = false;
 
   tree addr = force_gimple_operand_1 (unshare_expr (base_addr), &seq,
 				      is_gimple_mem_ref_addr, NULL_TREE);
@@ -1165,48 +1326,76 @@  imm_store_chain_info::output_merged_stor
       unsigned HOST_WIDE_INT try_size = split_store->size;
       unsigned HOST_WIDE_INT try_pos = split_store->bytepos;
       unsigned HOST_WIDE_INT align = split_store->align;
-      tree offset_type = get_alias_type_for_stmts (split_store->orig_stmts);
-      location_t loc = get_location_for_stmts (split_store->orig_stmts);
-
-      tree int_type = build_nonstandard_integer_type (try_size, UNSIGNED);
-      int_type = build_aligned_type (int_type, align);
-      tree dest = fold_build2 (MEM_REF, int_type, addr,
-			       build_int_cst (offset_type, try_pos));
-
-      tree src = native_interpret_expr (int_type,
-					group->val + try_pos - start_byte_pos,
-					group->buf_size);
+      tree dest, src;
+      location_t loc;
+      if (split_store->orig)
+	{
+	  /* If there is just a single constituent store which covers
+	     the whole area, just reuse the lhs and rhs.  */
+	  dest = gimple_assign_lhs (split_store->orig_stmts[0]);
+	  src = gimple_assign_rhs1 (split_store->orig_stmts[0]);
+	  loc = gimple_location (split_store->orig_stmts[0]);
+	}
+      else
+	{
+	  tree offset_type
+	    = get_alias_type_for_stmts (split_store->orig_stmts);
+	  loc = get_location_for_stmts (split_store->orig_stmts);
+
+	  tree int_type = build_nonstandard_integer_type (try_size, UNSIGNED);
+	  int_type = build_aligned_type (int_type, align);
+	  dest = fold_build2 (MEM_REF, int_type, addr,
+			      build_int_cst (offset_type, try_pos));
+	  src = native_interpret_expr (int_type,
+				       group->val + try_pos - start_byte_pos,
+				       group->buf_size);
+	  tree mask
+	    = native_interpret_expr (int_type,
+				     group->mask + try_pos - start_byte_pos,
+				     group->buf_size);
+	  if (!integer_zerop (mask))
+	    {
+	      tree tem = make_ssa_name (int_type);
+	      tree load_src = unshare_expr (dest);
+	      /* The load might load some or all bits uninitialized,
+		 avoid -W*uninitialized warnings in that case.
+		 As optimization, it would be nice if all the bits are
+		 provably uninitialized (no stores at all yet or previous
+		 store a CLOBBER) we'd optimize away the load and replace
+		 it e.g. with 0.  */
+	      TREE_NO_WARNING (load_src) = 1;
+	      stmt = gimple_build_assign (tem, load_src);
+	      gimple_set_location (stmt, loc);
+	      gimple_set_vuse (stmt, new_vuse);
+	      gimple_seq_add_stmt_without_update (&seq, stmt);
+
+	      /* FIXME: If there is a single chunk of zero bits in mask,
+		 perhaps use BIT_INSERT_EXPR instead?  */
+	      stmt = gimple_build_assign (make_ssa_name (int_type),
+					  BIT_AND_EXPR, tem, mask);
+	      gimple_set_location (stmt, loc);
+	      gimple_seq_add_stmt_without_update (&seq, stmt);
+	      tem = gimple_assign_lhs (stmt);
+
+	      src = wide_int_to_tree (int_type,
+				      wi::bit_and_not (wi::to_wide (src),
+						       wi::to_wide (mask)));
+	      stmt = gimple_build_assign (make_ssa_name (int_type),
+					  BIT_IOR_EXPR, tem, src);
+	      gimple_set_location (stmt, loc);
+	      gimple_seq_add_stmt_without_update (&seq, stmt);
+	      src = gimple_assign_lhs (stmt);
+	    }
+	}
 
       stmt = gimple_build_assign (dest, src);
       gimple_set_location (stmt, loc);
       gimple_set_vuse (stmt, new_vuse);
       gimple_seq_add_stmt_without_update (&seq, stmt);
 
-      /* We didn't manage to reduce the number of statements.  Bail out.  */
-      if (++num_stmts == orig_num_stmts)
-	{
-	  if (dump_file && (dump_flags & TDF_DETAILS))
-	    {
-	      fprintf (dump_file, "Exceeded original number of stmts (%u)."
-				  "  Not profitable to emit new sequence.\n",
-		       orig_num_stmts);
-	    }
-	  unsigned int ssa_count;
-	  tree ssa_name;
-	  /* Don't forget to cleanup the temporary SSA names.  */
-	  FOR_EACH_VEC_ELT (new_ssa_names, ssa_count, ssa_name)
-	    release_ssa_name (ssa_name);
-
-	  fail = true;
-	  break;
-	}
-
       tree new_vdef;
       if (i < split_stores.length () - 1)
-	{
-	  new_vdef = make_ssa_name (gimple_vop (cfun), stmt);
-	  new_ssa_names.safe_push (new_vdef);
-	}
+	new_vdef = make_ssa_name (gimple_vop (cfun), stmt);
       else
 	new_vdef = last_vdef;
 
@@ -1218,15 +1407,12 @@  imm_store_chain_info::output_merged_stor
   FOR_EACH_VEC_ELT (split_stores, i, split_store)
     delete split_store;
 
-  if (fail)
-    return false;
-
   gcc_assert (seq);
   if (dump_file)
     {
       fprintf (dump_file,
 	       "New sequence of %u stmts to replace old one of %u stmts\n",
-	       num_stmts, orig_num_stmts);
+	       split_stores.length (), orig_num_stmts);
       if (dump_flags & TDF_DETAILS)
 	print_gimple_seq (dump_file, seq, 0, TDF_VOPS | TDF_MEMSYMS);
     }
@@ -1387,12 +1573,25 @@  pass_store_merging::execute (function *f
 	      tree rhs = gimple_assign_rhs1 (stmt);
 
 	      HOST_WIDE_INT bitsize, bitpos;
+	      unsigned HOST_WIDE_INT bitregion_start = 0;
+	      unsigned HOST_WIDE_INT bitregion_end = 0;
 	      machine_mode mode;
 	      int unsignedp = 0, reversep = 0, volatilep = 0;
 	      tree offset, base_addr;
 	      base_addr
 		= get_inner_reference (lhs, &bitsize, &bitpos, &offset, &mode,
 				       &unsignedp, &reversep, &volatilep);
+	      if (TREE_CODE (lhs) == COMPONENT_REF
+		  && DECL_BIT_FIELD_TYPE (TREE_OPERAND (lhs, 1)))
+		{
+		  get_bit_range (&bitregion_start, &bitregion_end, lhs,
+				 &bitpos, &offset);
+		  if (bitregion_end)
+		    ++bitregion_end;
+		}
+	      if (bitsize == 0)
+		continue;
+
 	      /* As a future enhancement we could handle stores with the same
 		 base and offset.  */
 	      bool invalid = reversep
@@ -1414,7 +1613,26 @@  pass_store_merging::execute (function *f
 		  bit_off = byte_off << LOG2_BITS_PER_UNIT;
 		  bit_off += bitpos;
 		  if (!wi::neg_p (bit_off) && wi::fits_shwi_p (bit_off))
-		    bitpos = bit_off.to_shwi ();
+		    {
+		      bitpos = bit_off.to_shwi ();
+		      if (bitregion_end)
+			{
+			  bit_off = byte_off << LOG2_BITS_PER_UNIT;
+			  bit_off += bitregion_start;
+			  if (wi::fits_uhwi_p (bit_off))
+			    {
+			      bitregion_start = bit_off.to_uhwi ();
+			      bit_off = byte_off << LOG2_BITS_PER_UNIT;
+			      bit_off += bitregion_end;
+			      if (wi::fits_uhwi_p (bit_off))
+				bitregion_end = bit_off.to_uhwi ();
+			      else
+				bitregion_end = 0;
+			    }
+			  else
+			    bitregion_end = 0;
+			}
+		    }
 		  else
 		    invalid = true;
 		  base_addr = TREE_OPERAND (base_addr, 0);
@@ -1428,6 +1646,12 @@  pass_store_merging::execute (function *f
 		  base_addr = build_fold_addr_expr (base_addr);
 		}
 
+	      if (!bitregion_end)
+		{
+		  bitregion_start = ROUND_DOWN (bitpos, BITS_PER_UNIT);
+		  bitregion_end = ROUND_UP (bitpos + bitsize, BITS_PER_UNIT);
+		}
+
 	      if (! invalid
 		  && offset != NULL_TREE)
 		{
@@ -1457,9 +1681,11 @@  pass_store_merging::execute (function *f
 		  store_immediate_info *info;
 		  if (chain_info)
 		    {
-		      info = new store_immediate_info (
-			bitsize, bitpos, stmt,
-			(*chain_info)->m_store_info.length ());
+		      unsigned int ord = (*chain_info)->m_store_info.length ();
+		      info = new store_immediate_info (bitsize, bitpos,
+						       bitregion_start,
+						       bitregion_end,
+						       stmt, ord);
 		      if (dump_file && (dump_flags & TDF_DETAILS))
 			{
 			  fprintf (dump_file,
@@ -1488,6 +1714,8 @@  pass_store_merging::execute (function *f
 		  struct imm_store_chain_info *new_chain
 		    = new imm_store_chain_info (m_stores_head, base_addr);
 		  info = new store_immediate_info (bitsize, bitpos,
+						   bitregion_start,
+						   bitregion_end,
 						   stmt, 0);
 		  new_chain->m_store_info.safe_push (info);
 		  m_stores.put (base_addr, new_chain);
--- gcc/testsuite/gcc.dg/store_merging_10.c.jj	2017-10-27 14:52:29.724755656 +0200
+++ gcc/testsuite/gcc.dg/store_merging_10.c	2017-10-27 14:52:29.724755656 +0200
@@ -0,0 +1,56 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target store_merge } */
+/* { dg-options "-O2 -fdump-tree-store-merging" } */
+
+struct S {
+  unsigned int b1:1;
+  unsigned int b2:1;
+  unsigned int b3:1;
+  unsigned int b4:1;
+  unsigned int b5:1;
+  unsigned int b6:27;
+};
+
+struct T {
+  unsigned int b1:1;
+  unsigned int b2:16;
+  unsigned int b3:14;
+  unsigned int b4:1;
+};
+
+__attribute__((noipa)) void
+foo (struct S *x)
+{
+  x->b1 = 1;
+  x->b2 = 0;
+  x->b3 = 1;
+  x->b4 = 1;
+  x->b5 = 0;
+}
+
+__attribute__((noipa)) void
+bar (struct T *x)
+{
+  x->b1 = 1;
+  x->b2 = 0;
+  x->b4 = 0;
+}
+
+struct S s = { 0, 1, 0, 0, 1, 0x3a5f05a };
+struct T t = { 0, 0xf5af, 0x3a5a, 1 };
+
+int
+main ()
+{
+  asm volatile ("" : : : "memory");
+  foo (&s);
+  bar (&t);
+  asm volatile ("" : : : "memory");
+  if (s.b1 != 1 || s.b2 != 0 || s.b3 != 1 || s.b4 != 1 || s.b5 != 0 || s.b6 != 0x3a5f05a)
+    __builtin_abort ();
+  if (t.b1 != 1 || t.b2 != 0 || t.b3 != 0x3a5a || t.b4 != 0)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging" } } */
--- gcc/testsuite/gcc.dg/store_merging_11.c.jj	2017-10-27 14:52:29.725755644 +0200
+++ gcc/testsuite/gcc.dg/store_merging_11.c	2017-10-27 14:52:29.725755644 +0200
@@ -0,0 +1,47 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target store_merge } */
+/* { dg-options "-O2 -fdump-tree-store-merging" } */
+
+struct S { unsigned char b[2]; unsigned short c; unsigned char d[4]; unsigned long e; };
+
+__attribute__((noipa)) void
+foo (struct S *p)
+{
+  p->b[1] = 1;
+  p->c = 23;
+  p->d[0] = 4;
+  p->d[1] = 5;
+  p->d[2] = 6;
+  p->d[3] = 7;
+  p->e = 8;
+}
+
+__attribute__((noipa)) void
+bar (struct S *p)
+{
+  p->b[1] = 9;
+  p->c = 112;
+  p->d[0] = 10;
+  p->d[1] = 11;
+}
+
+struct S s = { { 30, 31 }, 32, { 33, 34, 35, 36 }, 37 };
+
+int
+main ()
+{
+  asm volatile ("" : : : "memory");
+  foo (&s);
+  asm volatile ("" : : : "memory");
+  if (s.b[0] != 30 || s.b[1] != 1 || s.c != 23 || s.d[0] != 4 || s.d[1] != 5
+      || s.d[2] != 6 || s.d[3] != 7 || s.e != 8)
+    __builtin_abort ();
+  bar (&s);
+  asm volatile ("" : : : "memory");
+  if (s.b[0] != 30 || s.b[1] != 9 || s.c != 112 || s.d[0] != 10 || s.d[1] != 11
+      || s.d[2] != 6 || s.d[3] != 7 || s.e != 8)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging" } } */
--- gcc/testsuite/gcc.dg/store_merging_12.c.jj	2017-10-27 15:00:20.046976487 +0200
+++ gcc/testsuite/gcc.dg/store_merging_12.c	2017-10-27 14:59:56.000000000 +0200
@@ -0,0 +1,11 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -Wall" } */
+
+struct S { unsigned int b1:1, b2:1, b3:1, b4:1, b5:1, b6:27; };
+void bar (struct S *);
+void foo (int x)
+{
+  struct S s;
+  s.b2 = 1; s.b3 = 0; s.b4 = 1; s.b5 = 0; s.b1 = x; s.b6 = x;	/* { dg-bogus "is used uninitialized in this function" } */
+  bar (&s);
+}
--- gcc/testsuite/g++.dg/pr71694.C.jj	2016-12-16 11:24:32.000000000 +0100
+++ gcc/testsuite/g++.dg/pr71694.C	2017-10-27 16:53:09.278596219 +0200
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2" } */
+/* { dg-options "-O2 -fno-store-merging" } */
 
 struct B {
     B() {}