diff mbox series

[RFC] Run store-merging pass once more before pass fre/pre

Message ID 20200218092739.6359-1-luoxhu@linux.ibm.com
State New
Headers show
Series [RFC] Run store-merging pass once more before pass fre/pre | expand

Commit Message

Xionghu Luo Feb. 18, 2020, 9:27 a.m. UTC
Store-merging pass should run twice, the reason is pass fre/pre will do
some kind of optimizations to instructions by:
  1. Converting the load from address to load from function arguments
  (store_merging_30.c:foo1).
  2. Converting the byte access to BIT_FIELD_REF(store_merging_30.c:foo2).
  3. Other bitfield combinations or potential interference optimizations etc.
These optimizations will break the store chain, store-merging pass fails
to catch such kind of pattern so stores are not merged in middle end,
then consecutive stb/sth instructions(should be merged to stw) are emitted
finally.

And why not directly move store-merging pass(numbered 194) just before
fre1(numbered 35) is for case store_merging_14.c, 5 merges are done by
store_merging1, and 4 merges are done fore store_merge2. So, keep the
original store_merge as store_merge2 as store merge may be still available
after other pass optimizations.  Most of the 30 store_merging_N.c test
case dg-final pass name would be updated from store-merging to
store-merging1 once this RFC patch idea got confirmed.
Any comments?  Thanks.

PS:
Before this patch, store_merging_30.c.035t.fre1:

... foo1:
Inserted _13 = (short unsigned int) counters_new_5(D);
Replaced tmp.D.2912.D.2911.D.2910.D.2909.inuse with _13 in all uses of
_1 = tmp.D.2912.D.2911.D.2910.D.2909.inuse;
Removing dead stmt _1 = tmp.D.2912.D.2911.D.2910.D.2909.inuse;
... foo2:
Inserted _17 = BIT_FIELD_REF <_1, 8, 16>;
Replaced tmp.D.2926.D.2925.D.2924.D.2923.objects with _17 in all uses of
_3 = tmp.D.2926.D.2925.D.2924.D.2923.objects;
Removing dead stmt _3 = tmp.D.2926.D.2925.D.2924.D.2923.objects;

foo1 asm:
rldicl 9,4,48,48
sth 4,0(3)
sth 9,2(3)
blr

With this patch(similar for foo2):

stw r4,0(r3)
blr

gcc/ChangeLog

2020-02-18  Xiong Hu Luo  <luoxhu@linux.ibm.com>

	Part of PR middle-end/71509
	gimple-ssa-store-merging.c (clone): New.
	passes.def (pass_store_merging): New.

gcc/testsuite/ChangeLog

2020-02-18  Xiong Hu Luo  <luoxhu@linux.ibm.com>

	Part of PR middle-end/71509
	testsuite/gcc.dg/store_merging_14.c: Update.
	testsuite/gcc.dg/store_merging_30.c: New.
---
 gcc/gimple-ssa-store-merging.c          |  2 +
 gcc/passes.def                          |  1 +
 gcc/testsuite/gcc.dg/store_merging_14.c |  3 +-
 gcc/testsuite/gcc.dg/store_merging_30.c | 86 +++++++++++++++++++++++++
 4 files changed, 91 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/store_merging_30.c

Comments

Jakub Jelinek Feb. 18, 2020, 9:49 a.m. UTC | #1
On Tue, Feb 18, 2020 at 03:27:39AM -0600, Xionghu Luo wrote:
> Store-merging pass should run twice, the reason is pass fre/pre will do
> some kind of optimizations to instructions by:
>   1. Converting the load from address to load from function arguments
>   (store_merging_30.c:foo1).
>   2. Converting the byte access to BIT_FIELD_REF(store_merging_30.c:foo2).
>   3. Other bitfield combinations or potential interference optimizations etc.
> These optimizations will break the store chain, store-merging pass fails
> to catch such kind of pattern so stores are not merged in middle end,
> then consecutive stb/sth instructions(should be merged to stw) are emitted
> finally.

This seems to be way too early, won't it e.g. stand in a way for SRA because
it will make harder to break those appart again?  Or anything that uses
alias info and store-merging needs to use a conservative alias set?
For constant stores into bitfields followed by loads I'm working on something
ATM (PR93582).

	Jakub
Richard Biener Feb. 18, 2020, 9:57 a.m. UTC | #2
On Tue, 18 Feb 2020, Xionghu Luo wrote:

> Store-merging pass should run twice, the reason is pass fre/pre will do
> some kind of optimizations to instructions by:
>   1. Converting the load from address to load from function arguments
>   (store_merging_30.c:foo1).
>   2. Converting the byte access to BIT_FIELD_REF(store_merging_30.c:foo2).
>   3. Other bitfield combinations or potential interference optimizations etc.
> These optimizations will break the store chain, store-merging pass fails
> to catch such kind of pattern so stores are not merged in middle end,
> then consecutive stb/sth instructions(should be merged to stw) are emitted
> finally.
> 
> And why not directly move store-merging pass(numbered 194) just before
> fre1(numbered 35) is for case store_merging_14.c, 5 merges are done by
> store_merging1, and 4 merges are done fore store_merge2. So, keep the
> original store_merge as store_merge2 as store merge may be still available
> after other pass optimizations.  Most of the 30 store_merging_N.c test
> case dg-final pass name would be updated from store-merging to
> store-merging1 once this RFC patch idea got confirmed.
> Any comments?  Thanks.

Generally trying to solve a pass ordering issue by re-ordering/duplicating
passes might help a single testcase but will surely pessimize others.
So that's a no-go.

What the testcase shows is that store-merging needs to operate
similar to bswap when trying to find a "common" source for a combined
store.  That is, it should appropriately follow defs.  For foo2 I expect
it to be not too difficult, for foo1 I'd say we miss a FRE opportunity
here (but Jakub is working on that one IIRC).

Richard.

> PS:
> Before this patch, store_merging_30.c.035t.fre1:
> 
> ... foo1:
> Inserted _13 = (short unsigned int) counters_new_5(D);
> Replaced tmp.D.2912.D.2911.D.2910.D.2909.inuse with _13 in all uses of
> _1 = tmp.D.2912.D.2911.D.2910.D.2909.inuse;
> Removing dead stmt _1 = tmp.D.2912.D.2911.D.2910.D.2909.inuse;
> ... foo2:
> Inserted _17 = BIT_FIELD_REF <_1, 8, 16>;
> Replaced tmp.D.2926.D.2925.D.2924.D.2923.objects with _17 in all uses of
> _3 = tmp.D.2926.D.2925.D.2924.D.2923.objects;
> Removing dead stmt _3 = tmp.D.2926.D.2925.D.2924.D.2923.objects;
> 
> foo1 asm:
> rldicl 9,4,48,48
> sth 4,0(3)
> sth 9,2(3)
> blr
> 
> With this patch(similar for foo2):
> 
> stw r4,0(r3)
> blr
> 
> gcc/ChangeLog
> 
> 2020-02-18  Xiong Hu Luo  <luoxhu@linux.ibm.com>
> 
> 	Part of PR middle-end/71509
> 	gimple-ssa-store-merging.c (clone): New.
> 	passes.def (pass_store_merging): New.
> 
> gcc/testsuite/ChangeLog
> 
> 2020-02-18  Xiong Hu Luo  <luoxhu@linux.ibm.com>
> 
> 	Part of PR middle-end/71509
> 	testsuite/gcc.dg/store_merging_14.c: Update.
> 	testsuite/gcc.dg/store_merging_30.c: New.
> ---
>  gcc/gimple-ssa-store-merging.c          |  2 +
>  gcc/passes.def                          |  1 +
>  gcc/testsuite/gcc.dg/store_merging_14.c |  3 +-
>  gcc/testsuite/gcc.dg/store_merging_30.c | 86 +++++++++++++++++++++++++
>  4 files changed, 91 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/store_merging_30.c
> 
> diff --git a/gcc/gimple-ssa-store-merging.c b/gcc/gimple-ssa-store-merging.c
> index 8371323ef4a..9a5bd49fc3a 100644
> --- a/gcc/gimple-ssa-store-merging.c
> +++ b/gcc/gimple-ssa-store-merging.c
> @@ -2156,6 +2156,8 @@ public:
>    {
>    }
>  
> +  opt_pass * clone () { return new pass_store_merging (m_ctxt); }
> +
>    /* Pass not supported for PDP-endian, nor for insane hosts or
>       target character sizes where native_{encode,interpret}_expr
>       doesn't work properly.  */
> diff --git a/gcc/passes.def b/gcc/passes.def
> index 2bf2cb78fc5..e531531cb14 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -85,6 +85,7 @@ along with GCC; see the file COPYING3.  If not see
>  	  /* pass_build_ealias is a dummy pass that ensures that we
>  	     execute TODO_rebuild_alias at this point.  */
>  	  NEXT_PASS (pass_build_ealias);
> +	  NEXT_PASS (pass_store_merging);
>  	  NEXT_PASS (pass_fre, true /* may_iterate */);
>  	  NEXT_PASS (pass_early_vrp);
>  	  NEXT_PASS (pass_merge_phi);
> diff --git a/gcc/testsuite/gcc.dg/store_merging_14.c b/gcc/testsuite/gcc.dg/store_merging_14.c
> index 9310aaf3489..bd120d18ac6 100644
> --- a/gcc/testsuite/gcc.dg/store_merging_14.c
> +++ b/gcc/testsuite/gcc.dg/store_merging_14.c
> @@ -214,4 +214,5 @@ main ()
>    return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "Merging successful" 9 "store-merging" } } */
> +/* { dg-final { scan-tree-dump-times "Merging successful" 5 "store-merging1" } } */
> +/* { dg-final { scan-tree-dump-times "Merging successful" 4 "store-merging2" } } */
> diff --git a/gcc/testsuite/gcc.dg/store_merging_30.c b/gcc/testsuite/gcc.dg/store_merging_30.c
> new file mode 100644
> index 00000000000..71369c3b196
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/store_merging_30.c
> @@ -0,0 +1,86 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target store_merge } */
> +/* { dg-options "-O2 -fdump-tree-store-merging" } */
> +
> +typedef unsigned int atomic_t;
> +
> +struct page
> +{
> +  union
> +  {
> +    unsigned long counters;
> +    struct
> +    {
> +      union
> +      {
> +	struct
> +	{
> +	  unsigned inuse : 16;
> +	  unsigned objects : 15;
> +	  unsigned frozen : 1;
> +	};
> +      };
> +    };
> +  };
> +};
> +
> +struct page2
> +{
> +  union
> +  {
> +    unsigned counters;
> +    struct
> +    {
> +      union
> +      {
> +	struct
> +	{
> +	  unsigned inuse : 16;
> +	  unsigned objects : 8;
> +	  unsigned frozen : 8;
> +	};
> +      };
> +    };
> +  };
> +};
> +
> +__attribute__((noipa)) void
> +foo1 (struct page *page, unsigned long counters_new)
> +{
> +        struct page tmp;
> +        tmp.counters = counters_new;
> +        page->inuse   = tmp.inuse;
> +        page->objects = tmp.objects;
> +        page->frozen  = tmp.frozen;
> +}
> +
> +__attribute__((noipa)) void
> +foo2 (struct page2 *page2, unsigned long counters_new)
> +{
> +        struct page2 tmp;
> +        tmp.counters = counters_new;
> +        page2->inuse   = tmp.inuse;
> +        page2->objects = tmp.objects;
> +        page2->frozen  = tmp.frozen;
> +}
> +
> +int main ()
> +{
> +  struct page page;
> +  foo1 (&page, 0x12345678ABCDEFFEUL);
> +
> +  asm volatile ("" : : : "memory");
> +  if (page.frozen != 1 || page.objects != 0x2bcd || page.inuse != 0xeffe)
> +    __builtin_abort ();
> +
> +  struct page2 page2;
> +  foo2 (&page2, 0x12345678ABCDEFFEUL);
> +
> +  asm volatile ("" : : : "memory");
> +
> +  if (page2.frozen != 0xab || page2.objects != 0xcd || page2.inuse != 0xeffe)
> +    __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging1" } } */
>
Xionghu Luo Feb. 26, 2020, 8:09 a.m. UTC | #3
On 2020/2/18 17:57, Richard Biener wrote:
> On Tue, 18 Feb 2020, Xionghu Luo wrote:
> 
>> Store-merging pass should run twice, the reason is pass fre/pre will do
>> some kind of optimizations to instructions by:
>>    1. Converting the load from address to load from function arguments
>>    (store_merging_30.c:foo1).
>>    2. Converting the byte access to BIT_FIELD_REF(store_merging_30.c:foo2).
>>    3. Other bitfield combinations or potential interference optimizations etc.
>> These optimizations will break the store chain, store-merging pass fails
>> to catch such kind of pattern so stores are not merged in middle end,
>> then consecutive stb/sth instructions(should be merged to stw) are emitted
>> finally.
>>
>> And why not directly move store-merging pass(numbered 194) just before
>> fre1(numbered 35) is for case store_merging_14.c, 5 merges are done by
>> store_merging1, and 4 merges are done fore store_merge2. So, keep the
>> original store_merge as store_merge2 as store merge may be still available
>> after other pass optimizations.  Most of the 30 store_merging_N.c test
>> case dg-final pass name would be updated from store-merging to
>> store-merging1 once this RFC patch idea got confirmed.
>> Any comments?  Thanks.
> 
> Generally trying to solve a pass ordering issue by re-ordering/duplicating
> passes might help a single testcase but will surely pessimize others.
> So that's a no-go.
> 
> What the testcase shows is that store-merging needs to operate
> similar to bswap when trying to find a "common" source for a combined
> store.  That is, it should appropriately follow defs.  For foo2 I expect
> it to be not too difficult, for foo1 I'd say we miss a FRE opportunity
> here (but Jakub is working on that one IIRC).

Thanks Richard, not sure about my understanding and please correct if any.

I tried Jukub's latest patch of "sccvn: Handle bitfields in push_partial_def".
Got to know fre pass checks the load instruction's vuse chain and do the constant
bitfield combines in push_partial_def, then native_encode_expr/native_interpret_expr
can decode and encode the constant content and shift/combine the data.  
This should be based on one change to my test case(by adding return page->counters;)
to trigger the fre pass push all SSA name's partial_defs.  Currently, for SSA variables,
this encode/interpret is not supported yet, I suppose this is the opportunity you mean.
As this case's input is not constant, so Jukub's patch doesn't fix it.

struct page
{
  union
  {
    unsigned counters;
    struct
    {
      union
      {
        struct
        {
          unsigned inuse : 16;
          unsigned inuse2 : 8;
          unsigned objects : 5;
          unsigned frozen : 3;
        };
      };
    };
  };
};

unsigned
foo1 (struct page *page, unsigned long counters_new)
{
        struct page tmp;
        tmp.counters = counters_new;
        page->inuse   = tmp.inuse;
        page->inuse2  = tmp.inuse2;
        page->objects = tmp.objects;
        page->frozen  = tmp.frozen;
        return page->counters;
}

If "return page->counters;" is removed, this case won't match the fre's current code path
in vn_reference_lookup_3 without a load(_14 = page_9(D)->D.2912.counters) to walk all the vuses.
So it seems not a fre opportunity but exactly a store merging issue(Also checked llvm that it 
doesn't generate byte store and short store so it only produces 1 stw for all patterns).

Optimize this case in pass store-merging is reasonable, just as you said,
"trying to find a "common" source for a combined store.", it requires break
store-merging pass's behavior: so far store-merging pass track all store
instruction's RHS and stops when RHS is a load instruction with <base_addr, bitsize,
bitpos, bitregion_start, bitregion_end> extracted, convert instructions and bitfield
instructions generated by pass fre are ignored in pass_store_merging::process_store.
To teach the process_store capture real common source requires refactoring handled_load
to continue tracking even RHS is a load instruction and support the convert  instructions
and bitfield_instructions, this seems to be a big change.

Regarding to Jakub's comments, merging could reduce many byte stores and half stores
to improve performance for this type of case.  There is already an 033t.esra running before,
and not sure whether SRA should replace such kind of bitfield operations.
Adding a store-merging pass is so simple and many passes run more than once...
So which would be best for this optimization, please?  Thanks again :)

Xionghu

> 
> Richard.
> 
>> PS:
>> Before this patch, store_merging_30.c.035t.fre1:
>>
>> ... foo1:
>> Inserted _13 = (short unsigned int) counters_new_5(D);
>> Replaced tmp.D.2912.D.2911.D.2910.D.2909.inuse with _13 in all uses of
>> _1 = tmp.D.2912.D.2911.D.2910.D.2909.inuse;
>> Removing dead stmt _1 = tmp.D.2912.D.2911.D.2910.D.2909.inuse;
>> ... foo2:
>> Inserted _17 = BIT_FIELD_REF <_1, 8, 16>;
>> Replaced tmp.D.2926.D.2925.D.2924.D.2923.objects with _17 in all uses of
>> _3 = tmp.D.2926.D.2925.D.2924.D.2923.objects;
>> Removing dead stmt _3 = tmp.D.2926.D.2925.D.2924.D.2923.objects;
>>
>> foo1 asm:
>> rldicl 9,4,48,48
>> sth 4,0(3)
>> sth 9,2(3)
>> blr
>>
>> With this patch(similar for foo2):
>>
>> stw r4,0(r3)
>> blr
>>
>> gcc/ChangeLog
>>
>> 2020-02-18  Xiong Hu Luo  <luoxhu@linux.ibm.com>
>>
>> 	Part of PR middle-end/71509
>> 	gimple-ssa-store-merging.c (clone): New.
>> 	passes.def (pass_store_merging): New.
>>
>> gcc/testsuite/ChangeLog
>>
>> 2020-02-18  Xiong Hu Luo  <luoxhu@linux.ibm.com>
>>
>> 	Part of PR middle-end/71509
>> 	testsuite/gcc.dg/store_merging_14.c: Update.
>> 	testsuite/gcc.dg/store_merging_30.c: New.
>> ---
>>   gcc/gimple-ssa-store-merging.c          |  2 +
>>   gcc/passes.def                          |  1 +
>>   gcc/testsuite/gcc.dg/store_merging_14.c |  3 +-
>>   gcc/testsuite/gcc.dg/store_merging_30.c | 86 +++++++++++++++++++++++++
>>   4 files changed, 91 insertions(+), 1 deletion(-)
>>   create mode 100644 gcc/testsuite/gcc.dg/store_merging_30.c
>>
>> diff --git a/gcc/gimple-ssa-store-merging.c b/gcc/gimple-ssa-store-merging.c
>> index 8371323ef4a..9a5bd49fc3a 100644
>> --- a/gcc/gimple-ssa-store-merging.c
>> +++ b/gcc/gimple-ssa-store-merging.c
>> @@ -2156,6 +2156,8 @@ public:
>>     {
>>     }
>>   
>> +  opt_pass * clone () { return new pass_store_merging (m_ctxt); }
>> +
>>     /* Pass not supported for PDP-endian, nor for insane hosts or
>>        target character sizes where native_{encode,interpret}_expr
>>        doesn't work properly.  */
>> diff --git a/gcc/passes.def b/gcc/passes.def
>> index 2bf2cb78fc5..e531531cb14 100644
>> --- a/gcc/passes.def
>> +++ b/gcc/passes.def
>> @@ -85,6 +85,7 @@ along with GCC; see the file COPYING3.  If not see
>>   	  /* pass_build_ealias is a dummy pass that ensures that we
>>   	     execute TODO_rebuild_alias at this point.  */
>>   	  NEXT_PASS (pass_build_ealias);
>> +	  NEXT_PASS (pass_store_merging);
>>   	  NEXT_PASS (pass_fre, true /* may_iterate */);
>>   	  NEXT_PASS (pass_early_vrp);
>>   	  NEXT_PASS (pass_merge_phi);
>> diff --git a/gcc/testsuite/gcc.dg/store_merging_14.c b/gcc/testsuite/gcc.dg/store_merging_14.c
>> index 9310aaf3489..bd120d18ac6 100644
>> --- a/gcc/testsuite/gcc.dg/store_merging_14.c
>> +++ b/gcc/testsuite/gcc.dg/store_merging_14.c
>> @@ -214,4 +214,5 @@ main ()
>>     return 0;
>>   }
>>   
>> -/* { dg-final { scan-tree-dump-times "Merging successful" 9 "store-merging" } } */
>> +/* { dg-final { scan-tree-dump-times "Merging successful" 5 "store-merging1" } } */
>> +/* { dg-final { scan-tree-dump-times "Merging successful" 4 "store-merging2" } } */
>> diff --git a/gcc/testsuite/gcc.dg/store_merging_30.c b/gcc/testsuite/gcc.dg/store_merging_30.c
>> new file mode 100644
>> index 00000000000..71369c3b196
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/store_merging_30.c
>> @@ -0,0 +1,86 @@
>> +/* { dg-do run } */
>> +/* { dg-require-effective-target store_merge } */
>> +/* { dg-options "-O2 -fdump-tree-store-merging" } */
>> +
>> +typedef unsigned int atomic_t;
>> +
>> +struct page
>> +{
>> +  union
>> +  {
>> +    unsigned long counters;
>> +    struct
>> +    {
>> +      union
>> +      {
>> +	struct
>> +	{
>> +	  unsigned inuse : 16;
>> +	  unsigned objects : 15;
>> +	  unsigned frozen : 1;
>> +	};
>> +      };
>> +    };
>> +  };
>> +};
>> +
>> +struct page2
>> +{
>> +  union
>> +  {
>> +    unsigned counters;
>> +    struct
>> +    {
>> +      union
>> +      {
>> +	struct
>> +	{
>> +	  unsigned inuse : 16;
>> +	  unsigned objects : 8;
>> +	  unsigned frozen : 8;
>> +	};
>> +      };
>> +    };
>> +  };
>> +};
>> +
>> +__attribute__((noipa)) void
>> +foo1 (struct page *page, unsigned long counters_new)
>> +{
>> +        struct page tmp;
>> +        tmp.counters = counters_new;
>> +        page->inuse   = tmp.inuse;
>> +        page->objects = tmp.objects;
>> +        page->frozen  = tmp.frozen;
>> +}
>> +
>> +__attribute__((noipa)) void
>> +foo2 (struct page2 *page2, unsigned long counters_new)
>> +{
>> +        struct page2 tmp;
>> +        tmp.counters = counters_new;
>> +        page2->inuse   = tmp.inuse;
>> +        page2->objects = tmp.objects;
>> +        page2->frozen  = tmp.frozen;
>> +}
>> +
>> +int main ()
>> +{
>> +  struct page page;
>> +  foo1 (&page, 0x12345678ABCDEFFEUL);
>> +
>> +  asm volatile ("" : : : "memory");
>> +  if (page.frozen != 1 || page.objects != 0x2bcd || page.inuse != 0xeffe)
>> +    __builtin_abort ();
>> +
>> +  struct page2 page2;
>> +  foo2 (&page2, 0x12345678ABCDEFFEUL);
>> +
>> +  asm volatile ("" : : : "memory");
>> +
>> +  if (page2.frozen != 0xab || page2.objects != 0xcd || page2.inuse != 0xeffe)
>> +    __builtin_abort ();
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging1" } } */
>>
>
Jakub Jelinek Feb. 26, 2020, 8:29 a.m. UTC | #4
On Wed, Feb 26, 2020 at 04:09:16PM +0800, luoxhu wrote:
> Thanks Richard, not sure about my understanding and please correct if any.
> 
> I tried Jukub's latest patch of "sccvn: Handle bitfields in push_partial_def".
> Got to know fre pass checks the load instruction's vuse chain and do the constant
> bitfield combines in push_partial_def, then native_encode_expr/native_interpret_expr
> can decode and encode the constant content and shift/combine the data.  
> This should be based on one change to my test case(by adding return page->counters;)
> to trigger the fre pass push all SSA name's partial_defs.  Currently, for SSA variables,
> this encode/interpret is not supported yet, I suppose this is the opportunity you mean.
> As this case's input is not constant, so Jukub's patch doesn't fix it.

Yeah, I've looked at your case and the problem is that
         tmp.counters = counters_new;
         page->inuse   = tmp.inuse;
         page->inuse2  = tmp.inuse2;
         page->objects = tmp.objects;
         page->frozen  = tmp.frozen;
we optimize it into essentially:
         tmp.counters = counters_new;
         page->inuse   = (cast) counters_new;
         page->inuse2  = tmp.inuse2;
         page->objects = tmp.objects;
         page->frozen  = tmp.frozen;
and so it is no longer recognized as a mem copy.  What we could do is just
teach store-merging to virtually undo that optimization and consider the
store to be a MEM_REF too.  Especially for the BIT_INSERT_EXPR Eric has
added quite a few similar optimizations already.

	Jakub
Richard Biener Feb. 26, 2020, 8:29 a.m. UTC | #5
On Wed, 26 Feb 2020, luoxhu wrote:

> On 2020/2/18 17:57, Richard Biener wrote:
> > On Tue, 18 Feb 2020, Xionghu Luo wrote:
> > 
> >> Store-merging pass should run twice, the reason is pass fre/pre will do
> >> some kind of optimizations to instructions by:
> >>    1. Converting the load from address to load from function arguments
> >>    (store_merging_30.c:foo1).
> >>    2. Converting the byte access to BIT_FIELD_REF(store_merging_30.c:foo2).
> >>    3. Other bitfield combinations or potential interference optimizations etc.
> >> These optimizations will break the store chain, store-merging pass fails
> >> to catch such kind of pattern so stores are not merged in middle end,
> >> then consecutive stb/sth instructions(should be merged to stw) are emitted
> >> finally.
> >>
> >> And why not directly move store-merging pass(numbered 194) just before
> >> fre1(numbered 35) is for case store_merging_14.c, 5 merges are done by
> >> store_merging1, and 4 merges are done fore store_merge2. So, keep the
> >> original store_merge as store_merge2 as store merge may be still available
> >> after other pass optimizations.  Most of the 30 store_merging_N.c test
> >> case dg-final pass name would be updated from store-merging to
> >> store-merging1 once this RFC patch idea got confirmed.
> >> Any comments?  Thanks.
> > 
> > Generally trying to solve a pass ordering issue by re-ordering/duplicating
> > passes might help a single testcase but will surely pessimize others.
> > So that's a no-go.
> > 
> > What the testcase shows is that store-merging needs to operate
> > similar to bswap when trying to find a "common" source for a combined
> > store.  That is, it should appropriately follow defs.  For foo2 I expect
> > it to be not too difficult, for foo1 I'd say we miss a FRE opportunity
> > here (but Jakub is working on that one IIRC).
> 
> Thanks Richard, not sure about my understanding and please correct if any.
> 
> I tried Jukub's latest patch of "sccvn: Handle bitfields in push_partial_def".
> Got to know fre pass checks the load instruction's vuse chain and do the constant
> bitfield combines in push_partial_def, then native_encode_expr/native_interpret_expr
> can decode and encode the constant content and shift/combine the data.  
> This should be based on one change to my test case(by adding return page->counters;)
> to trigger the fre pass push all SSA name's partial_defs.  Currently, for SSA variables,
> this encode/interpret is not supported yet, I suppose this is the opportunity you mean.
> As this case's input is not constant, so Jukub's patch doesn't fix it.
> 
> struct page
> {
>   union
>   {
>     unsigned counters;
>     struct
>     {
>       union
>       {
>         struct
>         {
>           unsigned inuse : 16;
>           unsigned inuse2 : 8;
>           unsigned objects : 5;
>           unsigned frozen : 3;
>         };
>       };
>     };
>   };
> };
> 
> unsigned
> foo1 (struct page *page, unsigned long counters_new)
> {
>         struct page tmp;
>         tmp.counters = counters_new;
>         page->inuse   = tmp.inuse;
>         page->inuse2  = tmp.inuse2;
>         page->objects = tmp.objects;
>         page->frozen  = tmp.frozen;
>         return page->counters;
> }
> 
> If "return page->counters;" is removed, this case won't match the fre's current code path
> in vn_reference_lookup_3 without a load(_14 = page_9(D)->D.2912.counters) to walk all the vuses.
> So it seems not a fre opportunity but exactly a store merging issue(Also checked llvm that it 
> doesn't generate byte store and short store so it only produces 1 stw for all patterns).

Huh, but the code has the return page->counters and you expec it to
return counters_new unchanged.  Yes, you also expect the code to be
optimized to

   page->counters = counters_new;
   return counters_new;

and that part would indeed be a store-merging opportunity.  Currently
FRE does

  _1 = (unsigned int) counters_new_6(D);
  tmp.D.1943.counters = _1;
  _18 = (short unsigned int) counters_new_6(D);
  page_9(D)->D.1943.D.1942.D.1941.D.1940.inuse = _18;
  _19 = BIT_FIELD_REF <_1, 8, 16>;
  page_9(D)->D.1943.D.1942.D.1941.D.1940.inuse2 = _19;
  _4 = tmp.D.1943.D.1942.D.1941.D.1940.objects;
  page_9(D)->D.1943.D.1942.D.1941.D.1940.objects = _4;
  _5 = tmp.D.1943.D.1942.D.1941.D.1940.frozen;
  page_9(D)->D.1943.D.1942.D.1941.D.1940.frozen = _5;

so it fails to optimize the other loads to BIT_FIELD_REFs of _1.  If it
did that then store-merging woudl already handle the code I think.

> Optimize this case in pass store-merging is reasonable, just as you said,
> "trying to find a "common" source for a combined store.", it requires break
> store-merging pass's behavior: so far store-merging pass track all store
> instruction's RHS and stops when RHS is a load instruction with <base_addr, bitsize,
> bitpos, bitregion_start, bitregion_end> extracted, convert instructions and bitfield
> instructions generated by pass fre are ignored in pass_store_merging::process_store.
> To teach the process_store capture real common source requires refactoring handled_load
> to continue tracking even RHS is a load instruction and support the convert  instructions
> and bitfield_instructions, this seems to be a big change.

Well, as said above FRE would elide the loads and you'll see a series of
stores with BIT_FIELD_REF as sources.

> Regarding to Jakub's comments, merging could reduce many byte stores and half stores
> to improve performance for this type of case.  There is already an 033t.esra running before,
> and not sure whether SRA should replace such kind of bitfield operations.
> Adding a store-merging pass is so simple and many passes run more than once...
> So which would be best for this optimization, please?  Thanks again :)

I think the testcase as-is might be also optimized with Andrews pending
work to lower bitfield accesses.  So a slightly adjusted testcase not
using bitfields but maybe char (or mixed char/short) members would be
useful to track as well.  There SRA might apply (SRA backs off of
doing anything to bitfields currently).

Richard.

> Xionghu
> 
> > 
> > Richard.
> > 
> >> PS:
> >> Before this patch, store_merging_30.c.035t.fre1:
> >>
> >> ... foo1:
> >> Inserted _13 = (short unsigned int) counters_new_5(D);
> >> Replaced tmp.D.2912.D.2911.D.2910.D.2909.inuse with _13 in all uses of
> >> _1 = tmp.D.2912.D.2911.D.2910.D.2909.inuse;
> >> Removing dead stmt _1 = tmp.D.2912.D.2911.D.2910.D.2909.inuse;
> >> ... foo2:
> >> Inserted _17 = BIT_FIELD_REF <_1, 8, 16>;
> >> Replaced tmp.D.2926.D.2925.D.2924.D.2923.objects with _17 in all uses of
> >> _3 = tmp.D.2926.D.2925.D.2924.D.2923.objects;
> >> Removing dead stmt _3 = tmp.D.2926.D.2925.D.2924.D.2923.objects;
> >>
> >> foo1 asm:
> >> rldicl 9,4,48,48
> >> sth 4,0(3)
> >> sth 9,2(3)
> >> blr
> >>
> >> With this patch(similar for foo2):
> >>
> >> stw r4,0(r3)
> >> blr
> >>
> >> gcc/ChangeLog
> >>
> >> 2020-02-18  Xiong Hu Luo  <luoxhu@linux.ibm.com>
> >>
> >> 	Part of PR middle-end/71509
> >> 	gimple-ssa-store-merging.c (clone): New.
> >> 	passes.def (pass_store_merging): New.
> >>
> >> gcc/testsuite/ChangeLog
> >>
> >> 2020-02-18  Xiong Hu Luo  <luoxhu@linux.ibm.com>
> >>
> >> 	Part of PR middle-end/71509
> >> 	testsuite/gcc.dg/store_merging_14.c: Update.
> >> 	testsuite/gcc.dg/store_merging_30.c: New.
> >> ---
> >>   gcc/gimple-ssa-store-merging.c          |  2 +
> >>   gcc/passes.def                          |  1 +
> >>   gcc/testsuite/gcc.dg/store_merging_14.c |  3 +-
> >>   gcc/testsuite/gcc.dg/store_merging_30.c | 86 +++++++++++++++++++++++++
> >>   4 files changed, 91 insertions(+), 1 deletion(-)
> >>   create mode 100644 gcc/testsuite/gcc.dg/store_merging_30.c
> >>
> >> diff --git a/gcc/gimple-ssa-store-merging.c b/gcc/gimple-ssa-store-merging.c
> >> index 8371323ef4a..9a5bd49fc3a 100644
> >> --- a/gcc/gimple-ssa-store-merging.c
> >> +++ b/gcc/gimple-ssa-store-merging.c
> >> @@ -2156,6 +2156,8 @@ public:
> >>     {
> >>     }
> >>   
> >> +  opt_pass * clone () { return new pass_store_merging (m_ctxt); }
> >> +
> >>     /* Pass not supported for PDP-endian, nor for insane hosts or
> >>        target character sizes where native_{encode,interpret}_expr
> >>        doesn't work properly.  */
> >> diff --git a/gcc/passes.def b/gcc/passes.def
> >> index 2bf2cb78fc5..e531531cb14 100644
> >> --- a/gcc/passes.def
> >> +++ b/gcc/passes.def
> >> @@ -85,6 +85,7 @@ along with GCC; see the file COPYING3.  If not see
> >>   	  /* pass_build_ealias is a dummy pass that ensures that we
> >>   	     execute TODO_rebuild_alias at this point.  */
> >>   	  NEXT_PASS (pass_build_ealias);
> >> +	  NEXT_PASS (pass_store_merging);
> >>   	  NEXT_PASS (pass_fre, true /* may_iterate */);
> >>   	  NEXT_PASS (pass_early_vrp);
> >>   	  NEXT_PASS (pass_merge_phi);
> >> diff --git a/gcc/testsuite/gcc.dg/store_merging_14.c b/gcc/testsuite/gcc.dg/store_merging_14.c
> >> index 9310aaf3489..bd120d18ac6 100644
> >> --- a/gcc/testsuite/gcc.dg/store_merging_14.c
> >> +++ b/gcc/testsuite/gcc.dg/store_merging_14.c
> >> @@ -214,4 +214,5 @@ main ()
> >>     return 0;
> >>   }
> >>   
> >> -/* { dg-final { scan-tree-dump-times "Merging successful" 9 "store-merging" } } */
> >> +/* { dg-final { scan-tree-dump-times "Merging successful" 5 "store-merging1" } } */
> >> +/* { dg-final { scan-tree-dump-times "Merging successful" 4 "store-merging2" } } */
> >> diff --git a/gcc/testsuite/gcc.dg/store_merging_30.c b/gcc/testsuite/gcc.dg/store_merging_30.c
> >> new file mode 100644
> >> index 00000000000..71369c3b196
> >> --- /dev/null
> >> +++ b/gcc/testsuite/gcc.dg/store_merging_30.c
> >> @@ -0,0 +1,86 @@
> >> +/* { dg-do run } */
> >> +/* { dg-require-effective-target store_merge } */
> >> +/* { dg-options "-O2 -fdump-tree-store-merging" } */
> >> +
> >> +typedef unsigned int atomic_t;
> >> +
> >> +struct page
> >> +{
> >> +  union
> >> +  {
> >> +    unsigned long counters;
> >> +    struct
> >> +    {
> >> +      union
> >> +      {
> >> +	struct
> >> +	{
> >> +	  unsigned inuse : 16;
> >> +	  unsigned objects : 15;
> >> +	  unsigned frozen : 1;
> >> +	};
> >> +      };
> >> +    };
> >> +  };
> >> +};
> >> +
> >> +struct page2
> >> +{
> >> +  union
> >> +  {
> >> +    unsigned counters;
> >> +    struct
> >> +    {
> >> +      union
> >> +      {
> >> +	struct
> >> +	{
> >> +	  unsigned inuse : 16;
> >> +	  unsigned objects : 8;
> >> +	  unsigned frozen : 8;
> >> +	};
> >> +      };
> >> +    };
> >> +  };
> >> +};
> >> +
> >> +__attribute__((noipa)) void
> >> +foo1 (struct page *page, unsigned long counters_new)
> >> +{
> >> +        struct page tmp;
> >> +        tmp.counters = counters_new;
> >> +        page->inuse   = tmp.inuse;
> >> +        page->objects = tmp.objects;
> >> +        page->frozen  = tmp.frozen;
> >> +}
> >> +
> >> +__attribute__((noipa)) void
> >> +foo2 (struct page2 *page2, unsigned long counters_new)
> >> +{
> >> +        struct page2 tmp;
> >> +        tmp.counters = counters_new;
> >> +        page2->inuse   = tmp.inuse;
> >> +        page2->objects = tmp.objects;
> >> +        page2->frozen  = tmp.frozen;
> >> +}
> >> +
> >> +int main ()
> >> +{
> >> +  struct page page;
> >> +  foo1 (&page, 0x12345678ABCDEFFEUL);
> >> +
> >> +  asm volatile ("" : : : "memory");
> >> +  if (page.frozen != 1 || page.objects != 0x2bcd || page.inuse != 0xeffe)
> >> +    __builtin_abort ();
> >> +
> >> +  struct page2 page2;
> >> +  foo2 (&page2, 0x12345678ABCDEFFEUL);
> >> +
> >> +  asm volatile ("" : : : "memory");
> >> +
> >> +  if (page2.frozen != 0xab || page2.objects != 0xcd || page2.inuse != 0xeffe)
> >> +    __builtin_abort ();
> >> +  return 0;
> >> +}
> >> +
> >> +/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging1" } } */
> >>
> > 
> 
>
Richard Biener Feb. 26, 2020, 8:37 a.m. UTC | #6
On Wed, 26 Feb 2020, Richard Biener wrote:

> On Wed, 26 Feb 2020, luoxhu wrote:
> 
> > On 2020/2/18 17:57, Richard Biener wrote:
> > > On Tue, 18 Feb 2020, Xionghu Luo wrote:
> > > 
> > >> Store-merging pass should run twice, the reason is pass fre/pre will do
> > >> some kind of optimizations to instructions by:
> > >>    1. Converting the load from address to load from function arguments
> > >>    (store_merging_30.c:foo1).
> > >>    2. Converting the byte access to BIT_FIELD_REF(store_merging_30.c:foo2).
> > >>    3. Other bitfield combinations or potential interference optimizations etc.
> > >> These optimizations will break the store chain, store-merging pass fails
> > >> to catch such kind of pattern so stores are not merged in middle end,
> > >> then consecutive stb/sth instructions(should be merged to stw) are emitted
> > >> finally.
> > >>
> > >> And why not directly move store-merging pass(numbered 194) just before
> > >> fre1(numbered 35) is for case store_merging_14.c, 5 merges are done by
> > >> store_merging1, and 4 merges are done fore store_merge2. So, keep the
> > >> original store_merge as store_merge2 as store merge may be still available
> > >> after other pass optimizations.  Most of the 30 store_merging_N.c test
> > >> case dg-final pass name would be updated from store-merging to
> > >> store-merging1 once this RFC patch idea got confirmed.
> > >> Any comments?  Thanks.
> > > 
> > > Generally trying to solve a pass ordering issue by re-ordering/duplicating
> > > passes might help a single testcase but will surely pessimize others.
> > > So that's a no-go.
> > > 
> > > What the testcase shows is that store-merging needs to operate
> > > similar to bswap when trying to find a "common" source for a combined
> > > store.  That is, it should appropriately follow defs.  For foo2 I expect
> > > it to be not too difficult, for foo1 I'd say we miss a FRE opportunity
> > > here (but Jakub is working on that one IIRC).
> > 
> > Thanks Richard, not sure about my understanding and please correct if any.
> > 
> > I tried Jukub's latest patch of "sccvn: Handle bitfields in push_partial_def".
> > Got to know fre pass checks the load instruction's vuse chain and do the constant
> > bitfield combines in push_partial_def, then native_encode_expr/native_interpret_expr
> > can decode and encode the constant content and shift/combine the data.  
> > This should be based on one change to my test case(by adding return page->counters;)
> > to trigger the fre pass push all SSA name's partial_defs.  Currently, for SSA variables,
> > this encode/interpret is not supported yet, I suppose this is the opportunity you mean.
> > As this case's input is not constant, so Jukub's patch doesn't fix it.
> > 
> > struct page
> > {
> >   union
> >   {
> >     unsigned counters;
> >     struct
> >     {
> >       union
> >       {
> >         struct
> >         {
> >           unsigned inuse : 16;
> >           unsigned inuse2 : 8;
> >           unsigned objects : 5;
> >           unsigned frozen : 3;
> >         };
> >       };
> >     };
> >   };
> > };
> > 
> > unsigned
> > foo1 (struct page *page, unsigned long counters_new)
> > {
> >         struct page tmp;
> >         tmp.counters = counters_new;
> >         page->inuse   = tmp.inuse;
> >         page->inuse2  = tmp.inuse2;
> >         page->objects = tmp.objects;
> >         page->frozen  = tmp.frozen;
> >         return page->counters;
> > }
> > 
> > If "return page->counters;" is removed, this case won't match the fre's current code path
> > in vn_reference_lookup_3 without a load(_14 = page_9(D)->D.2912.counters) to walk all the vuses.
> > So it seems not a fre opportunity but exactly a store merging issue(Also checked llvm that it 
> > doesn't generate byte store and short store so it only produces 1 stw for all patterns).
> 
> Huh, but the code has the return page->counters and you expec it to
> return counters_new unchanged.  Yes, you also expect the code to be
> optimized to
> 
>    page->counters = counters_new;
>    return counters_new;
> 
> and that part would indeed be a store-merging opportunity.  Currently
> FRE does
> 
>   _1 = (unsigned int) counters_new_6(D);
>   tmp.D.1943.counters = _1;
>   _18 = (short unsigned int) counters_new_6(D);
>   page_9(D)->D.1943.D.1942.D.1941.D.1940.inuse = _18;
>   _19 = BIT_FIELD_REF <_1, 8, 16>;
>   page_9(D)->D.1943.D.1942.D.1941.D.1940.inuse2 = _19;
>   _4 = tmp.D.1943.D.1942.D.1941.D.1940.objects;
>   page_9(D)->D.1943.D.1942.D.1941.D.1940.objects = _4;
>   _5 = tmp.D.1943.D.1942.D.1941.D.1940.frozen;
>   page_9(D)->D.1943.D.1942.D.1941.D.1940.frozen = _5;
> 
> so it fails to optimize the other loads to BIT_FIELD_REFs of _1.  If it
> did that then store-merging woudl already handle the code I think.

FRE fails to do that because of

              /* ???  We can't handle bitfield precision extracts without
                 either using an alternate type for the BIT_FIELD_REF and
                 then doing a conversion or possibly adjusting the offset
                 according to endianness.  */
              && (! INTEGRAL_TYPE_P (vr->type)
                  || known_eq (ref->size, TYPE_PRECISION (vr->type)))
              && multiple_p (ref->size, BITS_PER_UNIT))
...
                  || type_has_mode_precision_p (TREE_TYPE (def_rhs)))

where I applied the FUD rule and backed off possible endianess issues...

Richard.

> > Optimize this case in pass store-merging is reasonable, just as you said,
> > "trying to find a "common" source for a combined store.", it requires break
> > store-merging pass's behavior: so far store-merging pass track all store
> > instruction's RHS and stops when RHS is a load instruction with <base_addr, bitsize,
> > bitpos, bitregion_start, bitregion_end> extracted, convert instructions and bitfield
> > instructions generated by pass fre are ignored in pass_store_merging::process_store.
> > To teach the process_store capture real common source requires refactoring handled_load
> > to continue tracking even RHS is a load instruction and support the convert  instructions
> > and bitfield_instructions, this seems to be a big change.
> 
> Well, as said above FRE would elide the loads and you'll see a series of
> stores with BIT_FIELD_REF as sources.
> 
> > Regarding to Jakub's comments, merging could reduce many byte stores and half stores
> > to improve performance for this type of case.  There is already an 033t.esra running before,
> > and not sure whether SRA should replace such kind of bitfield operations.
> > Adding a store-merging pass is so simple and many passes run more than once...
> > So which would be best for this optimization, please?  Thanks again :)
> 
> I think the testcase as-is might be also optimized with Andrews pending
> work to lower bitfield accesses.  So a slightly adjusted testcase not
> using bitfields but maybe char (or mixed char/short) members would be
> useful to track as well.  There SRA might apply (SRA backs off of
> doing anything to bitfields currently).
> 
> Richard.
> 
> > Xionghu
> > 
> > > 
> > > Richard.
> > > 
> > >> PS:
> > >> Before this patch, store_merging_30.c.035t.fre1:
> > >>
> > >> ... foo1:
> > >> Inserted _13 = (short unsigned int) counters_new_5(D);
> > >> Replaced tmp.D.2912.D.2911.D.2910.D.2909.inuse with _13 in all uses of
> > >> _1 = tmp.D.2912.D.2911.D.2910.D.2909.inuse;
> > >> Removing dead stmt _1 = tmp.D.2912.D.2911.D.2910.D.2909.inuse;
> > >> ... foo2:
> > >> Inserted _17 = BIT_FIELD_REF <_1, 8, 16>;
> > >> Replaced tmp.D.2926.D.2925.D.2924.D.2923.objects with _17 in all uses of
> > >> _3 = tmp.D.2926.D.2925.D.2924.D.2923.objects;
> > >> Removing dead stmt _3 = tmp.D.2926.D.2925.D.2924.D.2923.objects;
> > >>
> > >> foo1 asm:
> > >> rldicl 9,4,48,48
> > >> sth 4,0(3)
> > >> sth 9,2(3)
> > >> blr
> > >>
> > >> With this patch(similar for foo2):
> > >>
> > >> stw r4,0(r3)
> > >> blr
> > >>
> > >> gcc/ChangeLog
> > >>
> > >> 2020-02-18  Xiong Hu Luo  <luoxhu@linux.ibm.com>
> > >>
> > >> 	Part of PR middle-end/71509
> > >> 	gimple-ssa-store-merging.c (clone): New.
> > >> 	passes.def (pass_store_merging): New.
> > >>
> > >> gcc/testsuite/ChangeLog
> > >>
> > >> 2020-02-18  Xiong Hu Luo  <luoxhu@linux.ibm.com>
> > >>
> > >> 	Part of PR middle-end/71509
> > >> 	testsuite/gcc.dg/store_merging_14.c: Update.
> > >> 	testsuite/gcc.dg/store_merging_30.c: New.
> > >> ---
> > >>   gcc/gimple-ssa-store-merging.c          |  2 +
> > >>   gcc/passes.def                          |  1 +
> > >>   gcc/testsuite/gcc.dg/store_merging_14.c |  3 +-
> > >>   gcc/testsuite/gcc.dg/store_merging_30.c | 86 +++++++++++++++++++++++++
> > >>   4 files changed, 91 insertions(+), 1 deletion(-)
> > >>   create mode 100644 gcc/testsuite/gcc.dg/store_merging_30.c
> > >>
> > >> diff --git a/gcc/gimple-ssa-store-merging.c b/gcc/gimple-ssa-store-merging.c
> > >> index 8371323ef4a..9a5bd49fc3a 100644
> > >> --- a/gcc/gimple-ssa-store-merging.c
> > >> +++ b/gcc/gimple-ssa-store-merging.c
> > >> @@ -2156,6 +2156,8 @@ public:
> > >>     {
> > >>     }
> > >>   
> > >> +  opt_pass * clone () { return new pass_store_merging (m_ctxt); }
> > >> +
> > >>     /* Pass not supported for PDP-endian, nor for insane hosts or
> > >>        target character sizes where native_{encode,interpret}_expr
> > >>        doesn't work properly.  */
> > >> diff --git a/gcc/passes.def b/gcc/passes.def
> > >> index 2bf2cb78fc5..e531531cb14 100644
> > >> --- a/gcc/passes.def
> > >> +++ b/gcc/passes.def
> > >> @@ -85,6 +85,7 @@ along with GCC; see the file COPYING3.  If not see
> > >>   	  /* pass_build_ealias is a dummy pass that ensures that we
> > >>   	     execute TODO_rebuild_alias at this point.  */
> > >>   	  NEXT_PASS (pass_build_ealias);
> > >> +	  NEXT_PASS (pass_store_merging);
> > >>   	  NEXT_PASS (pass_fre, true /* may_iterate */);
> > >>   	  NEXT_PASS (pass_early_vrp);
> > >>   	  NEXT_PASS (pass_merge_phi);
> > >> diff --git a/gcc/testsuite/gcc.dg/store_merging_14.c b/gcc/testsuite/gcc.dg/store_merging_14.c
> > >> index 9310aaf3489..bd120d18ac6 100644
> > >> --- a/gcc/testsuite/gcc.dg/store_merging_14.c
> > >> +++ b/gcc/testsuite/gcc.dg/store_merging_14.c
> > >> @@ -214,4 +214,5 @@ main ()
> > >>     return 0;
> > >>   }
> > >>   
> > >> -/* { dg-final { scan-tree-dump-times "Merging successful" 9 "store-merging" } } */
> > >> +/* { dg-final { scan-tree-dump-times "Merging successful" 5 "store-merging1" } } */
> > >> +/* { dg-final { scan-tree-dump-times "Merging successful" 4 "store-merging2" } } */
> > >> diff --git a/gcc/testsuite/gcc.dg/store_merging_30.c b/gcc/testsuite/gcc.dg/store_merging_30.c
> > >> new file mode 100644
> > >> index 00000000000..71369c3b196
> > >> --- /dev/null
> > >> +++ b/gcc/testsuite/gcc.dg/store_merging_30.c
> > >> @@ -0,0 +1,86 @@
> > >> +/* { dg-do run } */
> > >> +/* { dg-require-effective-target store_merge } */
> > >> +/* { dg-options "-O2 -fdump-tree-store-merging" } */
> > >> +
> > >> +typedef unsigned int atomic_t;
> > >> +
> > >> +struct page
> > >> +{
> > >> +  union
> > >> +  {
> > >> +    unsigned long counters;
> > >> +    struct
> > >> +    {
> > >> +      union
> > >> +      {
> > >> +	struct
> > >> +	{
> > >> +	  unsigned inuse : 16;
> > >> +	  unsigned objects : 15;
> > >> +	  unsigned frozen : 1;
> > >> +	};
> > >> +      };
> > >> +    };
> > >> +  };
> > >> +};
> > >> +
> > >> +struct page2
> > >> +{
> > >> +  union
> > >> +  {
> > >> +    unsigned counters;
> > >> +    struct
> > >> +    {
> > >> +      union
> > >> +      {
> > >> +	struct
> > >> +	{
> > >> +	  unsigned inuse : 16;
> > >> +	  unsigned objects : 8;
> > >> +	  unsigned frozen : 8;
> > >> +	};
> > >> +      };
> > >> +    };
> > >> +  };
> > >> +};
> > >> +
> > >> +__attribute__((noipa)) void
> > >> +foo1 (struct page *page, unsigned long counters_new)
> > >> +{
> > >> +        struct page tmp;
> > >> +        tmp.counters = counters_new;
> > >> +        page->inuse   = tmp.inuse;
> > >> +        page->objects = tmp.objects;
> > >> +        page->frozen  = tmp.frozen;
> > >> +}
> > >> +
> > >> +__attribute__((noipa)) void
> > >> +foo2 (struct page2 *page2, unsigned long counters_new)
> > >> +{
> > >> +        struct page2 tmp;
> > >> +        tmp.counters = counters_new;
> > >> +        page2->inuse   = tmp.inuse;
> > >> +        page2->objects = tmp.objects;
> > >> +        page2->frozen  = tmp.frozen;
> > >> +}
> > >> +
> > >> +int main ()
> > >> +{
> > >> +  struct page page;
> > >> +  foo1 (&page, 0x12345678ABCDEFFEUL);
> > >> +
> > >> +  asm volatile ("" : : : "memory");
> > >> +  if (page.frozen != 1 || page.objects != 0x2bcd || page.inuse != 0xeffe)
> > >> +    __builtin_abort ();
> > >> +
> > >> +  struct page2 page2;
> > >> +  foo2 (&page2, 0x12345678ABCDEFFEUL);
> > >> +
> > >> +  asm volatile ("" : : : "memory");
> > >> +
> > >> +  if (page2.frozen != 0xab || page2.objects != 0xcd || page2.inuse != 0xeffe)
> > >> +    __builtin_abort ();
> > >> +  return 0;
> > >> +}
> > >> +
> > >> +/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging1" } } */
> > >>
> > > 
> > 
> > 
> 
>
Andrew Pinski Feb. 26, 2020, 9:29 a.m. UTC | #7
On Wed, Feb 26, 2020 at 12:30 AM Richard Biener <rguenther@suse.de> wrote:
>
> On Wed, 26 Feb 2020, luoxhu wrote:
>
> > On 2020/2/18 17:57, Richard Biener wrote:
> > > On Tue, 18 Feb 2020, Xionghu Luo wrote:
> > >
> > >> Store-merging pass should run twice, the reason is pass fre/pre will do
> > >> some kind of optimizations to instructions by:
> > >>    1. Converting the load from address to load from function arguments
> > >>    (store_merging_30.c:foo1).
> > >>    2. Converting the byte access to BIT_FIELD_REF(store_merging_30.c:foo2).
> > >>    3. Other bitfield combinations or potential interference optimizations etc.
> > >> These optimizations will break the store chain, store-merging pass fails
> > >> to catch such kind of pattern so stores are not merged in middle end,
> > >> then consecutive stb/sth instructions(should be merged to stw) are emitted
> > >> finally.
> > >>
> > >> And why not directly move store-merging pass(numbered 194) just before
> > >> fre1(numbered 35) is for case store_merging_14.c, 5 merges are done by
> > >> store_merging1, and 4 merges are done fore store_merge2. So, keep the
> > >> original store_merge as store_merge2 as store merge may be still available
> > >> after other pass optimizations.  Most of the 30 store_merging_N.c test
> > >> case dg-final pass name would be updated from store-merging to
> > >> store-merging1 once this RFC patch idea got confirmed.
> > >> Any comments?  Thanks.
> > >
> > > Generally trying to solve a pass ordering issue by re-ordering/duplicating
> > > passes might help a single testcase but will surely pessimize others.
> > > So that's a no-go.
> > >
> > > What the testcase shows is that store-merging needs to operate
> > > similar to bswap when trying to find a "common" source for a combined
> > > store.  That is, it should appropriately follow defs.  For foo2 I expect
> > > it to be not too difficult, for foo1 I'd say we miss a FRE opportunity
> > > here (but Jakub is working on that one IIRC).
> >
> > Thanks Richard, not sure about my understanding and please correct if any.
> >
> > I tried Jukub's latest patch of "sccvn: Handle bitfields in push_partial_def".
> > Got to know fre pass checks the load instruction's vuse chain and do the constant
> > bitfield combines in push_partial_def, then native_encode_expr/native_interpret_expr
> > can decode and encode the constant content and shift/combine the data.
> > This should be based on one change to my test case(by adding return page->counters;)
> > to trigger the fre pass push all SSA name's partial_defs.  Currently, for SSA variables,
> > this encode/interpret is not supported yet, I suppose this is the opportunity you mean.
> > As this case's input is not constant, so Jukub's patch doesn't fix it.
> >
> > struct page
> > {
> >   union
> >   {
> >     unsigned counters;
> >     struct
> >     {
> >       union
> >       {
> >         struct
> >         {
> >           unsigned inuse : 16;
> >           unsigned inuse2 : 8;
> >           unsigned objects : 5;
> >           unsigned frozen : 3;
> >         };
> >       };
> >     };
> >   };
> > };
> >
> > unsigned
> > foo1 (struct page *page, unsigned long counters_new)
> > {
> >         struct page tmp;
> >         tmp.counters = counters_new;
> >         page->inuse   = tmp.inuse;
> >         page->inuse2  = tmp.inuse2;
> >         page->objects = tmp.objects;
> >         page->frozen  = tmp.frozen;
> >         return page->counters;
> > }
> >
> > If "return page->counters;" is removed, this case won't match the fre's current code path
> > in vn_reference_lookup_3 without a load(_14 = page_9(D)->D.2912.counters) to walk all the vuses.
> > So it seems not a fre opportunity but exactly a store merging issue(Also checked llvm that it
> > doesn't generate byte store and short store so it only produces 1 stw for all patterns).
>
> Huh, but the code has the return page->counters and you expec it to
> return counters_new unchanged.  Yes, you also expect the code to be
> optimized to
>
>    page->counters = counters_new;
>    return counters_new;
>
> and that part would indeed be a store-merging opportunity.  Currently
> FRE does
>
>   _1 = (unsigned int) counters_new_6(D);
>   tmp.D.1943.counters = _1;
>   _18 = (short unsigned int) counters_new_6(D);
>   page_9(D)->D.1943.D.1942.D.1941.D.1940.inuse = _18;
>   _19 = BIT_FIELD_REF <_1, 8, 16>;
>   page_9(D)->D.1943.D.1942.D.1941.D.1940.inuse2 = _19;
>   _4 = tmp.D.1943.D.1942.D.1941.D.1940.objects;
>   page_9(D)->D.1943.D.1942.D.1941.D.1940.objects = _4;
>   _5 = tmp.D.1943.D.1942.D.1941.D.1940.frozen;
>   page_9(D)->D.1943.D.1942.D.1941.D.1940.frozen = _5;
>
> so it fails to optimize the other loads to BIT_FIELD_REFs of _1.  If it
> did that then store-merging woudl already handle the code I think.
>
> > Optimize this case in pass store-merging is reasonable, just as you said,
> > "trying to find a "common" source for a combined store.", it requires break
> > store-merging pass's behavior: so far store-merging pass track all store
> > instruction's RHS and stops when RHS is a load instruction with <base_addr, bitsize,
> > bitpos, bitregion_start, bitregion_end> extracted, convert instructions and bitfield
> > instructions generated by pass fre are ignored in pass_store_merging::process_store.
> > To teach the process_store capture real common source requires refactoring handled_load
> > to continue tracking even RHS is a load instruction and support the convert  instructions
> > and bitfield_instructions, this seems to be a big change.
>
> Well, as said above FRE would elide the loads and you'll see a series of
> stores with BIT_FIELD_REF as sources.
>
> > Regarding to Jakub's comments, merging could reduce many byte stores and half stores
> > to improve performance for this type of case.  There is already an 033t.esra running before,
> > and not sure whether SRA should replace such kind of bitfield operations.
> > Adding a store-merging pass is so simple and many passes run more than once...
> > So which would be best for this optimization, please?  Thanks again :)
>
> I think the testcase as-is might be also optimized with Andrews pending
> work to lower bitfield accesses.  So a slightly adjusted testcase not
> using bitfields but maybe char (or mixed char/short) members would be
> useful to track as well.  There SRA might apply (SRA backs off of
> doing anything to bitfields currently).

With my current lower bitfield accesses and a "pass"/improvement to
handle merging of BIT_INSERT_EXPRs.
We can optimize the above to just:
  _1 = (unsigned int) counters_new_4(D);
  _14 = (short unsigned int) counters_new_4(D);
  _25 = BIT_INSERT_EXPR <_1, _14, 0 (16 bits)>;
  MEM[(union  *)page_7(D)] = _25;
  return _25;

I still need to figure out how to solve the last BIE though.  The cast
from unsigned long to unsigned int is getting in the way (changing
counters_new to unsigned instead, we get the perfect code and not have
BIE there).

Note x86 has still an issue with SLOW_BYTES when handling the bitfield
lowering; which I still need to understand and solve.

Thanks,
Andrew Pinski

>
> Richard.
>
> > Xionghu
> >
> > >
> > > Richard.
> > >
> > >> PS:
> > >> Before this patch, store_merging_30.c.035t.fre1:
> > >>
> > >> ... foo1:
> > >> Inserted _13 = (short unsigned int) counters_new_5(D);
> > >> Replaced tmp.D.2912.D.2911.D.2910.D.2909.inuse with _13 in all uses of
> > >> _1 = tmp.D.2912.D.2911.D.2910.D.2909.inuse;
> > >> Removing dead stmt _1 = tmp.D.2912.D.2911.D.2910.D.2909.inuse;
> > >> ... foo2:
> > >> Inserted _17 = BIT_FIELD_REF <_1, 8, 16>;
> > >> Replaced tmp.D.2926.D.2925.D.2924.D.2923.objects with _17 in all uses of
> > >> _3 = tmp.D.2926.D.2925.D.2924.D.2923.objects;
> > >> Removing dead stmt _3 = tmp.D.2926.D.2925.D.2924.D.2923.objects;
> > >>
> > >> foo1 asm:
> > >> rldicl 9,4,48,48
> > >> sth 4,0(3)
> > >> sth 9,2(3)
> > >> blr
> > >>
> > >> With this patch(similar for foo2):
> > >>
> > >> stw r4,0(r3)
> > >> blr
> > >>
> > >> gcc/ChangeLog
> > >>
> > >> 2020-02-18  Xiong Hu Luo  <luoxhu@linux.ibm.com>
> > >>
> > >>    Part of PR middle-end/71509
> > >>    gimple-ssa-store-merging.c (clone): New.
> > >>    passes.def (pass_store_merging): New.
> > >>
> > >> gcc/testsuite/ChangeLog
> > >>
> > >> 2020-02-18  Xiong Hu Luo  <luoxhu@linux.ibm.com>
> > >>
> > >>    Part of PR middle-end/71509
> > >>    testsuite/gcc.dg/store_merging_14.c: Update.
> > >>    testsuite/gcc.dg/store_merging_30.c: New.
> > >> ---
> > >>   gcc/gimple-ssa-store-merging.c          |  2 +
> > >>   gcc/passes.def                          |  1 +
> > >>   gcc/testsuite/gcc.dg/store_merging_14.c |  3 +-
> > >>   gcc/testsuite/gcc.dg/store_merging_30.c | 86 +++++++++++++++++++++++++
> > >>   4 files changed, 91 insertions(+), 1 deletion(-)
> > >>   create mode 100644 gcc/testsuite/gcc.dg/store_merging_30.c
> > >>
> > >> diff --git a/gcc/gimple-ssa-store-merging.c b/gcc/gimple-ssa-store-merging.c
> > >> index 8371323ef4a..9a5bd49fc3a 100644
> > >> --- a/gcc/gimple-ssa-store-merging.c
> > >> +++ b/gcc/gimple-ssa-store-merging.c
> > >> @@ -2156,6 +2156,8 @@ public:
> > >>     {
> > >>     }
> > >>
> > >> +  opt_pass * clone () { return new pass_store_merging (m_ctxt); }
> > >> +
> > >>     /* Pass not supported for PDP-endian, nor for insane hosts or
> > >>        target character sizes where native_{encode,interpret}_expr
> > >>        doesn't work properly.  */
> > >> diff --git a/gcc/passes.def b/gcc/passes.def
> > >> index 2bf2cb78fc5..e531531cb14 100644
> > >> --- a/gcc/passes.def
> > >> +++ b/gcc/passes.def
> > >> @@ -85,6 +85,7 @@ along with GCC; see the file COPYING3.  If not see
> > >>      /* pass_build_ealias is a dummy pass that ensures that we
> > >>         execute TODO_rebuild_alias at this point.  */
> > >>      NEXT_PASS (pass_build_ealias);
> > >> +    NEXT_PASS (pass_store_merging);
> > >>      NEXT_PASS (pass_fre, true /* may_iterate */);
> > >>      NEXT_PASS (pass_early_vrp);
> > >>      NEXT_PASS (pass_merge_phi);
> > >> diff --git a/gcc/testsuite/gcc.dg/store_merging_14.c b/gcc/testsuite/gcc.dg/store_merging_14.c
> > >> index 9310aaf3489..bd120d18ac6 100644
> > >> --- a/gcc/testsuite/gcc.dg/store_merging_14.c
> > >> +++ b/gcc/testsuite/gcc.dg/store_merging_14.c
> > >> @@ -214,4 +214,5 @@ main ()
> > >>     return 0;
> > >>   }
> > >>
> > >> -/* { dg-final { scan-tree-dump-times "Merging successful" 9 "store-merging" } } */
> > >> +/* { dg-final { scan-tree-dump-times "Merging successful" 5 "store-merging1" } } */
> > >> +/* { dg-final { scan-tree-dump-times "Merging successful" 4 "store-merging2" } } */
> > >> diff --git a/gcc/testsuite/gcc.dg/store_merging_30.c b/gcc/testsuite/gcc.dg/store_merging_30.c
> > >> new file mode 100644
> > >> index 00000000000..71369c3b196
> > >> --- /dev/null
> > >> +++ b/gcc/testsuite/gcc.dg/store_merging_30.c
> > >> @@ -0,0 +1,86 @@
> > >> +/* { dg-do run } */
> > >> +/* { dg-require-effective-target store_merge } */
> > >> +/* { dg-options "-O2 -fdump-tree-store-merging" } */
> > >> +
> > >> +typedef unsigned int atomic_t;
> > >> +
> > >> +struct page
> > >> +{
> > >> +  union
> > >> +  {
> > >> +    unsigned long counters;
> > >> +    struct
> > >> +    {
> > >> +      union
> > >> +      {
> > >> +  struct
> > >> +  {
> > >> +    unsigned inuse : 16;
> > >> +    unsigned objects : 15;
> > >> +    unsigned frozen : 1;
> > >> +  };
> > >> +      };
> > >> +    };
> > >> +  };
> > >> +};
> > >> +
> > >> +struct page2
> > >> +{
> > >> +  union
> > >> +  {
> > >> +    unsigned counters;
> > >> +    struct
> > >> +    {
> > >> +      union
> > >> +      {
> > >> +  struct
> > >> +  {
> > >> +    unsigned inuse : 16;
> > >> +    unsigned objects : 8;
> > >> +    unsigned frozen : 8;
> > >> +  };
> > >> +      };
> > >> +    };
> > >> +  };
> > >> +};
> > >> +
> > >> +__attribute__((noipa)) void
> > >> +foo1 (struct page *page, unsigned long counters_new)
> > >> +{
> > >> +        struct page tmp;
> > >> +        tmp.counters = counters_new;
> > >> +        page->inuse   = tmp.inuse;
> > >> +        page->objects = tmp.objects;
> > >> +        page->frozen  = tmp.frozen;
> > >> +}
> > >> +
> > >> +__attribute__((noipa)) void
> > >> +foo2 (struct page2 *page2, unsigned long counters_new)
> > >> +{
> > >> +        struct page2 tmp;
> > >> +        tmp.counters = counters_new;
> > >> +        page2->inuse   = tmp.inuse;
> > >> +        page2->objects = tmp.objects;
> > >> +        page2->frozen  = tmp.frozen;
> > >> +}
> > >> +
> > >> +int main ()
> > >> +{
> > >> +  struct page page;
> > >> +  foo1 (&page, 0x12345678ABCDEFFEUL);
> > >> +
> > >> +  asm volatile ("" : : : "memory");
> > >> +  if (page.frozen != 1 || page.objects != 0x2bcd || page.inuse != 0xeffe)
> > >> +    __builtin_abort ();
> > >> +
> > >> +  struct page2 page2;
> > >> +  foo2 (&page2, 0x12345678ABCDEFFEUL);
> > >> +
> > >> +  asm volatile ("" : : : "memory");
> > >> +
> > >> +  if (page2.frozen != 0xab || page2.objects != 0xcd || page2.inuse != 0xeffe)
> > >> +    __builtin_abort ();
> > >> +  return 0;
> > >> +}
> > >> +
> > >> +/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging1" } } */
> > >>
> > >
> >
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)
diff mbox series

Patch

diff --git a/gcc/gimple-ssa-store-merging.c b/gcc/gimple-ssa-store-merging.c
index 8371323ef4a..9a5bd49fc3a 100644
--- a/gcc/gimple-ssa-store-merging.c
+++ b/gcc/gimple-ssa-store-merging.c
@@ -2156,6 +2156,8 @@  public:
   {
   }
 
+  opt_pass * clone () { return new pass_store_merging (m_ctxt); }
+
   /* Pass not supported for PDP-endian, nor for insane hosts or
      target character sizes where native_{encode,interpret}_expr
      doesn't work properly.  */
diff --git a/gcc/passes.def b/gcc/passes.def
index 2bf2cb78fc5..e531531cb14 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -85,6 +85,7 @@  along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
+	  NEXT_PASS (pass_store_merging);
 	  NEXT_PASS (pass_fre, true /* may_iterate */);
 	  NEXT_PASS (pass_early_vrp);
 	  NEXT_PASS (pass_merge_phi);
diff --git a/gcc/testsuite/gcc.dg/store_merging_14.c b/gcc/testsuite/gcc.dg/store_merging_14.c
index 9310aaf3489..bd120d18ac6 100644
--- a/gcc/testsuite/gcc.dg/store_merging_14.c
+++ b/gcc/testsuite/gcc.dg/store_merging_14.c
@@ -214,4 +214,5 @@  main ()
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "Merging successful" 9 "store-merging" } } */
+/* { dg-final { scan-tree-dump-times "Merging successful" 5 "store-merging1" } } */
+/* { dg-final { scan-tree-dump-times "Merging successful" 4 "store-merging2" } } */
diff --git a/gcc/testsuite/gcc.dg/store_merging_30.c b/gcc/testsuite/gcc.dg/store_merging_30.c
new file mode 100644
index 00000000000..71369c3b196
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/store_merging_30.c
@@ -0,0 +1,86 @@ 
+/* { dg-do run } */
+/* { dg-require-effective-target store_merge } */
+/* { dg-options "-O2 -fdump-tree-store-merging" } */
+
+typedef unsigned int atomic_t;
+
+struct page
+{
+  union
+  {
+    unsigned long counters;
+    struct
+    {
+      union
+      {
+	struct
+	{
+	  unsigned inuse : 16;
+	  unsigned objects : 15;
+	  unsigned frozen : 1;
+	};
+      };
+    };
+  };
+};
+
+struct page2
+{
+  union
+  {
+    unsigned counters;
+    struct
+    {
+      union
+      {
+	struct
+	{
+	  unsigned inuse : 16;
+	  unsigned objects : 8;
+	  unsigned frozen : 8;
+	};
+      };
+    };
+  };
+};
+
+__attribute__((noipa)) void
+foo1 (struct page *page, unsigned long counters_new)
+{
+        struct page tmp;
+        tmp.counters = counters_new;
+        page->inuse   = tmp.inuse;
+        page->objects = tmp.objects;
+        page->frozen  = tmp.frozen;
+}
+
+__attribute__((noipa)) void
+foo2 (struct page2 *page2, unsigned long counters_new)
+{
+        struct page2 tmp;
+        tmp.counters = counters_new;
+        page2->inuse   = tmp.inuse;
+        page2->objects = tmp.objects;
+        page2->frozen  = tmp.frozen;
+}
+
+int main ()
+{
+  struct page page;
+  foo1 (&page, 0x12345678ABCDEFFEUL);
+
+  asm volatile ("" : : : "memory");
+  if (page.frozen != 1 || page.objects != 0x2bcd || page.inuse != 0xeffe)
+    __builtin_abort ();
+
+  struct page2 page2;
+  foo2 (&page2, 0x12345678ABCDEFFEUL);
+
+  asm volatile ("" : : : "memory");
+
+  if (page2.frozen != 0xab || page2.objects != 0xcd || page2.inuse != 0xeffe)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging1" } } */