diff mbox series

[RFC] Merge VEC_COND_EXPR into MASK_STORE after loop vectorization

Message ID a6041c90-2e77-f765-6f0d-f46e326a66de@foss.arm.com
State New
Headers show
Series [RFC] Merge VEC_COND_EXPR into MASK_STORE after loop vectorization | expand

Commit Message

Renlin Li Nov. 8, 2018, 11:02 a.m. UTC
Hi all,

When allow-store-data-races is enabled, ifcvt would prefer to generated
conditional select and unconditional store to convert certain if statement
into:

_ifc_1 = val
_ifc_2 = A[i]
val = cond? _ifc_1 : _ifc_2
A[i] = val

When the loop got vectorized, this pattern will be turned into
MASK_LOAD, VEC_COND_EXPR and MASK_STORE. This could be improved.

The change here add a post processing function to combine the VEC_COND_EXPR
expression into MASK_STORE, and delete related dead code.

I am a little bit conservative here.
I didn't change the default behavior of ifcvt to always generate MASK_STORE,
although it should be safe in all cases (allow or dis-allow store data race).

MASK_STORE might not well handled in vectorization pass compared with
conditional select. It might be too early and aggressive to do that in ifcvt.
And the performance of MASK_STORE might not good for some platforms.
(We could add --param or target hook to differentiate this ifcvt behavior
on different platforms)

Another reason I did not do that in ifcvt is the data reference
analysis. To create a MASK_STORE, a pointer is created as the first
argument to the internal function call. If the pointer is created out of
array references, e.g. x = &A[i], data reference analysis could not properly
analysis the relationship between MEM_REF (x) and ARRAY_REF (A, i). This
will create a versioned loop beside the vectorized one.
(I have hacks to look through the MEM_REF, and restore the reference back to
ARRAY_REF (A, i).  Maybe we could do analysis on lowered address expression?
I saw we have gimple laddress pass to aid the vectorizer)

The approach here comes a little bit late, on the condition that vector
MASK_STORE is generated by loop vectorizer. In this case, it is definitely
beneficial to do the code transformation.

Any thoughts on the best way to fix the issue?


This patch has been tested with aarch64-none-elf, no regressions.

Regards,
Renlin

gcc/ChangeLog:

2018-11-08  Renlin Li  <renlin.li@arm.com>

	* tree-vectorizer.h (combine_sel_mask_store): Declare new function.
	* tree-vect-loop.c (combine_sel_mask_store): Define new function.
	* tree-vectorizer.c (vectorize_loops): Call it.

gcc/testsuite/ChangeLog:

2018-11-08  Renlin Li  <renlin.li@arm.com>

	* gcc.target/aarch64/sve/combine_vcond_mask_store_1.c: New.

Comments

Richard Biener Nov. 8, 2018, 12:09 p.m. UTC | #1
On Thu, Nov 8, 2018 at 12:02 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>
> Hi all,
>
> When allow-store-data-races is enabled, ifcvt would prefer to generated
> conditional select and unconditional store to convert certain if statement
> into:
>
> _ifc_1 = val
> _ifc_2 = A[i]
> val = cond? _ifc_1 : _ifc_2
> A[i] = val
>
> When the loop got vectorized, this pattern will be turned into
> MASK_LOAD, VEC_COND_EXPR and MASK_STORE. This could be improved.

I'm somewhat confused - the vectorizer doesn't generate a masked store when
if-conversion didn't create one in the first place.

In particular with allow-store-data-races=1 (what your testcase uses)
there are no
masked loads/stores generated at all.   So at least you need a better testcase
to motivate (one that doesn't load from array[i] so that we know the conditional
stores might trap).

So what I see (with store data races not allowed) from ifcvt is

  <bb 3> [local count: 1006632961]:
  # i_20 = PHI <i_13(9), 1(21)>
  # ivtmp_18 = PHI <ivtmp_5(9), 15(21)>
  a_10 = array[i_20];
  _1 = a_10 & 1;
  _2 = a_10 + 1;
  _32 = _1 != 0;
  _33 = &array[i_20];
  .MASK_STORE (_33, 32B, _32, _2);
  prephitmp_8 = _1 != 0 ? _2 : a_10;
  _4 = a_10 + 2;
  _34 = prephitmp_8 > 10;
  .MASK_STORE (_33, 32B, _34, _4);
  i_13 = i_20 + 1;
  ivtmp_5 = ivtmp_18 - 1;
  if (ivtmp_5 != 0)

and what you want to do is merge

  prephitmp_8 = _1 != 0 ? _2 : a_10;
  _34 = prephitmp_8 > 10;

somehow?  But your patch mentions that _4 should be prephitmp_8 so
it wouldn't do anything here?

> The change here add a post processing function to combine the VEC_COND_EXPR
> expression into MASK_STORE, and delete related dead code.
>
> I am a little bit conservative here.
> I didn't change the default behavior of ifcvt to always generate MASK_STORE,
> although it should be safe in all cases (allow or dis-allow store data race).
>
> MASK_STORE might not well handled in vectorization pass compared with
> conditional select. It might be too early and aggressive to do that in ifcvt.
> And the performance of MASK_STORE might not good for some platforms.
> (We could add --param or target hook to differentiate this ifcvt behavior
> on different platforms)
>
> Another reason I did not do that in ifcvt is the data reference
> analysis. To create a MASK_STORE, a pointer is created as the first
> argument to the internal function call. If the pointer is created out of
> array references, e.g. x = &A[i], data reference analysis could not properly
> analysis the relationship between MEM_REF (x) and ARRAY_REF (A, i). This
> will create a versioned loop beside the vectorized one.

Actually for your testcase it won't vectorize because there's compile-time
aliasing (somehow we miss handling of dependence distance zero?!)

> (I have hacks to look through the MEM_REF, and restore the reference back to
> ARRAY_REF (A, i).  Maybe we could do analysis on lowered address expression?
> I saw we have gimple laddress pass to aid the vectorizer)

An old idea of mine is to have dependence analysis fall back to lowered address
form when it fails to match two references.  This would require re-analysis and
eventually storing an alternate "inner reference" in the data-ref structure.

> The approach here comes a little bit late, on the condition that vector
> MASK_STORE is generated by loop vectorizer. In this case, it is definitely
> beneficial to do the code transformation.
>
> Any thoughts on the best way to fix the issue?
>
>
> This patch has been tested with aarch64-none-elf, no regressions.
>
> Regards,
> Renlin
>
> gcc/ChangeLog:
>
> 2018-11-08  Renlin Li  <renlin.li@arm.com>
>
>         * tree-vectorizer.h (combine_sel_mask_store): Declare new function.
>         * tree-vect-loop.c (combine_sel_mask_store): Define new function.
>         * tree-vectorizer.c (vectorize_loops): Call it.
>
> gcc/testsuite/ChangeLog:
>
> 2018-11-08  Renlin Li  <renlin.li@arm.com>
>
>         * gcc.target/aarch64/sve/combine_vcond_mask_store_1.c: New.
>
Renlin Li Nov. 8, 2018, 4:55 p.m. UTC | #2
Hi Richard,

On 11/08/2018 12:09 PM, Richard Biener wrote:
> On Thu, Nov 8, 2018 at 12:02 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>>
>> Hi all,
>>
>> When allow-store-data-races is enabled, ifcvt would prefer to generated
>> conditional select and unconditional store to convert certain if statement
>> into:
>>
>> _ifc_1 = val
>> _ifc_2 = A[i]
>> val = cond? _ifc_1 : _ifc_2
>> A[i] = val
>>
>> When the loop got vectorized, this pattern will be turned into
>> MASK_LOAD, VEC_COND_EXPR and MASK_STORE. This could be improved.
> 
> I'm somewhat confused - the vectorizer doesn't generate a masked store when
> if-conversion didn't create one in the first place
> 
> In particular with allow-store-data-races=1 (what your testcase uses)
> there are no
> masked loads/stores generated at all.   So at least you need a better testcase
> to motivate (one that doesn't load from array[i] so that we know the conditional
> stores might trap).

Thanks for trying this. The test case is a little bit simple and artificial.
ifcvt won't generate mask_store, instead it will generate unconditional store with allow-store-data-races=1.

My build is based on 25th Oct. I got the following IR from ifcvt with
aarch64-none-elf-gcc -S -march=armv8-a+sve -O2 -ftree-vectorize --param allow-store-data-races=1

   <bb 3> [local count: 1006632961]:
   # i_20 = PHI <i_13(9), 1(19)>
   # ivtmp_18 = PHI <ivtmp_5(9), 15(19)>
   a_10 = array[i_20];
   _1 = a_10 & 1;
   _2 = a_10 + 1;
   _ifc__32 = array[i_20];
   _ifc__33 = _2;
   _ifc__34 = _1 != 0 ? _ifc__33 : _ifc__32;
   array[i_20] = _ifc__34;
   prephitmp_8 = _1 != 0 ? _2 : a_10;
   _4 = a_10 + 2;
   _ifc__35 = array[i_20];
   _ifc__36 = _4;
   _ifc__37 = prephitmp_8 > 10 ? _ifc__36 : _ifc__35;
   array[i_20] = _ifc__37;
   i_13 = i_20 + 1;
   ivtmp_5 = ivtmp_18 - 1;
   if (ivtmp_5 != 0)
     goto <bb 9>; [93.33%]
   else
     goto <bb 8>; [6.67%]

*However*, after I rebased my patch on the latest trunk.
Got the following dump from ifcvt:
   <bb 3> [local count: 1006632961]:
   # i_20 = PHI <i_13(9), 1(21)>
   # ivtmp_18 = PHI <ivtmp_5(9), 15(21)>
   a_10 = array[i_20];
   _1 = a_10 & 1;
   _2 = a_10 + 1;
   _ifc__34 = _1 != 0 ? _2 : a_10;
   array[i_20] = _ifc__34;
   _4 = a_10 + 2;
   _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
   array[i_20] = _ifc__37;
   i_13 = i_20 + 1;
   ivtmp_5 = ivtmp_18 - 1;
   if (ivtmp_5 != 0)
     goto <bb 9>; [93.33%]
   else
     goto <bb 8>; [6.67%]

the redundant load is not generated, but you could still see the unconditional store.
After loop vectorization, the following is generated (without my change):

   vect_a_10.6_6 = .MASK_LOAD (vectp_array.4_35, 4B, loop_mask_7);
   a_10 = array[i_20];
   vect__1.7_39 = vect_a_10.6_6 & vect_cst__38;
   _1 = a_10 & 1;
   vect__2.8_41 = vect_a_10.6_6 + vect_cst__40;
   _2 = a_10 + 1;
   vect__ifc__34.9_43 = VEC_COND_EXPR <vect__1.7_39 != vect_cst__42, vect__2.8_41, vect_a_10.6_6>;
   _ifc__34 = _1 != 0 ? _2 : a_10;
   .MASK_STORE (vectp_array.10_45, 4B, loop_mask_7, vect__ifc__34.9_43);
   vect__4.12_49 = vect_a_10.6_6 + vect_cst__48;
   _4 = a_10 + 2;
   vect__ifc__37.13_51 = VEC_COND_EXPR <vect__ifc__34.9_43 > vect_cst__50, vect__4.12_49, vect__ifc__34.9_43>;
   _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
   .MASK_STORE (vectp_array.14_53, 4B, loop_mask_7, vect__ifc__37.13_51);

With the old ifcvt code, my change here could improve it a little bit, eliminate some redundant load.
With the new code, it could not improved it further. I'll adjust the patch based on the latest trunk.


> 
> So what I see (with store data races not allowed) from ifcvt is

when store data races is not allowed, we won't generate unconditional store. Instead ifcvt
generates predicated store. That's what you showed here.

As I mentioned, we could always make ifcvt generate mask_store as it should be always safe.
But I don't know the performance implication on other targets (I assume there must be reasons why
people write code to generate unconditional store when data-race is allowed? What I understand is that,
this option allows the compiler to be more aggressive on optimization).

The other reason is the data reference analysis. There might be versioned loop created with a
more complexer test case.

Again, I need to rebase and check my patch with the latest trunk, and need to come up with a better test case.

> 
>    <bb 3> [local count: 1006632961]:
>    # i_20 = PHI <i_13(9), 1(21)>
>    # ivtmp_18 = PHI <ivtmp_5(9), 15(21)>
>    a_10 = array[i_20];
>    _1 = a_10 & 1;
>    _2 = a_10 + 1;
>    _32 = _1 != 0;
>    _33 = &array[i_20];
>    .MASK_STORE (_33, 32B, _32, _2);
>    prephitmp_8 = _1 != 0 ? _2 : a_10;
>    _4 = a_10 + 2;
>    _34 = prephitmp_8 > 10;
>    .MASK_STORE (_33, 32B, _34, _4);
>    i_13 = i_20 + 1;
>    ivtmp_5 = ivtmp_18 - 1;
>    if (ivtmp_5 != 0)
> 
> and what you want to do is merge
> 
>    prephitmp_8 = _1 != 0 ? _2 : a_10;
>    _34 = prephitmp_8 > 10;
> 
> somehow?  But your patch mentions that _4 should be prephitmp_8 so
> it wouldn't do anything here?
> 
>> The change here add a post processing function to combine the VEC_COND_EXPR
>> expression into MASK_STORE, and delete related dead code.
>>
>> I am a little bit conservative here.
>> I didn't change the default behavior of ifcvt to always generate MASK_STORE,
>> although it should be safe in all cases (allow or dis-allow store data race).
>>
>> MASK_STORE might not well handled in vectorization pass compared with
>> conditional select. It might be too early and aggressive to do that in ifcvt.
>> And the performance of MASK_STORE might not good for some platforms.
>> (We could add --param or target hook to differentiate this ifcvt behavior
>> on different platforms)
>>
>> Another reason I did not do that in ifcvt is the data reference
>> analysis. To create a MASK_STORE, a pointer is created as the first
>> argument to the internal function call. If the pointer is created out of
>> array references, e.g. x = &A[i], data reference analysis could not properly
>> analysis the relationship between MEM_REF (x) and ARRAY_REF (A, i). This
>> will create a versioned loop beside the vectorized one.
> 
> Actually for your testcase it won't vectorize because there's compile-time
> aliasing (somehow we miss handling of dependence distance zero?!)
> 
>> (I have hacks to look through the MEM_REF, and restore the reference back to
>> ARRAY_REF (A, i).  Maybe we could do analysis on lowered address expression?
>> I saw we have gimple laddress pass to aid the vectorizer)
> 
> An old idea of mine is to have dependence analysis fall back to lowered address
> form when it fails to match two references.  This would require re-analysis and
> eventually storing an alternate "inner reference" in the data-ref structure.

Yes, that makes sense.

My hack is in get_references_in_stmt. Look through the pointer to see where it is generated.
This is only for mask_store as we know the first operand is original a pointer or a new pointer from something.


Thanks,
Renlin


> 
>> The approach here comes a little bit late, on the condition that vector
>> MASK_STORE is generated by loop vectorizer. In this case, it is definitely
>> beneficial to do the code transformation.
>>
>> Any thoughts on the best way to fix the issue?
>>
>>
>> This patch has been tested with aarch64-none-elf, no regressions.
>>
>> Regards,
>> Renlin
>>
>> gcc/ChangeLog:
>>
>> 2018-11-08  Renlin Li  <renlin.li@arm.com>
>>
>>          * tree-vectorizer.h (combine_sel_mask_store): Declare new function.
>>          * tree-vect-loop.c (combine_sel_mask_store): Define new function.
>>          * tree-vectorizer.c (vectorize_loops): Call it.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 2018-11-08  Renlin Li  <renlin.li@arm.com>
>>
>>          * gcc.target/aarch64/sve/combine_vcond_mask_store_1.c: New.
>>
Richard Biener Nov. 9, 2018, 11:48 a.m. UTC | #3
On Thu, Nov 8, 2018 at 5:55 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>
> Hi Richard,
>
> On 11/08/2018 12:09 PM, Richard Biener wrote:
> > On Thu, Nov 8, 2018 at 12:02 PM Renlin Li <renlin.li@foss.arm.com> wrote:
> >>
> >> Hi all,
> >>
> >> When allow-store-data-races is enabled, ifcvt would prefer to generated
> >> conditional select and unconditional store to convert certain if statement
> >> into:
> >>
> >> _ifc_1 = val
> >> _ifc_2 = A[i]
> >> val = cond? _ifc_1 : _ifc_2
> >> A[i] = val
> >>
> >> When the loop got vectorized, this pattern will be turned into
> >> MASK_LOAD, VEC_COND_EXPR and MASK_STORE. This could be improved.
> >
> > I'm somewhat confused - the vectorizer doesn't generate a masked store when
> > if-conversion didn't create one in the first place
> >
> > In particular with allow-store-data-races=1 (what your testcase uses)
> > there are no
> > masked loads/stores generated at all.   So at least you need a better testcase
> > to motivate (one that doesn't load from array[i] so that we know the conditional
> > stores might trap).
>
> Thanks for trying this. The test case is a little bit simple and artificial.
> ifcvt won't generate mask_store, instead it will generate unconditional store with allow-store-data-races=1.
>
> My build is based on 25th Oct. I got the following IR from ifcvt with
> aarch64-none-elf-gcc -S -march=armv8-a+sve -O2 -ftree-vectorize --param allow-store-data-races=1
>
>    <bb 3> [local count: 1006632961]:
>    # i_20 = PHI <i_13(9), 1(19)>
>    # ivtmp_18 = PHI <ivtmp_5(9), 15(19)>
>    a_10 = array[i_20];
>    _1 = a_10 & 1;
>    _2 = a_10 + 1;
>    _ifc__32 = array[i_20];
>    _ifc__33 = _2;
>    _ifc__34 = _1 != 0 ? _ifc__33 : _ifc__32;
>    array[i_20] = _ifc__34;
>    prephitmp_8 = _1 != 0 ? _2 : a_10;
>    _4 = a_10 + 2;
>    _ifc__35 = array[i_20];
>    _ifc__36 = _4;
>    _ifc__37 = prephitmp_8 > 10 ? _ifc__36 : _ifc__35;
>    array[i_20] = _ifc__37;
>    i_13 = i_20 + 1;
>    ivtmp_5 = ivtmp_18 - 1;
>    if (ivtmp_5 != 0)
>      goto <bb 9>; [93.33%]
>    else
>      goto <bb 8>; [6.67%]
>
> *However*, after I rebased my patch on the latest trunk.
> Got the following dump from ifcvt:
>    <bb 3> [local count: 1006632961]:
>    # i_20 = PHI <i_13(9), 1(21)>
>    # ivtmp_18 = PHI <ivtmp_5(9), 15(21)>
>    a_10 = array[i_20];
>    _1 = a_10 & 1;
>    _2 = a_10 + 1;
>    _ifc__34 = _1 != 0 ? _2 : a_10;
>    array[i_20] = _ifc__34;
>    _4 = a_10 + 2;
>    _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
>    array[i_20] = _ifc__37;
>    i_13 = i_20 + 1;
>    ivtmp_5 = ivtmp_18 - 1;
>    if (ivtmp_5 != 0)
>      goto <bb 9>; [93.33%]
>    else
>      goto <bb 8>; [6.67%]
>
> the redundant load is not generated, but you could still see the unconditional store.

Yes, I fixed the redundant loads recently and indeed dead stores
remain (for the particular
testcase they would be easy to remove).

> After loop vectorization, the following is generated (without my change):

Huh.  But that's not because of if-conversion but because SVE needs to
mask _all_
loop operations that are not safe to execute with the loop_mask!

>    vect_a_10.6_6 = .MASK_LOAD (vectp_array.4_35, 4B, loop_mask_7);
>    a_10 = array[i_20];
>    vect__1.7_39 = vect_a_10.6_6 & vect_cst__38;
>    _1 = a_10 & 1;
>    vect__2.8_41 = vect_a_10.6_6 + vect_cst__40;
>    _2 = a_10 + 1;
>    vect__ifc__34.9_43 = VEC_COND_EXPR <vect__1.7_39 != vect_cst__42, vect__2.8_41, vect_a_10.6_6>;
>    _ifc__34 = _1 != 0 ? _2 : a_10;
>    .MASK_STORE (vectp_array.10_45, 4B, loop_mask_7, vect__ifc__34.9_43);
>    vect__4.12_49 = vect_a_10.6_6 + vect_cst__48;
>    _4 = a_10 + 2;
>    vect__ifc__37.13_51 = VEC_COND_EXPR <vect__ifc__34.9_43 > vect_cst__50, vect__4.12_49, vect__ifc__34.9_43>;
>    _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
>    .MASK_STORE (vectp_array.14_53, 4B, loop_mask_7, vect__ifc__37.13_51);
>
> With the old ifcvt code, my change here could improve it a little bit, eliminate some redundant load.
> With the new code, it could not improved it further. I'll adjust the patch based on the latest trunk.

So what does the patch change the above to?  The code has little to no
comments apart from a
small picture with code _before_ the transform...

I was wondering whether we can implement

  l = [masked]load;
  tem = cond ? x : l;
  masked-store = tem;

pattern matching in a regular pass - forwprop for example.  Note the
load doesn't need to be masked,
correct?  In fact if it is masked you need to make sure the
conditional never accesses parts that
are masked in the load, no?  Or require the mask to be the same as
that used by the store.  But then
you still cannot simply replace the store mask with a new mask
generated from the conditional?

Richard.

>
> >
> > So what I see (with store data races not allowed) from ifcvt is
>
> when store data races is not allowed, we won't generate unconditional store. Instead ifcvt
> generates predicated store. That's what you showed here.
>
> As I mentioned, we could always make ifcvt generate mask_store as it should be always safe.
> But I don't know the performance implication on other targets (I assume there must be reasons why
> people write code to generate unconditional store when data-race is allowed? What I understand is that,
> this option allows the compiler to be more aggressive on optimization).
>
> The other reason is the data reference analysis. There might be versioned loop created with a
> more complexer test case.
>
> Again, I need to rebase and check my patch with the latest trunk, and need to come up with a better test case.
>
> >
> >    <bb 3> [local count: 1006632961]:
> >    # i_20 = PHI <i_13(9), 1(21)>
> >    # ivtmp_18 = PHI <ivtmp_5(9), 15(21)>
> >    a_10 = array[i_20];
> >    _1 = a_10 & 1;
> >    _2 = a_10 + 1;
> >    _32 = _1 != 0;
> >    _33 = &array[i_20];
> >    .MASK_STORE (_33, 32B, _32, _2);
> >    prephitmp_8 = _1 != 0 ? _2 : a_10;
> >    _4 = a_10 + 2;
> >    _34 = prephitmp_8 > 10;
> >    .MASK_STORE (_33, 32B, _34, _4);
> >    i_13 = i_20 + 1;
> >    ivtmp_5 = ivtmp_18 - 1;
> >    if (ivtmp_5 != 0)
> >
> > and what you want to do is merge
> >
> >    prephitmp_8 = _1 != 0 ? _2 : a_10;
> >    _34 = prephitmp_8 > 10;
> >
> > somehow?  But your patch mentions that _4 should be prephitmp_8 so
> > it wouldn't do anything here?
> >
> >> The change here add a post processing function to combine the VEC_COND_EXPR
> >> expression into MASK_STORE, and delete related dead code.
> >>
> >> I am a little bit conservative here.
> >> I didn't change the default behavior of ifcvt to always generate MASK_STORE,
> >> although it should be safe in all cases (allow or dis-allow store data race).
> >>
> >> MASK_STORE might not well handled in vectorization pass compared with
> >> conditional select. It might be too early and aggressive to do that in ifcvt.
> >> And the performance of MASK_STORE might not good for some platforms.
> >> (We could add --param or target hook to differentiate this ifcvt behavior
> >> on different platforms)
> >>
> >> Another reason I did not do that in ifcvt is the data reference
> >> analysis. To create a MASK_STORE, a pointer is created as the first
> >> argument to the internal function call. If the pointer is created out of
> >> array references, e.g. x = &A[i], data reference analysis could not properly
> >> analysis the relationship between MEM_REF (x) and ARRAY_REF (A, i). This
> >> will create a versioned loop beside the vectorized one.
> >
> > Actually for your testcase it won't vectorize because there's compile-time
> > aliasing (somehow we miss handling of dependence distance zero?!)
> >
> >> (I have hacks to look through the MEM_REF, and restore the reference back to
> >> ARRAY_REF (A, i).  Maybe we could do analysis on lowered address expression?
> >> I saw we have gimple laddress pass to aid the vectorizer)
> >
> > An old idea of mine is to have dependence analysis fall back to lowered address
> > form when it fails to match two references.  This would require re-analysis and
> > eventually storing an alternate "inner reference" in the data-ref structure.
>
> Yes, that makes sense.
>
> My hack is in get_references_in_stmt. Look through the pointer to see where it is generated.
> This is only for mask_store as we know the first operand is original a pointer or a new pointer from something.
>
>
> Thanks,
> Renlin
>
>
> >
> >> The approach here comes a little bit late, on the condition that vector
> >> MASK_STORE is generated by loop vectorizer. In this case, it is definitely
> >> beneficial to do the code transformation.
> >>
> >> Any thoughts on the best way to fix the issue?
> >>
> >>
> >> This patch has been tested with aarch64-none-elf, no regressions.
> >>
> >> Regards,
> >> Renlin
> >>
> >> gcc/ChangeLog:
> >>
> >> 2018-11-08  Renlin Li  <renlin.li@arm.com>
> >>
> >>          * tree-vectorizer.h (combine_sel_mask_store): Declare new function.
> >>          * tree-vect-loop.c (combine_sel_mask_store): Define new function.
> >>          * tree-vectorizer.c (vectorize_loops): Call it.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >> 2018-11-08  Renlin Li  <renlin.li@arm.com>
> >>
> >>          * gcc.target/aarch64/sve/combine_vcond_mask_store_1.c: New.
> >>
Renlin Li Nov. 9, 2018, 3:49 p.m. UTC | #4
Hi Richard,

On 11/09/2018 11:48 AM, Richard Biener wrote:
> On Thu, Nov 8, 2018 at 5:55 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>>
>> Hi Richard,
>>
>>
>> *However*, after I rebased my patch on the latest trunk.
>> Got the following dump from ifcvt:
>>     <bb 3> [local count: 1006632961]:
>>     # i_20 = PHI <i_13(9), 1(21)>
>>     # ivtmp_18 = PHI <ivtmp_5(9), 15(21)>
>>     a_10 = array[i_20];
>>     _1 = a_10 & 1;
>>     _2 = a_10 + 1;
>>     _ifc__34 = _1 != 0 ? _2 : a_10;
>>     array[i_20] = _ifc__34;
>>     _4 = a_10 + 2;
>>     _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
>>     array[i_20] = _ifc__37;
>>     i_13 = i_20 + 1;
>>     ivtmp_5 = ivtmp_18 - 1;
>>     if (ivtmp_5 != 0)
>>       goto <bb 9>; [93.33%]
>>     else
>>       goto <bb 8>; [6.67%]
>>
>> the redundant load is not generated, but you could still see the unconditional store.
> 
> Yes, I fixed the redundant loads recently and indeed dead stores
> remain (for the particular
> testcase they would be easy to remove).

Right.

> 
>> After loop vectorization, the following is generated (without my change):
> 
> Huh.  But that's not because of if-conversion but because SVE needs to
> mask _all_
> loop operations that are not safe to execute with the loop_mask!
> 
>>     vect_a_10.6_6 = .MASK_LOAD (vectp_array.4_35, 4B, loop_mask_7);
>>     a_10 = array[i_20];
>>     vect__1.7_39 = vect_a_10.6_6 & vect_cst__38;
>>     _1 = a_10 & 1;
>>     vect__2.8_41 = vect_a_10.6_6 + vect_cst__40;
>>     _2 = a_10 + 1;
>>     vect__ifc__34.9_43 = VEC_COND_EXPR <vect__1.7_39 != vect_cst__42, vect__2.8_41, vect_a_10.6_6>;
>>     _ifc__34 = _1 != 0 ? _2 : a_10;
>>     .MASK_STORE (vectp_array.10_45, 4B, loop_mask_7, vect__ifc__34.9_43);
>>     vect__4.12_49 = vect_a_10.6_6 + vect_cst__48;
>>     _4 = a_10 + 2;
>>     vect__ifc__37.13_51 = VEC_COND_EXPR <vect__ifc__34.9_43 > vect_cst__50, vect__4.12_49, vect__ifc__34.9_43>;
>>     _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
>>     .MASK_STORE (vectp_array.14_53, 4B, loop_mask_7, vect__ifc__37.13_51);
>>
>> With the old ifcvt code, my change here could improve it a little bit, eliminate some redundant load.
>> With the new code, it could not improved it further. I'll adjust the patch based on the latest trunk.
> 
> So what does the patch change the above to?  The code has little to no
> comments apart from a
> small picture with code _before_ the transform...
It is like this:
   vect_a_10.6_6 = .MASK_LOAD (vectp_array.4_35, 4B, loop_mask_7);
   a_10 = array[i_20];
   vect__1.7_39 = vect_a_10.6_6 & vect_cst__38;
   _1 = a_10 & 1;
   vect__2.8_41 = vect_a_10.6_6 + vect_cst__40;
   _2 = a_10 + 1;
   _60 = vect__1.7_39 != vect_cst__42;
   vect__ifc__34.9_43 = VEC_COND_EXPR <_60, vect__2.8_41, vect_a_10.6_6>;
   _ifc__34 = _1 != 0 ? _2 : a_10;
   vec_mask_and_61 = _60 & loop_mask_7;
   .MASK_STORE (vectp_array.10_45, 4B, vec_mask_and_61, vect__2.8_41);
   vect__4.12_49 = vect_a_10.6_6 + vect_cst__48;
   _4 = a_10 + 2;
   vect__ifc__37.13_51 = VEC_COND_EXPR <vect__ifc__34.9_43 > vect_cst__50, vect__4.12_49, vect__ifc__34.9_43>;
   _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
   .MASK_STORE (vectp_array.14_53, 4B, loop_mask_7, vect__ifc__37.13_51);

As the loaded value is used later, It could not be removed.

With the change, ideally, less data is stored.
However, it might generate more instructions.

1, The load is not eliminable. Apparently, your change eliminate most of the redundant load.
    The rest is necessary or not easy to remove.
2, additional AND instruction.

With a simpler test case like this:

static int array[100];
int test (int a, int i)
{
   for (unsigned i = 0; i < 16; i++)
     {
       if (a & 1)
	array[i] = a + 1;
     }
   return array[i];
}

The new code-gen will be:
   vect__2.4_29 = vect_cst__27 + vect_cst__28;
   _44 = vect_cst__34 != vect_cst__35;
   vec_mask_and_45 = _44 & loop_mask_32;
   .MASK_STORE (vectp_array.9_37, 4B, vec_mask_and_45, vect__2.4_29);

While the old one is:

   vect__2.4_29 = vect_cst__27 + vect_cst__28;
   vect__ifc__24.7_33 = .MASK_LOAD (vectp_array.5_30, 4B, loop_mask_32);
   vect__ifc__26.8_36 = VEC_COND_EXPR <vect_cst__34 != vect_cst__35, vect__2.4_29, vect__ifc__24.7_33>;
   .MASK_STORE (vectp_array.9_37, 4B, loop_mask_32, vect__ifc__26.8_36);


> 
> I was wondering whether we can implement
> 
>    l = [masked]load;
>    tem = cond ? x : l;
>    masked-store = tem;
> 
> pattern matching in a regular pass - forwprop for example.  Note the
> load doesn't need to be masked,
> correct?  In fact if it is masked you need to make sure the
> conditional never accesses parts that
> are masked in the load, no?  Or require the mask to be the same as
> that used by the store.  But then
> you still cannot simply replace the store mask with a new mask
> generated from the conditional?

Yes, this would require the mask for load and store is the same.
This matches the pattern before loop vectorization.
The mask here is loop mask, to ensure we are bounded by the number of iterations.

The new mask is the (original mask & condition mask) (example shown above).
In this case, less lanes will be stored.

It is possible we do that in forwprop.
I could try to integrate the change into it if it is the correct place to go.

As the pattern is initially generated by loop vectorizer, I did the change right after it before it got
converted into other forms. For example, forwprop will transform the original code into:

   vect__2.4_29 = vect_cst__27 + { 1, ... };
   _16 = (void *) ivtmp.13_25;
   _2 = &MEM[base: _16, offset: 0B];
   vect__ifc__24.7_33 = .MASK_LOAD (_2, 4B, loop_mask_32);
   _28 = vect_cst__34 != { 0, ... };
   _35 = .COND_ADD (_28, vect_cst__27, { 1, ... }, vect__ifc__24.7_33);
   vect__ifc__26.8_36 = _35;
   .MASK_STORE (_2, 4B, loop_mask_32, vect__ifc__26.8_36);
   ivtmp_41 = ivtmp_40 + POLY_INT_CST [4, 4];
   next_mask_43 = .WHILE_ULT (ivtmp_41, 16, { 0, ... });
   ivtmp.13_15 = ivtmp.13_25 + POLY_INT_CST [16, 16];

This make the pattern matching not straight forward.

Thanks,
Renlin


> 
> Richard.
>
Richard Biener Nov. 14, 2018, 2:59 p.m. UTC | #5
On Fri, Nov 9, 2018 at 4:49 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>
> Hi Richard,
>
> On 11/09/2018 11:48 AM, Richard Biener wrote:
> > On Thu, Nov 8, 2018 at 5:55 PM Renlin Li <renlin.li@foss.arm.com> wrote:
> >>
> >> Hi Richard,
> >>
> >>
> >> *However*, after I rebased my patch on the latest trunk.
> >> Got the following dump from ifcvt:
> >>     <bb 3> [local count: 1006632961]:
> >>     # i_20 = PHI <i_13(9), 1(21)>
> >>     # ivtmp_18 = PHI <ivtmp_5(9), 15(21)>
> >>     a_10 = array[i_20];
> >>     _1 = a_10 & 1;
> >>     _2 = a_10 + 1;
> >>     _ifc__34 = _1 != 0 ? _2 : a_10;
> >>     array[i_20] = _ifc__34;
> >>     _4 = a_10 + 2;
> >>     _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
> >>     array[i_20] = _ifc__37;
> >>     i_13 = i_20 + 1;
> >>     ivtmp_5 = ivtmp_18 - 1;
> >>     if (ivtmp_5 != 0)
> >>       goto <bb 9>; [93.33%]
> >>     else
> >>       goto <bb 8>; [6.67%]
> >>
> >> the redundant load is not generated, but you could still see the unconditional store.
> >
> > Yes, I fixed the redundant loads recently and indeed dead stores
> > remain (for the particular
> > testcase they would be easy to remove).
>
> Right.
>
> >
> >> After loop vectorization, the following is generated (without my change):
> >
> > Huh.  But that's not because of if-conversion but because SVE needs to
> > mask _all_
> > loop operations that are not safe to execute with the loop_mask!
> >
> >>     vect_a_10.6_6 = .MASK_LOAD (vectp_array.4_35, 4B, loop_mask_7);
> >>     a_10 = array[i_20];
> >>     vect__1.7_39 = vect_a_10.6_6 & vect_cst__38;
> >>     _1 = a_10 & 1;
> >>     vect__2.8_41 = vect_a_10.6_6 + vect_cst__40;
> >>     _2 = a_10 + 1;
> >>     vect__ifc__34.9_43 = VEC_COND_EXPR <vect__1.7_39 != vect_cst__42, vect__2.8_41, vect_a_10.6_6>;
> >>     _ifc__34 = _1 != 0 ? _2 : a_10;
> >>     .MASK_STORE (vectp_array.10_45, 4B, loop_mask_7, vect__ifc__34.9_43);
> >>     vect__4.12_49 = vect_a_10.6_6 + vect_cst__48;
> >>     _4 = a_10 + 2;
> >>     vect__ifc__37.13_51 = VEC_COND_EXPR <vect__ifc__34.9_43 > vect_cst__50, vect__4.12_49, vect__ifc__34.9_43>;
> >>     _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
> >>     .MASK_STORE (vectp_array.14_53, 4B, loop_mask_7, vect__ifc__37.13_51);
> >>
> >> With the old ifcvt code, my change here could improve it a little bit, eliminate some redundant load.
> >> With the new code, it could not improved it further. I'll adjust the patch based on the latest trunk.
> >
> > So what does the patch change the above to?  The code has little to no
> > comments apart from a
> > small picture with code _before_ the transform...
> It is like this:
>    vect_a_10.6_6 = .MASK_LOAD (vectp_array.4_35, 4B, loop_mask_7);
>    a_10 = array[i_20];
>    vect__1.7_39 = vect_a_10.6_6 & vect_cst__38;
>    _1 = a_10 & 1;
>    vect__2.8_41 = vect_a_10.6_6 + vect_cst__40;
>    _2 = a_10 + 1;
>    _60 = vect__1.7_39 != vect_cst__42;
>    vect__ifc__34.9_43 = VEC_COND_EXPR <_60, vect__2.8_41, vect_a_10.6_6>;
>    _ifc__34 = _1 != 0 ? _2 : a_10;
>    vec_mask_and_61 = _60 & loop_mask_7;
>    .MASK_STORE (vectp_array.10_45, 4B, vec_mask_and_61, vect__2.8_41);
>    vect__4.12_49 = vect_a_10.6_6 + vect_cst__48;
>    _4 = a_10 + 2;
>    vect__ifc__37.13_51 = VEC_COND_EXPR <vect__ifc__34.9_43 > vect_cst__50, vect__4.12_49, vect__ifc__34.9_43>;
>    _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
>    .MASK_STORE (vectp_array.14_53, 4B, loop_mask_7, vect__ifc__37.13_51);

Ah, OK, now I see what you do.

> As the loaded value is used later, It could not be removed.
>
> With the change, ideally, less data is stored.
> However, it might generate more instructions.
>
> 1, The load is not eliminable. Apparently, your change eliminate most of the redundant load.
>     The rest is necessary or not easy to remove.
> 2, additional AND instruction.
>
> With a simpler test case like this:
>
> static int array[100];
> int test (int a, int i)
> {
>    for (unsigned i = 0; i < 16; i++)
>      {
>        if (a & 1)
>         array[i] = a + 1;
>      }
>    return array[i];
> }
>
> The new code-gen will be:
>    vect__2.4_29 = vect_cst__27 + vect_cst__28;
>    _44 = vect_cst__34 != vect_cst__35;
>    vec_mask_and_45 = _44 & loop_mask_32;
>    .MASK_STORE (vectp_array.9_37, 4B, vec_mask_and_45, vect__2.4_29);
>
> While the old one is:
>
>    vect__2.4_29 = vect_cst__27 + vect_cst__28;
>    vect__ifc__24.7_33 = .MASK_LOAD (vectp_array.5_30, 4B, loop_mask_32);
>    vect__ifc__26.8_36 = VEC_COND_EXPR <vect_cst__34 != vect_cst__35, vect__2.4_29, vect__ifc__24.7_33>;
>    .MASK_STORE (vectp_array.9_37, 4B, loop_mask_32, vect__ifc__26.8_36);

I don't see the masked load here on x86_64 btw. (I don't see
if-conversion generating a load).
I guess that's again when store-data-races are allowed that it uses a
RMW cycle and vectorization
generating the masked variants for the loop-mask.  Which means for SVE
if-conversion should
prefer the masked-store variant even when store data races are allowed?

>
> >
> > I was wondering whether we can implement
> >
> >    l = [masked]load;
> >    tem = cond ? x : l;
> >    masked-store = tem;
> >
> > pattern matching in a regular pass - forwprop for example.  Note the
> > load doesn't need to be masked,
> > correct?  In fact if it is masked you need to make sure the
> > conditional never accesses parts that
> > are masked in the load, no?  Or require the mask to be the same as
> > that used by the store.  But then
> > you still cannot simply replace the store mask with a new mask
> > generated from the conditional?
>
> Yes, this would require the mask for load and store is the same.
> This matches the pattern before loop vectorization.
> The mask here is loop mask, to ensure we are bounded by the number of iterations.
>
> The new mask is the (original mask & condition mask) (example shown above).
> In this case, less lanes will be stored.
>
> It is possible we do that in forwprop.
> I could try to integrate the change into it if it is the correct place to go.
>
> As the pattern is initially generated by loop vectorizer, I did the change right after it before it got
> converted into other forms. For example, forwprop will transform the original code into:
>
>    vect__2.4_29 = vect_cst__27 + { 1, ... };
>    _16 = (void *) ivtmp.13_25;
>    _2 = &MEM[base: _16, offset: 0B];
>    vect__ifc__24.7_33 = .MASK_LOAD (_2, 4B, loop_mask_32);
>    _28 = vect_cst__34 != { 0, ... };
>    _35 = .COND_ADD (_28, vect_cst__27, { 1, ... }, vect__ifc__24.7_33);
>    vect__ifc__26.8_36 = _35;
>    .MASK_STORE (_2, 4B, loop_mask_32, vect__ifc__26.8_36);
>    ivtmp_41 = ivtmp_40 + POLY_INT_CST [4, 4];
>    next_mask_43 = .WHILE_ULT (ivtmp_41, 16, { 0, ... });
>    ivtmp.13_15 = ivtmp.13_25 + POLY_INT_CST [16, 16];
>
> This make the pattern matching not straight forward.

Ah, that's because of the .COND_ADDs (I wonder about the copy that's
left - forwprop should eliminate copies).  Note the pattern-matching
could go in the

      /* Apply forward propagation to all stmts in the basic-block.
         Note we update GSI within the loop as necessary.  */

loop which comes before the match.pd pattern matching so you'd
still see the form without the .COND_ADD I think.

There _is_ precedence for some masked-store post-processing
in the vectorizer (optimize_mask_stores), not that I like that
very much either.  Eventually those can be at least combined...

That said, I prefer the forwprop place for any pattern matching
and the current patch needs more comments to understand
what it is doing (the DCE it does is IMHO premature).  You
should also modify the masked store in-place rather than
building a new one.  I don't like how you need to use
find_data_references_in_stmt, can't you simply compare
the address and size arguments?  It should also work for
a non-masked load I guess and thus apply to non-SVE if
we manage to feed the masked store with another conditional.

Richard.

>
> Thanks,
> Renlin
>
>
> >
> > Richard.
> >
Renlin Li Nov. 20, 2018, 2:57 p.m. UTC | #6
Hi Richard,

On 11/14/2018 02:59 PM, Richard Biener wrote:
> On Fri, Nov 9, 2018 at 4:49 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>>
>> Hi Richard,
>>
>> On 11/09/2018 11:48 AM, Richard Biener wrote:
>>> On Thu, Nov 8, 2018 at 5:55 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>>>>
>>>> Hi Richard,
>>>>
>>>>
> I don't see the masked load here on x86_64 btw. (I don't see
> if-conversion generating a load).
> I guess that's again when store-data-races are allowed that it uses a
> RMW cycle and vectorization
> generating the masked variants for the loop-mask.  Which means for SVE
> if-conversion should
> prefer the masked-store variant even when store data races are allowed?

Yes, it looks like, for SVE, masked-store variant is preferred even when store data races are allowed.
This decision is made in if-cvt.

mask_store need a pointer, and if it is created from an array access, we need to make sure the data reference analysis
could properly analysis relationship between array reference and pointer reference.
So that no versioned loop is generated during loop vectorization.
(This is a general improvement, and could be done in a different patch?)

> 
>>
>>>
>>> I was wondering whether we can implement
>>>
>>>     l = [masked]load;
>>>     tem = cond ? x : l;
>>>     masked-store = tem;
>>>
>>> pattern matching in a regular pass - forwprop for example.  Note the
>>> load doesn't need to be masked,
>>> correct?  In fact if it is masked you need to make sure the
>>> conditional never accesses parts that
>>> are masked in the load, no?  Or require the mask to be the same as
>>> that used by the store.  But then
>>> you still cannot simply replace the store mask with a new mask
>>> generated from the conditional?
>>
>> Yes, this would require the mask for load and store is the same.
>> This matches the pattern before loop vectorization.
>> The mask here is loop mask, to ensure we are bounded by the number of iterations.
>>
>> The new mask is the (original mask & condition mask) (example shown above).
>> In this case, less lanes will be stored.
>>
>> It is possible we do that in forwprop.
>> I could try to integrate the change into it if it is the correct place to go.
>>
>> As the pattern is initially generated by loop vectorizer, I did the change right after it before it got
>> converted into other forms. For example, forwprop will transform the original code into:
>>
>>     vect__2.4_29 = vect_cst__27 + { 1, ... };
>>     _16 = (void *) ivtmp.13_25;
>>     _2 = &MEM[base: _16, offset: 0B];
>>     vect__ifc__24.7_33 = .MASK_LOAD (_2, 4B, loop_mask_32);
>>     _28 = vect_cst__34 != { 0, ... };
>>     _35 = .COND_ADD (_28, vect_cst__27, { 1, ... }, vect__ifc__24.7_33);
>>     vect__ifc__26.8_36 = _35;
>>     .MASK_STORE (_2, 4B, loop_mask_32, vect__ifc__26.8_36);
>>     ivtmp_41 = ivtmp_40 + POLY_INT_CST [4, 4];
>>     next_mask_43 = .WHILE_ULT (ivtmp_41, 16, { 0, ... });
>>     ivtmp.13_15 = ivtmp.13_25 + POLY_INT_CST [16, 16];
>>
>> This make the pattern matching not straight forward.
> 
> Ah, that's because of the .COND_ADDs (I wonder about the copy that's
> left - forwprop should eliminate copies).  Note the pattern-matching
> could go in the
> 
>        /* Apply forward propagation to all stmts in the basic-block.
>           Note we update GSI within the loop as necessary.  */
> 
> loop which comes before the match.pd pattern matching so you'd
> still see the form without the .COND_ADD I think.
> 
> There _is_ precedence for some masked-store post-processing
> in the vectorizer (optimize_mask_stores), not that I like that
> very much either.  Eventually those can be at least combined...
Thanks for your suggestion, indeed .COND_ADD is generated later in fold_stmt function.

I update the patch with the style of "forward-propagation". It starts from
VEC_COND, and forward propagate it into MASK_STORE when specific pattern is found.

      X = MASK_LOAD (PTR, -, MASK)
      VAL = ...
      Y = VEC_COND (cond, VAL, X)
      MASK_STORE (PTR, -, MASK, Y)
> 
> That said, I prefer the forwprop place for any pattern matching
> and the current patch needs more comments to understand
> what it is doing (the DCE it does is IMHO premature).  You
> should also modify the masked store in-place rather than
> building a new one.  I don't like how you need to use
> find_data_references_in_stmt, can't you simply compare
> the address and size arguments?  

find_data_references_in_stmt is used because the two data reference are created
as two new SSA_NAMEs from same scalar pointer by loop vectorizer.
I can not directly compare the address as the are complicated with loop information.

By moving the functionality into forwprop, the complications are removed by the optimizers in between.
This makes a simple comparison possible.


> It should also work for
> a non-masked load I guess and thus apply to non-SVE if
> we manage to feed the masked store with another conditional.

You are right, non-masked load is a load with a mask of all 1.
As long as the store mask is a subset of load mask, and they are loading from the same address,
we could do this combining. (I haven't add this yet as I don't have a test case to test it)
This probably indicates there are more cases we could rewrite around masked operations.


Again, thanks for your review!
Renlin


> 
> Richard.
> 
>>
>> Thanks,
>> Renlin
>>
>>
>>>
>>> Richard.
>>>
commit bf42fb31fc5f9beda31a025ec3dbe244b558ed9a
Author: Renlin Li <renlin.li@arm.com>
Date:   Mon Nov 19 15:27:02 2018 +0000

    forwprop

diff --git a/gcc/testsuite/gcc.target/aarch64/sve/combine_vcond_mask_store_1.c b/gcc/testsuite/gcc.target/aarch64/sve/combine_vcond_mask_store_1.c
new file mode 100644
index 0000000..9213f06
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/combine_vcond_mask_store_1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize --param allow-store-data-races=1" } */
+
+static int array[100];
+int test (int a, int i)
+{
+   for (unsigned i = 0; i < 16; i++)
+     {
+       if (a & 1)
+        array[i] = a + 1;
+     }
+   return array[i];
+}
+
+/* { dg-final { scan-assembler-not {ld1d} } } */
diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
index 7449eaf..c954b2d 100644
--- a/gcc/tree-ssa-forwprop.c
+++ b/gcc/tree-ssa-forwprop.c
@@ -629,6 +629,163 @@ forward_propagate_into_cond (gimple_stmt_iterator *gsi_p)
   return 0;
 }
 
+/* Given the following pattern:
+     X = MASK_LOAD (PTR, -, MASK)
+     VAL = ...
+     Y = VEC_COND (cond, VAL, X)
+     MASK_STORE (PTR, -, MASK, Y)
+
+   They could be combined into a single MASK_STORE with new mask.
+   The new mask is the combination of original mask and the value selection mask
+   in VEC_COND_EXPR.
+
+     AND_MASK = MASK & cond
+     MASK_STORE (PTR, -, AND_MASK, VAL)
+
+   After the transformation, the MASK_LOAD and VEC_COND_EXPR might be dead.  */
+
+static bool
+forward_propagate_vec_cond_expr (gimple_stmt_iterator *gsi)
+{
+  gimple *stmt = gsi_stmt (*gsi);
+  if (!is_gimple_assign (stmt) || gimple_assign_rhs_code (stmt) != VEC_COND_EXPR)
+    return false;
+
+  imm_use_iterator iter;
+  gimple *use_stmt;
+  use_operand_p use_p;
+  tree lhs = gimple_assign_lhs (stmt);
+  bool rewrite = false;
+
+  FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
+    {
+      use_stmt = USE_STMT (use_p);
+      if (gimple_call_internal_p (use_stmt, IFN_MASK_STORE)
+	  && !gimple_has_volatile_ops (use_stmt))
+	{
+	  rewrite = true;
+	  break;
+	}
+    }
+
+  /* Early return if it doesn't have a user we are interested in.  */
+  if (!rewrite)
+    return false;
+
+  tree sel_cond = gimple_assign_rhs1 (stmt);
+  tree true_val = gimple_assign_rhs2 (stmt);
+  tree false_val = gimple_assign_rhs3 (stmt);
+
+  FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
+    {
+      if (!gimple_call_internal_p (use_stmt, IFN_MASK_STORE))
+	continue;
+
+      gimple *mask_store = use_stmt;
+      tree store_ptr = gimple_call_arg (mask_store, 0);
+      tree vec_op = gimple_call_arg (mask_store, 3);
+      tree store_mask = gimple_call_arg (mask_store, 2);
+      if (vec_op != lhs)
+	continue;
+
+      /* Looking for this pattern:
+	 X = MASK_LOAD (PTR, -, MASK)
+	 VAL = ...
+	 Y = VEC_COND (cond, VAL, X)
+	 MASK_STORE (PTR, -, MASK, Y)  */
+      gimple *mask_load = NULL;
+      bool reverse_p = false;
+
+      /* A[i] = cond ? val : A[i] */
+      if (TREE_CODE (false_val) == SSA_NAME)
+	{
+	  gimple *def = SSA_NAME_DEF_STMT (false_val);
+	  if (gimple_call_internal_p (def, IFN_MASK_LOAD))
+	    mask_load = def;
+	}
+      /* A[i] = cond ? A[i] : val
+	 Transform into:
+	 A[i] = !cond ? val : A[i]  */
+      if (mask_load == NULL && TREE_CODE (true_val) == SSA_NAME)
+	{
+	  gimple *def = SSA_NAME_DEF_STMT (true_val);
+	  if (gimple_call_internal_p (def, IFN_MASK_LOAD))
+	    {
+	      enum tree_code code = TREE_CODE (sel_cond);
+	      tree op_type = TREE_TYPE (TREE_OPERAND (sel_cond, 0));
+	      code = invert_tree_comparison (code, HONOR_NANS (op_type));
+	      if (code == ERROR_MARK)
+		return false;
+
+	      sel_cond = build2_loc (EXPR_LOCATION (sel_cond), code,
+				     TREE_TYPE (sel_cond),
+				     TREE_OPERAND (sel_cond, 0),
+				     TREE_OPERAND (sel_cond, 1));
+	      mask_load = def;
+	      reverse_p = true;
+	    }
+	}
+
+      /* The pair must be in the same basic block, use the same mask,
+	 and access the same memory.  */
+      if (mask_load == NULL
+	  || gimple_has_volatile_ops (mask_load)
+	  || gimple_bb (mask_store) != gimple_bb (mask_load)
+	  || gimple_vuse (mask_store) != gimple_vuse (mask_load)
+	  || !operand_equal_p (store_mask, gimple_call_arg (mask_load, 2), 0)
+	  || !operand_equal_p (store_ptr, gimple_call_arg (mask_load, 0), 0)
+	  || !operand_equal_p (gimple_call_arg (mask_load, 1),
+			       gimple_call_arg (mask_store, 1),
+			       0))
+	continue;
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "  Merge '");
+	  print_gimple_expr (dump_file, mask_load, 0);
+	  fprintf (dump_file, "  into '");
+	  print_gimple_expr (dump_file, mask_store, 0);
+	  fprintf (dump_file, "'\n");
+	}
+
+      tree sel_mask
+	= force_gimple_operand_gsi (gsi,
+				    unshare_expr (sel_cond),
+				    true, NULL_TREE,
+				    true, GSI_SAME_STMT);
+
+      /* Replace the condition in COND_EXPR with the new variable.  */
+      if (reverse_p)
+	{
+	  gimple_set_op (stmt, 1, sel_mask);
+	  gimple_set_op (stmt, 2, false_val);
+	  gimple_set_op (stmt, 3, true_val);
+
+	  update_stmt (stmt);
+	  true_val = false_val;
+	}
+      else
+	{
+	  gimple_set_op (stmt, 1, sel_mask);
+	  update_stmt (stmt);
+	}
+
+      /* Create the new mask which should have less active lanes.  */
+      tree and_mask = make_temp_ssa_name (TREE_TYPE (store_mask),
+					  NULL, "vec_mask_and");
+      gimple *and_stmt = gimple_build_assign (and_mask, BIT_AND_EXPR,
+					      sel_mask, store_mask);
+      gsi_insert_before (gsi, and_stmt, GSI_SAME_STMT);
+
+      /* Update the MASK_STORE with new mask and value.  */
+      gimple_call_set_arg (mask_store, 2, and_mask);
+      gimple_call_set_arg (mask_store, 3, true_val);
+      update_stmt (mask_store);
+    }
+
+  return has_zero_uses (lhs);
+}
+
 /* We've just substituted an ADDR_EXPR into stmt.  Update all the
    relevant data structures to match.  */
 
@@ -2440,6 +2597,12 @@ pass_forwprop::execute (function *fun)
 	      else
 		gsi_next (&gsi);
 	    }
+	  else if (code == VEC_COND_EXPR
+		   && forward_propagate_vec_cond_expr (&gsi))
+	    {
+	      release_defs (stmt);
+	      gsi_remove (&gsi, true);
+	    }
 	  else
 	    gsi_next (&gsi);
 	}
Jeff Law Dec. 3, 2018, 7:39 p.m. UTC | #7
On 11/20/18 7:57 AM, Renlin Li wrote:
> Hi Richard,
> 
> On 11/14/2018 02:59 PM, Richard Biener wrote:
>> On Fri, Nov 9, 2018 at 4:49 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>>>
>>> Hi Richard,
>>>
>>> On 11/09/2018 11:48 AM, Richard Biener wrote:
>>>> On Thu, Nov 8, 2018 at 5:55 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>>>>
>>>>>
>>>>> Hi Richard,
>>>>>
>>>>>
>> I don't see the masked load here on x86_64 btw. (I don't see
>> if-conversion generating a load).
>> I guess that's again when store-data-races are allowed that it uses a
>> RMW cycle and vectorization
>> generating the masked variants for the loop-mask.  Which means for SVE
>> if-conversion should
>> prefer the masked-store variant even when store data races are allowed?
> 
> Yes, it looks like, for SVE, masked-store variant is preferred even when store data races are allowed.
> 
> This decision is made in if-cvt.
> 
> mask_store need a pointer, and if it is created from an array access, we need to make sure the data reference analysis
> 
> could properly analysis relationship between array reference and pointer reference.
> 
> So that no versioned loop is generated during loop vectorization.
> (This is a general improvement, and could be done in a different patch?)
> 
>>
>>>
>>>>
>>>> I was wondering whether we can implement
>>>>
>>>>     l = [masked]load;
>>>>     tem = cond ? x : l;
>>>>     masked-store = tem;
>>>>
>>>> pattern matching in a regular pass - forwprop for example.  Note the
>>>> load doesn't need to be masked,
>>>> correct?  In fact if it is masked you need to make sure the
>>>> conditional never accesses parts that
>>>> are masked in the load, no?  Or require the mask to be the same as
>>>> that used by the store.  But then
>>>> you still cannot simply replace the store mask with a new mask
>>>> generated from the conditional?
>>>
>>> Yes, this would require the mask for load and store is the same.
>>> This matches the pattern before loop vectorization.
>>> The mask here is loop mask, to ensure we are bounded by the number of iterations.
>>>
>>>
>>> The new mask is the (original mask & condition mask) (example shown above).
>>>
>>> In this case, less lanes will be stored.
>>>
>>> It is possible we do that in forwprop.
>>> I could try to integrate the change into it if it is the correct place to go.
>>>
>>>
>>> As the pattern is initially generated by loop vectorizer, I did the change right after it before it got
>>>
>>> converted into other forms. For example, forwprop will transform the original code into:
>>>
>>>
>>>     vect__2.4_29 = vect_cst__27 + { 1, ... };
>>>     _16 = (void *) ivtmp.13_25;
>>>     _2 = &MEM[base: _16, offset: 0B];
>>>     vect__ifc__24.7_33 = .MASK_LOAD (_2, 4B, loop_mask_32);
>>>     _28 = vect_cst__34 != { 0, ... };
>>>     _35 = .COND_ADD (_28, vect_cst__27, { 1, ... }, vect__ifc__24.7_33);
>>>     vect__ifc__26.8_36 = _35;
>>>     .MASK_STORE (_2, 4B, loop_mask_32, vect__ifc__26.8_36);
>>>     ivtmp_41 = ivtmp_40 + POLY_INT_CST [4, 4];
>>>     next_mask_43 = .WHILE_ULT (ivtmp_41, 16, { 0, ... });
>>>     ivtmp.13_15 = ivtmp.13_25 + POLY_INT_CST [16, 16];
>>>
>>> This make the pattern matching not straight forward.
>>
>> Ah, that's because of the .COND_ADDs (I wonder about the copy that's
>> left - forwprop should eliminate copies).  Note the pattern-matching
>> could go in the
>>
>>        /* Apply forward propagation to all stmts in the basic-block.
>>           Note we update GSI within the loop as necessary.  */
>>
>> loop which comes before the match.pd pattern matching so you'd
>> still see the form without the .COND_ADD I think.
>>
>> There _is_ precedence for some masked-store post-processing
>> in the vectorizer (optimize_mask_stores), not that I like that
>> very much either.  Eventually those can be at least combined...
> Thanks for your suggestion, indeed .COND_ADD is generated later in fold_stmt function.
> 
> 
> I update the patch with the style of "forward-propagation". It starts from
> VEC_COND, and forward propagate it into MASK_STORE when specific pattern is found.
> 
> 
>      X = MASK_LOAD (PTR, -, MASK)
>      VAL = ...
>      Y = VEC_COND (cond, VAL, X)
>      MASK_STORE (PTR, -, MASK, Y)
>>
>> That said, I prefer the forwprop place for any pattern matching
>> and the current patch needs more comments to understand
>> what it is doing (the DCE it does is IMHO premature).  You
>> should also modify the masked store in-place rather than
>> building a new one.  I don't like how you need to use
>> find_data_references_in_stmt, can't you simply compare
>> the address and size arguments?  
> 
> find_data_references_in_stmt is used because the two data reference are created
> 
> as two new SSA_NAMEs from same scalar pointer by loop vectorizer.
> I can not directly compare the address as the are complicated with loop information.
> 
> 
> By moving the functionality into forwprop, the complications are removed by the optimizers in between.
> 
> This makes a simple comparison possible.
So presumably this has been bootstrap tested?  If so, it's fine once you
add an appropriate ChangeLog.



> diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
> index 7449eaf..c954b2d 100644
> --- a/gcc/tree-ssa-forwprop.c
> +++ b/gcc/tree-ssa-forwprop.c
> @@ -629,6 +629,163 @@ forward_propagate_into_cond (gimple_stmt_iterator *gsi_p)
>    return 0;
>  }
>  
> +/* Given the following pattern:
> +     X = MASK_LOAD (PTR, -, MASK)
> +     VAL = ...
> +     Y = VEC_COND (cond, VAL, X)
> +     MASK_STORE (PTR, -, MASK, Y)
> +
> +   They could be combined into a single MASK_STORE with new mask.
> +   The new mask is the combination of original mask and the value selection mask
> +   in VEC_COND_EXPR.
> +
> +     AND_MASK = MASK & cond
> +     MASK_STORE (PTR, -, AND_MASK, VAL)
> +
> +   After the transformation, the MASK_LOAD and VEC_COND_EXPR might be dead.  */
> +
> +static bool
> +forward_propagate_vec_cond_expr (gimple_stmt_iterator *gsi)
> +{
> +  gimple *stmt = gsi_stmt (*gsi);
> +  if (!is_gimple_assign (stmt) || gimple_assign_rhs_code (stmt) != VEC_COND_EXPR)
> +    return false;
> +
> +  imm_use_iterator iter;
> +  gimple *use_stmt;
> +  use_operand_p use_p;
> +  tree lhs = gimple_assign_lhs (stmt);
> +  bool rewrite = false;
> +
> +  FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
> +    {
> +      use_stmt = USE_STMT (use_p);
> +      if (gimple_call_internal_p (use_stmt, IFN_MASK_STORE)
> +	  && !gimple_has_volatile_ops (use_stmt))
> +	{
> +	  rewrite = true;
> +	  break;
> +	}
> +    }
> +
> +  /* Early return if it doesn't have a user we are interested in.  */
> +  if (!rewrite)
> +    return false;
> +
> +  tree sel_cond = gimple_assign_rhs1 (stmt);
> +  tree true_val = gimple_assign_rhs2 (stmt);
> +  tree false_val = gimple_assign_rhs3 (stmt);
> +
> +  FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
> +    {
> +      if (!gimple_call_internal_p (use_stmt, IFN_MASK_STORE))
> +	continue;
> +
> +      gimple *mask_store = use_stmt;
> +      tree store_ptr = gimple_call_arg (mask_store, 0);
> +      tree vec_op = gimple_call_arg (mask_store, 3);
> +      tree store_mask = gimple_call_arg (mask_store, 2);
> +      if (vec_op != lhs)
> +	continue;
> +
> +      /* Looking for this pattern:
> +	 X = MASK_LOAD (PTR, -, MASK)
> +	 VAL = ...
> +	 Y = VEC_COND (cond, VAL, X)
> +	 MASK_STORE (PTR, -, MASK, Y)  */
> +      gimple *mask_load = NULL;
> +      bool reverse_p = false;
> +
> +      /* A[i] = cond ? val : A[i] */
> +      if (TREE_CODE (false_val) == SSA_NAME)
> +	{
> +	  gimple *def = SSA_NAME_DEF_STMT (false_val);
> +	  if (gimple_call_internal_p (def, IFN_MASK_LOAD))
> +	    mask_load = def;
> +	}
> +      /* A[i] = cond ? A[i] : val
> +	 Transform into:
> +	 A[i] = !cond ? val : A[i]  */
> +      if (mask_load == NULL && TREE_CODE (true_val) == SSA_NAME)
> +	{
> +	  gimple *def = SSA_NAME_DEF_STMT (true_val);
> +	  if (gimple_call_internal_p (def, IFN_MASK_LOAD))
> +	    {
> +	      enum tree_code code = TREE_CODE (sel_cond);
> +	      tree op_type = TREE_TYPE (TREE_OPERAND (sel_cond, 0));
> +	      code = invert_tree_comparison (code, HONOR_NANS (op_type));
> +	      if (code == ERROR_MARK)
> +		return false;
> +
> +	      sel_cond = build2_loc (EXPR_LOCATION (sel_cond), code,
> +				     TREE_TYPE (sel_cond),
> +				     TREE_OPERAND (sel_cond, 0),
> +				     TREE_OPERAND (sel_cond, 1));
> +	      mask_load = def;
> +	      reverse_p = true;
> +	    }
> +	}
> +
> +      /* The pair must be in the same basic block, use the same mask,
> +	 and access the same memory.  */
> +      if (mask_load == NULL
> +	  || gimple_has_volatile_ops (mask_load)
> +	  || gimple_bb (mask_store) != gimple_bb (mask_load)
> +	  || gimple_vuse (mask_store) != gimple_vuse (mask_load)
> +	  || !operand_equal_p (store_mask, gimple_call_arg (mask_load, 2), 0)
> +	  || !operand_equal_p (store_ptr, gimple_call_arg (mask_load, 0), 0)
> +	  || !operand_equal_p (gimple_call_arg (mask_load, 1),
> +			       gimple_call_arg (mask_store, 1),
> +			       0))
> +	continue;
> +
> +      if (dump_file)
> +	{
> +	  fprintf (dump_file, "  Merge '");
> +	  print_gimple_expr (dump_file, mask_load, 0);
> +	  fprintf (dump_file, "  into '");
> +	  print_gimple_expr (dump_file, mask_store, 0);
> +	  fprintf (dump_file, "'\n");
> +	}
> +
> +      tree sel_mask
> +	= force_gimple_operand_gsi (gsi,
> +				    unshare_expr (sel_cond),
> +				    true, NULL_TREE,
> +				    true, GSI_SAME_STMT);
> +

> +      /* Replace the condition in COND_EXPR with the new variable.  */
> +      if (reverse_p)
> +	{
> +	  gimple_set_op (stmt, 1, sel_mask);
> +	  gimple_set_op (stmt, 2, false_val);
> +	  gimple_set_op (stmt, 3, true_val);
> +
> +	  update_stmt (stmt);
> +	  true_val = false_val;
> +	}
> +      else
> +	{
> +	  gimple_set_op (stmt, 1, sel_mask);
> +	  update_stmt (stmt);
> +	}
Is this really correct?   I'm thinking about the case where the
COND_EXPR has multiple uses.  I don't think you can just slam in a new
condition in that case, unless you know all the uses are going to be
transformed in the same way.



Or am I missing the intent of what that fragment is doing?

Overall it looks reasonable though.  We should try and wrap up iteration
on the implementation ASAP.  We're well into stage3 at this point.

jeff
Richard Sandiford Dec. 4, 2018, 4:33 p.m. UTC | #8
Jeff Law <law@redhat.com> writes:
> On 11/20/18 7:57 AM, Renlin Li wrote:
>> Hi Richard,
>> 
>> On 11/14/2018 02:59 PM, Richard Biener wrote:
>>> On Fri, Nov 9, 2018 at 4:49 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>>>>
>>>> Hi Richard,
>>>>
>>>> On 11/09/2018 11:48 AM, Richard Biener wrote:
>>>>> On Thu, Nov 8, 2018 at 5:55 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>>>>>
>>>>>>
>>>>>> Hi Richard,
>>>>>>
>>>>>>
>>> I don't see the masked load here on x86_64 btw. (I don't see
>>> if-conversion generating a load).
>>> I guess that's again when store-data-races are allowed that it uses a
>>> RMW cycle and vectorization
>>> generating the masked variants for the loop-mask.  Which means for SVE
>>> if-conversion should
>>> prefer the masked-store variant even when store data races are allowed?
>> 
>> Yes, it looks like, for SVE, masked-store variant is preferred even when store data races are allowed.

I agree, and I think that would cope with more cases than fixing it up
later.  E.g. the current fixup code requires the load and store to have
the same vuse, so it won't cope with cases in which another store comes
between the mask_load and mask_store.

An easy heuristic would be to test
targetm.vectorize.empty_mask_is_expensive.  If that's false and masked
stores are available, it's probably better to use them than introduce
the data race.

Thanks,
Richard
diff mbox series

Patch

diff --git a/gcc/testsuite/gcc.target/aarch64/sve/combine_vcond_mask_store_1.c b/gcc/testsuite/gcc.target/aarch64/sve/combine_vcond_mask_store_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..64f6b7b00f58ee45bd4a2f91c1a9404911f1a09f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/combine_vcond_mask_store_1.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize --param allow-store-data-races=1 -fdump-tree-vect-details" } */
+
+void test ()
+{
+  static int array[100];
+  for (unsigned i = 1; i < 16; ++i)
+    {
+      int a = array[i];
+      if (a & 1)
+	array[i] = a + 1;
+      if (array[i] > 10)
+	array[i] = a + 2;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Combining VEC_COND_EXPR and MASK_STORE" 1 "vect" } } */
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 177b284e9c617a41c33d1387ba5afbed51d8ed00..9e1a167d03ea5bf640e58b3426d42b4e3c74da56 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -8539,6 +8539,166 @@  vect_transform_loop (loop_vec_info loop_vinfo)
   return epilogue;
 }
 
+/*
+   When allow-store-data-races=1, if-conversion will convert certain if
+   statements into:
+   A[i] = cond ? val : A[i].
+   If the loop is successfully vectorized,
+   MASK_LOAD + VEC_COND_EXPR + MASK_STORE will be generated.
+
+   This pattern could be combined into a single MASK_STORE with new mask.
+   The new mask is the combination of original mask and the value selection mask
+   in VEC_COND_EXPR.
+
+   After the transformation, the MASK_LOAD and VEC_COND_EXPR might be dead.  */
+
+void
+combine_sel_mask_store (struct loop *loop)
+{
+  basic_block *bbs = get_loop_body (loop);
+  unsigned nbbs = loop->num_nodes;
+  unsigned i;
+  basic_block bb;
+  gimple_stmt_iterator gsi;
+
+  vect_location = find_loop_location (loop);
+  for (i = 0; i < nbbs; i++)
+    {
+      bb = bbs[i];
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *mask_store = gsi_stmt (gsi);
+	  if (!gimple_call_internal_p (mask_store, IFN_MASK_STORE))
+	    continue;
+
+	  /*
+	     X = MASK_LOAD (PTR, -, MASK)
+	     VAL = ...
+	     Y = VEC_COND (cond, VAL, X)
+	     MASK_STORE (PTR, -, MASK, Y)
+	  */
+	  tree vec_op = gimple_call_arg (mask_store, 3);
+	  tree store_mask = gimple_call_arg (mask_store, 2);
+	  if (TREE_CODE (vec_op) == SSA_NAME)
+	    {
+	      gimple *def = SSA_NAME_DEF_STMT (vec_op);
+	      gassign *assign = dyn_cast <gassign *> (def);
+	      if (!assign || gimple_assign_rhs_code (assign) != VEC_COND_EXPR)
+		continue;
+
+	      tree sel_cond = gimple_assign_rhs1 (assign);
+	      tree true_val = gimple_assign_rhs2 (assign);
+	      tree false_val = gimple_assign_rhs3 (assign);
+	      gimple *mask_load = NULL;
+
+	      /* A[i] = cond ? val : A[i] */
+	      if (TREE_CODE (false_val) == SSA_NAME)
+		{
+		  gimple *def = SSA_NAME_DEF_STMT (false_val);
+		  if (gimple_call_internal_p (def, IFN_MASK_LOAD))
+		    mask_load = def;
+		}
+	      /* A[i] = cond ? A[i] : val
+		 Transform into:
+		 A[i] = !cond ? val : A[i]  */
+	      if (mask_load == NULL && TREE_CODE (true_val) == SSA_NAME)
+		{
+		  gimple *def = SSA_NAME_DEF_STMT (true_val);
+		  if (gimple_call_internal_p (def, IFN_MASK_LOAD))
+		    {
+		      enum tree_code code = TREE_CODE (sel_cond);
+		      tree op_type = TREE_TYPE (TREE_OPERAND (sel_cond, 0));
+		      code = invert_tree_comparison (code, HONOR_NANS (op_type));
+		      if (code == ERROR_MARK)
+			continue;
+		      sel_cond = build2_loc (EXPR_LOCATION (sel_cond), code,
+					     TREE_TYPE (sel_cond),
+					     TREE_OPERAND (sel_cond, 0),
+					     TREE_OPERAND (sel_cond, 1));
+		      mask_load = def;
+		      true_val = false_val;
+		    }
+		}
+
+	      /* The pair must be in the same basic block, use the same mask,
+		 and access the same memory.  */
+	      if (mask_load == NULL ||
+		  gimple_bb (mask_store) != gimple_bb (mask_load) ||
+		  store_mask != gimple_call_arg (mask_load, 2) ||
+		  gimple_vuse (mask_store) != gimple_vuse (mask_load))
+		continue;
+
+	      auto_vec<data_reference_p, 2> refs;
+	      opt_result res
+		= find_data_references_in_stmt (loop, mask_store, &refs);
+	      if (!res)
+		continue;
+	      data_reference_p dr_a = refs.pop ();
+	      res = find_data_references_in_stmt (loop, mask_load, &refs);
+	      if (!res)
+		continue;
+	      data_reference_p dr_b = refs.pop ();
+
+	      if (!same_data_refs (dr_a, dr_b))
+		continue;
+
+	      /* If the data reference is the same, they are accessing the
+		 same memory location.  Merge the pattern.  */
+	      tree sel_mask
+		= force_gimple_operand_gsi (&gsi, unshare_expr (sel_cond),
+					    true, NULL_TREE,
+					    true, GSI_SAME_STMT);
+
+	      tree and_mask = make_temp_ssa_name (TREE_TYPE (store_mask),
+						  NULL, "vec_mask_and");
+	      gimple *and_stmt = gimple_build_assign (and_mask, BIT_AND_EXPR,
+						      sel_mask, store_mask);
+	      gsi_insert_before (&gsi, and_stmt, GSI_SAME_STMT);
+
+	      gcall *new_stmt
+		= gimple_build_call_internal (IFN_MASK_STORE, 4,
+					      gimple_call_arg (mask_store, 0),
+					      gimple_call_arg (mask_store, 1),
+					      and_mask, true_val);
+	      gimple_call_set_nothrow (new_stmt, true);
+	      gsi_replace (&gsi, new_stmt, true);
+
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "Combining VEC_COND_EXPR and MASK_STORE:\n"
+				 "%G%G", assign, new_stmt);
+
+	      /* Remove dead statements.  */
+	      if (has_zero_uses (vec_op))
+		{
+		  auto_vec<tree> worklist;
+		  worklist.safe_push (vec_op);
+		  while (!worklist.is_empty ())
+		    {
+		      tree val = worklist.pop ();
+		      if (TREE_CODE (val) == SSA_NAME
+			  && has_zero_uses (val))
+			{
+			  ssa_op_iter i;
+			  tree op;
+			  gimple *def = SSA_NAME_DEF_STMT (val);
+
+			  FOR_EACH_SSA_TREE_OPERAND (op, def, i, SSA_OP_USE)
+			    worklist.safe_push (op);
+
+			  gimple_stmt_iterator gsi = gsi_for_stmt (def);
+			  gsi_remove (&gsi, true);
+			}
+		    }
+		}
+	    }
+	}
+    }
+
+  free (bbs);
+}
+
 /* The code below is trying to perform simple optimization - revert
    if-conversion for masked stores, i.e. if the mask of a store is zero
    do not perform it and all stored value producers also if possible.
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 988456808318dabd0058f6b0d038f8c272e75c6b..ea661ec609df56f96cb54c3f2996646c05870667 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1510,6 +1510,7 @@  extern void vect_get_store_cost (stmt_vec_info, int,
 extern bool vect_supportable_shift (enum tree_code, tree);
 extern tree vect_gen_perm_mask_any (tree, const vec_perm_indices &);
 extern tree vect_gen_perm_mask_checked (tree, const vec_perm_indices &);
+extern void combine_sel_mask_store (struct loop*);
 extern void optimize_mask_stores (struct loop*);
 extern gcall *vect_gen_while (tree, tree, tree);
 extern tree vect_gen_while_not (gimple_seq *, tree, tree, tree);
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 12bf0fcd5bde4b889fb74342c4e7dd52327efa57..a1145cdfbeb826669071dd077420908ba67bdecc 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -1143,6 +1143,7 @@  vectorize_loops (void)
       loop_vinfo = (loop_vec_info) loop->aux;
       has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo);
       delete loop_vinfo;
+      combine_sel_mask_store (loop);
       if (has_mask_store
 	  && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE))
 	optimize_mask_stores (loop);