diff mbox series

[RFC/PATCH] vect: Recog mul_highpart pattern

Message ID da469973-d874-1eb6-739f-048831e8a898@linux.ibm.com
State New
Headers show
Series [RFC/PATCH] vect: Recog mul_highpart pattern | expand

Commit Message

Kewen.Lin July 13, 2021, 8:52 a.m. UTC
Hi,

When I added the support for Power10 newly introduced multiply
highpart instrutions, I noticed that currently vectorizer
doesn't try to vectorize multiply highpart pattern, I hope
this isn't intentional?

This patch is to extend the existing pattern mulhs handlings
to cover multiply highpart.  Another alternative seems to
recog mul_highpart operation in a general place applied for
scalar code when the target supports the optab for the scalar
operation, it's based on the assumption that one target which
supports vector version of multiply highpart should have the
scalar version.  I noticed that the function can_mult_highpart_p
can check/handle mult_highpart well even without mul_highpart
optab support, I think to recog this pattern in vectorizer
is better.  Is it on the right track?

Bootstrapped & regtested on powerpc64le-linux-gnu P9,
x86_64-redhat-linux and aarch64-linux-gnu.

BR,
Kewen
-----
gcc/ChangeLog:

	* tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
	recog normal multiply highpart.
---
 gcc/tree-vect-patterns.c | 67 ++++++++++++++++++++++++++++------------
 1 file changed, 48 insertions(+), 19 deletions(-)

Comments

Richard Biener July 13, 2021, 9:35 a.m. UTC | #1
On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi,
>
> When I added the support for Power10 newly introduced multiply
> highpart instrutions, I noticed that currently vectorizer
> doesn't try to vectorize multiply highpart pattern, I hope
> this isn't intentional?
>
> This patch is to extend the existing pattern mulhs handlings
> to cover multiply highpart.  Another alternative seems to
> recog mul_highpart operation in a general place applied for
> scalar code when the target supports the optab for the scalar
> operation, it's based on the assumption that one target which
> supports vector version of multiply highpart should have the
> scalar version.  I noticed that the function can_mult_highpart_p
> can check/handle mult_highpart well even without mul_highpart
> optab support, I think to recog this pattern in vectorizer
> is better.  Is it on the right track?

I think it's on the right track, using IFN_LAST is a bit awkward
in case yet another case pops up so maybe you can use
a code_helper instance instead which unifies tree_code,
builtin_code and internal_fn?

I also notice that can_mult_highpart_p will return true if
only vec_widen_[us]mult_{even,odd,hi,lo} are available,
but then the result might be less optimal (or even not
handled later)?

That is, what about adding optab internal functions
for [us]mul_highpart instead, much like the existing
ones for MULH{R,}S?

Richard.

>
> Bootstrapped & regtested on powerpc64le-linux-gnu P9,
> x86_64-redhat-linux and aarch64-linux-gnu.
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
>         * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
>         recog normal multiply highpart.
>
Richard Sandiford July 13, 2021, 9:40 a.m. UTC | #2
Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi,
>>
>> When I added the support for Power10 newly introduced multiply
>> highpart instrutions, I noticed that currently vectorizer
>> doesn't try to vectorize multiply highpart pattern, I hope
>> this isn't intentional?
>>
>> This patch is to extend the existing pattern mulhs handlings
>> to cover multiply highpart.  Another alternative seems to
>> recog mul_highpart operation in a general place applied for
>> scalar code when the target supports the optab for the scalar
>> operation, it's based on the assumption that one target which
>> supports vector version of multiply highpart should have the
>> scalar version.  I noticed that the function can_mult_highpart_p
>> can check/handle mult_highpart well even without mul_highpart
>> optab support, I think to recog this pattern in vectorizer
>> is better.  Is it on the right track?
>
> I think it's on the right track, using IFN_LAST is a bit awkward
> in case yet another case pops up so maybe you can use
> a code_helper instance instead which unifies tree_code,
> builtin_code and internal_fn?
>
> I also notice that can_mult_highpart_p will return true if
> only vec_widen_[us]mult_{even,odd,hi,lo} are available,
> but then the result might be less optimal (or even not
> handled later)?
>
> That is, what about adding optab internal functions
> for [us]mul_highpart instead, much like the existing
> ones for MULH{R,}S?

Yeah, that's be my preference too FWIW.  All uses of MULT_HIGHPART_EXPR
already have to be guarded by can_mult_highpart_p, so replacing it with
a directly-mapped ifn seems like a natural fit.  (Then can_mult_highpart_p
should be replaced with a direct_internal_fn_supported_p query.)

Thanks,
Richard
Richard Biener July 13, 2021, 9:44 a.m. UTC | #3
On Tue, Jul 13, 2021 at 11:40 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi,
> >>
> >> When I added the support for Power10 newly introduced multiply
> >> highpart instrutions, I noticed that currently vectorizer
> >> doesn't try to vectorize multiply highpart pattern, I hope
> >> this isn't intentional?
> >>
> >> This patch is to extend the existing pattern mulhs handlings
> >> to cover multiply highpart.  Another alternative seems to
> >> recog mul_highpart operation in a general place applied for
> >> scalar code when the target supports the optab for the scalar
> >> operation, it's based on the assumption that one target which
> >> supports vector version of multiply highpart should have the
> >> scalar version.  I noticed that the function can_mult_highpart_p
> >> can check/handle mult_highpart well even without mul_highpart
> >> optab support, I think to recog this pattern in vectorizer
> >> is better.  Is it on the right track?
> >
> > I think it's on the right track, using IFN_LAST is a bit awkward
> > in case yet another case pops up so maybe you can use
> > a code_helper instance instead which unifies tree_code,
> > builtin_code and internal_fn?
> >
> > I also notice that can_mult_highpart_p will return true if
> > only vec_widen_[us]mult_{even,odd,hi,lo} are available,
> > but then the result might be less optimal (or even not
> > handled later)?
> >
> > That is, what about adding optab internal functions
> > for [us]mul_highpart instead, much like the existing
> > ones for MULH{R,}S?
>
> Yeah, that's be my preference too FWIW.  All uses of MULT_HIGHPART_EXPR
> already have to be guarded by can_mult_highpart_p, so replacing it with
> a directly-mapped ifn seems like a natural fit.  (Then can_mult_highpart_p
> should be replaced with a direct_internal_fn_supported_p query.)

But note can_mult_highpart_t covers use via vec_widen_[us]mult_{even,odd,hi,lo}
but I think this specific pattern should key on [us]mul_highpart only?

Because vec_widen_* implies a higher VF (or else we might miss vectorizing?)?

Richard.


> Thanks,
> Richard
Richard Sandiford July 13, 2021, 10:11 a.m. UTC | #4
Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Jul 13, 2021 at 11:40 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> When I added the support for Power10 newly introduced multiply
>> >> highpart instrutions, I noticed that currently vectorizer
>> >> doesn't try to vectorize multiply highpart pattern, I hope
>> >> this isn't intentional?
>> >>
>> >> This patch is to extend the existing pattern mulhs handlings
>> >> to cover multiply highpart.  Another alternative seems to
>> >> recog mul_highpart operation in a general place applied for
>> >> scalar code when the target supports the optab for the scalar
>> >> operation, it's based on the assumption that one target which
>> >> supports vector version of multiply highpart should have the
>> >> scalar version.  I noticed that the function can_mult_highpart_p
>> >> can check/handle mult_highpart well even without mul_highpart
>> >> optab support, I think to recog this pattern in vectorizer
>> >> is better.  Is it on the right track?
>> >
>> > I think it's on the right track, using IFN_LAST is a bit awkward
>> > in case yet another case pops up so maybe you can use
>> > a code_helper instance instead which unifies tree_code,
>> > builtin_code and internal_fn?
>> >
>> > I also notice that can_mult_highpart_p will return true if
>> > only vec_widen_[us]mult_{even,odd,hi,lo} are available,
>> > but then the result might be less optimal (or even not
>> > handled later)?
>> >
>> > That is, what about adding optab internal functions
>> > for [us]mul_highpart instead, much like the existing
>> > ones for MULH{R,}S?
>>
>> Yeah, that's be my preference too FWIW.  All uses of MULT_HIGHPART_EXPR
>> already have to be guarded by can_mult_highpart_p, so replacing it with
>> a directly-mapped ifn seems like a natural fit.  (Then can_mult_highpart_p
>> should be replaced with a direct_internal_fn_supported_p query.)
>
> But note can_mult_highpart_t covers use via vec_widen_[us]mult_{even,odd,hi,lo}
> but I think this specific pattern should key on [us]mul_highpart only?
>
> Because vec_widen_* implies a higher VF (or else we might miss vectorizing?)?

But wouldn't it be better to do the existing hi/lo/even/odd conversion in
gimple, rather than hide it in expand?  (Yes, this is feature creep. :-))

Richard
Kewen.Lin July 13, 2021, 10:25 a.m. UTC | #5
Hi Richi,

Thanks for the comments!

on 2021/7/13 下午5:35, Richard Biener wrote:
> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi,
>>
>> When I added the support for Power10 newly introduced multiply
>> highpart instrutions, I noticed that currently vectorizer
>> doesn't try to vectorize multiply highpart pattern, I hope
>> this isn't intentional?
>>
>> This patch is to extend the existing pattern mulhs handlings
>> to cover multiply highpart.  Another alternative seems to
>> recog mul_highpart operation in a general place applied for
>> scalar code when the target supports the optab for the scalar
>> operation, it's based on the assumption that one target which
>> supports vector version of multiply highpart should have the
>> scalar version.  I noticed that the function can_mult_highpart_p
>> can check/handle mult_highpart well even without mul_highpart
>> optab support, I think to recog this pattern in vectorizer
>> is better.  Is it on the right track?
> 
> I think it's on the right track, using IFN_LAST is a bit awkward
> in case yet another case pops up so maybe you can use
> a code_helper instance instead which unifies tree_code,
> builtin_code and internal_fn?
> 

If there is one new requirement which doesn't have/introduce IFN
stuffs but have one existing tree_code, can we add one more field
with type tree_code, then for the IFN_LAST path we can check the
different requirements under the guard with that tree_code variable?

> I also notice that can_mult_highpart_p will return true if
> only vec_widen_[us]mult_{even,odd,hi,lo} are available,
> but then the result might be less optimal (or even not
> handled later)?
> 

I think it will be handled always?  The expander calls

rtx
expand_mult_highpart (machine_mode mode, rtx op0, rtx op1,
		      rtx target, bool uns_p)

which will further check with can_mult_highpart_p.

For the below case, 

#define SHT_CNT 16

__attribute__ ((noipa)) void
test ()
{
  for (int i = 0; i < N; i++)
    sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16;
}

Without this patch, it use widen_mult like below:

  vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1];
  vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1];
  vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>;
  vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>;
  vect__6.10_25 = vect_patt_22.9_13 >> 16;
  vect__6.10_26 = vect_patt_22.9_9 >> 16;
  vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>;
  MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27;

.L2:
        lxvx 33,7,9
        lxvx 32,8,9
        vmulosh 13,0,1    // widen mult
        vmulesh 0,0,1
        xxmrglw 33,32,45  // merge
        xxmrghw 32,32,45
        vsraw 1,1,12      // shift
        vsraw 0,0,12
        vpkuwum 0,0,1     // pack
        stxvx 32,10,9
        addi 9,9,16
        bdnz .L2


With this patch, it ends up with:

  vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1];
  vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1];
  vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14;
  MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25;

.L2:
        lxvx 32,8,9
        lxvx 33,10,9
        vmulosh 13,0,1   // widen mult
        vmulesh 0,0,1
        vperm 0,0,13,12  // perm on widen mults
        stxvx 32,7,9
        addi 9,9,16
        bdnz .L2


> That is, what about adding optab internal functions
> for [us]mul_highpart instead, much like the existing
> ones for MULH{R,}S?
> 

OK, I was thinking the IFN way at the beginning, but was worried
that it's easy to be blamed saying it's not necessary since there
is one existing tree_code.  :-)  Will update it with IFN way.

BR,
Kewen

> Richard.
> 
>>
>> Bootstrapped & regtested on powerpc64le-linux-gnu P9,
>> x86_64-redhat-linux and aarch64-linux-gnu.
>>
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>>
>>         * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
>>         recog normal multiply highpart.
>>
Richard Biener July 13, 2021, 12:42 p.m. UTC | #6
On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Richi,
>
> Thanks for the comments!
>
> on 2021/7/13 下午5:35, Richard Biener wrote:
> > On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi,
> >>
> >> When I added the support for Power10 newly introduced multiply
> >> highpart instrutions, I noticed that currently vectorizer
> >> doesn't try to vectorize multiply highpart pattern, I hope
> >> this isn't intentional?
> >>
> >> This patch is to extend the existing pattern mulhs handlings
> >> to cover multiply highpart.  Another alternative seems to
> >> recog mul_highpart operation in a general place applied for
> >> scalar code when the target supports the optab for the scalar
> >> operation, it's based on the assumption that one target which
> >> supports vector version of multiply highpart should have the
> >> scalar version.  I noticed that the function can_mult_highpart_p
> >> can check/handle mult_highpart well even without mul_highpart
> >> optab support, I think to recog this pattern in vectorizer
> >> is better.  Is it on the right track?
> >
> > I think it's on the right track, using IFN_LAST is a bit awkward
> > in case yet another case pops up so maybe you can use
> > a code_helper instance instead which unifies tree_code,
> > builtin_code and internal_fn?
> >
>
> If there is one new requirement which doesn't have/introduce IFN
> stuffs but have one existing tree_code, can we add one more field
> with type tree_code, then for the IFN_LAST path we can check the
> different requirements under the guard with that tree_code variable?
>
> > I also notice that can_mult_highpart_p will return true if
> > only vec_widen_[us]mult_{even,odd,hi,lo} are available,
> > but then the result might be less optimal (or even not
> > handled later)?
> >
>
> I think it will be handled always?  The expander calls
>
> rtx
> expand_mult_highpart (machine_mode mode, rtx op0, rtx op1,
>                       rtx target, bool uns_p)
>
> which will further check with can_mult_highpart_p.
>
> For the below case,
>
> #define SHT_CNT 16
>
> __attribute__ ((noipa)) void
> test ()
> {
>   for (int i = 0; i < N; i++)
>     sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16;
> }
>
> Without this patch, it use widen_mult like below:
>
>   vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1];
>   vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1];
>   vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>;
>   vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>;
>   vect__6.10_25 = vect_patt_22.9_13 >> 16;
>   vect__6.10_26 = vect_patt_22.9_9 >> 16;
>   vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>;
>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27;
>
> .L2:
>         lxvx 33,7,9
>         lxvx 32,8,9
>         vmulosh 13,0,1    // widen mult
>         vmulesh 0,0,1
>         xxmrglw 33,32,45  // merge
>         xxmrghw 32,32,45
>         vsraw 1,1,12      // shift
>         vsraw 0,0,12
>         vpkuwum 0,0,1     // pack
>         stxvx 32,10,9
>         addi 9,9,16
>         bdnz .L2
>
>
> With this patch, it ends up with:
>
>   vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1];
>   vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1];
>   vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14;
>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25;

Yes, so I'm curious what it ends up with/without the patch on x86_64 which
can do vec_widen_[us]mult_{even,odd} but not [us]mul_highpart.

Richard.

> .L2:
>         lxvx 32,8,9
>         lxvx 33,10,9
>         vmulosh 13,0,1   // widen mult
>         vmulesh 0,0,1
>         vperm 0,0,13,12  // perm on widen mults
>         stxvx 32,7,9
>         addi 9,9,16
>         bdnz .L2
>
>
> > That is, what about adding optab internal functions
> > for [us]mul_highpart instead, much like the existing
> > ones for MULH{R,}S?
> >
>
> OK, I was thinking the IFN way at the beginning, but was worried
> that it's easy to be blamed saying it's not necessary since there
> is one existing tree_code.  :-)  Will update it with IFN way.
>
> BR,
> Kewen
>
> > Richard.
> >
> >>
> >> Bootstrapped & regtested on powerpc64le-linux-gnu P9,
> >> x86_64-redhat-linux and aarch64-linux-gnu.
> >>
> >> BR,
> >> Kewen
> >> -----
> >> gcc/ChangeLog:
> >>
> >>         * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
> >>         recog normal multiply highpart.
> >>
>
>
Kewen.Lin July 13, 2021, 2:59 p.m. UTC | #7
on 2021/7/13 下午8:42, Richard Biener wrote:
> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Richi,
>>
>> Thanks for the comments!
>>
>> on 2021/7/13 下午5:35, Richard Biener wrote:
>>> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> When I added the support for Power10 newly introduced multiply
>>>> highpart instrutions, I noticed that currently vectorizer
>>>> doesn't try to vectorize multiply highpart pattern, I hope
>>>> this isn't intentional?
>>>>
>>>> This patch is to extend the existing pattern mulhs handlings
>>>> to cover multiply highpart.  Another alternative seems to
>>>> recog mul_highpart operation in a general place applied for
>>>> scalar code when the target supports the optab for the scalar
>>>> operation, it's based on the assumption that one target which
>>>> supports vector version of multiply highpart should have the
>>>> scalar version.  I noticed that the function can_mult_highpart_p
>>>> can check/handle mult_highpart well even without mul_highpart
>>>> optab support, I think to recog this pattern in vectorizer
>>>> is better.  Is it on the right track?
>>>
>>> I think it's on the right track, using IFN_LAST is a bit awkward
>>> in case yet another case pops up so maybe you can use
>>> a code_helper instance instead which unifies tree_code,
>>> builtin_code and internal_fn?
>>>
>>
>> If there is one new requirement which doesn't have/introduce IFN
>> stuffs but have one existing tree_code, can we add one more field
>> with type tree_code, then for the IFN_LAST path we can check the
>> different requirements under the guard with that tree_code variable?
>>
>>> I also notice that can_mult_highpart_p will return true if
>>> only vec_widen_[us]mult_{even,odd,hi,lo} are available,
>>> but then the result might be less optimal (or even not
>>> handled later)?
>>>
>>
>> I think it will be handled always?  The expander calls
>>
>> rtx
>> expand_mult_highpart (machine_mode mode, rtx op0, rtx op1,
>>                       rtx target, bool uns_p)
>>
>> which will further check with can_mult_highpart_p.
>>
>> For the below case,
>>
>> #define SHT_CNT 16
>>
>> __attribute__ ((noipa)) void
>> test ()
>> {
>>   for (int i = 0; i < N; i++)
>>     sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16;
>> }
>>
>> Without this patch, it use widen_mult like below:
>>
>>   vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1];
>>   vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1];
>>   vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>;
>>   vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>;
>>   vect__6.10_25 = vect_patt_22.9_13 >> 16;
>>   vect__6.10_26 = vect_patt_22.9_9 >> 16;
>>   vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>;
>>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27;
>>
>> .L2:
>>         lxvx 33,7,9
>>         lxvx 32,8,9
>>         vmulosh 13,0,1    // widen mult
>>         vmulesh 0,0,1
>>         xxmrglw 33,32,45  // merge
>>         xxmrghw 32,32,45
>>         vsraw 1,1,12      // shift
>>         vsraw 0,0,12
>>         vpkuwum 0,0,1     // pack
>>         stxvx 32,10,9
>>         addi 9,9,16
>>         bdnz .L2
>>
>>
>> With this patch, it ends up with:
>>
>>   vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1];
>>   vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1];
>>   vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14;
>>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25;
> 
> Yes, so I'm curious what it ends up with/without the patch on x86_64 which
> can do vec_widen_[us]mult_{even,odd} but not [us]mul_highpart.
> 

For test case:

```
#define N 32
typedef signed int bigType;
typedef signed short smallType;
#define SH_CNT 16

extern smallType small_a[N], small_b[N], small_c[N];

__attribute__((noipa)) void test_si(int n) {
  for (int i = 0; i < n; i++)
    small_c[i] = ((bigType)small_a[i] * (bigType)small_b[i]) >> SH_CNT;
}

```

on x86_64, with option set: -mfpmath=sse -msse2 -O2 -ftree-vectorize

1) without this patch, the pattern isn't recognized, the IR looks like:

  <bb 4> [local count: 94607391]:
  bnd.5_34 = niters.4_25 >> 3;
  _13 = (sizetype) bnd.5_34;
  _29 = _13 * 16;

  <bb 5> [local count: 378429566]:
  # ivtmp.34_4 = PHI <ivtmp.34_5(5), 0(4)>
  vect__1.10_40 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.34_4 * 1];
  vect__3.13_43 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.34_4 * 1];
  vect_patt_18.14_44 = WIDEN_MULT_LO_EXPR <vect__1.10_40, vect__3.13_43>;
  vect_patt_18.14_45 = WIDEN_MULT_HI_EXPR <vect__1.10_40, vect__3.13_43>;
  vect__6.15_46 = vect_patt_18.14_44 >> 16;
  vect__6.15_47 = vect_patt_18.14_45 >> 16;
  vect__7.16_48 = VEC_PACK_TRUNC_EXPR <vect__6.15_46, vect__6.15_47>;
  MEM <vector(8) short int> [(short int *)&small_c + ivtmp.34_4 * 1] = vect__7.16_48;
  ivtmp.34_5 = ivtmp.34_4 + 16;
  if (ivtmp.34_5 != _29)
    goto <bb 5>; [75.00%]
  else
    goto <bb 6>; [25.00%]

...

*asm*:

.L4:
        movdqu  small_b(%rax), %xmm3
        movdqu  small_a(%rax), %xmm1
        addq    $16, %rax
        movdqu  small_a-16(%rax), %xmm2
        pmullw  %xmm3, %xmm1
        pmulhw  %xmm3, %xmm2
        movdqa  %xmm1, %xmm0
        punpcklwd       %xmm2, %xmm0
        punpckhwd       %xmm2, %xmm1
        psrad   $16, %xmm1
        psrad   $16, %xmm0
        movdqa  %xmm0, %xmm2
        punpcklwd       %xmm1, %xmm0
        punpckhwd       %xmm1, %xmm2
        movdqa  %xmm0, %xmm1
        punpckhwd       %xmm2, %xmm1
        punpcklwd       %xmm2, %xmm0
        punpcklwd       %xmm1, %xmm0
        movups  %xmm0, small_c-16(%rax)
        cmpq    %rdx, %rax
        jne     .L4
        movl    %edi, %eax
        andl    $-8, %eax
        testb   $7, %dil
        je      .L14

*insn dist in loop*

      1 addq
      3 movdqa
      3 movdqu
      1 movups
      1 pmulhw
      1 pmullw
      2 psrad
      3 punpckhwd
      4 punpcklwd


2) with this patch but make the mul_highpart optab query return false, the IR looks
like:

  <bb 4> [local count: 94607391]:
  bnd.5_37 = niters.4_22 >> 3;
  _13 = (sizetype) bnd.5_37;
  _32 = _13 * 16;

  <bb 5> [local count: 378429566]:
  # ivtmp.33_4 = PHI <ivtmp.33_5(5), 0(4)>
  vect__1.10_43 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.33_4 * 1];
  vect__3.13_46 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.33_4 * 1];
  vect_patt_26.14_47 = vect__1.10_43 h* vect__3.13_46;
  MEM <vector(8) short int> [(short int *)&small_c + ivtmp.33_4 * 1] = vect_patt_26.14_47;
  ivtmp.33_5 = ivtmp.33_4 + 16;
  if (ivtmp.33_5 != _32)
    goto <bb 5>; [75.00%]
  else
    goto <bb 6>; [25.00%]

*asm*:

.L4:
        movdqu  small_b(%rax), %xmm3
        movdqu  small_a(%rax), %xmm1
        addq    $16, %rax
        movdqu  small_a-16(%rax), %xmm2
        pmullw  %xmm3, %xmm1
        pmulhw  %xmm3, %xmm2
        movdqa  %xmm1, %xmm0
        punpcklwd       %xmm2, %xmm0
        punpckhwd       %xmm2, %xmm1
        movdqa  %xmm0, %xmm2
        punpcklwd       %xmm1, %xmm0
        punpckhwd       %xmm1, %xmm2
        movdqa  %xmm0, %xmm1
        punpckhwd       %xmm2, %xmm1
        punpcklwd       %xmm2, %xmm0
        punpckhwd       %xmm1, %xmm0
        movups  %xmm0, small_c-16(%rax)
        cmpq    %rdx, %rax
        jne     .L4
        movl    %edi, %eax
        andl    $-8, %eax
        testb   $7, %dil
        je      .L14

*insn dist*:

      1 addq
      3 movdqa
      3 movdqu
      1 movups
      1 pmulhw
      1 pmullw
      4 punpckhwd
      3 punpcklwd

I know little on i386 asm, but this seems better from insn distribution,
the one without this patch uses two more psrad (s).

3) FWIW with this patch as well as existing optab supports, the IR looks like:

  <bb 4> [local count: 94607391]:
  bnd.5_40 = niters.4_19 >> 3;
  _75 = (sizetype) bnd.5_40;
  _91 = _75 * 16;

  <bb 5> [local count: 378429566]:
  # ivtmp.48_53 = PHI <ivtmp.48_45(5), 0(4)>
  vect__1.10_48 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.48_53 * 1];
  vect__3.13_51 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.48_53 * 1];
  vect_patt_26.14_52 = vect__1.10_48 h* vect__3.13_51;
  MEM <vector(8) short int> [(short int *)&small_c + ivtmp.48_53 * 1] = vect_patt_26.14_52;
  ivtmp.48_45 = ivtmp.48_53 + 16;
  if (ivtmp.48_45 != _91)
    goto <bb 5>; [75.00%]
  else
    goto <bb 6>; [25.00%]

  <bb 6> [local count: 94607391]:
  niters_vector_mult_vf.6_41 = niters.4_19 & 4294967288;
  tmp.7_42 = (int) niters_vector_mult_vf.6_41;
  _61 = niters.4_19 & 7;
  if (_61 == 0)
    goto <bb 12>; [12.50%]
  else
    goto <bb 7>; [87.50%]

  <bb 7> [local count: 93293400]:
  # i_38 = PHI <tmp.7_42(6), 0(3)>
  # _44 = PHI <niters_vector_mult_vf.6_41(6), 0(3)>
  niters.18_60 = niters.4_19 - _44;
  _76 = niters.18_60 + 4294967295;
  if (_76 <= 2)
    goto <bb 9>; [10.00%]
  else
    goto <bb 8>; [90.00%]

  <bb 8> [local count: 167928121]:
  _85 = (sizetype) _44;
  _86 = _85 * 2;
  vectp_small_a.23_84 = &small_a + _86;
  vectp_small_b.26_90 = &small_b + _86;
  vectp_small_c.31_98 = &small_c + _86;
  vect__9.24_89 = MEM <vector(4) short int> [(short int *)vectp_small_a.23_84];
  vect__28.27_95 = MEM <vector(4) short int> [(short int *)vectp_small_b.26_90];
  vect_patt_23.28_96 = vect__9.24_89 h* vect__28.27_95;
  MEM <vector(4) short int> [(short int *)vectp_small_c.31_98] = vect_patt_23.28_96;
  niters_vector_mult_vf.20_80 = niters.18_60 & 4294967292;
  _82 = (int) niters_vector_mult_vf.20_80;
  tmp.21_81 = i_38 + _82;
  _74 = niters.18_60 & 3;
  if (_74 == 0)
    goto <bb 11>; [25.00%]
  else
    goto <bb 9>; [75.00%]

...

*asm*:

.L4:
        movdqu  small_a(%rax), %xmm0
        movdqu  small_b(%rax), %xmm2
        addq    $16, %rax
        pmulhw  %xmm2, %xmm0
        movups  %xmm0, small_c-16(%rax)
        cmpq    %rdx, %rax
        jne     .L4
        movl    %ecx, %edx
        andl    $-8, %edx
        movl    %edx, %eax
        testb   $7, %cl
        je      .L19
.L3:
        movl    %ecx, %esi
        subl    %edx, %esi
        leal    -1(%rsi), %edi
        cmpl    $2, %edi
        jbe     .L6
        movq    small_a(%rdx,%rdx), %xmm0
        movq    small_b(%rdx,%rdx), %xmm1
        pmulhw  %xmm1, %xmm0
        movq    %xmm0, small_c(%rdx,%rdx)
        movl    %esi, %edx
        andl    $-4, %edx
        addl    %edx, %eax
        andl    $3, %esi
        je      .L1

I guess the proposed IFN would be directly mapped for [us]mul_highpart?
or do you expect it will also cover what we support in can_mult_highpart_p?

BR,
Kewen
Richard Biener July 14, 2021, 6:38 a.m. UTC | #8
On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> on 2021/7/13 下午8:42, Richard Biener wrote:
> > On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi Richi,
> >>
> >> Thanks for the comments!
> >>
> >> on 2021/7/13 下午5:35, Richard Biener wrote:
> >>> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> When I added the support for Power10 newly introduced multiply
> >>>> highpart instrutions, I noticed that currently vectorizer
> >>>> doesn't try to vectorize multiply highpart pattern, I hope
> >>>> this isn't intentional?
> >>>>
> >>>> This patch is to extend the existing pattern mulhs handlings
> >>>> to cover multiply highpart.  Another alternative seems to
> >>>> recog mul_highpart operation in a general place applied for
> >>>> scalar code when the target supports the optab for the scalar
> >>>> operation, it's based on the assumption that one target which
> >>>> supports vector version of multiply highpart should have the
> >>>> scalar version.  I noticed that the function can_mult_highpart_p
> >>>> can check/handle mult_highpart well even without mul_highpart
> >>>> optab support, I think to recog this pattern in vectorizer
> >>>> is better.  Is it on the right track?
> >>>
> >>> I think it's on the right track, using IFN_LAST is a bit awkward
> >>> in case yet another case pops up so maybe you can use
> >>> a code_helper instance instead which unifies tree_code,
> >>> builtin_code and internal_fn?
> >>>
> >>
> >> If there is one new requirement which doesn't have/introduce IFN
> >> stuffs but have one existing tree_code, can we add one more field
> >> with type tree_code, then for the IFN_LAST path we can check the
> >> different requirements under the guard with that tree_code variable?
> >>
> >>> I also notice that can_mult_highpart_p will return true if
> >>> only vec_widen_[us]mult_{even,odd,hi,lo} are available,
> >>> but then the result might be less optimal (or even not
> >>> handled later)?
> >>>
> >>
> >> I think it will be handled always?  The expander calls
> >>
> >> rtx
> >> expand_mult_highpart (machine_mode mode, rtx op0, rtx op1,
> >>                       rtx target, bool uns_p)
> >>
> >> which will further check with can_mult_highpart_p.
> >>
> >> For the below case,
> >>
> >> #define SHT_CNT 16
> >>
> >> __attribute__ ((noipa)) void
> >> test ()
> >> {
> >>   for (int i = 0; i < N; i++)
> >>     sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16;
> >> }
> >>
> >> Without this patch, it use widen_mult like below:
> >>
> >>   vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1];
> >>   vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1];
> >>   vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>;
> >>   vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>;
> >>   vect__6.10_25 = vect_patt_22.9_13 >> 16;
> >>   vect__6.10_26 = vect_patt_22.9_9 >> 16;
> >>   vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>;
> >>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27;
> >>
> >> .L2:
> >>         lxvx 33,7,9
> >>         lxvx 32,8,9
> >>         vmulosh 13,0,1    // widen mult
> >>         vmulesh 0,0,1
> >>         xxmrglw 33,32,45  // merge
> >>         xxmrghw 32,32,45
> >>         vsraw 1,1,12      // shift
> >>         vsraw 0,0,12
> >>         vpkuwum 0,0,1     // pack
> >>         stxvx 32,10,9
> >>         addi 9,9,16
> >>         bdnz .L2
> >>
> >>
> >> With this patch, it ends up with:
> >>
> >>   vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1];
> >>   vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1];
> >>   vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14;
> >>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25;
> >
> > Yes, so I'm curious what it ends up with/without the patch on x86_64 which
> > can do vec_widen_[us]mult_{even,odd} but not [us]mul_highpart.
> >
>
> For test case:
>
> ```
> #define N 32
> typedef signed int bigType;
> typedef signed short smallType;
> #define SH_CNT 16
>
> extern smallType small_a[N], small_b[N], small_c[N];
>
> __attribute__((noipa)) void test_si(int n) {
>   for (int i = 0; i < n; i++)
>     small_c[i] = ((bigType)small_a[i] * (bigType)small_b[i]) >> SH_CNT;
> }
>
> ```
>
> on x86_64, with option set: -mfpmath=sse -msse2 -O2 -ftree-vectorize
>
> 1) without this patch, the pattern isn't recognized, the IR looks like:
>
>   <bb 4> [local count: 94607391]:
>   bnd.5_34 = niters.4_25 >> 3;
>   _13 = (sizetype) bnd.5_34;
>   _29 = _13 * 16;
>
>   <bb 5> [local count: 378429566]:
>   # ivtmp.34_4 = PHI <ivtmp.34_5(5), 0(4)>
>   vect__1.10_40 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.34_4 * 1];
>   vect__3.13_43 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.34_4 * 1];
>   vect_patt_18.14_44 = WIDEN_MULT_LO_EXPR <vect__1.10_40, vect__3.13_43>;
>   vect_patt_18.14_45 = WIDEN_MULT_HI_EXPR <vect__1.10_40, vect__3.13_43>;
>   vect__6.15_46 = vect_patt_18.14_44 >> 16;
>   vect__6.15_47 = vect_patt_18.14_45 >> 16;
>   vect__7.16_48 = VEC_PACK_TRUNC_EXPR <vect__6.15_46, vect__6.15_47>;
>   MEM <vector(8) short int> [(short int *)&small_c + ivtmp.34_4 * 1] = vect__7.16_48;
>   ivtmp.34_5 = ivtmp.34_4 + 16;
>   if (ivtmp.34_5 != _29)
>     goto <bb 5>; [75.00%]
>   else
>     goto <bb 6>; [25.00%]
>
> ...
>
> *asm*:
>
> .L4:
>         movdqu  small_b(%rax), %xmm3
>         movdqu  small_a(%rax), %xmm1
>         addq    $16, %rax
>         movdqu  small_a-16(%rax), %xmm2
>         pmullw  %xmm3, %xmm1
>         pmulhw  %xmm3, %xmm2
>         movdqa  %xmm1, %xmm0
>         punpcklwd       %xmm2, %xmm0
>         punpckhwd       %xmm2, %xmm1
>         psrad   $16, %xmm1
>         psrad   $16, %xmm0
>         movdqa  %xmm0, %xmm2
>         punpcklwd       %xmm1, %xmm0
>         punpckhwd       %xmm1, %xmm2
>         movdqa  %xmm0, %xmm1
>         punpckhwd       %xmm2, %xmm1
>         punpcklwd       %xmm2, %xmm0
>         punpcklwd       %xmm1, %xmm0
>         movups  %xmm0, small_c-16(%rax)
>         cmpq    %rdx, %rax
>         jne     .L4
>         movl    %edi, %eax
>         andl    $-8, %eax
>         testb   $7, %dil
>         je      .L14
>
> *insn dist in loop*
>
>       1 addq
>       3 movdqa
>       3 movdqu
>       1 movups
>       1 pmulhw
>       1 pmullw
>       2 psrad
>       3 punpckhwd
>       4 punpcklwd
>
>
> 2) with this patch but make the mul_highpart optab query return false, the IR looks
> like:
>
>   <bb 4> [local count: 94607391]:
>   bnd.5_37 = niters.4_22 >> 3;
>   _13 = (sizetype) bnd.5_37;
>   _32 = _13 * 16;
>
>   <bb 5> [local count: 378429566]:
>   # ivtmp.33_4 = PHI <ivtmp.33_5(5), 0(4)>
>   vect__1.10_43 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.33_4 * 1];
>   vect__3.13_46 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.33_4 * 1];
>   vect_patt_26.14_47 = vect__1.10_43 h* vect__3.13_46;
>   MEM <vector(8) short int> [(short int *)&small_c + ivtmp.33_4 * 1] = vect_patt_26.14_47;
>   ivtmp.33_5 = ivtmp.33_4 + 16;
>   if (ivtmp.33_5 != _32)
>     goto <bb 5>; [75.00%]
>   else
>     goto <bb 6>; [25.00%]
>
> *asm*:
>
> .L4:
>         movdqu  small_b(%rax), %xmm3
>         movdqu  small_a(%rax), %xmm1
>         addq    $16, %rax
>         movdqu  small_a-16(%rax), %xmm2
>         pmullw  %xmm3, %xmm1
>         pmulhw  %xmm3, %xmm2
>         movdqa  %xmm1, %xmm0
>         punpcklwd       %xmm2, %xmm0
>         punpckhwd       %xmm2, %xmm1
>         movdqa  %xmm0, %xmm2
>         punpcklwd       %xmm1, %xmm0
>         punpckhwd       %xmm1, %xmm2
>         movdqa  %xmm0, %xmm1
>         punpckhwd       %xmm2, %xmm1
>         punpcklwd       %xmm2, %xmm0
>         punpckhwd       %xmm1, %xmm0
>         movups  %xmm0, small_c-16(%rax)
>         cmpq    %rdx, %rax
>         jne     .L4
>         movl    %edi, %eax
>         andl    $-8, %eax
>         testb   $7, %dil
>         je      .L14
>
> *insn dist*:
>
>       1 addq
>       3 movdqa
>       3 movdqu
>       1 movups
>       1 pmulhw
>       1 pmullw
>       4 punpckhwd
>       3 punpcklwd
>
> I know little on i386 asm, but this seems better from insn distribution,
> the one without this patch uses two more psrad (s).

So the advantage of 1) would be more appropriate costing in the vectorizer
(x86 does not have a native vector highpart multiply).

> 3) FWIW with this patch as well as existing optab supports, the IR looks like:
>
>   <bb 4> [local count: 94607391]:
>   bnd.5_40 = niters.4_19 >> 3;
>   _75 = (sizetype) bnd.5_40;
>   _91 = _75 * 16;
>
>   <bb 5> [local count: 378429566]:
>   # ivtmp.48_53 = PHI <ivtmp.48_45(5), 0(4)>
>   vect__1.10_48 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.48_53 * 1];
>   vect__3.13_51 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.48_53 * 1];
>   vect_patt_26.14_52 = vect__1.10_48 h* vect__3.13_51;
>   MEM <vector(8) short int> [(short int *)&small_c + ivtmp.48_53 * 1] = vect_patt_26.14_52;
>   ivtmp.48_45 = ivtmp.48_53 + 16;
>   if (ivtmp.48_45 != _91)
>     goto <bb 5>; [75.00%]
>   else
>     goto <bb 6>; [25.00%]
>
>   <bb 6> [local count: 94607391]:
>   niters_vector_mult_vf.6_41 = niters.4_19 & 4294967288;
>   tmp.7_42 = (int) niters_vector_mult_vf.6_41;
>   _61 = niters.4_19 & 7;
>   if (_61 == 0)
>     goto <bb 12>; [12.50%]
>   else
>     goto <bb 7>; [87.50%]
>
>   <bb 7> [local count: 93293400]:
>   # i_38 = PHI <tmp.7_42(6), 0(3)>
>   # _44 = PHI <niters_vector_mult_vf.6_41(6), 0(3)>
>   niters.18_60 = niters.4_19 - _44;
>   _76 = niters.18_60 + 4294967295;
>   if (_76 <= 2)
>     goto <bb 9>; [10.00%]
>   else
>     goto <bb 8>; [90.00%]
>
>   <bb 8> [local count: 167928121]:
>   _85 = (sizetype) _44;
>   _86 = _85 * 2;
>   vectp_small_a.23_84 = &small_a + _86;
>   vectp_small_b.26_90 = &small_b + _86;
>   vectp_small_c.31_98 = &small_c + _86;
>   vect__9.24_89 = MEM <vector(4) short int> [(short int *)vectp_small_a.23_84];
>   vect__28.27_95 = MEM <vector(4) short int> [(short int *)vectp_small_b.26_90];
>   vect_patt_23.28_96 = vect__9.24_89 h* vect__28.27_95;
>   MEM <vector(4) short int> [(short int *)vectp_small_c.31_98] = vect_patt_23.28_96;
>   niters_vector_mult_vf.20_80 = niters.18_60 & 4294967292;
>   _82 = (int) niters_vector_mult_vf.20_80;
>   tmp.21_81 = i_38 + _82;
>   _74 = niters.18_60 & 3;
>   if (_74 == 0)
>     goto <bb 11>; [25.00%]
>   else
>     goto <bb 9>; [75.00%]
>
> ...
>
> *asm*:
>
> .L4:
>         movdqu  small_a(%rax), %xmm0
>         movdqu  small_b(%rax), %xmm2
>         addq    $16, %rax
>         pmulhw  %xmm2, %xmm0
>         movups  %xmm0, small_c-16(%rax)
>         cmpq    %rdx, %rax
>         jne     .L4
>         movl    %ecx, %edx
>         andl    $-8, %edx
>         movl    %edx, %eax
>         testb   $7, %cl
>         je      .L19
> .L3:
>         movl    %ecx, %esi
>         subl    %edx, %esi
>         leal    -1(%rsi), %edi
>         cmpl    $2, %edi
>         jbe     .L6
>         movq    small_a(%rdx,%rdx), %xmm0
>         movq    small_b(%rdx,%rdx), %xmm1
>         pmulhw  %xmm1, %xmm0
>         movq    %xmm0, small_c(%rdx,%rdx)
>         movl    %esi, %edx
>         andl    $-4, %edx
>         addl    %edx, %eax
>         andl    $3, %esi
>         je      .L1

Oh, so indeed x86 has a vector highpart multiply, not grep-friendly
macroized as <s>mul<mode>3_highpart ;)

> I guess the proposed IFN would be directly mapped for [us]mul_highpart?

Yes.

> or do you expect it will also cover what we support in can_mult_highpart_p?

See above - since we manage to use widen_mult we should expose this
detail for the purpose of costing.  So the pattern would directly ask for
an optab IFN mapping to [us]mul_highpart.

Richard.

> BR,
> Kewen
Kewen.Lin July 14, 2021, 7:45 a.m. UTC | #9
on 2021/7/14 下午2:38, Richard Biener wrote:
> On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> on 2021/7/13 下午8:42, Richard Biener wrote:
>>> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>
>>>> Hi Richi,
>>>>
>>>> Thanks for the comments!
>>>>
>>>> on 2021/7/13 下午5:35, Richard Biener wrote:
>>>>> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> When I added the support for Power10 newly introduced multiply
>>>>>> highpart instrutions, I noticed that currently vectorizer
>>>>>> doesn't try to vectorize multiply highpart pattern, I hope
>>>>>> this isn't intentional?
>>>>>>
>>>>>> This patch is to extend the existing pattern mulhs handlings
>>>>>> to cover multiply highpart.  Another alternative seems to
>>>>>> recog mul_highpart operation in a general place applied for
>>>>>> scalar code when the target supports the optab for the scalar
>>>>>> operation, it's based on the assumption that one target which
>>>>>> supports vector version of multiply highpart should have the
>>>>>> scalar version.  I noticed that the function can_mult_highpart_p
>>>>>> can check/handle mult_highpart well even without mul_highpart
>>>>>> optab support, I think to recog this pattern in vectorizer
>>>>>> is better.  Is it on the right track?
>>>>>
>>>>> I think it's on the right track, using IFN_LAST is a bit awkward
>>>>> in case yet another case pops up so maybe you can use
>>>>> a code_helper instance instead which unifies tree_code,
>>>>> builtin_code and internal_fn?
>>>>>
>>>>
>>>> If there is one new requirement which doesn't have/introduce IFN
>>>> stuffs but have one existing tree_code, can we add one more field
>>>> with type tree_code, then for the IFN_LAST path we can check the
>>>> different requirements under the guard with that tree_code variable?
>>>>
>>>>> I also notice that can_mult_highpart_p will return true if
>>>>> only vec_widen_[us]mult_{even,odd,hi,lo} are available,
>>>>> but then the result might be less optimal (or even not
>>>>> handled later)?
>>>>>
>>>>
>>>> I think it will be handled always?  The expander calls
>>>>
>>>> rtx
>>>> expand_mult_highpart (machine_mode mode, rtx op0, rtx op1,
>>>>                       rtx target, bool uns_p)
>>>>
>>>> which will further check with can_mult_highpart_p.
>>>>
>>>> For the below case,
>>>>
>>>> #define SHT_CNT 16
>>>>
>>>> __attribute__ ((noipa)) void
>>>> test ()
>>>> {
>>>>   for (int i = 0; i < N; i++)
>>>>     sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16;
>>>> }
>>>>
>>>> Without this patch, it use widen_mult like below:
>>>>
>>>>   vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1];
>>>>   vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1];
>>>>   vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>;
>>>>   vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>;
>>>>   vect__6.10_25 = vect_patt_22.9_13 >> 16;
>>>>   vect__6.10_26 = vect_patt_22.9_9 >> 16;
>>>>   vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>;
>>>>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27;
>>>>
>>>> .L2:
>>>>         lxvx 33,7,9
>>>>         lxvx 32,8,9
>>>>         vmulosh 13,0,1    // widen mult
>>>>         vmulesh 0,0,1
>>>>         xxmrglw 33,32,45  // merge
>>>>         xxmrghw 32,32,45
>>>>         vsraw 1,1,12      // shift
>>>>         vsraw 0,0,12
>>>>         vpkuwum 0,0,1     // pack
>>>>         stxvx 32,10,9
>>>>         addi 9,9,16
>>>>         bdnz .L2
>>>>
>>>>
>>>> With this patch, it ends up with:
>>>>
>>>>   vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1];
>>>>   vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1];
>>>>   vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14;
>>>>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25;
>>>
>>> Yes, so I'm curious what it ends up with/without the patch on x86_64 which
>>> can do vec_widen_[us]mult_{even,odd} but not [us]mul_highpart.
>>>
>>
>> For test case:
>>
>> ```
>> #define N 32
>> typedef signed int bigType;
>> typedef signed short smallType;
>> #define SH_CNT 16
>>
>> extern smallType small_a[N], small_b[N], small_c[N];
>>
>> __attribute__((noipa)) void test_si(int n) {
>>   for (int i = 0; i < n; i++)
>>     small_c[i] = ((bigType)small_a[i] * (bigType)small_b[i]) >> SH_CNT;
>> }
>>
>> ```
>>
>> on x86_64, with option set: -mfpmath=sse -msse2 -O2 -ftree-vectorize
>>
>> 1) without this patch, the pattern isn't recognized, the IR looks like:
>>
>>   <bb 4> [local count: 94607391]:
>>   bnd.5_34 = niters.4_25 >> 3;
>>   _13 = (sizetype) bnd.5_34;
>>   _29 = _13 * 16;
>>
>>   <bb 5> [local count: 378429566]:
>>   # ivtmp.34_4 = PHI <ivtmp.34_5(5), 0(4)>
>>   vect__1.10_40 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.34_4 * 1];
>>   vect__3.13_43 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.34_4 * 1];
>>   vect_patt_18.14_44 = WIDEN_MULT_LO_EXPR <vect__1.10_40, vect__3.13_43>;
>>   vect_patt_18.14_45 = WIDEN_MULT_HI_EXPR <vect__1.10_40, vect__3.13_43>;
>>   vect__6.15_46 = vect_patt_18.14_44 >> 16;
>>   vect__6.15_47 = vect_patt_18.14_45 >> 16;
>>   vect__7.16_48 = VEC_PACK_TRUNC_EXPR <vect__6.15_46, vect__6.15_47>;
>>   MEM <vector(8) short int> [(short int *)&small_c + ivtmp.34_4 * 1] = vect__7.16_48;
>>   ivtmp.34_5 = ivtmp.34_4 + 16;
>>   if (ivtmp.34_5 != _29)
>>     goto <bb 5>; [75.00%]
>>   else
>>     goto <bb 6>; [25.00%]
>>
>> ...
>>
>> *asm*:
>>
>> .L4:
>>         movdqu  small_b(%rax), %xmm3
>>         movdqu  small_a(%rax), %xmm1
>>         addq    $16, %rax
>>         movdqu  small_a-16(%rax), %xmm2
>>         pmullw  %xmm3, %xmm1
>>         pmulhw  %xmm3, %xmm2
>>         movdqa  %xmm1, %xmm0
>>         punpcklwd       %xmm2, %xmm0
>>         punpckhwd       %xmm2, %xmm1
>>         psrad   $16, %xmm1
>>         psrad   $16, %xmm0
>>         movdqa  %xmm0, %xmm2
>>         punpcklwd       %xmm1, %xmm0
>>         punpckhwd       %xmm1, %xmm2
>>         movdqa  %xmm0, %xmm1
>>         punpckhwd       %xmm2, %xmm1
>>         punpcklwd       %xmm2, %xmm0
>>         punpcklwd       %xmm1, %xmm0
>>         movups  %xmm0, small_c-16(%rax)
>>         cmpq    %rdx, %rax
>>         jne     .L4
>>         movl    %edi, %eax
>>         andl    $-8, %eax
>>         testb   $7, %dil
>>         je      .L14
>>
>> *insn dist in loop*
>>
>>       1 addq
>>       3 movdqa
>>       3 movdqu
>>       1 movups
>>       1 pmulhw
>>       1 pmullw
>>       2 psrad
>>       3 punpckhwd
>>       4 punpcklwd
>>
>>
>> 2) with this patch but make the mul_highpart optab query return false, the IR looks
>> like:
>>
>>   <bb 4> [local count: 94607391]:
>>   bnd.5_37 = niters.4_22 >> 3;
>>   _13 = (sizetype) bnd.5_37;
>>   _32 = _13 * 16;
>>
>>   <bb 5> [local count: 378429566]:
>>   # ivtmp.33_4 = PHI <ivtmp.33_5(5), 0(4)>
>>   vect__1.10_43 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.33_4 * 1];
>>   vect__3.13_46 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.33_4 * 1];
>>   vect_patt_26.14_47 = vect__1.10_43 h* vect__3.13_46;
>>   MEM <vector(8) short int> [(short int *)&small_c + ivtmp.33_4 * 1] = vect_patt_26.14_47;
>>   ivtmp.33_5 = ivtmp.33_4 + 16;
>>   if (ivtmp.33_5 != _32)
>>     goto <bb 5>; [75.00%]
>>   else
>>     goto <bb 6>; [25.00%]
>>
>> *asm*:
>>
>> .L4:
>>         movdqu  small_b(%rax), %xmm3
>>         movdqu  small_a(%rax), %xmm1
>>         addq    $16, %rax
>>         movdqu  small_a-16(%rax), %xmm2
>>         pmullw  %xmm3, %xmm1
>>         pmulhw  %xmm3, %xmm2
>>         movdqa  %xmm1, %xmm0
>>         punpcklwd       %xmm2, %xmm0
>>         punpckhwd       %xmm2, %xmm1
>>         movdqa  %xmm0, %xmm2
>>         punpcklwd       %xmm1, %xmm0
>>         punpckhwd       %xmm1, %xmm2
>>         movdqa  %xmm0, %xmm1
>>         punpckhwd       %xmm2, %xmm1
>>         punpcklwd       %xmm2, %xmm0
>>         punpckhwd       %xmm1, %xmm0
>>         movups  %xmm0, small_c-16(%rax)
>>         cmpq    %rdx, %rax
>>         jne     .L4
>>         movl    %edi, %eax
>>         andl    $-8, %eax
>>         testb   $7, %dil
>>         je      .L14
>>
>> *insn dist*:
>>
>>       1 addq
>>       3 movdqa
>>       3 movdqu
>>       1 movups
>>       1 pmulhw
>>       1 pmullw
>>       4 punpckhwd
>>       3 punpcklwd
>>
>> I know little on i386 asm, but this seems better from insn distribution,
>> the one without this patch uses two more psrad (s).
> 
> So the advantage of 1) would be more appropriate costing in the vectorizer
> (x86 does not have a native vector highpart multiply).
> 
>> 3) FWIW with this patch as well as existing optab supports, the IR looks like:
>>
>>   <bb 4> [local count: 94607391]:
>>   bnd.5_40 = niters.4_19 >> 3;
>>   _75 = (sizetype) bnd.5_40;
>>   _91 = _75 * 16;
>>
>>   <bb 5> [local count: 378429566]:
>>   # ivtmp.48_53 = PHI <ivtmp.48_45(5), 0(4)>
>>   vect__1.10_48 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.48_53 * 1];
>>   vect__3.13_51 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.48_53 * 1];
>>   vect_patt_26.14_52 = vect__1.10_48 h* vect__3.13_51;
>>   MEM <vector(8) short int> [(short int *)&small_c + ivtmp.48_53 * 1] = vect_patt_26.14_52;
>>   ivtmp.48_45 = ivtmp.48_53 + 16;
>>   if (ivtmp.48_45 != _91)
>>     goto <bb 5>; [75.00%]
>>   else
>>     goto <bb 6>; [25.00%]
>>
>>   <bb 6> [local count: 94607391]:
>>   niters_vector_mult_vf.6_41 = niters.4_19 & 4294967288;
>>   tmp.7_42 = (int) niters_vector_mult_vf.6_41;
>>   _61 = niters.4_19 & 7;
>>   if (_61 == 0)
>>     goto <bb 12>; [12.50%]
>>   else
>>     goto <bb 7>; [87.50%]
>>
>>   <bb 7> [local count: 93293400]:
>>   # i_38 = PHI <tmp.7_42(6), 0(3)>
>>   # _44 = PHI <niters_vector_mult_vf.6_41(6), 0(3)>
>>   niters.18_60 = niters.4_19 - _44;
>>   _76 = niters.18_60 + 4294967295;
>>   if (_76 <= 2)
>>     goto <bb 9>; [10.00%]
>>   else
>>     goto <bb 8>; [90.00%]
>>
>>   <bb 8> [local count: 167928121]:
>>   _85 = (sizetype) _44;
>>   _86 = _85 * 2;
>>   vectp_small_a.23_84 = &small_a + _86;
>>   vectp_small_b.26_90 = &small_b + _86;
>>   vectp_small_c.31_98 = &small_c + _86;
>>   vect__9.24_89 = MEM <vector(4) short int> [(short int *)vectp_small_a.23_84];
>>   vect__28.27_95 = MEM <vector(4) short int> [(short int *)vectp_small_b.26_90];
>>   vect_patt_23.28_96 = vect__9.24_89 h* vect__28.27_95;
>>   MEM <vector(4) short int> [(short int *)vectp_small_c.31_98] = vect_patt_23.28_96;
>>   niters_vector_mult_vf.20_80 = niters.18_60 & 4294967292;
>>   _82 = (int) niters_vector_mult_vf.20_80;
>>   tmp.21_81 = i_38 + _82;
>>   _74 = niters.18_60 & 3;
>>   if (_74 == 0)
>>     goto <bb 11>; [25.00%]
>>   else
>>     goto <bb 9>; [75.00%]
>>
>> ...
>>
>> *asm*:
>>
>> .L4:
>>         movdqu  small_a(%rax), %xmm0
>>         movdqu  small_b(%rax), %xmm2
>>         addq    $16, %rax
>>         pmulhw  %xmm2, %xmm0
>>         movups  %xmm0, small_c-16(%rax)
>>         cmpq    %rdx, %rax
>>         jne     .L4
>>         movl    %ecx, %edx
>>         andl    $-8, %edx
>>         movl    %edx, %eax
>>         testb   $7, %cl
>>         je      .L19
>> .L3:
>>         movl    %ecx, %esi
>>         subl    %edx, %esi
>>         leal    -1(%rsi), %edi
>>         cmpl    $2, %edi
>>         jbe     .L6
>>         movq    small_a(%rdx,%rdx), %xmm0
>>         movq    small_b(%rdx,%rdx), %xmm1
>>         pmulhw  %xmm1, %xmm0
>>         movq    %xmm0, small_c(%rdx,%rdx)
>>         movl    %esi, %edx
>>         andl    $-4, %edx
>>         addl    %edx, %eax
>>         andl    $3, %esi
>>         je      .L1
> 
> Oh, so indeed x86 has a vector highpart multiply, not grep-friendly
> macroized as <s>mul<mode>3_highpart ;)
> 
>> I guess the proposed IFN would be directly mapped for [us]mul_highpart?
> 
> Yes.
> 

Thanks for confirming!  The related patch v2 is attached and the testing
is ongoing.

>> or do you expect it will also cover what we support in can_mult_highpart_p?
> 
> See above - since we manage to use widen_mult we should expose this
> detail for the purpose of costing.  So the pattern would directly ask for
> an optab IFN mapping to [us]mul_highpart.
> 

OK, the costing issue seems already exist since vect_recog_divmod_pattern
checks with can_mult_highpart_p and generate mul_highpart in some path which
probably ends up with widen_mult_{odd,even,lo,hi}, I guess we can fix the
related costing routine by querying can_mult_highpart_p and price it more
for mul_highpart input and it has to work with widen_mult_{odd...}.

btw, IIUC Richard S. proposed to generate these widen_mult_{odd,even,lo,hi}
in gimple rather than expand phase, since vector perm is involved it seems
we have to generate them during transforming?  Otherwise the mul_highpart
in scalar code like pattern recog divmod seems hard to be expanded?
BR,
Kewen
-----
gcc/ChangeLog:

	* internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
	* internal-fn.def (IFN_MULH): New internal function.
	* tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
	recog normal multiply highpart as IFN_MULH.
---
 gcc/internal-fn.c        |  1 +
 gcc/internal-fn.def      |  2 ++
 gcc/tree-vect-patterns.c | 45 +++++++++++++++++++++++++++-------------
 3 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index fb8b43d1ce2..b1b4289357c 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -3703,6 +3703,7 @@ first_commutative_argument (internal_fn fn)
     case IFN_FNMS:
     case IFN_AVG_FLOOR:
     case IFN_AVG_CEIL:
+    case IFN_MULH:
     case IFN_MULHS:
     case IFN_MULHRS:
     case IFN_FMIN:
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index c3b8e730960..ed6d7de1680 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -169,6 +169,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_FLOOR, ECF_CONST | ECF_NOTHROW, first,
 DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
 			      savg_ceil, uavg_ceil, binary)
 
+DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first,
+			      smul_highpart, umul_highpart, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
 			      smulhs, umulhs, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index b2e7fc2cc7a..4581cc6b51c 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -1896,8 +1896,15 @@ vect_recog_over_widening_pattern (vec_info *vinfo,
 
    1) Multiply high with scaling
      TYPE res = ((TYPE) a * (TYPE) b) >> c;
+     Here, c is bitsize (TYPE) / 2 - 1.
+
    2) ... or also with rounding
      TYPE res = (((TYPE) a * (TYPE) b) >> d + 1) >> 1;
+     Here, d is bitsize (TYPE) / 2 - 2.
+
+   3) Normal multiply high
+     TYPE res = ((TYPE) a * (TYPE) b) >> e;
+     Here, e is bitsize (TYPE) / 2.
 
    where only the bottom half of res is used.  */
 
@@ -1942,7 +1949,6 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
   stmt_vec_info mulh_stmt_info;
   tree scale_term;
   internal_fn ifn;
-  unsigned int expect_offset;
 
   /* Check for the presence of the rounding term.  */
   if (gimple_assign_rhs_code (rshift_input_stmt) == PLUS_EXPR)
@@ -1991,25 +1997,37 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
 
       /* Get the scaling term.  */
       scale_term = gimple_assign_rhs2 (plus_input_stmt);
+      /* Check that the scaling factor is correct.  */
+      if (TREE_CODE (scale_term) != INTEGER_CST)
+	return NULL;
+
+      /* Check pattern 2).  */
+      if (wi::to_widest (scale_term) + target_precision + 2
+	  != TYPE_PRECISION (lhs_type))
+	return NULL;
 
-      expect_offset = target_precision + 2;
       ifn = IFN_MULHRS;
     }
   else
     {
       mulh_stmt_info = rshift_input_stmt_info;
       scale_term = gimple_assign_rhs2 (last_stmt);
+      /* Check that the scaling factor is correct.  */
+      if (TREE_CODE (scale_term) != INTEGER_CST)
+	return NULL;
 
-      expect_offset = target_precision + 1;
-      ifn = IFN_MULHS;
+      /* Check for pattern 1).  */
+      if (wi::to_widest (scale_term) + target_precision + 1
+	  == TYPE_PRECISION (lhs_type))
+	ifn = IFN_MULHS;
+      /* Check for pattern 3).  */
+      else if (wi::to_widest (scale_term) + target_precision
+	       == TYPE_PRECISION (lhs_type))
+	ifn = IFN_MULH;
+      else
+	return NULL;
     }
 
-  /* Check that the scaling factor is correct.  */
-  if (TREE_CODE (scale_term) != INTEGER_CST
-      || wi::to_widest (scale_term) + expect_offset
-	   != TYPE_PRECISION (lhs_type))
-    return NULL;
-
   /* Check whether the scaling input term can be seen as two widened
      inputs multiplied together.  */
   vect_unpromoted_value unprom_mult[2];
@@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
   /* Check for target support.  */
   tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type);
   if (!new_vectype
-      || !direct_internal_fn_supported_p
-	    (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
+      || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
     return NULL;
 
   /* The IR requires a valid vector type for the cast result, even though
@@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
   /* Generate the IFN_MULHRS call.  */
   tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
   tree new_ops[2];
-  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type,
-		       unprom_mult, new_vectype);
+  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult,
+		       new_vectype);
   gcall *mulhrs_stmt
     = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]);
   gimple_call_set_lhs (mulhrs_stmt, new_var);
Richard Sandiford July 14, 2021, 8:38 a.m. UTC | #10
"Kewen.Lin" <linkw@linux.ibm.com> writes:
> gcc/ChangeLog:
>
> 	* internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
> 	* internal-fn.def (IFN_MULH): New internal function.
> 	* tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
> 	recog normal multiply highpart as IFN_MULH.

LGTM FWIW, although:

> @@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>    /* Check for target support.  */
>    tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type);
>    if (!new_vectype
> -      || !direct_internal_fn_supported_p
> -	    (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
> +      || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
>      return NULL;
>  
>    /* The IR requires a valid vector type for the cast result, even though
> @@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>    /* Generate the IFN_MULHRS call.  */
>    tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
>    tree new_ops[2];
> -  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type,
> -		       unprom_mult, new_vectype);
> +  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult,
> +		       new_vectype);
>    gcall *mulhrs_stmt
>      = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]);
>    gimple_call_set_lhs (mulhrs_stmt, new_var);

…these changes look like formatting only.  (I guess it's down to whether
or not the 80th column should be kept free for an “end of line+1” cursor.)

Thanks,
Richard
Kewen.Lin July 14, 2021, 10:02 a.m. UTC | #11
Hi Richard,

on 2021/7/14 下午4:38, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> gcc/ChangeLog:
>>
>> 	* internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
>> 	* internal-fn.def (IFN_MULH): New internal function.
>> 	* tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
>> 	recog normal multiply highpart as IFN_MULH.
> 
> LGTM FWIW, although:
> 

Thanks for the review!

>> @@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>>    /* Check for target support.  */
>>    tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type);
>>    if (!new_vectype
>> -      || !direct_internal_fn_supported_p
>> -	    (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
>> +      || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
>>      return NULL;
>>  
>>    /* The IR requires a valid vector type for the cast result, even though
>> @@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>>    /* Generate the IFN_MULHRS call.  */
>>    tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
>>    tree new_ops[2];
>> -  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type,
>> -		       unprom_mult, new_vectype);
>> +  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult,
>> +		       new_vectype);
>>    gcall *mulhrs_stmt
>>      = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]);
>>    gimple_call_set_lhs (mulhrs_stmt, new_var);
> 
> …these changes look like formatting only.  (I guess it's down to whether
> or not the 80th column should be kept free for an “end of line+1” cursor.)
> 

Yeah, just for formatting, the formatting tool (clang-format) reformatted
them.  Thanks for the information on "end of line+1" cursor, I didn't know
that before.  I guess you prefer me to keep the original format?  If so I
will remove them when committing it.  I was thinking whether I should change
field ColumnLimit of my .clang-format to 79 to avoid this kind of case to
be caught by formatting tool again.  Hope reviewers won't nit-pick the exact
80 column cases then. :)

BR,
Kewen
Richard Sandiford July 14, 2021, 11:32 a.m. UTC | #12
"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,
>
> on 2021/7/14 下午4:38, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> gcc/ChangeLog:
>>>
>>> 	* internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
>>> 	* internal-fn.def (IFN_MULH): New internal function.
>>> 	* tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
>>> 	recog normal multiply highpart as IFN_MULH.
>> 
>> LGTM FWIW, although:
>> 
>
> Thanks for the review!
>
>>> @@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>>>    /* Check for target support.  */
>>>    tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type);
>>>    if (!new_vectype
>>> -      || !direct_internal_fn_supported_p
>>> -	    (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
>>> +      || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
>>>      return NULL;
>>>  
>>>    /* The IR requires a valid vector type for the cast result, even though
>>> @@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>>>    /* Generate the IFN_MULHRS call.  */
>>>    tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
>>>    tree new_ops[2];
>>> -  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type,
>>> -		       unprom_mult, new_vectype);
>>> +  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult,
>>> +		       new_vectype);
>>>    gcall *mulhrs_stmt
>>>      = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]);
>>>    gimple_call_set_lhs (mulhrs_stmt, new_var);
>> 
>> …these changes look like formatting only.  (I guess it's down to whether
>> or not the 80th column should be kept free for an “end of line+1” cursor.)
>> 
>
> Yeah, just for formatting, the formatting tool (clang-format) reformatted
> them.  Thanks for the information on "end of line+1" cursor, I didn't know
> that before.  I guess you prefer me to keep the original format?  If so I
> will remove them when committing it.  I was thinking whether I should change
> field ColumnLimit of my .clang-format to 79 to avoid this kind of case to
> be caught by formatting tool again.  Hope reviewers won't nit-pick the exact
> 80 column cases then. :)

TBH, 79 vs. 80 isn't normally something I'd worry about when reviewing
new code.  But I know in the past people have asked for 79 to be used
for the “end+1” reason, so I don't think we should “fix” existing code
that honours the 79 limit so that it no longer does, especially when the
lines surrounding the code aren't changing.

There's also a risk of yo-yo-ing if someone else is using clang-format
and does have the limit set to 79 columns.

So yeah, I think it'd better to commit without the two hunks above.

Thanks,
Richard
Segher Boessenkool July 14, 2021, 7:32 p.m. UTC | #13
On Wed, Jul 14, 2021 at 12:32:24PM +0100, Richard Sandiford wrote:
> TBH, 79 vs. 80 isn't normally something I'd worry about when reviewing
> new code.  But I know in the past people have asked for 79 to be used
> for the “end+1” reason, so I don't think we should “fix” existing code
> that honours the 79 limit so that it no longer does, especially when the
> lines surrounding the code aren't changing.

The normal rule is you cannot go over 80.  It is perfectly fine to have
shorter lines, certainly if that is nice for some other reason, so
automatically (by some tool) changing this is Just Wrong.


Segher
Kewen.Lin July 15, 2021, 1:37 a.m. UTC | #14
on 2021/7/14 下午7:32, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> Hi Richard,
>>
>> on 2021/7/14 下午4:38, Richard Sandiford wrote:
>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>> gcc/ChangeLog:
>>>>
>>>> 	* internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
>>>> 	* internal-fn.def (IFN_MULH): New internal function.
>>>> 	* tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
>>>> 	recog normal multiply highpart as IFN_MULH.
>>>
>>> LGTM FWIW, although:
>>>
>>
>> Thanks for the review!
>>
>>>> @@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>>>>    /* Check for target support.  */
>>>>    tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type);
>>>>    if (!new_vectype
>>>> -      || !direct_internal_fn_supported_p
>>>> -	    (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
>>>> +      || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
>>>>      return NULL;
>>>>  
>>>>    /* The IR requires a valid vector type for the cast result, even though
>>>> @@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>>>>    /* Generate the IFN_MULHRS call.  */
>>>>    tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
>>>>    tree new_ops[2];
>>>> -  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type,
>>>> -		       unprom_mult, new_vectype);
>>>> +  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult,
>>>> +		       new_vectype);
>>>>    gcall *mulhrs_stmt
>>>>      = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]);
>>>>    gimple_call_set_lhs (mulhrs_stmt, new_var);
>>>
>>> …these changes look like formatting only.  (I guess it's down to whether
>>> or not the 80th column should be kept free for an “end of line+1” cursor.)
>>>
>>
>> Yeah, just for formatting, the formatting tool (clang-format) reformatted
>> them.  Thanks for the information on "end of line+1" cursor, I didn't know
>> that before.  I guess you prefer me to keep the original format?  If so I
>> will remove them when committing it.  I was thinking whether I should change
>> field ColumnLimit of my .clang-format to 79 to avoid this kind of case to
>> be caught by formatting tool again.  Hope reviewers won't nit-pick the exact
>> 80 column cases then. :)
> 
> TBH, 79 vs. 80 isn't normally something I'd worry about when reviewing
> new code.  But I know in the past people have asked for 79 to be used
> for the “end+1” reason, so I don't think we should “fix” existing code
> that honours the 79 limit so that it no longer does, especially when the
> lines surrounding the code aren't changing.
> 

Thanks for the explanation!  Agree.

> There's also a risk of yo-yo-ing if someone else is using clang-format
> and does have the limit set to 79 columns.
> 
> So yeah, I think it'd better to commit without the two hunks above.
> 

Will fix them.  Thanks for catching and explanations!

BR,
Kewen
Kewen.Lin July 15, 2021, 1:40 a.m. UTC | #15
on 2021/7/15 上午3:32, Segher Boessenkool wrote:
> On Wed, Jul 14, 2021 at 12:32:24PM +0100, Richard Sandiford wrote:
>> TBH, 79 vs. 80 isn't normally something I'd worry about when reviewing
>> new code.  But I know in the past people have asked for 79 to be used
>> for the “end+1” reason, so I don't think we should “fix” existing code
>> that honours the 79 limit so that it no longer does, especially when the
>> lines surrounding the code aren't changing.
> 
> The normal rule is you cannot go over 80.  It is perfectly fine to have
> shorter lines, certainly if that is nice for some other reason, so
> automatically (by some tool) changing this is Just Wrong.
> 

OK, could this be applied to changelog entry too?  I guess yes?

BR,
Kewen
Segher Boessenkool July 15, 2021, 11:08 p.m. UTC | #16
On Thu, Jul 15, 2021 at 09:40:52AM +0800, Kewen.Lin wrote:
> on 2021/7/15 上午3:32, Segher Boessenkool wrote:
> > The normal rule is you cannot go over 80.  It is perfectly fine to have
> > shorter lines, certainly if that is nice for some other reason, so
> > automatically (by some tool) changing this is Just Wrong.
> 
> OK, could this be applied to changelog entry too?  I guess yes?

Yes, lines of length 80 are fine in changelogs.

But try to not make short lines (that do no end an entry).  A changelog
that looks different from other changelogs is harder to read.  Normally
you have a whole bunch of totally boring entries ("New." or "Likewise."
for example), and the few that are longer naturally stand out then,
making it easier to scan the changelogs (which is what they are used for
most of the time: search for something, and press "n" a lot).

Also try to write less trivial things somewhat briefly in changelogs:
changelogs just say *what* changed, not *why*, and it is okay to leave
out details (this is a tradeoff of course).


Segher
diff mbox series

Patch

diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index b2e7fc2cc7a..9253c8088e9 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -1896,8 +1896,15 @@  vect_recog_over_widening_pattern (vec_info *vinfo,
 
    1) Multiply high with scaling
      TYPE res = ((TYPE) a * (TYPE) b) >> c;
+     Here, c is bitsize (TYPE) / 2 - 1.
+
    2) ... or also with rounding
      TYPE res = (((TYPE) a * (TYPE) b) >> d + 1) >> 1;
+     Here, d is bitsize (TYPE) / 2 - 2.
+
+   3) Normal multiply high
+     TYPE res = ((TYPE) a * (TYPE) b) >> e;
+     Here, e is bitsize (TYPE) / 2.
 
    where only the bottom half of res is used.  */
 
@@ -1942,7 +1949,6 @@  vect_recog_mulhs_pattern (vec_info *vinfo,
   stmt_vec_info mulh_stmt_info;
   tree scale_term;
   internal_fn ifn;
-  unsigned int expect_offset;
 
   /* Check for the presence of the rounding term.  */
   if (gimple_assign_rhs_code (rshift_input_stmt) == PLUS_EXPR)
@@ -1991,25 +1997,37 @@  vect_recog_mulhs_pattern (vec_info *vinfo,
 
       /* Get the scaling term.  */
       scale_term = gimple_assign_rhs2 (plus_input_stmt);
+      /* Check that the scaling factor is correct.  */
+      if (TREE_CODE (scale_term) != INTEGER_CST)
+	return NULL;
+
+      /* Check pattern 2).  */
+      if (wi::to_widest (scale_term) + target_precision + 2
+	  != TYPE_PRECISION (lhs_type))
+	return NULL;
 
-      expect_offset = target_precision + 2;
       ifn = IFN_MULHRS;
     }
   else
     {
       mulh_stmt_info = rshift_input_stmt_info;
       scale_term = gimple_assign_rhs2 (last_stmt);
+      /* Check that the scaling factor is correct.  */
+      if (TREE_CODE (scale_term) != INTEGER_CST)
+	return NULL;
 
-      expect_offset = target_precision + 1;
-      ifn = IFN_MULHS;
+      /* Check for pattern 1).  */
+      if (wi::to_widest (scale_term) + target_precision + 1
+	  == TYPE_PRECISION (lhs_type))
+	ifn = IFN_MULHS;
+      /* Check for pattern 3).  */
+      else if (wi::to_widest (scale_term) + target_precision
+	       == TYPE_PRECISION (lhs_type))
+	ifn = IFN_LAST;
+      else
+	return NULL;
     }
 
-  /* Check that the scaling factor is correct.  */
-  if (TREE_CODE (scale_term) != INTEGER_CST
-      || wi::to_widest (scale_term) + expect_offset
-	   != TYPE_PRECISION (lhs_type))
-    return NULL;
-
   /* Check whether the scaling input term can be seen as two widened
      inputs multiplied together.  */
   vect_unpromoted_value unprom_mult[2];
@@ -2029,9 +2047,14 @@  vect_recog_mulhs_pattern (vec_info *vinfo,
 
   /* Check for target support.  */
   tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type);
-  if (!new_vectype
-      || !direct_internal_fn_supported_p
-	    (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
+  if (!new_vectype)
+    return NULL;
+  if (ifn != IFN_LAST
+      && !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
+    return NULL;
+  else if (ifn == IFN_LAST
+	   && !can_mult_highpart_p (TYPE_MODE (new_vectype),
+				    TYPE_UNSIGNED (new_type)))
     return NULL;
 
   /* The IR requires a valid vector type for the cast result, even though
@@ -2040,14 +2063,20 @@  vect_recog_mulhs_pattern (vec_info *vinfo,
   if (!*type_out)
     return NULL;
 
-  /* Generate the IFN_MULHRS call.  */
+  gimple *mulhrs_stmt;
   tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
   tree new_ops[2];
-  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type,
-		       unprom_mult, new_vectype);
-  gcall *mulhrs_stmt
-    = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]);
-  gimple_call_set_lhs (mulhrs_stmt, new_var);
+  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult,
+		       new_vectype);
+  if (ifn == IFN_LAST)
+    mulhrs_stmt = gimple_build_assign (new_var, MULT_HIGHPART_EXPR, new_ops[0],
+				       new_ops[1]);
+  else
+    {
+      /* Generate the IFN_MULHRS call.  */
+      mulhrs_stmt = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]);
+      gimple_call_set_lhs (mulhrs_stmt, new_var);
+    }
   gimple_set_location (mulhrs_stmt, gimple_location (last_stmt));
 
   if (dump_enabled_p ())