diff mbox

VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.

Message ID B71DF1153024A14EABB94E39368E44A604268D3E@SJEXCHMB13.corp.ad.broadcom.com
State New
Headers show

Commit Message

Bingfeng Mei Jan. 28, 2014, 3:17 p.m. UTC
I checked vectorization code, it seems that only relevant place vec_widen_mult_even/odd & vec_widen_mult_lo/hi are generated is in supportable_widening_operation. One of these pairs is selected, with priority given to vec_widen_mult_even/odd if it is a reduction loop. However, lo/hi pair seems to have wider usage than even/odd pair (non-loop? Non-reduction?). Maybe that's why AltiVec and x86 still implement both pairs. Is following patch OK?



-----Original Message-----
From: Richard Biener [mailto:richard.guenther@gmail.com] 
Sent: 28 January 2014 12:56
To: Bingfeng Mei
Cc: gcc@gcc.gnu.org
Subject: Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.

On Tue, Jan 28, 2014 at 12:08 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Thanks, Richard. It is not very clear from documents.
>
> "Signed/Unsigned widening multiplication. The two inputs (operands 1 and 2)
> are vectors with N signed/unsigned elements of size S. Multiply the high/low
> or even/odd elements of the two vectors, and put the N/2 products of size 2*S
> in the output vector (operand 0)."
>
> So I thought that implementing both can help vectorizer to optimize more loops.
> Maybe we should improve documents.

Maybe.  But my answer was from the top of my head - so better double-check
in the vectorizer sources.

Richard.

> Bingfeng
>
>
>
> -----Original Message-----
> From: Richard Biener [mailto:richard.guenther@gmail.com]
> Sent: 28 January 2014 11:02
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
>
> On Wed, Jan 22, 2014 at 1:20 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
>> Hi,
>> I noticed there is a regression of 4.8 against ancient 4.5 in vectorization on our port. After a bit investigation, I found following code that prefer even|odd version instead of lo|hi one. This is obviously the case for AltiVec and maybe some other targets. But even|odd (expanding to a series of instructions) versions are less efficient on our target than lo|hi ones. Shouldn't there be a target-specific hook to do the choice instead of hard-coded one here, or utilizing some cost-estimating technique to compare two alternatives?
>
> Hmm, what's the reason for a target to support both?  I think the idea
> was that a target only supports either (the more efficient case).
>
> Richard.
>
>>      /* The result of a vectorized widening operation usually requires
>>          two vectors (because the widened results do not fit into one vector).
>>          The generated vector results would normally be expected to be
>>          generated in the same order as in the original scalar computation,
>>          i.e. if 8 results are generated in each vector iteration, they are
>>          to be organized as follows:
>>                 vect1: [res1,res2,res3,res4],
>>                 vect2: [res5,res6,res7,res8].
>>
>>          However, in the special case that the result of the widening
>>          operation is used in a reduction computation only, the order doesn't
>>          matter (because when vectorizing a reduction we change the order of
>>          the computation).  Some targets can take advantage of this and
>>          generate more efficient code.  For example, targets like Altivec,
>>          that support widen_mult using a sequence of {mult_even,mult_odd}
>>          generate the following vectors:
>>                 vect1: [res1,res3,res5,res7],
>>                 vect2: [res2,res4,res6,res8].
>>
>>          When vectorizing outer-loops, we execute the inner-loop sequentially
>>          (each vectorized inner-loop iteration contributes to VF outer-loop
>>          iterations in parallel).  We therefore don't allow to change the
>>          order of the computation in the inner-loop during outer-loop
>>          vectorization.  */
>>       /* TODO: Another case in which order doesn't *really* matter is when we
>>          widen and then contract again, e.g. (short)((int)x * y >> 8).
>>          Normally, pack_trunc performs an even/odd permute, whereas the
>>          repack from an even/odd expansion would be an interleave, which
>>          would be significantly simpler for e.g. AVX2.  */
>>       /* In any case, in order to avoid duplicating the code below, recurse
>>          on VEC_WIDEN_MULT_EVEN_EXPR.  If it succeeds, all the return values
>>          are properly set up for the caller.  If we fail, we'll continue with
>>          a VEC_WIDEN_MULT_LO/HI_EXPR check.  */
>>       if (vect_loop
>>           && STMT_VINFO_RELEVANT (stmt_info) == vect_used_by_reduction
>>           && !nested_in_vect_loop_p (vect_loop, stmt)
>>           && supportable_widening_operation (VEC_WIDEN_MULT_EVEN_EXPR,
>>                                              stmt, vectype_out, vectype_in,
>>                                              code1, code2, multi_step_cvt,
>>                                              interm_types))
>>         return true;
>>
>>
>> Thanks,
>> Bingfeng Mei

Comments

Richard Biener Jan. 29, 2014, 9:32 a.m. UTC | #1
On Tue, Jan 28, 2014 at 4:17 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> I checked vectorization code, it seems that only relevant place vec_widen_mult_even/odd & vec_widen_mult_lo/hi are generated is in supportable_widening_operation. One of these pairs is selected, with priority given to vec_widen_mult_even/odd if it is a reduction loop. However, lo/hi pair seems to have wider usage than even/odd pair (non-loop? Non-reduction?). Maybe that's why AltiVec and x86 still implement both pairs. Is following patch OK?

Ok.

Thanks,
Richard.

> Index: gcc/ChangeLog
> ===================================================================
> --- gcc/ChangeLog       (revision 207183)
> +++ gcc/ChangeLog       (working copy)
> @@ -1,3 +1,9 @@
> +2014-01-28  Bingfeng Mei  <bmei@broadcom.com>
> +
> +       * doc/md.texi: Mention that a target shouldn't implement
> +       vec_widen_(s|u)mul_even/odd pair if it is less efficient
> +       than hi/lo pair.
> +
>  2014-01-28  Richard Biener  <rguenther@suse.de>
>
>         Revert
> Index: gcc/doc/md.texi
> ===================================================================
> --- gcc/doc/md.texi     (revision 207183)
> +++ gcc/doc/md.texi     (working copy)
> @@ -4918,7 +4918,8 @@ the output vector (operand 0).
>  Signed/Unsigned widening multiplication.  The two inputs (operands 1 and 2)
>  are vectors with N signed/unsigned elements of size S@.  Multiply the high/low
>  or even/odd elements of the two vectors, and put the N/2 products of size 2*S
> -in the output vector (operand 0).
> +in the output vector (operand 0). A target shouldn't implement even/odd pattern
> +pair if it is less efficient than lo/hi one.
>
>  @cindex @code{vec_widen_ushiftl_hi_@var{m}} instruction pattern
>  @cindex @code{vec_widen_ushiftl_lo_@var{m}} instruction pattern
>
>
> -----Original Message-----
> From: Richard Biener [mailto:richard.guenther@gmail.com]
> Sent: 28 January 2014 12:56
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
>
> On Tue, Jan 28, 2014 at 12:08 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
>> Thanks, Richard. It is not very clear from documents.
>>
>> "Signed/Unsigned widening multiplication. The two inputs (operands 1 and 2)
>> are vectors with N signed/unsigned elements of size S. Multiply the high/low
>> or even/odd elements of the two vectors, and put the N/2 products of size 2*S
>> in the output vector (operand 0)."
>>
>> So I thought that implementing both can help vectorizer to optimize more loops.
>> Maybe we should improve documents.
>
> Maybe.  But my answer was from the top of my head - so better double-check
> in the vectorizer sources.
>
> Richard.
>
>> Bingfeng
>>
>>
>>
>> -----Original Message-----
>> From: Richard Biener [mailto:richard.guenther@gmail.com]
>> Sent: 28 January 2014 11:02
>> To: Bingfeng Mei
>> Cc: gcc@gcc.gnu.org
>> Subject: Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
>>
>> On Wed, Jan 22, 2014 at 1:20 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
>>> Hi,
>>> I noticed there is a regression of 4.8 against ancient 4.5 in vectorization on our port. After a bit investigation, I found following code that prefer even|odd version instead of lo|hi one. This is obviously the case for AltiVec and maybe some other targets. But even|odd (expanding to a series of instructions) versions are less efficient on our target than lo|hi ones. Shouldn't there be a target-specific hook to do the choice instead of hard-coded one here, or utilizing some cost-estimating technique to compare two alternatives?
>>
>> Hmm, what's the reason for a target to support both?  I think the idea
>> was that a target only supports either (the more efficient case).
>>
>> Richard.
>>
>>>      /* The result of a vectorized widening operation usually requires
>>>          two vectors (because the widened results do not fit into one vector).
>>>          The generated vector results would normally be expected to be
>>>          generated in the same order as in the original scalar computation,
>>>          i.e. if 8 results are generated in each vector iteration, they are
>>>          to be organized as follows:
>>>                 vect1: [res1,res2,res3,res4],
>>>                 vect2: [res5,res6,res7,res8].
>>>
>>>          However, in the special case that the result of the widening
>>>          operation is used in a reduction computation only, the order doesn't
>>>          matter (because when vectorizing a reduction we change the order of
>>>          the computation).  Some targets can take advantage of this and
>>>          generate more efficient code.  For example, targets like Altivec,
>>>          that support widen_mult using a sequence of {mult_even,mult_odd}
>>>          generate the following vectors:
>>>                 vect1: [res1,res3,res5,res7],
>>>                 vect2: [res2,res4,res6,res8].
>>>
>>>          When vectorizing outer-loops, we execute the inner-loop sequentially
>>>          (each vectorized inner-loop iteration contributes to VF outer-loop
>>>          iterations in parallel).  We therefore don't allow to change the
>>>          order of the computation in the inner-loop during outer-loop
>>>          vectorization.  */
>>>       /* TODO: Another case in which order doesn't *really* matter is when we
>>>          widen and then contract again, e.g. (short)((int)x * y >> 8).
>>>          Normally, pack_trunc performs an even/odd permute, whereas the
>>>          repack from an even/odd expansion would be an interleave, which
>>>          would be significantly simpler for e.g. AVX2.  */
>>>       /* In any case, in order to avoid duplicating the code below, recurse
>>>          on VEC_WIDEN_MULT_EVEN_EXPR.  If it succeeds, all the return values
>>>          are properly set up for the caller.  If we fail, we'll continue with
>>>          a VEC_WIDEN_MULT_LO/HI_EXPR check.  */
>>>       if (vect_loop
>>>           && STMT_VINFO_RELEVANT (stmt_info) == vect_used_by_reduction
>>>           && !nested_in_vect_loop_p (vect_loop, stmt)
>>>           && supportable_widening_operation (VEC_WIDEN_MULT_EVEN_EXPR,
>>>                                              stmt, vectype_out, vectype_in,
>>>                                              code1, code2, multi_step_cvt,
>>>                                              interm_types))
>>>         return true;
>>>
>>>
>>> Thanks,
>>> Bingfeng Mei
diff mbox

Patch

Index: gcc/ChangeLog
===================================================================
--- gcc/ChangeLog	(revision 207183)
+++ gcc/ChangeLog	(working copy)
@@ -1,3 +1,9 @@ 
+2014-01-28  Bingfeng Mei  <bmei@broadcom.com>
+
+	* doc/md.texi: Mention that a target shouldn't implement 
+	vec_widen_(s|u)mul_even/odd pair if it is less efficient
+	than hi/lo pair.
+
 2014-01-28  Richard Biener  <rguenther@suse.de>
 
 	Revert
Index: gcc/doc/md.texi
===================================================================
--- gcc/doc/md.texi	(revision 207183)
+++ gcc/doc/md.texi	(working copy)
@@ -4918,7 +4918,8 @@  the output vector (operand 0).
 Signed/Unsigned widening multiplication.  The two inputs (operands 1 and 2)
 are vectors with N signed/unsigned elements of size S@.  Multiply the high/low
 or even/odd elements of the two vectors, and put the N/2 products of size 2*S
-in the output vector (operand 0).
+in the output vector (operand 0). A target shouldn't implement even/odd pattern
+pair if it is less efficient than lo/hi one.
 
 @cindex @code{vec_widen_ushiftl_hi_@var{m}} instruction pattern
 @cindex @code{vec_widen_ushiftl_lo_@var{m}} instruction pattern