[AArch64] Fix shuffle for big-endian.

Message ID	53077F31.8070003@arm.com
State	New
Headers	show Return-Path: <gcc-patches-return-362205-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:content-type; q=dns; s=default; b=UkHpmfgmb3AguAwEHD8fsfYZV2c0FMGO4o4w9VDpsEd tF48EX61qcAEIGfscrJ8hYjOjGI7FJ6VOgdQVDEktwLvEHfw7gJTePb0TygxZQ++ l0HkJi4zZXQN9GGcYTLzAc27nh2CG7oRCs/lNyVaVtwBqWvOw+RRv9aXffJRxO5U = Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Message-ID: <53077F31.8070003@arm.com> Date: Fri, 21 Feb 2014 16:30:41 +0000 From: Tejas Belagod <tbelagod@arm.com> User-Agent: Thunderbird 2.0.0.18 (X11/20081120) MIME-Version: 1.0 To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org> CC: Marcus Shawcroft <Marcus.Shawcroft@arm.com> Subject: [Patch, AArch64] Fix shuffle for big-endian. Content-Type: multipart/mixed; boundary="------------020100030405060301050609"

Tejas Belagod Feb. 21, 2014, 4:30 p.m. UTC

Hi,

When a shuffle of more than one input happens, on NEON we end up with a 
'mixed-endian' format in the register list which TBL operates on. We don't make 
this correction in RTL and therefore the shuffle operation gets it incorrect. 
Here is a patch that fixes-up the index table in the selector rtx in RTL to also 
be mixed-endian to reflect what's happening on NEON.

As trunk stands, this patch will not be exercised as constant vector permute for 
Big-endian is disabled. I've tested this by locally enabling const vec_perm and 
it fixes the some regressions we have on big-endian:

aarch64_be-none-elf:
FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer
FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer 
-funroll-all-loops -finline-functions
FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer 
-funroll-loops
FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -g
FAIL->PASS: gcc.dg/torture/vector-shuffle1.c  -O0  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v16qi.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v2df.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v2di.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v2sf.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v2si.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v4sf.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v4si.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v8hi.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v8qi.c  -O2  execution test
FAIL->PASS: gcc.dg/vect/vect-114.c -flto -ffat-lto-objects execution test
FAIL->PASS: gcc.dg/vect/vect-114.c execution test
FAIL->PASS: gcc.dg/vect/vect-15.c -flto -ffat-lto-objects execution test
FAIL->PASS: gcc.dg/vect/vect-15.c execution test

Also regressed on aarch64-none-elf.

OK for stage-1?

Thanks,
Tejas.

2014-02-21  Tejas Belagod  <tejas.belagod@arm.com>

gcc/
	* config/aarch64/aarch64.c (aarch64_evpc_tbl): Fix index vector for
	big-endian when dealing with more than one input shuffle vector.

Alan Lawrence March 12, 2014, 2:52 p.m. UTC | #1

I've been doing some local testing using this patch as a basis for some of my
own work on NEON intrinsics, and it seems good to me. A couple of points:

(1) Re. the comment that "If two vectors, we end up with a wierd mixed-endian
mode on NEON": firstly "wierd" should be spelt "weird";
secondly, if I understand right, this comment belongs with the next "if
(!d->one_vector_p...)" rather than the "if (BYTES_BIG_ENDIAN)" before which it's
written.

(2) as you say, this code is not exercised, unless you do something to remove
the 'if (BYTES_BIG_ENDIAN) return false;' earlier in that same function. Can I
politely suggest you do that here in this patch?

(3) In my own regression testing, with const_vec_perm enabled on big_endian, I
see 2*PASS->FAIL, namely

gcc.dg/vect/vect-114.c scan-tree-dump-times vect "vectorized 0 loops" 1

gcc.dg/vect/vect-114.c -flto -ffat-lto-objects  scan-tree-dump-times
vect "vectorized 0 loops" 1

These are essentially noise, but the noise is removed and I see no other
problems, if (after this patch) I re-enable the testsuite's "vect_perm" target
selector for aarch64 big-endian (testsuite/lib/target-supports.exp). Would you
like a separate patch for that or roll it in here?

Cheers, Alan

Tejas Belagod wrote:
 > > Hi,
 > >
 > > When a shuffle of more than one input happens, on NEON we end up with a
 > > 'mixed-endian' format in the register list which TBL operates on. We don't 
make
 > > this correction in RTL and therefore the shuffle operation gets it incorrect.
 > > Here is a patch that fixes-up the index table in the selector rtx in RTL to 
also
 > > be mixed-endian to reflect what's happening on NEON.
 > >
 > > As trunk stands, this patch will not be exercised as constant vector 
permute for
 > > Big-endian is disabled. I've tested this by locally enabling const vec_perm 
and
 > > it fixes the some regressions we have on big-endian:
 > >
 > > aarch64_be-none-elf:
 > > FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 
-fomit-frame-pointer
 > > FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 
-fomit-frame-pointer
 > > -funroll-all-loops -finline-functions
 > > FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 
-fomit-frame-pointer
 > > -funroll-loops
 > > FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -g
 > > FAIL->PASS: gcc.dg/torture/vector-shuffle1.c  -O0  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v16qi.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v2df.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v2di.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v2sf.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v2si.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v4sf.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v4si.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v8hi.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v8qi.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/vect/vect-114.c -flto -ffat-lto-objects execution test
 > > FAIL->PASS: gcc.dg/vect/vect-114.c execution test
 > > FAIL->PASS: gcc.dg/vect/vect-15.c -flto -ffat-lto-objects execution test
 > > FAIL->PASS: gcc.dg/vect/vect-15.c execution test
 > >
 > > Also regressed on aarch64-none-elf.
 > >
 > > OK for stage-1?
 > >
 > > Thanks,
 > > Tejas.
 > >
 > > 2014-02-21  Tejas Belagod  <tejas.belagod@arm.com>
 > >
 > > gcc/
 > > 	* config/aarch64/aarch64.c (aarch64_evpc_tbl): Fix index vector for
 > > 	big-endian when dealing with more than one input shuffle vector.
 > >

Alan Lawrence March 24, 2014, 5:42 p.m. UTC | #2

Further to that - all looks good after one-liner 
http://gcc.gnu.org/ml/gcc-patches/2014-03/msg01142.html has gone in. (Without 
that, enabling the code in Tejas' patch causes a regression in 
gcc.dg/torture/vshuf-v4hi.c as loading a vector constant goes wrong).

I'll send a patch to enable this and fix up the testsuite shortly...

Cheers, Alan

Alan Lawrence wrote:
> I've been doing some local testing using this patch as a basis for some of my 
> own work on NEON intrinsics, and it seems good to me. A couple of points:
> 
> (1) Re. the comment that "If two vectors, we end up with a wierd mixed-endian 
> mode on NEON": firstly "wierd" should be spelt "weird";
> secondly, if I understand right, this comment belongs with the next "if
> (!d->one_vector_p...)" rather than the "if (BYTES_BIG_ENDIAN)" before which it's
> written.
> 
> (2) as you say, this code is not exercised, unless you do something to remove 
> the 'if (BYTES_BIG_ENDIAN) return false;' earlier in that same function. Can I 
> politely suggest you do that here in this patch?
> 
> (3) In my own regression testing, with const_vec_perm enabled on big_endian, I 
> see 2*PASS->FAIL, namely
> 
> gcc.dg/vect/vect-114.c scan-tree-dump-times vect "vectorized 0 loops" 1
> 
> gcc.dg/vect/vect-114.c -flto -ffat-lto-objects  scan-tree-dump-times
> vect "vectorized 0 loops" 1
> 
> These are essentially noise, but the noise is removed and I see no other 
> problems, if (after this patch) I re-enable the testsuite's "vect_perm" target 
> selector for aarch64 big-endian (testsuite/lib/target-supports.exp). Would you 
> like a separate patch for that or roll it in here?
> 
> Cheers, Alan
> 
> Tejas Belagod wrote:
>> Hi,
>>
>> When a shuffle of more than one input happens, on NEON we end up with a 
>> 'mixed-endian' format in the register list which TBL operates on. We don't make 
>> this correction in RTL and therefore the shuffle operation gets it incorrect. 
>> Here is a patch that fixes-up the index table in the selector rtx in RTL to also 
>> be mixed-endian to reflect what's happening on NEON.
>>
>> As trunk stands, this patch will not be exercised as constant vector permute for 
>> Big-endian is disabled. I've tested this by locally enabling const vec_perm and 
>> it fixes the some regressions we have on big-endian:
>>
>> aarch64_be-none-elf:
>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer
>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer 
>> -funroll-all-loops -finline-functions
>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer 
>> -funroll-loops
>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -g
>> FAIL->PASS: gcc.dg/torture/vector-shuffle1.c  -O0  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v16qi.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v2df.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v2di.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v2sf.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v2si.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v4sf.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v4si.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v8hi.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v8qi.c  -O2  execution test
>> FAIL->PASS: gcc.dg/vect/vect-114.c -flto -ffat-lto-objects execution test
>> FAIL->PASS: gcc.dg/vect/vect-114.c execution test
>> FAIL->PASS: gcc.dg/vect/vect-15.c -flto -ffat-lto-objects execution test
>> FAIL->PASS: gcc.dg/vect/vect-15.c execution test
>>
>> Also regressed on aarch64-none-elf.
>>
>> OK for stage-1?
>>
>> Thanks,
>> Tejas.
>>
>> 2014-02-21  Tejas Belagod  <tejas.belagod@arm.com>
>>
>> gcc/
>> 	* config/aarch64/aarch64.c (aarch64_evpc_tbl): Fix index vector for
>> 	big-endian when dealing with more than one input shuffle vector.
>>
> 
>

Richard Henderson March 24, 2014, 6:25 p.m. UTC | #3

On 02/21/2014 08:30 AM, Tejas Belagod wrote:
> +      /* If two vectors, we end up with a wierd mixed-endian mode on NEON.  */
> +      if (BYTES_BIG_ENDIAN)
> +	{
> +	  if (!d->one_vector_p && d->perm[i] & nunits)
> +	    {
> +	      /* Extract the offset.  */
> +	      elt = d->perm[i] & (nunits - 1);
> +	      /* Reverse the top half.  */
> +	      elt = nunits - 1 - elt;
> +	      /* Offset it by the bottom half.  */
> +	      elt += nunits;
> +	    }
> +	  else
> +	    elt = nunits - 1 - d->perm[i];
> +	}

Isn't this just

  elt = d->perm[i] ^ (nunits - 1);

all the time?  I.e. invert the index within the word,
but leave the word index (nunits) unchanged.


r~

Tejas Belagod March 25, 2014, 5:11 p.m. UTC | #4

Alan Lawrence wrote:
> Further to that - all looks good after one-liner 
> http://gcc.gnu.org/ml/gcc-patches/2014-03/msg01142.html has gone in. (Without 
> that, enabling the code in Tejas' patch causes a regression in 
> gcc.dg/torture/vshuf-v4hi.c as loading a vector constant goes wrong).
> 

Thanks for your fix.

Tejas.

> I'll send a patch to enable this and fix up the testsuite shortly...
> 
> Cheers, Alan
> 
> Alan Lawrence wrote:
>> I've been doing some local testing using this patch as a basis for some of my 
>> own work on NEON intrinsics, and it seems good to me. A couple of points:
>>
>> (1) Re. the comment that "If two vectors, we end up with a wierd mixed-endian 
>> mode on NEON": firstly "wierd" should be spelt "weird";
>> secondly, if I understand right, this comment belongs with the next "if
>> (!d->one_vector_p...)" rather than the "if (BYTES_BIG_ENDIAN)" before which it's
>> written.
>>
>> (2) as you say, this code is not exercised, unless you do something to remove 
>> the 'if (BYTES_BIG_ENDIAN) return false;' earlier in that same function. Can I 
>> politely suggest you do that here in this patch?
>>
>> (3) In my own regression testing, with const_vec_perm enabled on big_endian, I 
>> see 2*PASS->FAIL, namely
>>
>> gcc.dg/vect/vect-114.c scan-tree-dump-times vect "vectorized 0 loops" 1
>>
>> gcc.dg/vect/vect-114.c -flto -ffat-lto-objects  scan-tree-dump-times
>> vect "vectorized 0 loops" 1
>>
>> These are essentially noise, but the noise is removed and I see no other 
>> problems, if (after this patch) I re-enable the testsuite's "vect_perm" target 
>> selector for aarch64 big-endian (testsuite/lib/target-supports.exp). Would you 
>> like a separate patch for that or roll it in here?
>>
>> Cheers, Alan
>>
>> Tejas Belagod wrote:
>>> Hi,
>>>
>>> When a shuffle of more than one input happens, on NEON we end up with a 
>>> 'mixed-endian' format in the register list which TBL operates on. We don't make 
>>> this correction in RTL and therefore the shuffle operation gets it incorrect. 
>>> Here is a patch that fixes-up the index table in the selector rtx in RTL to also 
>>> be mixed-endian to reflect what's happening on NEON.
>>>
>>> As trunk stands, this patch will not be exercised as constant vector permute for 
>>> Big-endian is disabled. I've tested this by locally enabling const vec_perm and 
>>> it fixes the some regressions we have on big-endian:
>>>
>>> aarch64_be-none-elf:
>>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer
>>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer 
>>> -funroll-all-loops -finline-functions
>>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer 
>>> -funroll-loops
>>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -g
>>> FAIL->PASS: gcc.dg/torture/vector-shuffle1.c  -O0  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v16qi.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v2df.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v2di.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v2sf.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v2si.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v4sf.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v4si.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v8hi.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v8qi.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/vect/vect-114.c -flto -ffat-lto-objects execution test
>>> FAIL->PASS: gcc.dg/vect/vect-114.c execution test
>>> FAIL->PASS: gcc.dg/vect/vect-15.c -flto -ffat-lto-objects execution test
>>> FAIL->PASS: gcc.dg/vect/vect-15.c execution test
>>>
>>> Also regressed on aarch64-none-elf.
>>>
>>> OK for stage-1?
>>>
>>> Thanks,
>>> Tejas.
>>>
>>> 2014-02-21  Tejas Belagod  <tejas.belagod@arm.com>
>>>
>>> gcc/
>>> 	* config/aarch64/aarch64.c (aarch64_evpc_tbl): Fix index vector for
>>> 	big-endian when dealing with more than one input shuffle vector.
>>>
>>
>

Tejas Belagod March 25, 2014, 5:12 p.m. UTC | #5

Richard Henderson wrote:
> On 02/21/2014 08:30 AM, Tejas Belagod wrote:
>> +      /* If two vectors, we end up with a wierd mixed-endian mode on NEON.  */
>> +      if (BYTES_BIG_ENDIAN)
>> +	{
>> +	  if (!d->one_vector_p && d->perm[i] & nunits)
>> +	    {
>> +	      /* Extract the offset.  */
>> +	      elt = d->perm[i] & (nunits - 1);
>> +	      /* Reverse the top half.  */
>> +	      elt = nunits - 1 - elt;
>> +	      /* Offset it by the bottom half.  */
>> +	      elt += nunits;
>> +	    }
>> +	  else
>> +	    elt = nunits - 1 - d->perm[i];
>> +	}
> 
> Isn't this just
> 
>   elt = d->perm[i] ^ (nunits - 1);
> 
> all the time?  I.e. invert the index within the word,
> but leave the word index (nunits) unchanged.
> 

Yes, I think that works. Thanks!

Tejas.

[AArch64] Fix shuffle for big-endian.

Commit Message

Comments

Patch