diff mbox

[AArch64] Fix shuffle for big-endian.

Message ID 53077F31.8070003@arm.com
State New
Headers show

Commit Message

Tejas Belagod Feb. 21, 2014, 4:30 p.m. UTC
Hi,

When a shuffle of more than one input happens, on NEON we end up with a 
'mixed-endian' format in the register list which TBL operates on. We don't make 
this correction in RTL and therefore the shuffle operation gets it incorrect. 
Here is a patch that fixes-up the index table in the selector rtx in RTL to also 
be mixed-endian to reflect what's happening on NEON.

As trunk stands, this patch will not be exercised as constant vector permute for 
Big-endian is disabled. I've tested this by locally enabling const vec_perm and 
it fixes the some regressions we have on big-endian:

aarch64_be-none-elf:
FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer
FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer 
-funroll-all-loops -finline-functions
FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer 
-funroll-loops
FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -g
FAIL->PASS: gcc.dg/torture/vector-shuffle1.c  -O0  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v16qi.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v2df.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v2di.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v2sf.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v2si.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v4sf.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v4si.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v8hi.c  -O2  execution test
FAIL->PASS: gcc.dg/torture/vshuf-v8qi.c  -O2  execution test
FAIL->PASS: gcc.dg/vect/vect-114.c -flto -ffat-lto-objects execution test
FAIL->PASS: gcc.dg/vect/vect-114.c execution test
FAIL->PASS: gcc.dg/vect/vect-15.c -flto -ffat-lto-objects execution test
FAIL->PASS: gcc.dg/vect/vect-15.c execution test

Also regressed on aarch64-none-elf.

OK for stage-1?

Thanks,
Tejas.

2014-02-21  Tejas Belagod  <tejas.belagod@arm.com>

gcc/
	* config/aarch64/aarch64.c (aarch64_evpc_tbl): Fix index vector for
	big-endian when dealing with more than one input shuffle vector.

Comments

Alan Lawrence March 12, 2014, 2:52 p.m. UTC | #1
I've been doing some local testing using this patch as a basis for some of my
own work on NEON intrinsics, and it seems good to me. A couple of points:

(1) Re. the comment that "If two vectors, we end up with a wierd mixed-endian
mode on NEON": firstly "wierd" should be spelt "weird";
secondly, if I understand right, this comment belongs with the next "if
(!d->one_vector_p...)" rather than the "if (BYTES_BIG_ENDIAN)" before which it's
written.

(2) as you say, this code is not exercised, unless you do something to remove
the 'if (BYTES_BIG_ENDIAN) return false;' earlier in that same function. Can I
politely suggest you do that here in this patch?

(3) In my own regression testing, with const_vec_perm enabled on big_endian, I
see 2*PASS->FAIL, namely

gcc.dg/vect/vect-114.c scan-tree-dump-times vect "vectorized 0 loops" 1

gcc.dg/vect/vect-114.c -flto -ffat-lto-objects  scan-tree-dump-times
vect "vectorized 0 loops" 1

These are essentially noise, but the noise is removed and I see no other
problems, if (after this patch) I re-enable the testsuite's "vect_perm" target
selector for aarch64 big-endian (testsuite/lib/target-supports.exp). Would you
like a separate patch for that or roll it in here?

Cheers, Alan

Tejas Belagod wrote:
 > > Hi,
 > >
 > > When a shuffle of more than one input happens, on NEON we end up with a
 > > 'mixed-endian' format in the register list which TBL operates on. We don't 
make
 > > this correction in RTL and therefore the shuffle operation gets it incorrect.
 > > Here is a patch that fixes-up the index table in the selector rtx in RTL to 
also
 > > be mixed-endian to reflect what's happening on NEON.
 > >
 > > As trunk stands, this patch will not be exercised as constant vector 
permute for
 > > Big-endian is disabled. I've tested this by locally enabling const vec_perm 
and
 > > it fixes the some regressions we have on big-endian:
 > >
 > > aarch64_be-none-elf:
 > > FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 
-fomit-frame-pointer
 > > FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 
-fomit-frame-pointer
 > > -funroll-all-loops -finline-functions
 > > FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 
-fomit-frame-pointer
 > > -funroll-loops
 > > FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -g
 > > FAIL->PASS: gcc.dg/torture/vector-shuffle1.c  -O0  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v16qi.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v2df.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v2di.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v2sf.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v2si.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v4sf.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v4si.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v8hi.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/torture/vshuf-v8qi.c  -O2  execution test
 > > FAIL->PASS: gcc.dg/vect/vect-114.c -flto -ffat-lto-objects execution test
 > > FAIL->PASS: gcc.dg/vect/vect-114.c execution test
 > > FAIL->PASS: gcc.dg/vect/vect-15.c -flto -ffat-lto-objects execution test
 > > FAIL->PASS: gcc.dg/vect/vect-15.c execution test
 > >
 > > Also regressed on aarch64-none-elf.
 > >
 > > OK for stage-1?
 > >
 > > Thanks,
 > > Tejas.
 > >
 > > 2014-02-21  Tejas Belagod  <tejas.belagod@arm.com>
 > >
 > > gcc/
 > > 	* config/aarch64/aarch64.c (aarch64_evpc_tbl): Fix index vector for
 > > 	big-endian when dealing with more than one input shuffle vector.
 > >
Alan Lawrence March 24, 2014, 5:42 p.m. UTC | #2
Further to that - all looks good after one-liner 
http://gcc.gnu.org/ml/gcc-patches/2014-03/msg01142.html has gone in. (Without 
that, enabling the code in Tejas' patch causes a regression in 
gcc.dg/torture/vshuf-v4hi.c as loading a vector constant goes wrong).

I'll send a patch to enable this and fix up the testsuite shortly...

Cheers, Alan

Alan Lawrence wrote:
> I've been doing some local testing using this patch as a basis for some of my 
> own work on NEON intrinsics, and it seems good to me. A couple of points:
> 
> (1) Re. the comment that "If two vectors, we end up with a wierd mixed-endian 
> mode on NEON": firstly "wierd" should be spelt "weird";
> secondly, if I understand right, this comment belongs with the next "if
> (!d->one_vector_p...)" rather than the "if (BYTES_BIG_ENDIAN)" before which it's
> written.
> 
> (2) as you say, this code is not exercised, unless you do something to remove 
> the 'if (BYTES_BIG_ENDIAN) return false;' earlier in that same function. Can I 
> politely suggest you do that here in this patch?
> 
> (3) In my own regression testing, with const_vec_perm enabled on big_endian, I 
> see 2*PASS->FAIL, namely
> 
> gcc.dg/vect/vect-114.c scan-tree-dump-times vect "vectorized 0 loops" 1
> 
> gcc.dg/vect/vect-114.c -flto -ffat-lto-objects  scan-tree-dump-times
> vect "vectorized 0 loops" 1
> 
> These are essentially noise, but the noise is removed and I see no other 
> problems, if (after this patch) I re-enable the testsuite's "vect_perm" target 
> selector for aarch64 big-endian (testsuite/lib/target-supports.exp). Would you 
> like a separate patch for that or roll it in here?
> 
> Cheers, Alan
> 
> Tejas Belagod wrote:
>> Hi,
>>
>> When a shuffle of more than one input happens, on NEON we end up with a 
>> 'mixed-endian' format in the register list which TBL operates on. We don't make 
>> this correction in RTL and therefore the shuffle operation gets it incorrect. 
>> Here is a patch that fixes-up the index table in the selector rtx in RTL to also 
>> be mixed-endian to reflect what's happening on NEON.
>>
>> As trunk stands, this patch will not be exercised as constant vector permute for 
>> Big-endian is disabled. I've tested this by locally enabling const vec_perm and 
>> it fixes the some regressions we have on big-endian:
>>
>> aarch64_be-none-elf:
>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer
>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer 
>> -funroll-all-loops -finline-functions
>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer 
>> -funroll-loops
>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -g
>> FAIL->PASS: gcc.dg/torture/vector-shuffle1.c  -O0  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v16qi.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v2df.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v2di.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v2sf.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v2si.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v4sf.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v4si.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v8hi.c  -O2  execution test
>> FAIL->PASS: gcc.dg/torture/vshuf-v8qi.c  -O2  execution test
>> FAIL->PASS: gcc.dg/vect/vect-114.c -flto -ffat-lto-objects execution test
>> FAIL->PASS: gcc.dg/vect/vect-114.c execution test
>> FAIL->PASS: gcc.dg/vect/vect-15.c -flto -ffat-lto-objects execution test
>> FAIL->PASS: gcc.dg/vect/vect-15.c execution test
>>
>> Also regressed on aarch64-none-elf.
>>
>> OK for stage-1?
>>
>> Thanks,
>> Tejas.
>>
>> 2014-02-21  Tejas Belagod  <tejas.belagod@arm.com>
>>
>> gcc/
>> 	* config/aarch64/aarch64.c (aarch64_evpc_tbl): Fix index vector for
>> 	big-endian when dealing with more than one input shuffle vector.
>>
> 
>
Richard Henderson March 24, 2014, 6:25 p.m. UTC | #3
On 02/21/2014 08:30 AM, Tejas Belagod wrote:
> +      /* If two vectors, we end up with a wierd mixed-endian mode on NEON.  */
> +      if (BYTES_BIG_ENDIAN)
> +	{
> +	  if (!d->one_vector_p && d->perm[i] & nunits)
> +	    {
> +	      /* Extract the offset.  */
> +	      elt = d->perm[i] & (nunits - 1);
> +	      /* Reverse the top half.  */
> +	      elt = nunits - 1 - elt;
> +	      /* Offset it by the bottom half.  */
> +	      elt += nunits;
> +	    }
> +	  else
> +	    elt = nunits - 1 - d->perm[i];
> +	}

Isn't this just

  elt = d->perm[i] ^ (nunits - 1);

all the time?  I.e. invert the index within the word,
but leave the word index (nunits) unchanged.


r~
Tejas Belagod March 25, 2014, 5:11 p.m. UTC | #4
Alan Lawrence wrote:
> Further to that - all looks good after one-liner 
> http://gcc.gnu.org/ml/gcc-patches/2014-03/msg01142.html has gone in. (Without 
> that, enabling the code in Tejas' patch causes a regression in 
> gcc.dg/torture/vshuf-v4hi.c as loading a vector constant goes wrong).
> 

Thanks for your fix.

Tejas.

> I'll send a patch to enable this and fix up the testsuite shortly...
> 
> Cheers, Alan
> 
> Alan Lawrence wrote:
>> I've been doing some local testing using this patch as a basis for some of my 
>> own work on NEON intrinsics, and it seems good to me. A couple of points:
>>
>> (1) Re. the comment that "If two vectors, we end up with a wierd mixed-endian 
>> mode on NEON": firstly "wierd" should be spelt "weird";
>> secondly, if I understand right, this comment belongs with the next "if
>> (!d->one_vector_p...)" rather than the "if (BYTES_BIG_ENDIAN)" before which it's
>> written.
>>
>> (2) as you say, this code is not exercised, unless you do something to remove 
>> the 'if (BYTES_BIG_ENDIAN) return false;' earlier in that same function. Can I 
>> politely suggest you do that here in this patch?
>>
>> (3) In my own regression testing, with const_vec_perm enabled on big_endian, I 
>> see 2*PASS->FAIL, namely
>>
>> gcc.dg/vect/vect-114.c scan-tree-dump-times vect "vectorized 0 loops" 1
>>
>> gcc.dg/vect/vect-114.c -flto -ffat-lto-objects  scan-tree-dump-times
>> vect "vectorized 0 loops" 1
>>
>> These are essentially noise, but the noise is removed and I see no other 
>> problems, if (after this patch) I re-enable the testsuite's "vect_perm" target 
>> selector for aarch64 big-endian (testsuite/lib/target-supports.exp). Would you 
>> like a separate patch for that or roll it in here?
>>
>> Cheers, Alan
>>
>> Tejas Belagod wrote:
>>> Hi,
>>>
>>> When a shuffle of more than one input happens, on NEON we end up with a 
>>> 'mixed-endian' format in the register list which TBL operates on. We don't make 
>>> this correction in RTL and therefore the shuffle operation gets it incorrect. 
>>> Here is a patch that fixes-up the index table in the selector rtx in RTL to also 
>>> be mixed-endian to reflect what's happening on NEON.
>>>
>>> As trunk stands, this patch will not be exercised as constant vector permute for 
>>> Big-endian is disabled. I've tested this by locally enabling const vec_perm and 
>>> it fixes the some regressions we have on big-endian:
>>>
>>> aarch64_be-none-elf:
>>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer
>>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer 
>>> -funroll-all-loops -finline-functions
>>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -fomit-frame-pointer 
>>> -funroll-loops
>>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution,  -O3 -g
>>> FAIL->PASS: gcc.dg/torture/vector-shuffle1.c  -O0  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v16qi.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v2df.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v2di.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v2sf.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v2si.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v4sf.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v4si.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v8hi.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/torture/vshuf-v8qi.c  -O2  execution test
>>> FAIL->PASS: gcc.dg/vect/vect-114.c -flto -ffat-lto-objects execution test
>>> FAIL->PASS: gcc.dg/vect/vect-114.c execution test
>>> FAIL->PASS: gcc.dg/vect/vect-15.c -flto -ffat-lto-objects execution test
>>> FAIL->PASS: gcc.dg/vect/vect-15.c execution test
>>>
>>> Also regressed on aarch64-none-elf.
>>>
>>> OK for stage-1?
>>>
>>> Thanks,
>>> Tejas.
>>>
>>> 2014-02-21  Tejas Belagod  <tejas.belagod@arm.com>
>>>
>>> gcc/
>>> 	* config/aarch64/aarch64.c (aarch64_evpc_tbl): Fix index vector for
>>> 	big-endian when dealing with more than one input shuffle vector.
>>>
>>
>
Tejas Belagod March 25, 2014, 5:12 p.m. UTC | #5
Richard Henderson wrote:
> On 02/21/2014 08:30 AM, Tejas Belagod wrote:
>> +      /* If two vectors, we end up with a wierd mixed-endian mode on NEON.  */
>> +      if (BYTES_BIG_ENDIAN)
>> +	{
>> +	  if (!d->one_vector_p && d->perm[i] & nunits)
>> +	    {
>> +	      /* Extract the offset.  */
>> +	      elt = d->perm[i] & (nunits - 1);
>> +	      /* Reverse the top half.  */
>> +	      elt = nunits - 1 - elt;
>> +	      /* Offset it by the bottom half.  */
>> +	      elt += nunits;
>> +	    }
>> +	  else
>> +	    elt = nunits - 1 - d->perm[i];
>> +	}
> 
> Isn't this just
> 
>   elt = d->perm[i] ^ (nunits - 1);
> 
> all the time?  I.e. invert the index within the word,
> but leave the word index (nunits) unchanged.
> 

Yes, I think that works. Thanks!

Tejas.
diff mbox

Patch

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index ea90311..fd473a3 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -8128,7 +8128,28 @@  aarch64_evpc_tbl (struct expand_vec_perm_d *d)
     return false;

   for (i = 0; i < nelt; ++i)
-    rperm[i] = GEN_INT (d->perm[i]);
+    {
+      int nunits = GET_MODE_NUNITS (vmode);
+      int elt = d->perm[i];
+
+      /* If two vectors, we end up with a wierd mixed-endian mode on NEON.  */
+      if (BYTES_BIG_ENDIAN)
+	{
+	  if (!d->one_vector_p && d->perm[i] & nunits)
+	    {
+	      /* Extract the offset.  */
+	      elt = d->perm[i] & (nunits - 1);
+	      /* Reverse the top half.  */
+	      elt = nunits - 1 - elt;
+	      /* Offset it by the bottom half.  */
+	      elt += nunits;
+	    }
+	  else
+	    elt = nunits - 1 - d->perm[i];
+	}
+
+      rperm[i] = GEN_INT (elt);
+    }
   sel = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (nelt, rperm));
   sel = force_reg (vmode, sel);