diff mbox

[RTL] Eliminate redundant vec_select moves.

Message ID 52962733.7030005@arm.com
State New
Headers show

Commit Message

Tejas Belagod Nov. 27, 2013, 5:09 p.m. UTC
Richard Sandiford wrote:
> Tejas Belagod <tbelagod@arm.com> writes:
>>> The problem is that one reg rtx can span several hard registers.
>>> E.g. (reg:V4SI 32) might represent one 64-bit register (no. 32),
>>> but it might instead represent two 32-bit registers (nos. 32 and 33).
>>> Obviously the latter's not very likely for vectors this small,
>>> but more likely for larger ones (including on NEON IIRC).
>>>
>>> So if we had 2 32-bit registers being treated as a V4HI, it would be:
>>>
>>>    <--32--><--33-->
>>>    msb          lsb
>>>    0000111122223333
>>>    VVVVVVVV
>>>    00001111
>>>    msb  lsb
>>>    <--32-->
>>>
>>> for big endian and:
>>>
>>>    <--33--><--32-->
>>>    msb          lsb
>>>    3333222211110000
>>>            VVVVVVVV
>>>            11110000
>>>            msb  lsb
>>>            <--32-->
>>>
>>> for little endian.
>> Ah, ok, that makes things clearer. Thanks for that.
>>
>> I can't find any helper function that figures out if we're writing partial or 
>> full result regs. Would something like
>>
>>      REGNO (src) == REGNO (dst) &&
>>      HARD_REGNO_NREGS (src) == HARD_REGNO_NREGS (dst) == 1
>>
>> be a sane check for partial result regs?
> 
> Yeah, that should work.  I think a more general alternative would be:
> 
>   simplify_subreg_regno (REGNO (src), GET_MODE (src),
>                          offset, GET_MODE (dst)) == (int) REGNO (dst)
> 
> where:
> 
>   offset = GET_MODE_UNIT_SIZE (GET_MODE (src)) * INTVAL (XVECEXP (sel, 0))
> 
> That offset is the byte offset of the first selected element from the
> start of a vector in memory, which is also the way that SUBREG_BYTEs
> are counted.  For little-endian it gives the offset of the lsb of the
> slice, while for big-endian it gives the offset of the msb (which is
> also how SUBREG_BYTEs work).
> 
> The simplify_subreg_regno should cope with both single-register vectors
> and multi-register vectors.

Sorry for the delayed response to this.

Thanks for the tip. Here's an improved patch that implements the 
simplify_sureg_regno () method of eliminating redundant moves. Regarding the 
test case, I failed to get the ppc back-end to generate RTL pattern that this 
patch checks for. I can easily write a test case for aarch64(big and little 
endian) on these lines

typedef float float32x4_t __attribute__ ((__vector_size__ (16)));

float foo_be (float32x4_t x)
{
   return x[3];
}

float foo_le (float32x4_t x)
{
   return x[0];
}

where I know that the vector indexing will generate a vec_select on the same src 
and dst regs that could be optimized away and hence test it. But I'm struggling 
to get a test case  that the ppc altivec back-end will generate such a 
vec_select for. I see that altivec does not define vec_extract, so a simple 
indexing like this seems to happen via memory. Also, I don't know enough about 
the ppc PCS or architecture to write a test that will check for this 
optimization opportunity on same src and dst hard-registers. Any hints?

This patch has been bootstrapped on x64_64 and regressed on aarch64-none-elf and 
aarch64_be-none-elf.

Thanks for your patience,
Tejas.

Comments

Richard Sandiford Nov. 28, 2013, 9:28 a.m. UTC | #1
Tejas Belagod <tbelagod@arm.com> writes:
> Richard Sandiford wrote:
>> Tejas Belagod <tbelagod@arm.com> writes:
>>>> The problem is that one reg rtx can span several hard registers.
>>>> E.g. (reg:V4SI 32) might represent one 64-bit register (no. 32),
>>>> but it might instead represent two 32-bit registers (nos. 32 and 33).
>>>> Obviously the latter's not very likely for vectors this small,
>>>> but more likely for larger ones (including on NEON IIRC).
>>>>
>>>> So if we had 2 32-bit registers being treated as a V4HI, it would be:
>>>>
>>>>    <--32--><--33-->
>>>>    msb          lsb
>>>>    0000111122223333
>>>>    VVVVVVVV
>>>>    00001111
>>>>    msb  lsb
>>>>    <--32-->
>>>>
>>>> for big endian and:
>>>>
>>>>    <--33--><--32-->
>>>>    msb          lsb
>>>>    3333222211110000
>>>>            VVVVVVVV
>>>>            11110000
>>>>            msb  lsb
>>>>            <--32-->
>>>>
>>>> for little endian.
>>> Ah, ok, that makes things clearer. Thanks for that.
>>>
>>> I can't find any helper function that figures out if we're writing
>>> partial or
>>> full result regs. Would something like
>>>
>>>      REGNO (src) == REGNO (dst) &&
>>>      HARD_REGNO_NREGS (src) == HARD_REGNO_NREGS (dst) == 1
>>>
>>> be a sane check for partial result regs?
>> 
>> Yeah, that should work.  I think a more general alternative would be:
>> 
>>   simplify_subreg_regno (REGNO (src), GET_MODE (src),
>>                          offset, GET_MODE (dst)) == (int) REGNO (dst)
>> 
>> where:
>> 
>>   offset = GET_MODE_UNIT_SIZE (GET_MODE (src)) * INTVAL (XVECEXP (sel, 0))
>> 
>> That offset is the byte offset of the first selected element from the
>> start of a vector in memory, which is also the way that SUBREG_BYTEs
>> are counted.  For little-endian it gives the offset of the lsb of the
>> slice, while for big-endian it gives the offset of the msb (which is
>> also how SUBREG_BYTEs work).
>> 
>> The simplify_subreg_regno should cope with both single-register vectors
>> and multi-register vectors.
>
> Sorry for the delayed response to this.
>
> Thanks for the tip. Here's an improved patch that implements the 
> simplify_sureg_regno () method of eliminating redundant moves. Regarding the 
> test case, I failed to get the ppc back-end to generate RTL pattern that this 
> patch checks for. I can easily write a test case for aarch64(big and little 
> endian) on these lines
>
> typedef float float32x4_t __attribute__ ((__vector_size__ (16)));
>
> float foo_be (float32x4_t x)
> {
>    return x[3];
> }
>
> float foo_le (float32x4_t x)
> {
>    return x[0];
> }
>
> where I know that the vector indexing will generate a vec_select on
> the same src and dst regs that could be optimized away and hence test
> it. But I'm struggling to get a test case that the ppc altivec
> back-end will generate such a vec_select for. I see that altivec does
> not define vec_extract, so a simple indexing like this seems to happen
> via memory. Also, I don't know enough about the ppc PCS or
> architecture to write a test that will check for this optimization
> opportunity on same src and dst hard-registers. Any hints?

Me neither, sorry.

FWIW, the MIPS tests:

  typedef float float32x2_t __attribute__ ((__vector_size__ (8)));
  void bar (float);
  void foo_be (float32x2_t x) { bar (x[1]); }
  void foo_le (float32x2_t x) { bar (x[0]); }

also exercise it, but I don't think they add anything over the aarch64
versions.  I can add them to the testsuite anyway if it helps though.

> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> index 0cd0c7e..ca25ce5 100644
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -1180,6 +1180,22 @@ set_noop_p (const_rtx set)
>        dst = SUBREG_REG (dst);
>      }
>  
> +  /* It is a NOOP if destination overlaps with selected src vector
> +     elements.  */
> +  if (GET_CODE (src) == VEC_SELECT
> +      && REG_P (XEXP (src, 0)) && REG_P (dst)
> +      && HARD_REGISTER_P (XEXP (src, 0))
> +      && HARD_REGISTER_P (dst))
> +    {
> +      rtx par = XEXP (src, 1);
> +      rtx src0 = XEXP (src, 0);
> +      HOST_WIDE_INT offset =
> +	GET_MODE_UNIT_SIZE (GET_MODE (src0)) * INTVAL (XVECEXP (par, 0, 0));
> +
> +      return simplify_subreg_regno (REGNO (src0), GET_MODE (src0),
> +				    offset, GET_MODE (dst)) == (int)REGNO (dst);
> +    }
> +

Since this also (correctly) triggers for vector results, we need to keep
the check for consecutive indices that you had originally.  (It's always
the first index that should be used for the simplify_subreg_regno though.)

Looks good to me otherwise, thanks.

Richard.
diff mbox

Patch

diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
index 0cd0c7e..ca25ce5 100644
--- a/gcc/rtlanal.c
+++ b/gcc/rtlanal.c
@@ -1180,6 +1180,22 @@  set_noop_p (const_rtx set)
       dst = SUBREG_REG (dst);
     }
 
+  /* It is a NOOP if destination overlaps with selected src vector
+     elements.  */
+  if (GET_CODE (src) == VEC_SELECT
+      && REG_P (XEXP (src, 0)) && REG_P (dst)
+      && HARD_REGISTER_P (XEXP (src, 0))
+      && HARD_REGISTER_P (dst))
+    {
+      rtx par = XEXP (src, 1);
+      rtx src0 = XEXP (src, 0);
+      HOST_WIDE_INT offset =
+	GET_MODE_UNIT_SIZE (GET_MODE (src0)) * INTVAL (XVECEXP (par, 0, 0));
+
+      return simplify_subreg_regno (REGNO (src0), GET_MODE (src0),
+				    offset, GET_MODE (dst)) == (int)REGNO (dst);
+    }
+
   return (REG_P (src) && REG_P (dst)
 	  && REGNO (src) == REGNO (dst));
 }