diff mbox

[RTL] Eliminate redundant vec_select moves.

Message ID 529F5318.1030505@arm.com
State New
Headers show

Commit Message

Tejas Belagod Dec. 4, 2013, 4:06 p.m. UTC
Richard Sandiford wrote:
> Tejas Belagod <tbelagod@arm.com> writes:
>> Richard Sandiford wrote:
>>> Tejas Belagod <tbelagod@arm.com> writes:
>>>>> The problem is that one reg rtx can span several hard registers.
>>>>> E.g. (reg:V4SI 32) might represent one 64-bit register (no. 32),
>>>>> but it might instead represent two 32-bit registers (nos. 32 and 33).
>>>>> Obviously the latter's not very likely for vectors this small,
>>>>> but more likely for larger ones (including on NEON IIRC).
>>>>>
>>>>> So if we had 2 32-bit registers being treated as a V4HI, it would be:
>>>>>
>>>>>    <--32--><--33-->
>>>>>    msb          lsb
>>>>>    0000111122223333
>>>>>    VVVVVVVV
>>>>>    00001111
>>>>>    msb  lsb
>>>>>    <--32-->
>>>>>
>>>>> for big endian and:
>>>>>
>>>>>    <--33--><--32-->
>>>>>    msb          lsb
>>>>>    3333222211110000
>>>>>            VVVVVVVV
>>>>>            11110000
>>>>>            msb  lsb
>>>>>            <--32-->
>>>>>
>>>>> for little endian.
>>>> Ah, ok, that makes things clearer. Thanks for that.
>>>>
>>>> I can't find any helper function that figures out if we're writing
>>>> partial or
>>>> full result regs. Would something like
>>>>
>>>>      REGNO (src) == REGNO (dst) &&
>>>>      HARD_REGNO_NREGS (src) == HARD_REGNO_NREGS (dst) == 1
>>>>
>>>> be a sane check for partial result regs?
>>> Yeah, that should work.  I think a more general alternative would be:
>>>
>>>   simplify_subreg_regno (REGNO (src), GET_MODE (src),
>>>                          offset, GET_MODE (dst)) == (int) REGNO (dst)
>>>
>>> where:
>>>
>>>   offset = GET_MODE_UNIT_SIZE (GET_MODE (src)) * INTVAL (XVECEXP (sel, 0))
>>>
>>> That offset is the byte offset of the first selected element from the
>>> start of a vector in memory, which is also the way that SUBREG_BYTEs
>>> are counted.  For little-endian it gives the offset of the lsb of the
>>> slice, while for big-endian it gives the offset of the msb (which is
>>> also how SUBREG_BYTEs work).
>>>
>>> The simplify_subreg_regno should cope with both single-register vectors
>>> and multi-register vectors.
>> Sorry for the delayed response to this.
>>
>> Thanks for the tip. Here's an improved patch that implements the 
>> simplify_sureg_regno () method of eliminating redundant moves. Regarding the 
>> test case, I failed to get the ppc back-end to generate RTL pattern that this 
>> patch checks for. I can easily write a test case for aarch64(big and little 
>> endian) on these lines
>>
>> typedef float float32x4_t __attribute__ ((__vector_size__ (16)));
>>
>> float foo_be (float32x4_t x)
>> {
>>    return x[3];
>> }
>>
>> float foo_le (float32x4_t x)
>> {
>>    return x[0];
>> }
>>
>> where I know that the vector indexing will generate a vec_select on
>> the same src and dst regs that could be optimized away and hence test
>> it. But I'm struggling to get a test case that the ppc altivec
>> back-end will generate such a vec_select for. I see that altivec does
>> not define vec_extract, so a simple indexing like this seems to happen
>> via memory. Also, I don't know enough about the ppc PCS or
>> architecture to write a test that will check for this optimization
>> opportunity on same src and dst hard-registers. Any hints?
> 
> Me neither, sorry.
> 
> FWIW, the MIPS tests:
> 
>   typedef float float32x2_t __attribute__ ((__vector_size__ (8)));
>   void bar (float);
>   void foo_be (float32x2_t x) { bar (x[1]); }
>   void foo_le (float32x2_t x) { bar (x[0]); }
> 
> also exercise it, but I don't think they add anything over the aarch64
> versions.  I can add them to the testsuite anyway if it helps though.
> 
>> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
>> index 0cd0c7e..ca25ce5 100644
>> --- a/gcc/rtlanal.c
>> +++ b/gcc/rtlanal.c
>> @@ -1180,6 +1180,22 @@ set_noop_p (const_rtx set)
>>        dst = SUBREG_REG (dst);
>>      }
>>  
>> +  /* It is a NOOP if destination overlaps with selected src vector
>> +     elements.  */
>> +  if (GET_CODE (src) == VEC_SELECT
>> +      && REG_P (XEXP (src, 0)) && REG_P (dst)
>> +      && HARD_REGISTER_P (XEXP (src, 0))
>> +      && HARD_REGISTER_P (dst))
>> +    {
>> +      rtx par = XEXP (src, 1);
>> +      rtx src0 = XEXP (src, 0);
>> +      HOST_WIDE_INT offset =
>> +	GET_MODE_UNIT_SIZE (GET_MODE (src0)) * INTVAL (XVECEXP (par, 0, 0));
>> +
>> +      return simplify_subreg_regno (REGNO (src0), GET_MODE (src0),
>> +				    offset, GET_MODE (dst)) == (int)REGNO (dst);
>> +    }
>> +
> 
> Since this also (correctly) triggers for vector results, we need to keep
> the check for consecutive indices that you had originally.  (It's always
> the first index that should be used for the simplify_subreg_regno though.)
> 
> Looks good to me otherwise, thanks.

Thanks Richard. Here is a revised patch. Sorry about the delay - I was 
investigating to make sure an LRA ICE I was seeing on aarch64 was unrelated to 
this patch. I've added a test case that I expect to pass for aarch64. I've also 
added the tests that you suggested for MIPS, but haven't checked for the target 
because I'm not sure what optimizations happen on MIPS.

OK for trunk?

Thanks,
Tejas.

2013-12-04  Tejas Belagod  <tejas.belagod@arm.com>

gcc/
	* rtlanal.c (set_noop_p): Return nonzero in case of redundant vec_select
	for overlapping register lanes.

testsuite/
	* config/gcc.dg/vect/vect-nop-move.c: New.

Comments

H.J. Lu Dec. 4, 2013, 4:14 p.m. UTC | #1
On Wed, Dec 4, 2013 at 8:06 AM, Tejas Belagod <tbelagod@arm.com> wrote:
> Thanks Richard. Here is a revised patch. Sorry about the delay - I was
> investigating to make sure an LRA ICE I was seeing on aarch64 was unrelated
> to this patch. I've added a test case that I expect to pass for aarch64.
> I've also added the tests that you suggested for MIPS, but haven't checked
> for the target because I'm not sure what optimizations happen on MIPS.
>
> OK for trunk?
>
> Thanks,
> Tejas.
>
> 2013-12-04  Tejas Belagod  <tejas.belagod@arm.com>
>
>
> gcc/
>         * rtlanal.c (set_noop_p): Return nonzero in case of redundant
> vec_select
>         for overlapping register lanes.
>
> testsuite/
>         * config/gcc.dg/vect/vect-nop-move.c: New.
>
>
> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> index 0cd0c7e..e1388c8 100644
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -1180,6 +1180,26 @@ set_noop_p (const_rtx set)
>        dst = SUBREG_REG (dst);
>      }
>
> +  /* It is a NOOP if destination overlaps with selected src vector
> +     elements.  */
> +  if (GET_CODE (src) == VEC_SELECT
> +      && REG_P (XEXP (src, 0)) && REG_P (dst)
> +      && HARD_REGISTER_P (XEXP (src, 0))
> +      && HARD_REGISTER_P (dst))
> +    {
> +      int i;
> +      rtx par = XEXP (src, 1);
> +      rtx src0 = XEXP (src, 0);
> +      int c0 = INTVAL (XVECEXP (par, 0, 0));
> +      HOST_WIDE_INT offset = GET_MODE_UNIT_SIZE (GET_MODE (src0)) * c0;
> +
> +      for (i = 1; i < XVECLEN (par, 0); i++)
> +       if (INTVAL (XVECEXP (par, 0, i)) != c0 + i)
> +         return 0;
> +      return simplify_subreg_regno (REGNO (src0), GET_MODE (src0),
> +                                   offset, GET_MODE (dst)) == (int)REGNO
> (dst);
> +    }
> +
>    return (REG_P (src) && REG_P (dst)
>           && REGNO (src) == REGNO (dst));
>  }
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-nop-move.c
> b/gcc/testsuite/gcc.dg/vect/vect-nop-move.c
> new file mode 100644
> index 0000000..1941933
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-nop-move.c
> @@ -0,0 +1,64 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target vect_float } */
> +/* { dg-options "-O3 -fdump-rtl-combine-details" } */
> +
> +extern void abort (void);
> +
> +#define NOINLINE __attribute__((noinline))
> +
> +typedef float float32x4_t __attribute__ ((__vector_size__ (16)));
> +typedef float float32x2_t __attribute__ ((__vector_size__ (8)));
> +
> +NOINLINE float
> +foo32x4_be (float32x4_t x)
> +{
> +  return x[3];
> +}
> +
> +NOINLINE float
> +foo32x4_le (float32x4_t x)
> +{
> +  return x[0];
> +}
> +
> +NOINLINE float
> +bar (float a)
> +{
> +  return a;
> +}
> +
> +NOINLINE float
> +foo32x2_be (float32x2_t x)
> +{
> +  return bar (x[1]);
> +}
> +
> +NOINLINE float
> +foo32x2_le (float32x2_t x)
> +{
> +  return bar (x[0]);
> +}
> +
> +int
> +main()
> +{
> +  float32x4_t a = { 0.0f, 1.0f, 2.0f, 3.0f };
> +  float32x2_t b = { 0.0f, 1.0f };
> +
> +  if (foo32x4_be (a) != 3.0f)
> +    abort ();
> +
> +  if (foo32x4_le (a) != 0.0f)
> +    abort ();
> +
> +  if (foo32x2_be (b) != 1.0f)
> +    abort ();
> +
> +  if (foo32x2_le (b) != 0.0f)
> +    abort ();
> +
> +  return 0;
> +}
> +
> +/* { dg-final { scan-rtl-dump "deleting noop move" "combine" { target
> aarch64*-*-* } } } */

Any particular reason why it doesn't work for x86?

> +/* { dg-final { cleanup-rtl-dump "combine" } } */

Thanks.
Jeff Law Dec. 4, 2013, 5:29 p.m. UTC | #2
On 12/04/13 09:14, H.J. Lu wrote:

>> +
>> +/* { dg-final { scan-rtl-dump "deleting noop move" "combine" { target
>> aarch64*-*-* } } } */
>
> Any particular reason why it doesn't work for x86?
I don't think so.  I'm pretty sure Tejas is focused on ARM platforms for 
the obvious reason.

jeff
H.J. Lu Dec. 4, 2013, 5:31 p.m. UTC | #3
On Wed, Dec 4, 2013 at 9:29 AM, Jeff Law <law@redhat.com> wrote:
> On 12/04/13 09:14, H.J. Lu wrote:
>
>>> +
>>> +/* { dg-final { scan-rtl-dump "deleting noop move" "combine" { target
>>> aarch64*-*-* } } } */
>>
>>
>> Any particular reason why it doesn't work for x86?
>
> I don't think so.  I'm pretty sure Tejas is focused on ARM platforms for the
> obvious reason.
>

Then please add "i?86-*-* x86_64-*-*".

Thanks.
Richard Sandiford Dec. 4, 2013, 5:36 p.m. UTC | #4
Tejas Belagod <tbelagod@arm.com> writes:
> Richard Sandiford wrote:
>> Tejas Belagod <tbelagod@arm.com> writes:
>>> Richard Sandiford wrote:
>>>> Tejas Belagod <tbelagod@arm.com> writes:
>>>>>> The problem is that one reg rtx can span several hard registers.
>>>>>> E.g. (reg:V4SI 32) might represent one 64-bit register (no. 32),
>>>>>> but it might instead represent two 32-bit registers (nos. 32 and 33).
>>>>>> Obviously the latter's not very likely for vectors this small,
>>>>>> but more likely for larger ones (including on NEON IIRC).
>>>>>>
>>>>>> So if we had 2 32-bit registers being treated as a V4HI, it would be:
>>>>>>
>>>>>>    <--32--><--33-->
>>>>>>    msb          lsb
>>>>>>    0000111122223333
>>>>>>    VVVVVVVV
>>>>>>    00001111
>>>>>>    msb  lsb
>>>>>>    <--32-->
>>>>>>
>>>>>> for big endian and:
>>>>>>
>>>>>>    <--33--><--32-->
>>>>>>    msb          lsb
>>>>>>    3333222211110000
>>>>>>            VVVVVVVV
>>>>>>            11110000
>>>>>>            msb  lsb
>>>>>>            <--32-->
>>>>>>
>>>>>> for little endian.
>>>>> Ah, ok, that makes things clearer. Thanks for that.
>>>>>
>>>>> I can't find any helper function that figures out if we're writing
>>>>> partial or
>>>>> full result regs. Would something like
>>>>>
>>>>>      REGNO (src) == REGNO (dst) &&
>>>>>      HARD_REGNO_NREGS (src) == HARD_REGNO_NREGS (dst) == 1
>>>>>
>>>>> be a sane check for partial result regs?
>>>> Yeah, that should work.  I think a more general alternative would be:
>>>>
>>>>   simplify_subreg_regno (REGNO (src), GET_MODE (src),
>>>>                          offset, GET_MODE (dst)) == (int) REGNO (dst)
>>>>
>>>> where:
>>>>
>>>>   offset = GET_MODE_UNIT_SIZE (GET_MODE (src)) * INTVAL (XVECEXP (sel, 0))
>>>>
>>>> That offset is the byte offset of the first selected element from the
>>>> start of a vector in memory, which is also the way that SUBREG_BYTEs
>>>> are counted.  For little-endian it gives the offset of the lsb of the
>>>> slice, while for big-endian it gives the offset of the msb (which is
>>>> also how SUBREG_BYTEs work).
>>>>
>>>> The simplify_subreg_regno should cope with both single-register vectors
>>>> and multi-register vectors.
>>> Sorry for the delayed response to this.
>>>
>>> Thanks for the tip. Here's an improved patch that implements the 
>>> simplify_sureg_regno () method of eliminating redundant moves. Regarding the 
>>> test case, I failed to get the ppc back-end to generate RTL pattern
>>> that this
>>> patch checks for. I can easily write a test case for aarch64(big and little 
>>> endian) on these lines
>>>
>>> typedef float float32x4_t __attribute__ ((__vector_size__ (16)));
>>>
>>> float foo_be (float32x4_t x)
>>> {
>>>    return x[3];
>>> }
>>>
>>> float foo_le (float32x4_t x)
>>> {
>>>    return x[0];
>>> }
>>>
>>> where I know that the vector indexing will generate a vec_select on
>>> the same src and dst regs that could be optimized away and hence test
>>> it. But I'm struggling to get a test case that the ppc altivec
>>> back-end will generate such a vec_select for. I see that altivec does
>>> not define vec_extract, so a simple indexing like this seems to happen
>>> via memory. Also, I don't know enough about the ppc PCS or
>>> architecture to write a test that will check for this optimization
>>> opportunity on same src and dst hard-registers. Any hints?
>> 
>> Me neither, sorry.
>> 
>> FWIW, the MIPS tests:
>> 
>>   typedef float float32x2_t __attribute__ ((__vector_size__ (8)));
>>   void bar (float);
>>   void foo_be (float32x2_t x) { bar (x[1]); }
>>   void foo_le (float32x2_t x) { bar (x[0]); }
>> 
>> also exercise it, but I don't think they add anything over the aarch64
>> versions.  I can add them to the testsuite anyway if it helps though.
>> 
>>> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
>>> index 0cd0c7e..ca25ce5 100644
>>> --- a/gcc/rtlanal.c
>>> +++ b/gcc/rtlanal.c
>>> @@ -1180,6 +1180,22 @@ set_noop_p (const_rtx set)
>>>        dst = SUBREG_REG (dst);
>>>      }
>>>  
>>> +  /* It is a NOOP if destination overlaps with selected src vector
>>> +     elements.  */
>>> +  if (GET_CODE (src) == VEC_SELECT
>>> +      && REG_P (XEXP (src, 0)) && REG_P (dst)
>>> +      && HARD_REGISTER_P (XEXP (src, 0))
>>> +      && HARD_REGISTER_P (dst))
>>> +    {
>>> +      rtx par = XEXP (src, 1);
>>> +      rtx src0 = XEXP (src, 0);
>>> +      HOST_WIDE_INT offset =
>>> +	GET_MODE_UNIT_SIZE (GET_MODE (src0)) * INTVAL (XVECEXP (par, 0, 0));
>>> +
>>> +      return simplify_subreg_regno (REGNO (src0), GET_MODE (src0),
>>> +				    offset, GET_MODE (dst)) == (int)REGNO (dst);
>>> +    }
>>> +
>> 
>> Since this also (correctly) triggers for vector results, we need to keep
>> the check for consecutive indices that you had originally.  (It's always
>> the first index that should be used for the simplify_subreg_regno though.)
>> 
>> Looks good to me otherwise, thanks.
>
> Thanks Richard. Here is a revised patch. Sorry about the delay - I was
> investigating to make sure an LRA ICE I was seeing on aarch64 was
> unrelated to this patch. I've added a test case that I expect to pass
> for aarch64. I've also added the tests that you suggested for MIPS,
> but haven't checked for the target because I'm not sure what
> optimizations happen on MIPS.

Thanks, looks good to me, but I can't approve it.  Just one minor
formatting nit:

> +      return simplify_subreg_regno (REGNO (src0), GET_MODE (src0),
> +				    offset, GET_MODE (dst)) == (int)REGNO (dst);

space after "(int)".

Richard
Jeff Law Dec. 4, 2013, 8:03 p.m. UTC | #5
On 12/04/13 09:06, Tejas Belagod wrote:
> Richard Sandiford wrote:
>> Tejas Belagod <tbelagod@arm.com> writes:
>>> Richard Sandiford wrote:
>>>> Tejas Belagod <tbelagod@arm.com> writes:
>>>>>> The problem is that one reg rtx can span several hard registers.
>>>>>> E.g. (reg:V4SI 32) might represent one 64-bit register (no. 32),
>>>>>> but it might instead represent two 32-bit registers (nos. 32 and 33).
>>>>>> Obviously the latter's not very likely for vectors this small,
>>>>>> but more likely for larger ones (including on NEON IIRC).
>>>>>>
>>>>>> So if we had 2 32-bit registers being treated as a V4HI, it would be:
>>>>>>
>>>>>>    <--32--><--33-->
>>>>>>    msb          lsb
>>>>>>    0000111122223333
>>>>>>    VVVVVVVV
>>>>>>    00001111
>>>>>>    msb  lsb
>>>>>>    <--32-->
>>>>>>
>>>>>> for big endian and:
>>>>>>
>>>>>>    <--33--><--32-->
>>>>>>    msb          lsb
>>>>>>    3333222211110000
>>>>>>            VVVVVVVV
>>>>>>            11110000
>>>>>>            msb  lsb
>>>>>>            <--32-->
>>>>>>
>>>>>> for little endian.
>>>>> Ah, ok, that makes things clearer. Thanks for that.
>>>>>
>>>>> I can't find any helper function that figures out if we're writing
>>>>> partial or
>>>>> full result regs. Would something like
>>>>>
>>>>>      REGNO (src) == REGNO (dst) &&
>>>>>      HARD_REGNO_NREGS (src) == HARD_REGNO_NREGS (dst) == 1
>>>>>
>>>>> be a sane check for partial result regs?
>>>> Yeah, that should work.  I think a more general alternative would be:
>>>>
>>>>   simplify_subreg_regno (REGNO (src), GET_MODE (src),
>>>>                          offset, GET_MODE (dst)) == (int) REGNO (dst)
>>>>
>>>> where:
>>>>
>>>>   offset = GET_MODE_UNIT_SIZE (GET_MODE (src)) * INTVAL (XVECEXP
>>>> (sel, 0))
>>>>
>>>> That offset is the byte offset of the first selected element from the
>>>> start of a vector in memory, which is also the way that SUBREG_BYTEs
>>>> are counted.  For little-endian it gives the offset of the lsb of the
>>>> slice, while for big-endian it gives the offset of the msb (which is
>>>> also how SUBREG_BYTEs work).
>>>>
>>>> The simplify_subreg_regno should cope with both single-register vectors
>>>> and multi-register vectors.
>>> Sorry for the delayed response to this.
>>>
>>> Thanks for the tip. Here's an improved patch that implements the
>>> simplify_sureg_regno () method of eliminating redundant moves.
>>> Regarding the test case, I failed to get the ppc back-end to generate
>>> RTL pattern that this patch checks for. I can easily write a test
>>> case for aarch64(big and little endian) on these lines
>>>
>>> typedef float float32x4_t __attribute__ ((__vector_size__ (16)));
>>>
>>> float foo_be (float32x4_t x)
>>> {
>>>    return x[3];
>>> }
>>>
>>> float foo_le (float32x4_t x)
>>> {
>>>    return x[0];
>>> }
>>>
>>> where I know that the vector indexing will generate a vec_select on
>>> the same src and dst regs that could be optimized away and hence test
>>> it. But I'm struggling to get a test case that the ppc altivec
>>> back-end will generate such a vec_select for. I see that altivec does
>>> not define vec_extract, so a simple indexing like this seems to happen
>>> via memory. Also, I don't know enough about the ppc PCS or
>>> architecture to write a test that will check for this optimization
>>> opportunity on same src and dst hard-registers. Any hints?
>>
>> Me neither, sorry.
>>
>> FWIW, the MIPS tests:
>>
>>   typedef float float32x2_t __attribute__ ((__vector_size__ (8)));
>>   void bar (float);
>>   void foo_be (float32x2_t x) { bar (x[1]); }
>>   void foo_le (float32x2_t x) { bar (x[0]); }
>>
>> also exercise it, but I don't think they add anything over the aarch64
>> versions.  I can add them to the testsuite anyway if it helps though.
>>
>>> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
>>> index 0cd0c7e..ca25ce5 100644
>>> --- a/gcc/rtlanal.c
>>> +++ b/gcc/rtlanal.c
>>> @@ -1180,6 +1180,22 @@ set_noop_p (const_rtx set)
>>>        dst = SUBREG_REG (dst);
>>>      }
>>>
>>> +  /* It is a NOOP if destination overlaps with selected src vector
>>> +     elements.  */
>>> +  if (GET_CODE (src) == VEC_SELECT
>>> +      && REG_P (XEXP (src, 0)) && REG_P (dst)
>>> +      && HARD_REGISTER_P (XEXP (src, 0))
>>> +      && HARD_REGISTER_P (dst))
>>> +    {
>>> +      rtx par = XEXP (src, 1);
>>> +      rtx src0 = XEXP (src, 0);
>>> +      HOST_WIDE_INT offset =
>>> +    GET_MODE_UNIT_SIZE (GET_MODE (src0)) * INTVAL (XVECEXP (par, 0,
>>> 0));
>>> +
>>> +      return simplify_subreg_regno (REGNO (src0), GET_MODE (src0),
>>> +                    offset, GET_MODE (dst)) == (int)REGNO (dst);
>>> +    }
>>> +
>>
>> Since this also (correctly) triggers for vector results, we need to keep
>> the check for consecutive indices that you had originally.  (It's always
>> the first index that should be used for the simplify_subreg_regno
>> though.)
>>
>> Looks good to me otherwise, thanks.
>
> Thanks Richard. Here is a revised patch. Sorry about the delay - I was
> investigating to make sure an LRA ICE I was seeing on aarch64 was
> unrelated to this patch. I've added a test case that I expect to pass
> for aarch64. I've also added the tests that you suggested for MIPS, but
> haven't checked for the target because I'm not sure what optimizations
> happen on MIPS.
>
> OK for trunk?
>
> Thanks,
> Tejas.
>
> 2013-12-04  Tejas Belagod  <tejas.belagod@arm.com>
>
> gcc/
>      * rtlanal.c (set_noop_p): Return nonzero in case of redundant
> vec_select
>      for overlapping register lanes.
>
> testsuite/
>      * config/gcc.dg/vect/vect-nop-move.c: New.
Per HJ's request please test vect-nop-move on x86/x86_64 and if the 
redundant move is properly eliminated, enable the test on those targets 
(i?86-*-* x86_64-*-*).

Approved with that change for the trunk.

jeff
Tejas Belagod Dec. 5, 2013, 1:17 p.m. UTC | #6
H.J. Lu wrote:
> On Wed, Dec 4, 2013 at 9:29 AM, Jeff Law <law@redhat.com> wrote:
>> On 12/04/13 09:14, H.J. Lu wrote:
>>
>>>> +
>>>> +/* { dg-final { scan-rtl-dump "deleting noop move" "combine" { target
>>>> aarch64*-*-* } } } */
>>>
>>> Any particular reason why it doesn't work for x86?
>> I don't think so.  I'm pretty sure Tejas is focused on ARM platforms for the
>> obvious reason.
>>
> 
> Then please add "i?86-*-* x86_64-*-*".

Hi,

I tried this test on x86_64. Though the same RTL gets generated

   (set (reg:Sf) (vec_select:SF (reg:V4Sf) (parallel [const 0]))

for -msse2, this optimization does not seem to trigger. Only later in a 
post-reload-split does it get eliminated to something like

    (set (reg:SF 21 xmm0) (reg:SF 21 xmm0))

I suspect simplify_subreg_regno () may not be returning what we want here - 
sorry, I don't know enough about x86 to debug deeper.

I could either keep this test case as is or if you could give it a quick look to 
see why it does not trigger, it would be useful to add x86 to this test.

Thanks,
Tejas.
H.J. Lu Dec. 5, 2013, 1:30 p.m. UTC | #7
On Thu, Dec 5, 2013 at 5:17 AM, Tejas Belagod <tbelagod@arm.com> wrote:
> H.J. Lu wrote:
>>
>> On Wed, Dec 4, 2013 at 9:29 AM, Jeff Law <law@redhat.com> wrote:
>>>
>>> On 12/04/13 09:14, H.J. Lu wrote:
>>>
>>>>> +
>>>>> +/* { dg-final { scan-rtl-dump "deleting noop move" "combine" { target
>>>>> aarch64*-*-* } } } */
>>>>
>>>>
>>>> Any particular reason why it doesn't work for x86?
>>>
>>> I don't think so.  I'm pretty sure Tejas is focused on ARM platforms for
>>> the
>>> obvious reason.
>>>
>>
>> Then please add "i?86-*-* x86_64-*-*".
>
>
> Hi,
>
> I tried this test on x86_64. Though the same RTL gets generated
>
>   (set (reg:Sf) (vec_select:SF (reg:V4Sf) (parallel [const 0]))
>
> for -msse2, this optimization does not seem to trigger. Only later in a
> post-reload-split does it get eliminated to something like
>
>    (set (reg:SF 21 xmm0) (reg:SF 21 xmm0))
>
> I suspect simplify_subreg_regno () may not be returning what we want here -
> sorry, I don't know enough about x86 to debug deeper.

Kirill, can you take a look why it doesn't work for x86?

> I could either keep this test case as is or if you could give it a quick
> look to see why it does not trigger, it would be useful to add x86 to this
> test.
>
> Thanks,
> Tejas.
>
Kirill Yukhin Dec. 5, 2013, 1:40 p.m. UTC | #8
Hello,
On 05 Dec 05:30, H.J. Lu wrote:
> Kirill, can you take a look why it doesn't work for x86?
Okay, I'll look at this.

--
Thanks, K
Tejas Belagod Dec. 5, 2013, 4:12 p.m. UTC | #9
Jeff Law wrote:
> On 12/04/13 09:06, Tejas Belagod wrote:
>> Richard Sandiford wrote:
>>> Tejas Belagod <tbelagod@arm.com> writes:
>>>> Richard Sandiford wrote:
>>>>> Tejas Belagod <tbelagod@arm.com> writes:
>>>>>>> The problem is that one reg rtx can span several hard registers.
>>>>>>> E.g. (reg:V4SI 32) might represent one 64-bit register (no. 32),
>>>>>>> but it might instead represent two 32-bit registers (nos. 32 and 33).
>>>>>>> Obviously the latter's not very likely for vectors this small,
>>>>>>> but more likely for larger ones (including on NEON IIRC).
>>>>>>>
>>>>>>> So if we had 2 32-bit registers being treated as a V4HI, it would be:
>>>>>>>
>>>>>>>    <--32--><--33-->
>>>>>>>    msb          lsb
>>>>>>>    0000111122223333
>>>>>>>    VVVVVVVV
>>>>>>>    00001111
>>>>>>>    msb  lsb
>>>>>>>    <--32-->
>>>>>>>
>>>>>>> for big endian and:
>>>>>>>
>>>>>>>    <--33--><--32-->
>>>>>>>    msb          lsb
>>>>>>>    3333222211110000
>>>>>>>            VVVVVVVV
>>>>>>>            11110000
>>>>>>>            msb  lsb
>>>>>>>            <--32-->
>>>>>>>
>>>>>>> for little endian.
>>>>>> Ah, ok, that makes things clearer. Thanks for that.
>>>>>>
>>>>>> I can't find any helper function that figures out if we're writing
>>>>>> partial or
>>>>>> full result regs. Would something like
>>>>>>
>>>>>>      REGNO (src) == REGNO (dst) &&
>>>>>>      HARD_REGNO_NREGS (src) == HARD_REGNO_NREGS (dst) == 1
>>>>>>
>>>>>> be a sane check for partial result regs?
>>>>> Yeah, that should work.  I think a more general alternative would be:
>>>>>
>>>>>   simplify_subreg_regno (REGNO (src), GET_MODE (src),
>>>>>                          offset, GET_MODE (dst)) == (int) REGNO (dst)
>>>>>
>>>>> where:
>>>>>
>>>>>   offset = GET_MODE_UNIT_SIZE (GET_MODE (src)) * INTVAL (XVECEXP
>>>>> (sel, 0))
>>>>>
>>>>> That offset is the byte offset of the first selected element from the
>>>>> start of a vector in memory, which is also the way that SUBREG_BYTEs
>>>>> are counted.  For little-endian it gives the offset of the lsb of the
>>>>> slice, while for big-endian it gives the offset of the msb (which is
>>>>> also how SUBREG_BYTEs work).
>>>>>
>>>>> The simplify_subreg_regno should cope with both single-register vectors
>>>>> and multi-register vectors.
>>>> Sorry for the delayed response to this.
>>>>
>>>> Thanks for the tip. Here's an improved patch that implements the
>>>> simplify_sureg_regno () method of eliminating redundant moves.
>>>> Regarding the test case, I failed to get the ppc back-end to generate
>>>> RTL pattern that this patch checks for. I can easily write a test
>>>> case for aarch64(big and little endian) on these lines
>>>>
>>>> typedef float float32x4_t __attribute__ ((__vector_size__ (16)));
>>>>
>>>> float foo_be (float32x4_t x)
>>>> {
>>>>    return x[3];
>>>> }
>>>>
>>>> float foo_le (float32x4_t x)
>>>> {
>>>>    return x[0];
>>>> }
>>>>
>>>> where I know that the vector indexing will generate a vec_select on
>>>> the same src and dst regs that could be optimized away and hence test
>>>> it. But I'm struggling to get a test case that the ppc altivec
>>>> back-end will generate such a vec_select for. I see that altivec does
>>>> not define vec_extract, so a simple indexing like this seems to happen
>>>> via memory. Also, I don't know enough about the ppc PCS or
>>>> architecture to write a test that will check for this optimization
>>>> opportunity on same src and dst hard-registers. Any hints?
>>> Me neither, sorry.
>>>
>>> FWIW, the MIPS tests:
>>>
>>>   typedef float float32x2_t __attribute__ ((__vector_size__ (8)));
>>>   void bar (float);
>>>   void foo_be (float32x2_t x) { bar (x[1]); }
>>>   void foo_le (float32x2_t x) { bar (x[0]); }
>>>
>>> also exercise it, but I don't think they add anything over the aarch64
>>> versions.  I can add them to the testsuite anyway if it helps though.
>>>
>>>> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
>>>> index 0cd0c7e..ca25ce5 100644
>>>> --- a/gcc/rtlanal.c
>>>> +++ b/gcc/rtlanal.c
>>>> @@ -1180,6 +1180,22 @@ set_noop_p (const_rtx set)
>>>>        dst = SUBREG_REG (dst);
>>>>      }
>>>>
>>>> +  /* It is a NOOP if destination overlaps with selected src vector
>>>> +     elements.  */
>>>> +  if (GET_CODE (src) == VEC_SELECT
>>>> +      && REG_P (XEXP (src, 0)) && REG_P (dst)
>>>> +      && HARD_REGISTER_P (XEXP (src, 0))
>>>> +      && HARD_REGISTER_P (dst))
>>>> +    {
>>>> +      rtx par = XEXP (src, 1);
>>>> +      rtx src0 = XEXP (src, 0);
>>>> +      HOST_WIDE_INT offset =
>>>> +    GET_MODE_UNIT_SIZE (GET_MODE (src0)) * INTVAL (XVECEXP (par, 0,
>>>> 0));
>>>> +
>>>> +      return simplify_subreg_regno (REGNO (src0), GET_MODE (src0),
>>>> +                    offset, GET_MODE (dst)) == (int)REGNO (dst);
>>>> +    }
>>>> +
>>> Since this also (correctly) triggers for vector results, we need to keep
>>> the check for consecutive indices that you had originally.  (It's always
>>> the first index that should be used for the simplify_subreg_regno
>>> though.)
>>>
>>> Looks good to me otherwise, thanks.
>> Thanks Richard. Here is a revised patch. Sorry about the delay - I was
>> investigating to make sure an LRA ICE I was seeing on aarch64 was
>> unrelated to this patch. I've added a test case that I expect to pass
>> for aarch64. I've also added the tests that you suggested for MIPS, but
>> haven't checked for the target because I'm not sure what optimizations
>> happen on MIPS.
>>
>> OK for trunk?
>>
>> Thanks,
>> Tejas.
>>
>> 2013-12-04  Tejas Belagod  <tejas.belagod@arm.com>
>>
>> gcc/
>>      * rtlanal.c (set_noop_p): Return nonzero in case of redundant
>> vec_select
>>      for overlapping register lanes.
>>
>> testsuite/
>>      * config/gcc.dg/vect/vect-nop-move.c: New.
> Per HJ's request please test vect-nop-move on x86/x86_64 and if the 
> redundant move is properly eliminated, enable the test on those targets 
> (i?86-*-* x86_64-*-*).
> 
> Approved with that change for the trunk.

Thanks Jeff.

Now that Kirill's looking at why this doesn't work for x86, could I check this 
in without enabling vect-nop-move.c for targets (i?86-*-* x86_64-*-*)? If not, 
I'm happy to wait.

Thanks,
Tejas.
Jeff Law Dec. 5, 2013, 4:20 p.m. UTC | #10
On 12/05/13 09:12, Tejas Belagod wrote:

>
> Now that Kirill's looking at why this doesn't work for x86, could I
> check this in without enabling vect-nop-move.c for targets (i?86-*-*
> x86_64-*-*)? If not, I'm happy to wait.
Yea, that's fine with me.
jeff
Jakub Jelinek Dec. 5, 2013, 10:37 p.m. UTC | #11
On Wed, Dec 04, 2013 at 08:14:43AM -0800, H.J. Lu wrote:
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-nop-move.c
> > @@ -0,0 +1,64 @@
> > +/* { dg-do run } */
> > +/* { dg-require-effective-target vect_float } */
> > +/* { dg-options "-O3 -fdump-rtl-combine-details" } */

Please change dg-options to dg-additional-options, otherwise
it overrides the target basic vectorization options and thus
fails on i686-linux.

> > +/* { dg-final { scan-rtl-dump "deleting noop move" "combine" { target
> > aarch64*-*-* } } } */
> 
> Any particular reason why it doesn't work for x86?
> 
> > +/* { dg-final { cleanup-rtl-dump "combine" } } */

You also need to add

/* { dg-final { cleanup-tree-dump "vect" } } */

because all vectorizer tests dump *.vect dumps.

	Jakub
Tejas Belagod Dec. 6, 2013, 11:36 a.m. UTC | #12
Jakub Jelinek wrote:
> On Wed, Dec 04, 2013 at 08:14:43AM -0800, H.J. Lu wrote:
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.dg/vect/vect-nop-move.c
>>> @@ -0,0 +1,64 @@
>>> +/* { dg-do run } */
>>> +/* { dg-require-effective-target vect_float } */
>>> +/* { dg-options "-O3 -fdump-rtl-combine-details" } */
> 
> Please change dg-options to dg-additional-options, otherwise
> it overrides the target basic vectorization options and thus
> fails on i686-linux.

Sorry, OK I'll fix that. Also, curious to know what these fails look like - what 
default options is this overriding to cause the fail(eg. this test doesn't need 
a cost model)?

> 
>>> +/* { dg-final { scan-rtl-dump "deleting noop move" "combine" { target
>>> aarch64*-*-* } } } */
>> Any particular reason why it doesn't work for x86?
>>
>>> +/* { dg-final { cleanup-rtl-dump "combine" } } */
> 
> You also need to add
> 
> /* { dg-final { cleanup-tree-dump "vect" } } */

OK, will fix.

Thanks,
Tejas.
Kirill Yukhin Dec. 9, 2013, 6:49 a.m. UTC | #13
Hello,

On 05 Dec 16:40, Kirill Yukhin wrote:
> On 05 Dec 05:30, H.J. Lu wrote:
> > Kirill, can you take a look why it doesn't work for x86?
> Okay, I'll look at this.

I've looked at this. It seems that `CANNOT_CHANGE_MODE_CLASS'
is too conservative for x86.

In rtlanal.c we have `simplify_subreg_regno' which call target
hook `REG_CANNOT_CHANGE_MODE_P'. It takes only 3 arguments:
from mode, to mode and regclass.

Hook in x86 called `ix86_cannot_change_mode_class' and comment
says that we cannot change mode for nonzero offsets, which sounds
quite reasonable. That is why this hook returns `true' for this
tuple <V4SF, SF, FIRST_SSE_REG> and `simplify_subreg_regno'
prohibits simplification of that:
  (set (reg:SF 21 xmm0 [orig:86 D.1816 ] [86])
       (vec_select:SF (reg:V4SF 21 xmm0 [87])
          (parallel [(const_int 0 [0])])))

I think we can extend the hook and add `offset in frommode' to it.
We may set it to -1 for the cases where it is unknown and work
conservatively in the target hook.
For most cases offset is known and we could pass it to the hook.
This will require changes throughout all targets though.

Alternatively, we may introduce another target hook, say
`CANNOT_CHANGE_MODE_CLASS_OFFSET' with same args as
`CANNOT_CHANGE_MODE_CLASS' + offset and which will be defaulted to it.
For x86 (and possibly other targets) we'll implement this hook, which
will checko ffset.

What do you think?

--
Thanks, K
Tejas Belagod Dec. 9, 2013, 9:56 a.m. UTC | #14
Kirill Yukhin wrote:
> Hello,
> 
> On 05 Dec 16:40, Kirill Yukhin wrote:
>> On 05 Dec 05:30, H.J. Lu wrote:
>>> Kirill, can you take a look why it doesn't work for x86?
>> Okay, I'll look at this.
> 
> I've looked at this. It seems that `CANNOT_CHANGE_MODE_CLASS'
> is too conservative for x86.
> 
> In rtlanal.c we have `simplify_subreg_regno' which call target
> hook `REG_CANNOT_CHANGE_MODE_P'. It takes only 3 arguments:
> from mode, to mode and regclass.
> 
> Hook in x86 called `ix86_cannot_change_mode_class' and comment
> says that we cannot change mode for nonzero offsets, which sounds
> quite reasonable. That is why this hook returns `true' for this
> tuple <V4SF, SF, FIRST_SSE_REG> and `simplify_subreg_regno'
> prohibits simplification of that:
>   (set (reg:SF 21 xmm0 [orig:86 D.1816 ] [86])
>        (vec_select:SF (reg:V4SF 21 xmm0 [87])
>           (parallel [(const_int 0 [0])])))
> 
> I think we can extend the hook and add `offset in frommode' to it.
> We may set it to -1 for the cases where it is unknown and work
> conservatively in the target hook.
> For most cases offset is known and we could pass it to the hook.
> This will require changes throughout all targets though.
> 
> Alternatively, we may introduce another target hook, say
> `CANNOT_CHANGE_MODE_CLASS_OFFSET' with same args as
> `CANNOT_CHANGE_MODE_CLASS' + offset and which will be defaulted to it.
> For x86 (and possibly other targets) we'll implement this hook, which
> will checko ffset.
> 
> What do you think?
> 

I don't think CANNOT_CHANGE_MODE_CLASS has been designed with an intention to 
consider offsets. I thought all that magic about BYTE_OFFSET resolution into 
representable hardregs was done by subreg_get_info() where the 
info.representable is set to false if the BYTE_OFFSET of the subreg didn't map 
to a full hardreg. So if your (subreg:ymode (reg:xmode) off) maps to a full 
hardreg, simplify_subreg_regno should be returning the yregno automatically.

That's my understanding, please feel free to correct me if I'm incorrect.

Thanks,
Tejas.

> --
> Thanks, K
>
Richard Sandiford Dec. 9, 2013, noon UTC | #15
Tejas Belagod <tbelagod@arm.com> writes:
> Kirill Yukhin wrote:
>> Hello,
>> 
>> On 05 Dec 16:40, Kirill Yukhin wrote:
>>> On 05 Dec 05:30, H.J. Lu wrote:
>>>> Kirill, can you take a look why it doesn't work for x86?
>>> Okay, I'll look at this.
>> 
>> I've looked at this. It seems that `CANNOT_CHANGE_MODE_CLASS'
>> is too conservative for x86.
>> 
>> In rtlanal.c we have `simplify_subreg_regno' which call target
>> hook `REG_CANNOT_CHANGE_MODE_P'. It takes only 3 arguments:
>> from mode, to mode and regclass.
>> 
>> Hook in x86 called `ix86_cannot_change_mode_class' and comment
>> says that we cannot change mode for nonzero offsets, which sounds
>> quite reasonable. That is why this hook returns `true' for this
>> tuple <V4SF, SF, FIRST_SSE_REG> and `simplify_subreg_regno'
>> prohibits simplification of that:
>>   (set (reg:SF 21 xmm0 [orig:86 D.1816 ] [86])
>>        (vec_select:SF (reg:V4SF 21 xmm0 [87])
>>           (parallel [(const_int 0 [0])])))
>> 
>> I think we can extend the hook and add `offset in frommode' to it.
>> We may set it to -1 for the cases where it is unknown and work
>> conservatively in the target hook.
>> For most cases offset is known and we could pass it to the hook.
>> This will require changes throughout all targets though.
>> 
>> Alternatively, we may introduce another target hook, say
>> `CANNOT_CHANGE_MODE_CLASS_OFFSET' with same args as
>> `CANNOT_CHANGE_MODE_CLASS' + offset and which will be defaulted to it.
>> For x86 (and possibly other targets) we'll implement this hook, which
>> will checko ffset.
>> 
>> What do you think?
>> 
>
> I don't think CANNOT_CHANGE_MODE_CLASS has been designed with an
> intention to consider offsets. I thought all that magic about
> BYTE_OFFSET resolution into representable hardregs was done by
> subreg_get_info() where the info.representable is set to false if the
> BYTE_OFFSET of the subreg didn't map to a full hardreg. So if your
> (subreg:ymode (reg:xmode) off) maps to a full hardreg,
> simplify_subreg_regno should be returning the yregno automatically.

I agree.  A subreg only reduces to a single hard register if the subreg
logically refers to the low part of the hard register.  That's a target-
independent requirement so the hook shouldn't need to worry about it.

I'm just speculating, but maybe the problem is that this was traditionally
keyed off word size.  If a subreg is smaller than a word then it must
correspond to the low part of the containing word.  So if words
are 32 bits or wider, things like (subreg:QI (reg:SI X) 1) and
(subreg:QI (reg:SI X) 2) are always invalid, even for pseudo Xs.
But that doesn't stop things like (subreg:QI (reg:DI X) 4) on 32-bit
little-endian targets.  So we can run into trouble when dealing with
wider-than-word registers, since whether the byte offset is representable
depends on the class.  And things like IRA would need this to be trapped
at the class level, rather than just for specific hard registers.

If that was the problem though, it still sounds like something that could
be handled in a target-independent way, via things like class_max_nregs.

Thanks,
Richard
H.J. Lu Dec. 9, 2013, 1 p.m. UTC | #16
On Mon, Dec 9, 2013 at 1:56 AM, Tejas Belagod <tbelagod@arm.com> wrote:
> Kirill Yukhin wrote:
>>
>> Hello,
>>
>> On 05 Dec 16:40, Kirill Yukhin wrote:
>>>
>>> On 05 Dec 05:30, H.J. Lu wrote:
>>>>
>>>> Kirill, can you take a look why it doesn't work for x86?
>>>
>>> Okay, I'll look at this.
>>
>>
>> I've looked at this. It seems that `CANNOT_CHANGE_MODE_CLASS'
>> is too conservative for x86.
>>
>> In rtlanal.c we have `simplify_subreg_regno' which call target
>> hook `REG_CANNOT_CHANGE_MODE_P'. It takes only 3 arguments:
>> from mode, to mode and regclass.
>>
>> Hook in x86 called `ix86_cannot_change_mode_class' and comment
>> says that we cannot change mode for nonzero offsets, which sounds
>> quite reasonable. That is why this hook returns `true' for this
>> tuple <V4SF, SF, FIRST_SSE_REG> and `simplify_subreg_regno'
>> prohibits simplification of that:
>>   (set (reg:SF 21 xmm0 [orig:86 D.1816 ] [86])
>>        (vec_select:SF (reg:V4SF 21 xmm0 [87])
>>           (parallel [(const_int 0 [0])])))
>>
>> I think we can extend the hook and add `offset in frommode' to it.
>> We may set it to -1 for the cases where it is unknown and work
>> conservatively in the target hook.
>> For most cases offset is known and we could pass it to the hook.
>> This will require changes throughout all targets though.
>>
>> Alternatively, we may introduce another target hook, say
>> `CANNOT_CHANGE_MODE_CLASS_OFFSET' with same args as
>> `CANNOT_CHANGE_MODE_CLASS' + offset and which will be defaulted to it.
>> For x86 (and possibly other targets) we'll implement this hook, which
>> will checko ffset.
>>
>> What do you think?
>>
>
> I don't think CANNOT_CHANGE_MODE_CLASS has been designed with an intention
> to consider offsets. I thought all that magic about BYTE_OFFSET resolution
> into representable hardregs was done by subreg_get_info() where the
> info.representable is set to false if the BYTE_OFFSET of the subreg didn't
> map to a full hardreg. So if your (subreg:ymode (reg:xmode) off) maps to a
> full hardreg, simplify_subreg_regno should be returning the yregno
> automatically.
>
> That's my understanding, please feel free to correct me if I'm incorrect.

Kirill, ARM doesn't define HARD_REGNO_NREGS_HAS_PADDING
and i386 does.  Could that be the cause?
H.J. Lu Dec. 9, 2013, 1:48 p.m. UTC | #17
On Mon, Dec 9, 2013 at 5:00 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Mon, Dec 9, 2013 at 1:56 AM, Tejas Belagod <tbelagod@arm.com> wrote:
>> Kirill Yukhin wrote:
>>>
>>> Hello,
>>>
>>> On 05 Dec 16:40, Kirill Yukhin wrote:
>>>>
>>>> On 05 Dec 05:30, H.J. Lu wrote:
>>>>>
>>>>> Kirill, can you take a look why it doesn't work for x86?
>>>>
>>>> Okay, I'll look at this.
>>>
>>>
>>> I've looked at this. It seems that `CANNOT_CHANGE_MODE_CLASS'
>>> is too conservative for x86.
>>>
>>> In rtlanal.c we have `simplify_subreg_regno' which call target
>>> hook `REG_CANNOT_CHANGE_MODE_P'. It takes only 3 arguments:
>>> from mode, to mode and regclass.
>>>
>>> Hook in x86 called `ix86_cannot_change_mode_class' and comment
>>> says that we cannot change mode for nonzero offsets, which sounds
>>> quite reasonable. That is why this hook returns `true' for this
>>> tuple <V4SF, SF, FIRST_SSE_REG> and `simplify_subreg_regno'
>>> prohibits simplification of that:
>>>   (set (reg:SF 21 xmm0 [orig:86 D.1816 ] [86])
>>>        (vec_select:SF (reg:V4SF 21 xmm0 [87])
>>>           (parallel [(const_int 0 [0])])))
>>>
>>> I think we can extend the hook and add `offset in frommode' to it.
>>> We may set it to -1 for the cases where it is unknown and work
>>> conservatively in the target hook.
>>> For most cases offset is known and we could pass it to the hook.
>>> This will require changes throughout all targets though.
>>>
>>> Alternatively, we may introduce another target hook, say
>>> `CANNOT_CHANGE_MODE_CLASS_OFFSET' with same args as
>>> `CANNOT_CHANGE_MODE_CLASS' + offset and which will be defaulted to it.
>>> For x86 (and possibly other targets) we'll implement this hook, which
>>> will checko ffset.
>>>
>>> What do you think?
>>>
>>
>> I don't think CANNOT_CHANGE_MODE_CLASS has been designed with an intention
>> to consider offsets. I thought all that magic about BYTE_OFFSET resolution
>> into representable hardregs was done by subreg_get_info() where the
>> info.representable is set to false if the BYTE_OFFSET of the subreg didn't
>> map to a full hardreg. So if your (subreg:ymode (reg:xmode) off) maps to a
>> full hardreg, simplify_subreg_regno should be returning the yregno
>> automatically.
>>
>> That's my understanding, please feel free to correct me if I'm incorrect.
>
> Kirill, ARM doesn't define HARD_REGNO_NREGS_HAS_PADDING
> and i386 does.  Could that be the cause?
>

Nevermind. I don't think it does.
diff mbox

Patch

diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
index 0cd0c7e..e1388c8 100644
--- a/gcc/rtlanal.c
+++ b/gcc/rtlanal.c
@@ -1180,6 +1180,26 @@  set_noop_p (const_rtx set)
       dst = SUBREG_REG (dst);
     }
 
+  /* It is a NOOP if destination overlaps with selected src vector
+     elements.  */
+  if (GET_CODE (src) == VEC_SELECT
+      && REG_P (XEXP (src, 0)) && REG_P (dst)
+      && HARD_REGISTER_P (XEXP (src, 0))
+      && HARD_REGISTER_P (dst))
+    {
+      int i;
+      rtx par = XEXP (src, 1);
+      rtx src0 = XEXP (src, 0);
+      int c0 = INTVAL (XVECEXP (par, 0, 0));
+      HOST_WIDE_INT offset = GET_MODE_UNIT_SIZE (GET_MODE (src0)) * c0;
+
+      for (i = 1; i < XVECLEN (par, 0); i++)
+	if (INTVAL (XVECEXP (par, 0, i)) != c0 + i)
+	  return 0;
+      return simplify_subreg_regno (REGNO (src0), GET_MODE (src0),
+				    offset, GET_MODE (dst)) == (int)REGNO (dst);
+    }
+
   return (REG_P (src) && REG_P (dst)
 	  && REGNO (src) == REGNO (dst));
 }
diff --git a/gcc/testsuite/gcc.dg/vect/vect-nop-move.c b/gcc/testsuite/gcc.dg/vect/vect-nop-move.c
new file mode 100644
index 0000000..1941933
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-nop-move.c
@@ -0,0 +1,64 @@ 
+/* { dg-do run } */ 
+/* { dg-require-effective-target vect_float } */
+/* { dg-options "-O3 -fdump-rtl-combine-details" } */
+
+extern void abort (void);
+
+#define NOINLINE __attribute__((noinline))
+
+typedef float float32x4_t __attribute__ ((__vector_size__ (16)));
+typedef float float32x2_t __attribute__ ((__vector_size__ (8)));
+
+NOINLINE float
+foo32x4_be (float32x4_t x)
+{
+  return x[3];
+}
+
+NOINLINE float
+foo32x4_le (float32x4_t x)
+{
+  return x[0];
+}
+
+NOINLINE float
+bar (float a)
+{
+  return a;
+}
+
+NOINLINE float
+foo32x2_be (float32x2_t x)
+{
+  return bar (x[1]);
+}
+
+NOINLINE float
+foo32x2_le (float32x2_t x)
+{
+  return bar (x[0]);
+}
+
+int
+main()
+{
+  float32x4_t a = { 0.0f, 1.0f, 2.0f, 3.0f };
+  float32x2_t b = { 0.0f, 1.0f };
+
+  if (foo32x4_be (a) != 3.0f)
+    abort ();
+
+  if (foo32x4_le (a) != 0.0f)
+    abort ();
+
+  if (foo32x2_be (b) != 1.0f)
+    abort ();
+
+  if (foo32x2_le (b) != 0.0f)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-rtl-dump "deleting noop move" "combine" { target aarch64*-*-* } } } */
+/* { dg-final { cleanup-rtl-dump "combine" } } */