diff mbox series

AArch64: update vget_set_lane_1.c test output

Message ID patch-18238-tamar@arm.com
State New
Headers show
Series AArch64: update vget_set_lane_1.c test output | expand

Commit Message

Tamar Christina Feb. 1, 2024, 9:24 a.m. UTC
Hi All,

In the vget_set_lane_1.c test the following entries now generate a zip1 instead of an INS

BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)

This is because the non-Q variant for indices 0 and 1 are just shuffling values.
There is no perf difference between INS SIMD to SIMD and ZIP, as such just update the
test file.

Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/vget_set_lane_1.c: Update test output.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c b/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c
index 07a77de319206c5c6dad1c0d2d9bcc998583f9c1..a3978f68e4ff5899f395a98615a5e86c3b1389cb 100644




--
diff --git a/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c b/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c
index 07a77de319206c5c6dad1c0d2d9bcc998583f9c1..a3978f68e4ff5899f395a98615a5e86c3b1389cb 100644
--- a/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c
@@ -22,7 +22,7 @@ BUILD_TEST (uint16x4_t, uint16x4_t, , , u16, 3, 2)
 BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
 BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
 BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
-/* { dg-final { scan-assembler-times "ins\\tv0.s\\\[1\\\], v1.s\\\[0\\\]" 3 } } */
+/* { dg-final { scan-assembler-times "zip1\\tv0.2s, v0.2s, v1.2s" 3 } } */
 
 BUILD_TEST (poly8x8_t, poly8x16_t, , q, p8, 7, 15)
 BUILD_TEST (int8x8_t,  int8x16_t,  , q, s8, 7, 15)

Comments

Andrew Pinski Feb. 1, 2024, 9:48 a.m. UTC | #1
On Thu, Feb 1, 2024 at 1:26 AM Tamar Christina <tamar.christina@arm.com> wrote:
>
> Hi All,
>
> In the vget_set_lane_1.c test the following entries now generate a zip1 instead of an INS
>
> BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
> BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
> BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
>
> This is because the non-Q variant for indices 0 and 1 are just shuffling values.
> There is no perf difference between INS SIMD to SIMD and ZIP, as such just update the
> test file.
Hmm, is this true on all cores? I suspect there is a core out there
where INS is implemented with a much lower latency than ZIP.
If we look at config/aarch64/thunderx.md, we can see INS is 2 cycles
while ZIP is 6 cycles (3/7 for q versions).
Now I don't have any invested interest in that core any more but I
just wanted to point out that is not exactly true for all cores.

> Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?

This is PR 112375 by the way.

Thanks,
Andrew Pinski

>
> Thanks,
> Tamar
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/aarch64/vget_set_lane_1.c: Update test output.
>
> --- inline copy of patch --
> diff --git a/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c b/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c
> index 07a77de319206c5c6dad1c0d2d9bcc998583f9c1..a3978f68e4ff5899f395a98615a5e86c3b1389cb 100644
> --- a/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c
> @@ -22,7 +22,7 @@ BUILD_TEST (uint16x4_t, uint16x4_t, , , u16, 3, 2)
>  BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
>  BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
>  BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
> -/* { dg-final { scan-assembler-times "ins\\tv0.s\\\[1\\\], v1.s\\\[0\\\]" 3 } } */
> +/* { dg-final { scan-assembler-times "zip1\\tv0.2s, v0.2s, v1.2s" 3 } } */
>
>  BUILD_TEST (poly8x8_t, poly8x16_t, , q, p8, 7, 15)
>  BUILD_TEST (int8x8_t,  int8x16_t,  , q, s8, 7, 15)
>
>
>
>
> --
Richard Sandiford Feb. 1, 2024, 2:23 p.m. UTC | #2
Andrew Pinski <pinskia@gmail.com> writes:
> On Thu, Feb 1, 2024 at 1:26 AM Tamar Christina <tamar.christina@arm.com> wrote:
>>
>> Hi All,
>>
>> In the vget_set_lane_1.c test the following entries now generate a zip1 instead of an INS
>>
>> BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
>> BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
>> BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
>>
>> This is because the non-Q variant for indices 0 and 1 are just shuffling values.
>> There is no perf difference between INS SIMD to SIMD and ZIP, as such just update the
>> test file.
> Hmm, is this true on all cores? I suspect there is a core out there
> where INS is implemented with a much lower latency than ZIP.
> If we look at config/aarch64/thunderx.md, we can see INS is 2 cycles
> while ZIP is 6 cycles (3/7 for q versions).
> Now I don't have any invested interest in that core any more but I
> just wanted to point out that is not exactly true for all cores.

Thanks for the pointer.  In that case, perhaps we should prefer
aarch64_evpc_ins over aarch64_evpc_zip in aarch64_expand_vec_perm_const_1?
That's enough to fix this failure, but it'll probably require other
tests to be adjusted...

Richard
Tamar Christina Feb. 1, 2024, 2:31 p.m. UTC | #3
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Thursday, February 1, 2024 2:24 PM
> To: Andrew Pinski <pinskia@gmail.com>
> Cc: Tamar Christina <Tamar.Christina@arm.com>; gcc-patches@gcc.gnu.org; nd
> <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; Marcus
> Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>
> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output
> 
> Andrew Pinski <pinskia@gmail.com> writes:
> > On Thu, Feb 1, 2024 at 1:26 AM Tamar Christina <tamar.christina@arm.com>
> wrote:
> >>
> >> Hi All,
> >>
> >> In the vget_set_lane_1.c test the following entries now generate a zip1 instead
> of an INS
> >>
> >> BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
> >> BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
> >> BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
> >>
> >> This is because the non-Q variant for indices 0 and 1 are just shuffling values.
> >> There is no perf difference between INS SIMD to SIMD and ZIP, as such just
> update the
> >> test file.
> > Hmm, is this true on all cores? I suspect there is a core out there
> > where INS is implemented with a much lower latency than ZIP.
> > If we look at config/aarch64/thunderx.md, we can see INS is 2 cycles
> > while ZIP is 6 cycles (3/7 for q versions).
> > Now I don't have any invested interest in that core any more but I
> > just wanted to point out that is not exactly true for all cores.
> 
> Thanks for the pointer.  In that case, perhaps we should prefer
> aarch64_evpc_ins over aarch64_evpc_zip in aarch64_expand_vec_perm_const_1?
> That's enough to fix this failure, but it'll probably require other
> tests to be adjusted...

I think given that Thundex-X is a 10 year old micro-architecture that is several cases where
often used instructions have very high latencies that generic codegen should not be blocked
from progressing because of it.

we use zips in many things and if thunderx codegen is really of that much importance then I
think the old codegen should be gated behind -mcpu=thunderx rather than preventing generic
changes.

Regards,
Tamar.

> 
> Richard
Richard Sandiford Feb. 1, 2024, 4:42 p.m. UTC | #4
Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Thursday, February 1, 2024 2:24 PM
>> To: Andrew Pinski <pinskia@gmail.com>
>> Cc: Tamar Christina <Tamar.Christina@arm.com>; gcc-patches@gcc.gnu.org; nd
>> <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; Marcus
>> Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
>> <Kyrylo.Tkachov@arm.com>
>> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output
>> 
>> Andrew Pinski <pinskia@gmail.com> writes:
>> > On Thu, Feb 1, 2024 at 1:26 AM Tamar Christina <tamar.christina@arm.com>
>> wrote:
>> >>
>> >> Hi All,
>> >>
>> >> In the vget_set_lane_1.c test the following entries now generate a zip1 instead
>> of an INS
>> >>
>> >> BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
>> >> BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
>> >> BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
>> >>
>> >> This is because the non-Q variant for indices 0 and 1 are just shuffling values.
>> >> There is no perf difference between INS SIMD to SIMD and ZIP, as such just
>> update the
>> >> test file.
>> > Hmm, is this true on all cores? I suspect there is a core out there
>> > where INS is implemented with a much lower latency than ZIP.
>> > If we look at config/aarch64/thunderx.md, we can see INS is 2 cycles
>> > while ZIP is 6 cycles (3/7 for q versions).
>> > Now I don't have any invested interest in that core any more but I
>> > just wanted to point out that is not exactly true for all cores.
>> 
>> Thanks for the pointer.  In that case, perhaps we should prefer
>> aarch64_evpc_ins over aarch64_evpc_zip in aarch64_expand_vec_perm_const_1?
>> That's enough to fix this failure, but it'll probably require other
>> tests to be adjusted...
>
> I think given that Thundex-X is a 10 year old micro-architecture that is several cases where
> often used instructions have very high latencies that generic codegen should not be blocked
> from progressing because of it.
>
> we use zips in many things and if thunderx codegen is really of that much importance then I
> think the old codegen should be gated behind -mcpu=thunderx rather than preventing generic
> changes.

But you said there was no perf difference between INS and ZIP, so it
sounds like for all known cases, using INS rather than ZIP is either
neutral or better.

There's also the possible secondary benefit that the INS patterns use
standard RTL operations whereas the ZIP patterns use unspecs.

Keeping ZIP seems OK there's a specific reason to prefer it over INS for
more modern cores though.

Thanks,
Richard
Andrew Pinski Feb. 2, 2024, 1:16 a.m. UTC | #5
On Thu, Feb 1, 2024 at 8:42 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Thursday, February 1, 2024 2:24 PM
> >> To: Andrew Pinski <pinskia@gmail.com>
> >> Cc: Tamar Christina <Tamar.Christina@arm.com>; gcc-patches@gcc.gnu.org; nd
> >> <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; Marcus
> >> Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> >> <Kyrylo.Tkachov@arm.com>
> >> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output
> >>
> >> Andrew Pinski <pinskia@gmail.com> writes:
> >> > On Thu, Feb 1, 2024 at 1:26 AM Tamar Christina <tamar.christina@arm.com>
> >> wrote:
> >> >>
> >> >> Hi All,
> >> >>
> >> >> In the vget_set_lane_1.c test the following entries now generate a zip1 instead
> >> of an INS
> >> >>
> >> >> BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
> >> >> BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
> >> >> BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
> >> >>
> >> >> This is because the non-Q variant for indices 0 and 1 are just shuffling values.
> >> >> There is no perf difference between INS SIMD to SIMD and ZIP, as such just
> >> update the
> >> >> test file.
> >> > Hmm, is this true on all cores? I suspect there is a core out there
> >> > where INS is implemented with a much lower latency than ZIP.
> >> > If we look at config/aarch64/thunderx.md, we can see INS is 2 cycles
> >> > while ZIP is 6 cycles (3/7 for q versions).
> >> > Now I don't have any invested interest in that core any more but I
> >> > just wanted to point out that is not exactly true for all cores.
> >>
> >> Thanks for the pointer.  In that case, perhaps we should prefer
> >> aarch64_evpc_ins over aarch64_evpc_zip in aarch64_expand_vec_perm_const_1?
> >> That's enough to fix this failure, but it'll probably require other
> >> tests to be adjusted...
> >
> > I think given that Thundex-X is a 10 year old micro-architecture that is several cases where
> > often used instructions have very high latencies that generic codegen should not be blocked
> > from progressing because of it.
> >
> > we use zips in many things and if thunderx codegen is really of that much importance then I
> > think the old codegen should be gated behind -mcpu=thunderx rather than preventing generic
> > changes.
>
> But you said there was no perf difference between INS and ZIP, so it
> sounds like for all known cases, using INS rather than ZIP is either
> neutral or better.
>
> There's also the possible secondary benefit that the INS patterns use
> standard RTL operations whereas the ZIP patterns use unspecs.
>
> Keeping ZIP seems OK there's a specific reason to prefer it over INS for
> more modern cores though.

This is why I  wrote in the bug report:
But maybe this needs some better cost mechanism because right now even
though we might be able to do ins because of reencoding zip might be
detected first.

Maybe for GCC 15, we could rewrite the ZIP patterns not to use unspec
for V2SI and V2DI since in those cases it is VEC_CONCAT (I think).
Anyways I suspect ThunderX (88x and OcteonTX1) are no longer that
important but I don't speak for Marvell any more. And the cores I care
about ZIP and INS have a similar characteristics that using either of
them is ok.

Thanks,
Andrew Pinski

>
> Thanks,
> Richard
>
Tamar Christina Feb. 15, 2024, 8:30 a.m. UTC | #6
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Thursday, February 1, 2024 4:42 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Andrew Pinski <pinskia@gmail.com>; gcc-patches@gcc.gnu.org; nd
> <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; Marcus
> Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>
> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Thursday, February 1, 2024 2:24 PM
> >> To: Andrew Pinski <pinskia@gmail.com>
> >> Cc: Tamar Christina <Tamar.Christina@arm.com>; gcc-patches@gcc.gnu.org; nd
> >> <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; Marcus
> >> Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> >> <Kyrylo.Tkachov@arm.com>
> >> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output
> >>
> >> Andrew Pinski <pinskia@gmail.com> writes:
> >> > On Thu, Feb 1, 2024 at 1:26 AM Tamar Christina <tamar.christina@arm.com>
> >> wrote:
> >> >>
> >> >> Hi All,
> >> >>
> >> >> In the vget_set_lane_1.c test the following entries now generate a zip1
> instead
> >> of an INS
> >> >>
> >> >> BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
> >> >> BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
> >> >> BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
> >> >>
> >> >> This is because the non-Q variant for indices 0 and 1 are just shuffling values.
> >> >> There is no perf difference between INS SIMD to SIMD and ZIP, as such just
> >> update the
> >> >> test file.
> >> > Hmm, is this true on all cores? I suspect there is a core out there
> >> > where INS is implemented with a much lower latency than ZIP.
> >> > If we look at config/aarch64/thunderx.md, we can see INS is 2 cycles
> >> > while ZIP is 6 cycles (3/7 for q versions).
> >> > Now I don't have any invested interest in that core any more but I
> >> > just wanted to point out that is not exactly true for all cores.
> >>
> >> Thanks for the pointer.  In that case, perhaps we should prefer
> >> aarch64_evpc_ins over aarch64_evpc_zip in
> aarch64_expand_vec_perm_const_1?
> >> That's enough to fix this failure, but it'll probably require other
> >> tests to be adjusted...
> >
> > I think given that Thundex-X is a 10 year old micro-architecture that is several
> cases where
> > often used instructions have very high latencies that generic codegen should not
> be blocked
> > from progressing because of it.
> >
> > we use zips in many things and if thunderx codegen is really of that much
> importance then I
> > think the old codegen should be gated behind -mcpu=thunderx rather than
> preventing generic
> > changes.
> 
> But you said there was no perf difference between INS and ZIP, so it
> sounds like for all known cases, using INS rather than ZIP is either
> neutral or better.
> 
> There's also the possible secondary benefit that the INS patterns use
> standard RTL operations whereas the ZIP patterns use unspecs.
> 
> Keeping ZIP seems OK there's a specific reason to prefer it over INS for
> more modern cores though.

Ok, that's a fair point.  Doing some due diligence, Neoverse-E1 and
Cortex-A65 SWoGs seem to imply that there ZIPs have better throughput
than INSs. However the entries are inconsistent and I can't measure the
difference so I believe this to be a documentation bug.

That said, switching the operands seems to show one issue in that preferring
INS degenerates code in cases where we are inserting the top bits of the first
parameter into the bottom of the second parameter and returning,

Zip being a Three operand instruction allows us to put the result into the final
destination register with one operation whereas INS requires an fmov:

foo_uzp1_s32:
        ins     v0.s[1], v1.s[0]
        fmov    d0, d0
        ret
foo_uzp2_s32:
        ins     v1.s[0], v0.s[1]
        fmov    d0, d1
        ret

I've posted uzp but zip has the same issue.

So I guess it's not better to flip the order but perhaps I should add a case to
the zip/unzip RTL patterns for when op0 == op1?

Thanks,
Tamar
> 
> Thanks,
> Richard
Richard Sandiford Feb. 20, 2024, 10:30 a.m. UTC | #7
Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Thursday, February 1, 2024 4:42 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: Andrew Pinski <pinskia@gmail.com>; gcc-patches@gcc.gnu.org; nd
>> <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; Marcus
>> Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
>> <Kyrylo.Tkachov@arm.com>
>> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output
>> 
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> -----Original Message-----
>> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> Sent: Thursday, February 1, 2024 2:24 PM
>> >> To: Andrew Pinski <pinskia@gmail.com>
>> >> Cc: Tamar Christina <Tamar.Christina@arm.com>; gcc-patches@gcc.gnu.org; nd
>> >> <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; Marcus
>> >> Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
>> >> <Kyrylo.Tkachov@arm.com>
>> >> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output
>> >>
>> >> Andrew Pinski <pinskia@gmail.com> writes:
>> >> > On Thu, Feb 1, 2024 at 1:26 AM Tamar Christina <tamar.christina@arm.com>
>> >> wrote:
>> >> >>
>> >> >> Hi All,
>> >> >>
>> >> >> In the vget_set_lane_1.c test the following entries now generate a zip1
>> instead
>> >> of an INS
>> >> >>
>> >> >> BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
>> >> >> BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
>> >> >> BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
>> >> >>
>> >> >> This is because the non-Q variant for indices 0 and 1 are just shuffling values.
>> >> >> There is no perf difference between INS SIMD to SIMD and ZIP, as such just
>> >> update the
>> >> >> test file.
>> >> > Hmm, is this true on all cores? I suspect there is a core out there
>> >> > where INS is implemented with a much lower latency than ZIP.
>> >> > If we look at config/aarch64/thunderx.md, we can see INS is 2 cycles
>> >> > while ZIP is 6 cycles (3/7 for q versions).
>> >> > Now I don't have any invested interest in that core any more but I
>> >> > just wanted to point out that is not exactly true for all cores.
>> >>
>> >> Thanks for the pointer.  In that case, perhaps we should prefer
>> >> aarch64_evpc_ins over aarch64_evpc_zip in
>> aarch64_expand_vec_perm_const_1?
>> >> That's enough to fix this failure, but it'll probably require other
>> >> tests to be adjusted...
>> >
>> > I think given that Thundex-X is a 10 year old micro-architecture that is several
>> cases where
>> > often used instructions have very high latencies that generic codegen should not
>> be blocked
>> > from progressing because of it.
>> >
>> > we use zips in many things and if thunderx codegen is really of that much
>> importance then I
>> > think the old codegen should be gated behind -mcpu=thunderx rather than
>> preventing generic
>> > changes.
>> 
>> But you said there was no perf difference between INS and ZIP, so it
>> sounds like for all known cases, using INS rather than ZIP is either
>> neutral or better.
>> 
>> There's also the possible secondary benefit that the INS patterns use
>> standard RTL operations whereas the ZIP patterns use unspecs.
>> 
>> Keeping ZIP seems OK there's a specific reason to prefer it over INS for
>> more modern cores though.
>
> Ok, that's a fair point.  Doing some due diligence, Neoverse-E1 and
> Cortex-A65 SWoGs seem to imply that there ZIPs have better throughput
> than INSs. However the entries are inconsistent and I can't measure the
> difference so I believe this to be a documentation bug.
>
> That said, switching the operands seems to show one issue in that preferring
> INS degenerates code in cases where we are inserting the top bits of the first
> parameter into the bottom of the second parameter and returning,
>
> Zip being a Three operand instruction allows us to put the result into the final
> destination register with one operation whereas INS requires an fmov:
>
> foo_uzp1_s32:
>         ins     v0.s[1], v1.s[0]
>         fmov    d0, d0
>         ret
> foo_uzp2_s32:
>         ins     v1.s[0], v0.s[1]
>         fmov    d0, d1
>         ret

Ah, yeah, I should have thought about that.

In that case, the original patch is OK, thanks.

Richard
diff mbox series

Patch

--- a/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c
@@ -22,7 +22,7 @@  BUILD_TEST (uint16x4_t, uint16x4_t, , , u16, 3, 2)
 BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
 BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
 BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
-/* { dg-final { scan-assembler-times "ins\\tv0.s\\\[1\\\], v1.s\\\[0\\\]" 3 } } */
+/* { dg-final { scan-assembler-times "zip1\\tv0.2s, v0.2s, v1.2s" 3 } } */
 
 BUILD_TEST (poly8x8_t, poly8x16_t, , q, p8, 7, 15)
 BUILD_TEST (int8x8_t,  int8x16_t,  , q, s8, 7, 15)