[AArch64] Implement usadv16qi and ssadv16qi standard names

Message ID 5AF99160.6080802@foss.arm.com
State New
Headers show
Series
  • [AArch64] Implement usadv16qi and ssadv16qi standard names
Related show

Commit Message

Kyrill Tkachov May 14, 2018, 1:38 p.m.
Hi all,

This patch implements the usadv16qi and ssadv16qi standard names.
See the thread at on gcc@gcc.gnu.org [1] for background.

The V16QImode variant is important to get right as it is the most commonly used pattern:
reducing vectors of bytes into an int.
The midend expects the optab to compute the absolute differences of operands 1 and 2 and
reduce them while widening along the way up to SImode. So the inputs are V16QImode and
the output is V4SImode.

I've tried out a few different strategies for that, the one I settled with is to emit:
UABDL2    tmp.8h, op1.16b, op2.16b
UABAL    tmp.8h, op1.16b, op2.16b
UADALP    op3.4s, tmp.8h

To work through the semantics let's say operands 1 and 2 are:
op1 { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 }
op2 { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 }
op3 { c0, c1, c2, c3 }

The UABDL2 takes the upper V8QI elements, computes their absolute differences, widens them and stores them into the V8HImode tmp:

tmp { ABS(a[8]-b[8]), ABS(a[9]-b[9]), ABS(a[10]-b[10]), ABS(a[11]-b[11]), ABS(a[12]-b[12]), ABS(a[13]-b[13]), ABS(a[14]-b[14]), ABS(a[15]-b[15]) }

The UABAL after that takes the lower V8QI elements, computes their absolute differences, widens them and accumulates them into the V8HImode tmp from the previous step:

tmp { ABS(a[8]-b[8])+ABS (a[0]-b[0]), ABS(a[9]-b[9])+ABS(a[1]-b[1]), ABS(a[10]-b[10])+ABS(a[2]-b[2]), ABS(a[11]-b[11])+ABS(a[3]-b[3]), ABS(a[12]-b[12])+ABS(a[4]-b[4]), ABS(a[13]-b[13])+ABS(a[5]-b[5]), ABS(a[14]-b[14])+ABS(a[6]-b[6]), ABS(a[15]-b[15])+ABS(a[7]-b[7]) }

Finally the UADALP does a pairwise widening reduction and accumulation into the V4SImode op3:
op3 { c0+ABS(a[8]-b[8])+ABS(a[0]-b[0])+ABS(a[9]-b[9])+ABS(a[1]-b[1]), c1+ABS(a[10]-b[10])+ABS(a[2]-b[2])+ABS(a[11]-b[11])+ABS(a[3]-b[3]), c2+ABS(a[12]-b[12])+ABS(a[4]-b[4])+ABS(a[13]-b[13])+ABS(a[5]-b[5]), c3+ABS(a[14]-b[14])+ABS(a[6]-b[6])+ABS(a[15]-b[15])+ABS(a[7]-b[7]) }

(sorry for the text dump)

Remember, according to [1] the exact reduction sequence doesn't matter (for integer arithmetic at least).
I've considered other sequences as well (thanks Wilco), for example
* UABD + UADDLP + UADALP
* UABLD2 + UABDL + UADALP + UADALP

I ended up settling in the sequence in this patch as it's short (3 instructions) and in the future we can potentially
look to optimise multiple occurrences of these into something even faster (for example accumulating into H registers for longer
before doing a single UADALP in the end to accumulate into the final S register).

If your microarchitecture has some some strong preferences for a particular sequence, please let me know or, even better, propose a patch
to parametrise the generation sequence by code (or the appropriate RTX cost).


This expansion allows the vectoriser to avoid unpacking the bytes in two steps and performing V4SI arithmetic on them.
So, for the code:

unsigned char pix1[N], pix2[N];

int foo (void)
{
   int i_sum = 0;
   int i;

   for (i = 0; i < 16; i++)
     i_sum += __builtin_abs (pix1[i] - pix2[i]);

   return i_sum;
}

we now generate on aarch64:
foo:
         adrp    x1, pix1
         add     x1, x1, :lo12:pix1
         movi    v0.4s, 0
         adrp    x0, pix2
         add     x0, x0, :lo12:pix2
         ldr     q2, [x1]
         ldr     q3, [x0]
         uabdl2  v1.8h, v2.16b, v3.16b
         uabal   v1.8h, v2.8b, v3.8b
         uadalp  v0.4s, v1.8h
         addv    s0, v0.4s
         umov    w0, v0.s[0]
         ret


instead of:
foo:
         adrp    x1, pix1
         adrp    x0, pix2
         add     x1, x1, :lo12:pix1
         add     x0, x0, :lo12:pix2
         ldr     q0, [x1]
         ldr     q4, [x0]
         ushll   v1.8h, v0.8b, 0
         ushll2  v0.8h, v0.16b, 0
         ushll   v2.8h, v4.8b, 0
         ushll2  v4.8h, v4.16b, 0
         usubl   v3.4s, v1.4h, v2.4h
         usubl2  v1.4s, v1.8h, v2.8h
         usubl   v2.4s, v0.4h, v4.4h
         usubl2  v0.4s, v0.8h, v4.8h
         abs     v3.4s, v3.4s
         abs     v1.4s, v1.4s
         abs     v2.4s, v2.4s
         abs     v0.4s, v0.4s
         add     v1.4s, v3.4s, v1.4s
         add     v1.4s, v2.4s, v1.4s
         add     v0.4s, v0.4s, v1.4s
         addv    s0, v0.4s
         umov    w0, v0.s[0]
         ret

So I expect this new expansion to be better than the status quo in any case.
Bootstrapped and tested on aarch64-none-linux-gnu.
This gives about 8% on 525.x264_r from SPEC2017 on a Cortex-A72.

Ok for trunk?

Thanks,
Kyrill

[1] https://gcc.gnu.org/ml/gcc/2018-05/msg00070.html


2018-05-11  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/aarch64/aarch64.md ("unspec"): Define UNSPEC_SABAL,
     UNSPEC_SABDL2, UNSPEC_SADALP, UNSPEC_UABAL, UNSPEC_UABDL2,
     UNSPEC_UADALP values.
     * config/aarch64/iterators.md (ABAL): New int iterator.
     (ABDL2): Likewise.
     (ADALP): Likewise.
     (sur): Add mappings for the above.
     * config/aarch64/aarch64-simd.md (aarch64_<sur>abdl2<mode>_3):
     New define_insn.
     (aarch64_<sur>abal<mode>_4): Likewise.
     (aarch64_<sur>adalp<mode>_3): Likewise.
     (<sur>sadv16qi): New define_expand.

2018-05-11  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.c-torture/execute/ssad-run.c: New test.
     * gcc.c-torture/execute/usad-run.c: Likewise.
     * gcc.target/aarch64/ssadv16qi.c: Likewise.
     * gcc.target/aarch64/usadv16qi.c: Likewise.

Comments

Kyrill Tkachov May 15, 2018, 8:16 a.m. | #1
I realised I had forgotten to copy the maintainers...

https://gcc.gnu.org/ml/gcc-patches/2018-05/msg00613.html

Thanks,
Kyrill

On 14/05/18 14:38, Kyrill Tkachov wrote:
> Hi all,
>
> This patch implements the usadv16qi and ssadv16qi standard names.
> See the thread at on gcc@gcc.gnu.org [1] for background.
>
> The V16QImode variant is important to get right as it is the most commonly used pattern:
> reducing vectors of bytes into an int.
> The midend expects the optab to compute the absolute differences of operands 1 and 2 and
> reduce them while widening along the way up to SImode. So the inputs are V16QImode and
> the output is V4SImode.
>
> I've tried out a few different strategies for that, the one I settled with is to emit:
> UABDL2    tmp.8h, op1.16b, op2.16b
> UABAL    tmp.8h, op1.16b, op2.16b
> UADALP    op3.4s, tmp.8h
>
> To work through the semantics let's say operands 1 and 2 are:
> op1 { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 }
> op2 { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 }
> op3 { c0, c1, c2, c3 }
>
> The UABDL2 takes the upper V8QI elements, computes their absolute differences, widens them and stores them into the V8HImode tmp:
>
> tmp { ABS(a[8]-b[8]), ABS(a[9]-b[9]), ABS(a[10]-b[10]), ABS(a[11]-b[11]), ABS(a[12]-b[12]), ABS(a[13]-b[13]), ABS(a[14]-b[14]), ABS(a[15]-b[15]) }
>
> The UABAL after that takes the lower V8QI elements, computes their absolute differences, widens them and accumulates them into the V8HImode tmp from the previous step:
>
> tmp { ABS(a[8]-b[8])+ABS (a[0]-b[0]), ABS(a[9]-b[9])+ABS(a[1]-b[1]), ABS(a[10]-b[10])+ABS(a[2]-b[2]), ABS(a[11]-b[11])+ABS(a[3]-b[3]), ABS(a[12]-b[12])+ABS(a[4]-b[4]), ABS(a[13]-b[13])+ABS(a[5]-b[5]), ABS(a[14]-b[14])+ABS(a[6]-b[6]), ABS(a[15]-b[15])+ABS(a[7]-b[7]) }
>
> Finally the UADALP does a pairwise widening reduction and accumulation into the V4SImode op3:
> op3 { c0+ABS(a[8]-b[8])+ABS(a[0]-b[0])+ABS(a[9]-b[9])+ABS(a[1]-b[1]), c1+ABS(a[10]-b[10])+ABS(a[2]-b[2])+ABS(a[11]-b[11])+ABS(a[3]-b[3]), c2+ABS(a[12]-b[12])+ABS(a[4]-b[4])+ABS(a[13]-b[13])+ABS(a[5]-b[5]), c3+ABS(a[14]-b[14])+ABS(a[6]-b[6])+ABS(a[15]-b[15])+ABS(a[7]-b[7]) }
>
> (sorry for the text dump)
>
> Remember, according to [1] the exact reduction sequence doesn't matter (for integer arithmetic at least).
> I've considered other sequences as well (thanks Wilco), for example
> * UABD + UADDLP + UADALP
> * UABLD2 + UABDL + UADALP + UADALP
>
> I ended up settling in the sequence in this patch as it's short (3 instructions) and in the future we can potentially
> look to optimise multiple occurrences of these into something even faster (for example accumulating into H registers for longer
> before doing a single UADALP in the end to accumulate into the final S register).
>
> If your microarchitecture has some some strong preferences for a particular sequence, please let me know or, even better, propose a patch
> to parametrise the generation sequence by code (or the appropriate RTX cost).
>
>
> This expansion allows the vectoriser to avoid unpacking the bytes in two steps and performing V4SI arithmetic on them.
> So, for the code:
>
> unsigned char pix1[N], pix2[N];
>
> int foo (void)
> {
>   int i_sum = 0;
>   int i;
>
>   for (i = 0; i < 16; i++)
>     i_sum += __builtin_abs (pix1[i] - pix2[i]);
>
>   return i_sum;
> }
>
> we now generate on aarch64:
> foo:
>         adrp    x1, pix1
>         add     x1, x1, :lo12:pix1
>         movi    v0.4s, 0
>         adrp    x0, pix2
>         add     x0, x0, :lo12:pix2
>         ldr     q2, [x1]
>         ldr     q3, [x0]
>         uabdl2  v1.8h, v2.16b, v3.16b
>         uabal   v1.8h, v2.8b, v3.8b
>         uadalp  v0.4s, v1.8h
>         addv    s0, v0.4s
>         umov    w0, v0.s[0]
>         ret
>
>
> instead of:
> foo:
>         adrp    x1, pix1
>         adrp    x0, pix2
>         add     x1, x1, :lo12:pix1
>         add     x0, x0, :lo12:pix2
>         ldr     q0, [x1]
>         ldr     q4, [x0]
>         ushll   v1.8h, v0.8b, 0
>         ushll2  v0.8h, v0.16b, 0
>         ushll   v2.8h, v4.8b, 0
>         ushll2  v4.8h, v4.16b, 0
>         usubl   v3.4s, v1.4h, v2.4h
>         usubl2  v1.4s, v1.8h, v2.8h
>         usubl   v2.4s, v0.4h, v4.4h
>         usubl2  v0.4s, v0.8h, v4.8h
>         abs     v3.4s, v3.4s
>         abs     v1.4s, v1.4s
>         abs     v2.4s, v2.4s
>         abs     v0.4s, v0.4s
>         add     v1.4s, v3.4s, v1.4s
>         add     v1.4s, v2.4s, v1.4s
>         add     v0.4s, v0.4s, v1.4s
>         addv    s0, v0.4s
>         umov    w0, v0.s[0]
>         ret
>
> So I expect this new expansion to be better than the status quo in any case.
> Bootstrapped and tested on aarch64-none-linux-gnu.
> This gives about 8% on 525.x264_r from SPEC2017 on a Cortex-A72.
>
> Ok for trunk?
>
> Thanks,
> Kyrill
>
> [1] https://gcc.gnu.org/ml/gcc/2018-05/msg00070.html
>
>
> 2018-05-11  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * config/aarch64/aarch64.md ("unspec"): Define UNSPEC_SABAL,
>     UNSPEC_SABDL2, UNSPEC_SADALP, UNSPEC_UABAL, UNSPEC_UABDL2,
>     UNSPEC_UADALP values.
>     * config/aarch64/iterators.md (ABAL): New int iterator.
>     (ABDL2): Likewise.
>     (ADALP): Likewise.
>     (sur): Add mappings for the above.
>     * config/aarch64/aarch64-simd.md (aarch64_<sur>abdl2<mode>_3):
>     New define_insn.
>     (aarch64_<sur>abal<mode>_4): Likewise.
>     (aarch64_<sur>adalp<mode>_3): Likewise.
>     (<sur>sadv16qi): New define_expand.
>
> 2018-05-11  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * gcc.c-torture/execute/ssad-run.c: New test.
>     * gcc.c-torture/execute/usad-run.c: Likewise.
>     * gcc.target/aarch64/ssadv16qi.c: Likewise.
>     * gcc.target/aarch64/usadv16qi.c: Likewise.
James Greenhalgh May 19, 2018, 1:09 a.m. | #2
On Mon, May 14, 2018 at 08:38:40AM -0500, Kyrill Tkachov wrote:
> Hi all,
> 
> This patch implements the usadv16qi and ssadv16qi standard names.
> See the thread at on gcc@gcc.gnu.org [1] for background.
> 
> The V16QImode variant is important to get right as it is the most commonly used pattern:
> reducing vectors of bytes into an int.
> The midend expects the optab to compute the absolute differences of operands 1 and 2 and
> reduce them while widening along the way up to SImode. So the inputs are V16QImode and
> the output is V4SImode.
> 
> I've tried out a few different strategies for that, the one I settled with is to emit:
> UABDL2    tmp.8h, op1.16b, op2.16b
> UABAL    tmp.8h, op1.16b, op2.16b
> UADALP    op3.4s, tmp.8h
> 
> To work through the semantics let's say operands 1 and 2 are:
> op1 { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 }
> op2 { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 }
> op3 { c0, c1, c2, c3 }
> 
> The UABDL2 takes the upper V8QI elements, computes their absolute differences, widens them and stores them into the V8HImode tmp:
> 
> tmp { ABS(a[8]-b[8]), ABS(a[9]-b[9]), ABS(a[10]-b[10]), ABS(a[11]-b[11]), ABS(a[12]-b[12]), ABS(a[13]-b[13]), ABS(a[14]-b[14]), ABS(a[15]-b[15]) }
> 
> The UABAL after that takes the lower V8QI elements, computes their absolute differences, widens them and accumulates them into the V8HImode tmp from the previous step:
> 
> tmp { ABS(a[8]-b[8])+ABS (a[0]-b[0]), ABS(a[9]-b[9])+ABS(a[1]-b[1]), ABS(a[10]-b[10])+ABS(a[2]-b[2]), ABS(a[11]-b[11])+ABS(a[3]-b[3]), ABS(a[12]-b[12])+ABS(a[4]-b[4]), ABS(a[13]-b[13])+ABS(a[5]-b[5]), ABS(a[14]-b[14])+ABS(a[6]-b[6]), ABS(a[15]-b[15])+ABS(a[7]-b[7]) }
> 
> Finally the UADALP does a pairwise widening reduction and accumulation into the V4SImode op3:
> op3 { c0+ABS(a[8]-b[8])+ABS(a[0]-b[0])+ABS(a[9]-b[9])+ABS(a[1]-b[1]), c1+ABS(a[10]-b[10])+ABS(a[2]-b[2])+ABS(a[11]-b[11])+ABS(a[3]-b[3]), c2+ABS(a[12]-b[12])+ABS(a[4]-b[4])+ABS(a[13]-b[13])+ABS(a[5]-b[5]), c3+ABS(a[14]-b[14])+ABS(a[6]-b[6])+ABS(a[15]-b[15])+ABS(a[7]-b[7]) }
> 
> (sorry for the text dump)
> 
> Remember, according to [1] the exact reduction sequence doesn't matter (for integer arithmetic at least).
> I've considered other sequences as well (thanks Wilco), for example
> * UABD + UADDLP + UADALP
> * UABLD2 + UABDL + UADALP + UADALP
> 
> I ended up settling in the sequence in this patch as it's short (3 instructions) and in the future we can potentially
> look to optimise multiple occurrences of these into something even faster (for example accumulating into H registers for longer
> before doing a single UADALP in the end to accumulate into the final S register).
> 
> If your microarchitecture has some some strong preferences for a particular sequence, please let me know or, even better, propose a patch
> to parametrise the generation sequence by code (or the appropriate RTX cost).
> 
> 
> This expansion allows the vectoriser to avoid unpacking the bytes in two steps and performing V4SI arithmetic on them.
> So, for the code:
> 
> unsigned char pix1[N], pix2[N];
> 
> int foo (void)
> {
>    int i_sum = 0;
>    int i;
> 
>    for (i = 0; i < 16; i++)
>      i_sum += __builtin_abs (pix1[i] - pix2[i]);
> 
>    return i_sum;
> }
> 
> we now generate on aarch64:
> foo:
>          adrp    x1, pix1
>          add     x1, x1, :lo12:pix1
>          movi    v0.4s, 0
>          adrp    x0, pix2
>          add     x0, x0, :lo12:pix2
>          ldr     q2, [x1]
>          ldr     q3, [x0]
>          uabdl2  v1.8h, v2.16b, v3.16b
>          uabal   v1.8h, v2.8b, v3.8b
>          uadalp  v0.4s, v1.8h
>          addv    s0, v0.4s
>          umov    w0, v0.s[0]
>          ret
> 
> 
> instead of:
> foo:
>          adrp    x1, pix1
>          adrp    x0, pix2
>          add     x1, x1, :lo12:pix1
>          add     x0, x0, :lo12:pix2
>          ldr     q0, [x1]
>          ldr     q4, [x0]
>          ushll   v1.8h, v0.8b, 0
>          ushll2  v0.8h, v0.16b, 0
>          ushll   v2.8h, v4.8b, 0
>          ushll2  v4.8h, v4.16b, 0
>          usubl   v3.4s, v1.4h, v2.4h
>          usubl2  v1.4s, v1.8h, v2.8h
>          usubl   v2.4s, v0.4h, v4.4h
>          usubl2  v0.4s, v0.8h, v4.8h
>          abs     v3.4s, v3.4s
>          abs     v1.4s, v1.4s
>          abs     v2.4s, v2.4s
>          abs     v0.4s, v0.4s
>          add     v1.4s, v3.4s, v1.4s
>          add     v1.4s, v2.4s, v1.4s
>          add     v0.4s, v0.4s, v1.4s
>          addv    s0, v0.4s
>          umov    w0, v0.s[0]
>          ret
> 
> So I expect this new expansion to be better than the status quo in any case.
> Bootstrapped and tested on aarch64-none-linux-gnu.
> This gives about 8% on 525.x264_r from SPEC2017 on a Cortex-A72.
> 
> Ok for trunk?

You don't say it explicitly here, but I presume the mid-end takes care of
zeroing the accumulator register before the loop (i.e. op3 in your sequence
in aarch64-simd.md)?

If so, looks good to me.

Ok for trunk.

By the way, now you have the patterns, presumably you could also wire them
up in arm_neon.h

Thanks for the patch!

James


> 
> Thanks,
> Kyrill
> 
> [1] https://gcc.gnu.org/ml/gcc/2018-05/msg00070.html
> 
> 
> 2018-05-11  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> 
>      * config/aarch64/aarch64.md ("unspec"): Define UNSPEC_SABAL,
>      UNSPEC_SABDL2, UNSPEC_SADALP, UNSPEC_UABAL, UNSPEC_UABDL2,
>      UNSPEC_UADALP values.
>      * config/aarch64/iterators.md (ABAL): New int iterator.
>      (ABDL2): Likewise.
>      (ADALP): Likewise.
>      (sur): Add mappings for the above.
>      * config/aarch64/aarch64-simd.md (aarch64_<sur>abdl2<mode>_3):
>      New define_insn.
>      (aarch64_<sur>abal<mode>_4): Likewise.
>      (aarch64_<sur>adalp<mode>_3): Likewise.
>      (<sur>sadv16qi): New define_expand.
> 
> 2018-05-11  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> 
>      * gcc.c-torture/execute/ssad-run.c: New test.
>      * gcc.c-torture/execute/usad-run.c: Likewise.
>      * gcc.target/aarch64/ssadv16qi.c: Likewise.
>      * gcc.target/aarch64/usadv16qi.c: Likewise.
Kyrill Tkachov May 21, 2018, 11:32 a.m. | #3
On 19/05/18 02:09, James Greenhalgh wrote:
> On Mon, May 14, 2018 at 08:38:40AM -0500, Kyrill Tkachov wrote:
>> Hi all,
>>
>> This patch implements the usadv16qi and ssadv16qi standard names.
>> See the thread at on gcc@gcc.gnu.org [1] for background.
>>
>> The V16QImode variant is important to get right as it is the most commonly used pattern:
>> reducing vectors of bytes into an int.
>> The midend expects the optab to compute the absolute differences of operands 1 and 2 and
>> reduce them while widening along the way up to SImode. So the inputs are V16QImode and
>> the output is V4SImode.
>>
>> I've tried out a few different strategies for that, the one I settled with is to emit:
>> UABDL2    tmp.8h, op1.16b, op2.16b
>> UABAL    tmp.8h, op1.16b, op2.16b
>> UADALP    op3.4s, tmp.8h
>>
>> To work through the semantics let's say operands 1 and 2 are:
>> op1 { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 }
>> op2 { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 }
>> op3 { c0, c1, c2, c3 }
>>
>> The UABDL2 takes the upper V8QI elements, computes their absolute differences, widens them and stores them into the V8HImode tmp:
>>
>> tmp { ABS(a[8]-b[8]), ABS(a[9]-b[9]), ABS(a[10]-b[10]), ABS(a[11]-b[11]), ABS(a[12]-b[12]), ABS(a[13]-b[13]), ABS(a[14]-b[14]), ABS(a[15]-b[15]) }
>>
>> The UABAL after that takes the lower V8QI elements, computes their absolute differences, widens them and accumulates them into the V8HImode tmp from the previous step:
>>
>> tmp { ABS(a[8]-b[8])+ABS (a[0]-b[0]), ABS(a[9]-b[9])+ABS(a[1]-b[1]), ABS(a[10]-b[10])+ABS(a[2]-b[2]), ABS(a[11]-b[11])+ABS(a[3]-b[3]), ABS(a[12]-b[12])+ABS(a[4]-b[4]), ABS(a[13]-b[13])+ABS(a[5]-b[5]), ABS(a[14]-b[14])+ABS(a[6]-b[6]), ABS(a[15]-b[15])+ABS(a[7]-b[7]) }
>>
>> Finally the UADALP does a pairwise widening reduction and accumulation into the V4SImode op3:
>> op3 { c0+ABS(a[8]-b[8])+ABS(a[0]-b[0])+ABS(a[9]-b[9])+ABS(a[1]-b[1]), c1+ABS(a[10]-b[10])+ABS(a[2]-b[2])+ABS(a[11]-b[11])+ABS(a[3]-b[3]), c2+ABS(a[12]-b[12])+ABS(a[4]-b[4])+ABS(a[13]-b[13])+ABS(a[5]-b[5]), c3+ABS(a[14]-b[14])+ABS(a[6]-b[6])+ABS(a[15]-b[15])+ABS(a[7]-b[7]) }
>>
>> (sorry for the text dump)
>>
>> Remember, according to [1] the exact reduction sequence doesn't matter (for integer arithmetic at least).
>> I've considered other sequences as well (thanks Wilco), for example
>> * UABD + UADDLP + UADALP
>> * UABLD2 + UABDL + UADALP + UADALP
>>
>> I ended up settling in the sequence in this patch as it's short (3 instructions) and in the future we can potentially
>> look to optimise multiple occurrences of these into something even faster (for example accumulating into H registers for longer
>> before doing a single UADALP in the end to accumulate into the final S register).
>>
>> If your microarchitecture has some some strong preferences for a particular sequence, please let me know or, even better, propose a patch
>> to parametrise the generation sequence by code (or the appropriate RTX cost).
>>
>>
>> This expansion allows the vectoriser to avoid unpacking the bytes in two steps and performing V4SI arithmetic on them.
>> So, for the code:
>>
>> unsigned char pix1[N], pix2[N];
>>
>> int foo (void)
>> {
>>     int i_sum = 0;
>>     int i;
>>
>>     for (i = 0; i < 16; i++)
>>       i_sum += __builtin_abs (pix1[i] - pix2[i]);
>>
>>     return i_sum;
>> }
>>
>> we now generate on aarch64:
>> foo:
>>           adrp    x1, pix1
>>           add     x1, x1, :lo12:pix1
>>           movi    v0.4s, 0
>>           adrp    x0, pix2
>>           add     x0, x0, :lo12:pix2
>>           ldr     q2, [x1]
>>           ldr     q3, [x0]
>>           uabdl2  v1.8h, v2.16b, v3.16b
>>           uabal   v1.8h, v2.8b, v3.8b
>>           uadalp  v0.4s, v1.8h
>>           addv    s0, v0.4s
>>           umov    w0, v0.s[0]
>>           ret
>>
>>
>> instead of:
>> foo:
>>           adrp    x1, pix1
>>           adrp    x0, pix2
>>           add     x1, x1, :lo12:pix1
>>           add     x0, x0, :lo12:pix2
>>           ldr     q0, [x1]
>>           ldr     q4, [x0]
>>           ushll   v1.8h, v0.8b, 0
>>           ushll2  v0.8h, v0.16b, 0
>>           ushll   v2.8h, v4.8b, 0
>>           ushll2  v4.8h, v4.16b, 0
>>           usubl   v3.4s, v1.4h, v2.4h
>>           usubl2  v1.4s, v1.8h, v2.8h
>>           usubl   v2.4s, v0.4h, v4.4h
>>           usubl2  v0.4s, v0.8h, v4.8h
>>           abs     v3.4s, v3.4s
>>           abs     v1.4s, v1.4s
>>           abs     v2.4s, v2.4s
>>           abs     v0.4s, v0.4s
>>           add     v1.4s, v3.4s, v1.4s
>>           add     v1.4s, v2.4s, v1.4s
>>           add     v0.4s, v0.4s, v1.4s
>>           addv    s0, v0.4s
>>           umov    w0, v0.s[0]
>>           ret
>>
>> So I expect this new expansion to be better than the status quo in any case.
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> This gives about 8% on 525.x264_r from SPEC2017 on a Cortex-A72.
>>
>> Ok for trunk?
> You don't say it explicitly here, but I presume the mid-end takes care of
> zeroing the accumulator register before the loop (i.e. op3 in your sequence
> in aarch64-simd.md)?

Yes, the midend takes care of zeroing the accumulator register
and doing the full reduction at the end of the loop.

> If so, looks good to me.
>
> Ok for trunk.

Thanks, committed with r260437.
>
> By the way, now you have the patterns, presumably you could also wire them
> up in arm_neon.h

Yeah, it should be simple to wire them up.

Thanks,
Kyrill

>
> Thanks for the patch!
>
> James
>
>
>> Thanks,
>> Kyrill
>>
>> [1] https://gcc.gnu.org/ml/gcc/2018-05/msg00070.html
>>
>>
>> 2018-05-11  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>       * config/aarch64/aarch64.md ("unspec"): Define UNSPEC_SABAL,
>>       UNSPEC_SABDL2, UNSPEC_SADALP, UNSPEC_UABAL, UNSPEC_UABDL2,
>>       UNSPEC_UADALP values.
>>       * config/aarch64/iterators.md (ABAL): New int iterator.
>>       (ABDL2): Likewise.
>>       (ADALP): Likewise.
>>       (sur): Add mappings for the above.
>>       * config/aarch64/aarch64-simd.md (aarch64_<sur>abdl2<mode>_3):
>>       New define_insn.
>>       (aarch64_<sur>abal<mode>_4): Likewise.
>>       (aarch64_<sur>adalp<mode>_3): Likewise.
>>       (<sur>sadv16qi): New define_expand.
>>
>> 2018-05-11  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>       * gcc.c-torture/execute/ssad-run.c: New test.
>>       * gcc.c-torture/execute/usad-run.c: Likewise.
>>       * gcc.target/aarch64/ssadv16qi.c: Likewise.
>>       * gcc.target/aarch64/usadv16qi.c: Likewise.
>

Patch

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 9a6ed304432af0ca23ec7d3797783a3128776a6e..97f8dbf1c219e2df2653804f2c1f83c123cdf2d6 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -612,6 +612,67 @@  (define_insn "abd<mode>_3"
   [(set_attr "type" "neon_abd<q>")]
 )
 
+(define_insn "aarch64_<sur>abdl2<mode>_3"
+  [(set (match_operand:<VDBLW> 0 "register_operand" "=w")
+	(unspec:<VDBLW> [(match_operand:VDQV_S 1 "register_operand" "w")
+			  (match_operand:VDQV_S 2 "register_operand" "w")]
+	ABDL2))]
+  "TARGET_SIMD"
+  "<sur>abdl2\t%0.<Vwtype>, %1.<Vtype>, %2.<Vtype>"
+  [(set_attr "type" "neon_abd<q>")]
+)
+
+(define_insn "aarch64_<sur>abal<mode>_4"
+  [(set (match_operand:<VDBLW> 0 "register_operand" "=w")
+	(unspec:<VDBLW> [(match_operand:VDQV_S 1 "register_operand" "w")
+			  (match_operand:VDQV_S 2 "register_operand" "w")
+			 (match_operand:<VDBLW> 3 "register_operand" "0")]
+	ABAL))]
+  "TARGET_SIMD"
+  "<sur>abal\t%0.<Vwtype>, %1.<Vhalftype>, %2.<Vhalftype>"
+  [(set_attr "type" "neon_arith_acc<q>")]
+)
+
+(define_insn "aarch64_<sur>adalp<mode>_3"
+  [(set (match_operand:<VDBLW> 0 "register_operand" "=w")
+	(unspec:<VDBLW> [(match_operand:VDQV_S 1 "register_operand" "w")
+			  (match_operand:<VDBLW> 2 "register_operand" "0")]
+	ADALP))]
+  "TARGET_SIMD"
+  "<sur>adalp\t%0.<Vwtype>, %1.<Vtype>"
+  [(set_attr "type" "neon_reduc_add<q>")]
+)
+
+;; Emit a sequence to produce a sum-of-absolute-differences of the V16QI
+;; inputs in operands 1 and 2.  The sequence also has to perform a widening
+;; reduction of the difference into a V4SI vector and accumulate that into
+;; operand 3 before copying that into the result operand 0.
+;; Perform that with a sequence of:
+;; UABDL2	tmp.8h, op1.16b, op2.16b
+;; UABAL	tmp.8h, op1.16b, op2.16b
+;; UADALP	op3.4s, tmp.8h
+;; MOV		op0, op3 // should be eliminated in later passes.
+;; The signed version just uses the signed variants of the above instructions.
+
+(define_expand "<sur>sadv16qi"
+  [(use (match_operand:V4SI 0 "register_operand"))
+   (unspec:V16QI [(use (match_operand:V16QI 1 "register_operand"))
+		  (use (match_operand:V16QI 2 "register_operand"))] ABAL)
+   (use (match_operand:V4SI 3 "register_operand"))]
+  "TARGET_SIMD"
+  {
+    rtx reduc = gen_reg_rtx (V8HImode);
+    emit_insn (gen_aarch64_<sur>abdl2v16qi_3 (reduc, operands[1],
+					       operands[2]));
+    emit_insn (gen_aarch64_<sur>abalv16qi_4 (reduc, operands[1],
+					      operands[2], reduc));
+    emit_insn (gen_aarch64_<sur>adalpv8hi_3 (operands[3], reduc,
+					      operands[3]));
+    emit_move_insn (operands[0], operands[3]);
+    DONE;
+  }
+)
+
 (define_insn "aba<mode>_3"
   [(set (match_operand:VDQ_BHSI 0 "register_operand" "=w")
 	(plus:VDQ_BHSI (abs:VDQ_BHSI (minus:VDQ_BHSI
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 953edb7b943b9acb6fe65db93f67ce73e4498dcb..079385c58ea201225ecf54c752b3c9e3756eab49 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -141,6 +141,9 @@  (define_c_enum "unspec" [
     UNSPEC_PRLG_STK
     UNSPEC_REV
     UNSPEC_RBIT
+    UNSPEC_SABAL
+    UNSPEC_SABDL2
+    UNSPEC_SADALP
     UNSPEC_SCVTF
     UNSPEC_SISD_NEG
     UNSPEC_SISD_SSHL
@@ -159,6 +162,9 @@  (define_c_enum "unspec" [
     UNSPEC_TLSLE24
     UNSPEC_TLSLE32
     UNSPEC_TLSLE48
+    UNSPEC_UABAL
+    UNSPEC_UABDL2
+    UNSPEC_UADALP
     UNSPEC_UCVTF
     UNSPEC_USHL_2S
     UNSPEC_VSTRUCTDUMMY
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 2642de74bcfb0c729d8309cde14b76cf233ad7ab..e994e58ffb38cee2a00fae4216ae90e33e5563e1 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -1392,6 +1392,16 @@  (define_code_attr sve_imm_con [(eq "vsc")
 ;; -------------------------------------------------------------------
 ;; Int Iterators.
 ;; -------------------------------------------------------------------
+
+;; The unspec codes for the SABAL, UABAL AdvancedSIMD instructions.
+(define_int_iterator ABAL [UNSPEC_SABAL UNSPEC_UABAL])
+
+;; The unspec codes for the SABDL2, UABDL2 AdvancedSIMD instructions.
+(define_int_iterator ABDL2 [UNSPEC_SABDL2 UNSPEC_UABDL2])
+
+;; The unspec codes for the SADALP, UADALP AdvancedSIMD instructions.
+(define_int_iterator ADALP [UNSPEC_SADALP UNSPEC_UADALP])
+
 (define_int_iterator MAXMINV [UNSPEC_UMAXV UNSPEC_UMINV
 			      UNSPEC_SMAXV UNSPEC_SMINV])
 
@@ -1599,6 +1609,9 @@  (define_int_attr sur [(UNSPEC_SHADD "s") (UNSPEC_UHADD "u")
 		      (UNSPEC_SHSUB "s") (UNSPEC_UHSUB "u")
 		      (UNSPEC_SRHSUB "sr") (UNSPEC_URHSUB "ur")
 		      (UNSPEC_ADDHN "") (UNSPEC_RADDHN "r")
+		      (UNSPEC_SABAL "s") (UNSPEC_UABAL "u")
+		      (UNSPEC_SABDL2 "s") (UNSPEC_UABDL2 "u")
+		      (UNSPEC_SADALP "s") (UNSPEC_UADALP "u")
 		      (UNSPEC_SUBHN "") (UNSPEC_RSUBHN "r")
 		      (UNSPEC_ADDHN2 "") (UNSPEC_RADDHN2 "r")
 		      (UNSPEC_SUBHN2 "") (UNSPEC_RSUBHN2 "r")
diff --git a/gcc/testsuite/gcc.c-torture/execute/ssad-run.c b/gcc/testsuite/gcc.c-torture/execute/ssad-run.c
new file mode 100644
index 0000000000000000000000000000000000000000..f15f85f5753769a492cc066ac1ff8a82f39fcc30
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/ssad-run.c
@@ -0,0 +1,49 @@ 
+extern void abort ();
+extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__));
+
+static int
+foo (signed char *w, int i, signed char *x, int j)
+{
+  int tot = 0;
+  for (int a = 0; a < 16; a++)
+    {
+      for (int b = 0; b < 16; b++)
+	tot += abs (w[b] - x[b]);
+      w += i;
+      x += j;
+    }
+  return tot;
+}
+
+void
+bar (signed char *w, signed char *x, int i, int *result)
+{
+  *result = foo (w, 16, x, i);
+}
+
+int
+main (void)
+{
+  signed char m[256];
+  signed char n[256];
+  int sum, i;
+
+  for (i = 0; i < 256; ++i)
+    if (i % 2 == 0)
+      {
+	m[i] = (i % 8) * 2 + 1;
+	n[i] = -(i % 8);
+      }
+    else
+      {
+	m[i] = -((i % 8) * 2 + 2);
+	n[i] = -((i % 8) >> 1);
+      }
+
+  bar (m, n, 16, &sum);
+
+  if (sum != 2368)
+    abort ();
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.c-torture/execute/usad-run.c b/gcc/testsuite/gcc.c-torture/execute/usad-run.c
new file mode 100644
index 0000000000000000000000000000000000000000..904a634a497688eda6331845e2bf2805aa8a7991
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/usad-run.c
@@ -0,0 +1,49 @@ 
+extern void abort ();
+extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__));
+
+static int
+foo (unsigned char *w, int i, unsigned char *x, int j)
+{
+  int tot = 0;
+  for (int a = 0; a < 16; a++)
+    {
+      for (int b = 0; b < 16; b++)
+	tot += abs (w[b] - x[b]);
+      w += i;
+      x += j;
+    }
+  return tot;
+}
+
+void
+bar (unsigned char *w, unsigned char *x, int i, int *result)
+{
+  *result = foo (w, 16, x, i);
+}
+
+int
+main (void)
+{
+  unsigned char m[256];
+  unsigned char n[256];
+  int sum, i;
+
+  for (i = 0; i < 256; ++i)
+    if (i % 2 == 0)
+      {
+	m[i] = (i % 8) * 2 + 1;
+	n[i] = -(i % 8);
+      }
+    else
+      {
+	m[i] = -((i % 8) * 2 + 2);
+	n[i] = -((i % 8) >> 1);
+      }
+
+  bar (m, n, 16, &sum);
+
+  if (sum != 32384)
+    abort ();
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c b/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c
new file mode 100644
index 0000000000000000000000000000000000000000..bab75992986865389dff8f9ca43c58e947ef94a0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c
@@ -0,0 +1,27 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+#define N 1024
+
+signed char pix1[N], pix2[N];
+
+int foo (void)
+{
+  int i_sum = 0;
+  int i;
+
+  for (i = 0; i < N; i++)
+    i_sum += __builtin_abs (pix1[i] - pix2[i]);
+
+  return i_sum;
+}
+
+/* { dg-final { scan-assembler-not {\tsshll\t} } } */
+/* { dg-final { scan-assembler-not {\tsshll2\t} } } */
+/* { dg-final { scan-assembler-not {\tssubl\t} } } */
+/* { dg-final { scan-assembler-not {\tssubl2\t} } } */
+/* { dg-final { scan-assembler-not {\tabs\t} } } */
+
+/* { dg-final { scan-assembler {\tsabdl2\t} } } */
+/* { dg-final { scan-assembler {\tsabal\t} } } */
+/* { dg-final { scan-assembler {\tsadalp\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/usadv16qi.c b/gcc/testsuite/gcc.target/aarch64/usadv16qi.c
new file mode 100644
index 0000000000000000000000000000000000000000..b7c08ee1e1182dadba0048bb96b006f2db61ffe0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/usadv16qi.c
@@ -0,0 +1,27 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+#define N 1024
+
+unsigned char pix1[N], pix2[N];
+
+int foo (void)
+{
+  int i_sum = 0;
+  int i;
+
+  for (i = 0; i < N; i++)
+    i_sum += __builtin_abs (pix1[i] - pix2[i]);
+
+  return i_sum;
+}
+
+/* { dg-final { scan-assembler-not {\tushll\t} } } */
+/* { dg-final { scan-assembler-not {\tushll2\t} } } */
+/* { dg-final { scan-assembler-not {\tusubl\t} } } */
+/* { dg-final { scan-assembler-not {\tusubl2\t} } } */
+/* { dg-final { scan-assembler-not {\tabs\t} } } */
+
+/* { dg-final { scan-assembler {\tuabdl2\t} } } */
+/* { dg-final { scan-assembler {\tuabal\t} } } */
+/* { dg-final { scan-assembler {\tuadalp\t} } } */