mbox series

[v4,0/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

Message ID 20240221173922.19137-1-andre.simoesdiasvieira@arm.com
Headers show
Series arm: Add support for MVE Tail-Predicated Low Overhead Loops | expand

Message

Andre Vieira (lists) Feb. 21, 2024, 5:39 p.m. UTC
Hi,

This is a reworked patch series from.  The main differences are a further split
of patches, where:
[1/5] is arm specific and has been approved before,
[2/5] is target agnostic, has had no substantial changes from v3.
[3/5] new arm specific patch that is split from the original last patch and
annotates across lane instructions that are safe for tail predication if their
tail predicated operands are zeroed.
[4/5] new arm specific patch that could be committed indepdent of series to fix
an obvious issue and remove unused unspecs & iterators.
[5/5] reworked last patch refactoring the implicit predication and some other
validity checks.

Original cover letter:
This patch adds support for Arm's MVE Tail Predicated Low Overhead Loop
feature.

The M-class Arm-ARM:
https://developer.arm.com/documentation/ddi0553/bu/?lang=en
Section B5.5.1 "Loop tail predication" describes the feature
we are adding support for with this patch (although
we only add codegen for DLSTP/LETP instruction loops).

Previously with commit d2ed233cb94 we'd added support for
non-MVE DLS/LE loops through the loop-doloop pass, which, given
a standard MVE loop like:

```
void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
{
  while (n > 0)
    {
      mve_pred16_t p = vctp16q (n);
      int16x8_t va = vldrhq_z_s16 (a, p);
      int16x8_t vb = vldrhq_z_s16 (b, p);
      int16x8_t vc = vaddq_x_s16 (va, vb, p);
      vstrhq_p_s16 (c, vc, p);
      c+=8;
      a+=8;
      b+=8;
      n-=8;
    }
}
```
.. would output:

```
        <pre-calculate the number of iterations and place it into lr>
        dls     lr, lr
.L3:
        vctp.16 r3
        vmrs    ip, P0  @ movhi
        sxth    ip, ip
        vmsr     P0, ip @ movhi
        mov     r4, r0
        vpst
        vldrht.16       q2, [r4]
        mov     r4, r1
        vmov    q3, q0
        vpst
        vldrht.16       q1, [r4]
        mov     r4, r2
        vpst
        vaddt.i16       q3, q2, q1
        subs    r3, r3, #8
        vpst
        vstrht.16       q3, [r4]
        adds    r0, r0, #16
        adds    r1, r1, #16
        adds    r2, r2, #16
        le      lr, .L3
```

where the LE instruction will decrement LR by 1, compare and
branch if needed.

(there are also other inefficiencies with the above code, like the
pointless vmrs/sxth/vmsr on the VPR and the adds not being merged
into the vldrht/vstrht as a #16 offsets and some random movs!
But that's different problems...)

The MVE version is similar, except that:
* Instead of DLS/LE the instructions are DLSTP/LETP.
* Instead of pre-calculating the number of iterations of the
  loop, we place the number of elements to be processed by the
  loop into LR.
* Instead of decrementing the LR by one, LETP will decrement it
  by FPSCR.LTPSIZE, which is the number of elements being
  processed in each iteration: 16 for 8-bit elements, 5 for 16-bit
  elements, etc.
* On the final iteration, automatic Loop Tail Predication is
  performed, as if the instructions within the loop had been VPT
  predicated with a VCTP generating the VPR predicate in every
  loop iteration.

The dlstp/letp loop now looks like:

```
        <place n into r3>
        dlstp.16        lr, r3
.L14:
        mov     r3, r0
        vldrh.16        q3, [r3]
        mov     r3, r1
        vldrh.16        q2, [r3]
        mov     r3, r2
        vadd.i16  q3, q3, q2
        adds    r0, r0, #16
        vstrh.16        q3, [r3]
        adds    r1, r1, #16
        adds    r2, r2, #16
        letp    lr, .L14

```

Since the loop tail predication is automatic, we have eliminated
the VCTP that had been specified by the user in the intrinsic
and converted the VPT-predicated instructions into their
unpredicated equivalents (which also saves us from VPST insns).

The LE instruction here decrements LR by 8 in each iteration.

Stam Markianos-Wright (1):
  arm: Add define_attr to to create a mapping between MVE predicated and
    unpredicated insns

Andre Vieira (4):
  doloop: Add support for predicated vectorized loops
  arm: Annotate instructions with mve_safe_imp_xlane_pred
  arm: Fix a wrong attribute use and remove unused unspecs and iterators
  arm: Add support for MVE Tail-Predicated Low Overhead Loops