Message ID | VE1PR08MB5599EF32635A6AB87F3106BC83F69@VE1PR08MB5599.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | [v4,1/5] AArch64: Improve A64FX memset for small sizes | expand |
The 08/09/2021 13:13, Wilco Dijkstra via Libc-alpha wrote: > v4: no changes > > Remove unroll32 code since it doesn't improve performance. OK to commit, but keep Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com> > > --- > > diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/multiarch/memset_a64fx.S > index 55f28b644defdffb140c88da0635ef099235546c..89dba912588c243e67a9527a56b4d3a44659d542 100644 > --- a/sysdeps/aarch64/multiarch/memset_a64fx.S > +++ b/sysdeps/aarch64/multiarch/memset_a64fx.S > @@ -102,22 +102,6 @@ L(vl_agnostic): // VL Agnostic > ccmp vector_length, tmp1, 0, cs > b.eq L(L1_prefetch) > > -L(unroll32): > - lsl tmp1, vector_length, 3 // vector_length * 8 > - lsl tmp2, vector_length, 5 // vector_length * 32 > - .p2align 3 > -1: cmp rest, tmp2 > - b.cc L(unroll8) > - st1b_unroll > - add dst, dst, tmp1 > - st1b_unroll > - add dst, dst, tmp1 > - st1b_unroll > - add dst, dst, tmp1 > - st1b_unroll > - add dst, dst, tmp1 > - sub rest, rest, tmp2 > - b 1b > > L(unroll8): > lsl tmp1, vector_length, 3 > @@ -155,7 +139,7 @@ L(L1_prefetch): // if rest >= L1_SIZE > sub rest, rest, CACHE_LINE_SIZE * 2 > cmp rest, L1_SIZE > b.ge 1b > - cbnz rest, L(unroll32) > + cbnz rest, L(unroll8) > ret > > // count >= L2_SIZE > --
diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/multiarch/memset_a64fx.S index 55f28b644defdffb140c88da0635ef099235546c..89dba912588c243e67a9527a56b4d3a44659d542 100644 --- a/sysdeps/aarch64/multiarch/memset_a64fx.S +++ b/sysdeps/aarch64/multiarch/memset_a64fx.S @@ -102,22 +102,6 @@ L(vl_agnostic): // VL Agnostic ccmp vector_length, tmp1, 0, cs b.eq L(L1_prefetch) -L(unroll32): - lsl tmp1, vector_length, 3 // vector_length * 8 - lsl tmp2, vector_length, 5 // vector_length * 32 - .p2align 3 -1: cmp rest, tmp2 - b.cc L(unroll8) - st1b_unroll - add dst, dst, tmp1 - st1b_unroll - add dst, dst, tmp1 - st1b_unroll - add dst, dst, tmp1 - st1b_unroll - add dst, dst, tmp1 - sub rest, rest, tmp2 - b 1b L(unroll8): lsl tmp1, vector_length, 3 @@ -155,7 +139,7 @@ L(L1_prefetch): // if rest >= L1_SIZE sub rest, rest, CACHE_LINE_SIZE * 2 cmp rest, L1_SIZE b.ge 1b - cbnz rest, L(unroll32) + cbnz rest, L(unroll8) ret // count >= L2_SIZE