Message ID | VE1PR08MB559925E4AA763668AEE7802583E49@VE1PR08MB5599.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | [v3,1/5] AArch64: Improve A64FX memset | expand |
Hi Wilco, Thank you for the patch. LGTM, I confirmed that no performance change [1][2]. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com> Tested-by: Naohiro Tamura <naohirot@fujitsu.com> Regarding commit title, how about like this? "AArch64: Improve A64FX memset by removing unroll32" [1] https://drive.google.com/file/d/1SIw7bXX9Pi2G7wga9j5X9M2xzOHYQlIx/view?usp=sharing [2] https://drive.google.com/file/d/1gdcuRFZbtlIpnINUMar4DIt9Ao04K36o/view?usp=sharing Thanks. Naohiro > -----Original Message----- > From: Wilco Dijkstra <Wilco.Dijkstra@arm.com> > Sent: Friday, July 23, 2021 1:03 AM > To: Tamura, Naohiro/田村 直広 <naohirot@fujitsu.com> > Cc: 'GNU C Library' <libc-alpha@sourceware.org> > Subject: [PATCH v3 4/5] AArch64: Improve A64FX memset > > Remove unroll32 code since it doesn't improve performance. > > --- > diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/multiarch/memset_a64fx.S > index fce257fa68120c2b101f29b438c397e10b4c275e..8665c272431b46dadea53c63ab74829c3aa99312 100644 > --- a/sysdeps/aarch64/multiarch/memset_a64fx.S > +++ b/sysdeps/aarch64/multiarch/memset_a64fx.S > @@ -102,22 +102,6 @@ L(vl_agnostic): // VL Agnostic > ccmp vector_length, tmp1, 0, cs > b.eq L(L1_prefetch) > > -L(unroll32): > - lsl tmp1, vector_length, 3 // vector_length * 8 > - lsl tmp2, vector_length, 5 // vector_length * 32 > - .p2align 3 > -1: cmp rest, tmp2 > - b.cc L(unroll8) > - st1b_unroll > - add dst, dst, tmp1 > - st1b_unroll > - add dst, dst, tmp1 > - st1b_unroll > - add dst, dst, tmp1 > - st1b_unroll > - add dst, dst, tmp1 > - sub rest, rest, tmp2 > - b 1b > > L(unroll8): > lsl tmp1, vector_length, 3 > @@ -155,7 +139,7 @@ L(L1_prefetch): // if rest >= L1_SIZE > sub rest, rest, CACHE_LINE_SIZE * 2 > cmp rest, L1_SIZE > b.ge 1b > - cbnz rest, L(unroll32) > + cbnz rest, L(unroll8) > ret > > // count >= L2_SIZE
diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/multiarch/memset_a64fx.S index fce257fa68120c2b101f29b438c397e10b4c275e..8665c272431b46dadea53c63ab74829c3aa99312 100644 --- a/sysdeps/aarch64/multiarch/memset_a64fx.S +++ b/sysdeps/aarch64/multiarch/memset_a64fx.S @@ -102,22 +102,6 @@ L(vl_agnostic): // VL Agnostic ccmp vector_length, tmp1, 0, cs b.eq L(L1_prefetch) -L(unroll32): - lsl tmp1, vector_length, 3 // vector_length * 8 - lsl tmp2, vector_length, 5 // vector_length * 32 - .p2align 3 -1: cmp rest, tmp2 - b.cc L(unroll8) - st1b_unroll - add dst, dst, tmp1 - st1b_unroll - add dst, dst, tmp1 - st1b_unroll - add dst, dst, tmp1 - st1b_unroll - add dst, dst, tmp1 - sub rest, rest, tmp2 - b 1b L(unroll8): lsl tmp1, vector_length, 3 @@ -155,7 +139,7 @@ L(L1_prefetch): // if rest >= L1_SIZE sub rest, rest, CACHE_LINE_SIZE * 2 cmp rest, L1_SIZE b.ge 1b - cbnz rest, L(unroll32) + cbnz rest, L(unroll8) ret // count >= L2_SIZE