Message ID | VE1PR08MB5599A8A7D14BCD1B5E3EED1A83E49@VE1PR08MB5599.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | [v3,1/5] AArch64: Improve A64FX memset | expand |
Hi Wilco, Thank you for the patch. LGTM, I confirmed that no performance change [1][2]. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com> Tested-by: Naohiro Tamura <naohirot@fujitsu.com> Regarding commit title, how about like this? "AArch64: Improve A64FX memset for remaining bytes" [1] https://drive.google.com/file/d/13Q3cx3atMAAyfGq_cGGjS77zhA_JIqDu/view?usp=sharing [2] https://drive.google.com/file/d/1kUSQdKZx_8Vm3lOW_xE0NrkZSverhidS/view?usp=sharing Thanks. Naohiro > -----Original Message----- > From: Wilco Dijkstra <Wilco.Dijkstra@arm.com> > Sent: Friday, July 23, 2021 1:02 AM > To: Tamura, Naohiro/田村 直広 <naohirot@fujitsu.com> > Cc: 'GNU C Library' <libc-alpha@sourceware.org> > Subject: [PATCH v3 3/5] AArch64: Improve A64FX memset > > Simplify handling of remaining bytes. Avoid lots of taken branches and complex > whilelo computations, instead unconditionally write vectors from the end. > > --- > > diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/multiarch/memset_a64fx.S > index 608e0e2e2ff5259178e2fdadf1eea8816194d879..fce257fa68120c2b101f29b438c397e10b4c275e 100644 > --- a/sysdeps/aarch64/multiarch/memset_a64fx.S > +++ b/sysdeps/aarch64/multiarch/memset_a64fx.S > @@ -130,38 +130,19 @@ L(unroll8): > b 1b > > L(last): > - whilelo p0.b, xzr, rest > - whilelo p1.b, vector_length, rest > - b.last 1f > - st1b z0.b, p0, [dst, #0, mul vl] > - st1b z0.b, p1, [dst, #1, mul vl] > - ret > -1: lsl tmp1, vector_length, 1 // vector_length * 2 > - whilelo p2.b, tmp1, rest > - incb tmp1 > - whilelo p3.b, tmp1, rest > - b.last 1f > - st1b z0.b, p0, [dst, #0, mul vl] > - st1b z0.b, p1, [dst, #1, mul vl] > - st1b z0.b, p2, [dst, #2, mul vl] > - st1b z0.b, p3, [dst, #3, mul vl] > - ret > -1: lsl tmp1, vector_length, 2 // vector_length * 4 > - whilelo p4.b, tmp1, rest > - incb tmp1 > - whilelo p5.b, tmp1, rest > - incb tmp1 > - whilelo p6.b, tmp1, rest > - incb tmp1 > - whilelo p7.b, tmp1, rest > - st1b z0.b, p0, [dst, #0, mul vl] > - st1b z0.b, p1, [dst, #1, mul vl] > - st1b z0.b, p2, [dst, #2, mul vl] > - st1b z0.b, p3, [dst, #3, mul vl] > - st1b z0.b, p4, [dst, #4, mul vl] > - st1b z0.b, p5, [dst, #5, mul vl] > - st1b z0.b, p6, [dst, #6, mul vl] > - st1b z0.b, p7, [dst, #7, mul vl] > + cmp count, vector_length, lsl 1 > + b.ls 2f > + add tmp2, vector_length, vector_length, lsl 2 > + cmp count, tmp2 > + b.ls 5f > + st1b z0.b, p0, [dstend, -8, mul vl] > + st1b z0.b, p0, [dstend, -7, mul vl] > + st1b z0.b, p0, [dstend, -6, mul vl] > +5: st1b z0.b, p0, [dstend, -5, mul vl] > + st1b z0.b, p0, [dstend, -4, mul vl] > + st1b z0.b, p0, [dstend, -3, mul vl] > +2: st1b z0.b, p0, [dstend, -2, mul vl] > + st1b z0.b, p0, [dstend, -1, mul vl] > ret > > L(L1_prefetch): // if rest >= L1_SIZE
diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/multiarch/memset_a64fx.S index 608e0e2e2ff5259178e2fdadf1eea8816194d879..fce257fa68120c2b101f29b438c397e10b4c275e 100644 --- a/sysdeps/aarch64/multiarch/memset_a64fx.S +++ b/sysdeps/aarch64/multiarch/memset_a64fx.S @@ -130,38 +130,19 @@ L(unroll8): b 1b L(last): - whilelo p0.b, xzr, rest - whilelo p1.b, vector_length, rest - b.last 1f - st1b z0.b, p0, [dst, #0, mul vl] - st1b z0.b, p1, [dst, #1, mul vl] - ret -1: lsl tmp1, vector_length, 1 // vector_length * 2 - whilelo p2.b, tmp1, rest - incb tmp1 - whilelo p3.b, tmp1, rest - b.last 1f - st1b z0.b, p0, [dst, #0, mul vl] - st1b z0.b, p1, [dst, #1, mul vl] - st1b z0.b, p2, [dst, #2, mul vl] - st1b z0.b, p3, [dst, #3, mul vl] - ret -1: lsl tmp1, vector_length, 2 // vector_length * 4 - whilelo p4.b, tmp1, rest - incb tmp1 - whilelo p5.b, tmp1, rest - incb tmp1 - whilelo p6.b, tmp1, rest - incb tmp1 - whilelo p7.b, tmp1, rest - st1b z0.b, p0, [dst, #0, mul vl] - st1b z0.b, p1, [dst, #1, mul vl] - st1b z0.b, p2, [dst, #2, mul vl] - st1b z0.b, p3, [dst, #3, mul vl] - st1b z0.b, p4, [dst, #4, mul vl] - st1b z0.b, p5, [dst, #5, mul vl] - st1b z0.b, p6, [dst, #6, mul vl] - st1b z0.b, p7, [dst, #7, mul vl] + cmp count, vector_length, lsl 1 + b.ls 2f + add tmp2, vector_length, vector_length, lsl 2 + cmp count, tmp2 + b.ls 5f + st1b z0.b, p0, [dstend, -8, mul vl] + st1b z0.b, p0, [dstend, -7, mul vl] + st1b z0.b, p0, [dstend, -6, mul vl] +5: st1b z0.b, p0, [dstend, -5, mul vl] + st1b z0.b, p0, [dstend, -4, mul vl] + st1b z0.b, p0, [dstend, -3, mul vl] +2: st1b z0.b, p0, [dstend, -2, mul vl] + st1b z0.b, p0, [dstend, -1, mul vl] ret L(L1_prefetch): // if rest >= L1_SIZE