Message ID | 5CA2145C.8040402@bell-sw.com |
---|---|
State | New |
Headers | show |
Series | [v5] aarch64: thunderx2 memcpy optimizations for ext-based code path | expand |
On Mon, 2019-04-01 at 16:38 +0300, Anton Youdkevitch wrote: > Here is the updated patch for improving the long unaligned > code path (the one using "ext" instruction). > > 1. Always taken conditional branch at the beginning is > removed. > > 2. Epilogue code is placed after the end of the loop to > reduce the number of branches. > > 3. The redundant "mov" instructions inside the loop are > gone due to the changed order of the registers in the "ext" > instructions inside the loop, the prologue has additional > "ext" instruction. > > 4.Updating count in the prologue was hoisted out as > it is the same update for each prologue. > > 5. Invariant code of the loop epilogue was hoisted out. > > 6. As the current size of the ext chunk is exactly 16 > instructions long "nop" was added at the beginning > of the code sequence so that the loop entry for all the > chunks be aligned. > > make check - no regression (on linux-aarch64) > make bench - no performance regressions (on Thunderx2) > > Looks OK? This looks good to me Anton. I can check it in for you if we have a consensus that this version is OK and there are no objections. Steve Ellcey sellcey@marvell.com
On 02/04/2019 23:48, Steve Ellcey wrote: > On Mon, 2019-04-01 at 16:38 +0300, Anton Youdkevitch wrote: >> Here is the updated patch for improving the long unaligned >> code path (the one using "ext" instruction). >> >> 1. Always taken conditional branch at the beginning is >> removed. >> >> 2. Epilogue code is placed after the end of the loop to >> reduce the number of branches. >> >> 3. The redundant "mov" instructions inside the loop are >> gone due to the changed order of the registers in the "ext" >> instructions inside the loop, the prologue has additional >> "ext" instruction. >> >> 4.Updating count in the prologue was hoisted out as >> it is the same update for each prologue. >> >> 5. Invariant code of the loop epilogue was hoisted out. >> >> 6. As the current size of the ext chunk is exactly 16 >> instructions long "nop" was added at the beginning >> of the code sequence so that the loop entry for all the >> chunks be aligned. >> >> make check - no regression (on linux-aarch64) >> make bench - no performance regressions (on Thunderx2) >> >> Looks OK? > > This looks good to me Anton. I can check it in for you if we have a > consensus that this version is OK and there are no objections. yes, this is OK to commit, i have no objections.
On Fri, 2019-04-05 at 15:21 +0000, Szabolcs Nagy wrote: > On 02/04/2019 23:48, Steve Ellcey wrote: > > On Mon, 2019-04-01 at 16:38 +0300, Anton Youdkevitch wrote: > > > > > make check - no regression (on linux-aarch64) > > > make bench - no performance regressions (on Thunderx2) > > > > > > Looks OK? > This looks good to me Anton. I can check it in for you if we have a > > consensus that this version is OK and there are no objections. > > yes, this is OK to commit, i have no objections. Anton, I have gone ahead and committed this for you. Steve Ellcey sellcey@marvell.com
On 4/6/2019 00:05, Steve Ellcey wrote: > On Fri, 2019-04-05 at 15:21 +0000, Szabolcs Nagy wrote: >> On 02/04/2019 23:48, Steve Ellcey wrote: >>> On Mon, 2019-04-01 at 16:38 +0300, Anton Youdkevitch wrote: >>> >>>> make check - no regression (on linux-aarch64) >>>> make bench - no performance regressions (on Thunderx2) >>>> >>>> Looks OK? >> This looks good to me Anton. I can check it in for you if we have a >>> consensus that this version is OK and there are no objections. >> >> yes, this is OK to commit, i have no objections. > > Anton, I have gone ahead and committed this for you. OK, thanks a lot!
diff --git a/sysdeps/aarch64/multiarch/memcpy_thunderx2.S b/sysdeps/aarch64/multiarch/memcpy_thunderx2.S index b2215c1..45e9a29 100644 --- a/sysdeps/aarch64/multiarch/memcpy_thunderx2.S +++ b/sysdeps/aarch64/multiarch/memcpy_thunderx2.S @@ -382,7 +382,8 @@ L(bytes_0_to_3): strb A_lw, [dstin] strb B_lw, [dstin, tmp1] strb A_hw, [dstend, -1] -L(end): ret +L(end): + ret .p2align 4 @@ -544,43 +545,35 @@ L(dst_unaligned): str C_q, [dst], #16 ldp F_q, G_q, [src], #32 bic dst, dst, 15 + subs count, count, 32 adrp tmp2, L(ext_table) add tmp2, tmp2, :lo12:L(ext_table) add tmp2, tmp2, tmp1, LSL #2 ldr tmp3w, [tmp2] add tmp2, tmp2, tmp3w, SXTW br tmp2 - -#define EXT_CHUNK(shft) \ .p2align 4 ;\ + nop +#define EXT_CHUNK(shft) \ L(ext_size_ ## shft):;\ ext A_v.16b, C_v.16b, D_v.16b, 16-shft;\ ext B_v.16b, D_v.16b, E_v.16b, 16-shft;\ - subs count, count, 32;\ - b.ge 2f;\ -1:;\ - stp A_q, B_q, [dst], #32;\ ext H_v.16b, E_v.16b, F_v.16b, 16-shft;\ - ext I_v.16b, F_v.16b, G_v.16b, 16-shft;\ - stp H_q, I_q, [dst], #16;\ - add dst, dst, tmp1;\ - str G_q, [dst], #16;\ - b L(copy_long_check32);\ -2:;\ +1:;\ stp A_q, B_q, [dst], #32;\ prfm pldl1strm, [src, MEMCPY_PREFETCH_LDR];\ - ldp D_q, J_q, [src], #32;\ - ext H_v.16b, E_v.16b, F_v.16b, 16-shft;\ + ldp C_q, D_q, [src], #32;\ ext I_v.16b, F_v.16b, G_v.16b, 16-shft;\ - mov C_v.16b, G_v.16b;\ stp H_q, I_q, [dst], #32;\ + ext A_v.16b, G_v.16b, C_v.16b, 16-shft;\ + ext B_v.16b, C_v.16b, D_v.16b, 16-shft;\ ldp F_q, G_q, [src], #32;\ - ext A_v.16b, C_v.16b, D_v.16b, 16-shft;\ - ext B_v.16b, D_v.16b, J_v.16b, 16-shft;\ - mov E_v.16b, J_v.16b;\ + ext H_v.16b, D_v.16b, F_v.16b, 16-shft;\ subs count, count, 64;\ - b.ge 2b;\ - b 1b;\ + b.ge 1b;\ +2:;\ + ext I_v.16b, F_v.16b, G_v.16b, 16-shft;\ + b L(ext_tail); EXT_CHUNK(1) EXT_CHUNK(2) @@ -598,6 +591,14 @@ EXT_CHUNK(13) EXT_CHUNK(14) EXT_CHUNK(15) +L(ext_tail): + stp A_q, B_q, [dst], #32 + stp H_q, I_q, [dst], #16 + add dst, dst, tmp1 + str G_q, [dst], #16 + b L(copy_long_check32) + + END (MEMCPY) .section .rodata .p2align 4