Message ID | 20190320162930.GB13393@bell-sw.com |
---|---|
State | New |
Headers | show |
Series | [v2] aarch64: thunderx2 memcpy branches reordering | expand |
Adhemaral, The reason I did the rewriting is two-fold. First, it looks more clean as there aren't now two branches immediately following each other. Second, it can be performance beneficial as we are saving one branch on entry for the most of the cases (we take it only if the first iteration is a partial iteration). But, of course, the performance benefit was not my concern here. However, this is one instruction less anyway. On 3/20/2019 19:29, Anton Youdkevitch wrote: > Rewrote the branches in load and merge chunk > so that the order is more in line with the > most probable case. > > ChangeLog: > * sysdeps/aarch64/multiarch/memcpy_thunderx2.S: > branches reordering >
If is just a non functional change, you should indicate it. However it does change the code and thus for such refactor patches on arch-specific implementations we usually ask for, besides running the testcase for regression (which is also not indicate in your message in which platform you actually has tested), that at least some performance evaluation is provided (even if this shows no gain). On 21/03/2019 15:44, Anton Youdkevitch wrote: > Adhemaral, > > The reason I did the rewriting is two-fold. First, it > looks more clean as there aren't now two branches > immediately following each other. Second, it can be > performance beneficial as we are saving one branch > on entry for the most of the cases (we take it only if > the first iteration is a partial iteration). But, of course, > the performance benefit was not my concern here. > However, this is one instruction less anyway. > > > On 3/20/2019 19:29, Anton Youdkevitch wrote: >> Rewrote the branches in load and merge chunk >> so that the order is more in line with the >> most probable case. >> >> ChangeLog: >> * sysdeps/aarch64/multiarch/memcpy_thunderx2.S: >> branches reordering >> >
Adhemerval, On 3/21/2019 23:38, Adhemerval Zanella wrote: > If is just a non functional change, you should indicate it. However it > does change the code and thus for such refactor patches on arch-specific > implementations we usually ask for, besides running the testcase for > regression (which is also not indicate in your message in which platform > you actually has tested), that at least some performance evaluation is > provided (even if this shows no gain). make check - no regressions make bench - no performance regressions for memcpy Should I resubmit the same patch with the description mentioning it is a non functional change? > > On 21/03/2019 15:44, Anton Youdkevitch wrote: >> Adhemaral, >> >> The reason I did the rewriting is two-fold. First, it >> looks more clean as there aren't now two branches >> immediately following each other. Second, it can be >> performance beneficial as we are saving one branch >> on entry for the most of the cases (we take it only if >> the first iteration is a partial iteration). But, of course, >> the performance benefit was not my concern here. >> However, this is one instruction less anyway. >> >> >> On 3/20/2019 19:29, Anton Youdkevitch wrote: >>> Rewrote the branches in load and merge chunk >>> so that the order is more in line with the >>> most probable case. >>> >>> ChangeLog: >>> * sysdeps/aarch64/multiarch/memcpy_thunderx2.S: >>> branches reordering >>> >>
On 22/03/2019 05:13, Anton Youdkevitch wrote: > Adhemerval, > > On 3/21/2019 23:38, Adhemerval Zanella wrote: >> If is just a non functional change, you should indicate it. However it >> does change the code and thus for such refactor patches on arch-specific >> implementations we usually ask for, besides running the testcase for >> regression (which is also not indicate in your message in which platform >> you actually has tested), that at least some performance evaluation is >> provided (even if this shows no gain). > make check - no regressions > make bench - no performance regressions for memcpy > > Should I resubmit the same patch with the description mentioning it is a > non functional change? I think the provided information should be suffice. It also a good thing the patch could reduce code size (from 2964 to 2736), so I am ok with this change. Szabolcs, are you ok with this change?
diff --git a/sysdeps/aarch64/multiarch/memcpy_thunderx2.S b/sysdeps/aarch64/multiarch/memcpy_thunderx2.S index b2215c1..f637300 100644 --- a/sysdeps/aarch64/multiarch/memcpy_thunderx2.S +++ b/sysdeps/aarch64/multiarch/memcpy_thunderx2.S @@ -382,7 +382,8 @@ L(bytes_0_to_3): strb A_lw, [dstin] strb B_lw, [dstin, tmp1] strb A_hw, [dstend, -1] -L(end): ret +L(end): + ret .p2align 4 @@ -557,17 +558,9 @@ L(ext_size_ ## shft):;\ ext A_v.16b, C_v.16b, D_v.16b, 16-shft;\ ext B_v.16b, D_v.16b, E_v.16b, 16-shft;\ subs count, count, 32;\ - b.ge 2f;\ + b.lt 2f;\ 1:;\ stp A_q, B_q, [dst], #32;\ - ext H_v.16b, E_v.16b, F_v.16b, 16-shft;\ - ext I_v.16b, F_v.16b, G_v.16b, 16-shft;\ - stp H_q, I_q, [dst], #16;\ - add dst, dst, tmp1;\ - str G_q, [dst], #16;\ - b L(copy_long_check32);\ -2:;\ - stp A_q, B_q, [dst], #32;\ prfm pldl1strm, [src, MEMCPY_PREFETCH_LDR];\ ldp D_q, J_q, [src], #32;\ ext H_v.16b, E_v.16b, F_v.16b, 16-shft;\ @@ -579,8 +572,15 @@ L(ext_size_ ## shft):;\ ext B_v.16b, D_v.16b, J_v.16b, 16-shft;\ mov E_v.16b, J_v.16b;\ subs count, count, 64;\ - b.ge 2b;\ - b 1b;\ + b.ge 1b;\ +2:;\ + stp A_q, B_q, [dst], #32;\ + ext H_v.16b, E_v.16b, F_v.16b, 16-shft;\ + ext I_v.16b, F_v.16b, G_v.16b, 16-shft;\ + stp H_q, I_q, [dst], #16;\ + add dst, dst, tmp1;\ + str G_q, [dst], #16;\ + b L(copy_long_check32);\ EXT_CHUNK(1) EXT_CHUNK(2)