Message ID | 20111102194520.GB29597@kam.mff.cuni.cz |
---|---|
State | New |
Headers | show |
Hi, Please check the updated patch - it's separated from the most of the middle-end changes. I tried to take into account your remarks in it, you can find my comments below. The patch bootstrapped, make-checked, and showed performance gains on memset/memcpy on sizes ~256-4k (checked on Atom). Is it ok for trunk? ChangeLog: 2011-11-03 Zolotukhin Michael <michael.v.zolotukhin@gmail.com> * config/i386/i386.h (processor_costs): Add second dimension to stringop_algs array. (clear_ratio): Tune value to improve performance. * config/i386/i386.c (cost models): Initialize second dimension of stringop_algs arrays. Tune cost model in atom_cost, generic32_cost and generic64_cost. (promote_duplicated_reg): Add support for vector modes, add declaration. (promote_duplicated_reg_to_size): Likewise. (ix86_expand_move): Add support for vector moves, that use half of vector register. (expand_set_or_movmem_via_loop_with_iter): New function. (expand_set_or_movmem_via_loop): Enable reuse of the same iters in different loops, produced by this function. (emit_strset): New function. (expand_movmem_epilogue): Add epilogue generation for bigger sizes, use SSE-moves where possible. (expand_setmem_epilogue): Likewise. (expand_movmem_prologue): Likewise for prologue. (expand_setmem_prologue): Likewise. (expand_constant_movmem_prologue): Likewise. (expand_constant_setmem_prologue): Likewise. (decide_alg): Add new argument align_unknown. Fix algorithm of strategy selection if TARGET_INLINE_ALL_STRINGOPS is set. (decide_alignment): Update desired alignment according to chosen move mode. (ix86_expand_movmem): Change unrolled_loop strategy to use SSE-moves. (ix86_expand_setmem): Likewise. (ix86_slow_unaligned_access): Implementation of new hook slow_unaligned_access. * config/i386/i386.md (strset): Enable half-SSE moves. * config/i386/sse.md (sse2_loadq): Add expand for sse2_loadq. (vec_dupv4si): Add expand for vec_dupv4si. (vec_dupv2di): Add expand for vec_dupv2di. * cse.c (cse_insn): Stop forward propagation of vector constants. * fwprop.c (forward_propagate_and_simplify): Likewise. * doc/tm.texi (SLOW_UNALIGNED_ACCESS): Remove documentation for deleted macro SLOW_UNALIGNED_ACCESS. (TARGET_SLOW_UNALIGNED_ACCESS): Add documentation on new hook. * doc/tm.texi.in (SLOW_UNALIGNED_ACCESS): Likewise. (TARGET_SLOW_UNALIGNED_ACCESS): Likewise. * emit-rtl.c (get_mem_align_offset): Add handling of MEM_REFs. (adjust_address_1): Improve algorithm for determining alignment of address+offset. * expr.c (compute_align_by_offset): New function. (vector_mode_for_mode): New function. (vector_extensions_used_for_mode): New function. (alignment_for_piecewise_move): Use hook slow_unaligned_access instead of macros SLOW_UNALIGNED_ACCESS. (emit_group_load_1): Likewise. (emit_group_store): Likewise. (emit_push_insn): Likewise. (store_field): Likewise. (expand_expr_real_1): Likewise. * expr.h (compute_align_by_offset): Add declaration. * optabs.c (expand_vector_broadcast_of_byte_value): New function. * optabs.h (expand_vector_broadcast_of_byte_value): Add declaration. * rtl.h (vector_extensions_used_for_mode): Add declaration. * target.def (DEFHOOK): Add hook slow_unaligned_access. * targhooks.c (default_slow_unaligned_access): Add default hook implementation. * targhooks.h (default_slow_unaligned_access): Add prototype. * testsuite/gcc.target/i386/sw-1.c: Fix compile options to preserve correct behaviour of the test. > I do think we will want sse_unrolled_loop (and perhaps sse_loop, I don't know) > algorithms added to enable the SSE codegen: there are chips that do support SSE > but internally split the wide moves into word sized chunks and thus benefits > are minimal and setup costs won't pay back. So we do not want to default to > SSE codegen whenever possible, just when it is supposed to pay back. > > I wonder what we will want to do then with -mno-sse (and i.e. for linux kernel where > one can not implicitely use SSE). Probably making sse_unrolled_loop->unrolled_loop > transition is easiest for this quite rare case even if some of other algorithm may > turn out to be superrior. There is rolled loop algorithm, that doesn't use SSE-modes - such architectures could use it instead of unrolled_loop. I think the performance wouldn't suffer much from that. For the most of modern processors, SSE-moves are faster than several word-sized moves, so this change in unrolled_loop implementation seems reasonable to me, but, of course, if you think introducing sse_unrolled_move is worth doing, it could be done. > No vector move because promoted vector value is not computed, yet? (it would > make sense to bypass it to keep hot path for small blocks SSE free). Yes, that's the reason. Actually, such path does exist - it's used if the block size is so small, that prologue and epilogue would do all the work without main loop. > Where did go the logic dropping unroll factor to 2 for 32bit integer loops? This is > important otherwise se starve the RA. Fixed. > The unrolled epilogue does at most one byte wide move, while the rolled does at most > 4*16. Given that the odds are that the blocks are small, are you sure this is not > causing performance problems? It didn't hurt performance - quite the contrary, it was done to avoid performance problems. The point of this is following. If we generated an unrolled loop with SSE moves and a prologue with alignment checks, then we wouldn't know, how much bytes will left after the main loop. So in epilogue we'll generate a loop with unknown at compile time trip count. In previous implementation such loop simply used byte-moves, so in worst case we'd have (UnrollFactor*VectorWidth-1) byte moves. Now before such byte-loop we generate a rolled loop with SSE-moves. This loop would iterate at most (UnrollFactor-1) times, but that still would greatly improve performance. Finally, we would have something like this: main_loop: sse_move sse_move sse_move sse_move iter += 4*vector_size if (iter < count ) goto main_loop epilogue_sse_loop: sse_move iter += vector_size if (iter < count ) goto epilogue_sse_loop epilogue_byte_loop: byte_move iter ++ if (iter < count ) goto epilogue_byte_loop > + gcc_assert (TARGET_SSE); > + if (TARGET_64BIT) > + promoted_val = promote_duplicated_reg (V2DImode, val); > + else > + promoted_val = promote_duplicated_reg (V4SImode, val); > Hmm, it would seem more natural to turn this into V8QImode since it is really > vector of 4 duplicated bytes. This will avoid some TARGET_64BIT tess bellow. > Also AMD chips are very slow on integer->SSE moves. How the final promotion > sequence looks there? Here we expect that we've already had a value, promoted to GPR. In 64-bit it has DImode, in 32-bit - SImode - that's why we need this test here. I added some comments and an assert to make it a bit clearer. Final promotion sequence looks like this (example is for 32 bit, for 64 bit it looks quite similar): SImode_register = ByteValue*0x01010101 // generated somewhere before promote_duplicated_reg V4SImode_register = vec_dupv4si (SImode_register) // generated in promote_duplicated_reg When we don't have promoted to GPR value, we use expand_vector_broadcast_of_byte_value instead of promote_duplicated_reg, which just generate rtx_PARALLEL. Later it's transformed to the same dupv4si or similar, but that functionality isn't in the current patch, it's already implemented. > + /* Don't propagate vector-constants, as for now no architecture > + supports vector immediates. */ > + && !vector_extensions_used_for_mode (mode)) > > This seems quite dubious. The instruction pattern representing the move should > refuse the constants via its condition or predicates. It does, but it can't generate efficient code if such constants are propagated. Take a look at the next example. Suppose we'd generate something like this: v4si_reg = (0 | 0 | 0 | 0) sse_mov [mem], v4si_reg sse_mov [mem+16], v4si_reg After constant propagation we'd have: v4si_reg = (0 | 0 | 0 | 0) sse_mov [mem], (0 | 0 | 0 | 0) sse_mov [mem+16], (0 | 0 | 0 | 0) Initially, we had usual stores that could be generated without any problems, but after constant propagation we need to generate store of an immediate to memory. As we no longer know that there is a specially prepared register containing this value, we either need to initialize a vector-register again, or split the store into several word-size stores. The first option is too expensive, so the second one takes place. It's correct, but we don't have SSE-moves, so it's slower than it could be without constant propagation. On 2 November 2011 23:45, Jan Hubicka <hubicka@ucw.cz> wrote: > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c > index 2c53423..6ce240a 100644 > --- a/gcc/config/i386/i386.c > +++ b/gcc/config/i386/i386.c > @@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */ > COSTS_N_BYTES (2), /* cost of FABS instruction. */ > COSTS_N_BYTES (2), /* cost of FCHS instruction. */ > COSTS_N_BYTES (2), /* cost of FSQRT instruction. */ > - {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, > + {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, > > I do think we will want sse_unrolled_loop (and perhaps sse_loop, I don't know) > algorithms added to enable the SSE codegen: there are chips that do support SSE > but internally split the wide moves into word sized chunks and thus benefits > are minimal and setup costs won't pay back. So we do not want to default to > SSE codegen whenever possible, just when it is supposed to pay back. > > I wonder what we will want to do then with -mno-sse (and i.e. for linux kernel where > one can not implicitely use SSE). Probably making sse_unrolled_loop->unrolled_loop > transition is easiest for this quite rare case even if some of other algorithm may > turn out to be superrior. > > + if (align <= 8 && desired_alignment > 8) > + { > + rtx label = ix86_expand_aligntest (destptr, 8, false); > + destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0); > + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value))); > + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value))); > + ix86_adjust_counter (count, 8); > + emit_label (label); > + LABEL_NUSES (label) = 1; > + } > + gcc_assert (desired_alignment <= 16); > > No vector move because promoted vector value is not computed, yet? (it would > make sense to bypass it to keep hot path for small blocks SSE free). > case unrolled_loop: > need_zero_guard = true; > - size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2); > + /* Use SSE instructions, if possible. */ > + move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode; > + unroll_factor = 4; > > Where did go the logic dropping unroll factor to 2 for 32bit integer loops? This is > important otherwise se starve the RA. > @@ -21897,9 +22254,41 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp, > LABEL_NUSES (label) = 1; > } > > + /* We haven't updated addresses, so we'll do it now. > + Also, if the epilogue seems to be big, we'll generate a loop (not > + unrolled) in it. We'll do it only if alignment is unknown, because in > + this case in epilogue we have to perform memmove by bytes, which is very > + slow. */ > > The unrolled epilogue does at most one byte wide move, while the rolled does at most > 4*16. Given that the odds are that the blocks are small, are you sure this is not > causing performance problems? > > @@ -21983,11 +22402,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val) > static rtx > promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align) > { > - rtx promoted_val; > + rtx promoted_val = NULL_RTX; > > - if (TARGET_64BIT > - && (size_needed > 4 || (desired_align > align && desired_align > 4))) > - promoted_val = promote_duplicated_reg (DImode, val); > + if (size_needed > 8 || (desired_align > align && desired_align > 8)) > + { > + gcc_assert (TARGET_SSE); > + if (TARGET_64BIT) > + promoted_val = promote_duplicated_reg (V2DImode, val); > + else > + promoted_val = promote_duplicated_reg (V4SImode, val); > > Hmm, it would seem more natural to turn this into V8QImode since it is really > vector of 4 duplicated bytes. This will avoid some TARGET_64BIT tess bellow. > Also AMD chips are very slow on integer->SSE moves. How the final promotion > sequence looks there? > diff --git a/gcc/cse.c b/gcc/cse.c > index ae67685..3b6471d 100644 > --- a/gcc/cse.c > +++ b/gcc/cse.c > @@ -4616,7 +4616,10 @@ cse_insn (rtx insn) > to fold switch statements when an ADDR_DIFF_VEC is used. */ > || (GET_CODE (src_folded) == MINUS > && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF > - && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))) > + && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)) > + /* Don't propagate vector-constants, as for now no architecture > + supports vector immediates. */ > + && !vector_extensions_used_for_mode (mode)) > > This seems quite dubious. The instruction pattern representing the move should > refuse the constants via its condition or predicates. > > The i386 specific bits seems quite good to me now and ready for mainline with > the above comments addressed. It may make your life easier if you tried to > make version that does not need any of the middle end changes - I think it is > possible, the middle end changes are mostly about expanding memcpy/memset w/o > loops at all. This can be handled incrementally after your current patch gets > to mainline. > > Honza >
> > There is rolled loop algorithm, that doesn't use SSE-modes - such > architectures could use it instead of unrolled_loop. I think the > performance wouldn't suffer much from that. > For the most of modern processors, SSE-moves are faster than several > word-sized moves, so this change in unrolled_loop implementation seems > reasonable to me, but, of course, if you think introducing > sse_unrolled_move is worth doing, it could be done. This don't seem to be quite true for AMD chips. With your change I get on amdfam10 hardware: memset libcall rep1 noalg rep4 noalg rep8 noalg loop noalg unrl noalg byte profiled dynamic block size 8192000 0:00.62 0:00.55 0:00.54 0:00.54 0:00.60 0:00.54 0:00.57 0:00.54 0:00.57 0:00.54 0:00.60 0:03.63 0:00.62 0:00.62 best: 0:00.54 loop block size 819200 0:00.65 0:00.64 0:00.64 0:00.64 0:00.63 0:00.64 0:00.62 0:00.64 0:00.66 0:00.66 0:00.69 0:04.10 0:00.66 0:00.66 best: 0:00.62 rep8noalign block size 81920 0:00.21 0:00.21 0:00.21 0:00.21 0:00.27 0:00.21 0:00.21 0:00.20 0:00.27 0:00.20 0:00.25 0:04.18 0:00.21 0:00.21 best: 0:00.20 loop block size 20480 0:00.18 0:00.18 0:00.18 0:00.18 0:00.24 0:00.18 0:00.20 0:00.20 0:00.28 0:00.17 0:00.23 0:04.29 0:00.18 0:00.18 best: 0:00.17 unrl block size 8192 0:00.15 0:00.15 0:00.15 0:00.15 0:00.21 0:00.15 0:00.17 0:00.25 0:00.28 0:00.14 0:00.20 0:01.26 0:00.14 0:00.15 best: 0:00.14 unrl block size 4096 0:00.15 0:00.15 0:00.16 0:00.15 0:00.21 0:00.14 0:00.16 0:00.25 0:00.23 0:00.14 0:00.19 0:01.24 0:00.15 0:00.15 best: 0:00.14 rep8 block size 2048 0:00.16 0:00.18 0:00.18 0:00.17 0:00.23 0:00.16 0:00.18 0:00.26 0:00.21 0:00.17 0:00.21 0:01.25 0:00.16 0:00.16 best: 0:00.16 libcall block size 1024 0:00.19 0:00.24 0:00.24 0:00.20 0:00.25 0:00.17 0:00.19 0:00.28 0:00.23 0:00.21 0:00.26 0:01.26 0:00.17 0:00.16 best: 0:00.17 rep8 block size 512 0:00.23 0:00.34 0:00.33 0:00.23 0:00.26 0:00.20 0:00.22 0:00.31 0:00.27 0:00.27 0:00.29 0:01.29 0:00.20 0:00.19 best: 0:00.20 rep8 block size 256 0:00.29 0:00.51 0:00.51 0:00.28 0:00.30 0:00.25 0:00.26 0:00.38 0:00.35 0:00.39 0:00.38 0:01.33 0:00.24 0:00.25 best: 0:00.25 rep8 block size 128 0:00.39 0:00.76 0:00.76 0:00.40 0:00.42 0:00.38 0:00.40 0:00.52 0:00.51 0:00.55 0:00.52 0:01.41 0:00.37 0:00.38 best: 0:00.38 rep8 block size 64 0:00.72 0:00.95 0:00.95 0:00.70 0:00.73 0:00.65 0:00.72 0:00.75 0:00.75 0:00.74 0:00.76 0:01.48 0:00.64 0:00.65 best: 0:00.65 rep8 block size 48 0:00.89 0:00.98 0:01.12 0:00.86 0:00.72 0:00.83 0:00.88 0:00.94 0:00.92 0:00.92 0:00.91 0:01.71 0:00.93 0:00.67 best: 0:00.72 rep4noalign block size 32 0:01.18 0:01.30 0:01.30 0:01.11 0:01.13 0:01.11 0:01.13 0:01.20 0:01.19 0:01.13 0:01.19 0:01.79 0:01.15 0:01.11 best: 0:01.11 rep4 block size 24 0:01.57 0:01.71 0:01.71 0:01.52 0:01.51 0:01.52 0:01.52 0:01.57 0:01.56 0:01.49 0:01.52 0:02.30 0:01.46 0:01.53 best: 0:01.49 unrl block size 16 0:02.53 0:02.61 0:02.61 0:02.63 0:02.52 0:02.64 0:02.61 0:02.56 0:02.52 0:02.25 0:02.50 0:03.08 0:02.26 0:02.63 best: 0:02.25 unrl block size 14 0:02.73 0:02.77 0:02.77 0:02.62 0:02.58 0:02.64 0:02.59 0:02.60 0:02.61 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:03.58 0:02.48 0:02.67 best: 0:02.58 rep4noalign block size 12 0:03.29 0:03.09 0:03.08 0:03.02 0:02.98 0:03.06 0:02.96 0:02.89 0:02.96 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:03.89 0:02.70 0:03.05 best: 0:02.89 loop block size 10 0:03.58 0:03.64 0:03.60 0:03.58 0:03.31 0:03.52 0:03.38 0:03.36 0:03.38 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:04.42 0:03.10 0:03.43 best: 0:03.31 rep4noalign block size 8 0:04.19 0:03.76 0:03.75 0:03.98 0:03.83 0:03.82 0:03.70 0:03.70 0:03.80 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:04.68 Command terminated by signal 11 0:00.00 0:03.87 best: 0:03.70 loop block size 6 0:06.20 0:05.66 0:05.67 0:05.69 0:05.60 0:05.73 0:05.53 0:05.56 0:05.46 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:06.23 0:05.60 0:05.65 best: 0:05.46 loopnoalign block size 4 0:09.58 0:06.93 0:06.94 0:07.13 0:07.30 0:07.05 0:06.94 0:07.05 0:07.28 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:07.37 Command terminated by signal 11 0:00.00 0:07.46 best: 0:06.93 rep1 block size 1 0:38.46 0:17.27 0:17.27 0:15.14 0:15.11 0:16.34 0:15.10 0:16.38 0:15.11 Command terminated by signal 11 0:00.00 0:15.22 0:16.33 0:14.87 0:16.31 best: 0:15.10 rep8noalign The ICEs for SSE loop < 16 bytes needs to be solved. memset libcall rep1 noalg rep4 noalg rep8 noalg loop noalg unrl noalg byte profiled dynamic block size 8192000 0:00.62 0:00.55 0:00.55 0:00.55 0:00.60 0:00.55 0:00.57 0:00.54 0:00.59 0:00.55 0:00.57 0:03.60 0:00.61 0:00.61 best: 0:00.54 loop block size 819200 0:00.62 0:00.61 0:00.62 0:00.61 0:00.40 0:00.63 0:00.40 0:00.65 0:00.50 0:00.65 0:00.40 0:03.84 0:00.35 0:00.65 best: 0:00.40 rep4noalign block size 81920 0:00.21 0:00.21 0:00.21 0:00.21 0:00.27 0:00.21 0:00.22 0:00.27 0:00.40 0:00.21 0:00.22 0:04.47 0:00.21 0:00.21 best: 0:00.21 libcall block size 20480 0:00.18 0:00.18 0:00.18 0:00.18 0:00.24 0:00.18 0:00.20 0:00.28 0:00.30 0:00.17 0:00.20 0:04.07 0:00.18 0:00.18 best: 0:00.17 unrl block size 8192 0:00.15 0:00.15 0:00.15 0:00.15 0:00.21 0:00.15 0:00.16 0:00.17 0:00.18 0:00.14 0:00.17 0:01.27 0:00.15 0:00.15 best: 0:00.14 unrl block size 4096 0:00.15 0:00.16 0:00.16 0:00.15 0:00.21 0:00.15 0:00.17 0:00.18 0:00.17 0:00.14 0:00.17 0:01.24 0:00.15 0:00.15 best: 0:00.14 unrl block size 2048 0:00.16 0:00.18 0:00.19 0:00.17 0:00.23 0:00.16 0:00.18 0:00.19 0:00.19 0:00.16 0:00.19 0:01.25 0:00.16 0:00.16 best: 0:00.16 libcall block size 1024 0:00.19 0:00.24 0:00.24 0:00.21 0:00.26 0:00.17 0:00.19 0:00.22 0:00.21 0:00.21 0:00.22 0:01.27 0:00.17 0:00.17 best: 0:00.17 rep8 block size 512 0:00.22 0:00.34 0:00.33 0:00.23 0:00.26 0:00.20 0:00.22 0:00.25 0:00.25 0:00.28 0:00.28 0:01.29 0:00.20 0:00.21 best: 0:00.20 rep8 block size 256 0:00.29 0:00.51 0:00.51 0:00.29 0:00.31 0:00.25 0:00.27 0:00.35 0:00.33 0:00.39 0:00.37 0:01.33 0:00.25 0:00.25 best: 0:00.25 rep8 block size 128 0:00.39 0:00.76 0:00.76 0:00.40 0:00.42 0:00.38 0:00.41 0:00.50 0:00.50 0:00.60 0:00.54 0:01.41 0:00.38 0:00.33 best: 0:00.38 rep8 block size 64 0:00.54 0:00.86 0:00.86 0:00.55 0:00.57 0:00.53 0:00.58 0:00.73 0:00.71 0:00.95 0:00.84 0:01.47 0:00.54 0:00.53 best: 0:00.53 rep8 block size 48 0:00.71 0:00.98 0:00.98 0:00.69 0:00.72 0:00.69 0:00.75 0:00.91 0:00.87 0:01.28 0:01.07 0:01.69 0:01.11 0:00.69 best: 0:00.69 rep4 block size 32 0:00.97 0:01.14 0:01.14 0:00.89 0:00.91 0:00.87 0:00.93 0:01.08 0:01.10 0:01.47 0:01.37 0:01.70 0:01.36 0:00.87 best: 0:00.87 rep8 block size 24 0:01.55 0:01.41 0:01.72 0:01.52 0:01.53 0:01.56 0:01.27 0:01.65 0:01.60 0:02.13 0:02.05 0:02.29 0:01.98 0:01.56 best: 0:01.27 rep8noalign block size 16 0:02.51 0:02.60 0:02.59 0:02.63 0:02.34 0:02.61 0:02.73 0:02.63 0:02.55 0:02.86 0:03.33 0:03.09 0:03.26 0:02.65 best: 0:02.34 rep4noalign block size 14 0:02.18 0:02.24 0:02.24 0:02.09 0:02.08 0:02.19 0:02.16 0:02.15 0:02.17 0:02.99 0:03.06 0:02.96 0:02.74 0:02.76 best: 0:02.08 rep4noalign block size 12 0:03.28 0:03.08 0:03.07 0:02.92 0:02.93 0:03.12 0:03.10 0:02.92 0:02.98 0:03.72 0:03.60 0:03.77 0:03.64 0:02.92 best: 0:02.92 loop block size 10 0:03.49 0:03.56 0:03.56 0:03.60 0:03.39 0:03.80 0:03.73 0:03.51 0:03.65 0:04.59 0:04.50 0:04.53 0:04.55 0:03.72 best: 0:03.39 rep4noalign block size 8 0:04.31 0:04.26 0:04.26 0:04.31 0:04.46 0:04.15 0:04.14 0:04.23 0:04.13 0:05.00 0:04.59 0:05.07 0:04.99 0:04.18 best: 0:04.13 loopnoalign block size 6 0:06.69 0:06.06 0:06.05 0:06.26 0:05.83 0:06.09 0:05.93 0:05.91 0:05.75 0:06.42 0:06.27 0:06.39 0:06.75 0:05.94 best: 0:05.75 loopnoalign block size 4 0:10.18 0:07.51 0:07.51 0:07.69 0:07.51 0:07.39 0:07.41 0:07.35 0:07.41 0:07.38 0:07.39 0:07.39 0:07.34 0:07.37 best: 0:07.35 loop block size 1 0:37.06 0:15.82 0:15.82 0:14.65 0:11.15 0:14.57 0:14.48 0:14.48 0:14.46 0:14.35 0:14.31 0:15.17 0:14.19 0:14.66 best: 0:11.15 rep4noalign > It didn't hurt performance - quite the contrary, it was done to avoid > performance problems. > The point of this is following. If we generated an unrolled loop with > SSE moves and a prologue with alignment checks, then we wouldn't know, > how much bytes will left after the main loop. So in epilogue we'll > generate a loop with unknown at compile time trip count. In previous > implementation such loop simply used byte-moves, so in worst case we'd > have (UnrollFactor*VectorWidth-1) byte moves. Now before such > byte-loop we generate a rolled loop with SSE-moves. This loop would > iterate at most (UnrollFactor-1) times, but that still would greatly > improve performance. > > Finally, we would have something like this: > main_loop: > sse_move > sse_move > sse_move > sse_move > iter += 4*vector_size > if (iter < count ) goto main_loop > epilogue_sse_loop: > sse_move > iter += vector_size > if (iter < count ) goto epilogue_sse_loop > epilogue_byte_loop: > byte_move > iter ++ > if (iter < count ) goto epilogue_byte_loop Hmm, OK. Imisremembered how we generate the epilogue now - it is quite a while since I looked into this last time. > > This seems quite dubious. The instruction pattern representing the move should > > refuse the constants via its condition or predicates. > > It does, but it can't generate efficient code if such constants are > propagated. Take a look at the next example. > Suppose we'd generate something like this: > v4si_reg = (0 | 0 | 0 | 0) > sse_mov [mem], v4si_reg > sse_mov [mem+16], v4si_reg > > After constant propagation we'd have: > v4si_reg = (0 | 0 | 0 | 0) > sse_mov [mem], (0 | 0 | 0 | 0) > sse_mov [mem+16], (0 | 0 | 0 | 0) Well, if instruction predicate of sse_mov would reject the immediate operand for memory destination, the propagation will not happen. Lets work out what breaks with the small blocks and I will look into the patterns to prevent th propagation. Honza
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 2c53423..6ce240a 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */ COSTS_N_BYTES (2), /* cost of FABS instruction. */ COSTS_N_BYTES (2), /* cost of FCHS instruction. */ COSTS_N_BYTES (2), /* cost of FSQRT instruction. */ - {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, + {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, I do think we will want sse_unrolled_loop (and perhaps sse_loop, I don't know) algorithms added to enable the SSE codegen: there are chips that do support SSE but internally split the wide moves into word sized chunks and thus benefits are minimal and setup costs won't pay back. So we do not want to default to SSE codegen whenever possible, just when it is supposed to pay back. I wonder what we will want to do then with -mno-sse (and i.e. for linux kernel where one can not implicitely use SSE). Probably making sse_unrolled_loop->unrolled_loop transition is easiest for this quite rare case even if some of other algorithm may turn out to be superrior. + if (align <= 8 && desired_alignment > 8) + { + rtx label = ix86_expand_aligntest (destptr, 8, false); + destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0); + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value))); + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value))); + ix86_adjust_counter (count, 8); + emit_label (label); + LABEL_NUSES (label) = 1; + } + gcc_assert (desired_alignment <= 16); No vector move because promoted vector value is not computed, yet? (it would make sense to bypass it to keep hot path for small blocks SSE free). case unrolled_loop: need_zero_guard = true; - size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2); + /* Use SSE instructions, if possible. */ + move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode; + unroll_factor = 4; Where did go the logic dropping unroll factor to 2 for 32bit integer loops? This is important otherwise se starve the RA. @@ -21897,9 +22254,41 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp, LABEL_NUSES (label) = 1; } + /* We haven't updated addresses, so we'll do it now. + Also, if the epilogue seems to be big, we'll generate a loop (not + unrolled) in it. We'll do it only if alignment is unknown, because in + this case in epilogue we have to perform memmove by bytes, which is very + slow. */ The unrolled epilogue does at most one byte wide move, while the rolled does at most 4*16. Given that the odds are that the blocks are small, are you sure this is not causing performance problems? @@ -21983,11 +22402,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val) static rtx promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align) { - rtx promoted_val; + rtx promoted_val = NULL_RTX; - if (TARGET_64BIT - && (size_needed > 4 || (desired_align > align && desired_align > 4))) - promoted_val = promote_duplicated_reg (DImode, val); + if (size_needed > 8 || (desired_align > align && desired_align > 8)) + { + gcc_assert (TARGET_SSE); + if (TARGET_64BIT) + promoted_val = promote_duplicated_reg (V2DImode, val); + else + promoted_val = promote_duplicated_reg (V4SImode, val); Hmm, it would seem more natural to turn this into V8QImode since it is really vector of 4 duplicated bytes. This will avoid some TARGET_64BIT tess bellow. Also AMD chips are very slow on integer->SSE moves. How the final promotion sequence looks there? diff --git a/gcc/cse.c b/gcc/cse.c index ae67685..3b6471d 100644 --- a/gcc/cse.c +++ b/gcc/cse.c @@ -4616,7 +4616,10 @@ cse_insn (rtx insn) to fold switch statements when an ADDR_DIFF_VEC is used. */ || (GET_CODE (src_folded) == MINUS && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF - && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))) + && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)) + /* Don't propagate vector-constants, as for now no architecture + supports vector immediates. */ + && !vector_extensions_used_for_mode (mode)) This seems quite dubious. The instruction pattern representing the move should refuse the constants via its condition or predicates.