diff mbox

Use of vector instructions in memmov/memset expanding

Message ID 20111102194520.GB29597@kam.mff.cuni.cz
State New
Headers show

Commit Message

Jan Hubicka Nov. 2, 2011, 7:45 p.m. UTC
The i386 specific bits seems quite good to me now and ready for mainline with
the above comments addressed.  It may make your life easier if you tried to
make version that does not need any of the middle end changes - I think it is
possible, the middle end changes are mostly about expanding memcpy/memset w/o
loops at all.  This can be handled incrementally after your current patch gets
to mainline.

Honza

Comments

Michael Zolotukhin Nov. 3, 2011, 12:47 p.m. UTC | #1
Hi,

Please check the updated patch - it's separated from the most of the
middle-end changes. I tried to take into account your remarks in it,
you can find my comments below.

The patch bootstrapped, make-checked, and showed performance gains on
memset/memcpy on sizes ~256-4k (checked on Atom).

Is it ok for trunk?

ChangeLog:
2011-11-03  Zolotukhin Michael  <michael.v.zolotukhin@gmail.com>

        * config/i386/i386.h (processor_costs): Add second dimension to
        stringop_algs array.
        (clear_ratio): Tune value to improve performance.
        * config/i386/i386.c (cost models): Initialize second dimension of
        stringop_algs arrays.  Tune cost model in atom_cost, generic32_cost
        and generic64_cost.
        (promote_duplicated_reg): Add support for vector modes, add
        declaration.
        (promote_duplicated_reg_to_size): Likewise.
        (ix86_expand_move): Add support for vector moves, that use half of
        vector register.
        (expand_set_or_movmem_via_loop_with_iter): New function.
        (expand_set_or_movmem_via_loop): Enable reuse of the same iters in
        different loops, produced by this function.
        (emit_strset): New function.
        (expand_movmem_epilogue): Add epilogue generation for bigger sizes,
        use SSE-moves where possible.
        (expand_setmem_epilogue): Likewise.
        (expand_movmem_prologue): Likewise for prologue.
        (expand_setmem_prologue): Likewise.
        (expand_constant_movmem_prologue): Likewise.
        (expand_constant_setmem_prologue): Likewise.
        (decide_alg): Add new argument align_unknown.  Fix algorithm of
        strategy selection if TARGET_INLINE_ALL_STRINGOPS is set.
        (decide_alignment): Update desired alignment according to chosen move
        mode.
        (ix86_expand_movmem): Change unrolled_loop strategy to use SSE-moves.
        (ix86_expand_setmem): Likewise.
        (ix86_slow_unaligned_access): Implementation of new hook
        slow_unaligned_access.
        * config/i386/i386.md (strset): Enable half-SSE moves.
        * config/i386/sse.md (sse2_loadq): Add expand for sse2_loadq.
        (vec_dupv4si): Add expand for vec_dupv4si.
        (vec_dupv2di): Add expand for vec_dupv2di.
        * cse.c (cse_insn): Stop forward propagation of vector constants.
        * fwprop.c (forward_propagate_and_simplify): Likewise.
        * doc/tm.texi (SLOW_UNALIGNED_ACCESS): Remove documentation for deleted
        macro SLOW_UNALIGNED_ACCESS.
        (TARGET_SLOW_UNALIGNED_ACCESS): Add documentation on new hook.
        * doc/tm.texi.in (SLOW_UNALIGNED_ACCESS): Likewise.
        (TARGET_SLOW_UNALIGNED_ACCESS): Likewise.
        * emit-rtl.c (get_mem_align_offset): Add handling of MEM_REFs.
        (adjust_address_1): Improve algorithm for determining alignment of
        address+offset.
        * expr.c (compute_align_by_offset): New function.
        (vector_mode_for_mode): New function.
        (vector_extensions_used_for_mode): New function.
        (alignment_for_piecewise_move): Use hook slow_unaligned_access instead
        of macros SLOW_UNALIGNED_ACCESS.
        (emit_group_load_1): Likewise.
        (emit_group_store): Likewise.
        (emit_push_insn): Likewise.
        (store_field): Likewise.
        (expand_expr_real_1): Likewise.
        * expr.h (compute_align_by_offset): Add declaration.
        * optabs.c (expand_vector_broadcast_of_byte_value): New function.
        * optabs.h (expand_vector_broadcast_of_byte_value): Add declaration.
        * rtl.h (vector_extensions_used_for_mode): Add declaration.
        * target.def (DEFHOOK): Add hook slow_unaligned_access.
        * targhooks.c (default_slow_unaligned_access): Add default hook
        implementation.
        * targhooks.h (default_slow_unaligned_access): Add prototype.
        * testsuite/gcc.target/i386/sw-1.c: Fix compile options to preserve
        correct behaviour of the test.


> I do think we will want sse_unrolled_loop (and perhaps sse_loop, I don't know)
> algorithms added to enable the SSE codegen: there are chips that do support SSE
> but internally split the wide moves into word sized chunks and thus benefits
> are minimal and setup costs won't pay back.  So we do not want to default to
> SSE codegen whenever possible, just when it is supposed to pay back.
>
> I wonder what we will want to do then with -mno-sse (and i.e. for linux kernel where
> one can not implicitely use SSE).  Probably making sse_unrolled_loop->unrolled_loop
> transition is easiest for this quite rare case even if some of other algorithm may
> turn out to be superrior.

There is rolled loop algorithm, that doesn't use SSE-modes - such
architectures could use it instead of unrolled_loop. I think the
performance wouldn't suffer much from that.
For the most of modern processors, SSE-moves are faster than several
word-sized moves, so this change in unrolled_loop implementation seems
reasonable to me, but, of course, if you think introducing
sse_unrolled_move is worth doing, it could be done.


> No vector move because promoted vector value is not computed, yet?  (it would
> make sense to bypass it to keep hot path for small blocks SSE free).

Yes, that's the reason.
Actually, such path does exist - it's used if the block size is so
small, that prologue and epilogue would do all the work without main
loop.


> Where did go the logic dropping unroll factor to 2 for 32bit integer loops?  This is
> important otherwise se starve the RA.

Fixed.


> The unrolled epilogue does at most one byte wide move, while the rolled does at most
> 4*16. Given that the odds are that the blocks are small, are you sure this is not
> causing performance problems?

It didn't hurt performance - quite the contrary, it was done to avoid
performance problems.
The point of this is following. If we generated an unrolled loop with
SSE moves and a prologue with alignment checks, then we wouldn't know,
how much bytes will left after the main loop. So in epilogue we'll
generate a loop with unknown at compile time trip count. In previous
implementation such loop simply used byte-moves, so in worst case we'd
have (UnrollFactor*VectorWidth-1) byte moves. Now before such
byte-loop we generate a rolled loop with SSE-moves. This loop would
iterate at most (UnrollFactor-1) times, but that still would greatly
improve performance.

Finally, we would have something like this:
main_loop:
   sse_move
   sse_move
   sse_move
   sse_move
   iter += 4*vector_size
   if (iter < count ) goto main_loop
epilogue_sse_loop:
   sse_move
   iter += vector_size
   if (iter < count ) goto epilogue_sse_loop
epilogue_byte_loop:
   byte_move
   iter ++
   if (iter < count ) goto epilogue_byte_loop


> +      gcc_assert (TARGET_SSE);
> +      if (TARGET_64BIT)
> +       promoted_val = promote_duplicated_reg (V2DImode, val);
> +      else
> +       promoted_val = promote_duplicated_reg (V4SImode, val);
> Hmm, it would seem more natural to turn this into V8QImode since it is really
> vector of 4 duplicated bytes.  This will avoid some TARGET_64BIT tess bellow.
> Also AMD chips are very slow on integer->SSE moves. How the final promotion
> sequence looks there?

Here we expect that we've already had a value, promoted to GPR. In
64-bit it has DImode, in 32-bit - SImode - that's why we need this
test here. I added some comments and an assert to make it a bit
clearer.
Final promotion sequence looks like this (example is for 32 bit, for
64 bit it looks quite similar):
    SImode_register = ByteValue*0x01010101  // generated somewhere
before promote_duplicated_reg
    V4SImode_register = vec_dupv4si (SImode_register) // generated in
promote_duplicated_reg

When we don't have promoted to GPR value, we use
expand_vector_broadcast_of_byte_value instead of
promote_duplicated_reg, which just generate rtx_PARALLEL. Later it's
transformed to the same dupv4si or similar, but that functionality
isn't in the current patch, it's already implemented.


> +             /* Don't propagate vector-constants, as for now no architecture
> +                supports vector immediates.  */
> +         && !vector_extensions_used_for_mode (mode))
>
> This seems quite dubious.  The instruction pattern representing the move should
> refuse the constants via its condition or predicates.

It does, but it can't generate efficient code if such constants are
propagated. Take a look at the next example.
Suppose we'd generate something like this:
    v4si_reg = (0 | 0 | 0 | 0)
    sse_mov [mem], v4si_reg
    sse_mov [mem+16], v4si_reg

After constant propagation we'd have:
    v4si_reg = (0 | 0 | 0 | 0)
    sse_mov [mem], (0 | 0 | 0 | 0)
    sse_mov [mem+16], (0 | 0 | 0 | 0)

Initially, we had usual stores that could be generated without any
problems, but after constant propagation we need to generate store of
an immediate to memory. As we no longer know that there is a specially
prepared register containing this value, we either need to initialize
a vector-register again, or split the store into several word-size
stores. The first option is too expensive, so the second one takes
place. It's correct, but we don't have SSE-moves, so it's slower than
it could be without constant propagation.


On 2 November 2011 23:45, Jan Hubicka <hubicka@ucw.cz> wrote:
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 2c53423..6ce240a 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
>   COSTS_N_BYTES (2),                   /* cost of FABS instruction.  */
>   COSTS_N_BYTES (2),                   /* cost of FCHS instruction.  */
>   COSTS_N_BYTES (2),                   /* cost of FSQRT instruction.  */
> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>
> I do think we will want sse_unrolled_loop (and perhaps sse_loop, I don't know)
> algorithms added to enable the SSE codegen: there are chips that do support SSE
> but internally split the wide moves into word sized chunks and thus benefits
> are minimal and setup costs won't pay back.  So we do not want to default to
> SSE codegen whenever possible, just when it is supposed to pay back.
>
> I wonder what we will want to do then with -mno-sse (and i.e. for linux kernel where
> one can not implicitely use SSE).  Probably making sse_unrolled_loop->unrolled_loop
> transition is easiest for this quite rare case even if some of other algorithm may
> turn out to be superrior.
>
> +  if (align <= 8 && desired_alignment > 8)
> +    {
> +      rtx label = ix86_expand_aligntest (destptr, 8, false);
> +      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
> +      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
> +      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
> +      ix86_adjust_counter (count, 8);
> +      emit_label (label);
> +      LABEL_NUSES (label) = 1;
> +    }
> +  gcc_assert (desired_alignment <= 16);
>
> No vector move because promoted vector value is not computed, yet?  (it would
> make sense to bypass it to keep hot path for small blocks SSE free).
>     case unrolled_loop:
>       need_zero_guard = true;
> -      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
> +      /* Use SSE instructions, if possible.  */
> +      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
> +      unroll_factor = 4;
>
> Where did go the logic dropping unroll factor to 2 for 32bit integer loops?  This is
> important otherwise se starve the RA.
> @@ -21897,9 +22254,41 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
>       LABEL_NUSES (label) = 1;
>     }
>
> +  /* We haven't updated addresses, so we'll do it now.
> +     Also, if the epilogue seems to be big, we'll generate a loop (not
> +     unrolled) in it.  We'll do it only if alignment is unknown, because in
> +     this case in epilogue we have to perform memmove by bytes, which is very
> +     slow.  */
>
> The unrolled epilogue does at most one byte wide move, while the rolled does at most
> 4*16. Given that the odds are that the blocks are small, are you sure this is not
> causing performance problems?
>
> @@ -21983,11 +22402,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
>  static rtx
>  promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
>  {
> -  rtx promoted_val;
> +  rtx promoted_val = NULL_RTX;
>
> -  if (TARGET_64BIT
> -      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
> -    promoted_val = promote_duplicated_reg (DImode, val);
> +  if (size_needed > 8 || (desired_align > align && desired_align > 8))
> +    {
> +      gcc_assert (TARGET_SSE);
> +      if (TARGET_64BIT)
> +       promoted_val = promote_duplicated_reg (V2DImode, val);
> +      else
> +       promoted_val = promote_duplicated_reg (V4SImode, val);
>
> Hmm, it would seem more natural to turn this into V8QImode since it is really
> vector of 4 duplicated bytes.  This will avoid some TARGET_64BIT tess bellow.
> Also AMD chips are very slow on integer->SSE moves. How the final promotion
> sequence looks there?
> diff --git a/gcc/cse.c b/gcc/cse.c
> index ae67685..3b6471d 100644
> --- a/gcc/cse.c
> +++ b/gcc/cse.c
> @@ -4616,7 +4616,10 @@ cse_insn (rtx insn)
>                 to fold switch statements when an ADDR_DIFF_VEC is used.  */
>              || (GET_CODE (src_folded) == MINUS
>                  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
> -                 && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
> +                 && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
> +             /* Don't propagate vector-constants, as for now no architecture
> +                supports vector immediates.  */
> +         && !vector_extensions_used_for_mode (mode))
>
> This seems quite dubious.  The instruction pattern representing the move should
> refuse the constants via its condition or predicates.
>
> The i386 specific bits seems quite good to me now and ready for mainline with
> the above comments addressed.  It may make your life easier if you tried to
> make version that does not need any of the middle end changes - I think it is
> possible, the middle end changes are mostly about expanding memcpy/memset w/o
> loops at all.  This can be handled incrementally after your current patch gets
> to mainline.
>
> Honza
>
Jan Hubicka Nov. 6, 2011, 1:04 p.m. UTC | #2
> 
> There is rolled loop algorithm, that doesn't use SSE-modes - such
> architectures could use it instead of unrolled_loop. I think the
> performance wouldn't suffer much from that.
> For the most of modern processors, SSE-moves are faster than several
> word-sized moves, so this change in unrolled_loop implementation seems
> reasonable to me, but, of course, if you think introducing
> sse_unrolled_move is worth doing, it could be done.

This don't seem to be quite true for AMD chips.  With your change I get on
amdfam10 hardware:
memset
                   libcall   rep1   noalg    rep4   noalg    rep8   noalg    loop   noalg    unrl   noalg    byte profiled dynamic
block size 8192000 0:00.62 0:00.55 0:00.54 0:00.54 0:00.60 0:00.54 0:00.57 0:00.54 0:00.57 0:00.54 0:00.60 0:03.63 0:00.62 0:00.62 best: 0:00.54 loop
block size  819200 0:00.65 0:00.64 0:00.64 0:00.64 0:00.63 0:00.64 0:00.62 0:00.64 0:00.66 0:00.66 0:00.69 0:04.10 0:00.66 0:00.66 best: 0:00.62 rep8noalign
block size   81920 0:00.21 0:00.21 0:00.21 0:00.21 0:00.27 0:00.21 0:00.21 0:00.20 0:00.27 0:00.20 0:00.25 0:04.18 0:00.21 0:00.21 best: 0:00.20 loop
block size   20480 0:00.18 0:00.18 0:00.18 0:00.18 0:00.24 0:00.18 0:00.20 0:00.20 0:00.28 0:00.17 0:00.23 0:04.29 0:00.18 0:00.18 best: 0:00.17 unrl
block size    8192 0:00.15 0:00.15 0:00.15 0:00.15 0:00.21 0:00.15 0:00.17 0:00.25 0:00.28 0:00.14 0:00.20 0:01.26 0:00.14 0:00.15 best: 0:00.14 unrl
block size    4096 0:00.15 0:00.15 0:00.16 0:00.15 0:00.21 0:00.14 0:00.16 0:00.25 0:00.23 0:00.14 0:00.19 0:01.24 0:00.15 0:00.15 best: 0:00.14 rep8
block size    2048 0:00.16 0:00.18 0:00.18 0:00.17 0:00.23 0:00.16 0:00.18 0:00.26 0:00.21 0:00.17 0:00.21 0:01.25 0:00.16 0:00.16 best: 0:00.16 libcall
block size    1024 0:00.19 0:00.24 0:00.24 0:00.20 0:00.25 0:00.17 0:00.19 0:00.28 0:00.23 0:00.21 0:00.26 0:01.26 0:00.17 0:00.16 best: 0:00.17 rep8
block size     512 0:00.23 0:00.34 0:00.33 0:00.23 0:00.26 0:00.20 0:00.22 0:00.31 0:00.27 0:00.27 0:00.29 0:01.29 0:00.20 0:00.19 best: 0:00.20 rep8
block size     256 0:00.29 0:00.51 0:00.51 0:00.28 0:00.30 0:00.25 0:00.26 0:00.38 0:00.35 0:00.39 0:00.38 0:01.33 0:00.24 0:00.25 best: 0:00.25 rep8
block size     128 0:00.39 0:00.76 0:00.76 0:00.40 0:00.42 0:00.38 0:00.40 0:00.52 0:00.51 0:00.55 0:00.52 0:01.41 0:00.37 0:00.38 best: 0:00.38 rep8
block size      64 0:00.72 0:00.95 0:00.95 0:00.70 0:00.73 0:00.65 0:00.72 0:00.75 0:00.75 0:00.74 0:00.76 0:01.48 0:00.64 0:00.65 best: 0:00.65 rep8
block size      48 0:00.89 0:00.98 0:01.12 0:00.86 0:00.72 0:00.83 0:00.88 0:00.94 0:00.92 0:00.92 0:00.91 0:01.71 0:00.93 0:00.67 best: 0:00.72 rep4noalign
block size      32 0:01.18 0:01.30 0:01.30 0:01.11 0:01.13 0:01.11 0:01.13 0:01.20 0:01.19 0:01.13 0:01.19 0:01.79 0:01.15 0:01.11 best: 0:01.11 rep4
block size      24 0:01.57 0:01.71 0:01.71 0:01.52 0:01.51 0:01.52 0:01.52 0:01.57 0:01.56 0:01.49 0:01.52 0:02.30 0:01.46 0:01.53 best: 0:01.49 unrl
block size      16 0:02.53 0:02.61 0:02.61 0:02.63 0:02.52 0:02.64 0:02.61 0:02.56 0:02.52 0:02.25 0:02.50 0:03.08 0:02.26 0:02.63 best: 0:02.25 unrl
block size      14 0:02.73 0:02.77 0:02.77 0:02.62 0:02.58 0:02.64 0:02.59 0:02.60 0:02.61 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:03.58 0:02.48 0:02.67 best: 0:02.58 rep4noalign
block size      12 0:03.29 0:03.09 0:03.08 0:03.02 0:02.98 0:03.06 0:02.96 0:02.89 0:02.96 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:03.89 0:02.70 0:03.05 best: 0:02.89 loop
block size      10 0:03.58 0:03.64 0:03.60 0:03.58 0:03.31 0:03.52 0:03.38 0:03.36 0:03.38 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:04.42 0:03.10 0:03.43 best: 0:03.31 rep4noalign
block size       8 0:04.19 0:03.76 0:03.75 0:03.98 0:03.83 0:03.82 0:03.70 0:03.70 0:03.80 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:04.68 Command terminated by signal 11 0:00.00 0:03.87 best: 0:03.70 loop
block size       6 0:06.20 0:05.66 0:05.67 0:05.69 0:05.60 0:05.73 0:05.53 0:05.56 0:05.46 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:06.23 0:05.60 0:05.65 best: 0:05.46 loopnoalign
block size       4 0:09.58 0:06.93 0:06.94 0:07.13 0:07.30 0:07.05 0:06.94 0:07.05 0:07.28 Command terminated by signal 11 0:00.00 Command terminated by signal 11 0:00.00 0:07.37 Command terminated by signal 11 0:00.00 0:07.46 best: 0:06.93 rep1
block size       1 0:38.46 0:17.27 0:17.27 0:15.14 0:15.11 0:16.34 0:15.10 0:16.38 0:15.11 Command terminated by signal 11 0:00.00 0:15.22 0:16.33 0:14.87 0:16.31 best: 0:15.10 rep8noalign

The ICEs for SSE loop < 16 bytes needs to be solved.
memset
                   libcall   rep1   noalg    rep4   noalg    rep8   noalg    loop   noalg    unrl   noalg    byte profiled dynamic
block size 8192000 0:00.62 0:00.55 0:00.55 0:00.55 0:00.60 0:00.55 0:00.57 0:00.54 0:00.59 0:00.55 0:00.57 0:03.60 0:00.61 0:00.61 best: 0:00.54 loop
block size  819200 0:00.62 0:00.61 0:00.62 0:00.61 0:00.40 0:00.63 0:00.40 0:00.65 0:00.50 0:00.65 0:00.40 0:03.84 0:00.35 0:00.65 best: 0:00.40 rep4noalign
block size   81920 0:00.21 0:00.21 0:00.21 0:00.21 0:00.27 0:00.21 0:00.22 0:00.27 0:00.40 0:00.21 0:00.22 0:04.47 0:00.21 0:00.21 best: 0:00.21 libcall
block size   20480 0:00.18 0:00.18 0:00.18 0:00.18 0:00.24 0:00.18 0:00.20 0:00.28 0:00.30 0:00.17 0:00.20 0:04.07 0:00.18 0:00.18 best: 0:00.17 unrl
block size    8192 0:00.15 0:00.15 0:00.15 0:00.15 0:00.21 0:00.15 0:00.16 0:00.17 0:00.18 0:00.14 0:00.17 0:01.27 0:00.15 0:00.15 best: 0:00.14 unrl
block size    4096 0:00.15 0:00.16 0:00.16 0:00.15 0:00.21 0:00.15 0:00.17 0:00.18 0:00.17 0:00.14 0:00.17 0:01.24 0:00.15 0:00.15 best: 0:00.14 unrl
block size    2048 0:00.16 0:00.18 0:00.19 0:00.17 0:00.23 0:00.16 0:00.18 0:00.19 0:00.19 0:00.16 0:00.19 0:01.25 0:00.16 0:00.16 best: 0:00.16 libcall
block size    1024 0:00.19 0:00.24 0:00.24 0:00.21 0:00.26 0:00.17 0:00.19 0:00.22 0:00.21 0:00.21 0:00.22 0:01.27 0:00.17 0:00.17 best: 0:00.17 rep8
block size     512 0:00.22 0:00.34 0:00.33 0:00.23 0:00.26 0:00.20 0:00.22 0:00.25 0:00.25 0:00.28 0:00.28 0:01.29 0:00.20 0:00.21 best: 0:00.20 rep8
block size     256 0:00.29 0:00.51 0:00.51 0:00.29 0:00.31 0:00.25 0:00.27 0:00.35 0:00.33 0:00.39 0:00.37 0:01.33 0:00.25 0:00.25 best: 0:00.25 rep8
block size     128 0:00.39 0:00.76 0:00.76 0:00.40 0:00.42 0:00.38 0:00.41 0:00.50 0:00.50 0:00.60 0:00.54 0:01.41 0:00.38 0:00.33 best: 0:00.38 rep8
block size      64 0:00.54 0:00.86 0:00.86 0:00.55 0:00.57 0:00.53 0:00.58 0:00.73 0:00.71 0:00.95 0:00.84 0:01.47 0:00.54 0:00.53 best: 0:00.53 rep8
block size      48 0:00.71 0:00.98 0:00.98 0:00.69 0:00.72 0:00.69 0:00.75 0:00.91 0:00.87 0:01.28 0:01.07 0:01.69 0:01.11 0:00.69 best: 0:00.69 rep4
block size      32 0:00.97 0:01.14 0:01.14 0:00.89 0:00.91 0:00.87 0:00.93 0:01.08 0:01.10 0:01.47 0:01.37 0:01.70 0:01.36 0:00.87 best: 0:00.87 rep8
block size      24 0:01.55 0:01.41 0:01.72 0:01.52 0:01.53 0:01.56 0:01.27 0:01.65 0:01.60 0:02.13 0:02.05 0:02.29 0:01.98 0:01.56 best: 0:01.27 rep8noalign
block size      16 0:02.51 0:02.60 0:02.59 0:02.63 0:02.34 0:02.61 0:02.73 0:02.63 0:02.55 0:02.86 0:03.33 0:03.09 0:03.26 0:02.65 best: 0:02.34 rep4noalign
block size      14 0:02.18 0:02.24 0:02.24 0:02.09 0:02.08 0:02.19 0:02.16 0:02.15 0:02.17 0:02.99 0:03.06 0:02.96 0:02.74 0:02.76 best: 0:02.08 rep4noalign
block size      12 0:03.28 0:03.08 0:03.07 0:02.92 0:02.93 0:03.12 0:03.10 0:02.92 0:02.98 0:03.72 0:03.60 0:03.77 0:03.64 0:02.92 best: 0:02.92 loop
block size      10 0:03.49 0:03.56 0:03.56 0:03.60 0:03.39 0:03.80 0:03.73 0:03.51 0:03.65 0:04.59 0:04.50 0:04.53 0:04.55 0:03.72 best: 0:03.39 rep4noalign
block size       8 0:04.31 0:04.26 0:04.26 0:04.31 0:04.46 0:04.15 0:04.14 0:04.23 0:04.13 0:05.00 0:04.59 0:05.07 0:04.99 0:04.18 best: 0:04.13 loopnoalign
block size       6 0:06.69 0:06.06 0:06.05 0:06.26 0:05.83 0:06.09 0:05.93 0:05.91 0:05.75 0:06.42 0:06.27 0:06.39 0:06.75 0:05.94 best: 0:05.75 loopnoalign
block size       4 0:10.18 0:07.51 0:07.51 0:07.69 0:07.51 0:07.39 0:07.41 0:07.35 0:07.41 0:07.38 0:07.39 0:07.39 0:07.34 0:07.37 best: 0:07.35 loop
block size       1 0:37.06 0:15.82 0:15.82 0:14.65 0:11.15 0:14.57 0:14.48 0:14.48 0:14.46 0:14.35 0:14.31 0:15.17 0:14.19 0:14.66 best: 0:11.15 rep4noalign

> It didn't hurt performance - quite the contrary, it was done to avoid
> performance problems.
> The point of this is following. If we generated an unrolled loop with
> SSE moves and a prologue with alignment checks, then we wouldn't know,
> how much bytes will left after the main loop. So in epilogue we'll
> generate a loop with unknown at compile time trip count. In previous
> implementation such loop simply used byte-moves, so in worst case we'd
> have (UnrollFactor*VectorWidth-1) byte moves. Now before such
> byte-loop we generate a rolled loop with SSE-moves. This loop would
> iterate at most (UnrollFactor-1) times, but that still would greatly
> improve performance.
> 
> Finally, we would have something like this:
> main_loop:
>    sse_move
>    sse_move
>    sse_move
>    sse_move
>    iter += 4*vector_size
>    if (iter < count ) goto main_loop
> epilogue_sse_loop:
>    sse_move
>    iter += vector_size
>    if (iter < count ) goto epilogue_sse_loop
> epilogue_byte_loop:
>    byte_move
>    iter ++
>    if (iter < count ) goto epilogue_byte_loop

Hmm, OK.  Imisremembered how we generate the epilogue now - it is quite a while since
I looked into this last time.
> > This seems quite dubious.  The instruction pattern representing the move should
> > refuse the constants via its condition or predicates.
> 
> It does, but it can't generate efficient code if such constants are
> propagated. Take a look at the next example.
> Suppose we'd generate something like this:
>     v4si_reg = (0 | 0 | 0 | 0)
>     sse_mov [mem], v4si_reg
>     sse_mov [mem+16], v4si_reg
> 
> After constant propagation we'd have:
>     v4si_reg = (0 | 0 | 0 | 0)
>     sse_mov [mem], (0 | 0 | 0 | 0)
>     sse_mov [mem+16], (0 | 0 | 0 | 0)

Well, if instruction predicate of sse_mov would reject the immediate operand
for memory destination, the propagation will not happen.

Lets work out what breaks with the small blocks and I will look into the patterns
to prevent th propagation.

Honza
diff mbox

Patch

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2c53423..6ce240a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@  struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},

I do think we will want sse_unrolled_loop (and perhaps sse_loop, I don't know)
algorithms added to enable the SSE codegen: there are chips that do support SSE
but internally split the wide moves into word sized chunks and thus benefits
are minimal and setup costs won't pay back.  So we do not want to default to
SSE codegen whenever possible, just when it is supposed to pay back.

I wonder what we will want to do then with -mno-sse (and i.e. for linux kernel where
one can not implicitely use SSE).  Probably making sse_unrolled_loop->unrolled_loop
transition is easiest for this quite rare case even if some of other algorithm may
turn out to be superrior.

+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);

No vector move because promoted vector value is not computed, yet?  (it would
make sense to bypass it to keep hot path for small blocks SSE free).
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
+      unroll_factor = 4;

Where did go the logic dropping unroll factor to 2 for 32bit integer loops?  This is
important otherwise se starve the RA.
@@ -21897,9 +22254,41 @@  ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */

The unrolled epilogue does at most one byte wide move, while the rolled does at most
4*16. Given that the odds are that the blocks are small, are you sure this is not
causing performance problems?

@@ -21983,11 +22402,21 @@  promote_duplicated_reg (enum machine_mode mode, rtx val)
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      gcc_assert (TARGET_SSE);
+      if (TARGET_64BIT)
+	promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+	promoted_val = promote_duplicated_reg (V4SImode, val);

Hmm, it would seem more natural to turn this into V8QImode since it is really
vector of 4 duplicated bytes.  This will avoid some TARGET_64BIT tess bellow.
Also AMD chips are very slow on integer->SSE moves. How the final promotion
sequence looks there?
diff --git a/gcc/cse.c b/gcc/cse.c
index ae67685..3b6471d 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -4616,7 +4616,10 @@  cse_insn (rtx insn)
 		 to fold switch statements when an ADDR_DIFF_VEC is used.  */
 	      || (GET_CODE (src_folded) == MINUS
 		  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
-		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
+		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
+	      /* Don't propagate vector-constants, as for now no architecture
+		 supports vector immediates.  */
+	  && !vector_extensions_used_for_mode (mode))

This seems quite dubious.  The instruction pattern representing the move should
refuse the constants via its condition or predicates.