diff mbox

Memset/memcpy patch

Message ID 20111117211057.GE30593@kam.mff.cuni.cz
State New
Headers show

Commit Message

Jan Hubicka Nov. 17, 2011, 9:10 p.m. UTC
> 
> The current x86 memset/memcpy expansion is broken. It miscompiles
> many programs, including GCC itself.  Should it be reverted for now?

There was problem in the new code doing loopy epilogues.
I am currently testing the following patch that shold fix the problem.
We could either revert now and I will apply combined patch or I hope to fix that
tonight.

Honza

Comments

Jan Hubicka Nov. 17, 2011, 9:23 p.m. UTC | #1
> > 
> > The current x86 memset/memcpy expansion is broken. It miscompiles
> > many programs, including GCC itself.  Should it be reverted for now?
> 
> There was problem in the new code doing loopy epilogues.
> I am currently testing the following patch that shold fix the problem.
> We could either revert now and I will apply combined patch or I hope to fix that
> tonight.

To expand little bit. I was looking into the code for most of the day today and
the patch combines several fixes
   1) the new loopy epilogue code was quite broken. It did not work for memset at all because
      the promoted value was not always initialized that I fixed in the version of patch
      that is in mainline now. It however also miss bound check in some cases.  This is fixed
      by the expand_set_or_movmem_via_loop_with_iter change.
   2) I misupdated atom description so 32bit memset was not expanded inline, this is fixed
      by memset changes
   3) decide_alg was broken in two ways - first it gives complex algorithms for -O0
      and it chose wrong variant when sse_loop is used.
   4) the epilogue loop was output even in the case it is not needed - i.e. when unrolled
      loops handled 16 bytes at once, and block size is 39. This is the ix86_movmem
      and ix86_setmem change
   5) The implementation of ix86_movmem/ix86_setmem diverged for no reason so I got it back
      to sync. For some reason SSE code in movmem is not output for 64bit unaligned memcpy
      that is fixed too.
   6) it seems that both bdver and core is good enough on handling misaligned blocks that 
      the alignmnet prologues can be ommited. This greatly improves and reduces size of the
      inline sequence. I will however break this out into independent patch.

Life would be easier if the changes was made in multiple incremental steps, stringops expansion
is relatively tricky busyness and realively easy to get wrong in some cases since there are so 
many of them depending on knowledge of size/alignmnet and target architecture.

Honza
Jan Hubicka Nov. 18, 2011, 2:23 a.m. UTC | #2
> > > 
> > > The current x86 memset/memcpy expansion is broken. It miscompiles
> > > many programs, including GCC itself.  Should it be reverted for now?
> > 
> > There was problem in the new code doing loopy epilogues.
> > I am currently testing the following patch that shold fix the problem.
> > We could either revert now and I will apply combined patch or I hope to fix that
> > tonight.
> 
> To expand little bit. I was looking into the code for most of the day today and
> the patch combines several fixes
>    1) the new loopy epilogue code was quite broken. It did not work for memset at all because
>       the promoted value was not always initialized that I fixed in the version of patch
>       that is in mainline now. It however also miss bound check in some cases.  This is fixed
>       by the expand_set_or_movmem_via_loop_with_iter change.
>    2) I misupdated atom description so 32bit memset was not expanded inline, this is fixed
>       by memset changes
>    3) decide_alg was broken in two ways - first it gives complex algorithms for -O0
>       and it chose wrong variant when sse_loop is used.
>    4) the epilogue loop was output even in the case it is not needed - i.e. when unrolled
>       loops handled 16 bytes at once, and block size is 39. This is the ix86_movmem
>       and ix86_setmem change
>    5) The implementation of ix86_movmem/ix86_setmem diverged for no reason so I got it back
>       to sync. For some reason SSE code in movmem is not output for 64bit unaligned memcpy
>       that is fixed too.
>    6) it seems that both bdver and core is good enough on handling misaligned blocks that 
>       the alignmnet prologues can be ommited. This greatly improves and reduces size of the
>       inline sequence. I will however break this out into independent patch.
> 
> Life would be easier if the changes was made in multiple incremental steps, stringops expansion
> is relatively tricky busyness and realively easy to get wrong in some cases since there are so 
> many of them depending on knowledge of size/alignmnet and target architecture.

Hi,
this is the patch I comitted after bootstrapping/regstesting x86_64-linux and
--with-arch=core2 --with-cpu=atom gfortran.fortran-torture/execute/arrayarg.f90
failure stays. As I've explained in the PR log, I believe it is previously
latent problem elsewhere that is now triggered by inline memset expansion that
is later unrolled.  I would welcome help from someone who understand the
testcase on whether it is aliasing safe or not.

Honza

	PR bootstrap/51134
	* i386.c (atom_cost): Fix 32bit memset description.
	(expand_set_or_movmem_via_loop_with_iter): Output proper bounds check for epilogue loops.
	(expand_movmem_epilogue): Handle epilogues up to size 15 w/o producing byte loop.
	(decide_alg): sse_loop is not useable wthen SSE2 is disabled; when not optimizing always
	use rep movsb or lincall; do not produce word sized loops when optimizing memset for
	size (to avoid need for large constants).
	(ix86_expand_movmem): Get into sync with ix86_expand_setmem; choose unroll factors
	better; always do 128bit moves when producing SSE loops; do not produce loopy epilogue
	when size is too small.
	(promote_duplicated_reg_to_size): Do not look into desired alignments when
	doing vector expansion.
	(ix86_expand_setmem): Track better when promoted value is available; choose unroll factors
	more sanely.; output loopy epilogue only when needed.
Index: config/i386/i386.c
===================================================================
*** config/i386/i386.c	(revision 181407)
--- config/i386/i386.c	(working copy)
*************** struct processor_costs atom_cost = {
*** 1785,1791 ****
       if that fails.  */
    {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
      {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}},
!    {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
      {libcall, {{2048, sse_loop}, {2048, unrolled_loop},
  	       {-1, libcall}}}}},
  
--- 1785,1791 ----
       if that fails.  */
    {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
      {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}},
!    {{libcall, {{2048, sse_loop}, {2048, unrolled_loop}, {-1, libcall}}}, /* Unknown alignment.  */
      {libcall, {{2048, sse_loop}, {2048, unrolled_loop},
  	       {-1, libcall}}}}},
  
*************** expand_set_or_movmem_via_loop_with_iter 
*** 21149,21168 ****
  
    top_label = gen_label_rtx ();
    out_label = gen_label_rtx ();
-   if (!reuse_iter)
-     iter = gen_reg_rtx (iter_mode);
- 
    size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
! 			      NULL, 1, OPTAB_DIRECT);
!   /* Those two should combine.  */
!   if (piece_size == const1_rtx)
      {
!       emit_cmp_and_jump_insns (size, const0_rtx, EQ, NULL_RTX, iter_mode,
  			       true, out_label);
-       predict_jump (REG_BR_PROB_BASE * 10 / 100);
      }
-   if (!reuse_iter)
-     emit_move_insn (iter, const0_rtx);
  
    emit_label (top_label);
  
--- 21149,21173 ----
  
    top_label = gen_label_rtx ();
    out_label = gen_label_rtx ();
    size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
! 		               NULL, 1, OPTAB_DIRECT);
!   if (!reuse_iter)
      {
!       iter = gen_reg_rtx (iter_mode);
!       /* Those two should combine.  */
!       if (piece_size == const1_rtx)
! 	{
! 	  emit_cmp_and_jump_insns (size, const0_rtx, EQ, NULL_RTX, iter_mode,
! 				   true, out_label);
! 	  predict_jump (REG_BR_PROB_BASE * 10 / 100);
! 	}
!       emit_move_insn (iter, const0_rtx);
!     }
!   else
!     {
!       emit_cmp_and_jump_insns (iter, size, GE, NULL_RTX, iter_mode,
  			       true, out_label);
      }
  
    emit_label (top_label);
  
*************** expand_movmem_epilogue (rtx destmem, rtx
*** 21460,21466 ****
        gcc_assert (remainder_size == 0);
        return;
      }
!   if (max_size > 8)
      {
        count = expand_simple_binop (GET_MODE (count), AND, count, GEN_INT (max_size - 1),
  				    count, 1, OPTAB_DIRECT);
--- 21465,21471 ----
        gcc_assert (remainder_size == 0);
        return;
      }
!   if (max_size > 16)
      {
        count = expand_simple_binop (GET_MODE (count), AND, count, GEN_INT (max_size - 1),
  				    count, 1, OPTAB_DIRECT);
*************** expand_movmem_epilogue (rtx destmem, rtx
*** 21475,21480 ****
--- 21480,21504 ----
     */
    if (TARGET_SINGLE_STRINGOP)
      {
+       if (max_size > 8)
+ 	{
+ 	  rtx label = ix86_expand_aligntest (count, 8, true);
+ 	  if (TARGET_64BIT)
+ 	    {
+ 	      src = change_address (srcmem, DImode, srcptr);
+ 	      dest = change_address (destmem, DImode, destptr);
+ 	      emit_insn (gen_strmov (destptr, dest, srcptr, src));
+ 	    }
+ 	  else
+ 	    {
+ 	      src = change_address (srcmem, SImode, srcptr);
+ 	      dest = change_address (destmem, SImode, destptr);
+ 	      emit_insn (gen_strmov (destptr, dest, srcptr, src));
+ 	      emit_insn (gen_strmov (destptr, dest, srcptr, src));
+ 	    }
+ 	  emit_label (label);
+ 	  LABEL_NUSES (label) = 1;
+ 	}
        if (max_size > 4)
  	{
  	  rtx label = ix86_expand_aligntest (count, 4, true);
*************** expand_movmem_epilogue (rtx destmem, rtx
*** 21508,21513 ****
--- 21532,21566 ----
        rtx offset = force_reg (Pmode, const0_rtx);
        rtx tmp;
  
+       if (max_size > 8)
+ 	{
+ 	  rtx label = ix86_expand_aligntest (count, 8, true);
+ 	  if (TARGET_64BIT)
+ 	    {
+ 	      src = change_address (srcmem, DImode, srcptr);
+ 	      dest = change_address (destmem, DImode, destptr);
+ 	      emit_move_insn (dest, src);
+ 	      tmp = expand_simple_binop (Pmode, PLUS, offset, GEN_INT (8), NULL,
+ 				         true, OPTAB_LIB_WIDEN);
+ 	    }
+ 	  else
+ 	    {
+ 	      src = change_address (srcmem, SImode, srcptr);
+ 	      dest = change_address (destmem, SImode, destptr);
+ 	      emit_move_insn (dest, src);
+ 	      tmp = expand_simple_binop (Pmode, PLUS, offset, GEN_INT (4), NULL,
+ 				         true, OPTAB_LIB_WIDEN);
+ 	      if (tmp != offset)
+ 	         emit_move_insn (offset, tmp);
+ 	      tmp = expand_simple_binop (Pmode, PLUS, offset, GEN_INT (4), NULL,
+ 				         true, OPTAB_LIB_WIDEN);
+ 	      emit_move_insn (dest, src);
+ 	    }
+ 	  if (tmp != offset)
+ 	    emit_move_insn (offset, tmp);
+ 	  emit_label (label);
+ 	  LABEL_NUSES (label) = 1;
+ 	}
        if (max_size > 4)
  	{
  	  rtx label = ix86_expand_aligntest (count, 4, true);
*************** expand_setmem_epilogue (rtx destmem, rtx
*** 21588,21604 ****
  	 Remaining part we'll move using Pmode and narrower modes.  */
  
        if (promoted_to_vector_value)
! 	while (remainder_size >= 16)
! 	  {
! 	    if (GET_MODE (destmem) != move_mode)
! 	      destmem = adjust_automodify_address_nv (destmem, move_mode,
! 						      destptr, offset);
! 	    emit_strset (destmem, promoted_to_vector_value, destptr,
! 			 move_mode, offset);
  
! 	    offset += 16;
! 	    remainder_size -= 16;
! 	  }
  
        /* Move the remaining part of epilogue - its size might be
  	 a size of the widest mode.  */
--- 21641,21668 ----
  	 Remaining part we'll move using Pmode and narrower modes.  */
  
        if (promoted_to_vector_value)
! 	{
! 	  if (promoted_to_vector_value)
! 	    {
! 	      if (max_size >= GET_MODE_SIZE (V4SImode))
! 		move_mode = V4SImode;
! 	      else if (max_size >= GET_MODE_SIZE (DImode))
! 		move_mode = DImode;
! 	    }
! 	  while (remainder_size >= GET_MODE_SIZE (move_mode))
! 	    {
! 	      if (GET_MODE (destmem) != move_mode)
! 		destmem = adjust_automodify_address_nv (destmem, move_mode,
! 							destptr, offset);
! 	      emit_strset (destmem,
! 			   promoted_to_vector_value,
! 			   destptr,
! 			   move_mode, offset);
  
! 	      offset += GET_MODE_SIZE (move_mode);
! 	      remainder_size -= GET_MODE_SIZE (move_mode);
! 	    }
! 	}
  
        /* Move the remaining part of epilogue - its size might be
  	 a size of the widest mode.  */
*************** decide_alg (HOST_WIDE_INT count, HOST_WI
*** 22022,22031 ****
  			     || (memset
  				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
  
! #define ALG_USABLE_P(alg) (rep_prefix_usable			\
! 			   || (alg != rep_prefix_1_byte		\
! 			       && alg != rep_prefix_4_byte      \
! 			       && alg != rep_prefix_8_byte))
    const struct processor_costs *cost;
  
    /* Even if the string operation call is cold, we still might spend a lot
--- 22086,22096 ----
  			     || (memset
  				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
  
! #define ALG_USABLE_P(alg) ((rep_prefix_usable			\
! 			    || (alg != rep_prefix_1_byte	\
! 			        && alg != rep_prefix_4_byte      \
! 			        && alg != rep_prefix_8_byte))    \
! 			   && (TARGET_SSE2 || alg != sse_loop))
    const struct processor_costs *cost;
  
    /* Even if the string operation call is cold, we still might spend a lot
*************** decide_alg (HOST_WIDE_INT count, HOST_WI
*** 22037,22042 ****
--- 22102,22110 ----
    else
      optimize_for_speed = true;
  
+   if (!optimize)
+     return (rep_prefix_usable ? rep_prefix_1_byte : libcall);
+ 
    cost = optimize_for_speed ? ix86_cost : &ix86_size_cost;
  
    *dynamic_check = -1;
*************** decide_alg (HOST_WIDE_INT count, HOST_WI
*** 22049,22058 ****
    /* rep; movq or rep; movl is the smallest variant.  */
    else if (!optimize_for_speed)
      {
!       if (!count || (count & 3))
! 	return rep_prefix_usable ? rep_prefix_1_byte : loop_1_byte;
        else
! 	return rep_prefix_usable ? rep_prefix_4_byte : loop;
      }
    /* Very tiny blocks are best handled via the loop, REP is expensive to setup.
     */
--- 22117,22126 ----
    /* rep; movq or rep; movl is the smallest variant.  */
    else if (!optimize_for_speed)
      {
!       if (!count || (count & 3) || memset)
! 	return rep_prefix_usable ? rep_prefix_1_byte : libcall;
        else
! 	return rep_prefix_usable ? rep_prefix_4_byte : libcall;
      }
    /* Very tiny blocks are best handled via the loop, REP is expensive to setup.
     */
*************** decide_alg (HOST_WIDE_INT count, HOST_WI
*** 22106,22118 ****
        int max = -1;
        enum stringop_alg alg;
        int i;
-       bool any_alg_usable_p = true;
        bool only_libcall_fits = true;
  
        for (i = 0; i < MAX_STRINGOP_ALGS; i++)
  	{
  	  enum stringop_alg candidate = algs->size[i].alg;
- 	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
  
  	  if (candidate != libcall && candidate
  	      && ALG_USABLE_P (candidate))
--- 22174,22184 ----
*************** decide_alg (HOST_WIDE_INT count, HOST_WI
*** 22124,22130 ****
        /* If there aren't any usable algorithms, then recursing on
  	 smaller sizes isn't going to find anything.  Just return the
  	 simple byte-at-a-time copy loop.  */
!       if (!any_alg_usable_p || only_libcall_fits)
  	{
  	  /* Pick something reasonable.  */
  	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
--- 22190,22196 ----
        /* If there aren't any usable algorithms, then recursing on
  	 smaller sizes isn't going to find anything.  Just return the
  	 simple byte-at-a-time copy loop.  */
!       if (only_libcall_fits)
  	{
  	  /* Pick something reasonable.  */
  	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 22253,22259 ****
    int dynamic_check;
    bool need_zero_guard = false;
    bool align_unknown;
!   int unroll_factor;
    enum machine_mode move_mode;
    rtx loop_iter = NULL_RTX;
    int dst_offset, src_offset;
--- 22319,22325 ----
    int dynamic_check;
    bool need_zero_guard = false;
    bool align_unknown;
!   unsigned int unroll_factor;
    enum machine_mode move_mode;
    rtx loop_iter = NULL_RTX;
    int dst_offset, src_offset;
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 22316,22329 ****
      case unrolled_loop:
        need_zero_guard = true;
        move_mode = Pmode;
!       unroll_factor = TARGET_64BIT ? 4 : 2;
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case sse_loop:
        need_zero_guard = true;
        /* Use SSE instructions, if possible.  */
!       move_mode = align_unknown ? DImode : V4SImode;
!       unroll_factor = TARGET_64BIT ? 4 : 2;
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case rep_prefix_8_byte:
--- 22382,22408 ----
      case unrolled_loop:
        need_zero_guard = true;
        move_mode = Pmode;
!       unroll_factor = 1;
!       /* Select maximal available 1,2 or 4 unroll factor.
! 	 In 32bit we can not afford to use 4 registers inside the loop.  */
!       if (!count)
! 	unroll_factor = TARGET_64BIT ? 4 : 2;
!       else
! 	while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
! 	       && unroll_factor < (TARGET_64BIT ? 4 :2))
! 	  unroll_factor *= 2;
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case sse_loop:
        need_zero_guard = true;
        /* Use SSE instructions, if possible.  */
!       move_mode = V4SImode;
!       /* Select maximal available 1,2 or 4 unroll factor.  */
!       if (!count)
! 	unroll_factor = 4;
!       else
! 	while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
! 	       && unroll_factor < 4)
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case rep_prefix_8_byte:
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 22568,22574 ****
    if (alg == sse_loop || alg == unrolled_loop)
      {
        rtx tmp;
!       if (align_unknown && unroll_factor > 1)
  	{
  	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
  	     do this, we can have very big epilogue - when alignment is statically
--- 22647,22659 ----
    if (alg == sse_loop || alg == unrolled_loop)
      {
        rtx tmp;
!       int remainder_size = epilogue_size_needed;
! 
!       /* We may not need the epilgoue loop at all when the count is known
! 	 and alignment is not adjusted.  */
!       if (count && desired_align <= align)
! 	remainder_size = count % epilogue_size_needed;
!       if (remainder_size > 31)
  	{
  	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
  	     do this, we can have very big epilogue - when alignment is statically
*************** promote_duplicated_reg_to_size (rtx val,
*** 22710,22716 ****
  {
    rtx promoted_val = NULL_RTX;
  
!   if (size_needed > 8 || (desired_align > align && desired_align > 8))
      {
        /* We want to promote to vector register, so we expect that at least SSE
  	 is available.  */
--- 22795,22801 ----
  {
    rtx promoted_val = NULL_RTX;
  
!   if (size_needed > 8)
      {
        /* We want to promote to vector register, so we expect that at least SSE
  	 is available.  */
*************** promote_duplicated_reg_to_size (rtx val,
*** 22724,22730 ****
        else
  	promoted_val = promote_duplicated_reg (V4SImode, val);
      }
!   else if (size_needed > 4 || (desired_align > align && desired_align > 4))
      {
        gcc_assert (TARGET_64BIT);
        promoted_val = promote_duplicated_reg (DImode, val);
--- 22809,22815 ----
        else
  	promoted_val = promote_duplicated_reg (V4SImode, val);
      }
!   else if (size_needed > 4)
      {
        gcc_assert (TARGET_64BIT);
        promoted_val = promote_duplicated_reg (DImode, val);
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 22764,22769 ****
--- 22849,22855 ----
    unsigned int unroll_factor;
    enum machine_mode move_mode;
    rtx loop_iter = NULL_RTX;
+   bool early_jump = false;
  
    if (CONST_INT_P (align_exp))
      align = INTVAL (align_exp);
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 22783,22789 ****
    /* Step 0: Decide on preferred algorithm, desired alignment and
       size of chunks to be copied by main loop.  */
  
!   align_unknown = CONST_INT_P (align_exp) && INTVAL (align_exp) > 0;
    alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
    desired_align = decide_alignment (align, alg, expected_size);
    unroll_factor = 1;
--- 22869,22875 ----
    /* Step 0: Decide on preferred algorithm, desired alignment and
       size of chunks to be copied by main loop.  */
  
!   align_unknown = !(CONST_INT_P (align_exp) && INTVAL (align_exp) > 0);
    alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
    desired_align = decide_alignment (align, alg, expected_size);
    unroll_factor = 1;
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 22813,22821 ****
        move_mode = Pmode;
        unroll_factor = 1;
        /* Select maximal available 1,2 or 4 unroll factor.  */
!       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
! 	     && unroll_factor < 4)
! 	unroll_factor *= 2;
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case sse_loop:
--- 22899,22910 ----
        move_mode = Pmode;
        unroll_factor = 1;
        /* Select maximal available 1,2 or 4 unroll factor.  */
!       if (!count)
! 	unroll_factor = 4;
!       else
! 	while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
! 	       && unroll_factor < 4)
! 	  unroll_factor *= 2;
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case sse_loop:
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 22823,22831 ****
        move_mode = TARGET_64BIT ? V2DImode : V4SImode;
        unroll_factor = 1;
        /* Select maximal available 1,2 or 4 unroll factor.  */
!       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
! 	     && unroll_factor < 4)
! 	unroll_factor *= 2;
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case rep_prefix_8_byte:
--- 22912,22923 ----
        move_mode = TARGET_64BIT ? V2DImode : V4SImode;
        unroll_factor = 1;
        /* Select maximal available 1,2 or 4 unroll factor.  */
!       if (!count)
! 	unroll_factor = 4;
!       else
! 	while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
! 	       && unroll_factor < 4)
! 	  unroll_factor *= 2;
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case rep_prefix_8_byte:
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 22904,22909 ****
--- 22996,23002 ----
                emit_move_insn (loop_iter, const0_rtx);
  	    }
  	  label = gen_label_rtx ();
+ 	  early_jump = true;
  	  emit_cmp_and_jump_insns (count_exp,
  				   GEN_INT (epilogue_size_needed),
  				   LTU, 0, counter_mode (count_exp), 1, label);
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 23016,23022 ****
        vec_promoted_val =
  	promote_duplicated_reg_to_size (gpr_promoted_val,
  					GET_MODE_SIZE (move_mode),
! 					desired_align, align);
        loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
  				     NULL, vec_promoted_val, count_exp,
  				     loop_iter, move_mode, unroll_factor,
--- 23109,23115 ----
        vec_promoted_val =
  	promote_duplicated_reg_to_size (gpr_promoted_val,
  					GET_MODE_SIZE (move_mode),
! 					GET_MODE_SIZE (move_mode), align);
        loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
  				     NULL, vec_promoted_val, count_exp,
  				     loop_iter, move_mode, unroll_factor,
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 23065,23085 ****
        LABEL_NUSES (label) = 1;
        /* We can not rely on fact that promoved value is known.  */
        vec_promoted_val = 0;
!       gpr_promoted_val = 0;
      }
   epilogue:
    if (alg == unrolled_loop || alg == sse_loop)
      {
        rtx tmp;
!       if (align_unknown && unroll_factor > 1
! 	  && epilogue_size_needed >= GET_MODE_SIZE (move_mode)
! 	  && vec_promoted_val)
  	{
  	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
  	     do this, we can have very big epilogue - when alignment is statically
  	     unknown we'll have the epilogue byte by byte which may be very slow.  */
  	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
! 	      NULL, vec_promoted_val, count_exp,
  	      loop_iter, move_mode, 1,
  	      expected_size, false);
  	  dst = change_address (dst, BLKmode, destreg);
--- 23158,23183 ----
        LABEL_NUSES (label) = 1;
        /* We can not rely on fact that promoved value is known.  */
        vec_promoted_val = 0;
!       if (early_jump)
!         gpr_promoted_val = 0;
      }
   epilogue:
    if (alg == unrolled_loop || alg == sse_loop)
      {
        rtx tmp;
!       int remainder_size = epilogue_size_needed;
!       if (count && desired_align <= align)
! 	remainder_size = count % epilogue_size_needed;
!       /* We may not need the epilgoue loop at all when the count is known
! 	 and alignment is not adjusted.  */
!       if (remainder_size > 31 
! 	  && (alg == sse_loop ? vec_promoted_val : gpr_promoted_val))
  	{
  	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
  	     do this, we can have very big epilogue - when alignment is statically
  	     unknown we'll have the epilogue byte by byte which may be very slow.  */
  	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
! 	      NULL, (alg == sse_loop ? vec_promoted_val : gpr_promoted_val), count_exp,
  	      loop_iter, move_mode, 1,
  	      expected_size, false);
  	  dst = change_address (dst, BLKmode, destreg);
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 23090,23106 ****
        if (tmp != destreg)
  	emit_move_insn (destreg, tmp);
      }
!   if (count_exp == const0_rtx)
      ;
!   else if (!gpr_promoted_val && epilogue_size_needed > 1)
      expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
  				     epilogue_size_needed);
    else
!     {
!       if (epilogue_size_needed > 1)
! 	expand_setmem_epilogue (dst, destreg, vec_promoted_val, gpr_promoted_val,
! 				val_exp, count_exp, epilogue_size_needed);
!     }
    if (jump_around_label)
      emit_label (jump_around_label);
    return true;
--- 23188,23201 ----
        if (tmp != destreg)
  	emit_move_insn (destreg, tmp);
      }
!   if (count_exp == const0_rtx || epilogue_size_needed <= 1)
      ;
!   else if (!gpr_promoted_val)
      expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
  				     epilogue_size_needed);
    else
!     expand_setmem_epilogue (dst, destreg, vec_promoted_val, gpr_promoted_val,
! 			    val_exp, count_exp, epilogue_size_needed);
    if (jump_around_label)
      emit_label (jump_around_label);
    return true;
Michael Zolotukhin Nov. 18, 2011, 12:30 p.m. UTC | #3
I found another bug in current implementation. A patch for it doesn't
cure i686-linux- bootstrap, but fixes fails on some tests (see
attached).

The problem was that we tried to add runtime tests for alignment even
if both SRC and DST had unknown alignment - in this case it could be
impossible to make them both aligned simultaneously, so I think it's
easier to even not try to use aligned SSE-moves at all. Generation of
prologues with runtime tests could be used only if at least one
alignment is known - otherwise it's incorrect. Probably, generation of
such prologues could be removed from MEMMOV at all for now.

Though, even with this fix i686-bootstrap still fails. Configure for
the bootstrap-fail reproducing:
CC="gcc -m32" CXX="g++ -m32" ../configure --with-arch=core2
--with-cpu=atom --prefix=`pwd` i686-linux --with-fpmath=sse
--enable-languages=c,c++,fortran

On 18 November 2011 06:23, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > >
>> > > The current x86 memset/memcpy expansion is broken. It miscompiles
>> > > many programs, including GCC itself.  Should it be reverted for now?
>> >
>> > There was problem in the new code doing loopy epilogues.
>> > I am currently testing the following patch that shold fix the problem.
>> > We could either revert now and I will apply combined patch or I hope to fix that
>> > tonight.
>>
>> To expand little bit. I was looking into the code for most of the day today and
>> the patch combines several fixes
>>    1) the new loopy epilogue code was quite broken. It did not work for memset at all because
>>       the promoted value was not always initialized that I fixed in the version of patch
>>       that is in mainline now. It however also miss bound check in some cases.  This is fixed
>>       by the expand_set_or_movmem_via_loop_with_iter change.
>>    2) I misupdated atom description so 32bit memset was not expanded inline, this is fixed
>>       by memset changes
>>    3) decide_alg was broken in two ways - first it gives complex algorithms for -O0
>>       and it chose wrong variant when sse_loop is used.
>>    4) the epilogue loop was output even in the case it is not needed - i.e. when unrolled
>>       loops handled 16 bytes at once, and block size is 39. This is the ix86_movmem
>>       and ix86_setmem change
>>    5) The implementation of ix86_movmem/ix86_setmem diverged for no reason so I got it back
>>       to sync. For some reason SSE code in movmem is not output for 64bit unaligned memcpy
>>       that is fixed too.
>>    6) it seems that both bdver and core is good enough on handling misaligned blocks that
>>       the alignmnet prologues can be ommited. This greatly improves and reduces size of the
>>       inline sequence. I will however break this out into independent patch.
>>
>> Life would be easier if the changes was made in multiple incremental steps, stringops expansion
>> is relatively tricky busyness and realively easy to get wrong in some cases since there are so
>> many of them depending on knowledge of size/alignmnet and target architecture.
>
> Hi,
> this is the patch I comitted after bootstrapping/regstesting x86_64-linux and
> --with-arch=core2 --with-cpu=atom gfortran.fortran-torture/execute/arrayarg.f90
> failure stays. As I've explained in the PR log, I believe it is previously
> latent problem elsewhere that is now triggered by inline memset expansion that
> is later unrolled.  I would welcome help from someone who understand the
> testcase on whether it is aliasing safe or not.
>
> Honza
>
>        PR bootstrap/51134
>        * i386.c (atom_cost): Fix 32bit memset description.
>        (expand_set_or_movmem_via_loop_with_iter): Output proper bounds check for epilogue loops.
>        (expand_movmem_epilogue): Handle epilogues up to size 15 w/o producing byte loop.
>        (decide_alg): sse_loop is not useable wthen SSE2 is disabled; when not optimizing always
>        use rep movsb or lincall; do not produce word sized loops when optimizing memset for
>        size (to avoid need for large constants).
>        (ix86_expand_movmem): Get into sync with ix86_expand_setmem; choose unroll factors
>        better; always do 128bit moves when producing SSE loops; do not produce loopy epilogue
>        when size is too small.
>        (promote_duplicated_reg_to_size): Do not look into desired alignments when
>        doing vector expansion.
>        (ix86_expand_setmem): Track better when promoted value is available; choose unroll factors
>        more sanely.; output loopy epilogue only when needed.
> Index: config/i386/i386.c
> ===================================================================
> *** config/i386/i386.c  (revision 181407)
> --- config/i386/i386.c  (working copy)
> *************** struct processor_costs atom_cost = {
> *** 1785,1791 ****
>       if that fails.  */
>    {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
>      {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}},
> !    {{libcall, {{-1, libcall}}},                              /* Unknown alignment.  */
>      {libcall, {{2048, sse_loop}, {2048, unrolled_loop},
>               {-1, libcall}}}}},
>
> --- 1785,1791 ----
>       if that fails.  */
>    {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
>      {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}},
> !    {{libcall, {{2048, sse_loop}, {2048, unrolled_loop}, {-1, libcall}}}, /* Unknown alignment.  */
>      {libcall, {{2048, sse_loop}, {2048, unrolled_loop},
>               {-1, libcall}}}}},
>
> *************** expand_set_or_movmem_via_loop_with_iter
> *** 21149,21168 ****
>
>    top_label = gen_label_rtx ();
>    out_label = gen_label_rtx ();
> -   if (!reuse_iter)
> -     iter = gen_reg_rtx (iter_mode);
> -
>    size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
> !                             NULL, 1, OPTAB_DIRECT);
> !   /* Those two should combine.  */
> !   if (piece_size == const1_rtx)
>      {
> !       emit_cmp_and_jump_insns (size, const0_rtx, EQ, NULL_RTX, iter_mode,
>                               true, out_label);
> -       predict_jump (REG_BR_PROB_BASE * 10 / 100);
>      }
> -   if (!reuse_iter)
> -     emit_move_insn (iter, const0_rtx);
>
>    emit_label (top_label);
>
> --- 21149,21173 ----
>
>    top_label = gen_label_rtx ();
>    out_label = gen_label_rtx ();
>    size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
> !                              NULL, 1, OPTAB_DIRECT);
> !   if (!reuse_iter)
>      {
> !       iter = gen_reg_rtx (iter_mode);
> !       /* Those two should combine.  */
> !       if (piece_size == const1_rtx)
> !       {
> !         emit_cmp_and_jump_insns (size, const0_rtx, EQ, NULL_RTX, iter_mode,
> !                                  true, out_label);
> !         predict_jump (REG_BR_PROB_BASE * 10 / 100);
> !       }
> !       emit_move_insn (iter, const0_rtx);
> !     }
> !   else
> !     {
> !       emit_cmp_and_jump_insns (iter, size, GE, NULL_RTX, iter_mode,
>                               true, out_label);
>      }
>
>    emit_label (top_label);
>
> *************** expand_movmem_epilogue (rtx destmem, rtx
> *** 21460,21466 ****
>        gcc_assert (remainder_size == 0);
>        return;
>      }
> !   if (max_size > 8)
>      {
>        count = expand_simple_binop (GET_MODE (count), AND, count, GEN_INT (max_size - 1),
>                                    count, 1, OPTAB_DIRECT);
> --- 21465,21471 ----
>        gcc_assert (remainder_size == 0);
>        return;
>      }
> !   if (max_size > 16)
>      {
>        count = expand_simple_binop (GET_MODE (count), AND, count, GEN_INT (max_size - 1),
>                                    count, 1, OPTAB_DIRECT);
> *************** expand_movmem_epilogue (rtx destmem, rtx
> *** 21475,21480 ****
> --- 21480,21504 ----
>     */
>    if (TARGET_SINGLE_STRINGOP)
>      {
> +       if (max_size > 8)
> +       {
> +         rtx label = ix86_expand_aligntest (count, 8, true);
> +         if (TARGET_64BIT)
> +           {
> +             src = change_address (srcmem, DImode, srcptr);
> +             dest = change_address (destmem, DImode, destptr);
> +             emit_insn (gen_strmov (destptr, dest, srcptr, src));
> +           }
> +         else
> +           {
> +             src = change_address (srcmem, SImode, srcptr);
> +             dest = change_address (destmem, SImode, destptr);
> +             emit_insn (gen_strmov (destptr, dest, srcptr, src));
> +             emit_insn (gen_strmov (destptr, dest, srcptr, src));
> +           }
> +         emit_label (label);
> +         LABEL_NUSES (label) = 1;
> +       }
>        if (max_size > 4)
>        {
>          rtx label = ix86_expand_aligntest (count, 4, true);
> *************** expand_movmem_epilogue (rtx destmem, rtx
> *** 21508,21513 ****
> --- 21532,21566 ----
>        rtx offset = force_reg (Pmode, const0_rtx);
>        rtx tmp;
>
> +       if (max_size > 8)
> +       {
> +         rtx label = ix86_expand_aligntest (count, 8, true);
> +         if (TARGET_64BIT)
> +           {
> +             src = change_address (srcmem, DImode, srcptr);
> +             dest = change_address (destmem, DImode, destptr);
> +             emit_move_insn (dest, src);
> +             tmp = expand_simple_binop (Pmode, PLUS, offset, GEN_INT (8), NULL,
> +                                        true, OPTAB_LIB_WIDEN);
> +           }
> +         else
> +           {
> +             src = change_address (srcmem, SImode, srcptr);
> +             dest = change_address (destmem, SImode, destptr);
> +             emit_move_insn (dest, src);
> +             tmp = expand_simple_binop (Pmode, PLUS, offset, GEN_INT (4), NULL,
> +                                        true, OPTAB_LIB_WIDEN);
> +             if (tmp != offset)
> +                emit_move_insn (offset, tmp);
> +             tmp = expand_simple_binop (Pmode, PLUS, offset, GEN_INT (4), NULL,
> +                                        true, OPTAB_LIB_WIDEN);
> +             emit_move_insn (dest, src);
> +           }
> +         if (tmp != offset)
> +           emit_move_insn (offset, tmp);
> +         emit_label (label);
> +         LABEL_NUSES (label) = 1;
> +       }
>        if (max_size > 4)
>        {
>          rtx label = ix86_expand_aligntest (count, 4, true);
> *************** expand_setmem_epilogue (rtx destmem, rtx
> *** 21588,21604 ****
>         Remaining part we'll move using Pmode and narrower modes.  */
>
>        if (promoted_to_vector_value)
> !       while (remainder_size >= 16)
> !         {
> !           if (GET_MODE (destmem) != move_mode)
> !             destmem = adjust_automodify_address_nv (destmem, move_mode,
> !                                                     destptr, offset);
> !           emit_strset (destmem, promoted_to_vector_value, destptr,
> !                        move_mode, offset);
>
> !           offset += 16;
> !           remainder_size -= 16;
> !         }
>
>        /* Move the remaining part of epilogue - its size might be
>         a size of the widest mode.  */
> --- 21641,21668 ----
>         Remaining part we'll move using Pmode and narrower modes.  */
>
>        if (promoted_to_vector_value)
> !       {
> !         if (promoted_to_vector_value)
> !           {
> !             if (max_size >= GET_MODE_SIZE (V4SImode))
> !               move_mode = V4SImode;
> !             else if (max_size >= GET_MODE_SIZE (DImode))
> !               move_mode = DImode;
> !           }
> !         while (remainder_size >= GET_MODE_SIZE (move_mode))
> !           {
> !             if (GET_MODE (destmem) != move_mode)
> !               destmem = adjust_automodify_address_nv (destmem, move_mode,
> !                                                       destptr, offset);
> !             emit_strset (destmem,
> !                          promoted_to_vector_value,
> !                          destptr,
> !                          move_mode, offset);
>
> !             offset += GET_MODE_SIZE (move_mode);
> !             remainder_size -= GET_MODE_SIZE (move_mode);
> !           }
> !       }
>
>        /* Move the remaining part of epilogue - its size might be
>         a size of the widest mode.  */
> *************** decide_alg (HOST_WIDE_INT count, HOST_WI
> *** 22022,22031 ****
>                             || (memset
>                                 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
>
> ! #define ALG_USABLE_P(alg) (rep_prefix_usable                  \
> !                          || (alg != rep_prefix_1_byte         \
> !                              && alg != rep_prefix_4_byte      \
> !                              && alg != rep_prefix_8_byte))
>    const struct processor_costs *cost;
>
>    /* Even if the string operation call is cold, we still might spend a lot
> --- 22086,22096 ----
>                             || (memset
>                                 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
>
> ! #define ALG_USABLE_P(alg) ((rep_prefix_usable                 \
> !                           || (alg != rep_prefix_1_byte        \
> !                               && alg != rep_prefix_4_byte      \
> !                               && alg != rep_prefix_8_byte))    \
> !                          && (TARGET_SSE2 || alg != sse_loop))
>    const struct processor_costs *cost;
>
>    /* Even if the string operation call is cold, we still might spend a lot
> *************** decide_alg (HOST_WIDE_INT count, HOST_WI
> *** 22037,22042 ****
> --- 22102,22110 ----
>    else
>      optimize_for_speed = true;
>
> +   if (!optimize)
> +     return (rep_prefix_usable ? rep_prefix_1_byte : libcall);
> +
>    cost = optimize_for_speed ? ix86_cost : &ix86_size_cost;
>
>    *dynamic_check = -1;
> *************** decide_alg (HOST_WIDE_INT count, HOST_WI
> *** 22049,22058 ****
>    /* rep; movq or rep; movl is the smallest variant.  */
>    else if (!optimize_for_speed)
>      {
> !       if (!count || (count & 3))
> !       return rep_prefix_usable ? rep_prefix_1_byte : loop_1_byte;
>        else
> !       return rep_prefix_usable ? rep_prefix_4_byte : loop;
>      }
>    /* Very tiny blocks are best handled via the loop, REP is expensive to setup.
>     */
> --- 22117,22126 ----
>    /* rep; movq or rep; movl is the smallest variant.  */
>    else if (!optimize_for_speed)
>      {
> !       if (!count || (count & 3) || memset)
> !       return rep_prefix_usable ? rep_prefix_1_byte : libcall;
>        else
> !       return rep_prefix_usable ? rep_prefix_4_byte : libcall;
>      }
>    /* Very tiny blocks are best handled via the loop, REP is expensive to setup.
>     */
> *************** decide_alg (HOST_WIDE_INT count, HOST_WI
> *** 22106,22118 ****
>        int max = -1;
>        enum stringop_alg alg;
>        int i;
> -       bool any_alg_usable_p = true;
>        bool only_libcall_fits = true;
>
>        for (i = 0; i < MAX_STRINGOP_ALGS; i++)
>        {
>          enum stringop_alg candidate = algs->size[i].alg;
> -         any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
>
>          if (candidate != libcall && candidate
>              && ALG_USABLE_P (candidate))
> --- 22174,22184 ----
> *************** decide_alg (HOST_WIDE_INT count, HOST_WI
> *** 22124,22130 ****
>        /* If there aren't any usable algorithms, then recursing on
>         smaller sizes isn't going to find anything.  Just return the
>         simple byte-at-a-time copy loop.  */
> !       if (!any_alg_usable_p || only_libcall_fits)
>        {
>          /* Pick something reasonable.  */
>          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
> --- 22190,22196 ----
>        /* If there aren't any usable algorithms, then recursing on
>         smaller sizes isn't going to find anything.  Just return the
>         simple byte-at-a-time copy loop.  */
> !       if (only_libcall_fits)
>        {
>          /* Pick something reasonable.  */
>          if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
> *************** ix86_expand_movmem (rtx dst, rtx src, rt
> *** 22253,22259 ****
>    int dynamic_check;
>    bool need_zero_guard = false;
>    bool align_unknown;
> !   int unroll_factor;
>    enum machine_mode move_mode;
>    rtx loop_iter = NULL_RTX;
>    int dst_offset, src_offset;
> --- 22319,22325 ----
>    int dynamic_check;
>    bool need_zero_guard = false;
>    bool align_unknown;
> !   unsigned int unroll_factor;
>    enum machine_mode move_mode;
>    rtx loop_iter = NULL_RTX;
>    int dst_offset, src_offset;
> *************** ix86_expand_movmem (rtx dst, rtx src, rt
> *** 22316,22329 ****
>      case unrolled_loop:
>        need_zero_guard = true;
>        move_mode = Pmode;
> !       unroll_factor = TARGET_64BIT ? 4 : 2;
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case sse_loop:
>        need_zero_guard = true;
>        /* Use SSE instructions, if possible.  */
> !       move_mode = align_unknown ? DImode : V4SImode;
> !       unroll_factor = TARGET_64BIT ? 4 : 2;
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case rep_prefix_8_byte:
> --- 22382,22408 ----
>      case unrolled_loop:
>        need_zero_guard = true;
>        move_mode = Pmode;
> !       unroll_factor = 1;
> !       /* Select maximal available 1,2 or 4 unroll factor.
> !        In 32bit we can not afford to use 4 registers inside the loop.  */
> !       if (!count)
> !       unroll_factor = TARGET_64BIT ? 4 : 2;
> !       else
> !       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> !              && unroll_factor < (TARGET_64BIT ? 4 :2))
> !         unroll_factor *= 2;
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case sse_loop:
>        need_zero_guard = true;
>        /* Use SSE instructions, if possible.  */
> !       move_mode = V4SImode;
> !       /* Select maximal available 1,2 or 4 unroll factor.  */
> !       if (!count)
> !       unroll_factor = 4;
> !       else
> !       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> !              && unroll_factor < 4)
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case rep_prefix_8_byte:
> *************** ix86_expand_movmem (rtx dst, rtx src, rt
> *** 22568,22574 ****
>    if (alg == sse_loop || alg == unrolled_loop)
>      {
>        rtx tmp;
> !       if (align_unknown && unroll_factor > 1)
>        {
>          /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
>             do this, we can have very big epilogue - when alignment is statically
> --- 22647,22659 ----
>    if (alg == sse_loop || alg == unrolled_loop)
>      {
>        rtx tmp;
> !       int remainder_size = epilogue_size_needed;
> !
> !       /* We may not need the epilgoue loop at all when the count is known
> !        and alignment is not adjusted.  */
> !       if (count && desired_align <= align)
> !       remainder_size = count % epilogue_size_needed;
> !       if (remainder_size > 31)
>        {
>          /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
>             do this, we can have very big epilogue - when alignment is statically
> *************** promote_duplicated_reg_to_size (rtx val,
> *** 22710,22716 ****
>  {
>    rtx promoted_val = NULL_RTX;
>
> !   if (size_needed > 8 || (desired_align > align && desired_align > 8))
>      {
>        /* We want to promote to vector register, so we expect that at least SSE
>         is available.  */
> --- 22795,22801 ----
>  {
>    rtx promoted_val = NULL_RTX;
>
> !   if (size_needed > 8)
>      {
>        /* We want to promote to vector register, so we expect that at least SSE
>         is available.  */
> *************** promote_duplicated_reg_to_size (rtx val,
> *** 22724,22730 ****
>        else
>        promoted_val = promote_duplicated_reg (V4SImode, val);
>      }
> !   else if (size_needed > 4 || (desired_align > align && desired_align > 4))
>      {
>        gcc_assert (TARGET_64BIT);
>        promoted_val = promote_duplicated_reg (DImode, val);
> --- 22809,22815 ----
>        else
>        promoted_val = promote_duplicated_reg (V4SImode, val);
>      }
> !   else if (size_needed > 4)
>      {
>        gcc_assert (TARGET_64BIT);
>        promoted_val = promote_duplicated_reg (DImode, val);
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 22764,22769 ****
> --- 22849,22855 ----
>    unsigned int unroll_factor;
>    enum machine_mode move_mode;
>    rtx loop_iter = NULL_RTX;
> +   bool early_jump = false;
>
>    if (CONST_INT_P (align_exp))
>      align = INTVAL (align_exp);
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 22783,22789 ****
>    /* Step 0: Decide on preferred algorithm, desired alignment and
>       size of chunks to be copied by main loop.  */
>
> !   align_unknown = CONST_INT_P (align_exp) && INTVAL (align_exp) > 0;
>    alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
>    desired_align = decide_alignment (align, alg, expected_size);
>    unroll_factor = 1;
> --- 22869,22875 ----
>    /* Step 0: Decide on preferred algorithm, desired alignment and
>       size of chunks to be copied by main loop.  */
>
> !   align_unknown = !(CONST_INT_P (align_exp) && INTVAL (align_exp) > 0);
>    alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
>    desired_align = decide_alignment (align, alg, expected_size);
>    unroll_factor = 1;
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 22813,22821 ****
>        move_mode = Pmode;
>        unroll_factor = 1;
>        /* Select maximal available 1,2 or 4 unroll factor.  */
> !       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> !            && unroll_factor < 4)
> !       unroll_factor *= 2;
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case sse_loop:
> --- 22899,22910 ----
>        move_mode = Pmode;
>        unroll_factor = 1;
>        /* Select maximal available 1,2 or 4 unroll factor.  */
> !       if (!count)
> !       unroll_factor = 4;
> !       else
> !       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> !              && unroll_factor < 4)
> !         unroll_factor *= 2;
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case sse_loop:
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 22823,22831 ****
>        move_mode = TARGET_64BIT ? V2DImode : V4SImode;
>        unroll_factor = 1;
>        /* Select maximal available 1,2 or 4 unroll factor.  */
> !       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> !            && unroll_factor < 4)
> !       unroll_factor *= 2;
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case rep_prefix_8_byte:
> --- 22912,22923 ----
>        move_mode = TARGET_64BIT ? V2DImode : V4SImode;
>        unroll_factor = 1;
>        /* Select maximal available 1,2 or 4 unroll factor.  */
> !       if (!count)
> !       unroll_factor = 4;
> !       else
> !       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> !              && unroll_factor < 4)
> !         unroll_factor *= 2;
>        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>        break;
>      case rep_prefix_8_byte:
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 22904,22909 ****
> --- 22996,23002 ----
>                emit_move_insn (loop_iter, const0_rtx);
>            }
>          label = gen_label_rtx ();
> +         early_jump = true;
>          emit_cmp_and_jump_insns (count_exp,
>                                   GEN_INT (epilogue_size_needed),
>                                   LTU, 0, counter_mode (count_exp), 1, label);
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 23016,23022 ****
>        vec_promoted_val =
>        promote_duplicated_reg_to_size (gpr_promoted_val,
>                                        GET_MODE_SIZE (move_mode),
> !                                       desired_align, align);
>        loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
>                                     NULL, vec_promoted_val, count_exp,
>                                     loop_iter, move_mode, unroll_factor,
> --- 23109,23115 ----
>        vec_promoted_val =
>        promote_duplicated_reg_to_size (gpr_promoted_val,
>                                        GET_MODE_SIZE (move_mode),
> !                                       GET_MODE_SIZE (move_mode), align);
>        loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
>                                     NULL, vec_promoted_val, count_exp,
>                                     loop_iter, move_mode, unroll_factor,
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 23065,23085 ****
>        LABEL_NUSES (label) = 1;
>        /* We can not rely on fact that promoved value is known.  */
>        vec_promoted_val = 0;
> !       gpr_promoted_val = 0;
>      }
>   epilogue:
>    if (alg == unrolled_loop || alg == sse_loop)
>      {
>        rtx tmp;
> !       if (align_unknown && unroll_factor > 1
> !         && epilogue_size_needed >= GET_MODE_SIZE (move_mode)
> !         && vec_promoted_val)
>        {
>          /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
>             do this, we can have very big epilogue - when alignment is statically
>             unknown we'll have the epilogue byte by byte which may be very slow.  */
>          loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
> !             NULL, vec_promoted_val, count_exp,
>              loop_iter, move_mode, 1,
>              expected_size, false);
>          dst = change_address (dst, BLKmode, destreg);
> --- 23158,23183 ----
>        LABEL_NUSES (label) = 1;
>        /* We can not rely on fact that promoved value is known.  */
>        vec_promoted_val = 0;
> !       if (early_jump)
> !         gpr_promoted_val = 0;
>      }
>   epilogue:
>    if (alg == unrolled_loop || alg == sse_loop)
>      {
>        rtx tmp;
> !       int remainder_size = epilogue_size_needed;
> !       if (count && desired_align <= align)
> !       remainder_size = count % epilogue_size_needed;
> !       /* We may not need the epilgoue loop at all when the count is known
> !        and alignment is not adjusted.  */
> !       if (remainder_size > 31
> !         && (alg == sse_loop ? vec_promoted_val : gpr_promoted_val))
>        {
>          /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
>             do this, we can have very big epilogue - when alignment is statically
>             unknown we'll have the epilogue byte by byte which may be very slow.  */
>          loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
> !             NULL, (alg == sse_loop ? vec_promoted_val : gpr_promoted_val), count_exp,
>              loop_iter, move_mode, 1,
>              expected_size, false);
>          dst = change_address (dst, BLKmode, destreg);
> *************** ix86_expand_setmem (rtx dst, rtx count_e
> *** 23090,23106 ****
>        if (tmp != destreg)
>        emit_move_insn (destreg, tmp);
>      }
> !   if (count_exp == const0_rtx)
>      ;
> !   else if (!gpr_promoted_val && epilogue_size_needed > 1)
>      expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
>                                     epilogue_size_needed);
>    else
> !     {
> !       if (epilogue_size_needed > 1)
> !       expand_setmem_epilogue (dst, destreg, vec_promoted_val, gpr_promoted_val,
> !                               val_exp, count_exp, epilogue_size_needed);
> !     }
>    if (jump_around_label)
>      emit_label (jump_around_label);
>    return true;
> --- 23188,23201 ----
>        if (tmp != destreg)
>        emit_move_insn (destreg, tmp);
>      }
> !   if (count_exp == const0_rtx || epilogue_size_needed <= 1)
>      ;
> !   else if (!gpr_promoted_val)
>      expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
>                                     epilogue_size_needed);
>    else
> !     expand_setmem_epilogue (dst, destreg, vec_promoted_val, gpr_promoted_val,
> !                           val_exp, count_exp, epilogue_size_needed);
>    if (jump_around_label)
>      emit_label (jump_around_label);
>    return true;
>
Jan Hubicka Nov. 18, 2011, 2:18 p.m. UTC | #4
> I found another bug in current implementation. A patch for it doesn't
> cure i686-linux- bootstrap, but fixes fails on some tests (see
> attached).
> 
> The problem was that we tried to add runtime tests for alignment even
> if both SRC and DST had unknown alignment - in this case it could be
> impossible to make them both aligned simultaneously, so I think it's
> easier to even not try to use aligned SSE-moves at all. Generation of
> prologues with runtime tests could be used only if at least one
> alignment is known - otherwise it's incorrect. Probably, generation of
> such prologues could be removed from MEMMOV at all for now.

The prologues always align the destination as it helps more than aligning
source on most chips.  I do not see problem with that.  But for SSE either we
should arrange unaligned load opcodes (that is what I see in generated code, but I
guess it depends on -march setting) or simply disqualify the sse_loop algorithm
in decide_alg when alignment is not know.
> 
> Though, even with this fix i686-bootstrap still fails. Configure for
> the bootstrap-fail reproducing:
> CC="gcc -m32" CXX="g++ -m32" ../configure --with-arch=core2
> --with-cpu=atom --prefix=`pwd` i686-linux --with-fpmath=sse
> --enable-languages=c,c++,fortran

Default i686-linux bootstrap was working for me. I will try your setting but
my time today evening and at weekend is limited.

Honza
H.J. Lu Nov. 18, 2011, 4:55 p.m. UTC | #5
On Fri, Nov 18, 2011 at 6:18 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> I found another bug in current implementation. A patch for it doesn't
>> cure i686-linux- bootstrap, but fixes fails on some tests (see
>> attached).
>>
>> The problem was that we tried to add runtime tests for alignment even
>> if both SRC and DST had unknown alignment - in this case it could be
>> impossible to make them both aligned simultaneously, so I think it's
>> easier to even not try to use aligned SSE-moves at all. Generation of
>> prologues with runtime tests could be used only if at least one
>> alignment is known - otherwise it's incorrect. Probably, generation of
>> such prologues could be removed from MEMMOV at all for now.
>
> The prologues always align the destination as it helps more than aligning
> source on most chips.  I do not see problem with that.  But for SSE either we
> should arrange unaligned load opcodes (that is what I see in generated code, but I
> guess it depends on -march setting) or simply disqualify the sse_loop algorithm
> in decide_alg when alignment is not know.
>>
>> Though, even with this fix i686-bootstrap still fails. Configure for
>> the bootstrap-fail reproducing:
>> CC="gcc -m32" CXX="g++ -m32" ../configure --with-arch=core2
>> --with-cpu=atom --prefix=`pwd` i686-linux --with-fpmath=sse
>> --enable-languages=c,c++,fortran
>
> Default i686-linux bootstrap was working for me. I will try your setting but
> my time today evening and at weekend is limited.
>

Given that x86 memset/memcpy is still broken, I think we should revert
it for now.
diff mbox

Patch

Index: config/i386/i386.h
===================================================================
--- config/i386/i386.h	(revision 181442)
+++ config/i386/i386.h	(working copy)
@@ -276,6 +276,7 @@  enum ix86_tune_indices {
   X86_TUNE_PROMOTE_QIMODE,
   X86_TUNE_FAST_PREFIX,
   X86_TUNE_SINGLE_STRINGOP,
+  X86_TUNE_ALIGN_STRINGOP,
   X86_TUNE_QIMODE_MATH,
   X86_TUNE_HIMODE_MATH,
   X86_TUNE_PROMOTE_QI_REGS,
Index: config/i386/i386.md
===================================================================
--- config/i386/i386.md	(revision 181442)
+++ config/i386/i386.md	(working copy)
@@ -15944,6 +15944,17 @@ 
 	      (clobber (reg:CC FLAGS_REG))])]
   ""
 {
+  rtx vec_reg;
+  enum machine_mode mode = GET_MODE (operands[2]);
+  if (vector_extensions_used_for_mode (mode)
+      && CONSTANT_P (operands[2]))
+    {
+      if (mode == DImode)
+	mode = TARGET_64BIT ? V2DImode : V4SImode;
+      vec_reg = gen_reg_rtx (mode);
+      emit_move_insn (vec_reg, operands[2]);
+      operands[2] = vec_reg;
+    }
   if (GET_MODE (operands[1]) != GET_MODE (operands[2]))
     operands[1] = adjust_address_nv (operands[1], GET_MODE (operands[2]), 0);
 
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 181442)
+++ config/i386/i386.c	(working copy)
@@ -1785,7 +1785,7 @@  struct processor_costs atom_cost = {
      if that fails.  */
   {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
     {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}},
-   {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
+   {{libcall, {{2048, sse_loop}, {2048, unrolled_loop}, {-1, libcall}}}, /* Unknown alignment.  */
     {libcall, {{2048, sse_loop}, {2048, unrolled_loop},
 	       {-1, libcall}}}}},
 
@@ -2178,6 +2178,9 @@  static unsigned int initial_ix86_tune_fe
   /* X86_TUNE_SINGLE_STRINGOP */
   m_386 | m_P4_NOCONA,
 
+  /* X86_TUNE_ALIGN_STRINGOP */
+  ~(m_BDVER | m_CORE2I7),
+
   /* X86_TUNE_QIMODE_MATH */
   ~0,
 
@@ -3724,6 +3727,14 @@  ix86_option_override_internal (bool main
         target_flags |= MASK_NO_RED_ZONE;
     }
 
+  if (!(target_flags_explicit & MASK_NO_ALIGN_STRINGOPS))
+    {
+      if (ix86_tune_features[X86_TUNE_ALIGN_STRINGOP] & ix86_tune_mask)
+        target_flags &= ~MASK_NO_ALIGN_STRINGOPS;
+      else
+        target_flags |= MASK_NO_ALIGN_STRINGOPS;
+    }
+
   /* Keep nonleaf frame pointers.  */
   if (flag_omit_frame_pointer)
     target_flags &= ~MASK_OMIT_LEAF_FRAME_POINTER;
@@ -21149,20 +21160,25 @@  expand_set_or_movmem_via_loop_with_iter
 
   top_label = gen_label_rtx ();
   out_label = gen_label_rtx ();
-  if (!reuse_iter)
-    iter = gen_reg_rtx (iter_mode);
-
   size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
-			      NULL, 1, OPTAB_DIRECT);
-  /* Those two should combine.  */
-  if (piece_size == const1_rtx)
+		               NULL, 1, OPTAB_DIRECT);
+  if (!reuse_iter)
     {
-      emit_cmp_and_jump_insns (size, const0_rtx, EQ, NULL_RTX, iter_mode,
+      iter = gen_reg_rtx (iter_mode);
+      /* Those two should combine.  */
+      if (piece_size == const1_rtx)
+	{
+	  emit_cmp_and_jump_insns (size, const0_rtx, EQ, NULL_RTX, iter_mode,
+				   true, out_label);
+	  predict_jump (REG_BR_PROB_BASE * 10 / 100);
+	}
+      emit_move_insn (iter, const0_rtx);
+    }
+  else
+    {
+      emit_cmp_and_jump_insns (iter, size, GE, NULL_RTX, iter_mode,
 			       true, out_label);
-      predict_jump (REG_BR_PROB_BASE * 10 / 100);
     }
-  if (!reuse_iter)
-    emit_move_insn (iter, const0_rtx);
 
   emit_label (top_label);
 
@@ -21588,17 +21604,28 @@  expand_setmem_epilogue (rtx destmem, rtx
 	 Remaining part we'll move using Pmode and narrower modes.  */
 
       if (promoted_to_vector_value)
-	while (remainder_size >= 16)
-	  {
-	    if (GET_MODE (destmem) != move_mode)
-	      destmem = adjust_automodify_address_nv (destmem, move_mode,
-						      destptr, offset);
-	    emit_strset (destmem, promoted_to_vector_value, destptr,
-			 move_mode, offset);
+	{
+	  if (promoted_to_vector_value)
+	    {
+	      if (max_size >= GET_MODE_SIZE (V4SImode))
+		move_mode = V4SImode;
+	      else if (max_size >= GET_MODE_SIZE (DImode))
+		move_mode = DImode;
+	    }
+	  while (remainder_size >= GET_MODE_SIZE (move_mode))
+	    {
+	      if (GET_MODE (destmem) != move_mode)
+		destmem = adjust_automodify_address_nv (destmem, move_mode,
+							destptr, offset);
+	      emit_strset (destmem,
+			   promoted_to_vector_value,
+			   destptr,
+			   move_mode, offset);
 
-	    offset += 16;
-	    remainder_size -= 16;
-	  }
+	      offset += GET_MODE_SIZE (move_mode);
+	      remainder_size -= GET_MODE_SIZE (move_mode);
+	    }
+	}
 
       /* Move the remaining part of epilogue - its size might be
 	 a size of the widest mode.  */
@@ -22022,10 +22049,11 @@  decide_alg (HOST_WIDE_INT count, HOST_WI
 			     || (memset
 				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
 
-#define ALG_USABLE_P(alg) (rep_prefix_usable			\
-			   || (alg != rep_prefix_1_byte		\
-			       && alg != rep_prefix_4_byte      \
-			       && alg != rep_prefix_8_byte))
+#define ALG_USABLE_P(alg) ((rep_prefix_usable			\
+			    || (alg != rep_prefix_1_byte	\
+			        && alg != rep_prefix_4_byte      \
+			        && alg != rep_prefix_8_byte))    \
+			   && (TARGET_SSE2 || alg != sse_loop))
   const struct processor_costs *cost;
 
   /* Even if the string operation call is cold, we still might spend a lot
@@ -22037,6 +22065,9 @@  decide_alg (HOST_WIDE_INT count, HOST_WI
   else
     optimize_for_speed = true;
 
+  if (!optimize)
+    return (rep_prefix_usable ? rep_prefix_1_byte : libcall);
+
   cost = optimize_for_speed ? ix86_cost : &ix86_size_cost;
 
   *dynamic_check = -1;
@@ -22049,10 +22080,10 @@  decide_alg (HOST_WIDE_INT count, HOST_WI
   /* rep; movq or rep; movl is the smallest variant.  */
   else if (!optimize_for_speed)
     {
-      if (!count || (count & 3))
-	return rep_prefix_usable ? rep_prefix_1_byte : loop_1_byte;
+      if (!count || (count & 3) || memset)
+	return rep_prefix_usable ? rep_prefix_1_byte : libcall;
       else
-	return rep_prefix_usable ? rep_prefix_4_byte : loop;
+	return rep_prefix_usable ? rep_prefix_4_byte : libcall;
     }
   /* Very tiny blocks are best handled via the loop, REP is expensive to setup.
    */
@@ -22106,13 +22137,11 @@  decide_alg (HOST_WIDE_INT count, HOST_WI
       int max = -1;
       enum stringop_alg alg;
       int i;
-      bool any_alg_usable_p = true;
       bool only_libcall_fits = true;
 
       for (i = 0; i < MAX_STRINGOP_ALGS; i++)
 	{
 	  enum stringop_alg candidate = algs->size[i].alg;
-	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
 
 	  if (candidate != libcall && candidate
 	      && ALG_USABLE_P (candidate))
@@ -22124,7 +22153,7 @@  decide_alg (HOST_WIDE_INT count, HOST_WI
       /* If there aren't any usable algorithms, then recursing on
 	 smaller sizes isn't going to find anything.  Just return the
 	 simple byte-at-a-time copy loop.  */
-      if (!any_alg_usable_p || only_libcall_fits)
+      if (only_libcall_fits)
 	{
 	  /* Pick something reasonable.  */
 	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
@@ -22253,7 +22282,7 @@  ix86_expand_movmem (rtx dst, rtx src, rt
   int dynamic_check;
   bool need_zero_guard = false;
   bool align_unknown;
-  int unroll_factor;
+  unsigned int unroll_factor;
   enum machine_mode move_mode;
   rtx loop_iter = NULL_RTX;
   int dst_offset, src_offset;
@@ -22316,14 +22345,27 @@  ix86_expand_movmem (rtx dst, rtx src, rt
     case unrolled_loop:
       need_zero_guard = true;
       move_mode = Pmode;
-      unroll_factor = TARGET_64BIT ? 4 : 2;
+      unroll_factor = 1;
+      /* Select maximal available 1,2 or 4 unroll factor.
+	 In 32bit we can not afford to use 4 registers inside the loop.  */
+      if (!count)
+	unroll_factor = TARGET_64BIT ? 4 : 2;
+      else
+	while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	       && unroll_factor < (TARGET_64BIT ? 4 :2))
+	  unroll_factor *= 2;
       size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case sse_loop:
       need_zero_guard = true;
       /* Use SSE instructions, if possible.  */
-      move_mode = align_unknown ? DImode : V4SImode;
-      unroll_factor = TARGET_64BIT ? 4 : 2;
+      move_mode = V4SImode;
+      /* Select maximal available 1,2 or 4 unroll factor.  */
+      if (!count)
+	unroll_factor = 4;
+      else
+	while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	       && unroll_factor < 4)
       size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
@@ -22568,7 +22610,13 @@  ix86_expand_movmem (rtx dst, rtx src, rt
   if (alg == sse_loop || alg == unrolled_loop)
     {
       rtx tmp;
-      if (align_unknown && unroll_factor > 1)
+      int remainder_size = epilogue_size_needed;
+
+      /* We may not need the epilgoue loop at all when the count is known
+	 and alignment is not adjusted.  */
+      if (count && desired_align <= align)
+	remainder_size = count & epilogue_size_needed;
+      if (remainder_size > GET_MODE_SIZE (move_mode) * 2)
 	{
 	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
 	     do this, we can have very big epilogue - when alignment is statically
@@ -22764,6 +22812,7 @@  ix86_expand_setmem (rtx dst, rtx count_e
   unsigned int unroll_factor;
   enum machine_mode move_mode;
   rtx loop_iter = NULL_RTX;
+  bool early_jump = false;
 
   if (CONST_INT_P (align_exp))
     align = INTVAL (align_exp);
@@ -22813,9 +22862,12 @@  ix86_expand_setmem (rtx dst, rtx count_e
       move_mode = Pmode;
       unroll_factor = 1;
       /* Select maximal available 1,2 or 4 unroll factor.  */
-      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
-	     && unroll_factor < 4)
-	unroll_factor *= 2;
+      if (!count)
+	unroll_factor = 4;
+      else
+	while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	       && unroll_factor < 4)
+	  unroll_factor *= 2;
       size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case sse_loop:
@@ -22823,9 +22875,12 @@  ix86_expand_setmem (rtx dst, rtx count_e
       move_mode = TARGET_64BIT ? V2DImode : V4SImode;
       unroll_factor = 1;
       /* Select maximal available 1,2 or 4 unroll factor.  */
-      while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
-	     && unroll_factor < 4)
-	unroll_factor *= 2;
+      if (!count)
+	unroll_factor = 4;
+      else
+	while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
+	       && unroll_factor < 4)
+	  unroll_factor *= 2;
       size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
       break;
     case rep_prefix_8_byte:
@@ -22904,6 +22959,7 @@  ix86_expand_setmem (rtx dst, rtx count_e
               emit_move_insn (loop_iter, const0_rtx);
 	    }
 	  label = gen_label_rtx ();
+	  early_jump = true;
 	  emit_cmp_and_jump_insns (count_exp,
 				   GEN_INT (epilogue_size_needed),
 				   LTU, 0, counter_mode (count_exp), 1, label);
@@ -23065,21 +23121,26 @@  ix86_expand_setmem (rtx dst, rtx count_e
       LABEL_NUSES (label) = 1;
       /* We can not rely on fact that promoved value is known.  */
       vec_promoted_val = 0;
-      gpr_promoted_val = 0;
+      if (early_jump)
+        gpr_promoted_val = 0;
     }
  epilogue:
   if (alg == unrolled_loop || alg == sse_loop)
     {
       rtx tmp;
-      if (align_unknown && unroll_factor > 1
-	  && epilogue_size_needed >= GET_MODE_SIZE (move_mode)
-	  && vec_promoted_val)
+      int remainder_size = epilogue_size_needed;
+      if (count && desired_align <= align)
+	remainder_size = count & epilogue_size_needed;
+      /* We may not need the epilgoue loop at all when the count is known
+	 and alignment is not adjusted.  */
+      if (remainder_size > GET_MODE_SIZE (move_mode) * 2
+	  && (alg == sse_loop ? vec_promoted_val : gpr_promoted_val))
 	{
 	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
 	     do this, we can have very big epilogue - when alignment is statically
 	     unknown we'll have the epilogue byte by byte which may be very slow.  */
 	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
-	      NULL, vec_promoted_val, count_exp,
+	      NULL, (alg == sse_loop ? vec_promoted_val : gpr_promoted_val), count_exp,
 	      loop_iter, move_mode, 1,
 	      expected_size, false);
 	  dst = change_address (dst, BLKmode, destreg);
@@ -23090,17 +23151,14 @@  ix86_expand_setmem (rtx dst, rtx count_e
       if (tmp != destreg)
 	emit_move_insn (destreg, tmp);
     }
-  if (count_exp == const0_rtx)
+  if (count_exp == const0_rtx || epilogue_size_needed <= 1)
     ;
-  else if (!gpr_promoted_val && epilogue_size_needed > 1)
+  else if (!gpr_promoted_val)
     expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
 				     epilogue_size_needed);
   else
-    {
-      if (epilogue_size_needed > 1)
-	expand_setmem_epilogue (dst, destreg, vec_promoted_val, gpr_promoted_val,
-				val_exp, count_exp, epilogue_size_needed);
-    }
+    expand_setmem_epilogue (dst, destreg, vec_promoted_val, gpr_promoted_val,
+			    val_exp, count_exp, epilogue_size_needed);
   if (jump_around_label)
     emit_label (jump_around_label);
   return true;